Forming a Regex to Parse Numerical Expressions

Forming a Regex to Parse Numerical Expressions - c#

I am trying to parse numerical comparisons in string form. I want to tokenize a string such as 45+(30*2)<=50 such that the resulting groups are 45, +, (30 * 2), <=, and 50.
I know I can define my groups as
\w* for the numerical terms
\(.*\) for the parenthetical terms
[\+\-\*\\=<>]{1,2} for the operator terms
but I don't know how to say "A numerical or parenthetical term followed by an operational term, and that whole thing repeated any number of times, ending in a numerical or parenthetical term".
Is such a thing possible with regex?

A regular expression isn't exactly the best tool for the job. You can achieve what you want with them, but you'll have to jump through hoops.
The first one being nested constructs like 45+((10 + 20)*2)<=50, so let's start working on that first, as \(.*\) won't do you any good. It's eager and unaware of nested constructs.
Here's a better pattern for parentheses only:
(?>
(?<p>\()
|(?<-p>\))
|(?(p)[^()])
)+
(?(p)(?!))
Yes, that's what it takes. Read about balancing groups for an in-depth explanation of this.
Numerical terms would be matched by \d+ or [0-9]+ (for ASCII only digits in .NET), not by \w+.
As for your question:
A numerical or parenthetical term followed by an operational term, and that whole thing repeated any number of times, ending in a numerical or parenthetical term
You're trying to do it wrong. While you could do just that with PCRE regexes, it'll be much harder in .NET.
You can use regexes for lexing (aka tokenizing). But then use application code to make sense of the tokens the regex returns you. Don't use regex for semantics, you won't end up with pretty code.
Perhaps you should use an existing math parsing library, such as NCalc.
Or you may need to go with a custom solution and build your own parser...

I hope the below regex will give what you expect
(([><!=]{1}[=]{0,1})|[\+\-\*\/]{1}|\(.*\)|[\d]*| *)

Related

Fastest regex for first occurence of a word

I would like my regex to capture the following kind of strings as two Urls with "%3f" inside them.
https://*****%3f****%3D,https://*****%3f****%3D …
Where each string URL of this type should be captured by itself. Note - The * is here for simplification and the URLS can be in any part of the big string with anything in between.
My regex now is:
(https://\S+?%3f)(?<toDelete>\S+?%3D)
But I've been asked to see if there's a non lazy approach for this (or just a faster version), as it is much slower then greediness, and this regex will be called over huge strings and dataflow.
Note that the reason I cant simply put \S* is that doing so will capture in one match from the first http to the last %3D.

You might probably split the string with a comma and then get a substring up to the %3f value.
If you want to make the \S*? pattern work "faster" you must take into account what kind of context this part of a pattern should be aware of.
You are matching any char that is not a whitespace char, any amount of times, up to the first occurrence of %3f. That is, you want to match any chars other than % and whitespace or % chars that are not followed with 3f. That makes (?:[^\s%]|%(?!3f))*. However, alternation ruins the whole idea of optimization. You need to use the "unroll-the-loop" approach: [^%\s]*(?:%(?!3f)[^%\s]*)*.
So, the whole pattern will look like
https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f
Or with the Delete part:
(https://[^%\s]*(?:%(?!3f)[^%\s]*)*%3f)(?<toDelete>[^%\s]*(?:%(?!3D)[^%\s]*)*%3D)
For short strings, this last pattern might work a tiny bit slower than the \S+? based pattern, but it becomes much more efficient when the matched string becomes longer.

How to grab everything between two words in a text with regex [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how $(.+?)$, the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

Regex.IsMatch gives true but http://www.regexr.com/ gives false

I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)

First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.

Performance and readability of RegEx using positive look ahead

I am validating following strings with regular expressions in C#:
[/1/2/]
[/1/2/];[/3/4/5/]
[/1/22/333/];[/1/];[/9999/]
Basically it's one or more group of square brackets separated by semi-colon (but not at the end). Each group consists out of one or more numbers seperated by slashes. There are no other characters allowed.
These are two alternatives:
^(\[\/(\d+\/)+\](;(?=\[)|$))+$
^(\[\/(\d+\/)+\];)*(\[\/(\d+\/)+\])$
The first version uses a positive look ahead and the second version duplicates part of the pattern.
Both RegEx-es seem to be ok, do what they should and aren't very nice to read. ;)
Does anybody have an idea for a better, faster and more easy to read solution? When I was playing around in regex101 I realized that the second version uses more steps, why?
At the same time I realized that it would be nice to count the steps used in a C#-RegEx. Is there any way to achieve this?

You can use 1 regex to validate all these strings:
^\[/(\d+/)+\](?:;\[/(\d+/)+\])*$
See regex demo
To make it easier to read, use a VERBOSE flag (inline (?x) or RegexOptions.IgnorePatternWhitespace):
var rx = #"(?x)^ # Start of string
\[/ # Literal `[/`
(\d+/)+ # 1 or more sequences of 1 or more digits followed by `/`
\] # Closing `]`
(?: # A non-capturing group start
; # a semi-colon delimiter
\[/(\d+/)+\] # Same as the first part of the regex
)* # 0 or more occurrences
$ # End of string
";
To test a .NET regex performance (not the number of steps), you can use a regexhero.net service. With the 3 sample strings above, my regex shows 217K iterations per second speed, which is more than either of your regexps.

There is nothing particularly wrong with the two options you suggest. They are not that complicated as regexes go, and they should be understandable enough, as long as you put an appropriate comment in your code.
In general, I think it is preferable to avoid look-arounds, unless they are necessary or greatly simplify the regex--they make it harder to figure out what is going on, since they add a non-linear element to the logic.
The relative performance of regexes this simple is not something to worry about, unless you are performing a huge number of operations or discover a performance problem with your code. Still, understanding the relative performance of different patterns may be instructive.

Matching operators

I have a text input containing lots of operators, variable and English words. From this input I have to separate all the operators alone.
As of now I'm using regular expression matching, so the number of operators matched depends on the regular expression. problem I get are '= is matched with <=', '& is matched with &&'. I need to match both = and <= separately.
Is there any better way for matching the operators other than regex?

as far as regex goes, you could have the pattern match the special (compound) case first, then the catch-all last with simple alternation. In your simple input case: /<=|&&|=|&/. this isn't necessarily terrible, you can still put whatever your catch-all is after that: /special1|...specialN|special-chars-catch-all/
this technique could be useful in some cases where a greedy expression would just get the whole thing, like: if($x==-1), you would want ==, not ==-

Look at the extended variants in your RE language.
In most RE languages /[<](?![=])/ will match "<" but not "<=" and not "=", for example. The (?! ... ) means "except when followed by ...". The term for this is Negative Look-ahead Assertion. These are sometimes spelled differently, as they are less standard than most other formations, but they are usually available. They never consume more characters, but they create slower matches.
The "except when preceded" or Negative Look-behind Assertion is sometimes also available, but you may wish to avoid it. It is seldom clear to a reader and can create slower matches.

There probably is. But as an alternative, you could have your regex as (e.g.):
[><=&|]+
(Modify to your specifications - not sure if you want addition, subtraction, ++ for incrementing etc too).
The + means "one or more" and so the regex matches as many characters as possible, meaning that if <= is in the text, it will match <= rather than < and then =.
Then, only once you've extracted all the matches, loop through them all and classify them.

I think you might still be able to get regex to do what you want.
If you want to completely abandon it, please forgive me and ignore my suggestion :)
If you want to use regex to detect just = then you could use [^<>=]=[^<>=] which means 'match the equals only when it is not preceded or seceded by < > or another =.
You could use {1}& with ampersands to detect one (and only one) ampersand.
(NB you might need to escape a couple of those symbols with \)
I hope that might help. Good luck.
K.

If you do multiple passes, you can also find the compound operators and then replace them with other characters before a pass that finds the simple ones.
This is often a useful approach anyway: to slowly overwrite your interpreted string as it is processed, so that what is left when you are done is just tokens. RE processors often return index ranges. So you can easily go back and overwrite that range with something no one else will match later (like a control-character token, a NUL, or a tilde).
An advantage is that you can then have debug code that does a verification pass to check that you have not left anything around uninterpreted.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.