I have a text input containing lots of operators, variable and English words. From this input I have to separate all the operators alone.
As of now I'm using regular expression matching, so the number of operators matched depends on the regular expression. problem I get are '= is matched with <=', '& is matched with &&'. I need to match both = and <= separately.
Is there any better way for matching the operators other than regex?
as far as regex goes, you could have the pattern match the special (compound) case first, then the catch-all last with simple alternation. In your simple input case: /<=|&&|=|&/. this isn't necessarily terrible, you can still put whatever your catch-all is after that: /special1|...specialN|special-chars-catch-all/
this technique could be useful in some cases where a greedy expression would just get the whole thing, like: if($x==-1), you would want ==, not ==-
Look at the extended variants in your RE language.
In most RE languages /[<](?![=])/ will match "<" but not "<=" and not "=", for example. The (?! ... ) means "except when followed by ...". The term for this is Negative Look-ahead Assertion. These are sometimes spelled differently, as they are less standard than most other formations, but they are usually available. They never consume more characters, but they create slower matches.
The "except when preceded" or Negative Look-behind Assertion is sometimes also available, but you may wish to avoid it. It is seldom clear to a reader and can create slower matches.
There probably is. But as an alternative, you could have your regex as (e.g.):
[><=&|]+
(Modify to your specifications - not sure if you want addition, subtraction, ++ for incrementing etc too).
The + means "one or more" and so the regex matches as many characters as possible, meaning that if <= is in the text, it will match <= rather than < and then =.
Then, only once you've extracted all the matches, loop through them all and classify them.
I think you might still be able to get regex to do what you want.
If you want to completely abandon it, please forgive me and ignore my suggestion :)
If you want to use regex to detect just = then you could use [^<>=]=[^<>=] which means 'match the equals only when it is not preceded or seceded by < > or another =.
You could use {1}& with ampersands to detect one (and only one) ampersand.
(NB you might need to escape a couple of those symbols with \)
I hope that might help. Good luck.
K.
If you do multiple passes, you can also find the compound operators and then replace them with other characters before a pass that finds the simple ones.
This is often a useful approach anyway: to slowly overwrite your interpreted string as it is processed, so that what is left when you are done is just tokens. RE processors often return index ranges. So you can easily go back and overwrite that range with something no one else will match later (like a control-character token, a NUL, or a tilde).
An advantage is that you can then have debug code that does a verification pass to check that you have not left anything around uninterpreted.
Related
I am trying to parse numerical comparisons in string form. I want to tokenize a string such as 45+(30*2)<=50 such that the resulting groups are 45, +, (30 * 2), <=, and 50.
I know I can define my groups as
\w* for the numerical terms
\(.*\) for the parenthetical terms
[\+\-\*\\=<>]{1,2} for the operator terms
but I don't know how to say "A numerical or parenthetical term followed by an operational term, and that whole thing repeated any number of times, ending in a numerical or parenthetical term".
Is such a thing possible with regex?
A regular expression isn't exactly the best tool for the job. You can achieve what you want with them, but you'll have to jump through hoops.
The first one being nested constructs like 45+((10 + 20)*2)<=50, so let's start working on that first, as \(.*\) won't do you any good. It's eager and unaware of nested constructs.
Here's a better pattern for parentheses only:
(?>
(?<p>\()
|(?<-p>\))
|(?(p)[^()])
)+
(?(p)(?!))
Yes, that's what it takes. Read about balancing groups for an in-depth explanation of this.
Numerical terms would be matched by \d+ or [0-9]+ (for ASCII only digits in .NET), not by \w+.
As for your question:
A numerical or parenthetical term followed by an operational term, and that whole thing repeated any number of times, ending in a numerical or parenthetical term
You're trying to do it wrong. While you could do just that with PCRE regexes, it'll be much harder in .NET.
You can use regexes for lexing (aka tokenizing). But then use application code to make sense of the tokens the regex returns you. Don't use regex for semantics, you won't end up with pretty code.
Perhaps you should use an existing math parsing library, such as NCalc.
Or you may need to go with a custom solution and build your own parser...
I hope the below regex will give what you expect
(([><!=]{1}[=]{0,1})|[\+\-\*\/]{1}|\(.*\)|[\d]*| *)
I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)
First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.
I have a regex that works fine currently. But now I want to add on to it to capture dates.
Current regex:
(?<GeneralHelp>^/help\s*)?
(?:/client:)
(?<Client>\w*)
(?:(?:\s*/(?<ClientHelp>help))*)*
(?:(?:\s*/)(?<Modules>createHistory)(?:(?:\s*/(?<ModuleHelp>help))*)*)*
I added to the end:
(?:(?:\s*/)(?<StartDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
(?:(?:\s*/)(?<EndDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
Using the below example, it just won't get the dates, but it does match everything else.
/client:testClient/createHistory/11-11-2013/11.11.2013
This regex is used to break up the Main one string in the string array parameter from a console app. No one on my team in "fluent" in regex, nor do we have time to become fluent. We work with what we can and this addition is something I thought of today that may have with bigger problems what we have with our project and we are running low on time. So any help would be appreciated.
First, the ^ in your regex means "start of string", that is you only want to match a date at the start of the string (which is not true for you). So remove it. Same with "$" which means "end of string".
Secondly, [0|1] means "match characters 0, | or 1". You probably want [01] meaning "match characters 0 or 1".
Thirdly, you have an extra closing bracket with an unmatched opening bracket in both your regexes.
Fourthly as a general style point, [0] is the same as 0 so the square brackets are redundant here.
So your (not quite!) "fixed" regex is:
(?:(?:\s*/)(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:(?:\s*/)(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
However, this will not match your test string because of the extra "/testModule" in the string which is not in your working regex anywhere.
You could modify your original regex to allow extra slashes in between the two parts of regex?
<original regex>
(?:/[^/]+)* # <-- for the /testModule and any other similar tokens that appear in between
<date regex>
Also as a general point
you have a few occurences (?:(?:regex)*)*. I am not sure what the point is of doubling the outer * besides making the regex parser work much harder than it should for no good reason (the outer (?: )* is redundant here).
there is no point doing (?:/\s*) as you are not doing anything with the brackets, so just do /\s*
same with things like (?:/client:). Why have non-capturing brackets if you are not doing anything with them. /client: will do.
(?:regex)* means "match 0 to infinity occurences of regex". With things like (?:\s*/(?<ClientHelp>help))*, do you really expect this to occur infinitely many times in your string, or will it appear just once or not at all? Consider replacing * with ? which means "match 0 or 1 occurences" (if you know that that token will appear either once or not at all), or replace it with (say) {0, 100} if you know that that token will appear at most 100 times (and at least 0 times). This can improve performance.
So I recommend changing your regex like this:
(?<GeneralHelp>^/help\s*)?
/client:
(?<Client>\w*)
(?:\s*/(?<ClientHelp>help))*
(?:\s*/(?<Modules>createHistory)(?:\s*/(?<ModuleHelp>help))*)*
(?:/[^/]+)*
(?:\s*/(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:\s*/(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
You can fiddle around with your regex at regexr where I've created an example with your regex/test string. (Edit: the < and > in the regex seem to have been changed to < and > in regexr so the link won't work unless you copy/paste the regex I've written directly)
If you're sure these two last fields are dates, you could simply add something like
(?<StartDate>(?:\d+[. -]?){3})/(?<EndDate>.*)$
(or even (?<StartDate>[^/]+)/(?<EndDate>.+)$ if your cases are all in the same pattern and it fits your needs).
Also as already pointed out by mathematical.coffee, the first regex can be improved.
Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.
a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.
You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.
Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.
I'm try to develop a regex that will be used in a C# program..
My initial regex was:
(?<=\()\w+(?=\))
Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".
However, if I modify the regex to:
\[(?<=\()\w+(?=\))\]
and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.
Thanks in advance for your kind help.
Rob Cecil
Your look-behinds are the problem. Here's how the string is being processed:
We see [ in the string, and it matches the regex.
Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.
At least thats what I would guess is causing the problem.
Try this regex instead:
(?<=\[\()\w+(?=\)\])
Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.
Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.