RegularExpressions.Regex.IsMatch is hanging - c#

Here is an excerpt from my code:
string[] myStr =
{
" Line1: active 56:09 - tst0063, tst0063",
"Contacts accosiated with line 1 - tst0063, tst0063",
"Line 1: 00:00:32 Wrap: 00:00:20 - tst0063, tst0063",
"Line 1: 00:00:17 Active: 00:00:15 - tst0064, tst0064"
};
string sPattern = #"^Line(\s*\S*)*tst0063$";
RegexOptions options = RegexOptions.IgnoreCase;
foreach (string s in myStr)
{
System.Console.Write(s);
if (System.Text.RegularExpressions.Regex.IsMatch(s, sPattern, options))
{
System.Console.WriteLine(" - valid");
}
else
{
System.Console.WriteLine(" - invalid");
}
System.Console.ReadLine();
}
RegularExpressions.Regex.IsMatch hangs while working on the last line. I did some experiments, but still can't understand why it's hanging when there is no match in the end of the line. Please help!

The question is not why the fourth test hangs, but why the first three don't. The first string starts with a space, and the second starts with Contacts, neither of which matches the regex ^Line, so the first two match attempts fail immediately. The third string matches the regex; although it takes much longer than it should (for reasons I'm about to explain), it still seems instantaneous.
The fourth match fails because the string doesn't match the end part of the regex: tst0063$. When that fails, the regex engine backs up to the variable portion of the regex, (\s*\S*)*, and starts trying all the different ways to fit that onto the string. Unlike the third string, this time it has to try every every possible combination of zero or more whitespace characters (\s*) followed by zero or more non-whitespace characters (\S*), zero or more times, before it can give up. The possibilities aren't infinite, but they might as well be.
You were probably thinking of [\s\S]*, which is a well-known idiom for matching any character including newlines. It's used in JavaScript, which doesn't have a way to make the dot (.) match line separator characters. Most other flavors let you specify a matching mode that changes the behavior of the dot; some call it DOTALL mode, but .NET uses the more common Singleline.
string sPattern = #"^Line.*tst0063$";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
You can also use inline modifiers:
string sPattern = #"(?is)^Line.*tst0063$";
UPDATE: In response to your comment, yes, it does seem odd that the regex engine can't tell that any match must end with tst0063. But it's not always so easy to tell. How much effort should it put into looking for shortcuts like that? And how many shortcuts can you bolt onto the normal matching algorithm before all matches (successful as well as failed) become too slow?
.NET has one of the best regex implementations out there: fast, powerful, and with some truly amazing features. But you have to think about what you're telling it to do. For example, if you know there has to be at least one of something, use +, not *. If you had followed that rule, you wouldn't have had this problem. This regex:
#"^Line(\s+\S+)*tst0063$"
...works just fine. (\s+\S+)* is a perfectly reasonable way to match zero or more words, where words are defined as one or more non-whitespace characters, separated from other words by one or more whitespace characters. (Is that what you were trying to do?)

Move System.Console.ReadLine(); outside the foreach loop.
You're blocking the thread at the end of the first iteration of the loop, waiting for user input.

Related

How do I select all including sensitive case (regex) in c#?

I have a problem with a regex command,
I have a file with a tons of lines and with a lot of sensitive characters,
this is an Example with all sensitive case 0123456789/*-+.&é"'(-è_çà)=~#{[|`\^#]}²$*ù^%µ£¨¤,;:!?./§<>AZERTYUIOPMLKJHGFDSQWXCVBNazertyuiopmlkjhgfdsqwxcvbn
I tried many regex commands but never get the expected result,
I have to select everything from Example to the end
I tried this command on https://www.regextester.com/ :
\sExample(.*?)+
Image of the result here
And when I tried it in C# the only result I get was : Example
I don't understand why --'
Here's a quick chat about greedy and pessimistic:
Here is test data:
Example word followed by another word and then more
Here are two regex:
Example.*word
Example.*?word
The first is greedy. Regex will match Example then it will take .* which consumes everything all the way to the END of the string and the works backwards spitting a character at a time back out, trying to make the match succeed. It will succeed when Example word followed by another word is matched, the .* having matched word followed by another (and the spaces at either end)
The second is pessimistic; it nibbled forwards along the string one character at a time, trying to match. Regex will match Example then it'll take one more character into the .*? wildcard, then check if it found word - which it did. So pessimistic matching will only find a single space and the full match in pessimistic mode is Example word
Because you say you want the whole string after Example I recommend use of a greedy quantifier so it just immediately takes the whole string that remains and declares a match, rather than nibbling forwards one at a time (slow)
This, then, will match (and capture) everything after Example:
\sExample(.*)
The brackets make a capture group. In c# we can name the group using ?<namehere> at the start of the brackets and then everything that .* matches can be retrieved with:
Regex r = new Regex("\sExample(?<x>.*)");
Match m = r.Match("Exampleblahblah");
Console.WriteLine(m.Groups["x"].Value); //prints: blahblah
Note that if your data contains newlines you should note that . doesn't match a newline, unless you enable RegexOptions.SingleLine when you create the regex

C# regex to match words containing known substrings and not equal to specific keywords

I need to verify if a string contains "error" or "exception" in it, excluding certain keywords: "exception1", "exception2", "includeException", "error1".
This regex seems to do the job:
\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)(exception|error)\w*\b
It correctly returns 2 matches when run against the following string:
Test string: "exception1 exception2 exception3 includeException error1 error2"
Matches: "exception3", "error2"
However, if I set the RegexOptions.IgnoreCase flag or add "(?i)" at the beginning of the Regex it also returns a match for "includeException".
What am I missing here?
Using a good Regex tester can help you figure out what's actually being matched. I used this one:
http://regexhero.net/tester/
In the results where it highlights the matches, there is a small button with an 'i' for information. So the reason that it's matching innerException when it's case insensitive is because you are matching the latter half of the word. The regex doesn't require white space separating the words.
Your regex would match with case invariant off if innerException were written as innerexception because your positive match (exception|error) is matching the last half. You can also see that when you start removing spaces. exception1exception2 doesn't match, but exception1exception2exception3 does.
While Regex is very compact, there are several ways to get it wrong. A straightforward approach might be a better solution in this case.
Changing your regex to remove the last wildcard * characters will make what you have work the way you want:
\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)(exception|error)\w\b
I see two main bottlenecks with your regex:
It has several unanchored lookaheads (when unanchored, they usually do not help unless used in a tempered greedy token and other complex patterns)
The \w* subpatterns are placed on both sides of lookaheads, thus, removing any impact from the lookaheads.
The problem with case-insensitivity is described in Berin's answer, you want to match the word exception and includeException contains that substring. So, a possible solution is to add a leading word boundary to (error|exception) pattern:
\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)\b(exception|error)\w*\b
^^
However, if you need to match words containing error or exception, that ARE NOT EQUAL to specific keywords, use
\b(?!(?:exception1|exception2|includeException|error1)\b)\w*(exception|error)\w*\b
Here, the lookaheads are anchored to the leading word boundary, they are only checked once after each word boundary, not at each position inside a word. Certainly, you can contract it further: \b(?!(?:exception[12]|includeException|error1)\b)\w*(exception|error)\w*\b.
Now, if you need to match words containing error or exception, that DO NOT CONTAIN specific keywords, use
\b(?!\w*(?:exception1|exception2|includeException|error1))\w*(exception|error)\w*\b
All regex patterns used here are tested at regexhero.net
Regex is not very readable... how about a pure C# solution?
public static Boolean ContainsErrorOrExceptionExcept(this string input, string[] excludedKeywords)
{
if (input.Contains("error") || input.Contains("exception"))
{
foreach (string x in excludedKeywords)
{
if (input.Contains(x))
{
return false;
}
}
return true;
}
else
{
return false;
}
}

Regex.IsMatch gives true but http://www.regexr.com/ gives false

I'm trying to check if the next string is match to this pattern in this code:
string str = "CRSSA.T,";
var pattern = #"((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)";
Console.WriteLine(Regex.IsMatch(str, pattern));
the site: http://www.regexr.com/ says it's not match(everything match, except the last comma), but that code prints True. is it possible?
thanks ahead! :)
First of all, sure it can happen that different regex engines disagree, either because the capabilities differ or the interpretation, e.g. Java's String.matches method explicitly requires the whole string to match, not just a substring.
In your case, though, both regexr and .NET say it matches, because the substring CRSSA.T will match. Your third group, containing the comma, has a * quantifier, i.e. it can be matched zero or more times. In this case it's being matched zero times, but that's okay. It's still a match.
If you want the whole string to match, and no substrings whatsoever, then you need to add anchors to your regex:
^((\w+\.{1}\w+)+(,\w+\.{1}\w+)*)$
Furthermore, {1} is a useless quantifier, you can just leave it out. Also, if you have a capturing group around the whole regex, you can leave that out as well, as it's already in capturing group 0 automatically. So a bit simplified you could use:
^(\w+\.\w+)+(,\w+\.\w+)*$
Also be careful with \w and \b. Those two features are closely linked (by the definition of \w and \W and are not always intuitive. E.g. they include the underscore and, depending on the regex engine, a lot more than just [A-Za-z_], e.g. in .NET \w also matches things like ä, µ, Ð, ª, or º. For those reasons I tend to be rather explicit when writing more robust regexes (i.e. those that are not just used for a quick one-off usage) and use things like [A-Za-z], \p{L}, (?=\P{L}|$), etc. instead of \w, \W and \b.

Regex not capturing date

I have a regex that works fine currently. But now I want to add on to it to capture dates.
Current regex:
(?<GeneralHelp>^/help\s*)?
(?:/client:)
(?<Client>\w*)
(?:(?:\s*/(?<ClientHelp>help))*)*
(?:(?:\s*/)(?<Modules>createHistory)(?:(?:\s*/(?<ModuleHelp>help))*)*)*
I added to the end:
(?:(?:\s*/)(?<StartDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
(?:(?:\s*/)(?<EndDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
Using the below example, it just won't get the dates, but it does match everything else.
/client:testClient/createHistory/11-11-2013/11.11.2013
This regex is used to break up the Main one string in the string array parameter from a console app. No one on my team in "fluent" in regex, nor do we have time to become fluent. We work with what we can and this addition is something I thought of today that may have with bigger problems what we have with our project and we are running low on time. So any help would be appreciated.
First, the ^ in your regex means "start of string", that is you only want to match a date at the start of the string (which is not true for you). So remove it. Same with "$" which means "end of string".
Secondly, [0|1] means "match characters 0, | or 1". You probably want [01] meaning "match characters 0 or 1".
Thirdly, you have an extra closing bracket with an unmatched opening bracket in both your regexes.
Fourthly as a general style point, [0] is the same as 0 so the square brackets are redundant here.
So your (not quite!) "fixed" regex is:
(?:(?:\s*/)(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:(?:\s*/)(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
However, this will not match your test string because of the extra "/testModule" in the string which is not in your working regex anywhere.
You could modify your original regex to allow extra slashes in between the two parts of regex?
<original regex>
(?:/[^/]+)* # <-- for the /testModule and any other similar tokens that appear in between
<date regex>
Also as a general point
you have a few occurences (?:(?:regex)*)*. I am not sure what the point is of doubling the outer * besides making the regex parser work much harder than it should for no good reason (the outer (?: )* is redundant here).
there is no point doing (?:/\s*) as you are not doing anything with the brackets, so just do /\s*
same with things like (?:/client:). Why have non-capturing brackets if you are not doing anything with them. /client: will do.
(?:regex)* means "match 0 to infinity occurences of regex". With things like (?:\s*/(?<ClientHelp>help))*, do you really expect this to occur infinitely many times in your string, or will it appear just once or not at all? Consider replacing * with ? which means "match 0 or 1 occurences" (if you know that that token will appear either once or not at all), or replace it with (say) {0, 100} if you know that that token will appear at most 100 times (and at least 0 times). This can improve performance.
So I recommend changing your regex like this:
(?<GeneralHelp>^/help\s*)?
/client:
(?<Client>\w*)
(?:\s*/(?<ClientHelp>help))*
(?:\s*/(?<Modules>createHistory)(?:\s*/(?<ModuleHelp>help))*)*
(?:/[^/]+)*
(?:\s*/(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:\s*/(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
You can fiddle around with your regex at regexr where I've created an example with your regex/test string. (Edit: the < and > in the regex seem to have been changed to < and > in regexr so the link won't work unless you copy/paste the regex I've written directly)
If you're sure these two last fields are dates, you could simply add something like
(?<StartDate>(?:\d+[. -]?){3})/(?<EndDate>.*)$
(or even (?<StartDate>[^/]+)/(?<EndDate>.+)$ if your cases are all in the same pattern and it fits your needs).
Also as already pointed out by mathematical.coffee, the first regex can be improved.

To find everything between { }

I'm new to regex and was hoping for a pointer towards finding matches for words which are between { } brackets which are words and the first letter is uppercase and the second is lowercase. So I want to ignore any numbers also words which contain numbers
{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}
so I would only want to bring back matches for:
Test
Tesgd
Abc
I've looked at using \b and \w for words that are bound and [Az] for upper followed by lower but not sure how to only get the words which are between the brackets only as well.
Here is your solution:
Regex r = new Regex(#"(?<={[^}]*?({(?<depth>)[^}]*?}(?<-depth>))*?[^}]*?)(?<myword>[A-Z][a-z]+?)(?=,|}|\Z)", RegexOptions.ExplicitCapture);
string s = "{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}";
var m = r.Matches(s);
foreach (Match match in m)
Console.WriteLine(match.Groups["myword"].Value);
I assumed it is OK to match inside but not the deepest level paranthesis.
Let's dissect the regex a bit. AAA means an arbitrary expression. www means an arbitrary identifier (sequence of letters)
. is any character
[A-Z] is as you can guess any upper case letter.
[^}] is any character but }
,|}|\Z means , or } or end-of-string
*? means match what came before 0 or more times but lazily (Do a minimal match if possible and spit what you swallowed to make as many matches as possible)
(?<=AAA) means AAA should match on the left before you really try
to match something.
(?=AAA) means AAA should match on the right
after you really match something.
(?<www>AAA) means match AAA and give the string you matched the name www. Only used with ExplicitCapture option.
(?<depth>) matches everything but also pushes "depth" on the stack.
(?<-depth>) matches everything but also pops "depth" from the stack. Fails if the stack is empty.
We use the last two items to ensure that we are inside a paranthesis. It would be much simpler if there were no nested paranthesis or matches occured only in the deepest paranthesis.
The regular expression works on your example and probably has no bugs. However I tend to agree with others, you should not blindly copy what you cannot understand and maintain. Regular expressions are wonderful but only if you are willing to spend effort to learn them.
Edit: I corrected a careless mistake in the regex. (replaced .*? with [^}]*? in two places. Morale of the story: It's very easy to introduce bugs in Regex's.
In answer your original question, I would have offered this regex:
\b[A-Z][a-z]+\b(?=[^{}]*})
The last part is a positive lookahead; it notes the current match position, tries to match the enclosed subexpression, then returns the match position to where it started. In this case, it starts at the end of the word that was just matched and gobbles up as many characters it can as long as they're not { or }. If the next character after that is }, it means the word is inside a pair of braces, so the lookahead succeeds. If the next character is {, or if there's no next character because it's at the end of the string, the lookahead fails and the regex engine moves on to try the next word.
Unfortunately, that won't work because (as you mentioned in a comment) the braces may be nested. Matching any kind of nested or recursive structure is fundamentally incompatible with the way regexes work. Many regex flavors offer that capability anyway, but they tend to go about it in wildly different ways, and it's always ugly. Here's how I would do this in C#, using Balanced Groups:
Regex r = new Regex(#"
\b[A-Z][a-z]+\b
(?!
(?>
[^{}]+
|
{ (?<Open>)
|
} (?<-Open>)
)*
$
(?(Open)(?!))
)", RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
string s = "testa Testb { Test1 Testc testd 1Test } Teste { Testf {testg Testh} testi } Testj";
foreach (Match m in r.Matches(s))
{
Console.WriteLine(m.Value);
}
output:
Testc
Testf
Testh
I'm still using a lookahead, but this time I'm using the group named Open as a counter to keep track of the number of opening braces relative to the number of closing braces. If the word currently under consideration is not enclosed in braces, then by the time the lookahead reaches the end of the string ($), the value of Open will be zero. Otherwise, whether it's positive or negative, the conditional construct - (?(Open)(?!)) - will interpret it as "true" and try to try to match (?!). That's a negative lookahead for nothing, which is guaranteed to fail; it's always possible to match nothing.
Nested or not, there's no need to use a lookbehind; a lookahead is sufficient. Most flavors place such severe restrictions on lookbehinds that nobody would even think to try using them for a job like this. .NET has no such restrictions, so you could do this in a lookbehind, but it wouldn't make much sense. Why do all that work when the other conditions--uppercase first letter, no digits, etc--are so much cheaper to test?
Do the filtering in two steps. Use the regular expression
#"\{(.*)\}"
to pull out the pieces between the brackets, and the regular expression
#"\b([A-Z][a-z]+)\b"
to pull out each of the words that begins with a capital letter and is followed by lower case letters.

Categories

Resources