Understanding the ramifications of CultureInvariant and IgnoreCase on [A-Za-z]

Understanding the ramifications of CultureInvariant and IgnoreCase on [A-Za-z] - c#

If I create a Regex based on this pattern: #"[A-Za-z]+", does the set that it matches change at all by adding RegOptions.IgnoreCase if I'm already using RegOptions.CultureInvariant (due to issues like this)? I think this is an obvious "no, it's just redundant and repetitive". And in my tests that's what I've shown, but I wonder if I'm missing something due to confirmation bias.
Please correct me if I'm wrong on this point, but I believe that I definitely need to use the CultureInvariant though, since I also do not know what the culture will be. MSDN Reference
Note: this is not the actual pattern I need to use, just the simplest critical portion of it. The full pattern is: #"[A-Za-z0-9\s!\\#$(),.:;=#'\-{}|/&]+", in case there is actually some strange behavior surrounding symbols, case, and culture. No, I didn't create the pattern, I'm just consuming it, can't change it, and I realize the | is not needed before /&.
If I could change the pattern...
Pattern "[a-z]" with both CultureInvariant and IgnoreCase
would be functionally equivalent to "[A-Za-z]" using only
CultureInvariant correct?
Assuming #1 is correct, which would be more efficient, and why? I would guess the shorter pattern is more efficient to evaluate against, but I don't know how the internals work right now to say that with much confidence.

Using this program we can test all possible two-letter sequences:
static void Main()
{
var defaultRegexOptions = RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture | RegexOptions.Singleline;
var regex1 = new Regex(#"^[A-Za-z]+$", defaultRegexOptions);
var regex2 = new Regex(#"^[A-Za-z]+$", defaultRegexOptions | RegexOptions.IgnoreCase);
ParallelEnumerable.Range(char.MinValue, char.MaxValue - char.MinValue + 1)
.ForAll(firstCharAsInt =>
{
var buffer = new char[2];
buffer[0] = (char)firstCharAsInt;
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
buffer[1] = (char)i;
var str = new string(buffer);
if (regex1.IsMatch(str) != regex2.IsMatch(str))
Console.WriteLine("dfjkgnearjkgh");
}
});
}
There could be differences in longer sequences but I think that's quite unlikely. This is strong evidence that there is no difference.
The program takes 20 minutes to run.
Unfortunately, this answer does not provide any insight into why this is.

So I had a fundamental misunderstanding of the way this all works. I think this is what was throwing me off...
Regex regex = new Regex("[A-Za-z]", RegexOptions.IgnoreCase);
...will return false for regex.IsMatch("ı"), but true for regex.IsMatch("İ"). If I remove the IgnoreCase it returns false for both, and if I used CultureInvariant (with or without IgnoreCase) it will return false regardless, and this basically boils down to what Scott Chamberlain said in his comment. Thank you Scott.
Ultimately I want "İ" and "ı" to both be rejected, and I just got myself all turned around by bringing IgnoreCase into the mix before I had even considered CultureInvariant. If I drop IgnoreCase and add CultureInvariant then I can keep the pattern as is and have it match what I want it to.
If I were able to change the pattern to just "[A-Z]" then I could use both flags and still get the desired behavior. But the bit about changing the pattern, and which would be more efficient was just curiosity. I don't want to get into all the issues that could arise from that discussion, and all the ways I could change pattern. My concern was with culture, case-insensitivity, and these two RegexOptions.
To summarize, I need to drop IgnoreCase and then the entire issue surrounding culture goes away. If the pattern were a-z or A-Z and I needed to use IgnoreCase to match both upper and lower, then I would need to use CultureInvariant also.

Related

How to include and exclude patterns in one regex

Snippet
Let's say there are two regexes - one with patterns that are good and another with patterns that are wrong:
var allowed = new Regex("^(a.*|b.*)$");
var disallowed = new Regex("^(aa|bb|.*c)$");
When this snippet is run:
var cases = new[] {"a", "aa", "aaa", "b", "bb", "bbb", "ac", "bc", "ad", "ad"};
foreach (var c in cases)
Console.WriteLine($"{c}: {allowed.IsMatch(c) && !disallowed.IsMatch(c)}");
It works.
Questions
Is there a way to merge those two regexes into one?
Would it be a better design to create a set of regexes and enumerate over them to test if the input string matches any of the good patterns and none of the bad patterns?

You can simple put them together using a negative lookahead assertion:
(?!^(aa|bb|.*c)$)^(a.*|b.*)$
DEMO
You can shorten this regex by only specifying the parts you don't want. The rest should match:
(?!^(aa|bb|.*c)$)^.*$
DEMO
Using this you don't have the problem that you try to combine including and excluding regexes in one regex.
And finally you can try this regex:
^(a|b)(?!\1$).*(?<!c)$
DEMO

In answer to your second question: Depends on whether you want maintainability or speed.
If your rule sets are subject to constant change then I would think multiple sets would be easier to maintain, each set being specific enough to name them.
If you're looking for speed, let's say your parsing long documents or what have you, then a single one would be faster. Making them static readonly and then the difference is very negligible and MUCH, MUCH faster (around 10 times faster). The 'static readonly' really only applies when you move the logic out into a separate method, you wouldn't want to recreate the regex every call.
However, if you are looking both ... do it in code! There are many ways to write this, but they all seem to be around the same speed and that is over 6 times faster than the compiled regex. I believe this would be easier to maintain even without the comments but with a few comments it becomes very clear.
private bool IsAllowed(string word)
{
// empty string or whitespace is allowed
if (string.IsNullOrWhiteSpace(word)) return true;
// any word ending in the letter c is not allowed
if (word[word.Length - 1] == 'c') return false;
// any length that is not two letter is allowed
if (word.Length != 2) return true;
// allow anything except aa or bb
return word != "aa" && word != "bb";
}

Typically exclude regexes are very hard to write, even more so when combined with including matches. You could try using negative lookahead/lookbehind to do this.
Typically even when possible, the result is not very maintainable. Having seperate regexes for include and exclude is almost always better from an "I want to understand this code when I come back to it in 3 months" point of view.
You can combine the "good" patterns into a single regex - this should always be possible. It might even improve performance, as the regex compiler can optimise over all the patterns at once. But if there are a lot of them it then it may make maintenance more difficult - no one wants to read a 200 character regex.
So in summary, seperate regexes for include and exclude patterns, but smaller numbers of each are better, provided they don't get too complex. You'll have to use your judgement to work out what is best for your individual case.

c# regex - changing pattern matches until find specific word

usually i can workaround and get everything works by myself, but this one is kinda tricky, even msdn references and examples confuses more than helps.
i have testing some codes and stuck at mixing a capture grouping for changing with a non-capturing group, to stop the matchings when i wish
a simpler code that i want to change is:
stats = "label:100,value:7878,label:110,value:7879,something,label:200,value:8888";
valor = "value:8080";
i know if i use
pattern = #"value:(\d+)";
i can change every value number to 8080 when i do
Regex.Replace(stats, pattern, valor);
but i need he stops changing these when find 'something' string
i managed to change every single char to 'valor' until he finds 'something' using
pattern = #"^(?:(?!something).)*";
is there a way to only change 'value:(\d+)' numbers to 'valor' , along with the ?:(?!something) to stop the matchings in the same sentence?
ive seen lots of examples but they never said something like this so i dunno if its possible to merge both conditions at same time

You can make use of a look-behind solution that makes sure there is no something before the value:
(?<!\bsomething\b.*)value:\d+
See demo
Note that something is matched as a whole word due to \b word boundaries.
The result of replace operation:
Note that (?:(?!something).) is very inefficient and should be used when no other means works. In .NET, there is a powerful variable-width look-behind, which is the right tool for this task.
Also note that if you are not using capture group backreferences, you do not need those capturing groups in your pattern (I remove parentheses from around \d+).

RegEx doesn't work with .NET, but does with other RegEx implementations

I'm trying to match strings that look like this:
http://www.google.com
But not if it occurs in larger context like this:
http://www.google.com
The regex I've got that does the job in a couple different RegEx engines I've tested (PHP, ActionScript) looks like this:
(?<!["'>]\b*)((https?://)([A-Za-z0-9_=%&#?./-]+))\b
You can see it working here: http://regexr.com?36g0e
The problem is that that particular RegEx doesn't seem to work correctly under .NET.
private static readonly Regex fixHttp = new Regex(#"(?<![""'>]\b*)((https?://)([A-Za-z0-9_=%&#?./-]+))\b", RegexOptions.IgnoreCase);
private static readonly Regex fixWww = new Regex(#"(?<=[\s])\b((www\.)([A-Za-z0-9_=%&#?./-]+))\b", RegexOptions.IgnoreCase);
public static string FixUrls(this string s)
{
s = fixHttp.Replace(s, "$1");
s = fixWww.Replace(s, "$1");
return s;
}
Specifically, .NET doesn't seem to be paying attention to the first \b*. In other words, it correctly fails to match this string:
http://www.google.com
But it incorrectly matches this string (note the extra spaces):
http://www.google.com
Any ideas as to what I'm doing wrong or how to work around it?

I was waiting for one of the folks who actually originally answered this question to pop the answer down here, but since they haven't, I'll throw it in.
I'm not precisely sure what was going wrong, but it turns out that in .NET, I needed to replace the \b* with a \s*. The \s* doesn't seem to work with other RegEx engines (I only did a little bit of testing), but it does work correctly with .NET. The documentation I've read around \b would lead me to believe that it should match whitespace leading up to a word as well, but perhaps I've misunderstood, or perhaps there are some weirdnesses around captures that different engines handle differently.
At any rate, this is my final RegEx:
(?<!["'>]\s*)((https?:\/\/)([A-Za-z0-9_=%&#\?\.\/\-]+))\b
I don't understand what was going wrong well enough to give any real context for why this change works, and I dislike RegExes enough that I can't quite justify the time figuring it out, but maybe it'll help someone else eventually :-).

Regular Expressions in C# running slowly

I have been doing a little work with regex over the past week and managed to make a lot of progress, however, I'm still fairly n00b. I have a regex written in C#:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
IsMethodRegex = new Regex(isMethodRegex);
For some reason, when calling the regular expression IsMethodRegex.IsMatch() it hangs for 30+ seconds on the following string:
"\t * Returns collection of active STOP transactions (transaction type 30) "
Does anyone how the internals of Regex works and why this would be so slow on matching this string and not others. I have had a play with it and found that if I take out the * and the parenthesis then it runs fine. Perhaps the regular expression is poorly written?
Any help would be so greatly appreciated.

EDIT: I think the performance issue comes in the way <parameters> matching group is done. I have rearranged to match a first parameter, then any number of successive parameters, or optionally none at all. Also I have changed the \s* between parameter type and name to \s+ (I think this was responsible for a LOT of backtracking because it allows no spaces, so that object could match as obj and ect with \s* matching no spaces) and it seems to run a lot faster:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s*(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>((\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)"+
#"(\s*,\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)*\s*))?\)";
EDIT: As duly pointed out by #Dan, the following is simply because the Regex can exit early.
This is indeed a really bizarre situation, but if I remove the two optional matching at the beginning (for public/private/internal/protected and static/virtual/abstract) then it starts to run almost instantaneously again:
string isMethodRegex =
#"\b(public|private|internal|protected)\s*(static|virtual|abstract)"+
#"(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
var IsMethodRegex = new Regex(isMethodRegex);
string s = "\t * Returns collection of active STOP transactions (transaction type 30) ";
Console.WriteLine(IsMethodRegex.IsMatch(s));
Technically you could split into four separate Regex's for each possibility to deal with this particular situation. However, as you attempt to deal with more and more complicated scenarios, you will likely run into this performance issue again and again, so this is probably not the ideal approach.

I changed some 0-or-more (*) matchings with 1-or-more (+), where I think it makes sense for your regex (it's more suitable to Java and C# than to VB.NET):
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?" +
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]+)\s+(?<method>[a-zA-Z\<\>_1-9]+)\s+\" +
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]+\s+[a-zA-Z_1-9]+\s*)[,]?\s*)+)\)";
It's fast now.
Please check if it still returns the result you expect.
For some background on bad regexes, look here.

Have you tried compiling your Regex?
string pattern = #"\b[at]\w+";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
string text = "The threaded application ate up the thread pool as it executed.";
MatchCollection matches;
Regex optionRegex = new Regex(pattern, options);
Console.WriteLine("Parsing '{0}' with options {1}:", text, options.ToString());
// Get matches of pattern in text
matches = optionRegex.Matches(text);
// Iterate matches
for (int ctr = 1; ctr <= matches.Count; ctr++)
Console.WriteLine("{0}. {1}", ctr, matches[ctr-1].Value);
Then the Regular Expression is only slow on the first execution.

RegEx: Correct usage of lookbehind assertion and group definitions

I have the following string:
i:0#.w|domain\x123456
I know about the possibility to group searchterms by using <mysearchterm> and calling it via RegEx.Match(myRegEx).Result("${mysearchtermin}");.
I also know that I can lookbehind assertions like (?<= subexpression) via MSDN. Could someone help me in geting the (including the possibility to search for them via groups as shown before):
domain ("domain")
user account ("x12345")
I don't need anything from before the pipe character (nor the pipe character itself) - so basically I am interested in domain\x123456.

As others have noted, this can be done without regex, or without lookbehinds. That being said, I can think of reasons you might want them: to write a RegexValidator instead of having to roll up a CustomValidator, for example. In ASP.NET, CustomValidators can be a little longer to write, and sometimes a RegexValidator does the job just fine.
As far as lookbehinds, the main reason you'd want one for something like this is if the target string could contain irrelevant copies of the |domain\x123456 pattern:
foo#bar|domain\x999999 says: 'i:0#.w|domain\x888888i:0#.w|domain\x123456|domain\x000000'
If you only wanted to grab domain\x888888 and domain\x123456 out of that, a lookbehind could be useful. Or maybe you just want to learn about lookbehinds. Anyway, since we only have one sample input, I can only guess at the rules; so perhaps something like this:
#"(?<=[a-z]:\d#\.[a-z]\|)(?<domain>[^\\]*)\\(?<user>x\d+)"
Lookarounds are one of the most subtle and misunderstood features of regex, IMHO. I've gotten a lot of use out of them in preventing false positives, or in limiting the length of matches when I'm not trying to match the entire string (for example, if I want only the 3-digit numbers in blah 1234 123 1234567 123 foo, I can use (?<!\d)\d{3}(?!\d)). Here's a good reference if you want to learn more about named groups and lookarounds.

You can just use the regex #"\|([^\\]+)\\(.+)".
The domain and user will be in groups 1 and 2, respectively.

You don't need regular expressions for that.
var myString = #"i:0#.w|domain\x123456";
var relevantParts = myString.Split('|')[1].Split('\\');
var domain = relevantParts[0];
var user = relevantParts[1];
Explanation: String.Split(separator) returns an array of substrings separated by separator.
If you insist of using regular expressions, this is how you do it with named groups and Match.Result, based on SLaks answer (+1, by the way):
var myString = #"i:0#.w|domain\x123456";
var r = new Regex(#"\|(?<domain>[^\\]+)\\(?<user>.+)");
var match = r.Matches(myString)[0]; // get first match
var domain = match.Result("${domain}");
var user = match.Result("${user}");
Personally, however, I would prefer the following syntax, if you are just extracting the values:
var domain = match.Groups["domain"];
var user = match.Groups["user"];
And you really don't need lookbehind assertions here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Understanding the ramifications of CultureInvariant and IgnoreCase on [A-Za-z] - c#

Related

How to include and exclude patterns in one regex

c# regex - changing pattern matches until find specific word

RegEx doesn't work with .NET, but does with other RegEx implementations

Regular Expressions in C# running slowly

RegEx: Correct usage of lookbehind assertion and group definitions

Categories

Resources