I have been doing a little work with regex over the past week and managed to make a lot of progress, however, I'm still fairly n00b. I have a regex written in C#:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
IsMethodRegex = new Regex(isMethodRegex);
For some reason, when calling the regular expression IsMethodRegex.IsMatch() it hangs for 30+ seconds on the following string:
"\t * Returns collection of active STOP transactions (transaction type 30) "
Does anyone how the internals of Regex works and why this would be so slow on matching this string and not others. I have had a play with it and found that if I take out the * and the parenthesis then it runs fine. Perhaps the regular expression is poorly written?
Any help would be so greatly appreciated.
EDIT: I think the performance issue comes in the way <parameters> matching group is done. I have rearranged to match a first parameter, then any number of successive parameters, or optionally none at all. Also I have changed the \s* between parameter type and name to \s+ (I think this was responsible for a LOT of backtracking because it allows no spaces, so that object could match as obj and ect with \s* matching no spaces) and it seems to run a lot faster:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s*(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>((\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)"+
#"(\s*,\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)*\s*))?\)";
EDIT: As duly pointed out by #Dan, the following is simply because the Regex can exit early.
This is indeed a really bizarre situation, but if I remove the two optional matching at the beginning (for public/private/internal/protected and static/virtual/abstract) then it starts to run almost instantaneously again:
string isMethodRegex =
#"\b(public|private|internal|protected)\s*(static|virtual|abstract)"+
#"(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
var IsMethodRegex = new Regex(isMethodRegex);
string s = "\t * Returns collection of active STOP transactions (transaction type 30) ";
Console.WriteLine(IsMethodRegex.IsMatch(s));
Technically you could split into four separate Regex's for each possibility to deal with this particular situation. However, as you attempt to deal with more and more complicated scenarios, you will likely run into this performance issue again and again, so this is probably not the ideal approach.
I changed some 0-or-more (*) matchings with 1-or-more (+), where I think it makes sense for your regex (it's more suitable to Java and C# than to VB.NET):
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?" +
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]+)\s+(?<method>[a-zA-Z\<\>_1-9]+)\s+\" +
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]+\s+[a-zA-Z_1-9]+\s*)[,]?\s*)+)\)";
It's fast now.
Please check if it still returns the result you expect.
For some background on bad regexes, look here.
Have you tried compiling your Regex?
string pattern = #"\b[at]\w+";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
string text = "The threaded application ate up the thread pool as it executed.";
MatchCollection matches;
Regex optionRegex = new Regex(pattern, options);
Console.WriteLine("Parsing '{0}' with options {1}:", text, options.ToString());
// Get matches of pattern in text
matches = optionRegex.Matches(text);
// Iterate matches
for (int ctr = 1; ctr <= matches.Count; ctr++)
Console.WriteLine("{0}. {1}", ctr, matches[ctr-1].Value);
Then the Regular Expression is only slow on the first execution.
Related
Ok, so I need to design a regex to insert dashes. Im tasked with building a web API function that returns a specifically formatted string based upon input parameters. For some reason that hasn't been made clear to me, the source data isn't properly formatted, and I need to reformat the data with dashes in the correct place.
Depending on the first two characters and string length there is an optional third dash. Fortunately Im not concerned what those characters are. This system is a passthrough, so garbage in, garbage out. However, i do need to make sure the dashes are spaced appropriately on length.
Structure Types
XX-9999999999-XX AB
XX-9999999999-99 CD, EF
XX-9999999999-XXX-99 GH
XX-9999999999-XX-99 IJ, KL
For Example:
AB123456789044 should be AB-01234567890-44 and
GH1234567890YYY99 becomes GH-01234567890-YYY-99.
Thus far ive gotten to this point.
^(\w\w)(\d{10})(\w{2,3})(\d\d)?$
Which leads to my Question(s)
1) Im attempting to replace with $1-$2-$3-$4 However, whenever there is a fourth section of decimals, such as the case with IJ, its hard to distinguish between that and AB in the replace.
Ive gotten GH-01234567890-YY-99 And GH-01234567890-YY-.
How do I reference a conditional capture group in a replace string such that the dash relating to it only shows up if the grouping exists?
The problem is that you need conditional replacements, and C# doesn't support those. So you've got to do the replacements programmatically. Something like:
string resultString = null;
try {
Regex regexObj = new Regex(#"([A-Z]{2})-?(\d{10})-?(?:([A-Z]{2,3})|(\d{2}))-?(\d{2})?", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
resultString = regexObj.Replace(subjectString, new MatchEvaluator(ComputeReplacement));
} catch (ArgumentException ex) {
// Error handling
}
public String ComputeReplacement(Match m) {
// Vary the replacement text in C# as needed
return "$1-$2-$3-$4-$5";
}
I haven't paid too much attention to the actual RegEx here, as it seems like you know what you're doing with it. I just included some conditional hyphens in case the data are quite dirty (partially formatted). Obviously you have to edit the "return" part of this, using conditionals in case any of the captures are blank. I haven't worked out that logic for you, as C# isn't my strength.
I need to do string replaces... there are only a few cases I need to handle:
1) optional case insensitive
2) optional whole words
Right now I'm using _myRegEx.Replace()... if #1 is specified, I add the RegexOptions.IgnoreCase flag. If #2 is specified, I wrap the search word in \b<word>\b.
This works fine, but its really slow. My benchmark takes 1100ms vs 90ms with String.Replace. Obviously some issues with doing that:
1) case insensitive is tricky
2) regex \b<word>\b will handle "<word>", " <word>", "<word> " and " <word> "... string replace would only handle " <word> ".
I'm already using the RegexOptions.Compiled flag.
Any other options?
You can get a noticeable improvement in this case if you don't use a compiled regex. Honestly, this isn't the first time I measure regex performance and find the compiled regex to be slower, even if used the way it's supposed to be used.
Let's replace \bfast\b with 12345 in a string a million times, using four different methods, and time how long this took - on two different PCs:
var str = "Regex.Replace is extremely FAST for simple replacements like that";
var compiled = new Regex(#"\bfast\b", RegexOptions.IgnoreCase | RegexOptions.Compiled);
var interpreted = new Regex(#"\bfast\b", RegexOptions.IgnoreCase);
var start = DateTime.UtcNow;
for (int i = 0; i < 1000000; i++)
{
// Comment out all but one of these:
str.Replace("FAST", "12345"); // PC #1: 208 ms, PC #2: 339 ms
compiled.Replace(str, "12345"); // 1100 ms, 2708 ms
interpreted.Replace(str, "12345"); // 788 ms, 2174 ms
Regex.Replace(str, #"\bfast\b", "12345", RegexOptions.IgnoreCase); // 1076 ms, 3138 ms
}
Console.WriteLine((DateTime.UtcNow - start).TotalMilliseconds);
Compiled regex is consistently one of the slowest ones. I don't observe quite as big a difference between string.Replace and Regex.Replace as you do, but it's in the same ballpark. So try it without compiling the regex.
Also worth noting is that if you had just one humongous string, Regex.Replace is blazing fast, taking about 7ms for 13,000 lines of Pride and Prejudice on my PC.
I want to remove specific tags from a HTML string. I am using HtmlAgility, but that removes entire nodes. I want to 'enhance' it to keep the innerHtml. It's all working but I have serious performance issues. This made me change the string.replace by a regex.replace and it is already 4 times faster. The replacement needs to be caseinsensitive. This is my current code:
var scrubHtmlTags = new[] {"strong","span","div","b","u","i","p","em","ul","ol","li","br"};
var stringToSearch = "LargeHtmlContent";
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(stringToSearch, "<" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
stringToSearch = Regex.Replace(stringToSearch, "</" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
}
There are still improvements however:
It should be possible to get rid of < b > as well as < /b > in one run I assume...
Is it possible to do all string replacements in one run?
To do it in one run you can use this:
stringToSearch = Regex.Replace(stringToSearch, "<\\/?" + string.Format("(?:{0})", string.Join("|", scrubHtmlTags)) + ".*?>", "", RegexOptions.IgnoreCase);
But keep in mind that this may fail on several cases.
If I were your manager ... (koff, koff) ... I would reject your code and tell you, nay, require(!) you, to "listen to Thomas Ayoub," in his #1 post to the first entry on this thread. You are well on your way to creating completely-unmaintainable code here: code that was written because it seemed, to someone who wasn’t talking to anyone else, to have “solved” the immediate problem that s/he faced at the time.
Going back to your original task-description, you say that you “wish to remove specific tags from an HTML string.” You further state that you are already using HtmlAgility (good, good ...), but then you object(!) that it “removes entire nodes.”
“ ’scuse me, but ...” exactly what did you expect it to do? A “tag,” I surmise, is a (DOM) “node.”
So, faced with what you call “a performance problem,” instead of(!) questing for the inevitable bug(!!) in your previous code, you decided to throw caution to the four winds, and to thereby inflict it upon the project and the rest of the team.
And that, as an old-phart Manager, would be where I would step in.
I would exercise my “authority has its privileges” and instruct you ... order you ... to abandon your present approach and to go back to find-and-fix the bugs in your original approach. But, going one step further, I would order you first to “find” the bugs, then to present your proposed(!) solution to the Team and to me, before authorizing you (by Team consensus) to implement your proposed fix.
(And I would like to think that, after you spent a suitable amount of time “calling me an a**hole” (of course ...), you would come to understand why I responded in this way, and why I took the time to say as much on Stack-whatever.com.)
You might try this:
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(
stringToSearch,
"<\/?" + stringToScrub + ">", "",
RegexOptions.IgnoreCase);
}
But I would try to use one expressions to remove them all.
If I create a Regex based on this pattern: #"[A-Za-z]+", does the set that it matches change at all by adding RegOptions.IgnoreCase if I'm already using RegOptions.CultureInvariant (due to issues like this)? I think this is an obvious "no, it's just redundant and repetitive". And in my tests that's what I've shown, but I wonder if I'm missing something due to confirmation bias.
Please correct me if I'm wrong on this point, but I believe that I definitely need to use the CultureInvariant though, since I also do not know what the culture will be. MSDN Reference
Note: this is not the actual pattern I need to use, just the simplest critical portion of it. The full pattern is: #"[A-Za-z0-9\s!\\#$(),.:;=#'\-{}|/&]+", in case there is actually some strange behavior surrounding symbols, case, and culture. No, I didn't create the pattern, I'm just consuming it, can't change it, and I realize the | is not needed before /&.
If I could change the pattern...
Pattern "[a-z]" with both CultureInvariant and IgnoreCase
would be functionally equivalent to "[A-Za-z]" using only
CultureInvariant correct?
Assuming #1 is correct, which would be more efficient, and why? I would guess the shorter pattern is more efficient to evaluate against, but I don't know how the internals work right now to say that with much confidence.
Using this program we can test all possible two-letter sequences:
static void Main()
{
var defaultRegexOptions = RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture | RegexOptions.Singleline;
var regex1 = new Regex(#"^[A-Za-z]+$", defaultRegexOptions);
var regex2 = new Regex(#"^[A-Za-z]+$", defaultRegexOptions | RegexOptions.IgnoreCase);
ParallelEnumerable.Range(char.MinValue, char.MaxValue - char.MinValue + 1)
.ForAll(firstCharAsInt =>
{
var buffer = new char[2];
buffer[0] = (char)firstCharAsInt;
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
buffer[1] = (char)i;
var str = new string(buffer);
if (regex1.IsMatch(str) != regex2.IsMatch(str))
Console.WriteLine("dfjkgnearjkgh");
}
});
}
There could be differences in longer sequences but I think that's quite unlikely. This is strong evidence that there is no difference.
The program takes 20 minutes to run.
Unfortunately, this answer does not provide any insight into why this is.
So I had a fundamental misunderstanding of the way this all works. I think this is what was throwing me off...
Regex regex = new Regex("[A-Za-z]", RegexOptions.IgnoreCase);
...will return false for regex.IsMatch("ı"), but true for regex.IsMatch("İ"). If I remove the IgnoreCase it returns false for both, and if I used CultureInvariant (with or without IgnoreCase) it will return false regardless, and this basically boils down to what Scott Chamberlain said in his comment. Thank you Scott.
Ultimately I want "İ" and "ı" to both be rejected, and I just got myself all turned around by bringing IgnoreCase into the mix before I had even considered CultureInvariant. If I drop IgnoreCase and add CultureInvariant then I can keep the pattern as is and have it match what I want it to.
If I were able to change the pattern to just "[A-Z]" then I could use both flags and still get the desired behavior. But the bit about changing the pattern, and which would be more efficient was just curiosity. I don't want to get into all the issues that could arise from that discussion, and all the ways I could change pattern. My concern was with culture, case-insensitivity, and these two RegexOptions.
To summarize, I need to drop IgnoreCase and then the entire issue surrounding culture goes away. If the pattern were a-z or A-Z and I needed to use IgnoreCase to match both upper and lower, then I would need to use CultureInvariant also.
I writing BBcode converter to html.
Converter should skip unclosed tags.
I thought about 2 options to do it:
1) match all tags in once using one regex call, like:
Regex re2 = new Regex(#"\[(\ /?(?:b|i|u|quote|strike))\]");
MatchCollection mc = re2.Matches(sourcestring);
and then, loop over MatchCollection using 2 pointers to find start and open tags and than replacing with right html tag.
2) call regex multiple time for every tag and replace directly:
Regex re = new Regex(#"\[b\](.*?)\[\/b\]");
string s1 = re.Replace(sourcestring2,"<b>$1</b>");
What is more efficient?
The first option uses one regex but will require me to loop through all tags and find all pairs, and skip tags that don't have a pair.
Another positive thins is that I don't care about the content between the tags, i just work and replace them using the position.
In second option I don't need to worry about looping and making special replace function.
But will require to execute multiple regex and replaces.
What can you suggest?
If the second option is the right one,
there is a problem with regex
\[b\](.*?)\[\/b\]
how can i fix it to also match multi lines like:
[b]
test 1
[/b]
[b]
test 2
[/b]
One option would be to use more SAX-like parsing, where instead of looking for a particular regex you look for [, then have your program handle that even in some manner, look for the ], handle that even, etc. Although more verbose than the regex it may be easier to understand, and wouldn't necessarily be slower.
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("asdfasdf[b]test[/b]asdfsadf", "<b>$1</b>");
That should give you only elements that have matched closing tags and also handle multi line (even though i specified the option of SingleLine it actually treats it as a single line)
It should also handle [b][b][/b] properly by ignoring the first [b].
As to whether or not this method is better than your first method I couldn't say. But hopefully this will point you in the right direction.
Code that works with your example below:
System.Text.RegularExpressions.Regex r;
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]", "<b>$1</b>");