I need to do string replaces... there are only a few cases I need to handle:
1) optional case insensitive
2) optional whole words
Right now I'm using _myRegEx.Replace()... if #1 is specified, I add the RegexOptions.IgnoreCase flag. If #2 is specified, I wrap the search word in \b<word>\b.
This works fine, but its really slow. My benchmark takes 1100ms vs 90ms with String.Replace. Obviously some issues with doing that:
1) case insensitive is tricky
2) regex \b<word>\b will handle "<word>", " <word>", "<word> " and " <word> "... string replace would only handle " <word> ".
I'm already using the RegexOptions.Compiled flag.
Any other options?
You can get a noticeable improvement in this case if you don't use a compiled regex. Honestly, this isn't the first time I measure regex performance and find the compiled regex to be slower, even if used the way it's supposed to be used.
Let's replace \bfast\b with 12345 in a string a million times, using four different methods, and time how long this took - on two different PCs:
var str = "Regex.Replace is extremely FAST for simple replacements like that";
var compiled = new Regex(#"\bfast\b", RegexOptions.IgnoreCase | RegexOptions.Compiled);
var interpreted = new Regex(#"\bfast\b", RegexOptions.IgnoreCase);
var start = DateTime.UtcNow;
for (int i = 0; i < 1000000; i++)
{
// Comment out all but one of these:
str.Replace("FAST", "12345"); // PC #1: 208 ms, PC #2: 339 ms
compiled.Replace(str, "12345"); // 1100 ms, 2708 ms
interpreted.Replace(str, "12345"); // 788 ms, 2174 ms
Regex.Replace(str, #"\bfast\b", "12345", RegexOptions.IgnoreCase); // 1076 ms, 3138 ms
}
Console.WriteLine((DateTime.UtcNow - start).TotalMilliseconds);
Compiled regex is consistently one of the slowest ones. I don't observe quite as big a difference between string.Replace and Regex.Replace as you do, but it's in the same ballpark. So try it without compiling the regex.
Also worth noting is that if you had just one humongous string, Regex.Replace is blazing fast, taking about 7ms for 13,000 lines of Pride and Prejudice on my PC.
Related
I want to remove specific tags from a HTML string. I am using HtmlAgility, but that removes entire nodes. I want to 'enhance' it to keep the innerHtml. It's all working but I have serious performance issues. This made me change the string.replace by a regex.replace and it is already 4 times faster. The replacement needs to be caseinsensitive. This is my current code:
var scrubHtmlTags = new[] {"strong","span","div","b","u","i","p","em","ul","ol","li","br"};
var stringToSearch = "LargeHtmlContent";
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(stringToSearch, "<" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
stringToSearch = Regex.Replace(stringToSearch, "</" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
}
There are still improvements however:
It should be possible to get rid of < b > as well as < /b > in one run I assume...
Is it possible to do all string replacements in one run?
To do it in one run you can use this:
stringToSearch = Regex.Replace(stringToSearch, "<\\/?" + string.Format("(?:{0})", string.Join("|", scrubHtmlTags)) + ".*?>", "", RegexOptions.IgnoreCase);
But keep in mind that this may fail on several cases.
If I were your manager ... (koff, koff) ... I would reject your code and tell you, nay, require(!) you, to "listen to Thomas Ayoub," in his #1 post to the first entry on this thread. You are well on your way to creating completely-unmaintainable code here: code that was written because it seemed, to someone who wasn’t talking to anyone else, to have “solved” the immediate problem that s/he faced at the time.
Going back to your original task-description, you say that you “wish to remove specific tags from an HTML string.” You further state that you are already using HtmlAgility (good, good ...), but then you object(!) that it “removes entire nodes.”
“ ’scuse me, but ...” exactly what did you expect it to do? A “tag,” I surmise, is a (DOM) “node.”
So, faced with what you call “a performance problem,” instead of(!) questing for the inevitable bug(!!) in your previous code, you decided to throw caution to the four winds, and to thereby inflict it upon the project and the rest of the team.
And that, as an old-phart Manager, would be where I would step in.
I would exercise my “authority has its privileges” and instruct you ... order you ... to abandon your present approach and to go back to find-and-fix the bugs in your original approach. But, going one step further, I would order you first to “find” the bugs, then to present your proposed(!) solution to the Team and to me, before authorizing you (by Team consensus) to implement your proposed fix.
(And I would like to think that, after you spent a suitable amount of time “calling me an a**hole” (of course ...), you would come to understand why I responded in this way, and why I took the time to say as much on Stack-whatever.com.)
You might try this:
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(
stringToSearch,
"<\/?" + stringToScrub + ">", "",
RegexOptions.IgnoreCase);
}
But I would try to use one expressions to remove them all.
I have been doing a little work with regex over the past week and managed to make a lot of progress, however, I'm still fairly n00b. I have a regex written in C#:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
IsMethodRegex = new Regex(isMethodRegex);
For some reason, when calling the regular expression IsMethodRegex.IsMatch() it hangs for 30+ seconds on the following string:
"\t * Returns collection of active STOP transactions (transaction type 30) "
Does anyone how the internals of Regex works and why this would be so slow on matching this string and not others. I have had a play with it and found that if I take out the * and the parenthesis then it runs fine. Perhaps the regular expression is poorly written?
Any help would be so greatly appreciated.
EDIT: I think the performance issue comes in the way <parameters> matching group is done. I have rearranged to match a first parameter, then any number of successive parameters, or optionally none at all. Also I have changed the \s* between parameter type and name to \s+ (I think this was responsible for a LOT of backtracking because it allows no spaces, so that object could match as obj and ect with \s* matching no spaces) and it seems to run a lot faster:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s*(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>((\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)"+
#"(\s*,\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)*\s*))?\)";
EDIT: As duly pointed out by #Dan, the following is simply because the Regex can exit early.
This is indeed a really bizarre situation, but if I remove the two optional matching at the beginning (for public/private/internal/protected and static/virtual/abstract) then it starts to run almost instantaneously again:
string isMethodRegex =
#"\b(public|private|internal|protected)\s*(static|virtual|abstract)"+
#"(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
var IsMethodRegex = new Regex(isMethodRegex);
string s = "\t * Returns collection of active STOP transactions (transaction type 30) ";
Console.WriteLine(IsMethodRegex.IsMatch(s));
Technically you could split into four separate Regex's for each possibility to deal with this particular situation. However, as you attempt to deal with more and more complicated scenarios, you will likely run into this performance issue again and again, so this is probably not the ideal approach.
I changed some 0-or-more (*) matchings with 1-or-more (+), where I think it makes sense for your regex (it's more suitable to Java and C# than to VB.NET):
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?" +
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]+)\s+(?<method>[a-zA-Z\<\>_1-9]+)\s+\" +
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]+\s+[a-zA-Z_1-9]+\s*)[,]?\s*)+)\)";
It's fast now.
Please check if it still returns the result you expect.
For some background on bad regexes, look here.
Have you tried compiling your Regex?
string pattern = #"\b[at]\w+";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
string text = "The threaded application ate up the thread pool as it executed.";
MatchCollection matches;
Regex optionRegex = new Regex(pattern, options);
Console.WriteLine("Parsing '{0}' with options {1}:", text, options.ToString());
// Get matches of pattern in text
matches = optionRegex.Matches(text);
// Iterate matches
for (int ctr = 1; ctr <= matches.Count; ctr++)
Console.WriteLine("{0}. {1}", ctr, matches[ctr-1].Value);
Then the Regular Expression is only slow on the first execution.
I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.
In this case I am trying to export to csv. Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?
Heres what I have at the moment...
formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");
Any Ideas?
It's a rather inelegant problem, so no method will really be deeply elegant.
Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).
At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\u2013". The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).
I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).
Taking this approach to your code it becomes:
formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');
Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:
StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
switch(c)
{
case '\u2013': case '\u2014': case '\u2015':
sb.Append('-');
default:
sb.Append(c)
}
formattedString = sb.ToString();
(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).
With various variants (e.g. if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).
Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, e.g. we could include:
case 'ß':
sb.Append("ss");
Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.
Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:
char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
sb.Append(replacements[(int)c]);
formattedString = sb.ToString();
Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.
Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:
char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();
char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;
/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/
StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
int cAsI = (int)c;
sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
throw new ArgumentException("Unconvertable character");
formattedString = ret;
The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.
The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.
Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?
In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.
You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace.
For example:
var lookup = new Dictionary<char, char>
{
{ '`', '-' },
{ 'இ', '-' },
//next pair, etc, etc
};
var input = "blah இ blah ` blah";
var r;
var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);
string output = new string(result.ToArray());
Or if you want blanket treatment of non ASCII range characters:
string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());
Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.
That being said, you could make a few improvements.
If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.
If they are all replaced with the same string:
formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));
or
foreach (char c in "\u2013\u2014\u2015")
formattedString = formattedString.Replace(c, '-');
I am currently iterating over somewhere between 7000 and 10000 text definitions varying in size between 0 and 5000 characters and I want to check whether a particular string exists in any of them. I want to do this for somewhere in the region of 5000 different string definitions.
In most cases I just want to to know an exact case-insensitive match however sometimes a regex is required to be more specific. I was wondering though whether it would be quicker to use another "search" technique when the regex isn't required.
A slimmed version of the code looks something like this.
foreach (string find in stringsiWantToFind)
{
Regex rx = new Regex(find, RegexOptions.IgnoreCase);
foreach (String s in listOfText)
if (rx.IsMatch(s))
find.FoundIn(s);
}
I've read around a bit to see whether I'm missing anything obvious. There are a number of suggestions for using Compliled regexs however I can't see that is helpful given the "dynamic" nature of the regex.
I also read an interesting article on CodeProject so I'm just about to look at using the "FastIndexOf" to see how it compares in performance.
I just wondered if anybody had any advice for this kind of problem and how performance can potentially be optimized?
Thanks
Something like this? Make one regular expression which contains all the strings you want to match then loop over the files with that regex. The new Regex parameter is prob wrong, my knowledge of .net regex patterns is not the best. Also i've left out a few using to make it more readable here. You could make the Regex compiled if this improves things.
Regex rx = new Regex("string1|string2|string3|string5|string-etc", RegexOptions.IgnoreCase);
foreach (string fileName in fileNames)
{
var fs = new FileStream(fileName.ToString(), FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite);
var sr = new StreamReader(fs);
var sw = new StreamWriter(fs);
string readFile = sr.ReadToEnd();
MatchCollection matches = rx.Matches(readFile );
foreach (Match match in matches)
{
//do stuff
}
}
I would look into a file indexing service like MS Indexing Service or Google Desktop Search. Those APIs will allow you to search the indexes of your files rather than the files themselves and are extremely fast.
One trick that came to my mind was:
Concatenate the strings into 1 big one, have the regex work on global level. That would yield you results of a ´string found xx times´ using 1 regex instead of looping over your list.
Hope this helps,
I'm generating regular expressions dynamically by running through some xml structure and building up the statement as I shoot through its node types. I'm using this regular expression as part of a Layout type that I defined. I then parse through a text file that has an Id in the beginning of each line. This id points me to a specific layout. I then try to match the data in that row against its regex.
Sounds fine and dandy right? The only problem is it is matching strings extremely slow. I have them set as compiled to try and speed things up a bit, but to no avail. What is baffling is that these expressions aren't that complex. I am by no means a RegEx guru, but I know a decent amount about them to get things going well.
Here is the code that generates the expressions...
StringBuilder sb = new StringBuilder();
//get layout id and memberkey in there...
sb.Append(#"^([0-9]+)[ \t]{1,2}([0-9]+)");
foreach (ColumnDef c in columns)
{
sb.Append(#"[ \t]{1,2}");
switch (c.Variable.PrimType)
{
case PrimitiveType.BIT:
sb.Append("(0|1)");
break;
case PrimitiveType.DATE:
sb.Append(#"([0-9]{2}/[0-9]{2}/[0-9]{4})");
break;
case PrimitiveType.FLOAT:
sb.Append(#"([-+]?[0-9]*\.?[0-9]+)");
break;
case PrimitiveType.INTEGER:
sb.Append(#"([0-9]+)");
break;
case PrimitiveType.STRING:
sb.Append(#"([a-zA-Z0-9]*)");
break;
}
}
sb.Append("$");
_pattern = new Regex(sb.ToString(), RegexOptions.Compiled);
The actual slow part...
public System.Text.RegularExpressions.Match Match(string input)
{
if (input == null)
throw new ArgumentNullException("input");
return _pattern.Match(input);
}
A typical "_pattern" may have about 40-50 columns. I'll save from pasting the entire pattern. I try to group each case so that I can enumerate over each case in the Match object later on.
Any tips or modifications that could drastically help? Or is this running slowly to be expected?
EDIT FOR CLARITY: Sorry, I don't think I was clear enough the first time around.
I use an xml file to generate regex's for a specific layout. I then run through a file for a data import. I need to make sure that each line in the file matches the pattern it says its supposed to be. So, patterns could be checked against multiple times, possible thousands.
You are parsing a 50 column CSV file (that uses tabs) with regex?
You should just remove duplicate tabs, then split the text on \t. Now you have all of your columns in an array. You can use your ColumnDef object collection to tell you what each column is.
Edit: Once you have things split up, you could optionally use regex to verify each value, this should be much faster than using the giant single regex.
Edit2: You also get an additional benefit of knowing exactly what column(s) is badly formated and you can produce an error like "Sytax error in column 30 on line 12, expected date format."
Some performance thoughts:
use [01] instead of (0|1)
use non-capturing groups (?:expr) instead of capturing groups (if you really need grouping)
Edit As it seems that your values are separated by whitespace, why don’t you split it up there?
Regular expression are expensive to create and are even more expensive if you compile them. So the problem is that you are creating many regular expressions but use them only once.
You should cache them for reuse and relly don't compile them if you don't want to use them really often. I have never meassured that, but I could imagine that you will have to use a simple regular expression well over 100 times to outweight the cost of the compilation.
Performance test
Regex: "^(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:[a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)$"
Input: "www.stackoverflow.com"
Results in milliseconds per iteration
one regex, compiled, 10,000 iterations: 0.0018 ms
one regex, not compiled, 10,000 iterations: 0.0021 ms
one regex per iteration, not compiled, 10,000 iterations: 0.0287 ms
one regex per iteration, compiled, 10,000 iterations: 4.8144 ms
Note that even after 10,000 iterations the compiled and uncompiled regex are still very close together comparing their performance. With increasing number of iterations the compiled regex performs better.
one regex, compiled, 1,000,000 iterations: 0.00137 ms
one regex, not compiled, 1,000,000 iterations: 0.00225 ms
Well. Building the pattern using a StringBuilder will save a few cycles, compared to concatenating them.
An optimization on this that is drastic (can be visibly measured) is most likely going to be doing this through some other method.
Regular expressions are slow ... powerful but slow. Parsing through a text-file and then comparing using regular expressions just to retrieve the right bits of data is not going to be very quick (dependent on the host computer and size of text file).
Perhaps storing the data in some other method rather than a large text file would be more efficient to parse (use XML for that as well?), or perhaps a comma separated list.
Having a potential of 50 match groups in a single expression by default is going to be a bit slow. I would do a few things to see if you can pin down the performance setback.
Start by trying a hard coded, vs dynamic comparison and see if you have any performance benefit.
Look at your requirements and see if there is any way you can reduce the number of groupings that you need to evaluate
Use a profiler tool if needed, such as Ants Profiler to see the location of the slowdown.
I would just build a lexer by hand.
In this case it looks like you have a bunch of fields seperated by tabs, with a record terminated by a new line. The XML file appears to describe the sequence of columns, and their types.
Writing code to recognize each case by hand is probably 5-10 lines of code at the worst case.
You would then want to simply generate an arraay of PrimitiveType[] values from the xml file, and then call the "GetValues" function below.
This should allow you to make a single pass through the input stream, which should give a big boost over using regexes.
You'll need to supply the "ScanXYZ" methods your self. They should be easy to write. It's best to implement them w/out using regexes.
public IEnumerable<object[]> GetValues(TextReader reader, PrimitiveType[] schema)
{
while (reader.Peek() > 0)
{
var values = new object[schema.Length];
for (int i = 0; i < schema.Length; ++i)
{
switch (schema[i])
{
case PrimitiveType.BIT:
values[i] = ScanBit(reader);
break;
case PrimitiveType.DATE:
values[i] = ScanDate(reader);
break;
case PrimitiveType.FLOAT:
values[i] = ScanFloat(reader);
break;
case PrimitiveType.INTEGER:
values[i] = ScanInt(reader);
break;
case PrimitiveType.STRING:
values[i] = ScanString(reader);
break;
}
}
EatTabs(reader);
if (reader.Peek() == '\n')
{
break;
}
if (reader.Peek() == '\n')
{
reader.Read();
}
else if (reader.Peek() >= 0)
{
throw new Exception("Extra junk detected!");
}
yield return values;
}
reader.Read();
}