Whats the best way to optimise my regex matching - c#

I've got an app with a textbox in it. The user enters text into this box.
I have the following function triggered in an OnKeyUp event in that textbox
private void bxItemText_KeyUp(object sender, System.Windows.Input.KeyEventArgs e)
{
// rules is an array of regex strings
foreach (string rule in rules)
{
Regex regex = new Regex(rule);
if (regex.Match(text.Trim().ToLower()))
{
// matched rule is a variable
matchedRule = rule;
break;
}
}
}
I've got about 12 strings in rules, although this can vary.
Once the length of the text in my textbox gets bigger than 80 characters performance starts to degrade. typing a letter after 100 characters takes a second to show up.
How do I optimise this? Should I match on every 3rd KeyUp ? Should I abandon KeyUp altogether and just auto match every couple of seconds?

How do I optimise this? Should I match on every 3rd KeyUp ? Should I abandon KeyUp altogether and just auto match every couple of seconds?
I would go with the second option, that is abandon KeyUp and trigger the validation every couple of seconds, or better yet trigger the validation when the TextBox loses focus.
On the other hand, I should suggest to cache the regular expressions beforehand and compile them because it seems like you are using them over and over again, in other words instead of storing the rules as strings in that array, you should store them as compiled regular expression objects when they are added or loaded.

Use static method calls instead of create a new object each time, static calls use a caching feature : Optimizing Regular Expression Performance, Part I: Working with the Regex Class and Regex Objects.
That will be a major improvement in performance, then you can provide your regexes (rules) to see if some optimization can be done in the regexes.
Other resources :
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Optimizing Regex Performance, Part 3

Combining strings to one on Regex level will work faster than foreach in code.
Combining two Regex to one

If you need pattern determination for Each new symbol, and you care about performance, than Final State Machine seems to be the best option...
That is much harder way. You should specify for each symbol list of next symbols, that are allowed.
And OnKeyUp you just walk on next state, if possible. And you will have the amount of patterns, that input text currently matches.
Some useful references, that I could found:
FSM example
Guy explaining how to convert Regex to FSM
Regex - FSM converting discussion

You don't need to create a new regex object each time. Also using static call will cache the pattern if used before (since .Net 2). Here is how I would rewrite it
matchedRule = Rules.FirstOrDefault( rule => Regex.IsMatch(text.Trim().ToLower(), rule));

Given that you seem to be matching keywords, can you just perform the match on the current portion of text that's been edited (i.e. in the vicinity of the cursor)? Might be tricky to engineer, especially for operations like paste or undo, but scope for a big performance gain.

Pre-compile your regexes (using RegexOptions.Compiled). Also, can you make the Trim and ToLower redundant by expanding your regex? You're running Trim and ToLower once for each rule, which is inefficient even if you can't eliminate them altogether
You can try and make your rules mutually exclusive - this should speed things up. I did a short test: matching against the following
"cat|car|cab|box|balloon|button"
can be sped up by writing it like this
"ca(t|r|b)|b(ox|alloon|utton)"

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Maybe I need a Regex?

I am making a simple console application for a home project. Basically, it monitors a folder for any files being added.
FileSystemWatcher fsw = new FileSystemWatcher(#"c:\temp");
fsw.Created += new FileSystemEventHandler(fsw_Created);
bool monitor = true;
while (monitor)
{
fsw.WaitForChanged(WatcherChangeTypes.Created, 1000);
if(Console.KeyAvailable)
{
monitor = false;
}
}
Show("User has quit the process...", ConsoleColor.Yellow);
When a new files arrives, 'WaitForChanges' gets called, and I can then start the work.
What I need to do is check the filename for patterns. In real life, I am putting video files into this folder. Based on the filename, I will have rules, which move the files into specific directories. So for now, I'll have a list of KeyValue pairs... holding a RegEx (I think?), and a folder. So, if the filename matches a regex, it moves it into the related folder.
An example of a filename is:
CSI- NY.S07E01.The 34th Floor.avi
So, my Regex needs to look at it, and see if the words CSI "AND" (NY "OR" NewYork "OR" New York) exist. If they do, I will then move them to a \Series\CSI\NY\ folder.
I need the AND, because another file example for a different series is:
CSI- Crime Scene Investigation.S11E16.Turn On, Tune In, Drop Dead
So, for this one, I would need to have some NOTs. So, I need to check if the filename has CSI, but NOT ("New York" or "NY" or "NewYork")
Could someone assist me with these RegExs? Or maybe, there's a better method?
You can try to store conditions in Func<string,bool>
Dictionary<Func<string,bool>,string> dic = new Dictionary<Func<string, bool>, string>();
Func<string, bool> f1 = x => x.Contains("CSI") && ((x.Contains("NY") || x.Contains("New York"));
dic.Add(f1,"C://CSI/");
foreach (var pair in dic)
{
if(pair.Key.Invoke("CSI- NY.S07E01.The 34th Floor.avi"))
{
// copy
return;
}
}
I think you have the right idea. The nice thing about this approach is that you can add/remove/edit regular expressions to a config file or some other approach which means you don't have to recompile the project every time you want to keep track of a new show.
A regular expression for CSI AND NY would look something like this.
First if you want to check if CSI exists in the filename the regex is simply "CSI". Keep in mind it's case sensitive by default.
If you want to check if NY, New York or NewYork exist in the file name the regex is "((NY)|(New York)|(NewYork))" The bars indicate OR and the parenthesis are used to designate groups.
In order to combine the two you could run both regexes and in some cases (where perhaps order is unimportant) this might be easier. However if you always expect the show type to come after the syntax would be "(CSI).*((NY)|(New York)|(NewYork))" The period means "any character" and the asterisk means zero or more.
This does not look as one regex, even if you succeed with tossing the whole thing into one. Regexes which match "anything without a given word" are a pain. I'd better stick with two regexes for each rule: one which should match, and the other which should NOT match for this rule to be triggered. If you need your "CSI" and "NY" but don't like fixing any particular order within the filename, you as well may switch from a pair of regexes to a pair of lists of regexes. In general it's better to put this logic into code and configuration and keep regexes as simple as possible. And yes, you're quite likely to get away with simple substring search, no explicit need for regexes as long as you keep your code smart enough.
Well, people already gave you some advises about doing this using:
Regular expressions
Func and storing exactly the C# code that will be executed against the file
so I'm just give you a different one.
I disagree with using Regular Expressions for this purpose. I agree with #Anton S. Kraievoy: I don't like regexes to match anything without a given word. It is easier to check: !text.Contains(word)
The second option looks perfect if you are looking for a fast solution, but...
If that is a more complex application, and you want to design it correctly, I think you should:
Define how you will store those patterns (in a class with members, or in a string, etc). An string example could be:
"CSI" & ("NY" || "Las Vegas")
Then write a module that will match a filename with that pattern.
You're creating kind of a DSL.
Why is it better than just paste directly the C# code?
Well, because:
You can easily change the semantic of your patterns
You can generate the validation code in any language you want, because you're storing patterns in a generic way.
The thing is how to match a pattern against a filename.
You have some options:
Write the grammar of your pattern and write a parser by yourself
Generate (I'm not 100% sure if it is possible, that depends on the grammar) the write a regex that will convert your grammar into C# code.
Like: "A" & "B" to string.Contains("A") && string.Contains("B") or something like that.
Use a tool to do that, like ANTLR.

Why is my regex so much slower compiled than interpreted?

I have a large and complex C# regex that runs OK when interpreted, but is a bit slow. I'm trying to speed this up by setting RegexOptions.Compiled, and this seems to take about 30 seconds for the first time and instantly after that. I'm trying to negate this by compiling the regex to an assembly first, so my app can be as fast as possible.
My problem is when the compiling delay takes place, whether it's compiled in the app:
Regex myComplexRegex = new Regex(regexText, RegexOptions.Compiled);
MatchCollection matches = myComplexRegex.Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{
}
or using Regex.CompileToAssembly in advance:
MatchCollection matches = new CompiledAssembly.ComplexRegex().Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{
}
This is making compiling to an assembly basically useless, as I still get the delay on the first foreach call. What I want is for all the compiling delay to be done at compile time instead (at the Regex.CompileToAssembly call), and not at runtime. Where am I going wrong ?
(The code I'm using to compile to an assembly is similar to http://www.dijksterhuis.org/regular-expressions-advanced/ , if that's relevant ).
Edit:
Should I be using new when calling the compiled assembly in new CompiledAssembly.ComplexRegex().Matches(searchText); ? It gives a "object reference required" error without it though.
Update 2
Thanks for the answers/comments. The regex that I'm using is pretty long but basically straightforward, a list of thousands of words each separated by |. I can't see it'd be a backtracking problem really. The subject string can be just one letter long, and it can still cause the compilation delay. For a RegexOptions.Compiled regex, it'll take over 10 seconds to execute when the regex contains 5000 words. For comparison, the non-compiled version of the regex can take 30,000+ words and still execute just about instantly.
After doing a lot of testing on this, what I think I've found out is:
Don't use RegexOptions.Compiled when your regex has many alternatives - it can be extremely slow to to compile.
.Net will use lazy evaluation for regex when possible, and AFAI can see this extends (at least to some extent) to regex compilation too. A regex will be fully compiled only when it has to be, and there seems to be no way of forcing compilation ahead of time.
Regex.CompileToAssembly would be much more useful if the regexes could be forced to be fully compiled, it seems to be verging on being pointless as it is.
Please correct me if I'm wrong or missing something!
When using RegexOptions.Compiled, you should make sure to re-use the Regex object. It doesn't seem like you are doing this.
RegexOptions.Compiled is a trade-off. The initial construction of the Regex will be slower, because code is compiled on-the-fly, but each match should be faster. If your regular expression changes at run-time, there will probably be no benefit from using RegexOptions.Compiled, although it might depend on the actual expression involved.
Update, per the comments
If your actual code looks like the one you have posted, you are not taking any advantage of CompileToAssembly, as you are creating new, on-the-fly compiled instances of Regex each time that piece of code runs. In order to take advantage of CompileToAssembly, you will need to compile the Regex first; then take the generated assembly and reference it in your project. You should then instantiate the generated, strongly-typed Regex types generated.
In the example you link to, he has a regular expression named FindTCPIP, which gets compiled into a type named FindCTPIP. When this needs to be used, one should create a new instance of this specific type, such as:
TheRegularExpressions.FindTCPIP MatchTCP = new TheRegularExpressions.FindTCPIP();
Try using Regex.CompileToAssembly, then link to the assembly so that you can construct the Regex objects. RegexOptions.Compiled is a runtime option, the regex would still get re-compiled every time you run the application.
A very probable cause when investigating a slow regex is that it backtracks too much. This is solved by rewriting the regex so that the number of backtracking is non existent or minimal.
Can you post the regex and a sample input where it is slow.
Personally I didn't have the need to compile a regex although its interesting to see some actual numbers about performance if you have taken this path.
To force initialization you can call Match against an empty string. On top of that you can use ngen to create a native image of the expression to speed up the process even further. But probably most importantly, it's essentially just as fast to throw 30.000 string.IndexOf's or string.Contains or Regex.Match statements against a given text, than compiling a ginormous big expression to Match against a single text. Since that requires a lot less compilation, jitting etc, as the state machine is a lot simpler.
Another thing you could consider is to tokenize the text and intersect it with the list of words you're after.
After extensive testing of my own, I can confirm the suspicions of mikel are essentially correct. Even when using Regex.CompileToAssembly() and statically linking the resultant DLL into the application, there is a substantial initial delay on the first practical matching call (at least for patterns involving many ORed alternatives). Moreover, the initial delay on the first matching call depends on what text you match against. For example, matching against an empty string or some other arbitrary text will cause less of an initial delay, but you will still get additional delays later on when actual positive matches are first encountered in new text. The only way to fully guarantee future matches will all be lightning fast is to initially force a positive match at runtime with text that does indeed match. Of course this gives the maximum initial delay possible (in exchange for all future matches being lightning fast).
I dug deeper in order to understand this better. For each regex compiled into the assembly, a triplet of classes are written with the following naming template: {RegexName, RegexNameFactoryN, RegexNameRunnerN}. A reference to the RegexNameFactoryN class is instantiated at time of RegexName ctor, but the RegexNameRunnerN class is not. See the private factory and runnerref fields in the base Regex class. runnerref is a cached weak reference to a RegexNameRunnerN object. After various experiments with reflection, I can confirm that the ctors of all 3 of these compiled classes are fast and the RegexNameFactoryN.CreateInstance() function (which returns the initial RegexNameRunnerN reference) is also fast. The initial delay occurs somewhere within RegexRunner.Scan(), or it's call tree, and is thus likely outside the reach of the compiled MSIL generated by Regex.CompileToAssembly() since this call tree involves numerous non-abstract functions. This is very unfortunate and means the C# Regex compilation process performance benefits only extend so far: At runtime there will always be some substantial delay at the first time a positive match is encountered (at least for this class of many-ORed patterns).
I theorize that this has to do with how the Nondeterministic Finite Automaton (NFA) engine performs some of it's own internal caching/instantiations at runtime as the pattern is processed.
jessehouwing's suggestion of ngen is interesting and could possibly improve performance. I have not tested it.

C# Code/Algorithm to Search Text for Terms

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
What's the most efficient way to do this in .NET (ideally C#)?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.
A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.
Why reinvent the wheel? Why not just leverage something like Lucene.NET?
have you considered the following:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
"written C# today" <- removed noise, now there are lots less to look throught.
A great article on NLP can be found here. Sentance comparing
HTH
Bones
A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).
Here's another idea: Make a class something like this:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.
Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf
Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.
Ok, current rework shows this as fastest (psuedocode):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count
How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

Best way to replace tokens in a large text template

I have a large text template which needs tokenized sections replaced by other text. The tokens look something like this: ##USERNAME##. My first instinct is just to use String.Replace(), but is there a better, more efficient way or is Replace() already optimized for this?
System.Text.RegularExpressions.Regex.Replace() is what you seek - IF your tokens are odd enough that you need a regex to find them.
Some kind soul did some performance testing, and between Regex.Replace(), String.Replace(), and StringBuilder.Replace(), String.Replace() actually came out on top.
The only situation in which I've had to do this is sending a templated e-mail. In .NET this is provided out of the box by the MailDefinition class. So this is how you create a templated message:
MailDefinition md = new MailDefinition();
md.BodyFileName = pathToTemplate;
md.From = "test#somedomain.com";
ListDictionary replacements = new ListDictionary();
replacements.Add("<%To%>", someValue);
// continue adding replacements
MailMessage msg = md.CreateMailMessage("test#someotherdomain.com", replacements, this);
After this, msg.Body would be created by substituting the values in the template. I guess you can take a look at MailDefinition.CreateMailMessage() with Reflector :). Sorry for being a little off-topic, but if this is your scenario I think it's the easiest way.
Well, depending on how many variables you have in your template, how many templates you have, etc. this might be a work for a full template processor. The only one I've ever used for .NET is NVelocity, but I'm sure there must be scores of others out there, most of them linked to some web framework or another.
string.Replace is fine. I'd prefer using a Regex, but I'm *** for regular expressions.
The thing to keep in mind is how big these templates are. If its real big, and memory is an issue, you might want to create a custom tokenizer that acts on a stream. That way you only hold a small part of the file in memory while you manipulate it.
But, for the naiive implementation, string.Replace should be fine.
If you are doing multiple replaces on large strings then it might be better to use StringBuilder.Replace(), as the usual performance issues with strings will appear.
Regular expressions would be the quickest solution to code up but if you have many different tokens then it will get slower. If performance is not an issue then use this option.
A better approach would be to define token, like your "##" that you can scan for in the text. Then select what to replace from a hash table with the text that follows the token as the key.
If this is part of a build script then nAnt has a great feature for doing this called Filter Chains. The code for that is open source so you could look at how its done for a fast implementation.
Had to do something similar recently. What I did was:
make a method that takes a dictionary (key = token name, value = the text you need to insert)
Get all matches to your token format (##.+?## in your case I guess, not that good at regular expressions :P) using Regex.Matches(input, regular expression)
foreach over the results, using the dictionary to find the insert value for your token.
return result.
Done ;-)
If you want to test your regexes I can suggest the regulator.
FastReplacer implements token replacement in O(n*log(n) + m) time and uses 3x the memory of the original string.
FastReplacer is good for executing many Replace operations on a large string when performance is important.
The main idea is to avoid modifying existing text or allocating new memory every time a string is replaced.
We have designed FastReplacer to help us on a project where we had to generate a large text with a large number of append and replace operations. The first version of the application took 20 seconds to generate the text using StringBuilder. The second improved version that used the String class took 10 seconds. Then we implemented FastReplacer and the duration dropped to 0.1 seconds.
If your template is large and you have lots of tokens, you probably don't want walk it and replace the token in the template one by one as that would result in an O(N * M) operation where N is the size of the template and M is the number of tokens to replace.
The following method accepts a template and a dictionary of the keys value pairs you wish to replace. By initializing the StringBuilder to slightly larger than the size of the template, it should result in an O(N) operation (i.e. it shouldn't have to grow itself log N times).
Finally, you can move the building of the tokens into a Singleton as it only needs to be generated once.
static string SimpleTemplate(string template, Dictionary<string, string> replacements)
{
// parse the message into an array of tokens
Regex regex = new Regex("(##[^#]+##)");
string[] tokens = regex.Split(template);
// the new message from the tokens
var sb = new StringBuilder((int)((double)template.Length * 1.1));
foreach (string token in tokens)
sb.Append(replacements.ContainsKey(token) ? replacements[token] : token);
return sb.ToString();
}
This is an ideal use of Regular Expressions. Check out this helpful website, the .Net Regular Expressions class, and this very helpful book Mastering Regular Expressions.

Categories

Resources