Using Regex to efficiently parse/lex tokens

Using Regex to efficiently parse/lex tokens - c#

I am trying to find a way to use Regex in .NET to efficiently determine which of several patterns a string matches. If my tokens were of fixed text, I would use a Dictionary<> and simply look them up. However, the tokens may have one or more sequences of digits embedded in them to represent indices. I have several dozen to ~100 such tokens. For a small example, I would like to match one of the following tokens:
ORDERID
PRICE(\d+)
QUANTITY(\d+)
DESCRIPTION(\d+)
WEIGHT(\d+)_(\d+)
(The imagined use case is that I have a set of name-value pairs and the names use embedded integers to permit repetition. In this example, imagine an order that has multiple lines, and PRICE is the price for the nth line. WEIGHT_ is the weight of the mth individual object of the nth line (imagine the lineitem is a kit of some sort)).
Note that composition of these tokens is outside of my control.
I can efficiently recognize these tokens with something like
^(?<oid>ORDERID)|(?<prc>PRICE(\d+))|(?<qty>QUANTITY(\d+)|(?<dsc>DESCRIPTION(\d+)|(?<wght>WEIGHT(\d+)_(\d+)$
Note that regular expression matching for a given regular express is linear in the size of the string you are matching, and it shouldn't get more than log n less efficient as I add more tokens.
Now do a match:
Match m = r.Match("PRICE44")
Unfortunately, as far as I can tell, to determine which token was matched from the Regex.Match object, I have to iterate through all of the possibilities:
m.Groups["oid"].Success
m.Groups["prc"].Success
m.Groups["qty"].Success
m.Groups["dsc"].Success
m.Groups["wght"].Success
The cost grows linearly (or more likely, n log n) with an increased number of tokens. If there were, say, a SuccessGroups collection, I could iterate through that, where it would generally (in my usage) have a single element: the particular group that was matched.
I could write my own parsing algorithm creating a trie or similar data structure, but I am loathe to reimplement something that Regex already implements, but doesn't appear to give me efficient access to.
Any ideas or suggestions?

Maybe use groups, it will record the first matched and you can iterate through the matches instead of the tokens
http://msdn.microsoft.com/en-us/library/bs2twtah%28v=vs.110%29.aspx#matched_subexpression

Related

Searching for a substring in a key of a key-value pair more efficient when stored in a data structure or in a long string?

I have a string searching problem and two ideas came to mind on how to implement it. I was wondering if people can indicate which method would give me more efficient performance, or perhaps even suggest a better way of doing it?
The problem is I have a text file of around 450kb containing data in the following format:
description1, code1\n
description2, code2\n
description3, code3\n
...
It is two columns of data delimited by a comma and each record consists of a description and a code.
The code is a short three character text that is not immediately meaningful to the user, which is why there is description data paired with the code.
The description data is a short sentence that describes to the user what the code means.
I'm trying to create a GUI where the user can enter a search keyword in an editable text field which is then used to search against the description data. The system would then return back all the filtered records, i.e., all the description data that has the keyword as a substring and the code that it is paired with for the user to select. This occurs for each character the user types.
The first idea that came to mind on how to implement this feature is to create a key-value pair collection using the description data as key, such as a NameValueCollection, and then use a foreach loop to go through each record and search the key for the matching substring.
The second idea is to read the whole text file into one long string, and use the String.IndexOf() method to search for the keyword and wherever there is a hit in the search, I extract that portion of the record to return to the user.
The second idea came to mind because I was concerned by the performance impact that the first idea may have. I've read that the IndexOf method in use with StringComparison.Ordinal performs better than a Boyer–Moore string search algorithm so I think implementing it this way would have better performance?
So when searching for a substring in the key, does it provide faster retrieval to store the whole file as a string or in a NameValueCollection, or are there better ways of doing this?

If you have a wide collection of strings that you are planning on searching for the exact same substring, you have many options available.
One option would be to use the Aho-Corasick string matching algorithm to search for the search query in every single one of the lines of the file. The total runtime of doing this will be O(m + n + z), where m is the length of the query, z is the number of total matches, and n is the total number of characters in all of the strings in the file.
A better but more complex option would be to build a generalized suffix tree out of all the lines of the file. You could then find all matching lines in time O(n + z), where n is the length of the pattern to search for and z is the total number of lines in the file. This requires O(m) preprocessing time, where m is the total number of characters in the file. This is much, much faster than the first option, but you would probably have to find a good suffix tree library, as suffix tree construction algorithms are fairly complex.
Hope this helps!

Given a dictionary, what's the optimal way to find all possible words that contains a particular set of characters and a string

I'm writing a word game. I have access to the dictionary object to validate the words. I need to find all possible words that contains a word and a set of additional characters.
for example:
lets the say the word is "MEN" and the set of additional characters are "WALOHTD". I need a way to find words like....
1.MEND
2.WOMEN
3.MENTAL
4. etc....
basically we are looking at all possible words that contain "MEN" and any of the specific additional characters.
I can certainly write code that can loop through the entire dictionary to first words that contains the subword and then check for the specific characters existance but that is not optimal. It's taking more than a second. Any help towards optimal solution is greatly appreciated.
_rey

The problem is a mixture of that of regular language and that of searching a data structure.
Considering the first aspect alone, we'd be inclined to use a regular expression. You don't say if we can repeat the "additional characters". If we can, it's easy enough [WALOTHD]*MEN[WALOTHD]* for your case, and that's easily adapted.
If we can't repeat, then we can start with [WALOTHD]{0,7}MEN[WALOTHD]{0,7} and filter out any that break the rule ("ALLOTMENT" matches that expression, but repeats L and T).
Or we can try to build a much more complicated regular expression, though I'm not sure if the gains in the better expression would out-weigh the cost of working out what it was though.
Coming from the other side of searching a dictionary, a DAWG is very space-efficient and makes finding matches that contain substrings relatively efficient. It's not a complete match to this puzzle, as we have quite a few permutations of prefixes and suffixes to worry about. Without testing, I'd guess it'd being reasonably good if we can't repeat from the "additional", and horrible if we can. But that is just a guess. A GADDAG might well be worth looking at, it'd be bigger than a DAWG, but likely faster for this sort of search (GADDAGs are used in scrabble-solving, which is pretty much the same problem that you have here).

How to cut specified words from string

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.

Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.

You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.

You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.

Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.

A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);

Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).

In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.

You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.

Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.

Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant

I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).

I had great results using this algorithm on codeproject.com better than brute force text replacments.

Is there a way to match a set of groups in any order in a regex?

I looked through the related questions, there were quite a few but I don't think any answered this question. I am very new to Regex but I'm trying to get better so bear with me please. I am trying to match several groups in a string, but in any order. Is this something I should be using Regex for? If so, how? If it matters, I plan to use these in IronPython.
EDIT: Someone asked me to be more specific, so here:
I want to use re.match with a regex like:
\[image\s*(?(#alt:(?<alt>.*?);).*(#title:(?<title>.*?);))*.*\](?<arg>.*?)\[\/image\]
But it will only match the named groups when they are in the right order, and separated with a space. I would like to be able to match the named groups in any order, as long as they appear where they do now in the regex.
A typical string that will be applied to this might look like:
[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien.png[/image]
So the 'attributes' (things that come between '#' and ';' in the first 'tag') should be matched in any order, as long as they both appear.

The answer to the question in your title is "no" -- to match N groups "in any order", the regex should have an "or" (the | feature in the regex pattern) among the N! (N factorial) possible permutations of the groups, the product of all integers from 1 to N. That's a number which grows extremely fast -- for N just equal 6, it's already 720, for 7, it's almost 5000, and so on at a dizzying pace -- so this approach is totally impractical for any N which isn't really tiny.
The solutions may be many, depending on what you want the groups to be separated with. Let's say, for example, that you don't care (if you DO care, edit your question with better specs).
In this case, if overlapping matches are impossible or are OK with you, make N separate regular expressions, one per group -- say these N compiled RE objects are in a list named grps, then
mos = [g.search(thestring) for g in grps]
is the list of match objects for the groups (None for a group which doesn't match). With the mos list you can do all sorts of checks and/or further manipulations, for example all(mos) is True if and only if all the groups matched, in which case [m.group() for m in mos] is the list of substrings that have been matched, and so on, and so forth.
If you need non-overlapping matches, it's a bit more complicated -- you may extract the boundaries of all possible matches for each group, then seeing if there's a way to extract from these N lists a set of N intervals, one per lists, so that no two of them are pairwise intersecting. This is a somewhat subtle algorithm (if you want reasonable speed for a large N, of course), so I think it's worth a separate question, and in any case it's not worth discussing right here when the very issue of whether it's needed or not depends on so incredibly many factors that you have not specified.
So, please edit your question with more precise specifications, first, and then things can perhaps be clarified to provide you with the code and/or algorithms you need.
Edit: I see the OP has now clarified the issue at least of the extent of providing an example -- although, confusingly, he offers a RE pattern example and a string example that should not match, regardless of ordering (the RE specifies the presence of a substring #title which the example string does not have -- puzzling!).
Anyway, if the number of groups in the example (two which appear to be interchangeable, one which appears to have to occur in a specific spot) is representative of the OP's actual problems, then the total number of permutations of interest is just two, so joining the "just two" permutations with a vertical bar | would of course be quite feasible. Is that the case in the OP's real problems, though...?
Edit: if the number of permutations of interest is tiny, here's an example of one way to avoid the problem of repeated group names in the pattern (syntax requires Python 2.7 or better, but that's just for the final "dict comprehension" -- the same functionality is available in many previous version of Python, just with the less elegant dict(('a', ... syntax;-)...:
>>> r = re.compile(r'(?P<a1>a.*?a).*?(?P<b1>b.*?b)|(?P<b2>b.*?b).*?(?P<a2>a.*?a)')
>>> m = r.search('zzzakkkavvvbxxxbnnn')
>>> g = m.groupdict()
>>> d = {'a':(g.get('a1') or g.get('a2')), 'b':(g.get('b1') or g.get('b2'))}
>>> d
{'a': 'akkka', 'b': 'bxxxb'}

This is very similar to one of the key problems with using regular expressions to parse HTML - there is no requirement that attributes always be specified in the same order, and many tags have surprising attributes (like <br clear="all">. So it seems you are working with a very similar markup syntax.
Pyparsing addresses this problem in an indirect way - instead of trying to parse all different permutations, parse the general "#attrname:attribute value;" syntax, and keep track of the attributes keys and values in an attribute mapping data structure. The mapping makes it easy to get the "title" attribute, regardless of whether it came first or last in the image tag. This behavior is built into the pyparsing API methods, makeHTMLTags and makeXMLTags.
Of course, this markup is not XML, but a similar approach gives some pretty easy to work with results:
text = """[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien1.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien2.png[/image]
"""
from pyparsing import Suppress, Group, Word, alphas, SkipTo, Dict, ZeroOrMore
LBRACK,RBRACK,COLON,SEMI,AT = map(Suppress,"[]:;#")
tagAttribute = Group(AT + Word(alphas) + COLON + SkipTo(SEMI) + SEMI)
imageTag = LBRACK + "image" + Dict(ZeroOrMore(tagAttribute)) + RBRACK
imageLink = imageTag + SkipTo("[/image]")("text")
for taginfo in imageLink.searchString(text):
print taginfo.alt
print taginfo.title
print taginfo.text
print
Prints:
alien
reddit alien
http://www.reddit.com/alien1.png
alien
reddit alien
http://www.reddit.com/alien2.png

Efficient string matching algorithm

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.
Here are my requirements:
Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
Entries may be absolute matches, i.e. www.example.com.
Entries may include wildcards, i.e. *.example.com.
Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.
Language/environment: C# (.Net Framework 3.5)
I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.
I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.
My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.
Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.
The only differences between this and a regular spell-checking tree are:
The matching needs to be done in reverse
You have to take into account the wildcards
To address point #2, you would simply check for the "*" character at the end of a test.
A quick example:
Entries:
*.fark.com
www.cnn.com
Tree:
m -> o -> c -> . -> k -> r -> a -> f -> . -> *
\
-> n -> n -> c -> . -> w -> w -> w
Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.
Checking www.cern.com would fail on the second "n" of n,n,c,...
Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

you don't need regexp .. just reverse all the strings,
get rid of '*', and put a flag to indicate partial match
till this point passes.
Somehow, a trie or suffix trie looks most appropriate.
If the list of domains is known at compile time, you may look at
tokenizing at '.' and using multiple gperf generated machines.
Links:
google for trie
http://marknelson.us/1996/08/01/suffix-trees/

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.
Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.
To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.
This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.
Additionally, I would have to bet that this approach would be faster than RegEx.

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.
Here's an article on recursive descent parsers.

Assuming the rules are as you said: literal or start with a *.
Java:
public static boolean matches(String candidate, List<String> rules) {
for(String rule : rules) {
if (rule.startsWith("*")) {
rule = rule.substring(2);
}
if (candidate.endsWith(rule)) {
return true;
}
}
return false;
}
This scales to the number of rules you have.
EDIT:
Just to be clear here.
When I say "sort the rules", I really mean create a tree out of the rule characters.
Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).
For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.
If you have a LOT (LOT LOT) of rules I would sort the rules.
For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)
If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

Have a look at RegExLib

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:
Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.
What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.

I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?
I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex

If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.
That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.
If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...

Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.
One source of information for these is:
KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html
BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.