Writing an extremely simple parser - c#

I'm writing a very basic web server that has to support an extremely limited special server side scripting language. Basically all I need to support is "echo", addition/subtraction/multiplication (no division) with only 2 operands, a simple "date()" function that outputs the date and the use of the "&" operator to concatenate strings.
An example could be:
echo "Here is the date: " & date();
echo "9 x 15 = : & 9*15;
I've gone through and created the code necessary to generate tokens, but I'm not sure I'm using the right tokens.
I created tokens for the following:
ECHO - The echo command
WHITESPACE - Any whitespace
STRING - A string inside quotations
DATE - The date() function
CONCAT - the & operator for concatenation
MATH - Any instance of binary operation (5+4, 9*2, 8-2, etc)
TERM - The terminal character (;)
The MATH one I am particularly unsure about. Typically I see people create a token specifically for integers and then for each operator as well, but since I ONLY want to allow binary operations, I thought it made sense to group it into one token. If I were to do everything separately, I would have to do some extra work to ensure that I never accepted "5+4+1".
So question 1 is am I on the right track with which tokens to use?
My next question is what to I do with these tokens next to ensure correct syntax? The approach that I had thought of was to basically say, "Okay I know I have this token, here is a list of tokens that are allowed to come next based on the current token. Is the next token in the list?"
Based on that, I made a list of all of my tokens as well as what tokens are valid to appear directly after them (didn't include whitespace for simplicity).
ECHO -> STRING|MATH|DATE
STRING -> TERM|CONCAT
MATH -> TERM|CONCAT
DATE -> TERM|CONCAT
CONCAT -> STRING|MATH|DATE
The problem is I'm not sure at all how to best implement this. Really I need to keep track of whitespace as well to make sure there are spaces between the tokens. But that means I have to look ahead two tokens at a time which is getting even more intimidating. I also am not sure how to manage the "valid next tokens" stuff without just some disgusting section of if blocks. Should I be checking for valid syntax before trying to actually execute the script, or should I do it all at once and just throw an error when I reach an unexpected token? In this simple example, everything will always work just fine parsing left to right, there's no real precedence rules (except the MATH thing, but that's part of why I combined it into one token even though it feels wrong.) Even so, I wouldn't mind designing a more scalable and elegant solution.
In my research about writing parsers, I see a lot of references to creating "accept()" and "expect()" functions but I can't find any clear description of what they are supposed to do or how they are supposed to work.
I guess I'm just not sure how to implement this, and then how to actually come up with a resulting string at the end of the day.
Am I heading in the right direction and does anybody know of a resource that might help me understand how to best implement something simple like this? I am required to do it by hand and cannot use a tool like ANTLR.
Thanks in advance for any help.

The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:
echo "Here is the date: " & date();
I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for &, date and ().
My token list now contains the following tokens:
echo
"Here is the date: "
&
date
()
Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date, string or math. Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.
For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.

expect is what your parser does to get the next token, and fails if the token isn't a proper following token. To begin with, your parser expects ECHO or WHITESPACE. Those are the only valid starting terms. Having seen "ECHO", your parser expects one of WHITESPACE|STRING|MATH|DATE; anything else is an error. And so on.
accept is when your parser has seen a complete "statement" - ECHO, followed by a valid sequence of tokens, followed by TERM. Your parser now has enough information to process your ECHO command.
Oh, and hand-written parsers (especially simple ones) are very often disgusting collections of if blocks (or moral equivalents like switch statements) :) Further up the line of elegant-ness would be some kind of state machine, and further up from that is a grammar generator like yacc or GOLD Parser Generator (which in turn churn out ugly if, switch, and state machines for you).
EDIT to provide more details.
To help sort out responsibilities, create a "lexer" whose job is to read the input and produce tokens. This involves deciding what tokens look like. An easy token is the word "echo". A less easy token is a math operation; the token would consist of one or more digits, an operator, and one or more digits, with no whitespace between. The lexer would take care of skipping whitespace, as well as understanding a quoted string and the characters that form the date() function. The lexer would return two things - the type of token read and the value of the token (e.g., "MATH" and "9*15").
With a lexer in hand to read your input, the parser consumes the tokens and ensures they're in a proper order. First you have to see the ECHO token. If not, fail with an error message. After that, you have to see STRING, DATE, or MATH. If not, fail with an error message. After that, you loop, watching for either TERM, or else CONCAT followed by another STRING, DATE, or MATH. If you see TERM, break the loop. If you see neither TERM nor CONCAT, fail with an error message.
You can process the ECHO command as you're parsing, since it's a simple grammar. Each time you find a STRING, DATE or MATH, evaluate it and concatenate it to what you already have. When you find TERM, exit the function and return the built-up string.
Questions? Comments? Omelets? :)

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

How to arrange the matched output in c#?

i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}
Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.
CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer
In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.

C# strings with embedded null characters: bug, bad practice, or vulnerability?

I have re-coded to avoid embedded null characters in C# strings,... but was wondering why the following calls gave no warning or exception for string parameters with an embedded null character, and whether this was a bug in StringBuilder.ToString(), a bad practice in general for C#, or at worst a vulnerability in .NET.
For background I have a WPF application that was parsing through an XPath to create nodes and attributes within an XmlDocument when needed. The StringBuilder class let me replace a path delimiter with a null character, e.g.: xpathtonode[i] = '\0';
Though this is allowed, if it were a bad practice I would hope to receive and exception or at least a warning.
The call to xmlpathtonode.ToString() would correctly return the string up to the null terminating character, except when the null character was embedded as the last character, then the null character would be included in the string returned by ToString(). Thus the string's Length property would be longer than the intended string value.
If StringBuilder.ToString() would recognize the null character at the end of the string and exclude it, there would not have been the following issue. Maybe this is just a bug in the StringBuilder class...
The subsequent call to XmlDocument.CreateAttribute(...), or even a call to exclude the embedded null character xpathtonode.ToString().Substring(offset,length) would exit the thread of execution without error or exception. My program and the debugger would continue to operate as if the call had never occurred,...
I doubt that this would be an OS style buffer overflow vulnerability,... but it is creepy to have the flow of execution interrupted and continue without any indication.
Bug? Bad Practice? Vulnerability?
In your problem statement, you said,
The StringBuilder class let me replace a path delimiter with a null character, e.g.:
xpathtonode[i] = '\0';
Though this is allowed, if it were a bad practice I would hope to receive and [sic]
exception or at least a warning.
U+0000 (Ascii NUL) is a perfectly legal Unicode control character and a perfectly legal character in a .Net string: .Net strings aren't nul-terminated: they carry a length specifier around with them.
You might use a more appropriate Unicode/ASCII control character for this:
U+001C (FS) is File Separator.
U+001D (GS) is Group Separator.
U+001E (RS) is Record Separator.
U+001F (US) is Unit Separator.
Back in the old days (history lesson coming), when men were men, data was persisted to paper tape or punch cards.
In particular, on paper tape, fields within a file record would be separated with US, the unit separator. Groups of fields (e.g., repeating fields or a group of related fields) might be delimited with GS (group separator). Individual records within a file would be delimited with RS (record separator) and individual files on the tape with FS the file separator.
Punch cards were a little different since cards were discrete things. Each record was often (but not always!) on a single punch card. And a "file" might be 1 or more boxes of punch cards.
Bug? Bad Practice? Vulnerability?
Specific to .NET's XmlDocument object, since you mention that calls to CreateAttribute(...) or xpathtonode.ToString().Substring(offset,length) cause the thread to be exited without error or exception then this appears to be a small bug. It would be bad practise for you to include the null character in any code because of this quirk.
However, this can also be classed as a vulnerability if you are constructing the path from user input, as a malicious user could include the null character on purpose to change the execution path of your code. It is good practise anyway to sanitize any user controlled or external data in XPath queries, as otherwise your code would be vulnerable to XPath Injection:
XPath Injection attacks occur when a web site uses user-supplied information to construct an XPath query for XML data. By sending intentionally malformed information into the web site, an attacker can find out how the XML data is structured, or access data that he may not normally have access to.
There are a few ways to avoid XPath Injection in .NET.
Regarding null bytes in general, and your StringBuilder example it appears that this may be a type of Off-by-one Error. If the StringBuilder does any string processing with user input, it may be possible for an attacker to provide a null terminated string and access the value of a character that they normally wouldn't have access to. It might also be possible for a user to supply a null terminated string and cause the program to discard whatever would normally follow in the string. These attacks would rely on the null value being persisted from the initial input location, as the pipeline may consistently terminate the input at the null byte. It is any inconsistency that is the problem.
For example, if one component treats the string 12345\06789 as 123456789 during validation, and another component treats the string as 12345 when the value is actually used then this is a problem. This was the cause of several PHP null byte related issues where PHP would read the null byte, but any system functions that were written in C classed them as a termination character. This made it possible to smuggle various strings past the PHP validation code and then enable the operating system to execute things it wasn't meant to as an aid to the attacker.
However, as .NET is a managed language this is unlikely to lead to a buffer overflow vulnerability. It might be worth further investigating if it is possible to do any of these by injecting null bytes from user input.
Bad practice because the \0 character can be interpreted differently by various features/functions of .NET giving you strange/unpredictable results, the real question is why you would purposely use that character.
Here is a similar question/response: Why is there no Char.Empty like String.Empty?

Regex expression to validate a user input

I am building a system where the user builds a query by selecting his operands from a combobox(names of operands are then put between $ sign).
eg. $TotalPresent$+56
eg. $Total$*100
eg 100*($TotalRegistered$-$NumberPresent$)
Things like that,
However since the user is allowed to enter brackets and the +,-,* and /.
Thus he can also make mistakes like
eg. $Total$+1a
eg. 78iu+$NumberPresent$
ETC...
I need a way to validate the query built by the user.
How can I do that ?
A regex will never be able to properly validate a query like that. Either your validation would be incomplete, or you would reject valid input.
As you're building a query, you must already have a way parse and execute it. Why not use your parsing code to validate the user input? If you want to have client-side validation you could use an ajax call to the server.
I need a way to validate the query built by the user.
Personally, I don't think it is a good idea to use regex here. It can be possible with help of some extensions (see here, for example), but original Kleene expressions aren't fit for checking whether unlimited number of parentheses is balanced. Even worse, too difficult expression may result in significant time and memory spent, opening doors to denial-of-service attacks (if your service is public).
You can make use of a weak expression, though: one which is easy to write and match with and forbids most obvious mistakes. Some inputs will still be illegal, but you will discover that on parsing, as Menno van den Heuvel offered. Something like this should do:
^(?:[-]?\(*)?(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+)(?:\)*[+/*-]\(*(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+))*(?:\)*)$
Hey guys I managed to get to my ends(Thanks to Anirudh(Validating a String using regex))
I am posting my answer as it may help further visitors.
string UserFedData = ttextBox1.Text.Trim().ToString();
//this is a regex to detect conflicting user built queries.
var troublePattern = new Regex(#"^(\(?\d+\)?|\(?[$][^$]+[$]\)?)([+*/-](\(?\d+\)?|\(?[$][^$]+[$]\)?))*$");
//var troublePattern = new Regex(#"var troublePattern = new Regex(#"^(?:[-]?\(*)?(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+)(?:\)*[+/*-]\(*(?:\$[A-Za-z][A-Za-z0-9_]*\$|\d+))*(?:\)*)$");
string TroublePattern = troublePattern.ToString();
//readyToGo is the boolean that indicates if further processing of data is safe or not
bool readyToGo = Regex.IsMatch(UserFedData, TroublePattern, RegexOptions.None);

Efficient string matching algorithm

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.
Here are my requirements:
Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
Entries may be absolute matches, i.e. www.example.com.
Entries may include wildcards, i.e. *.example.com.
Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.
Language/environment: C# (.Net Framework 3.5)
I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.
I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.
My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?
If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.
Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.
The only differences between this and a regular spell-checking tree are:
The matching needs to be done in reverse
You have to take into account the wildcards
To address point #2, you would simply check for the "*" character at the end of a test.
A quick example:
Entries:
*.fark.com
www.cnn.com
Tree:
m -> o -> c -> . -> k -> r -> a -> f -> . -> *
\
-> n -> n -> c -> . -> w -> w -> w
Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.
Checking www.cern.com would fail on the second "n" of n,n,c,...
Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".
I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).
you don't need regexp .. just reverse all the strings,
get rid of '*', and put a flag to indicate partial match
till this point passes.
Somehow, a trie or suffix trie looks most appropriate.
If the list of domains is known at compile time, you may look at
tokenizing at '.' and using multiple gperf generated machines.
Links:
google for trie
http://marknelson.us/1996/08/01/suffix-trees/
I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.
Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.
To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.
This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.
Additionally, I would have to bet that this approach would be faster than RegEx.
You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.
Here's an article on recursive descent parsers.
Assuming the rules are as you said: literal or start with a *.
Java:
public static boolean matches(String candidate, List<String> rules) {
for(String rule : rules) {
if (rule.startsWith("*")) {
rule = rule.substring(2);
}
if (candidate.endsWith(rule)) {
return true;
}
}
return false;
}
This scales to the number of rules you have.
EDIT:
Just to be clear here.
When I say "sort the rules", I really mean create a tree out of the rule characters.
Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).
For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.
If you have a LOT (LOT LOT) of rules I would sort the rules.
For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)
If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.
I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.
I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.
Have a look at RegExLib
Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:
Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.
Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.
What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.
I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?
I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex
If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.
That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.
If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...
Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.
One source of information for these is:
KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html
BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Categories

Resources