I have a bit of a background in Formal Languages and recently I found out that Java and other languages use what is coined Extended Regular Languages. Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background. As a result, I had always assumed no matter how ugly my regex, no matter how long my regex, Pattern.matches or similar methods would run in linear time. But that assumption seems to be incorrect.
A post I read seems to suggest some Regex expressions do run in linear time, but I don't entirely believe or trust one single person.
I'll eventually write my own Java Formal Regular Expression library (existing ones I've found only have GNU GPL licences) but in the meantime I have a few questions on the time complexity of Java/C# regexs. Want to ensure what I read elsewhere is correct.
Questions:
A formal language like \sRT\s., would the match method of a regexes in Java/C# solve membership in linear or non-linear time?
In general, how would I know if a given Regular Language's expression membership problem is linear time for a Regex?
I do text analysis, finding out Java regexes aren't DFA's was really a downer.
Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background.
This belief is common in academics. In practice, regular expression compilation does not produce a DFA and then execute it. I have only a small amount of experience in this; I worked briefly on the regular expression compilation system in Microsoft's implementation of JavaScript in the 1990s. We chose to compile the "regular" expression into a simple domain-specific bytecode language and then built an interpreter for that language.
As you note, this can lead to situations where repeated backtracking has exponentially bad time behavior in the length of the input, but the construction of the compiled state is essentially linear in the size of the expression.
So let me counter your questions with another two questions -- and I note that these are genuine questions, not rhetorical.
1) Every actually regular expression corresponds to an NDFA with let's say n states. The corresponding DFA might require superposition of up to 2n states. So what stops the time taken to construct the DFA from being exponential in pathological cases? The runtime might be linear in the input, but if the runtime is exponential in the pattern size then basically you're just trading one nonlinearity for another.
2) So-called "regular" expressions these days are nothing of the sort; they can do parenthesis matching. They correspond to pushdown automata, not nondeterministic finite automata. Is there a linear algorithm for constructing the corresponding pushdown automaton for a "regular" expression?
Regular expressions are implemented in many languages as NFAs to support backtracking (see http://msdn.microsoft.com/en-us/library/e347654k(v=vs.110).aspx). Because of backtracking, you can construct regular expressions which have terrible performance on some strings (see http://www.regular-expressions.info/catastrophic.html)
As far as analysing regular expressions to determine their performance, I doubt there is a very good way in general. You can look for warning flags, like compounded backtracking in the second link's example, but even that may be hard to detect properly in some cases.
Java's implementation of regular expression uses the NFA approach.
Here is a link that explains it clearly.
Basically, a badly written but still correct regular expression can cause the engine to perform badly.
For example, given the expression (a+a+)+b and the string aaaaaaaaaaaaaaa.
It can take a while (depending on your machine, from several seconds to minutes) to figure out that there is no match.
The worst case performance of NFA is an almost match situation (given example). Because the expression forces the engine to explore every path (with lots of backtracking) to determine a non-match.
Related
Every time I write a simple lexer and parser, I stumble upon the same question: how should the lexer and the parser communicate? I see four different approaches:
The lexer eagerly converts the entire input string into a vector of tokens. Once this is done, the vector is fed to the parser which converts it into a tree. This is by far the simplest solution to implement, but since all tokens are stored in memory, it wastes a lot of space.
Each time the lexer finds a token, it invokes a function on the parser, passing the current token. In my experience, this only works if the parser can naturally be implemented as a state machine like LALR parsers. By contrast, I don't think it would work at all for recursive descent parsers.
Each time the parser needs a token, it asks the lexer for the next one. This is very easy to implement in C# due to the yield keyword, but quite hard in C++ which doesn't have it.
The lexer and parser communicate through an asynchronous queue. This is commonly known under the title "producer/consumer", and it should simplify the communication between the lexer and the parser a lot. Does it also outperform the other solutions on multicores? Or is lexing too trivial?
Is my analysis sound? Are there other approaches I haven't thought of? What is used in real-world compilers? It would be really cool if compiler writers like Eric Lippert could shed some light on this issue.
While I wouldn't classify much of the above as incorrect, I do believe several items are misleading.
Lexing an entire input before running a parser has many advantages over other options. Implementations vary, but in general the memory required for this operation is not a problem, especially when you consider the type of information that you'd like to have available for reporting compilation errors.
Benefits
Potentially more information available during error reporting.
Languages written in a way that allows lexing to occur before parsing are easier to specify and write compilers for.
Drawbacks
Some languages require context-sensitive lexers that simply cannot operate before the parsing phase.
Language implementation note: This is my preferred strategy, as it results in separable code and is best suited for translation to implementing an IDE for the language.
Parser implementation note: I experimented with ANTLR v3 regarding memory overhead with this strategy. The C target uses over 130 bytes per token, and the Java target uses around 44 bytes per token. With a modified C# target, I showed it's possible to fully represent the tokenized input with only 8 bytes per token, making this strategy practical for even quite large source files.
Language design note: I encourage people designing a new language to do so in a way that allows this parsing strategy, whether or not they end up choosing it for their reference compiler.
It appears you've described a "push" version of what I generally see described as a "pull" parser like you have in #3. My work emphasis has always been on LL parsing, so this wasn't really an option for me. I would be surprised if there are benefits to this over #3, but cannot rule them out.
The most misleading part of this is the statement about C++. Proper use of iterators in C++ makes it exceptionally well suited to this type of behavior.
A queue seems like a rehash of #3 with a middleman. While abstracting independent operations has many advantages in areas like modular software development, a lexer/parser pair for a distributable product offering is highly performance-sensitive, and this type of abstraction removes the ability to do certain types of optimization regarding data structure and memory layout. I would encourage the use of option #3 over this.
As an additional note on multi-core parsing: The initial lexer/parser phases of compilation for a single compilation unit generally cannot be parallelized, nor do they need to be considering how easy it is to simply run parallel compilation tasks on different compilation units (e.g. one lexer/parser operation on each source file, parallelizing across the source files but only using a single thread for any given file).
Regarding other options: For a compiler intended for widespread use (commercial or otherwise), generally implementers choose a parsing strategy and implementation which provides the best performance under the constraints of the target language. Some languages (e.g. Go) can be parsed exceptionally quickly with a simple LR parsing strategy, and using a "more powerful" parsing strategy (read: unnecessary features) would only serve to slow things down. Other languages (e.g. C++) are extremely challenging or impossible to parse with typical algorithms, so slower but more powerful/flexible parsers are employed.
I think there is no golden rule here. Requirements may vary from one case to another. So, reasonable solutions can be different also. Let me comment on your options from my own experience.
"Vector of tokens". This solution may have big memory footprint. Imagine compiling source file with a lot of headers. Storing the token itself is not enough. Error message should contain context with the file name and the line number. It may happen that lexer depends on the parser. Reasonable example: ">>" - is this a shift operator or this is closing of 2 layers of template instantiations? I would not recommend this option.
(2,3). "One part calls another". My impression is that more complex system should call less complex one. I consider lexer to be more simple. This means parser should call lexer. I do not see why C# is better than C++. I implemented C/C++ lexer as a subroutine (in reality this is a complex class) that is called from the grammar based parser. There were no problems in this implementation.
"Communicating processes". This seems to me an overkill. There is nothing wrong in this approach, but maybe it is better to keep the things simple? Multicore aspect. Compiling single file is a relatively rare case. I would recommend to load each core with its own file.
I do not see other reasonable options of combiming lexer and parser together.
I wrote these notes thinking about compiling sources of the software project. Parsing a short query request is completely different thing, and reasons can significantly differ.
My answer is based on my own experience. Other people may see this differently.
The lexer-parser relationship is simpler than the most general case of coroutines, because in general the communication is one-way; the parser does not need to send information back to the lexer. This is why the method of eager generation works (with some penalty, although it does mean that you can discard the input earlier).
As you've observed, if either the lexer or the parser can be written in a reinvocable style then the other can be treated as a simple subroutine. This can always be implemented as a source code transformation, with local variables translated to object slots.
Although C++ doesn't have language support for coroutines, it is possible to make use of library support, in particular fibers. The Unix setcontext family is one option; another is to use multithreading but with a synchronous queue (essentially single-threading but switching between two threads of control).
Also consider for #1 that you lex tokens that don't need it, for example if there is an error, and in addition, you may run low on memory or I/O bandwidth. I believe that the best solution is that employed by parsers generated by tools like Bison, where the parser calls the lexer to get the next token. Minimizes space requirements and memory bandwidth requirements.
#4 is just not going to be worth it. Lexing and parsing are inherently synchronous- there's just not enough processing going on to justify the costs of communication. In addition, typically you parse/lex multiple files concurrently- this can already max out all your cores at once.
The way I handle it in my toy buildsystem project in progress is by having a "file reader" class, with a function bool next_token(std::string&,const std::set<char>&). This class contains one line of input (for error reporting purposes with line number). The function accepts a std::string reference to put the token in, and a std::set<char> which contains the "token-ending" characters. My input class is both parser and lexer, but you could easily split it up if you need more fanciness. So the parsing functions just call next_token and can do their thing, including very detailed error output.
If you need to keep the verbatim input, you'll need to store each line that's read in a vector<string> or something, but not store each token seperately and/or double-y.
The code I'm talking about is located here:
https://github.com/rubenvb/Ambrosia/blob/master/libAmbrosia/Source/nectar_loader.cpp
(search for ::next_token and the extract_nectar function is where is all begins)
This question is designed around the performance within PHP but you may broaden it to any language if you wish to.
After many years of using PHP and having to compare strings I've learned that using string comparison operators over regular expressions is beneficial when it comes to performance.
I fully understand that some operations have to be done with Regular Expressions down to there complexity but for operations that can be resolved via regex AND string functions.
take this example:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
can easily be done with
PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
There are many other examples but you should get the point I'm trying to make.
What version of string comparison should you try and lean towards and why?
Looks like this question arose from our small argument here, so i feel myself somehow obliged to respond.
php developers are being actively brainwashed about "performance", whereat many rumors and myths arise, including sheer stupid things like "double quotes are slower". Regexps being "slow" is one of these myths, unfortunately supported by the manual (see infamous comment on the preg_match page). The truth is that in most cases you don't care. Unless your code is repeated 10,000 times, you don't even notice a difference between string function and a regular expression. And if your code does repeat 10,000 times, you must be doing something wrong in any case, and you will gain performance by optimizing your logic, not by stripping down regular expressions.
As for readability, regexps are admittedly hard to read, however, the code that uses them is in most cases shorter, cleaner and simpler (compare yours and mine answers on the above link).
Another important concern is flexibility, especially in php, whose string library doesn't support unicode out of the box. In your concrete example, what happens when you decide to migrate your site to utf8? With ctype_alpha you're kinda out of luck, preg_match would require another pattern, but will keep working.
So, regexes are not slower, more readable and more flexible. Why on earth should we avoid them?
Regular expressions actually lead to a performance gain (not that such microoptimizations are in any way sensible) when they can replace multiple atomic string comparisons. So typically around five strpos() checks it gets advisable to use a regular expression instead. Moreso for readability.
And here's another thought to round things up: PCRE can handle conditionals faster than the Zend kernel can handle IF bytecode.
Not all regular expressions are designed equal, though. If the complexetiy gets too high, regex recursion can kill its performance advantage. Therefore it's often reconsiderworthy to mix regex matching and regular PHP string functions. Right tool for the job and all.
PHP itself recommends using string functions over regex functions when the match is straightforward. For example, from the preg_match manual page:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
Or from the str_replace manual page:
If you don't need fancy replacing rules (like regular expressions), you should always use this function instead of ereg_replace() or preg_replace().
However, I find that people try to use the string functions to solve problems that would be better solved by regex. For instance, when trying to create a full-word string matcher, I have encountered people trying to use strpos($string, " $word ") (note the spaces), for the sake of "performance", without stopping to think about how spaces aren't the only way to delineate a word (think about how many string functions calls would be needed to fully replace preg_match('/\bword\b/', $string)).
My personal stance is to use string functions for matching static strings (ie. a match of a distinct sequence of characters where the match is always the same) and regular expressions for everything else.
Agreed that PHP people tend to over-emphasise performance of one function over another. That doesn't mean the performance differences don't exists -- they definitely do -- but most PHP code (and indeed most code in general) has much worse bottlenecks than the choice of regex over string-comparison. To find out where your bottlenecks are, use xdebug's profiler. Fix the issues it comes up with before worrying about fine-tuning individual lines of code.
They're both part of the language for a reason. IsAlpha is more expressive. For example, when an expression you're looking at is inherently alpha or not, and that has domain meaning, then use it.
But if it is, say, an input validation, and could possibly be changed to include underscores, dashes, etc., or if it is with other logic that requires regex, then I would use regex. This tends to be the majority of the time for me.
If you had to explain Lambda expressions to a 5th grader (10/11 years old), how would you do it? And what examples might you give, or resources might you point them to? I may be finding myself in the position of having to teach this to 5th grade level developers and could use some assistance.
[EDIT]: The "5th Grader" reference was meant to relate to an American TV show which pits adults vs. 5th graders in a quiz type setting (I think). I meant to imply that the people who need to be taught this know nothing about Lambda's and I need to find a way to make things VERY simple. I'm sorry that I forgot this forum has a worldwide audience.
Thanks very much.
Just call it a function without a name. If she has not been exposed to much programming her mind not already have been calcified in thinking that all functions should have names.
Most of the complexity related to lambda expressions is caused by complicated naming and putting it on a marble pedestal.
Lot's of kids create great websites with lot's of Javascript stuff. Chances are they are using lambda expressions all the time without knowing it. They just call it a 'cool trick'.
I don't think you need to explain lambda expressions to kids aged 10-11. Just show them how lambda expressions look like, what you can do with them, and where you can use them. Kids in that age still have the capability to learn something new without relying on analogies to understand it.
Lambda expressions are what I consider higher order programming. Rigorous explanation will require extensive prerequisite learning. Certainly, this is not practical at the 5th grade level.
However, it might help to just cover some concepts by example in a way that mirrors real life physical situations.
For instance, a scale is a sort of a lambda expression. It tallies the mass of the objects placed on it. It is not a variable because it does not store the number anywhere. Instead, it generates the number at the time of use. When used again, it recalculates based on its inputs. You can take it places and use it somewhere else, but the underlying mechanics (expression) is the same.
if he already understands what "function" is that you can say that it is the function that you need only once and therefore it doesn't need a name.
Anyway, if you need to explain functional programming I would recommend to try to steal some ideas from http://learnyouahaskell.com/ - it's one of the best explanations of ideas behind functional programming I've ever read.
Lambda expressions in c# are basically just anonymous delegates so when explaining what they are to ANYONE they need to understand in this order:
What a delegate is and what they are used for.
What an anonymous delegate is and how it is just short hand way of creating a delegate.
Lambda expression syntax and how it is just really a even more short hand way of creating an anonymous delegate.
Probably not the best thing to start explaining to a fifth grader if a cutting edge OO language (C#) didn't get it for 10 years.
Pretty hard to come up with an analogy for a function with no name... Perhaps it's significant because it's a less verbose way specify a callback?
I'm assuming you're actually looking for a basic intro for programmers, not actual 5th graders. (For 5th graders, Python or JavaScript tend to be best). Anyways, two good introductions to C# lambda expressions:
Explaining C# lambda statements in one sentence (and LINQ in about 10)
LINQ in Action
The first (disclaimer - my blog) will give you a quick explanation of the fundamental concepts. The book provides complete coverage of all relevant topics.
What exactly is Expression<> used for in C#? Are there any scenarios where you would instantiate Expression<>'s yourself as an object? If so, please give an example!
Thank you!
Expression<T> is almost entirely used for LINQ, but it doesn't have to be. Within LINQ, it's usually used to "capture" the logic expressed in code, but keep it in data. That data can then be examined by the LINQ provider and handled appropriately - e.g. by converting it into SQL. Usually the expression trees in LINQ are created by the compiler from lambda expressions or query expressions - but in other cases it can be handy to use the API directly yourself.
A few examples of other places I've used it and seen it used:
In MiscUtil, Marc Gravell used it to implement "generic arithmetic" - if a type has the relevant operator, it can be used generically.
In UnconstrainedMelody I used it in a similar way to perform operations on flags enums, regardless of their underlying type (which is trickier than you might expect, due to long and ulong having different ranges)
In Visual LINQ I used query expressions to "animate" LINQ, so you can see what's going on. While obviously this is a LINQ usage, it's not the traditional form of translating logic into another form.
In terms of LINQ, there are things you can do to create more versatile LINQ queries at runtime than you can purely in lambdas.
I've used Expression many times as a micro-compiler, as an alternative to DynamicMethod and IL. This approach gets stronger in .NET 4.0 (as discussed on InfoQ), but even in 3.5 there are lots of things you can do (generally based on runtime data; configuration etc):
generic operators
object cloning
complex initialization
object comparison
I also used it as part of a maths engine for some work I did with Microsoft - i.e. parse a math expression ("(x + 12) * y = z" etc) into an Expression tree, compile it and run it.
Another intersting use (illustrated by Jason Bock, here) is in genetic programming; build your candidates as Expression trees, and you have the necessary code to execute them quickly (after Compile()), but importantly (for genetic programming), also to swap fragments around.
Take a look at my before & after code in my answer to another SO question.
Summary: Expression<> greatly simplified the code, made it easier to understand, and even fixed a phantom bug.
It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.
So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?
(See "Regex to parse C# source code to find all strings" for way I am asking this question.)
This seems to sums it up well:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems...
(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)
Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.
There are two aspects to consider:
Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.
Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.
I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.
For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.
As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.
You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.
Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...
Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.
Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.
At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:
Identifying parts of a string.
Checking whether a string conforms to a certain format or construction.
Finding substrings that match a certain pattern.
Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).
Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.
Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.
Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.
The main concern here is maintainability.
It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.
On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!
But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.