'Regular Expression' VS 'String Comparison operators / functions' - c#

This question is designed around the performance within PHP but you may broaden it to any language if you wish to.
After many years of using PHP and having to compare strings I've learned that using string comparison operators over regular expressions is beneficial when it comes to performance.
I fully understand that some operations have to be done with Regular Expressions down to there complexity but for operations that can be resolved via regex AND string functions.
take this example:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
can easily be done with
PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
There are many other examples but you should get the point I'm trying to make.
What version of string comparison should you try and lean towards and why?

Looks like this question arose from our small argument here, so i feel myself somehow obliged to respond.
php developers are being actively brainwashed about "performance", whereat many rumors and myths arise, including sheer stupid things like "double quotes are slower". Regexps being "slow" is one of these myths, unfortunately supported by the manual (see infamous comment on the preg_match page). The truth is that in most cases you don't care. Unless your code is repeated 10,000 times, you don't even notice a difference between string function and a regular expression. And if your code does repeat 10,000 times, you must be doing something wrong in any case, and you will gain performance by optimizing your logic, not by stripping down regular expressions.
As for readability, regexps are admittedly hard to read, however, the code that uses them is in most cases shorter, cleaner and simpler (compare yours and mine answers on the above link).
Another important concern is flexibility, especially in php, whose string library doesn't support unicode out of the box. In your concrete example, what happens when you decide to migrate your site to utf8? With ctype_alpha you're kinda out of luck, preg_match would require another pattern, but will keep working.
So, regexes are not slower, more readable and more flexible. Why on earth should we avoid them?

Regular expressions actually lead to a performance gain (not that such microoptimizations are in any way sensible) when they can replace multiple atomic string comparisons. So typically around five strpos() checks it gets advisable to use a regular expression instead. Moreso for readability.
And here's another thought to round things up: PCRE can handle conditionals faster than the Zend kernel can handle IF bytecode.
Not all regular expressions are designed equal, though. If the complexetiy gets too high, regex recursion can kill its performance advantage. Therefore it's often reconsiderworthy to mix regex matching and regular PHP string functions. Right tool for the job and all.

PHP itself recommends using string functions over regex functions when the match is straightforward. For example, from the preg_match manual page:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
Or from the str_replace manual page:
If you don't need fancy replacing rules (like regular expressions), you should always use this function instead of ereg_replace() or preg_replace().
However, I find that people try to use the string functions to solve problems that would be better solved by regex. For instance, when trying to create a full-word string matcher, I have encountered people trying to use strpos($string, " $word ") (note the spaces), for the sake of "performance", without stopping to think about how spaces aren't the only way to delineate a word (think about how many string functions calls would be needed to fully replace preg_match('/\bword\b/', $string)).
My personal stance is to use string functions for matching static strings (ie. a match of a distinct sequence of characters where the match is always the same) and regular expressions for everything else.

Agreed that PHP people tend to over-emphasise performance of one function over another. That doesn't mean the performance differences don't exists -- they definitely do -- but most PHP code (and indeed most code in general) has much worse bottlenecks than the choice of regex over string-comparison. To find out where your bottlenecks are, use xdebug's profiler. Fix the issues it comes up with before worrying about fine-tuning individual lines of code.

They're both part of the language for a reason. IsAlpha is more expressive. For example, when an expression you're looking at is inherently alpha or not, and that has domain meaning, then use it.
But if it is, say, an input validation, and could possibly be changed to include underscores, dashes, etc., or if it is with other logic that requires regex, then I would use regex. This tends to be the majority of the time for me.

Related

Uri.IsWellFormedUriString vs. Regular Expression for Validating a URL

Which is the better way to validate a URL in C#? using Uri.IsWellFormedUriString or self created regular expression which is robust but can sometimes miss the patterns?
Uri.IsWellFormedUriString is the complete method to do that task - use it.
If you need more, like some refined matching, then attach an && MyPattern(uri) afterwards.
It is a general rule to use built-in solutions to problems, because people who have implemented that method have known at least as much about the problem domain as you do, and usually more, being specialists. Therefore, it is highly unlikely that they have missed some case that you might cover better.

C# HTMLAgilityPack VS regular expressions for extracting links from HTML

I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. I'd like to try and lower this.
I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack.
As all I need to do is extract links from HTML is using HTMLAgilityPack over kill?
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
Downloaded HTML with WebClient then compared.
Using href\\s*=\\s*(?:[\"'](?<1>[^\"']*)[\"']|(?<1>\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack.
43 milliseconds compared to 3 consistently.
See my code on pastebin
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
In your case the HTML parser is overkill as your tests have shown.
People who answer on SO use that as a rote answer to all regex questions. One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion.
Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow.
Why is that? Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. That handicaps regex out of the gate. One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute.
For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently!
Most people learn basic pattern matching by doing searches with a *. When they first learn regex they use * with the . such as .*. That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses.
Unless you know empirically that there are no items, use the + instead.
Back in 2009 I wrote about this subject on my blog Are C# .Net Regular Expressions Fast Enough for You?

Algorithmic complexity of Regular Languages in Extended Regular Language frameworks

I have a bit of a background in Formal Languages and recently I found out that Java and other languages use what is coined Extended Regular Languages. Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background. As a result, I had always assumed no matter how ugly my regex, no matter how long my regex, Pattern.matches or similar methods would run in linear time. But that assumption seems to be incorrect.
A post I read seems to suggest some Regex expressions do run in linear time, but I don't entirely believe or trust one single person.
I'll eventually write my own Java Formal Regular Expression library (existing ones I've found only have GNU GPL licences) but in the meantime I have a few questions on the time complexity of Java/C# regexs. Want to ensure what I read elsewhere is correct.
Questions:
A formal language like \sRT\s., would the match method of a regexes in Java/C# solve membership in linear or non-linear time?
In general, how would I know if a given Regular Language's expression membership problem is linear time for a Regex?
I do text analysis, finding out Java regexes aren't DFA's was really a downer.
Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background.
This belief is common in academics. In practice, regular expression compilation does not produce a DFA and then execute it. I have only a small amount of experience in this; I worked briefly on the regular expression compilation system in Microsoft's implementation of JavaScript in the 1990s. We chose to compile the "regular" expression into a simple domain-specific bytecode language and then built an interpreter for that language.
As you note, this can lead to situations where repeated backtracking has exponentially bad time behavior in the length of the input, but the construction of the compiled state is essentially linear in the size of the expression.
So let me counter your questions with another two questions -- and I note that these are genuine questions, not rhetorical.
1) Every actually regular expression corresponds to an NDFA with let's say n states. The corresponding DFA might require superposition of up to 2n states. So what stops the time taken to construct the DFA from being exponential in pathological cases? The runtime might be linear in the input, but if the runtime is exponential in the pattern size then basically you're just trading one nonlinearity for another.
2) So-called "regular" expressions these days are nothing of the sort; they can do parenthesis matching. They correspond to pushdown automata, not nondeterministic finite automata. Is there a linear algorithm for constructing the corresponding pushdown automaton for a "regular" expression?
Regular expressions are implemented in many languages as NFAs to support backtracking (see http://msdn.microsoft.com/en-us/library/e347654k(v=vs.110).aspx). Because of backtracking, you can construct regular expressions which have terrible performance on some strings (see http://www.regular-expressions.info/catastrophic.html)
As far as analysing regular expressions to determine their performance, I doubt there is a very good way in general. You can look for warning flags, like compounded backtracking in the second link's example, but even that may be hard to detect properly in some cases.
Java's implementation of regular expression uses the NFA approach.
Here is a link that explains it clearly.
Basically, a badly written but still correct regular expression can cause the engine to perform badly.
For example, given the expression (a+a+)+b and the string aaaaaaaaaaaaaaa.
It can take a while (depending on your machine, from several seconds to minutes) to figure out that there is no match.
The worst case performance of NFA is an almost match situation (given example). Because the expression forces the engine to explore every path (with lots of backtracking) to determine a non-match.

Sharing character buffer between C# strings objects

Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:
var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);
If there is no such function, why bother with making strings immutable?
Or, alternatively, if strings are already immutable for other reasons, why not provide this method?
The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.
Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...
No, there's nothing like that.
.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.
Both solutions have "wins" in some situations, and losses in others.
If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.
As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?
The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.
Nothing in C# provides you the out-of-the-box functionality you're looking for.
What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.
Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.
Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.
You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use
string text = myCheapObject;
and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).
The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.

When not to use Regex in C# (or Java, C++, etc.)

It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.
So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?
(See "Regex to parse C# source code to find all strings" for way I am asking this question.)
This seems to sums it up well:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems...
(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)
Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.
There are two aspects to consider:
Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.
Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.
I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.
For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.
As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.
You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.
Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...
Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.
Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.
At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:
Identifying parts of a string.
Checking whether a string conforms to a certain format or construction.
Finding substrings that match a certain pattern.
Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).
Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.
Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.
Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.
The main concern here is maintainability.
It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.
On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!
But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.

Categories

Resources