Which is the better way to validate a URL in C#? using Uri.IsWellFormedUriString or self created regular expression which is robust but can sometimes miss the patterns?
Uri.IsWellFormedUriString is the complete method to do that task - use it.
If you need more, like some refined matching, then attach an && MyPattern(uri) afterwards.
It is a general rule to use built-in solutions to problems, because people who have implemented that method have known at least as much about the problem domain as you do, and usually more, being specialists. Therefore, it is highly unlikely that they have missed some case that you might cover better.
Related
This question is designed around the performance within PHP but you may broaden it to any language if you wish to.
After many years of using PHP and having to compare strings I've learned that using string comparison operators over regular expressions is beneficial when it comes to performance.
I fully understand that some operations have to be done with Regular Expressions down to there complexity but for operations that can be resolved via regex AND string functions.
take this example:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
can easily be done with
PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
There are many other examples but you should get the point I'm trying to make.
What version of string comparison should you try and lean towards and why?
Looks like this question arose from our small argument here, so i feel myself somehow obliged to respond.
php developers are being actively brainwashed about "performance", whereat many rumors and myths arise, including sheer stupid things like "double quotes are slower". Regexps being "slow" is one of these myths, unfortunately supported by the manual (see infamous comment on the preg_match page). The truth is that in most cases you don't care. Unless your code is repeated 10,000 times, you don't even notice a difference between string function and a regular expression. And if your code does repeat 10,000 times, you must be doing something wrong in any case, and you will gain performance by optimizing your logic, not by stripping down regular expressions.
As for readability, regexps are admittedly hard to read, however, the code that uses them is in most cases shorter, cleaner and simpler (compare yours and mine answers on the above link).
Another important concern is flexibility, especially in php, whose string library doesn't support unicode out of the box. In your concrete example, what happens when you decide to migrate your site to utf8? With ctype_alpha you're kinda out of luck, preg_match would require another pattern, but will keep working.
So, regexes are not slower, more readable and more flexible. Why on earth should we avoid them?
Regular expressions actually lead to a performance gain (not that such microoptimizations are in any way sensible) when they can replace multiple atomic string comparisons. So typically around five strpos() checks it gets advisable to use a regular expression instead. Moreso for readability.
And here's another thought to round things up: PCRE can handle conditionals faster than the Zend kernel can handle IF bytecode.
Not all regular expressions are designed equal, though. If the complexetiy gets too high, regex recursion can kill its performance advantage. Therefore it's often reconsiderworthy to mix regex matching and regular PHP string functions. Right tool for the job and all.
PHP itself recommends using string functions over regex functions when the match is straightforward. For example, from the preg_match manual page:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
Or from the str_replace manual page:
If you don't need fancy replacing rules (like regular expressions), you should always use this function instead of ereg_replace() or preg_replace().
However, I find that people try to use the string functions to solve problems that would be better solved by regex. For instance, when trying to create a full-word string matcher, I have encountered people trying to use strpos($string, " $word ") (note the spaces), for the sake of "performance", without stopping to think about how spaces aren't the only way to delineate a word (think about how many string functions calls would be needed to fully replace preg_match('/\bword\b/', $string)).
My personal stance is to use string functions for matching static strings (ie. a match of a distinct sequence of characters where the match is always the same) and regular expressions for everything else.
Agreed that PHP people tend to over-emphasise performance of one function over another. That doesn't mean the performance differences don't exists -- they definitely do -- but most PHP code (and indeed most code in general) has much worse bottlenecks than the choice of regex over string-comparison. To find out where your bottlenecks are, use xdebug's profiler. Fix the issues it comes up with before worrying about fine-tuning individual lines of code.
They're both part of the language for a reason. IsAlpha is more expressive. For example, when an expression you're looking at is inherently alpha or not, and that has domain meaning, then use it.
But if it is, say, an input validation, and could possibly be changed to include underscores, dashes, etc., or if it is with other logic that requires regex, then I would use regex. This tends to be the majority of the time for me.
If you had to explain Lambda expressions to a 5th grader (10/11 years old), how would you do it? And what examples might you give, or resources might you point them to? I may be finding myself in the position of having to teach this to 5th grade level developers and could use some assistance.
[EDIT]: The "5th Grader" reference was meant to relate to an American TV show which pits adults vs. 5th graders in a quiz type setting (I think). I meant to imply that the people who need to be taught this know nothing about Lambda's and I need to find a way to make things VERY simple. I'm sorry that I forgot this forum has a worldwide audience.
Thanks very much.
Just call it a function without a name. If she has not been exposed to much programming her mind not already have been calcified in thinking that all functions should have names.
Most of the complexity related to lambda expressions is caused by complicated naming and putting it on a marble pedestal.
Lot's of kids create great websites with lot's of Javascript stuff. Chances are they are using lambda expressions all the time without knowing it. They just call it a 'cool trick'.
I don't think you need to explain lambda expressions to kids aged 10-11. Just show them how lambda expressions look like, what you can do with them, and where you can use them. Kids in that age still have the capability to learn something new without relying on analogies to understand it.
Lambda expressions are what I consider higher order programming. Rigorous explanation will require extensive prerequisite learning. Certainly, this is not practical at the 5th grade level.
However, it might help to just cover some concepts by example in a way that mirrors real life physical situations.
For instance, a scale is a sort of a lambda expression. It tallies the mass of the objects placed on it. It is not a variable because it does not store the number anywhere. Instead, it generates the number at the time of use. When used again, it recalculates based on its inputs. You can take it places and use it somewhere else, but the underlying mechanics (expression) is the same.
if he already understands what "function" is that you can say that it is the function that you need only once and therefore it doesn't need a name.
Anyway, if you need to explain functional programming I would recommend to try to steal some ideas from http://learnyouahaskell.com/ - it's one of the best explanations of ideas behind functional programming I've ever read.
Lambda expressions in c# are basically just anonymous delegates so when explaining what they are to ANYONE they need to understand in this order:
What a delegate is and what they are used for.
What an anonymous delegate is and how it is just short hand way of creating a delegate.
Lambda expression syntax and how it is just really a even more short hand way of creating an anonymous delegate.
Probably not the best thing to start explaining to a fifth grader if a cutting edge OO language (C#) didn't get it for 10 years.
Pretty hard to come up with an analogy for a function with no name... Perhaps it's significant because it's a less verbose way specify a callback?
I'm assuming you're actually looking for a basic intro for programmers, not actual 5th graders. (For 5th graders, Python or JavaScript tend to be best). Anyways, two good introductions to C# lambda expressions:
Explaining C# lambda statements in one sentence (and LINQ in about 10)
LINQ in Action
The first (disclaimer - my blog) will give you a quick explanation of the fundamental concepts. The book provides complete coverage of all relevant topics.
Is there any simple way in C# to test if a regular expression is a regular expression? In other words, I would like to check if a user-provided regex pattern is malformed or not. This is purely a syntax test and not what the regex is supposed to achieve/test. Thanks
You may try passing it to the Regex constructor and catch potential ArgumentException which is thrown if the argument is a malformed regular expression.
Here's an example from C# Online .NET that uses exceptions:
EDIT:
Removed the code to respect copyright owners, just in case. Simply click on the above link to see it.
I have to say, this doesn't sound good. The extremely small subset of computer users that would be capable of correctly entering a regex should probably also be trusted to interpret the exception message correctly. Trying to validate their entry and getting it wrong would be sufficient grounds for them to get pretty flipping mad and uninstall your program.
If experienced programmers are not actually your target customer, be sure to avoid regex.
It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.
So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?
(See "Regex to parse C# source code to find all strings" for way I am asking this question.)
This seems to sums it up well:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems...
(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)
Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.
There are two aspects to consider:
Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.
Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.
I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.
For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.
As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.
You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.
Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...
Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.
Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.
At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:
Identifying parts of a string.
Checking whether a string conforms to a certain format or construction.
Finding substrings that match a certain pattern.
Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).
Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.
Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.
Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.
The main concern here is maintainability.
It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.
On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!
But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.
I hope this is programmer-related question. I'm in the hobby business of C# programming. For my own purposes I need to parse html files and the best idea is..regular expression. As many found out, it's quite time consuming to learn them and thus I'm quite interested if you know about some application that would be able to take input (piece of any code), understand what i need (by Me selecting a piece of the code I need to "cut out"), and give me the proper regular expression for it or more options.
As I've heard, Regex is a little science of itself, so it might not be as easy as I'd imagine.
Yes there is Roy Osherove wrote exactly what you're looking for - regulazy
Not real answer to your question, as it has nothing to do with regex, but HtmlAgilityPack may help you with your parsing.
You might also want to try txt2re : http://txt2re.com/, which tries to identify patterns in a user-supplied string and allows to build a regex out of them.
I gotta agree with Sunny on this one: if you're parsing html, you're better off converting it to XML (using the HTML Agility pack it's trivially easy) and then you can using XPATH expressions rather than regular expressions, it's far better suited to the job.