Problems with C# regular expression long load

Problems with C# regular expression long load - c#

I have a quiet long regex and sometimes it response fast some times it loads long like crazy.
here is my regex:
<div class=""rwResult bg"">.*?mp3/d/[^>]+>(?<Name>[^<]+)</a>.*?artist:[^>]+>(?<Artist>[^<]+).*?user</span>[^>]+[^""]+""(?<Uploader>[^""]+).*?category:.*?"">.*?"">(?<Category>[^<]+).*?time: (?<Duration>[^ ]+) \| (?<StreamSize>[0-9]+) (?<Weight>[^ ]+) \| listened: (?<Clicks>[0-9]+).*?<a href=""(?<DownloadLink>http://dl[^""]+)
rather than use alot of regex for each group i prefer doing one time regex.
Is there any function that i could check or avoid the long load while the regular expression is executing ?
I'm working C# or F# hope anyone could answer this problem.
thanks.

It looks like you are trying to parse an XML document using a regular expression. This is not really an optimal approach. My guess is that you are seeing problems because of the use of backtracking in your regular expression.
You could try to rewrite your regular expression, but XML is not a regular language and thus is not parsable by regular expressions.
Take a look at the document How to read XML from a file by using Visual C# to get started.
Sidenote: For an entertaining read on what happens when trying to parse a non regular language using regular expression see this Stack Overflow question.

I think you're using the wrong tool. You really want Xpath, and possibly XSLT. The only time you want to use a regex to parse raw XML is when the XML is suspected to be syntactically broken in predictable ways.
Seriously, look at Xpath - it's magic for delving into the structure of XML documents and pulling out the bits you want.

Related

C# HTMLAgilityPack VS regular expressions for extracting links from HTML

I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. I'd like to try and lower this.
I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack.
As all I need to do is extract links from HTML is using HTMLAgilityPack over kill?
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
Downloaded HTML with WebClient then compared.
Using href\\s*=\\s*(?:[\"'](?<1>[^\"']*)[\"']|(?<1>\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack.
43 milliseconds compared to 3 consistently.
See my code on pastebin

Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
In your case the HTML parser is overkill as your tests have shown.
People who answer on SO use that as a rote answer to all regex questions. One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion.
Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow.
Why is that? Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. That handicaps regex out of the gate. One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute.
For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently!
Most people learn basic pattern matching by doing searches with a *. When they first learn regex they use * with the . such as .*. That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses.
Unless you know empirically that there are no items, use the + instead.
Back in 2009 I wrote about this subject on my blog Are C# .Net Regular Expressions Fast Enough for You?

regular expression for textarea

I am looking for a regular expression to filter out all \r\n out of the html file but if there is a textarea it should be passed without having the enter removed.
I am using .NET (C#) technology.

Don't use regular expressions - use an HTML parser.

Speaking of HTML parsers, the Html Agility Pack is great for solving this type of problem.

Alternative approach:
Find, with regex, the position (in the string) where there's a textarea element.
The suitable regex for this would be: (<textarea>(.*?)</textarea>)
Remove the \r\n characters from everywhere, except the places you found on #1.

This is extremely similar to this answer I've given before.
Fortunately, .NET has a balanced matching feature.
So you can do this:
(<textarea[^>]*>[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!))</textarea>)|\r\n
Then you can perform a replace value of $1.
Here it is in action:
http://regexhero.net/tester/?id=292c5529-5fe8-42e9-8d72-d7ea9ab9e1fe
Hope that helps. The benefit of using balanced matching like this is that it's powerful enough to handle nested tags that are inherent to HTML.
However, it's still not 100% reliable. Comments can still throw it off. And of course this is also an insanely complicated regular expression to manage if you ever need to make changes. So you may still want to use an html parser after all.

Read this:
RegEx match open tags except XHTML self-contained tags
This question is like saying how do you do up a bolt with a hammer. Now I'm sure if you were determined enough you could do tighten the bolt with a hammer. However it would be difficult and problematic to say the least and the chances are you would break something by trying.
Take a step back, throw away the assumption that your hammer is the best tool and go back to your tool box, if you dig around in there you will find a better tool its called an HTML parser.

When not to use Regex in C# (or Java, C++, etc.)

It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.
So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?
(See "Regex to parse C# source code to find all strings" for way I am asking this question.)
This seems to sums it up well:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems...
(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)

Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.

There are two aspects to consider:
Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.
Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.

I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.
For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.
As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.

You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.
Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...

Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.
Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.

At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:
Identifying parts of a string.
Checking whether a string conforms to a certain format or construction.
Finding substrings that match a certain pattern.
Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).
Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.
Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.
Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.

The main concern here is maintainability.
It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.
On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!
But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.

For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.

Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.

It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started

You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?

I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser

Is there an Application to Create Regular Expression Out of Text by Selecting Wanted Area?

I hope this is programmer-related question. I'm in the hobby business of C# programming. For my own purposes I need to parse html files and the best idea is..regular expression. As many found out, it's quite time consuming to learn them and thus I'm quite interested if you know about some application that would be able to take input (piece of any code), understand what i need (by Me selecting a piece of the code I need to "cut out"), and give me the proper regular expression for it or more options.
As I've heard, Regex is a little science of itself, so it might not be as easy as I'd imagine.

Yes there is Roy Osherove wrote exactly what you're looking for - regulazy

Not real answer to your question, as it has nothing to do with regex, but HtmlAgilityPack may help you with your parsing.

You might also want to try txt2re : http://txt2re.com/, which tries to identify patterns in a user-supplied string and allows to build a regex out of them.

I gotta agree with Sunny on this one: if you're parsing html, you're better off converting it to XML (using the HTML Agility pack it's trivially easy) and then you can using XPATH expressions rather than regular expressions, it's far better suited to the job.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.