I am looking for a regular expression to filter out all \r\n out of the html file but if there is a textarea it should be passed without having the enter removed.
I am using .NET (C#) technology.
Don't use regular expressions - use an HTML parser.
Speaking of HTML parsers, the Html Agility Pack is great for solving this type of problem.
Alternative approach:
Find, with regex, the position (in the string) where there's a textarea element.
The suitable regex for this would be: (<textarea>(.*?)</textarea>)
Remove the \r\n characters from everywhere, except the places you found on #1.
This is extremely similar to this answer I've given before.
Fortunately, .NET has a balanced matching feature.
So you can do this:
(<textarea[^>]*>[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!))</textarea>)|\r\n
Then you can perform a replace value of $1.
Here it is in action:
http://regexhero.net/tester/?id=292c5529-5fe8-42e9-8d72-d7ea9ab9e1fe
Hope that helps. The benefit of using balanced matching like this is that it's powerful enough to handle nested tags that are inherent to HTML.
However, it's still not 100% reliable. Comments can still throw it off. And of course this is also an insanely complicated regular expression to manage if you ever need to make changes. So you may still want to use an html parser after all.
Read this:
RegEx match open tags except XHTML self-contained tags
This question is like saying how do you do up a bolt with a hammer. Now I'm sure if you were determined enough you could do tighten the bolt with a hammer. However it would be difficult and problematic to say the least and the chances are you would break something by trying.
Take a step back, throw away the assumption that your hammer is the best tool and go back to your tool box, if you dig around in there you will find a better tool its called an HTML parser.
Related
I'm writing a C# web crawler and when I run the profiling I can see that HTMLAgilityPack's LoadHTML method is using 10% of the programs overall CPU usage. I'd like to try and lower this.
I'm sure a regular expression would be faster but as I look at link extracting examples on SO I see everyone saying this method should be avoided in favour of a html parser like HTMLAgilityPack.
As all I need to do is extract links from HTML is using HTMLAgilityPack over kill?
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
Downloaded HTML with WebClient then compared.
Using href\\s*=\\s*(?:[\"'](?<1>[^\"']*)[\"']|(?<1>\\S+)) (then trimming and adding to a list) is way faster than HTMLAgilityPack.
43 milliseconds compared to 3 consistently.
See my code on pastebin
Are the reasons for favouring a HTML parser applicable to my case as I'm only using it for extracting links?
In your case the HTML parser is overkill as your tests have shown.
People who answer on SO use that as a rote answer to all regex questions. One should use the tool if one actually needs to parse the domain of the HTML in a more robust fashion.
Bias against Regular Expressions are found by people who feel that they are too slow or cumbersome [to learn]. There is some merit on what is proposed by them for certain operations, in that specific optimized text for finding utilities do perform better. Sure I agree, but to dismiss regex out of hand, well that is par for the course on StackOverflow.
Why is that? Sometimes the analysis is simply flawed because the pattern provided introduces a lot of unnecessary backtracking and is not optimized. That handicaps regex out of the gate. One does have to learn the regex language and understand what it is doing to tune the engine of regex to not pollute.
For example I took your same C# code test, but I used an optimized pattern of yours and my own and was able to get it down to 1 millisecond consistently!
Most people learn basic pattern matching by doing searches with a *. When they first learn regex they use * with the . such as .*. That step along with indiscriminate usage of the * will most likely will doom any non beginning pattern to the hell of backtracking and slow responses.
Unless you know empirically that there are no items, use the + instead.
Back in 2009 I wrote about this subject on my blog Are C# .Net Regular Expressions Fast Enough for You?
I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:
It's not known ahead of time whether a document contains HTML at all.
More likely than not, any HTML will be very poorly formatted.
Individual documents might be very large, perhaps hundreds of megabytes.
Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of <.+/?> are a no go. (And stripping XML is less desirable, anyway.)
I'm currently using HTML Agility Pack, and it's just not cutting the mustard. Performance is poorer than I'd like, it doesn't always handle truly awful formatting as gracefully as it could, and lately I've been running into problems with stack overflows on some of the more upsettingly large files.
I suspect that all of these problems stem from the fact that it's trying to actually parse the data, which makes it a poor fit for my needs. I don't want a syntax tree; I just want (most of) the tags to go away.
Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that's not such a great idea. But that diatribe's points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?
Assuming it isn't a terrible idea, suggestions for regex that would do a good job are very welcome.
This regex finds all tags avoiding angle brackets inside quotes in tags.
<[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>
It isn't able to detect escaped quotes inside quotes (but I think it is unnecessary in html)
Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I'm afraid an exact solution can't be found starting with your assumption about angle brackets, think for example to something like b<a ...
EDIT:
Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.
I'm just thinking outside the box here, but you may consider leveraging something like Microsoft Word, or maybe OpenOffice.
I've used Word automation to translate HTML to DOC, RTF, or TXT. The HTML to TXT conversion native to Word would give you exactly what you want, stripping all of the HTML tags and converting it to text format. Of course this wouldn't be efficient at all if you're processing tons of tiny HTML files since there's some overhead in all of this. But if you're dealing with massive files this may not be a bad choice as I'm sure Word has plenty of optimizations around these conversions. You could test this theory by manually opening one of your largest HTML files in Word and resaving it as a TXT file and see how long Word takes to save.
And although I haven't tried it, I bet it's possible to programmatically interact with OpenOffice to accomplish something similar.
I have a quiet long regex and sometimes it response fast some times it loads long like crazy.
here is my regex:
<div class=""rwResult bg"">.*?mp3/d/[^>]+>(?<Name>[^<]+)</a>.*?artist:[^>]+>(?<Artist>[^<]+).*?user</span>[^>]+[^""]+""(?<Uploader>[^""]+).*?category:.*?"">.*?"">(?<Category>[^<]+).*?time: (?<Duration>[^ ]+) \| (?<StreamSize>[0-9]+) (?<Weight>[^ ]+) \| listened: (?<Clicks>[0-9]+).*?<a href=""(?<DownloadLink>http://dl[^""]+)
rather than use alot of regex for each group i prefer doing one time regex.
Is there any function that i could check or avoid the long load while the regular expression is executing ?
I'm working C# or F# hope anyone could answer this problem.
thanks.
It looks like you are trying to parse an XML document using a regular expression. This is not really an optimal approach. My guess is that you are seeing problems because of the use of backtracking in your regular expression.
You could try to rewrite your regular expression, but XML is not a regular language and thus is not parsable by regular expressions.
Take a look at the document How to read XML from a file by using Visual C# to get started.
Sidenote: For an entertaining read on what happens when trying to parse a non regular language using regular expression see this Stack Overflow question.
I think you're using the wrong tool. You really want Xpath, and possibly XSLT. The only time you want to use a regex to parse raw XML is when the XML is suspected to be syntactically broken in predictable ways.
Seriously, look at Xpath - it's magic for delving into the structure of XML documents and pulling out the bits you want.
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.
Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.
It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.
I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started
You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?
I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser
I hope this is programmer-related question. I'm in the hobby business of C# programming. For my own purposes I need to parse html files and the best idea is..regular expression. As many found out, it's quite time consuming to learn them and thus I'm quite interested if you know about some application that would be able to take input (piece of any code), understand what i need (by Me selecting a piece of the code I need to "cut out"), and give me the proper regular expression for it or more options.
As I've heard, Regex is a little science of itself, so it might not be as easy as I'd imagine.
Yes there is Roy Osherove wrote exactly what you're looking for - regulazy
Not real answer to your question, as it has nothing to do with regex, but HtmlAgilityPack may help you with your parsing.
You might also want to try txt2re : http://txt2re.com/, which tries to identify patterns in a user-supplied string and allows to build a regex out of them.
I gotta agree with Sunny on this one: if you're parsing html, you're better off converting it to XML (using the HTML Agility pack it's trivially easy) and then you can using XPATH expressions rather than regular expressions, it's far better suited to the job.