How do I parse HTML using regular expressions in C#? - c#

How do I parse HTML using regular expressions in C#?
For example, given HTML code
<s2> t1 </s2> <img src='1.gif' /> <span> span1 <span/>
I am trying to obtain
1. <s2>
2. t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>
How do I do this using regular expressions in C#?
In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.

I used this regx in C#, and it works. Thanks for all your answers.
<([^<]*)>|([^<]*)

you might want to simply use string functions. make < and > as your indicator for parsing.

Related

Linq parse html string

I want to parse an html page and get a specific value from it. How can I do this using Linq or string parsing in C# ?
------------- MORE HTML ----------
<span class="date">
04.09.2012
</span>
<table cellspacing="0"><tr><th scope="row">1 EUR</th><td><span>**4,4907**</span></td><td><span class="rise">+0,0009</span></td><td><span class="rise">+0,02%</span></td></tr><tr><th scope="row">1 USD</th><td><span>3,5635</span></td><td><span class="fall">-0,0093</span></td><td><span class="fall">-0,26%</span></td></tr></table>
------------- MORE HTML ----------
I am interested in getting the value 4,4907 in bold!
Any idea how to achieve this?
Thanks!
If you only need that bit, use a regular expression. (But don't use a regular expression to parse more complex HTML.)
<td><span>4,4907</span></td>
would be matched most conveniently by the regular expression
<td><span>([0-9,]+)</span></td>
And see for example this quickly Googled page on how to use regexps with C#.
Be careful when trying to parse HTML.
I think the obvious way would be to load it into an XDocument (as XML) but as HTML is often ambiguous or contains syntax errors this is bound to fail.
People here on Stack overflow have instead suggested to use http://htmlagilitypack.codeplex.com/ which is said to do a great job parsing html. Then you may use xpath to query your document for various contents.
You can try a regular expression in C# this way:
http://www.c-sharpcorner.com/UploadFile/prasad_1/RegExpPSD12062005021717AM/RegExpPSD.aspx
To find the string between "< span > * " and " * < / span >".
Or you can use an HTML parser like "jericho" and navigate through HTML tags to reach your value.

C# , How to write RegEx.Replace to replace value for an xml element?

Have a xml string, goal is to replace an xml element value to a fixed string, i.e. for blah blah blah replace it to fixed value, I am thinking to use RegEx.Replace instead of loading the string to a DOM model and replace.
Could anyone please help on how to write this regular expression? essentially the goal is to match everything inside element tag 'abc'
Thanks a lot!
This article tells you what you need to know: XML is not Regular
Ignoring the most obvious solution to their problem (which would be to use a pre-existing XML parser), they think they should use regular expressions (regex for short). Now they have two problems.
Use regular expressions only on regular languages.
That said, there are many sites that purport to offer guidance on writing regular expressions for XML. They are all wrong. But they exist, and you can use them at your own risk.
For what it's worth, don't.
Process the XML normally, with a XmlDocument, Xml.Linq or XmlReader/Writer, it's what they are for, cover all kinds of edge cases we couldn't even imagine, and above all, are proven to work.
Don't use a regex for this, please . . . just don't.
My two cents.
let the downvoting begin
Regular expressions are meant to be used on regular languages. XML is a non-regular language. As such, regular expressions cannot be used to properly parse anything written in it. You will need to use a real XML parser, which can be found in the numerous libraries available in C#, to do it.
Regular expressions are not suitable for processing markup. Among other flaws, they won't work if elements can be nested:
<abc> ... <abc> ... </abc> ... </abc>
They are also unable to distinguish a comment from a non-comment.
You need a real XML parser.

How to extract string between 2 markers using Regex in .NET?

I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.
I've tried the following with no success:
var match = Regex.Match(output, #"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;
Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.
Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:
un-escape the angle brackets - with the # before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
with your regex, the body tag cannot have any attributes.
I would suggest something more like:
(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)
this seems to work for me on the source of this page!
As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.
First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.
That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.
As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?

Regular Expression to find hidden fields in html

I am looking for a regular expression to find all input fields of type hidden in html output. Anyone know an expression to do such?
I agree that the link Radomir suggest is correct that HTML should not be parsed with regular expressions. However, I do not agree that nothing meaningful can be gleaned from their use together. And the ensuing rant is totally counter-productive.
To correct Robert's RegEx:
<([^<]*)type=('|")hidden('|")>[^<]*(/>|</.+?>)
I know you asked for regular expression, but download Html Agility Pack and do the following:
var inputs = htmlDoc.DocumentNode.Descendants("input");
foreach (var input in inputs)
{
if( input.Attributes["type"].Value == "hidden" )
// do something
}
You can also use xpath with html agility pack.
Regular expressions are generally the wrong tool for the job when trying to search or manipulate HTML or XML; a parsing library would likely be a much cleaner and easier solution.
That said, if you're just looking through a big file and accuracy isn't critical, you can probably do reasonably well with something like <input[^>]*type="?hidden"?.

RegEx matching HTML tags and extracting text

I have a string of test like this:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
I'd chew my own leg off before using a regular expression to parse and alter HTML.
Use XSL or DOM.
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
Here are a couple of articles on how to use XSL with C#:
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>
Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
You can find 3 simple examples here:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/
//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');

Categories

Resources