Linq parse html string - c#

I want to parse an html page and get a specific value from it. How can I do this using Linq or string parsing in C# ?
------------- MORE HTML ----------
<span class="date">
04.09.2012
</span>
<table cellspacing="0"><tr><th scope="row">1 EUR</th><td><span>**4,4907**</span></td><td><span class="rise">+0,0009</span></td><td><span class="rise">+0,02%</span></td></tr><tr><th scope="row">1 USD</th><td><span>3,5635</span></td><td><span class="fall">-0,0093</span></td><td><span class="fall">-0,26%</span></td></tr></table>
------------- MORE HTML ----------
I am interested in getting the value 4,4907 in bold!
Any idea how to achieve this?
Thanks!

If you only need that bit, use a regular expression. (But don't use a regular expression to parse more complex HTML.)
<td><span>4,4907</span></td>
would be matched most conveniently by the regular expression
<td><span>([0-9,]+)</span></td>
And see for example this quickly Googled page on how to use regexps with C#.

Be careful when trying to parse HTML.
I think the obvious way would be to load it into an XDocument (as XML) but as HTML is often ambiguous or contains syntax errors this is bound to fail.
People here on Stack overflow have instead suggested to use http://htmlagilitypack.codeplex.com/ which is said to do a great job parsing html. Then you may use xpath to query your document for various contents.

You can try a regular expression in C# this way:
http://www.c-sharpcorner.com/UploadFile/prasad_1/RegExpPSD12062005021717AM/RegExpPSD.aspx
To find the string between "< span > * " and " * < / span >".
Or you can use an HTML parser like "jericho" and navigate through HTML tags to reach your value.

Related

Regex against markup after XPath?

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.
Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.
You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo

Handling html special character while Parsing html string using c#

I am using htmlagility pack to parse html string, and convert certain patterns to links.
Given a html string and a pattern "mystring". I have to replace the occurrence of this pattern in the hrml string with <a href="/mystring.html>mystring</a>. But there are two exceptions
1. I should not replace the pattern if it is already within an anchor tag, which means its immediate parent or any level parent should not be an anchor tag. For ex: <a href="google.com><span>mystring</span><\a>
2. It should not be inside href. For ex <a href="mystring">.
input string: "<li><span>mystring test</span></li><li><a href='#'><span>mystring</span></li</li>"
expected output : "<li><span><a href="/mystring.html>mystring</a> test</span></li><li><a href='#'><span>mystring</span></li</li>"
I am using htmlagilitypack and loading this string as html doc and getting all text and looking whether its any level parent is not an anchor and replacing it. Everything worked simple and fine. But there is a problem here.
If my input string is something like "li><span>mystring test < 10 and 5</span></li>" there is a problem. Htmlagility parser considers the less than symbol as a html special character and considers the "< 10 and 5" as a html tag and produces something like this.
< 10="" and="" 5=""> (attributes with empty values).
IS there a work around for this using htmlagilityparser?
Should I take a step back and use regex? In that case how do I handle the any level anchor exception?
IS there a better approach for this problem?
Using < outside HTML tag is invalid. Use < entity instead.
EDIT: If don't have control over input string, you may try replacing "< ":
inputhtml = inputhtml.Replace("< ", "< ");
If there are any other errors, you can try importing MSHTML COM DLL. Reference COM dll "Microsoft HTML object library".
Two suggestions:
You could pre-clean the broken HTML so HtmlAgilityPack works better. This is possibly easier.
Or parse & track nested-structure of tags yourself, via a simple regex-based parser. But many HTML tags do not have to be normatively ended, such as <TR> <TD> <P> <BR>.. and you'll have to deal with the broken < angle-brackets here too.
Option 2) is not hard -- but will be more work first-off, for a payoff in improved reliability & control over how you handle "malformed" inputs from a low-quality source.

How to find all tags from string using RegEx using C#.net?

I want to find all HTML tags from the input strings and removed/replace with some text.
suppose that I have string
INPUT=>
<img align="right" src="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg" /><p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, Il Giardino Ristorante in Newport Beach.</p>
OUTPUT=>
string strSrc="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg";
<p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, http://www.tenrestaurantgroup.com in Newport Beach.</p>
From above string
if <IMG> tag found then I want to get SRC of the tag,
if <A> tag found then I want get HREF from the tag.
and all other tag as same it is..
How can I achieved using Regex in C#.net?
You really, really shouldn't use regex for this. In fact, parsing HTML cannot be done perfectly with regex. Have you considered using an XML parser or HTML DOM library?
You can use HtmlAgilityPack for parsing (valid/non valid) html and get what you want.
I agree with Justin, Regex really isn't the best way to do this, and the HTML Agility is well worth a look if this is something you will need to be doing alot of.
With that said, the expression below will store attributes into a group from where you should be able to pull them into your text while ignoring the rest of the element. :
</?([^ >]+)( [^=]+?="(.+?)")*>
Hope this helps.

How do I parse HTML using regular expressions in C#?

How do I parse HTML using regular expressions in C#?
For example, given HTML code
<s2> t1 </s2> <img src='1.gif' /> <span> span1 <span/>
I am trying to obtain
1. <s2>
2. t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>
How do I do this using regular expressions in C#?
In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.
Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.
This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.
You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.
I used this regx in C#, and it works. Thanks for all your answers.
<([^<]*)>|([^<]*)
you might want to simply use string functions. make < and > as your indicator for parsing.

Get URL from HTML code using a regular expression

Consider:
<div>Anirudha Web blog</div>
What is the regular expression to get http://anirudhagupta.blogspot.com/
from the following?
<div>Anirudha Web blog</div>
If you suggest something in C# that's good. I also like jQuery to do this.
If you want to use jQuery you can do the following.
$('a').attr('href')
Quick and dirty:
href="(.*?)"
Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).
The simplest way to do this is using the following regular expression.
/href="([^"]+)"/
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
use 5.010;
while (<>) {
push #matches, m/href="([^"]+)"/gi;
push #matches, m/href='([^']+)'/gi;
push #matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
say for #matches;
}
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:
curl url | perl urls.pl
The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.
You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.
data="""
<html>
abcd ef ....
blah blah <div>Anirudha Web blog</div>
blah ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""
for item in data.split("</a>"):
if "<a href" in item:
start_of_href = item.index("<a href") # get where <a href=" is
print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.
The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

Categories

Resources