Parsing HTML page with HtmlAgilityPack using LINQ - c#

How can i parse html using Linq on a webpage and add values to a string. I am using the HtmlAgilityPack on a metro application and would like to bring back 3 values and add them to a string.
here is the url = http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N
I would like to get the values from the following see "belwo"
"Balance:",
"Transactions in",
"Received"
WebResponse x = await req.GetResponseAsync();
HttpWebResponse res = (HttpWebResponse)x;
if (res != null)
{
if (res.StatusCode == HttpStatusCode.OK)
{
Stream stream = res.GetResponseStream();
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
string appName = htmlDocument.DocumentNode.Descendants // not sure what t
string a = "Name: " + WebUtility.HtmlDecode(appName);
}
}

Please try the following. You might also consider pulling the table apart as it is a little better formed than the free-text in the 'p' tag.
Cheers, Aaron.
// download the site content and create a new html document
// NOTE: make this asynchronous etc when considering IO performance
var url = "http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N";
var data = new WebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);
// extract the transactions 'h3' title, the node we want is directly before it
var transTitle =
(from h3 in doc.DocumentNode.Descendants("h3")
where h3.InnerText.ToLower() == "transactions"
select h3).FirstOrDefault();
// tokenise the summary, one line per 'br' element, split each line by the ':' symbol
var summary = transTitle.PreviousSibling.PreviousSibling;
var tokens =
(from row in summary.InnerHtml.Replace("<br>", "|").Split('|')
where !string.IsNullOrEmpty(row.Trim())
let line = row.Trim().Split(':')
where line.Length == 2
select new { name = line[0].Trim(), value = line[1].Trim() });
// using linqpad to debug, the dump command drops the currect variable to the output
tokens.Dump();
'Dump()', is a LinqPad command that dumps the variable to the console, the following is a sample of the output from the Dump command:
Balance: 5 LTC
Transactions in: 2
Received: 5 LTC
Transactions out: 0
Sent: 0 LTC

the document you have to parse is not the most well formed for parsing many elements are missing the class or at least id attribute but what you want to get is a second p tag
content in it
you can try this
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var pNodes = htmlDocument.DocumentNode.SelectNodes("//p")
[1].InnerHtml.ToString().Split(new string[] { "<br />" }, StringSplitOptions.None).Take(3);
string vl="Balance:"+pNodes[0].Split(':')[1]+"Transactions in"+pNodes[1].Split(':')[1]+"Received"+pNodes[2].Split(':')[1];

Related

MVC StackOverflowException with larger html data

I have the following method (i'm using the htmlagilitypack):
public DataTable tableIntoTable(HtmlDocument doc)
{
var nodes = doc.DocumentNode.SelectNodes("//table");
var table = new DataTable("MyTable");
table.Columns.Add("raw", typeof(string));
foreach (var node in nodes)
{
if (
(!node.InnerHtml.Contains("pldefault"))
&& (!node.InnerHtml.Contains("ntdefault"))
&& (!node.InnerHtml.Contains("bgtabon"))
)
{
table.Rows.Add(node.InnerHtml);
}
}
return table;
}
It accepts html grabbed using this:
public HtmlDocument getDataWithGet(string url)
{
using (var wb = new WebClient())
{
string response = wb.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(response);
return doc;
}
}
All works fine with an html document that is 3294 lines long.
When I feed it some html that is 33960 lines long I get:
StackOverflowException was unhandled at the IF statement in the tableIntoTable method as seen in this image:
http://imgur.com/Q2FnIgb
I thought it might be related to the MaxHttpCollectionKeys limit of 1000 so I tried putting this in my Web.config and it still doesn't work:
add key="aspnet:MaxHttpCollectionKeys" value="9999"
I'm not really sure where to go from here, it only breaks with larger html documents.
Assuming the values in your if statement are contained in some attribute value of some decendant of a table.
var xpath = #"//table[not(.//*[contains(#*,'pldefault') or
contains(#*,'ntdefault') or
contains(#*,'bgtabon')])]";
var tables = doc.DocumentNode.SelectNodes(xpath);
Upadte: More accurately based on your comments:
#"//table[not(.//td[contains(#class,'pldefault') or
contains(#class,'ntdefault') or
contains(#class,'bgtabon')])]";

How to Parse XML String c#

I'm trying to parse XML string into list, result count is always zero.
string result = "";
string address = "http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml";
// Create the web request
HttpWebRequest request = WebRequest.Create(address) as HttpWebRequest;
// Get response
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
// Get the response stream
StreamReader reader = new StreamReader(response.GetResponseStream());
// Read the whole contents and return as a string
result = reader.ReadToEnd();
}
XDocument doc = XDocument.Parse(result);
var ListCurr = doc.Descendants("Cube").Select(curr => new CurrencyType()
{ Name = curr.Element("currency").Value, Value = float.Parse(curr.Element("rate").Value) }).ToList();
where I'm going wrong.
The problem is that you're looking for elements without a namespace, whereas the XML contains this in the root element:
xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref"
That specifies the default namespace for any element. Also, currency and rate are attributes within the Cube elements - they're not subelements.
So you want something like:
XNamespace ns = "http://www.ecb.int/vocabulary/2002-08-01/eurofxref";
var currencies = doc.Descendants(ns + "Cube")
.Select(c => new CurrencyType {
Name = (string) c.Attribute("currency"),
Value = (decimal) c.Attribute("rate")
})
.ToList();
Note that because I'm casting the currency attribute to string, you'll end up with a null Name property for any currencies which don't specify that attribute. If you want to skip those elements, you can do so with a Where clause either before or after the Select.
Also note that I've changed the type of Value to decimal rather than float - you shouldn't use float for currency-related values. (See this question for more details.)
Additionally, you should consider using XDocument.Load to load the XML:
XDocument doc = XDocument.Load(address);
Then there's no need to create the WebRequest etc yourself.
XDocument doc = XDocument.Parse(result);
XNamespace ns = "http://www.ecb.int/vocabulary/2002-08-01/eurofxref";
var ListCurr = doc.Descendants(ns + "Cube")
.Where(c=>c.Attribute("currency")!=null) //<-- Some "Cube"s do not have currency attr.
.Select(curr => new CurrencyType
{
Name = curr.Attribute("currency").Value,
Value = float.Parse(curr.Attribute("rate").Value)
})
.ToList();

extract content from html page

I'm trying to extract the content inside div tag with id job_title1 in a html page. I'm using htmlagilitypack to fetch the data. Here is my code
var obj = new HtmlWeb();
var document = obj.Load("url of website ");
var bold = document.DocumentNode.SelectNodes("//div[#class='job_title1']");
foreach (var i in document.DocumentNode.SelectNodes("//div[#class='job_title1']"))
{
Response.Write(i.InnerHtml);
}
When i tried to run this code i'm getting error at foreach saying the Object reference not set to an instance of an object. Please help me solving this.
You said "div tag with id job_title1", shouldn't the xpath be:
document.DocumentNode.SelectNodes("//div[#id='job_title1']")
check if null like this:
var nodes = document.DocumentNode.SelectNodes("//div[#class='job_title1']");
if(nodes != null)
foreach (var i in document.DocumentNode.SelectNodes("//div[#class='job_title1']"
...
Edit: Use \" instead ':
var obj = new HtmlWeb();
var document = obj.Load("url of website ");
var bold = document.DocumentNode.SelectNodes("//div[#class=\"job_title1\"]");
if(bold!= null)
foreach (var i in bold)
{
Response.Write(i.InnerHtml);
}

Grab all text from html with Html Agility Pack

Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).
https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}

How do you parse an HTML string for image tags to get at the SRC information?

Currently I use .Net WebBrowser.Document.Images() to do this. It requires the Webrowser to load the document. It's messy and takes up resources.
According to this question XPath is better than a regex at this.
Anyone know how to do this in C#?
If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.
Otherwise you can try this function, that will return all image links from HtmlSource :
public List<Uri> FetchLinksFromSource(string htmlSource)
{
List<Uri> links = new List<Uri>();
string regexImgSrc = #"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
foreach (Match m in matchesImgSrc)
{
string href = m.Groups[1].Value;
links.Add(new Uri(href));
}
return links;
}
And you can use it like this :
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
}
}
The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Pack over on CodePlex. It rocks (and handles malformed HTML).
Here's a snippet for iterating over links in a document:
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/#href");
Content match = null;
// Run only if there are links in the document.
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute attrib = linkNode.Attributes["href"];
// Do whatever else you need here
}
}
Original Blog Post
If all you need is images I would just use a regular expression. Something like this should do the trick:
Regex rg = new Regex(#"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);
If it's valid xhtml, you could do this:
XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/#src");

Categories

Resources