Html string reader - c#

I need to load HTML and parse it, I think that it should be something simple, I pass a string with a "HTML" it reads the string in a Dom like object, so I can search and parse the content of the HTML, facilitating scraping and things like that.
Do you guys know about any thing like that.
Thanks

HTML Agility Pack
Similar API to XmlDocument, for example (from the examples page):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
(you should also be able to use LoadHtml to load a string of html, rather than from a path)

If you're running in-browser, you should be able to use the Html DOM Bridge, load the HTML into it, and walk the DOM Tree for that.

Related

c# Html agility pack HtmlDocument does not contain all Elements from the website

at the time I´m making a chatbot. The bot should be able to define a word, so I tried getting the span Element from Google (https://www.google.de/webhp?sourceid=chrome-instant&rlz=1C1CHBD_deDE721DE721&ion=1&espv=2&ie=UTF-8#q=define%20test) where the definition is writen in, wich didn't work. It turns out that the htmlDocument does not contain the hole website.
string Url = "https://www.google.de/webhp?sourceid=chrome- instant&rlz=1C1CHBD_deDE721DE721&ion=1&espv=2&ie=UTF-8#q=define%20test";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id='uid_0']/div[1]/div/div[1]/div[2]/div/ol/li[1]/div/div/div[2]/div/div[1]/span");
if (!String.IsNullOrEmpty(node.InnerText))
output += node.InnerText;
node is not set to an Instance of an object
I let me give the InnerHtml of the document and put it in a gist: https://gist.github.com/MarcelBulpr/bb44a527d8202eb7fffb4e21fb8b4fed
it seems that the website does not load the result of the search request.
Does anyone know how to work around this?
Thanks in advance

HTML parsing from C#

I'm trying to parse some HTML files which don't always have the exact same format. Nevertheless, I've been able to find some patterns which are common to all the files.
For example, this is one of the files:
https://www.sec.gov/Archives/edgar/data/63908/000006390816000103/mcd-12312015x10k.htm#sFBA07EFA89A85B6DB59920A55B5021BC
I've seen that all the files I need have a unique tag which InnerText equals to "Financial Statements and Supplementary Data". I cannot search directly for that string as i appears repeatedly along the text. I used this code to find that tag:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(m_strFilePath);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
if (link.InnerText.Contains("Financial Statements"))
{
}
}
I was wondering if there's any way to get the position of this tag in the html substring so i can get the data i need by doing:
dataNeeded = html.substring(indexOf<a>Tag);
Thanks a lot

Add a doctype to HTML via HTML Agility pack

I know it is easy to add elements and attributes to HTML documents with the HTML agility pack. But how can I add a doctype (e.g. the HTML5 one) to an HtmlDocument with the html agility pack? Thank you
As far as I know AgilityPack doesn't have a direct method to set the doctype, but as Hans mentioned, HAP treats the doctype as a comment node. So you could try to find the existing doctype first, if not create a new one and paste a desired value there:
var doctype = doc.DocumentNode.SelectSingleNode("/comment()[starts-with(.,'<!DOCTYPE')]");
if (doctype == null)
doctype = doc.DocumentNode.PrependChild(doc.CreateComment());
doctype.InnerHtml = "<!DOCTYPE html>";
The Html Agility Pack parser treats the doctype as a comment node.
In order to add a doctype to an HTML document simply add a
comment node with the desired doctype to the beginning of the document:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("withoutdoctype.html");
HtmlCommentNode hcn = htmlDoc.CreateComment("<!DOCTYPE html>");
HtmlNode htmlNode = htmlDoc.DocumentNode.SelectSingleNode("/html");
htmlDoc.DocumentNode.InsertBefore(hcn, htmlNode);
htmlDoc.Save("withdoctype.html");
Please note, that my code does not check for the existing of a doctype.

How to scrape a page generated with a script in C#?

Simple example: Google search page.
http://www.google.com/search?q=foobar
When I get the source of the page, I get the underlying JavaScript. I want the resulting page. What do I do?
Even though it looks as if it is only javascript it really is the full HTML, you can easily confirm with HtmlAgilityPack:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com/search?q=foobar");
string html = doc.DocumentNode.OuterHtml;
var nodes = doc.DocumentNode.SelectNodes("//div"); //returns 85 nodes

html parsing problem using C#

From here, I am trying to get data from stock quote for every 10 mins interval.
I used WebClient for downloading the page content and for parsing I used regular expressions. It is working fine for other urls. For the Particular URL, my parsing code not working.
I think it is the problem with javascript, When I load the page in Browser, after loading the page content, It took some extra time to plot the data. May be this guy is using some client side script for this page. Can anyone help me Please..........
HTML Agility Pack will save you tons of headaches. Try it instead of using regexps to parse HTML.
For what it's worth, in the page you link to the quote data is indeed in Javascript code, check http://www.nseindia.com/js/getquotedata.js and http://www.nseindia.com/js/quote_data.js
as per #Vinko Vrsalovic answer, Html Agility pack is your friend. Here is a sample
WebClient client = new WebClient();
string source = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(source);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//*[#href]");
foreach (HtmlNode node in nodes)
{
if (node.Attributes.Contains("class"))
{
if (node.Attributes["class"].Value.Contains("StockData"))
{// Here is our info }
}
}

Categories

Resources