I'm trying to parse some HTML files which don't always have the exact same format. Nevertheless, I've been able to find some patterns which are common to all the files.
For example, this is one of the files:
https://www.sec.gov/Archives/edgar/data/63908/000006390816000103/mcd-12312015x10k.htm#sFBA07EFA89A85B6DB59920A55B5021BC
I've seen that all the files I need have a unique tag which InnerText equals to "Financial Statements and Supplementary Data". I cannot search directly for that string as i appears repeatedly along the text. I used this code to find that tag:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(m_strFilePath);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
if (link.InnerText.Contains("Financial Statements"))
{
}
}
I was wondering if there's any way to get the position of this tag in the html substring so i can get the data i need by doing:
dataNeeded = html.substring(indexOf<a>Tag);
Thanks a lot
Related
I need to grab some posts from a blog. All went well until I've wanted to get the post creation date. The DOM-tree for it is:
div class="stories-feed__container"
-> article
-> div class="story__main"
-> div class="story__footer"
-> div class="story__user user"
-> div class="user__info-item"
-> time datetime="date and time in UTC format".
So I wrote the code:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/#serhiy1994");
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(#class, 'stories-feed__container')]/article[2]/div[contains(#class, 'story__main')]/div[contains(#class, 'story__footer')]/div[contains(#class, 'story__user user')]/div[contains(#class, 'user__info-item')]/time").GetAttributeValue("datetime", "NULL"); // e.g. for the 2nd article on the page
And it returns the NullReferenceException.
BUT if you stop at the "div class="story__user user"" level (e.g.,
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(#class, 'stories-feed__container')]/article[2]/div[contains(#class, 'story__main')]/div[contains(#class, 'story__footer')]/div[contains(#class, 'story__user user')]").InnerHtml;
it works properly and return you the inner HTML-code.
So I think there is something like 'maximum visibility level" for HtmlAgilityPack, and you won't able to manipulate with the deeper markdown.
Am I right or I'm coding something wrong?
The original page code is here: https://pastebin.com/jFC0XD9C
HtmlAgility will scrape the entire website, regardless of how deep you want to go. You can use this to get to the item you are looking for since you dont have to provide the entire path.
This will search the entire site and look for the first <div> tag that has the class name user__info-item. You can also change SelectSingleNode to SelectNodes if there are multiple tags then loop through them to get the dates.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/#serhiy1994");
var postDate = doc.DocumentNode.SelectSingleNode("//div[#class='user__info-item']/time");
Console.WriteLine(postDate.InnerText);
Whats wrong with your code?
Reason the code above you have doesnt work is because there is another div that you are missing, '<div class="user__info user__info_left">'.
If you write your code like this, it works.
var nodes = doc.DocumentNode.SelectSingleNode("//div[#class='story__main']/div[#class='story__footer']/div[#class='story__user user']/div[#class='user__info user__info_left']/div[#class='user__info-item']/time");
Console.WriteLine(nodes.InnerText);
Another way
Another way to do it is by searching for a parent div. Once you find the parent tag, search under that tag to find what you are looking for.
var nodes = doc.DocumentNode.SelectNodes("//div[#class='story__user user']");
foreach (HtmlNode node in nodes)
{
// Search within each node using .// notation
var timeNodes = node.SelectSingleNode(".//div[#class='user__info-item']/time");
Console.WriteLine(timeNodes.InnerText);
}
Tested Code here
This is the html document I am trying to extract the highlighted data in
.
I have read a lot on this site but was unable to find a solution that was helpful.
I tried using
nodes = doc.DocumentNode.SelectNodes(table_title + "/tbody/tr/td");
headers = nodes.Elements("span").Select(d => d.InnerText.Trim());
foreach (var this_header in header)
{
string location = this_header.InnerText.Trim();
Console.Writeline(location);
}
This does not give me the correct information. How do I find the specific content I am looking for?
What is this /tbody/tr/td ... there is no table at all.
you have to get a unique selector (xpath, css, id) at SelectNodes..
I'm loading a web page into my WebView, and I can access it's raw HTML as text. The page has several video elements embedded within it, and I want to get their locations as a list of strings so I can download them separately.
How would I go about doing this ?
You can use HTTP agility pack for parsing
HtmlDocument document = new HtmlDocument();
document.LoadHtml(rawText);
var videoSourceNodes = document.DocumentNode.SelectNodes("//video/source");
foreach(var node in videoSourceNodes)
{
var path = node.Attributes["src"].Value;
}
It's your concern to convert relative path to absolute.
I need to load HTML and parse it, I think that it should be something simple, I pass a string with a "HTML" it reads the string in a Dom like object, so I can search and parse the content of the HTML, facilitating scraping and things like that.
Do you guys know about any thing like that.
Thanks
HTML Agility Pack
Similar API to XmlDocument, for example (from the examples page):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
(you should also be able to use LoadHtml to load a string of html, rather than from a path)
If you're running in-browser, you should be able to use the Html DOM Bridge, load the HTML into it, and walk the DOM Tree for that.
From here, I am trying to get data from stock quote for every 10 mins interval.
I used WebClient for downloading the page content and for parsing I used regular expressions. It is working fine for other urls. For the Particular URL, my parsing code not working.
I think it is the problem with javascript, When I load the page in Browser, after loading the page content, It took some extra time to plot the data. May be this guy is using some client side script for this page. Can anyone help me Please..........
HTML Agility Pack will save you tons of headaches. Try it instead of using regexps to parse HTML.
For what it's worth, in the page you link to the quote data is indeed in Javascript code, check http://www.nseindia.com/js/getquotedata.js and http://www.nseindia.com/js/quote_data.js
as per #Vinko Vrsalovic answer, Html Agility pack is your friend. Here is a sample
WebClient client = new WebClient();
string source = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(source);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//*[#href]");
foreach (HtmlNode node in nodes)
{
if (node.Attributes.Contains("class"))
{
if (node.Attributes["class"].Value.Contains("StockData"))
{// Here is our info }
}
}