How deep is the visible area of HtmlAgilityPack? - c#

I need to grab some posts from a blog. All went well until I've wanted to get the post creation date. The DOM-tree for it is:
div class="stories-feed__container"
-> article
-> div class="story__main"
-> div class="story__footer"
-> div class="story__user user"
-> div class="user__info-item"
-> time datetime="date and time in UTC format".
So I wrote the code:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/#serhiy1994");
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(#class, 'stories-feed__container')]/article[2]/div[contains(#class, 'story__main')]/div[contains(#class, 'story__footer')]/div[contains(#class, 'story__user user')]/div[contains(#class, 'user__info-item')]/time").GetAttributeValue("datetime", "NULL"); // e.g. for the 2nd article on the page
And it returns the NullReferenceException.
BUT if you stop at the "div class="story__user user"" level (e.g.,
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(#class, 'stories-feed__container')]/article[2]/div[contains(#class, 'story__main')]/div[contains(#class, 'story__footer')]/div[contains(#class, 'story__user user')]").InnerHtml;
it works properly and return you the inner HTML-code.
So I think there is something like 'maximum visibility level" for HtmlAgilityPack, and you won't able to manipulate with the deeper markdown.
Am I right or I'm coding something wrong?
The original page code is here: https://pastebin.com/jFC0XD9C

HtmlAgility will scrape the entire website, regardless of how deep you want to go. You can use this to get to the item you are looking for since you dont have to provide the entire path.
This will search the entire site and look for the first <div> tag that has the class name user__info-item. You can also change SelectSingleNode to SelectNodes if there are multiple tags then loop through them to get the dates.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/#serhiy1994");
var postDate = doc.DocumentNode.SelectSingleNode("//div[#class='user__info-item']/time");
Console.WriteLine(postDate.InnerText);
Whats wrong with your code?
Reason the code above you have doesnt work is because there is another div that you are missing, '<div class="user__info user__info_left">'.
If you write your code like this, it works.
var nodes = doc.DocumentNode.SelectSingleNode("//div[#class='story__main']/div[#class='story__footer']/div[#class='story__user user']/div[#class='user__info user__info_left']/div[#class='user__info-item']/time");
Console.WriteLine(nodes.InnerText);
Another way
Another way to do it is by searching for a parent div. Once you find the parent tag, search under that tag to find what you are looking for.
var nodes = doc.DocumentNode.SelectNodes("//div[#class='story__user user']");
foreach (HtmlNode node in nodes)
{
// Search within each node using .// notation
var timeNodes = node.SelectSingleNode(".//div[#class='user__info-item']/time");
Console.WriteLine(timeNodes.InnerText);
}
Tested Code here

Related

c# Html agility pack HtmlDocument does not contain all Elements from the website

at the time I´m making a chatbot. The bot should be able to define a word, so I tried getting the span Element from Google (https://www.google.de/webhp?sourceid=chrome-instant&rlz=1C1CHBD_deDE721DE721&ion=1&espv=2&ie=UTF-8#q=define%20test) where the definition is writen in, wich didn't work. It turns out that the htmlDocument does not contain the hole website.
string Url = "https://www.google.de/webhp?sourceid=chrome- instant&rlz=1C1CHBD_deDE721DE721&ion=1&espv=2&ie=UTF-8#q=define%20test";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id='uid_0']/div[1]/div/div[1]/div[2]/div/ol/li[1]/div/div/div[2]/div/div[1]/span");
if (!String.IsNullOrEmpty(node.InnerText))
output += node.InnerText;
node is not set to an Instance of an object
I let me give the InnerHtml of the document and put it in a gist: https://gist.github.com/MarcelBulpr/bb44a527d8202eb7fffb4e21fb8b4fed
it seems that the website does not load the result of the search request.
Does anyone know how to work around this?
Thanks in advance

HTML parsing from C#

I'm trying to parse some HTML files which don't always have the exact same format. Nevertheless, I've been able to find some patterns which are common to all the files.
For example, this is one of the files:
https://www.sec.gov/Archives/edgar/data/63908/000006390816000103/mcd-12312015x10k.htm#sFBA07EFA89A85B6DB59920A55B5021BC
I've seen that all the files I need have a unique tag which InnerText equals to "Financial Statements and Supplementary Data". I cannot search directly for that string as i appears repeatedly along the text. I used this code to find that tag:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(m_strFilePath);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
if (link.InnerText.Contains("Financial Statements"))
{
}
}
I was wondering if there's any way to get the position of this tag in the html substring so i can get the data i need by doing:
dataNeeded = html.substring(indexOf<a>Tag);
Thanks a lot

C# htmlAgility Webscrape html node inside the first Html node

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);
You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

HTMLAgilityPack missing child nodes that exist on website being scraped

I'm running the following piece of code, it returns back the correct number of number of divs found for 'callTable'.. but they are all empty, the innerHTML is empty and it doesn't find any children for any of them, even though if you inspect the elements on the actual site, they have children.
I thought maybe it had to do with having a table within a div, so I tested it by looking within 'box-content' divs. Those seem to be loading correctly though. It is possible it has to do with the callTable has 'table-layout: fixed'?
Anyway, can't seem to find anyone else having this error after poking around. Anyone have some thoughts? Much appreciated!!
string Url = "https://malwr.com/analysis/MWI5MThhZWZhNDI0NDEyYThmOWMxMjc3MzRmZjQ1MDg"+id
HtmlWeb web = new HtmlWeb();
HtmlDocument webpage = web.Load(Url);
HtmlNodeCollection callTable = webpage.DocumentNode.SelectNodes("//div[#class='calltable']");//[contains(#class, 'calltable')]");
//Just a test
HtmlNodeCollection boxContentTest = webpage.DocumentNode.SelectNodes("//div[#class='box-content']");

How to scrape a page generated with a script in C#?

Simple example: Google search page.
http://www.google.com/search?q=foobar
When I get the source of the page, I get the underlying JavaScript. I want the resulting page. What do I do?
Even though it looks as if it is only javascript it really is the full HTML, you can easily confirm with HtmlAgilityPack:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com/search?q=foobar");
string html = doc.DocumentNode.OuterHtml;
var nodes = doc.DocumentNode.SelectNodes("//div"); //returns 85 nodes

Categories

Resources