Extract a certain part of HTML with XPath and HTMLAbilityPack

Extract a certain part of HTML with XPath and HTMLAbilityPack - c#

I am having an issue with XPath syntax as I dont understand how to use it to extract certain HTML statements.
I am trying to load a videos information from a channel page; http://www.youtube.com/user/CinemaSins/videos
I know there is a line that holds all the details from views, title, ID, ect.
Here is what I am trying to get from within the html:
Thats line 2836;
<div class="yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item" data-context-item-id="ntgNB3Mb08Y" data-context-item-views="243,456 views" data-context-item-time="9:01" data-context-item-type="video" data-context-item-user="CinemaSins" data-context-item-title="Everything Wrong With The Chronicles Of Riddick In 8 Minutes Or Less">
I'm not sure how, But I have HTML Ability Pack added as a resouce and have started attempts on getting it.
Can someone explain how to get all of those details and the XPath syntax involved?
What I have attemped:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']//a"))
{
if (node.ChildNodes[0].InnerHtml != String.Empty)
{
title.Add(node.ChildNodes[0].InnerHtml);
}
}
^ The above code works in only getting the title of each video. But it also has a blank input aswell. Code executed and result is below.

Your xpath is selecting the <a> element inside the <div>. If you want the attributes of the <div> too, then you need to either
a) select both elements and process them separately.
b) run several xpath queries where you specify the exact attribute you want.
Lets go with (a) for this example.
var nodes = doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']");
and get the attributes and title like so:
foreach(var node in nodes)
{
foreach(var attribute in node.Attributes)
{
// ... Get the values of the attributes here.
}
var linkNodes = node.SelectNodes("//a"));
// ... Get the InnerHtml as per your own example.
}
I hope this was clear enough. Good luck.

Seems the answer given to me did not help what so ever so after HEAPS of digging, I finally understand how XPath works and managed to do it myself as seen below;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']"))
{
String val = node.Attributes["data-context-item-id"].Value;
videoid.Add(val);
}
I just had to grab the content within the class. Knowing this made it alot easier to use.

Related

When should I use XPath?

Consider the following example, where a ul element's id is known, and we want to Click() its containing li element if the li.Text equals a certain text.
Here are two working solutions to this problem:
Method 1: Using XPath
ReadOnlyCollection<IWebElement> lis = FindElements(By.XPath("//ul[#id='id goes here']/li"));
foreach (IWebElement li in lis) {
if (li.Text == text) {
li.Click();
break;
}
}
Method 2: Using ID and TagName
IWebElement ul = FindElement(By.Id("id goes here"));
ReadOnlyCollection<IWebElement> lis = ul.FindElements(By.TagName("li"));
foreach (IWebElement li in lis) {
if (li.Text == text) {
li.Click();
break;
}
}
My question is: When should we use XPath and when shouldn't we?
I prefer to use XPath only when necessary. For this specific example, I think that XPath is completely unnecessary, but when I looked up this specific problem on StackOverflow, it seems that a majority of users default to using XPath.

In this particular case, XPath can even simplify the problem to a single line:
driver.FindElement(By.XPath(String.Format("//ul[#id='id goes here']/li[. = '{0}']", text))).click();
In general though, if you can uniquely identify an element using simple By.Id or By.TagName or other similar "simple" locators, do it. XPath expression and CSS selector based locators usually either provide advanced ways to locate elements (we can go up/down/sideways in the tree, use partial attribute matches, count elements, determine their position etc) or make the element's location concise, as in this particular situation.

When you need to track more similar web elements use XPATH.
When you need particular single element use id
Xpath having more advantage, because sometimes id get duplicate
This is my experience!

Getting a specific data from webpage using only class items

I have a source code on a webpage that I wish to extract (I've narrowed it down to exactly what is relevant here:
<div class="sideInfoPlayer">
<a class="signLink" href="spieler.php?uid=12345" title="Profile">
<span class="wrap">Wagamama</span>
</a>
Now the trick here is that I want to get the word Wagamama into a message box but that word changes on every page of that site so I need to get that element but there is no ID on this page. Therefore I was thinking of doing a search for the class named "sideInfoPlayer" first and then find the "wrap" class within the previous class block.
I have written the below to get the first one but do not know how to tackle the second one and then get the desired value.
HtmlElementCollection col = webBrowser1.Document.GetElementsByTagName("div");
foreach (HtmlElement element in col)
{
string cls = element.GetAttribute("className");
if (String.IsNullOrEmpty(cls) || !cls.Equals("sideInfoPlayer"))
continue;
}
I hope you can help unstuck me on this one.

You have better options. Look at http://htmlagilitypack.codeplex.com/
And here: How can i parse html string
First you'll need to add reference to HtmlAgilityPack library by downloading it manually or with NuGet package manager.
// loading html into HtmlDocument
var doc = new HtmlWeb().Load("http://website.com/mypage");
// walking through all nodes of interest
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='sideInfoPlayer']/span[#class='wrap']"))
{
// here is your text: node.InnerText
}
//div[#class='sideInfoPlayer']/span[#class='wrap'] is called Xpath Expression and this one literally means "get me all span elements with class=wrap that are children of div element with class=sideInfoPlayer.
I didn't test it, but it should work.

Scraping HTML from Financial Statements

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.
From this page I am attempting to pull out information from various places within this page and save them as correctly formatted strings
here is my current code (taken from: shriek )
HtmlNode tdNode = document.DocumentNode.DescendantNodes().FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
HtmlNode trNode = tdNode.ParentNode;
foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType == HtmlNodeType.Element))
{
Console.WriteLine(node.InnerText.Trim());
//Output:
//Net Income
//265.00
//298.00
//601.00
//672.00
//666.00
}
}
It works correctly however I want to get more information and I am unsure of how to search through the html correctly. First I would like to also be able to select these numbers from the annual data, not only from the quarterly, (View option at the top of the page).
I would also like to get the dates for each column of numbers, both quarterly and annual (the "As of ..." at the top of each column)
also for future projects, does google provide an API for this?

If you take a close look at the original input html source, you will see its data is organized around 6 main sections that are DIV html elements with one of the following 'id' attributes: "incinterimdiv" "incannualdiv" "balinterimdiv" "balannualdiv" "casinterimdiv" "casannualdiv". Obviously, these matches Income Statement, Balance Sheet, and Cash Flow for Quaterly or Annual Data.
Now, when you're scraping a site with Html Agility Pack, I suggest you use XPATH wich is the easiest way to get to any node inside the HTML code, without any dependency on XML, as Html Agility Pack supports plain XPATH over HTML.
XPATH has to be learned, for sure, but is very elegant because it does so many things in just one line. I know this may look old-fashioned with the new cool C#-oriented XLinq syntax :), but XPATH is much more concise. It also enables you to concentrate the bindings between your code and the input HTML in plain old strings, and avoid recompilation of the code when the input source evolves (for example, when the ID change). This make your scraping code more robust, and future-proof. You could also put the XPATH bindings in an XSL(T) file, to be able to transform the HTML into the data presented as XML.
Anyway, enough digression :) Here is a sample code that allows you to get the financial data from a specific line title, and another that gets all data from all lines (from one of the 6 main sections):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii");
// How get a specific line:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the given text (trimmed)
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[normalize-space(text()) = 'Deferred Taxes']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
// get all following sibling TD elements
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim()); // InnerText works also for negative values
}
}
// How to get all lines:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the class 'lft lm'
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[#class='lft lm']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim());
}
}

You have two options. One is to reverse engineer the HTML page, figure out what JavaScript code is run when you click on Annual Data, see where it gets the data from and ask for the data.
The second solution, which is more robust, is to use a platform such as Selenium, that actually emulates the web browser and runs JavaScript for you.
As far as I could tell, there's no CSV interface to the financial statements. Perhaps Yahoo! has one.

If you need to navigate around to get to the right page, then you probably want to look into using WatiN. WatiN was designed as an automated testing tool for web pages and drives a selected web browser to get the page. It also allows you to identify input fields and enter text in textboxes or push buttons. It's a lot like HtmlAgilityPack, so you shouldn't find it too difficult to master.

I would highly recommend against this approach. The HTML that google is spitting out is likely highly volatile, so even once you solidify your parsing approach to get all of the data you need, in a day, a week or a month the HTML format could all change and you would need to rewrite your parsing logic.
You should try to use something more static, like XBRL.
SEC publishes this XBRL for each publicly traded company here = http://xbrl.sec.gov/
You can use this toolkit to work with the data programatically - http://code.google.com/p/xbrlware/
EDIT: The path of least resistance is actually using http://www.xignite.com/xFinancials.asmx, but this service costs money.

Is there a way to replace html nodes with text nodes using HTMLAgilityPack?

I would like to use HTMLAgility pack to replace a node within the document with a text node. The purpose of this is to remove tags surrounding the node itself. Currently, I do something like this:
//This code fixes redundant HTML formatting tags
//This is a snippet of code
foreach (var hChildNode in hd.DocumentNode.SelectNodes("//b//b | //i//i | //u//u") ?? Enumerable.Empty<HtmlNode>())
hChildNode.Name = "remove";
StringBuilder sb = new StringBuilder(hd.DocumentNode.WriteTo());
sb.Replace("<remove>", string.Empty);
sb.Replace("</remove>", string.Empty);
Is there a better way to do this? If I try to create a new text node, and then do something like the code snippet below, I receive an invalid cast error:
foreach (var hChildNode in hd.DocumentNode.SelectNodes("//b//b | //i//i | //u//u") ?? Enumerable.Empty<HtmlNode>())
{
HtmlNode hNewNode = hd.CreateTextNode(hChildNode.InnerHtml);
hChildNode.ParentNode.ReplaceChild(hNewNode, hChildNode);
}
(updated after a typo was pointed out, however the problem still remains)
Am I using the method wrong? Is there another method I am supposed to use to perform functions like this? Thanks.

The purpose of this is to remove tags surrounding the node itself
Your second code snipped performs exactly tag removing except one typo (I guess):
HtmlNode hNewNode = hd.CreateTextNode(hNewNode.InnerHtml);
You should replace hNewNode.InnerHtml by hChildNode.InnerHtml otherwise your code won't even compile (use of unassigned variable).
Also want to mention, after creation of text node it won't have child nodes of the replaced one (instead of this it will have the same value for the InnerHtml property with the node replaced).

HTML Agility Pack Screen Scraping XPATH isn't returning data

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.
The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND
The code I'm currently using is pretty quick and dirty...
//This function retrieves data from the digikey
private static List<string> ExtractProductInfo(HtmlDocument doc)
{
List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
List<string> m_unparsedProductInfo = new List<string>();
//Base Node for part info
string m_baseNode = #"//html[1]/body[1]/div[2]";
//Write part info to list
m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + #"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
//More lines of similar form will go here for more info
//this retrieves digikey PN
foreach(HtmlNode node in m_unparsedProductInfoNodes)
{
m_unparsedProductInfo.Add(node.InnerText);
}
return m_unparsedProductInfo;
}
Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"
Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.

Try using this XPath expression:
/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]
Using Google Chrome Developer Tools and Firebug in Firefox, it seems like webpage has a 'cs' and 'rf' tags before the first table. Something like:
<cs="0">
<rf="141">
<table>
...
</table>
</rf>
</cs>
There is something that might be useful to know what is happening when you want to parse a known HTML file and you're not getting results as expected. In this case I just did:
string xpath = "";
//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
if (node.InnerText.Trim() == "296-12602-1-ND")
xpath = node.XPath; //Here it is
}
Or you could just debug your application after document loads, and go through each child node until you find the node you want to get the info from. If you just set a breakpoint when InnerText is found, you can just go through parents and then keep looking for other nodes. I usually do that entering manually commands in a 'watch' window and navigating using the treeview to see properties, attributes and childs.

Just for an update:
I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract a certain part of HTML with XPath and HTMLAbilityPack - c#

Related

When should I use XPath?

Getting a specific data from webpage using only class items

Scraping HTML from Financial Statements

Is there a way to replace html nodes with text nodes using HTMLAgilityPack?

HTML Agility Pack Screen Scraping XPATH isn't returning data

Categories

Resources