Getting a specific data from webpage using only class items - c#

I have a source code on a webpage that I wish to extract (I've narrowed it down to exactly what is relevant here:
<div class="sideInfoPlayer">
<a class="signLink" href="spieler.php?uid=12345" title="Profile">
<span class="wrap">Wagamama</span>
</a>
Now the trick here is that I want to get the word Wagamama into a message box but that word changes on every page of that site so I need to get that element but there is no ID on this page. Therefore I was thinking of doing a search for the class named "sideInfoPlayer" first and then find the "wrap" class within the previous class block.
I have written the below to get the first one but do not know how to tackle the second one and then get the desired value.
HtmlElementCollection col = webBrowser1.Document.GetElementsByTagName("div");
foreach (HtmlElement element in col)
{
string cls = element.GetAttribute("className");
if (String.IsNullOrEmpty(cls) || !cls.Equals("sideInfoPlayer"))
continue;
}
I hope you can help unstuck me on this one.

You have better options. Look at http://htmlagilitypack.codeplex.com/
And here: How can i parse html string
First you'll need to add reference to HtmlAgilityPack library by downloading it manually or with NuGet package manager.
// loading html into HtmlDocument
var doc = new HtmlWeb().Load("http://website.com/mypage");
// walking through all nodes of interest
foreach (var node in doc.DocumentNode.SelectNodes("//div[#class='sideInfoPlayer']/span[#class='wrap']"))
{
// here is your text: node.InnerText
}
//div[#class='sideInfoPlayer']/span[#class='wrap'] is called Xpath Expression and this one literally means "get me all span elements with class=wrap that are children of div element with class=sideInfoPlayer.
I didn't test it, but it should work.

Related

Html Agility Pack, search through site for a specified string of words

I'm using the Html Agility Pack for this task, basically I've got a URL, and my program should read through the content of the html page on it, and if it finds a line of text (ie: "John had three apples"), it should change a label's text to "Found it".
I tried to do it with contains, but I guess it only checks for one word.
var nodeBFT = doc.DocumentNode.SelectNodes("//*[contains(text(), 'John had three apples')]");
if (nodeBFT != null && nodeBFT.Count != 0)
myLabel.Text = "Found it";
EDIT: Rest of my code, now with ako's attempt:
if (CheckIfValidUrl(v)) // foreach var v in a list..., checks if the URL works
{
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(v);
try
{
if (doc.DocumentNode.InnerHtml.ToString().Contains("string of words"))
{
mylabel.Text = v;
}
...
One possible option is using . instead of text(). Passing text() to contains() function the way you did will, as you suspected, effective only when the searched text is the first direct child of the current element :
doc.DocumentNode.SelectNodes("//*[contains(., 'John had three apples')]");
In the other side, contains(., '...') evaluates the entire text content of current element, concatenated. So, just a heads up, the above XPath will also consider the following element for example, as a match :
<span>John had <br/>three <strong>apples</strong></span>
If you need the XPath to only consider cases when the entire keyword contained in a single text node, and therefore considers the above case as a no-match, you can try this way :
doc.DocumentNode.SelectNodes("//*[text()[contains(., 'John had three apples')]]");
If none of the above works for you, please post minimal HTML snippet that contains the keyword but returned no match, so we can examine further what possibly causes that behavior and how to fix it.
use this:
if (doc.DocumentNode.InnerHtml.ToString().Contains("John had three apples"))
myLabel.Text="Found it";

Extract a certain part of HTML with XPath and HTMLAbilityPack

I am having an issue with XPath syntax as I dont understand how to use it to extract certain HTML statements.
I am trying to load a videos information from a channel page; http://www.youtube.com/user/CinemaSins/videos
I know there is a line that holds all the details from views, title, ID, ect.
Here is what I am trying to get from within the html:
Thats line 2836;
<div class="yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item" data-context-item-id="ntgNB3Mb08Y" data-context-item-views="243,456 views" data-context-item-time="9:01" data-context-item-type="video" data-context-item-user="CinemaSins" data-context-item-title="Everything Wrong With The Chronicles Of Riddick In 8 Minutes Or Less">
I'm not sure how, But I have HTML Ability Pack added as a resouce and have started attempts on getting it.
Can someone explain how to get all of those details and the XPath syntax involved?
What I have attemped:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']//a"))
{
if (node.ChildNodes[0].InnerHtml != String.Empty)
{
title.Add(node.ChildNodes[0].InnerHtml);
}
}
^ The above code works in only getting the title of each video. But it also has a blank input aswell. Code executed and result is below.
Your xpath is selecting the <a> element inside the <div>. If you want the attributes of the <div> too, then you need to either
a) select both elements and process them separately.
b) run several xpath queries where you specify the exact attribute you want.
Lets go with (a) for this example.
var nodes = doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']");
and get the attributes and title like so:
foreach(var node in nodes)
{
foreach(var attribute in node.Attributes)
{
// ... Get the values of the attributes here.
}
var linkNodes = node.SelectNodes("//a"));
// ... Get the InnerHtml as per your own example.
}
I hope this was clear enough. Good luck.
Seems the answer given to me did not help what so ever so after HEAPS of digging, I finally understand how XPath works and managed to do it myself as seen below;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']"))
{
String val = node.Attributes["data-context-item-id"].Value;
videoid.Add(val);
}
I just had to grab the content within the class. Knowing this made it alot easier to use.

Selenium C# Webdriver FindElements(By.LinkText) RegEx?

Is it possible to find links on a webpage by searching their text using a pattern like A-ZNN:NN:NN:NN, where N is a single digit (0-9).
I've used Regex in PHP to turn text into links, so I was wondering if it's possible to use this sort of filter in Selenium with C# to find links that will all look the same, following a certain format.
I tried:
driver.FindElements(By.LinkText("[A-Z][0-9]{2}):([0-9]{2}):([0-9]{2}):([0-9]{2}")).ToList();
But this didn't work. Any advice?
In a word, no, none of the FindElement() strategies support using regular expressions for finding elements. The simplest way to do this would be to use FindElements() to find all of the links on the page, and match their .Text property to your regular expression.
Note though that if clicking on the link navigates to a new page in the same browser window (i.e., does not open a new browser window when clicking on the link), you'll need to capture the exact text of all of the links you'd like to click on for later use. I mention this because if you try to hold onto the references to the elements found during your initial FindElements() call, they will be stale after you click on the first one. If this is your scenario, the code might look something like this:
// WARNING: Untested code written from memory.
// Not guaranteed to be exactly correct.
List<string> matchingLinks = new List<string>();
// Assume "driver" is a valid IWebDriver.
ReadOnlyCollection<IWebElement> links = driver.FindElements(By.TagName("a"));
// You could probably use LINQ to simplify this, but here is
// the foreach solution
foreach(IWebElement link in links)
{
string text = link.Text;
if (Regex.IsMatch("your Regex here", text))
{
matchingLinks.Add(text);
}
}
foreach(string linkText in matchingLinks)
{
IWebElement element = driver.FindElement(By.LinkText(linkText));
element.Click();
// do stuff on the page navigated to
driver.Navigate().Back();
}
Dont use regex to parse Html.
Use htmlagilitypack
You can follow these steps:
Step1 Use HTML PARSER to extract all the links from the particular webpage and store it into a List.
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]"))
{
//collect all links here
}
Step2 Use this regex to match all the links in the list
.*?[A-Z]\d{2}:\d{2}:\d{2}:\d{2}.*?
Step 3 You get your desired links.

Scraping HTML from Financial Statements

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.
From this page I am attempting to pull out information from various places within this page and save them as correctly formatted strings
here is my current code (taken from: shriek )
HtmlNode tdNode = document.DocumentNode.DescendantNodes().FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
HtmlNode trNode = tdNode.ParentNode;
foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType == HtmlNodeType.Element))
{
Console.WriteLine(node.InnerText.Trim());
//Output:
//Net Income
//265.00
//298.00
//601.00
//672.00
//666.00
}
}
It works correctly however I want to get more information and I am unsure of how to search through the html correctly. First I would like to also be able to select these numbers from the annual data, not only from the quarterly, (View option at the top of the page).
I would also like to get the dates for each column of numbers, both quarterly and annual (the "As of ..." at the top of each column)
also for future projects, does google provide an API for this?
If you take a close look at the original input html source, you will see its data is organized around 6 main sections that are DIV html elements with one of the following 'id' attributes: "incinterimdiv" "incannualdiv" "balinterimdiv" "balannualdiv" "casinterimdiv" "casannualdiv". Obviously, these matches Income Statement, Balance Sheet, and Cash Flow for Quaterly or Annual Data.
Now, when you're scraping a site with Html Agility Pack, I suggest you use XPATH wich is the easiest way to get to any node inside the HTML code, without any dependency on XML, as Html Agility Pack supports plain XPATH over HTML.
XPATH has to be learned, for sure, but is very elegant because it does so many things in just one line. I know this may look old-fashioned with the new cool C#-oriented XLinq syntax :), but XPATH is much more concise. It also enables you to concentrate the bindings between your code and the input HTML in plain old strings, and avoid recompilation of the code when the input source evolves (for example, when the ID change). This make your scraping code more robust, and future-proof. You could also put the XPATH bindings in an XSL(T) file, to be able to transform the HTML into the data presented as XML.
Anyway, enough digression :) Here is a sample code that allows you to get the financial data from a specific line title, and another that gets all data from all lines (from one of the 6 main sections):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii");
// How get a specific line:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the given text (trimmed)
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[normalize-space(text()) = 'Deferred Taxes']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
// get all following sibling TD elements
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim()); // InnerText works also for negative values
}
}
// How to get all lines:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the class 'lft lm'
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[#class='lft lm']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim());
}
}
You have two options. One is to reverse engineer the HTML page, figure out what JavaScript code is run when you click on Annual Data, see where it gets the data from and ask for the data.
The second solution, which is more robust, is to use a platform such as Selenium, that actually emulates the web browser and runs JavaScript for you.
As far as I could tell, there's no CSV interface to the financial statements. Perhaps Yahoo! has one.
If you need to navigate around to get to the right page, then you probably want to look into using WatiN. WatiN was designed as an automated testing tool for web pages and drives a selected web browser to get the page. It also allows you to identify input fields and enter text in textboxes or push buttons. It's a lot like HtmlAgilityPack, so you shouldn't find it too difficult to master.
I would highly recommend against this approach. The HTML that google is spitting out is likely highly volatile, so even once you solidify your parsing approach to get all of the data you need, in a day, a week or a month the HTML format could all change and you would need to rewrite your parsing logic.
You should try to use something more static, like XBRL.
SEC publishes this XBRL for each publicly traded company here = http://xbrl.sec.gov/
You can use this toolkit to work with the data programatically - http://code.google.com/p/xbrlware/
EDIT: The path of least resistance is actually using http://www.xignite.com/xFinancials.asmx, but this service costs money.

HTML Agility Pack Screen Scraping XPATH isn't returning data

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.
The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND
The code I'm currently using is pretty quick and dirty...
//This function retrieves data from the digikey
private static List<string> ExtractProductInfo(HtmlDocument doc)
{
List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
List<string> m_unparsedProductInfo = new List<string>();
//Base Node for part info
string m_baseNode = #"//html[1]/body[1]/div[2]";
//Write part info to list
m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + #"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
//More lines of similar form will go here for more info
//this retrieves digikey PN
foreach(HtmlNode node in m_unparsedProductInfoNodes)
{
m_unparsedProductInfo.Add(node.InnerText);
}
return m_unparsedProductInfo;
}
Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"
Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.
Try using this XPath expression:
/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]
Using Google Chrome Developer Tools and Firebug in Firefox, it seems like webpage has a 'cs' and 'rf' tags before the first table. Something like:
<cs="0">
<rf="141">
<table>
...
</table>
</rf>
</cs>
There is something that might be useful to know what is happening when you want to parse a known HTML file and you're not getting results as expected. In this case I just did:
string xpath = "";
//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
if (node.InnerText.Trim() == "296-12602-1-ND")
xpath = node.XPath; //Here it is
}
Or you could just debug your application after document loads, and go through each child node until you find the node you want to get the info from. If you just set a breakpoint when InnerText is found, you can just go through parents and then keep looking for other nodes. I usually do that entering manually commands in a 'watch' window and navigating using the treeview to see properties, attributes and childs.
Just for an update:
I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

Categories

Resources