Removing useless TextNodes in HtmlAgilityPack - c#

I'm scraping a number of websites using HtmlAgilityPack. The problem is that it seems to insist on inserting TextNodes in most places which are either empty or just contain a mass of \n, whitespaces and \r.
They tend to cause me issues when I'm counting childnodes , since firebug doesn't show them, but HtmlAgilityPack does.
Is there a way of telling HtmlAgilityPack to stop doing it, or at least clearing out these textnodes? (I want to keep USEFUL ones though). While we're here, same thing for Comment and Script tags.

You can use the following extension method:
static class HtmlNodeExtensions
{
public static List<HtmlNode> GetChildNodesDiscardingTextOnes(this HtmlNode node)
{
return node.ChildNodes.Where(n => n.NodeType != HtmlNodeType.Text).ToList();
}
}
And call it like this:
List<HtmlNode> nodes = someNode.GetChildNodesDiscardingTextOnes();

There is a difference between "no whitespace" between two nodes and "some whitespace". So all-whitespace textnodes still are needed and significant.
Couldn't you preprocess the html and remove all nodes that you do not need, before starting the "real scraping"?
See also this answer for the "how to remove".

Create an extension method that operates on the "Child" collection (or similar) on a node that uses some LINQ to filter out unwanted nodes. Then, when you traverse your tree do something like this:
myNode.Children.FilterNodes().ForEach(x => {});

I am looking for a better answer. Here is my current method with respect to childnodes like tables rows and table cells. Nodes are identified by their name TR, TH, TD so I strip out #text every time.
List<HtmlNode> rows = table.ChildNodes.Where(w => w.Name != "#text").ToList();
Sure, it is tedious and works and could be improved by an extension.

Related

When should I use XPath?

Consider the following example, where a ul element's id is known, and we want to Click() its containing li element if the li.Text equals a certain text.
Here are two working solutions to this problem:
Method 1: Using XPath
ReadOnlyCollection<IWebElement> lis = FindElements(By.XPath("//ul[#id='id goes here']/li"));
foreach (IWebElement li in lis) {
if (li.Text == text) {
li.Click();
break;
}
}
Method 2: Using ID and TagName
IWebElement ul = FindElement(By.Id("id goes here"));
ReadOnlyCollection<IWebElement> lis = ul.FindElements(By.TagName("li"));
foreach (IWebElement li in lis) {
if (li.Text == text) {
li.Click();
break;
}
}
My question is: When should we use XPath and when shouldn't we?
I prefer to use XPath only when necessary. For this specific example, I think that XPath is completely unnecessary, but when I looked up this specific problem on StackOverflow, it seems that a majority of users default to using XPath.
In this particular case, XPath can even simplify the problem to a single line:
driver.FindElement(By.XPath(String.Format("//ul[#id='id goes here']/li[. = '{0}']", text))).click();
In general though, if you can uniquely identify an element using simple By.Id or By.TagName or other similar "simple" locators, do it. XPath expression and CSS selector based locators usually either provide advanced ways to locate elements (we can go up/down/sideways in the tree, use partial attribute matches, count elements, determine their position etc) or make the element's location concise, as in this particular situation.
When you need to track more similar web elements use XPATH.
When you need particular single element use id
Xpath having more advantage, because sometimes id get duplicate
This is my experience!

Extract a certain part of HTML with XPath and HTMLAbilityPack

I am having an issue with XPath syntax as I dont understand how to use it to extract certain HTML statements.
I am trying to load a videos information from a channel page; http://www.youtube.com/user/CinemaSins/videos
I know there is a line that holds all the details from views, title, ID, ect.
Here is what I am trying to get from within the html:
Thats line 2836;
<div class="yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item" data-context-item-id="ntgNB3Mb08Y" data-context-item-views="243,456 views" data-context-item-time="9:01" data-context-item-type="video" data-context-item-user="CinemaSins" data-context-item-title="Everything Wrong With The Chronicles Of Riddick In 8 Minutes Or Less">
I'm not sure how, But I have HTML Ability Pack added as a resouce and have started attempts on getting it.
Can someone explain how to get all of those details and the XPath syntax involved?
What I have attemped:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']//a"))
{
if (node.ChildNodes[0].InnerHtml != String.Empty)
{
title.Add(node.ChildNodes[0].InnerHtml);
}
}
^ The above code works in only getting the title of each video. But it also has a blank input aswell. Code executed and result is below.
Your xpath is selecting the <a> element inside the <div>. If you want the attributes of the <div> too, then you need to either
a) select both elements and process them separately.
b) run several xpath queries where you specify the exact attribute you want.
Lets go with (a) for this example.
var nodes = doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']");
and get the attributes and title like so:
foreach(var node in nodes)
{
foreach(var attribute in node.Attributes)
{
// ... Get the values of the attributes here.
}
var linkNodes = node.SelectNodes("//a"));
// ... Get the InnerHtml as per your own example.
}
I hope this was clear enough. Good luck.
Seems the answer given to me did not help what so ever so after HEAPS of digging, I finally understand how XPath works and managed to do it myself as seen below;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='yt-lockup clearfix yt-lockup-video yt-lockup-grid context-data-item']"))
{
String val = node.Attributes["data-context-item-id"].Value;
videoid.Add(val);
}
I just had to grab the content within the class. Knowing this made it alot easier to use.

Scraping HTML from Financial Statements

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.
From this page I am attempting to pull out information from various places within this page and save them as correctly formatted strings
here is my current code (taken from: shriek )
HtmlNode tdNode = document.DocumentNode.DescendantNodes().FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
HtmlNode trNode = tdNode.ParentNode;
foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType == HtmlNodeType.Element))
{
Console.WriteLine(node.InnerText.Trim());
//Output:
//Net Income
//265.00
//298.00
//601.00
//672.00
//666.00
}
}
It works correctly however I want to get more information and I am unsure of how to search through the html correctly. First I would like to also be able to select these numbers from the annual data, not only from the quarterly, (View option at the top of the page).
I would also like to get the dates for each column of numbers, both quarterly and annual (the "As of ..." at the top of each column)
also for future projects, does google provide an API for this?
If you take a close look at the original input html source, you will see its data is organized around 6 main sections that are DIV html elements with one of the following 'id' attributes: "incinterimdiv" "incannualdiv" "balinterimdiv" "balannualdiv" "casinterimdiv" "casannualdiv". Obviously, these matches Income Statement, Balance Sheet, and Cash Flow for Quaterly or Annual Data.
Now, when you're scraping a site with Html Agility Pack, I suggest you use XPATH wich is the easiest way to get to any node inside the HTML code, without any dependency on XML, as Html Agility Pack supports plain XPATH over HTML.
XPATH has to be learned, for sure, but is very elegant because it does so many things in just one line. I know this may look old-fashioned with the new cool C#-oriented XLinq syntax :), but XPATH is much more concise. It also enables you to concentrate the bindings between your code and the input HTML in plain old strings, and avoid recompilation of the code when the input source evolves (for example, when the ID change). This make your scraping code more robust, and future-proof. You could also put the XPATH bindings in an XSL(T) file, to be able to transform the HTML into the data presented as XML.
Anyway, enough digression :) Here is a sample code that allows you to get the financial data from a specific line title, and another that gets all data from all lines (from one of the 6 main sections):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii");
// How get a specific line:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the given text (trimmed)
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[normalize-space(text()) = 'Deferred Taxes']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
// get all following sibling TD elements
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim()); // InnerText works also for negative values
}
}
// How to get all lines:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the class 'lft lm'
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[#class='lft lm']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim());
}
}
You have two options. One is to reverse engineer the HTML page, figure out what JavaScript code is run when you click on Annual Data, see where it gets the data from and ask for the data.
The second solution, which is more robust, is to use a platform such as Selenium, that actually emulates the web browser and runs JavaScript for you.
As far as I could tell, there's no CSV interface to the financial statements. Perhaps Yahoo! has one.
If you need to navigate around to get to the right page, then you probably want to look into using WatiN. WatiN was designed as an automated testing tool for web pages and drives a selected web browser to get the page. It also allows you to identify input fields and enter text in textboxes or push buttons. It's a lot like HtmlAgilityPack, so you shouldn't find it too difficult to master.
I would highly recommend against this approach. The HTML that google is spitting out is likely highly volatile, so even once you solidify your parsing approach to get all of the data you need, in a day, a week or a month the HTML format could all change and you would need to rewrite your parsing logic.
You should try to use something more static, like XBRL.
SEC publishes this XBRL for each publicly traded company here = http://xbrl.sec.gov/
You can use this toolkit to work with the data programatically - http://code.google.com/p/xbrlware/
EDIT: The path of least resistance is actually using http://www.xignite.com/xFinancials.asmx, but this service costs money.

Get tags around text in HTML document using C#

I would like to search an HTML file for a certain string and then extract the tags. Given:
<div_outer><div_inner>Happy birthday<div><div>
I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div> then <div_outer></div>.
Any ideas? I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.
Thanks as always, guys.
The HAP is a good place indeed for this.
You can use the OuterHtml and Parent properties of a Node to get the enclosing elements and markup.
You could use xpath for this. I use //*[text()='Happy birthday'][1]/ancestor-or-self::* expression which finds a first (for simplicity) node which text content is Happy birthday, and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:
var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");
var ancestors = doc.DocumentNode
.SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
.Reverse()
.ToList();
It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse method to reverse it.
This will return 2 nodes: div_inner and div_outer.

strip out tag occurrences from XML

I'd like to strip out occurrences of a specific tag, leaving the inner XML intact. I'd like to do this with one pass (rather than searching, replacing, and starting from scratch again). For instance, from the source:
<element>
<RemovalTarget Attribute="Something">
Content Here
</RemovalTarget>
</element>
<element>
More Here
</element>
I'd like the result to be:
<element>
Content Here
</element>
<element>
More Here
</element>
I've tried something like this (forgive me, I'm new to Linq):
var elements = from element in doc.Descendants()
where element.Name.LocalName == "RemovalTarget"
select element;
foreach (var element in elements) {
element.AddAfterSelf(element.Value);
element.Remove();
}
but on the second time through the loop I get a null reference, presumably because the collection is invalidated by changing it. What is an efficient way to make remove these tags on a potentially large document?
You'll have to skip the deferred execution with a call to ToList, which probably won't hurt your performance in large documents as you're just going to be iterating and replacing at a much lower big-O than the original search. As #jacob_c pointed out, I should be using element.Nodes() to replace it properly, and as #Panos pointed out, I should reverse the list in order to handle nested replacements accurately.
Also, use XElement.ReplaceWith, much faster than your current approach in large documents:
var elements = doc.Descendants("RemovalTarget").ToList().Reverse();
/* reverse on the IList<T> may be faster than Reverse on the IEnumerable<T>,
* needs benchmarking, but can't be any slower
*/
foreach (var element in elements) {
element.ReplaceWith(element.Nodes());
}
One last point, in reviewing what this MAY be used for, I tend to agree with #Trull that XSLT may be what you're actually looking for, if say you're removing all say <b> tags from a document. Otherwise, enjoy this fairly decent and fairly well performing LINQ to XML implementation.
Have you considered using XSLT? Seems like the perfect soution, as you are doing exactly what XSLT is meant for, transforming one XML doc into another. The templating system will delve into nested nastiness for you without problems.
Here is a basic example
I would recommend either doing XSLT as Trull recommended as the best solution.
Or you might look at using a string builder and regex matching to remove the items.
You could look at walking through the document, and working with nodes and parent nodes to effectively move the code from inside the node to the parent, but it would be tedious, and very un-necessary with the other potential solutions out there.
A lightweight solution would be to use XmlReader to go trough the input document and XmlWriter to write the output.
Note: XmlReader and XmlWriter clases are abstract, use the appropriate for your situation derived classes.
Depending on how you manage your XML, you could use a regular expression to remove the tags.
Here's a simple console application that demonstrates the use of a regex:
static void Main(string[] args)
{
string content = File.ReadAllText(args[0]);
Regex openTag = new Regex("<([/]?)RemovalTarget([^>]*)>", RegexOptions.Multiline);
string cleanContent = openTag.Replace(content, string.Empty);
File.WriteAllText(args[1], cleanContent);
}
This leaves newline characters in the file, but it shouldn't be too difficult to augment the regular expression.

Categories

Resources