HtmlAgilityPack and large HTML Documents - c#

I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)
So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;
documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[#href]");
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
I was thinking maybe limit what i fetch? What would be my best option here?
Certainly someone must have run into this problem before :)

Have you tried dropping the XPath and using the LINQ functionality?
var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);
That'll pull a list of the href attribute of all anchor tags as a List<string>.

If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.
CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.

".//a[#href]" is extremely slow XPath. Tried to replace with "//a[#href]" or with code that simply walks whole document and checks all A nodes.
Why this XPath is slow:
"." starting with a node
"//" select all descendent nodes
"a" - pick only "a" nodes
"#href" with href.
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.

Related

Why is this seemingly simple Xpath navigation not working?

I'm having what seems like a really simple problem. I'm trying to navigate to an element in HTML by Xpath, and can't seem to get it to function properly.
I want to grab a span from the html contents of a page. The page is fairly complex, so I've been using Firebug's "get element by xpath" and pasting the result into my code. I've noticed it's slightly different than the xpath you get from doing the same thing in Chrome, but they both seem to direct to the same place.
The html I'm trying to navigate is found here. The field I'm trying to access via xpath is the first "Results 1 - 10 of n".
Based on FireBug's 'inspect element' the xpath should be: /html/body/div/center/table/tbody/tr[6]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/span
However when I try to use this xpath to identify the element in a C# codebehind, it gives me a number of errors that that path cannot be found.
Am I doing something wrong here? I've tried a number of permutations of the xpath and I don't understand why this wouldn't be cooperating within code.
Edit: I'm having this problem both in HTMLAgilityPack (but managed to hack out a bad solution using regexes instead) and a SELECT statement modeled after the answer found here
Edit 2: I'm trying to figure out this issue using Yahoo's free proxy as shown in the example here:
var query = 'SELECT * FROM html WHERE url="http://mattgemmell.com/2008/12/08/what-have-you-tried/" and xpath="//h1" and class="entry-title"';
var url = "http://query.yahooapis.com/v1/public/yql?q=" + query + "&format=json&callback=??";
$.getJSON(url,function(data){
alert(data.query.results.h1.content);
})
I'm having the same problems with HTML agility pack but I'm more interested in getting this part to work. It works for the provided URL that the answerer gave me (seen above). However when I try to use even simple xpath expressions on the http://nl.newsbank.com url, I get errors that no object has been retrieved every time, no matter how basic the xpath.
Edit 3: I thought I'd elaborate a little more on the big picture of the larger problem I'm trying to solve of which this problem is a critical component in the hopes that maybe it provides a little more insight.
To learn basic ASP.NET development skills from scratch, I decided to make a simple web application, based around the news archive search at http://nl.newsbank.com/. In its current iteration, it sends a POST request (although I've now learned you can use a GET request and just dump the body at the end of the URL) to send search criteria, as if the user entered criteria in the search bar. It then searches the response (using RegExes, not Xpath because I couldnt get that working) for the "Results 1-n of n" span, extracts n, and dumps it in a table. It's a cool little tool for looking up news occurrence rates over time.
I wanted to add functionality such that you could enter a date range (say May 2002 - June 2010) and run a frequency search for every month / week in that range. This is very easy to implement conceptually. HOWEVER the problem is, right now all this happens server side, and since there's no API, the HTTP response contains the entire page, and is therefore very bandwidth intensive. Sending dozens of queries at once would swallow absolutely unspeakable amounts of bandwidth and wouldn't be even a little scalable.
As a result I tried rewriting the application to work client-side. However because of the same-origin policy I'm not able to send a request to an external host from the client-side. HOWEVER there is a loophole that I can use a free Yahoo proxy that makes the request and converts it to JSON, and then I can use the JSON exception of the Same-Origin Policy to retrieve that data from the proxy.
Here's where I'm running into these xpath problems specific to http://nl.newsbank.com. I'm not able to retrieve html with any xpath, and I'm not sure why or how I can fix it. When debugging in VS2010, I'll receive the error Microsoft JScript runtime error: Unable to get value of the property 'content': object is null or undefined
As paul t. already mentioned in a comment, the TBODY elements are generated by the webkit engine. The next problem is that the DIV between the BODY and CENTER does not exist on the page by default. It is added by an JS statement on line 119.
After stripping out the DIV and TBODY elements like
/html/body/center/table/tr[6]/td/table/tr/td[2]/table/tr/td/table/tr/td/table/tr/td/table/tr/td/span
i can successfull select a node with the HthmlAgilityPack.
Edit: don't use tools like Firebug for getting an XPath value on a website. Don't even use it if you just want wo look at the source of the page. The problem with Firebug is, that it will show you the current DOM document tree which probably on almost every is already (heavily) modified by JS.
Your sample HTML page's elements haven't got many classes to select on, but if you're interested in the first <span> element that contains "Results: 1 - 10 of n", you can use an XPath expression that explicitly targets this textual content.
For example:
//table//span[starts-with(., "Results:")]
will select all <span> elements, contained in a <table>, and that contain text beginning with "Results:" (the //table is not strictly necessary in your case I think, but might as well restrict a little)
You want the first one of these <span>, so you can use this expression:
(//table//span[starts-with(., "Results:")])[1]
Note the brackets around the whole previous expression and then [1] to select the first of all the <span> matching the text
It may sound kind of simplistic, but the element you are looking for is the only doc element that is using the css class "basic-text-white". I would think this would be a lot easier to find and extract than a long xpath. Web-scraping is never a stable thing, but I would think this is probably as stable as the xpath. Trying to debug the xpath just about makes my eyes bleed.

Can Html Agility Pack be used to parse HTML fragments?

I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)
Absolutely, that is what it excels at.
In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.
The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.
An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).
See creating a new DOM for details.
Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.
Using CsQuery is very straightforward, e.g.
CQ docFromString = CQ.Create(htmlString);
CQ docFromWeb = CQ.CreateFromUrl(someUrl);
// there are other methods for asynchronous web gets, creating from files, streams, etc.
// css selector: the indexer [] is like jQuery $(..)
CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];
// Text() is a jQuery method returning text contents of selection
string textOfCell = lastCellInFirstRow.Text();
Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.

XML: Searching elements for specific text using C#

I'm trying to get a list of PDF links from different websites. First I'm using the Web client class to download the page source. I then use sgmlReader to convert the HTML to XML. So for one particular site, I'll get a tag that looks like this:
<p>1985 to 1997 Board Action Summary</p>
I need to grab all the links that contain ".pdf". Obviously not all websites are laid out the same, so just searching for a <p> tag, wont be dynamic enough. I'd rather not use linq, but I will if I have to. Thanks in advance.
Linq makes this easy...
var hrefs = doc.Root.Descendants("a")
.Where(a => a.Attrib("href").Value.ToUpper().EndsWith(".PDF"))
.Select(a => a.Attrib("href"));
away you go! (note: did this from memory, so you might have to fix it somewhat)
This will break down for <a/> tags that don't have an href (anchors) but you can fix that surely...
I think you have 2 options here. If you need only the links, you can use Regular Expressions to find the matches for strings ending with .pdf. If you need to manipulate the XML structure or get other values from the XML, it would be better to use XmlDocument and use an XPath query to find out the nodes which have a link to a pdf file in it. Using LINQ to XML just reduces the number of lines of code you have to write.

Read specific text from page into string array in C#

I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.

Is there a jQuery-like CSS/HTML selector that can be used in C#?

I'm wondering if there's a jQuery-like css selector that can be used in C#.
Currently, I'm parsing some html strings using regex and thought it would be much nicer to have something like the css selector in jQuery to match my desired elements.
Update 10/18/2012
CsQuery is now in release 1.3. The latest release incorporates a C# port of the validator.nu HTML5 parser. As a result CsQuery will now produce a DOM that uses the HTML5 spec for invalid markup handling and is completely standards compliant.
Original Answer
Old question but new answer. I've recently released version 1.1 of CsQuery, a jQuery port for .NET 4 written in C# that I've been working on for about a year. Also on NuGet as "CsQuery"
The current release implements all CSS2 & CSS3 selectors, all jQuery extensions, and all jQuery DOM manipulation methods. It's got extensive test coverage including all the tests from jQuery and sizzle (the jQuery CSS selection engine). I've also included some performance tests for direct comparisons with Fizzler; for the most part CsQuery dramatically outperforms it. The exception is actually loading the HTML in the first place where Fizzler is faster; I assume this is because fizzler doesn't build an index. You get that time back after your first selection, though.
There's documentation on the github site, but at a basic level it works like this:
Create from a string of HTML
CQ dom = CQ.Create(htmlString);
Load synchronously from the web
CQ dom = CQ.CreateFromUrl("http://www.jquery.com");
Load asynchronously (non-blocking)
CQ.CreateFromUrlAsync("http://www.jquery.com", responseSuccess => {
Dom = response.Dom;
}, responseFail => {
..
});
Run selectors & do jQuery stuff
var childSpans = dom["div > span"];
childSpans.AddClass("myclass");
the CQ object is like thejQuery object. The property indexer used above is the default method (like $(...).
Output:
string html = dom.Render();
You should definitely see #jamietre's CsQuery. Check out his answer to this question!
Fizzler and Sharp-Query provide similar functionality, but the projects seem to be abandoned.
Not quite jQuery like, but this may help:
http://www.codeplex.com/htmlagilitypack
For XML you might use XPath...
I'm not entirely clear as to what you're trying to achieve, but if you have a HTML document that you're trying to extract data from, I'd recommend loading it with a parser, and then it becomes fairly trivial to query the object to pull desired elements.
The parser I linked above allows for use of XPath queries, which sounds like what you are looking for.
Let me know if I've misunderstood.

Categories

Resources