XPath + Firebug +XML/HTML +HTML AgilityPack C# - c#

When using Firebug or some of the bookmarklets:
javascript:(function(){var a=document.createElement("script");a.setAttribute("src","http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.js");if(typeof jQuery=="undefined"){document.getElementsByTagName("head")[0].appendChild(a)}(function(){if(typeof jQuery=="undefined"){setTimeout(arguments.callee,100)}else{jQuery("*").one("click",function(d){jQuery(this)[0].scrollIntoView();for(var e="",c=jQuery(this)[0];c&&c.nodeType==1;c=c.parentNode){var b=jQuery(c.parentNode).children(c.tagName).index(c)+1;b>1?(b="["+b+"]"):(b="");e="/"+c.tagName.toLowerCase()+b+e}window.location.hash="#xpath:"+e;prompt('Twoje wyrazenie:',e);d.preventDefault();d.stopPropagation();jQuery("*").unbind("click",arguments.callee)})}})()})();
I receive a HTML's XPath. In order to parse HTML via HTML Agility Pack or Sgml, i need to convert it to XHTML (XML).
But the problem is (i think) that XHTML's XPath is different from HTML's XPath.
That's why Firebug's "XPath Copy" feature doesn't work when using it with
HtmlNode valueNode = doc.DocumentNode.SelectSingleNode(Firebugs_XPath);
For example, firebug/bookmarklet gives (if I remove tbody it won't help):
/html/body/div[2]/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]/form/table/tbody/tr[2]/td/div/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]/u
and proper code is (give or take):
/html/body/div/table/tr[1]/td[2]/table/tr[1]//td[2]/table[2]/tr[1]//td[2]/table/tr/tr/td[2]/u
My question is - how to fix that behavior, in order to make firebugXpath->HtmlAgilityPack work.
And - is this possible, to use bookmarklet with built in C# WebBrowser component.
I will really appreciate your help.

Firebug's representation of your markup might be different from the actual XHTML because it tries to normalise the markup, and that's what the XPath queries are generated against rather than the actual underlying XHTML. I'm not sure it's possible to change this behaviour, you might just need to tweak the XPaths by hand.

I was having the same issue with trying to get the correct xpath using firebug and chrome and ie's dev tools so I wrote an application using HTML agility pack to find the xpath.
http://letschat.info/?p=23

Related

Getting nested Divs using HTML Agility Pack c#

I'm trying to scrape a webpage (Pub Med) to see how many references appear in specific articles(some articles have references, some don't). However, the problem i'm having right now is that the divs are all nested and named the same thing so I haven't been able to figure out what code is required to get the elements.
So far i've tried using contains to see if I could just grab a catch all and dig my way into the node from there but that hasn't worked.
.SelectNodes("//div[contains(#class,'portlet_title')]");
I also have tried copying the XPath but all I would get is null as a result
.SelectNodes("//*[#id="disc_col"]/div[3]/div[1]/div/h3/span");
Any help would be appreciated as I am no master at Xpath.
And for reference, a page that would fit my criteria is:
http://www.ncbi.nlm.nih.gov/pubmed/?term=23489346 (right hand side says Cited by * articles).
I've also browsed some other responses however they all seemed to be for results with differently named Divs ( ie get all the divs ids on a html page using Html Agility Pack). Either I dont understand how to use this correctly, or my problem is different.
Thanks again.
Mike! Try use
var titles = website.DocumentNode.SelectNodes("//div[#class='portlet_title']");
The errors in your XPaths are: 1. attributes are written just in "[]" with "#" symbol like I wrote; 2. in every XPath node you should write an index e.g. "//div[3]/div[1]/div[1]/h3[1]/span[1]".
Good luck!

HtmlAgilityPack and large HTML Documents

I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)
So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;
documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[#href]");
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
I was thinking maybe limit what i fetch? What would be my best option here?
Certainly someone must have run into this problem before :)
Have you tried dropping the XPath and using the LINQ functionality?
var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);
That'll pull a list of the href attribute of all anchor tags as a List<string>.
If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.
CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.
".//a[#href]" is extremely slow XPath. Tried to replace with "//a[#href]" or with code that simply walks whole document and checks all A nodes.
Why this XPath is slow:
"." starting with a node
"//" select all descendent nodes
"a" - pick only "a" nodes
"#href" with href.
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.

Can Html Agility Pack be used to parse HTML fragments?

I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)
Absolutely, that is what it excels at.
In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.
The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.
An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).
See creating a new DOM for details.
Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.
Using CsQuery is very straightforward, e.g.
CQ docFromString = CQ.Create(htmlString);
CQ docFromWeb = CQ.CreateFromUrl(someUrl);
// there are other methods for asynchronous web gets, creating from files, streams, etc.
// css selector: the indexer [] is like jQuery $(..)
CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];
// Text() is a jQuery method returning text contents of selection
string textOfCell = lastCellInFirstRow.Text();
Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.

Does .NET framework offer methods to parse an HTML string?

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:
find specific controls in the hierarchy by id or by tag
modify (and ideally create) attributes of those found elements
Are there methods available in .net to do so?
HtmlDocument
GetElementById
HtmlElement
You can create a dummy html document.
WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();
Output:
2
file:///c:
about:myUrl
Editing elements:
HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
"src=\"c:\"",
"src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Output:
file:///d:
Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:
Most obviously - use regex. (System.Text.RegularExpressions)
Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
Linq?
One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here
Also, here is a post about Linq vs Regex.
You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

c# parse html using XPathDocument

i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml...
is there a way to do this or not?
Should use HtmlAgilityPack. Still the best!
Use something like Html Agility Pack which can load your html into a DOM object which can be traversed with for example xpath queries.
Unless your html is in fact xhtml, it is usually not a valid xml structure with correct opening and ending node tags.

Categories

Resources