Can Html Agility Pack be used to parse HTML fragments? - c#

I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)

Absolutely, that is what it excels at.
In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.
The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.

An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).
See creating a new DOM for details.
Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.
Using CsQuery is very straightforward, e.g.
CQ docFromString = CQ.Create(htmlString);
CQ docFromWeb = CQ.CreateFromUrl(someUrl);
// there are other methods for asynchronous web gets, creating from files, streams, etc.
// css selector: the indexer [] is like jQuery $(..)
CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];
// Text() is a jQuery method returning text contents of selection
string textOfCell = lastCellInFirstRow.Text();
Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.

Related

Html.Raw in controller

I get content from editor so content include html tags like this "dddd"
I must remove html tags from content because I write this content to PDF(generate pdf in c#-controller action) using itextsharp.DLL but itextsharp content with html tags,it does not render html tags as you can see below screen
There is no Html.Raw function or HtmlHelper.Raw function in c#(action -controller)
What should I do?I try to remove html tags with regex but content is very complex and it is dynamic so there is many many html tags
One approach would be to use an HTML parser like the HTML Agility Toolpack. I've used this successfully for problems as you describe (but am otherwise unaffiliated with its development). From the site:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You'll find lots of examples online to tailor to your needs.
You can use Html.Raw and Html.Json in controller like this
Example
If I use this in View
var attrilist = #Html.Raw(Json.Encode(attriFeildlist));
Then I can use this as alternate of this code in Controller like
var jsonencode = System.Web.Helpers.Json.Encode(attriFeildlist);
var htmlencode= WebUtility.HtmlEncode(jsonencode);

HTML Agility pack - How to get URLs, which start with specific text?

The question is in the title, but that is more specific: can I get URL from HTML, which starts with specific text ? may be, is there any case to extract in JQuery-style?
$( "a[href^='event_handler']" )
Out-of-the-box library doesn't support jquery type selectors (those are CSS selectors FYI), but only XPATH or XSLT selectors. Of course there are good people who took their time and added a extension to CSS selector support, see Add CSS Selector Query Engine onto HTMLAgilityPack.
Adding this, you can select your links with the string selector you've already provided yourself.
HTMLAgilityPack is based on using XPath queries, not CSS selectors (which is what you have in your original post).
If you absolutely must use CSS selectors, there is a tool I've used in the past to do this called Fizzler:
https://code.google.com/p/fizzler/
It sits on top of HTMLAgilityPack, so therefore much of the documentation stays the same.
I'd also say your question is a little confusing. Your CSS selector there is selecting something based on it's href starting with a value, yet you mention you want to select something by it's text - which is different. The below is a direct equivlaent of what your original selector is:
//a[starts-with(#href, 'event_handler')]
However, to match on the actual text, not the href, then it's:
//a[starts-with(text(), 'event_handler')]
You can also use linq
doc.DocumentNode.SelectNodes("//li").Where(x => x.FirstChild.Attributes["href"].Value.StartsWith("event_handler")).Select(x => x.FirstChild.Attributes["href"].Value).ToList();

HtmlAgilityPack and large HTML Documents

I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :)
So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;
documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[#href]");
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
I was thinking maybe limit what i fetch? What would be my best option here?
Certainly someone must have run into this problem before :)
Have you tried dropping the XPath and using the LINQ functionality?
var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);
That'll pull a list of the href attribute of all anchor tags as a List<string>.
If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.
CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.
".//a[#href]" is extremely slow XPath. Tried to replace with "//a[#href]" or with code that simply walks whole document and checks all A nodes.
Why this XPath is slow:
"." starting with a node
"//" select all descendent nodes
"a" - pick only "a" nodes
"#href" with href.
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.
Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}
It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)
You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

Is there a jQuery-like CSS/HTML selector that can be used in C#?

I'm wondering if there's a jQuery-like css selector that can be used in C#.
Currently, I'm parsing some html strings using regex and thought it would be much nicer to have something like the css selector in jQuery to match my desired elements.
Update 10/18/2012
CsQuery is now in release 1.3. The latest release incorporates a C# port of the validator.nu HTML5 parser. As a result CsQuery will now produce a DOM that uses the HTML5 spec for invalid markup handling and is completely standards compliant.
Original Answer
Old question but new answer. I've recently released version 1.1 of CsQuery, a jQuery port for .NET 4 written in C# that I've been working on for about a year. Also on NuGet as "CsQuery"
The current release implements all CSS2 & CSS3 selectors, all jQuery extensions, and all jQuery DOM manipulation methods. It's got extensive test coverage including all the tests from jQuery and sizzle (the jQuery CSS selection engine). I've also included some performance tests for direct comparisons with Fizzler; for the most part CsQuery dramatically outperforms it. The exception is actually loading the HTML in the first place where Fizzler is faster; I assume this is because fizzler doesn't build an index. You get that time back after your first selection, though.
There's documentation on the github site, but at a basic level it works like this:
Create from a string of HTML
CQ dom = CQ.Create(htmlString);
Load synchronously from the web
CQ dom = CQ.CreateFromUrl("http://www.jquery.com");
Load asynchronously (non-blocking)
CQ.CreateFromUrlAsync("http://www.jquery.com", responseSuccess => {
Dom = response.Dom;
}, responseFail => {
..
});
Run selectors & do jQuery stuff
var childSpans = dom["div > span"];
childSpans.AddClass("myclass");
the CQ object is like thejQuery object. The property indexer used above is the default method (like $(...).
Output:
string html = dom.Render();
You should definitely see #jamietre's CsQuery. Check out his answer to this question!
Fizzler and Sharp-Query provide similar functionality, but the projects seem to be abandoned.
Not quite jQuery like, but this may help:
http://www.codeplex.com/htmlagilitypack
For XML you might use XPath...
I'm not entirely clear as to what you're trying to achieve, but if you have a HTML document that you're trying to extract data from, I'd recommend loading it with a parser, and then it becomes fairly trivial to query the object to pull desired elements.
The parser I linked above allows for use of XPath queries, which sounds like what you are looking for.
Let me know if I've misunderstood.

Categories

Resources