HtmlAgilityPack td.innertext bug?

HtmlAgilityPack td.innertext bug? - c#

I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.
This is a C# app in VS2010. Running in Debug, I see the string in my class begins:
Animal and vegetable oils 1 < 5 MW <br>5-50 MW 30 <br>
But when I assign with:
td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();
The td.InnerHtml shows
Animal and vegetable oils 1 < 5=\"\" mw=\"\"><br>5-50 MW 30 <br>
Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)

HTML Agility Pack's HTML parser is treating the < as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the <br> which forces it to close the tag.
The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.
An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.

Related

Handling html special character while Parsing html string using c#

I am using htmlagility pack to parse html string, and convert certain patterns to links.
Given a html string and a pattern "mystring". I have to replace the occurrence of this pattern in the hrml string with <a href="/mystring.html>mystring</a>. But there are two exceptions
1. I should not replace the pattern if it is already within an anchor tag, which means its immediate parent or any level parent should not be an anchor tag. For ex: <a href="google.com><span>mystring</span><\a>
2. It should not be inside href. For ex <a href="mystring">.
input string: "<li><span>mystring test</span></li><li><a href='#'><span>mystring</span></li</li>"
expected output : "<li><span><a href="/mystring.html>mystring</a> test</span></li><li><a href='#'><span>mystring</span></li</li>"
I am using htmlagilitypack and loading this string as html doc and getting all text and looking whether its any level parent is not an anchor and replacing it. Everything worked simple and fine. But there is a problem here.
If my input string is something like "li><span>mystring test < 10 and 5</span></li>" there is a problem. Htmlagility parser considers the less than symbol as a html special character and considers the "< 10 and 5" as a html tag and produces something like this.
< 10="" and="" 5=""> (attributes with empty values).
IS there a work around for this using htmlagilityparser?
Should I take a step back and use regex? In that case how do I handle the any level anchor exception?
IS there a better approach for this problem?

Using < outside HTML tag is invalid. Use < entity instead.
EDIT: If don't have control over input string, you may try replacing "< ":
inputhtml = inputhtml.Replace("< ", "< ");
If there are any other errors, you can try importing MSHTML COM DLL. Reference COM dll "Microsoft HTML object library".

Two suggestions:
You could pre-clean the broken HTML so HtmlAgilityPack works better. This is possibly easier.
Or parse & track nested-structure of tags yourself, via a simple regex-based parser. But many HTML tags do not have to be normatively ended, such as <TR> <TD> <P> <BR>.. and you'll have to deal with the broken < angle-brackets here too.
Option 2) is not hard -- but will be more work first-off, for a payoff in improved reliability & control over how you handle "malformed" inputs from a low-quality source.

Can Html Agility Pack be used to parse HTML fragments?

I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.
I could try using regular expressions to grab just these elements but there are several issues with that approach:
I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
Not to mention how HTML is not a regular language
I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)

Absolutely, that is what it excels at.
In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.
The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.

An alternative to Html Agility Pack is CsQuery, a C# jQuery port of which I am the primary author. It lets you use CSS selectors and the full Query API to access and manipulate the DOM, which for many people is easier than XPATH. Additionally, it's HTML parser is designed specifically with a variety of purposes in mind and there are several options for parsing HTML: as a full document (missing html, body tags will be added, and any orphaned content moved inside the body); as a content block (meaning - it won't be wrapped as a full document, but optional tags such as tbody that are still mandatory in the DOM are added automatically, same as browsers do), and as a true fragment where no tags are created (e.g. in case you're just working with building blocks).
See creating a new DOM for details.
Additionally, CsQuery's HTML parser has been designed to honor the HTML5 spec for optional closing tags. For example, closing p tags are optional, but there are specific rules that determine when the block should be closed. In order to produce the same DOM that a browser does, the parser needs to implement the same rules. CsQuery does this to provide a high degree of compatibility with browser DOM for a given source.
Using CsQuery is very straightforward, e.g.
CQ docFromString = CQ.Create(htmlString);
CQ docFromWeb = CQ.CreateFromUrl(someUrl);
// there are other methods for asynchronous web gets, creating from files, streams, etc.
// css selector: the indexer [] is like jQuery $(..)
CQ lastCellInFirstRow = docFromString["table tr:first-child td:last-child"];
// Text() is a jQuery method returning text contents of selection
string textOfCell = lastCellInFirstRow.Text();
Finally CsQuery indexes documents on class, id, attribute, and tag - making selectors extremely fast compared to Html Agility Pack.

HtmlDocument.Write Stripping Quotation Marks

For some reason when I try writing to an HtmlDocument it strips some (not all) of the quotation marks of the string I am giving it.
Look here:
HtmlDocument htmlDoc = Webbrowser1.Document.OpenNew(true);
htmlDoc.Write("<HTML><BODY><DIV ID=\"TEST\"></DIV></BODY></HTML>");
string temp = htmlDoc.GetElementsByTagName("HTML")[0].InnerHtml;
The result of temp is this:
<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>
It works exactly as it should except it is stripping the quotation marks. Does anyone have a solution on how to prevent or fix this?

There is no guarantees with innerHTML that it will return content identical to string you passed in. The innerHTML is constructed by browser using its HTML tree representation - so it will produce resulting string as it see fits.
So depending on your needs you can try to use some HTML parsing code that understands ID's without quotes around OR try to convince browser to use latest engine which more likely to produce innerHTML to you liking.
I.e. in your case it looks like at least IE9 renders your HTML as IE9:Quirks mode (that returns innerHTML in the shape your are not happy with), if you make valid HTML or force mode to IE9:Standard you'll get string with qoutes like
document.getElementsByTagName("html")[0].innerHTML
IE9:Standards - "<head></head><body><div id="TEST"></div></body>"
IE9:Quirks -
"<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>"
You can try it yourself by creating sample HTML file and opening from disk. F12 to show dev tools and check out mode in the menu bar.

C# has a quirky feature though I'm not sure of it's name. Sorry i'm not sure of a vb equivalent.
Add an # at the beginning of a literal string to escape all characters.
htmlDoc.Write(#"<HTML><BODY><DIV ID="TEST"></DIV></BODY></HTML>");
Also, this isn't important but your html would not validate. All tags and attributes should be lower case. E.g.<HTML> should be <html>.

Is this not a suitable scenario for an Html parser?

I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
Link
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:
<p class="<sometag attr=" something"="">">
Link
</p>
The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?

it's not valid html, so i don't think you can rely on an html parser to parse it.

You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.

Why isn't MarkdownSharp encoding my HTML?

In my mind, one of the bigger goals of Markdown is to prevent the user from typing potentially malformed HTML directly.
Well that isn't exactly working for me in MarkdownSharp.
This example works properly when you have the extra line break immediately after "abc"...
But when that line break isn't there, I think it should still be HtmlEncoded, but that isn't happening here...
Behind the scenes, the rendered markup is coming from an iframe. And this is the code behind it...
<%
var md = new MarkdownSharp.Markdown();
%>
<%= md.Transform(Request.Form[0]) %>
Surely I must be missing something. Oh, and I am using v1.13 (the latest version as of this writing).
EDIT (this is a test for StackOverflow's implementation)
abc
this shouldn't be red

For those not wanting to use Steve Wortham's customized solution, I have submitted an issue and a proposed fix to the MarkdownSharp guys: http://code.google.com/p/markdownsharp/issues/detail?id=43
If you download my attached Markdown.cs file you will find a new option that you can set. It will stop MarkdownSharp from re-encoding text within the code blocks.
Just don't forget to HTML encode your input BEFORE you pass it into markdown, NOT after.
Another solution is to white-list HTML tags like Stack Overflow does. You would do this AFTER you pass your content to markdown.
See this for more information: http://www.CodeTunnel.com/blog/post/24/mardownsharp-and-encoded-html

Since it became clear that the StackOverflow implementation contains quite a few customizations that could be time consuming to test and figure out, I decided to go another direction.
I created my own simplified markup language that's a subset of Markdown. The open-source project is at http://ultralight.codeplex.com/ and you can see a working example at http://www.bucketsoft.com/ultralight/
The project is a complete ASP.NET MVC solution with a Javascript editor. And unlike MarkdownSharp, safe HTML is guaranteed. The Javascript parser is used both client-side and server-side to guarantee consistent markup (special thanks to the Jurassic Javascript compiler). It's a beautiful thing to only have to maintain one codebase for that parser.
Although the project is still in beta, I'm using it on my own site already and it seems to be working well so far.

Maybe I'm not understanding? If you are starting a new code block in Markdown, in all its varieties, you do need a double linebreak and four-space indentation -- a single newline won't do in any of the renderers I have to hand.
abc -- Here comes a code block:
<div style="background-color: red"> This is code</div>
yielding:
abc -- Here comes a code block:
<div style="background-color: red"> This is code</div>
From what you are saying it seems that MarkdownSharp does fine with this rule, so with just one newline (but indentation):
abc -- Here comes a code block:
<div style="background-color: red"> This should be code</div>
we get a mess not a code block:
abc -- Here comes a code block:
This should be code
I assume StackOverflow is stripping the <div> tags, because they think comments shouldn't have divisions and suchlike things. (?) (In general they have to do a lot of other processing don't they, e.g. to get syntax highlighting and so on?)
EDIT: I think people are expecting the wrong thing of a Markdown implementation. For example, as I say below, there is no such thing as 'invalid markdown'. It isn't a programming language or anything like one. I have verified that all three markdown implementations I have available from the command line indifferently 'convert' random .js and .c files, or those inserted into otherwise sensible markdown -- and also interpolated zip files and other nonsense -- into valid html that browsers don't mind displaying at all -- chicken scratches though it is. If you want to exclude something, e.g. in a wiki program, you do something further, of course, as most markdown-employing wiki programs do.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.