HtmlDocument.Write Stripping Quotation Marks - c#

For some reason when I try writing to an HtmlDocument it strips some (not all) of the quotation marks of the string I am giving it.
Look here:
HtmlDocument htmlDoc = Webbrowser1.Document.OpenNew(true);
htmlDoc.Write("<HTML><BODY><DIV ID=\"TEST\"></DIV></BODY></HTML>");
string temp = htmlDoc.GetElementsByTagName("HTML")[0].InnerHtml;
The result of temp is this:
<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>
It works exactly as it should except it is stripping the quotation marks. Does anyone have a solution on how to prevent or fix this?

There is no guarantees with innerHTML that it will return content identical to string you passed in. The innerHTML is constructed by browser using its HTML tree representation - so it will produce resulting string as it see fits.
So depending on your needs you can try to use some HTML parsing code that understands ID's without quotes around OR try to convince browser to use latest engine which more likely to produce innerHTML to you liking.
I.e. in your case it looks like at least IE9 renders your HTML as IE9:Quirks mode (that returns innerHTML in the shape your are not happy with), if you make valid HTML or force mode to IE9:Standard you'll get string with qoutes like
document.getElementsByTagName("html")[0].innerHTML
IE9:Standards - "<head></head><body><div id="TEST"></div></body>"
IE9:Quirks -
"<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>"
You can try it yourself by creating sample HTML file and opening from disk. F12 to show dev tools and check out mode in the menu bar.

C# has a quirky feature though I'm not sure of it's name. Sorry i'm not sure of a vb equivalent.
Add an # at the beginning of a literal string to escape all characters.
htmlDoc.Write(#"<HTML><BODY><DIV ID="TEST"></DIV></BODY></HTML>");
Also, this isn't important but your html would not validate. All tags and attributes should be lower case. E.g.<HTML> should be <html>.

Related

Handling html special character while Parsing html string using c#

I am using htmlagility pack to parse html string, and convert certain patterns to links.
Given a html string and a pattern "mystring". I have to replace the occurrence of this pattern in the hrml string with <a href="/mystring.html>mystring</a>. But there are two exceptions
1. I should not replace the pattern if it is already within an anchor tag, which means its immediate parent or any level parent should not be an anchor tag. For ex: <a href="google.com><span>mystring</span><\a>
2. It should not be inside href. For ex <a href="mystring">.
input string: "<li><span>mystring test</span></li><li><a href='#'><span>mystring</span></li</li>"
expected output : "<li><span><a href="/mystring.html>mystring</a> test</span></li><li><a href='#'><span>mystring</span></li</li>"
I am using htmlagilitypack and loading this string as html doc and getting all text and looking whether its any level parent is not an anchor and replacing it. Everything worked simple and fine. But there is a problem here.
If my input string is something like "li><span>mystring test < 10 and 5</span></li>" there is a problem. Htmlagility parser considers the less than symbol as a html special character and considers the "< 10 and 5" as a html tag and produces something like this.
< 10="" and="" 5=""> (attributes with empty values).
IS there a work around for this using htmlagilityparser?
Should I take a step back and use regex? In that case how do I handle the any level anchor exception?
IS there a better approach for this problem?
Using < outside HTML tag is invalid. Use < entity instead.
EDIT: If don't have control over input string, you may try replacing "< ":
inputhtml = inputhtml.Replace("< ", "< ");
If there are any other errors, you can try importing MSHTML COM DLL. Reference COM dll "Microsoft HTML object library".
Two suggestions:
You could pre-clean the broken HTML so HtmlAgilityPack works better. This is possibly easier.
Or parse & track nested-structure of tags yourself, via a simple regex-based parser. But many HTML tags do not have to be normatively ended, such as <TR> <TD> <P> <BR>.. and you'll have to deal with the broken < angle-brackets here too.
Option 2) is not hard -- but will be more work first-off, for a payoff in improved reliability & control over how you handle "malformed" inputs from a low-quality source.

HtmlAgilityPack td.innertext bug?

I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.
This is a C# app in VS2010. Running in Debug, I see the string in my class begins:
Animal and vegetable oils 1 < 5 MW <br>5-50 MW 30 <br>
But when I assign with:
td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();
The td.InnerHtml shows
Animal and vegetable oils 1 < 5=\"\" mw=\"\"><br>5-50 MW 30 <br>
Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)
HTML Agility Pack's HTML parser is treating the < as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the <br> which forces it to close the tag.
The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.
An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.

How can i add double quotes to a string?

I want to add double quotes for a sting . I know by using /" we can add double quotes . My string is
string scrip = "$(function () {$(\"[src='" + names[i, 0] + "']\"" + ").pinit();});";
When i do this on the browser i am getting &quot instead of " quotes . How can i overcome with the problem ?
If your browser has displayed a "&quot" instead of a " character, than there are only a few causes possible. The character should have been emitted to the browser as either itself, or as a HTML entity of ". Please note the semicolor at the end. If a browser sees such 'code', it presents a quote. This is to allow writing the HTML easier, when its attribtues need to contain special characters, compare:
<div attribute="blahblahblah" />
if you want to put a " into the blahs, it'd terminate the attribute's notation, and the HTML code would break. So, adding a single " character should look like:
<div attribute="blah&quote;blahblah" />
Now, if you miss the semicolon, the browser will display blah&quotblahblah instead of blah"blahblah.
I've just noted that your code is actually glueing up the JavaScript code. In JavaScript, the semicolon is an expression delimiter, so probably there is actually a " in the emitted HTML and it is just improperly presented in the error message... Or maybe you have forgotten to open/close some quotes in the javascript, and the semicolon is actually treated as expression terminator?
Be also sure to check why the JavaScript code undergoes html-entity translation. Usually, blocks are not reparsed. Are you setting that JavaScript code as a HTML element attribute? like OnClick or OnSend? Then stop doing it now. Create a javascript-function with this code and call that function from the click/send instead.. It is not worth to encode long expressions in the JS into an attribute! Just a waste of time and nerves.
If all else fails and if the JavaScript is emitted correctly, then look for any text-correcting or text-highlighting or text-formatting modules you have on your site. Quite probable that one of them is mis-reading the html entities and removed the semicolon, or the opposite - that they add them were they are not needed. The ASP.Net itself in general does its job right, and it translates the entites correctly wherever they are needed, so I'd look at the other libraries first.
You can use something like this:
String str=#"hello,,?!"
This should escape all characters
Or
String TestString = "This is a <Test String>.";
String EncodedString = Server.HtmlEncode(TestString);
Here's the manual: http://msdn.microsoft.com/en-us/library/w3te6wfz.aspx
What else are you doing with the string?
Seems that somewhere after that the string gets encoded. You can could use HttpUtility.HtmlDecode(str); but first you'll have to figure out where your string gets encoded in the first place.
Keep in mind that if you use <%: %> in aspx or #yourvarin Razor it will get encoded automatically. You'll have to use #Html.Raw(yourvar) to suppress that.

Is this not a suitable scenario for an Html parser?

I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
Link
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:
<p class="<sometag attr=" something"="">">
Link
</p>
The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?
it's not valid html, so i don't think you can rely on an html parser to parse it.
You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Categories

Resources