Handling html special character while Parsing html string using c# - c#

I am using htmlagility pack to parse html string, and convert certain patterns to links.
Given a html string and a pattern "mystring". I have to replace the occurrence of this pattern in the hrml string with <a href="/mystring.html>mystring</a>. But there are two exceptions
1. I should not replace the pattern if it is already within an anchor tag, which means its immediate parent or any level parent should not be an anchor tag. For ex: <a href="google.com><span>mystring</span><\a>
2. It should not be inside href. For ex <a href="mystring">.
input string: "<li><span>mystring test</span></li><li><a href='#'><span>mystring</span></li</li>"
expected output : "<li><span><a href="/mystring.html>mystring</a> test</span></li><li><a href='#'><span>mystring</span></li</li>"
I am using htmlagilitypack and loading this string as html doc and getting all text and looking whether its any level parent is not an anchor and replacing it. Everything worked simple and fine. But there is a problem here.
If my input string is something like "li><span>mystring test < 10 and 5</span></li>" there is a problem. Htmlagility parser considers the less than symbol as a html special character and considers the "< 10 and 5" as a html tag and produces something like this.
< 10="" and="" 5=""> (attributes with empty values).
IS there a work around for this using htmlagilityparser?
Should I take a step back and use regex? In that case how do I handle the any level anchor exception?
IS there a better approach for this problem?

Using < outside HTML tag is invalid. Use < entity instead.
EDIT: If don't have control over input string, you may try replacing "< ":
inputhtml = inputhtml.Replace("< ", "< ");
If there are any other errors, you can try importing MSHTML COM DLL. Reference COM dll "Microsoft HTML object library".

Two suggestions:
You could pre-clean the broken HTML so HtmlAgilityPack works better. This is possibly easier.
Or parse & track nested-structure of tags yourself, via a simple regex-based parser. But many HTML tags do not have to be normatively ended, such as <TR> <TD> <P> <BR>.. and you'll have to deal with the broken < angle-brackets here too.
Option 2) is not hard -- but will be more work first-off, for a payoff in improved reliability & control over how you handle "malformed" inputs from a low-quality source.

Related

Using C# to convert incorrect html string to real html

My original issue is that I am trying to serialize a string containing html tags to an XML element.
hello World, this
is
a nice
test
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
However, I have 2 issues
Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.
2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags).
Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:
string result = "<p>" + text
.Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
.Replace(Environment.NewLine, "<br />")
.Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";
However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.
Target would be to have a result like the following (line breaks are only for better readability and don't matter here)
<p>hello World, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?
Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.
Thank you!
EDIT: Clearly separated the 2 issues
EDIT2: No influence on deserialization
EDIT3: Added target output
What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.
You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.
Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.
I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:
1) Replace line breaks with HTML line breaks
string result = text.Replace(Environment.NewLine, "<br />");
2) Use the HTMLAgility pack to fix any invalid HTML
var doc = new HtmlDocument();
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
doc.OptionFixNestedTags = false;
doc.LoadHtml(result);
if (doc.ParseErrors.Count() > 0)
{
// throw error
}else{
// get fixed html
result= doc.DocumentNode.OuterHtml;
}

HtmlAgilityPack td.innertext bug?

I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.
This is a C# app in VS2010. Running in Debug, I see the string in my class begins:
Animal and vegetable oils 1 < 5 MW <br>5-50 MW 30 <br>
But when I assign with:
td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();
The td.InnerHtml shows
Animal and vegetable oils 1 < 5=\"\" mw=\"\"><br>5-50 MW 30 <br>
Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)
HTML Agility Pack's HTML parser is treating the < as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the <br> which forces it to close the tag.
The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.
An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.

HtmlDocument.Write Stripping Quotation Marks

For some reason when I try writing to an HtmlDocument it strips some (not all) of the quotation marks of the string I am giving it.
Look here:
HtmlDocument htmlDoc = Webbrowser1.Document.OpenNew(true);
htmlDoc.Write("<HTML><BODY><DIV ID=\"TEST\"></DIV></BODY></HTML>");
string temp = htmlDoc.GetElementsByTagName("HTML")[0].InnerHtml;
The result of temp is this:
<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>
It works exactly as it should except it is stripping the quotation marks. Does anyone have a solution on how to prevent or fix this?
There is no guarantees with innerHTML that it will return content identical to string you passed in. The innerHTML is constructed by browser using its HTML tree representation - so it will produce resulting string as it see fits.
So depending on your needs you can try to use some HTML parsing code that understands ID's without quotes around OR try to convince browser to use latest engine which more likely to produce innerHTML to you liking.
I.e. in your case it looks like at least IE9 renders your HTML as IE9:Quirks mode (that returns innerHTML in the shape your are not happy with), if you make valid HTML or force mode to IE9:Standard you'll get string with qoutes like
document.getElementsByTagName("html")[0].innerHTML
IE9:Standards - "<head></head><body><div id="TEST"></div></body>"
IE9:Quirks -
"<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>"
You can try it yourself by creating sample HTML file and opening from disk. F12 to show dev tools and check out mode in the menu bar.
C# has a quirky feature though I'm not sure of it's name. Sorry i'm not sure of a vb equivalent.
Add an # at the beginning of a literal string to escape all characters.
htmlDoc.Write(#"<HTML><BODY><DIV ID="TEST"></DIV></BODY></HTML>");
Also, this isn't important but your html would not validate. All tags and attributes should be lower case. E.g.<HTML> should be <html>.

Is this not a suitable scenario for an Html parser?

I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
Link
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:
<p class="<sometag attr=" something"="">">
Link
</p>
The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?
it's not valid html, so i don't think you can rely on an html parser to parse it.
You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.

How to find all tags from string using RegEx using C#.net?

I want to find all HTML tags from the input strings and removed/replace with some text.
suppose that I have string
INPUT=>
<img align="right" src="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg" /><p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, Il Giardino Ristorante in Newport Beach.</p>
OUTPUT=>
string strSrc="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg";
<p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, http://www.tenrestaurantgroup.com in Newport Beach.</p>
From above string
if <IMG> tag found then I want to get SRC of the tag,
if <A> tag found then I want get HREF from the tag.
and all other tag as same it is..
How can I achieved using Regex in C#.net?
You really, really shouldn't use regex for this. In fact, parsing HTML cannot be done perfectly with regex. Have you considered using an XML parser or HTML DOM library?
You can use HtmlAgilityPack for parsing (valid/non valid) html and get what you want.
I agree with Justin, Regex really isn't the best way to do this, and the HTML Agility is well worth a look if this is something you will need to be doing alot of.
With that said, the expression below will store attributes into a group from where you should be able to pull them into your text while ignoring the rest of the element. :
</?([^ >]+)( [^=]+?="(.+?)")*>
Hope this helps.

Categories

Resources