Is this not a suitable scenario for an Html parser?

Is this not a suitable scenario for an Html parser? - c#

I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
Link
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:
<p class="<sometag attr=" something"="">">
Link
</p>
The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?

it's not valid html, so i don't think you can rely on an html parser to parse it.

You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.

Related

Remove some bbcodes from code

I am facing a problem. I am using a bbcode parser to HTML and when I try to parse it I have some problem when I have tags that is not in my set of parser.
For example:
My parser permit just [b], [center] and [i] tags.
If I try to parse [u] or [color={anyColor}] tags it returns me an exception.
I would like to remove any other tag not permited.
First I thought about not permitting it on my textarea, but, when I use ctrl+c/v to fill the textarea it fills with those tags and I notice it when the data is already on my database.
What I thought:
User enter the string with wrong tags
I call any method to remove not permitted tags (here is my problem)
save data on my database
Can anyone help me with it? Or suggest me something else?

After taking a quick look into the parser src found on the link you provided, it seems that if it runs into a tag that it does not know(meaning not in the list of tags provided during instantiation) it errors out(in some manner).
As it stands it looks like you have a few options:
Change your ErrorMode to ErrorFree.
this will no longer produce any exceptions and instead treat Unknown tags as text.
Go with your original Idea and restrict the input on the front end.
If you can, instead of going straight to HTML, add all possible tags to the parser, check to see if you can get a c# object out of the parser and eliminate the unwanted tags before outputting to html.
Or on the downswing of things after html is produced prohibit the use of the generated HTML tags.
Send the authors of the parser an email/(if you know german) a ticket/issue on codeplex and ask them to add support for striping unwanted tags.
Or if you want since you have the src add functionality to strip unwanted tags, yourself
This shouldn't be too hard I think, follow the pattern they have for the current Tags list in BBCodeParser.cs and make an TagsToIgnore list and just add a check before the rest of the parsing of a tag just to strip/ continue on to the next token.
EDIT:
You may be able to make the parser interpret the tags to display nothing. where you init the bbCodeParser.
var parser = new BBCodeParser(new[]
{
// keep these tags
new BBTag("b", "<b>", "</b>"),
new BBTag("i", "<span style=\"font-style:italic;\">", "</span>"),
new BBTag("u", "<span style=\"text-decoration:underline;\">", "</span>"),
// remove these (or at least there markup)
new BBTag("code", "", ""),
new BBTag("img", "", ""),
});

Handling html special character while Parsing html string using c#

I am using htmlagility pack to parse html string, and convert certain patterns to links.
Given a html string and a pattern "mystring". I have to replace the occurrence of this pattern in the hrml string with <a href="/mystring.html>mystring</a>. But there are two exceptions
1. I should not replace the pattern if it is already within an anchor tag, which means its immediate parent or any level parent should not be an anchor tag. For ex: <a href="google.com><span>mystring</span><\a>
2. It should not be inside href. For ex <a href="mystring">.
input string: "<li><span>mystring test</span></li><li><a href='#'><span>mystring</span></li</li>"
expected output : "<li><span><a href="/mystring.html>mystring</a> test</span></li><li><a href='#'><span>mystring</span></li</li>"
I am using htmlagilitypack and loading this string as html doc and getting all text and looking whether its any level parent is not an anchor and replacing it. Everything worked simple and fine. But there is a problem here.
If my input string is something like "li><span>mystring test < 10 and 5</span></li>" there is a problem. Htmlagility parser considers the less than symbol as a html special character and considers the "< 10 and 5" as a html tag and produces something like this.
< 10="" and="" 5=""> (attributes with empty values).
IS there a work around for this using htmlagilityparser?
Should I take a step back and use regex? In that case how do I handle the any level anchor exception?
IS there a better approach for this problem?

Using < outside HTML tag is invalid. Use < entity instead.
EDIT: If don't have control over input string, you may try replacing "< ":
inputhtml = inputhtml.Replace("< ", "< ");
If there are any other errors, you can try importing MSHTML COM DLL. Reference COM dll "Microsoft HTML object library".

Two suggestions:
You could pre-clean the broken HTML so HtmlAgilityPack works better. This is possibly easier.
Or parse & track nested-structure of tags yourself, via a simple regex-based parser. But many HTML tags do not have to be normatively ended, such as <TR> <TD> <P> <BR>.. and you'll have to deal with the broken < angle-brackets here too.
Option 2) is not hard -- but will be more work first-off, for a payoff in improved reliability & control over how you handle "malformed" inputs from a low-quality source.

XPath to first occurrence of element with text length >= 200 characters

How do I get the first element that has an inner text (plain text, discarding other children) of 200 or more characters in length?
I'm trying to create an HTML parser like Embed.ly and I've set up a system of fallbacks where I first check for og:description, then I would search for this occurrence and only then for the description meta tag.
This is because most sites that even include meta description describe their site in that tag, instead of the contents of the current page.
Example:
<html>
<body>
<div>some characters
<p>200 characters <span>some more stuff</span></p>
</div>
</body>
</html>
What selector could I use to get the 200 characters portion of that HTML fragment? I don't want the some more stuff either, I don't care what element it is (except for <script> or <style>), as long as it's the first plain text to contain at least 200 characters.
What should the XPath query look like?

Use:
(//*[not(self::script or self::style)]/text()[string-length() > 200])[1]
Note: In case the document is an XHTML document (and that means all elements are in the xhrml namespace), the above expression should be specified as:
(//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1]
where the prefix "x:" must be bound to the XHTML namespace -- "http://www.w3.org/1999/xhtml" (or as many XPath APIs call this -- the namespace must be "Registered" with this prefix)

I meant something like this:
root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]")
Seems to work pretty well.

HTML is not XML. You should not use XML parsers to parse HTML period. They are two different things entirely, and your parser will choke out the first time you see html that's not well formed XML.
You should find an opensource HTML parser instead of rolling your own.

HtmlDocument.Write Stripping Quotation Marks

For some reason when I try writing to an HtmlDocument it strips some (not all) of the quotation marks of the string I am giving it.
Look here:
HtmlDocument htmlDoc = Webbrowser1.Document.OpenNew(true);
htmlDoc.Write("<HTML><BODY><DIV ID=\"TEST\"></DIV></BODY></HTML>");
string temp = htmlDoc.GetElementsByTagName("HTML")[0].InnerHtml;
The result of temp is this:
<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>
It works exactly as it should except it is stripping the quotation marks. Does anyone have a solution on how to prevent or fix this?

There is no guarantees with innerHTML that it will return content identical to string you passed in. The innerHTML is constructed by browser using its HTML tree representation - so it will produce resulting string as it see fits.
So depending on your needs you can try to use some HTML parsing code that understands ID's without quotes around OR try to convince browser to use latest engine which more likely to produce innerHTML to you liking.
I.e. in your case it looks like at least IE9 renders your HTML as IE9:Quirks mode (that returns innerHTML in the shape your are not happy with), if you make valid HTML or force mode to IE9:Standard you'll get string with qoutes like
document.getElementsByTagName("html")[0].innerHTML
IE9:Standards - "<head></head><body><div id="TEST"></div></body>"
IE9:Quirks -
"<HEAD></HEAD>
<BODY>
<DIV id=TEST></DIV></BODY>"
You can try it yourself by creating sample HTML file and opening from disk. F12 to show dev tools and check out mode in the menu bar.

C# has a quirky feature though I'm not sure of it's name. Sorry i'm not sure of a vb equivalent.
Add an # at the beginning of a literal string to escape all characters.
htmlDoc.Write(#"<HTML><BODY><DIV ID="TEST"></DIV></BODY></HTML>");
Also, this isn't important but your html would not validate. All tags and attributes should be lower case. E.g.<HTML> should be <html>.

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}

Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Is this not a suitable scenario for an Html parser? - c#

it's not valid html, so i don't think you can rely on an html parser to parse it.

Related

Remove some bbcodes from code

Handling html special character while Parsing html string using c#

XPath to first occurrence of element with text length >= 200 characters

HtmlDocument.Write Stripping Quotation Marks

Checking a HTML string for unopened tags

Categories

Resources