XPath to first occurrence of element with text length >= 200 characters - c#

How do I get the first element that has an inner text (plain text, discarding other children) of 200 or more characters in length?
I'm trying to create an HTML parser like Embed.ly and I've set up a system of fallbacks where I first check for og:description, then I would search for this occurrence and only then for the description meta tag.
This is because most sites that even include meta description describe their site in that tag, instead of the contents of the current page.
Example:
<html>
<body>
<div>some characters
<p>200 characters <span>some more stuff</span></p>
</div>
</body>
</html>
What selector could I use to get the 200 characters portion of that HTML fragment? I don't want the some more stuff either, I don't care what element it is (except for <script> or <style>), as long as it's the first plain text to contain at least 200 characters.
What should the XPath query look like?

Use:
(//*[not(self::script or self::style)]/text()[string-length() > 200])[1]
Note: In case the document is an XHTML document (and that means all elements are in the xhrml namespace), the above expression should be specified as:
(//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1]
where the prefix "x:" must be bound to the XHTML namespace -- "http://www.w3.org/1999/xhtml" (or as many XPath APIs call this -- the namespace must be "Registered" with this prefix)

I meant something like this:
root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]")
Seems to work pretty well.

HTML is not XML. You should not use XML parsers to parse HTML period. They are two different things entirely, and your parser will choke out the first time you see html that's not well formed XML.
You should find an opensource HTML parser instead of rolling your own.

Related

Handling html special character while Parsing html string using c#

I am using htmlagility pack to parse html string, and convert certain patterns to links.
Given a html string and a pattern "mystring". I have to replace the occurrence of this pattern in the hrml string with <a href="/mystring.html>mystring</a>. But there are two exceptions
1. I should not replace the pattern if it is already within an anchor tag, which means its immediate parent or any level parent should not be an anchor tag. For ex: <a href="google.com><span>mystring</span><\a>
2. It should not be inside href. For ex <a href="mystring">.
input string: "<li><span>mystring test</span></li><li><a href='#'><span>mystring</span></li</li>"
expected output : "<li><span><a href="/mystring.html>mystring</a> test</span></li><li><a href='#'><span>mystring</span></li</li>"
I am using htmlagilitypack and loading this string as html doc and getting all text and looking whether its any level parent is not an anchor and replacing it. Everything worked simple and fine. But there is a problem here.
If my input string is something like "li><span>mystring test < 10 and 5</span></li>" there is a problem. Htmlagility parser considers the less than symbol as a html special character and considers the "< 10 and 5" as a html tag and produces something like this.
< 10="" and="" 5=""> (attributes with empty values).
IS there a work around for this using htmlagilityparser?
Should I take a step back and use regex? In that case how do I handle the any level anchor exception?
IS there a better approach for this problem?
Using < outside HTML tag is invalid. Use < entity instead.
EDIT: If don't have control over input string, you may try replacing "< ":
inputhtml = inputhtml.Replace("< ", "< ");
If there are any other errors, you can try importing MSHTML COM DLL. Reference COM dll "Microsoft HTML object library".
Two suggestions:
You could pre-clean the broken HTML so HtmlAgilityPack works better. This is possibly easier.
Or parse & track nested-structure of tags yourself, via a simple regex-based parser. But many HTML tags do not have to be normatively ended, such as <TR> <TD> <P> <BR>.. and you'll have to deal with the broken < angle-brackets here too.
Option 2) is not hard -- but will be more work first-off, for a payoff in improved reliability & control over how you handle "malformed" inputs from a low-quality source.

How to compare 2 HTML strings

How can I compare 2 html strings for equality? I was trying some 'stuff' out with the Agility pack, but it doesn't have a compare method, or anything like that.
For the record, the .NET framework doesn't do the trick.
[EDIT]
With comparing 2 html strings, I mean the innerHTML of a webpage.
[/EDIT]
Example:
For example, press right mouse button on this page, and click 'view page source' (i use firefox). Put that content to a string variable.
Now do this again, exactly like you did before but pick another page and create a new string variable.
When you're done, compare those 2 strings against each other.
It's all going to the point if you're actually comparing valid XML.
HTML is a derivate language from XML, and if both string's are valid XML you can always create two XMLDocument's and compare them equally.
If there's a problem with your HTML syntax, then you need other algorithm for the comparation, like stripping all double spaces, strip all spaces between tags, and compare them ...
of course you will need to workout the correct representation as <body style="padding:2em;color:white;"> is exactly the same as <body style="color:white;padding:2em"> as sake of HTML...
Assuming you're only interested in the textual content of the HTML elements (i.e. the stuff between ) then just compare the .InnerText properties of the two elements - this returns a string containing all of the concatenation of all the "#text" nodes of all child nodes.

Is this not a suitable scenario for an Html parser?

I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
Link
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:
<p class="<sometag attr=" something"="">">
Link
</p>
The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?
it's not valid html, so i don't think you can rely on an html parser to parse it.
You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Regular Expression to Extract HTML Body Content

I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
</title>
</head>
<body contenteditable="true">
<p>
Example paragraph content
</p>
<p>
</p>
<p>
<br />
</p>
<h1>Header 1</h1>
</body>
</html>
Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.
Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
XHTML would be more easily parsed with an XML parser, than with a regex. I know it's not what youre asking, but an XML parser would be able to quickly navigate to the body node and give you back its content without any tag mapping problems that the regex is giving you.
EDIT:
In response to a comment here; that an XML parser is too slow.
There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.
The DOM method is good for multiple uses, pulling tags and finding who is what's child. The SAX parser reads across the file in order and qill quickly get the information you are after. The Regex won't be any faster than a SAX parser, because they both simply walk across the file and pattern match, with the exception that the regex won't quit looking after it has found a body tag, because regex has no built in knowledge of XML. In fact, your SAX parser probably uses small pieces of regex to find each tag.
String toMatch="aaaaaaaaaaabcxx sldjfkvnlkfd <body>i m avinash</body>";
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?");
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
Why can't you just split it by
</{0,1}body[^>]*>
and take the second string? I believe it will be much faster than looking for a huge regexp.
/<body[^>]*>(.*)</body>/s
replace with
\1
Match the first body tag: <\s*body.*?>
Match the last body tag: <\s*/\s*body.*?>
(note: we account for spaces in the middle of the tags, which is completely valid markup btw)
Combine them together like this and you will get everything in-between, including the body tags: <\s*body.*?>.*?<\s*/\s*body.*?>. And make sure you are using Singleline mode which will ignore line breaks.
This works in VB.NET, and hopefully others too!

Categories

Resources