How can I compare 2 html strings for equality? I was trying some 'stuff' out with the Agility pack, but it doesn't have a compare method, or anything like that.
For the record, the .NET framework doesn't do the trick.
[EDIT]
With comparing 2 html strings, I mean the innerHTML of a webpage.
[/EDIT]
Example:
For example, press right mouse button on this page, and click 'view page source' (i use firefox). Put that content to a string variable.
Now do this again, exactly like you did before but pick another page and create a new string variable.
When you're done, compare those 2 strings against each other.
It's all going to the point if you're actually comparing valid XML.
HTML is a derivate language from XML, and if both string's are valid XML you can always create two XMLDocument's and compare them equally.
If there's a problem with your HTML syntax, then you need other algorithm for the comparation, like stripping all double spaces, strip all spaces between tags, and compare them ...
of course you will need to workout the correct representation as <body style="padding:2em;color:white;"> is exactly the same as <body style="color:white;padding:2em"> as sake of HTML...
Assuming you're only interested in the textual content of the HTML elements (i.e. the stuff between ) then just compare the .InnerText properties of the two elements - this returns a string containing all of the concatenation of all the "#text" nodes of all child nodes.
Related
Hope someone can tell me how I should get a Superscript tag like ² to display correctly in the text of a Dropdownlist option?
Thanks.
Probably using HTML entities:
²
instead of the actual character. But it's probably better to let C# take care of it:
string safeString = HttpUtility.HtmlEncode("your string²");
// Use the result as the displayed value in your Dropdownlist
This method will also find other problematic characters such as & and replace them accordingly. See MSDN HttpUtility.HtmlEncode for more information on this.
Edit: be advised; the resulting string from HtmlEncode will show (when used in HTML) exactly that what you have input in the method. So do not use HTML entities in your input, because then that's exactly what you'll see in the resulting page.
If you want to show m² then just enter that inside the method. .NET will take care of the rest.
Maybe unicode symbols would do the trick for you: http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html#super
For the superscripted two you would use ² resulting in: ²
You can write this way,
string item=HttpUtility.HtmlDecode("ml/min/1.73m²")
for more info on superscript you can see this link
http://symbolcodes.tlt.psu.edu/bylanguage/mathchart.html#super
How do I get the first element that has an inner text (plain text, discarding other children) of 200 or more characters in length?
I'm trying to create an HTML parser like Embed.ly and I've set up a system of fallbacks where I first check for og:description, then I would search for this occurrence and only then for the description meta tag.
This is because most sites that even include meta description describe their site in that tag, instead of the contents of the current page.
Example:
<html>
<body>
<div>some characters
<p>200 characters <span>some more stuff</span></p>
</div>
</body>
</html>
What selector could I use to get the 200 characters portion of that HTML fragment? I don't want the some more stuff either, I don't care what element it is (except for <script> or <style>), as long as it's the first plain text to contain at least 200 characters.
What should the XPath query look like?
Use:
(//*[not(self::script or self::style)]/text()[string-length() > 200])[1]
Note: In case the document is an XHTML document (and that means all elements are in the xhrml namespace), the above expression should be specified as:
(//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1]
where the prefix "x:" must be bound to the XHTML namespace -- "http://www.w3.org/1999/xhtml" (or as many XPath APIs call this -- the namespace must be "Registered" with this prefix)
I meant something like this:
root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]")
Seems to work pretty well.
HTML is not XML. You should not use XML parsers to parse HTML period. They are two different things entirely, and your parser will choke out the first time you see html that's not well formed XML.
You should find an opensource HTML parser instead of rolling your own.
I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.
I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)
Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.