How to parse bad html?

How to parse bad html? - c#

I am writing a search engine that goes to all my company affiliates websites parse html and stores them in database. These websites are really old and are not html compliant out of 100000 websites around 25% have bad html that makes it difficult to parse. I need to write a c# code that might fix bad html and then parse the contents or come up with a solution that will address above said issue. If you are sitting on idea, an actual hint or code snippet would help.

Just use Html Agility Pack. It is the very good to parse faulty html code

People generally use some form of heuristic-driven tag soup parser.
E.g. for
Java
Haskell
These are mostly just lexers, that try their best to build an AST from all the random symbols.

Use a tagsoup parser, I'm sure the is one for C#. Then you can serialize the DOM to a more-or less valid HTML, depending on whether that parser conforms to the HTML DTD. Alternatively you can use HTML Tidy, which will clear at least the worst faults.
Regexes are not applicable for this task.

Related

What techniques are used to write a parser that switches between languages?

I'm interested in how a parser like the Razor view engine can parse two distinct languages like C# and JavaScript.
It's very cool that the following works, for instance:
$("#fm_duedate").val('#DateTime.Now.AddMonths(1).ToString("MM/dd/yyyy")');
I'm going to try and look at the source but I'm curious if there's a some kind of theoretical foundation for a parser like this or is it more brute force like taking the union of the two languages and parsing that?
Trying to reason it for myself, I say "you start with a parser for each language then you add to each one a set of productions that switch it to the other" but I doubt its so simple.
I guess the perfect answer would be a pointer to discussion on how the Razor engine is implemented or a walk-through of the source (I haven't actually Google'd this for fear of going down a rabbit hole). Alternately, just some insight on how the problem of parsing two languages is approached would be great.

As Corey points out, Razor and similar frameworks do not do anything particularly fancy.
However there are some more theoretically sound models for building parsers for languages where one language is embedded in another. My erstwhile colleague Luke Hoban has a great introductory article on parser combinators, which afford a very nice way to build a parser for one-language-embedded-in-another-language scenarios:
http://blogs.msdn.com/b/lukeh/archive/2007/08/19/monadic-parser-combinators-using-c-3-0.aspx
The wikipedia page is pretty straightforward as well:
http://en.wikipedia.org/wiki/Parser_combinator

Razor (and the other view engines) do not parse the HTML or JavaScript of a view. Instead they parse the text to detect specific tokens, with no real concern about the surrounding text.
In the case of Razor, every # character in the source file is processed as a code block of some sort. Razor is quite smart about detecting the expression that follows the # character, including handling things like #foreach (var x in collection) { and locating the closing } while not trying to parse the HTML (or JavaScript) inside. It also lets you use #{ } and #( ) to override the processing to a degree.
I find the ASPX <%...%> format simpler to read, since I've used that format more and I've got some established pattern recognition going on for those. Having explicit start/finish tokens is simpler to process and simpler to read in-place.

Stripping HTML tags without using HtmlAgilityPack

I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:
It's not known ahead of time whether a document contains HTML at all.
More likely than not, any HTML will be very poorly formatted.
Individual documents might be very large, perhaps hundreds of megabytes.
Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of <.+/?> are a no go. (And stripping XML is less desirable, anyway.)
I'm currently using HTML Agility Pack, and it's just not cutting the mustard. Performance is poorer than I'd like, it doesn't always handle truly awful formatting as gracefully as it could, and lately I've been running into problems with stack overflows on some of the more upsettingly large files.
I suspect that all of these problems stem from the fact that it's trying to actually parse the data, which makes it a poor fit for my needs. I don't want a syntax tree; I just want (most of) the tags to go away.
Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that's not such a great idea. But that diatribe's points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?
Assuming it isn't a terrible idea, suggestions for regex that would do a good job are very welcome.

This regex finds all tags avoiding angle brackets inside quotes in tags.
<[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>
It isn't able to detect escaped quotes inside quotes (but I think it is unnecessary in html)
Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I'm afraid an exact solution can't be found starting with your assumption about angle brackets, think for example to something like b<a ...
EDIT:
Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.

I'm just thinking outside the box here, but you may consider leveraging something like Microsoft Word, or maybe OpenOffice.
I've used Word automation to translate HTML to DOC, RTF, or TXT. The HTML to TXT conversion native to Word would give you exactly what you want, stripping all of the HTML tags and converting it to text format. Of course this wouldn't be efficient at all if you're processing tons of tiny HTML files since there's some overhead in all of this. But if you're dealing with massive files this may not be a bad choice as I'm sure Word has plenty of optimizations around these conversions. You could test this theory by manually opening one of your largest HTML files in Word and resaving it as a TXT file and see how long Word takes to save.
And although I haven't tried it, I bet it's possible to programmatically interact with OpenOffice to accomplish something similar.

c# .net4 - regex vs html agility pack

What's faster? I just made a web scraper that uses HTML Agility pack and it's consuming massive amounts of memory.
Profiling it with a memory profiler, I found that the HTMLDocument, HTMLNode, etc, instances are taking up the most amount of memory.
I feel like maybe it would be faster and more efficient to use regex, am I wrong?

A reg-ex will be a lot faster than html agilty pack.
But you should remember that html need not always be well formed. Searching the correct data you want using only reg-ex may fail. Browsers are very forgiving about mistakes.
Agility pack is a great tool. It provides a lot of features for that memory it is consuming.

Depending on what exactly you do it really could be possible to speed things up and free some mem using regex. The question is - how rigid and well-formed are the pages you are extracting data from. Regex is much more easily confused by perfectly valid, but unexpected, HTML constructs that you might encounter in the wild.

Simple screen scraping and analyze in .NET

I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?

Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.

Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.

For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.

Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.

It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started

You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?

I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.