I have made a web crawler by using Asp.net. It's work well. Problem is when I want to extract content from it. Some of content wrap by between HTML tags. I have some of solutions to extract content from it but I don't know which one are better. It should be good performance and easy to implement.
Using Regex with many patterns to extact content.
Using Linq to XML to extract content.
Using XPath to extract content.
Somebody please help me choose the better solutions. I think I will go with XPath but I am not sure about performance are better than RegEx or Linq2XML.
Many thanks for any ideas.
None of your solutions is particularly good.
HTML is not a regular language and as such is not a good fit for regular expressions. See also the standard response to parsing HTML with regex.
HTML is not necessarily valid XML
Instead, you should use a HTML parsing library like the Html Agility Pack.
Neither. Use a proper HTML parser such as HTML Agility Pack
RegEx is no doubt faster than both Linq to XML and XPath way. But you cannot parse everything out of the html markup using RegEx. Html is too complex for that purpose.
I didn't design my own Crawler though, I used arachnode.net, and it crawls massive amount of data. And everywhere I've used Html Agility Pack to extract various components i.e. Html Controls, Cookies, MetaTags etc etc.
As the other guys already hinted - use proper HTML parser. In most cases, HTML is not written good enough to be treated as XML. What's worse, HTML5 pushes for syntax that is completely non parseable. For example, HTML5 allows you to omit quotes around attributes.
Along with HTML Agility Pack, you can take a look at Majestic-12's HTML Parser: Majestic-12 : Projects : C# HTML parser (.NET).
Related
I have succesfully scraped a data from websites page. But it contain both the HTML tags aswell as plain text. How can i filter the unwanted data (tags,scripts,some text which is not required,etc) from this scraped data. Atleast suggest some approach for doing it.
You can use HTML Agility Pack to parse the html and remove any unwanted takes.
How to use HTML Agility Pack
You can start by taking a look at the HTML Agility Pack. This should allow you to remove any HTML.
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
I'm working on a browser-like application which gets HTML from a site (any website) then applies a style-script over it to change certain elements (just like greasemonkey).
My initial plan is to parse the HTML using XPath and XmlDocument, but is there a better way?
Thanks in advance!
Ps> Handy tips, tricks & links on HTML+C# would be great~ ^^
use the HTML Aglility pack. You can find it here: http://www.codeplex.com/htmlagilitypack
HTML is not always follows XML rules, for example there are tags in html, that may not have close tag, so XPath and XDocument will sometimes throw errors. IE API gives you ability to do that(see here), you can also find 3-rd party parsers for that (see this o this)
I would highly recomend using XSLT. This allows you to keep all your transformational data OUTSIDE your code, and therefore, making it really easy to change it if the HTML to be transformed is modified, or you want to change your layout.
Non the less, if using HTML and not XHTML, beware of possible errors. Non the less, using a Tidy library can help you overcome this.
I would really recommend using a package for your programming language of choice that handles all the oddities of HTML parsing. I've used Hpricot in Ruby before and it's made things a breeze.
If you want to be able to browse the HTML based on its content, XPath is a good choice. But you'll have to clean up the HTML first. You can use HTML tidy to convert the HTML to XHTML. In the process you might modify how the page renders. But it seems to be the purpose of your project so that's not a big deal.
I am writing a program that will help me find out sites are my competitors linking to.
In order to do that, I am writing a program that will parse an HTML file, and will produce 2 lists: internal links and external links.
I will use the internal links to further crawl the website, and the external links are actually what I am looking for.
How, using .NET RegEx, do I parse an HTML file and find 1. External links. 2. Internal links.
Thanks in advance,
Eytan Levit.
Edit: In response to the question - no - I am not bound to regex, i can use any other ideas.
Don't use a regular expression for this.
Use something like the HTML Agility Pack which is specifically designed for parsing HTML. (There's even an example on their CodePlex homepage which finds all links in a page.)
i had used Regex for Html parsing it is really fast but now there are better options that will reduce the development cost.
Try Linq To Html it's good, Beth has a great post about it that can be found here
What is the best way to extract RSS/ATOM URLs from HTML LINK tags? I know regex is not the best way to do this, so I'm wondering what alternatives I have. Surely some kind of horrible string munging using .Contains after loading the HTML into a string is not optimal either. Anyone got a decent strategy for this?
Use XPath.
1. Convert an HTML into an XHTML with Tidy
2. With the XHTML, use XPath to search for the link
/html/head/link[#type='application/rss+xml']
Maybe Html Agility Pack can help you. Have not use it. But hear good thing about it.
What's the best way to parse fragments of HTML in C#?
For context, I've inherited an application that uses a great deal of composite controls, which is fine, but a good deal of the controls are rendered using a long sequence of literal controls, which is fairly terrifying. I'm trying to get the application into unit tests, and I want to get these controls under tests that will find out if they're generating well formed HTML, and in a dream solution, validate that HTML.
Have a look at the HTMLAgility pack. It's very compatible with the .NET XmlDocument class, but it much more forgiving about HTML that's not clean/valid XHTML.
If the HTML is XHTML compliant, you can use the built in System.Xml namespace.
I've used an SGMLReader to produce a valid Xml document from HTML and then parse what is required using XPath or to another format using XSLT. .
You can also look into HTML Tidy for HTML parsing/cleanup. I don't think they have specific .NET libraries, but you might be able to run the binary via command-line, or IKVM the java libraries.