I have succesfully scraped a data from websites page. But it contain both the HTML tags aswell as plain text. How can i filter the unwanted data (tags,scripts,some text which is not required,etc) from this scraped data. Atleast suggest some approach for doing it.
You can use HTML Agility Pack to parse the html and remove any unwanted takes.
How to use HTML Agility Pack
You can start by taking a look at the HTML Agility Pack. This should allow you to remove any HTML.
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
Related
I have made a web crawler by using Asp.net. It's work well. Problem is when I want to extract content from it. Some of content wrap by between HTML tags. I have some of solutions to extract content from it but I don't know which one are better. It should be good performance and easy to implement.
Using Regex with many patterns to extact content.
Using Linq to XML to extract content.
Using XPath to extract content.
Somebody please help me choose the better solutions. I think I will go with XPath but I am not sure about performance are better than RegEx or Linq2XML.
Many thanks for any ideas.
None of your solutions is particularly good.
HTML is not a regular language and as such is not a good fit for regular expressions. See also the standard response to parsing HTML with regex.
HTML is not necessarily valid XML
Instead, you should use a HTML parsing library like the Html Agility Pack.
Neither. Use a proper HTML parser such as HTML Agility Pack
RegEx is no doubt faster than both Linq to XML and XPath way. But you cannot parse everything out of the html markup using RegEx. Html is too complex for that purpose.
I didn't design my own Crawler though, I used arachnode.net, and it crawls massive amount of data. And everywhere I've used Html Agility Pack to extract various components i.e. Html Controls, Cookies, MetaTags etc etc.
As the other guys already hinted - use proper HTML parser. In most cases, HTML is not written good enough to be treated as XML. What's worse, HTML5 pushes for syntax that is completely non parseable. For example, HTML5 allows you to omit quotes around attributes.
Along with HTML Agility Pack, you can take a look at Majestic-12's HTML Parser: Majestic-12 : Projects : C# HTML parser (.NET).
i have a string that contains a bunch of html. i want to html-encode the text within the html tags but not the tags themselves. is there an easy way to do this in asp.net c# 3.5
You could use the HTML Agility Pack (available via nuget) to read the HTML then if necessary use the HtmlEncode method to encode the specific values by querying them.
if you need a bunch of functions use the htmlagilitypack. if you just need urls or something in specific tags a selfwritten parser would be more efficient
I'm trying to load a piece of (possibly) malformed HTML into an XMLDocument object, but it fails with XMLExceptions... since there are extra opening/closing tags, and malformed XML tags such as <img > instead of <img />
How do I get the XML to parse with all the errors in the data? Is there any XML validator that I can apply before parsing, to correct these errors? Or would handling the exception parse whatever can be parsed?
The HTML Agility Pack will parse html, rather than xhtml, and is quite forgiving. The object model will be familiar if you've used XmlDocument.
You might want to check out the answer to this question.
Basically somewhere between a .NET port of beautifulsoup and the HTML agility pack there is a way.
It's unlikely that you will be able to build an XmlDocument that has this level of malformed structure. XmlDocument (to my knowledge) requires that xml content adhere to proper nesting and closure syntax.
However, you suspect that you could parse this with an XmlReader instead. It may still throw exceptions if certain egregious errors are encountered, but according to the MSDN docs, it can at least disclose the location of the errors.
If you're just dealing with HTML, there is the HTML Agility Pack, which may serve your purposes.
Depending ont he specific needs, you might be able to use HTML Tidy to cleanup the document, then import it using the XMLDocument object.
What you are trying to do is very difficult. HTML cannot be parsed using an XML parser since XML is strict and HTML is not. If that HTML were compliant XHTML (HTML as XML), then an XML parser would parse the HTML without issue.
You might want to see if there are any HTML to XHTML converters out there, if you really want to use an XML parser for HTML.
In other words, I have yet to meet an XML parser that handles malformed XML... they are not designed to accept loose markup like HTML (for good reason, too :) )
You can't load malformed XML into a XmlDocument.
Check out the Html Agility Pack on CodePlex
I have XHTML files whose source is not completely valid, it does not follow the DTD of an XML document.
Like there are places where for " it uses &Idquo; or for apostrophes it uses ’. This causes exceptions in my C# code.
So is there any method or any weblink that i can use to get rid of this?
If the file is otherwise well-formed you can define the character entities in your own DTD.
If the file is ill-formed the HTML Agility Pack from CodePlex will parse it.
You could parse the document as HTML instead since they both end up in a DOM and HTML parsers scoff at these pansy quotation mark problems. Going along with unknown's HTML Tidy idea, you could then serialize the DOM back into a valid XHTML file. (This is identical to using HTML Tidy, wihch presumably uses an HTML parser anyway, except you'd do it from C# programatically.)
Well by the nature of XML it needs to be valid otherwise it won't render at all. I'd first see what type of errors it generates with W3C's validator http://validator.w3.org/
Also consider using HTML tidy, which can be configured to fix XML as well.
We use hpricot to fix our XML, but then again we are building rails apps. Not sure about C#
What's the best way to parse fragments of HTML in C#?
For context, I've inherited an application that uses a great deal of composite controls, which is fine, but a good deal of the controls are rendered using a long sequence of literal controls, which is fairly terrifying. I'm trying to get the application into unit tests, and I want to get these controls under tests that will find out if they're generating well formed HTML, and in a dream solution, validate that HTML.
Have a look at the HTMLAgility pack. It's very compatible with the .NET XmlDocument class, but it much more forgiving about HTML that's not clean/valid XHTML.
If the HTML is XHTML compliant, you can use the built in System.Xml namespace.
I've used an SGMLReader to produce a valid Xml document from HTML and then parse what is required using XPath or to another format using XSLT. .
You can also look into HTML Tidy for HTML parsing/cleanup. I don't think they have specific .NET libraries, but you might be able to run the binary via command-line, or IKVM the java libraries.