When I parse an HTML5 document such as:
<p>Content</p>
using HtmlAgilityPack with default options, it parses it successfully, but the constructed HtmlDocument does not include the <html> and <body> elements that the standard HTML5 parsing algorithm would construct.
Are there options I am missing that would do this?
Or is there some other library (.NET 6) that I should be using instead?
I have come to the conclusion that unless the functionality is very well hidden, HtmlAgilityPack does not offer this capability.
I discovered the package AngleSharp, which seems to meet my requirement.
Well, almost. Parsing <p>Content</p>, I get
<?xml version="1.0" encoding="UTF-8"?>
<HTML xmlns="http://www.w3.org/1999/xhtml"><HEAD/>
<BODY><P>Content</P></BODY></HTML>
I need to do a bit of further work to get the element names in lower case, but we're close.
Related
I have set of XSL-FO documents which are used for PDF generation. Also I have a requirement to get the same output data (which are in PDF) exported as an HTML file. Further, I need the HTML to have a similar styles as in PDF.
Is there any way to convert XSL-FO to XHTML using C#?
NOTE : I know one option is to use "RenderX:FO2HTML". But since it's a commercial product, I would like to learn about any other options available and do a comparison before continuing further.
I use the RenderX fo2html stylesheet a lot, and I recommend it to my customers because it is zero cost. Thus I have built it into a number of client solutions. You have to go through the RenderX online store to get it, but it costs nothing.
Write or find an XSLT stylesheet which converts XSL-FO into XHTML, modify it if necessary to get the rendering you require? Websearching "XSL-FO to HTML" finds at least one such.
Though this is somewhat backward. Normally the document starts in some semantic markup language (such as XHTML), and a stylesheet converts it into XSL-FO for rendering.
I have a C# application that receives an html file. I want to parse and validate it. On output it will return a list of errors or that my html is valid.
Has anyone any idea how can I do this?
I'd run a local instance of the W3C Markup Validation service and communicate with it via the API
You can use HTML Tidy. There is a wrapper for .NET called TidyManaged
There is an obscure DLL in the framework version 1.0 (!) Microsoft.mshtml.dll and that is the only way in the framework to deal with DOM. If HTML is XHTML and a valid XML, then you can use XML but otherwise this is the only chance.
I'm working on a browser-like application which gets HTML from a site (any website) then applies a style-script over it to change certain elements (just like greasemonkey).
My initial plan is to parse the HTML using XPath and XmlDocument, but is there a better way?
Thanks in advance!
Ps> Handy tips, tricks & links on HTML+C# would be great~ ^^
use the HTML Aglility pack. You can find it here: http://www.codeplex.com/htmlagilitypack
HTML is not always follows XML rules, for example there are tags in html, that may not have close tag, so XPath and XDocument will sometimes throw errors. IE API gives you ability to do that(see here), you can also find 3-rd party parsers for that (see this o this)
I would highly recomend using XSLT. This allows you to keep all your transformational data OUTSIDE your code, and therefore, making it really easy to change it if the HTML to be transformed is modified, or you want to change your layout.
Non the less, if using HTML and not XHTML, beware of possible errors. Non the less, using a Tidy library can help you overcome this.
I would really recommend using a package for your programming language of choice that handles all the oddities of HTML parsing. I've used Hpricot in Ruby before and it's made things a breeze.
If you want to be able to browse the HTML based on its content, XPath is a good choice. But you'll have to clean up the HTML first. You can use HTML tidy to convert the HTML to XHTML. In the process you might modify how the page renders. But it seems to be the purpose of your project so that's not a big deal.
I have XHTML files whose source is not completely valid, it does not follow the DTD of an XML document.
Like there are places where for " it uses &Idquo; or for apostrophes it uses ’. This causes exceptions in my C# code.
So is there any method or any weblink that i can use to get rid of this?
If the file is otherwise well-formed you can define the character entities in your own DTD.
If the file is ill-formed the HTML Agility Pack from CodePlex will parse it.
You could parse the document as HTML instead since they both end up in a DOM and HTML parsers scoff at these pansy quotation mark problems. Going along with unknown's HTML Tidy idea, you could then serialize the DOM back into a valid XHTML file. (This is identical to using HTML Tidy, wihch presumably uses an HTML parser anyway, except you'd do it from C# programatically.)
Well by the nature of XML it needs to be valid otherwise it won't render at all. I'd first see what type of errors it generates with W3C's validator http://validator.w3.org/
Also consider using HTML tidy, which can be configured to fix XML as well.
We use hpricot to fix our XML, but then again we are building rails apps. Not sure about C#
What's the best way to parse fragments of HTML in C#?
For context, I've inherited an application that uses a great deal of composite controls, which is fine, but a good deal of the controls are rendered using a long sequence of literal controls, which is fairly terrifying. I'm trying to get the application into unit tests, and I want to get these controls under tests that will find out if they're generating well formed HTML, and in a dream solution, validate that HTML.
Have a look at the HTMLAgility pack. It's very compatible with the .NET XmlDocument class, but it much more forgiving about HTML that's not clean/valid XHTML.
If the HTML is XHTML compliant, you can use the built in System.Xml namespace.
I've used an SGMLReader to produce a valid Xml document from HTML and then parse what is required using XPath or to another format using XSLT. .
You can also look into HTML Tidy for HTML parsing/cleanup. I don't think they have specific .NET libraries, but you might be able to run the binary via command-line, or IKVM the java libraries.