I have set of XSL-FO documents which are used for PDF generation. Also I have a requirement to get the same output data (which are in PDF) exported as an HTML file. Further, I need the HTML to have a similar styles as in PDF.
Is there any way to convert XSL-FO to XHTML using C#?
NOTE : I know one option is to use "RenderX:FO2HTML". But since it's a commercial product, I would like to learn about any other options available and do a comparison before continuing further.
I use the RenderX fo2html stylesheet a lot, and I recommend it to my customers because it is zero cost. Thus I have built it into a number of client solutions. You have to go through the RenderX online store to get it, but it costs nothing.
Write or find an XSLT stylesheet which converts XSL-FO into XHTML, modify it if necessary to get the rendering you require? Websearching "XSL-FO to HTML" finds at least one such.
Though this is somewhat backward. Normally the document starts in some semantic markup language (such as XHTML), and a stylesheet converts it into XSL-FO for rendering.
Related
How can I add accessibility tagging to non-tagged PDF with iTextSharp using PDF content nodes?
Per Can I fully tag a non tagged PDF using iTextSharp?, I know we can't get perfect tagging compared to a human or tagging the PDF as its being created (the best options), but what about taking the PDF content objects and just doing a "best effort" semantic tagging to improve readability?
The basic principle I would think could be to sort all PDF content objects left to right, top to bottom, and then say for a text node, create simple P tags in that order so that they're spoken at least. If there are form objects intermixed, then tag those as well. Obviously if its all paths and artifacts, there isn't much you can do with it, but plenty of PDFs have text nodes. I can't rely on Adobe Reader trying to determine reading order.
For example, the content structure of a PDF is a simple example that has Text content nodes that could be tagged. We can't control the source PDF generation, but need to manipulate the PDF by adding headers/footers, etc and want it all tagged together.
How can we achieve this with iTextSharp? We have the commercial version 5.5.10.0 version.
For example, abcpdf has a function called MakeAccessible that attempts this and works fairly well. However, we want to use iTextSharp.
It's obvious from the title what I want to do. I know it is possible to convert html to PDF document using a very popular library iTextSharp. But what I acknowledged from this post is iTextSharp cannot render HTML5 and CSS3 styles correctly. Is there any free library to achieve this?
Backgroud:
I am using DevExtreme for report generation. It has supported chart export in PDF but my client wants some extra content in PDF apart from charts. It is not supported by DevExtreme, so I took decision to write my own custom PDF exporter.
There are some libraries available but I cannot rely them since I can't predict in advance what issues will it cause in production in future. Correct me if I am wrong, there is no API given by Microsoft for manipulating PDF files. We can create and manipulate excel and doc files using Microsoft.Office.Interop.Excel.dll and Microsoft.Office.Interop.Word.dll but I didn't find anything for PDF manipulation.
Please suggest me what options I have.
Hope this makes sense..!
A few years back I was using iTextSharp to get our html manuals in xhtml/css/wiki to pdf. It was...painful and a lot of work. So, the first news is: You will need quite a few weeks (2,3,4 weeks, depending on the grade of perfection you want) of time if what you have is not only a few html pages.
If you only have a very limited amount of pages, the quickest and dirtiest way is to make screenshots from your rendered pages and add those images to the pdf. Not very high-tech but quickly done.
If your style sheets can be sacrificed and you do not care about the formatting of the content to be identical, you can convert your html5 pages to xhtml so you can load them as XmlDocuments. Then you simply create a program which does some mapping from xml elements, such as <h1>MyTitle</h1> to some section of code which creates a pdf entity using iTextSharp. Basically that was the way I did it in my case. I also did some mapping from css style classes to some specific pdf formatting, but not to the extreme.
Also worth trying is converters from html (or xml) to tex/latex. If you are lucky you find one which does a good enough job. Then you can use pdftex and get your pdf.
Also, it is possible that you can print your documents to an xps printer and then convert the xps to pdf. Or you simply convince your customer that xps is what they want.
Is there a markup language that can be used in conjunction with a well supported .net open source project to generate PDF or HTML documents with very fine control on the output in terms of style and anchoring for both ?
Documents will part be static and part auto generated from the xml comments of some class libraries.
To Clarify the question, I Know html is a markup language, The reason I don't want to use it to directly store the content is because all of the HTML to PDF tools and libraries I have looked at contain patchy support for creating tables of contents, indexes and turning hyperlinks in to PDF document anchors.
I would opt for HTML documents. Markdown comes to mind. But as far as 'very fine' control goes arbitrarily, you can always just use HTML.. it is THE HyperText Markup Language after all.
There were many questions like this before on stackoverflow. I think the consensus is that you should have one markup language, rather than two.
HTML is - by definition (hypertext MARKUP LANGUAGE) - the markup language of choice and all you need to do is convert that to PDF. The other way around, from PDF to HTML is quite a bit tougher.
In order to convert HTML to PDF there's a truckload of tools, depending on what exact needs you have for the resulting PDF and what kind of CSS you need to support.
I'd always go for a rendering engine that's used in browsers (instead of something like iText or Prince), because you want to make sure your docs look like they do in a browser. You'd end up with Winnovative or something based on WebKit like the API by htm2pdf.
XSL-FO is the recommended solution. It provides a great level of control over the document layout and there are several tools for XSL-FO to PDF comversion.
I'm working on a browser-like application which gets HTML from a site (any website) then applies a style-script over it to change certain elements (just like greasemonkey).
My initial plan is to parse the HTML using XPath and XmlDocument, but is there a better way?
Thanks in advance!
Ps> Handy tips, tricks & links on HTML+C# would be great~ ^^
use the HTML Aglility pack. You can find it here: http://www.codeplex.com/htmlagilitypack
HTML is not always follows XML rules, for example there are tags in html, that may not have close tag, so XPath and XDocument will sometimes throw errors. IE API gives you ability to do that(see here), you can also find 3-rd party parsers for that (see this o this)
I would highly recomend using XSLT. This allows you to keep all your transformational data OUTSIDE your code, and therefore, making it really easy to change it if the HTML to be transformed is modified, or you want to change your layout.
Non the less, if using HTML and not XHTML, beware of possible errors. Non the less, using a Tidy library can help you overcome this.
I would really recommend using a package for your programming language of choice that handles all the oddities of HTML parsing. I've used Hpricot in Ruby before and it's made things a breeze.
If you want to be able to browse the HTML based on its content, XPath is a good choice. But you'll have to clean up the HTML first. You can use HTML tidy to convert the HTML to XHTML. In the process you might modify how the page renders. But it seems to be the purpose of your project so that's not a big deal.
I have a need to populate a Word 2007 document from code, including repeating table sections - currently I use an XML transform on the document.xml portion of the docx, but this is extremely time consuming to setup (each time you edit the template document, you have to recreate the transform.xsl file, which can take up to a day to do for complex documents).
Is there any better way, preferably one that doesn't require you to run Word 2007 during the process?
Regards
Richard
I tried myself to write some code for that purpose, but gave up. Now I use a 3rd party product: Aspose Words and am quite happy with that component.
It doesn't need Microsoft Word on the machine.
"Aspose.Words enables .NET and Java applications to read, modify and write Word® documents without utilizing Microsoft Word®."
"Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market."
DISCLAIMER: I am not affiliated with that company.
Since a DOCX file is simply a ZIP file containing a folder structure with images and XML files, you should be able to manipulate those XML files using our favorite XML manipulation API. The specification of the format is known as WordprocessingML, part of the Office Open XML standard.
I thought I'd mention it in case the 3rd party tool suggested by splattne is not an option.
Have you considered using the Open XML SDK from Microsoft? The only dependency is on .NET 3.5.
Documentation: http://msdn.microsoft.com/en-us/library/bb448854%28office.14%29.aspx
Download: http://www.microsoft.com/downloads/details.aspx?familyid=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en
Use invoke docx lib. it supports table data (http://invoke.co.nz/products/help/docx_tables.aspx). More info at http://invoke.co.nz/products/docx.aspx
Have you considered using VB? You could create a separate assembly to populate your document.
I know you are looking for a C# solution, but the XML literal support is one area where XML literal support could help you populate the document. Create a document in Word to server as a template, unzip the docx, paste the relevant XML section you want to change into you VB code, and add code to fill in the parts you wish to change. It's difficult to say from your description if this would meet your requirements but I would suggest looking into it.