Converting PDF to PDFA1-A with iTextSharp

Converting PDF to PDFA1-A with iTextSharp - c#

I want is to load a plain PDF file in iText and export it (or write it) as a PDF/A1-A.
I've got "iText in action sec. edit" by hand and using iTextSharp. Still Progress == null.

Hiya Leonard. (Leonard works for Adobe as their PDF Dev Evangelist Guy. His PDF-Fu is Mighty. I'll refrain from comparing our mightiness in some vague attempt at false modesty.) >:)
Arbitrary PDF -> PDF/A1-A is all but impossible. 1-A requires a great big pile of formatting information embedded in your tagging... about as much info as you'd need to rebuild the PDF as html/css.
Going from "this is a pile of lines and characters with these coordinates" to "this is a table with X columns and Y rows and the following information in its cells" is EXTREMELY DIFFICULT. All but impossible.
PDF/A1-b is much more realistic, though still not easy. You need to put everything into a specific set of colorspaces and render intents and things with molecular structures that your primitive intellect wouldn't understand.
(Terrible misquote, but there's still some funny in there, so I left it.)
iText[Sharp] supports generating PDF/A in as much as it will tell you when you do something blatantly against the spec... but it may not catch it until you call document.close(). The programmer writing the generator still needs to fill in a Whole Bunch of Information "manually".
Ain't nobody that can say "we'll take some arbitrary PDF and turn it into PDF/A-1a" (without lying through their teeth). You point me at some software that says so, and I'll give you a perfectly valid PDF that'll break it. EVERY TIME. I'd bet money on it.
You need a copy of the PDF/A ISO spec ($). You need a copy of the PDF ISO spec (free!). You need to KNOW THEM. And then you'll understand what you're up against.
Now all that is "Arbitrary PDF". If you have a stack of instances of some report that are all coming from the same program, then there's a light at the end of the tunnel. It's still a long tunnel, but the problem degrades to "hard" instead of "almost impossible". And once you've got one report working, handing similar reports FROM THE SAME APP are likely to be relatively easy.
Still not fun.

iText doesn't support conversion of PDF->PDF/A "out of the box".
You could certainly use the low level APIs in the library as a starting point for writing such a converter...but it woudl only be a start...

Try this.
PdfStamper pst = null;
PdfReader reader = new PdfReader(GetTemplateBytes());
pst = new PdfStamper(reader, Response.OutputStream);
pst.Writer.SetPdfVersion(PdfWriter.PDF_VERSION_1_4);
pst.Writer.PDFXConformance = PdfWriter.PDFA1A;

Related

Convert docx to postscript

I need to convert a Word document (docx) to a postscript file so that I can use this postscript file to generate PDF using the Ghostscript command line tool.
How do I generate the postscript file from the docx?
I need to code using .NET/C#. I found about LaTeX which generates postscript but how do I make my Word file be used with LaTeX or any other tool to get the postscript generated?

There are three main products I will mention that understand DOCX.
The obvious one is MS Word. It produces the definitive rendering of all DOCX files. Nothing is ever going to be exactly the same. By definition it is always correct. However it is not really designed for automated conversion and getting it to do this kind of thing is fraught with difficulty. On a legal level the EULA may confict with your chosen solution.
OpenOffice.org is a great product. The EULA is much more accomodating. The freeness is attractive. However, while it will produce a pretty good output for most DOCX documents it does not for all. While it is similar to MS Word it is not the same and this is something you may notice, particularly for more complex documents. Probably more importantly, again it's not designed for automated conversions and trying to get it to do this can be fraught and tiresome.
WordGlue .NET (on which I work) is a native .NET library that understands DOCX. It is designed specifically to produce output which is the same as MS Word. While I'm not going to say it is perfect (it's a big task) it is superior to OpenOffice.org in that it does actually attempt this as a specfic design decision. However probably the biggest advantage is that it is designed for high perfomance multi-threaded server side conversion. It's native .NET and thus low impact in terms of security.
Products like ABCpdf (on which I work) will integrate with these three applicatons to allow conversion direct to PDF. Why bother going via PostScript if you want PDF? However if you really want to save as PostScript you can do that too.
Or indeed you can write your own code to integrate with these products. Just be aware of the caveats above regarding fraughtness and tiresomeness relating to MS Office and OpenOffice.org. To get these things working unattended requires an awful lot of attention.

You need to print it to a PostScript file, from an application which can read .docx files. Or you could just export direct to PDf from the app, as far as I know anything which reads .docx and can print, can also write a PDF file.

If you have a windows computer you can use the commandline
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "printerThatDumpsPS"
You can find file printers for postscript printing for free on the internet. Or if you have adobe pfdf, pdf exchange or any PS printer. You can use c# to temporarily set the printers settings so that it does this for you.
So for example using pdf exchange as follows,
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "PDF-XChange Printer 2012"
Produces a pdf file without much of a trace anywhere what program was used, assuming pdf exchange was set to save file without asking.
This produces a passable document but yeah it looses quiet many features. But it might be enough.

What options do I have to produce a PDF report from code in .NET for scientific data (winforms)

I have a "legacy" VB.NET application (winforms) written for .NET 1.1, and re-compiled under 2.0 that produces a report in HTML via a custom XmlTextWriter wrapper that is suited for HTML. The user then would print the report into pdf if they wanted to.
That was 2003, and now technology has changed a bit, especially within the C#/VB.NET world, and customers want to skip the HTML part, and go to PDF directly. What are my options for open source, or low cost PDF libraries that work well with .NET and must support tables with pictures (generated bitmaps from code) and text.
Here is screen shot of the resulting html rendering
Obviously this needs some cleaning up, tidying it and stuff, but I am interested on known which technology to pursue in this project.
This related question might be what I need, or it may be out of date by now. I don't have any data sources that will provide all the information I want displayed. Currently it is collected from various classes within the application in order to be displayed as html.
anybody have direct experience with iTextSharp or SharpPDF ?
Thanks for any advice.
Update 1:
found possible duplicate here.

I have used iTextSharp to produce PDF reports before. Although you have to get used to the library (and it is an extensive library), once you get the hang of it it isn't so bad. I found the book iText In Action to be very helpful. Even though the book is about the original Java library, not the .NET port, most of the methods and classes are named the same so it wasn't really a problem.
My #1 piece of advice when working with iTextSharp is that you'll be writing a lot of the same code, over and over again. (i.e. creating a table cell, setting the fonts, sizes, colors, and borders for that table cell, setting text...). Do yourself a favor and make your own little Utility class that will do all of your gruntwork for you -- otherwise you'll end up with 2000 lines of code that just create a few tables with some special formatting.
In addition, this site has a series of brief articles that I found useful when I was first learning iTextSharp.
Edit:
If you're interested in an XHTML-->PDF converter, I just found this blog post by Darin Dimitriov that shows how to port the open-source Java flying-saucer library to .NET. He makes it look easy!
Interestingly enough, it seems that flying-saucer uses iText under the hood to perform the conversion.

Report.NET is a .NET PDF library specific for report generation supporting the features you asked for. It is smaller than iTextPdf, but perhaps sufficient for your needs:
http://sourceforge.net/projects/report/
(license is LGPL).

You can use this free print driver:
http://www.dopdf.com/
When you print to it, it outputs PDF.

I was researching this topic two months ago and basically you have two ways:
Dlls
open source iTextSharp - well, don't expect too much from it, it can generate PDFs from simple web pages, your table seems quite complicated though some my I don't think you will succeed with it without some tweaking of it's source code
paid options - I've used ABCPdf, works pretty smoothly, not everything is rendered as good as in browser but does it's job, I believe there are way more libraries like this
Command line tools
If you are lucky enough to have full control over server I think it will be your best option
wkhtmltopdf - good user opinions
htmldoc
I had not tried does two though, I used hosting so they were not an option to me

I just wrote a TIP in CodeProject on how to do this without using any external DLL in a couple of lines.
Here's the short code copied:
// ----------------------------------------------------------------------------------------------
// If you run this on Windows 10 (having it's default printer "Microsoft Print to PDF" installed)
// This should print a PDF file named "CreatedByCSharp.PDF" in your "MyDocuments" folder
// containing the string "When nothing goes right, go left"
// ----------------------------------------------------------------------------------------------
// If not present, you will need to add a reference to System.Drawing in your project References
using System.Drawing;
using System.Drawing.Printing;
void PrintPDF()
{
// Set the output dir and file name
string directory = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
string file = "CreatedByCSharp.pdf";
PrintDocument pDoc = new PrintDocument()
{
PrinterSettings = new PrinterSettings()
{
PrinterName = "Microsoft Print to PDF",
PrintToFile = true,
PrintFileName = System.IO.Path.Combine(directory, file),
}
};
pDoc.PrintPage += new PrintPageEventHandler(Print_Page);
pDoc.Print();
}
void Print_Page(object sender, PrintPageEventArgs e)
{
// Here you can play with the font style (and much much more, this is just an ultra-basic example)
Font fnt = new Font("Courier New", 12);
// Insert the desired text into the PDF file
e.Graphics.DrawString("When nothing goes right, go left", fnt, System.Drawing.Brushes.Black, 0, 0);
}

I ended up using iTextSharp to produce image flashcards with some tricky formatting after becoming deeply frustrated with other libraries.
Where it really paid off was the competent documentation compared to the other options. I believe there is also an option to automatically parse HTML/XML.

I'd also suggest to take a look at PD4ML html to pdf converting library. It has quite a modest price for paid product, but it supports lots of features and is instantly updated.

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

Generating large amounts of PDF Docs from templates

Where I work we get PDF templates from our clients and we convert them into html templates that we can change out tokens in the page with other info and mail them out to their clients.
The reason we convert them into html is because the text can wrap if any of the info is too long.
The process can be slow since we can only do 1 at a time on a single computer because of a GDI problem. So we have a farm that creates the pdf docs that need to be mailed out.
The GDI problem.
http://support.microsoft.com/kb/939884/en-us/
The hotfix does not seam to fix the problem.
Is there a better way of doing this that would be more efferent or easier to do with out having to change from pdf->html->pdf

Have a look at iTextSharp. You can use the PdfStamper object to dink with the original PDF and spit out customized versions very quickly. You might need to do a little massaging of the source PDF (define some form obejcts in Acrobat, etc) to get it to do exactly what you want, but it's very high performance compared to the process you're describing. We use it to generate thousands of PDFs a day (with customized info inserted for each customer), and it's free. A little tricky to get your head around (steep learning curve). I'd highly suggest buying the eBook "iText In Action"- the samples there make the whole thing much easier to grok.

ASP.Net Converting and Merging documents into single PDF

I need to have the ability to convert and merge various documents into a single Pdf.
The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.
At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.
We currently produce around 100-1000 of these PDF's per day.
The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.
We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.
Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!
Edit:
Thanks for all the great advice, I'll look into some of these and compare performance.
Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.

Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.
In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.
This means:
The target application (for example Office) is opened and closed
The document has to travel through the printing service
In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.

We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.

I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.
O2 Solutions - PDF4NET
Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.
PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
PDFDocument doc = PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
doc.SaveToStream( stream );

You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.
Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?

While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.