I have to load a few hundred documents of various sorts from disk and produce one giant aggregated PDF.
My approach is to convert each document to PDF individually, then aggregate them.
I have been writing each PDF document to isolated storage as part of this process.
Is there are better way to do this?
Performance is a priority - this is client side.
Have you heard of Pdftk - http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ (free) - I use it to merge multiple pdfs into one.
In combination with BullZip PDF printer (free) - http://www.bullzip.com/products/pdf/info.php - I can basically create any PDFs I need.
I'm not sure if you can do any automation with BullZip, but with Pdftk you could certainly work with it through C# and System.Diagnostics.Process.
Here's a typical command line that I use:
pdftk folder1/file1.pdf folder1/file2.pdf folder2/file1.pdf cat output all_files.pdf
That merges the 3 files, in order, to all_files.pdf
Related
I need to convert a Word document (docx) to a postscript file so that I can use this postscript file to generate PDF using the Ghostscript command line tool.
How do I generate the postscript file from the docx?
I need to code using .NET/C#. I found about LaTeX which generates postscript but how do I make my Word file be used with LaTeX or any other tool to get the postscript generated?
There are three main products I will mention that understand DOCX.
The obvious one is MS Word. It produces the definitive rendering of all DOCX files. Nothing is ever going to be exactly the same. By definition it is always correct. However it is not really designed for automated conversion and getting it to do this kind of thing is fraught with difficulty. On a legal level the EULA may confict with your chosen solution.
OpenOffice.org is a great product. The EULA is much more accomodating. The freeness is attractive. However, while it will produce a pretty good output for most DOCX documents it does not for all. While it is similar to MS Word it is not the same and this is something you may notice, particularly for more complex documents. Probably more importantly, again it's not designed for automated conversions and trying to get it to do this can be fraught and tiresome.
WordGlue .NET (on which I work) is a native .NET library that understands DOCX. It is designed specifically to produce output which is the same as MS Word. While I'm not going to say it is perfect (it's a big task) it is superior to OpenOffice.org in that it does actually attempt this as a specfic design decision. However probably the biggest advantage is that it is designed for high perfomance multi-threaded server side conversion. It's native .NET and thus low impact in terms of security.
Products like ABCpdf (on which I work) will integrate with these three applicatons to allow conversion direct to PDF. Why bother going via PostScript if you want PDF? However if you really want to save as PostScript you can do that too.
Or indeed you can write your own code to integrate with these products. Just be aware of the caveats above regarding fraughtness and tiresomeness relating to MS Office and OpenOffice.org. To get these things working unattended requires an awful lot of attention.
You need to print it to a PostScript file, from an application which can read .docx files. Or you could just export direct to PDf from the app, as far as I know anything which reads .docx and can print, can also write a PDF file.
If you have a windows computer you can use the commandline
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "printerThatDumpsPS"
You can find file printers for postscript printing for free on the internet. Or if you have adobe pfdf, pdf exchange or any PS printer. You can use c# to temporarily set the printers settings so that it does this for you.
So for example using pdf exchange as follows,
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "PDF-XChange Printer 2012"
Produces a pdf file without much of a trace anywhere what program was used, assuming pdf exchange was set to save file without asking.
This produces a passable document but yeah it looses quiet many features. But it might be enough.
I have a DataTable with 3 columns (a,b,c) and a docx file with the corresponding MailMerge fields set up. What I'd like to do is perform a Mail Merge on the document with the data.
Presume that you can write to the hard disk (if you need to create a csv etc for doing the merge etc), you don't have word, excel etc, Open XML SDK is installed, but equally we can install anything else.
In terms of the answer, converting the input data to whatever is needed isn't really a problem, the problem is how to perform a mail merge in the Open XML SDK (or other FREE API).
As a side note, the output should be one file with n pages (where n is the number of rows in the data), i.e. not n documents (though I don't mind if the merge of documents is done at the end).
(I should add, I'm not tied to the MailMerge concept, being able to just do a replace for example would work - though obviously that then requires merging the files together at the end...)
I've got this working in a pretty horrible way - basically - at the moment, the algorithm follows this:
Unzip docx file
Read in document.xml (to a string)
string.Replace the fields
Rezip to temporary docx
Merge all the temporary documents created
The actual code for merging the documents comes from Eric White's Blog : http://blogs.msdn.com/ericwhite/archive/2009/02/05/move-insert-delete-paragraphs-in-word-processing-documents-using-the-open-xml-sdk.aspx
I need to have the ability to convert and merge various documents into a single Pdf.
The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.
At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.
We currently produce around 100-1000 of these PDF's per day.
The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.
We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.
Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!
Edit:
Thanks for all the great advice, I'll look into some of these and compare performance.
Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.
Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.
In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.
This means:
The target application (for example Office) is opened and closed
The document has to travel through the printing service
In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.
We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.
I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.
O2 Solutions - PDF4NET
Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.
PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
PDFDocument doc = PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
doc.SaveToStream( stream );
You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.
Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?
While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.
I have a situation where in a web application a user may need a variable list of PDFs to be printed. That is, given a large list of PDFs, the user may choose an arbitrary subset of that list to print. These PDFs are stored on the file system. I need a method to allow users to print these batches of PDFs relatively easily (thus, asking the user to click each PDF and print is not an option) and without too much of a hit on performance.
A couple of options I've thought about:
1) I have a colleague who uses a PDF library that I could use to take the PDFs and combine them on the fly and then send that PDF to the user for printing. I don't know if this method will mess up any sort of page numbering. This may be an "ok" method but I worry about the performance hit of this.
2) I've thought about creating an ActiveX that I would pass the PDFs off to and let it invoke the printing features. My concern is that this is needlessly complex and may present some odd user interactions.
So, I'm looking for the best option to use in this scenario, which is probably not one of the ones I've gone through.
The best solution I have for you is number 1. There are plenty of libraries that will merge documents. From the one I've used the numbering should not be an issue since all the pages are all ready rendered.
If you go with ActiveX you are going to limit yourself to IE which might be acceptable. The only other idea would be to use a smart client so you can have more control...then you could serve up the PDF's via a web service.
I think concatenating the documents is the way to go.
For tools I recommend iText#. Its free
You can download here iTextSharp
iText# (iTextSharp) is a port of the iText open source java library for PDF generation written entirely in C# for the .NET platform. Use the iText mailing list to get support.
I agree with #1. You could do some tests to see what the performance hit would be like.
Is it possible to reorder an already generated PDF file programmatically, and using as little resources as possible, as this will need to be ran on ~8000 PDFs every month or so?
We are currently using iTextSharp to merge the PDF’s in to larger PDF’s, but iTextsharp’s Documentation does not really explain much.
I've used iTextSharp -- check out this code sample, it's what I used to write a (simpler) splitting utility.
I've used this on over 10,000 PDFs in a shot, and I can't remember the exact performance, but it was certainly acceptable for a batch job.
The Merger product from DynamicPDF will do this (http://www.dynamicpdf.com/). I cannot speak to what kind of performance you'll see with 8k documents, but i can say that it is one of the fastest PDF processing tools i have found.
There is a .Net version of both the Merger tool, and the Generator tool.