ASP.Net Converting and Merging documents into single PDF - c#

I need to have the ability to convert and merge various documents into a single Pdf.
The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.
At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.
We currently produce around 100-1000 of these PDF's per day.
The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.
We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.
Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!
Edit:
Thanks for all the great advice, I'll look into some of these and compare performance.
Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.

Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.
In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.
This means:
The target application (for example Office) is opened and closed
The document has to travel through the printing service
In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.

We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.

I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.
O2 Solutions - PDF4NET
Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.
PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
PDFDocument doc = PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
doc.SaveToStream( stream );

You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.
Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?

While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.

Related

Convert docx to postscript

I need to convert a Word document (docx) to a postscript file so that I can use this postscript file to generate PDF using the Ghostscript command line tool.
How do I generate the postscript file from the docx?
I need to code using .NET/C#. I found about LaTeX which generates postscript but how do I make my Word file be used with LaTeX or any other tool to get the postscript generated?
There are three main products I will mention that understand DOCX.
The obvious one is MS Word. It produces the definitive rendering of all DOCX files. Nothing is ever going to be exactly the same. By definition it is always correct. However it is not really designed for automated conversion and getting it to do this kind of thing is fraught with difficulty. On a legal level the EULA may confict with your chosen solution.
OpenOffice.org is a great product. The EULA is much more accomodating. The freeness is attractive. However, while it will produce a pretty good output for most DOCX documents it does not for all. While it is similar to MS Word it is not the same and this is something you may notice, particularly for more complex documents. Probably more importantly, again it's not designed for automated conversions and trying to get it to do this can be fraught and tiresome.
WordGlue .NET (on which I work) is a native .NET library that understands DOCX. It is designed specifically to produce output which is the same as MS Word. While I'm not going to say it is perfect (it's a big task) it is superior to OpenOffice.org in that it does actually attempt this as a specfic design decision. However probably the biggest advantage is that it is designed for high perfomance multi-threaded server side conversion. It's native .NET and thus low impact in terms of security.
Products like ABCpdf (on which I work) will integrate with these three applicatons to allow conversion direct to PDF. Why bother going via PostScript if you want PDF? However if you really want to save as PostScript you can do that too.
Or indeed you can write your own code to integrate with these products. Just be aware of the caveats above regarding fraughtness and tiresomeness relating to MS Office and OpenOffice.org. To get these things working unattended requires an awful lot of attention.
You need to print it to a PostScript file, from an application which can read .docx files. Or you could just export direct to PDf from the app, as far as I know anything which reads .docx and can print, can also write a PDF file.
If you have a windows computer you can use the commandline
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "printerThatDumpsPS"
You can find file printers for postscript printing for free on the internet. Or if you have adobe pfdf, pdf exchange or any PS printer. You can use c# to temporarily set the printers settings so that it does this for you.
So for example using pdf exchange as follows,
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "PDF-XChange Printer 2012"
Produces a pdf file without much of a trace anywhere what program was used, assuming pdf exchange was set to save file without asking.
This produces a passable document but yeah it looses quiet many features. But it might be enough.

Wanted: ASP.NET control to view/print PDF, TIFF, possibly more?

I'm looking for an asp.NET control that will allow for viewing and printing of a pdf and TIFF within a web form. I'm willing to use more than 1 control if needed (1 control for pdf, 1 for Tiff, show and hide based on file extension), but I have not been able to find a good Tiff viewer.
Files are stored on our LAN in a shared folder, and this application is an intranet site.
Open source / free licensing preferred, but I'm willing to look at paid options as well.
http://www.alternatiff.com/ is one of the viewers that I've seen used for this type of viewing of tiffs.
You can get a free licence of ABCPDF (provided you link back to their site) which will do the conversion from TIFF to PDF for you as per #Chris Lively 's suggestion.
It'll also do conversion from PDF to TIFF if you decide to do things backwards.
It makes sense to present the content in a common format. If you wanted to you can embed the PDF in the browser to create the 'seamless' experience you're looking for using something like PDFObject.
As #BenCr says though, PDF is a really common format and the tools already exist to open and work with them, so introducing new ways to perform existing tasks could actually end up complicating matters unnecessarily.
I'm in total agreement with #BenCr on this.
Viewing PDFs is an extremely common thing to do. This isn't a "technical" issue by any stretch.
It sounds like you have some type of faxing solution in place that is creating these documents. Most likely multi-page TIFF and PDFs.
If this is the case you might want to just convert the TIFFs to PDFs to begin with and run everything through Adobe's pdf reader. Every online fax solution does this.
You could try http://issuu.com/ and they appear to have a API too if you want to go that deep.
We used the the Seadragon control to do this. I think it was an overkill and we should have just rolled our own -- would have been cheaper than integrating it. TIFFs and PDFs are converted to PNG on the server side. I don't think you can do better than that, especially with PDFs (assuming you don't want to use Acrobat Reader to display them). Convert PDFs to PNG using Xpdf/Poppler.
How about using Google Docs Viewer?
EDIT: Probably not working, since the viewer has to read the document from your URL; when it's on the Intranet, this won't work.
If you can mess about with mime types -- mainly by making the .tiff files expose an application/pdf mimetype -- you should be able to get acrobat to open TIFF files directly by effectively fooling the browser to open TIFF files with acrobat. Then all you need is a trusty old iframe to get you familiar UI with print buttons.

Give users the ability to edit Word 97-2010 documents within a .NET application

I was wondering if anyone has any idea of any product/method to give my end users the ability to edit Word documents within our C#/.NET application, avoiding the use of Automation and separate instances of Word opening outside of the application. This is a possibility [backup plan!] - but one that I'd rather not have to implement (due to the amount of work involved and having users exit our application).
I know that I could possibly use the WebBrowser control - but from what I've been able to find -- support for this is sketchy at best, and things such as toolbars are not present, and it does not appear to work with Word 2010 anyway.
I've been evaluating a few products that claim to do this but many are lacking in features or produce compatibility errors within documents rendering them useless when opened in Word.
We are using Word 2003 and Word 2010. Our documents start out as .DOCX files through our custom merge/templating processes.
Any suggestions for products or other ideas would be great.
Edit:
We're creating documents without issue using OpenXML. Fun stuff, works really well. However, at the end of the day I would prefer to have users editing the created documents as well as legacy documents (created as .DOC files) within our .NET application directly. Unforunately, with Microsoft removing the ability to embed via ActiveX/OLE, etc. there isn't a way to do this. What I am looking for is a 3rd party product to achieve this, which should be virtually 100% compatible with both the .DOC and .DOCX formats.
For those asking why ? Security, ease of use, etc. We are storing documents in a database. Once I start dropping files on the filesystem and working with Automation support/macros, ... there's a lot of things that would have to be done to get the files back into the database / update, etc. This is made especially difficult since Word doesn't expose the raw bytes[] of a document and files must be saved as temporary files somewhere on the fs. Just a lot of headaches.
So, the "easiest" solution - embed Word [seems not possible] or use a 3rd party product that supports editing .DOC/.DOCX files.
An example is DevExpress XtraRichEdit control - unfortunately, while it supports a lot of nice Word-like/compatible features it only works with .DOCX files.. and isn't 100% feature complete, compared to Word.
The file structure of a word document is huge, it could take hundreds of man hours to program even limited .doc/docx support. What exactly is the reason for using your program to edit a word file over word itself?
I am not exactly sure how Word 2003 has .docx support though, my understanding is there was only a word viewer release when Office 2007 was released, it of course has been years since thats been a problem.
If you are going to actually do this only add support for .doc files since there is more information out there, you can allow word itself to handle the converstion to a .docx file if you want.
You are not going to find a third party product that does this. The amount of effort required to build an app that 100% supports the Word formats is beyond consideration. Not just every feature, but every bug as well would have to be duplicated. Considering the potential legal pitfalls of doing such, no one in their right mind would bother trying. The legal aspects, incidentally, is one of the primary reasons for the new formats.
Which means you have to go external. There are two really good options here.
One would be to hook into Office Live to give them the ability to edit Microsoft Documents online.
Another possibility is to just leverage Sharepoint in your application. It has built in methods for document workflow and integrates nicely with Office.
A third possibility would be to write your own word add-in which would take care of saving / loading the documents from your system. I'd go with the first two above before going this route.
This used to be supported through a feature called OLE Embedding. Support for it has been disappearing from Microsoft software and tools over the past 10 years. Notably .NET has no support for it whatsoever. Office was one of the last hold-outs with 2007 already getting pretty cranky about it. But this indeed looks to be completely gonzo in the 2010 edition. All download links to the DSOFramer control, a generic ActiveX embedding control were removed around the time that 2010 went into beta.
There's no future here, look at VSTO for the road ahead.
Word Automation Services and Office Web Apps (requires SP 2010).
Certainly not 100% coverage of Word features, but have you tried ASPOSE.Words.NET Total or TXTextControl.NET?

Batch Printing PDFs from ASP.NET

I have a situation where in a web application a user may need a variable list of PDFs to be printed. That is, given a large list of PDFs, the user may choose an arbitrary subset of that list to print. These PDFs are stored on the file system. I need a method to allow users to print these batches of PDFs relatively easily (thus, asking the user to click each PDF and print is not an option) and without too much of a hit on performance.
A couple of options I've thought about:
1) I have a colleague who uses a PDF library that I could use to take the PDFs and combine them on the fly and then send that PDF to the user for printing. I don't know if this method will mess up any sort of page numbering. This may be an "ok" method but I worry about the performance hit of this.
2) I've thought about creating an ActiveX that I would pass the PDFs off to and let it invoke the printing features. My concern is that this is needlessly complex and may present some odd user interactions.
So, I'm looking for the best option to use in this scenario, which is probably not one of the ones I've gone through.
The best solution I have for you is number 1. There are plenty of libraries that will merge documents. From the one I've used the numbering should not be an issue since all the pages are all ready rendered.
If you go with ActiveX you are going to limit yourself to IE which might be acceptable. The only other idea would be to use a smart client so you can have more control...then you could serve up the PDF's via a web service.
I think concatenating the documents is the way to go.
For tools I recommend iText#. Its free
You can download here iTextSharp
iText# (iTextSharp) is a port of the iText open source java library for PDF generation written entirely in C# for the .NET platform. Use the iText mailing list to get support.
I agree with #1. You could do some tests to see what the performance hit would be like.

Reorder PDF Page Order

Is it possible to reorder an already generated PDF file programmatically, and using as little resources as possible, as this will need to be ran on ~8000 PDFs every month or so?
We are currently using iTextSharp to merge the PDF’s in to larger PDF’s, but iTextsharp’s Documentation does not really explain much.
I've used iTextSharp -- check out this code sample, it's what I used to write a (simpler) splitting utility.
I've used this on over 10,000 PDFs in a shot, and I can't remember the exact performance, but it was certainly acceptable for a batch job.
The Merger product from DynamicPDF will do this (http://www.dynamicpdf.com/). I cannot speak to what kind of performance you'll see with 8k documents, but i can say that it is one of the fastest PDF processing tools i have found.
There is a .Net version of both the Merger tool, and the Generator tool.

Categories

Resources