This question might have been answered already and I looked into the web, but I don't really know what term to use, but here it is.
As some of you know, some people provide writable pdf file, basically look like some web form, but in pdf format, in which there are fields that can be filled. One of our client have such a form and would like to be able to fill in the fields from a software that we wrote. Most of the fields are there and we can easily add the one missing, but we have two main problems with this
I can't find a way to actually write to those documents.
The document is provided by a government entity, which seems to update it with new change several times a year. Plus they only accept the most recent version.
My question is : Is this even possible and if so how do we do that ? When I look in the web, I find all ressource on how to write a PDF in c# or other language, but how to write in to those PDF form.
Thanks,
Related
I would like to know how can I edit an existing PDF document in C#. The document is already created and has fields as the one on the image below:
I want to know if there is a code which can check the desired checkbox or enter text at the lines. Please let me know.
I looked at iTextSharp but I don't know if that tool can help me achieve that.
There are ways to do it, but it requires external tools. I use ActivePDF library, it provides form filling routines and works quite well..
You can do that with iTextSharp, BUT first you should find out more about the document.
If the pdf contains an actual acroform form definition, filling it is fairly easy. There are many examples in the documentation and on the iText Web site.
If it does not contain such a form definition, though, and the check boxes and text fields merely are some lines drawn somewhere, it gets a bit more difficult: you have to measure where to put your entries.
Additionally you should find out whether the document is signed or encrypted which might limit what you are allowed to do with the document.
I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.
This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml
Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.
Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.
We are planning on implementing a solution for comparing different revisions of a PDF document in our .Net Windows Forms application. In Adobe Acrobat there is a nice feature for comparing two documents, but I have not been able to find any information about whether it is possible to create a plug-in (or something else) to this feature from our application.
I would really appreciate it if any of you could point me in the direction to how I should go about to make such a solution.
I have also looked at other threads here at Stackoverflow for comparing PDF documents, particularly these threads:
How to compare two PDF-files
PDF-libraries
I did not really find a good solution there for a library or SDK letting us create a good solution for comparing PDF-documents in a way which is easy to understand for users of the system.
Do you know any good solutions to solve this problem?
All help appreciated! :)
Do you know the pdf files? or you just want to make the compare without knowing it. If you know the pdf files, you can use variables values on the specific fields and compare the values between files, instead comparing the entire pdf file.
Given an existing PDF document, I would like to tack on an index to the end of the file to show the pages on which key words show up. It would be best if I don't have to give a list of words to look for and the list of words is automatically generated. However, if a list of words must be given, I can work with it. I'm looking to do this either through a C# library or a command line tool. It needs to run as part of another command line app.
Is there anything out there that is capable of this?
This "PDF Index Everthing" (http://www.pdfstore.com/details.asp?ProdID=799) seems to be on the right track, but requires interaction through its GUI.
I don't actually have an c# solution but hopefully this will still help...
pdflib is an excellent pdf development library. It is one of the better libs available. As far as I know it doesn't have a C# binding. PDF is a random access object-based file format and although there are many libraries that allow for creating of pdfs, most freely available libs don't support adding pages to existing pdfs. pdflib does support adding pages with it's pdi option, so it may be worth checking out.
Updated Info:
Check out- iText# library and
merging pdf files with C# and iText
I need to have the ability to convert and merge various documents into a single Pdf.
The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.
At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.
We currently produce around 100-1000 of these PDF's per day.
The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.
We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.
Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!
Edit:
Thanks for all the great advice, I'll look into some of these and compare performance.
Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.
Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.
In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.
This means:
The target application (for example Office) is opened and closed
The document has to travel through the printing service
In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.
We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.
I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.
O2 Solutions - PDF4NET
Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.
PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
PDFDocument doc = PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
doc.SaveToStream( stream );
You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.
Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?
While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.