read the content of the current page opened page of PDF file - c#

I'm trying to read data and attributes from an opened PDF file which is on screen.
Is there a way of attaching to running acrobat reader and manipulating data from it ?

Attaching to another process means that you will have to handle a lot of inter-process-communication (IPC). Apart from that, you don't know what Acrobat Reader looks like inside. So you cannot simple ask it to deliver you some bytes.
Instead, you should use one of the many libraries to open, display and read PFD-files like iTextSharp. I am certain that these will serve purposes well.
There are many more libraries available, you should have a look at PDFSharp.

I have never done it my self, but a quick look around I found that the Acrobat Reader (assuming that is what you are talking about) has an API which (by looking at its documentation) has an IPC module which will be the closest to what you are asking for.

Related

Does MuPdf library have unicode or text search functionality?

Background
I am working on a WPF windows application and I want add embedded PDF viewer with only basic functionalities including PDF view, text search and page navigation.
I tried embedded Internet Explorer and Adobe PDF Reader installed method (this way ) but this method is not suitable for our requirement as Adobe PDF Reader has too may external links which can not be allowed because of the security reasons of the application.
Therefore, I am trying to use moonpdf library. This library works fine with our requirements but the only problem is there is no text search functionality in this library. (I think it shows PDF as images)
Then, I have download moonpdf source code and realized that moonpdf is using libmupdf.dll wrapping to c#.
I can modify the moonpdf source code and mupdf source code for our requirement if needed.
My Question
Is there any text search functionalities in mupdf? if so how can I use it?
In the basic mupdf library, there are several functions for searching for text. These work by searching a page for a text string, in a few different variants, and returns the area for all hits of the given text. You need to iterate over the pages yourself (in order to do forward or reverse search).
fz_quad hits[1000];
count = fz_search_page(ctx, page, needle, hits, nelem(hits));
That said, I do not know how or even if "moonpdf" has wrapped these functions.
You can certainly extract the text from a document, the MuPDF library will do that. I believe it's up to you to apply your own search criteria after that. I'm afraid I'm not expert enough to answer the 'how to' part of it though. I imagine one of the mutool examples would be helpful here though. I'll see if I can get one of the developers to answer.

Wanted: ASP.NET control to view/print PDF, TIFF, possibly more?

I'm looking for an asp.NET control that will allow for viewing and printing of a pdf and TIFF within a web form. I'm willing to use more than 1 control if needed (1 control for pdf, 1 for Tiff, show and hide based on file extension), but I have not been able to find a good Tiff viewer.
Files are stored on our LAN in a shared folder, and this application is an intranet site.
Open source / free licensing preferred, but I'm willing to look at paid options as well.
http://www.alternatiff.com/ is one of the viewers that I've seen used for this type of viewing of tiffs.
You can get a free licence of ABCPDF (provided you link back to their site) which will do the conversion from TIFF to PDF for you as per #Chris Lively 's suggestion.
It'll also do conversion from PDF to TIFF if you decide to do things backwards.
It makes sense to present the content in a common format. If you wanted to you can embed the PDF in the browser to create the 'seamless' experience you're looking for using something like PDFObject.
As #BenCr says though, PDF is a really common format and the tools already exist to open and work with them, so introducing new ways to perform existing tasks could actually end up complicating matters unnecessarily.
I'm in total agreement with #BenCr on this.
Viewing PDFs is an extremely common thing to do. This isn't a "technical" issue by any stretch.
It sounds like you have some type of faxing solution in place that is creating these documents. Most likely multi-page TIFF and PDFs.
If this is the case you might want to just convert the TIFFs to PDFs to begin with and run everything through Adobe's pdf reader. Every online fax solution does this.
You could try http://issuu.com/ and they appear to have a API too if you want to go that deep.
We used the the Seadragon control to do this. I think it was an overkill and we should have just rolled our own -- would have been cheaper than integrating it. TIFFs and PDFs are converted to PNG on the server side. I don't think you can do better than that, especially with PDFs (assuming you don't want to use Acrobat Reader to display them). Convert PDFs to PNG using Xpdf/Poppler.
How about using Google Docs Viewer?
EDIT: Probably not working, since the viewer has to read the document from your URL; when it's on the Intranet, this won't work.
If you can mess about with mime types -- mainly by making the .tiff files expose an application/pdf mimetype -- you should be able to get acrobat to open TIFF files directly by effectively fooling the browser to open TIFF files with acrobat. Then all you need is a trusty old iframe to get you familiar UI with print buttons.

C# PDF Control & Library

I'm looking for a way to display a PDF (similar to a picture box), in a Windows Form. After that I need to be able to create a PDF. What's the best library for the job for creating the PDF (from simple text)? I've taken a look at several and I'm not sure which one is the best. Preferably open source. As for the control, I tried the COM object Adobe provides... I can't seem to get it working. At all. I've tried loading several files, there are no errors. It simply fails to load.
PDF Sharp, Sharp PDF and iTextSharp are excellent. They are all OpenSource.
To answer your question about getting the PDF to render, you could use a WebBrowser Control on your form as long as the client workstation has Adobe Reader installed. The browser will automatically pick up the MIME type and load the in-browser Adobe Reader.
For rendering, I echo Will Marcouiller and SLaks. We have had good success with PDFSharp.
For creating pdf's iTextSharp is very good, and it's free too.
I worked with SharpPDF and it did great job. And it's open source.

Start process from stream

I have a memory stream that contains a PDF file.
Is it possible to view the PDF without saving it to the hard disk ? Process.Start() only takes a path and not a stream.
Thank you
Only by implementing your own pseudo-file system in C#, somehow mounting this as a disk in Windows, and having it intercept the file open and stream the contents of your MemoryStream. Absolutely 100% certainly not worth the effort.
You can create a RAM drive and write the stream to it, this way you are still keeping it all in ram (assuming the disk operations are what worries you).
Sure, this is certainly possible. Just not via Process Start and Adobe Reader (I assume you are invoking Adobe or something similar)
If you are using .NET or Java you simply need to find a PDF viewer component, there are lots to choose from, google will give you plenty of links, Gnostice has a good one, but its expensive. Once you find a suitable control, view the PDF directly from your app.
If there is, process.Start won't be the way, but I'd risk guessing that there isn't.
Unless there's a specific PDF API that allows that somehow (I doubt) I'd save it to disk.

ASP.Net Converting and Merging documents into single PDF

I need to have the ability to convert and merge various documents into a single Pdf.
The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.
At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.
We currently produce around 100-1000 of these PDF's per day.
The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.
We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.
Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!
Edit:
Thanks for all the great advice, I'll look into some of these and compare performance.
Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.
Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.
In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.
This means:
The target application (for example Office) is opened and closed
The document has to travel through the printing service
In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.
We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.
I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.
O2 Solutions - PDF4NET
Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.
PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
PDFDocument doc = PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
doc.SaveToStream( stream );
You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.
Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?
While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.

Categories

Resources