I am using acrobat sdk to convert a image pdf to searchable text pdf, Can anyone help me out, I am stucked, i need to check whether a file is already OCR or not??
you can try this sample to apply OCR to images in a pdf document
Related
I want to perform OCR on png and pdf files.I am able to get Tesseract 3.0.2 .net wrapper work for png files but I can't find any class in it for PDf files.So, does it work for the pdf files.If not then please let me know any other open source library for scanning pdfs. My requirement is scanning diagrams in a pdf for specific circles, and creating hyperlinks for those circles.
No, it doesn't. You'll have to extract the images from the pdf first. This can be done using pdfimages pdfimages.exe -j your.pdf or gs as suggested by Zakk Diaz.
I need to create a software which will create print previews of the documents of following formats: the MS Office documents (.doc(x), .ppt(x), .xls(x)), images, .txt files and PDF files. I have made a working prototype using XPS files. So basically I do the following: I convert the office files to .xps using Office Automation and then I render the .xps documents to images. I simply create XPS files from images and .txt by adding text or an image to FlowDocument, then rendering it. But I have found out that there is no way to convert PDF to XPS fast (A document which has 600 pages takes more than 2 minutes to convert and this is totally not suitable). So I am stuck at this point. It seems that I should start over again, using the different file format. Should I rewrite my program using PDF, for example, or is there any other way to accomplish my task? And if I should use PDF, could you, please, suggest me a good PDF C# library to render previews of pages as fast as possible? I tried using Websupergoo's ABCPdf, but it is too slow, because it does not allow me to render the previews to System.Windows.Media.Imaging.BitmapSource, only System.Drawing.Bitmap, so I have to convert Bitmap to BitmapSource and it takes up a lot of time.
Thanks in advance.
Use Ghostscipt to convert PDF to images. Though, I don't know why you wouldn't just use the PDF. I have used GhostScript for a number of PDF/Image manipulation tasks.
http://www.wibit.net/blog/integrating_ghostscript_c
Ghostscript will output any PDF to images to the settings you specify. I think you can use it as a DLL or as a commandline process.
I have one pdf document which has links. I need to get link title.
Please help me to solve this issue.
Thanks in advance.
Sow
You can use the iText PDF library to read the PDF file contents.. and get the text,links or there values from the PDF file. you can get the library from here!
I have PDF files that have been "recognized" using the OCR Text Recognition -> Recognize Text Using OCR functionality in Acrobat.
I would like to take these as an upload (C# ASP.NET MVC) and be able to extract this information for indexing and search purposes.
I have tried opening the PDF files and I don't find any of the recognized text so I'm guessing it's compressed and/or encoded.
Any ideas?
There is an article on CodeProject that explains how you can extract text from PDF using C#.
xpdf and poppler have pdftotext tools.
I have a PDF file which contains just 1 Page. I have a barcode at the end of the page.
How do I extract the barcode number from the PDF in C#
I have seen a post to convert barcode Image to Code 39 but how do we do it from PDF, Please help
barcode image to Code39 conversion in C#?
Thanks
Your best bet is to get a PDF library that can extract images from the pages. We use Aspose.PDF and Aspose.PDF.Kit, which are both excellent products.
http://www.aspose.com
This page shows a code example of how to extract an image from a PDF document:
http://www.aspose.com/documentation/.net-components/aspose.pdf.kit-for-.net/extract-image-from-pdf-document.html
They also have a Barcode library, and one of the things it can do is read barcodes from multi-page tiff images. You could convert the pages of your PDF to TIFF and then use the barcode library to read the barcodes:
http://www.aspose.com/documentation/.net-components/aspose.barcode-for-.net/how-to-read-barcode-from-multipage-tiff-images.html
You can use iTextSharp library to convert the PDF to an image and then a barcode reading library such as IBScanner to extract the barcode(s) from the image.