Does tesseract OCR for .net works with pdf files?

Does tesseract OCR for .net works with pdf files? - c#

I want to perform OCR on png and pdf files.I am able to get Tesseract 3.0.2 .net wrapper work for png files but I can't find any class in it for PDf files.So, does it work for the pdf files.If not then please let me know any other open source library for scanning pdfs. My requirement is scanning diagrams in a pdf for specific circles, and creating hyperlinks for those circles.

No, it doesn't. You'll have to extract the images from the pdf first. This can be done using pdfimages pdfimages.exe -j your.pdf or gs as suggested by Zakk Diaz.

Related

Merge multiple files and convert in to single PDF using .net mvc

In my application, option is there to upload multiple attachments, that can be png, jpg, doc or .xlsx and its working fine using .Net MVC.
Now I want to merge all these attachments into a single PDF. Is that possible using .net MVC?

Yes, you can use: pdfsharp.
PDFsharp is a .NET library for processing PDF file. You create PDF pages using drawing routines known from GDI+. Almost anything that can be done with GDI+ will also work with PDFsharp. Only basic text layout is supported by PDFsharp, and page breaks are not created automatically. The same drawing routines can be used for screen, PDF, or meta files

Extracting the first page of multiple PDFs & saving them as Image

I have about 400 ebooks, all in PDF format, and my task is to extract the cover from every one of them (which is the first page of every PDF) and export them all as separate image (PNG or JPEG) files
So I will end up with 400 ebooks and 400 images of their covers.
I have Windows
Any advice greatly appreciated.

Use ghostscript to render tiff or jpg from the pdf. You have fine grained control over the result.
If this is a commercial application, you need a commercial license. If you use the application commercially, but inside your organisation, you are allowed to use the GPLed version of ghostscript.
Ghostscript can be found here. The PDF interpreter in many opensource packages relies on the gs PDF interpreter. Imagemagick for example, requires ghostscript libraries.
Download GS here: http://ghostscript.com/download/gsdnld.html
Use C# Process class to execute Ghostscript, there is a SO topic on this here How to run a C# console application with the console hidden
The commandline for tiff will be:
D:\gs\gs9.20>bin\gswin64c.exe -sOutputFile=d:\some%02d.tiff -dBATCH -dNOPAUSE -sDEVICE=tiff24nc -sCompression=lzw -r150 -sPageList=1 d:\PDFReference.pdf
This will create one some01.tiff file on d:\ in 150dpi resolution.

The following thread is suitable for your request. converting pdf file to an jpeg image
One solution is to use a third party library. ImageMagick is a very popular, freely available too. You can get a .NET wrapper for it here. The original ImageMagick download page is here.
http://www.codeproject.com/KB/library/pdftoimages.aspx Convert PDF pages to image files using the Solid Framework
http://www.print-driver.com/howto/convert_pdf_to_jpeg.html Universal Document Converter
http://www.makeuseof.com/tag/6-ways-to-convert-a-pdf-file-to-a-jpg-image/ 6 Ways To Convert A PDF To A JPG Image
And you also can take a look at this thread: how to open a page from a pdf file in pictureBox in C#
If you use this process to convert a PDF to tiff, you can use this class to retrieve the bitmap from tiff.

Creating print previews of documents

I need to create a software which will create print previews of the documents of following formats: the MS Office documents (.doc(x), .ppt(x), .xls(x)), images, .txt files and PDF files. I have made a working prototype using XPS files. So basically I do the following: I convert the office files to .xps using Office Automation and then I render the .xps documents to images. I simply create XPS files from images and .txt by adding text or an image to FlowDocument, then rendering it. But I have found out that there is no way to convert PDF to XPS fast (A document which has 600 pages takes more than 2 minutes to convert and this is totally not suitable). So I am stuck at this point. It seems that I should start over again, using the different file format. Should I rewrite my program using PDF, for example, or is there any other way to accomplish my task? And if I should use PDF, could you, please, suggest me a good PDF C# library to render previews of pages as fast as possible? I tried using Websupergoo's ABCPdf, but it is too slow, because it does not allow me to render the previews to System.Windows.Media.Imaging.BitmapSource, only System.Drawing.Bitmap, so I have to convert Bitmap to BitmapSource and it takes up a lot of time.
Thanks in advance.

Use Ghostscipt to convert PDF to images. Though, I don't know why you wouldn't just use the PDF. I have used GhostScript for a number of PDF/Image manipulation tasks.
http://www.wibit.net/blog/integrating_ghostscript_c
Ghostscript will output any PDF to images to the settings you specify. I think you can use it as a DLL or as a commandline process.

OCR enabled pdf through C#

I am using acrobat sdk to convert a image pdf to searchable text pdf, Can anyone help me out, I am stucked, i need to check whether a file is already OCR or not??

you can try this sample to apply OCR to images in a pdf document

c# converting PDF to Tif

i was using ghostscript to convert PDF's to Tif with C$ class wrapper and then was using OCR tessnet2 to read the content of the image file but the tif images are pretty much unreadable, the image is pretty faded and doesnt look right and the OCR engine fails to read anything. Is there any open source or library that will cost me few bucks out there that can convert PDFs to TIf in good quality? or any open source OCR engine that read PDF's because tessnet2 cannot read PDF's.

As DaNet said, I'm not sure if there is any an open source DLL or a free way to do that. We use a third-party toolkit named leadtools that gives us very good results when OCR PDF documents. You can use it to do some processing on the image (i.e. binarize it, remove the unwanted dots from the image, convert it to 1-bit black & white, save it as TIF image, etc), and then pass it to their OCR engine.
I know that they have an online demo, you can try it. Here is the link for the demo:
http://demo.leadtools.com/OnlineRecognitionDemo
If the results match your requirements, you can check this tutorial:
Scanning to Searchable PDF

I not sure about a opensource OCR, but if you play with the resolution output of ghostscript generated tiff you shouldn't have a problem.
Tried to add -r150 to the "string args" of the ghostscript wrapper to changes the resolution and hopefully a decent size megabyte file!

i had to change properties of imageMagicNET class output format to png16m and DPI so the images generated are high quality and readable for the OCR engine

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Does tesseract OCR for .net works with pdf files? - c#

No, it doesn't. You'll have to extract the images from the pdf first. This can be done using pdfimages pdfimages.exe -j your.pdf or gs as suggested by Zakk Diaz.

Related

Merge multiple files and convert in to single PDF using .net mvc

Extracting the first page of multiple PDFs & saving them as Image

Creating print previews of documents

OCR enabled pdf through C#

c# converting PDF to Tif

Categories

Resources