i was using ghostscript to convert PDF's to Tif with C$ class wrapper and then was using OCR tessnet2 to read the content of the image file but the tif images are pretty much unreadable, the image is pretty faded and doesnt look right and the OCR engine fails to read anything. Is there any open source or library that will cost me few bucks out there that can convert PDFs to TIf in good quality? or any open source OCR engine that read PDF's because tessnet2 cannot read PDF's.
As DaNet said, I'm not sure if there is any an open source DLL or a free way to do that. We use a third-party toolkit named leadtools that gives us very good results when OCR PDF documents. You can use it to do some processing on the image (i.e. binarize it, remove the unwanted dots from the image, convert it to 1-bit black & white, save it as TIF image, etc), and then pass it to their OCR engine.
I know that they have an online demo, you can try it. Here is the link for the demo:
http://demo.leadtools.com/OnlineRecognitionDemo
If the results match your requirements, you can check this tutorial:
Scanning to Searchable PDF
I not sure about a opensource OCR, but if you play with the resolution output of ghostscript generated tiff you shouldn't have a problem.
Tried to add -r150 to the "string args" of the ghostscript wrapper to changes the resolution and hopefully a decent size megabyte file!
i had to change properties of imageMagicNET class output format to png16m and DPI so the images generated are high quality and readable for the OCR engine
Related
Project Information : .NETCore v3.1 - IText7 v7.1.11
We have a PDF Document builder library created by IText7 nuget package, where we can dynamically building large PDF files for our customers.
Normally we are working with .jpg format but a new feature is added to our online side which is we have started to work with also interactive SVG files with hotspots integrated to our UI.
So, when our application builds a PDF Document it also needs to import those SVG files, we can use it by:
var image = SvgConverter.ConvertToImage(
new FileStream(imagePath, FileMode.Open, FileAccess.Read), pdfDocument);
The Original File (I can't upload a SVG so I am uploading as jpg but this is not important because I just want to show the line thickness):
Output is shown below, as you can see on the left side the lines become very thick and parts are not shown as expected.
Steps:
SvgConverter.DrawOnCanvas(svgStream, pdfCanvas);
SvgConverter.DrawOnDocument(svgStream, pdfDocument);
SvgConverter.DrawOnPage(svgStream, pdfPage);
SvgConverter.ConvertToXObject(svgStream, pdfDocument);
SvgConverter.ConvertToImage(svgStream, pdfDocument);
I have tried them all but results are same for all.
Questions:
PDF and SVG files are vectors, so can't we use them as integrated through IText, why should we need to convert it to a Raster? Why do we need a converter?
Is there a way to decrease thickness or a way to not to lose image quality?
Thank you for your time!
When you invoke SvgConverter.ConvertToImage or SvgConverter.ConvertToXObject, your SVG is not converted into a raster image - it still remains a vector image. So you can use the integrated SVG converter workflow and you are in fact using it with the SvgConverter. The converter is needed to process SVG file format into more PDF-specific structures, so it performs some conversion because PDF does not support SVG directly. This is not vector -> raster conversion though.
Regarding the problem with the line thickness, first think you should do is to try with the latest version - as far as I see you are trying with 7.1.11 while it's dated back to around a year and 7.1.15 is out already. If the problem persists then it's a bug in the SVG support in iText and you can try to minimize the SVG file to see if there is a workaround to achieve proper conversion until the bug is fixed for your case and/or report the problem to iText (StackOverflow is not the right place to report bugs).
I want to perform OCR on png and pdf files.I am able to get Tesseract 3.0.2 .net wrapper work for png files but I can't find any class in it for PDf files.So, does it work for the pdf files.If not then please let me know any other open source library for scanning pdfs. My requirement is scanning diagrams in a pdf for specific circles, and creating hyperlinks for those circles.
No, it doesn't. You'll have to extract the images from the pdf first. This can be done using pdfimages pdfimages.exe -j your.pdf or gs as suggested by Zakk Diaz.
I have about 400 ebooks, all in PDF format, and my task is to extract the cover from every one of them (which is the first page of every PDF) and export them all as separate image (PNG or JPEG) files
So I will end up with 400 ebooks and 400 images of their covers.
I have Windows
Any advice greatly appreciated.
Use ghostscript to render tiff or jpg from the pdf. You have fine grained control over the result.
If this is a commercial application, you need a commercial license. If you use the application commercially, but inside your organisation, you are allowed to use the GPLed version of ghostscript.
Ghostscript can be found here. The PDF interpreter in many opensource packages relies on the gs PDF interpreter. Imagemagick for example, requires ghostscript libraries.
Download GS here: http://ghostscript.com/download/gsdnld.html
Use C# Process class to execute Ghostscript, there is a SO topic on this here How to run a C# console application with the console hidden
The commandline for tiff will be:
D:\gs\gs9.20>bin\gswin64c.exe -sOutputFile=d:\some%02d.tiff -dBATCH -dNOPAUSE -sDEVICE=tiff24nc -sCompression=lzw -r150 -sPageList=1 d:\PDFReference.pdf
This will create one some01.tiff file on d:\ in 150dpi resolution.
The following thread is suitable for your request. converting pdf file to an jpeg image
One solution is to use a third party library. ImageMagick is a very popular, freely available too. You can get a .NET wrapper for it here. The original ImageMagick download page is here.
http://www.codeproject.com/KB/library/pdftoimages.aspx Convert PDF pages to image files using the Solid Framework
http://www.print-driver.com/howto/convert_pdf_to_jpeg.html Universal Document Converter
http://www.makeuseof.com/tag/6-ways-to-convert-a-pdf-file-to-a-jpg-image/ 6 Ways To Convert A PDF To A JPG Image
And you also can take a look at this thread: how to open a page from a pdf file in pictureBox in C#
If you use this process to convert a PDF to tiff, you can use this class to retrieve the bitmap from tiff.
I have prevented my program from saving images downloaded as files. (They are saved in Image variables instead). However, my application is slowed down significantly because the PDFsharp libraries save the image files anyway before drawing them to the PDF document. This is done deep within a hierarchy of calls by its functions.
Is there a simple fix to get around this?
JPEG images are simply copied into the PDF file, all other image formats have to be converted to the PDF format. AFAIK PDFsharp does not save the images to the local file system, however they are saved into a memory stream during the conversion.
There is a simple fix to get around this: write a better conversion and submit it to the PDFsharp team.
I think PDFSharp always need to save first. It cannot handle byte image that will used to import an image. I recently encountered it when I also coded and used pdfsharp. My process is to save an image then import to code in pdfsharp. I have a part that I need to delete all temporary images.
I need to create a software which will create print previews of the documents of following formats: the MS Office documents (.doc(x), .ppt(x), .xls(x)), images, .txt files and PDF files. I have made a working prototype using XPS files. So basically I do the following: I convert the office files to .xps using Office Automation and then I render the .xps documents to images. I simply create XPS files from images and .txt by adding text or an image to FlowDocument, then rendering it. But I have found out that there is no way to convert PDF to XPS fast (A document which has 600 pages takes more than 2 minutes to convert and this is totally not suitable). So I am stuck at this point. It seems that I should start over again, using the different file format. Should I rewrite my program using PDF, for example, or is there any other way to accomplish my task? And if I should use PDF, could you, please, suggest me a good PDF C# library to render previews of pages as fast as possible? I tried using Websupergoo's ABCPdf, but it is too slow, because it does not allow me to render the previews to System.Windows.Media.Imaging.BitmapSource, only System.Drawing.Bitmap, so I have to convert Bitmap to BitmapSource and it takes up a lot of time.
Thanks in advance.
Use Ghostscipt to convert PDF to images. Though, I don't know why you wouldn't just use the PDF. I have used GhostScript for a number of PDF/Image manipulation tasks.
http://www.wibit.net/blog/integrating_ghostscript_c
Ghostscript will output any PDF to images to the settings you specify. I think you can use it as a DLL or as a commandline process.