Extracting the first page of multiple PDFs & saving them as Image - c#

I have about 400 ebooks, all in PDF format, and my task is to extract the cover from every one of them (which is the first page of every PDF) and export them all as separate image (PNG or JPEG) files
So I will end up with 400 ebooks and 400 images of their covers.
I have Windows
Any advice greatly appreciated.

Use ghostscript to render tiff or jpg from the pdf. You have fine grained control over the result.
If this is a commercial application, you need a commercial license. If you use the application commercially, but inside your organisation, you are allowed to use the GPLed version of ghostscript.
Ghostscript can be found here. The PDF interpreter in many opensource packages relies on the gs PDF interpreter. Imagemagick for example, requires ghostscript libraries.
Download GS here: http://ghostscript.com/download/gsdnld.html
Use C# Process class to execute Ghostscript, there is a SO topic on this here How to run a C# console application with the console hidden
The commandline for tiff will be:
D:\gs\gs9.20>bin\gswin64c.exe -sOutputFile=d:\some%02d.tiff -dBATCH -dNOPAUSE -sDEVICE=tiff24nc -sCompression=lzw -r150 -sPageList=1 d:\PDFReference.pdf
This will create one some01.tiff file on d:\ in 150dpi resolution.

The following thread is suitable for your request. converting pdf file to an jpeg image
One solution is to use a third party library. ImageMagick is a very popular, freely available too. You can get a .NET wrapper for it here. The original ImageMagick download page is here.
http://www.codeproject.com/KB/library/pdftoimages.aspx Convert PDF pages to image files using the Solid Framework
http://www.print-driver.com/howto/convert_pdf_to_jpeg.html Universal Document Converter
http://www.makeuseof.com/tag/6-ways-to-convert-a-pdf-file-to-a-jpg-image/ 6 Ways To Convert A PDF To A JPG Image
And you also can take a look at this thread: how to open a page from a pdf file in pictureBox in C#
If you use this process to convert a PDF to tiff, you can use this class to retrieve the bitmap from tiff.

Related

Does tesseract OCR for .net works with pdf files?

I want to perform OCR on png and pdf files.I am able to get Tesseract 3.0.2 .net wrapper work for png files but I can't find any class in it for PDf files.So, does it work for the pdf files.If not then please let me know any other open source library for scanning pdfs. My requirement is scanning diagrams in a pdf for specific circles, and creating hyperlinks for those circles.
No, it doesn't. You'll have to extract the images from the pdf first. This can be done using pdfimages pdfimages.exe -j your.pdf or gs as suggested by Zakk Diaz.

How to display multiple .tif in C#?

Is there any simple way to open and display multipage .tif files? I want to write a simple winForm application to open multiple page .tif file and scrolling around these pages? I want to add next and previous buttons to my project to scroll around them. Any suggestions or examples?
Try the free DotImage Photo SDK by Atalasoft. Though it is not open source, it is free and a very good choice for viewing images.
Meanwhile I will have a look if AForge.Net and EmguCV can open Multipage TIFF images and let you know. These frameworks are open source and include very powerful SDKs for image processing.
I know an SDK named leadtools that has the ability to load and display multipage TIF files using .NET. For a sample code, see the following link:
http://www.leadtools.com/help/leadtools/v175/dh/to/leadtools.topics~leadtools.topics.loadingsavingtutorials.html

Creating print previews of documents

I need to create a software which will create print previews of the documents of following formats: the MS Office documents (.doc(x), .ppt(x), .xls(x)), images, .txt files and PDF files. I have made a working prototype using XPS files. So basically I do the following: I convert the office files to .xps using Office Automation and then I render the .xps documents to images. I simply create XPS files from images and .txt by adding text or an image to FlowDocument, then rendering it. But I have found out that there is no way to convert PDF to XPS fast (A document which has 600 pages takes more than 2 minutes to convert and this is totally not suitable). So I am stuck at this point. It seems that I should start over again, using the different file format. Should I rewrite my program using PDF, for example, or is there any other way to accomplish my task? And if I should use PDF, could you, please, suggest me a good PDF C# library to render previews of pages as fast as possible? I tried using Websupergoo's ABCPdf, but it is too slow, because it does not allow me to render the previews to System.Windows.Media.Imaging.BitmapSource, only System.Drawing.Bitmap, so I have to convert Bitmap to BitmapSource and it takes up a lot of time.
Thanks in advance.
Use Ghostscipt to convert PDF to images. Though, I don't know why you wouldn't just use the PDF. I have used GhostScript for a number of PDF/Image manipulation tasks.
http://www.wibit.net/blog/integrating_ghostscript_c
Ghostscript will output any PDF to images to the settings you specify. I think you can use it as a DLL or as a commandline process.

c# converting PDF to Tif

i was using ghostscript to convert PDF's to Tif with C$ class wrapper and then was using OCR tessnet2 to read the content of the image file but the tif images are pretty much unreadable, the image is pretty faded and doesnt look right and the OCR engine fails to read anything. Is there any open source or library that will cost me few bucks out there that can convert PDFs to TIf in good quality? or any open source OCR engine that read PDF's because tessnet2 cannot read PDF's.
As DaNet said, I'm not sure if there is any an open source DLL or a free way to do that. We use a third-party toolkit named leadtools that gives us very good results when OCR PDF documents. You can use it to do some processing on the image (i.e. binarize it, remove the unwanted dots from the image, convert it to 1-bit black & white, save it as TIF image, etc), and then pass it to their OCR engine.
I know that they have an online demo, you can try it. Here is the link for the demo:
http://demo.leadtools.com/OnlineRecognitionDemo
If the results match your requirements, you can check this tutorial:
Scanning to Searchable PDF
I not sure about a opensource OCR, but if you play with the resolution output of ghostscript generated tiff you shouldn't have a problem.
Tried to add -r150 to the "string args" of the ghostscript wrapper to changes the resolution and hopefully a decent size megabyte file!
i had to change properties of imageMagicNET class output format to png16m and DPI so the images generated are high quality and readable for the OCR engine

Edit tif files with C#

I need to create a program that reads tif files from a directory and then trims the bottom inch of the file and resaves the file. I know how to open the files but how would I automate this process from c#?
If you need to handle TIFF images in C# then have a look at LibTIFF.Net
http://bitmiracle.com/libtiff/ - It is open source and Native .NET component and free for commercial use.
This library should also have the TIFF cropping functions you need. I am not sure if the native .NET libraries can handle all of the TIFF functions you may require whereas LibTIFF will.
The original LibTIFF for C/C++ can be found at http://www.remotesensing.org/libtiff/ which may help you with documentation and support if needed.
Included with libTiff is a program called tiffCrop which should also have source code. http://www.remotesensing.org/libtiff/man/tiffcrop.1.html which can be accessed via
http://www.remotesensing.org/libtiff/tools.html.
See here.

Categories

Resources