I have PDF files that have been "recognized" using the OCR Text Recognition -> Recognize Text Using OCR functionality in Acrobat.
I would like to take these as an upload (C# ASP.NET MVC) and be able to extract this information for indexing and search purposes.
I have tried opening the PDF files and I don't find any of the recognized text so I'm guessing it's compressed and/or encoded.
Any ideas?
There is an article on CodeProject that explains how you can extract text from PDF using C#.
xpdf and poppler have pdftotext tools.
Related
I want to perform OCR on png and pdf files.I am able to get Tesseract 3.0.2 .net wrapper work for png files but I can't find any class in it for PDf files.So, does it work for the pdf files.If not then please let me know any other open source library for scanning pdfs. My requirement is scanning diagrams in a pdf for specific circles, and creating hyperlinks for those circles.
No, it doesn't. You'll have to extract the images from the pdf first. This can be done using pdfimages pdfimages.exe -j your.pdf or gs as suggested by Zakk Diaz.
I have a Code where I can upload files (doc, txt) and save it sql and I want that user can click on a download button and download the file in pdf. So it should convert the doc in pdf.
You have a couple options for handling this within your application:
You can convert the DOC / TXT file before saving it to the SQL db
Or you can do this on-the-fly when the user clicks the download button.
Either way, you will need a SDK, API, or utility to do the conversion. There are quite a bit of information already on this on the web, here are some links you should take a look at to see how to do the conversion in C#:
net library to convert microsoft office docs to pdf
How do I convert Word files to PDF programmatically?
I need to create a software which will create print previews of the documents of following formats: the MS Office documents (.doc(x), .ppt(x), .xls(x)), images, .txt files and PDF files. I have made a working prototype using XPS files. So basically I do the following: I convert the office files to .xps using Office Automation and then I render the .xps documents to images. I simply create XPS files from images and .txt by adding text or an image to FlowDocument, then rendering it. But I have found out that there is no way to convert PDF to XPS fast (A document which has 600 pages takes more than 2 minutes to convert and this is totally not suitable). So I am stuck at this point. It seems that I should start over again, using the different file format. Should I rewrite my program using PDF, for example, or is there any other way to accomplish my task? And if I should use PDF, could you, please, suggest me a good PDF C# library to render previews of pages as fast as possible? I tried using Websupergoo's ABCPdf, but it is too slow, because it does not allow me to render the previews to System.Windows.Media.Imaging.BitmapSource, only System.Drawing.Bitmap, so I have to convert Bitmap to BitmapSource and it takes up a lot of time.
Thanks in advance.
Use Ghostscipt to convert PDF to images. Though, I don't know why you wouldn't just use the PDF. I have used GhostScript for a number of PDF/Image manipulation tasks.
http://www.wibit.net/blog/integrating_ghostscript_c
Ghostscript will output any PDF to images to the settings you specify. I think you can use it as a DLL or as a commandline process.
I am using acrobat sdk to convert a image pdf to searchable text pdf, Can anyone help me out, I am stucked, i need to check whether a file is already OCR or not??
you can try this sample to apply OCR to images in a pdf document
I have some files stored in a directory on the server application. Actually, this folder is inside my app's folder.
Is there a way to preview these files (most of then xls, docs and pdfs) using a component or a custom control ?
Probably the easiest way is to use Google Docs Viewer.
Basically you call the viewer URL and pass the full qualified public URL to a document on your server and the Google Docs Viewer will render your document to HTML.
The documentation states that they support the following document types:
Microsoft Word (.DOC and .DOCX)
Microsoft Excel (.XLS and .XLSX)
Microsoft PowerPoint (.PPT and .PPTX)
Adobe Portable Document Format (.PDF)
Apple Pages (.PAGES)
Adobe Illustrator (.AI)
Adobe Photoshop (.PSD)
Tagged Image File Format (.TIFF)
Autodesk AutoCad (.DXF)
Scalable Vector Graphics (.SVG)
PostScript (.EPS, .PS)
TrueType (.TTF)
XML Paper Specification (.XPS)
Archive file types (.ZIP and .RAR)