I'm trying to use IronOCR to create text-searchable versions of scanned PDF documents. The outputted file is displayed properly (and is text-selectable) in pretty much every viewer, except for Chrome's built-in PDF viewer.
Here's my code for converting the files:
byte[] origPdfBytes = Properties.Resources.Non_text_searchable;
using (MemoryStream pdfStream = new MemoryStream(origPdfBytes))
{
var ocr = new IronTesseract();
using (OcrInput input = new OcrInput())
{
input.AddPdf(pdfStream);
OcrResult ocrResult = ocr.Read(input);
ocrResult.SaveAsSearchablePdf("C:\\temp\\OCRTest\\output.pdf");
}
}
Here is a sample file that I've converted using IronOCR: https://drive.google.com/file/d/1_uhmZKJN_TFStApfeeieI8LezAPjp1mj/view
If you download and view this file in pretty much any viewer other than Chrome, the text is properly selectable. However, in Chrome, the cursor does appear to be selecting text, but it does not display properly.
I've Chrome's built-in PDF viewer for years, and I've never run into an issue like this. I'm not sure if this is an issue with IronOCR's output formatting, or if it's just a problem with Chrome. Any ideas?
How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.
After struggling whole day, I identified the issue but this didn't solve my problem.
On short:
I need to open a PDF, convert to BW (grayscale), search some words and insert some notes nearby found words. At a first look it seems easy but I discovered how hard PDF files are processed (having no "words" concepts and so on).
Now the first task, converting to grayscale just drove me crazy. I didn't find a working solution either commercial or free. I came up with this solution:
open the PDF
print with windows drivers, some free PDF printers
This is quite ugly since I will force the C# users to install such 3'rd party SW but.. that is fpr the moment. I tested FreePDF, CutePDF and PDFCreator. All of them are working "stand alone" as expected.
Now when I tried to print from C#, obviously, I don't want the print dialog, just select BW option and print (aka. convert)
The following code just uses a PDF library, shown for clarity only.
Aspose.Pdf.Facades.PdfViewer viewer = new Aspose.Pdf.Facades.PdfViewer();
viewer.BindPdf(txtPDF.Text);
viewer.PrintAsGrayscale = true;
//viewer.RenderingOptions = new RenderingOptions { UseNewImagingEngine = true };
//Set attributes for printing
//viewer.AutoResize = true; //Print the file with adjusted size
//viewer.AutoRotate = true; //Print the file with adjusted rotation
viewer.PrintPageDialog = true; //Do not produce the page number dialog when printing
////PrinterJob printJob = PrinterJob.getPrinterJob();
//Create objects for printer and page settings and PrintDocument
System.Drawing.Printing.PrinterSettings ps = new System.Drawing.Printing.PrinterSettings();
System.Drawing.Printing.PageSettings pgs = new System.Drawing.Printing.PageSettings();
//System.Drawing.Printing.PrintDocument prtdoc = new System.Drawing.Printing.PrintDocument();
//prtdoc.PrinterSettings = ps;
//Set printer name
//ps.PrinterName = prtdoc.PrinterSettings.PrinterName;
ps.PrinterName = "CutePDF Writer";
ps.PrintToFile = true;
ps.PrintFileName = #"test.pdf";
//
//ps.
//Set PageSize (if required)
//pgs.PaperSize = new System.Drawing.Printing.PaperSize("A4", 827, 1169);
//Set PageMargins (if required)
//pgs.Margins = new System.Drawing.Printing.Margins(0, 0, 0, 0);
//Print document using printer and page settings
viewer.PrintDocumentWithSettings(ps);
//viewer.PrintDocument();
//Close the PDF file after priting
What I discovered and seems to be little explained, is that if you select
ps.PrintToFile = true;
no matter C# PDF library or PDF printer driver, Windows will just skip the PDF drivers and instead of PDF files will output PS (postscript) ones which obviously, will not be recognized by Adobe Reader.
Now the question (and I am positive that others who may want to print PDFs from C# may be encountered) is how to print to CutePDF for example and still suppress any filename dialog?
In other words, just print silently with programmatically selected filename from C# application. Or somehow convince "print to file" to go through PDF driver, not Windows default PS driver.
Thanks very much for any hints.
I solved conversion to grayscale with a commercial component with this post and I also posted there my complete solution, in care anyone will struggle like me.
Converting PDF to Grayscale pdf using ABC PDF
I am using Amyuni PDF Creator .Net to print PDF using a Windows service.
Windows service is running under Local System user account. When I tried to print using above library, it prints the PDF in wrong font. See the attachment (Wrong font in PDF printing).
This issue persists with only some of the printers such as Brother MFC-8890DW Printer.
But for the same printer with above windows service, it prints the PDF properly when unchecked the Enable advanced printing features setting in above printer Properties. See the attachment (Disable Advanced printing features).
using (FileStream file1 = new FileStream(pdfFile, FileMode.Open, FileAccess.Read))
{
using (IacDocument doc1 = new IacDocument())
{
doc1.Open(file1, string.Empty);
doc1.Copies = 1;
bool printed = doc1.Print(printer, false);
}
}
But same windows service prints PDF correctly for some other printers such as HP LaserJet P1005 either Enable advanced printing features checked or unchecked.
Without having access to the same printer that you are using it is hard to know exactly what is happening. My best guess would be that the driver of this printer has issues dealing with process-level fonts (those that are registered using the GDI function AddFontResourceEx) when "Enable advanced printing features" is checked. This is how Amyuni PDF Creator uses fonts embedded in the PDF file, which is the case for the file that you have presented.
A possible workaround for this could be to use the attribute "PrintAsImage" of the Document class.
The code would look like this:
//set license key This is needed only with licensed version
acPDFCreatorLib.SetLicenseKey("your company", "your activation code");
//Create a new document instance
Amyuni.PDFCreator.IacDocument doc = new Amyuni.PDFCreator.IacDocument(null);
doc.AttributeByName("PrintAsImage").Value =1;
//Open the file here (...)
//Print to default printer
pdfCreator1.Document.Print("", false);
Another alternative would be to save your file as xps using Amyuni PDF Creator then send the xps file to the printer:
// Create print server and print queue.
LocalPrintServer localPrintServer = new LocalPrintServer();
PrintQueue defaultPrintQueue = LocalPrintServer.GetDefaultPrintQueue();
defaultPrintQueue.AddJob("my document", "c:\\temp\\mytempfile.xps", true);
Disclaimer: I work for Amyuni Technologies.
I´ve trying to solve this problem for nearly 2 days. There are a lot of more or fewer good solutions on the net, but not a single one fits my task perfectly.
Task:
Print a PDF programmatically
Do it with a fixed printer
Don´t let the user do more than one Button_Click
Do it silent - the more, the better
Do it client side
First Solutions:
Do it with a Forms.WebBrowser
If we have Adobe Reader installed, there is a plugin to show PDF´s in the webbrowser. With this solution we have a nice preview and with webbrowserControlName.Print() we can trigger the control to print its content.
Problem - we still have a PrintDialog.
Start the AcroRd32.exe with start arguments
The following CMD command let us use Adobe Reader to print our PDF.
InsertPathTo..\AcroRd32.exe /t "C:\sample.pdf" "\printerNetwork\printerName"
Problems - we need the absolute path to AcroRd32.exe | there is an Adobe Reader Window opening and it has to be opened until the print task is ready.
Use windows presets
Process process = new Process();
process.StartInfo.FileName = pathToPdf;
process.StartInfo.Verb = "printto";
process.StartInfo.Arguments = "\"" + printerName + "\"";
process.Start();
process.WaitForInputIdle();
process.Kill();
Problem - there is still an Adobe Reader window popping up, but after the printing is done it closes itself usually.
Solution - convince the client to use Foxit Reader (don´t use last two lines of code).
Convert PDF pages to Drawing.Image
I´ve no idea how to do it with code, but when I get this to work the rest is just a piece of cake. Printing.PrintDocument can fulfill all demands.
Anyone an idea to get some Drawing.Image´s out of those PDF´s or another approach how to do it?
Best Regards,
Max
The most flexible, easiest and best performing method I could find was using GhostScript. It can print to windows printers directly by printer name.
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" -dPrinted -dBATCH -dNOPAUSE -sDEVICE=mswinpr2 -dNoCancel -sOutputFile="%printer%printer name" "pdfdocument.pdf"
Add these switches to shrink the document to an A4 page.
-sPAPERSIZE=a4 -dPDFFitPage
If a commercial library is an option, you can try with Amyuni PDF Creator. Net.
Printing directly with the library:
For opening a PDF file and send it to print directly you can use the method IacDocument.Print. The code in C# will look like this:
// Open PDF document from file<br>
FileStream file1 = new FileStream ("test.pdf", FileMode.Open, FileAccess.Read);
IacDocument doc1 = new IacDocument (null);
doc1.Open (file1, "" );
// print document to a specified printer with no prompt
doc1.Print ("My Laser Printer", false);
Exporting to images (then printing if needed):
Choice 1: You can use the method IacDocument.ExportToJPeg for converting all pages in a PDF to JPG images that you can print or display using Drawing.Image
Choice 2: You can draw each page into a bitmap using the method IacDocument.DrawCurrentPage with the method System.Drawing.Graphics.FromImage. The code in C# should look like this:
FileStream myFile = new FileStream ("test.pdf", FileMode.Open, FileAccess.Read);
IacDocument doc = new IacDocument(null);
doc.Open(myFile);
doc.CurrentPage = 1;
Image img = new Bitmap(100,100);
Graphics gph = Graphics.FromImage(img);
IntPtr hdc = gph.GetHDC();
doc.DrawCurrentPage(hdc, false);
gph.ReleaseHdc( hdc );
Disclaimer: I work for Amyuni Technologies
I tried many things and the one that worked best for me was launching a SumatraPDF from the command line:
// Launch SumatraPDF Reader to print
String arguments = "-print-to-default -silent \"" + fileName + "\"";
System.Diagnostics.Process.Start("SumatraPDF.exe", arguments);
There are so many advantages to this:
SumatraPDF is much much faster than Adobe Acrobat Reader.
The UI doesn't load. It just prints.
You can use SumatraPDF as a standalone application so you can include it with your application so you can use your own pa. Note that I did not read the license agreement; you should probably check it out yourself.
Another approach would to use spooler function in .NET to send the pre-formatted printer data to a printer. But unfortunately you need to work with win32 spooler API
you can look at How to send raw data to a printer by using Visual C# .NET
you only can use this approach when the printer support PDF document natively.
My company offers Docotic.Pdf library that can render and print PDF documents. The article behind the link contains detailed information about the following topics:
printing PDFs in Windows Forms or WPF application directly
printing PDFs via an intermediate image
rendering PDFs on a Graphics
There are links to sample code, too.
I work for the company, so please read the article and try suggested solutions yourselves.
Process proc = new Process();
proc.StartInfo.FileName = #"C:\Program Files\Adobe\Acrobat 7.0\Reader\AcroRd32.exe";
proc.StartInfo.Arguments = #"/p /h C:\Documents and Settings\brendal\Desktop\Test.pdf";
proc.StartInfo.UseShellExecute = false;
proc.StartInfo.CreateNoWindow = true;
proc.Start();
for (int i = 0; i < 5; i++)
{
if (!proc.HasExited)
{
proc.Refresh();
Thread.Sleep(2000);
}
else
break;
}
if (!proc.HasExited)
{
proc.CloseMainWindow();
}
You can use ghostscript to convert PDF into image formats.
The following example converts a single PDF into a sequence of PNG-Files:
private static void ExecuteGhostscript(string input, string tempDirectory)
{
// %d will be replaced by ghostscript with a number for each page
string filename = Path.GetFileNameWithoutExtension(input) + "-%d.png";
string output = Path.Combine(tempDirectory, filename);
Process ghostscript = new Process();
ghostscript.StartInfo.FileName = _pathToGhostscript;
ghostscript.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
ghostscript.StartInfo.Arguments = string.Format(
"-dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -r300 -sOutputFile=\"{0}\" \"{1}\"", output, input);
ghostscript.Start();
ghostscript.WaitForExit();
}
If you prefer to use Adobe Reader instead you can hide its window:
process.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
I found a slightly different version of your code that uses the printto verb. I didn't try it, but maybe it helps you:
http://vbcity.com/forums/t/149141.aspx
If you're interested in commercial solutions which do exactly what you require then there are quite a few options. My company provides one of those options in a developer toolkit called Debenu Quick PDF Library.
Here is a code sample (key functions are PrintOptions and PrintDocument):
/* Print a document */
// Load a local sample file from the input folder
DPL.LoadFromFile("Test.pdf", "");
// Configure print options
iPrintOptions = DPL.PrintOptions(0, 0, "Printing Sample")
// Print the current document to the default
// printing using the options as configured above.
// You can also specify the specific printer.
DPL.PrintDocument(DPL.GetDefaultPrinterName(), 1, 1, iPrintOptions);
I know that the tag has Windows Forms; however, due to the general title, some people might be wondering if they may use that namespace with a WPF application -- they may.
Here's code:
var file = File.ReadAllBytes(pdfFilePath);
var printQueue = LocalPrintServer.GetDefaultPrintQueue();
using (var job = printQueue.AddJob())
using (var stream = job.JobStream)
{
stream.Write(file, 0, file.Length);
}
Now, this namespace must be used with a WPF application. It does not play well with ASP.NET or Windows Service. It should not be used with Windows Forms, as it has System.Drawing.Printing. I don't have a single issue with my PDF printing using the above code.
Note that if your printer does not support Direct Printing for PDF files, this won't work.
What about using the PrintDocument class?
http://msdn.microsoft.com/en-us/library/system.drawing.printing.printdocument.aspx
You just need to pass the filename of the file you want to print (based on the example).
HTH
As of July 2018, there is still no answer for the OP. There is no free way to 1) silently print your pdf for a 2) closed source project.
1) You can most certainly use a process i.e. Adobe Acrobat or Foxit Reader
2) Free solutions have a GPL (GNU's General Public License). This means you must open your source code if giving the software, even for free, to anyone outside your company.
As the OP says, if you can get a PDF to Drawing.Image, you can print it with .NET methods. Sadly, software to do this also requires payment or a GPL.