Which library to use to extract text from images?

Which library to use to extract text from images? - c#

I am writing a program that when given an image of a low level math problem (e.g. 98*13) should be able to output the answer. The numbers would be black, and the background white. Not a captcha, just an image of a math problem.
The math problems would only have two numbers and one operator, and that operator would only be +, -, *, or /.
Obviously, I know how to do the calculating ;) I'm just not sure how to go about getting the text from the image.
A free library would be ideal... although If I have to write the code myself I could probably manage.

For extract words from image, I use the most accurate open source OCR engine: Tesseract. Available here or directly in your packages NuGet.
And this is my function in C#, which extract words from image passed in sourceFilePath. Set EngineMode to TesseractAndCube; it detect more word than the other options.
var path = "YourSolutionDirectoryPath";
using (var engine = new TesseractEngine(path + Path.DirectorySeparatorChar + "tessdata", "fra", EngineMode.TesseractAndCube))
{
using (var img = Pix.LoadFromFile(sourceFilePath))
{
using (var page = engine.Process(img))
{
var text = page.GetText();
// text variable contains a string with all words found
}
}
}
I hope that helps.

Try this post regarding using the C++ Google Tessaract OCR lib in C#
OCR with the Tesseract interface

You need OCR. There is the free Tesseract library from Google, but it's C code. You could use in a C++/CLI project and access via .NET.
This article gives some information on recognizing numbers (for Sudoku, but your problem is similar)
http://sudokugrab.blogspot.com/2009/07/how-does-it-all-work.html

you can use Microsoft Office Document Imaging (Interop.MODI.dll) in visaul studio and extract text of pictures
Document modiDocument = new Document();
modiDocument.Create(filePath);
modiDocument.OCR(MiLANGUAGES.miLANG_ENGLISH);
MODI.Image modiImage = (modiDocument.Images[0] as MODI.Image);
string extractedText = modiImage.Layout.Text;
modiDocument.Close();
return extractedText;

IronOCR is free for development and testing. The default English language pack should do a good job of reading this, but you may also want to consider using a custom Tesseract language pack written specifically for equations.
See https://ironsoftware.com/csharp/ocr/languages/#custom-language-example
using IronOcr;
var Ocr = new IronTesseract();
Ocr.UseCustomTesseractLanguageFile("languages/equ.traineddata");
using (var Input = new OcrInput(#"images\equation.png"))
{
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Disclaimer: I work for Iron Software.

Related

Generating GS1 DataMatrix using ZXing.Net

What I need
Is to generate a working GS1 DataMatrix, using this test content:
(240)1234567890(10)AA12345(11)123456(21)1(96)1234567
Steps
I've downloaded the nuget package from here:
and
I've created a console app that uses this code:
private static void DoGs1DataMatrixStuff()
{
var writer = new BarcodeWriter
{
Format = BarcodeFormat.DATA_MATRIX
};
writer
.Write("(240)1234567890(10)AA12345(11)123456(21)1(96)1234567")
.Save(#"C:\Temp\barcode.png");
}
There's no obvious specific GS1_DataMatrix format I can use ...
that gives me
which if read by a scanner app on my smartphone, gives the literal content that I originally presented, not with the FNC1 formatting that I expect for GS1:
(240)1234567890(10)AA12345(11)123456(21)1(96)1234567
while it should be
2401234567890 10AA12345 11123456211 961234567
From another source (not a source I can use) I got this barcode:
Using my smartphone app this reads into the correct data.
Question
How can I recreate this working GS1 datamatrix, using ZXing.Net?
also see
this link, Chris Bahns raises the same concern I have, but his request didn't get a working answer.

You have to use a formatted string with ASCII character 29 (GS - Group Separator):
< GS >2401234567890< GS >10AA12345< GS >11123456211< GS >961234567
(replace the "< GS >" with ASCII 29)
ZXing.Net supports the GS symbol with the ASCII encoder since version 0.15. It replaces the ASCII 29 value with the FNC1 codeword (232) in the resulting datamatrix image.
That's only a low level support. There is no built in class or something similar which understands AI (application identifiers) with fixed or variable length (similar to the result parser classes for vCards, vEvent, ISBN, ...).

HTML to PDF in a Windows universal app (UWP)

Is it possible to convert an HTML string to a PDF file in a UWP app?
I've seen lots of different ways it can be done in regular .NET apps (there seem to be plenty of third party libraries), but I've yet to see a way it can be done in a Universal/UWP app. Does anyone know how it can be done?
Perhaps there is some way to hook into the "Microsoft Print to PDF" option, if there is no pure code solution?
Or is there a roundabout way of doing it, maybe like somehow using Javascript and https://github.com/MrRio/jsPDF inside a C# UWP app? I'm not sure, clutching at straws...
EDIT
I have marked the solution provided by Grace Feng - MSFT as correct for proving that it IS possible to convert HTML to PDF, through the use of the Microsoft Print to PDF option in the print dialog. Thank you

Perhaps there is some way to hook into the "Microsoft Print to PDF" option, if there is no pure code solution?
Yes, but using this way, you will need to firstly show your HTML string in controls like RichEditBox or TextBlock, only UIElement can be printable content.
You can also create PDF file by yourself, here is basic syntax used in PDF:
You can use BT and ET to create paragraph:
Here is sample in C#:
StringBuilder sb = new StringBuilder();
sb.AppendLine("BT"); // BT = begin text object, with text-units the same as userspace-units
sb.AppendLine("/F0 40 Tf"); // Tf = start using the named font "F0" with size "40"
sb.AppendLine("40 TL"); // TL = set line height to "40"
sb.AppendLine("230.0 400.0 Td"); // Td = position text point at coordinates "230.0", "400.0"
sb.AppendLine("(Hello World)'");
sb.AppendLine("/F2 20 Tf");
sb.AppendLine("20 TL");
sb.AppendLine("0.0 0.2 1.0 rg"); // rg = set fill color to RGB("0.0", "0.2", "1.0")
sb.AppendLine("(This is StackOverflow)'");
sb.AppendLine("ET");
Then you can create a PDF file and save this into this file. But since you want to convert the HTML to PDF, it could be a hard work and I think you don't want to do this.
Or is there a roundabout way of doing it, maybe like somehow using Javascript and https://github.com/MrRio/jsPDF inside a C# UWP app? I'm not sure, clutching at straws...
To be honestly, using Libs or Web service to convert HTML to PDF is also a method, there are many and I just searched for them, but I can't find any free to be used in WinRT. So I think the most practicable method here is the first one, hooking into Microsoft Print to PDF. To do this, you can check the official Printing sample.
Update:
Used #Jerry Nixon - MSFT's code in How do I print WebView content in a Windows Store App?, this is a great sample. I just added some code for add pages for printing, in the NavigationCompleted event of WebView:
private async void webView_NavigationCompleted(WebView sender, WebViewNavigationCompletedEventArgs args)
{
MyWebViewRectangle.Fill = await GetWebViewBrush(webView);
MyPrintPages.ItemsSource = await GetWebPages(webView, new Windows.Foundation.Size(842, 595));
}
Then in the printDoc.AddPages += PrintDic_AddPages; event (printDoc is instance of PrintDocument):
private void PrintDic_AddPages(object sender, AddPagesEventArgs e)
{
foreach (var item in MyPrintPages.Items)
{
var rect = item as Rectangle;
printDoc.AddPage(rect);
}
printDoc.AddPagesComplete();
}
For other code you can refer to the official printing sample.

Need Alternative to EO.Pdf for Converting HTML to PDF in C#, wkhtmltopdf?

I am creating a HTML catalog of movies, and then converting it to PDF. I was using EO.Pdf, and it worked for my small test sample. However, when I run it against the entire list of movies, the resulting HTML file is nearly 8000 lines, and 7MB. EO.Pdf times out when attempting to convert it. I believe it is a limitation of the free version, as I can copy the entire HTML and paste it into their online demo and it works.
I am looking for an alternative to use. I am not good with command line, or running external programs, so I would prefer something I can add to the .NET library and use easily. I will admit that the use of EO.Pdf was easy, once I added the dll to the libarary and added the namespace, it took one line of code to convert either the HTML Code, or the HTML file into a PDF. The downsides I ran into were that they had a stamp on every page (in 16pt font) with their website on it. It also wouldn't pick up half of my images, not sure why. I used a relative URL in the HTML file to the images, and I created the PDF in the same dir as the HTML file.
I do not want to re-create the layout in a PDF, so I think something like iTextSharp is out. I've read a bit about something called like wkhtmltopdf or something strange like that. It sounded good, but needed a wrapper, and I have no clue how to accomplish that, or use it.
I would appreciate suggestions with basic instructions how to use them. Either a library and a couple lines on how to use it. Or if you can tell me how to setup/use the wkhtmltopdf I would be extremely greatful!
Thanks in advance!

I'm using wkhtmltopdf and I'm very happy with it. One way of using it:
Process p = new Process();
p.StartInfo.CreateNoWindow = true;
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.FileName = "wkhtmltopdf.exe";
p.StartInfo.Arguments = "-O landscape <<URL>> -";
p.Start();
and then you can get a stream:
p.StandardOutput.BaseStream
I used this because I needed a stream, of course you can invoke it differently.
Here is also a discussion about invoking wkhtmltopdf
I just saw, that someone is implementing a c# wrapper for wkhtmltopdf. I haven't tested it, but may be worth a look.

After much searching I decided to use HiQPdf. It was simple to use, fast enough for my needs and the price point was acceptable to me.
var converter = new HiQPdf.HtmlToPdf();
converter.Document.PageSize = PdfPageSize.Letter;
converter.Document.PageOrientation = PdfPageOrientation.Portrait;
converter.Document.Margins = new PdfMargins(15); // Unit = Points
converter.ConvertHtmlToFile(htmlText, null, fileName);
It even includes a free version if you can keep it to 3 pages.
http://www.hiqpdf.com/free-html-to-pdf-converter.aspx
And no, I am in no way affiliated with them.

I recommend ExpertPdf.
ExpertPdf Html To Pdf Converter is very easy to use and it supports the latest html5/css3. You can either convert an entire url to pdf:
using ExpertPdf.HtmlToPdf;
byte[] pdfBytes = new PdfConverter().GetPdfBytesFromUrl(url);
or a html string:
using ExpertPdf.HtmlToPdf;
byte[] pdfBytes = new PdfConverter().GetPdfBytesFromHtmlString(html, baseUrl);
You also have the alternative to directly save the generated pdf document to a Stream of file on the disk.

Converting PDF, Doc and Docx to rtf in c#

I have a requirement for an application that takes Doc, Docx and PDF and converts them to RTF.
The conversion is one way and I do not need to convert back to Doc or PDF.
Has anyone done this and can you recommend a libray? I know there is aspose but it's way to pricey and the licenses are per year so that's not going to work for the company I happen to work for.
I'm ok using more than one library for each of the file types if thats what it takes.
Thanks in advance

Telerik has a nice library to do this. They actually have an entire editor that looks like Microsoft Word. It can open multiple file formats and it saves natively as RTF (although it can save as PDF, DOCX, etc.) The one thing I'm not sure of is opening the PDF and saving as an RTF. I'm not sure that the Telerik library can do that.
Here is a link to the library:
http://www.telerik.com/products/wpf/richtextbox.aspx
For a PDF to RTF library, you could use this:
http://www.sautinsoft.com/products/pdf-focus/index.php

GroupDocs.Conversion Cloud is a REST API that converts all common file formats from on format to another reliably and easily. Its free pricing plan offers 50 free credits per month.
Here is sample code for PDF to RTF from default storage:
// Get App Key and App SID from https://dashboard.groupdocs.cloud/
var configuration = new GroupDocs.Conversion.Cloud.Sdk.Client.Configuration(MyAppSid, MyAppKey);
var apiInstance = new ConvertApi(configuration);
try
{
// convert settings
var settings = new GroupDocs.Conversion.Cloud.Sdk.Model.ConvertSettings
{
StorageName = null,
FilePath = "02_pages.pdf",
Format = "rtf",
ConvertOptions = new RtfConvertOptions(),
OutputPath = "02_pages.rtf"
};
// convert to specified format
List<StoredConvertedResult> response = apiInstance.ConvertDocument(new ConvertDocumentRequest(settings));
Console.WriteLine("Document converted successfully: " + response[0].Url);
}
catch (Exception e)
{
Console.WriteLine("Exception when calling ConvertApi.QuickConvert: " + e.Message);
}
I'm developer evangelist at aspose.

how to convert pdf file to text file using c#.net

currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas

Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.

There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.