How to get the most accurate results with Tesseract OCR

How to get the most accurate results with Tesseract OCR - c#

I'm in the process of building/training Tesseract to recognize passport MRZ codes from a captured photo. I'm applying the following image pre-processing techniques before the photo/image is being sent to the Tesseract engine:
Binarization
Normalization
Sampling
Denoising
Thinning (optionally)
Furthermore I've already trained the Tesseract engine with the correct font (OCR-B) by creating numerous box files (from 35 or so samples that contain photos taken from textual samples of OCR-B font), fixing any mistakes in the box files, creating training files and finally training the Tesseract engine with all my samples and generating a traineddata file.
However even after all this Tesseract 3.04 in C# (engine mode = Default, pagesegmode = Auto) with my custom traineddata still makes simply mistakes such as:
Confusing alphabet characters with numeric ones (or vice versa) for example S and 5, B and 8.
Now for my question, what can I do so that Tesseract produces much more accurate results? My 30 training samples consisted of photos taken from:
Passports
Typed word pages with OCR-B font
Sample of what the input image would look like compared to what Tessearct receives:

Scale up to 480% using imagemagick convert program. Also introduce sharpening and whitening. Gives dramatic improvements. I see better results than many bought OCR programs doing this.

Related

Tesseract OCR C#: Training the network for unknown font

So I am using Tesseract with C# to read english text and it works like a charm. I use pre-trained data from the tesseract repo:https://github.com/tesseract-ocr/tessdata
So far, so good. However, I fail to understand how to solve the following situation: I have an image with a maximum of three numbers on it:
I also followed this tutorial in order to train my own data but I failed to understand what exactly I am doing mid-way:https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/
In this tutorial, they used some existing font and train their network accordingly. However, I do not know what this font is. I tried to figure it out myself but was overwhelmed by the huge amount of information about tesseract and actually do not have any idea where to start.
I was wondering if the following would be possible: I have lots of pictures looking like that(in fact, every possible character with every possible color, only difference is that the background is different):
etc...
And with those pictures, I want to train the network, without using any existing font files.
My algorithm right now does not use tesseract, it just screenshots the position of the numbers and I compare pixel-wise. I do not like this appoach though, as the accuracy is something like 60%.
Thanks for your help in advance

Does anyone know any API for OCRing 7-Segment Display for Windows Phone?

I'm trying to develop a Windows Phone 8.1 App but I need to recognize some numbers from different Displays.
I was following this example:
http://bsubramanyamraju.blogspot.com/2014/08/windowsphone-81-optical-character.html
That is using the Microsoft OCR Runtime Library:
https://www.nuget.org/packages/Microsoft.Windows.Ocr/
However, it doesn't work when I'm trying to recognize those kinds of pics. Even I found this site:
https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
Does anyone have a recommendation? Or Does anyone know any code related to it?
Thank for your worthy knowledge.

I wish the answer to your question would be "Sure, here it is" with link to a black-box process-anything OCR tool, but there are several aspects involved, which are best considered separately.
First, there is some work on image pre-processing BEFORE you even consider any OCR. Your image samples are very drastically different, and include full range of issues.
SAMPLE 1 has low contrast, so when it is binarized to black and white layer, which most OCR will perform internally at some stage, there are no characters to process. It looks like this after binarization:
See this OCR Blog post for additional details on image pre-processing: http://www.ocr-it.com/guide-to-better-mobile-images-from-cell-phone-camera-for-higher-quality-ocr.
Secondly, the image has no dpi information in the header, which some OCR technologies use to determine appropriate scaling of the image. Without header information, some OCR programs may set some default dpi, which may or may not match your image, thus affecting the OCR result. This is NOT critical, but preferred if this can be implemented at the time of picture creation.
SAMPLE 2 has sufficient contrast and adaptive notarization returns a clear image. It is also missing dpi resolution value in the header.
SAMPLE 3 has very clear contrast, but it also has no resolution dpi in the header.
Once you have images that are optimized for OCR processing, the next step is to look at OCR technologies.
I did NOT test the once you mentioned, assuming you had correct implementation and yet no success with them. I tested other OCR tools I have used in the past.
In general, there is no 7-segment OCR known to me. However, I was able to adapt to other generic OCR for this specialized task. Every OCR I tried'out-of-box' or with default settings is unable to handle this recognition. And it is logical and expected. Why? Because most generic OCR are written to recognize inseparable pixel patterns for each character. This is related to the "character separability" principle used to separate words into separate characters. In other words, inner OCR algorithms look for connected strokes which make up each character. More powerful commercial OCR allows some breaks in pixel patterns, but they are expected to be minimal to none, like defects in print or scan, which may result in missing character pieces.
7-segment display by nature will have multiple breaks in each character, conflicting with the character separability principle.
More powerful OCR technologies have a) more tolerance to breaks in pixel patterns and/or b) have special settings to handle these cases.
I will perform further testing with OCR-IT web-based OCR API platform, which is well known to me. I worked as a developer on its OCR capabilities. I also use it extensively in my own iOS and Android apps. OCR-IT API is based on a strong commercial OCR engine, so it is having good tolerance to character imperfections as well as some controls to help in this case.
SAMPLE 3. This is the easiest sample to process, so I tested it first. Using OCR-IT API, and making a request with default settings, requesting the output to TXT format, I get the following:
It appears that OCR is a) segmenting characters into two separate lines, and b) tries to read resulting patters as close as possible to valid characters.
Based on this quick analysis, making one adjustments to OCR settings results in the following recognition:
The setting that made substantial difference in OCR result is switching from default print type to using "DotMatrix", which is in the middle of this entire OCR-IT API settings XML:
<Job>
<InputURL>http://i.stack.imgur.com/wOtFx.jpg</InputURL>
<CleanupSettings>
<Deskew>false</Deskew>
<RemoveGarbage>false</RemoveGarbage>
<RemoveTexture>false</RemoveTexture>
<RotationType>NoRotation</RotationType>
</CleanupSettings>
<OCRSettings>
<PrintType>DotMatrix</PrintType>
<OCRLanguage>English</OCRLanguage>
<SpeedOCR>false</SpeedOCR>
<AnalysisMode>MixedDocument</AnalysisMode>
<LookForBarcodes>false</LookForBarcodes>
</OCRSettings>
<OutputSettings>
<ExportFormat>Text</ExportFormat>
</OutputSettings>
</Job>
The use of DotMatrix print type turned on necessary algorithms to increase tolerance for breaks in character structure, which commonly occurs by nature of dot-matrix printers in dot-matrix prints. Alternatively, a "Typewriter" print type could be used, since character breaks are also expected in typewritten fonts, thus being automatically handled by OCR.
There could be one more change to the API setting to run OCR using "Digits" character set (language), effectively eliminating any possibility of misreading 1 as I, etc.
SAMPLE 2. In this sample, the gaps in each character's structure are much wider. Even standard algorithms for handling DotMatrix or Typerwriter print types cannot accommodate these wide gaps. The use of all possible setting variations returned something like this:
Character segmentation seems to be the issue. One technical solution goes back to image pre-processing. A simple algorithm can be implemented to fill in gaps between each segment of the 7-segment character. It does not have to be very precise, something like this:
But that is enough to produce a perfect OCR result.
Since it may be unknown in advance which 7-segment LCD display will require filled in gaps, and which does not, I recommend applying this algorithm to all LCD 7-segment images, with small or large gaps. I would limit the size of the gap to no wider than the width of a segment. Given these screens come in various background and segment colors, this pre-procession algorithm can be substantially simplified if it is performed on binarized (black & white) image.
Overall, this task is possible with OCR and near out-of-box functionality, assuming that some image pre-processing is performed. In general, I believe that image pre-processing is required for any OCR-related project anyway, specific to that project.
If you have any further questions about OCR or image pre-processing, pm me.

Despite it has been a while since Ilya's answer and thanks to his advice and other ones, especially this one:
Seven Segment Optical Character Recognition
I was able to create my own class in C#:
https://github.com/FANMixco/7-segment-ocr-reader/blob/master/OCR/SevenSegmentOCR.cs
Feel free to use it and improve it.

How to OCR email address

I am trying to OCR and extract the email form the images. The images are supposed to have one line of text which is the email address.
I am using EmguCV.OCR to extract the text (email address) from those images. The target is to have 100% accurate result.
We can fix the font and size of the text. For example Ariel, 12pt, so that all the images will have email written in Ariel 12pt with black on white background.
The problem is that Tesseract OCR in EmguCV is not recognizing the text properly. It recognizes only 80% of the characters accurately.
I am using preprocessing with Leptonica library.
Here are some sample images I am trying to recognize.
Is there any way to achieve the target of 100% accuracy

With those sample images I can suggest two ways to solve the same problem. In those images JPEG artifacts are present (the result of lossy compression). Because of this, the letters are becoming connected to each other (zoom in on the image in a program where you can see the actual pixels, windows photo viewer worked fine for me). TesseractOCR relies on spacing between letters (it uses connected components) to do character recognition. Have any pieces connected throws off the recognition process which means it tries to recognize the combination of "co" as one letter.
Two possible solutions:
I'm not sure what preprocessing steps are already being done, but you'll want to do some thresholding to removing the lighter shades on the image (disconnecting the characters). However, you have to be careful with this as it may remove more than what you want.
If at any time during this process you have a higher resolution image, or a non-jpeg/lossy format (i.e. png), then keep it in this format as you do other processing steps. Try to avoid any lossy compression that might happen. It sounds like these images don't come to you as shown above. This is the preferable solution as you wont risk losing too data.

I tried to recognize your images with ABBYY Cloud OCR SDK and got 100% accuracy.
You can use Demo Tool to make sure of recognition accuracy.
I work for ABBYY and can give you more information about our technologies if you need.

Could some one perform an OCR on this image successfully?

I've tried with some demo downloaded from the web to test an OCR on this image, the characters on image are not well-formed as printing characters you can see when you type in a TextBox. I'm not experienced enough about OCR and Neural Network. These are my images https://sites.google.com/site/thecabinet3/home/files-store/sample.bmp?attredirects=0 and https://sites.google.com/site/thecabinet3/home/files-store/6bi.bmp?attredirects=0
I have some questions here:
Do I have to re-train the neural network with these new non-standard characters, the network has already been trained using a standard character set. (I mean the standard character looks like character you see when typing any character with a specified font into a TextBox).
Could you perform an OCR on the images I uploaded successfully using some example? If you could, please give me that working example?
Your help would be highly appreciated!

I tested your image in a commercial high-quality webservice OCR and received 100% recognition result out of box.
65 -HC
0999
I looked at your sample, and in my experience it has enough quality and character definition to produce high quality result in any descent OCR system, unless your algorithm is very sensitive to rough edges of character patterns.
I am not sure if your need is academic or commercial. Last time I used neural networks for OCR was very many years ago in college, but not in commercial implementations due to training limitations.

recognizing letter in image OCR

i searched with no result. If i copy other topic, please delete, not making mess.
I have question about image recognition OCR using C#
I am working on image which shows the scrabble.
First i converted image to grayscale, thresholded to find out only black letters and then I used median to avoid the walls around letters.
Now, how to start to get at the end function which will recognize letter? Should i somehow now separate 'foreach' letters :-) or just start recognizing ? On what problem should i take attention before start recognizing? Any sources will be welcome. :-)
Any idea will be also very helpful.

Well, this is a big, vast discussion...
If you don't want to reinvent the wheel, I suggest you use Tesseract, an open source OCR in C++, or Tessnet2 which is a .NET wrapper around it.
I had the same approach as yours, i.e. grayscale then thresholds etc in .NET, then I adapted a bit Tesseract (Tessnet2 is a wrapper around Tesseract 2.0, not 3.x which is the actual version) to have a good interface for OCR output then I have good results.
Here is another subject on OCR and Tesseract on Stack Overflow which describes more precisely what it consists in.
Of course if what you want is play with OCR concepts to make them work by yourself, this is not the good way :-)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.