Improve pre-processing for OCR/Image Recognition

Improve pre-processing for OCR/Image Recognition - c#

Currently i'm having a huge intrest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more diffucult.
I'm trying to read the value out of these captchas:
http://img851.imageshack.us/img851/9579/57859946.png
I have written some filters for pre-processing:
- Replace Color (to White)
Remove blue lines
remove the lines that go trough the text (two)
- Threshold image (255)
Wich outputs an images like this;
http://img232.imageshack.us/img232/2325/00i3q45j1zt.png
As you can see there are holes in some letters. I first thought maybe it's better to leave the lines trough the letters but that made it worse. I'm using the tesseract OCR engine
and I trained it using the Elephant font (The font the captcha uses). I also tried
using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).
The captcha is generated by a software package named PHPCaptcha.
Now my question is:
Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?
I'm excited to hear from you guys :)
Greetings,

Part 0 - Preface
i) Before hand, you may want read to my OCR-related answer here, which may give you some tricks for using tesseract
ii) I assume you could just turn everything into black and white (in your case, processing in colors doesn't give you an edge)
Part 1 - Preprocessing
To fill 'the-holes' after you've removed the blue lines. You can always dilate or perform 'dilate-then-erode' operations. Here, dilation means you enlarge every pixel in 8-directions(making a bigger pixel). Once you've dilated the pixels, see if you can get them to be recognized or see if the characters are 'over-filled' (dilated too much). If the chars cannot be recognized or the characters are dilated too much, you can then apply a erosion operation. Of course there are advanced synthesis algorithms, but i think you are better off to start with a simpler image processing operation first.
Part 2 - OCR/Tesseract
With Tesseract, if you are feeding the whole image into Tesseract, it would perform line analysis and so on and so forth. Since characters in captcha dont behave like normal text, doing line analysis or recognizing them in a group may somewhat deteoriate the recognition rate. So my suggestion is to recognize by character-by-character first.

Related

Tesseract low resolution number detection improve accuracy

I would like to ask for some insight/assistance on how I might improve my OCR accuracy. My target images are low resolution (screenshots) and I would very much prefer not to upscale them, as my program needs to perform fast.
I have 2 images. I see no apparent difference between them, however tesseract is having trouble with one.
image 1 image2
The first image is the issue, the result I am getting is: 251\n41\n31\n\n11\n11\n\n11\n
As you can see, there is something wrong with how it's handling the spacing. There are 2x new lines when things start to go wrong.
Meanwhile, in the second image I get the expected result: 300\n60\n40\n\n1\n15\n15\n10\n6\n15\n
These images were created through the following preprocessing steps:
image.Alpha(AlphaOption.Remove);
image.BlackThreshold(new Percentage(27));
image.Negate(); // Original image has white text on black background
I have limited tesseract's charset to only digits (01234567890-).
I have tried various segmentation modes (SparseText, SingleColumn, SingleBlock). I am running Tesseract 4.1. Do you guys have any pointers?
Or maybe you could tell me what resize algorithm is fast and good for OCR?

If you are having issues with Tesseract and are considering using a more robust library without the need for training, you can try using a commercial library such as Leadtools. With the Leadtools OCR toolkit, I was able to get perfect results for both images with only the basic image processing built in to the OCR demo. There are, however, more sophisticated image processing functions that you can use for more complex tasks if need be. Besides the OCR demo, I was also able to get the same results, in JSON form, without any preprocessing, from one of the tutorials posted here below. As a disclaimer, I work for this vendor.
https://www.leadtools.com/help/sdk/v21/tutorials/dotnet-console-export-ocr-results-to-json.html
Here's some simplified source code that would achieve the same task for one image and print out the raw text:
// Create a new OCR engine with the default settings
IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
ocrEngine.Startup(null, null, null, null);
// Create an OCR document to hold everything
using (IOcrDocument ocrDocument = ocrEngine.DocumentManager.CreateDocument()){
// Add the input image as a new page
using(IOcrPage page = ocrDocument.Pages.AddPage(inputFilename, null)){
// Perform OCR on just the one page
page.Recognize(null);
// build a string from the recognized characters
string text = page.GetText(0);
// Show output
Console.WriteLine($"text: '{text}'");
}
}
The results I got for the two images were "251\r\n41\r\n31\r\n1\r\n11\r\n11\r\n7\r\n4\r\n11\r\n" and "300\r\n60\r\n40\r\n1\r\n15\r\n15\r\n10\r\n6\r\n15\r\n".

Tesseract OCR gives bad output

I'm using a c# wrapper for the Tesseract library (3.02 if I'm not mistaken) (https://github.com/charlesw/tesseract). I've got it running and giving output, but that output is essentially garbage. Often it gives nothing and when it does give something it's often a mess. I know it's theoretically working because I've tried it on some really perfect images and it works. I'm wondering if someone can help me diagnose the issues and suggest some ways I can improve Tesseract accuracy. I've already converted all the images to black and white and the resolution is set at 300x300. I don't do any line straightening programmatically but as you can see below they're pretty straight.
This image works perfectly
This one does not work at all, producing either gibberish or nothing at all
I tried flipping the colors on some examples, thinking that it might give greater contrast (since most text is black on a white background, whereas the working ones were white text on black background). But:
Does not work at all, whereas
Again works perfectly.
I suspect this has something to do with the additional spacing between the letters in "INVOICE." But there must be some way to get decent results with a tighter font. Any suggestions are welcome, I'm a relative noob here.

If possible you should consider using pictures with a higher resolution. The other problem about the Payments image is probably the gap between the letters that is too small. Tesseract cannot detect single letters if they are (almost) connected to the next letter of the word.
I would suggest an image processing library like openCV to improve your results.
You could try erosion/dilation. This will seperate the letters if the right parameters are used for the kernel. Use different kernels to see what works best for you.
Mat element = getStructuringElement(erosion_type,
Size(2 * erosion_size + 1, 2 * erosion_size + 1),
Point(erosion_size, erosion_size));
erode(src, erosion_dst, element);
What was helping me a lot when I was working on my project was using an adaptive threshold. I found this to be way more effective than just turning it into a grayscale or binary image.
Note: Java Code, should be very similar in C though.
Imgproc.adaptiveThreshold(cropedIm, cropedIm, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C, Imgproc.THRESH_BINARY, 29, 10);
This is what I get after selecting one of your images in Pixtern, an android project of mine(source code on github). I was using a the adapting threshold but no dilation/erosion and the result is already quite good.
[broken links removed]
For the Payments image and similar ones:
Try using a normal threshold and inverting the image(black font, white background). Again, dilation/erosion can be used afterwards. Java Code:
//results in binary image
Imgproc.threshold(cropedIm, cropedIm, 127, 255, Imgproc.THRESH_BINARY);
//Inverting image
Core.bitwise_not(cropedIm, cropedIm);

Tesseract expects whole pages or rather it was trained on those.
If you give it one or two characters or words it won't work well.
I assume you have more of these images. Stitch them together as lines of text: like each image is a line of text after the previous and it should work much better.
Furthermore, make sure you set the psm-parameter right when using tesseract. More on this: https://www.pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

pattern recognition inside a matrix

say I have these boxes, some of which are black and some white.
The image shows a U shape drawn with the black boxes. Now say I have a matrix of 1s and 0s (it can be a huge matrix) like this:
111111111111111111
111111111111111111
111111111111111111
111111111101111111
111111111101111111
111011111101111111
111011111101111111
111011111101111111
111011111101111111
111011111101111111
111011111101111111
111100000011111111
111111111111111111
which shows zeros forming roughly the shape shown in the image. The image and the matrix are just examples. The image is a screen shot of a software where we should draw patterns, which would then need to be located in given matrices (inside simple text files).
What I'm looking for is a guidance on how to get started on this, cuz I have never programmed anything related to pattern recognitions, which this problem clearly seems to be related to. This is all that I have to do, a pattern given, to be matched with matrix of 0s and 1s. I dont think I can write it on my own in a few days, I'm writing code in c# vs 2013, so hoping I can find some libraries that would let me achieve this with minimal dependencies. Thanks

I think you need to provide a bit more information on what exactly you're looking for. Are the shapes all letters or arbitrary shapes?
Whatever you're looking for I'd start with emguCV. It's a pretty comprehensive library that isn't too difficult to use.
EmguCV has a lot of OCR (optical character recognition) functions which should be able to pick out letters pretty well.
I don't have as much experience using it for arbitrary shape detection but I think SURF detection, something which emguCV also does, might be a good way to go. It attempts to match a given image with features in another image.

People never draw at the exact same place and scale as your stored data.
The things you want are often done with neural networks (its also in aforge).
But it might be hard to A understand it and B use it in your code.
So maybe you could try it like this, get the first position, then record the delta position.
Try to find long lines, and their next direction; store the general direction changes.
above sample would be "down right up", you might also store some length info.
Then there is some math to check how much different sets are, for example string comparisons distance of strings (like in php the levenshtein function); cant think of a levenshtein func in c# dough i dont think c# is that rich with string functions but once you see that i'm sure you can derive something for C#.

Advanced C# pattern search in long string (100-25000 char)

Let me start with this: I can't zip it or anything similar.
What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)
The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.
I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.
Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.

You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.
However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)
This is a fun problem space to play in! Good Luck!

edge boundaries and morphological functions

I am looking for some morphological functions and edge linking with c# corresponding to matlab functions.
Bw= binary image; operations look for
'clean'
Removes isolated pixels (individual 1s that are surrounded by 0s), such as the center pixel in this pattern
'skel'
With n = Inf, removes pixels on the boundaries of objects but does not allow objects to break apart. The pixels remaining make up the image skeleton. This option preserves the Euler number
if somebody knows some link or code , it would be helpfull regards,

You can use this application/library:
Image Processing Lab in C#
Image processing is a complex topic but a median filter may meet your needs. If not, then this is at least a good framework to implement your own filtering algorithm.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Improve pre-processing for OCR/Image Recognition - c#

Related

Tesseract low resolution number detection improve accuracy

Tesseract OCR gives bad output

pattern recognition inside a matrix

Advanced C# pattern search in long string (100-25000 char)

edge boundaries and morphological functions

Categories

Resources