I'm using a c# wrapper for the Tesseract library (3.02 if I'm not mistaken) (https://github.com/charlesw/tesseract). I've got it running and giving output, but that output is essentially garbage. Often it gives nothing and when it does give something it's often a mess. I know it's theoretically working because I've tried it on some really perfect images and it works. I'm wondering if someone can help me diagnose the issues and suggest some ways I can improve Tesseract accuracy. I've already converted all the images to black and white and the resolution is set at 300x300. I don't do any line straightening programmatically but as you can see below they're pretty straight.
This image works perfectly
This one does not work at all, producing either gibberish or nothing at all
I tried flipping the colors on some examples, thinking that it might give greater contrast (since most text is black on a white background, whereas the working ones were white text on black background). But:
Does not work at all, whereas
Again works perfectly.
I suspect this has something to do with the additional spacing between the letters in "INVOICE." But there must be some way to get decent results with a tighter font. Any suggestions are welcome, I'm a relative noob here.
If possible you should consider using pictures with a higher resolution. The other problem about the Payments image is probably the gap between the letters that is too small. Tesseract cannot detect single letters if they are (almost) connected to the next letter of the word.
I would suggest an image processing library like openCV to improve your results.
You could try erosion/dilation. This will seperate the letters if the right parameters are used for the kernel. Use different kernels to see what works best for you.
Mat element = getStructuringElement(erosion_type,
Size(2 * erosion_size + 1, 2 * erosion_size + 1),
Point(erosion_size, erosion_size));
erode(src, erosion_dst, element);
What was helping me a lot when I was working on my project was using an adaptive threshold. I found this to be way more effective than just turning it into a grayscale or binary image.
Note: Java Code, should be very similar in C though.
Imgproc.adaptiveThreshold(cropedIm, cropedIm, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C, Imgproc.THRESH_BINARY, 29, 10);
This is what I get after selecting one of your images in Pixtern, an android project of mine(source code on github). I was using a the adapting threshold but no dilation/erosion and the result is already quite good.
[broken links removed]
For the Payments image and similar ones:
Try using a normal threshold and inverting the image(black font, white background). Again, dilation/erosion can be used afterwards. Java Code:
//results in binary image
Imgproc.threshold(cropedIm, cropedIm, 127, 255, Imgproc.THRESH_BINARY);
//Inverting image
Core.bitwise_not(cropedIm, cropedIm);
Tesseract expects whole pages or rather it was trained on those.
If you give it one or two characters or words it won't work well.
I assume you have more of these images. Stitch them together as lines of text: like each image is a line of text after the previous and it should work much better.
Furthermore, make sure you set the psm-parameter right when using tesseract. More on this: https://www.pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/
Related
I would like to ask for some insight/assistance on how I might improve my OCR accuracy. My target images are low resolution (screenshots) and I would very much prefer not to upscale them, as my program needs to perform fast.
I have 2 images. I see no apparent difference between them, however tesseract is having trouble with one.
image 1 image2
The first image is the issue, the result I am getting is: 251\n41\n31\n\n11\n11\n\n11\n
As you can see, there is something wrong with how it's handling the spacing. There are 2x new lines when things start to go wrong.
Meanwhile, in the second image I get the expected result: 300\n60\n40\n\n1\n15\n15\n10\n6\n15\n
These images were created through the following preprocessing steps:
image.Alpha(AlphaOption.Remove);
image.BlackThreshold(new Percentage(27));
image.Negate(); // Original image has white text on black background
I have limited tesseract's charset to only digits (01234567890-).
I have tried various segmentation modes (SparseText, SingleColumn, SingleBlock). I am running Tesseract 4.1. Do you guys have any pointers?
Or maybe you could tell me what resize algorithm is fast and good for OCR?
If you are having issues with Tesseract and are considering using a more robust library without the need for training, you can try using a commercial library such as Leadtools. With the Leadtools OCR toolkit, I was able to get perfect results for both images with only the basic image processing built in to the OCR demo. There are, however, more sophisticated image processing functions that you can use for more complex tasks if need be. Besides the OCR demo, I was also able to get the same results, in JSON form, without any preprocessing, from one of the tutorials posted here below. As a disclaimer, I work for this vendor.
https://www.leadtools.com/help/sdk/v21/tutorials/dotnet-console-export-ocr-results-to-json.html
Here's some simplified source code that would achieve the same task for one image and print out the raw text:
// Create a new OCR engine with the default settings
IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
ocrEngine.Startup(null, null, null, null);
// Create an OCR document to hold everything
using (IOcrDocument ocrDocument = ocrEngine.DocumentManager.CreateDocument()){
// Add the input image as a new page
using(IOcrPage page = ocrDocument.Pages.AddPage(inputFilename, null)){
// Perform OCR on just the one page
page.Recognize(null);
// build a string from the recognized characters
string text = page.GetText(0);
// Show output
Console.WriteLine($"text: '{text}'");
}
}
The results I got for the two images were "251\r\n41\r\n31\r\n1\r\n11\r\n11\r\n7\r\n4\r\n11\r\n" and "300\r\n60\r\n40\r\n1\r\n15\r\n15\r\n10\r\n6\r\n15\r\n".
say I have these boxes, some of which are black and some white.
The image shows a U shape drawn with the black boxes. Now say I have a matrix of 1s and 0s (it can be a huge matrix) like this:
111111111111111111
111111111111111111
111111111111111111
111111111101111111
111111111101111111
111011111101111111
111011111101111111
111011111101111111
111011111101111111
111011111101111111
111011111101111111
111100000011111111
111111111111111111
which shows zeros forming roughly the shape shown in the image. The image and the matrix are just examples. The image is a screen shot of a software where we should draw patterns, which would then need to be located in given matrices (inside simple text files).
What I'm looking for is a guidance on how to get started on this, cuz I have never programmed anything related to pattern recognitions, which this problem clearly seems to be related to. This is all that I have to do, a pattern given, to be matched with matrix of 0s and 1s. I dont think I can write it on my own in a few days, I'm writing code in c# vs 2013, so hoping I can find some libraries that would let me achieve this with minimal dependencies. Thanks
I think you need to provide a bit more information on what exactly you're looking for. Are the shapes all letters or arbitrary shapes?
Whatever you're looking for I'd start with emguCV. It's a pretty comprehensive library that isn't too difficult to use.
EmguCV has a lot of OCR (optical character recognition) functions which should be able to pick out letters pretty well.
I don't have as much experience using it for arbitrary shape detection but I think SURF detection, something which emguCV also does, might be a good way to go. It attempts to match a given image with features in another image.
People never draw at the exact same place and scale as your stored data.
The things you want are often done with neural networks (its also in aforge).
But it might be hard to A understand it and B use it in your code.
So maybe you could try it like this, get the first position, then record the delta position.
Try to find long lines, and their next direction; store the general direction changes.
above sample would be "down right up", you might also store some length info.
Then there is some math to check how much different sets are, for example string comparisons distance of strings (like in php the levenshtein function); cant think of a levenshtein func in c# dough i dont think c# is that rich with string functions but once you see that i'm sure you can derive something for C#.
I want to create tiles out of a equirectangular image. So I want the image to be split into 4 lateral faces+up and bottom. Does anyone know any library that I can import into my c# project and which is able to do something like this?
Depending on your original image format, System.Drawing.Bitmap.Clone(Rectangle, PixelFormat) should do the trick.
More information here.
EDIT:
First, let me say that this is not going to answer your question either (not even close), you're seeking a library that already exists for this purpose and I don't know of one personally.
Equirectangular projection is the same as plate carrée (wow, that's humbling), so it's a very simple projection to work with in code.
Here is an example of it's use in a GIS application. I don't know what your purposes are, but the math is the same.
One way to do it is to deproject each pixel then draw it on a new image, but understand that to do this you will still require some sort of projection because you're changing from 3 dimensions to 2 dimensions.
I wasn't able to find a good example, but an easier or faster way might be to first use a matrix transform (again, to change projections), then cut the image into the regions you need.
Like I said, this isn't an asnwer, but if nothing else it will give you more keywords to goggle for.
Currently i'm having a huge intrest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more diffucult.
I'm trying to read the value out of these captchas:
http://img851.imageshack.us/img851/9579/57859946.png
I have written some filters for pre-processing:
- Replace Color (to White)
Remove blue lines
remove the lines that go trough the text (two)
- Threshold image (255)
Wich outputs an images like this;
http://img232.imageshack.us/img232/2325/00i3q45j1zt.png
As you can see there are holes in some letters. I first thought maybe it's better to leave the lines trough the letters but that made it worse. I'm using the tesseract OCR engine
and I trained it using the Elephant font (The font the captcha uses). I also tried
using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).
The captcha is generated by a software package named PHPCaptcha.
Now my question is:
Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?
I'm excited to hear from you guys :)
Greetings,
Part 0 - Preface
i) Before hand, you may want read to my OCR-related answer here, which may give you some tricks for using tesseract
ii) I assume you could just turn everything into black and white (in your case, processing in colors doesn't give you an edge)
Part 1 - Preprocessing
To fill 'the-holes' after you've removed the blue lines. You can always dilate or perform 'dilate-then-erode' operations. Here, dilation means you enlarge every pixel in 8-directions(making a bigger pixel). Once you've dilated the pixels, see if you can get them to be recognized or see if the characters are 'over-filled' (dilated too much). If the chars cannot be recognized or the characters are dilated too much, you can then apply a erosion operation. Of course there are advanced synthesis algorithms, but i think you are better off to start with a simpler image processing operation first.
Part 2 - OCR/Tesseract
With Tesseract, if you are feeding the whole image into Tesseract, it would perform line analysis and so on and so forth. Since characters in captcha dont behave like normal text, doing line analysis or recognizing them in a group may somewhat deteoriate the recognition rate. So my suggestion is to recognize by character-by-character first.
I have a C# app for which I've written GDI+ code that uses Bitmap/TextureBrush rendering to present 2D images, which can have various image processing functions applied. This code is a new path in an application that mimics existing DX9 code, and they share a common library to perform all vector and matrix (e.g. ViewToWorld/WorldToView) operations. My test bed consists of DX9 output images that I compare against the output of the new GDI+ code.
A simple test case that renders to a viewport that matches the Bitmap dimensions (i.e. no zoom or pan) does match pixel-perfect (no binary diff) - but as soon as the image is zoomed up (magnified), I get very minor differences in 5-10% of the pixels. The magnitude of the difference is 1 (occasionally 2)/256. I suspect this is due to interpolation differences.
Question: For a DX9 ortho projection (and identity world space), with a camera perpendicular and centered on a textured quad, is it reasonable to expect DirectX.Direct3D.TextureFilter.Linear to generate identical output to a GDI+ TextureBrush filled rectangle/polygon when using the System.Drawing.Drawing2D.InterpolationMode.Bilinear setting?
For this (magnification) case, the DX9 code is using this (MinFilter,MipFilter set similarly):
Device.SetSamplerState(0, SamplerStageStates.MagFilter, (int)TextureFilter.Linear);
and the GDI+ path is using:
g.InterpolationMode = InterpolationMode.Bilinear;
I thought that "Bilinear Interpolation" was a fairly specific filter definition, but then I noticed that there is another option in GDI+ for "HighQualityBilinear" (which I've tried, with no difference - which makes sense given the description of "added prefiltering for shrinking")
Followup Question: Is it reasonable to expect pixel-perfect output matching between DirectX and GDI+ (assuming all external coordinates passed in are equal)? If not, why not?
Clarification: The images I'm using are opaque grayscale (R=G=B, A=1) using Format32bppPArgb.
Finally, there are a number of other APIs I could be using (Direct2D, WPF, GDI, etc.) - and this question generally applies to comparing the output of "equivalent" bilinear interpolated output images across any two of these. Thanks!
DirectX runs mostly in the GPU and DX9 may be running shaders. GDI+ runs on completely different algorithms. I don't think it is reasonable to expect the two to come up with exactly pixel-matching outputs.
I'd expect DX9 to have better quality than GDI+, which is a step improvement over the old GDI but not much. GDI+ is long understood to have trouble with anti-aliasing lines and also with preserving quality in image scaling (which seems to be your problem). In order to have something similar in quality than latest-generation GPU texture processing, you'll need to move to WPF graphics. This gives quality similar to DX.
WPF also uses the GPU (if available) and falls back to software rendering (if no GPU), therefore the output between GPU and software rendering are reasonably close.
EDIT: Although this has been picked as the answer, it is only an early attempt to explain and doesn't really address the true reason. The reader is referred to discussions laid out in the comments to the question and to the answers instead.
Why do you make the assumption that they use the same formula?
Even if they do use the same formula and you accept that the implementation is different would you expect the output to be the same?
At the end of the day the code is designed to work with perception not be mathematically precise. Although you can get this with CUDA if you want.
Rather than being suprised that you get different results i would be very suprised if you got pixel perfect matches.
the way they represent colour is different ... I know for a fact nvidia uses a float(maybe double) to represent colour wheras GDI uses int i believe.
http://en.wikipedia.org/wiki/GPGPU
In DX9 shader 2.0 appears which is when the implementation of colour switched from int to 24 and 32 bit floats.
try comparing ati/amd rendering to nvidia rendering and you can clearly see that colour is very different.
I first noticed this in quake 2 ... the difference between the 2 cards was staggering - of course that is due to a great many number of things, least of which is their bilinier interp implementation.
EDIT: the info about how the specification was made happeend after i answered. Anyway i think the datatypes used to store it iwll be different no matter how you specify it. Moreover the implementation of float is likley to be different. I may be wrong but im pretty sure that c# implements float differently to the C compiler that nvidia uses. (and that assumes that GDI+ doesnt just convert the float into the equivalent int ....)
Even if i am wrong about that I would enerally hold it to be exceptional to expect 2 different implementations of an algorithm to be identical. they are optomised for speed as a result the difference in optomisation will directly translate to a difference in image quality as this speed will come from a different approach to cutting corners/approximation.
There are two possibilities for round-off differences. The first is obvious, when the RGB value is calculated as a fraction of the values on either side. The second is more subtle, when calculating the ratio to use when determining the fraction between the two points.
I'd actually be very surprised if two different implementations of the algorithm were pixel-for-pixel identical, there's so many places for a +/-1 difference to occur. Without getting the exact implementation details of both methods it's impossible to be more precise than that.