How to get character wise confidence values in tesseract - c#

How to get character wise confidence values in tesseract - c# - c#

Im building an application to do OCR on images. I got everything to work but I'm unable to get the confidence values for each character in a word. I can get the confidence values of the word.
This is the code I tried to get confidence for each character:
using (ResultIterator iter = doocr.GetIter())
{
iter.Begin();
do
{
listBox1.Items.Add("Char Confidence" + iter.GetConfidence(PageIteratorLevel.Word).ToString());
} while (iter.Next(PageIteratorLevel.Symbol));
}
It always shows a single value of 0 even if there are multiple characters.
GetIter() is a function in my class which returns page.GetIterator().
How to get the confidence values for each character? What am I doing wrong?

It turned out that initializing tesseract in a different class and then calling iterator in a form gives null result. I began the iteration in the class itself and saved it in a list and then read the list in the form to get the result I want.

Related

Having issues working out minimum and maximum calculations within my forms on c#

I am creating a student module form with multiple functions. I am having issues reading the minimum and maximum values within the list box. I have been struggling for days and would greatly appreciate any form of help. Thanks in advance!
I have tried using different arrays, storing different values ect. I thought the issues within the code came from no 'mark' being stored. But i am certain that it is working and believe the issues lies within the line of code in 15.
public int MinMark()
{
int lowest = int.Parse(ModuleData.studentMark[0]);
for (int index = 1; index < ModuleData.studentMark.Count;index++)
{
if (int.Parse(ModuleData.studentMark[index]) < lowest)
{
lowest = ModuleData.studentMark.ToString()[index];
}
}
return lowest;
So far my code is just outputting the first index on from the list. i have explored all my lecture notes and have tried anything i can think of to get it working.

The line
lowest = ModuleData.studentMark.ToString()[index];
is incorrect and would give you the char value of a character in the string. What the above line is doing is taking the mark which is a string, converting it to a string again, and then selecting the character at index index from that string.
Instead you want the the string as an integer which can be achieved as below
lowest = int.Parse(ModuleData.studentMark[index]);

lowest = ModuleData.studentMark.ToString()[index];
This line is almost certainly incorrect, as you're returning the string representation of your collection (likely something like "System.String[]" or "System.Collections.Generic.List'1[System.String]") and getting a character from the string by index and implictly converting that character to an integer. The line should likely be
lowest = int.Parse(ModuleData.studentMark[index]);
However, you can replace this method with a single LINQ query, like so:
public int MinMark() => ModuleDate.studentMark.Select(int.Parse).Min();
This will parse all the student marks to integers then select the smallest from the collection. If this is still only returning the first index, then the first index is likely the lowest value in your case, or the ModuleDate.studentMark field isn't being populated as you expect.

Magick.net check if two images are identical

I'm trying to compare two screenshots from a webpage using Magick.NET, a C# library from ImageMagick. My code looks like this:
//Adapt image a bit otherwise he'll throw an error over the whole image
newScreenshot.ColorFuzz = new Percentage(15);
//Get the difference, 1 = perfectly the same, less then 1 not.
double diff = newScreenshot.Compare(benchmarkScreenshot, new ErrorMetric(), imgDiff);
//Output the result image for comparaison
imgDiff.Write(compareResultPath);
if (diff < 0.998)
{
//Do something
}
In this case I would get values lower then 1, where I imagined 1 would be "Identical" and everything less then 1 wouldn't be. I was wrong... So the only way I could think of to check if they are as identical as possible is to lower the tolerence by lowering the value in the if-statement.
So if I have a screenshot from a website, and I adapt it I get the following values for the "diff" variable:
Identical image: 0.99842343024053205
Removing a sentence: 0.99776453647987487
Removing one letter from any word on the page: 0.99698398328761506
I'm very afraid of the fact that removing an entire sentence has a higher value then just a single letter.
I also tried with ErrorMetric.Absolute rather then new ErrorMetric(), the values that I got for the "diff" variable were:
Identical image: 1949
Removing a sentence: 766
Removing one letter from any word on the page: 75
Is there a better, more accurate way then what I'm trying to do to check if there's an actual change or not?

Searching an array string with a binary search sub string

I have a file.txt containing about 200,000 records.
The format of each record is 123456-99-Text. The 123456 are unique account numbers, the 99 is a location code that I need (it changes from 01 to 99), and the text is irrelevant. These account numbers are sorted in order and with a line break in the file per ac(111111, 111112, 111113, etc).
I made a visual studio textbox and search button to have someone search for the account number. The account number is actually 11 digits long but only the first 6 matter. I wrote this as string actnum = textbox1.text.substring(0,6)
I wrote a foreach (string x in file.readline('file.txt')) with an if (x.contains(actnum)) then string code = x.substring(8,2)) statement.
The program works well, but because there are so many records if someone searches an account number that doesnt exist, or a number at the bottom of the list, the program locks up for a good 10 seconds before going to the "number not found" else statement, or taking forever to find that last record.
My Question:
Reading about binary searches I have attempted to try one without much success. I cannot seem to get the array or file to act like a legitimate binary search. Is there a way to take the 6 digit actnum from textbox1, compare it to an array substring of the 6 digit account number, then grab the substring 99 code from that specific line?
A binary search would help greatly! I could take 555-555 and compare it to the top or bottom half of the record file, then keep searching until i fine the line i need, grab the entire line, then substring the 99 out. The problem I have is I cant seem to get a proper integer conversion of the file because it contains both numbers AND text, and therefore I cant properly use <, >, = signs.
Any help on this would be greatly appreciated. The program I currently have actually works but is incredibly slow at times.

As one possible solution (not necessarily the best) you can add your record IDs to a Dictionary<string, int> (or even a Dictionary<long, int> if all record IDs are numeric) where each key is the ID of one line and each value is the line index. When you need to look up a particular record, just look in the dictionary (it'll do an efficient lookup for you) and gives you the line number. If the item is not there (non-existent ID), you won't find it in the dictionary.
At this point, if the record ID exists in the file, you have a line number - you can either load the entire file into memory (if it's not too big) or just seek to the right line and read in the line with the data.
For this to work, you have to go through the file at least once and collect all the record IDs from all lines and add them to the dictionary. You won't have to implement the binary search - the dictionary will internally perform the lookup for you.
Edit:
If you don't need all the data from a particular line, just one bit (like the location code you mentioned), you don't even need to store the line number (since you won't need to go back to the line in the file) - just store the location data as the value in the dictionary.
I personally would still store the line index because, in my experience, such projects start out small but end up collecting features and there'll be a point where you'll have to have everything from the file. If you expect this to be the case over time, just parse data from each line into a data structure and store that in the dictionary - it'll make your future life simpler. If you're very sure you'll never need more data than the one bit of information, you can just stash the data itself in the dictionary.
Here's a simple example (assuming that your record IDs can be parsed into a long):
public class LineData
{
public int LineIndex { get; set; }
public string LocationCode { get; set; }
// other data from the line that you need
}
// ...
// declare your map
private Dictionary<long, LineData> _dataMap = new Dictionary<long, LineData> ();
// ...
// Read file, parse lines into LineData objects and put them in dictionary
// ...
To see if a record ID exists, you just call TryGetValue():
LineData lineData;
if ( _dataMap.TryGetValue ( recordID, out lineData ) )
{
// record ID was found
}
This approach essentially keeps the entire file in memory but all data is parsed only once (at the beginning, during building the dictionary). If this approach uses too much memory, just store the line index in the dictionary and then go back to the file if you find a record and parse the line on the fly.

You cannot really do a binary search against file.ReadLine because you have to be able to access the lines in different order. Instead you should read the whole file into memory (file.ReadAllLines would be an option)
Assuming your file is sorted by the substring, you can create a new class that implements IComparer
public class SubstringComparer : IComparer<string>
{
public int Compare(string x, string y)
{
return x.Substring(0, 6).CompareTo(y.Substring(0, 6));
}
}
and then your binary search would look like:
int returnedValue = foundStrings.BinarySearch(searchValue, new SubstringComparer());

Assuming the file doesn't change often, then you can simply load the entire file into memory using a structure that handles the searching in faster time. If the file can change then you will need to decide on a mechanism for reloading the file, be it restarting the program or a more complex process.
It looks like you are looking for exact matches (searching for 123456 yields only one record which is labelled 123456). If that is the case then you can use a Dictionary. Note that to use a Dictionary you need to define key and value types. It looks like in your case they would both be string.

While I did not find a way to do a better type of search, I did manage to learn about embedded resources which considerably sped up the program. Scanning the entire file takes a fraction of a second now, instead of 5-10 seconds. Posting the following code:
string searchfor = textBox1.Text
Assembly assm = Assembly.GetExecutingAssembly();
using (Stream datastream = assm.GetManifestResourceStream("WindowsFormsApplication2.Resources.file1.txt"))
using (StreamReader reader = new StreamReader(datastream))
{
string lines;
while ((lines = reader.ReadLine()) != null)
{
if (lines.StartsWith(searchfor))
{
label1.Text = "Found";
break;
}
else
{
label1.Text = "Not found";
}
}
}

Parse Number Data from Letters C#

I am trying to parse data I am getting from an arduino robot I created. I currently have my serial program up and running and I am able to monitor the data being sent and received by my computer.
The data I am trying to get from the robot includes: speed, range, and heading. The data being sent from the arduino are floats.
I use a single character to denote what the data being received is by either a S,R, or H. For example:
R150.6
This would denote that this is range data and 150.6 would be the new range to update the program with.
I'm a little stuck trying to figure out the best way to parse this using c# as this is my first c# program.
I have tried with a similar code to:
if (RxString[0] == 'R')
Range = double.Parse(RxString);
I read on the site to use regular expressions, however I am having a hard time figuring out how to incorporate it into my code.
This is the link I was using for guidance:
String parsing, extracting numbers and letters

You're almost there. If you're always starting with a single letter, try Range = double.Parse(RxString.Substring(1)). It will read from the second character on.

i can use regex for find double:
\d+([,\.]\d+)?
using:
Regex re = Regex.Match("R1.23", "\d+([,\.]\d+)?");
if (re.Success)
{
double #double = Convert.ToDouble(re.Value, System.Globalization.CultureInfo.InvariantCulture);
}
it garanties to get first decimal from string if your letter migrate in string or adding other symbols

Since you know the format of the returned data, you can try something like this
data = RxString.SubString(0,1);
value = RxString.SubString(1, RxString.Length-1);
if(data == "R")
range = double.Parse(value);

Dynamic Regex generation for predictable repeating string patterns in a data feed

I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.
The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.
e.g
Panasonic TV with FREE Blu-Ray Player
Sony TV with FREE DVD Player + Box Office DVD
Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher
So the format in this instance is: PRODUCT with FREEGIFT.
PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.
Creating a Regular Expression to extract the dynamic parts is trivial.
The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:
Brand new Panasonic TV and a FREE Blu-Ray Player if you order today
Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today
Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today
And next week it will probably be something different, so I have to keep modifying my Regular Expressions...
How would you handle this?
Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?
Thanks for any help or advice.

This code isn't perfect, it certainly isn't efficient, and it's very likely to be too late to help you, but it does work. If given a set of strings, it will return the common content above a certain length.
However, as others have mentioned, an algorithm can only give you an approximation, as you could hit a bad batch where all products have the same initial word, and then the code would accidentally identify that content as static. It may also produce mismatches when dynamic content shares values with static content, but as the size of samples you feed into it grows, the chance of error will shrink.
I'd recommend running this on a subset of your data (20000 rows would be a bad idea!) with some sort of extra sanity checking (max # of static elements etc)
Final caveat: it may do a perfect job, but even if it does, how do you know which item is the PRODUCT and which one is the FREEGIFT?
The algorithm
If all strings in the set start with the same character, add that character to the "current match" set, then remove the leading character from all strings
If not, remove the first character from all strings whose first x (minimum match length) characters aren't contained in all the other strings
As soon as a mismatch is reached (case 2), yield the current match if it meets the length requirement
Continue until all strings are exhausted
The implementation
private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
string sharedContent = "";
while (strings.All(x => x.Length > 0))
{
var item1FirstCharacter = strings[0][0];
if (strings.All(x => x[0] == item1FirstCharacter))
{
sharedContent += item1FirstCharacter;
for (int index = 0; index < strings.Length; index++)
strings[index] = strings[index].Substring(1);
continue;
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
sharedContent = "";
// If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
for (int index = 0; index < strings.Length; index++)
{
string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));
if (!strings.All(x => x.Contains(testBlock)))
strings[index] = strings[index].Substring(1);
}
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
}
Output
Set 1 (from your example):
FindCommonContent(strings, 4);
=> "with "
Set 2 (from your example):
FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"
Building the regex
This should be as simple as:
"{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"
Although you could modify the algorithm to return information about where the matches are (between or outside the static content), this will be fine, as you know some will match zero-length strings anyway.

I think it would be possible with an algorithm , but the time it would take you to code it versus simply doing the Regular Expression might not be worth it.
You could however make your changing process faster. If instead of having your Regex String inside your application, you'd put it in a text file somewhere, you wouldn't have to recompile and redeploy everything every time there's a change, you could simply edit the text file.
Depending on your project size and implementation, this could save you a generous amount of time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.