I have a table that contains human entered observation data. There is a column that is supposed to correspond to another list; the human entered value should identically match that in a sort of master list of possibilities.
The problem however is that the human data is abbreviated, misspelled, and etc. Is there a mechanism that does some sort of similarity search to find what the human entered data should actually be?
Examples
**Human Entered** **Should Be**
Carbon-12 Carbon(12)
South Korea Republic of Korea
farenheit Fahrenheit
The only thought I really have is to break up the Human Entered data into like 3 character sections and see if they are contained in the Should Be list. It would just pick the highest rated entry. As a later addition it could present the user with a choice of the top 10 or something.
I'm also not necessarily interested in an absolutely perfect solution, but if it worked like 70% right it would save A LOT of time going through the list.
One option is to look for a small Levenshtein distance between two strings rather than requiring an exact match. This would help find matches where there are minor spelling differences or typos.
Another option is to normalize the strings before comparing them. The normalization techniques that make sense depend on your specific application but it could for example involve:
Removing all punctuation.
Converting UK spellings to US spellings.
Using the scientific name for a substance instead of its common names.
etc.
You can then compare the normalized forms of the members of each list instead of the original forms. You may also want to consider using a case-insensitive comparison instead of a case-sensitive comparison.
You can try to calculate the similarity of two strings using Levenshtein distance:
private static int CalcLevensteinDistance(string left, string right)
{
if (left == right)
return 0;
int[,] matrix = new int[left.Length + 1, right.Length + 1];
for (int i = 0; i <= left.Length; i++)
// delete
matrix[i, 0] = i;
for (int j = 0; j <= right.Length; j++)
// insert
matrix[0, j] = j;
for (int i = 0; i < left.Length; i++)
{
for (int j = 0; j < right.Length; j++)
{
if (left[i] == right[j])
matrix[i + 1, j + 1] = matrix[i, j];
else
{
// deletion or insertion
matrix[i + 1, j + 1] = System.Math.Min(matrix[i, j + 1] + 1, matrix[i + 1, j] + 1);
// substitution
matrix[i + 1, j + 1] = System.Math.Min(matrix[i + 1, j + 1], matrix[i, j] + 1);
}
}
}
return matrix[left.Length, right.Length];
}
Now calculate the similarity between two strings in %
public static double CalcSimilarity(string left, string right, bool ignoreCase)
{
if (ignoreCase)
{
left = left.ToLower();
right = right.ToLower();
}
double distance = CalcLevensteinDistance(left, right);
if (distance == 0.0f)
return 1.0f;
double longestStringSize = System.Math.Max(left.Length, right.Length);
double percent = distance / longestStringSize;
return 1.0f - percent;
}
Have you considered using a (...or several) drop down list(s) to enforce correct input? In my opinion, that would be a better approach in most cases when considering usability and user friendlyness. It would also make treatment of this input a lot easier. When just using free text input, you'd probably get a lot of different ways to write one thing, and you'll "never" be able to figure out every way of writing anything complex.
Example: As you wrote; "carbon-12", "Carbon 12", "Carbon ( 12 )", "Carbon (12)", "Carbon - 12" etc... Just for this, the possibilities are nearly endless. When you also consider things like "South Korea" vs "Republic of Korea" where the mapping is not "1:1" (What about North Korea? Or just "Korea"?), this gets even harder.
Of course, I know nothing about your application and might be completely wrong. But usually, when you expect complex values in a certain format, a drop down list would in many cases make both your job as a developer easier, as well as give the end user a better experience.
Related
I am having trouble with an assignment about finding the shortest total path in a grid, while visiting all the correct tiles in the correct order.
We are supposed to emulate manually inputting a word, like when using a controller to write something, and find the least amount of commands (up, down, left, right) needed to do so.
Our input is the grid, parameters, and the word we are supposed to work with.
I store them like this (with example inputs):
Height = 2;
Width = 2;
Content = "ABCC";
Word = "ABC";
grid = new char[Height, Width];
Contents = Content.ToCharArray();
Words = Word.ToCharArray();
int ch = 0;
for (int i = 0; i < Height; i++)
{
for (int j = 0; j < Width; j++)
{
if (ch < Contents.Length)
{
grid[i, j] = Contents[ch];
ch++;
}
}
}
The actual way I compute the shortest path is like so:
public void GridSearch( int a, int FirstX, int FirstY, int PathLength)
{
int NewPath;
int NewX;
int NewY;
int SecondX = 0;
int SecondY = 0;
for (int i = 0; i < Height; i++)
{
for (int j = 0; j < Width; j++)
{
if (grid[i, j] == Words[a])
{
SecondX = i;
SecondY = j;
NewPath = PathLength;
NewPath += Math.Abs(FirstX - SecondX);
NewPath += Math.Abs(FirstY - SecondY);
NewX = SecondX;
NewY = SecondY;
if (a < Words.Length-1)
{
GridSearch(a+1, NewX, NewY, NewPath);
}
else
{
if (FinalPath > NewPath ^ FinalPath == -1)
{
FinalPath = NewPath;
}
}
}
}
}
We are also supposed to "click" when on the correct tile, so I am adding the length of "Words" to the total of commands.
In this case, the shortest path between the letters would be 2 (right, down) and the length is 3, so 5 is the correct answer.
This is also what my program gets, however when I try to send it in, the automated checker says it only passes 1 out of 5 tests, which is an improvement over the 0 that I had until recently, but still not actually good.
Sadly it does not say which inputs it used to make my program fail, and after a day of trying things I am out of ideas on how to fix it, could anyone please point out the, no doubt, obvious mistake I am making and help me fix this program?
EDIT: The assignment instructions as written (since a commenter asked for them):
Some devices allow text entry using a grid of letters. The grid
contains a movable cursor, which begins in the upper-left-hand corner.
Arrow keys move the cursor up, down, left and right and Enter key
chooses the letter under cursor.
For example, if the input grid looks like this:
ABCDEFGH
IJKLMNOP
QRSTUVWX
YZ
we can enter the text "HELLO" with the following sequence of keys
(which is only one of many possible sequences):
right
right
right
right
right
right
right
Enter
left
left
left
Enter
down
left
Enter
Enter
right
right
right
Enter
Write a program that for a given grid (which may contain both
lowercase and uppercase letters) and text (which may also contain
non-alphabetic characters) determines and writes out the minimum
number of keystrokes required to enter the given text.
Caution: Each letter may appear more than once in the grid!
The input begins with numbers indicating the width and height of the
grid (each on its own line).
A single line follows containing the contents of the entire grid (in
row-major order, i.e. with one row after another).
The rest of the lines contain the text to be entered. ! You should
ignore any characters in the text that are not present in the grid.
Example:
Input:
3
3
ABCBFECDF
ABCDEFA
Output:
15
In this example, the grid has the form
ABC
BFE
CDF
It is possible to enter the text ABCDEFA in many possible ways; 15
keystrokes is the length of the shortest of these.
Revising my previous answer, it is likely that you have not counted the "enter" keystroke. I.e. you should add one to the candidate path length for each letter:
...
NewY = SecondY;
**NewPath++;**
if (a < Words.Length - 1)
...
This gives a correct length of 15 keypresses on your example set of "ABCBFECDF" / "ABCDEFA".
Note that this type of code greatly benefits from a type that represents a pair of x/y coordinate, like a Point or Vector2i, so you don't have to repeat a bunch of calculations for both x and y coordinates. I would also recommend following common coding conventions like
declare local variables in the smallest possible scope, not at the top of the method
Use "camelCasing" for local variables
Prefer pure methods whenever possible, i.e.
I would still recommend reading up on Djikstra or A*, since these should be more generally applicable and be more efficient.
I've always loved reducing number of code lines by using simple but smart math approaches. This situation seems to be one of those that need this approach. So what I basically need is to sum up digits in the odd and even places separately with minimum code. So far this is the best way I have been able to think of:
string number = "123456789";
int sumOfDigitsInOddPlaces=0;
int sumOfDigitsInEvenPlaces=0;
for (int i=0;i<number.length;i++){
if(i%2==0)//Means odd ones
sumOfDigitsInOddPlaces+=number[i];
else
sumOfDigitsInEvenPlaces+=number[i];
{
//The rest is not important
Do you have a better idea? Something without needing to use if else
int* sum[2] = {&sumOfDigitsInOddPlaces,&sumOfDigitsInEvenPlaces};
for (int i=0;i<number.length;i++)
{
*(sum[i&1])+=number[i];
}
You could use two separate loops, one for the odd indexed digits and one for the even indexed digits.
Also your modulus conditional may be wrong, you're placing the even indexed digits (0,2,4...) in the odd accumulator. Could just be that you're considering the number to be 1-based indexing with the number array being 0-based (maybe what you intended), but for algorithms sake I will consider the number to be 0-based.
Here's my proposition
number = 123456789;
sumOfDigitsInOddPlaces=0;
sumOfDigitsInEvenPlaces=0;
//even digits
for (int i = 0; i < number.length; i = i + 2){
sumOfDigitsInEvenPlaces += number[i];
}
//odd digits, note the start at j = 1
for (int j = 1; i < number.length; i = i + 2){
sumOfDigitsInOddPlaces += number[j];
}
On the large scale this doesn't improve efficiency, still an O(N) algorithm, but it eliminates the branching
Since you added C# to the question:
var numString = "123456789";
var odds = numString.Split().Where((v, i) => i % 2 == 1);
var evens = numString.Split().Where((v, i) => i % 2 == 0);
var sumOfOdds = odds.Select(int.Parse).Sum();
var sumOfEvens = evens.Select(int.Parse).Sum();
Do you like Python?
num_string = "123456789"
odds = sum(map(int, num_string[::2]))
evens = sum(map(int, num_string[1::2]))
This Java solution requires no if/else, has no code duplication and is O(N):
number = "123456789";
int[] sums = new int[2]; //sums[0] == sum of even digits, sums[1] == sum of odd
for(int arrayIndex=0; arrayIndex < 2; ++arrayIndex)
{
for (int i=0; i < number.length()-arrayIndex; i += 2)
{
sums[arrayIndex] += Character.getNumericValue(number.charAt(i+arrayIndex));
}
}
Assuming number.length is even, it is quite simple. Then the corner case is to consider the last element if number is uneven.
int i=0;
while(i<number.length-1)
{
sumOfDigitsInEvenPlaces += number[ i++ ];
sumOfDigitsInOddPlaces += number[ i++ ];
}
if( i < number.length )
sumOfDigitsInEvenPlaces += number[ i ];
Because the loop goes over i 2 by 2, if number.length is even, removing 1 does nothing.
If number.length is uneven, it removes the last item.
If number.length is uneven, then the last value of i when exiting the loop is that of the not yet visited last element.
If number.length is uneven, by modulo 2 reasoning, you have to add the last item to sumOfDigitsInEvenPlaces.
This seems slightly more verbose, but also more readable, to me than Anonymous' (accepted) answer. However, benchmarks to come.
Well, the compiler seems to think my code more understandable as well, since he removes it all if I don't print the results (which explains why I kept getting a time of 0 all along...). The other code though is obfuscated enough for even the compiler.
In the end, even with huge arrays, it's pretty hard for clock_t to tell the difference between the two. You get about a third less instructions in the second case, but since everything's in cache (and your running sums even in registers) it doesn't matter much.
For the curious, I've put the disassembly of both versions (compiled from C) here : http://pastebin.com/2fciLEMw
I have a quick question that I haven't found out how to do efficiently (in C#).
I have a list array of Points (X,Y). I need to find which 3 points are the tightest cluster. It's for a mapping project.
What would the best way to do this be? There's only about 6 to 9 items in the list.
Thanks in advance.
Cheers!
For such small numbers, the brute force method should work just fine. With six points, there are 20 possible combinations of three points. With 9 points, there are 84 possible combinations. I wouldn't recommend this approach for a lot of points, but with just a handful, it's going to be plenty fast enough and it's dead simple to write.
You can easily generate the combinations:
for (int i = 0; i < points.Length - 2; ++i)
{
for (j = i + 1; j < points.Length - 1; j++)
{
for (k = j + 1; k < points.Length; k++)
{
// Here, your three points are
// points[i], points[j], and points[k]
// compute "tightness" and store
}
}
}
You'll need a structure to hold your combinations:
struct PointGroup
{
public readonly int i;
public readonly int j;
public readonly int k;
public readonly double tightness;
public PointGroup(int i, int j, int k, double tight)
{
this.i = i;
this.j = j;
this.k = k;
this.tightness = tight;
}
}
If you create one of those structures for each group and store them in an array, you can simply sort the array and take the best three.
Your bigger problem is coming up with a definition of "tight group." Also, you have to decide if a point can be in more than one of those "tightest" groups. Three possible ways to define tightness are:
The sum of the distances between the points is minimized.
The average distance from each point to the center of the group is minimized.
The circumference of the triangle formed by the three points is minimized.
Undoubtedly there are more.
If the points are not identical, this becomes a form of cluster analysis.
There are various algorithms that differ in how they measure and "cluster" points, though with only a few points, a brute force approach might be the easiest... You could just measure the distance between each pair of points, and sort...
You can simplify the problem as follows:
Don't check a Point against itself; distance is zero.
Exploit symmetry: distance from Point i to Point j is the same as Point j to Point i
Those eliminate a number of combinations.
But, given those, you have to calculate the distance between each pair and sort.
I've searched online for a diff algorithm but none of them do what I am looking for. It is for a texting contest (as in cell phone) and I need the entry text compared to the master text recording the errors along the way. I am semi-new to C# and I get most of the string functions and didn't think this was going to be that hard of a problem, but alas I just can't wrap my head around it.
I have a form with 2 rich-text-boxes (one on top of the other) and 2 buttons. The top box is the master text (string) and the bottom box is the entry text (string). Every contestant is sending a text to an email account, from the email we copy and paste the text into the Entry RTB and compare to the Master RTB. For each single word and single space counts as a thing to check. A word, no matter how many errors it has, is still 1 error. And for every error add 1 sec. to their time.
Examples:
Hello there! <= 3 checks (2 words and 1 space)
Helothere! <= 2 errors (Helo and space)
Hello there!! <= 1 error (extra ! at end of there!)
Hello there! How are you? <= 9 checks (5 words and 4 spaces)
Helothere!! How a re you? <= still 9 checks, 4 errors(helo, no space, extra !, and a space in are)
Hello there!# Ho are yu?? <= 3 errors (# at end of there!, no w, no o and extra ? (all errors are still under the 1 word)
What I have so far:
I've created 6 arrays (3 for master, 3 for entry) and they are
CharArray of all chars
StringArray of all strings(words) including the spaces
IntArray with length of the string in each StringArray
My biggest trouble is if the entry text is wrong and it's shorter or longer than the master. I keep getting IndexOutOfRange exceptions (understandably) but can't fathom how to go about checking and writing the code to compensate.
I hope I have made myself clear enough as to what I need help with. If anyone could give some code examples or something to shoot me in the right path would be very helpful.
Have you looked into the Levenshtein distance algorithm? It returns the number of differences between two strings, which, in your case would be texting errors. Implementing the algorithm based off the pseudo-code found on the wikipedia page passes the first 3 of your 4 use cases:
Assert.AreEqual(2, LevenshteinDistance("Hello there!", "Helothere!");
Assert.AreEqual(1, LevenshteinDistance("Hello there!", "Hello there!!"));
Assert.AreEqual(4, LevenshteinDistance("Hello there! How are you?", "Helothere!! How a re you?"));
Assert.AreEqual(3, LevenshteinDistance("Hello there! How are you?", "Hello there!# Ho are yu??")); //fails, returns 4 errors
So while not perfect out of the box, it is probably a good starting point for you. Also, if you have too much trouble implementing your scoring rules, it might be worth revisiting them.
hth
Update:
Here is the result of the string you requested in the comments:
Assert.AreEqual(7, LevenshteinDistance("Hello there! How are you?", "Hlothere!! Hw a reYou?"); //fails, returns 8 errors
And here is my implementation of the Levenshtein Distance algorithm:
int LevenshteinDistance(string left, string right)
{
if (left == null || right == null)
{
return -1;
}
if (left.Length == 0)
{
return right.Length;
}
if (right.Length == 0)
{
return left.Length;
}
int[,] distance = new int[left.Length + 1, right.Length + 1];
for (int i = 0; i <= left.Length; i++)
{
distance[i, 0] = i;
}
for (int j = 0; j <= right.Length; j++)
{
distance[0, j] = j;
}
for (int i = 1; i <= left.Length; i++)
{
for (int j = 1; j <= right.Length; j++)
{
if (right[j - 1] == left[i - 1])
{
distance[i, j] = distance[i - 1, j - 1];
}
else
{
distance[i, j] = Min(distance[i - 1, j] + 1, //deletion
distance[i, j - 1] + 1, //insertion
distance[i - 1, j - 1] + 1); //substitution
}
}
}
return distance[left.Length, right.Length];
}
int Min(int val1, int val2, int val3)
{
return Math.Min(val1, Math.Min(val2, val3));
}
You need to come up with a scoring systems that works for you're situation.
I would make a word array after each space.
If a word is found on the same index +5.
If a word is found on the same index +-1 index location +3 (keep a counter how much words differ to increase the +- correction
If a needed word is found as part of another word +2
etc.etc. Matching words is hard, getting up with a rules engine that works is 'easier'
I once implemented an algorithm (which I can't find at the moment, I'll post code when I find it) which looked at the total number of PAIRS in the target string. i.e. "Hello, World!" would have 11 pairs, { "He", "el", "ll",...,"ld", "d!" }.
You then do the same thing on an input string such as "Helo World" so you have { "He",...,"ld" }.
You can then calculate accuracy as a function of correct pairs (i.e. input pairs that are in the list of target pairs), incorrect pairs (i.e. input pairs that do not exists in the list of target pairs), compared to the total list of target pairs. Over long enough sentences, this measure would be very accurate fair.
A simple algorithm would be to check letter by letter. If the letters differ increment the num of errors. If the next pairing of letters match, its a switched letter so just continue. If the messup matches the next letter, it is an omission and treat it accordingly. If the next letter matches the messed up one, its an insertion and treat it accordingly. Else the person really messed up and continue.
This doesn't get everything but with a few modifications this could become comprehensive.
a weak attempt at pseudocode:
edit: new idea. look at comments. I don't know the string functions off the top of my head so you'll have to figure that part out. The algorithm kinda fails for words that repeat a lot though...
string entry; //we'll pretend that this has stuff inside
string master; // this too...
string tempentry = entry; //stuff will be deleted so I need a copy to mess up
int e =0; //word index for entry
int m = 0; //word index for master
int errors = 0;
while(there are words in tempentry) //!tempentry.empty() ?
string mword = the next word in master;
m++;
int eplace = find mword in tempentry; //eplace is the index of where the mword starts in tempentry
if(eplace == -1) //word not there...
continue;
else
errors += m - e;
errors += find number of spaces before eplace
e = m // there is an error
tempentry = stripoff everything between the beginning and the next word// substring?
all words and spaces left in master are considered errors.
There are a couple of bounds checking errors that need to be fixed here but its a good start.
Suppose We have a, IEnumerable Collection with 20 000 Person object items.
Then suppose we have created another Person object.
We want to list all Persons that ressemble this Person.
That means, for instance, if the Surname affinity is more than 90 % , add that Person to the list.
e.g. ("Andrew" vs "Andrw")
What is the most effective / quick way of doing this?
Iterating through the collection and comparing char by char with affinity determination? OR?
Any ideas?
Thank you!
You may be interested in:
Levenshtein Distance Algorithm
Peter Norvig - How to Write a Spelling Corrector
(you'll be interested in the part where he compares a word against a collection of existing words)
Depending on how often you'll need to do this search, the brute force iterate and compare method might be fast enough. Twenty thousand records really isn't all that much and unless the number of requests is large your performance may be acceptable.
That said, you'll have to implement the comparison logic yourself and if you want a large degree of flexibility (or if you need find you have to work on performance) you might want to look at something like Lucene.Net. Most of the text search engines I've seen and worked with have been more file-based, but I think you can index in-memory objects as well (however I'm not sure about that).
Good luck!
I'm not sure if you're asking for help writing the search given your existing affinity function, or if you're asking for help writing the affinity function. So for the moment I'll assume you're completely lost.
Given that assumption, you'll notice that I divided the problem into two pieces, and that's what you need to do as well. You need to write a function that takes two string inputs and returns a boolean value indicating whether or not the inputs are sufficiently similar. Then you need a separate search a delegate that will match any function with that kind of signature.
The basic signature for your affinity function might look like this:
bool IsAffinityMatch(string p1, string p2)
And then your search would look like this:
MyPersonCollection.Where(p => IsAffinityMatch(p.Surname, OtherPerson.Surname));
I provide the source code of that Affinity method:
/// <summary>
/// Compute Levenshtein distance according to the Levenshtein Distance Algorithm
/// </summary>
/// <param name="s">String 1</param>
/// <param name="t">String 2</param>
/// <returns>Distance between the two strings.
/// The larger the number, the bigger the difference.
/// </returns>
private static int Compare(string s, string t)
{
/* if both string are not set, its uncomparable. But others fields can still match! */
if (string.IsNullOrEmpty(s) && string.IsNullOrEmpty(t)) return 0;
/* if one string has value and the other one hasn't, it's definitely not match */
if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(t)) return -1;
s = s.ToUpper().Trim();
t = t.ToUpper().Trim();
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
int cost;
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; d[i, 0] = i++) ;
for (int j = 0; j <= m; d[0, j] = j++) ;
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
that means, if 0 is returned, 2 strings are identical.