I have the following implementation, but I want to add a threshold, so if the result is going to be greater than it, just stop calculating and return.
How would I go about that?
EDIT: Here is my current code, threshold is not yet used...the goal is that it is used
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;
var del = d[i - 1, j] + 1;
var ins = d[i, j - 1] + 1;
var sub = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(del, Math.Min(ins, sub));
if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
}
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
}
This is Regarding ur answer this: Damerau - Levenshtein Distance, adding a threshold
(sorry can't comment as I don't have 50 rep yet)
I think you have made an error here. You initialized:
var minDistance = threshold;
And ur update rule is:
if (d[i, j] < minDistance)
minDistance = d[i, j];
Also, ur early exit criteria is:
if (minDistance > threshold)
return int.MaxValue;
Now, observe that the if condition above will never hold true! You should rather initialize minDistance to int.MaxValue
Here's the most elegant way I can think of. After setting each index of d, see if it exceeds your threshold. The evaluation is constant-time, so it's a drop in the bucket compared to the theoretical N^2 complexity of the overall algorithm:
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
...
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
...
var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub));
if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost);
//Does this value exceed your threshold? if so, get out now
if(temp > threshold)
return temp;
}
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
You also asked this as a SQL CLR UDF question so I'll answer in that specific context: you best optmiziation won't come from optimizing the Levenshtein distance, but from reducing the number of pairs you compare. Yes, a faster Levenshtein algorithm will improve things, but not nearly as much as reducing the number of comparisons from N square (with N in the millions of rows) to N*some factor. My proposal is to compare only elements who have the length difference within a tolerable delta. On your big table, you add a persisted computed column on LEN(Data) and then create an index on it with include Data:
ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED;
CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data);
Now you can restrict the sheer problem space by joining within an max difference on lenght (eg. say 5), if your data's LEN(Data) varies significantly:
SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data)
FROM Table A
JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5
Finally got it...though it's not as beneficial as I had hoped
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
var im1 = i - 1;
var im2 = i - 2;
var minDistance = threshold;
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var jm1 = j - 1;
var jm2 = j - 2;
var cost = string1[im1] == string2[jm1] ? 0 : 1;
var del = d[im1, j] + 1;
var ins = d[i, jm1] + 1;
var sub = d[im1, jm1] + cost;
//Math.Min is slower than native code
//d[i, j] = Math.Min(del, Math.Min(ins, sub));
d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);
if (d[i, j] < minDistance)
minDistance = d[i, j];
}
if (minDistance > threshold)
return int.MaxValue;
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
? int.MaxValue
: d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
Related
I have a for loop and what I'd like to do is to store the data of every for cycle in C#.
At the moment it only stores the datas of the last iteration.
Attached my code. Thanks a lot!
for (int i = 0; i < n; i++)
{
if (i <= k - p - 1)
{
alpha[i] = 1;
NewCPVector[i] = CPVector[i];
}
if (k - p <= i && i <= k-1)
{
alpha[i] = (FinalKnotsVector[k] - Initialknots[i]) / (Initialknots[i + p + 1] - Initialknots[i]);
NewCPVector[i] = alpha[i] * CPVector[i] + (1 - alpha[i]) * CPVector[i - 1];
}
if (i >= k)
{
alpha[i] = 0;
NewCPVector[i] = CPVector[i - 1];
}
}
I'm going to assume that your arrays hold double values. (However it could be other types, like float or decimal) you just have to specify that type in the declaration of the list
You could save the data in a List this way:
List<double> data = new List<double>();
for (int i = 0; i < n; i++)
{
if (i <= k - p - 1)
{
alpha[i] = 1;
data.Add(NewCPVector[i] = CPVector[i]);
}
if (k - p <= i && i <= k-1)
{
alpha[i] = (FinalKnotsVector[k] - Initialknots[i]) / (Initialknots[i + p + 1] - Initialknots[i]);
data.Add(alpha[i] * CPVector[i] + (1 - alpha[i]) * CPVector[i - 1]);
}
if (i >= k)
{
alpha[i] = 0;
data.Add(CPVector[i - 1]);
}
}
I am attempting to implement the Levenshtein Distance algorithm in C# (for practice and because it'd be handy to have). I used an implementation from the Wikipedia page but for some reason I'm getting the wrong distance on one set of words. Here's the code (from LinqPad):
void Main()
{
var ld = new LevenshteinDistance();
int dist = ld.LevenshteinDistanceCalc("sitting","kitten");
dist.Dump();
}
// Define other methods and classes here
public class LevenshteinDistance
{
private int[,] distance;
public int LevenshteinDistanceCalc(string source, string target)
{
int sourceSize = source.Length, targetSize = target.Length;
distance = new int[sourceSize, targetSize];
for (int sIndex = 0; sIndex < sourceSize; sIndex++)
{
distance[sIndex, 0] = sIndex;
}
for (int tIndex = 0; tIndex < targetSize; tIndex++)
{
distance[0,tIndex] = tIndex;
}
// for j from 1 to n:
// for i from 1 to m:
// if s[i] = t[j]:
// substitutionCost:= 0
// else:
// substitutionCost:= 1
// d[i, j] := minimum(d[i - 1, j] + 1, // deletion
// d[i, j - 1] + 1, // insertion
// d[i - 1, j - 1] + substitutionCost) // substitution
//
//
// return d[m, n]
for (int tIndex = 1; tIndex < targetSize; tIndex++)
{
for (int sIndex = 1; sIndex < sourceSize; sIndex++)
{
int substitutionCost = source[sIndex] == target[tIndex] ? 0 : 1;
int deletion = distance[sIndex-1, tIndex]+1;
int insertion = distance[sIndex,tIndex-1]+1;
int substitution = distance[sIndex-1, tIndex-1] + substitutionCost;
distance[sIndex, tIndex] = leastOfThree(deletion, insertion, substitution);
}
}
return distance[sourceSize-1,targetSize-1];
}
private int leastOfThree(int a, int b, int c)
{
return Math.Min(a,(Math.Min(b,c)));
}
}
When I try "sitting" and "kitten" I get an LD of 2 (should be 3). Yet when I try "Saturday" and "Sunday" I get an LD of 3 (which is correct). I know something's wrong but I can't figure out what I'm missing.
The example on wikipedia uses 1-based strings. In C# we use 0-based strings.
In their matrix the 0-row and 0-column does exist. So the size of their matrix is [source.Length + 1, source.Length + 1] In your code it doesn't exist.
public int LevenshteinDistanceCalc(string source, string target)
{
int sourceSize = source.Length, targetSize = target.Length;
distance = new int[sourceSize + 1, targetSize + 1];
for (int sIndex = 1; sIndex <= sourceSize; sIndex++)
distance[sIndex, 0] = sIndex;
for (int tIndex = 1; tIndex <= targetSize; tIndex++)
distance[0, tIndex] = tIndex;
for (int tIndex = 1; tIndex <= targetSize; tIndex++)
{
for (int sIndex = 1; sIndex <= sourceSize; sIndex++)
{
int substitutionCost = source[sIndex-1] == target[tIndex-1] ? 0 : 1;
int deletion = distance[sIndex - 1, tIndex] + 1;
int insertion = distance[sIndex, tIndex - 1] + 1;
int substitution = distance[sIndex - 1, tIndex - 1] + substitutionCost;
distance[sIndex, tIndex] = leastOfThree(deletion, insertion, substitution);
}
}
return distance[sourceSize, targetSize];
}
Your matrix isn't big enough.
In the pseudo-code, s and t have lengths m and n respectively (char s[1..m], char t[1..n]). The matrix however has dimentions [0..m, 0..n] - i.e. one more than the length of the strings in each direction. You can see this in the tables below the pseudo-code.
So the matrix for "sitting" and "kitten" is 7x8, but your matrix is only 6x7.
You're also indexing into the strings incorrectly, because the strings in the pseudo-code are 1-indexed, but C#'s strings are 0-indexed.
After fixing these, you get this code, which works with "sitting" and "kitten":
public static class LevenshteinDistance
{
public static int LevenshteinDistanceCalc(string source, string target)
{
int sourceSize = source.Length + 1, targetSize = target.Length + 1;
int[,] distance = new int[sourceSize, targetSize];
for (int sIndex = 0; sIndex < sourceSize; sIndex++)
{
distance[sIndex, 0] = sIndex;
}
for (int tIndex = 0; tIndex < targetSize; tIndex++)
{
distance[0, tIndex] = tIndex;
}
// for j from 1 to n:
// for i from 1 to m:
// if s[i] = t[j]:
// substitutionCost:= 0
// else:
// substitutionCost:= 1
// d[i, j] := minimum(d[i - 1, j] + 1, // deletion
// d[i, j - 1] + 1, // insertion
// d[i - 1, j - 1] + substitutionCost) // substitution
//
//
// return d[m, n]
for (int tIndex = 1; tIndex < targetSize; tIndex++)
{
for (int sIndex = 1; sIndex < sourceSize; sIndex++)
{
int substitutionCost = source[sIndex - 1] == target[tIndex - 1] ? 0 : 1;
int deletion = distance[sIndex - 1, tIndex] + 1;
int insertion = distance[sIndex, tIndex - 1] + 1;
int substitution = distance[sIndex - 1, tIndex - 1] + substitutionCost;
distance[sIndex, tIndex] = leastOfThree(deletion, insertion, substitution);
}
}
return distance[sourceSize - 1, targetSize - 1];
}
private static int leastOfThree(int a, int b, int c)
{
return Math.Min(a, (Math.Min(b, c)));
}
}
(I also took the liberty of making distance a local variable since there's no need for it to be a field (it only makes your class non-threadsafe), and also making it static to avoid the unnecessary instantiation).
To debug this, I put a breakpoint on return distance[sourceSize - 1, targetSize - 1] and compared distance to the table on Wikipedia. It was very obvious that it was too small.
I have this levenstein algorithm:
public static int? GetLevenshteinDistance(string input, string output, int maxDistance)
{
var stringOne = String.Empty;
var stringTwo = String.Empty;
if (input.Length >= output.Length)
{
stringOne = input;
stringTwo = output;
}
else
{
stringOne = output;
stringTwo = input;
}
var stringOneLength = stringOne.Length;
var stringTwoLength = stringTwo.Length;
var matrix = new int[stringOneLength + 1, stringTwoLength + 1];
for (var i = 0; i <= stringOneLength; matrix[i, 0] = i++) { }
for (var j = 0; j <= stringTwoLength; matrix[0, j] = j++) { }
for (var i = 1; i <= stringOneLength; i++)
{
bool isBreak = true;
for (var j = 1; j <= stringTwoLength; j++)
{
var cost = (stringTwo[j - 1] == stringOne[i - 1]) ? 0 : 1;
matrix[i, j] = Math.Min(
Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1),
matrix[i - 1, j - 1] + cost);
if (matrix[i, j] < maxDistance)
{
isBreak = false;
}
}
if (isBreak)
{
return null;
}
}
return matrix[stringOneLength, stringTwoLength];
}
I checked each values and if it > max distance I break for.
But it does not always work correctly.
For example:
string1 = "#rewRPAF"
string2 = "#rewQVRZP"
maxDistance = 4
I get value 5 but don't null.
This solution i get this - Levenstein distance limit
We don't fix code on here, but I will help you fix it yourself.
Change this
if (matrix[i, j] < maxDistance)
{
isBreak = false;
}
to
if (matrix[i, j] < maxDistance)
{
isBreak = false;
} else {
System.Diagnostics.Debugger.Break();
}
that should break the debugger when you get to maxDistance, when that happens step forward in the debugger and follow what your program does. That should allow you to see what is happening that you don't want.
Look at what happens the first time around the inner loop. At this point the cost can not exceed one. Thus IsBreak is always being set to false if MaxDistance is greater than 1.
My gut says:
scrap everything to do with IsBreak
int Distance = matrix[stringOneLength, stringTwoLength];
return Distance > MaxDistance ? null : Distance;
but I haven't tried it.
Alternately (I haven't done enough with Levenshtein to be confident of this approach):
scrap everything to do with IsBreak
if (matrix[i, j] < maxDistance)
{
isBreak = false;
}
becomes
if (matrix[i, j] > maxDistance)
{
return null;
}
(Note that your termination test had an off-by-one.)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I get this performance issue from visual studio (Prefer jagged arrays over multidimensional).
The code to be replaced is "//matrix".
How can i do this with my code?
public static int LevenshteinDistance(string s, string t)
{
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++) ;
for (int j = 0; j <= m; d[0, j] = j++) ;
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
Here's a version which uses only a single dimensional array.
public static int LevenshteinDistance(string s, string t)
{
int n = s.Length; //length of s
int m = t.Length; //length of t
int stride = m+1;
int[] d = new int[(n + 1)*stride];
// note, d[i*m + i + j] holds (i,j)
int cost;
// Step 1
if (n == 0) return m;
if (m == 0) return n;
// Step 2, adjusted to skip (0,0)
for (int i = 0, k = stride; k < d.Length; k += stride) d[k] = ++i;
for (int j = 1; j < stride; ++j) d[j] = j;
// Step 3
int newrow = stride * 2;
char si = s[0];
for (int i=0, j=0, k = stride + 1; k < d.Length; ++k)
{
// don't overwrite d[i,0]
if (k == newrow) {
newrow += stride;
j=0;
si=s[++i];
continue;
}
// Step 5
cost = (t[j] == si ? 0 : 1);
// Step 6
d[k] = System.Math.Min(System.Math.Min(
d[k-stride] + 1, /* up one row */
d[k-1] + 1 /* left one */ ),
d[k-stride-1] + cost /* diagonal */ );
}
// Step 7
return d[d.Length-1];
}
This should improve performance 3 ways:
No string comparison and no one-character string garbage for the GC to clean up
Changed memory layout to match iteration order, improving cache behavior
Used single dimensional array and optimizer-friendly idioms, which should reduce bounds-checking
However, I'm pretty sure that applying mike z's suggestion of using two vectors will make for even clearer code.
Hi i'm using the levenshtein algorithm to calculate the difference between two strings, using the below code. It currently provides the total number of changes which need to be made to get from 'answer' to 'target', but i'd like to split these up into the types of errors being made. So classifying an error as a deletion, substitution or insertion.
I've tried adding a simple count but i'm new at this and don't really understand how the code works so not sure how to go about it.
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
Thanks in advance.