C# implementation of Levenshtein algorithm for substring matching

C# implementation of Levenshtein algorithm for substring matching - c#

I'm playing with Levenshtein distance for getting a C# implementation which allows not only to tell whether two strings are similar, but also find a similar string (the needle) in a larger string (the haystack).
To this end, I tried to follow the advice at the bottom of this excellent post, but I'm getting some issues.
To start with, I adopted this implementation, changing it to fit my additional requirements. I also added some diagnostic dump support to let me understand the algorithm better, inspired by this other post.
My implementation returns an object with score and (when requested) index and length, and also a reference to the calculated matrix used for diagnostic purposes:
public class LevenshteinMatch
{
public int Score { get; }
public int Index { get; }
public int Length { get; }
public int[,] Matrix { get; set; }
public LevenshteinMatch(int score, int index = 0, int length = 0)
{
Score = score;
Index = index;
Length = length;
}
public override string ToString()
{
return $"{Score} #{Index}x{Length}";
}
}
Here is my implementation: the Distance method works "normally" if sub is false; otherwise, it finds a similar substring. DumpMatrix is just a diagnostic helper method.
public static class Levenshtein
{
public static string DumpMatrix(int[,] d, string a, string b)
{
if (d == null) throw new ArgumentNullException(nameof(d));
if (a == null) throw new ArgumentNullException(nameof(a));
if (b == null) throw new ArgumentNullException(nameof(b));
// # k i t t e n
// 00 01 02 03 04 05 06
// # 00 .. .. .. .. .. .. ..
// s 01 .. .. .. .. .. .. ..
// ...etc (sitting)
StringBuilder sb = new StringBuilder();
int n = a.Length;
int m = b.Length;
// b-legend
sb.Append(" # ");
for (int j = 0; j < m; j++) sb.Append(b[j]).Append(" ");
sb.AppendLine();
sb.Append(" 00 ");
for (int j = 1; j < m; j++) sb.AppendFormat("{0:00}", j).Append(' ');
sb.AppendFormat("{0:00} ", m).AppendLine();
// matrix
for (int i = 0; i <= n; i++)
{
// a-legend
if (i == 0)
{
sb.Append("# 00 ");
}
else
{
sb.Append(a[i - 1])
.Append(' ')
.AppendFormat("{0:00}", i)
.Append(' ');
}
// row of values
for (int j = 0; j <= m; j++)
sb.AppendFormat("{0,2} ", d[i, j]);
sb.AppendLine();
}
return sb.ToString();
}
private static LevenshteinMatch BuildMatch(string a, string b, int[,] d)
{
int n = a.Length;
int m = b.Length;
// take the min rightmost score instead of the bottom-right corner
int min = 0, rightMinIndex = -1;
for (int j = m; j > -1; j--)
{
if (rightMinIndex == -1 || d[n, j] < min)
{
min = d[n, j];
rightMinIndex = j;
}
}
// corner case: perfect match, just collect m chars from score=0
if (min == 0)
{
return new LevenshteinMatch(min,
rightMinIndex - n,
n);
}
// collect all the lowest scores on the bottom row leftwards,
// up to the length of the needle
int count = n, leftMinIndex = rightMinIndex;
while (leftMinIndex > -1)
{
if (d[n, leftMinIndex] == min && --count == 0) break;
leftMinIndex--;
}
return new LevenshteinMatch(min,
leftMinIndex - 1,
rightMinIndex + 1 - leftMinIndex);
}
public static LevenshteinMatch Distance(string a, string b,
bool sub = false, bool withMatrix = false)
{
if (a is null) throw new ArgumentNullException(nameof(a));
if (b == null) throw new ArgumentNullException(nameof(b));
int n = a.Length;
int m = b.Length;
int[,] d = new int[n + 1, m + 1];
if (n == 0) return new LevenshteinMatch(m);
if (m == 0) return new LevenshteinMatch(n);
for (int i = 0; i <= n; i++) d[i, 0] = i;
// if matching substring, leave the top row to 0
if (!sub)
{
for (int j = 0; j <= m; j++) d[0, j] = j;
}
for (int j = 1; j <= m; j++)
{
for (int i = 1; i <= n; i++)
{
if (a[i - 1] == b[j - 1])
{
d[i, j] = d[i - 1, j - 1]; // no operation
}
else
{
d[i, j] = Math.Min(Math.Min(
d[i - 1, j] + 1, // a deletion
d[i, j - 1] + 1), // an insertion
d[i - 1, j - 1] + 1 // a substitution
);
}
}
}
LevenshteinMatch match = sub
? BuildMatch(a, b, d)
: new LevenshteinMatch(d[n, m]);
if (withMatrix) match.Matrix = d;
return match;
}
}
To be more complete, here is the demo console program using it. This just prompts the user for the matching mode (substring or not) and the two strings, then calls the Distance method, dumps the resulting matrix, and shows the substring if required.
internal static class Program
{
private static string ReadLine(string defaultLine)
{
string s = Console.ReadLine();
return string.IsNullOrEmpty(s) ? defaultLine ?? s : s;
}
private static void Main()
{
Console.WriteLine("Fuzzy Levenshtein Matcher");
string a = "sitting", b = "kitten";
bool sub = false;
LevenshteinMatch match;
while (true)
{
Console.Write("sub [y/n]? ");
string yn = Console.ReadLine();
if (!string.IsNullOrEmpty(yn)) sub = yn == "y" || yn == "Y";
Console.Write(sub? $"needle ({a}): " : $"A ({a}): ");
a = ReadLine(a);
Console.Write(sub? $"haystack ({b}): " : $"B ({b}): ");
b = ReadLine(b);
match = Levenshtein.Distance(a, b, sub, true);
Console.WriteLine($"{a} - {b}: {match}");
Console.WriteLine(Levenshtein.DumpMatrix(match.Matrix, a, b));
if (sub) Console.WriteLine(b.Substring(match.Index, match.Length));
}
}
}
Now, for substring matches this works in a case like "aba" in "c abba c". Here is the matrix:
aba - c abba c: 1 #3x3
# c a b b a c
00 01 02 03 04 05 06 07 08
# 00 0 0 0 0 0 0 0 0 0
a 01 1 1 1 0 1 1 0 1 1
b 02 2 2 2 1 0 1 1 1 2
a 03 3 3 3 2 1 1 1 2 2
Yet, in other cases, e.g. "abas" in "ego sum abbas Cucaniensis", I fail to collect the min scores from the bottom row:
abas - ego sum abbas Cucaniensis: 1 #-2x15
# e g o s u m a b b a s C u c a n i e n s i s
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a 01 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
b 02 2 2 2 2 2 2 2 2 2 1 0 1 1 1 2 2 2 2 1 1 2 2 2 2 2 2
a 03 3 3 3 3 3 3 3 3 3 2 1 1 1 2 2 3 3 3 2 2 2 3 3 3 3 3
s 04 4 4 4 4 4 3 4 4 4 3 2 2 2 1 2 3 4 4 3 3 3 3 4 3 4 3
Here there is just a single score=1 in the bottom row. In the case of a perfect match (score=0) my code just takes the left N-characters (where N is the length of the needle) from the rightmost lowest score; but here I have scores greater than 0. Probably I've just misinterpreted the hints in the above post, as I'm new to the interals of this algorithm. Could anyone suggest the correct way of finding the needle's index and length in the haystack?

You start at the best score in the bottom row: the 1 at (13,4)
Then you find look at the predecessor states and transitions that could have got you there:
(12,4) - not possible, because it has a higher difference
(13,3) - not possible, because it has a higher difference
(12,3) - same difference and the characters match, so this works
From (12,3) you follow the same procedure to get to (11,2) and then (10,1)
At (10,1) the letters don't match, so you couldn't have come from (9,0). You could use either (10,0) for the similar string "bas", or you could use (9,1) then (8,0) for the similar string "abbas", both with distance 1.

Related

Write some code that accepts an integer and prints the integers from 0 to that input integer in a spiral format

Write some code that accepts an integer and prints the integers from 0 to that input integer in a spiral format.
For example, if I supplied 24 the output would be:
20 21 22 23 24
19 6 7 8 9
18 5 0 1 10
17 4 3 2 11
16 15 14 13 12
this is the code which i have tried but i am not able to get expected output as per the example please suggest for the same.
this is the actual output
25 24 23 22 21
10 9 8 7 20
11 2 1 6 19
12 3 4 5 18
13 14 15 16 17
class Program
{
static void Main(string[] args)
{
int n = 5;
printSpiral(n);
}
static void printSpiral(int n)
{
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
int x;
x = Math.Min(Math.Min(i, j),
Math.Min(n - 1 - i, n - 1 - j));
if (i <= j)
Console.Write((n - 2 * x) *
(n - 2 * x) -
(i - x) - (j - x) + "\t");
else
Console.Write((n - 2 * x - 2) *
(n - 2 * x - 2) +
(i - x) + (j - x) + "\t");
}
Console.WriteLine();
}
}
}

Print a matrix in following format for given number

For a given number n I have to print the following matrix (n = 3 example):
3 3 3 3 3
3 2 2 2 3
3 2 1 2 3
3 2 2 2 3
3 3 3 3 3
the count of rows and columns should be (2 * n) - 1. I tried to find the pattern but couldn't figured it out. Any help would be helpful. Thanks

Something like this:
private static int[][] Matrix(int n) {
// Create arrays
int[][] result = Enumerable.Range(0, 2 * n - 1)
.Select(_ => new int[2 * n - 1])
.ToArray();
// Feed arrays
for (int i = 1; i <= n; ++i) {
int from = i - 1;
int to = 2 * n - i - 1;
int v = n - i + 1;
for (int j = from; j <= to; ++j) {
result[from][j] = v;
result[to][j] = v;
result[j][from] = v;
result[j][to] = v;
}
}
return result;
}
....
int n = 3;
String report = String.Join(Environment.NewLine, Matrix(n)
.Select(line => String.Join(" ", line)));
Console.Write(report);
Output for n = 3 is
3 3 3 3 3
3 2 2 2 3
3 2 1 2 3
3 2 2 2 3
3 3 3 3 3
And for n = 4:
4 4 4 4 4 4 4
4 3 3 3 3 3 4
4 3 2 2 2 3 4
4 3 2 1 2 3 4
4 3 2 2 2 3 4
4 3 3 3 3 3 4
4 4 4 4 4 4 4

Here's a version that doesn't use any intermediate storage:
static void printMatrix(int n)
{
int x = 2*n - 1;
for (int i = 0, p = n; i < x; ++i, p += (i > x/2) ? 1 : -1)
{
for (int j = 0, q = n; j < x; ++j, q += (j > x/2) ? 1 : -1)
Console.Write(Math.Max(p, q) + " ");
Console.WriteLine();
}
}
This works as follows:
The outer loop (i) and inner loop (j) both go from 0 .. 2*n-1.
However, the values that we want to print (p and q) start at n and decrease until halfway across/down the matrix, at which point they start increasing again.
We can determine whether to increment or decrement these values by checking the loop variable to see if it is halfway across/down the matrix yet. If it is, we decrement, otherwise we increment.
That's what this is doing: p += (i > x/2) ? 1 : -1.
If i > x/2 then the value 1 will be used for the increment, otherwise -1 will be used for the increment (i.e. it will be decremented).
(Similarly for q.)
The final piece of the puzzle is that the value we want to use is actually the maximum of p and q. If you inspect the matrix, you will see that if you consider each row value and each column value, the maximum of each is used for the corresponding cell.
Hence the use of Math.Max(p, q) in the output.

Here a simpler solution less complicated and fastest here so far:
private static void printMatrix(int n)
{
// length of the matrix in one dimension
int length = (2 * n) - 1;
// iterate through y axis of the matrix
for (int i = 0; i < length; i++)
{
int value = n;
// iterate through x axis of the matrix
for (int j = 0; j < length; j++)
{
Console.Write(value);
if (i > j && i + j < length - 1)
{
value--;
}
else if (i <= j && i + j >= length - 1)
{
value++;
}
}
Console.WriteLine();
}
}
Explanation for the if statements
First take a look on the matrix like it would be an array and look at the indexes and values and what changes when iterating through the array in the two for statements whereas the value of i is for the y axis, from up to down and the value of j is for the x axis, from left to right.
// +1 and -1 means the changes of the values
0,0 0,1 0,2 0,3 0,4 // values of i,j (first is i, second is j)
[3] [3] [3] [3] [3] // printed value
1,0 1,1 1,2 1,3 1,4
[3] -1 [2] [2] [2] +1 [3]
2,0 2,1 2,2 2,3 2,4
[3] -1 [2] -1 [1] +1 [2] +1 [3]
3,0 3,1 3,2 3,3 3,4
[3] -1 [2] [2] [2] +1 [3]
4,0 4,1 4,2 4,3 4,4
[3] [3] [3] [3] [3]
As you can see, the value changes only in specific circumstances.
And this happens -1 if i > j and if i + j < length - 1, otherwise after index 3,1 you will have wrong values, after that it shouldn't subtract the value any more.
This two if statements lead us to the first statement in the code:
if (i > j && i + j < length - 1)
{
value--;
}
Now it also happens +1 if i + j >= length - 1 but that not other values like 3,1 also add +1 to the value it only increments it, if i <= j, which leads us to the second if-statement in the code:
else if (i <= j && i + j >= length - 1)
{
value++;
}
If none of these statements are true, the value simply stays the same as it should.

Binding Variable Nested List to GridView

I am trying to make a multiplication table appear on a page based on input from the user. This is my code:
<asp:GridView runat="server" ID="TableData"></asp:GridView>
List<List<int>> nestedList = new List<List<int>>();
protected void LoadTable(int val)
{
for (int y = 0; y <= val; y++)
{
List<int> list = new List<int>();
for (int x = 0; x <= val; x++)
list.Add(x * y);
nestedList.Add(list);
}
TableData.DataSource = nestedList;
TableData.DataBind();
}
But this displays as:
Capacity Count
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
16 14
What am I doing wrong?
For clarification, if the user enters 5, the output should be:
0 0 0 0 0 0
0 1 2 3 4 5
0 2 4 6 8 10
0 3 6 9 12 15
0 4 8 12 16 20
0 5 10 15 20 25
I am not worried about column or row headers at this time.

The problem is with your items Source.
a list< list < ?? > > is not a good choice (as i think).
For a Linear view you can use this approach
Code Snippet
var objList = new List<object>();
for (int i = 0; i < 5; i++)
{
var temp = new { operation = string.Format("{0} * {1}", i, i + 1), result = i * (i + 1) };
objList.Add(temp);
}

GridView does not support 2d list binding, consider using another methode.
For exemple, use a simple List , each string will represent a row, you can fill up each string by using a loop that goes like :
(first loop)
{
string s;
for(int x = 0; x < val; x ++)
{
s += (x * y).Tostring() + " ");
}
nestedList.Add(s);
}

how to implementing gaussian elimination for binary equations

i have this system of equations1=x⊕y⊕z
1=x⊕y⊕w
0=x⊕w⊕z
1=w⊕y⊕zI'm trying to implement gaussian elimination to solve this system as described here , replacing division,subtraction and multiplication by XOR, but it gives my wrong answer..the correct answer is (x,y,z,w)=(0,1,0,0) what am i doing wrong ?
public static void ComputeCoefficents(byte[,] X, byte[] Y)
{
int I, J, K, K1, N;
N = Y.Length;
for (K = 0; K < N; K++)
{
K1 = K + 1;
for (I = K; I < N; I++)
{
if (X[I, K] != 0)
{
for (J = K1; J < N; J++)
{
X[I, J] /= X[I, K];
}
//Y[I] /= X[I, K];
Y[I] ^= X[I, K];
}
}
for (I = K1; I < N; I++)
{
if (X[I, K] != 0)
{
for (J = K1; J < N; J++)
{
X[I, J] ^= X[K, J];
}
Y[I] ^= Y[K];
}
}
}
for (I = N - 2; I >= 0; I--)
{
for (J = N - 1; J >= I + 1; J--)
{
//Y[I] -= AndOperation(X[I, J], Y[J]);
Y[I] ^= (byte)(X[I, J]* Y[J]);
}
}
}

I think you're trying to apply Gaussian elimination mod 2 for this.
In general you can do Gaussian elimination mod k, if your equations are of the form
a_1 * x + b_1 * y + c_1 * z = d_1
a_2 * x + b_2 * y + c_2 * z = d_2
a_3 * x + b_3 * y + c_3 * z = d_3
a_4 * x + b_4 * y + c_4 * z = d_4
And in Z2 * is and and + is xor, so you can use Gausian elimination to solve equations of the form
x (xor) y (xor) z = 1
x (xor) y (xor) w = 1
x (xor) z (xor) w = 0
y (xor) z (xor) w = 1
Lets do this equation using Gausian elimination by hand.
The corresponding augmented matrix is:
1 1 1 0 | 1
1 1 0 1 | 1
1 0 1 1 | 0
0 1 1 1 | 1
1 1 1 0 | 1
0 0 1 1 | 0 (R2 = R2 + R1)
0 1 0 1 | 1 (R3 = R3 + R1)
0 1 1 1 | 1
1 1 1 0 | 1
0 1 1 1 | 1 (R2 = R4)
0 1 0 1 | 1
0 0 1 1 | 0 (R4 = R2)
1 0 0 1 | 0 (R1 = R1 + R2)
0 1 1 1 | 1
0 0 1 0 | 0 (R3 = R3 + R2)
0 0 1 1 | 0
1 0 0 1 | 0
0 1 0 1 | 1 (R2 = R2 + R3)
0 0 1 0 | 0
0 0 0 1 | 0 (R4 = R4 + R3)
1 0 0 0 | 0 (R1 = R1 + R4)
0 1 0 0 | 1 (R2 = R2 + R4)
0 0 1 0 | 0
0 0 0 1 | 0
Giving your solution of (x,y,z,w) = (0,1,0,0).
But this requires row pivoting - which I can't see in your code.
There's also some multiplications and divisions floating around in your code that probably dont need to be there. I'd expect the code to look like this: (You'll need to fix the TODOs).
public static void ComputeCoefficents(byte[,] X, byte[] Y) {
int I, J, K, K1, N;
N = Y.Length;
for (K = 0; K < N; K++) {
//First ensure that we have a non-zero entry in X[K,K]
if( X[K,K] == 0 ) {
for(int i = 0; i<N ; ++i ) {
if(X[i,K] != 0 ) {
for( ... ) //TODO: A loop to swap the entries
//TODO swap entries in Y too
}
}
if( X[K,K] == 0 ) {
// TODO: Handle the case where we have a zero column
// - for now we just move on to the next column
// - This means we have no solutions or multiple
// solutions
continue
}
// Do full row elimination.
for( int I = 0; I<N; ++I)
{
if( I!=K ){ //Don't self eliminate
if( X[I,K] ) {
for( int J=K; J<N; ++J ) { X[I,J] = X[I,J] ^ X[K,J]; }
Y[J] = Y[J] ^ Y[K];
}
}
}
}
//Now assuming we didnt hit any zero columns Y should be our solution.
}

create the same function in another language

Sorry for the previous question, but I was so confused and stressed :S.
I have this function in R:
shift <- function(d, k) rbind( tail(d,k), head(d,-k), deparse.level = 0 )
and this data frame:
A B value
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
this function will transform this data frame to (in case k=1) :
A B value
3 2 0.412
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
Code:
string[] data = File.ReadAllLines("test.txt");
decimal[,] numbers = new decimal[data.Length, 3];
for(int x = 0; x < data.Length; x++)
{
string[] temp = data[x].Split(' ');
for(int y = 1; y < temp.Length; y++)
{
numbers[x,y] = Convert.ToDecimal(temp[y]);
}
}
that's the code i'm using to get the values from the text file , but i want to create the function that will rotate this table.
I want to make the same function for a text file in Java or C#.
How this can be done?
I'm storing the data in C# in a 2D array: decimal[,]
UPDATE:
your function will rotate them like the previous example, what i want to do is this:
i have this table:
A B value
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
i want it to become(in case of shift by 2) :
A B value
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
1 1 0.123
2 1 0.213

I think this will do what you want but I must admit I'm not that familiar with C# so I'd expect there to be a more idiomatic form with less looping:
static decimal[,] Rotate(decimal[,] input, int k)
{
int m = input.GetLength(0);
int n = input.GetLength(1);
decimal[,] result = new decimal[m, n];
for (int i=0; i<m; i++)
{
int p = (i + k) % m;
if (p < 0)
p += m;
for (int j=0; j<n; j++)
result[p, j] = input[i, j];
return result;
}
Regarding your update, that is handled by passing a negative value for k. For your example pass k=-2.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# implementation of Levenshtein algorithm for substring matching - c#

Related

Write some code that accepts an integer and prints the integers from 0 to that input integer in a spiral format

Print a matrix in following format for given number

Binding Variable Nested List to GridView

how to implementing gaussian elimination for binary equations

create the same function in another language

Categories

Resources