Compare 2 string - c#
I have the following 2 strings:
String A: Manchester United
String B: Manchester Utd
Both strings means the same, but contains different values.
How can I compare these string to have a "matching score" like, in the case, the first word is similar, "Manchester" and the second words contain similar letters, but not in the right place.
Is there any simple algorithm that returns the "matching score" after I supply 2 strings?
You could calculate the Levenshtein distance between the two strings and if it is smaller than some value (that you must define) you may consider them to be pretty close.
I've needed to do something like this and used Levenshtein distance.
I used it for a SQL Server UDF which is being used in queries with more than a million of rows (and texts of up to 6 or 7 words).
I found that the algorithm runs faster and the "similarity index" is more precise if you compare each word separately. I.e. you split each input string in words, and compare each word of one input string to each word of the other input string.
Remember that Levenshtein gives the difference, and you have to convert it to a "similarity index". I used something like distance divided by the length of the longest word (but with some variations)
First rule: order and number of words
You must also consider:
if there must be the same number of words in both inputs, or it can change
and if the order must be the same on both inputs, or it can change.
Depending on this the algorithm changes. For example, applying the first rule is really fast if the number of words differs. And, the second rule reduces the number of comparisons, specially if there are many words in the compared texts. That's explained with examples later.
Second rule: weighting the similarity of each compared pair
I also weighted the longer words higher than the shorter words to get the global similarity index. My algorithm takes the longest of the two words in the compared pair, and gives a higher weight to the pair with the longer words than to the pair with the shorter ones, although not exactly proportional to the pair length.
Sample comparison: same order
With this example, which uses different number of words:
compare "Manchester United" to "Manchester Utd FC"
If the same order of the words in both inputs is guaranteed, you should compare these pairs:
Manchester United
Manchester Utd FC
(Manchester,Manchester) (Utd,United) (FC: not compared)
Manchester United
Manchester Utd FC
(Manchester,Manchester) (Utd: not compared) (United,FC)
Machester United
Manchester Utd FC
(Mancheter: not compared) (Manchester,Utd) (United,FC)
Obviously, the highest score would be for the first set of pairs.
Implementation
To compare words in the same order.
The string with the higher number of words is a fixed vector, shown as A,B,C,D,E in this example. Where v[0] is the word A, v[1] the word B and so on.
For the string with the lower number of words we need to create all the possible combination of indexes that can be compared with the firs set. In this case, the string with lower number of words is represented by a,b,c.
You can use a simple loop to create all the vectors that represents the pairs to be compared like so
A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E
a,b,c a,b, c a,b, c a, b,c a, b, c a, b,c
0 1 2 0 1 3 0 1 4 0 2 3 0 2 4 0 3 4
A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E
a,b,c a,b, c a, b,c a,b,c
1 2 3 1 2 4 1 3 4 2 3 4
The numbers in the sample, are vectors that have the indices of the first set of words which must be comapred with the indices in the first set. i.e. v[0]=0, means compare index 0 of the short set (a) to index 0 of the long set (A), v[1]=2 means compare index 1 of the short (b) set to index 2 of the long set (C), and so on.
To calculate this vectors, simply start with 0,1,2. Move to the right the latest index that can be moved until it can no longer be moved:
Strat by moving the last one:
0,1,2 -> 0,1,3 -> 0,1,4
No more moves possible, move the previous index, and restore the others
to the lowest possible values (move 1 to 2, restore 4 to 3)
When the last can't be move any further, move the one before the last, and reset the last to the nearest possible place (1 moved to 2, and 4 move to 3):
0,2,3 -> 0,2,4
No more moves possible of the last, move the one before the last
Move the one before the last again.
0,3,4
No more moves possible of the last, move the one before the last
Not possible, move the one before the one before the last, and reset the others:
Move the previous one:
1,2,3 -> 1,2,4
And so on. See the picture
When you have all the possible combinations you can compare the defined pairs.
Third rule: minimum similarity to stop comparison
Stop comparison when minimun similarity is reached: depending on what you want to do it's possible that you can set a thresold that, when it's reached, stops the comparison of the pairs.
If you can't set a thresold, at least you can always stop if you get a 100% similarity for each pair of words. This allows to spare a lot of time.
On some occasions you can simply decide to stop the comparison if the similarity is at least, something like 75%. This can be used if you want to show the user all the strings which are similar to the one provided by the user.
Sample: comparison with change of the order of the words
If there can be changes in the order, you need to compare each word of the first set with each word of the second set, and take the highest scores for the combinations of results, which include all the words of the shortest pair ordered in all the possible ways, compared to different words of the second pair. For this you can populate the upper or lower triangle of a matrix of (n X m) elements, and then take the required elements from the matrix.
Fourth rule: normalization
You must also normalize the word before comparison, like so:
if not case-sensitive convert all the words to upper or lower case
if not accent sensitive, remove accents in all the words
if you know that there are usual abbreviations, you can also normalized them, to the abbreviation to speed it up (i.e. convert united to utd, not utd to united)
Caching for optimization
To optmize the procedure, I cached whichever I could, i.e. the comparison vectors for different sizes, like the vectors 0,1,2-0,1,3,-0,1,4-0,2,3, in the A,B,C,D,E to a,b,c comparison example: all comparisons for lengths 3,5 would be calculated on first use and recycled for all the 3 words to 5 words incoming comparisons.
Other algorithms
I tried Hamming distance and the results were less accurate.
You can do much more complex things like semantic comparisons, phonetic comparisons, consider that some letters are just the same (like b and v, for several languages, like spanish, where ther is no distinction). Some of this things are very easy to implemente and others are really difficult.
NOTE: I didn't include the implementation of Levenhstein distance, because you can easyly find it implemented on differente laguages
Take a look at this article, which explains how to do it and gives sample code too :)
Fuzzy Matching (Levenshtein Distance)
Update:
Here is the method code that takes two strings as parameters and calculates the "Levenshtein Distance" of the two strings
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
Detecting duplicates sometimes might be a "little" more complicated than computing Levenshtein dinstance.
Consider following example:
1. Jeff, Lynch, Maverick, Road, 181, Woodstock
2. Jeff, Alf., Lynch, Maverick, Rd, Woodstock, NY
This duplicates can be matched by complicated clustering algorithms.
For further information you might want to check some research papers like
"Effective Incremental Clustering for Duplicate Detection in Large Databases".
(Example is from the paper)
What you are looking for is a string similarity measure. There are multiple ways of doing this:
Edit Distances between two strings (as in Answer #1)
Converting the strings into sets of characters (generally on bigrams or words) and then calculating Bruce Coefficient or Dice Coefficient on the two sets.
Projecting the strings into term vectors (either on words or bigrams) and calculating the Cosine Distance between the two vectors.
I generally find the option #2 to be the easiest to implement and if your strings are phrases then you can simply tokenize them on word-boundaries.
In all the above cases, you might want to first remove the stop words (common words like and, a,the etc) before tokenizing.
Update: Links
Dice Coefficient
Cosine Similarity
Implementing Naive Similarity engine in C# *Warning: shameless Self Promotion
Here is an alternative to using the Levenshtein distance algorithm. This compares strings based on Dice's Coefficient, which compares the number of common letter pairs in each string to generate a value between 0 and 1 with 0 being no similarity and 1 being complete similarity
public static double CompareStrings(string strA, string strB)
{
List<string> setA = new List<string>();
List<string> setB = new List<string>();
for (int i = 0; i < strA.Length - 1; ++i)
setA.Add(strA.Substring(i, 2));
for (int i = 0; i < strB.Length - 1; ++i)
setB.Add(strB.Substring(i, 2));
var intersection = setA.Intersect(setB, StringComparer.InvariantCultureIgnoreCase);
return (2.0 * intersection.Count()) / (setA.Count + setB.Count);
}
Call the method like this:
CompareStrings("Manchester United", "Manchester Utd");
Ouput is: 0.75862068965517238
Related
Calculate the total amount (number) of combinations of a set with the specified combination length? [duplicate]
This question already has answers here: Calculating all possible sub-sequences of a given length (C#) (4 answers) Closed 5 years ago. Having a set of elements, which in this case is an Array of 3 characters/elements {A, B, C}: char[] charSet = "ABC".ToCharArray(); I would like to write a generic usage function to help determine which would be the total amount of combinations that can be generated OF THE SPECIFIED LENGTH and determining too the amount of possible combinations with and without repetition. To avoid possible mistakes: this question is not about combo/perm generation, just calculation. A simple uncompleted example to understand me: public static long CalculateCombinations(int setLength, int comboLength, bool allowRepetition) { return result; } ( where setLength is the amount of elements in the set, comboLength is the desired length of each combination, and allowRepetition a deterministic flag to help calculate the amount of combinations when and when not elements repetition is allowed in each combination. ) Then, if I have the same character set specified above, and I want to calculate the amount of possible combinations with repetition, the algorithm should return a value of 9, which would be the equivalent amount to this serie of combinations: 1: AA 2: AB 3: AC 4: BA 5: BB 6: BC 7: CA 8: CB 9: CC The same algorithm should return me a value of 6 if I dont want repetition, which would be the equivalent amount to this serie of combinations: 1: AB 2: AC 3: BA 4: BC 5: CA 6: CB Basically I'm trying to reproduce what this online service can do: http://textmechanic.com/text-tools/combination-permutation-tools/combination-generator/ however I tried to investigate and implement different 'nCr' formulas around the WWW (like http://www.vcskicks.com/code-snippet/combination.php ) and StackOverflow threads (like https://stackoverflow.com/a/26312275/1248295 ), but i don't get it how to calculate it when the combination length factor and repetition is involved in the calculation. Maybe this could be too basic than what it appears to me, but maths are not my forte. My question: how can I write an algorithm that can calculate what I explained?. Would be very grateful if someone could link a formula and its implementation in C# or VB.NET.
Let's try it with three characters, A, B and C (n = 3) and combo length of k = 2, as your example states. With repetition We start with two empty spaces. The first empty space can be filled in 3 possible ways. For each of three possible ways, the second space can be filled in another three possible ways. This gives you a total of 3 × 3 possibilities. In general, there are n ^ k possibilities. Without repetition We start with two empty spaces. The first empty space can be filled in 3 possible ways. The second empty space can be filled in 2 possible ways, because you don't want to repeat yourself. This gives you 3 × 2 possibilities in your case. Let's go with another example. Say, you have five letters (ABCDE) and combo length of four _ _ _ _. We put any of five letters on the first empty space. This is five possibilities: A, B, C, D, E. Now for each possibility after the last step, no matter which letter we've chosen, now we have 4 letters left to choose from. If in the previous step we've chosen A, the corpus is now BCDE -- this is four possibilities. For B, we choose from ACDE -- this is again for possibilities. In total, since there were 5 ways to do previous step, and there are 4 ways to go after any of the previous choices, in total this is 20 possibilities: (AB, AC, AD, AE), (BA, BC, BD, BE), (CA, CB, CD, CE), (DA, DB, DC, DE), (EA, EB, EC, ED). Let's keep going. After picking two letters, we're left with 3. With the same logic as before, for each of the previous 20 possibilities we have another 3 possibilities. This is 60 in total. And one more space left. We have two letters which we haven't chosen before. From any of the previous 60 possibilities, we now have two possibilities. That's 120 in total. So we've arrived at this by multiplying 5 × 4 × 3 × 2. Why start from 5? Because we initially had 5 letters: ABCDE. Why have four numbers in our multiplication? Because there were 4 empty spaces: _ _ _ _. In general, you keep multiplying a decremented value starting from n, and do this k times: n × (n - 1) × ... × (n - k + 1). The last value is (n - k + 1) because you are multiplying k values in total. From n to (n - k + 1) there are k values in total (inclusive). We can test this with our n = 5 and k = 4 example. We said that the formula was 5 × 4 × 3 × 2. Now look at the general formula: indeed, we start from n = 5 and keep multiplying until we reach the number 5 - 4 + 1 = 2. In your function's signature, n is setLength, k is comboLength. The implementation should be trivial with the above formulas, so I'm leaving this to the reader. These are called permutations with and without repetition.
How to implement the longest common subsequences in C#
I have read this paper which explains an algorithm for finding the longest common subsequences. I have problem with coding of creating matchlist of Algorithm 2 on page 352. Well, to be exact, it reads the first string and then it will start second string from the last letter and then will scan and save in decreasing order those letters are qual. First, my problem is how I could save for each letter of list the indices in the linked list? I mean creating a list for each letter in first string and then a sub-list of it which saves the letters positions in other sting. Second, I do not know how could I find k in the step 3. They key data structure for the described algorithm is an array of threshold values T_(i,k) defined as: T_(i,k) = the smallest j such that A[1:i] and B[1:j] contain a common subsequence of length k. Here what is confusing for me is that in the algorithm the THRESH[] is sets up as: for i= 1 step 1 until n do TRESH[i] := n + 1; which as I understand will set all up to a fixed number. right? ad then the following line in the Algorithm 2 would be somehow not correct: find k such that THRESH[k-1] < j <= THRESH[k] Could you please let me know your comments on it. I do really need some hints please!!
Distinct number algorithm from string
I'm working on a simple game and I have the requirement of taking a word or phrase such as "hello world" and converting it to a series of numbers. The criteria is: Numbers need to be distinct Need ability to configure maximum sequence of numbers. IE 10 numbers total. Need ability to configure max range for each number in sequence. Must be deterministic, that is we should get the same sequence everytime for the same input phrase. I've tried breaking down the problem like so: Convert characters to ASCII number code: "hello world" = 104 101 108 108 111 32 119 111 114 108 100 Remove everyother number until we satisfy total numbers (10 in this case) Foreach number if number > max number then divide by 2 until number <= max number If any numbers are duplicated increase or decrease the first occurence until satisfied. (This could cause a problem as you could create a duplicate by solving another duplicate) Is there a better way of doing this or am I on the right track? As stated above I think I may run into issues with removing distinction.
If you want to limit the size of the output series - then this is impossible. Proof: Assume your output is a series of size k, each of range r <= M for some predefined M, then there are at most k*M possible outputs. However, there are infinite number of inputs, and specifically there are k*M+1 different inputs. From pigeonhole principle (where the inputs are the pigeons and the outputs are the pigeonholes) - there are 2 pigeons (inputs) in one pigeonhole (output) - so the requirement cannot be achieved. Original answer, provides workaround without limiting the size of the output series: You can use prime numbers, let p1,p2,... be the series of prime numbers. Then, convert the string into series of numbers using number[i] = ascii(char[i]) * p_i The range of each character is obviously then [0,255 * p_i] Since for each i,j such that i != j -> p_i * x != p_j * y (for each x,y) - you get uniqueness. However, this is mainly nice theoretically as the generated numbers might grow quickly, and for practical implementation you are going to need some big number library such as java's BigInteger (cannot recall the C# equivalent) Another possible solution (with the same relaxation of no series limitation) is: number[i] = ascii(char[i]) + 256*(i-1) In here the range for number[i] is [256*(i-1),256*i), and elements are still distinct.
Mathematically, it is theoretically possible to do what you want, but you won't be able to do it in C#: If your outputs are required to be distinct, then you cannot lose any information after encoding the string using ASCII values. This means that if you limit your output size to n numbers then the numbers will have to include all information from the encoding. So for your example "Hello World" -> 104 101 108 108 111 32 119 111 114 108 100 you would have to preserve the meaning of each of those numbers. The simplest way to do this would just 0 pad your numbers to three digits and concatenate them together into one large number...making your result 104101108111032119111114108100 for max numbers = 1. (You can see where the issue becomes, for arbitrary length input you need very large numbers.) So certainly it is possible to encode any arbitrary length string input to n numbers, but the numbers will become exceedingly large. If by "numbers" you meant digits, then no you cannot have distinct outputs, as #amit explained in his example with the pidgeonhole principle.
Let's eliminate your criteria as easily as possible. For distinct, deterministic, just use a hash code. (Hash actually isn't guaranteed to be distinct, but is highly likely to be): string s = "hello world"; uint hash = Convert.ToUInt32(s.GetHashCode()); Note that I converted the signed int returned from GetHashCode to unsigned, to avoid the chance of having a '-' appear. Then, for your max range per number, just convert the base. That leaves you with the maximum sequence criteria. Without understanding your requirements better, all I can propose is truncate if necessary: hash.toString().Substring(0, size) Truncating leaves a chance that you'll no longer be distinct, but that must be built in as acceptable to your requirements? As amit explains in another answer, you can't have infinite input and non-infinite output.
Ok, so in one comment you've said that this is just to pick lottery numbers. In that case, you could do something like this: public static List<int> GenNumbers(String input, int count, int maxNum) { List<int> ret = new List<int>(); Random r = new Random(input.GetHashCode()); for (int i = 0; i < count; ++i) { int next = r.Next(maxNum - i); foreach (int picked in ret.OrderBy(x => x)) { if (picked <= next) ++next; else break; } ret.Add(next); } return ret; } The idea is to seed a random number generator with the hash code of the String. The rest of that is just picking numbers without replacement. I'm sure it could be written more efficiently - an alternative is to generate all maxNum numbers and shuffle the first count. Warning, untested. I know newer versions of the .Net runtime use a random String hash code algorithm (so results will differ between runs), but I believe this is opt-in. Writing your own hash algorithm is an option.
Dealing With Combinations
In C# I created a list array containing a list of varied indexes. I'd like to display 1 combination of 2 combinations of different indexes. The 2 combinations inside the one must not be repeated. I am trying to code a tennis tournament with 14 players that pair. Each player must never be paired with another player twice.
Your problem falls under the domain of the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items. I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks: Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set. Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions. Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers. The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table. There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs. To read about this class and download the code, see Tablizing The Binomial Coeffieicent. There are 2 different ways to interpret your problem. In tennis, tournaments are usually arranged to use single elmination where the winning player from each match advances. However, some local clubs also use round robins where each player plays each other player just once, which appears to be the problem that you are looking at. So, the question is - how to calculate the total number of unique matches that can be played with 14 players (N = 14), where each player plays just one other player (and thus K = 2). The binomial coefficient calculation is as follows: Total number of unique combinations = N! / (K! * (N - K)! ). The ! character is called a factorical, and means N * (N-1) * (N-2) ... * 1. When K is 2, the binomial coefficient is reduced to: N * (N - 1) / 2. So, plugging in 14 for N and 2 for K, we find that the total number of combinations is 91. The following code will iterate through each uniue combinations: int N = 14; // Total number of elements in the set. int K = 2; // Total number of elements in each group. // Create the bin coeff object required to get all // the combos for this N choose K combination. BinCoeff<int> BC = new BinCoeff<int>(N, K, false); int NumCombos = BinCoeff<int>.GetBinCoeff(N, K); // The Kindexes array specifies the 2 players, starting with index 0. int[] KIndexes = new int[K]; // Loop thru all the combinations for this N choose K case. for (int Combo = 0; Combo < NumCombos; Combo++) { // Get the k-indexes for this combination. BC.GetKIndexes(Loop, KIndexes); // KIndex[0] is the first player & Kindex[2] is the 2nd player. // Print out the indexes for both players. String S = "Player1 = Kindexes[0].ToString() + ", " + "Player2 = Kindexes[1].ToString(); Console.WriteLine(S}; } You should be able to port this class over fairly easily to the language of your choice. You probably will not have to port over the generic part of the class to accomplish your goals. Depending on the number of combinations you are working with, you might need to use a bigger word size than 4 byte ints. I should also mention, that since this is a class project, your teacher might not accept the above answer since he might be looking for more original work. In that case, you might want to consider using loops. You should check with him before submitting a solution.
How to generate a list of shuffled integers between 2 numbers?
I want to create a shuffled set of integers such that: Given the same seed, the shuffle will be the same every time As I iterate through, every number in the shuffled set will be used exactly once before repeating itself Will work for large sets (I want all numbers between 0 and 2 billion) Will generate between a range, for example, 100 to 150. This option gives a great solution if you want, say, all of the numbers between 0 and a specified number: Generating Shuffled Range Using a PRNG Rather Than Shuffling Any ideas?
You can use the exact same algorithm as the linked question. Just generate numbers between 0 and upperBound - lowerBound + 1 and add lowerBound to the result. e.g. (using code from linked question): var upper = 5; var lower = 3; foreach (int n in GenerateSequence(upper-lower+1)) { Console.WriteLine(n+lower); } If you want the sequence to repeat (shuffled differently each time), you can add a while (true) around the iterator method body.