I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
รถ instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function
Related
Given two arrays of integers with equal lengths, return the maximum value of:
|arr1[i] - arr1[j]| + |arr2[i] - arr2[j]| + |i - j|
where the maximum is taken over all 0 <= i, j < arr1.length.
Example 1:
Input: arr1 = [1,2,3,4], arr2 = [-1,4,5,6]
Output: 13
Example 2:
Input: arr1 = [1,-2,-5,0,10], arr2 = [0,-2,-1,-7,-4]
Output: 20
Constraints:
2 <= arr1.length == arr2.length <= 40000
-10^6 <= arr1[i], arr2[i] <= 10^6
public class Solution {
public int MaxAbsValExpr(int[] arr1, int[] arr2) {
int max = 0;
for (int i =0; i < arr1.Length; i++){
for (int k =i+1; k < arr1.Length; k++){
if (Math.Abs(arr1[i] - arr1[k]) + Math.Abs(arr2[i] - arr2[k]) + Math.Abs(i - k) > max){
max = Math.Abs(arr1[i] - arr1[k]) + Math.Abs(arr2[i] - arr2[k]) + Math.Abs(i - k);
}
}
}
return max;
}}
This passes most test cases but gets a "Time Limit Exceeded" when the test case has hundreds of values in the arrays in the six digits. I've tried looking online to see how to shorten it in c# but I don't know what else I could do to make it faster.
You encountered "Time Limit Exceeded", it is because your algorithm takes too long to execute and the computer timeout. As commented by #Flydog57, nested loops can be very very expensive and should be avoided. He also suggested a way to implement the algorithm without nested loops. Here's the implementation in PHP (I do not know C#):
/**
* Find the indexes of the max and min values in $arr.
*
* #param array $arr
* #return array
*/
protected function minMaxIndexes(Array $arr)
{
// Initialize variables indexes and values by the 1st element of $arr.
[$i0, $v0, $i1, $v1] = [0, $arr[0], 0, $arr[0]];
for ($i = 1, $size = count($arr); $i < $size; ++$i) {
if ($v0 > $arr[$i]) {
$v0 = $arr[$i];
$i0 = $i;
} elseif ($v1 < $arr[$i]) {
$v1 = $arr[$i];
$i1 = $i;
}
}
return [$i0, $i1];
}
public function maxAbsValExpr(Array $arr1, Array $arr2)
{
[$i, $j] = $this->minMaxIndexes($arr1);
$max1 = abs($arr1[$i] - $arr1[$j]) + abs($arr2[$i] - $arr2[$j]) + abs($i - $j);
[$i, $j] = $this->minMaxIndexes($arr2);
$max2 = abs($arr1[$i] - $arr1[$j]) + abs($arr2[$i] - $arr2[$j]) + abs($i - $j);
return max($max1, $max2);
}
As you can see, the for loop is 2 times of N-1, if my calculation is correct, it is N/2 times faster than nested loop of N(N-1). That is, for 1 mil array length, it is 500k times faster.
Problem statement:
Given an array of non-negative integers, count the number of unordered pairs of array elements, such that their bitwise AND is a power of 2.
Example:
arr = [10, 7, 2, 8, 3]
Answer: 6 (10&7, 10&2, 10&8, 10&3, 7&2, 2&3)
Constraints:
1 <= arr.Count <= 2*10^5
0 <= arr[i] <= 2^12
Here's my brute-force solution that I've come up with:
private static Dictionary<int, bool> _dictionary = new Dictionary<int, bool>();
public static long CountPairs(List<int> arr)
{
long result = 0;
for (var i = 0; i < arr.Count - 1; ++i)
{
for (var j = i + 1; j < arr.Count; ++j)
{
if (IsPowerOfTwo(arr[i] & arr[j])) ++result;
}
}
return result;
}
public static bool IsPowerOfTwo(int number)
{
if (_dictionary.TryGetValue(number, out bool value)) return value;
var result = (number != 0) && ((number & (number - 1)) == 0);
_dictionary[number] = result;
return result;
}
For small inputs this works fine, but for big inputs this works slow.
My question is: what is the optimal (or at least more optimal) solution for the problem? Please provide a graceful solution in C#. ๐
One way to accelerate your approach is to compute the histogram of your data values before counting.
This will reduce the number of computations for long arrays because there are fewer options for value (4096) than the length of your array (200000).
Be careful when counting bins that are powers of 2 to make sure you do not overcount the number of pairs by including cases when you are comparing a number with itself.
We can adapt the bit-subset dynamic programming idea to have a solution with O(2^N * N^2 + n * N) complexity, where N is the number of bits in the range, and n is the number of elements in the list. (So if the integers were restricted to [1, 4096] or 2^12, with n at 100,000, we would have on the order of 2^12 * 12^2 + 100000*12 = 1,789,824 iterations.)
The idea is that we want to count instances for which we have overlapping bit subsets, with the twist of adding a fixed set bit. Given Ai -- for simplicity, take 6 = b110 -- if we were to find all partners that AND to zero, we'd take Ai's negation,
110 -> ~110 -> 001
Now we can build a dynamic program that takes a diminishing mask, starting with the full number and diminishing the mask towards the left
001
^^^
001
^^
001
^
Each set bit on the negation of Ai represents a zero, which can be ANDed with either 1 or 0 to the same effect. Each unset bit on the negation of Ai represents a set bit in Ai, which we'd like to pair only with zeros, except for a single set bit.
We construct this set bit by examining each possibility separately. So where to count pairs that would AND with Ai to zero, we'd do something like
001 ->
001
000
we now want to enumerate
011 ->
011
010
101 ->
101
100
fixing a single bit each time.
We can achieve this by adding a dimension to the inner iteration. When the mask does have a set bit at the end, we "fix" the relevant bit by counting only the result for the previous DP cell that would have the bit set, and not the usual union of subsets that could either have that bit set or not.
Here is some JavaScript code (sorry, I do not know C#) to demonstrate with testing at the end comparing to the brute-force solution.
var debug = 0;
function bruteForce(a){
let answer = 0;
for (let i = 0; i < a.length; i++) {
for (let j = i + 1; j < a.length; j++) {
let and = a[i] & a[j];
if ((and & (and - 1)) == 0 && and != 0){
answer++;
if (debug)
console.log(a[i], a[j], a[i].toString(2), a[j].toString(2))
}
}
}
return answer;
}
function f(A, N){
const n = A.length;
const hash = {};
const dp = new Array(1 << N);
for (let i=0; i<1<<N; i++){
dp[i] = new Array(N + 1);
for (let j=0; j<N+1; j++)
dp[i][j] = new Array(N + 1).fill(0);
}
for (let i=0; i<n; i++){
if (hash.hasOwnProperty(A[i]))
hash[A[i]] = hash[A[i]] + 1;
else
hash[A[i]] = 1;
}
for (let mask=0; mask<1<<N; mask++){
// j is an index where we fix a 1
for (let j=0; j<=N; j++){
if (mask & 1){
if (j == 0)
dp[mask][j][0] = hash[mask] || 0;
else
dp[mask][j][0] = (hash[mask] || 0) + (hash[mask ^ 1] || 0);
} else {
dp[mask][j][0] = hash[mask] || 0;
}
for (let i=1; i<=N; i++){
if (mask & (1 << i)){
if (j == i)
dp[mask][j][i] = dp[mask][j][i-1];
else
dp[mask][j][i] = dp[mask][j][i-1] + dp[mask ^ (1 << i)][j][i - 1];
} else {
dp[mask][j][i] = dp[mask][j][i-1];
}
}
}
}
let answer = 0;
for (let i=0; i<n; i++){
for (let j=0; j<N; j++)
if (A[i] & (1 << j))
answer += dp[((1 << N) - 1) ^ A[i] | (1 << j)][j][N];
}
for (let i=0; i<N + 1; i++)
if (hash[1 << i])
answer = answer - hash[1 << i];
return answer / 2;
}
var As = [
[10, 7, 2, 8, 3] // 6
];
for (let A of As){
console.log(JSON.stringify(A));
console.log(`DP, brute force: ${ f(A, 4) }, ${ bruteForce(A) }`);
console.log('');
}
var numTests = 1000;
for (let i=0; i<numTests; i++){
const N = 6;
const A = [];
const n = 10;
for (let j=0; j<n; j++){
const num = Math.floor(Math.random() * (1 << N));
A.push(num);
}
const fA = f(A, N);
const brute = bruteForce(A);
if (fA != brute){
console.log('Mismatch:');
console.log(A);
console.log(fA, brute);
console.log('');
}
}
console.log("Done testing.");
int[] numbers = new[] { 10, 7, 2, 8, 3 };
static bool IsPowerOfTwo(int n) => (n != 0) && ((n & (n - 1)) == 0);
long result = numbers.AsParallel()
.Select((a, i) => numbers
.Skip(i + 1)
.Select(b => a & b)
.Count(IsPowerOfTwo))
.Sum();
If I understand the problem correctly, this should work and should be faster.
First, for each number in the array we grab all elements in the array after it to get a collection of numbers to pair with.
Then we transform each pair number with a bitwise AND, then counting the number that satisfy our 'IsPowerOfTwo;' predicate (implementation here).
Finally we simply get the sum of all the counts - our output from this case is 6.
I think this should be more performant than your dictionary based solution - it avoids having to perform a lookup each time you wish to check power of 2.
I think also given the numerical constraints of your inputs it is fine to use int data types.
So, this is my problem to solve:
I want to calculate 2^(n) where 0 < n< 10000
I am representing each element of array as a space where 4digit number should be "living" and if extra digit appears, I am replacing it to the next element of this array.
The principle I am using looks like this:
The code I am using is the following:
static string NotEfficient(int power)
{
if (power < 0)
throw new Exception("Power shouldn't be negative");
if (power == 0)
return "1";
if (power == 1)
return "2";
int[] A = new int[3750];
int current4Digit = 0;
//at first 2 is written in first element of array
A[current4Digit] = 2;
int currentPower = 1;
while (currentPower < power)
{
//multiply every 4digit by 2
for (int i = 0; i <= current4Digit; i++)
{
A[i] *= 2;
}
currentPower++;
//checking every 4digit if it
//contains 5 digit and if yes remove and
//put it in next 4digit
for (int i = 0; i <= current4Digit; i++)
{
if (A[i] / 10000 > 0)
{
int more = A[i] / 10000;
A[i] = A[i] % 10000;
A[i + 1] += more;
//if new digit should be opened
if (i + 1 > current4Digit)
{
current4Digit++;
}
}
}
}
//getting data from array to generate answer
string answer = "";
for (int i = current4Digit; i >= 0; i--)
{
answer += A[i].ToString() + ",";
}
return answer;
}
The problem I have is that it doesn't display correctly the number, which contains 0 in reality. for example 2 ^ (50) = 1 125 899 906 842 624 and with my algorithm I get 1 125 899 96 842 624 (0 is missing). This isn't only for 50...
This happens when I have the following situation for example:
How I can make this algorithm better?
Use BigInteger, which is already included in .Net Core or available in the System.Runtime.Numerics Nuget Package.
static string Efficient(int power)
{
var result = BigInteger.Pow(2, power);
return result.ToString(CultureInfo.InvariantCulture);
}
On my machine, NotEfficient takes roughly 80ms, where Efficient takes 0.3ms. You should be able to manipulate that string (if I'm understanding your problem statement correctly):
static string InsertCommas(string value)
{
var sb = new StringBuilder(value);
for (var i = value.Length - 4; i > 0; i -= 4)
{
sb.Insert(i, ',');
}
return sb.ToString();
}
One way to resolve this is to pad your 4-digit numbers with leading zeroes if they are less than four digits by using the PadLeft method:
answer += A[i].ToString().PadLeft(4, '0') + ",";
And then you can use the TrimStart method to remove any leading zeros from the final result:
return answer.TrimStart('0');
I implemented this algorithm to solve Leetcode's #28 question "implemment strStr()".
Problem description:
Implement strStr(),
Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
My code was implemented based on http://www.geeksforgeeks.org/searching-for-patterns-set-3-rabin-karp-algorithm/ 's tutoril.
I found out that with different prime number I use to scale down the hash, the function could go wrong.
Here is my code:
public class Solution {
public int StrStr(string haystack, string needle) {
int len = needle.Length;
//2 special case
if (haystack.Length < len) return -1;
if (needle == "") return 0;
//base prime number used for rabin-karp's hash function
int basement = 101;
//prime number used to scale down the hash value
int prime = 101;
//the factor used to multiply with the character to be removed from the hash
int factor = (int)(Math.Pow(basement, needle.Length - 1)) % prime;
//get ascii value of the needle and the initial window
int needleHash = 0;
int windowHash = 0;
byte[] needleBytes = Encoding.ASCII.GetBytes(needle);
byte[] windowBytes = Encoding.ASCII.GetBytes(haystack.Substring(0, len));
//generate hash value for both
for (int i = 0; i < needle.Length; i++)
{
needleHash = (needleHash * basement + needleBytes[i]) % prime;
windowHash = (windowHash * basement + windowBytes[i]) % prime;
}
//loop to find match
bool findMatch = true;
for (int i = 0; i < haystack.Length - len + 1; i++){
//if hash value matches, incase the hash value are not uniq, iterate through needle and window
if(needleHash == windowHash){
findMatch = true;
for (int j = 0; j < len; j++)
{
if (needle[j] != haystack[i + j])
{
findMatch = false;
break;
}
}
if (findMatch == true) return i;
}
//move the sliding window and find the hash value for new window
if (i < haystack.Length - len){
byte removeByte = Encoding.ASCII.GetBytes(haystack.Substring(i, 1))[0];
byte addByte = Encoding.ASCII.GetBytes(haystack.Substring(i + len, 1))[0];
//function of rolling hash
windowHash = ((windowHash - removeByte * factor) * basement + addByte) % prime;
//ensure the window hash to be positive
if(windowHash < 0) windowHash += prime;
}
}
return -1;
}
}
With "prime" set to "101", this code passes all test. But if i change it to other prime number no matter smaller or larger (example: 17, 31, 103), it always failed at test "68/72" which
haystack = "baabbaaaaaaabbaaaaabbabbababaabbabbbbbab
babbbbbbabababaabbbbbaaabbbbabaababababbbaabbbbaaabbaaba
bbbaabaabbabbaaaabababaaabbabbababbabbaaabbbbabbbbabbabbaabbbaa";
needle = "bbaaaababa";
Thus I do believe my code have big issues that I could not detect. Why does that happen?
Two issues with your code:
basement should be set at 256 per the algorithm at the site you used as a reference. This value is the number of characters in the input alphabet.
Your computation of factor is not correct; you're casting to int before calculating the remainder. The Math.Pow operation results in a double value that is larger than Int32.MaxValue. When you cast to int before the modulo operation you truncate this value. You need to perform the modulo with the double value and then cast to int. The line should look like this:
int factor = (int)((Math.Pow(basement, needle.Length - 1)) % prime);
I tested your code with these modifications and the example given and it works for primes of 17, 31, 101 and 103.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm following this blog asking
Five programming problems every Software Engineer should be able to solve in less than 1 hour
I am absolutely stumped by question 4 (5 is a different story too)
Write a function that given a list of non negative integers, arranges them such that they form the largest possible number. For example, given [50, 2, 1, 9], the largest formed number is 95021.
Now the author posted a answer, and I saw an python attempt too:
import math
numbers = [50,2,1,9,10,100,52]
def arrange(lst):
for i in xrange(0, len(lst)):
for j in xrange(0, len(lst)):
if i != j:
comparison = compare(lst[i], lst[j])
if lst[i] == comparison[0]:
temp = lst[j]
lst[j] = lst[i]
lst[i] = temp
return lst
def compare(num1, num2):
pow10_1 = math.floor(math.log10(num1))
pow10_2 = math.floor(math.log10(num2))
temp1 = num1
temp2 = num2
if pow10_1 > pow10_2:
temp2 = (temp2 / math.pow(10, pow10_2)) * math.pow(10, pow10_1)
elif pow10_2 > pow10_1:
temp1 = (temp1 / math.pow(10, pow10_1)) * math.pow(10, pow10_2)
print "Starting", num1, num2
print "Comparing", temp1, temp2
if temp1 > temp2:
return [num1, num2]
elif temp2 > temp1:
return [num2, num1]
else:
if num1 < num2:
return [num1, num2]
else:
return [num2, num1]
print arrange(numbers)
but I'm not going to learn these languages soon. Is anyone willing to share how they would sort the numbers in C# to form the largest number please?
I've also tried straight conversion to C# in VaryCode but when the IComparer gets involved, then it causes erroneous conversions.
The python attempt uses a bubble sort it seems.
Is bubble sort a starting point? What else would be used?
The solution presented uses the approach to sort the numbers in the array in a special way. For example the value 5 comes before 50 because "505" ("50"+"5") comes before "550" ("5"+"50"). The thinking behind this isn't explained, and I'm not convinced that it actually works...
When looking that the problem, I came to this solution:
You can do it recursively. Make a method that loops through the numbers in the array and concatenates each number with the largest value that can be formed by the remaining numbers, to see which one of those is largest:
public static int GetLargest(int[] numbers) {
if (numbers.Length == 1) {
return numbers[0];
} else {
int largest = 0;
for (int i = 0; i < numbers.Length; i++) {
int[] other = numbers.Take(i).Concat(numbers.Skip(i + 1)).ToArray();
int n = Int32.Parse(numbers[i].ToString() + GetLargest(other).ToString());
if (i == 0 || n > largest) {
largest = n;
}
}
return largest;
}
}
If you want to do it by bubble sort, try this:
public static void Main()
{
bool swapped = true;
while (swapped)
{
swapped = false;
for (int i = 0; i < VALUES.Length - 1; i++)
{
if (Compare(VALUES[i], VALUES[i + 1]) > 0)
{
int temp = VALUES[i];
VALUES[i] = VALUES[i + 1];
VALUES[i + 1] = temp;
swapped = true;
}
}
}
String result = "";
foreach (int integer in VALUES)
{
result += integer.ToString();
}
Console.WriteLine(result);
}
public static int Compare(int lhs, int rhs)
{
String v1 = lhs.ToString();
String v2 = rhs.ToString();
return (v1 + v2).CompareTo(v2 + v1) * -1;
}
The Compare method compares the order of two numbers that would create the largest number. When it returns value larger than 0, that means you need to swap.
This is the sorting part:
while (swapped)
{
swapped = false;
for (int i = 0; i < VALUES.Length - 1; i++)
{
if (Compare(VALUES[i], VALUES[i + 1]) > 0)
{
int temp = VALUES[i];
VALUES[i] = VALUES[i + 1];
VALUES[i + 1] = temp;
swapped = true;
}
}
}
It checks to consecutive values in the array and swaps if necessary. After one iteration without swapping, the sort is finished.
Finally, you concatenate the values in the array and print it out.