Recursion to iteration conversion using dynamic programming - c#

public static int n;
public static int w;
public static int[] s;
public static int[] p;
static void Main(string[] args)
{
n = 5;
w = 5;
s = new int[n + 1];
p = new int[n + 1];
Random rnd = new Random();
for (int i = 1; i <= n; i++)
{
s[i] = rnd.Next(1, 10);
p[i] = rnd.Next(1, 10);
}
Console.WriteLine(F_recursion(n, w));
Console.WriteLine(DP(n, w));
}
// recursive approach
public static int F_recursion(int n, int w)
{
if (n == 0 || w == 0)
return 0;
else if (s[n] > w)
return F_recursion(n - 1, w);
else
{
return Math.Max(F_recursion(n - 1, w), (p[n] + F_recursion(n - 1, w - s[n])));
}
}
// iterative approach
public static int DP(int n, int w)
{
int result = 0;
for (int i = 1; i <= n; i++)
{
if (s[i] > w)
{
continue;
}
else
{
result += p[i];
w = w - s[i];
}
}
return result;
}
I need to convert F_recursion function to iterative. I currently written following function DP that sometimes works but not always. I learned that problem is in F_recursion(n - 1, w - s[n]) I have no idea how to make w - s[n] work correctly in iterative solution. If change w - s[n] and w - s[i] to only w then program always work.
In Console:
s[i] = 2 p[i] = 3
-------
s[i] = 3 p[i] = 4
-------
s[i] = 5 p[i] = 3
-------
s[i] = 3 p[i] = 8
-------
s[i] = 6 p[i] = 6
-------
Recursive:11
Iteration:7
but sometimes it works
s[i] = 5 p[i] = 6
-------
s[i] = 8 p[i] = 1
-------
s[i] = 3 p[i] = 5
-------
s[i] = 3 p[i] = 1
-------
s[i] = 7 p[i] = 7
-------
Recursive:6
Iteration:6

The following approach might be useful, when bigger numbers are involved (specially for s) and consequently a 2 dimensional array would be unnecessary big and only a few w values would actually be used in computing the result.
The idea: precompute possible w values, by starting at w and for each i in [n, n-1, ..., 1] determine the values w_[i], where w_[i+1] >= s[i] without duplicates.
Then iterate i_n over n and compute sub-results only for valid w_[i] values.
I chose an array of Dictionary as datastructure, since it's relatively easy to design sparse data this way.
public static int DP(int n, int w)
{
// compute possible w values for each iteration from 0 to n
Stack<HashSet<int>> validW = new Stack<HashSet<int>>();
validW.Push(new HashSet<int>() { w });
for (int i = n; i > 0; i--)
{
HashSet<int> validW_i = new HashSet<int>();
foreach (var prevValid in validW.Peek())
{
validW_i.Add(prevValid);
if (prevValid >= s[i])
{
validW_i.Add(prevValid - s[i]);
}
}
validW.Push(validW_i);
}
// compute sub-results for all possible n,w values.
Dictionary<int, int>[] value = new Dictionary<int,int>[n + 1];
for (int n_i = 0; n_i <= n; n_i++)
{
value[n_i] = new Dictionary<int, int>();
HashSet<int> validSubtractW_i = validW.Pop();
foreach (var w_j in validSubtractW_i)
{
if (n_i == 0 || w_j == 0)
value[n_i][w_j] = 0;
else if (s[n_i] > w_j)
value[n_i][w_j] = value[n_i - 1][w_j];
else
value[n_i][w_j] = Math.Max(value[n_i - 1][w_j], (p[n_i] + value[n_i - 1][w_j - s[n_i]]));
}
}
return value[n][w];
}
It's important to understand that some space and computation is "wasted" in order to precompute possible w values and to support the sparse data structures. So this approach might perform bad for large data sets with small values in s, where most w values will be possible sub-results.
After some more thought I realized, if space is a concern, you can actually throw away the sub-results of everything except the previous outer loop iteration, since the recursion in this algorithm follows a strict n-1 pattern. However, I'm not including this into my code for now.

Your approach does not work because your dynamic programmig state space (which apparently is only one variable) does not match the signature of the recursive method. The goal of the dynamic programming approach should be to define and fill a state space such that all results for evaluation are available when needed. On inspection of the recursive method, notice that the recursive calls of F_recursion may change both arguments, n and w. This is an indication that a two-dimensional state space should be used.
The first argument (which apparently limits the range of items) can range from 0 to n while the second argument (which apparently is some bound for the total of an item property) can range from 0 to w.
You should define a two dimensional state space
int[,] value = new int[n,w];
for accomodation of the values. Next, you should initialize the values to undefined; you can use the value Int32.MaxValue for this, because it will behave in a suitable way if the minimum with some different value is calculated.
Next, the iterative version of the algorithm shoud use two loops which iterate in a forwad manner, unlike the recursive iteration which decreases the arguments.
for (int i = 0; i < n; i++)
{
for (int j = 0; j < w; j++)
{
// logic for the recurrence relation goes here
}
}
In the innermost block you can use a modified version of the recurrence relation. Instead of using recursive calls, you access values which are stored in value; instead of returning values, you write the values to value.
Semantically this is the same as memoization, but instead of using actual recursive calls, the order of evaluation asserts that necessary values always exist, making additional logic unneccessary.
Once the state space is filled, you have to examine its last state (namely the part of the array where the first index is n-1) to determine the maximal value for the entire input.

Related

Is it okay to exit a loop when an exception is thrown?

I solved a task on Hackerrank.com, where the problem was like this:
You have an Array. This Array contains numbers.
Now you enter two numbers:
The first one describes a sum
The second one describes the amount of indexes (sequence length) you add together
In the end you get the amount of sequences whose sum is your defined number
For example:
Your array is [ 1, 2, 3, 4], your sum is 3 and your sequence length is 2.
Now you take the first two indexes and output the sum: [1, 2] = 3.
This is equal to your sum, so now you have found one sequence.
The next sequence is [ 2, 3 ] = 5. This is not equal to 3, so your sequence counter stays 1.
The last sequence is [3, 4] = 7. This is also not equal to 3 and in the end, you found one sequence.
I wrote this code for that:
static int GetSequences(List<int> s, int d, int m)
{
//m = segment-length
//d = sum
int count = 0;
int j = 0;
int k = 0;
do
{
try
{
List<int> temp = new List<int>();
for (int i = 0; i < m; i++)
{
temp.Add(s[i + k]);
}
if (temp.Sum() == d)
{
count++;
}
j++;
k++;
}
catch (ArgumentOutOfRangeException)
{
break;
}
} while (true);
return count;
}
As I didn't know how often I have to count
(For example a 6-Length-Array with a sequence-length of 3 has 4 sequences (1,2,3 | 2,3,4 | 3,4,5 | 4,5,6)),
I am stopping the while loop when the index is out of range. but I'm not sure if this solution is okay. Not just with program speed, but also with code cleanliness. Is this code acceptable, or is it better to use a for loop, which loops for example exactly 4 times for a 6-length array with 3-Length sequences?
It's not recommended, no. Exceptions should be reserved for stuff that isn't supposed to happen, not flow control or validation.
What you want is to use conditional logic (if statements) and the break keyword.
Also, codereview.stackexchange.com is better suited for these kinds of questions.
It would be better to fix your code so that it doesn't routinely throw exceptions:
You sum each of these segments:
0 1 2 3 start = 0
| | summing indexes: 0, 1
+--+
0 1 2 3 start = 1
| | summing indexes: 1, 2
+--+
0 1 2 3 start = 2
| | summing indexes: 2, 3
+--+
The bracket starts at the index start, and has a size of m. The length of s is given by s.Count. Therefore we want to keep going until start + m == s.Count.
(I always find it's useful to draw these things out, and put sample numbers in, in order to make sure you've got the maths right. In the sample above, you can see that we stop when start (2) + m (2) == the array size (4))
static int GetSequences(List<int> s, int d, int m)
{
//m = segment-length
//d = sum
int count = 0;
for (int start = 0; start + m <= s.Count; start++)
{
List<int> temp = new List<int>();
for (int i = 0; i < m; i++)
{
temp.Add(s[start + i]);
}
if (temp.Sum() == d)
{
count++;
}
}
return count;
}
However, you can improve your code a bit:
Use meaningful variable names
Don't create a new temporary list each time, just to sum it
Check your inputs
static int GetSequences(List<int> numbers, int targetSum, int segmentLength)
{
if (numbers == null)
throw new ArgumentNullException(nameof(numbers));
if (segmentLength > numbers.Count)
throw new ArgumentException("segmentLength must be <= numbers.Count");
int count = 0;
for (int start = 0; start + segmentLength <= numbers.Count; start++)
{
int sum = 0;
for (int i = 0; i < segmentLength; i++)
{
sum += numbers[start + i];
}
if (sum == targetSum)
{
count++;
}
}
}
Usually except for switch/case there is often no real reason to use break.
Also an exception MUST be as the name says exceptional, so it MUST NOT be a part of your logic.
As said Jeppe you can use the methods and attributes the framework provides you to do as you like.
Here s.Count seems to be the way to go.
int[] arr = new[] { 1, 2, 1, 2 };
// Sum and len are given by the task.
// 'last' is the last index where we should stop iterating.
int sum = 3, len = 2, last = arr.Length - len;
// One of the overloads of Where accepts index, i.e. the position of element.
// 1) We check that we don't go after our stop-index (last).
// 2) Avoid exception by using '&&'.
// 3) We use C# 8 Range (..) to get the slice of the numbers we need:
// we start from the current position (index) till then index,
// calculated as current index + length given by the task.
// 4) Sum all the numbers in the slice (Sum()) and compare it with the target sum,
// given by the task (== sum).
// 5) The count of all matches (Count()) is the sought amount of sequences.
int count = arr.Where((z, index) => index <= last && arr[index..(index+len)].Sum() == sum).Count();

find number with no pair in array

I am having trouble with a small bit of code, which in a random size array, with random number pairs, except one which has no pair.
I need to find that number which has no pair.
arLength is the length of the array.
but i am having trouble actually matching the pairs, and finding the one which has no pair..
for (int i = 0; i <= arLength; i++)
{ // go through the array one by one..
var number = nArray[i];
// now search through the array for a match.
for (int e = 0; e <= arLength; e++)
{
if (e != i)
{
}
}
}
I have also tried this :
var findValue = nArray.Distinct();
I have searched around, but so far, i haven't been able to find a method for this.
This code is what generates the array, but this question isn't about this part of the code, only for clarity.
Random num = new Random();
int check = CheckIfOdd(num.Next(1, 1000000));
int counter = 1;
while (check <= 0)
{
if (check % 2 == 0)
{
check = CheckIfOdd(num.Next(1, 1000000)); ;
}
counter++;
}
int[] nArray = new int[check];
int arLength = 0;
//generate arrays with pairs of numbers, and one number which does not pair.
for (int i = 0; i < check; i++)
{
arLength = nArray.Length;
if (arLength == i + 1)
{
nArray[i] = i + 1;
}
else
{
nArray[i] = i;
nArray[i + 1] = i;
}
i++;
}
You can do it using the bitwise operator ^, and the complexity is O(n).
Theory
operator ^ aka xor has the following table:
So suppose you have only one number without pair, all the pairs will get simplified because they are the same.
var element = nArray[0];
for(int i = 1; i < arLength; i++)
{
element = element ^ nArray[i];
}
at the end, the variable element will be that number without pair.
Distict will give you back the array with distinct values. it will not find the value you need.
You can GroupBy and choose the values with Count modulo 2 equals 1.
var noPairs = nArray.GroupBy(i => i)
.Where(g => g.Count() % 2 == 1)
.Select(g=> g.Key);
You can use a dictionary to store the number of occurrences of each value in the array. To find the value without pairs, look for a (single) number of occurrences smaller than 2.
using System.Linq;
int[] data = new[] {1, 2, 3, 4, 5, 3, 2, 4, 1};
// key is the number, value is its count
var numberCounts = new Dictionary<int, int>();
foreach (var number in data) {
if (numberCounts.ContainsKey(number)) {
numberCounts[number]++;
}
else {
numberCounts.Add(number, 1);
}
}
var noPair = numberCounts.Single(kvp => kvp.Value < 2);
Console.WriteLine(noPair.Key);
Time complexity is O(n) because you traverse the array only a single time and then traverse the dictionary a single time. The same dictionary can also be used to find triplets etc.
.NET Fiddle
An easy and fast way to do this is with a Frequency Table. Keep a dictionary with as key your number and as value the number of times you found it. This way you only have to run through your array once.
Your example should work too with some changes. It will be a lot slower if you have a big array.
for (int i = 0; i <= arLength; i++)
{
bool hasMatch = false;
for (int e = 0; e <= arLength; e++)
{
if (nArray[e] == nArray[i])//Compare the element, not the index.
{
hasMatch = true;
}
}
//if hasMatch == false, you found your item.
}
All you have to do is to Xor all the numbers:
int result = nArray.Aggregate((s, a) => s ^ a);
all items which has pair will cancel out: a ^ a == 0 and you'll have the distinc item: 0 ^ 0 ^ ...^ 0 ^ distinct ^ 0 ^ ... ^0 == distinct
Because you mentioned you like short and simple in a comment, how about getting rid of most of your other code as well?
var total = new Random().Next(500000) * 2 + 1;
var myArray = new int[total];
for (var i = 1; i < total; i+=2)
{
myArray[i] = i;
myArray[i -1] = i;
}
myArray[total - 1] = total;
Then indeed use Linq to get what you are looking for. Here is a slight variation, returning the key of the item in your array:
var key = myArray.GroupBy(t => t).FirstOrDefault(g=>g.Count()==1)?.Key;

The most similar string [duplicate]

I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
รถ instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function

Counting sort - implementation differences

I heard about Counting Sort and wrote my version of it based on what I understood.
public void my_counting_sort(int[] arr)
{
int range = 100;
int[] count = new int[range];
for (int i = 0; i < arr.Length; i++) count[arr[i]]++;
int index = 0;
for (int i = 0; i < count.Length; i++)
{
while (count[i] != 0)
{
arr[index++] = i;
count[i]--;
}
}
}
The above code works perfectly.
However, the algorithm given in CLRS is different. Below is my implementation
public int[] counting_sort(int[] arr)
{
int k = 100;
int[] count = new int[k + 1];
for (int i = 0; i < arr.Length; i++)
count[arr[i]]++;
for (int i = 1; i <= k; i++)
count[i] = count[i] + count[i - 1];
int[] b = new int[arr.Length];
for (int i = arr.Length - 1; i >= 0; i--)
{
b[count[arr[i]]] = arr[i];
count[arr[i]]--;
}
return b;
}
I've directly translated this from pseudocode to C#. The code doesn't work and I get an IndexOutOfRange Exception.
So my questions are:
What's wrong with the second piece of code ?
What's the difference algorithm wise between my naive implementation and the one given in the book ?
The problem with your version is that it won't work if the elements have satellite data.
CLRS version would work and it's stable.
EDIT:
Here's an implementation of the CLRS version in Python, which sorts pairs (key, value) by key:
def sort(a):
B = 101
count = [0] * B
for (k, v) in a:
count[k] += 1
for i in range(1, B):
count[i] += count[i-1]
b = [None] * len(a)
for i in range(len(a) - 1, -1, -1):
(k, v) = a[i]
count[k] -= 1
b[count[k]] = a[i]
return b
>>> print sort([(3,'b'),(2,'a'),(3,'l'),(1,'s'),(1,'t'),(3,'e')])
[(1, 's'), (1, 't'), (2, 'a'), (3, 'b'), (3, 'l'), (3, 'e')]
It should be
b[count[arr[i]]-1] = arr[i];
I'll leave it to you to track down why ;-).
I don't think they perform any differently. The second just pushes the correlation of counts out of the loop so that it's simplified a bit within the final loop. That's not necessary as far as I'm concerned. Your way is just as straightforward and probably more readable. In fact (I don't know about C# since I'm a Java guy) I would expect that you could replace that inner while-loop with a library array fill; something like this:
for (int i = 0; i < count.Length; i++)
{
arrayFill(arr, index, count[i], i);
index += count[i];
}
In Java the method is java.util.Arrays.fill(...).
The problem is that you have hard-coded the length of the array that you are using to 100. The length of the array should be m + 1 where m is the maximum element on the original array. This is the first reason that you would think using counting-sort, if you have information about the elements of the array are all minor that some constant and it would work great.

Is there a good radixsort-implementation for floats in C#

I have a datastructure with a field of the float-type. A collection of these structures needs to be sorted by the value of the float. Is there a radix-sort implementation for this.
If there isn't, is there a fast way to access the exponent, the sign and the mantissa.
Because if you sort the floats first on mantissa, exponent, and on exponent the last time. You sort floats in O(n).
Update:
I was quite interested in this topic, so I sat down and implemented it (using this very fast and memory conservative implementation). I also read this one (thanks celion) and found out that you even dont have to split the floats into mantissa and exponent to sort it. You just have to take the bits one-to-one and perform an int sort. You just have to care about the negative values, that have to be inversely put in front of the positive ones at the end of the algorithm (I made that in one step with the last iteration of the algorithm to save some cpu time).
So heres my float radixsort:
public static float[] RadixSort(this float[] array)
{
// temporary array and the array of converted floats to ints
int[] t = new int[array.Length];
int[] a = new int[array.Length];
for (int i = 0; i < array.Length; i++)
a[i] = BitConverter.ToInt32(BitConverter.GetBytes(array[i]), 0);
// set the group length to 1, 2, 4, 8 or 16
// and see which one is quicker
int groupLength = 4;
int bitLength = 32;
// counting and prefix arrays
// (dimension is 2^r, the number of possible values of a r-bit number)
int[] count = new int[1 << groupLength];
int[] pref = new int[1 << groupLength];
int groups = bitLength / groupLength;
int mask = (1 << groupLength) - 1;
int negatives = 0, positives = 0;
for (int c = 0, shift = 0; c < groups; c++, shift += groupLength)
{
// reset count array
for (int j = 0; j < count.Length; j++)
count[j] = 0;
// counting elements of the c-th group
for (int i = 0; i < a.Length; i++)
{
count[(a[i] >> shift) & mask]++;
// additionally count all negative
// values in first round
if (c == 0 && a[i] < 0)
negatives++;
}
if (c == 0) positives = a.Length - negatives;
// calculating prefixes
pref[0] = 0;
for (int i = 1; i < count.Length; i++)
pref[i] = pref[i - 1] + count[i - 1];
// from a[] to t[] elements ordered by c-th group
for (int i = 0; i < a.Length; i++){
// Get the right index to sort the number in
int index = pref[(a[i] >> shift) & mask]++;
if (c == groups - 1)
{
// We're in the last (most significant) group, if the
// number is negative, order them inversely in front
// of the array, pushing positive ones back.
if (a[i] < 0)
index = positives - (index - negatives) - 1;
else
index += negatives;
}
t[index] = a[i];
}
// a[]=t[] and start again until the last group
t.CopyTo(a, 0);
}
// Convert back the ints to the float array
float[] ret = new float[a.Length];
for (int i = 0; i < a.Length; i++)
ret[i] = BitConverter.ToSingle(BitConverter.GetBytes(a[i]), 0);
return ret;
}
It is slightly slower than an int radix sort, because of the array copying at the beginning and end of the function, where the floats are bitwise copied to ints and back. The whole function nevertheless is again O(n). In any case much faster than sorting 3 times in a row like you proposed. I dont see much room for optimizations anymore, but if anyone does: feel free to tell me.
To sort descending change this line at the very end:
ret[i] = BitConverter.ToSingle(BitConverter.GetBytes(a[i]), 0);
to this:
ret[a.Length - i - 1] = BitConverter.ToSingle(BitConverter.GetBytes(a[i]), 0);
Measuring:
I set up some short test, containing all special cases of floats (NaN, +/-Inf, Min/Max value, 0) and random numbers. It sorts exactly the same order as Linq or Array.Sort sorts floats:
NaN -> -Inf -> Min -> Negative Nums -> 0 -> Positive Nums -> Max -> +Inf
So i ran a test with a huge array of 10M numbers:
float[] test = new float[10000000];
Random rnd = new Random();
for (int i = 0; i < test.Length; i++)
{
byte[] buffer = new byte[4];
rnd.NextBytes(buffer);
float rndfloat = BitConverter.ToSingle(buffer, 0);
switch(i){
case 0: { test[i] = float.MaxValue; break; }
case 1: { test[i] = float.MinValue; break; }
case 2: { test[i] = float.NaN; break; }
case 3: { test[i] = float.NegativeInfinity; break; }
case 4: { test[i] = float.PositiveInfinity; break; }
case 5: { test[i] = 0f; break; }
default: { test[i] = test[i] = rndfloat; break; }
}
}
And stopped the time of the different sorting algorithms:
Stopwatch sw = new Stopwatch();
sw.Start();
float[] sorted1 = test.RadixSort();
sw.Stop();
Console.WriteLine(string.Format("RadixSort: {0}", sw.Elapsed));
sw.Reset();
sw.Start();
float[] sorted2 = test.OrderBy(x => x).ToArray();
sw.Stop();
Console.WriteLine(string.Format("Linq OrderBy: {0}", sw.Elapsed));
sw.Reset();
sw.Start();
Array.Sort(test);
float[] sorted3 = test;
sw.Stop();
Console.WriteLine(string.Format("Array.Sort: {0}", sw.Elapsed));
And the output was (update: now ran with release build, not debug):
RadixSort: 00:00:03.9902332
Linq OrderBy: 00:00:17.4983272
Array.Sort: 00:00:03.1536785
roughly more than four times as fast as Linq. That is not bad. But still not yet that fast as Array.Sort, but also not that much worse. But i was really surprised by this one: I expected it to be slightly slower than Linq on very small arrays. But then I ran a test with just 20 elements:
RadixSort: 00:00:00.0012944
Linq OrderBy: 00:00:00.0072271
Array.Sort: 00:00:00.0002979
and even this time my Radixsort is quicker than Linq, but way slower than array sort. :)
Update 2:
I made some more measurements and found out some interesting things: longer group length constants mean less iterations and more memory usage. If you use a group length of 16 bits (only 2 iterations), you have a huge memory overhead when sorting small arrays, but you can beat Array.Sort if it comes to arrays larger than about 100k elements, even if not very much. The charts axes are both logarithmized:
(source: daubmeier.de)
There's a nice explanation of how to perform radix sort on floats here:
http://www.codercorner.com/RadixSortRevisited.htm
If all your values are positive, you can get away with using the binary representation; the link explains how to handle negative values.
By doing some fancy casting and swapping arrays instead of copying this version is 2x faster for 10M numbers as Philip Daubmeiers original with grouplength set to 8. It is 3x faster as Array.Sort for that arraysize.
static public void RadixSortFloat(this float[] array, int arrayLen = -1)
{
// Some use cases have an array that is longer as the filled part which we want to sort
if (arrayLen < 0) arrayLen = array.Length;
// Cast our original array as long
Span<float> asFloat = array;
Span<int> a = MemoryMarshal.Cast<float, int>(asFloat);
// Create a temp array
Span<int> t = new Span<int>(new int[arrayLen]);
// set the group length to 1, 2, 4, 8 or 16 and see which one is quicker
int groupLength = 8;
int bitLength = 32;
// counting and prefix arrays
// (dimension is 2^r, the number of possible values of a r-bit number)
var dim = 1 << groupLength;
int groups = bitLength / groupLength;
if (groups % 2 != 0) throw new Exception("groups must be even so data is in original array at end");
var count = new int[dim];
var pref = new int[dim];
int mask = (dim) - 1;
int negatives = 0, positives = 0;
// counting elements of the 1st group incuding negative/positive
for (int i = 0; i < arrayLen; i++)
{
if (a[i] < 0) negatives++;
count[(a[i] >> 0) & mask]++;
}
positives = arrayLen - negatives;
int c;
int shift;
for (c = 0, shift = 0; c < groups - 1; c++, shift += groupLength)
{
CalcPrefixes();
var nextShift = shift + groupLength;
//
for (var i = 0; i < arrayLen; i++)
{
var ai = a[i];
// Get the right index to sort the number in
int index = pref[( ai >> shift) & mask]++;
count[( ai>> nextShift) & mask]++;
t[index] = ai;
}
// swap the arrays and start again until the last group
var temp = a;
a = t;
t = temp;
}
// Last round
CalcPrefixes();
for (var i = 0; i < arrayLen; i++)
{
var ai = a[i];
// Get the right index to sort the number in
int index = pref[( ai >> shift) & mask]++;
// We're in the last (most significant) group, if the
// number is negative, order them inversely in front
// of the array, pushing positive ones back.
if ( ai < 0) index = positives - (index - negatives) - 1; else index += negatives;
//
t[index] = ai;
}
void CalcPrefixes()
{
pref[0] = 0;
for (int i = 1; i < dim; i++)
{
pref[i] = pref[i - 1] + count[i - 1];
count[i - 1] = 0;
}
}
}
You can use an unsafe block to memcpy or alias a float * to a uint * to extract the bits.
I think your best bet if the values aren't too close and there's a reasonable precision requirement, you can just use the actual float digits before and after the decimal point to do the sorting.
For example, you can just use the first 4 decimals (be they 0 or not) to do the sorting.

Categories

Resources