I was working on this problem 278 First Bad Version on LeetCode. I have used binary search to get the element.
My way of getting middle index of an array m = (start + end)/2 was causing issue in case of large array, closer to MAX_LIMIT of int. First, I thought of int range overflow issue but I am not sure because it worked with end = MAX_LIMIT and some start < MAX_LIMIT even though its going over int range.
I would like to understand how m = start + (end - start)/2 is better than m = (start + end)/2
Code 1 works with input :
2147483647
98765432
But Code 1 fails with input:
2147483647
987654321
I think overflow issue should either happen in both cases or none of them.
1. Code which Worked with m = (start + end)/2 but fails for large array
public int FirstBadVersion(int n) {
if(n == 1){
return 1;
}
int s = 1;
int e = n;
int x = 0;
while(s != e){
x = (s+e)/2;
if(IsBadVersion(x)){
e = x;
}
else{
s = x + 1;
}
}
return s;
}
2. Code which worked with m = start + (end - start)/2
public int FirstBadVersion(int n) {
if(n == 1){
return 1;
}
int s = 1;
int e = n;
int x= 0;
while(s != e){
// x = (s+e)/2;
x = s + (e-s)/2;
if(IsBadVersion(x)){
e = x;
}
else{
s = x + 1;
}
}
return e;
}
You have integer overflow when computing
(a + b) / 2
when (a + b) > int.MaxValue the sum of (a + b) can't be represented as 32 bit integer (it requires 33 bits), and you have incorrect (often negative) result (when bit 33 is ignored) which you then divide by 2.
I suggest working with long in order to be safe from integer overflow:
public int FirstBadVersion(int n) {
// Special case: all versions starting from #1 failed
if (IsBadVersion(1))
return 1;
// let use long in order to avoid integer overflow
long correct = 1;
long failed = n;
// while we have unknown version between correct and failed
while (failed - correct > 1) {
int middle = (low + high) / 2); // it safe now : adding two long values
if (IsBadVersion(middle))
failed = middle;
else
correct = middle;
}
return (int)failed;
}
If using long is cheating and you want to stick to int, you can use the formula below, for int a, b we can put (note that we don't add big values a and b but their halves)
(a + b) / 2 == a / 2 + b / 2 + (a % 2 + b % 2) / 2
Please, note that your formula is not universal
(a + b) = a + (b - a) / 2;
it will work in the particular case of problem #278 where a and b are positive, but will fail in general casw, say when a = 1_000_000_000, b = -2_000_000_000.
Related
I have been thinking of adding binary numbers where binary numbers are in a form of string and we add those two binary numbers to get a resultant binary number (in string).
So far I have been using this:
long first = Convert.ToInt64(a, 2);
long second = Convert.ToInt64(b, 2);
long addresult = first + second;
string result = Convert.ToString(addresult, 2);
return result;
Courtesy of Stackoverflow: Binary addition of 2 values represented as strings
But, now I want to add numbers which are far greater than the scope of a long data type.
For Example, a Binary value whose decimel result is a BigInteger, i.e., incredibly huge integers as shown below:
111101101000010111101000101010001010010010010110011010100001000010110110110000110001101 which equals to149014059082024308625334669
1111001101011000001011000111100011101011110100101010010001110101011101010100101000001101000010000110001110100010011101011111111110110101100101110001010101011110001010000010111110011011 which equals to23307765732196437339985049250988196614799400063289798555
At least I think it does.
Courtesy of Stackoverflow:
C# Convert large binary string to decimal system
BigInteger to Hex/Decimal/Octal/Binary strings?
I have used logic provided in above links which are more or less perfect.
But, is there a more compact way to add the given two binary strings?
Kindly let me know as this question is rattling in my mind for some time now.
You can exploit the same scheme you used before but with BigInteger:
using System.Linq;
using System.Numerics;
...
BigInteger first = a.Aggregate(BigInteger.Zero, (s, item) => s * 2 + item - '0');
BigInteger second = b.Aggregate(BigInteger.Zero, (s, item) => s * 2 + item - '0');
StringBuilder sb = new StringBuilder();
for (BigInteger addresult = first + second; addresult > 0; addresult /= 2)
sb.Append(addresult % 2);
if (sb.Length <= 0)
sb.Append('0');
string result = string.Concat(sb.ToString().Reverse());
This question was a nostalgic one - thanks. Note that my code is explanatory and inefficient with little to no validation, but it works for your example. You definitely do not want to use anything like this in a real world solution, this is just to illustrate binary addition in principle.
BinaryString#1
111101101000010111101000101010001010010010010110011010100001000010110110110000110001101
decimal:149014059082024308625334669
BinaryString#2
1111001101011000001011000111100011101011110100101010010001110101011101010100101000001101000010000110001110100010011101011111111110110101100101110001010101011110001010000010111110011011
decimal:23307765732196437339985049250988196614799400063289798555
Calculated Sum
1111001101011000001011000111100011101011110100101010010001110101011101010100101000001101000010001101111011100101011010100101010000000111111000100100101001100110100000111001000100101000
decimal:23307765732196437339985049251137210673881424371915133224
Check
23307765732196437339985049251137210673881424371915133224
decimal:23307765732196437339985049251137210673881424371915133224
Here's the code
using System;
using System.Linq;
using System.Numerics;
namespace ConsoleApp3
{
class Program
{
// return 0 for '0' and 1 for '1' (C# chars promotion to ints)
static int CharAsInt(char c) { return c - '0'; }
// and vice-versa
static char IntAsChar(int bit) { return (char)('0' + bit); }
static string BinaryStringAdd(string x, string y)
{
// get rid of spaces
x = x.Trim();
y = y.Trim();
// check if valid binaries
if (x.Any(c => c != '0' && c != '1') || y.Any(c => c != '0' && c != '1'))
throw new ArgumentException("binary representation may contain only '0' and '1'");
// align on right-most bit
if (x.Length < y.Length)
x = x.PadLeft(y.Length, '0');
else
y = y.PadLeft(x.Length, '0');
// NNB: the result may require one more bit than the longer of the two input strings (carry at the end), let's not forget this
var result = new char[x.Length];
// add from least significant to most significant (right to left)
var i = result.Length;
var carry = '0';
while (--i >= 0)
{
// to add x[i], y[i] and carry
// - if 2 or 3 bits are set then we carry '1' again (otherwise '0')
// - if the number of set bits is odd the sum bit is '1' otherwise '0'
var countSetBits = CharAsInt(x[i]) + CharAsInt(y[i]) + CharAsInt(carry);
carry = countSetBits > 1 ? '1' : '0';
result[i] = countSetBits == 1 || countSetBits == 3 ? '1' : '0';
}
// now to make this byte[] a string
var ret = new string(result);
// remember that final carry?
return carry == '1' ? carry + ret : ret;
}
static BigInteger BigIntegerFromBinaryString(string s)
{
var biRet = new BigInteger(0);
foreach (var t in s)
{
biRet = biRet << 1;
if (t == '1')
biRet += 1;
}
return biRet;
}
static void Main(string[] args)
{
var s1 = "111101101000010111101000101010001010010010010110011010100001000010110110110000110001101";
var s2 = "1111001101011000001011000111100011101011110100101010010001110101011101010100101000001101000010000110001110100010011101011111111110110101100101110001010101011110001010000010111110011011";
var sum = BinaryStringAdd(s1, s2);
var bi1 = BigIntegerFromBinaryString(s1);
var bi2 = BigIntegerFromBinaryString(s2);
var bi3 = bi1 + bi2;
Console.WriteLine($"BinaryString#1\n {s1}\n decimal:{bi1}");
Console.WriteLine($"BinaryString#2\n {s2}\n decimal:{bi2}");
Console.WriteLine($"Calculated Sum\n {sum}\n decimal:{BigIntegerFromBinaryString(sum)}");
Console.WriteLine($"Check\n {bi3}\n decimal:{bi3}");
Console.ReadKey();
}
}
}
I'll add an alternative solution alongside AlanK's just as an example of how you might go about this without converting the numbers to some form of integer before adding them.
static string BinaryStringAdd(string b1, string b2)
{
char[] c = new char[Math.Max(b1.Length, b2.Length) + 1];
int carry = 0;
for (int i = 1; i <= c.Length; i++)
{
int d1 = i <= b1.Length ? b1[^i] : 48;
int d2 = i <= b2.Length ? b2[^i] : 48;
int sum = carry + (d1-48) + (d2-48);
if (sum == 3)
{
sum = 1;
carry = 1;
}
else if (sum == 2)
{
sum = 0;
carry = 1;
}
else
{
carry = 0;
}
c[^i] = (char) (sum+48);
}
return c[0] == '0' ? String.Join("", c)[1..] : String.Join("", c);
}
Note that this solution is ~10% slower than Alan's solution (at least for this test case), and assumes the strings arrive to the method formatted correctly.
I am writing the algorithm which is choosing a subset (of k elements) from a set (of n elements).
I have accomplished the task with a success. It works fine for small numbers.
I have tested it for n=6, k=3 and n=10, k=5.
The problem is starting now, when I need to use it for huge numbers.
Sometimes I would need to use it for let's say n = 96000000 and k = 3000.
To simply testing a bit, lets focus on example for n = 786432 and k = 1000. Then there is 461946653375201 such a possibilities. As the third parameters to my function there is rank number, so the number for particular unique subset. Let's try few random, for example 3264832 works fine (gave me subset of different numbers), but for 4619466533201 the one number (in subset) is repeated several times, what is wrong. It must be set as well subset based on unique numbers !
Question is to make it works correct and what is the problem ? The numbers are too big even for ulong ?
If you have any question feel free to ask.
Here is my code:
public static ulong BinomialCoefficient(ulong N, ulong K)
{
ulong result = 1;
for (ulong i = 1; i <= K; i++)
{
result *= N - (K - i);
result /= i;
}
return result;
}
public static ulong[] ChooseSubsetByRank(ulong sizeOfSet, ulong sizeOfSubset, ulong rank)
{
ulong[] resultingSubset = new ulong[sizeOfSubset];
ulong x = sizeOfSet;
for (ulong i = sizeOfSubset; i > 0; i--)
{
while (BinomialCoefficient(x, i) > rank)
x--;
resultingSubset[i - 1] = x + 1;
rank = BinomialCoefficient(x + 1, i) - rank - 1;
}
return resultingSubset;
}
And below is the run code. To test it you may change the third argument at the line below.
ulong[] arrayTest = Logic.ChooseSubsetByRank(786432, 1000, 4619466533201);
string test = "";
for (int i = 0; i < arrayTest.Length; i++)
test = test + arrayTest[i].ToString() + " ";
System.Windows.MessageBox.Show(" " + test);
No hope. You can not.
As says spender: use BigInteger.
Your calculation is false (probably if you calculate with ulong which is very very limited for this).
C786432,1000 is in reality :
6033573926325594551531868873570215053708823770889227136141180206574788891075585715726697576999866930083212017993760483485644855730323214507786127283118515758667219335061769573572969492263411636472559059114372691043787225874459276616360823293108500929182830806831624098080982165637186175635880811026388564912224747148201420203796293941118006753515861022396665706095036252893420240334110487119413634294555065166398219767688578556791918697815341165100213662715943043737412038535358818942960435634721564898425752479874494445989953267768476995289375942620219089503401832797819758809124329657724691573254079810257990856068363592549560111914326820802223343980843357174727643299789438961341403866942005159819587812937265119974334351031505150775547311257835039161258554849609865661574816771511161168033768782419369241858323336341530982042093999410402417064838718686064312965836862249598770142918659708106482935266574067985412321680292750817019104479650736141502332606724302400412461373311881584020963297279437835819666355490804970115983436645628460688679416826680621378132834857452816232982148238532837600398378710514758276529410600324271797090502818444825427753513255984828515472462706714900697194261105881768124169338072607942675219899630246822298950117323544399023453603528517829390771915103036173961755955159422806483076370762068538902803552244794986362728794573306025683866038470793703513935653987744702277137020842862116544300481688519625708115843299275718747596961899491910480897148955406962985269341341630460910287516984534632412940751629513018144947978952932944251585462754004392953272268819217751573575925319332190435744062763990089885732157684342450873180307735549083984647582210698121884513785762578827079077499321224628231353083451055184483182777799031632857810808269286112679457384588431986459863394440578400765094557059628627207887510198427517980206661794055812198263391603552022883118047415972254211592143706127815985486692600870607976623561998434373091244295356784708997235625422777415209304056464924341151878262503587256198384142718049855042621519149038523177569828231641690393173865902883254477356340730939905543154540746759842093744184723706019384873683467974667731206411977863548104488741332797192887789005759777716153901423692511142309333333044144404295842596379993363263619514077277847401673508888691303190564956937240904605718333403477875735125913053605250218671009674129773564325959311930556006735185907557691220793718745513911096043358579428288852312401862707347174079157233572972231584221683511928548130771207729971476262436947167805862489722247791944393249804177227081889352572247647101767728277149206844417712380170809760442471306983505977784517425621794122861839031329562074224252476256692950187473655698688314932344304325068076491419731413851641058957149245827761363536463550636030779009703117216843500031930755136735771022162481784531500378393390581558695370099627488825651248884473844195719258621451229987520317542943566297340698028466818937335976792343382788134518740623993664131802576690485505420542865842569675333314900726976825951448445467650748963731221593412649796639395685018463040431779020656159571608044184646177251839940386267422657877801967082672251079906237183824765375906939480520508656199566649638083679757430680818796170362008227564859519761936618260089868694546582873807181452115865272320
I'm trying to calculate the cumulative binomial probability of 'n' trials, with 'p' probability and 'r' as the successful outcome of each trial. I have written the following code that works sometimes, but not always:
Console.WriteLine ();
Console.WriteLine ("B~(n, p)");
incorrectN:
Console.WriteLine ("Enter value of 'n': ");
int n = Convert.ToInt32 (Console.ReadLine ());
if (n < 0) {
Console.WriteLine ("ERROR: 'n' must be greater than 0");
goto incorrectN;
}
incorrectP:
Console.WriteLine ();
Console.WriteLine ("Enter value of 'p': ");
double p = Convert.ToDouble (Console.ReadLine ());
if (p > 1) {
Console.WriteLine ();
Console.WriteLine ("ERROR: 'p' must be between 0 and 1");
goto incorrectP;
}
Console.WriteLine ();
incorrectS:
int r = GetR();
int k = r;
double binomTotal = 0;
for (int j = r + 1; j > 0; j--) {
int nCr = Factorial(n) / (Factorial(n - (r - k)) * Factorial(r - k));
binomTotal = binomTotal + nCr * Math.Pow(p, (r - k)) * Math.Pow(1 - p, (n - (r - k)));
k--;
}
Console.WriteLine();
Console.WriteLine(binomTotal);
P.S. I have written the GetR() and Factorial() functions elsewhere within the class, where GetR() asks the user for the value of 'r' and Factorial() is defined as follows:
public static int Factorial(int x)
{
return x <= 1 ? 1 : x * Factorial(x - 1);
}
I tested the code with values n = 10, p = 0.5 and r = 5 and the output is 0.623046875, which is correct. However, when I use n = 13, p = 0.35 and r = 7, I get 0.297403640622647 instead of 0.9538.
Any help would be much appreciated.
In addition to your own answer:
public static double Factorial(double x)
{
return x <= 1 ? 1 : x * Factorial(x - 1);
}
accepts a double parameter, which means that x is not restricted to be an integer.
So you could call your Factorial method like this.
var fac1 = Factorial(1.4);
var fac2 = Factorial(2.7);
However, this does not make sense since the factorial is defined only* for , meaning that
is undefined.
So, instead of using double and allowing for invalid inputs, you should be using long instead, which has a greater range than int.
public static long Factorial(long x)
{
return x <= 1 ? 1 : x * Factorial(x - 1);
}
* there are some cases where factorials can be used with real values as well - e.g. by using the gamma function - but I don't think they're relevant to your use case and therefore you should not allow invalid parameters.
Change:
public static int Factorial(int x)
{
return x <= 1 ? 1 : x * Factorial(x - 1);
}
To:
public static double Factorial(double x)
{
return x <= 1 ? 1 : x * Factorial(x - 1);
}
Because Factorial(13) is too large for Int32.
I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
ö instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function
I need to create a function that will generate 2 random numbers between x and y (e.g. x = 1, y = 20) which when added will not involve regrouping / carryover or which when subracted will not involve borrowing.
For example,
18 + 1 = good
14 + 5 = good
18-7 = good
29 - 8 = good
15 + 6 = bad
6 + 7 = bad
21 - 3 = bad
36 - 8 = bad etc.
I want to create a simple worksheet generator that will generate sample problems using the requirements above.
I guess I could always convert the number to string, get the right most digit for each of the 2 numbers, convert them back to integer, and test if one is greater than the other. Repeat for all the digit. Only thing is, that is so damn ugly (read inefficient). I am sure that there is a better way. Anyone have any suggestions? Thanks
Generate them one digit at a time. e.g
a1 = rand(9)
a2 = rand(9 - a1)
b1 = rand(9)
b2 = rand(9 - b1)
x = b1*10 + a1
y = b2*10 + a2
From the construction you know that x+y will not involve any carry, because a1+a2 <= 9 and b1 + b2 <= 9.
You can do similar for subtraction.
If you want to restrict the overall range to be [1..20] instead of [1..99], just adjust the range for the leftmost digit:
b1 = rand(1)
b2 = rand(1 - b1)
using System;
class Sample {
static void Main() {
var rnd = new Random();
var x = 1;
var y = 20;
var a = rnd.Next(x, y);
var b = rnd.Next(x, y);
var op = '+';
Console.WriteLine("{0} {2} {1} = {3}", a, b, op , isValid(a, b, op)? "good":"bad");
op = '-';
Console.WriteLine("{0} {2} {1} = {3}", a, b, op , isValid(a, b, op)? "good":"bad");
}
static bool isValid(int x, int y, char op){
int a = x % 10;
int b = y % 10;
switch (op){
case '+':
return a + b < 10;
case '-':
return x >= y && a - b >= 0;
default:
throw new Exception(String.Format("unknown operator '{0}'", op));
}
}
}
Breaking up the numbers into digits is indeed exactly what you need to do. It does not matter whether you do that by arithmetic manipulation (division and modulus by 10) or by converting the numbers into strings, but fundamentally your question is precisely about the individual digits of the numbers.
For the subtraction x − y, no borrows are required if and only if none of the digits in y are greater than the corresponding digit in x.
For the addition x + y, there will be no carries if and only if the sum of each pair of corresponding digits is less than 10.
Here's some pseudo-C# for checking these conditions:
bool CanSubtractWithoutBorrow (uint x, uint y) {
while (y > 0) {
if ((x % 10) < (y % 10)) return False;
x /= 10; y /= 10;
}
return True;
}
bool CanAddWithoutCarry (uint x, uint y) {
while (x > 0 && y > 0) {
if ((x % 10) + (y % 10) >= 10) return False;
x /= 10; y /= 10;
}
return True;
}
You need to look at each pair digit in turn, and see if adding or subtracting them involves carries.
You can get the rightmost digit by taking the value modulo 10, x%10, and you can erase the right most digit by dividing by 10.
No string conversions are necessary.