Given two arrays of integers with equal lengths, return the maximum value of:
|arr1[i] - arr1[j]| + |arr2[i] - arr2[j]| + |i - j|
where the maximum is taken over all 0 <= i, j < arr1.length.
Example 1:
Input: arr1 = [1,2,3,4], arr2 = [-1,4,5,6]
Output: 13
Example 2:
Input: arr1 = [1,-2,-5,0,10], arr2 = [0,-2,-1,-7,-4]
Output: 20
Constraints:
2 <= arr1.length == arr2.length <= 40000
-10^6 <= arr1[i], arr2[i] <= 10^6
public class Solution {
public int MaxAbsValExpr(int[] arr1, int[] arr2) {
int max = 0;
for (int i =0; i < arr1.Length; i++){
for (int k =i+1; k < arr1.Length; k++){
if (Math.Abs(arr1[i] - arr1[k]) + Math.Abs(arr2[i] - arr2[k]) + Math.Abs(i - k) > max){
max = Math.Abs(arr1[i] - arr1[k]) + Math.Abs(arr2[i] - arr2[k]) + Math.Abs(i - k);
}
}
}
return max;
}}
This passes most test cases but gets a "Time Limit Exceeded" when the test case has hundreds of values in the arrays in the six digits. I've tried looking online to see how to shorten it in c# but I don't know what else I could do to make it faster.
You encountered "Time Limit Exceeded", it is because your algorithm takes too long to execute and the computer timeout. As commented by #Flydog57, nested loops can be very very expensive and should be avoided. He also suggested a way to implement the algorithm without nested loops. Here's the implementation in PHP (I do not know C#):
/**
* Find the indexes of the max and min values in $arr.
*
* #param array $arr
* #return array
*/
protected function minMaxIndexes(Array $arr)
{
// Initialize variables indexes and values by the 1st element of $arr.
[$i0, $v0, $i1, $v1] = [0, $arr[0], 0, $arr[0]];
for ($i = 1, $size = count($arr); $i < $size; ++$i) {
if ($v0 > $arr[$i]) {
$v0 = $arr[$i];
$i0 = $i;
} elseif ($v1 < $arr[$i]) {
$v1 = $arr[$i];
$i1 = $i;
}
}
return [$i0, $i1];
}
public function maxAbsValExpr(Array $arr1, Array $arr2)
{
[$i, $j] = $this->minMaxIndexes($arr1);
$max1 = abs($arr1[$i] - $arr1[$j]) + abs($arr2[$i] - $arr2[$j]) + abs($i - $j);
[$i, $j] = $this->minMaxIndexes($arr2);
$max2 = abs($arr1[$i] - $arr1[$j]) + abs($arr2[$i] - $arr2[$j]) + abs($i - $j);
return max($max1, $max2);
}
As you can see, the for loop is 2 times of N-1, if my calculation is correct, it is N/2 times faster than nested loop of N(N-1). That is, for 1 mil array length, it is 500k times faster.
Related
Problem statement:
Given an array of non-negative integers, count the number of unordered pairs of array elements, such that their bitwise AND is a power of 2.
Example:
arr = [10, 7, 2, 8, 3]
Answer: 6 (10&7, 10&2, 10&8, 10&3, 7&2, 2&3)
Constraints:
1 <= arr.Count <= 2*10^5
0 <= arr[i] <= 2^12
Here's my brute-force solution that I've come up with:
private static Dictionary<int, bool> _dictionary = new Dictionary<int, bool>();
public static long CountPairs(List<int> arr)
{
long result = 0;
for (var i = 0; i < arr.Count - 1; ++i)
{
for (var j = i + 1; j < arr.Count; ++j)
{
if (IsPowerOfTwo(arr[i] & arr[j])) ++result;
}
}
return result;
}
public static bool IsPowerOfTwo(int number)
{
if (_dictionary.TryGetValue(number, out bool value)) return value;
var result = (number != 0) && ((number & (number - 1)) == 0);
_dictionary[number] = result;
return result;
}
For small inputs this works fine, but for big inputs this works slow.
My question is: what is the optimal (or at least more optimal) solution for the problem? Please provide a graceful solution in C#. 😊
One way to accelerate your approach is to compute the histogram of your data values before counting.
This will reduce the number of computations for long arrays because there are fewer options for value (4096) than the length of your array (200000).
Be careful when counting bins that are powers of 2 to make sure you do not overcount the number of pairs by including cases when you are comparing a number with itself.
We can adapt the bit-subset dynamic programming idea to have a solution with O(2^N * N^2 + n * N) complexity, where N is the number of bits in the range, and n is the number of elements in the list. (So if the integers were restricted to [1, 4096] or 2^12, with n at 100,000, we would have on the order of 2^12 * 12^2 + 100000*12 = 1,789,824 iterations.)
The idea is that we want to count instances for which we have overlapping bit subsets, with the twist of adding a fixed set bit. Given Ai -- for simplicity, take 6 = b110 -- if we were to find all partners that AND to zero, we'd take Ai's negation,
110 -> ~110 -> 001
Now we can build a dynamic program that takes a diminishing mask, starting with the full number and diminishing the mask towards the left
001
^^^
001
^^
001
^
Each set bit on the negation of Ai represents a zero, which can be ANDed with either 1 or 0 to the same effect. Each unset bit on the negation of Ai represents a set bit in Ai, which we'd like to pair only with zeros, except for a single set bit.
We construct this set bit by examining each possibility separately. So where to count pairs that would AND with Ai to zero, we'd do something like
001 ->
001
000
we now want to enumerate
011 ->
011
010
101 ->
101
100
fixing a single bit each time.
We can achieve this by adding a dimension to the inner iteration. When the mask does have a set bit at the end, we "fix" the relevant bit by counting only the result for the previous DP cell that would have the bit set, and not the usual union of subsets that could either have that bit set or not.
Here is some JavaScript code (sorry, I do not know C#) to demonstrate with testing at the end comparing to the brute-force solution.
var debug = 0;
function bruteForce(a){
let answer = 0;
for (let i = 0; i < a.length; i++) {
for (let j = i + 1; j < a.length; j++) {
let and = a[i] & a[j];
if ((and & (and - 1)) == 0 && and != 0){
answer++;
if (debug)
console.log(a[i], a[j], a[i].toString(2), a[j].toString(2))
}
}
}
return answer;
}
function f(A, N){
const n = A.length;
const hash = {};
const dp = new Array(1 << N);
for (let i=0; i<1<<N; i++){
dp[i] = new Array(N + 1);
for (let j=0; j<N+1; j++)
dp[i][j] = new Array(N + 1).fill(0);
}
for (let i=0; i<n; i++){
if (hash.hasOwnProperty(A[i]))
hash[A[i]] = hash[A[i]] + 1;
else
hash[A[i]] = 1;
}
for (let mask=0; mask<1<<N; mask++){
// j is an index where we fix a 1
for (let j=0; j<=N; j++){
if (mask & 1){
if (j == 0)
dp[mask][j][0] = hash[mask] || 0;
else
dp[mask][j][0] = (hash[mask] || 0) + (hash[mask ^ 1] || 0);
} else {
dp[mask][j][0] = hash[mask] || 0;
}
for (let i=1; i<=N; i++){
if (mask & (1 << i)){
if (j == i)
dp[mask][j][i] = dp[mask][j][i-1];
else
dp[mask][j][i] = dp[mask][j][i-1] + dp[mask ^ (1 << i)][j][i - 1];
} else {
dp[mask][j][i] = dp[mask][j][i-1];
}
}
}
}
let answer = 0;
for (let i=0; i<n; i++){
for (let j=0; j<N; j++)
if (A[i] & (1 << j))
answer += dp[((1 << N) - 1) ^ A[i] | (1 << j)][j][N];
}
for (let i=0; i<N + 1; i++)
if (hash[1 << i])
answer = answer - hash[1 << i];
return answer / 2;
}
var As = [
[10, 7, 2, 8, 3] // 6
];
for (let A of As){
console.log(JSON.stringify(A));
console.log(`DP, brute force: ${ f(A, 4) }, ${ bruteForce(A) }`);
console.log('');
}
var numTests = 1000;
for (let i=0; i<numTests; i++){
const N = 6;
const A = [];
const n = 10;
for (let j=0; j<n; j++){
const num = Math.floor(Math.random() * (1 << N));
A.push(num);
}
const fA = f(A, N);
const brute = bruteForce(A);
if (fA != brute){
console.log('Mismatch:');
console.log(A);
console.log(fA, brute);
console.log('');
}
}
console.log("Done testing.");
int[] numbers = new[] { 10, 7, 2, 8, 3 };
static bool IsPowerOfTwo(int n) => (n != 0) && ((n & (n - 1)) == 0);
long result = numbers.AsParallel()
.Select((a, i) => numbers
.Skip(i + 1)
.Select(b => a & b)
.Count(IsPowerOfTwo))
.Sum();
If I understand the problem correctly, this should work and should be faster.
First, for each number in the array we grab all elements in the array after it to get a collection of numbers to pair with.
Then we transform each pair number with a bitwise AND, then counting the number that satisfy our 'IsPowerOfTwo;' predicate (implementation here).
Finally we simply get the sum of all the counts - our output from this case is 6.
I think this should be more performant than your dictionary based solution - it avoids having to perform a lookup each time you wish to check power of 2.
I think also given the numerical constraints of your inputs it is fine to use int data types.
I got asked a question and now I am kicking myself for not being able to come up with the exact/correct result.
Imagine we have a function that splits a string into multiple lines but each line has to have x number of characters before we "split" to the new line:
private string[] GetPagedMessages(string input, int maxCharsPerLine) { ... }
For each line, we need to incorporate, at the end of the line "x/y" which is basically 1/4, 2/4 etc...
Now, the paging mechanism must also be part of the length restriction per line.
I have been overworked and overthinking and tripping up on things and this seems pretty straight forward but for the life of me, I cannot figure it out! What am I not "getting"?
What am I interested in? The calculation and some part of the logic but mainly the calculation of how many lines are required to split the input based on the max chars per line which also needs to include the x/y.
Remember: we can have more than a single digit for the x/y (i.e: not just 1/4 but also 10/17 or 99/200)
Samples:
input = "This is a long message"
maxCharsPerLine = 10
output:
This i 1/4 // << Max 10 chars
s a lo 2/4 // << Max 10 chars
ng mes 3/4 // << Max 10 chars
sage 4/4 // << Max 10 chars
Overall the logic is simple but its just the calculation that is throwing me off.
The idea: First, find how many digits is the number of lines:
(n = input.Length, maxCharsPerLine = 10)
if n <= 9*(10-4) ==> 1 digit
if n <= 9*(10-5) + 90*(10-6) ==> 2 digits
if n <= 9*(10-6) + 90*(10-7) + 900*(10-8) ==> 3 digits
if n <= 9*(10-7) + 90*(10-8) + 900*(10-9) + 9000*(10-10) ==> No solution
Then, subtract the spare number of lines. The solution:
private static int GetNumberOfLines(string input, int maxCharsPerLine)
{
int n = input.Length;
int x = maxCharsPerLine;
for (int i = 4; i < x; i++)
{
int j, sum = 0, d = 9, numberOfLines = 0;
for (j = i; j <= i + i - 4; j++)
{
if (x - j <= 0)
return -1; // No solution
sum += d * (x - j);
numberOfLines += d;
d *= 10;
}
if (n <= sum)
return numberOfLines - (sum - n) / (x - j + 1);
}
return -2; // Invalid
}
Usage:
private static string[] GetPagedMessages(string input, int maxCharsPerLine)
{
int numberOfLines = GetNumberOfLines(input, maxCharsPerLine);
if (numberOfLines < 0)
return null;
string[] result = new string[numberOfLines];
int spaceLeftForLine = maxCharsPerLine - numberOfLines.ToString().Length - 2; // Remove the chars of " x/y" except the incremental 'x'
int inputPosition = 0;
for (int line = 1; line < numberOfLines; line++)
{
int charsInLine = spaceLeftForLine - line.ToString().Length;
result[line - 1] = input.Substring(inputPosition, charsInLine) + $" {line}/{numberOfLines}";
inputPosition += charsInLine;
}
result[numberOfLines-1] = input.Substring(inputPosition) + $" {numberOfLines}/{numberOfLines}";
return result;
}
A naive approach is to start counting the line lengths minus the "pager"'s size, until the line count changes in size ("1/9" is shorter than "1/10", which is shorter than "11/20", and so on):
private static int[] GetLineLengths(string input, int maxCharsPerLine)
{
/* The "pager" (x/y) is at least 4 characters (including the preceding space) and at most ... 8?
* 7/9 (4)
* 1/10 (5)
* 42/69 (6)
* 3/123 (6)
* 42/420 (7)
* 999/999 (8)
*/
int charsRemaining = input.Length;
var lineLengths = new List<int>();
// Start with " 1/2", (1 + 1 + 2) = 4 length
var highestLineNumberLength = 1;
var lineNumber = 0;
do
{
lineNumber++;
var currentLineNumberLength = lineNumber.ToString().Length; // 1 = 1, 99 = 2, ...
if (currentLineNumberLength > highestLineNumberLength)
{
// Pager size changed, reset
highestLineNumberLength = currentLineNumberLength;
lineLengths.Clear();
lineNumber = 0;
charsRemaining = input.Length;
continue;
}
var pagerSize = currentLineNumberLength + highestLineNumberLength + 2;
var lineLength = maxCharsPerLine - pagerSize;
if (lineLength <= 0)
{
throw new ArgumentException($"Can't split input of size {input.Length} into chunks of size {maxCharsPerLine}");
}
lineLengths.Add(lineLength);
charsRemaining -= lineLength;
}
while (charsRemaining > 0);
return lineLengths.ToArray();
}
Usage:
private static string[] GetPagedMessages(string input, int maxCharsPerLine)
{
if (input.Length <= maxCharsPerLine)
{
// Assumption: no pager required for a message that takes one line
return new[] { input };
}
var lineLengths = GetLineLengths(input, maxCharsPerLine);
var result = new string[lineLengths.Length];
// Cut the input and append the pager
var previousIndex = 0;
for (var i = 0; i < lineLengths.Length; i++)
{
var lineLength = Math.Min(lineLengths[i], input.Length - previousIndex); // To cater for final line being shorter
result[i] = input.Substring(previousIndex, lineLength) + " " + (i + 1) + "/" + lineLengths.Length;
previousIndex += lineLength;
}
return result;
}
Prints, for example:
This 1/20
is a 2/20
long 3/20
strin 4/20
g tha 5/20
t wil 6/20
l spa 7/20
n mor 8/20
e tha 9/20
n te 10/20
n li 11/20
nes 12/20
beca 13/20
use 14/20
of i 15/20
ts e 16/20
norm 17/20
ous 18/20
leng 19/20
th 20/20
I am a beginner to programming and I am attempting to write a program that should solve the following mathematical problem:
"On a circle we have 108 numbers. The sum of any 20 consecutive numbers is always 1000. At position "1" we have the number 1, at position "19" we have the number 19 and at position "50" we have number 50. What number do we have on position "100"?" The numbers are always integers.
Up to this point i know how to use array and looping structures like "for".
I attempted to set a variable "check" which I initialized it as "false" in order to change it later to true, when I will find the number on position "100".
My initial thought is to go through the array and make the sum of each 20 consecutive numbers. I used two variables in order to assign values for each position that is not 1, 19 or 50 and then make the sum. I did this in two "for" structures, one inside another. Of course, this creates an issue that i am not able to properly calculate the values of the first 20 number.
I checked on the MSDN some information about the "Queue" structure, yet I can't understand how to use it in my particular case.
Let me know please if anyone has some suggestions on how i should handle this.
Thank you.
As you asked, here is the code that I have managed to write so far. Please take into consideration that I am really in the begging of programming and I still have a lot to learn.
check = true;
int[] array = new int[108];
for (counter = 1; counter <= array.Length; counter++)
{
if (counter !=1 || counter !=19 || counter !=50)
{
for (i = 1; i <= 999; i++)
{
array[counter] = i;
if (counter >= 20)
{
for (consecutive = counter; consecutive => 2; consecutive--)
{
checkValue = array[consecutive] + array[consecutive]-1;
}
if (checkvalue != 1000) check = false;
}
}
}
}
As say #fryday the code of this problem doesn't mean anything without logic, so you should understand well the solution before read this.
Let me explain to you the logic of this problem and then read the code (you will see that is quite simple). First you may note that each time you get 20 consecutive numbers, the next number must be equal to the first in the 20 consecutive, because you have
x_1 + x_2 + ... + x_20 = 1000
and
x_2 + ... + x_20 + x_21 = 1000
so
x_1 + x_2 + ... + x_20 = x_2 + ... + x_20 + x_21
and finally
x_1 = x_21
Following this insight all we have to do is walk through the array starting at each one of this positions 1, 19 and 50, fill the positions in the array with this values and then calculate the next position adding 20 to the current and mod by 108, repeat the process until the value in the array is equal to 1, 19 or 50 in each case. Then the values in the array equals to 0 means that all must have the same value (and has not found yet). To found this value sum 20 values consecutive and divide the difference with 1000 by the count of elements with value equal to 0 in such 20 consecutive. You can notice that the position 100 is one of this that have value equals to 0. Finally we get the value of 130.
This is the code:
using System;
namespace Stackoverflow
{
class Program
{
static void Main(string[] args)
{
int[] values = { 1, 19, 50 };
int[] arr = new int[108];
foreach (var val in values)
{
int pos = val - 1;
while (arr[pos] != val)
{
arr[pos] = val;
pos += 20;
pos %= 108;
}
}
int sum = 0;
int zeros = 0;
for (int i = 0; i < 20; i++)
{
if (arr[i] == 0) zeros++;
sum += arr[i];
}
Console.WriteLine((1000 - sum) / zeros);
}
}
}
I solved this problem with array, but better use some type of cycle list. It's about structure choice.
In general you should solve this problem with logic, computer only can help you to check result or visualize data. I'll append code which show how I use array for solving this problem:
N = 108
step = 20
a = [None] * (N + 1)
def toArrayIndex(i):
if i < 0:
return toArrayIndex(N + i)
if i % N == 0:
return N
else:
return i % N
def setKnownNumbers(value, index):
while a[toArrayIndex(index)] != value:
if a[toArrayIndex(index)] != None:
print "ERROR: problem unsolvable i = %d" % toArrayIndex(index)
a[toArrayIndex(index)] = value
index += step
setKnownNumbers(1, 1)
setKnownNumbers(19, 19)
setKnownNumbers(50, 50)
Now you only need to understand how finish it. It's not hard ;)
You can use thi code to output array and check correctness
# output
for i in range(1,N + 1):
print i, a[i]
counter = 0
for i in range(1,N + 1):
s = 0
for j in range(0,step):
s += a[toArrayIndex(i + j)]
if s == 1000:
counter += 1
if counter == N:
print "Correct"
print "a[100] = %d" % a[100]
I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
ö instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function
I am creating a forecasting application that will run simulations for various "modes" that a production plant is able to run. The plant can run in one mode per day, so I am writing a function that will add up the different modes chosen each day that best maximize the plant’s output and best aligns with the sales forecast numbers provided. This data will be loaded into an array of mode objects that will then be used to calculate the forecast output of the plant.
I have created the functions to do this, however, I need to make them recursive so that I am able to handle any number (within reason) of modes and work days (which varies based on production needs). Listed below is my code using for loops to simulate what I want to do. Can someone point me in the right direction in order to create a recursive function to replace the need for multiple for loops?
Where the method GetNumbers4 would be when there were four modes, and GetNumbers5 would be 5 modes. Int start would be the number of work days.
private static void GetNumber4(int start)
{
int count = 0;
int count1 = 0;
for (int i = 0; 0 <= start; i++)
{
for (int j = 0; j <= i; j++)
{
for (int k = 0; k <= j; k++)
{
count++;
for (int l = 0; l <= i; l++)
{
count1 = l;
}
Console.WriteLine(start + " " + (count1 - j) + " " + (j - k) + " " + k);
count1 = 0;
}
}
start--;
}
Console.WriteLine(count);
}
private static void GetNumber5(int start)
{
int count = 0;
int count1 = 0;
for (int i = 0; 0 <= start; i++)
{
for (int j = 0; j <= i; j++)
{
for (int k = 0; k <= j; k++)
{
for (int l = 0; l <= k; l++)
{
count++;
for (int m = 0; m <= i; m++)
{
count1 = m;
}
Console.WriteLine(start + " " + (count1 - j) + " " + (j - k) + " " + (k - l) + " " + l);
count1 = 0;
}
}
}
start--;
}
Console.WriteLine(count);
}
EDITED:
I think that it would be more helpful if I gave an example of what I was trying to do. For example, if a plant could run in three modes "A", "B", "C" and there were three work days, then the code will return the following results.
3 0 0
2 1 0
2 0 0
1 2 0
1 1 1
1 0 2
0 3 0
0 2 1
0 1 2
0 0 3
The series of numbers represent the three modes A B C. I will load these results into a Modes object that has the corresponding production rates. Doing it this way allows me to shortcut creating a list of every possible combination; it instead gives me a frequency of occurrence.
Building on one of the solutions already offered, I would like to do something like this.
//Where Modes is a custom classs
private static Modes GetNumberRecur(int start, int numberOfModes)
{
if (start < 0)
{
return Modes;
}
//Do work here
GetNumberRecur(start - 1);
}
Thanks to everyone who have already provided input.
Calling GetNumber(5, x) should yield the same result as GetNumber5(x):
static void GetNumber(int num, int max) {
Console.WriteLine(GetNumber(num, max, ""));
}
static int GetNumber(int num, int max, string prefix) {
if (num < 2) {
Console.WriteLine(prefix + max);
return 1;
}
else {
int count = 0;
for (int i = max; i >= 0; i--)
count += GetNumber(num - 1, max - i, prefix + i + " ");
return count;
}
}
A recursive function just needs a terminating condition. In your case, that seems to be when start is less than 0:
private static void GetNumberRec(int start)
{
if(start < 0)
return;
// Do stuff
// Recurse
GetNumberRec(start-1);
}
I've refactored your example into this:
private static void GetNumber5(int start)
{
var count = 0;
for (var i = 0; i <= start; i++)
{
for (var j = 0; j <= i; j++)
{
for (var k = 0; k <= j; k++)
{
for (var l = 0; l <= k; l++)
{
count++;
Console.WriteLine(
(start - i) + " " +
(i - j) + " " +
(j - k) + " " +
(k - l) + " " +
l);
}
}
}
}
Console.WriteLine(count);
}
Please verify this is correct.
A recursive version should then look like this:
public static void GetNumber(int start, int depth)
{
var count = GetNumber(start, depth, new Stack<int>());
Console.WriteLine(count);
}
private static int GetNumber(int start, int depth, Stack<int> counters)
{
if (depth == 0)
{
Console.WriteLine(FormatCounters(counters));
return 1;
}
else
{
var count = 0;
for (int i = 0; i <= start; i++)
{
counters.Push(i);
count += GetNumber(i, depth - 1, counters);
counters.Pop();
}
return count;
}
}
FormatCounters is left as an exercise to the reader ;)
I previously offered a simple C# recursive function here.
The top-most function ends up having a copy of every permutation, so it should be easily adapted for your needs..
I realize that everyone's beaten me to the punch at this point, but here's a dumb Java algorithm (pretty close to C# syntactically that you can try out).
import java.util.ArrayList;
import java.util.List;
/**
* The operational complexity of this is pretty poor and I'm sure you'll be able to optimize
* it, but here's something to get you started at least.
*/
public class Recurse
{
/**
* Base method to set up your recursion and get it started
*
* #param start The total number that digits from all the days will sum up to
* #param days The number of days to split the "start" value across (e.g. 5 days equals
* 5 columns of output)
*/
private static void getNumber(int start,int days)
{
//start recursing
printOrderings(start,days,new ArrayList<Integer>(start));
}
/**
* So this is a pretty dumb recursion. I stole code from a string permutation algorithm that I wrote awhile back. So the
* basic idea to begin with was if you had the string "abc", you wanted to print out all the possible permutations of doing that
* ("abc","acb","bac","bca","cab","cba"). So you could view your problem in a similar fashion...if "start" is equal to "5" and
* days is equal to "4" then that means you're looking for all the possible permutations of (0,1,2,3,4,5) that fit into 4 columns. You have
* the extra restriction that when you find a permutation that works, the digits in the permutation must add up to "start" (so for instance
* [0,0,3,2] is cool, but [0,1,3,3] is not). You can begin to see why this is a dumb algorithm because it currently just considers all
* available permutations and keeps the ones that add up to "start". If you want to optimize it more, you could keep a running "sum" of
* the current contents of the list and either break your loop when it's greater than "start".
*
* Essentially the way you get all the permutations is to have the recursion choose a new digit at each level until you have a full
* string (or a value for each "day" in your case). It's just like nesting for loops, but the for loop actually only gets written
* once because the nesting is done by each subsequent call to the recursive function.
*
* #param start The total number that digits from all the days will sum up to
* #param days The number of days to split the "start" value across (e.g. 5 days equals
* 5 columns of output)
* #param chosen The current permutation at any point in time, may contain between 0 and "days" numbers.
*/
private static void printOrderings(int start,int days,List<Integer> chosen)
{
if(chosen.size() == days)
{
int sum = 0;
for(Integer i : chosen)
{
sum += i.intValue();
}
if(sum == start)
{
System.out.println(chosen.toString());
}
return;
}
else if(chosen.size() < days)
{
for(int i=0; i < start; i++)
{
if(chosen.size() >= days)
{
break;
}
List<Integer> newChosen = new ArrayList<Integer>(chosen);
newChosen.add(i);
printOrderings(start,days,newChosen);
}
}
}
public static void main(final String[] args)
{
//your equivalent of GetNumber4(5)
getNumber(5,4);
//your equivalent of GetNumber5(5)
getNumber(5,5);
}
}