Algorithm to find all possible binary combinations with a condition - c#

Here is one is for you math brains out there. I have a matrix, actually its half a matrix, cut diagonally. Each element of the matrix can be a 1 or a 0. I need to find all the possible combinations of 1s and 0s for any matrix of width N.
This is easy enough, you can get the number of elements on this matrix given width N with for this example where N=7 this would give us 28 or the number of elements. Then you can get the combinations with .
So the formula would be to get all the possible combinations.
Now here is where it gets tricky. There is one condition that must hold true for each result. The sum of the each set of elements on the matrix (shown below with each row represented) must be less than 4 for the first set (the one on the first row), less than 3 for all the other sets (these are constants regardless of the N value).
Here are what the sets for this example (N=7) look like. If you notice each row is represented. So for the first set if the combination is 0 1 0 1 0 1 0 this would be valid as its sum is < 4 (since its the first row). For the second set if the combination is 1 0 0 0 0 1 0 it is valid as it needs to be < 3.
I need to do this for huge matrices so brute forcing all possible permutations to find the ones that fall under this condition would be unfeasable. I need to find some sort of algorithm I can use to generate the valid matrices bottom up rather than top down. Maybe doing separate operations that can be composed later to yield a total set of results.
Any and all ideas are welcome.

A simple algorithm generating each solution recursively :
global File //A file where you will store your data
global N //Your matrix size
//matrix contains the matrix we build (int[][])
//set contains the number of 1 we can use on a set (int[])
//c is the column number (int)
//r is the row number (int)
function f ( matrix, set, c, r ) :
if ( c == N ):
r = r + 1
c = r
if ( r == N ):
write ( matrix in File )
// Implement your own way of storing the matrix
if ( set[r] > 0 AND (c+2 < N AND set[c+2] > 0) ):
matrix[c][r] = 1
set[c]--
set[r]--
f ( matrix, set, c+1, r )
matrix[c][r] = 0
f ( matrix, set, c+1, r)
end
//Calling our function with N = 5
N = 5
f([[0,0,0,0,0],[0,0,0,0,0],...], [3,2,2,2,2], 0, 0)
You can store each matrix in something else than a file but keep an eye on your memory consumption.

Here's a basic idea to get started; it's too large for a comment, though, but not a complete answer.
The idea is to start with a maximally 'filled' matrix rather than an empty one and then filling it.
Basic striping away procedure
Start with a matrix filled with all rows filled to their maximum number of 1s, that is row 0 has 4 1s and the other rows each have 3 1s. Then, start checking the conditions. Condition 0 (row 0) is automatically satisfied. For the rest of the conditions, either they are satisfied, or there are too many 1s in its condition set: take away 1s until the condition is satisfied. Do this for all conditions.
Generating all 'simpler' ones
Doing this, you end up with a matrix that satisfies all conditions. Now, you can change any 1 to a 0 and the matrix will still satisfy all the conditions. So, once you have a 'maximal' solution, you can generate all sub-solutions of it trivially.

Related

I need help understanding the modulo operator

I was trying to recreate my C++ factor program from a few years ago in my new language C#. All I could remember is that it possibly involved a modulo, and possibly didn't. I knew that it involved at least one for and if statement. However, when I started trying to recreate it I kept getting nothing near what should be. I thought it had something to do with me not understanding loops, but it turns out I understand loops just fine. What I don't understand is how to use the modulo when performing math operations.
for instance what am I doing when I say something like:
(ignore that it might not actually work, it's just an example)
if(12 % 2 == 0)
{
Console.WriteLine("I don't understand.");
}
This kind of thing I don't quite have a grasp of yet. I realize that it is taking the remainder, and that's all I can grasp, not how it's actually used in real programming. I managed to get my factor program to work in C# after a bit of thinking and tinkering, it again doesn't mean I understand this operator or its uses. I no longer have access to the old C++ file.
The % (modulo) operator yields the remainder from the division. In your example the remainder is equal to 0 and the if evaluates to true (0 == 0). A classic example is when it's used to see if a number is even or not.
if (number % 2 == 0) {
// even
} else {
// odd
}
Think of modulo like a circle with a pointer (spinner), easiest example is a clock.
Notice how at the top it is zero.
The modulo function maps any value to one of those values on the spinner, think of the value to the left of the % as the number of steps around the spinner, and the second value as the number of total steps in the spinner, so we have the following.
0 % 12 = 0
1 % 12 = 1
12 % 12 = 0
13 % 12 = 1
We always start at 0.
So if we go 0 steps around a 12 step spinner we are still at 0, if we go 1 step from zero we are on 1, if we go 12 steps we are back at 0. If we go 13 we go all the way around and end at 1 again.
I hope this helps you visualize it.
It helps when you are using structures like an array, and you want to cycle through them. Imagine you have an array of the days of the week, 7 elements (mon-sunday). You want to always display the day 3 days from the current day. well Today is tuesday, so the array element is days[1], if we want to get the day 3 days from now we do days[1+3]; now this is alright, but what if we are at saturday (days[5]) and want to get 3 days from there? well we have days[5+3] which is an index out of bounds error as our array has only 7 elements (max index of 6) and we tried to access the 8th element.
However, knowing what you know about modulos and spinners now you can do the following:
string threeDaysFromNow = days[(currentDay + 3)%7]; When it goes over the bounds of the array, it wraps around and starts at the beginning again. There are many applications for this. Just remember the visualization of spinners, that is when it clicked in my head.
The modulo operator % returns the remainder of a division operation. For example, where 13 / 5 = 2, 13 % 5 = 3 (using integer math).
It's a common tactic to check a value against % 2 to see if it is even. If it is even, the remainder will be 0, otherwise it will be 1.
As for your specific use of it, you are doing 12 % 2 which is not only 0, but will always be 0. That will always make the if condition 12 % 2 == 0 true, which makes the if rather redundant.
as mentioned, it's commonly used for checking even/odd but also can use it to iterate loops at intervals, or split files into mod chunks. i personally use mod for clock face type problems as my data often navigates a circle.
the register is in mod for example an 8 bit register rolls over at 2^8 so so can force compliance into a register size var = mod(var, 256)
and the last thing i know about mod is that it is used in checksum and random number generation, but i haven't gone into the why for those. at all
An example where you could use this is in indexing arrays in certain for loops. For example, take the simple equation that defines the new pixel value of a resampled image using bicubic interpolation:
where
Don't worry what bicubic interpolation exactly is for the moment, we're just concerned about executing what seems to be two simple for loops: one for index i and one for index j. Note that the vector 'a' is 16 numbers long.
A simple for loop someone would try could be:
int n= 0;
for(int i = 0; i < 4; ++i)
{
for(int j = 0; i < 4; ++j)
{
pxy += a[n] * pow(x,i) * pow(y,j); // p(x,y)
n++; // n = 15 when finished
}
}
Or you could do it in one for loop:
for(int i = 0; i < 16; ++i)
{
int i_new = floor(i / 4.0); // i_new provides indices 0-3 incrementing every 4 iterations of loop
int j_new = i % 4; // j_new is reset to 0 when i is a multiple of 4
pxy += a[i] * pow(x,i_new) * pow(y,j_new); // p(x,y)
}
Printing i_new and j_new in the loop:
i_new j_new
0 0
0 1
0 2
0 3
1 0
1 1
1 2
1 3
2 0
2 1
2 2
2 3
3 0
3 1
3 2
3 3
As you can see, % can be very useful.

Jumping segments in binary

My question is, is there a way in C# with a starting bit location to find the next binary digit within a byte that has a specified value of 0 or 1 without iteration (looking for the highest performance option).
As an example, if you had 10011 and started at the first bit (far right) and searched for the first 0, it would be the 3rd place going right to left. If you then started at the 3rd place and wanted to find the next 1, it would be at the 5th place (far left).
Thanks for any help and feel free to let me know if I need to provide anything further.
Edit: Here is my current code.
private int GetBinarySegment(uint uiValue, int iStart, int iMaxBits, byte bValue)
{
int r = 0; uiValue >>= iStart;
if (uiValue == 0) return iMaxBits - iStart;
while ((uiValue & 1) == bValue) { uiValue >>= 1; r++; }
return r;
}
There are ways, but they're ugly because there's no _BitScanForward or equivalent intrinsic. Still, you can actually compute this thing efficiently without needing a huge table.
First step: make a number that has a 1 at the position you're searching for and 0 everywhere else.
If searching for a 1, that means x & -x. If searching for a 0, use ~x & (x + 1).
Then, use one of the many ways to emulate either bitscan (there is only one set bit now, so it doesn't matter which side you search from). Some ways to do that are detailed here (not in C#, but you can convert them).
Use a lookup table. That is, precalculate a 2D array indexed by byte value and current position. You can do a separate table for zeros and ones, or you can combine it.
So for your example, you start at bit 0 of the number 19. That happens to be a 1. So if you lookup nextBit[19][0] it should return 1, and so on. Here's what a combined lookup table might look like. It shows the next bit for both 0s and 1s:
nextBit[19][0] = 1 // 1
nextBit[19][1] = 4 // 1
nextBit[19][2] = 3 // 0
nextBit[19][3] = 4 // 0
nextBit[19][4] = 0 // 1
nextBit[19][5] = 6 // 0
nextBit[19][6] = 7 // 0
Obviously there is no 'next' for bit 7, and if 'next' returns 0, there are no more of that particular bit.
I may have interpreted your question incorrectly, but this technique can be modified to suit your purposes. I initially thought you wanted to navigate through all 1-bits or 0-bits. If instead you want to skip over consecutive 1-bits, then you just arrange your table in that way. Or indeed, you can have a 'next' for both 0 and 1 at each position.

Compare 2 string

I have the following 2 strings:
String A: Manchester United
String B: Manchester Utd
Both strings means the same, but contains different values.
How can I compare these string to have a "matching score" like, in the case, the first word is similar, "Manchester" and the second words contain similar letters, but not in the right place.
Is there any simple algorithm that returns the "matching score" after I supply 2 strings?
You could calculate the Levenshtein distance between the two strings and if it is smaller than some value (that you must define) you may consider them to be pretty close.
I've needed to do something like this and used Levenshtein distance.
I used it for a SQL Server UDF which is being used in queries with more than a million of rows (and texts of up to 6 or 7 words).
I found that the algorithm runs faster and the "similarity index" is more precise if you compare each word separately. I.e. you split each input string in words, and compare each word of one input string to each word of the other input string.
Remember that Levenshtein gives the difference, and you have to convert it to a "similarity index". I used something like distance divided by the length of the longest word (but with some variations)
First rule: order and number of words
You must also consider:
if there must be the same number of words in both inputs, or it can change
and if the order must be the same on both inputs, or it can change.
Depending on this the algorithm changes. For example, applying the first rule is really fast if the number of words differs. And, the second rule reduces the number of comparisons, specially if there are many words in the compared texts. That's explained with examples later.
Second rule: weighting the similarity of each compared pair
I also weighted the longer words higher than the shorter words to get the global similarity index. My algorithm takes the longest of the two words in the compared pair, and gives a higher weight to the pair with the longer words than to the pair with the shorter ones, although not exactly proportional to the pair length.
Sample comparison: same order
With this example, which uses different number of words:
compare "Manchester United" to "Manchester Utd FC"
If the same order of the words in both inputs is guaranteed, you should compare these pairs:
Manchester United
Manchester Utd FC
(Manchester,Manchester) (Utd,United) (FC: not compared)
Manchester United
Manchester Utd FC
(Manchester,Manchester) (Utd: not compared) (United,FC)
Machester United
Manchester Utd FC
(Mancheter: not compared) (Manchester,Utd) (United,FC)
Obviously, the highest score would be for the first set of pairs.
Implementation
To compare words in the same order.
The string with the higher number of words is a fixed vector, shown as A,B,C,D,E in this example. Where v[0] is the word A, v[1] the word B and so on.
For the string with the lower number of words we need to create all the possible combination of indexes that can be compared with the firs set. In this case, the string with lower number of words is represented by a,b,c.
You can use a simple loop to create all the vectors that represents the pairs to be compared like so
A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E
a,b,c a,b, c a,b, c a, b,c a, b, c a, b,c
0 1 2 0 1 3 0 1 4 0 2 3 0 2 4 0 3 4
A,B,C,D,E A,B,C,D,E A,B,C,D,E A,B,C,D,E
a,b,c a,b, c a, b,c a,b,c
1 2 3 1 2 4 1 3 4 2 3 4
The numbers in the sample, are vectors that have the indices of the first set of words which must be comapred with the indices in the first set. i.e. v[0]=0, means compare index 0 of the short set (a) to index 0 of the long set (A), v[1]=2 means compare index 1 of the short (b) set to index 2 of the long set (C), and so on.
To calculate this vectors, simply start with 0,1,2. Move to the right the latest index that can be moved until it can no longer be moved:
Strat by moving the last one:
0,1,2 -> 0,1,3 -> 0,1,4
No more moves possible, move the previous index, and restore the others
to the lowest possible values (move 1 to 2, restore 4 to 3)
When the last can't be move any further, move the one before the last, and reset the last to the nearest possible place (1 moved to 2, and 4 move to 3):
0,2,3 -> 0,2,4
No more moves possible of the last, move the one before the last
Move the one before the last again.
0,3,4
No more moves possible of the last, move the one before the last
Not possible, move the one before the one before the last, and reset the others:
Move the previous one:
1,2,3 -> 1,2,4
And so on. See the picture
When you have all the possible combinations you can compare the defined pairs.
Third rule: minimum similarity to stop comparison
Stop comparison when minimun similarity is reached: depending on what you want to do it's possible that you can set a thresold that, when it's reached, stops the comparison of the pairs.
If you can't set a thresold, at least you can always stop if you get a 100% similarity for each pair of words. This allows to spare a lot of time.
On some occasions you can simply decide to stop the comparison if the similarity is at least, something like 75%. This can be used if you want to show the user all the strings which are similar to the one provided by the user.
Sample: comparison with change of the order of the words
If there can be changes in the order, you need to compare each word of the first set with each word of the second set, and take the highest scores for the combinations of results, which include all the words of the shortest pair ordered in all the possible ways, compared to different words of the second pair. For this you can populate the upper or lower triangle of a matrix of (n X m) elements, and then take the required elements from the matrix.
Fourth rule: normalization
You must also normalize the word before comparison, like so:
if not case-sensitive convert all the words to upper or lower case
if not accent sensitive, remove accents in all the words
if you know that there are usual abbreviations, you can also normalized them, to the abbreviation to speed it up (i.e. convert united to utd, not utd to united)
Caching for optimization
To optmize the procedure, I cached whichever I could, i.e. the comparison vectors for different sizes, like the vectors 0,1,2-0,1,3,-0,1,4-0,2,3, in the A,B,C,D,E to a,b,c comparison example: all comparisons for lengths 3,5 would be calculated on first use and recycled for all the 3 words to 5 words incoming comparisons.
Other algorithms
I tried Hamming distance and the results were less accurate.
You can do much more complex things like semantic comparisons, phonetic comparisons, consider that some letters are just the same (like b and v, for several languages, like spanish, where ther is no distinction). Some of this things are very easy to implemente and others are really difficult.
NOTE: I didn't include the implementation of Levenhstein distance, because you can easyly find it implemented on differente laguages
Take a look at this article, which explains how to do it and gives sample code too :)
Fuzzy Matching (Levenshtein Distance)
Update:
Here is the method code that takes two strings as parameters and calculates the "Levenshtein Distance" of the two strings
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
Detecting duplicates sometimes might be a "little" more complicated than computing Levenshtein dinstance.
Consider following example:
1. Jeff, Lynch, Maverick, Road, 181, Woodstock
2. Jeff, Alf., Lynch, Maverick, Rd, Woodstock, NY
This duplicates can be matched by complicated clustering algorithms.
For further information you might want to check some research papers like
"Effective Incremental Clustering for Duplicate Detection in Large Databases".
(Example is from the paper)
What you are looking for is a string similarity measure. There are multiple ways of doing this:
Edit Distances between two strings (as in Answer #1)
Converting the strings into sets of characters (generally on bigrams or words) and then calculating Bruce Coefficient or Dice Coefficient on the two sets.
Projecting the strings into term vectors (either on words or bigrams) and calculating the Cosine Distance between the two vectors.
I generally find the option #2 to be the easiest to implement and if your strings are phrases then you can simply tokenize them on word-boundaries.
In all the above cases, you might want to first remove the stop words (common words like and, a,the etc) before tokenizing.
Update: Links
Dice Coefficient
Cosine Similarity
Implementing Naive Similarity engine in C# *Warning: shameless Self Promotion
Here is an alternative to using the Levenshtein distance algorithm. This compares strings based on Dice's Coefficient, which compares the number of common letter pairs in each string to generate a value between 0 and 1 with 0 being no similarity and 1 being complete similarity
public static double CompareStrings(string strA, string strB)
{
List<string> setA = new List<string>();
List<string> setB = new List<string>();
for (int i = 0; i < strA.Length - 1; ++i)
setA.Add(strA.Substring(i, 2));
for (int i = 0; i < strB.Length - 1; ++i)
setB.Add(strB.Substring(i, 2));
var intersection = setA.Intersect(setB, StringComparer.InvariantCultureIgnoreCase);
return (2.0 * intersection.Count()) / (setA.Count + setB.Count);
}
Call the method like this:
CompareStrings("Manchester United", "Manchester Utd");
Ouput is: 0.75862068965517238

Find the best interval match result

I have two sets of data in this form:
x | y | z x1 | y1 | z1
ab1 | 1 | 2 ab1 | 1 | 2
ab1 | 2 | 3 ab1 | 1.8 | 2
ab2 | 2 | 3 ab1 | 1.8 | 2
The number of columns can change between 1 to 30. The number of rows of the two sets is likely to be different.
The average amount of rows per set can change between few hundreds to few millions.
For each column a different matching rule will be applied, for example:
x: perfect match
y: +/- 0.1
z: +/- 0.5
Two rows are equivalent when all the criterias are satisfied.
My final goal is to find the rows in the first set with no match in second set.
The naive algorithm could be:
foreach a in SetA
{
foreach b in SetB
{
if (a == b)
{
remove b from SetB
process the next element in SetA
}
}
log a is not in SetB
}
At this stage I am not very interested in the efficiency of the algorithm. I am sure I could do better and I could reduce the complexity.
I am more concern about the correctness of the result. Let's try with a very simple example.
Two sets of number:
A B
1.6 1.55
1.5 1.45
4 3.2
And two elements are equal if:
b + 0.1 >= a >= b - 0.1
Now, if I run the naive algorithm I will find 2 matches.
However the result of the algorithm depends on the order of the two sets. For example:
A B
1.5 1.55
1.6 1.45
4 3.2
The algorithm will find only one match.
I would like to find the maximum number of matching rows.
I reckon in the real world data one of the columns will store an id, so the number of possible multiple matches will be a much smaller subset of the original set.
I know I can try to face this problem with a post processing after the first scan.
However, I don't want reinventing the wheel and I am wondering if my problem is equivalent to some famous, well known and already solved problem.
PS: I have tagged the question also as C++, C# and Java because I am going to use one of these languages to implement it.
It can be cast as a graph theory problem. Let X be a set that contains one node for each row in your first set. Let Y be another set which contains one node for each row in your second set.
The edges in the graph are defined by: for a given x in X and a given y in Y, there is an edge (x,y) if the row corresponding to x matches the row corresponding to y.
Once you have built this graph you can run the "maximum-bipartite-matching" algorithm on it and you will be done.
As I understand, you want the rows in the first set which don't match any row in the second set (within the error range). This cleaerly can be done with an O(n^2) complexity algorithm by parsing the elements in the first set and comparing them with the elements in the second set.
An optimization could be this:
sort both the sets - O(n*ln(n))
eliminate from the start the elements too small or too big (within the error) from the first set - O(n)
look in the second set for elements from the first set using a binary search (within the error) - O(n*lg(2)) and eliminate those not suitable
comlexity O(n*ln(n))
range tree? http://en.wikipedia.org/wiki/Range_tree
i dont really know, just throwing ideas out there
From the statement "My final goal is to find the rows in the first set with no match in second set." I understand that there can be multiple rows in first set that match the same row in the second set. In this case the solution is to remove the line "remove b from SetB" from your naive algorithm.
If however, you really need one to one matches between elements of the two sets then the answer with "maximum-bipartite-matching" provide by Corey Kosak applies.
Given your constraints, I dont see a way to do it in less than O(n^2). I'd probably modify your naive algorithm to include either a bool or a count field for each row in table A and then mark it if it matches a row in table B.
Then post process it with std::partition based on the indicator to group all the unique and non unique rows together. If you go with a count, you could get the rows that were "least unique". The bool would be somewhat more efficient since you could break out of the loop over B at the first match.
Two rows are equivalent when all the criteria are satisfied. My final goal is to find the rows in the first set with no match in second set.
foreach a in SetA
{
foreach b in SetB
{
if (a == b) //why would you alter SetB at all
go to next A
}
remove a from SetA //log a is not in SetB
}
However, you are right, that this is equivalent to some famous, well known and already solved problem. It's called "Set Difference". It's... kind of a major part of set theory. And since all those languages have sets, they also have that algorithm. C++ even has a dedicated function for it. Approximate Complexity of all of these is O(2(A+B)-1).
C++ standard algorithm function: http://www.cplusplus.com/reference/algorithm/set_difference/
vector<row> Result(A.rows());
end = std::set_difference(A.begin(), A.end(),
B.begin(), B.end(),
Result.begin());
Result.resize(end-Result.begin());
or std::unordered_set can be made to do this: http://msdn.microsoft.com/en-us/library/bb982739.aspx
std::unordered_set<row> Result(A.begin(), A.end());
for(auto i=B.begin(); i!=B.end(); ++i) {
auto f = Result.find(*i);
if (f != A.end())
A.erase(f);
}
Java does as well: http://download.oracle.com/javase/tutorial/collections/interfaces/set.html
Set<row> Result = new Set<row>(A);
A.removeAll(B);
And C#: http://msdn.microsoft.com/en-us/library/bb299875.aspx
HashSet<row> Result = new HashSet<row>(A);
A.ExceptWith(B);

Centering Divisions Around Zero

I'm trying to create something that sort of resembles a histogram. I'm trying to create buckets from an array.
Suppose I have a random array doubles between -10 and 10; this is very simplified. I then want to specify a center point, in this case 0 and the number of buckets.
If I want 4 buckets the division would be -10 to -5, -5 to 0, 0 to 5 and 5 to 10. Not that complicated right. Now if I change the min and max to -12 and -9 and as for 4 divisions its more complicated. I either want a division at -3 and 3; it is centered around 0 ; or one at -6 to 0 and 0 to 6.
Its not that hard to find the division size
= Math.Ceiling((Abs(Max) + Abs(Min)) / Divisions)
Then you would basically have an if statement to determine whether you want it centered on 0 or on an edge. You then iterate out from either 0 or DivisionSize/2 depending on the situation. You may not ALWAYS end up with the specified number of divisions but it will be close. Then you iterate through the array and increment the bin count.
Does this seem like a good way to go about this? This method would surely work but it does not seem to be the most elegant. I'm curious as to whether the creation of the bins and the counting from the list could be done in a clever class with linq in a more elegant way?
Something like creating the bins and then having each bin be a property {get;} that returns list.Count(x=> x >= Lower && x < Upper).
To me it seems simpler: You need to find lower bound and size of each "division".
Since you want it to be symmetrical around 0 depending on number of divisions you either get one that includes 0 for odd numbers (-3,3) or around 0 for even ones (-3,0)(0,3)
lowerBound = - Max(Abs(from), Abs(to))
bucketSize = 2 * lowerBound / divisions
(throw in Ceiling and update bucketSize and lowerBound if needed)
Than use .Aggregate to update array of buckets (position would be (value-lowerBound)/devisions, with additional range checks if needed).
Note: do not implement get the way you suggested - it is not expected for getters to perfomr non-trivial work like walking large array.

Categories

Resources