best algorithm to reconcile 3 lists

best algorithm to reconcile 3 lists - c#

i am looking for a way to reconcile elements from 3 different sources. i've simplified the elements to having just a key (string) and version (long).
the lists are attained concurrently (2 from separate database queries, and 1 from a memory cache on another system).
for my end result, i only care about elements that are not identical versions across all 3 sources. So the result i care about would be a list of keys, with corresponding versions in each system.
Element1 | system1:v100 | system2:v100 | system3:v101 |
Element2 | system1:missing | system2:v200 | system3:v200 |
and the elements with identical versions can be discarded.
The 2 ways of achieving this i thought of are
wait for all datasources to finish retrieving, and than loop through each list to aggregate a master list with a union of keys + all 3 versions (discarding all identical items).
as soon as the first list is done being retrieved, put it into a concurrent collection such as dictionary (offered in .net 4.0), and start aggregating remaining lists (into the concurrent collection) as soon as they are available.
my thinking is that second approach will be a little quicker, but probably not by much. i can't really do much until all 3 sources are present, so not much is gained from 2nd approach and contention is introduced.
maybe there is a completely other way to go about this? Also, since versions are stored using longs, and there will be 100's of thousands (possibly millions) of elements, memory allocation could be of concern (tho probably not a big concern since these objects are short lived)

HashSet is an option as it has Union and Intersect methods
HashSet.UnionWith Method
To use this you must override Equals and GetHashCode.
A good (unique) hash is key to performance.
If the version is all v then numeric the could use the numeric to build the hash with missing as 0.
Have Int32 to play with so if version is Int10 or less can create a perfect hash.
Another option is ConcurrentDictionary (there is no concurrent HashSet) and have all three feed into it.
Still need to override Equals and GetHashCode.
My gut feel is three HashSets then Union would be faster.
If all versions are numeric and you can use 0 for missing then could just pack into UInt32 or UInt64 and put that directly in a HashSet. After Union then unpack. Use bit pushing << rather than math to pack an unpack.
This is just two UInt16 but it runs in 2 seconds.
This is going to be faster than Hashing classes.
If all three versions are long then HashSet<integral type> will not be an option.
long1 ^ long2 ^ long3; might be a good hash but the is not my expertise.
I know GetHashCode on a Tuple is bad.
class Program
{
static void Main(string[] args)
{
HashSetComposite hsc1 = new HashSetComposite();
HashSetComposite hsc2 = new HashSetComposite();
for (UInt16 i = 0; i < 100; i++)
{
for (UInt16 j = 0; j < 40000; j++)
{
hsc1.Add(i, j);
}
for (UInt16 j = 20000; j < 60000; j++)
{
hsc2.Add(i, j);
}
}
Console.WriteLine(hsc1.Intersect(hsc2).Count().ToString());
Console.WriteLine(hsc1.Union(hsc2).Count().ToString());
}
}
public class HashSetComposite : HashSet<UInt32>
{
public void Add(UInt16 u1, UInt16 u2)
{
UInt32 unsignedKey = (((UInt32)u1) << 16) | u2;
Add(unsignedKey);
}
//left over notes from long
//ulong unsignedKey = (long) key;
//uint lowBits = (uint) (unsignedKey & 0xffffffffUL);
//uint highBits = (uint) (unsignedKey >> 32);
//int i1 = (int) highBits;
//int i2 = (int) lowBits;
}
Tested using a ConcurrentDictionary and the above was over twice as fast.
Taking locks on the inserts is expensive.

Your problem seems to be suitable for an event based solution. Basically assign events for the completion of data for each of your sources. Keep a global concurrent hash with type . In your event handlers go over the completed data source and if your concurrent hash contains the key for the current element just add it to the list if not just insert a new list with given element.
But depending on your performance requirements this may overcomplicate your application. Your first method would be the simplest one to use.

Related

Reading time of arrays with an equal number of elements but of different dimensional

I have 3 array that keep integer values. A array of 4 -dimensional, a array of 2-dimensional, a array of single-dimensional. But the total number of elements is equal to each. I'm going to print on console all the elements in these array. Which one prints the fastest? Or is it equal to printing times?
int[,,,] Q = new int[4, 4, 4, 4];
int[,] W = new int[16,16];
int[] X = new int[256];

Unless I'm missing something, there are two main ways you could be iterating over the multi-dimensional arrays.
The first is:
int[,] W = new int[16,16];
for(int i = 0; i < 16; i++)
{
for(int j = 0; j < 16; j++)
Console.WriteLine(W[i][j]);
}
This method is slower than iterating over the single-dimensional array, as the only difference is that for every 16 members, you need to start a new iteration of the outside loop and re-initiate the inner loop.
The second is:
for(int i = 0; i < 256; i++)
{
Console.WriteLine(W[i / 16][i % 16]);
}
This method is slower because every iteration you need to calculate both (i / 16) and (i % 16).
Ignoring the iteration factor, there is also the time it takes to access another pointer every iteration.
To the extent of my knowledge in boolean functions*, given two sets of two integers, one of them bigger numbers but both having the same size in memory (as is the case for all numbers of type int in c#), the time to compute the addition of the two sets would be exactly the same (as in the number of clock ticks, but it's not something I'd expect everyone who stumbles upon this question to be familiar with). This being the case, the time for calculating the address of an array member is not dependent upon how big its index is.
So to summarize, unless I'm missing something or I'm way rustier than I think, there is one factor that is guaranteed to lengthen the time it takes for iterating over multidimensional arrays (the extra pointers to access), another factor that is guaranteed to do the same, but you can choose one of two options for (multiple loops or additional calculations every iteration of the loop), and there are no factors that would slow down the single-dimensional array approach (no "tax" for an extra long index).
CONCLUSIONS:
That makes it two factors working for a single-dimensional array, and none for a multi-dimensional one.
Thus, I would assume the single-dimensional array would be faster
That being said, you're using C#, so you're probably not really looking for that insignificant an edge or you'd use a low-level language. And if you are, you should probably either switch to a low-level language or really contemplate whether you are doing whatever it is you're trying to in the best way possible (the only case where this could make an actual difference, that I can think of, is if you load into your code a whole 1 million record plus database, and that's really bad practice).
However, if you're just starting out in C# then you're probably just overthinking it.
Whichever it is, this was a fun hypothetical, so thanks for asking it!
*by boolean functions, I mean functions at the binary level, not C# functions returning a bool value

Does increasing RNG seed by 1 each time ensure not getting same consecutive values?

I've had a chance to see some interesting piece of code that was either used as an April Fools joke (update became public on April 1st) or was just a bug, because someone did not understand how to use RNGs.
Question is related to Random class being part of .NET/C#, but maybe other RNGs work the same way.
Simplified version of the code I found, after taking out all the unnecessary details, would look like this:
for ( int i = startI; i < stopI; ++i ) {
int newValue = new Random( i ).Next( 0, 3 ); // new RNG with seed i generates a single value from 3 options: 0, 1 and 2
// rest of code
}
I did run simple test of that code in LINQPad to see if what I was observing in program was just my "luck" or whether maybe that's actually how RNG used this way will work. Here's the code:
int lastChoice = -1;
int streakLength = -1;
for ( int i = 0; i < 100000000; ++i ) {
int newChoice = new Random( i ).Next( 0, 3 );
if ( newChoice == lastChoice ) {
streakLength++;
( i + ";" + lastChoice + ";" + streakLength ).Dump();
} else {
lastChoice = newChoice;
streakLength = 1;
}
}
"The End".Dump();
(The Dump() method simply prints the value to the screen)
The result of running this "script" was just "The End", nothing more. It means, that for 100M cycles of generating a random value, not a single time was it able to generate same consecutive values, when having only 3 of them as an option.
So back to my question from the title - does increasing the seed of RNG (specifically the .NET/C#'s Random class, but general answer is also welcome) by one after every (integer) random number generation would ensure that no repeated consecutive values would occur? Or is that just pure luck?

The behavior you show depends on the PRNG.
For many PRNGs, including linear PRNGs such as the one implemented in the .NET Framework's System.Random, if you initialize two instances of a PRNG with consecutive seeds, the number sequences they produce may be correlated with each other, even though each of those sequences produces random-behaving numbers on its own. The behavior you describe in your question is just one possible result of this.
For System.Random in particular, this phenomenon is described in further detail in "A Primer on Repeatable Random Numbers".
Other PRNGs, however, give each seed its own independent pseudorandom number sequence (an example is SFC64 and counter-based PRNGs; see, e.g., "Parallel Random Numbers: As Easy as 1, 2, 3"), and some PRNGs can be "jumped ahead" a huge number of steps to produce pseudorandom number sequences that are independent with respect to each other.
See also:
Impact of setting random.seed() to recreate a simulated behaviour and choosing the seed
Seeding Multiple Random Number Generators

Sort a list by only the swapping of it's elements

What would be the optimal solution to the following problem :
Given a list of values (fe : numbers ranging from 0-14) how would you sort them by using only swap operations (fe : swapping the 0-th and the 9-th element in the list) your goal is to find the solution with the least swaps.
Thank you in advance

Assuming the values are 0 to n-1 for an array of size n, here is a an algorithm with O(n) time complexity, and it should be the optimal algorithm for minimizing swaps. Every swap will place at least one value (sometimes both) in it's proper location.
// the values of A[] range from 0 to n-1
void sort(int A[], int n)
{
for(int i = 0; i < n; i++)
while(A[i] != i)
swap(A[i], A[A[i]]);
}
For a more generic solution and assuming that only the swaps used to sort the original array are counted, generate an array of indices to the array to be sorted, sort the array of indices according to the array to be sorted (using any sort algorithm), then use the above algorithm to sort the original array and the array of indices at the same time. Using C++ to describe this, and using a lambda compare for this example:
void sort(int A[], int n)
{
// generate indices I[]
int *I = new int[n];
for(int i = 0; i < n; i++)
I[i] = i;
// sort I according to A
std::sort(I, I+n,
[&A](int i, int j)
{return A[i] < A[j];});
// sort A and I according to I using swaps
for(int i = 0; i < n; i++){
while(I[i] != i){
std::swap(I[i], I[I[i]]);
std::swap(A[i], A[A[i]]); // only this swap is counted
}
}
delete[] I;
}
For languages without the equivalent of a lambda commpare, a custom sort function can be used. Sorting is accomplished undoing the "cycles" in the array with O(n) time complexity. Every permutation of an array can be considered as a series of cycles. Value is really the order for the element, but in this case the ordering and value are the same:
index 0 1 2 3 4 5 6 7
value 6 3 1 2 4 0 7 5
The cycles are the "paths" to follow a chain of values, start with index 0, which has a value of 6, then go to index 6 which has a value of 7 and repeat the process until the cycle completes back at index 0. Repeat for the rest of the array. For this example, the cycles are:
0->6 6->7 7->5 5->0
1->3 3->2 2->1
4->4
Following the algorithm shown above the swaps are:
swap(a[0],a[6]) // puts 6 into place
swap(a[0],a[7]) // puts 7 into place
swap(a[0],a[5]) // puts 0 and 5 into place
swap(a[1],a[3]) // puts 3 into place
swap(a[1],a[2]) // puts 1 and 2 into place
// done
Link to the more practical example of sorting multiple arrays according to one of them. In this example, the cycles are done using a series of moves instead of swaps:
Sorting two arrays based on one with standard library (copy steps avoided)

What you're searching for is a sorting algorithm.
https://brilliant.org/wiki/sorting-algorithms/
A good one is "QuickSort" combined with a simpler sorting algorithm like "BubbleSort"
Ted-Ed also have a good video on the topic:
https://www.youtube.com/watch?v=WaNLJf8xzC4

Probably the best way to find the answer to this question is to open your favorite search engine and put the title to your question there. You will find many results, including:
Sorting algorithm - Wikipedia (which includes a section on Popular sorting algorithms)
10.4. Sorting Algorithms - Introductory Programming in C#
Read through these and find the algorithms that only use the swapping of elements to do the sorting (since that is your requirement). You can also read about the performance of the algorithms as well (since that was another part of the requirement).
Note that some will perform faster than others depending on how large and how sorted the array is.
Another way to figure this out is to ask yourself, "What do the professionals do?". Which will likely lead you to reading the documentation for the Array.Sort Method, which is the built-in mechanism that most of us use if we need to quickly sort an array. Here you will find the following information:
Remarks
This method uses the introspective sort (introsort) algorithm as follows:
If the partition size is fewer than 16 elements, it uses an insertion sort algorithm.
If the number of partitions exceeds 2 * LogN, where N is the range of the input array, it uses a Heapsort algorithm.
Otherwise, it uses a Quicksort algorithm.
So now we see that, for small partitions (like your example with 15 elements), the pros use insertion sort.

Most efficient way to store and retrieve a 512-bit number?

I have a String of 512 characters that contains only 0, 1. I'm trying to represent it into a data structure that can save the space. Is BitArray the most efficient way?
I'm also thinking about using 16 int32 to store the number, which would then be 16 * 4 = 64 bytes.

Most efficient can mean many different things...
Most efficient from a memory management perspective?
Most efficient from a CPU calculation perspective?
Most efficient from a usage perspective? (In respect to writing code that uses the numbers for calculations)
For 1 - use byte[64] or long[8] - if you aren't doing calculations or don't mind writing your own calculations.
For 3 definitely BigInteger is the way to go. You have your math functions already defined and you just need to turn your binary number into a decimal representation.
EDIT: Sounds like you don't want BigInteger due to size concerns... however I think you are going to find that you will of course have to parse this as an enumerable / yield combo where you are parsing it a bit at a time and don't hold the entire data structure in memory at the same time.
That being said... I can help you somewhat with parsing your string into array's of Int64's... Thanks King King for part of this linq statement here.
// convert string into an array of int64's
// Note that MSB is in result[0]
var result = input.Select((x, i) => i)
.Where(i => i % 64 == 0)
.Select(i => input.Substring(i, input.Length - i >= 64 ?
64 : input.Length - i))
.Select(x => Convert.ToUInt64(x, 2))
.ToArray();
If you decide you want a different array structure byte[64] or whatever it should be easy to modify.
EDIT 2: OK I got bored so I wrote an EditDifference function for fun... here you go...
static public int GetEditDistance(ulong[] first, ulong[] second)
{
int editDifference = 0;
var smallestArraySize = Math.Min(first.Length, second.Length);
for (var i = 0; i < smallestArraySize; i++)
{
long signedDifference;
var f = first[i];
var s = second[i];
var biggest = Math.Max(f, s);
var smallest = Math.Min(f, s);
var difference = biggest - smallest;
if (difference > long.MaxValue)
{
editDifference += 1;
signedDifference = Convert.ToInt64(difference - long.MaxValue - 1);
}
else
signedDifference = Convert.ToInt64(difference);
editDifference += Convert.ToString(signedDifference, 2)
.Count(x => x == '1');
}
// if arrays are different sizes every bit is considered to be different
var differenceOfArraySize =
Math.Max(first.Length, second.Length) - smallestArraySize;
if (differenceOfArraySize > 0)
editDifference += differenceOfArraySize * 64;
return editDifference;
}

Use BigInteger from .NET. It can easily support 512-bit numbers as well as operations on those numbers.
BigInteger.Parse("your huge number");

BitArray (with 512 bits), byte[64], int[16], long[8] (or List<> variants of those), or BigInteger will all be much more efficient than your String. I'd say that byte[] is the most idiomatic/typical way of representing data such as this, in general. For example, ComputeHash uses byte[] and Streams deal with byte[]s, and if you store this data as a BLOB in a DB, byte[] will be the most natural way to work with that data. For that reason, it'd probably make sense to use this.
On the other hand, if this data represents a number that you might do numeric things to like addition and subtraction, you probably want to use a BigInteger.
These approaches have roughly the same performance as each other, so you should choose between them based primarily on things like what makes sense, and secondarily on performance benchmarked in your usage.

The most efficient would be having eight UInt64/ulong or Int64/long typed variables (or a single array), although this might not be optimal for querying/setting. One way to get around this is, indeed, to use a BitArray (which is basically a wrapper around the former method, including additional overhead [1]). It's a matter of choice either for easy use or efficient storage.
If this isn't sufficient, you can always choose to apply compression, such as RLE-encoding or various other widely available encoding methods (gzip/bzip/etc...). This will require additional processing power though.
It depends on your definition of efficient.
[1] Addtional overhead, as in storage overhead. BitArray internally uses an Int32-array to store values. In addition to that BitArray stores its current mutation version, the number of ints 'allocated' and a syncroot. Even though the overhead is negligible for smaller amount of values, it can be an issue if you keep a lot of these in memory.

Dealing With Combinations

In C# I created a list array containing a list of varied indexes. I'd like to display 1 combination of 2 combinations of different indexes. The 2 combinations inside the one must not be repeated.
I am trying to code a tennis tournament with 14 players that pair. Each player must never be paired with another player twice.

Your problem falls under the domain of the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
There are 2 different ways to interpret your problem. In tennis, tournaments are usually arranged to use single elmination where the winning player from each match advances. However, some local clubs also use round robins where each player plays each other player just once, which appears to be the problem that you are looking at.
So, the question is - how to calculate the total number of unique matches that can be played with 14 players (N = 14), where each player plays just one other player (and thus K = 2). The binomial coefficient calculation is as follows:
Total number of unique combinations = N! / (K! * (N - K)! ). The ! character is called a factorical, and means N * (N-1) * (N-2) ... * 1. When K is 2, the binomial coefficient is reduced to: N * (N - 1) / 2. So, plugging in 14 for N and 2 for K, we find that the total number of combinations is 91.
The following code will iterate through each uniue combinations:
int N = 14; // Total number of elements in the set.
int K = 2; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the 2 players, starting with index 0.
int[] KIndexes = new int[K];
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Loop, KIndexes);
// KIndex[0] is the first player & Kindex[2] is the 2nd player.
// Print out the indexes for both players.
String S = "Player1 = Kindexes[0].ToString() + ", " +
"Player2 = Kindexes[1].ToString();
Console.WriteLine(S};
}
You should be able to port this class over fairly easily to the language of your choice. You probably will not have to port over the generic part of the class to accomplish your goals. Depending on the number of combinations you are working with, you might need to use a bigger word size than 4 byte ints.
I should also mention, that since this is a class project, your teacher might not accept the above answer since he might be looking for more original work. In that case, you might want to consider using loops. You should check with him before submitting a solution.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

best algorithm to reconcile 3 lists - c#

Related

Reading time of arrays with an equal number of elements but of different dimensional

Does increasing RNG seed by 1 each time ensure not getting same consecutive values?

Sort a list by only the swapping of it's elements

Most efficient way to store and retrieve a 512-bit number?

Dealing With Combinations

Categories

Resources