Need help on an algorithm - c#

I need help on an algorithm. I have randomly generated numbers with 6 digits. Like;
123654
109431
There are approximately 1 million of them saved in a file line by line. I have to filter them according to the rule I try to describe below.
Take a number, compare it to all others digit by digit. If a number comes up with a digit with a value of bigger by one to the compared number, then delete it. Let me show it by using numbers.
Our number is: 123456
Increase the first digit with 1, so the number becomes: 223456. Delete all the 223456s from the file.
Increase the second digit by 1, the number becomes: 133456. Delete all 133456s from the file, and so on...
I can do it just as I describe but I need it to be "FAST".
So can anyone help me on this?
Thanks.

First of all, since it is around 1Million you had better perform the algorithm in RAM, not on Disk, that is, first load the contents into an array, then modify the array, then paste the results back into the file.
I would suggest the following algorithm - a straightforward one. Precalculate all the target numbers, in this case 223456, 133456, 124456, 123556, 123466, 123457. Now pass the array and if the number is NOT any of these, write it to another array. Alternatively if it is one of these numbers delete it(recommended if your data structure has O(1) remove)

This algorithm will keep a lot of numbers around in memory, but it will process the file one number at a time so you don't actually need to read it all in at once. You only need to supply an IEnumerable<int> for it to operate on.
public static IEnumerable<int> FilterInts(IEnumerable<int> ints)
{
var removed = new HashSet<int>();
foreach (var i in ints)
{
var iStr = i.ToString("000000").ToCharArray();
for (int j = 0; j < iStr.Length; j++)
{
var c = iStr[j];
if (c == '9')
iStr[j] = '0';
else
iStr[j] = (char)(c + 1);
removed.Add(int.Parse(new string(iStr)));
iStr[j] = c;
}
if (!removed.Contains(i))
yield return i;
}
}
You can use this method to create an IEnumerable<int> from the file:
public static IEnumerable<int> ReadIntsFrom(string path)
{
using (var reader = File.OpenText(path))
{
string line;
while ((line = reader.ReadLine()) != null)
yield return int.Parse(line);
}
}

Take all the numbers from the file to an arrayList, then:
take the number of threads as the number of digits
increment the first digit on the number in first thread, second in the second thread and then compare it with the rest of the numbers,
It would be fast as it will undergo by parallel processing...

All the suggestions (so far) require six comparisons per input line, which is not necessary. The numbers are coming in as strings, so use string comparisons.
Start with #Armen Tsirunyan's idea:
Precalculate all the target numbers,
in this case 223456, 133456, 124456,
123556, 123466, 123457.
But instead of single comparisons, make that into a string:
string arg = "223456 133456 124456 123556 123466 123457";
Then read through the input (either from file or in memory). Pseudocode:
foreach (string s in theBigListOfNumbers)
if (arg.indexOf(s) == -1)
print s;
This is just one comparison per input line, no dictionaries, maps, iterators, etc.
Edited to add:
In x86 instruction set processors (not just the Intel brand), substring searches like this are very fast. To search for a character within a string, for example, is just one machine instruction.
I'll have to ask others to weigh in on alternate architectures.

For starters, I would just read all the numbers into an array.
When you are finally done, rewrite the file.

It seems like the rule you're describing is for the target number abdcef you want to find all numbers that contain a+1, b+1, c+1, d+1, e+1, or f+1 in the appropriate place. You can do this in O(n) by looping over the lines in the file and comparing each of the six digits to the digit in the target number if no digits match, write the number to an output file.

This sounds like a potential case for a multidimensional array, and possibly also unsafe c# code so that you can use pointer math to iterate through such a large quantity of numbers.
I would have to dig into it further, but I would also probably use a Dictionary for non-linear lookups, if you are comparing numbers that aren't in sequence.

How about this. You process numbers one by one. Numbers will be stored in hash tables NumbersOK and NumbersNotOK.
Take one number
If it's not in NumbersNotOK place it in a Hash of NumbersOK
Get it's variances of single number increments in hash - NumbersNotOK.
Remove all of the NumbersOK members if they match any of the variances.
Repeat from 1, untill end of file
Save the NumbersOK to the file.
This way you'll pass the list just once. The hash table is made just for this kind of purposes and it'll be very fast (no expensive comparison methods).
This algorithm is not in full, as it doesn't handle when there are some numbers repeating, but it can be handled with some tweaking...

Read all your numbers from the file and store them in a map where the number is the key and a boolean is the value signifying that the value hasn't been deleted. (True means exists, false means deleted).
Then iterate through your keys. For each key, set the map to false for the values you would be deleting from the list.
Iterate through your list one more time and get all the keys where the value is true. This is the list of remaining numbers.
public List<int> FilterNumbers(string fileName)
{
StreamReader sr = File.OpenTest(fileName);
string s = "";
Dictionary<int, bool> numbers = new Dictionary<int, bool>();
while((s = sr.ReadLine()) != null)
{
int number = Int32.Parse(s);
numbers.Add(number,true);
}
foreach(int number in numbers.Keys)
{
if(numbers[number])
{
if(numbers.ContainsKey(100000+number))
numbers[100000+number]=false;
if(numbers.ContainsKey(10000+number))
numbers[10000+number]=false;
if(numbers.ContainsKey(1000+number))
numbers[1000+number]=false;
if(numbers.ContainsKey(100+number))
numbers[100+number]=false;
if(numbers.ContainsKey(10+number))
numbers[10+number]=false;
if(numbers.ContainsKey(1+number))
numbers[1+number]=false;
}
}
List<int> validNumbers = new List<int>();
foreach(int number in numbers.Keys)
{
validNumbers.Add(number);
}
return validNumbers;
}
This may need to be tested as I don't have a C# compiler on this computer and I'm a bit rusty. The algorithm will take a bit of memory bit it runs in linear time.
** EDIT **
This runs into problems whenever one of the numbers is 9. I'll update the code later.

Still sounds like a homework question... the fastest sort on a million numbers will be n log(n) that is 1000000log(1000000) that is 6*1000000 which is the same as comparing 6 numbers to each of the million numbers. So a direct comparison will be faster than sort and remove, because after sorting you still have to compare to remove. Unless, ofcourse, my calculations have entirely missed the target.
Something else comes to mind. When you pick up the number, read it as hex and not base 10. then maybe some bitwise operators may help somehow.
Still thinking on what can be done using this. Will update if it works
EDIT: currently thinking on the lines of gray code. 123456 (our original number) and 223456 or 133456 will be off only by one digit and a gray code convertor will catch it fast. It's late night here, so if someone else finds this useful and can give a solution...

Related

Searching an array string with a binary search sub string

I have a file.txt containing about 200,000 records.
The format of each record is 123456-99-Text. The 123456 are unique account numbers, the 99 is a location code that I need (it changes from 01 to 99), and the text is irrelevant. These account numbers are sorted in order and with a line break in the file per ac(111111, 111112, 111113, etc).
I made a visual studio textbox and search button to have someone search for the account number. The account number is actually 11 digits long but only the first 6 matter. I wrote this as string actnum = textbox1.text.substring(0,6)
I wrote a foreach (string x in file.readline('file.txt')) with an if (x.contains(actnum)) then string code = x.substring(8,2)) statement.
The program works well, but because there are so many records if someone searches an account number that doesnt exist, or a number at the bottom of the list, the program locks up for a good 10 seconds before going to the "number not found" else statement, or taking forever to find that last record.
My Question:
Reading about binary searches I have attempted to try one without much success. I cannot seem to get the array or file to act like a legitimate binary search. Is there a way to take the 6 digit actnum from textbox1, compare it to an array substring of the 6 digit account number, then grab the substring 99 code from that specific line?
A binary search would help greatly! I could take 555-555 and compare it to the top or bottom half of the record file, then keep searching until i fine the line i need, grab the entire line, then substring the 99 out. The problem I have is I cant seem to get a proper integer conversion of the file because it contains both numbers AND text, and therefore I cant properly use <, >, = signs.
Any help on this would be greatly appreciated. The program I currently have actually works but is incredibly slow at times.
As one possible solution (not necessarily the best) you can add your record IDs to a Dictionary<string, int> (or even a Dictionary<long, int> if all record IDs are numeric) where each key is the ID of one line and each value is the line index. When you need to look up a particular record, just look in the dictionary (it'll do an efficient lookup for you) and gives you the line number. If the item is not there (non-existent ID), you won't find it in the dictionary.
At this point, if the record ID exists in the file, you have a line number - you can either load the entire file into memory (if it's not too big) or just seek to the right line and read in the line with the data.
For this to work, you have to go through the file at least once and collect all the record IDs from all lines and add them to the dictionary. You won't have to implement the binary search - the dictionary will internally perform the lookup for you.
Edit:
If you don't need all the data from a particular line, just one bit (like the location code you mentioned), you don't even need to store the line number (since you won't need to go back to the line in the file) - just store the location data as the value in the dictionary.
I personally would still store the line index because, in my experience, such projects start out small but end up collecting features and there'll be a point where you'll have to have everything from the file. If you expect this to be the case over time, just parse data from each line into a data structure and store that in the dictionary - it'll make your future life simpler. If you're very sure you'll never need more data than the one bit of information, you can just stash the data itself in the dictionary.
Here's a simple example (assuming that your record IDs can be parsed into a long):
public class LineData
{
public int LineIndex { get; set; }
public string LocationCode { get; set; }
// other data from the line that you need
}
// ...
// declare your map
private Dictionary<long, LineData> _dataMap = new Dictionary<long, LineData> ();
// ...
// Read file, parse lines into LineData objects and put them in dictionary
// ...
To see if a record ID exists, you just call TryGetValue():
LineData lineData;
if ( _dataMap.TryGetValue ( recordID, out lineData ) )
{
// record ID was found
}
This approach essentially keeps the entire file in memory but all data is parsed only once (at the beginning, during building the dictionary). If this approach uses too much memory, just store the line index in the dictionary and then go back to the file if you find a record and parse the line on the fly.
You cannot really do a binary search against file.ReadLine because you have to be able to access the lines in different order. Instead you should read the whole file into memory (file.ReadAllLines would be an option)
Assuming your file is sorted by the substring, you can create a new class that implements IComparer
public class SubstringComparer : IComparer<string>
{
public int Compare(string x, string y)
{
return x.Substring(0, 6).CompareTo(y.Substring(0, 6));
}
}
and then your binary search would look like:
int returnedValue = foundStrings.BinarySearch(searchValue, new SubstringComparer());
Assuming the file doesn't change often, then you can simply load the entire file into memory using a structure that handles the searching in faster time. If the file can change then you will need to decide on a mechanism for reloading the file, be it restarting the program or a more complex process.
It looks like you are looking for exact matches (searching for 123456 yields only one record which is labelled 123456). If that is the case then you can use a Dictionary. Note that to use a Dictionary you need to define key and value types. It looks like in your case they would both be string.
While I did not find a way to do a better type of search, I did manage to learn about embedded resources which considerably sped up the program. Scanning the entire file takes a fraction of a second now, instead of 5-10 seconds. Posting the following code:
string searchfor = textBox1.Text
Assembly assm = Assembly.GetExecutingAssembly();
using (Stream datastream = assm.GetManifestResourceStream("WindowsFormsApplication2.Resources.file1.txt"))
using (StreamReader reader = new StreamReader(datastream))
{
string lines;
while ((lines = reader.ReadLine()) != null)
{
if (lines.StartsWith(searchfor))
{
label1.Text = "Found";
break;
}
else
{
label1.Text = "Not found";
}
}
}

Algorithm for generating a nearly sorted list on predefined data

Note: This is part 1 of a 2 part question.
Part 2 here
I'm wanting to more about sorting algorithms and what better way to do than then to code! So I figure I need some data to work with.
My approach to creating some "standard" data will be as follows: create a set number of items, not sure how large to make it but I want to have fun and make my computer groan a little bit :D
Once I have that list, I'll push it into a text file and just read off that to run my algorithms against. I should have a total of 4 text files filled with the same data but just sorted differently to run my algorithms against (see below).
Correct me if I'm wrong but I believe I need 4 different types of scenarios to profile my algorithms.
Randomly sorted data (for this I'm going to use the knuth shuffle)
Reversed data (easy enough)
Nearly sorted (not sure how to implement this)
Few unique (once again not sure how to approach this)
This question is for generating a nearly sorted list.
Which approach is best to generate a nearly sorted list on predefined data?
To "shuffle" a sorted list to make it "almost sorted":
Create a list of functions you can think of which you can apply to parts of the array, like:
Negate(array, startIndex, endIndex);
Reverse(array, startIndex, endIndex);
Swap(array, startIndex, endIndex);
For i from zero to some function of the array's length (e.g. Log(array.Length):
Randomly choose 2 integers*
Randomly choose a function from the functions you thought of
Apply that function to those indices of the array
*Note: The integers should not be constricted to the array size. Rather, choose random integers and "wrap" around the array -- that way the elements near the ends will have the same chance of being modified as the elements in the middle.
Answering my own question here. All this does is taking a sorted list and shuffling up small sections of it.
public static T[] ShuffleBagSort<T>(T[] array, int shuffleSize)
{
Random r = _random;
for (int i = 0; i < array.Length; i += shuffleSize)
{
//Prevents us from getting index out of bounds, while still getting a shuffle of the
//last set of un shuffled array, but breaks for loop if the number of unshuffled array is 1
if (i + shuffleSize > array.Length)
{
shuffleSize = array.Length - i;
if (shuffleSize <= 1) // should never be less than 1, don't think that's possible lol
continue;
}
if (i % shuffleSize == 0)
{
for (int j = i; j < i + shuffleSize; j++)
{
// Pick random element to swap from our small section of the array.
int k = r.Next(i, i + shuffleSize);
// Swap.
T tmp = array[k];
array[k] = array[j];
array[j] = tmp;
}
}
}
return array;
}
Sort the array.
Start sorting it in descending order with bubble sort
Stop after a few iterations (depending how much 'dis-sorted' you want the array to be
Add some randomness (each time when bubblesort wants to swap two elements toss a coin and perform that operation or not depending on the result, or use a different probability than 50/50 for that)
This will give you an array which will be roughly equally modified across its whole range, preserving most of the order (the begining will hold the least elements, the end the greatest). That's because the changes performed by bubblesort with a random test will be rather local. It won't mix the whole array at once so much that it wouldn't resemble the original.
If you want to you can also completely randomly shuffle whole parts of the array (but keep the parts not to big because, you'll completely loose the ordering).
Or you may also randomly swap whole sorted parts of the array. That would be an interesing test case, for example:
[1,2,3,4,5,6,7,8] -> [1,2,6,7,8,3,4,5]
The almost sorted list is the reason why Timsort (python) is so efficient in the real world is because data is typically "almost sorted" . There is an article about it explaining the math behind the entropy of data.

performance problem

OK so I need to know if anyone can see a way to reduce the number of iterations of these loops because I can't. The first while loop is for going through a file, reading one line at a time. The first foreach loop is then comparing each of the compareSet with what was read in the first while loop. Then the next while loop is to do with bit counting.
As requested, an explaination of my algorithm:
There is a file that is too large to fit in memory. It contains a word followed by the pages in a very large document that this word is on. EG:
sky 1 7 9 32....... (it is not in this format, but you get the idea).
so parseLine reads in the line and converts it into a list of ints that are like a bit array where 1 means the word is on the page, and 0 means it isn't.
CompareSet is a bunch of other words. I can't fit my entire list of words into memory so I can only fit a subset of them. This is a bunch of words just like the "sky" example. I then compare each word in compareSet with Sky by seeing if they are on the same page.
So if sky and some other word both have 1 set at a certain index in the bit array (simulated as an int array for performance), they are on the same page. The algorithm therefore counts the occurances of any two words on a particular page. So in the end I will have a list like:
(for all words in list) is on the same page as (for all words in list) x number of times.
eg sky and land is on the same page x number of times.
while ((line = parseLine(s)) != null) {
getPageList(line.Item2, compareWord);
foreach (Tuple<int, uint[], List<Tuple<int, int>>> word in compareSet) {
unchecked {
for (int i = 0; i < 327395; i++) {
if (word.Item2[i] == 0 || compareWord[i] == 0)
continue;
uint combinedNumber = word.Item2[i] & compareWord[i];
while (combinedNumber != 0) {
actual++;
combinedNumber = combinedNumber & (combinedNumber - 1);
}
}
}
As my old professor Bud used to say: "When you see nested loops like this, your spidey senses should be goin' CRAZY!"
You have a while with a nested for with another while. This nesting of loops is an exponential increase on the order of operations. Your one for loop has 327395 iterations. Assuming they have the same or similar number of iterations, that means you have an order of operations of
327,395 * 327,395 * 327,395 = 35,092,646,987,154,875 (insane)
It's no wonder that things would be slowing down. You need to redefine your algorithm to remove these nested loops or combine work somewhere. Even if the numbers are smaller than my assumptions, the nesting of the loops is creating a LOT of operations that are probably unnecessary.
As Joal already mentioned nobody is able to optimize this looping algorithm. But what you can do is trying to better explain what you are trying to accomplish and what your hard requirements are. Maybe you can take a different approach by using some like HashSet<T>.IntersectWith() or BloomFilter or something like this.
So if you really want help from here you should not only post the code that doesn't work, but also what the overall task is you like to accomplish. Maybe someone has a completely other idea to solve your problem, making your whole algorithm obsolete.

How do I insert an int into a sorted array quickly?

I'd like to insert an int into a sorted array. This operation is going to be performed very often, so it needs to be as fast as possible.
It is possible and even preferred to use a List or any other class instead of an array
All values are in the 1 to 34 range
The array typically contains exactly 14 values
I was thinking of many different approaches, including binary search and simple insert-on-copy, but found it hard to decide. Also, I felt like I missed an idea. Do you have experiences on this topic or any new ideas to consider?
I will use an int array whose length is 35(because you said range 1-34) to record the status of the numbers.
int[] status = Enumerable.Repeat(0, 35).ToArray();
//an array contains 35 zeros
//which means currently there is no elements in the array
status[10] = 1; // now the array have only one number: 10
status[11] ++; // a new number 11 is added to the list
So if you want to add a number i to the list:
status[i]++; // O(1) to add a number
To remove an i from the list:
status[i]--; // O(1) to remove a number
Want to know all the numebrs in the list?
for (int i = 0; i < status.Length; i++)
{
if (status[i] > 0)
{
for (int j = 0; j < status[i]; j++)
Console.WriteLine(i);
}
}
//or more easier using LINQ
var result = status.SelectMany((i, index) => Enumerable.Repeat(index, i));
The following example may help you understand my code better:
the real number array: 1 12 12 15 9 34 // i don't care if it's sorted
the status array: status[1]=1,status[12]=2,status[15]=1,status[9]=1,status[34]=1
all others are 0
At 14 values this is a pretty small array, I don't think switching to a smarter data structure such as a list will win you much, especially if you fast good random access. Even binary search may actually be slower than linear search at this scale. Are you sure that, say, insert-on-copy does not satisfy your performance requirements?
This operation is going to be performed very often, so it needs to be as fast as possible.
The things that you notice happen "very often" are frequently not the bottlenecks in the program - it's often surprising what the actual bottlenecks are. You should code something simple and measure the actual performance of your program before performing any optimizations.
I was thinking of many different approaches, including binary search and simple insert-on-copy, but found it hard to decide.
Assuming that this is the bottleneck, the big-O performance of the different methods is not going to be relevant here because of the small size of your array. It is easier to just try a few different approaches, measure the results, see which performs best and choose that method. If you have followed the advice from the first paragraph you already have a profiler setup that you can use for this step too.
For inserting into the middle, a LinkedList<int> would be the fastest option - anything else involves copying data. At 14 elements, don't stress over binary search etc - just walk forwards to the item you want:
using System;
using System.Collections.Generic;
static class Program
{
static void Main()
{
LinkedList<int> data = new LinkedList<int>();
Random rand = new Random(12345);
for (int i = 0; i < 20; i++)
{
data.InsertSortedValue(rand.Next(300));
}
foreach (int i in data) Console.WriteLine(i);
}
}
static class LinkedListExtensions {
public static void InsertSortedValue(this LinkedList<int> list, int value)
{
LinkedListNode<int> node = list.First, next;
if (node == null || node.Value > value)
{
list.AddFirst(value);
}
else
{
while ((next = node.Next) != null && next.Value < value)
node = next;
list.AddAfter(node, value);
}
}
}
Doing the brute-force approach is the best decision here because 14 isn't a number :). However, this is not a scalable decision, since should 14 become 14000 one day that will cause problems
What is the most common operation with your array?
Insert? Read?
Heap data structure will give you O(log(14)) for both of them. SortedDictionary may hit your performance.
Using a simple array will give you O(1) for reading and O(14) for insert.
By the way, have you tried System.Collections.Generic.SortedDictionary ot System.Collections.Generic.SortedList?
If you're on .Net 4 you should take a look at the SortedSet<T>. Otherwise take a look at SortedDictionary<TKey, TValue> where you make TValue as object and just put null into it, cause you're just interested into the keys.
If there is no repeated value on the array and the possible values won´t change maybe a fixed size array where the value is equal to the index is a good choice
Both insert and read are O(1)
You have a range of possible values from 1-34 which is rather narrow. So the fastest way would likely be using an array with 34 slots. To insert a number n just do array[n-1]++ and to remove it do array[n.1]-- (if n>0).
To check if a value exists in your collection you do array[n-1]>0.
edit: Damn...Danny was faster. :)
Write a method takes an array of integers and sorts them in place using Bubble Sort. The method is not allowed to create any additional arrays. Bubble Sort is a simple sorting algorithm that works by looping through the array to be sorted, comparing each pair of adjacent elements and swapping them if they are in the wrong order.

Algorithm for matching lists of integers

For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:
DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers
This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.
The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.
Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.
We are using the following algorithm. Convert the list values to a string via
string signature = String.Join("|", values.OrderBy(n => n).ToArray());
then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)
Is there a better method?
You could probably just hash the list itself, instead of going through String.
Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)
You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).
It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.
You cannot do better than O(n + m), because every entry needs to be visited at least once.
Edit: misread, fixed.
On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List.
You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.
Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it.
If there is already an existing key, only then do you sort the corresponding Lists and compare them.
You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.
If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.
Code
Below is a first-attempt at implementing this.
Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();
public bool CheckDuplicate(List<int> theList) {
bool isIdentical = false;
int xorkey = 0;
foreach (int v in theList) xorkey ^= v;
List<List<int>> existingLists;
checkHash.TryGetValue(xorkey, out existingLists);
if (existingLists != null) {
// Already in the dictionary. Check each stored list
foreach (List<int> li in existingLists) {
isIdentical = (theList.Count == li.Count);
if (isIdentical) {
// Check all elements
foreach (int v in theList) {
if (!li.Contains(v)) {
isIdentical = false;
break;
}
}
}
if (isIdentical) break;
}
}
if (existingLists == null || !isIdentical) {
// never seen this before, add it
List<List<int>> newList = new List<List<int>>();
newList.Add(theList);
checkHash.Add(xorkey, newList);
}
return isIdentical;
}
Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.
If a duplicate key is found, we loop through each previously stored List until we found a mismatch.
The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.
Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.
If the lists are sorted, it could be as simple as this:
IntListEqualityComparer : IEqualityComparer<List<int>> {
public int GetHashCode(List<int> list) {
int code = 0;
foreach (int value in list) code ^=value;
return code;
}
public bool Equals(List<int> list1, List<int> list2) {
if (list1.Count != list2.Coount) return false;
for (int i = 0; i < list1.Count; i++) {
if (list1[i] != list2[i]) return false;
}
return true;
}
}
Now you can create a dictionary that uses the IEqualityComparer:
Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());
Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.
You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).
Does the ordering matter? i.e. [1,2] on day 1 and [2,1] on day 2, are they equal?
If they are, then hashing might not work all that well. You could use a sorted array/vector instead to help with the comparison.
Also, what kind of keys is it? Does it have a definite range (e.g. 0-63)? You might be able to concatenate them into large integer (may require precision beyond 64-bits), and hash, instead of converting to string, because that might take a while.
It might be worthwhile to place this in a SQL database. If you don't want to have a full blown DBMS you could use sqlite.
This would make uniqueness checks and unions and these types of operations very simple queries and would very efficient. It would also allow you to easily store information if it is ever needed again.
Would you consider summing up the list of values to obtain an integer which can be used as a precheck of whether different list contains the same set of values?
Though there will be much more collisions (same sum doesn't necessarily mean same set of values) but I think it can first reduce the set of comparisons required by a large part.

Categories

Resources