Grouping unique integer arrays

Grouping unique integer arrays - c#

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.
So, for every new array I need to figure out if it already exits. And if it does what's the index.
A naive approach is to keep all arrays in a list and then just call:
MyList.FindIndex(x=>x.SequenceEqual(Small_Array));
But this becomes very slow if the number of arrays is getting into the thousands.
A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.
Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?

Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".
The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:
public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
// Note: SequenceEqual already checks the count before looking at content.
public bool Equals(int[] first, int[] second) =>
first.SequenceEqual(second);
public int GetHashCode(int[] array)
{
unchecked
{
int hash = 23;
foreach (var item in array)
{
hash = hash * 31 + item;
}
return hash;
}
}
}
You'd then create the dictionary like this:
var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());
Then you'd have something like:
public int MaybeAddArray(int[] array)
{
if (!arrayMap.TryGetValue(array, out var index))
{
index = arrayMap.Count + 1;
arrayMap[array] = index;
}
return index;
}
Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

Related

Array as Dictionary key gives a lot of collisions

I need to use a list of numbers (longs) as a Dictionary key in order to do some group calculations on them.
When using the long array as a key directly, I get a lot of collisions. If I use string.Join(",", myLongs) as a key, it works as I would expect it to, but that's much, much slower (because the hash is more complicated, I assume).
Here's an example demonstrating my problem:
Console.WriteLine("Int32");
Console.WriteLine(new[] { 1, 2, 3, 0}.GetHashCode());
Console.WriteLine(new[] { 1, 2, 3, 0 }.GetHashCode());
Console.WriteLine("String");
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0}).GetHashCode());
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0 }).GetHashCode());
Output:
Int32
43124074
51601393
String
406954194
406954194
As you can see, the arrays return a different hash.
Is there any way of getting the performance of the long array hash, but the uniqeness of the string hash?
See my own answer below for a performance comparison of all the suggestions.
About the potential duplicate -- that question has a lot of useful information, but as this question was primarily about finding high performance alternatives, I think it still provides some useful solutions that are not mentioned there.

That the first one is different is actually good. Arrays are a reference type and luckily they are using the reference (somehow) during hash generation. I would guess that is something like the Pointer that is used on machine code level, or some Garbage Colletor level value. One of the things you have no influence on but is copied if you assign the same instance to a new reference variable.
In the 2nd case you get the hash value on a string consisting of "," and whatever (new[] { 1, 2, 3, 0 }).ToString(); should return. The default is something like teh class name, so of course in both cases they will be the same. And of course string has all those funny special rules like "compares like a value type" and "string interning", so the hash should be the same.

Another alternative is to leverage the lesser known IEqualityComparer to implement your own hash and equality comparisons. There are some notes you'll need to observe about building good hashes, and it's generally not good practice to have editable data in your keys, as it'll introduce instability should the keys ever change, but it would certainly be more performant than using string joins.
public class ArrayKeyComparer : IEqualityComparer<int[]>
{
public bool Equals(int[] x, int[] y)
{
return x == null || y == null
? x == null && y == null
: x.SequenceEqual(y);
}
public int GetHashCode(int[] obj)
{
var seed = 0;
if(obj != null)
foreach (int i in obj)
seed %= i.GetHashCode();
return seed;
}
}
Note that this still may not be as performant as a tuple, since it's still iterating the array rather than being able to take a more constant expression.

Your strings are returning the same hash codes for the same strings correctly because string.GetHashCode() is implemented that way.
The implementation of int[].GetHashCode() does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes.
So that's why your arrays with identical contents are returning different hash codes.
Rather than using an array directly as a key, you should consider writing a wrapper class for an array that will provide a proper hash code.
The main disadvantage with this is that it will be an O(N) operation to compute the hash code (it has to be - otherwise it wouldn't represent all the data in the array).
Fortunately you can cache the hash code so it's only computed once.
Another major problem with using a mutable array for a hash code is that if you change the contents of the array after using it for the key of a hashing container such as Dictionary, you will break the container.
Ideally you would only use this kind of hashing for arrays that are never changed.
Bearing all that in mind, a simple wrapper would look like this:
public sealed class IntArrayKey
{
public IntArrayKey(int[] array)
{
Array = array;
_hashCode = hashCode();
}
public int[] Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in Array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
}
You can use that in place of the actual arrays for more sensible hash code generation.
As per the comments below, here's a version of the class that:
Makes a defensive copy of the array so that it cannot be modified.
Implements equality operators.
Exposes the underlying array as a read-only list, so callers can access its contents but cannot break its hash code.
Code:
public sealed class IntArrayKey: IEquatable<IntArrayKey>
{
public IntArrayKey(IEnumerable<int> sequence)
{
_array = sequence.ToArray();
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
public bool Equals(IntArrayKey other)
{
if (other is null)
return false;
if (ReferenceEquals(this, other))
return true;
return _hashCode == other._hashCode && equals(other.Array);
}
public override bool Equals(object obj)
{
return ReferenceEquals(this, obj) || obj is IntArrayKey other && Equals(other);
}
public static bool operator == (IntArrayKey left, IntArrayKey right)
{
return Equals(left, right);
}
public static bool operator != (IntArrayKey left, IntArrayKey right)
{
return !Equals(left, right);
}
public IReadOnlyList<int> Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
bool equals(IReadOnlyList<int> other) // other cannot be null.
{
if (_array.Length != other.Count)
return false;
for (int i = 0; i < _array.Length; ++i)
if (_array[i] != other[i])
return false;
return true;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in _array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
readonly int[] _array;
}
If you wanted to use the above class without the overhead of making a defensive copy of the array, you can change the constructor to:
public IntArrayKey(int[] array)
{
_array = array;
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}

If you know the length of the arrays you're using, you could use a Tuple.
Console.WriteLine("Tuple");
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Outputs
Tuple
1248
1248

I took all the suggestions from this question and the similar byte[].GetHashCode() question, and made a simple performance test.
The suggestions are as follows:
int[] as key (original attempt -- does not work at all, included as a benchmark)
string as key (original solution -- works, but slow)
Tuple as key (suggested by David)
ValueTuple as key (inspired by the Tuple)
Direct int[] hash as key
IntArrayKey (suggested by Matthew Watson)
int[] as key with Skeet's IEqualityComparer
int[] as key with David's IEqualityComparer
I generated a List containing one million int[]-arrays of length 7 containing random numbers between 100 000 and 999 999 (which is an approximation of my current use case). Then I duplicated the first 100 000 of these arrays, so that there are 900 000 unique arrays, and 100 000 that are listed twice (to force collisions).
For each solution, I enumerated the list, and added the keys to a Dictionary, OR incremented the Value if the key already existed. Then I printed how many keys had a Value more than 1**, and how much time it took.
The results are as follows (ordered from best to worst):
Algorithm Works? Time usage
NonGenericSkeetEquality YES 392 ms
SkeetEquality YES 422 ms
ValueTuple YES 521 ms
QuickIntArrayKey YES 747 ms
IntArrayKey YES 972 ms
Tuple YES 1 609 ms
string YES 2 291 ms
DavidEquality YES 1 139 200 ms ***
int[] NO 336 ms
IntHash NO 386 ms
The Skeet IEqualityComparer is only slightly slower than using the int[] as key directly, with the huge advantage that it actually works, so I'll use that.
** I'm aware that this is not a completely fool proof solution, as I could theoretically get the expected number of collisions without it actually being the collisions I expected, but having run the test a lot of times, I'm fairly certain I don't.
*** Did not finish, probably due to poor hashing algorithm and a lot of equality checks. Had to reduce the number of arrays to 10 000, then multiply the time usage by 100 to compare with the others.

Dictionary hash function for fuzzy lookups

When an approximated comparison between strings is required, the basic Levenshtein Distance can help. It measures the amount of modifications of the string needed to equal another string:
"aaaa" vs "aaab" => 1
"abba" vs "aabb" => 2
"aaaa" vs "a" => 3
When using a Dictionary<T, U> one can provide a custom IEqualityComparer<T>. One can implement the Levenshtein Distance as an IEqualityComparer<string>:
public class LevenshteinStringComparer : IEqualityComparer<string>
{
private readonly int _maximumDistance;
public LevenshteinStringComparer(int maximumDistance)
=> _maximumDistance = maximumDistance;
public bool Equals(string x, string y)
=> ComputeLevenshteinDistance(x, y) <= _maximumDistance;
public int GetHashCode(string obj)
=> 0;
private static int ComputeLevenshteinDistance(string s, string t)
{
// Omitted for simplicity
// Example can be found here: https://www.dotnetperls.com/levenshtein
}
}
So we can use a fuzzy dictionary:
var dict = new Dictionary<string, int>(new LevenshteinStringComparer(2));
dict["aaa"] = 1;
dict["aab"] = 2; // Modify existing value under "aaa" key
// Only one key was created:
dict.Keys => { "aaa" }
Having all this set up, you may have noticed that we don't have implemented a proper GetHashCode in the LevenshteinStringComparer which would be greatly appreciated by the dictionary. As some rule of thumbs regarding hash codes, I'd use:
Unequal objects should not have the same hash code
Equal objects must have the same hash code
The only possible hash function following these rules I can imagine is a constant number, just as implemented in the given code. This isn't optimal though, but when we start for example to take the default hash of the string, then aaa and aab would end up with different hashes, even though they are handled as equal. Thinking further this means all possible strings have to have the same hash.
Am I correct? And why does the performance of the dictionary gets better when I use the default string hash function with hash collisions for our comparer? Shouldn't this make the hash buckets inside the dictionary invalid?
public int GetHashCode(string obj)
=> obj.GetHashCode();

I don't think there is a hashing function that could work in your case.
The problem is that you have to assign the bucket based on a signle value only, while you can't know what was added before. But the Levenshtein distance of the item being hashed can be anything from 0 to "infinity", only thing that matters is what it is compared with. Hence you cannot satisfy the second condition of the hashing function (to have equal objects have the same hash code).
Another argument "pseudo-proof" would be the situation when you want maximum distance of 2 and you already have two items in the dictionary, which have mutual distance of 3. If you then add a string which is of distance 2 from the first item and distance 1 from the second item, how would you decide which item should it match to? It satisfies your maximum for both items, but it should probably match with the second one rather than the first one. But not knowing anything about the contents of the dictionary you cannot know how to hash it correctly.
For the second question - using the default string.GetHashCode() method does improve performance, but it destroys the functionality of your equality comparer. If you test this solution on your sample code, you can see that the dict will contain two keys now. This is because GetHashCode returned two different hash codes, so there was no conflict and dict now has two buckets and your Equals method was not even executed.

I can understand fuzzy lookup. But not fuzzy storage. Why would you want to overwrite "aaa" when assigning a value for "aab"? If all you want is fuzzy lookup wouldn't it be better to have a normal dictionary which has an extension to do a fuzzy lookup like...
public static class DictionaryExtensions
{
public static IEnumerable<T> FuzzyMatch<T>(this IDictionary<string, T> dictionary, string key, int distance = 2)
{
IEqualityComparer<string> comparer = new LevenshteinStringComparer(distance);
return dictionary
.Keys
.Where(k => comparer.Equals(k, key))
.Select(k => dictionary[k]);
}
}
This is more of a comment than an answer. To answer your question, if you consider the following example...
"abba" vs "cbbc" => 2
"cddc" vs "cbbc" => 2
"abba" vs "cddc" => 4
You get the gist here? i.e Clearly its not possible for the following to be true
abba == cbbc &&
cddc == cbbc &&
abba != cddc

Calling a list of methods in a random sequence?

I have a list of 10 methods. Now I want to call this methods in a random sequence. The sequence should be generated at runtime. Whats the best way to do this?

It is always astonishing to me the number of incorrect and inefficient answers one sees whenever anyone asks how to shuffle a list of things on StackOverflow. Here we have several examples of code which is brittle (because it assumes that key collisions are impossible when in fact they are merely rare) or slow for large lists. (In this case the problem is stated to be only ten elements, but when possible surely it is better to give a solution that scales to thousands of elements if doing so is not difficult.)
This is not a hard problem to solve correctly. The correct, fast way to do this is to create an array of actions, and then shuffle that array in-place using a Fisher-Yates Shuffle.
http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
Some things not to do:
Do not implement Fischer-Yates shuffle incorrectly. One sees more incorrect than correct implementations of this trivial algorithm. In particular, make sure you are choosing the random number from the correct range. Choosing it from the wrong range produces a biased shuffle.
If the shuffle algorithm must actually be unpredictable then use a source of randomness other than Random, which is only pseudo-random. Remember, Random only has 232 possible seeds, and therefore there are fewer than that many possible shuffles.
If you are going to be producing many shuffles in a short amount of time, do not create a new instance of Random every time. Save and re-use the old one, or use a different source of randomness entirely. Random chooses its seed based on the time; many Randoms created in close succession will produce the same sequence of "random" numbers.
Do not sort on a "random" GUID as your key. GUIDs are guaranteed to be unique. They are not guaranteed to be randomly ordered. It is perfectly legal for an implementation to spit out consecutive GUIDs.
Do not use a random function as a comparator and feed that to a sorting algorithm. Sort algorithms are permitted to do anything they please if the comparator is bad, including crashing, and including producing non-random results. As Microsoft recently found out, it is extremely embarrassing to get a simple algorithm like this wrong.
Do not use the input to random as the key to a dictionary, and then sort the dictionary. There is nothing stopping the randomness source from choosing the same key twice, and therefore either crashing your application with a duplicate key exception, or silently losing one of your methods.
Do not use the algorithm "Create two lists. Add the elements to the first list. Repeatedly move a random element from the first list to the second list, removing the element from the first list". If the list is O(n) to remove an item then this is an O(n2) algorithm.
Do not use the algorithm "Create two lists. Add the elements to the first list. Repeatedly move a random non-null element from the first list to the second list, setting the element in the first list to null." Also do not do this crazy equivalent of that algorithm.If there are lots of items in the list then this gets slower and slower as you start hitting more and more nulls.

New, short answer
Starting from where Ilya Kogan left off, totally correct after we had Eric Lippert find the bug:
var methods = new Action[10];
var rng = new Random();
var shuffled = methods.Select(m => Tuple.Create(rng.Next(), m))
.OrderBy(t => t.Item1).Select(t => t.Item2);
foreach (var action in shuffled) {
action();
}
Of course this is doing a lot behind the scenes. The method below should be much faster. But if LINQ is fast enough...
Old answer (much longer)
After stealing this code from here:
public static T[] RandomPermutation<T>(T[] array)
{
T[] retArray = new T[array.Length];
array.CopyTo(retArray, 0);
Random random = new Random();
for (int i = 0; i < array.Length; i += 1)
{
int swapIndex = random.Next(i, array.Length);
if (swapIndex != i)
{
T temp = retArray[i];
retArray[i] = retArray[swapIndex];
retArray[swapIndex] = temp;
}
}
return retArray;
}
the rest is easy:
var methods = new Action[10];
var perm = RandomPermutation(methods);
foreach (var method in perm)
{
// call the method
}

Have an array of delegates. Suppose you have this:
class YourClass {
public int YourFunction1(int x) { }
public int YourFunction2(int x) { }
public int YourFunction3(int x) { }
}
Now declare a delegate:
public delegate int MyDelegate(int x);
Now create an array of delegates:
MyDelegate delegates[] = new MyDelegate[10];
delegates[0] = new MyDelegate(YourClass.YourFunction1);
delegates[1] = new MyDelegate(YourClass.YourFunction2);
delegates[2] = new MyDelegate(YourClass.YourFunction3);
and now call it like this:
int result = delegates[randomIndex] (48);

You can create a shuffled collection of delegates, and then call all methods in the collection.
Here is an easy way of doing so using a dictionary. The keys of the dictionary are random numbers, and the values are delegates to your methods. When you iterate through the dictionary, it has the effect of shuffling.
var shuffledActions = actions.ToDictionary(
action => random.Next(),
action => action);
foreach (var pair in shuffledActions.OrderBy(item => item.Key))
{
pair.Value();
}
actions is an enumerable of your methods.
random is a of type Random.

Think that this is a list of objects and you want it to extract the objects randomly. You can get a random index using the Random.Next Method (always use current List.Count as parameter) and after that remove object from the list so it will not be drawn again.

When processing a list in a random order, the natural inclination is to shuffle a list.
Another approach is to just keep the list order, but randomly select and remove each item.
var actionList = new[]
{
new Action( () => CallMethodOne() ),
new Action( () => CallMethodTwo() ),
new Action( () => CallMethodThree() )
}.ToList();
var r = new Random();
while(actionList.Count() > 0) {
var index = r.Next(actionList.Count());
var action = actionList[index];
actionList.RemoveAt(index);
action();
}

I think:
Via reflection get Method Objects;
create an array of created Method Object;
generate random index (normalize range);
invoke method;
You can remove method from array to execute method one times.
Bye

Examining two string arrays for equivalence

Is there a better way to examine whether two string arrays have the same contents than this?
string[] first = new string[]{"cat","and","mouse"};
string[] second = new string[]{"cat","and","mouse"};
bool contentsEqual = true;
if(first.Length == second.Length){
foreach (string s in first)
{
contentsEqual &= second.Contains(s);
}
}
else{
contentsEqual = false;
}
Console.WriteLine(contentsEqual.ToString());// true

Enumerable.SequenceEquals if they're supposed to be in the same order.

You should consider using the intersect method. It will give you all the matching values and then you can just compare the count of the resulting array with one the arrays that were compared.
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.intersect.aspx

This is O(n^2). If the arrays have the same length, sort them, then compare elements in the same position. This is O(n log n).
Or you can use a hash set or dictionary: insert each word in the first array, then see if every word in the second array is in the set or dictionary. This is O(n) on average.

Nothing wrong with the logic of the method, but the fact that you're testing Contains for each item in the first sequence means the algorithm runs in O(n^2) time in general. You can also make one or two other smaller optimisations and improvements
I would implement such a function as follows. Define an extension method as such (example in .NET 4.0).
public static bool SequenceEquals<T>(this IEnumerable<T> seq1, IEnumerable<T> seq2)
{
foreach (var pair in Enumerable.Zip(seq1, seq2)
{
if (!pair.Item1.Equals(pair.Item2))
return;
}
return false;
}

You could try Enumerable.Intersect: http://msdn.microsoft.com/en-us/library/bb460136.aspx
The result of the operation is every element that is common to both arrays. If the length of the result is equal to the length of both arrays, then the two arrays contain the same items.
Enumerable.Union: http://msdn.microsoft.com/en-us/library/bb341731.aspx would work too; just check that the result of the Union operation has length of zero (meaning there are no elements that are unique to only one array);
Although I'm not exactly sure how the functions handle duplicates.

Arrays/Lists and computing hashvalues (VB, C#)

I feel bad asking this question but I am currently not able to program and test this as I'm writing this on my cell-phone and not on my dev machine :P (Easy rep points if someone answers! XD )
Anyway, I've had experience with using hashvalues from String objects. E.g., if I have StringA and StringB both equal to "foo", they'll both compute out the same hashvalue, because they're set to equal values.
Now what if I have a List, with T being a native data type. If I tried to compute the hashvalue of ListA and ListB, assuming that they'd both be the same size and contain the same information, wouldn't they have equal hashvalues as well?
Assuming as sample dataset of 'byte' with a length of 5
{5,2,0,1,3}

It depends on how you calculate the hash value and how you define equality. For example, two different instances of an array which happen to contain the same values may not be considered equal depending on your application. In this case you may include the address or some other unique value per array as part of the hash function.
However, if you want to consider to distinct arrays which contain the same values equal you would calculate the list hash using only the values in the array. Of course, then you have to consider if ordering matters to you or not in determining equality (and thus influencing your hash function).

If the order of items is important then you could generate a sequence hashcode like this.
public static int GetOrderedHashCode<T>(this IEnumerable<T> source)
{
unchecked
{
int hash = 269;
foreach (T item in source)
{
hash = (hash * 17) + item.GetHashCode;
}
return hash;
}
}
If the order of items isn't important, then you could do something like this instead:
public static int GetUnorderedHashCode<T>(this IEnumerable<T> source)
{
unchecked
{
int sum = 907;
int count = 953;
foreach (T item in source)
{
sum = sum + item.GetHashCode();
count++
}
return 991 * sum * count;
}
}
(Note that both of these methods will have poor performance for larger collections, in which case you might want to implement some sort of cache and only recalculate the hashcode when the collection changes.)

If your talking about the built-in list types, then no, they will not be equal. Why? Because List<T> is a reference type, so equality will do a comparison to see if the references are the same. If you are creating a custom list type, then you could override the Equals and GetHashCode methods to support this behavior, but it isn't going to happen on the built in types.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.