Array as Dictionary key gives a lot of collisions

Array as Dictionary key gives a lot of collisions - c#

I need to use a list of numbers (longs) as a Dictionary key in order to do some group calculations on them.
When using the long array as a key directly, I get a lot of collisions. If I use string.Join(",", myLongs) as a key, it works as I would expect it to, but that's much, much slower (because the hash is more complicated, I assume).
Here's an example demonstrating my problem:
Console.WriteLine("Int32");
Console.WriteLine(new[] { 1, 2, 3, 0}.GetHashCode());
Console.WriteLine(new[] { 1, 2, 3, 0 }.GetHashCode());
Console.WriteLine("String");
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0}).GetHashCode());
Console.WriteLine(string.Join(",", new[] { 1, 2, 3, 0 }).GetHashCode());
Output:
Int32
43124074
51601393
String
406954194
406954194
As you can see, the arrays return a different hash.
Is there any way of getting the performance of the long array hash, but the uniqeness of the string hash?
See my own answer below for a performance comparison of all the suggestions.
About the potential duplicate -- that question has a lot of useful information, but as this question was primarily about finding high performance alternatives, I think it still provides some useful solutions that are not mentioned there.

That the first one is different is actually good. Arrays are a reference type and luckily they are using the reference (somehow) during hash generation. I would guess that is something like the Pointer that is used on machine code level, or some Garbage Colletor level value. One of the things you have no influence on but is copied if you assign the same instance to a new reference variable.
In the 2nd case you get the hash value on a string consisting of "," and whatever (new[] { 1, 2, 3, 0 }).ToString(); should return. The default is something like teh class name, so of course in both cases they will be the same. And of course string has all those funny special rules like "compares like a value type" and "string interning", so the hash should be the same.

Another alternative is to leverage the lesser known IEqualityComparer to implement your own hash and equality comparisons. There are some notes you'll need to observe about building good hashes, and it's generally not good practice to have editable data in your keys, as it'll introduce instability should the keys ever change, but it would certainly be more performant than using string joins.
public class ArrayKeyComparer : IEqualityComparer<int[]>
{
public bool Equals(int[] x, int[] y)
{
return x == null || y == null
? x == null && y == null
: x.SequenceEqual(y);
}
public int GetHashCode(int[] obj)
{
var seed = 0;
if(obj != null)
foreach (int i in obj)
seed %= i.GetHashCode();
return seed;
}
}
Note that this still may not be as performant as a tuple, since it's still iterating the array rather than being able to take a more constant expression.

Your strings are returning the same hash codes for the same strings correctly because string.GetHashCode() is implemented that way.
The implementation of int[].GetHashCode() does something with its memory address to return the hash code, so arrays with identical contents will nevertheless return different hash codes.
So that's why your arrays with identical contents are returning different hash codes.
Rather than using an array directly as a key, you should consider writing a wrapper class for an array that will provide a proper hash code.
The main disadvantage with this is that it will be an O(N) operation to compute the hash code (it has to be - otherwise it wouldn't represent all the data in the array).
Fortunately you can cache the hash code so it's only computed once.
Another major problem with using a mutable array for a hash code is that if you change the contents of the array after using it for the key of a hashing container such as Dictionary, you will break the container.
Ideally you would only use this kind of hashing for arrays that are never changed.
Bearing all that in mind, a simple wrapper would look like this:
public sealed class IntArrayKey
{
public IntArrayKey(int[] array)
{
Array = array;
_hashCode = hashCode();
}
public int[] Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in Array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
}
You can use that in place of the actual arrays for more sensible hash code generation.
As per the comments below, here's a version of the class that:
Makes a defensive copy of the array so that it cannot be modified.
Implements equality operators.
Exposes the underlying array as a read-only list, so callers can access its contents but cannot break its hash code.
Code:
public sealed class IntArrayKey: IEquatable<IntArrayKey>
{
public IntArrayKey(IEnumerable<int> sequence)
{
_array = sequence.ToArray();
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}
public bool Equals(IntArrayKey other)
{
if (other is null)
return false;
if (ReferenceEquals(this, other))
return true;
return _hashCode == other._hashCode && equals(other.Array);
}
public override bool Equals(object obj)
{
return ReferenceEquals(this, obj) || obj is IntArrayKey other && Equals(other);
}
public static bool operator == (IntArrayKey left, IntArrayKey right)
{
return Equals(left, right);
}
public static bool operator != (IntArrayKey left, IntArrayKey right)
{
return !Equals(left, right);
}
public IReadOnlyList<int> Array { get; }
public override int GetHashCode()
{
return _hashCode;
}
bool equals(IReadOnlyList<int> other) // other cannot be null.
{
if (_array.Length != other.Count)
return false;
for (int i = 0; i < _array.Length; ++i)
if (_array[i] != other[i])
return false;
return true;
}
int hashCode()
{
int result = 17;
unchecked
{
foreach (var i in _array)
{
result = result * 23 + i;
}
}
return result;
}
readonly int _hashCode;
readonly int[] _array;
}
If you wanted to use the above class without the overhead of making a defensive copy of the array, you can change the constructor to:
public IntArrayKey(int[] array)
{
_array = array;
_hashCode = hashCode();
Array = new ReadOnlyCollection<int>(_array);
}

If you know the length of the arrays you're using, you could use a Tuple.
Console.WriteLine("Tuple");
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode());
Outputs
Tuple
1248
1248

I took all the suggestions from this question and the similar byte[].GetHashCode() question, and made a simple performance test.
The suggestions are as follows:
int[] as key (original attempt -- does not work at all, included as a benchmark)
string as key (original solution -- works, but slow)
Tuple as key (suggested by David)
ValueTuple as key (inspired by the Tuple)
Direct int[] hash as key
IntArrayKey (suggested by Matthew Watson)
int[] as key with Skeet's IEqualityComparer
int[] as key with David's IEqualityComparer
I generated a List containing one million int[]-arrays of length 7 containing random numbers between 100 000 and 999 999 (which is an approximation of my current use case). Then I duplicated the first 100 000 of these arrays, so that there are 900 000 unique arrays, and 100 000 that are listed twice (to force collisions).
For each solution, I enumerated the list, and added the keys to a Dictionary, OR incremented the Value if the key already existed. Then I printed how many keys had a Value more than 1**, and how much time it took.
The results are as follows (ordered from best to worst):
Algorithm Works? Time usage
NonGenericSkeetEquality YES 392 ms
SkeetEquality YES 422 ms
ValueTuple YES 521 ms
QuickIntArrayKey YES 747 ms
IntArrayKey YES 972 ms
Tuple YES 1 609 ms
string YES 2 291 ms
DavidEquality YES 1 139 200 ms ***
int[] NO 336 ms
IntHash NO 386 ms
The Skeet IEqualityComparer is only slightly slower than using the int[] as key directly, with the huge advantage that it actually works, so I'll use that.
** I'm aware that this is not a completely fool proof solution, as I could theoretically get the expected number of collisions without it actually being the collisions I expected, but having run the test a lot of times, I'm fairly certain I don't.
*** Did not finish, probably due to poor hashing algorithm and a lot of equality checks. Had to reduce the number of arrays to 10 000, then multiply the time usage by 100 to compare with the others.

Related

Dictionary hash function for fuzzy lookups

When an approximated comparison between strings is required, the basic Levenshtein Distance can help. It measures the amount of modifications of the string needed to equal another string:
"aaaa" vs "aaab" => 1
"abba" vs "aabb" => 2
"aaaa" vs "a" => 3
When using a Dictionary<T, U> one can provide a custom IEqualityComparer<T>. One can implement the Levenshtein Distance as an IEqualityComparer<string>:
public class LevenshteinStringComparer : IEqualityComparer<string>
{
private readonly int _maximumDistance;
public LevenshteinStringComparer(int maximumDistance)
=> _maximumDistance = maximumDistance;
public bool Equals(string x, string y)
=> ComputeLevenshteinDistance(x, y) <= _maximumDistance;
public int GetHashCode(string obj)
=> 0;
private static int ComputeLevenshteinDistance(string s, string t)
{
// Omitted for simplicity
// Example can be found here: https://www.dotnetperls.com/levenshtein
}
}
So we can use a fuzzy dictionary:
var dict = new Dictionary<string, int>(new LevenshteinStringComparer(2));
dict["aaa"] = 1;
dict["aab"] = 2; // Modify existing value under "aaa" key
// Only one key was created:
dict.Keys => { "aaa" }
Having all this set up, you may have noticed that we don't have implemented a proper GetHashCode in the LevenshteinStringComparer which would be greatly appreciated by the dictionary. As some rule of thumbs regarding hash codes, I'd use:
Unequal objects should not have the same hash code
Equal objects must have the same hash code
The only possible hash function following these rules I can imagine is a constant number, just as implemented in the given code. This isn't optimal though, but when we start for example to take the default hash of the string, then aaa and aab would end up with different hashes, even though they are handled as equal. Thinking further this means all possible strings have to have the same hash.
Am I correct? And why does the performance of the dictionary gets better when I use the default string hash function with hash collisions for our comparer? Shouldn't this make the hash buckets inside the dictionary invalid?
public int GetHashCode(string obj)
=> obj.GetHashCode();

I don't think there is a hashing function that could work in your case.
The problem is that you have to assign the bucket based on a signle value only, while you can't know what was added before. But the Levenshtein distance of the item being hashed can be anything from 0 to "infinity", only thing that matters is what it is compared with. Hence you cannot satisfy the second condition of the hashing function (to have equal objects have the same hash code).
Another argument "pseudo-proof" would be the situation when you want maximum distance of 2 and you already have two items in the dictionary, which have mutual distance of 3. If you then add a string which is of distance 2 from the first item and distance 1 from the second item, how would you decide which item should it match to? It satisfies your maximum for both items, but it should probably match with the second one rather than the first one. But not knowing anything about the contents of the dictionary you cannot know how to hash it correctly.
For the second question - using the default string.GetHashCode() method does improve performance, but it destroys the functionality of your equality comparer. If you test this solution on your sample code, you can see that the dict will contain two keys now. This is because GetHashCode returned two different hash codes, so there was no conflict and dict now has two buckets and your Equals method was not even executed.

I can understand fuzzy lookup. But not fuzzy storage. Why would you want to overwrite "aaa" when assigning a value for "aab"? If all you want is fuzzy lookup wouldn't it be better to have a normal dictionary which has an extension to do a fuzzy lookup like...
public static class DictionaryExtensions
{
public static IEnumerable<T> FuzzyMatch<T>(this IDictionary<string, T> dictionary, string key, int distance = 2)
{
IEqualityComparer<string> comparer = new LevenshteinStringComparer(distance);
return dictionary
.Keys
.Where(k => comparer.Equals(k, key))
.Select(k => dictionary[k]);
}
}
This is more of a comment than an answer. To answer your question, if you consider the following example...
"abba" vs "cbbc" => 2
"cddc" vs "cbbc" => 2
"abba" vs "cddc" => 4
You get the gist here? i.e Clearly its not possible for the following to be true
abba == cbbc &&
cddc == cbbc &&
abba != cddc

Grouping unique integer arrays

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.
So, for every new array I need to figure out if it already exits. And if it does what's the index.
A naive approach is to keep all arrays in a list and then just call:
MyList.FindIndex(x=>x.SequenceEqual(Small_Array));
But this becomes very slow if the number of arrays is getting into the thousands.
A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.
Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?

Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".
The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:
public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
// Note: SequenceEqual already checks the count before looking at content.
public bool Equals(int[] first, int[] second) =>
first.SequenceEqual(second);
public int GetHashCode(int[] array)
{
unchecked
{
int hash = 23;
foreach (var item in array)
{
hash = hash * 31 + item;
}
return hash;
}
}
}
You'd then create the dictionary like this:
var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());
Then you'd have something like:
public int MaybeAddArray(int[] array)
{
if (!arrayMap.TryGetValue(array, out var index))
{
index = arrayMap.Count + 1;
arrayMap[array] = index;
}
return index;
}
Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

Looking for something like a HashSet, but with a range of values for the key?

I'm wondering if there is something like HashSet, but keyed by a range of values.
For example, we could add an item which is keyed to all integers between 100 and 4000. This item would be returned if we used any key between 100 and 4000, e.g. 287.
I would like the lookup speed to be quite close to HashSet, i.e. O(1). It would be possible to implement this using a binary search, but this would be too slow for the requirements. I would like to use standard .NET API calls as much as possible.
Update
This is interesting: https://github.com/mbuchetics/RangeTree
It has a time complexity of O(log(N)) where N is number of intervals, so it's not exactly O(1), but it could be used to build a working implementation.

I don't believe there's a structure for it already. You could implement something like a RangedDictionary:
class RangedDictionary {
private Dictionary<Range, int> _set = new Dictionary<Range, int>();
public void Add(Range r, int key) {
_set.Add(r, key);
}
public int Get(int key) {
//find a range that includes that key and return _set[range]
}
}
struct Range {
public int Begin;
public int End;
//override GetHashCode() and Equals() methods so that you can index a Dictionary by Range
}
EDIT: changed to HashSet to Dictionary

Here is a solution you can try out. However it assumes some points :
No range overlaps
When you request for a number, it is effectively inside a range (no error check)
From what you said, this one is O(N), but you can make it O(log(N)) with little effort I think.
The idea is that a class will handle the range thing, it will basically convert any value given to it to its range's lower boundary. This way your Hashtable (here a Dictionary) contains the low boundaries as keys.
public class Range
{
//We store all the ranges we have
private static List<int> ranges = new List<int>();
public int value { get; set; }
public static void CreateRange(int RangeStart, int RangeStop)
{
ranges.Add(RangeStart);
ranges.Sort();
}
public Range(int value)
{
int previous = ranges[0];
//Here we will find the range and give it the low boundary
//This is a very simple foreach loop but you can make it better
foreach (int item in ranges)
{
if (item > value)
{
break;
}
previous = item;
}
this.value = previous;
}
public override int GetHashCode()
{
return value;
}
}
Here is to test it.
class Program
{
static void Main(string[] args)
{
Dictionary<int, int> myRangedDic = new Dictionary<int,int>();
Range.CreateRange(10, 20);
Range.CreateRange(50, 100);
myRangedDic.Add(new Range(15).value, 1000);
myRangedDic.Add(new Range(75).value, 5000);
Console.WriteLine("searching for 16 : {0}", myRangedDic[new Range(16).value].ToString());
Console.WriteLine("searching for 64 : {0}", myRangedDic[new Range(64).value].ToString());
Console.ReadLine();
}
}
I don't believe you really can go below O(Log(N)) because there is no way for you to know immediately in which range a number is, you must always compare it with a lower (or upper) bound.
If you had predetermined ranges, that would have been easier to do. i.e. if your ranges are every hundreds, it is really easy to find the correct range of any number by calculating it modulo 100, but here we can assume nothing, so we must check.
To go down to Log(N) with this solution, just replace the foreach with a loop that will look at the middle of the array, then split it in two every iteration...

Converting a method to use any Enum

My Problem:
I want to convert my randomBloodType() method to a static method that can take any enum type. I want my method to take any type of enum whether it be BloodType, DaysOfTheWeek, etc. and perform the operations shown below.
Some Background on what the method does:
The method currently chooses a random element from the BloodType enum based on the values assigned to each element. An element with a higher value has a higher probability to be picked.
Code:
public enum BloodType
{
// BloodType = Probability
ONeg = 4,
OPos = 36,
ANeg = 3,
APos = 28,
BNeg = 1,
BPos = 20,
ABNeg = 1,
ABPos = 5
};
public BloodType randomBloodType()
{
// Get the values of the BloodType enum and store it in a array
BloodType[] bloodTypeValues = (BloodType[])Enum.GetValues(typeof(BloodType));
List<BloodType> bloodTypeList = new List<BloodType>();
// Create a list where each element occurs the approximate number of
// times defined as its value(probability)
foreach (BloodType val in bloodTypeValues)
{
for(int i = 0; i < (int)val; i++)
{
bloodTypeList.Add(val);
}
}
// Sum the values
int sum = 0;
foreach (BloodType val in bloodTypeValues)
{
sum += (int)val;
}
//Get Random value
Random rand = new Random();
int randomValue = rand.Next(sum);
return bloodTypeList[randomValue];
}
What I have tried so far:
I have tried to use generics. They worked out for the most part, but I was unable to cast my enum elements to int values. I included a example of a section of code that was giving me problems below.
foreach (T val in bloodTypeValues)
{
sum += (int)val; // This line is the problem.
}
I have also tried using Enum e as a method parameter. I was unable to declare the type of my array of enum elements using this method.

(Note: My apologies in advance for the lengthy answer. My actual proposed solution is not all that long, but there are a number of problems with the proposed solutions so far and I want to try to address those thoroughly, to provide context for my own proposed solution).
In my opinion, while you have in fact accepted one answer and might be tempted to use either one, neither of the answers provided so far are correct or useful.
Commenter Ben Voigt has already pointed out two major flaws with your specifications as stated, both related to the fact that you are encoding the enum value's weight in the value itself:
You are tying the enum's underlying type to the code that then must interpret that type.
Two enum values that have the same weight are indistinguishable from each other.
Both of these issues can be addressed. Indeed, while the answer you accepted (why?) fails to address the first issue, the one provided by Dweeberly does address this through the use of Convert.ToInt32() (which can convert from long to int just fine, as long as the values are small enough).
But the second issue is much harder to address. The answer from Asad attempts to address this by starting with the enum names and parsing them to their values. And this does indeed result in the final array being indexed containing the corresponding entries for each name separately. But the code actually using the enum has no way to distinguish the two; it's really as if those two names are a single enum value, and that single enum value's probability weight is the sum of the value used for the two different names.
I.e. in your example, while the enum entries for e.g. BNeg and ABNeg will be selected separately, the code that receives these randomly selected value has no way to know whether it was BNeg or ABNeg that was selected. As far as it knows, those are just two different names for the same value.
Now, even this problem can be addressed (but not in the way that Asad attempts to…his answer is still broken). If you were, for example, to encode the probabilities in the value while still ensuring unique values for each name, you could decode those probabilities while doing the random selection and that would work. For example:
enum BloodType
{
// BloodType = Probability
ONeg = 4 * 100 + 0,
OPos = 36 * 100 + 1,
ANeg = 3 * 100 + 2,
APos = 28 * 100 + 3,
BNeg = 1 * 100 + 4,
BPos = 20 * 100 + 5,
ABNeg = 1 * 100 + 6,
ABPos = 5 * 100 + 7,
};
Having declared your enum values that way, then you can in your selection code divide the enum value by 100 to obtain its probability weight, which then can be used as seen in the various examples. At the same time, each enum name has a unique value.
But even solving that problem, you are still left with problems related to the choice of encoding and representation of the probabilities. For example, in the above you cannot have an enum that has more than 100 values, nor one with weights larger than (2^31 - 1) / 100; if you want an enum that has more than 100 values, you need a larger multiplier but that would limit your weight values even more.
In many scenarios (maybe all the ones you care about) this won't be an issue. The numbers are small enough that they all fit. But that seems like a serious limitation in what seems like a situation where you want a solution that is as general as possible.
And that's not all. Even if the encoding stays within reasonable limits, you have another significant limit to deal with: the random selection process requires an array large enough to contain for each enum value as many instances of that value as its weight. Again, if the values are small maybe this is not a big problem. But it does severely limit the ability of your implementation to generalize.
So, what to do?
I understand the temptation to try to keep each enum type self-contained; there are some obvious advantages to doing so. But there are also some serious disadvantages that result from that, and if you truly ever try to use this in a generalized way, the changes to the solutions proposed so far will tie your code together in ways that IMHO negate most if not all of the advantage of keeping the enum types self-contained (primarily: if you find you need to modify the implementation to accommodate some new enum type, you will have to go back and edit all of the other enum types you're using…i.e. while each type looks self-contained, in reality they are all tightly coupled with each other).
In my opinion, a much better approach would be to abandon the idea that the enum type itself will encode the probability weights. Just accept that this will be declared separately somehow.
Also, IMHO is would be better to avoid the memory-intensive approach proposed in your original question and mirrored in the other two answers. Yes, this is fine for the small values you're dealing with here. But it's an unnecessary limitation, making only one small part of the logic simpler while complicating and restricting it in other ways.
I propose the following solution, in which the enum values can be whatever you want, the enum's underlying type can be whatever you want, and the algorithm uses memory proportionally only to the number of unique enum values, rather than in proportion to the sum of all of the probability weights.
In this solution, I also address possible performance concerns, by caching the invariant data structures used to select the random values. This may or may not be useful in your case, depending on how frequently you will be generating these random values. But IMHO it is a good idea regardless; the up-front cost of generating these data structures is so high that if the values are selected with any regularity at all, it will begin to dominate the run-time cost of your code. Even if it works fine today, why take the risk? (Again, especially given that you seem to want a generalized solution).
Here is the basic solution:
static T NextRandomEnumValue<T>()
{
KeyValuePair<T, int>[] aggregatedWeights = GetWeightsForEnum<T>();
int weightedValue =
_random.Next(aggregatedWeights[aggregatedWeights.Length - 1].Value),
index = Array.BinarySearch(aggregatedWeights,
new KeyValuePair<T, int>(default(T), weightedValue),
KvpValueComparer<T, int>.Instance);
return aggregatedWeights[index < 0 ? ~index : index + 1].Key;
}
static KeyValuePair<T, int>[] GetWeightsForEnum<T>()
{
object temp;
if (_typeToAggregatedWeights.TryGetValue(typeof(T), out temp))
{
return (KeyValuePair<T, int>[])temp;
}
if (!_typeToWeightMap.TryGetValue(typeof(T), out temp))
{
throw new ArgumentException("Unsupported enum type");
}
KeyValuePair<T, int>[] weightMap = (KeyValuePair<T, int>[])temp;
KeyValuePair<T, int>[] aggregatedWeights =
new KeyValuePair<T, int>[weightMap.Length];
int sum = 0;
for (int i = 0; i < weightMap.Length; i++)
{
sum += weightMap[i].Value;
aggregatedWeights[i] = new KeyValuePair<T,int>(weightMap[i].Key, sum);
}
_typeToAggregatedWeights[typeof(T)] = aggregatedWeights;
return aggregatedWeights;
}
readonly static Random _random = new Random();
// Helper method to reduce verbosity in the enum-to-weight array declarations
static KeyValuePair<T1, T2> CreateKvp<T1, T2>(T1 t1, T2 t2)
{
return new KeyValuePair<T1, T2>(t1, t2);
}
readonly static KeyValuePair<BloodType, int>[] _bloodTypeToWeight =
{
CreateKvp(BloodType.ONeg, 4),
CreateKvp(BloodType.OPos, 36),
CreateKvp(BloodType.ANeg, 3),
CreateKvp(BloodType.APos, 28),
CreateKvp(BloodType.BNeg, 1),
CreateKvp(BloodType.BPos, 20),
CreateKvp(BloodType.ABNeg, 1),
CreateKvp(BloodType.ABPos, 5),
};
readonly static Dictionary<Type, object> _typeToWeightMap =
new Dictionary<Type, object>()
{
{ typeof(BloodType), _bloodTypeToWeight },
};
readonly static Dictionary<Type, object> _typeToAggregatedWeights =
new Dictionary<Type, object>();
Note that the work of actually selecting a random value is simply a matter of choosing a non-negative random integer less than the sum of the weights, and then using a binary search to find the appropriate enum value.
Once per enum type, the code will build the table of values and weight-sums that will be used for the binary search. This result is stored in a cache dictionary, _typeToAggregatedWeights.
There are also the objects that have to be declared and which will be used at run-time to build this table. Note that the _typeToWeightMap is just in support of making this method 100% generic. If you wanted to write a different named method for each specific type you wanted to support, that could still used a single generic method to implement the initialization and selection, but the named method would know the correct object (e.g. _bloodTypeToWeight) to use for initialization.
Alternatively, another way to avoid the _typeToWeightMap while still keeping the method 100% generic would be to have the _typeToAggregatedWeights be of type Dictionary<Type, Lazy<object>>, and have the values of the dictionary (the Lazy<object> objects) explicitly reference the appropriate weight array for the type.
In other words, there are lots of variations on this theme that would work fine. But they will all have essentially the same structure as above; semantics would be the same and performance differences would be negligible.
One thing you'll notice is that the binary search requires a custom IComparer<T> implementation. That is here:
class KvpValueComparer<TKey, TValue> :
IComparer<KeyValuePair<TKey, TValue>> where TValue : IComparable<TValue>
{
public readonly static KvpValueComparer<TKey, TValue> Instance =
new KvpValueComparer<TKey, TValue>();
private KvpValueComparer() { }
public int Compare(KeyValuePair<TKey, TValue> x, KeyValuePair<TKey, TValue> y)
{
return x.Value.CompareTo(y.Value);
}
}
This allows the Array.BinarySearch() method to correct compare the array elements, allowing a single array to contain both the enum values and their aggregated weights, but limiting the binary search comparison to just the weights.

Assuming your enum values are all of type int (you can adjust this accordingly if they're long, short, or whatever):
static TEnum RandomEnumValue<TEnum>(Random rng)
{
var vals = Enum
.GetNames(typeof(TEnum))
.Aggregate(Enumerable.Empty<TEnum>(), (agg, curr) =>
{
var value = Enum.Parse(typeof (TEnum), curr);
return agg.Concat(Enumerable.Repeat((TEnum)value,(int)value)); // For int enums
})
.ToArray();
return vals[rng.Next(vals.Length)];
}
Here's how you would use it:
var rng = new Random();
var randomBloodType = RandomEnumValue<BloodType>(rng);
People seem to have their knickers in a knot about multiple indistinguishable enum values in the input enum (for which I still think the above code provides expected behavior). Note that there is no answer here, not even Peter Duniho's, that will allow you to distinguish enum entries when they have the same value, so I'm not sure why this is being considered as a metric for any potential solutions.
Nevertheless, an alternative approach that doesn't use the enum values as probabilities is to use an attribute to specify the probability:
public enum BloodType
{
[P=4]
ONeg,
[P=36]
OPos,
[P=3]
ANeg,
[P=28]
APos,
[P=1]
BNeg,
[P=20]
BPos,
[P=1]
ABNeg,
[P=5]
ABPos
}
Here is what the attribute used above looks like:
[AttributeUsage(AttributeTargets.Field, AllowMultiple = false)]
public class PAttribute : Attribute
{
public int Weight { get; private set; }
public PAttribute(int weight)
{
Weight = weight;
}
}
and finally, this is what the method to get a random enum value would like:
static TEnum RandomEnumValue<TEnum>(Random rng)
{
var vals = Enum
.GetNames(typeof(TEnum))
.Aggregate(Enumerable.Empty<TEnum>(), (agg, curr) =>
{
var value = Enum.Parse(typeof(TEnum), curr);
FieldInfo fi = typeof (TEnum).GetField(curr);
var weight = ((PAttribute)fi.GetCustomAttribute(typeof(PAttribute), false)).Weight;
return agg.Concat(Enumerable.Repeat((TEnum)value, weight)); // For int enums
})
.ToArray();
return vals[rng.Next(vals.Length)];
}
(Note: if this code is performance critical, you might need to tweak this and add caching for the reflection data).

Some of this you can do and some of it isn't so easy. I believe the following extension method will do what you describe.
static public class Util {
static Random rnd = new Random();
static public int PriorityPickEnum(this Enum e) {
// The approved types for an enum are byte, sbyte, short, ushort, int, uint, long, or ulong
// However, Random only supports a int (or double) as a max value. Either way
// it doesn't have the range for uint, long and ulong.
//
// sum enum
int sum = 0;
foreach (var x in Enum.GetValues(e.GetType())) {
sum += Convert.ToInt32(x);
}
var i = rnd.Next(sum); // get a random value, it will form a ratio i / sum
// enums may not have a uniform (incremented) value range (think about flags)
// therefore we have to step through to get to the range we want,
// this is due to the requirement that return value have a probability
// proportional to it's value. Note enum values must be sorted for this to work.
foreach (var x in Enum.GetValues(e.GetType()).OfType<Enum>().OrderBy(a => a)) {
i -= Convert.ToInt32(x);
if (i <= 0) return Convert.ToInt32(x);
}
throw new Exception("This doesn't seem right");
}
}
Here is an example of using this extension:
BloodType bt = BloodType.ABNeg;
for (int i = 0; i < 100; i++) {
var v = (BloodType) bt.PriorityPickEnum();
Console.WriteLine("{0}: {1}({2})", i, v, (int) v);
}
This should work pretty well for enum's of type byte, sbyte, ushort, short and int. Once you get beyond int (uint, long, ulong) the problem is the Random class. You can adjust the code to use doubles generated by Random, which would cover uint, but the Random class just doesn't have the range to cover long and ulong. Of course you could use/find/write a different Random class if this is important.

Arrays/Lists and computing hashvalues (VB, C#)

I feel bad asking this question but I am currently not able to program and test this as I'm writing this on my cell-phone and not on my dev machine :P (Easy rep points if someone answers! XD )
Anyway, I've had experience with using hashvalues from String objects. E.g., if I have StringA and StringB both equal to "foo", they'll both compute out the same hashvalue, because they're set to equal values.
Now what if I have a List, with T being a native data type. If I tried to compute the hashvalue of ListA and ListB, assuming that they'd both be the same size and contain the same information, wouldn't they have equal hashvalues as well?
Assuming as sample dataset of 'byte' with a length of 5
{5,2,0,1,3}

It depends on how you calculate the hash value and how you define equality. For example, two different instances of an array which happen to contain the same values may not be considered equal depending on your application. In this case you may include the address or some other unique value per array as part of the hash function.
However, if you want to consider to distinct arrays which contain the same values equal you would calculate the list hash using only the values in the array. Of course, then you have to consider if ordering matters to you or not in determining equality (and thus influencing your hash function).

If the order of items is important then you could generate a sequence hashcode like this.
public static int GetOrderedHashCode<T>(this IEnumerable<T> source)
{
unchecked
{
int hash = 269;
foreach (T item in source)
{
hash = (hash * 17) + item.GetHashCode;
}
return hash;
}
}
If the order of items isn't important, then you could do something like this instead:
public static int GetUnorderedHashCode<T>(this IEnumerable<T> source)
{
unchecked
{
int sum = 907;
int count = 953;
foreach (T item in source)
{
sum = sum + item.GetHashCode();
count++
}
return 991 * sum * count;
}
}
(Note that both of these methods will have poor performance for larger collections, in which case you might want to implement some sort of cache and only recalculate the hashcode when the collection changes.)

If your talking about the built-in list types, then no, they will not be equal. Why? Because List<T> is a reference type, so equality will do a comparison to see if the references are the same. If you are creating a custom list type, then you could override the Equals and GetHashCode methods to support this behavior, but it isn't going to happen on the built in types.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Array as Dictionary key gives a lot of collisions - c#

If you know the length of the arrays you're using, you could use a Tuple. Console.WriteLine("Tuple"); Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode()); Console.WriteLine(Tuple.Create(1, 2, 3, 0).GetHashCode()); Outputs Tuple 1248 1248

Related

Dictionary hash function for fuzzy lookups

Grouping unique integer arrays

Looking for something like a HashSet, but with a range of values for the key?

Converting a method to use any Enum

Arrays/Lists and computing hashvalues (VB, C#)

Categories

Resources