Is Nullable<T>.GetHashCode() a poor hash code function?

Is Nullable<T>.GetHashCode() a poor hash code function? - c#

The implementation of Nullable<T>.GetHashCode() is as follows:
public override int GetHashCode()
{
if (!this.HasValue)
{
return 0;
}
return this.value.GetHashCode();
}
If however the underlying value also generates a hash code of 0 (e.g. a bool set to false or an int32 set to 0), then we have two commonly occurring different object states with the same hash code. It seems to me that a better implementation would have been something like.
public override int GetHashCode()
{
if (!this.HasValue)
{
return 0xD523648A; // E.g. some arbitrary 32 bit int with a good mix of set and
// unset bits (also probably a prime number).
}
return this.value.GetHashCode();
}

Yes, you do have a point. It is always possible to write a better GetHashCode() implementation if you know up front what data you are going to store. Not a luxury that a library writer ever has available. But yes, if you have a lot of bool? that are either false or !HasValue then the default implementation is going to hurt. Same for enums and ints, zero is a common value.
Your argument is academic however, changing the implementation costs minus ten thousand points and you can't do it yourself. Best you can do is submit the suggestion, the proper channel is the user-voice site. Getting traction on this is going to be difficult, good luck.

Let's first note that this question is just about performance. The hash code is not required to be unique or collision resistant for correctness. It is helpful for performance though.
Actually, this is the main value proposition of a hash table: Practically evenly distributed hash codes lead to O(1) behavior.
So what hash code constant is most likely to lead to the best possible performance profile in real applications?
Certainly not 0 because 0 is a common hash code: 0.GetHashCode() == 0. That goes for other types as well. 0 is the worst candidate because it tends to occur so often.
So how to avoid collisions? My proposal:
static readonly int nullableDefaultHashCode = GetRandomInt32();
public override int GetHashCode()
{
if (!this.HasValue)
return nullableDefaultHashCode;
else
return this.value.GetHashCode();
}
Evenly distributed, unlikely to collide and no stylistic problem of choosing an arbitrary constant.
Note, that GetRandomInt32 could be implemented as return 0xD523648A;. It would still be more useful than return 0;. But it is probably best to query a cheap source of pseudo-random numbers.

In the end, a Nullable<T> without value has to return a hashcode, and that hashcode should be a constant.
Returning an arbitrary constant may look more safe or appropriate, perhaps even more so when viewed within the specific case of Nullable<int>, but in the end it's just that: a hash.
And within the entire set that Nullable<T> can cover (which is infinite), zero is not a better hashcode than any other value.

I don't understand the concern here - poor performance in what situation?
Why would you could consider a hash function as poor based on its result for one value.
I could see that it would be a problem if many different values of a Type hash to the same result. But the fact that null hashes to the same value as 0 seems insignificant.
As far as I know the most common use of a .NET hash function is for a Hashtable, HashSet or Dictionary key, and the fact that zero and null happen to be in the same bucket will have an insignificant effect on overall performance.

Related

Why hash code computed using division method instead of universal hashing method?

I find the following code for compute hashcode:
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
int index = hashCode % buckets.Length;
Why engineers didn't choose a universal hashing method:
int index = [(ak + b) mod p)mod buckets.Length]
where a,b are random numbers between 0...p-1 (p is prime) ?

A complete answer to the question would require consulting with the individual(s) who wrote that code. So I don't think you're going to get a complete answer.
That said:
The "universal hashing method", as you call it, is hardly the only possible implementation of a good hash code. People implement hash code computations in a variety of ways for a variety of reasons.
More important though…
The computation to which you refer is not actually computing a hash code. The variable name is a bit misleading, because while the value is based on the hash code of the item in question, it's really an implementation detail of the class's internal hash table. By sacrificing the highest bit from the actual hash code, the Entry value for the hash table can be flagged as unused using that bit. Masking the bit off as opposed to, for example, just special-casing an element with a hash code value of -1, preserves the distribution qualities of the original hash code implementation (which is determined outside the Dictionary<TKey, TValue> class).
In other words, the code you're asking about is simply how the author of that code implemented a particular optimization, in which they decreased the size of the Entry value by storing a flag they needed for some other purpose — i.e. the purpose of indicating whether a particular table Entry is used or not — in the same 32-bit value where part of the element's hash code is stored.
Storing the hash code in the Entry value is in turn also an optimization. Since the Entry value includes the TKey key value for the element, the implementation could in fact just have always called the key.GetHashCode() method to get the hash code. This is a trade-off in acknowledging that the GetHashCode() method is not always optimized itself (indeed, most implementations, including .NET's implementation for the System.String class, always recompute the hash code from scratch), and so the choice was (apparently) made to cache the hash code value within the Entry value, rather than asking the TKey value to recompute it every time it's needed.
Don't confuse the caching and subsequent usage of some other object's hash code implementation with an actual hash code implementation. The latter is not what's going on in the code you're asking about, the former is.

Can subsequent writes in .NET be reordered by the runtime or the processor?

I have immutable objects whose hashcode I wish to calculate lazily. I've implemented
private bool _HasHashCode = false;
private int _HashCode;
public override int GetHashCode()
{
if (_HasHashCode)
return _HashCode;
long hashCode;
unchecked
{
hashCode = Digits;
hashCode = (hashCode*397) ^ XI;
hashCode = (hashCode*397) ^ YI;
hashCode = (int) ( hashCode % Int32.MaxValue);
}
// is it possible that these two write instructions
// get reordered on a certain .NET/CPU architecture
// combination:
_HashCode = (int)hashCode;
_HasHashCode = true;
return _HashCode;
}
My reasoning is that the 32 bit _HashCode member is 32 bit and writes to it are atomic so even if the calculation is run twice due to a race condition on setting the _HasHashCode property it doesn't matter as the same value will be calculated each time.
My worry is that the CLR might reorder the write to _HashCode and _HasHashCode. Is this a concern or can I be sure the CLR doesn't reorder writes?

There's a lazy approach here: avoid the issue and avoid the question. For example, re-ordering is only a concern if there are two "things" - one "thing" can never be out of order. You could sacrifice the sentinel value of 0 to mean "not yet calculated" - then as the last step of the calculation, avoid the sentinel:
int hash;
public override int GetHashCode()
{
var snapshot = hash;
if(snapshot == 0) // means: not yet calculated
{
// snapshot = ... your actual implementation
if(snapshot == 0) snapshot = -124987; // avoid sentinel value
hash = snapshot;
}
return snapshot;
}
Note that int reads and writes are guaranteed to be atomic, which also helps.

No it is NOT threadsafe, because of the concern you mentioned: writes can be reordered by the JIT compiler.
This is confirmed in this MSDN article about the CLR memory model (in the very first couple of paragraphs). (Also see part two of the article.)
The solution is not to use volatile. Rather, you should use Thread.MemoryBarrier() as follows:
_HashCode = (int)hashCode;
Thread.MemoryBarrier(); // Prevents reordering of the statements before and after.
_HasHashCode = true;
A MemoryBarrier has precisely the semantics you need for this code.
However, note that according to microsoft:
MemoryBarrier is required only on multiprocessor systems with weak
memory ordering (for example, a system employing multiple Intel
Itanium processors).
Also, I'm not totally convinced that it would be any faster doing it like this rather than caching the hash code from the constructor (and thereby removing all logic from the GetHashCode() implementation).
I would certainly try some careful timings with both approaches to be sure.

Edit: #Groo got my attention focused on reordering instructions either by underlying framework (CLR could do that) or the OS. I believed that lock blocks prevents this and according to this they do prevent reordering instructions. Another source is this one which states "Monitor.Enter and Monitor.Exit both generate full fences".
It's not thread-safe; but first the my proposition:
private bool _HasHashCode = false;
private int _HashCode;
private readonly object _lock = new object();
public override int GetHashCode()
{
if (_HasHashCode)
return _HashCode;
lock (_lock)
{
if (_HasHashCode)
return _HashCode;
long hashCode;
unchecked
{
hashCode = Digits;
hashCode = (hashCode*397) ^ XI;
hashCode = (hashCode*397) ^ YI;
hashCode = (int) (hashCode%Int32.MaxValue);
}
_HashCode = (int) hashCode;
_HasHashCode = true;
return _HashCode;
}
}
One problem in parallel/async programming I encounter most of the time is the "Is That Job Already Done?". This code takes care of that. lock statement is pretty fast and it would hit just a couple of times (and hash code would not get re-calculated!). Hash code will be calculated just at the first lock. The following that comes (if you are creating this object very fast over and over) would just come and see _HasHashCode is true and just return it.
The good part is, other than some initial objects that are created at first; none of late comers would hit the lock! So that lock block just hits a couple of times (at most).
Note: I was hasty in answering. I should ask: If this object is immutable, why not calculating the hash at construction time? :)

To add to other answers, here is a table which shows possible reorderings on different architectures:
(credit: Linux Journal, Memory Ordering in Modern Microprocessors by Paul E. McKenney)
Regarding Intel architectures and the OP's question, it shows that:
stores cannot be reordered with other stores on x86 (this includes IA-32 and Intel64, or Intel's implementation with x86-64, not to be confused with IA-64/Itanium),
but stores can be reordered with other stores on IA-64 (Itanium) processors.
On the other hand, according to this link, .NET (since 2.0) should ensure that out-of-order writes never happen (even on such architectures):
On .NET (...) this kind of code motion and processor reordering is not legal. This specific example was a primary motivation for the strengthening changes we made to the .NET Framework 2.0’s implemented memory model in the CLR. Writes always retire in-order. To forbid out-of-order writes, the CLR’s JIT compiler emits the proper instructions on appropriate architectures (i.e. in this case, ensuring all writes on IA64 are store/release).
This MSDN article also explains it:
Strong Model 2: .NET Framework 2.0
The rules for this model (introduced in .NET 2.0) are:
All the rules that are contained in the ECMA model, in particular the three fundamental memory model rules as well as the ECMA rules for volatile.
Reads and writes cannot be introduced.
A read can only be removed if it is adjacent to another read to the same location from the same thread. A write can only be removed if it is adjacent to another write to the same location from the same thread. Rule 5 can be used to make reads or writes adjacent before applying this rule.
Writes cannot move past other writes from the same thread.
Reads can only move earlier in time, but never past a write to the same memory location from the same thread.
Given the fact that Microsoft recently dropped support for Itanium in both Windows Server and Visual Studio, you can basically only target x86/x64 from now on, which have stricter memory models mentioned above, disallowing out-of-order writes.
Of course, since there exist different implementations of Microsoft's .NET (Mono), claims like these should be taken with reserve.

Dictionary TryGetValue without using value returned

I'm implementing a simple Dictionary<String, Int> that keeps track of picture files I download, and also the renaming of the files.
String - original filename
Int - new filename
I read up on TryGetValue vs ContainsKey and came across this:
TryGetValue approach is faster than ContainsKey approach but only when
you want to check the key in collection and also want to get the value
associated with it. If you only want to check the key is present or
not use ContainsKey only.
from here
As such, I was wondering what were other people's views on the following:
Should I use TryGetValue even if I do not plan to use the returned value, assuming the Dictonary size would grow to 1000 entries, and I do duplicate checks everytime I download ie. frequently?

In theory, follow the documentation. If you don't want the value then use ContainsKey because there's no code to go and actually grab the value out of memory.
Now, in practice, it probably doesn't matter because you're micro-optimizing on a Dictionary that's probably very small in the grand scheme of things. So, in practice, do what is best for you and the readability of your code.
And just to help you get a good idea, would grow to 1000 entries is really small so it really isn't going to matter in practice.

If you only want to check the key is present or not use ContainsKey only.
I think you answered the question for yourself.

Let's see the implementation of both under Reflector
public bool TryGetValue(TKey key, out TValue value)
{
int index = this.FindEntry(key);
if (index >= 0)
{
value = this.entries[index].value;
return true;
}
value = default(TValue);
return false;
}
public bool ContainsKey(TKey key)
{
return (this.FindEntry(key) >= 0);
}
this is how both the methods are implemented.
Now you can decide yourself which method is best.

I think that performance gains (if any) aren't worth the cost of obfuscating your code with this optimization.
Balance the scale your targeting with code maintainability. E.g.:
~10K concurrent calls average vs. < 5 developer team size GO FOR IT!
~500 concurrent call average vs. > 50 developer team size DON'T DO IT!

Preventing double hash operation when trying to update value in a of type Dictionary<IComparable, int>

I am working on software for scientific research that deals heavily with chemical formulas. I keep track of the contents of a chemical formula using an internal Dictionary<Isotope, int> where Isotope is an object like "Carbon-13", "Nitrogen-14", and the int represents the number of those isotopes in the chemical formula. So the formula C2H3NO would exist like this:
{"C12", 2
"H1", 3
"N14", 1
"O16", 1}
This is all fine and dandy, but when I want to add two chemical formulas together, I end up having to calculate the hash function of Isotope twice to update a value, see follow code example.
public class ChemicalFormula {
internal Dictionary<Isotope, int> _isotopes = new Dictionary<Isotope, int>();
public void Add(Isotope isotope, int count)
{
if (count != 0)
{
int curValue = 0;
if (_isotopes.TryGetValue(isotope, out curValue))
{
int newValue = curValue + count;
if (newValue == 0)
{
_isotopes.Remove(isotope);
}
else
{
_isotopes[isotope] = newValue;
}
}
else
{
_isotopes.Add(isotope, count);
}
_isDirty = true;
}
}
}
While this may not seem like it would be a slow down, it is when we are adding billions of chemical formulas together, this method is consistently the slowest part of the program (>45% of the running time). I am dealing with large chemical formulas like "H5921C3759N1023O1201S21" that are consistently being added to by smaller chemical formulas.
My question is, is there a better data structure for storing data like this? I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
EDIT
Isotope is immutable and shouldn't change during the lifetime of the program so I should be able to cache the hashcode.
I have linked to the source code so you can see the classes more in depth rather than me copy and paste them here.

I second the opinion that Isotope should be made immutable with precalculated hash. That would make everything much simpler.
(in fact, functionally-oriented programming is better suited for calculations of such sort, and it deals with immutable objects)

I have tried creating a simple IsotopeCount object that contains a int so I can access the value in a reference-type (as opposed to value-type) to avoid the double hash function. However, this didn't seem beneficial.
Well it would stop the double hashing... but obviously it's then worse in terms of space. What performance difference did you notice?
Another option you should strongly consider if you're doing this a lot is caching the hash within the Isotope class, assuming it's immutable. (If it's not, then using it as a dictionary key is at least somewhat worrying.)
If you're likely to use most Isotope values as dictionary keys (or candidates) then it's probably worth computing the hash during initialization. Otherwise, pick a particularly unlikely hash value (in an ideal world, that would be any value) and use that as the "uncached" value, and compute it lazily.
If you've got 45% of the running time in GetHashCode, have you looked at optimizing that? Is it actually GetHashCode, or Equals which is the problem? (You talk about "hashing" but I suspect you mean "hash lookup in general".)
If you could post the relevant bits of the Isotope type, we may be able to help more.
EDIT: Another option to consider if you're using .NET 4 would be ConcurrentDictionary, with its AddOrUpdate method. You'd use it like this:
public void Add(Isotope isotope, int count)
{
// I prefer early exit to lots of nesting :)
if (count == 0)
{
return 0;
}
int newCount = _isotopes.AddOrUpdate(isotope, count,
(key, oldCount) => oldCount + count);
if (newCount == 0)
{
_isotopes.Remove(isotope);
}
_isDirty = true;
}

Do you actually require random access to Isotope count by type or are you using the dictionary as a means for associating a key with a value?
I would guess the latter.
My suggestion to you is not to work with a dictionary but with a sorted array (or List) of IsotopeTuples, something like:
class IsotopeTuple{
Isotope i;
int count;
}
sorted by the name of the isotope.
Why the sorting?
Because then, when you want to "add" two isotopes together, you can do this in linear time by traversing both arrays (hope this is clear, I can elaborate if needed). No hash computation required, just super fast comparisons of order.
This is a classic approach when dealing with vector multiplications where the dimensions are words.
Used widely in text mining.
the tradeoff is of course that the construction of the initial vector is (n)log(n), but I doubt if you will feel the impact.

Another solution that you could think of if you had a limited number of Isotopes and no memory problems:
public struct Formula
{
public int C12;
public int H1;
public int N14;
public int O16;
}
I am guessing you're looking at organic chemistry, so you may not have to deal with that many isotopes, and if the lookup is the issue, this one will be pretty fast...

Using C# HashSet to solve problems where equal is not equal

I'm basing this on performance characteristics I've recently found out about Dictionary, so I'm using Dictionary<type, bool> where the bool is ignored but supposedly I could use HashSet instead.
For example:
Dictionary<bounds, bool> overlap;
class bounds
{
public float top_left_x, top_left_y, width, height;
public bool equal(bounds other)
{
return upper_left_x + width > other.upper_left_x &&
upper_left_x < other.upper_left_x + other.width &&
upper_left_y + height > other.upper_left_y &&
upper_left_y < other.upper_left_y + other.height;
}
public ... GetHashCode()
{
...;
}
}
Here I'm not using equal to check for equality but instead for overlapping, which is bound to be annoying elsewhere but there is a reason why I'm doing it.
I'm presuming that if a value can be looked up from a key in O(1) time then so can a key from itself.
So I could presumably put thousands of bounds into overlap and do this:
overlap.ContainsKey(new bounds(...));
To find out in O(1) time if a given bound overlaps any others from the collection.
I'd also like to know what happens if I change the (x, y) position of a bound, presumably it's like removing and then adding it into the set again performance wise, very expensive?
What do I put into the GetHashCode function?
goal
If this works then I'm after using this sort of mechanism to find out what other bounds a given bound overlaps.
Very few bounds move in this system and no new ones are added after the collection has been populated. Newly added bounds need to be able to overlap old ones.
conclusion
See the feedback below for more details.
In summary it's not possible to achieve the O(1) performance because, unlike the default equals, a check for overlapping is not transitive.
An interval tree however is a good solution.

The equality relation is completely the wrong relation to use here because equality is required to be an equivalence relation. That is, it must be reflexive -- A == A for any A. It must be symmetric -- A == B implies that B == A. And it must be transitive -- if A == B and B == C then A == C.
You are proposing a violation of the transitive property; "overlaps" is not a transitive relation, therefore "overlaps" is not an equivalence relation, and therefore you cannot define equality as overlapping.
Rather than trying to do this dangerous thing, solve the real problem. Your goal is apparently to take a set of intervals, and then quickly determine whether a given interval overlaps any of those intervals. The data structure you want is called an interval tree; it is specifically optimized to solve exactly that problem, so use it. Under no circumstances should you attempt to use a hash set as an interval tree. Use the right tool for the job:
http://wikipedia.org/wiki/Interval_tree

Here I'm not using equal to check for equality but instead for overlapping, which is bound to be annoying elsewhere but there is a reason why I'm doing it.
I'm assuming this means you will have a scenario where A.Equals(B) is true, B.Equals(C) is true, but A.Equals(C) is false. In other words, your Equals is not transitive.
That is breaking the rules of Equals(), and as a result Dictionary will not work for you. The rule of Equals/GetHashCode is (from http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx):
If two objects compare as equal, the GetHashCode method for each object must return the same value.
If your Equals is not transitive, then you can't possibly write a valid GetHashCode.

If you use the derived class approach I mentioned above, you need the following:
public class Bounds
{
public Point position;
public Point size; // I know the width and height don't really compose
// a point, but this is just for demonstration
public override int GetHashCode(){...}
}
public class OverlappingBounds : Bounds
{
public override bool Equals(object other)
{
// your implementation here
}
}
// Usage:
if (_bounds.ContainsKey(new OverlappingBounds(...))){...}
but since the GetHashCode() method needs to return always the same value, the runtime complexity will most likely be at O(n) instead of O(1).

You can't use a Dictionary or HashSet to check if bounds overlap. To be able to use a dictionary (or hashset), you need an Equals() and a GetHashCode() method that meet the following properties:
The Equals() method is an equivalence relation
a.Equals(b) must imply a.GetHashCode() == b.GetHashCode()
You can't meet either of these requirements, so you have to use another datastructure: An Interval tree.

You can not guarantee O(1) performance on dictionary where you customize hashcode calculation. If I put inside GetHashCode() method some WebService request which should control for me the equality of 2 provided items, it's clear that the time can never be O(1) as expected. Ok, this is kind of "edge case", but just to give an idea.
By doing in a way you think to do (assuming that this is even possible), imo, you negate the benefits provided by Dictionary<K,V>, so the constant key recovery time also on big collections.
It need to be measured on reasonable amount of objects you have, but I would first try to use
List<T> like an object holder, and make something like this:
var bounds = new List<Bound> {.... initialization... }
Bound providedBound = //something. Some data filled in it.
var overlappedany = bounds.Any<Bound>(b=>return b.Equals(providedBound));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.