I would like to know if HashCode are always the same.?
For instance:
String myString = "my super string";
Int myHashCode = myString.GetHashCode();
Will myHashCode always be the same value? On any computer, at anytime, on any everything?
Can I trust in this in value to use it as a custom unique identity ( for the same object type)
No, the value can change between computers and base system versions.
You should only depend on it to be constant during a given program run.
From the documentation:
The value returned by GetHashCode is platform-dependent. It differs on the 32-bit and 64-bit versions of the .NET Framework. It also can differ between versions of the .NET Framework.
Caution: A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
Do not serialize hash code values or store them in databases.
Can I trust in this in value to use it as a custom unique identity?
That won't work either, even during a single program run, as hash codes do collide (same hash code for unequal objects).
To quote the docs again:
Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To test for equality, call the ReferenceEquals or Equals method.
As per the documentation:
If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.
The hash code itself is not guaranteed to be stable. Hash codes for identical strings can differ across versions of the .NET Framework and across platforms (such as 32-bit and 64-bit) for a single version of the .NET Framework. In some cases, they can even differ by application domain.
As a result, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection, and they should never be persisted.
Hash code is derived from the value for string and for other objects it is derived from the memory location.
It can produce same hashcode for different strings, as collisions may occur, so as a thumb rule never use hashcode as a key as they change,here is a good source about hashcode:
http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/
Related
Clarifying edit: The keys in the dictionary are actual instances of System.Type. More specifically every value is stored with its type as the key.
In a specific part of my program the usage of Dictionary<System.Type, SomeThing> takes a large chunk of CPU time, as per Visual Studio 2017 performance profiler.
A change in the type of the dictionary to Dictionary<int, SomeThing> and instead of passing the type object directly I pass the type.GetHashCode() seems to be about 20%-25% faster.
The above optimization will result in a nasty bug if two types have the same hash code, but it seems plausible to me that types can have unique hash codes, at least when it comes to types from the same assembly - which all the types used in this dictionary are.
Possibly relevant information - As per this answer the number of possible types in an assembly is far smaller than the number of values represented by System.Int32.
No. The documentation on object.GetHashCode() make no guarantees, and states:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
...
Do not use the hash code as the key to retrieve an object from a keyed collection.
Because equal hash codes is necessary, but not sufficient, for two objects to be equal.
If you're wondering if Type.GetHashCode() follows a more restrictive definition, its documentation makes no mention of such a change, so it still does not guarantee uniqueness. The reference source does not show any attempt to make this guarantee, either.
A hash-code is never garantueed to be unique for different values, so you should not use it like you are doing.
The same value should however generate the same hashcode.
This is also stated in MSDN:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash codes.
and somewhat further:
Do not use the hash code as the key to retrieve an object from a keyed collection.
Therefore, I would also not rely for GetHashCode for different types to be unique, but at least, you can verify it:
Dictionary<int, string> s = new Dictionary<int, string>();
var types = typeof(int).Assembly.GetTypes();
Console.WriteLine($"Inspecting {types.Length} types...");
foreach (var t in typeof(-put a type from that assembly here-).Assembly.GetTypes())
{
if (s.ContainsKey(t.GetHashCode()))
{
Console.WriteLine($"{t.Name} has the same hashcode as {s[t.GetHashCode()]}");
}
else
{
s.Add(t.GetHashCode(), t.Name);
}
}
Console.WriteLine("done!");
But even if the above test would conclude that there are no collisions, I wouldn't do it, since the implementation of GetHashCode can change over time, which means that collisions in the future might be possible.
A hashcode isn´t ment do be unique. Instead it is used in hash-based collections such as Dictionary in order to limit the number of possible ambiguities. A hashc-ode is nothing but an index, so instead of searching the entire collection for a match only a few items that share a common value - the hash-code - have to be considered.
In fact you could even have a hash-implementation that allways returns the same number for every item. However that´ll leads to O(n) to look for a key in your dictionary, as every key has to be compared.
Anyway you shouldn´t strive for micro-optimizations that get you some nan-seconds in exchange for maintainability and understandability. You should instead use some data-structure that gets the job done and is easy to understand.
I find the following code for compute hashcode:
int hashCode = comparer.GetHashCode(key) & 0x7FFFFFFF;
int index = hashCode % buckets.Length;
Why engineers didn't choose a universal hashing method:
int index = [(ak + b) mod p)mod buckets.Length]
where a,b are random numbers between 0...p-1 (p is prime) ?
A complete answer to the question would require consulting with the individual(s) who wrote that code. So I don't think you're going to get a complete answer.
That said:
The "universal hashing method", as you call it, is hardly the only possible implementation of a good hash code. People implement hash code computations in a variety of ways for a variety of reasons.
More important though…
The computation to which you refer is not actually computing a hash code. The variable name is a bit misleading, because while the value is based on the hash code of the item in question, it's really an implementation detail of the class's internal hash table. By sacrificing the highest bit from the actual hash code, the Entry value for the hash table can be flagged as unused using that bit. Masking the bit off as opposed to, for example, just special-casing an element with a hash code value of -1, preserves the distribution qualities of the original hash code implementation (which is determined outside the Dictionary<TKey, TValue> class).
In other words, the code you're asking about is simply how the author of that code implemented a particular optimization, in which they decreased the size of the Entry value by storing a flag they needed for some other purpose — i.e. the purpose of indicating whether a particular table Entry is used or not — in the same 32-bit value where part of the element's hash code is stored.
Storing the hash code in the Entry value is in turn also an optimization. Since the Entry value includes the TKey key value for the element, the implementation could in fact just have always called the key.GetHashCode() method to get the hash code. This is a trade-off in acknowledging that the GetHashCode() method is not always optimized itself (indeed, most implementations, including .NET's implementation for the System.String class, always recompute the hash code from scratch), and so the choice was (apparently) made to cache the hash code value within the Entry value, rather than asking the TKey value to recompute it every time it's needed.
Don't confuse the caching and subsequent usage of some other object's hash code implementation with an actual hash code implementation. The latter is not what's going on in the code you're asking about, the former is.
For a given string "5", if I use the built in GetHashCode() function what is the value that is returned? Am I confused in that it returns the integer value of 5?
It's implementation specific, and you should not rely on anything you happen to observe in one particular implementation. Only rely on what is guaranteed: two equal strings will return the same value, in the same process. The same string value can return a different hash next time you run your program, or on a different machine. This means you should never persist the result of GetHashCode - it's not useful for future comparisons.
If two strings return the same hash code they may be equal - but they may not be.
For string.GetHash() MSDN Docs writes:
If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.
The hash code itself is not guaranteed to be stable. Hash codes for identical strings can differ across versions of the .NET Framework and across platforms (such as 32-bit and 64-bit) for a single version of the .NET Framework. In some cases, they can even differ by application domain.
As a result, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection, and they should never be persisted.
Finally, do not use the hash code instead of a value returned by a cryptographic hashing function if you need a cryptographically strong hash. For cryptographic hashes, use a class derived from the System.Security.Cryptography.HashAlgorithm or System.Security.Cryptography.KeyedHashAlgorithm class.
So its kind of "quick-compare-check" feature regarding strings. But you should not relay on the hash-only. Its important to know that these hash-codes are not stable, meaning you must never store them in files, databases etc. - don't persist them.
In general GetHash() is specific to the class implementation as Jon wrote. If we look at the MSDN Docs for object.GetHash() we see that they serve as a index for hash-based collection so the collections index-tree is balanced. See this article for more information on hasing-algorithm.
So if you query one and the same object using GetHash() it should return the same hash-code. That code may be different if your application runs the next time.
I'm trying to cache the result of an expensive function in a MemoryCache object.
The MemoryCache requires a key that is a string, so I was wondering if it was valid to do the following:
string key = Char.ConvertFromUtf32(myObject.GetHashCode());
if (!_resourceDescriptionCache.Contains(key))
{
_resourceDescriptionCache[key] = ExpensiveFunction(myObject);
}
return (string)_resourceDescriptionCache[key];
It feels odd using a single UTF32 character as the key for a potentially large cache.
That depends.
There are many cases where using GetHashCode() could cause incorrect behavior:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
Do not serialize hash code values or store them in databases.
Do not use the hash code as the key to retrieve an object from a keyed collection.
Do not send hash codes across application domains or processes. In some cases, hash codes may be computed on a per-process or per-application domain basis.
http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx
If the memory cache happens (or can in the future happen) in a different process or app domain than the code that calls it, you fail the 3rd condition.
It feels odd using a single UTF32 character as the key for a potentially large cache.
If you are caching enough things, the collision rate on a 32-bit hash can be uncomfortably high due to the Birthday Problem.
When caching tens of millions of things, I have used a 64-bit hash called City Hash (created by Google, open source) with good success. You can also use a Guid, though the memory to maintain keys is twice as large for a GUID compared to a 64-bit hash.
Hashcodes can collide. return 0; is a valid implementation for GetHashCode. Multiple keys will share a cache slot which is not what you desire... You will confuse objects.
If your code does not work with return 0; as the implementation for GetHashCode your code is broken.
Choose a better cache key.
The memory cache is backed by the normal C# Dictionary. It really isn't much different, other than the fact that it provides expiration
The chances of a collision are 2^32, which is the size of an integer. Even if you do manage to get a collision, the dictionary has safety measures around that (by using the Equals on a collision)
Edit: The key collisions are only handled when a dictionary is given the unaltered key (ex: Dictionary()). In this case, since the MemoryCache uses strings, theres no collision detection.
I've been coding in c++ and java entirety of my life but on C#, I feel like it's a totally different animal.
In case of hash collision in Dictionary container in c#, what does it do? or does it even detect the collision?
In case of collisions in similar containers in SDL, some would make a key value section link data to key value section like linked list, or some would attempt to find different hash method.
[Update 10:56 A.M. 6/4/2010]
I am trying to make a counter per user. And set user # is not defined, it can both increase or decrease. And I'm expecting the size of data to be over 1000.
So, I want :
fast Access preferably not O(n), It important that I have close to O(1) due to requirement, I need to make sure I can force log off people before they are able to execute something silly.
Dynamic growth and shrink.
unique data.
Hashmap was my solution, and it seems Dictionary is what is similar to hashmap in c#...
Hash collisions are correctly handled by Dictionary<> - in that so long as an object implements GetHashCode() and Equals() correctly, the appropriate instance will be returned from the dictionary.
First, you shouldn't make any assumptions about how Dictionary<> works internally - that's an implementation detail that is likely to change over time. Having said that....
What you should be concerned with is whether the types you are using for keys implement GetHashCode() and Equals() correctly. The basic rules are that GetHashCode() must return the same value for the lifetime of the object, and that Equals() must return true when two instances represent the same object. Unless you override it, Equals() uses reference equality - which means it only returns true if two objects are actually the same instance. You may override how Equals() works, but then you must ensure that two objects that are 'equal' also produce the same hash code.
From a performance standpoint, you may also want to provide an implementation of GetHashCode() that generates a good spread of values to reduce the frequency of hashcode collision. The primarily downside of hashcode collisions, is that it reduces the dictionary into a list in terms of performance. Whenever two different object instances yield the same hash code, they are stored in the same internal bucket of the dictionary. The result of this, is that a linear scan must be performed, calling Equals() on each instance until a match is found.
According to this article at MSDN, in case of a hash collision the Dictionary class converts the bucket into a linked list. The older HashTable class, on the other hand, uses rehashing.
I offer an alternative code oriented answer that demonstrates a Dictionary will exhibit exception-free and functionally correct behavior when two items with different keys are added but the keys produce the same hashcode.
On .Net 4.6 the strings "699391" and "1241308" produce the same hashcode. What happens in the following code?
myDictionary.Add( "699391", "abc" );
myDictionary.Add( "1241308", "def" );
The following code demonstrates that a .Net Dictionary accepts different keys that cause a hash collision. No exception is thrown and dictionary key lookup returns the expected object.
var hashes = new Dictionary<int, string>();
var collisions = new List<string>();
for (int i = 0; ; ++i)
{
string st = i.ToString();
int hash = st.GetHashCode();
if (hashes.TryGetValue( hash, out string collision ))
{
// On .Net 4.6 we find "699391" and "1241308".
collisions.Add( collision );
collisions.Add( st );
break;
}
else
hashes.Add( hash, st );
}
Debug.Assert( collisions[0] != collisions[1], "Check we have produced two different strings" );
Debug.Assert( collisions[0].GetHashCode() == collisions[1].GetHashCode(), "Prove we have different strings producing the same hashcode" );
var newDictionary = new Dictionary<string, string>();
newDictionary.Add( collisions[0], "abc" );
newDictionary.Add( collisions[1], "def" );
Console.Write( "If we get here without an exception being thrown, it demonstrates a dictionary accepts multiple items with different keys that produce the same hash value." );
Debug.Assert( newDictionary[collisions[0]] == "abc" );
Debug.Assert( newDictionary[collisions[1]] == "def" );
Check this link for a good explanation: An Extensive Examination of Data Structures Using C# 2.0
Basically, .NET generic dictionary chains items with the same hash value.