Why do "int" and "sbyte" GetHashCode functions generate different values? - c#

We have the following code:
int i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 1
This make sense and the same happen whit all integral types in C# except sbyte and short.
That is:
sbyte i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 257
Why is this?

Because the source of that method (SByte.GetHashCode) is
public override int GetHashCode()
{
return (int)this ^ ((int)this << 8);
}
As for why, well someone at Microsoft knows that..

Yes it's all about values distribution. As the GetHashCode method return type is int for the type sbyte the values are going to be distributed in intervals of 257. For this same reason for the long type will be colisions.

The reason is that it is probably done to avoid clustering of hash values.
As GetHashCode documentation says:
For the best performance, a hash function must generate a random
distribution for all input.
Providing a good hash function on a class can significantly affect the
performance of adding those objects to a hash table. In a hash table with
a good implementation of a hash function, searching for an element takes
constant time (for example, an O(1) operation).
Also, as this excellent article explains:
Guideline: the distribution of hash codes must be "random"
By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be "clustered"; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

Related

Does a perfect hash function guarantee no collisions?

I have been reading and learning hashing and hashtables and experemented with some code(I am still very new to this so I might say something wrong that I missunderstood). I came to the issue for perfect hash functions. Provided that I have my own custom type that somehow has a perfect hash function:
class Foo
{
private int data;
override int GetHashCode()
{
return data.GetHashCode();
}
}
An int's hash code is the int itself so I have a perfect hash function, right? But when we use the hashing function to map the objects to a hashtable by the simple formula:
index = foo.GetHashCode() % hashtable.Length
we get a variable index that depends on also how many elements we have in the hashtable. If the hashtable's size was int.MaxValue only then we will have a perfect hash function. For example lets say that we have a hashtable with size of 2. And if we hash for example the numbers 1 and 3 we get
1 % 2 = 1
3 % 2 = 1
A collision! Have I understood anything wrong about hashing and hashtables? It comes out that a perfect hash function is not perfect.
You have it all right until this point
index = foo.GetHashCode() % hashtable.Length
Your hash function is perfect, but when you calculate the modulo, you're actually using a different hash function. In this case, your hash function int.GetHashCode is perfect, but your data structure using foo.GetHashCode() % hashtable.Length is not. That is, one thing is the hash of your objects, and a different thing is the hash used by the structure holding those objects.
For your data structure to be perfect too, its maximum size must also be the number of ints.
So why don't we have collisions in Dictionary? Actually, we do. If two objects A and B do have the same hash in the dictionary, we have a collision. What happens is that the dictionary runs A.Equals(B) as the final check to see if the two objects actually are the same or not. If they are, you get an exception for having duplicates. If they don't, they are both kept under the same dictionary hash.
Yes! (as said, by definition)
Where do you get a p.h.f from in the first place?
You want to hash a fixed, i.e. constant set S of different (i.e. no multiset) values
to the set 1..|S|, bijectively.
Apparently then, the p.h.f depends on the set S.
Also, remove a single element from S, and add another one, you almost surely get a collision (of the new element with an old one).
So, you actually want "a p.h.f. for such-and-such well defined/described set".
And then we can try to find one.
Yes, a perfect hash function is guaranteed not to have collisions.
That's its very definition!
From Wikipedia (http://en.wikipedia.org/wiki/Perfect_hash_function)
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented

Comparing immutable data types

Is there a commonly accepted way of how to compare immutable objects that might contain long lists of values?
So far, my interfaces are as follows:
interface Formula : IEquatable<Formula> {
IList<Symbol> Symbols {get;}
}
interface Symbol : IEquatable<Symbol> {
String Value {get;}
}
Here, the immutable datatype Formula represents a sequence of Symbol's. So in a formula:
x -> y
symbols would be x,->,y.
I want to compare two Formulas based on their content (e.g. a list of symbols). So new Formula(symbols) would equal new Formula(symbols) for some arbitrary list of symbols.
However, I don't want to iteratively compare two lists all the time.
I was thinking, in implementation, of creating some kind of calculated value during the initialization of the Formula - and using that for comparison. However, that will require me to add a method of some sort to my interface. What would I call that method?
I am not sure if it is appropriate to use hash code for this, as it seems to be limited to integers.
Any help appreciated - and if something is not clear I will revise my question.
Thank you!
You could definitely use a hash code for this. Don't forget that a hash code doesn't have to be unique - it just helps if it doesn't give collisions (two unequal sequences with the same hash code) terribly often. (At least, try to come up with an approach which avoids equal hash codes for obvious situations.)
So you could compute the hash code once on construction (by combining the hash codes of each symbol in turn), then return that from GetHashCode without recomputing it each time. That would mean that you'd only ever need to compare sequences with equal hash codes - which would rarely happen for non-equal sequences.
No, you have to compare all of the elements. You can't use hash code or a similar approach, because the set of possible formulas is infinite, while the set of possible hash codes is finite.
As Jon Skeet notes, you could use hash codes to reduce the need to compare formulas element-by-element, but you cannot eliminate the need. When two formulas have unequal hash codes, you know the formulas are unequal, but when they have equal hash codes, you will need to do an element-by-element comparison to see whether they are equal.
I do believe that is not all you need to do...
a+b = (a+b)
would result in false with your approach.
I believe you have to construct AST (abstract syntax trees) for the expressions on both sides and then compare the expressions. The AST would do away the parnthesis since they are expressed as hierarchies in the AST.
hth
Mario
This is kinda like the other answer for overriding GetHashCode but I have a different approach....
Since the formula appears to have a string representation....
Can't you override GetHashCode and in the override do a
foreach(char c in ToString().ToCharArray()){
int hashCode |= c;
}
The result of this would yield a 4 byte code which was a packed representation of the symbols in the equation...
This could be taken further if each symbol has specific OpCode which could be looked up in a HashTable.
I would build the HashTable up with alias's of each OpCode so each Symbol would not have to declare a property OpCode.
I would then make an Extension ToOpCode on the Symbol class which did the look-up in the HashTable described above.
I would then utilize the Extension method in the GetHashCode such as
Formula....
public override int GetHashCode(){
foreach(Symbol c in Symbols){
int hashCode |= c.ToOpCode();
}
}
Symbol....
public override int GetHashCode(){
retuurn Extensions.ToOpCode(this);
}
This implementation would yield the same hash for a + b and b + a which is very important per your question...
Additionally if you specified the OpCode in correct succession you would technically be able to compare equations in the form of:
(a) + (b) == (a+b)
This would be achieved by ensuring the Parenthesis OpCodes were given a value in the HashCode in a different place than the numbers...
E.g. If you have 4 bytes (an integer) the scope depth could be kept in the first byte, the index to the previous or next equation / symbol in the stack would be next and the next two bytes would be reserved for sign data and the value / continuations or number of variables in the equation (exclusive).
This allows you to tell certain things such as how many nesting levels etc so you can essentially override Equals as well to ensure you can differentiate between a + b and b + a and ((a) + (b)) if required.
For instance you may want to know if the equation is exactly the same with a certain method but in another you may want to know if the equations are doing the same thing but not written the same exact way.
This would also allow you to determine equality in different ways such as checking if the scope depths match and if there are exactly the same amount of steps in the equation rather than just assuming so based on the hash code..
e.g. you could then shift as follows to determine things such as :
hash << 8 would be the dept of parens
hash << 16 would be the previous or next equation pointer for the stack
hash << 24 would be the sign or code value continuation or number of variables in the equation (exclusive)
you could also just do hash == anotherHash but this way gives you much more flexibility with literally no overhead.
If you need more room in the Hash then create a new Method GetExtendedHashCode which returns long and then shift / downcast or reformat the ExtendedHashCode in GetHashCode to match the int format required by the CLR.
You also have the benefit of the symbols being able to represent variables and values in this way by leaving them as they are on the stack and using them just like the CLR.
First of all, I would advise against implementing IEquatable<T> for any non-sealed type T. The only safe way to implement IEquatable<T>.Equals on an unsealed type is generally to call the virtual method Object.Equals. Otherwise, there is a possibility that a class whose parent class implements IEquatable<T> for one or more types T will override Object.Equals and Object.GetHashCode without re-implementing all of its IEquatable<T> interfaces; any such interfaces that aren't re-implemented will thus be broken.
Secondly, if while comparing the lists in two Formula instances, one finds a pair of corresponding Symbol references that are equivalent but refer to distinct instances, it may be helpful to call System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode() on each instance. If one compares larger than the other, replace the reference with the larger RunTimeHelpers.GetHashCode() value with the one from the other list. This will accelerate any future comparisons of those lists. Further, if one repeatedly compares multiple lists that have the same items, all of the lists will "gravitate" toward having the same Symbol instances.
Finally, if one finds that the lists are equal, and if the lists are supposed to be "semantically" immutable, one can use the same RuntimeHelpers.GetHashCode() trick to pick a List instance. This will then expedite future comparisons.

What would be a good hashCode for a DateRange class

I have the following class
public class DateRange
{
private DateTime startDate;
private DateTime endDate;
public override bool Equals(object obj)
{
DateRange other = (DateRange)obj;
if (startDate != other.startDate)
return false;
if (endDate != other.endDate)
return false;
return true;
}
...
}
I need to store some values in a dictionary keyed with a DateRange like:
Dictionary<DateRange, double> tddList;
How should I override the GetHashCode() method of DateRange class?
I use this approach from Effective Java for combining hashes:
unchecked
{
int hash = 17;
hash = hash * 31 + field1.GetHashCode();
hash = hash * 31 + field2.GetHashCode();
...
return hash;
}
There's no reason that shouldn't work fine in this situation.
It depends on the values I expect to see it used with.
If it was most often going to have different day values, rather than different times on the same day, and they were within a century of now, I would use:
unchecked
{
int hash = startDate.Year + endDate.Year - 4007;
hash *= 367 + startDate.DayOfYear;
return hash * 367 + endDate.DayOfYear;
}
This distributes the bits well with the expected values, while reducing the number of bits lost in the shifting. Note that while there cases where dependency on primes can be surprisingly bad at collisions (esp. when the hash is fed into something that uses a modulo of the same prime in trying to avoid collisions when producing a yet-smaller hash to distribute among its buckets) I've opted to go for primes above the more obvious choices, as they're only just above and so still pretty "tight" for bit-distribution. I don't worry much about using the same prime twice, as they're so "tight" in this way, but it does hurt if you've a hash-based collection with 367 buckets. This deals well (but not as well) with dates well into the past or future, but is dreadful if the assumption that there will be few or no ranges within the same day (differing in time) is wrong as that information is entirely lost.
If I was expecting (or writing for general use by other parties, and not able to assume otherwise) I'd go for:
int startHash = startDate.GetHashCode();
return (((startHash >> 24) & 0x000000FF) | ((startHash >> 8) & 0x0000FF00) | ((startHash << 8) & 0x00FF0000) | (unchecked((int)((startHash << 24) & 0xFF000000)))) ^ endDate.GetHashCode();
Where the first method works on the assumption that the general-purpose GetHashCode in DateTime isn't as good as we want, this one depends on it being good, but mixes around the bits of one value.
It's good in dealing with the more obvious tricky cases such as the two values being the same, or a common distance from each other (e.g. lots of 1day or 1hour ranges). It's not as good at the cases where the first example works best, but the first one totally sucks if there are lots of ranges using the same day, but different times.
Edit: To give a more detailed response to Dour's concern:
Dour points out, correctly, that some of the answers on this page lose data. The fact is, all of them lose data.
The class defined in the question has 8.96077483×1037 different valid states (or 9.95641648×1036 if we don't care about the DateTimeKind of each date), and the output of GetHashCode has 4294967296 possible states (one of which - zero - is also going to be used as the hashcode of a null value, which may be commonly compared with in real code). Whatever we do, we reduce information by a scale of 2.31815886 × 1027. That's a lot of information we lost!
It's likely true that we can lose more with some than in others. Certainly, it's easy to prove some solutions can lose more than others by writing a valid, but really poor, answer.
(The worse possible valid solution is return 0; which is valid as it never errors or mismatches on equal objects, but as poor as possible as it collides for all values. The performance of a hash-based collection becomes O(n), and slow as O(n) goes, as the constants involved are higher than such O(n) operations as searching an unordered list).
It's difficult to measure just how much is lost. How much more does shifting of some bits before XORing lose than swapping bits, considering that XOR halves the amount of information left. Even the naïve x ^ y doesn't lose more than a swap-and-xor, it just collides more on common values; swap-and-xor will collide on values where plain-xor does not.
Once we've got a choice between solutions that are not losing much more information than possible, but returning 4294967296 or close to 4294967296 possible values with a good distribution between those values, then the question is no longer how much information is lost (the answer that only 4.31376821×10-28 of the original information remains) but which information is lost.
This is why my first suggestion above ignores time components. There are 864000000000 "ticks" (the 100nanosecond units DateTime has a resolution of) in a day, and I throw away two chunks of those ticks (7.46496×1023 possible values between the two) on purpose because I'm thinking of a scenario where that information is not used anyway. In this case I've deliberately structured the mechanism in such a way as to pick which information gets lost, that improves the hash for a given situation, but makes it absolutely worthless if we had different values all with start and end dates happening no the same days but at different times.
Likewise x ^ y doesn't lose any more information than any of the others, but the information that it does lose is more likely to be significant than with other choices.
In the absence of any way to predict which information is likely to be of importance (esp. if your class will be public and its hash code used by external code), then we are more restricted in the assumptions we can safely make.
As a whole prime-mult or prime-mod methods are better in which information they lose than shift-based methods, except when the same prime is used in a further hashing that may take place inside a hash-based method, ironically with the same goal in mind (no number is relatively prime to itself! even primes) in which case they are much worse. On the other hand shift-based methods really fall down if fed into a shift-based further hash. There is no perfect hash for arbitrary data and arbitrary use (except when a class has few valid values and we match them all, in which case it's more strictly an encoding than a hash that we produce).
In short, you're going to lose information whatever you do, it's which you lose that's important.
Well, consider what characteristics a good hash function should have. It must:
be in agreement with Equals - that is, if Equals is true for two objects then the two hash codes have to also be the same.
never crash
And it should:
be very fast
give different results for similar inputs
What I would do is come up with a very simple algorithm; say, taking 16 bits from the hash code of the first and 16 bits from the hash code of the second, and combining them together. Make yourself a test case of representative samples; date ranges that are likely to be actually used, and see if this algorithm does give a good distribution.
A common choice is to xor the two hashes together. This is not necessarily a good idea for this type because it seems likely that someone will want to represent the zero-length range that goes from X to X. If you xor the hashes of two equal DateTimes you always get zero, which seems like a recipe for a lot of hash collisions.
You have to shift one end of the range, otherwise two equal dates will hash to zero, a pretty common scenario I imagine:
return startDate.GetHashCode() ^ (endDate.GetHashCode() << 4);
return startDate.GetHashCode() ^ endDate.GetHashCode();
might be a good start. You have to check that you get good distribution when there is equal distance between startDate and endDate, but different dates.

Good hash for small class? (override GetHashCode)

I use some identity classes/structs that contains 1-2 ints, maybe a datetime or a small string as well. I use these as keys in a dictionary.
What would be a good override of GetHashCode for something like this? Something quite simple but still somewhat performant hopefully.
Thanks
Take a look into Essential C#.
It contains a detailed description on how to overwrite GetHashCode() correctly.
Extract from the book
The purpose of the hash code is to efficiently balance a hash table by generating a number that corresponds to the value of an object.
Required: Equal objects must have equal hash codes (if a.Equals(b), then a.GetHashCode() == b.GetHashCode())
Required: GetHashCode()'s returns over the life of a particular object should be constant (the same value), even if the object's data changes. In many cases, you should cache the method return to enforce this.
Required: GetHashCode() should not throw any exceptions; GetHashCode() must always successfully return a value.
Performance: Hash codes should be unique whenever possible. However, since hash code return only an int, there has to be an overlap in hash codes for objects that have potentially more values than an int can hold -- virtually all types. (An obvious example is long, since there are more possible long values than an int could uniquely identify.)
Performance: The possible hash code values should be distributed evenly over the range of an int. For example, creating a hash that doesn't consider the fact that distribution of a string in Latin-based languages primarily centers on the initial 128 ASCII characters would result in a very uneven distribution of string values and would not be a strong GetHashCode() algorithm.
Performance: GetHashCode() should be optimized for performance. GetHashCode() is generally used in Equals() implementations to short-circuit a full equals comparison if the hash codes are different. As a result, it is frequently called when the type is used as a key type in dictionary collections.
Performance: Small differences between two objects should result in large differences between hash codes values -- ideally, a 1-bit difference in the object results in around 16 bits of the hash code changing, on average. This helps ensure that the hash table remains balanced no matter how it is "bucketing" the hash values.
Security: It should be difficult for an attacker to craft an object that has a particular hash code. The attack is to flood a hash table with large amounts of data that all hash to the same value. The hash table implementation then becomes O(n) instead of O(1), resulting in a possible denial-of-service attack.
As already mentioned here you have also to consider some points about overriding Equals() and there are some code examples showing how to implement these two functions.
So these informations should give a starting point but i recommend to buy the book and to read the complete chapter 9 (at least the first twelve sides) to get all the points on how to correctly implement these two crucial functions.

What hash algorithm does .net utilise? What about java?

Regarding the HashTable (and subsequent derivatives of such) does anyone know what hashing algorithm .net and Java utilise?
Are List and Dictionary both direct descandents of Hashtable?
The hash function is not built into the hash table; the hash table invokes a method on the key object to compute the hash. So, the hash function varies depending on the type of key object.
In Java, a List is not a hash table (that is, it doesn't extend the Map interface). One could implement a List with a hash table internally (a sparse list, where the list index is the key into the hash table), but such an implementation is not part of the standard Java library.
I know nothing about .NET but I'll attempt to speak for Java.
In Java, the hash code is ultimately a combination of the code returned by a given object's hashCode() function, and a secondary hash function inside the HashMap/ConcurrentHashMap class (interestingly, the two use different functions). Note that Hashtable and Dictionary (the precursors to HashMap and AbstractMap) are obsolete classes. And a list is really just "something else".
As an example, the String class constructs a hash code by repeatedly multiplying the current code by 31 and adding in the next character. See my article on how the String hash function works for more information. Numbers generally use "themselves" as the hash code; other classes, e.g. Rectangle, that have a combination of fields often use a combination of the String technique of multiplying by a small prime number and adding in, but add in the various field values. (Choosing a prime number means you're unlikely to get "accidental interactions" between certain values and the hash code width, since they don't divide by anything.)
Since the hash table size-- i.e. the number of "buckets" it has-- is a power of two, a bucket number is derived from the hash code essentially by lopping off the top bits until the hash code is in range. The secondary hash function protects against hash functions where all or most of the randomness is in those top bits, by "spreading the bits around" so that some of the randomness ends up in the bottom bits and doesn't get lopped off. The String hash code would actually work fairly well without this mixing, but user-created hash codes may not work quite so well. Note that if two different hash codes resolve to the same bucket number, Java's HashMap implementations use the "chaining" technique-- i.e. they create a linked list of entries in each bucket. It's thus important for hash codes to have a good degree of randomness so that items don't cluster into a particular range of buckets. (However, even with a perfect hash function, you will still by law of averages expect some chaining to occur.)
Hash code implementations shouldn't be a mystery. You can look at the hashCode() source for any class you choose.
The HASHING algorithm is the algorithm used to determine the hash code of an item within the HashTable.
The HASHTABLE algorithm (which I think is what this person is asking) is the algorithm the HashTable uses to organize its elements given their hash code.
Java happens to use a chained hash table algorithm.
While looking for the same answer myself, I found this in .net's reference source # http://referencesource.microsoft.com.
/*
Implementation Notes:
The generic Dictionary was copied from Hashtable's source - any bug
fixes here probably need to be made to the generic Dictionary as well.
This Hashtable uses double hashing. There are hashsize buckets in the
table, and each bucket can contain 0 or 1 element. We a bit to mark
whether there's been a collision when we inserted multiple elements
(ie, an inserted item was hashed at least a second time and we probed
this bucket, but it was already in use). Using the collision bit, we
can terminate lookups & removes for elements that aren't in the hash
table more quickly. We steal the most significant bit from the hash code
to store the collision bit.
Our hash function is of the following form:
h(key, n) = h1(key) + n*h2(key)
where n is the number of times we've hit a collided bucket and rehashed
(on this particular lookup). Here are our hash functions:
h1(key) = GetHash(key); // default implementation calls key.GetHashCode();
h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));
The h1 can return any number. h2 must return a number between 1 and
hashsize - 1 that is relatively prime to hashsize (not a problem if
hashsize is prime). (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
If this is true, then we are guaranteed to visit every bucket in exactly
hashsize probes, since the least common multiple of hashsize and h2(key)
will be hashsize * h2(key). (This is the first number where adding h2 to
h1 mod hashsize will be 0 and we will search the same bucket twice).
We previously used a different h2(key, n) that was not constant. That is a
horrifically bad idea, unless you can prove that series will never produce
any identical numbers that overlap when you mod them by hashsize, for all
subranges from i to i+hashsize, for all i. It's not worth investigating,
since there was no clear benefit from using that hash function, and it was
broken.
For efficiency reasons, we've implemented this by storing h1 and h2 in a
temporary, and setting a variable called seed equal to h1. We do a probe,
and if we collided, we simply add h2 to seed each time through the loop.
A good test for h2() is to subclass Hashtable, provide your own implementation
of GetHash() that returns a constant, then add many items to the hash table.
Make sure Count equals the number of items you inserted.
Note that when we remove an item from the hash table, we set the key
equal to buckets, if there was a collision in this bucket. Otherwise
we'd either wipe out the collision bit, or we'd still have an item in
the hash table.
--
*/
Anything purporting to be a HashTable or something like it in .NET does not implement its own hashing algorithm: they always call the object-being-hashed's GetHashCode() method.
There is a lot of confusion though as to what this method does or is supposed to do, especially when concerning user-defined or otherwise custom classes that override the base Object implementation.
For .NET, you can use Reflector to see the various algorithms. There is a different one for the generic and non-generic hash table, plus of course each class defines its own hash code formula.
The .NET Dictionary<T> class uses an IEqualityComparer<T> to compute hash codes for keys and to perform comparisons between keys in order to do hash lookups.
If you don't provide an IEqualityComparer<T> when constructing the Dictionary<T> instance (it's an optional argument to the constructor) it will create a default one for you, which uses the object.GetHashCode and object.Equals methods by default.
As for how the standard GetHashCode implementation works, I'm not sure it's documented. For specific types you can read the source code for the method in Reflector or try checking the Rotor source code to see if it's there.

Categories

Resources