Does a perfect hash function guarantee no collisions?

Does a perfect hash function guarantee no collisions? - c#

I have been reading and learning hashing and hashtables and experemented with some code(I am still very new to this so I might say something wrong that I missunderstood). I came to the issue for perfect hash functions. Provided that I have my own custom type that somehow has a perfect hash function:
class Foo
{
private int data;
override int GetHashCode()
{
return data.GetHashCode();
}
}
An int's hash code is the int itself so I have a perfect hash function, right? But when we use the hashing function to map the objects to a hashtable by the simple formula:
index = foo.GetHashCode() % hashtable.Length
we get a variable index that depends on also how many elements we have in the hashtable. If the hashtable's size was int.MaxValue only then we will have a perfect hash function. For example lets say that we have a hashtable with size of 2. And if we hash for example the numbers 1 and 3 we get
1 % 2 = 1
3 % 2 = 1
A collision! Have I understood anything wrong about hashing and hashtables? It comes out that a perfect hash function is not perfect.

You have it all right until this point
index = foo.GetHashCode() % hashtable.Length
Your hash function is perfect, but when you calculate the modulo, you're actually using a different hash function. In this case, your hash function int.GetHashCode is perfect, but your data structure using foo.GetHashCode() % hashtable.Length is not. That is, one thing is the hash of your objects, and a different thing is the hash used by the structure holding those objects.
For your data structure to be perfect too, its maximum size must also be the number of ints.
So why don't we have collisions in Dictionary? Actually, we do. If two objects A and B do have the same hash in the dictionary, we have a collision. What happens is that the dictionary runs A.Equals(B) as the final check to see if the two objects actually are the same or not. If they are, you get an exception for having duplicates. If they don't, they are both kept under the same dictionary hash.

Yes! (as said, by definition)
Where do you get a p.h.f from in the first place?
You want to hash a fixed, i.e. constant set S of different (i.e. no multiset) values
to the set 1..|S|, bijectively.
Apparently then, the p.h.f depends on the set S.
Also, remove a single element from S, and add another one, you almost surely get a collision (of the new element with an old one).
So, you actually want "a p.h.f. for such-and-such well defined/described set".
And then we can try to find one.

Yes, a perfect hash function is guaranteed not to have collisions.
That's its very definition!
From Wikipedia (http://en.wikipedia.org/wiki/Perfect_hash_function)
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented

Related

Create a identical "random" float based on multiple data

I'm working on a game (Unity) and I need to create a random float value (between 0 and 1) based on multiple int and/or float.
I think it'll be more easy to manually create a single string for the function, but maybe it could accept a list of int and/or float.
Example of result:
"[5-91]-52-1" > 0.158756..
Important points:
The distribution of results (between 0 and 1) must be equals (don't want 90% of results between 0.45 and 0.55)
Asking 2 times for the same string must return the exact same result (even if I reload the app, or start it on different computers, ..)
Results have no need to be unique.
Bonus Point:
Sometime I need that close similar string return close result, but not everytime. It's possible for "random generation" to handle a boolean with this feature ?

What you've described is essentially definition of a hash function.
So just use one and normalize results into range you want. Most basic case can use GetHashCode, but it is not guaranteed to produce the same results across different versions of framework.
Stable version that guarantees to provide exactly the same results across machines would be to use well known good hash - like crypto hash SHA256 and take several first bytes of result as integer and normalize. Crypto hash functions also conveniently take byte arrays as input so you can combine multiple values as bytes directly and get stable result.
var intValue = 42;
var bytesToHash = BitConverter.GetBytes(intValue);
var hash = System.Security.Cryptography.SHA256Managed.Create()
.ComputeHash(bytesToHash);
var toNormalize = BitConverter.ToUInt32(hash,0);
var fancyRandom = (double)toNormalize/UInt32.MaxValue;
To combine multiple values into byte array you can either manually combine results of BitConverter.GetBytes or use BinaryWriter on MemoryStream.
Alternatively you can use resulting integer as seed for some custom implementation of pseudo-random generator (as one in .Net does not guarantee to provide same results across machines/version of .Net) as suggested in comments, but I don't think it will give significantly better distribution.
Note: make sure resulting numbers are distributed "randomly enough" for your case. Crypto hashing functions likely give result you want but I'm not sure how to prove that.
For "bonus" part: I would be very surprised if you can find pseudo-random generator that will consistently produce close results for "similar" seeds. Instead you can use same approach as above for separate parts - one that "same" and other that handles variation (i.e. intValue & 0xFFFFFF00 for stable part, intValue & 0xFF for "small difference") and than combine resulting "random" numbers with some weight: randomFromStable + 0.05 * randomFromDifference.

I would suggest using the hashcode (or something similar) as the seed to a Random object. Hashcodes must be the same for the same string so you will always get the same sequence back.
As Nuf notes, hashcodes are only guaranteed to be the same in the same app-domain; so it may not work across restarts.
As to your bonus point, getting there without writing your own RNG will be hard. Any variance in the seed can and should cause a lot of variation in the resulting sequence.

Why do "int" and "sbyte" GetHashCode functions generate different values?

We have the following code:
int i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 1
This make sense and the same happen whit all integral types in C# except sbyte and short.
That is:
sbyte i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 257
Why is this?

Because the source of that method (SByte.GetHashCode) is
public override int GetHashCode()
{
return (int)this ^ ((int)this << 8);
}
As for why, well someone at Microsoft knows that..

Yes it's all about values distribution. As the GetHashCode method return type is int for the type sbyte the values are going to be distributed in intervals of 257. For this same reason for the long type will be colisions.

The reason is that it is probably done to avoid clustering of hash values.
As GetHashCode documentation says:
For the best performance, a hash function must generate a random
distribution for all input.
Providing a good hash function on a class can significantly affect the
performance of adding those objects to a hash table. In a hash table with
a good implementation of a hash function, searching for an element takes
constant time (for example, an O(1) operation).
Also, as this excellent article explains:
Guideline: the distribution of hash codes must be "random"
By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be "clustered"; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

hash that maps strings to integers

Looking for some hash function to make string to int mapping with following restrictions.
restrictions:
Same strings go to same number.
Different strings go to different numbers.
During one run of application I am getting strings from same length, only in the runtime I know the length.
Any suggestions how to create the hash function ?

A hash function does never guarantee that two different values (strings in your case) yield different hash codes. However, same values will always yield the same hash codes.
This is because information gets lost. If you have a string of a length of 32 characters, it will have 64 bytes (2 bytes per char). An int hash code has four bytes. This is inevitable and is called a collision.
Note: Dictionary<Tkey,TValue> uses a hash table internally. Therfore it implements a collision resolution strategy. See An Extensive Examination of Data Structures Using C# 2.0 on MSDN.
Here is the current implementation of dictionary.cs.

You aren't going to find a hash algorithm that guarantees that the same integer won't be returned for different strings. By definition, hash algorithms have collisions. There are far more possible strings in the world than there are possible 32-bit integers.

Different strings go to different numbers.
There are more strings than there are numbers, so this is flat out impossible without restricting the input set. You can't put n pigeons in m boxes with n > m without having at least one box contain more than one pigeon.

Is the String.GetHashCode function not right for your needs?

Is there a method to randomize integers so that visitors can't figure out the sequence of objects

I have an id in the url. So normally it will be an auto number and so it will be 1,2,3,4,5,.....
I don't want visitors to figure out the sequence and so i want to let the number be kinda of random. So i want 1 to be converted to 174891 and 2 to 817482 and so on. But i want this to be in a specique range like 1 to 1,000,000.
I figured out i can do this using xoring and shifting of the bits of the integer. But i was wondering if this already was implemented in some place.
Thanks

You could pass your integer as the seed to a random number generator. (Just make sure that it would be unique)
You could also generate the SHA-512c hash of the integer and use that instead.
However, the best thing to do here is to use a GUID instead of an integer.
EDIT: If it needs to be reversible, the correct way to do it is to encrypt the number using AES or a different encryption algorithm. However, this won't result in a number between one and a million.

Don't rely on obscurity -- i.e., non-sequential ids -- for security. Build your app so that even if someone does guess the next id, it's still secure.
If you do need non-sequential ids, though. Generate a new id each time randomly. Store that in your table as a indexed (uniquely) column along with your autogenerated primary key id. Then all you need to do is a look up on that column to get back the real id.

EDIT: In general, I prefer tvanfosson's approach on both scores. However, here's an answer to the question as stated...
These are fairly strange design constraints, to be honest - but they're reasonably easy to deal with:
Pick an arbitrary RNG seed which you will use on every execution of your program
Create an instance of Random using that seed
Create an array of integers 1..1000000
Shuffle the array using the Random instance
Create a "reverse mapping" array by going through the original array like this:
int[] reverseMapping = new int[mapping.Length];
for (int i = 0; i < mapping.Length; i++)
{
reverseMapping[mapping[i]] = i + 1;
}
Then you can map both ways. This does rely on the algorithm used by Random not changing, admittedly... if that's a concern, you could always generate this mapping once and save it somewhere.

If you're looking for a fairly simple pseudo-random integer sequence, the linear congruential method is pretty good:
ni+1 = (a×ni + k) mod m
Use prime numbers for a and k.

What hash algorithm does .net utilise? What about java?

Regarding the HashTable (and subsequent derivatives of such) does anyone know what hashing algorithm .net and Java utilise?
Are List and Dictionary both direct descandents of Hashtable?

The hash function is not built into the hash table; the hash table invokes a method on the key object to compute the hash. So, the hash function varies depending on the type of key object.
In Java, a List is not a hash table (that is, it doesn't extend the Map interface). One could implement a List with a hash table internally (a sparse list, where the list index is the key into the hash table), but such an implementation is not part of the standard Java library.

I know nothing about .NET but I'll attempt to speak for Java.
In Java, the hash code is ultimately a combination of the code returned by a given object's hashCode() function, and a secondary hash function inside the HashMap/ConcurrentHashMap class (interestingly, the two use different functions). Note that Hashtable and Dictionary (the precursors to HashMap and AbstractMap) are obsolete classes. And a list is really just "something else".
As an example, the String class constructs a hash code by repeatedly multiplying the current code by 31 and adding in the next character. See my article on how the String hash function works for more information. Numbers generally use "themselves" as the hash code; other classes, e.g. Rectangle, that have a combination of fields often use a combination of the String technique of multiplying by a small prime number and adding in, but add in the various field values. (Choosing a prime number means you're unlikely to get "accidental interactions" between certain values and the hash code width, since they don't divide by anything.)
Since the hash table size-- i.e. the number of "buckets" it has-- is a power of two, a bucket number is derived from the hash code essentially by lopping off the top bits until the hash code is in range. The secondary hash function protects against hash functions where all or most of the randomness is in those top bits, by "spreading the bits around" so that some of the randomness ends up in the bottom bits and doesn't get lopped off. The String hash code would actually work fairly well without this mixing, but user-created hash codes may not work quite so well. Note that if two different hash codes resolve to the same bucket number, Java's HashMap implementations use the "chaining" technique-- i.e. they create a linked list of entries in each bucket. It's thus important for hash codes to have a good degree of randomness so that items don't cluster into a particular range of buckets. (However, even with a perfect hash function, you will still by law of averages expect some chaining to occur.)
Hash code implementations shouldn't be a mystery. You can look at the hashCode() source for any class you choose.

The HASHING algorithm is the algorithm used to determine the hash code of an item within the HashTable.
The HASHTABLE algorithm (which I think is what this person is asking) is the algorithm the HashTable uses to organize its elements given their hash code.
Java happens to use a chained hash table algorithm.

While looking for the same answer myself, I found this in .net's reference source # http://referencesource.microsoft.com.
/*
Implementation Notes:
The generic Dictionary was copied from Hashtable's source - any bug
fixes here probably need to be made to the generic Dictionary as well.
This Hashtable uses double hashing. There are hashsize buckets in the
table, and each bucket can contain 0 or 1 element. We a bit to mark
whether there's been a collision when we inserted multiple elements
(ie, an inserted item was hashed at least a second time and we probed
this bucket, but it was already in use). Using the collision bit, we
can terminate lookups & removes for elements that aren't in the hash
table more quickly. We steal the most significant bit from the hash code
to store the collision bit.
Our hash function is of the following form:
h(key, n) = h1(key) + n*h2(key)
where n is the number of times we've hit a collided bucket and rehashed
(on this particular lookup). Here are our hash functions:
h1(key) = GetHash(key); // default implementation calls key.GetHashCode();
h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));
The h1 can return any number. h2 must return a number between 1 and
hashsize - 1 that is relatively prime to hashsize (not a problem if
hashsize is prime). (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
If this is true, then we are guaranteed to visit every bucket in exactly
hashsize probes, since the least common multiple of hashsize and h2(key)
will be hashsize * h2(key). (This is the first number where adding h2 to
h1 mod hashsize will be 0 and we will search the same bucket twice).
We previously used a different h2(key, n) that was not constant. That is a
horrifically bad idea, unless you can prove that series will never produce
any identical numbers that overlap when you mod them by hashsize, for all
subranges from i to i+hashsize, for all i. It's not worth investigating,
since there was no clear benefit from using that hash function, and it was
broken.
For efficiency reasons, we've implemented this by storing h1 and h2 in a
temporary, and setting a variable called seed equal to h1. We do a probe,
and if we collided, we simply add h2 to seed each time through the loop.
A good test for h2() is to subclass Hashtable, provide your own implementation
of GetHash() that returns a constant, then add many items to the hash table.
Make sure Count equals the number of items you inserted.
Note that when we remove an item from the hash table, we set the key
equal to buckets, if there was a collision in this bucket. Otherwise
we'd either wipe out the collision bit, or we'd still have an item in
the hash table.
--
*/

Anything purporting to be a HashTable or something like it in .NET does not implement its own hashing algorithm: they always call the object-being-hashed's GetHashCode() method.
There is a lot of confusion though as to what this method does or is supposed to do, especially when concerning user-defined or otherwise custom classes that override the base Object implementation.

For .NET, you can use Reflector to see the various algorithms. There is a different one for the generic and non-generic hash table, plus of course each class defines its own hash code formula.

The .NET Dictionary<T> class uses an IEqualityComparer<T> to compute hash codes for keys and to perform comparisons between keys in order to do hash lookups.
If you don't provide an IEqualityComparer<T> when constructing the Dictionary<T> instance (it's an optional argument to the constructor) it will create a default one for you, which uses the object.GetHashCode and object.Equals methods by default.
As for how the standard GetHashCode implementation works, I'm not sure it's documented. For specific types you can read the source code for the method in Reflector or try checking the Rotor source code to see if it's there.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Does a perfect hash function guarantee no collisions? - c#

Related

Create a identical "random" float based on multiple data

Why do "int" and "sbyte" GetHashCode functions generate different values?

hash that maps strings to integers

Is there a method to randomize integers so that visitors can't figure out the sequence of objects

What hash algorithm does .net utilise? What about java?

Categories

Resources