Create a identical "random" float based on multiple data - c#

I'm working on a game (Unity) and I need to create a random float value (between 0 and 1) based on multiple int and/or float.
I think it'll be more easy to manually create a single string for the function, but maybe it could accept a list of int and/or float.
Example of result:
"[5-91]-52-1" > 0.158756..
Important points:
The distribution of results (between 0 and 1) must be equals (don't want 90% of results between 0.45 and 0.55)
Asking 2 times for the same string must return the exact same result (even if I reload the app, or start it on different computers, ..)
Results have no need to be unique.
Bonus Point:
Sometime I need that close similar string return close result, but not everytime. It's possible for "random generation" to handle a boolean with this feature ?

What you've described is essentially definition of a hash function.
So just use one and normalize results into range you want. Most basic case can use GetHashCode, but it is not guaranteed to produce the same results across different versions of framework.
Stable version that guarantees to provide exactly the same results across machines would be to use well known good hash - like crypto hash SHA256 and take several first bytes of result as integer and normalize. Crypto hash functions also conveniently take byte arrays as input so you can combine multiple values as bytes directly and get stable result.
var intValue = 42;
var bytesToHash = BitConverter.GetBytes(intValue);
var hash = System.Security.Cryptography.SHA256Managed.Create()
.ComputeHash(bytesToHash);
var toNormalize = BitConverter.ToUInt32(hash,0);
var fancyRandom = (double)toNormalize/UInt32.MaxValue;
To combine multiple values into byte array you can either manually combine results of BitConverter.GetBytes or use BinaryWriter on MemoryStream.
Alternatively you can use resulting integer as seed for some custom implementation of pseudo-random generator (as one in .Net does not guarantee to provide same results across machines/version of .Net) as suggested in comments, but I don't think it will give significantly better distribution.
Note: make sure resulting numbers are distributed "randomly enough" for your case. Crypto hashing functions likely give result you want but I'm not sure how to prove that.
For "bonus" part: I would be very surprised if you can find pseudo-random generator that will consistently produce close results for "similar" seeds. Instead you can use same approach as above for separate parts - one that "same" and other that handles variation (i.e. intValue & 0xFFFFFF00 for stable part, intValue & 0xFF for "small difference") and than combine resulting "random" numbers with some weight: randomFromStable + 0.05 * randomFromDifference.

I would suggest using the hashcode (or something similar) as the seed to a Random object. Hashcodes must be the same for the same string so you will always get the same sequence back.
As Nuf notes, hashcodes are only guaranteed to be the same in the same app-domain; so it may not work across restarts.
As to your bonus point, getting there without writing your own RNG will be hard. Any variance in the seed can and should cause a lot of variation in the resulting sequence.

Related

Does a perfect hash function guarantee no collisions?

I have been reading and learning hashing and hashtables and experemented with some code(I am still very new to this so I might say something wrong that I missunderstood). I came to the issue for perfect hash functions. Provided that I have my own custom type that somehow has a perfect hash function:
class Foo
{
private int data;
override int GetHashCode()
{
return data.GetHashCode();
}
}
An int's hash code is the int itself so I have a perfect hash function, right? But when we use the hashing function to map the objects to a hashtable by the simple formula:
index = foo.GetHashCode() % hashtable.Length
we get a variable index that depends on also how many elements we have in the hashtable. If the hashtable's size was int.MaxValue only then we will have a perfect hash function. For example lets say that we have a hashtable with size of 2. And if we hash for example the numbers 1 and 3 we get
1 % 2 = 1
3 % 2 = 1
A collision! Have I understood anything wrong about hashing and hashtables? It comes out that a perfect hash function is not perfect.
You have it all right until this point
index = foo.GetHashCode() % hashtable.Length
Your hash function is perfect, but when you calculate the modulo, you're actually using a different hash function. In this case, your hash function int.GetHashCode is perfect, but your data structure using foo.GetHashCode() % hashtable.Length is not. That is, one thing is the hash of your objects, and a different thing is the hash used by the structure holding those objects.
For your data structure to be perfect too, its maximum size must also be the number of ints.
So why don't we have collisions in Dictionary? Actually, we do. If two objects A and B do have the same hash in the dictionary, we have a collision. What happens is that the dictionary runs A.Equals(B) as the final check to see if the two objects actually are the same or not. If they are, you get an exception for having duplicates. If they don't, they are both kept under the same dictionary hash.
Yes! (as said, by definition)
Where do you get a p.h.f from in the first place?
You want to hash a fixed, i.e. constant set S of different (i.e. no multiset) values
to the set 1..|S|, bijectively.
Apparently then, the p.h.f depends on the set S.
Also, remove a single element from S, and add another one, you almost surely get a collision (of the new element with an old one).
So, you actually want "a p.h.f. for such-and-such well defined/described set".
And then we can try to find one.
Yes, a perfect hash function is guaranteed not to have collisions.
That's its very definition!
From Wikipedia (http://en.wikipedia.org/wiki/Perfect_hash_function)
A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented

hash that maps strings to integers

Looking for some hash function to make string to int mapping with following restrictions.
restrictions:
Same strings go to same number.
Different strings go to different numbers.
During one run of application I am getting strings from same length, only in the runtime I know the length.
Any suggestions how to create the hash function ?
A hash function does never guarantee that two different values (strings in your case) yield different hash codes. However, same values will always yield the same hash codes.
This is because information gets lost. If you have a string of a length of 32 characters, it will have 64 bytes (2 bytes per char). An int hash code has four bytes. This is inevitable and is called a collision.
Note: Dictionary<Tkey,TValue> uses a hash table internally. Therfore it implements a collision resolution strategy. See An Extensive Examination of Data Structures Using C# 2.0 on MSDN.
Here is the current implementation of dictionary.cs.
You aren't going to find a hash algorithm that guarantees that the same integer won't be returned for different strings. By definition, hash algorithms have collisions. There are far more possible strings in the world than there are possible 32-bit integers.
Different strings go to different numbers.
There are more strings than there are numbers, so this is flat out impossible without restricting the input set. You can't put n pigeons in m boxes with n > m without having at least one box contain more than one pigeon.
Is the String.GetHashCode function not right for your needs?

Manipulating very large binary strings in c#

I'm working on a genetic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...
I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.
My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.
What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.
Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)
Once you have that list, you can then do research on what data structures work well.
For example, suppose your list of operations is:
construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly
Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.
However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.
If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.
For example, this would make shifts more efficient by ~ a factor 4.
Is the BitArray class no help?
A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.
GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?
Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?
If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.
Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:
public enum DnaMarker : byte { A, C, G, T };
Then, you go from 4 of these to a byte with one operation:
public static byte ToByteCode(this DnaMarker[] markers)
{
byte output = 0;
for(byte i=0;i<4;i++)
output = (output << 2) + (byte)markers[i];
}
... and parse them back out with something like this:
public static DnaMarker[] ToMarkers(this byte input)
{
var result = new byte[4];
for(byte i=0;i<4;i++)
result[i] = (DnaMarker)(input - (input >> (2*(i+1))));
return result;
}
You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.
Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.
From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().
And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.

Random.Next() - finding the Nth .Next()

Given a consistently seeded Random:
Random r = new Random(0);
Calling r.Next() consistently produces the same series; so is there a way to quickly discover the N-th value in that series, without calling r.Next() N times?
My scenario is a huge array of values created via r.Next(). The app occasionally reads a value from the array at arbitrary indexes. I'd like to optimize memory usage by eliminating the array and instead, generating the values on demand. But brute-forcing r.Next() 5 million times to simulate the 5 millionth index of the array is more expensive than storing the array. Is it possible to short-cut your way to the Nth .Next() value, without / with less looping?
I don't know the details of the PRNG used in the BCL, but my guess is that you will find it extremely difficult / impossible to find a nice, closed-form solution for N-th value of the series.
How about this workaround:
Make the seed to the random-number generator the desired index, and then pick the first generated number. This is equally 'deterministic', and gives you a wide range to play with in O(1) space.
static int GetRandomNumber(int index)
{
return new Random(index).Next();
}
In theory if you knew the exact algorithm and the initial state you'd be able to duplicate the series but the end result would just be identical to calling r.next().
Depending on how 'good' you need your random numbers to be you might consider creating your own PRNG based on a Linear congruential generator which is relatively easy/fast to generate numbers for. If you can live with a "bad" PRNG there are likely other algorithms that may be better to use for your purpose. Whether this would be faster/better than just storing a large array of numbers from r.next() is another question.
No, I don't believe there is. For some RNG algorithms (such as linear congruential generators) it's possible in principle to get the n'th value without iterating through n steps, but the Random class doesn't provide a way of doing that.
I'm not sure whether the algorithm it uses makes it possible in principle -- it's a variant (details not disclosed in documentation) of Knuth's subtractive RNG, and it seems like the original Knuth RNG should be equivalent to some sort of polynomial-arithmetic thing that would allow access to the n'th value, but (1) I haven't actually checked that and (2) whatever tweaks Microsoft have made might break that.
If you have a good enough "scrambling" function f then you can use f(0), f(1), f(2), ... as your sequence of random numbers, instead of f(0), f(f(0)), f(f(f(0))), etc. (the latter being roughly what most RNGs do) and then of course it's trivial to start the sequence at any point you please. But you'll need to choose a good f, and it'll probably be slower than a standard RNG.
You could build your own on-demand dictionary of 'indexes' & 'random values'. This assumes that you will always 'demand' indexes in the same order each time the program runs or that you don't care if the results are the same each time the program runs.
Random rnd = new Random(0);
Dictionary<int,int> randomNumbers = new Dictionary<int,int>();
int getRandomNumber(int index)
{
if (!randomNumbers.ContainsKey(index))
randomNumbers[index] = rnd.Next();
return randomNumbers[index];
}

What hash algorithm does .net utilise? What about java?

Regarding the HashTable (and subsequent derivatives of such) does anyone know what hashing algorithm .net and Java utilise?
Are List and Dictionary both direct descandents of Hashtable?
The hash function is not built into the hash table; the hash table invokes a method on the key object to compute the hash. So, the hash function varies depending on the type of key object.
In Java, a List is not a hash table (that is, it doesn't extend the Map interface). One could implement a List with a hash table internally (a sparse list, where the list index is the key into the hash table), but such an implementation is not part of the standard Java library.
I know nothing about .NET but I'll attempt to speak for Java.
In Java, the hash code is ultimately a combination of the code returned by a given object's hashCode() function, and a secondary hash function inside the HashMap/ConcurrentHashMap class (interestingly, the two use different functions). Note that Hashtable and Dictionary (the precursors to HashMap and AbstractMap) are obsolete classes. And a list is really just "something else".
As an example, the String class constructs a hash code by repeatedly multiplying the current code by 31 and adding in the next character. See my article on how the String hash function works for more information. Numbers generally use "themselves" as the hash code; other classes, e.g. Rectangle, that have a combination of fields often use a combination of the String technique of multiplying by a small prime number and adding in, but add in the various field values. (Choosing a prime number means you're unlikely to get "accidental interactions" between certain values and the hash code width, since they don't divide by anything.)
Since the hash table size-- i.e. the number of "buckets" it has-- is a power of two, a bucket number is derived from the hash code essentially by lopping off the top bits until the hash code is in range. The secondary hash function protects against hash functions where all or most of the randomness is in those top bits, by "spreading the bits around" so that some of the randomness ends up in the bottom bits and doesn't get lopped off. The String hash code would actually work fairly well without this mixing, but user-created hash codes may not work quite so well. Note that if two different hash codes resolve to the same bucket number, Java's HashMap implementations use the "chaining" technique-- i.e. they create a linked list of entries in each bucket. It's thus important for hash codes to have a good degree of randomness so that items don't cluster into a particular range of buckets. (However, even with a perfect hash function, you will still by law of averages expect some chaining to occur.)
Hash code implementations shouldn't be a mystery. You can look at the hashCode() source for any class you choose.
The HASHING algorithm is the algorithm used to determine the hash code of an item within the HashTable.
The HASHTABLE algorithm (which I think is what this person is asking) is the algorithm the HashTable uses to organize its elements given their hash code.
Java happens to use a chained hash table algorithm.
While looking for the same answer myself, I found this in .net's reference source # http://referencesource.microsoft.com.
/*
Implementation Notes:
The generic Dictionary was copied from Hashtable's source - any bug
fixes here probably need to be made to the generic Dictionary as well.
This Hashtable uses double hashing. There are hashsize buckets in the
table, and each bucket can contain 0 or 1 element. We a bit to mark
whether there's been a collision when we inserted multiple elements
(ie, an inserted item was hashed at least a second time and we probed
this bucket, but it was already in use). Using the collision bit, we
can terminate lookups & removes for elements that aren't in the hash
table more quickly. We steal the most significant bit from the hash code
to store the collision bit.
Our hash function is of the following form:
h(key, n) = h1(key) + n*h2(key)
where n is the number of times we've hit a collided bucket and rehashed
(on this particular lookup). Here are our hash functions:
h1(key) = GetHash(key); // default implementation calls key.GetHashCode();
h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));
The h1 can return any number. h2 must return a number between 1 and
hashsize - 1 that is relatively prime to hashsize (not a problem if
hashsize is prime). (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
If this is true, then we are guaranteed to visit every bucket in exactly
hashsize probes, since the least common multiple of hashsize and h2(key)
will be hashsize * h2(key). (This is the first number where adding h2 to
h1 mod hashsize will be 0 and we will search the same bucket twice).
We previously used a different h2(key, n) that was not constant. That is a
horrifically bad idea, unless you can prove that series will never produce
any identical numbers that overlap when you mod them by hashsize, for all
subranges from i to i+hashsize, for all i. It's not worth investigating,
since there was no clear benefit from using that hash function, and it was
broken.
For efficiency reasons, we've implemented this by storing h1 and h2 in a
temporary, and setting a variable called seed equal to h1. We do a probe,
and if we collided, we simply add h2 to seed each time through the loop.
A good test for h2() is to subclass Hashtable, provide your own implementation
of GetHash() that returns a constant, then add many items to the hash table.
Make sure Count equals the number of items you inserted.
Note that when we remove an item from the hash table, we set the key
equal to buckets, if there was a collision in this bucket. Otherwise
we'd either wipe out the collision bit, or we'd still have an item in
the hash table.
--
*/
Anything purporting to be a HashTable or something like it in .NET does not implement its own hashing algorithm: they always call the object-being-hashed's GetHashCode() method.
There is a lot of confusion though as to what this method does or is supposed to do, especially when concerning user-defined or otherwise custom classes that override the base Object implementation.
For .NET, you can use Reflector to see the various algorithms. There is a different one for the generic and non-generic hash table, plus of course each class defines its own hash code formula.
The .NET Dictionary<T> class uses an IEqualityComparer<T> to compute hash codes for keys and to perform comparisons between keys in order to do hash lookups.
If you don't provide an IEqualityComparer<T> when constructing the Dictionary<T> instance (it's an optional argument to the constructor) it will create a default one for you, which uses the object.GetHashCode and object.Equals methods by default.
As for how the standard GetHashCode implementation works, I'm not sure it's documented. For specific types you can read the source code for the method in Reflector or try checking the Rotor source code to see if it's there.

Categories

Resources