What would be a good hashCode for a DateRange class - c#

I have the following class
public class DateRange
{
private DateTime startDate;
private DateTime endDate;
public override bool Equals(object obj)
{
DateRange other = (DateRange)obj;
if (startDate != other.startDate)
return false;
if (endDate != other.endDate)
return false;
return true;
}
...
}
I need to store some values in a dictionary keyed with a DateRange like:
Dictionary<DateRange, double> tddList;
How should I override the GetHashCode() method of DateRange class?

I use this approach from Effective Java for combining hashes:
unchecked
{
int hash = 17;
hash = hash * 31 + field1.GetHashCode();
hash = hash * 31 + field2.GetHashCode();
...
return hash;
}
There's no reason that shouldn't work fine in this situation.

It depends on the values I expect to see it used with.
If it was most often going to have different day values, rather than different times on the same day, and they were within a century of now, I would use:
unchecked
{
int hash = startDate.Year + endDate.Year - 4007;
hash *= 367 + startDate.DayOfYear;
return hash * 367 + endDate.DayOfYear;
}
This distributes the bits well with the expected values, while reducing the number of bits lost in the shifting. Note that while there cases where dependency on primes can be surprisingly bad at collisions (esp. when the hash is fed into something that uses a modulo of the same prime in trying to avoid collisions when producing a yet-smaller hash to distribute among its buckets) I've opted to go for primes above the more obvious choices, as they're only just above and so still pretty "tight" for bit-distribution. I don't worry much about using the same prime twice, as they're so "tight" in this way, but it does hurt if you've a hash-based collection with 367 buckets. This deals well (but not as well) with dates well into the past or future, but is dreadful if the assumption that there will be few or no ranges within the same day (differing in time) is wrong as that information is entirely lost.
If I was expecting (or writing for general use by other parties, and not able to assume otherwise) I'd go for:
int startHash = startDate.GetHashCode();
return (((startHash >> 24) & 0x000000FF) | ((startHash >> 8) & 0x0000FF00) | ((startHash << 8) & 0x00FF0000) | (unchecked((int)((startHash << 24) & 0xFF000000)))) ^ endDate.GetHashCode();
Where the first method works on the assumption that the general-purpose GetHashCode in DateTime isn't as good as we want, this one depends on it being good, but mixes around the bits of one value.
It's good in dealing with the more obvious tricky cases such as the two values being the same, or a common distance from each other (e.g. lots of 1day or 1hour ranges). It's not as good at the cases where the first example works best, but the first one totally sucks if there are lots of ranges using the same day, but different times.
Edit: To give a more detailed response to Dour's concern:
Dour points out, correctly, that some of the answers on this page lose data. The fact is, all of them lose data.
The class defined in the question has 8.96077483×1037 different valid states (or 9.95641648×1036 if we don't care about the DateTimeKind of each date), and the output of GetHashCode has 4294967296 possible states (one of which - zero - is also going to be used as the hashcode of a null value, which may be commonly compared with in real code). Whatever we do, we reduce information by a scale of 2.31815886 × 1027. That's a lot of information we lost!
It's likely true that we can lose more with some than in others. Certainly, it's easy to prove some solutions can lose more than others by writing a valid, but really poor, answer.
(The worse possible valid solution is return 0; which is valid as it never errors or mismatches on equal objects, but as poor as possible as it collides for all values. The performance of a hash-based collection becomes O(n), and slow as O(n) goes, as the constants involved are higher than such O(n) operations as searching an unordered list).
It's difficult to measure just how much is lost. How much more does shifting of some bits before XORing lose than swapping bits, considering that XOR halves the amount of information left. Even the naïve x ^ y doesn't lose more than a swap-and-xor, it just collides more on common values; swap-and-xor will collide on values where plain-xor does not.
Once we've got a choice between solutions that are not losing much more information than possible, but returning 4294967296 or close to 4294967296 possible values with a good distribution between those values, then the question is no longer how much information is lost (the answer that only 4.31376821×10-28 of the original information remains) but which information is lost.
This is why my first suggestion above ignores time components. There are 864000000000 "ticks" (the 100nanosecond units DateTime has a resolution of) in a day, and I throw away two chunks of those ticks (7.46496×1023 possible values between the two) on purpose because I'm thinking of a scenario where that information is not used anyway. In this case I've deliberately structured the mechanism in such a way as to pick which information gets lost, that improves the hash for a given situation, but makes it absolutely worthless if we had different values all with start and end dates happening no the same days but at different times.
Likewise x ^ y doesn't lose any more information than any of the others, but the information that it does lose is more likely to be significant than with other choices.
In the absence of any way to predict which information is likely to be of importance (esp. if your class will be public and its hash code used by external code), then we are more restricted in the assumptions we can safely make.
As a whole prime-mult or prime-mod methods are better in which information they lose than shift-based methods, except when the same prime is used in a further hashing that may take place inside a hash-based method, ironically with the same goal in mind (no number is relatively prime to itself! even primes) in which case they are much worse. On the other hand shift-based methods really fall down if fed into a shift-based further hash. There is no perfect hash for arbitrary data and arbitrary use (except when a class has few valid values and we match them all, in which case it's more strictly an encoding than a hash that we produce).
In short, you're going to lose information whatever you do, it's which you lose that's important.

Well, consider what characteristics a good hash function should have. It must:
be in agreement with Equals - that is, if Equals is true for two objects then the two hash codes have to also be the same.
never crash
And it should:
be very fast
give different results for similar inputs
What I would do is come up with a very simple algorithm; say, taking 16 bits from the hash code of the first and 16 bits from the hash code of the second, and combining them together. Make yourself a test case of representative samples; date ranges that are likely to be actually used, and see if this algorithm does give a good distribution.
A common choice is to xor the two hashes together. This is not necessarily a good idea for this type because it seems likely that someone will want to represent the zero-length range that goes from X to X. If you xor the hashes of two equal DateTimes you always get zero, which seems like a recipe for a lot of hash collisions.

You have to shift one end of the range, otherwise two equal dates will hash to zero, a pretty common scenario I imagine:
return startDate.GetHashCode() ^ (endDate.GetHashCode() << 4);

return startDate.GetHashCode() ^ endDate.GetHashCode();
might be a good start. You have to check that you get good distribution when there is equal distance between startDate and endDate, but different dates.

Related

Create a identical "random" float based on multiple data

I'm working on a game (Unity) and I need to create a random float value (between 0 and 1) based on multiple int and/or float.
I think it'll be more easy to manually create a single string for the function, but maybe it could accept a list of int and/or float.
Example of result:
"[5-91]-52-1" > 0.158756..
Important points:
The distribution of results (between 0 and 1) must be equals (don't want 90% of results between 0.45 and 0.55)
Asking 2 times for the same string must return the exact same result (even if I reload the app, or start it on different computers, ..)
Results have no need to be unique.
Bonus Point:
Sometime I need that close similar string return close result, but not everytime. It's possible for "random generation" to handle a boolean with this feature ?
What you've described is essentially definition of a hash function.
So just use one and normalize results into range you want. Most basic case can use GetHashCode, but it is not guaranteed to produce the same results across different versions of framework.
Stable version that guarantees to provide exactly the same results across machines would be to use well known good hash - like crypto hash SHA256 and take several first bytes of result as integer and normalize. Crypto hash functions also conveniently take byte arrays as input so you can combine multiple values as bytes directly and get stable result.
var intValue = 42;
var bytesToHash = BitConverter.GetBytes(intValue);
var hash = System.Security.Cryptography.SHA256Managed.Create()
.ComputeHash(bytesToHash);
var toNormalize = BitConverter.ToUInt32(hash,0);
var fancyRandom = (double)toNormalize/UInt32.MaxValue;
To combine multiple values into byte array you can either manually combine results of BitConverter.GetBytes or use BinaryWriter on MemoryStream.
Alternatively you can use resulting integer as seed for some custom implementation of pseudo-random generator (as one in .Net does not guarantee to provide same results across machines/version of .Net) as suggested in comments, but I don't think it will give significantly better distribution.
Note: make sure resulting numbers are distributed "randomly enough" for your case. Crypto hashing functions likely give result you want but I'm not sure how to prove that.
For "bonus" part: I would be very surprised if you can find pseudo-random generator that will consistently produce close results for "similar" seeds. Instead you can use same approach as above for separate parts - one that "same" and other that handles variation (i.e. intValue & 0xFFFFFF00 for stable part, intValue & 0xFF for "small difference") and than combine resulting "random" numbers with some weight: randomFromStable + 0.05 * randomFromDifference.
I would suggest using the hashcode (or something similar) as the seed to a Random object. Hashcodes must be the same for the same string so you will always get the same sequence back.
As Nuf notes, hashcodes are only guaranteed to be the same in the same app-domain; so it may not work across restarts.
As to your bonus point, getting there without writing your own RNG will be hard. Any variance in the seed can and should cause a lot of variation in the resulting sequence.

Applying Rabin-Karp Hash for large N

I refer to the Rabin Karp Wikipedia article on Hash use.
In the example, the string "hi" is hashed using a prime number 101 as the base.
hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
Can such an algorithm be used practically in Java or C# where long has a maximum value of 9,223,372,036,854,775,807? Naively, to me it seems that the hash value grows exponentially and with a large enough N (being string length) will result in overflow of the long type. For example, say I have 65 characters in my string input for the hash?
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?
hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
That's only half the truth. In reality, if you would actually compute the value s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n, the result would be a number whose representation would be about as long as the string itself, so you haven't gained anything. So what you actually do is to compute
(s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n) mod M
where M is reasonably small. Thus your hash value will always be smaller than M.
So what you do in practice is you choose M = 2^64 and make use of the fact that unsigned integer overflow is well-defined in most programming languages. In fact, multiplication and addition of 64-bit integers in Java, C++ and C# is equivalent to multiplication and addition modulo 2^64.
It's not necessarily a wise choice to use 2^64 as the modulus. In fact you can easily construct a string with lots of collisions, thus provoking the worst case behaviour of Rabin-Karp, which is Ω(n * m) matching instead of O(n + m).
It would be better to use a large prime as the modulus and get much better collision resistance. The reason why this is usually not done is performance: We would need to explicitely use modular reduction (add a % M) to every addition and multiplication. What's worse, we can't even use the builtin multiplication anymore, because it could overflow if M > 2^32. So we need a custom MultiplyMod function, which is bound to be a lot slower than machine-level multiplication.
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?
As I already mentioned, if you don't reduce using a modulus, your hash value will grow as large as the string itself, thus rendering it useless to use a hash function in the first place. So yes, using controlled overflow modulo 2^64 is correct and even necessary if we don't manually reduce.
If your goal is a type of storage which contains only "small" number,
but where the sum can be compared:
You could view this simply as 101 - number system,
like 10=decimal, 16=hex. and so on.
Ie.
a) You have to store a set of { ascii value and it´s 101-power }
(without possibility for multiple entries with the same power).
b) When creating the data from a string,
values >101 have to be propagated (is this the right word?) to the next power.
Example 1:
"a" is 97*101^0
(trivial)
Example 2:
"g" is 1*101^1 + 2*101^0
because g is 103. 103>=101 ie. take only 103%101 for 101^0
(modulo, remainder of division)
and (int)(103/101) for the next power.
(if the ascii numers could be higher or the prime number is lower than 101
it could be possible that (int)(103/101) would exceed the prime numer too.
In this case, it would continue to prime^2 and so on, until the value is smaller
than the prime number)
Example 3:
"ag" is 98*101^1 + 2*101^0
Compared to above, 97*101^1 is added because of a.
and so on...
To compare without calculating the full sum,
just compare the values of one power to each other, for each power.
Equal if all "power values" are the same.
Side note: Be aware that ^ is not exponentiation in languages like C# and Java.

Why do "int" and "sbyte" GetHashCode functions generate different values?

We have the following code:
int i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 1
This make sense and the same happen whit all integral types in C# except sbyte and short.
That is:
sbyte i = 1;
Console.WriteLine(i.GetHashCode()); // outputs => 257
Why is this?
Because the source of that method (SByte.GetHashCode) is
public override int GetHashCode()
{
return (int)this ^ ((int)this << 8);
}
As for why, well someone at Microsoft knows that..
Yes it's all about values distribution. As the GetHashCode method return type is int for the type sbyte the values are going to be distributed in intervals of 257. For this same reason for the long type will be colisions.
The reason is that it is probably done to avoid clustering of hash values.
As GetHashCode documentation says:
For the best performance, a hash function must generate a random
distribution for all input.
Providing a good hash function on a class can significantly affect the
performance of adding those objects to a hash table. In a hash table with
a good implementation of a hash function, searching for an element takes
constant time (for example, an O(1) operation).
Also, as this excellent article explains:
Guideline: the distribution of hash codes must be "random"
By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be "clustered"; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

C# random number generator

I'm looking for a random number that always generates the same "random" number for a given seed. The seed is defined by x + (y << 16), where x and y are positions on a heightmap.
I could create a new instance of System.Random every time with my seed, but thats a lot of GC pressure. Especially since this will be called a lot of times.
EDIT:
"A lot" means half a million times.
Thanks to everyone that answered! I know I was unclear, but I learned here that a hash function is exactly what I want.
Since a hash function is apparently closer to what you want, consider a variation of the following:
int Hash(int n) {
const int prime = 1031;
return (((n & 0xFFFF) * prime % 0xFFFF)) ^ (n >> 16);
}
This XORs the least significant two bytes with the most significant two bytes of a four-byte number after shuffling the least significant two byte around a little bit by multiplication with a prime number. The result is thus in the range 0 < 0x10000 (i.e. it fits in an Int16).
This should “shuffle” the input number a bit, reliably produces the same value for the same input and looks “random”. Now, I haven’t done a stochastic analysis of the distribution and if ever a statistician was to look at it, he would probably go straight into anaphylactic shock. (In fact, I have really written this implementation off the top of my head.)
If you require something less half-baked, consider using an established check sum (such as CRC32).
I could create a new instance of System.Random every time with my seed
Do that.
but thats a lot of GC pressure. Especially since this will be called a lot of times.
How many times do you call it? Does it verifiably perform badly? Notice, the GC is optimized to deal with lots of small objects with short life time. It should deal with this easily.
And, what would be the alternative that takes a seed but doesn’t create a new instance of some object? That sounds rather like a badly designed class, in fact.
See Simple Random Number Generation for C# source code. The state is just two unsigned integers, so it's easy to keep up with between calls. And the generator passes standard tests for quality.
What about storing a Dictionary<int, int> the provides the first value returned by a new Random object for a given seed?
class RandomSource
{
Dictionary<int, int> _dictionary = new Dictionary<int, int>();
public int GetValue(int seed)
{
int value;
if (!_dictionary.TryGetValue(seed, out value))
{
value = _dictionary[seed] = new Random(seed).Next();
}
return value;
}
}
This incurs the GC pressue of constructing a new Random instance the first time you want a value for a particular seed, but every subsequent call with the same seed will retrieve a cached value instead.
I don't think a "random number generator" is actually what you're looking for. Simply create another map and pre-populate it with random values. If your current heightmap is W x H, the simplest solution would be to create a W x H 2D array and just fill each element with a random value using System.Random. You can then look up the pre-populated random value for a particular (x, y) coordinate whenever you need it.
Alternatively, if your current heighmap actually stores some kind of data structure, you could modify that to store the random value in addition to the height value.
A side benefit that this has is that later, if you need to, you can perform operations over the entire "random" map to ensure that it has certain properties. For example, depending on the context (is this for a game?) you may find later that you want to smooth the randomness out across the map. This is trivial if you precompute and store the values as I've described.
CSharpCity provides source to several random number generators. You'll have to experiment to see whether these have less impact on performance than System.Random.
ExtremeOptimization offers a library with several generators. They also discuss quality and speed of the generators and compare against System.Random.
Finally, what do you mean by GC pressure? Do you really mean memory pressure, which is the only context I've seen it used in? The job of the GC is to handle the creation and destruction of gobs of objects very efficiently. I'm concerned that you're falling for the premature optimization temptation. Perhaps you can create a test app that gives some cold, hard numbers.

Should I use byte or int?

I recall having read somewhere that it is better (in terms of performance) to use Int32, even if you only require Byte. It applies (supposedly) only to cases where you do not care about the storage. Is this valid?
For example, I need a variable that will hold a day of week. Do I
int dayOfWeek;
or
byte dayOfWeek;
EDIT:
Guys, I am aware of DayOfWeek enum. The question is about something else.
Usually yes, a 32 bit integer will perform slightly better because it is already properly aligned for native CPU instructions. You should only use a smaller sized numeric type when you actually need to store something of that size.
You should use the DayOfWeek enum, unless there's a strong reason not to.
DayOfWeek day = DayOfWeek.Friday;
To explain, since I was downvoted:
The correctness of your code is almost always more critical than the performance, especially in cases where we're talking this small of a difference. If using an enum or a class representing the semantics of the data (whether it's the DayOfWeek enum, or another enum, or a Gallons or Feet class) makes your code clearer or more maintainable, it will help you get to the point where you can safely optimize.
int z;
int x = 3;
int y = 4;
z = x + y;
That may compile. But there's no way to know if it's doing anything sane or not.
Gallons z;
Gallons x = new Gallons(3);
Feet y = new Feet(4);
z = x + y;
This won't compile, and even looking at it it's obvious why not - adding Gallons to Feet makes no sense.
My default position is to try to use strong types to add constraints to values - where you know those in advance. Thus in your example, it may be preferable to use byte dayOfWeek because it is closer to your desired value range.
Here is my reasoning; with the example of storing and passing a year-part of a date. The year part - when considering other parts of the system that include SQL Server DateTimes, is constrained to 1753 through to 9999 (note C#'s possible range for DateTime is different!) Thus a short covers my possible values and if I try to pass anything larger the compiler will warn me before the code will compile. Unfortunately, in this particular example, the C# DateTime.Year property will return an int - thus forcing me to cast the result if I need to pass e.g. DateTime.Now.Year into my function.
This starting-position is driven by considerations of long-term storage of data, assuming 'millions of rows' and disk space - even though it is cheap (it is far less cheap when you are hosted and running a SAN or similar).
In another DB example, I will use smaller types such as byte (SQL Server tinyint) for lookup ID's where I am confident that there will not be many lookup types, through to long (SQL Server bigint) for id's where there are likely to be more records. i.e. to cover transactional records.
So my rules of thumb:
Go for correctness first if possible. Use DayOfWeek in your example, of course :)
Go for a type of appropriate size thus making use of compiler safety checks giving you errors at the earliest possible time;
...but offset against extreme performance needs and simplicity, especially where long-term storage is not involved, or where we are considering a lookup (low row count) table rather than a transactional (high row count) one.
In the interests of clarity, DB storage tends not to shrink as quickly as you expect by shrinking column types from bigint to smaller types. This is both because of padding to word boundaries and page-size issues internal to the DB. However, you probably store every data item several times in your DB, perhaps through storing historic records as they change, and also keeping the last few days of backups and log backups. So saving a few percent of your storage needs will have long term savings in storage cost.
I have never personally experienced issues where the in-memory performance of bytes vs. ints has been an issue, but I have wasted hours and hours having to reallocate disk space and have live servers entirely stall because there was no one person available to monitor and manage such things.
Use an int. Computer memory is addressed by "words," which are usually 4 bytes long. What this means is that if you want to get one byte of data from memory, the CPU has to retrieve the entire 4-byte word from RAM and then perform some extra steps to isolate the single byte that you're interested in. When thinking about performance, it will be a lot easier for the CPU to retrieve a whole word and be done with it.
Actually in all reality, you won't notice any difference between the two as far as performance is concerned (except in rare, extreme circumstances). That's why I like to use int instead of byte, because you can store bigger numbers with pretty much no penalty.
In terms of storage amount, use byte and in terms of cpu performance, use int.
System.DayOfWeek
MSDN
Most of the time use int. Not for performance but simplicity.

Categories

Resources