Applying Rabin-Karp Hash for large N

Applying Rabin-Karp Hash for large N - c#

I refer to the Rabin Karp Wikipedia article on Hash use.
In the example, the string "hi" is hashed using a prime number 101 as the base.
hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
Can such an algorithm be used practically in Java or C# where long has a maximum value of 9,223,372,036,854,775,807? Naively, to me it seems that the hash value grows exponentially and with a large enough N (being string length) will result in overflow of the long type. For example, say I have 65 characters in my string input for the hash?
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?

hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
That's only half the truth. In reality, if you would actually compute the value s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n, the result would be a number whose representation would be about as long as the string itself, so you haven't gained anything. So what you actually do is to compute
(s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n) mod M
where M is reasonably small. Thus your hash value will always be smaller than M.
So what you do in practice is you choose M = 2^64 and make use of the fact that unsigned integer overflow is well-defined in most programming languages. In fact, multiplication and addition of 64-bit integers in Java, C++ and C# is equivalent to multiplication and addition modulo 2^64.
It's not necessarily a wise choice to use 2^64 as the modulus. In fact you can easily construct a string with lots of collisions, thus provoking the worst case behaviour of Rabin-Karp, which is Ω(n * m) matching instead of O(n + m).
It would be better to use a large prime as the modulus and get much better collision resistance. The reason why this is usually not done is performance: We would need to explicitely use modular reduction (add a % M) to every addition and multiplication. What's worse, we can't even use the builtin multiplication anymore, because it could overflow if M > 2^32. So we need a custom MultiplyMod function, which is bound to be a lot slower than machine-level multiplication.
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?
As I already mentioned, if you don't reduce using a modulus, your hash value will grow as large as the string itself, thus rendering it useless to use a hash function in the first place. So yes, using controlled overflow modulo 2^64 is correct and even necessary if we don't manually reduce.

If your goal is a type of storage which contains only "small" number,
but where the sum can be compared:
You could view this simply as 101 - number system,
like 10=decimal, 16=hex. and so on.
Ie.
a) You have to store a set of { ascii value and it´s 101-power }
(without possibility for multiple entries with the same power).
b) When creating the data from a string,
values >101 have to be propagated (is this the right word?) to the next power.
Example 1:
"a" is 97*101^0
(trivial)
Example 2:
"g" is 1*101^1 + 2*101^0
because g is 103. 103>=101 ie. take only 103%101 for 101^0
(modulo, remainder of division)
and (int)(103/101) for the next power.
(if the ascii numers could be higher or the prime number is lower than 101
it could be possible that (int)(103/101) would exceed the prime numer too.
In this case, it would continue to prime^2 and so on, until the value is smaller
than the prime number)
Example 3:
"ag" is 98*101^1 + 2*101^0
Compared to above, 97*101^1 is added because of a.
and so on...
To compare without calculating the full sum,
just compare the values of one power to each other, for each power.
Equal if all "power values" are the same.
Side note: Be aware that ^ is not exponentiation in languages like C# and Java.

Related

Random number within a range biased towards the minimum value of that range

I want to generate random numbers within a range (1 - 100000), but instead of purely random I want the results to be based on a kind of distribution. What I mean that in general I want the numbers "clustered" around the minimum value of the range (1).
I've read about Box–Muller transform and normal distributions but I'm not quite sure how to use them to achieve the number generator.
How can I achieve such an algorithm using C#?

There are a lot of ways doing this (using uniform distribution prng) here few I know of:
Combine more uniform random variables to obtain desired distribution.
I am not a math guy but there sure are equations for this. This kind of solution has usually the best properties from randomness and statistical point of view. For more info see the famous:
Understanding “randomness”.
but there are limited number of distributions we know the combinations for.
Apply non linear function on uniform random variable
This is the simplest to implement. You simply use floating randoms in <0..1> range apply your non linear function (that change the distribution towards your wanted shape) on them (while result is still in the <0..1> range) and rescale the result into your integer range for example (in C++):
floor( pow( random(),5 ) * 100000 )
The problem is that this is just blind fitting of the distribution so you usually need to tweak the constants a bit. It a good idea to render histogram and randomness graphs to see the quality of result directly like in here:
How to seed to generate random numbers?
You can also avoid too blind fitting with BEZIERS like in here:
Random but most likely 1 float
Distribution following pseudo random generator
there are two approaches I know of for this the simpler is:
create big enough array of size n
fill it with all values following the distribution
so simply loop through all values you want to output and compute how many of them will be in n size array (from your distribution) and add that count of the numbers into array. Beware the filled size of the array might be slightly less than n due to rounding. If n is too small you will be missing some less occurring numbers. so if you multiply probability of the least probable number and n it should be at least >=1. After the filling change the n into the real array size (number of really filled numbers in it).
shuffle the array
now use the array as linear list of random numbers
so instead of random() you just pick a number from array and move to the next one. Once you get into n-th value schuffle the array and start from first one again.
This solution has very good statistical properties (follows the distribution exactly) but the randomness properties are not good and requires array and occasional shuffling. For more info see:
How to efficiently generate a set of unique random numbers with a predefined distribution?
The other variation of this is to avoid use of array and shuffling. It goes like this:
get random value in range <0..1>
apply inverse cumulated distribution function to convert to target range
as you can see its like the #2 Apply non linear function... approach but instead of "some" non linear function you use directly the distribution. So if p(x) is probability of x in range <0..1> where 1 means 100% than we need a function that cumulates all the probabilities up to x (sorry do not know the exact math term in English). For integers:
f(x) = p(0)+p(1)+...+p(x)
Now we need inverse function g() to it so:
y = f(x)
x = g(y)
Now if my memory serves me well then the generation should look like this:
y = random(); // <0..1>
x = g(y); // probability -> value
Many distributions have known g() function but for those that do not (or we are too lazy to derive it) you can use binary search on p(x). Too lazy to code it so here slower linear search version:
for (x=0;x<max;x++) if (f(x)>=y) break;
So when put all together (and using only p(x)) I got this (C++):
y=random(); // uniform distribution pseudo random value in range <0..1>
for (f=0.0,x=0;x<max;x++) // loop x through all values
{
f+=p(x); // f(x) cumulative distribution function
if (f>=y) break;
}
// here x is your pseudo random value following p(x) distribution
This kind of solution has usually very good both statistical and randomness properties and does not require that the distribution is a continuous function (it can be even just an array of values instead).

Create a identical "random" float based on multiple data

I'm working on a game (Unity) and I need to create a random float value (between 0 and 1) based on multiple int and/or float.
I think it'll be more easy to manually create a single string for the function, but maybe it could accept a list of int and/or float.
Example of result:
"[5-91]-52-1" > 0.158756..
Important points:
The distribution of results (between 0 and 1) must be equals (don't want 90% of results between 0.45 and 0.55)
Asking 2 times for the same string must return the exact same result (even if I reload the app, or start it on different computers, ..)
Results have no need to be unique.
Bonus Point:
Sometime I need that close similar string return close result, but not everytime. It's possible for "random generation" to handle a boolean with this feature ?

What you've described is essentially definition of a hash function.
So just use one and normalize results into range you want. Most basic case can use GetHashCode, but it is not guaranteed to produce the same results across different versions of framework.
Stable version that guarantees to provide exactly the same results across machines would be to use well known good hash - like crypto hash SHA256 and take several first bytes of result as integer and normalize. Crypto hash functions also conveniently take byte arrays as input so you can combine multiple values as bytes directly and get stable result.
var intValue = 42;
var bytesToHash = BitConverter.GetBytes(intValue);
var hash = System.Security.Cryptography.SHA256Managed.Create()
.ComputeHash(bytesToHash);
var toNormalize = BitConverter.ToUInt32(hash,0);
var fancyRandom = (double)toNormalize/UInt32.MaxValue;
To combine multiple values into byte array you can either manually combine results of BitConverter.GetBytes or use BinaryWriter on MemoryStream.
Alternatively you can use resulting integer as seed for some custom implementation of pseudo-random generator (as one in .Net does not guarantee to provide same results across machines/version of .Net) as suggested in comments, but I don't think it will give significantly better distribution.
Note: make sure resulting numbers are distributed "randomly enough" for your case. Crypto hashing functions likely give result you want but I'm not sure how to prove that.
For "bonus" part: I would be very surprised if you can find pseudo-random generator that will consistently produce close results for "similar" seeds. Instead you can use same approach as above for separate parts - one that "same" and other that handles variation (i.e. intValue & 0xFFFFFF00 for stable part, intValue & 0xFF for "small difference") and than combine resulting "random" numbers with some weight: randomFromStable + 0.05 * randomFromDifference.

I would suggest using the hashcode (or something similar) as the seed to a Random object. Hashcodes must be the same for the same string so you will always get the same sequence back.
As Nuf notes, hashcodes are only guaranteed to be the same in the same app-domain; so it may not work across restarts.
As to your bonus point, getting there without writing your own RNG will be hard. Any variance in the seed can and should cause a lot of variation in the resulting sequence.

Efficient bit remapping algorithm

I have a use case where I need to scramble an input in such a way that:
Each specific input always maps to a specific pseudo-random output.
The output must shuffle the input sufficiently so that an incrementing input maps to a pseudo-random output.
For example, if the input is 64 bits, there must be exactly 2^64 unique outputs, and these must break incrementing inputs as much as possible (arbitrary requirement).
I will code this in C#, but can translate from Java or C, so long as there are not SIMD intrinsics. What I am looking for is some already existing code, rather than reinventing the wheel.
I have looked on Google, but haven't found anything that does a 1:1 mapping.

This seems to work fairly well:
const long multiplier = 6364136223846793005;
const long mulinv_multiplier = -4568919932995229531;
const long offset = 1442695040888963407;
static long Forward(long x)
{
return x * multiplier + offset;
}
static long Reverse(long x)
{
return (x - offset) * mulinv_multiplier;
}
You can change the constants to whatever as long as multiplier is odd and mulinv_multiplier is the modular multiplicative inverse (see wiki:modular multiplicative inverse or Hackers Delight 10-15 Exact Division by Constants) of multiplier (modulo 2^64, obviously - and that's why multiplier has to be odd, otherwise it has no inverse).
The offset can be anything, but make it relatively prime with 2^64 just to be on the safe side.
These specific constants come from Knuths linear congruential generator.
There's one small thing: it puts the complement of the LSB of the input in the LSB of the result. If that's a problem, you could just rotate it by any nonzero amount.
For 32 bits, the constants can be multiplier = 0x4c957f2d, offset = 0xf767814f, mulinv_multiplier = 0x329e28a5.
For 64 bits, multiplier = 12790229573962758597, mulinv_multiplier = 16500474117902441741 may work better.
Or, you could use a CRC, which is reversible for this use (ie the input is the same size as the CRC) for CRC64 it requires some modifications of course.

Just from the top of my head:
Shift the input: Make sure you keep every bit, i.e. use two shift operations in different directions and OR the result together.
Apply an static XOR.
Everything else that comes to my mind won't be bijective. However, a search for bijective might bring up something useful ;D

What would be a good hashCode for a DateRange class

I have the following class
public class DateRange
{
private DateTime startDate;
private DateTime endDate;
public override bool Equals(object obj)
{
DateRange other = (DateRange)obj;
if (startDate != other.startDate)
return false;
if (endDate != other.endDate)
return false;
return true;
}
...
}
I need to store some values in a dictionary keyed with a DateRange like:
Dictionary<DateRange, double> tddList;
How should I override the GetHashCode() method of DateRange class?

I use this approach from Effective Java for combining hashes:
unchecked
{
int hash = 17;
hash = hash * 31 + field1.GetHashCode();
hash = hash * 31 + field2.GetHashCode();
...
return hash;
}
There's no reason that shouldn't work fine in this situation.

It depends on the values I expect to see it used with.
If it was most often going to have different day values, rather than different times on the same day, and they were within a century of now, I would use:
unchecked
{
int hash = startDate.Year + endDate.Year - 4007;
hash *= 367 + startDate.DayOfYear;
return hash * 367 + endDate.DayOfYear;
}
This distributes the bits well with the expected values, while reducing the number of bits lost in the shifting. Note that while there cases where dependency on primes can be surprisingly bad at collisions (esp. when the hash is fed into something that uses a modulo of the same prime in trying to avoid collisions when producing a yet-smaller hash to distribute among its buckets) I've opted to go for primes above the more obvious choices, as they're only just above and so still pretty "tight" for bit-distribution. I don't worry much about using the same prime twice, as they're so "tight" in this way, but it does hurt if you've a hash-based collection with 367 buckets. This deals well (but not as well) with dates well into the past or future, but is dreadful if the assumption that there will be few or no ranges within the same day (differing in time) is wrong as that information is entirely lost.
If I was expecting (or writing for general use by other parties, and not able to assume otherwise) I'd go for:
int startHash = startDate.GetHashCode();
return (((startHash >> 24) & 0x000000FF) | ((startHash >> 8) & 0x0000FF00) | ((startHash << 8) & 0x00FF0000) | (unchecked((int)((startHash << 24) & 0xFF000000)))) ^ endDate.GetHashCode();
Where the first method works on the assumption that the general-purpose GetHashCode in DateTime isn't as good as we want, this one depends on it being good, but mixes around the bits of one value.
It's good in dealing with the more obvious tricky cases such as the two values being the same, or a common distance from each other (e.g. lots of 1day or 1hour ranges). It's not as good at the cases where the first example works best, but the first one totally sucks if there are lots of ranges using the same day, but different times.
Edit: To give a more detailed response to Dour's concern:
Dour points out, correctly, that some of the answers on this page lose data. The fact is, all of them lose data.
The class defined in the question has 8.96077483×1037 different valid states (or 9.95641648×1036 if we don't care about the DateTimeKind of each date), and the output of GetHashCode has 4294967296 possible states (one of which - zero - is also going to be used as the hashcode of a null value, which may be commonly compared with in real code). Whatever we do, we reduce information by a scale of 2.31815886 × 1027. That's a lot of information we lost!
It's likely true that we can lose more with some than in others. Certainly, it's easy to prove some solutions can lose more than others by writing a valid, but really poor, answer.
(The worse possible valid solution is return 0; which is valid as it never errors or mismatches on equal objects, but as poor as possible as it collides for all values. The performance of a hash-based collection becomes O(n), and slow as O(n) goes, as the constants involved are higher than such O(n) operations as searching an unordered list).
It's difficult to measure just how much is lost. How much more does shifting of some bits before XORing lose than swapping bits, considering that XOR halves the amount of information left. Even the naïve x ^ y doesn't lose more than a swap-and-xor, it just collides more on common values; swap-and-xor will collide on values where plain-xor does not.
Once we've got a choice between solutions that are not losing much more information than possible, but returning 4294967296 or close to 4294967296 possible values with a good distribution between those values, then the question is no longer how much information is lost (the answer that only 4.31376821×10-28 of the original information remains) but which information is lost.
This is why my first suggestion above ignores time components. There are 864000000000 "ticks" (the 100nanosecond units DateTime has a resolution of) in a day, and I throw away two chunks of those ticks (7.46496×1023 possible values between the two) on purpose because I'm thinking of a scenario where that information is not used anyway. In this case I've deliberately structured the mechanism in such a way as to pick which information gets lost, that improves the hash for a given situation, but makes it absolutely worthless if we had different values all with start and end dates happening no the same days but at different times.
Likewise x ^ y doesn't lose any more information than any of the others, but the information that it does lose is more likely to be significant than with other choices.
In the absence of any way to predict which information is likely to be of importance (esp. if your class will be public and its hash code used by external code), then we are more restricted in the assumptions we can safely make.
As a whole prime-mult or prime-mod methods are better in which information they lose than shift-based methods, except when the same prime is used in a further hashing that may take place inside a hash-based method, ironically with the same goal in mind (no number is relatively prime to itself! even primes) in which case they are much worse. On the other hand shift-based methods really fall down if fed into a shift-based further hash. There is no perfect hash for arbitrary data and arbitrary use (except when a class has few valid values and we match them all, in which case it's more strictly an encoding than a hash that we produce).
In short, you're going to lose information whatever you do, it's which you lose that's important.

Well, consider what characteristics a good hash function should have. It must:
be in agreement with Equals - that is, if Equals is true for two objects then the two hash codes have to also be the same.
never crash
And it should:
be very fast
give different results for similar inputs
What I would do is come up with a very simple algorithm; say, taking 16 bits from the hash code of the first and 16 bits from the hash code of the second, and combining them together. Make yourself a test case of representative samples; date ranges that are likely to be actually used, and see if this algorithm does give a good distribution.
A common choice is to xor the two hashes together. This is not necessarily a good idea for this type because it seems likely that someone will want to represent the zero-length range that goes from X to X. If you xor the hashes of two equal DateTimes you always get zero, which seems like a recipe for a lot of hash collisions.

You have to shift one end of the range, otherwise two equal dates will hash to zero, a pretty common scenario I imagine:
return startDate.GetHashCode() ^ (endDate.GetHashCode() << 4);

return startDate.GetHashCode() ^ endDate.GetHashCode();
might be a good start. You have to check that you get good distribution when there is equal distance between startDate and endDate, but different dates.

long/large numbers and modulus in .NET

I'm currently writing a quick custom encoding method where I take a stamp a key with a number to verify that it is a valid key.
Basically I was taking whatever number that comes out of the encoding and multiplying it by a key.
I would then multiply those numbers to the deploy to the user/customer who purchases the key. I wanted to simply use (Code % Key == 0) to verify that the key is valid, but for large values the mod function does not seem to function as expected.
Number = 468721387;
Key = 12345678;
Code = Number * Key;
Using the numbers above:
Code % Key == 11418772
And for smaller numbers it would correctly return 0. Is there a reliable way to check divisibility for a long in .NET?
Thanks!
EDIT:
Ok, tell me if I'm special and missing something...
long a = DateTime.Now.Ticks;
long b = 12345;
long c = a * b;
long d = c % b;
d == 10001 (Bad)
and
long a = DateTime.Now.Ticks;
long b = 12;
long c = a * b;
long d = c % b;
d == 0 (Good)
What am I doing wrong?

As others have said, your problem is integer overflow. You can make this more obvious by checking "Check for arithmetic overflow/underflow" in the "Advanced Build Settings" dialog. When you do so, you'll get an OverflowException when you perform *DateTime.Now.Ticks * 12345*.
One simple solution is just to change "long" to "decimal" (or "double") in your code.
In .NET 4.0, there is a new BigInteger class.
Finally, you say you're "... writing a quick custom encoding method ...", so a simple homebrew solution may be satisfactory for your needs. However, if this is production code, you might consider more robust solutions involving cryptography or something from a third-party who specializes in software licensing.

The answers that say that integer overflow is the likely culprit are almost certainly correct; you can verify that by putting a "checked" block around the multiplication and seeing if it throws an exception.
But there is a much larger problem here that everyone seems to be ignoring.
The best thing to do is to take a large step back and reconsider the wisdom of this entire scheme. It appears that you are attempting to design a crypto-based security system but you are clearly not an expert on cryptographic arithmetic. That is a huge red warning flag. If you need a crypto-based security system DO NOT ATTEMPT TO ROLL YOUR OWN. There are plenty of off-the-shelf crypto systems that are built by experts, heavily tested, and readily available. Use one of them.
If you are in fact hell-bent on rolling your own crypto, getting the math right in 64 bits is the least of your worries. 64 bit integers are way too small for this crypto application. You need to be using a much larger integer size; otherwise, finding a key that matches the code is trivial.
Again, I cannot emphasize strongly enough how difficult it is to construct correct crypto-based security code that actually protects real users from real threats.

Integer Overflow...see my comment.
The value of the multiplication you're doing overflows the int data type and causes it to wrap (int values fall between +/-2147483647).
Pick a more appropriate data type to hold a value as large as 5786683315615386 (the result of your multiplication).
UPDATE
Your new example changes things a little.
You're using long, but now you're using System.DateTime.Ticks which on Mono (not sure about the MS platform) is returning 633909674610619350.
When you multiply that by a large number, you are now overflowing a long just like you were overflowing an int previously. At that point, you'll probably need to use a double to work with the values you want (decimal may work as well, depending on how large your multiplier gets).

Apparently, your Code fails to fit in the int data type. Try using long instead:
long code = (long)number * key;
The (long) cast is necessary. Without the cast, the multiplication will be done in 32-bit integer form (assuming number and key variables are typed int) and the result will be casted to long which is not what you want. By casting one of the operands to long, you tell the compiler to perform the multiplication on two long numbers.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Applying Rabin-Karp Hash for large N - c#

Related

Random number within a range biased towards the minimum value of that range

Create a identical "random" float based on multiple data

Efficient bit remapping algorithm

What would be a good hashCode for a DateRange class

long/large numbers and modulus in .NET

Categories

Resources