How to generate random numbers with a stable distribution in C#?
The Random class has uniform distribution. Many other code on the
internet show normal distribution. But we need stable distribution
meaning infinite variance, a.k.a fat-tailed distribution.
The reason is for generating realistic stock prices. In the real
world, huge variations in prices are far more likely than in
normal distributions.
Does someone know the C# code to convert Random class output
into stable distribution?
Edit: Hmmm. Exact distribution is less critical than ensuring it will randomly generate huge sigma like at least 20 sigma. We want to test a trading strategy for resilience in a true fat tailed distribution which is exactly how stock market prices behave.
I just read about ZipFian and Cauchy due to comments. Since I must pick, let's go with Cauchy distribution but I will also try ZipFian to compare.
In general, the method is:
Choose a stable, fat-tailed distribution. Say, the Cauchy distribution.
Look up the quantile function of the chosen distribution.
For the Cauchy distribution, that would be p --> peak + scale * tan( pi * (p - 0.5) ).
And now you have a method of transforming uniformly-distributed random numbers into Cauchy-distributed random numbers.
Make sense? See
http://en.wikipedia.org/wiki/Inverse_transform_sampling
for details.
Caveat: It has been a long, long time since I took statistics.
UPDATE:
I liked this question so much I just blogged it: see
http://ericlippert.com/2012/02/21/generating-random-non-uniform-data/
My article exploring a few interesting examples of Zipfian distributions is here:
http://blogs.msdn.com/b/ericlippert/archive/2010/12/07/10100227.aspx
If you're interested in using the Zipfian distribution (which is often used when modeling processes from the sciences or social domains), you would do something along the lines of:
Select your k (skew) for the distribution
Precompute the domain of the cumulative distribution (this is just an optimization)
Generate random values for the distribution by finding the nearest value from the domain
Sample Code:
List<int> domain = Enumerable.Range(0,1000); // generate your domain
double skew = 0.37; // select a skew appropriate to your domain
double sigma = domain.Aggregate(0.0d, (z,x) => x + 1.0 / Math.Pow(z+1, skew));
List<double> cummDist = domain.Select(
x => domain.Aggregate(0.0d, (z,y) => z + 1.0/Math.Pow(y, skew) * sigma));
Now you can generate random values by selecting the closest value from within the domain:
Random rand = new Random();
double seek = rand.NextDouble();
int searchIndex = cummDist.BinarySearch(seek);
// return the index of the closest value from the distribution domain
return searchIndex < 0 ? (~searchIndex)-1 : searchIndex-1;
You can, of course, generalize this entire process by factoring out the logic that materializes the domain of the distribution from the process that maps and returns a value from that domain.
I have before me James Gentle's Springer volume on this topic, Random Number Generation and Monte Carlo Methods, courtesy of my statistician wife. It discusses the stable family on page 105:
The stable family of distributions is a flexible family of generally heavy-tailed distributions. This family includes the normal distribution at one extreme value of one of the parameters and the Cauchy at the other extreme value. Chambers, Mallows, and Stuck (1976) give a method for generating deviates from stable distributions. (Watch for some errors in the constants in the auxiliary function D2, for evaluating (ex-1)/x.) Their method is used in the IMSL libraries. For a symmetric stable distribution, Devroye (1986) points out that a faster method can be developed by exploiting the relationship of the symmetric stable to the Fejer-de la Vallee Poissin distribution. Buckle (1995) shows how to simulate the parameters of a stable distribution, conditional on the data.
Generating deviates from the generic stable distribution is hard. If you need to do this then I would recommend a library such as IMSL. I do not advise you attempt this yourself.
However, if you are looking for a specific distribution in the stable family, e.g. Cauchy, then you can use the method described by Eric, known as the probability integral transform. So long as you can write down the inverse of the distribution function in closed form then you can use this approach.
The following C# code generates a random number following a stable distribution given the shape parameters alpha and beta. I release it to the public domain under Creative Commons Zero.
public static double StableDist(Random rand, double alpha, double beta){
if(alpha<=0 || alpha>2 || beta<-1 || beta>1)
throw new ArgumentException();
var halfpi=Math.PI*0.5;
var unif=NextDouble(rand);
while(unif == 0.0)unif=NextDouble(rand);
unif=(unif - 0.5) * Math.PI;
// Cauchy special case
if(alpha==1 && beta==0)
return Math.Tan(unif);
var expo=-Math.Log(1.0 - NextDouble(rand));
var c=Math.Cos(unif);
if(alpha == 1){
var s=Math.Sin(unif);
return 2.0*((unif*beta+halfpi)*s/c -
beta * Math.Log(halfpi*expo*c/(
unif*beta+halfpi)))/Math.PI;
}
var z=-Math.Tan(halfpi*alpha)*beta;
var ug=unif+Math.Atan(-z)/alpha;
var cpow=Math.Pow(c, -1.0 / alpha);
return Math.Pow(1.0+z*z, 1.0 / (2*alpha))*
(Math.Sin(alpha*ug)*cpow)*
Math.Pow(Math.Cos(unif-alpha*ug)/expo, (1.0-alpha) / alpha);
}
private static double NextDouble(Random rand){
// The default NextDouble implementation in .NET (see
// https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/shared/System/Random.cs)
// is very problematic:
// - It generates a random number 0 or greater and less than 2^31-1 in a
// way that very slightly biases 2^31-2.
// - Then it divides that number by 2^31-1.
// - The result is a number that uses roughly only 32 bits of pseudorandomness,
// even though `double` has 53 bits in its significand.
// To alleviate some of these problems, this method generates a random 53-bit
// random number and divides that by 2^53. Although this doesn't fix the bias
// mentioned above (for the default System.Random), this bias may be of
// negligible importance for most purposes not involving security.
long x=rand.Next(0,1<<30);
x<<=23;
x+=rand.Next(0,1<<23);
return (double)x / (double)(1L<<53);
}
In addition, I set forth pseudocode for the stable distribution in a separate article.
Related
I want to generate random numbers within a range (1 - 100000), but instead of purely random I want the results to be based on a kind of distribution. What I mean that in general I want the numbers "clustered" around the minimum value of the range (1).
I've read about Box–Muller transform and normal distributions but I'm not quite sure how to use them to achieve the number generator.
How can I achieve such an algorithm using C#?
There are a lot of ways doing this (using uniform distribution prng) here few I know of:
Combine more uniform random variables to obtain desired distribution.
I am not a math guy but there sure are equations for this. This kind of solution has usually the best properties from randomness and statistical point of view. For more info see the famous:
Understanding “randomness”.
but there are limited number of distributions we know the combinations for.
Apply non linear function on uniform random variable
This is the simplest to implement. You simply use floating randoms in <0..1> range apply your non linear function (that change the distribution towards your wanted shape) on them (while result is still in the <0..1> range) and rescale the result into your integer range for example (in C++):
floor( pow( random(),5 ) * 100000 )
The problem is that this is just blind fitting of the distribution so you usually need to tweak the constants a bit. It a good idea to render histogram and randomness graphs to see the quality of result directly like in here:
How to seed to generate random numbers?
You can also avoid too blind fitting with BEZIERS like in here:
Random but most likely 1 float
Distribution following pseudo random generator
there are two approaches I know of for this the simpler is:
create big enough array of size n
fill it with all values following the distribution
so simply loop through all values you want to output and compute how many of them will be in n size array (from your distribution) and add that count of the numbers into array. Beware the filled size of the array might be slightly less than n due to rounding. If n is too small you will be missing some less occurring numbers. so if you multiply probability of the least probable number and n it should be at least >=1. After the filling change the n into the real array size (number of really filled numbers in it).
shuffle the array
now use the array as linear list of random numbers
so instead of random() you just pick a number from array and move to the next one. Once you get into n-th value schuffle the array and start from first one again.
This solution has very good statistical properties (follows the distribution exactly) but the randomness properties are not good and requires array and occasional shuffling. For more info see:
How to efficiently generate a set of unique random numbers with a predefined distribution?
The other variation of this is to avoid use of array and shuffling. It goes like this:
get random value in range <0..1>
apply inverse cumulated distribution function to convert to target range
as you can see its like the #2 Apply non linear function... approach but instead of "some" non linear function you use directly the distribution. So if p(x) is probability of x in range <0..1> where 1 means 100% than we need a function that cumulates all the probabilities up to x (sorry do not know the exact math term in English). For integers:
f(x) = p(0)+p(1)+...+p(x)
Now we need inverse function g() to it so:
y = f(x)
x = g(y)
Now if my memory serves me well then the generation should look like this:
y = random(); // <0..1>
x = g(y); // probability -> value
Many distributions have known g() function but for those that do not (or we are too lazy to derive it) you can use binary search on p(x). Too lazy to code it so here slower linear search version:
for (x=0;x<max;x++) if (f(x)>=y) break;
So when put all together (and using only p(x)) I got this (C++):
y=random(); // uniform distribution pseudo random value in range <0..1>
for (f=0.0,x=0;x<max;x++) // loop x through all values
{
f+=p(x); // f(x) cumulative distribution function
if (f>=y) break;
}
// here x is your pseudo random value following p(x) distribution
This kind of solution has usually very good both statistical and randomness properties and does not require that the distribution is a continuous function (it can be even just an array of values instead).
I'm trying to compute the cosine of 4203708359 radians in C#:
var x = (double)4203708359;
var c = Math.Cos(x);
(4203708359 can be exactly represented in double precision.)
I'm getting
c = -0.57977754519440394
Windows' calculator gives
c = -0.579777545198813380788467070278
PHP's cos(double) function (which internally just uses cos(double) from the C standard library) on Linux gives:
c = -0.57977754519881
C's cos(double) function in a simple C program compiled with Visual Studio 2017 gives
c = -0.57977754519881342
Here is the definition of Math.cos() in C#: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Math.cs#L57-L58
It appears to be a built-in function. I didn't dig (yet) in the C# compiler to check what this effectively compiles to but this is probably the next step.
In the meantime:
Why is the precision so poor in my C# example, and what can I do about it?
Is it simply that the cosine implementation in the C# compiler deals poorly with large integer inputs?
Edit 1: Wolfram Mathematica 11.0:
In[1] := N[Cos[4203708359], 50]
Out[1] := -0.57977754519881338078846707027800171954257546099993
Edit 2: I do need that level precision, and I'm ready to go pretty far in order to obtain it. I'd be happy to use an arbitrary precision library if there exists a good one that supports cosine (my efforts haven't led to one so far).
Edit 3: I posted the question on coreclr's issue tracker: https://github.com/dotnet/coreclr/issues/12737
I think I might know the answer. I'm pretty sure the sin/cos libraries don't take arbitrarily large numbers and calculate the sin/cos of them - they instead reduce them down to low numbers (between 0-2xpi?) and calculate them there. I mean, cos(x) = cos(x + 2xpi) = cos(x + 4xpi) = ...
Problem is, how is the program supposed to reduce your 10-digit number down? Realistically, it should figure out how many times it needs to multiply (2xpi) to get a value just below your number, then subtract that out. In your case, that's about 670 million.
So it's multiplying (2xpi) by this 9 digit value - so it's effectively losing 9 digits worth of significance from the math library's version of pi.
I ended up writing a little function to test what was going on:
private double reduceDown(double start)
{
decimal startDec = (decimal)start;
decimal pi = decimal.Parse("3.1415926535897932384626433832795");
decimal tau = pi * 2;
int num = (int)(startDec / tau);
decimal x = startDec - (num * tau);
double retVal;
double.TryParse(x.ToString(), out retVal);
return retVal;
//return start - (num * tau);
}
All this is doing is using decimal data type as a way of reducing down the value without losing digits of precision from pi - it still returns back a double. When I call it with a modification of your code:
var x = (double)4203708359;
var c = Math.Cos(x);
double y = reduceDown(x);
double c2 = Math.Cos(y);
MessageBox.Show(c.ToString() + Environment.NewLine + c2);
return;
... sure enough, the second one is accurate.
So my advice is - if you really need radians that high, and you really need the accuracy? Do something like that function above, and reduce the number down on your end in a way that you don't lose digits of precision.
Presumably, the salts are stored along with each password. You could use the PHP code to calculate that cosine, and store that also with the password. I would then also add a password version number and default all those older passwords to be version 1. Then, in your C# code, for any new passwords, you implement a new hashing algorithm, and store those password hashes as passwords version 2. For any version 1 passwords, to authenticate, you do not have to calculate the cosine, you simply use the one stored along with the password hash and the salt.
The programmer of that PHP code was probably wanting to do a clever version of pepper. By storing that cosine, or pepper along with the salt and the password hashes, you basically change that pepper into a salt2. So, another versionless way of doing this would be to use two salts in your C# hashing code. For new passwords you could leave the second salt blank or assign it some other way. For old passwords, it would be that cosine, but it is already calculated.
Regarding this part of my question: "Why is the precision so poor in my C# example", coreclr developers answered here: https://github.com/dotnet/coreclr/issues/12737
In a nutshell, .NET Framework 4.6.2 (x86 and x64) and .NET Core (x86) appear to use Intel's x87 FP unit (i.e. fcos or fsincos) that gives inaccurate results while .NET Core on x64 (and PHP, Visual Studio 2017 and gcc) use more accurate, presumably SSE2-based implementations that give correctly rounded results.
I refer to the Rabin Karp Wikipedia article on Hash use.
In the example, the string "hi" is hashed using a prime number 101 as the base.
hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
Can such an algorithm be used practically in Java or C# where long has a maximum value of 9,223,372,036,854,775,807? Naively, to me it seems that the hash value grows exponentially and with a large enough N (being string length) will result in overflow of the long type. For example, say I have 65 characters in my string input for the hash?
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?
hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
That's only half the truth. In reality, if you would actually compute the value s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n, the result would be a number whose representation would be about as long as the string itself, so you haven't gained anything. So what you actually do is to compute
(s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n) mod M
where M is reasonably small. Thus your hash value will always be smaller than M.
So what you do in practice is you choose M = 2^64 and make use of the fact that unsigned integer overflow is well-defined in most programming languages. In fact, multiplication and addition of 64-bit integers in Java, C++ and C# is equivalent to multiplication and addition modulo 2^64.
It's not necessarily a wise choice to use 2^64 as the modulus. In fact you can easily construct a string with lots of collisions, thus provoking the worst case behaviour of Rabin-Karp, which is Ω(n * m) matching instead of O(n + m).
It would be better to use a large prime as the modulus and get much better collision resistance. The reason why this is usually not done is performance: We would need to explicitely use modular reduction (add a % M) to every addition and multiplication. What's worse, we can't even use the builtin multiplication anymore, because it could overflow if M > 2^32. So we need a custom MultiplyMod function, which is bound to be a lot slower than machine-level multiplication.
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?
As I already mentioned, if you don't reduce using a modulus, your hash value will grow as large as the string itself, thus rendering it useless to use a hash function in the first place. So yes, using controlled overflow modulo 2^64 is correct and even necessary if we don't manually reduce.
If your goal is a type of storage which contains only "small" number,
but where the sum can be compared:
You could view this simply as 101 - number system,
like 10=decimal, 16=hex. and so on.
Ie.
a) You have to store a set of { ascii value and it´s 101-power }
(without possibility for multiple entries with the same power).
b) When creating the data from a string,
values >101 have to be propagated (is this the right word?) to the next power.
Example 1:
"a" is 97*101^0
(trivial)
Example 2:
"g" is 1*101^1 + 2*101^0
because g is 103. 103>=101 ie. take only 103%101 for 101^0
(modulo, remainder of division)
and (int)(103/101) for the next power.
(if the ascii numers could be higher or the prime number is lower than 101
it could be possible that (int)(103/101) would exceed the prime numer too.
In this case, it would continue to prime^2 and so on, until the value is smaller
than the prime number)
Example 3:
"ag" is 98*101^1 + 2*101^0
Compared to above, 97*101^1 is added because of a.
and so on...
To compare without calculating the full sum,
just compare the values of one power to each other, for each power.
Equal if all "power values" are the same.
Side note: Be aware that ^ is not exponentiation in languages like C# and Java.
Given two float values (fLow and fHigh), how could you calculate the greatest or maximum stride/gap between the two successive values?
For example:
In the range 16777217f to 20000000f the answer would be 2, as values are effectively rounded to the nearest two.
Generalizing this to an arbitrary range has got me scratching my head - any suggestions?
cheers,
This should be language neutral, but I'm using C# (which conforms to IEEE-754 for this, I think).
This is in C. It requires some IEEE 754 behavior, for rounding and such. For IEEE 754 64-bit binary (double), SmallestPositive is 2-1074, approximately 4.9406564584124654417656879286822137236505980261e-324, and DBL_EPSILON is 2-52, 2.220446049250313080847263336181640625e-16. For 32-bit binary (float), change DBL to FLT and double to float wherever they appear (and fabs to fabsf and fmax to fmaxf, although it should work without these changes). Then SmallestPositive is 2-149, approximately 1.401298464324817070923729583289916131280261941876515771757068283889791e-45, and FLT_EPSILON is 2-23, 1.1920928955078125e-07.
For an interval between two values, the greatest step size is of course the step size at the endpoint with larger magnitude. (If that endpoint is exactly a power of two, the step size from that point to the next does not appear in the interval itself, so that would be a special case.)
#include <float.h>
#include <math.h>
/* Return the ULP of q.
This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
05.12_, Faculty for Information and Communication Sciences, Hamburg
University of Technology, November 13, 2005.
*/
double ULP(double q)
{
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
return fmax(SmallestPositive, q - (q - q * Scale));
}
Well, machine accuracy is, as the name indicates, really something that might in general depend on the machine and even on the compiler. So, to be really sure you will typically have to write a program that actually tests what is going on.
However, I suspect that you are really looking for some handy formulas that you can use to approximate the maximum distance in a given interval. The Wikipedia article on machine epsilon gives a really nice overview over this topic and I'm mostly quoting from this source in the following.
Let s be the machine epsilon of your floating point representation (i.e., about 2^(-24) in the case of standard floats), then the maximum spacing between a normalised number x and its neighbors is 2*s*|x|. The word normalised is really crucial here and I will not even try to consider the situation for de-normalised numbers because this is where things get really nasty...
That is, in your particular case the maximum spacing h in the interval you propose is given by h = 2*s*max(|fLow|, |fHigh|).
I have a use case where I need to scramble an input in such a way that:
Each specific input always maps to a specific pseudo-random output.
The output must shuffle the input sufficiently so that an incrementing input maps to a pseudo-random output.
For example, if the input is 64 bits, there must be exactly 2^64 unique outputs, and these must break incrementing inputs as much as possible (arbitrary requirement).
I will code this in C#, but can translate from Java or C, so long as there are not SIMD intrinsics. What I am looking for is some already existing code, rather than reinventing the wheel.
I have looked on Google, but haven't found anything that does a 1:1 mapping.
This seems to work fairly well:
const long multiplier = 6364136223846793005;
const long mulinv_multiplier = -4568919932995229531;
const long offset = 1442695040888963407;
static long Forward(long x)
{
return x * multiplier + offset;
}
static long Reverse(long x)
{
return (x - offset) * mulinv_multiplier;
}
You can change the constants to whatever as long as multiplier is odd and mulinv_multiplier is the modular multiplicative inverse (see wiki:modular multiplicative inverse or Hackers Delight 10-15 Exact Division by Constants) of multiplier (modulo 2^64, obviously - and that's why multiplier has to be odd, otherwise it has no inverse).
The offset can be anything, but make it relatively prime with 2^64 just to be on the safe side.
These specific constants come from Knuths linear congruential generator.
There's one small thing: it puts the complement of the LSB of the input in the LSB of the result. If that's a problem, you could just rotate it by any nonzero amount.
For 32 bits, the constants can be multiplier = 0x4c957f2d, offset = 0xf767814f, mulinv_multiplier = 0x329e28a5.
For 64 bits, multiplier = 12790229573962758597, mulinv_multiplier = 16500474117902441741 may work better.
Or, you could use a CRC, which is reversible for this use (ie the input is the same size as the CRC) for CRC64 it requires some modifications of course.
Just from the top of my head:
Shift the input: Make sure you keep every bit, i.e. use two shift operations in different directions and OR the result together.
Apply an static XOR.
Everything else that comes to my mind won't be bijective. However, a search for bijective might bring up something useful ;D