binning-bucketing numerical values in .net - c#

Is there a .net framework function to bin-bucket numerical values, for example for the sake of preparing data for an histogram chart?
I find it odd I might have to code one up myself.
Probably I am not browsing around with the right keyword.

I don't think there is a function that will automatically prepare data for a histogram (including the calculation of the right number of buckets), but you can quite easily create histograms using Seq.countBy.
For example, given a sequence of numbers nums between -1 and 1, you can write something like:
nums
|> Seq.countBy (fun v -> round(v*10.0))
This will create buckets for numbers in intervals ... (-0.1, 0.0), (0.0, 0.1), (0.1, 0.2), ... etc. and it will return the count of numbers in each bucket. If you pipe the result to the Chart.Bar function from F# Charting, then you'll get a reasonably nice histogram.

Related

Random number within a range biased towards the minimum value of that range

I want to generate random numbers within a range (1 - 100000), but instead of purely random I want the results to be based on a kind of distribution. What I mean that in general I want the numbers "clustered" around the minimum value of the range (1).
I've read about Box–Muller transform and normal distributions but I'm not quite sure how to use them to achieve the number generator.
How can I achieve such an algorithm using C#?
There are a lot of ways doing this (using uniform distribution prng) here few I know of:
Combine more uniform random variables to obtain desired distribution.
I am not a math guy but there sure are equations for this. This kind of solution has usually the best properties from randomness and statistical point of view. For more info see the famous:
Understanding “randomness”.
but there are limited number of distributions we know the combinations for.
Apply non linear function on uniform random variable
This is the simplest to implement. You simply use floating randoms in <0..1> range apply your non linear function (that change the distribution towards your wanted shape) on them (while result is still in the <0..1> range) and rescale the result into your integer range for example (in C++):
floor( pow( random(),5 ) * 100000 )
The problem is that this is just blind fitting of the distribution so you usually need to tweak the constants a bit. It a good idea to render histogram and randomness graphs to see the quality of result directly like in here:
How to seed to generate random numbers?
You can also avoid too blind fitting with BEZIERS like in here:
Random but most likely 1 float
Distribution following pseudo random generator
there are two approaches I know of for this the simpler is:
create big enough array of size n
fill it with all values following the distribution
so simply loop through all values you want to output and compute how many of them will be in n size array (from your distribution) and add that count of the numbers into array. Beware the filled size of the array might be slightly less than n due to rounding. If n is too small you will be missing some less occurring numbers. so if you multiply probability of the least probable number and n it should be at least >=1. After the filling change the n into the real array size (number of really filled numbers in it).
shuffle the array
now use the array as linear list of random numbers
so instead of random() you just pick a number from array and move to the next one. Once you get into n-th value schuffle the array and start from first one again.
This solution has very good statistical properties (follows the distribution exactly) but the randomness properties are not good and requires array and occasional shuffling. For more info see:
How to efficiently generate a set of unique random numbers with a predefined distribution?
The other variation of this is to avoid use of array and shuffling. It goes like this:
get random value in range <0..1>
apply inverse cumulated distribution function to convert to target range
as you can see its like the #2 Apply non linear function... approach but instead of "some" non linear function you use directly the distribution. So if p(x) is probability of x in range <0..1> where 1 means 100% than we need a function that cumulates all the probabilities up to x (sorry do not know the exact math term in English). For integers:
f(x) = p(0)+p(1)+...+p(x)
Now we need inverse function g() to it so:
y = f(x)
x = g(y)
Now if my memory serves me well then the generation should look like this:
y = random(); // <0..1>
x = g(y); // probability -> value
Many distributions have known g() function but for those that do not (or we are too lazy to derive it) you can use binary search on p(x). Too lazy to code it so here slower linear search version:
for (x=0;x<max;x++) if (f(x)>=y) break;
So when put all together (and using only p(x)) I got this (C++):
y=random(); // uniform distribution pseudo random value in range <0..1>
for (f=0.0,x=0;x<max;x++) // loop x through all values
{
f+=p(x); // f(x) cumulative distribution function
if (f>=y) break;
}
// here x is your pseudo random value following p(x) distribution
This kind of solution has usually very good both statistical and randomness properties and does not require that the distribution is a continuous function (it can be even just an array of values instead).

Calculation of orig, norm attributes of NormContinous in PMML

Overview
I am currently working on a normalization PMML-Model executor in c#.
These PMML normalization models look like this:
<TransformationDictionary>
<DerivedField displayName="BU01" name="BU01*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 17 column(s)"/>
<NormContinuous field="BU01">
<LinearNorm orig="0.0" norm="-0.6148417019560395"/>
<LinearNorm orig="1.0" norm="-0.6140350877192982"/>
</NormContinuous>
</DerivedField>
(...)
I do know how min-max normalization in theory works using
z_i = (x_i - min(x)) / (max(x) - min(x))
to normalize a dataset into the range of 0-1 and obviously it's not hard to reverse this equation.
Problem
So to execute the normlization and denormalization I somehow have to translate this orig, norm values into min, max values. But I just can't figure out how these orig/norm values are being calculated and how they relate to min/max.
Question
So I'm asking if some does know an equation to transform orig/norm to min/max and back. Or is someone able to explain how to directly use orig/norm values to normalize/denormalize my fields?
Further Explanation
EDIT: It loks like as if I did not state clearly what the problem exactly is so here is another approach:
I try to get an attribut of a dataset normalized into the range from 0-1 using Min-Max normalization method (aka Feature Scaling). Using the Data Analysis tool Knime I can do this and export my "scaling" as a PMML Model. (Example of this is the XML provided above)
With these normalized attributes I train my MLP Model. Now if I export my MLP Model as PMML I have to put normalized values in and get normalized output out when caluclating a prediction. (Computing the MLP Network already works)
In a deployed scenario where Knime can't do this normalization for me I want to use my normalization Model. As already described I do know the theory behing Feature Scaling and can easily compute de-/normalization if I am provided with min and max of my attribute. The problem is that PMML has another let's say "notation" for saving this min-max information which is somehow inside the orig and norm value.
So what I am ultimately looking for is a way to convert orig/norm to min/max or how min/max information is "encoded" into orig/norm values.
Extra Info
[Why this "encoding" is done in the first place seems to be because computation speed reasons (which is not important in my scenario) and to easier encode min/max normlization info for ranges other than 0-1.]
Example #1
To give an example:
Let's say I want to normalize the array of [0, 1, 2, 4, 8] into the range of 0-1. Clearly the answer is [0, 0.125, 0.25, 0.5, 1] as computed by Feature Scaling with min = 0, max = 8. Easy. But now if I look at the PMML normalization Model:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="0.0"/>
<LinearNorm orig="1.0" norm="0.125"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Example #2
[1, 2, 4, 8] -> [0, 0.333, 0.667, 1]
With:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="-0.3333333333333333"/>
<LinearNorm orig="1.0" norm="0.0"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Question
So how am I supposed to scale with orig/norm or compute min/max from these values?
What I'm about to say depends on what you mean by (min, max).
I'm going to assume that min equals the value where 0.5% of the total lies below and max equals the value where 0.5% of the total lies above.
If we agree on that, a symmetric normal distribution would have a mean value of approximately mean ~ (max+min)/2. (You call the mean the origin.)
Six standard deviations encompasses 99% of a normal distribution, so the standard deviation is approximately sigma ~ (max-min)/6.
The definition of normalized z = (x - mean)/sigma.
With those values you can get yourself back to the denormalized distribution.
Found the answer. After carefully reading again through the Documentation (which is extremly confusing imo) i came across this sentence:
The sequence of LinearNorm elements defines a sequence of points for a stepwise linear interpolation function. The sequence must contain at least two elements. Within NormContinous the elements LinearNorm must be strictly sorted by ascending value of orig.
Which basically explains it all. Normalization in PMML is done by using a stepwise interpolation with only 2 points. So in fact just a simple conversion function.
In the case of normalization into a range of 0-1 it even get's easier as the two points will always be at x1=0 and x2=1 (orig values). And will therefore always have their y axis intercept at orig=0 norm-value. As far as the slope of the function is concerned it is also very easy to calculate by slope = (y2-y1)/(x2-x1) = (y2-y1)/(1-0) = y2-y1 which are just the 2 norm-values.
So to get our interpolation function which will always be a polynom 1st grade we just calculate:
f(x) = ax + b = (y2-y1)x + y1 = (norm(orig=1)-norm(orig=0) * x + norm(orig=0) This is used for normalization.
and now we can calculate the inverse:
x = (f(x) - norm(orig=0)) / (norm(orig=1)-norm(orig=0)) This is used for de-normalization
Hope this helps everyone who at someday will also go through the hassle of implementing their own PMML executor engine and gets stuck at this topic.

calculate sum of numbers closest to a given number

I want to find out what is the best way to do this in C#:
I have a array of lets say 20 numbers, and then one more additional variable.
I want to get the sum of the numbers which is closest to the given variable.
Lets say, I have 1.1, 1.5, 1.7, 1.9, 2.2, 3.1, 3.2, 1,5, 4.5, 4.1. And then the additional variable has value of 5.
I want to get the sum of some numbers in the array which will be closest to the given number, and once I'll get that number, remove those numbers from the list and add them to a new array.
Every comment is welcomed.
Thanks
You are describing the optimization problem for Subset Sum Problem.
The problem is NP-Complete, so there is no known polynomial solution to it.
However, since the input is fairly small scale - an exponential solution of checking all subsets is feasible, since there are only 2^20 ~= 1000000 (a bit more, actually, but close enough for estimating run time)
Pseudo code should be something like:
getClosestSum(list,sum,number):
if (list is empty):
return sum
candidate1 <- getClosest(list[1:],sum,number)
candidate2 <- getClosest(list[1:],sum+list[0],number)
if (abs(number-candidate1) < abs(number-candidate2)):
return candidate1
else:
return candidate2

Stable distribution random numbers?

How to generate random numbers with a stable distribution in C#?
The Random class has uniform distribution. Many other code on the
internet show normal distribution. But we need stable distribution
meaning infinite variance, a.k.a fat-tailed distribution.
The reason is for generating realistic stock prices. In the real
world, huge variations in prices are far more likely than in
normal distributions.
Does someone know the C# code to convert Random class output
into stable distribution?
Edit: Hmmm. Exact distribution is less critical than ensuring it will randomly generate huge sigma like at least 20 sigma. We want to test a trading strategy for resilience in a true fat tailed distribution which is exactly how stock market prices behave.
I just read about ZipFian and Cauchy due to comments. Since I must pick, let's go with Cauchy distribution but I will also try ZipFian to compare.
In general, the method is:
Choose a stable, fat-tailed distribution. Say, the Cauchy distribution.
Look up the quantile function of the chosen distribution.
For the Cauchy distribution, that would be p --> peak + scale * tan( pi * (p - 0.5) ).
And now you have a method of transforming uniformly-distributed random numbers into Cauchy-distributed random numbers.
Make sense? See
http://en.wikipedia.org/wiki/Inverse_transform_sampling
for details.
Caveat: It has been a long, long time since I took statistics.
UPDATE:
I liked this question so much I just blogged it: see
http://ericlippert.com/2012/02/21/generating-random-non-uniform-data/
My article exploring a few interesting examples of Zipfian distributions is here:
http://blogs.msdn.com/b/ericlippert/archive/2010/12/07/10100227.aspx
If you're interested in using the Zipfian distribution (which is often used when modeling processes from the sciences or social domains), you would do something along the lines of:
Select your k (skew) for the distribution
Precompute the domain of the cumulative distribution (this is just an optimization)
Generate random values for the distribution by finding the nearest value from the domain
Sample Code:
List<int> domain = Enumerable.Range(0,1000); // generate your domain
double skew = 0.37; // select a skew appropriate to your domain
double sigma = domain.Aggregate(0.0d, (z,x) => x + 1.0 / Math.Pow(z+1, skew));
List<double> cummDist = domain.Select(
x => domain.Aggregate(0.0d, (z,y) => z + 1.0/Math.Pow(y, skew) * sigma));
Now you can generate random values by selecting the closest value from within the domain:
Random rand = new Random();
double seek = rand.NextDouble();
int searchIndex = cummDist.BinarySearch(seek);
// return the index of the closest value from the distribution domain
return searchIndex < 0 ? (~searchIndex)-1 : searchIndex-1;
You can, of course, generalize this entire process by factoring out the logic that materializes the domain of the distribution from the process that maps and returns a value from that domain.
I have before me James Gentle's Springer volume on this topic, Random Number Generation and Monte Carlo Methods, courtesy of my statistician wife. It discusses the stable family on page 105:
The stable family of distributions is a flexible family of generally heavy-tailed distributions. This family includes the normal distribution at one extreme value of one of the parameters and the Cauchy at the other extreme value. Chambers, Mallows, and Stuck (1976) give a method for generating deviates from stable distributions. (Watch for some errors in the constants in the auxiliary function D2, for evaluating (ex-1)/x.) Their method is used in the IMSL libraries. For a symmetric stable distribution, Devroye (1986) points out that a faster method can be developed by exploiting the relationship of the symmetric stable to the Fejer-de la Vallee Poissin distribution. Buckle (1995) shows how to simulate the parameters of a stable distribution, conditional on the data.
Generating deviates from the generic stable distribution is hard. If you need to do this then I would recommend a library such as IMSL. I do not advise you attempt this yourself.
However, if you are looking for a specific distribution in the stable family, e.g. Cauchy, then you can use the method described by Eric, known as the probability integral transform. So long as you can write down the inverse of the distribution function in closed form then you can use this approach.
The following C# code generates a random number following a stable distribution given the shape parameters alpha and beta. I release it to the public domain under Creative Commons Zero.
public static double StableDist(Random rand, double alpha, double beta){
if(alpha<=0 || alpha>2 || beta<-1 || beta>1)
throw new ArgumentException();
var halfpi=Math.PI*0.5;
var unif=NextDouble(rand);
while(unif == 0.0)unif=NextDouble(rand);
unif=(unif - 0.5) * Math.PI;
// Cauchy special case
if(alpha==1 && beta==0)
return Math.Tan(unif);
var expo=-Math.Log(1.0 - NextDouble(rand));
var c=Math.Cos(unif);
if(alpha == 1){
var s=Math.Sin(unif);
return 2.0*((unif*beta+halfpi)*s/c -
beta * Math.Log(halfpi*expo*c/(
unif*beta+halfpi)))/Math.PI;
}
var z=-Math.Tan(halfpi*alpha)*beta;
var ug=unif+Math.Atan(-z)/alpha;
var cpow=Math.Pow(c, -1.0 / alpha);
return Math.Pow(1.0+z*z, 1.0 / (2*alpha))*
(Math.Sin(alpha*ug)*cpow)*
Math.Pow(Math.Cos(unif-alpha*ug)/expo, (1.0-alpha) / alpha);
}
private static double NextDouble(Random rand){
// The default NextDouble implementation in .NET (see
// https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/shared/System/Random.cs)
// is very problematic:
// - It generates a random number 0 or greater and less than 2^31-1 in a
// way that very slightly biases 2^31-2.
// - Then it divides that number by 2^31-1.
// - The result is a number that uses roughly only 32 bits of pseudorandomness,
// even though `double` has 53 bits in its significand.
// To alleviate some of these problems, this method generates a random 53-bit
// random number and divides that by 2^53. Although this doesn't fix the bias
// mentioned above (for the default System.Random), this bias may be of
// negligible importance for most purposes not involving security.
long x=rand.Next(0,1<<30);
x<<=23;
x+=rand.Next(0,1<<23);
return (double)x / (double)(1L<<53);
}
In addition, I set forth pseudocode for the stable distribution in a separate article.

C# Date/Numerical Axis Scale

I'm developing a histogram container class and I'm trying to determine where the cut off points should be for the bins. I'd like the cutoff points to be nice looking numbers, in much that same way that graphs are scaled.
To distill my request into a basic question: Is there a basic method by which data axis labels can be determined from a list of numbers.
For example:
Array{1,6,8,5,12,15,22}
It would make sense to have 5 bins.
Bin Start Count
0 1
5 3
10 2
15 0
20 1
The bin start stuff is identical to selecting axis labels on a graph in this instance.
For the purpose of this question I don't really care about bins and the histogram, I'm more interested in the graph scale axis label portion of the question.
I will be using C# 4.0 for my app, so nifty solution using linq are welcome.
I've attempted stuff like this in the distant past using some log base 10 scaling stuff, but I never got it to work in great enough detail for this application. I don't want to do log scaling, I just used base 10 to round to nearest whole numbers. I'd like it to work for large numbers and very small numbers and possibly dates too; although dates can be converted to doubles and parsed that way.
Any resources on the subject would be greatly appreciated.
You could start with something simple:
NUM_BINS is a passed argument or constatn (e.g. NUM_BINS = 10)
x is your array of x-values (e.g. int[] x = new int[50])
int numBins = x.Length < NUM_BINS ? x.Length : NUM_BINS;
At this point you could calc a histogram of xPoints, and if the xPoints are heavily weighted to one side of distribution (maybe just count left of midpoint vs. right of midpoint), then use log/exp divisions over range of x[]. If the histogram is flat, use linear divisions.
double[] xAxis = new double[numBins];
double range = x[x.Length-1] - x[0];
CalcAxisValues(xAxis, range, TYPE); //Type is enum of LOG, EXP, or LINEAR
This function would then equally space points based on the TYPE.

Categories

Resources