Calculation of orig, norm attributes of NormContinous in PMML - c#

Overview
I am currently working on a normalization PMML-Model executor in c#.
These PMML normalization models look like this:
<TransformationDictionary>
<DerivedField displayName="BU01" name="BU01*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 17 column(s)"/>
<NormContinuous field="BU01">
<LinearNorm orig="0.0" norm="-0.6148417019560395"/>
<LinearNorm orig="1.0" norm="-0.6140350877192982"/>
</NormContinuous>
</DerivedField>
(...)
I do know how min-max normalization in theory works using
z_i = (x_i - min(x)) / (max(x) - min(x))
to normalize a dataset into the range of 0-1 and obviously it's not hard to reverse this equation.
Problem
So to execute the normlization and denormalization I somehow have to translate this orig, norm values into min, max values. But I just can't figure out how these orig/norm values are being calculated and how they relate to min/max.
Question
So I'm asking if some does know an equation to transform orig/norm to min/max and back. Or is someone able to explain how to directly use orig/norm values to normalize/denormalize my fields?
Further Explanation
EDIT: It loks like as if I did not state clearly what the problem exactly is so here is another approach:
I try to get an attribut of a dataset normalized into the range from 0-1 using Min-Max normalization method (aka Feature Scaling). Using the Data Analysis tool Knime I can do this and export my "scaling" as a PMML Model. (Example of this is the XML provided above)
With these normalized attributes I train my MLP Model. Now if I export my MLP Model as PMML I have to put normalized values in and get normalized output out when caluclating a prediction. (Computing the MLP Network already works)
In a deployed scenario where Knime can't do this normalization for me I want to use my normalization Model. As already described I do know the theory behing Feature Scaling and can easily compute de-/normalization if I am provided with min and max of my attribute. The problem is that PMML has another let's say "notation" for saving this min-max information which is somehow inside the orig and norm value.
So what I am ultimately looking for is a way to convert orig/norm to min/max or how min/max information is "encoded" into orig/norm values.
Extra Info
[Why this "encoding" is done in the first place seems to be because computation speed reasons (which is not important in my scenario) and to easier encode min/max normlization info for ranges other than 0-1.]
Example #1
To give an example:
Let's say I want to normalize the array of [0, 1, 2, 4, 8] into the range of 0-1. Clearly the answer is [0, 0.125, 0.25, 0.5, 1] as computed by Feature Scaling with min = 0, max = 8. Easy. But now if I look at the PMML normalization Model:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="0.0"/>
<LinearNorm orig="1.0" norm="0.125"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Example #2
[1, 2, 4, 8] -> [0, 0.333, 0.667, 1]
With:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="-0.3333333333333333"/>
<LinearNorm orig="1.0" norm="0.0"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Question
So how am I supposed to scale with orig/norm or compute min/max from these values?

What I'm about to say depends on what you mean by (min, max).
I'm going to assume that min equals the value where 0.5% of the total lies below and max equals the value where 0.5% of the total lies above.
If we agree on that, a symmetric normal distribution would have a mean value of approximately mean ~ (max+min)/2. (You call the mean the origin.)
Six standard deviations encompasses 99% of a normal distribution, so the standard deviation is approximately sigma ~ (max-min)/6.
The definition of normalized z = (x - mean)/sigma.
With those values you can get yourself back to the denormalized distribution.

Found the answer. After carefully reading again through the Documentation (which is extremly confusing imo) i came across this sentence:
The sequence of LinearNorm elements defines a sequence of points for a stepwise linear interpolation function. The sequence must contain at least two elements. Within NormContinous the elements LinearNorm must be strictly sorted by ascending value of orig.
Which basically explains it all. Normalization in PMML is done by using a stepwise interpolation with only 2 points. So in fact just a simple conversion function.
In the case of normalization into a range of 0-1 it even get's easier as the two points will always be at x1=0 and x2=1 (orig values). And will therefore always have their y axis intercept at orig=0 norm-value. As far as the slope of the function is concerned it is also very easy to calculate by slope = (y2-y1)/(x2-x1) = (y2-y1)/(1-0) = y2-y1 which are just the 2 norm-values.
So to get our interpolation function which will always be a polynom 1st grade we just calculate:
f(x) = ax + b = (y2-y1)x + y1 = (norm(orig=1)-norm(orig=0) * x + norm(orig=0) This is used for normalization.
and now we can calculate the inverse:
x = (f(x) - norm(orig=0)) / (norm(orig=1)-norm(orig=0)) This is used for de-normalization
Hope this helps everyone who at someday will also go through the hassle of implementing their own PMML executor engine and gets stuck at this topic.

Related

Random number within a range biased towards the minimum value of that range

I want to generate random numbers within a range (1 - 100000), but instead of purely random I want the results to be based on a kind of distribution. What I mean that in general I want the numbers "clustered" around the minimum value of the range (1).
I've read about Box–Muller transform and normal distributions but I'm not quite sure how to use them to achieve the number generator.
How can I achieve such an algorithm using C#?
There are a lot of ways doing this (using uniform distribution prng) here few I know of:
Combine more uniform random variables to obtain desired distribution.
I am not a math guy but there sure are equations for this. This kind of solution has usually the best properties from randomness and statistical point of view. For more info see the famous:
Understanding “randomness”.
but there are limited number of distributions we know the combinations for.
Apply non linear function on uniform random variable
This is the simplest to implement. You simply use floating randoms in <0..1> range apply your non linear function (that change the distribution towards your wanted shape) on them (while result is still in the <0..1> range) and rescale the result into your integer range for example (in C++):
floor( pow( random(),5 ) * 100000 )
The problem is that this is just blind fitting of the distribution so you usually need to tweak the constants a bit. It a good idea to render histogram and randomness graphs to see the quality of result directly like in here:
How to seed to generate random numbers?
You can also avoid too blind fitting with BEZIERS like in here:
Random but most likely 1 float
Distribution following pseudo random generator
there are two approaches I know of for this the simpler is:
create big enough array of size n
fill it with all values following the distribution
so simply loop through all values you want to output and compute how many of them will be in n size array (from your distribution) and add that count of the numbers into array. Beware the filled size of the array might be slightly less than n due to rounding. If n is too small you will be missing some less occurring numbers. so if you multiply probability of the least probable number and n it should be at least >=1. After the filling change the n into the real array size (number of really filled numbers in it).
shuffle the array
now use the array as linear list of random numbers
so instead of random() you just pick a number from array and move to the next one. Once you get into n-th value schuffle the array and start from first one again.
This solution has very good statistical properties (follows the distribution exactly) but the randomness properties are not good and requires array and occasional shuffling. For more info see:
How to efficiently generate a set of unique random numbers with a predefined distribution?
The other variation of this is to avoid use of array and shuffling. It goes like this:
get random value in range <0..1>
apply inverse cumulated distribution function to convert to target range
as you can see its like the #2 Apply non linear function... approach but instead of "some" non linear function you use directly the distribution. So if p(x) is probability of x in range <0..1> where 1 means 100% than we need a function that cumulates all the probabilities up to x (sorry do not know the exact math term in English). For integers:
f(x) = p(0)+p(1)+...+p(x)
Now we need inverse function g() to it so:
y = f(x)
x = g(y)
Now if my memory serves me well then the generation should look like this:
y = random(); // <0..1>
x = g(y); // probability -> value
Many distributions have known g() function but for those that do not (or we are too lazy to derive it) you can use binary search on p(x). Too lazy to code it so here slower linear search version:
for (x=0;x<max;x++) if (f(x)>=y) break;
So when put all together (and using only p(x)) I got this (C++):
y=random(); // uniform distribution pseudo random value in range <0..1>
for (f=0.0,x=0;x<max;x++) // loop x through all values
{
f+=p(x); // f(x) cumulative distribution function
if (f>=y) break;
}
// here x is your pseudo random value following p(x) distribution
This kind of solution has usually very good both statistical and randomness properties and does not require that the distribution is a continuous function (it can be even just an array of values instead).

Teaching a ANN how to add

Preface: I'm currently learning about ANNs because I have ~18.5k images in ~83 classes. They will be used to train a ANN to recognize approximately equal images in realtime. I followed the image example in the book, but it doesn't work for me. So I'm going back to the beginning as I've likely missed something.
I took the Encog XOR example and extended it to teach it how to add numbers less than 100. So far, the results are mixed, even for exact input after training.
Inputs (normalized from 100): 0+0, 1+2, 3+4, 5+6, 7+8, 1+1, 2+2, 7.5+7.5, 7+7, 50+50, 20+20.
Outputs are the numbers added, then normalized to 100.
After training 100,000 times, some sample output from input data:
0+0=1E-18 (great!)
1+2=6.95
3+4=7.99 (so close!)
5+6=9.33
7+8=11.03
1+1=6.70
2+2=7.16
7.5+7.5=10.94
7+7=10.48
50+50=99.99 (woo!)
20+20=41.27 (close enough)
From cherry-picked unseen data:
2+4=7.75
6+8=10.65
4+6=9.02
4+8=9.91
25+75=99.99 (!!)
21+21=87.41 (?)
I've messed with layers, neuron numbers, and [Resilient|Back]Propagation, but I'm not entirely sure if it's getting better or worse. With the above data, the layers are 2, 6, 1.
I have no frame of reference for judging this. Is this normal? Do I have not enough input? Is my data not complete or random enough, or too weighted?
You are not the first one to ask this. It seems logical to teach an ANN to add. We teach them to function as logic gates, why not addition/multiplication operators. I can't answer this completely, because I have not researched it myself to see how well an ANN performs in this situation.
If you are just teaching addition or multiplication, you might have best results with a linear output and no hidden layer. For example, to learn to add, the two weights would need to be 1.0 and the bias weight would have to go to zero:
linear( (input1 * w1) + (input2 * w2) + bias) =
becomes
linear( (input1 * 1.0) + (input2 * 1.0) + (0.0) ) =
Training a sigmoid or tanh might be more problematic. The weights/bias and hidden layer would basically have to undo the sigmoid to truely get back to an addition like above.
I think part of the problem is that the neural network is recognizing patterns, not really learning math.
ANN can learn arbitrary function, including all arithmetics. For example, it was proved that addition of N numbers can be computed by polynomial-size network of depth 2. One way to teach NN arithmetics is to use binary representation (i.e. not normalized input from 100, but a set of input neurons each representing one binary digit, and same representation for output). This way you will be able to implement addition and other arithmetics. See this paper for further discussion and description of ANN topologies used in learning arithmetics.
PS. If you want to work with image recognition, its not good idea to start practicing with your original dataset. Try some well-studied dataset like MNIST, where it is known what results can be expected from correctly implemented algorithms. After mastering classical examples, you can move to work with your own data.
I am in the middle of a demo that makes the computer to learn how to multiply and I share my progress on this: as Jeff suggested I used the Linear approach and in particular ADALINE. At this moment my program "knows" how to multiply by 5. This is the output I am getting:
1 x 5 ~= 5.17716232607829
2 x 5 ~= 10.147218373698
3 x 5 ~= 15.1172744213176
4 x 5 ~= 20.0873304689373
5 x 5 ~= 25.057386516557
6 x 5 ~= 30.0274425641767
7 x 5 ~= 34.9974986117963
8 x 5 ~= 39.967554659416
9 x 5 ~= 44.9376107070357
10 x 5 ~= 49.9076667546553
Let me know if you are interested in this demo. I'd be happy to share.

Stable distribution random numbers?

How to generate random numbers with a stable distribution in C#?
The Random class has uniform distribution. Many other code on the
internet show normal distribution. But we need stable distribution
meaning infinite variance, a.k.a fat-tailed distribution.
The reason is for generating realistic stock prices. In the real
world, huge variations in prices are far more likely than in
normal distributions.
Does someone know the C# code to convert Random class output
into stable distribution?
Edit: Hmmm. Exact distribution is less critical than ensuring it will randomly generate huge sigma like at least 20 sigma. We want to test a trading strategy for resilience in a true fat tailed distribution which is exactly how stock market prices behave.
I just read about ZipFian and Cauchy due to comments. Since I must pick, let's go with Cauchy distribution but I will also try ZipFian to compare.
In general, the method is:
Choose a stable, fat-tailed distribution. Say, the Cauchy distribution.
Look up the quantile function of the chosen distribution.
For the Cauchy distribution, that would be p --> peak + scale * tan( pi * (p - 0.5) ).
And now you have a method of transforming uniformly-distributed random numbers into Cauchy-distributed random numbers.
Make sense? See
http://en.wikipedia.org/wiki/Inverse_transform_sampling
for details.
Caveat: It has been a long, long time since I took statistics.
UPDATE:
I liked this question so much I just blogged it: see
http://ericlippert.com/2012/02/21/generating-random-non-uniform-data/
My article exploring a few interesting examples of Zipfian distributions is here:
http://blogs.msdn.com/b/ericlippert/archive/2010/12/07/10100227.aspx
If you're interested in using the Zipfian distribution (which is often used when modeling processes from the sciences or social domains), you would do something along the lines of:
Select your k (skew) for the distribution
Precompute the domain of the cumulative distribution (this is just an optimization)
Generate random values for the distribution by finding the nearest value from the domain
Sample Code:
List<int> domain = Enumerable.Range(0,1000); // generate your domain
double skew = 0.37; // select a skew appropriate to your domain
double sigma = domain.Aggregate(0.0d, (z,x) => x + 1.0 / Math.Pow(z+1, skew));
List<double> cummDist = domain.Select(
x => domain.Aggregate(0.0d, (z,y) => z + 1.0/Math.Pow(y, skew) * sigma));
Now you can generate random values by selecting the closest value from within the domain:
Random rand = new Random();
double seek = rand.NextDouble();
int searchIndex = cummDist.BinarySearch(seek);
// return the index of the closest value from the distribution domain
return searchIndex < 0 ? (~searchIndex)-1 : searchIndex-1;
You can, of course, generalize this entire process by factoring out the logic that materializes the domain of the distribution from the process that maps and returns a value from that domain.
I have before me James Gentle's Springer volume on this topic, Random Number Generation and Monte Carlo Methods, courtesy of my statistician wife. It discusses the stable family on page 105:
The stable family of distributions is a flexible family of generally heavy-tailed distributions. This family includes the normal distribution at one extreme value of one of the parameters and the Cauchy at the other extreme value. Chambers, Mallows, and Stuck (1976) give a method for generating deviates from stable distributions. (Watch for some errors in the constants in the auxiliary function D2, for evaluating (ex-1)/x.) Their method is used in the IMSL libraries. For a symmetric stable distribution, Devroye (1986) points out that a faster method can be developed by exploiting the relationship of the symmetric stable to the Fejer-de la Vallee Poissin distribution. Buckle (1995) shows how to simulate the parameters of a stable distribution, conditional on the data.
Generating deviates from the generic stable distribution is hard. If you need to do this then I would recommend a library such as IMSL. I do not advise you attempt this yourself.
However, if you are looking for a specific distribution in the stable family, e.g. Cauchy, then you can use the method described by Eric, known as the probability integral transform. So long as you can write down the inverse of the distribution function in closed form then you can use this approach.
The following C# code generates a random number following a stable distribution given the shape parameters alpha and beta. I release it to the public domain under Creative Commons Zero.
public static double StableDist(Random rand, double alpha, double beta){
if(alpha<=0 || alpha>2 || beta<-1 || beta>1)
throw new ArgumentException();
var halfpi=Math.PI*0.5;
var unif=NextDouble(rand);
while(unif == 0.0)unif=NextDouble(rand);
unif=(unif - 0.5) * Math.PI;
// Cauchy special case
if(alpha==1 && beta==0)
return Math.Tan(unif);
var expo=-Math.Log(1.0 - NextDouble(rand));
var c=Math.Cos(unif);
if(alpha == 1){
var s=Math.Sin(unif);
return 2.0*((unif*beta+halfpi)*s/c -
beta * Math.Log(halfpi*expo*c/(
unif*beta+halfpi)))/Math.PI;
}
var z=-Math.Tan(halfpi*alpha)*beta;
var ug=unif+Math.Atan(-z)/alpha;
var cpow=Math.Pow(c, -1.0 / alpha);
return Math.Pow(1.0+z*z, 1.0 / (2*alpha))*
(Math.Sin(alpha*ug)*cpow)*
Math.Pow(Math.Cos(unif-alpha*ug)/expo, (1.0-alpha) / alpha);
}
private static double NextDouble(Random rand){
// The default NextDouble implementation in .NET (see
// https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/shared/System/Random.cs)
// is very problematic:
// - It generates a random number 0 or greater and less than 2^31-1 in a
// way that very slightly biases 2^31-2.
// - Then it divides that number by 2^31-1.
// - The result is a number that uses roughly only 32 bits of pseudorandomness,
// even though `double` has 53 bits in its significand.
// To alleviate some of these problems, this method generates a random 53-bit
// random number and divides that by 2^53. Although this doesn't fix the bias
// mentioned above (for the default System.Random), this bias may be of
// negligible importance for most purposes not involving security.
long x=rand.Next(0,1<<30);
x<<=23;
x+=rand.Next(0,1<<23);
return (double)x / (double)(1L<<53);
}
In addition, I set forth pseudocode for the stable distribution in a separate article.

Bell Curve Gaussian Algorithm (Python and/or C#)

Here's a somewhat simplified example of what I am trying to do.
Suppose I have a formula that computes credit points, but the formula has no constraints (for example, the score might be 1 to 5000). And a score is assigned to 100 people.
Now, I want to assign a "normalized" score between 200 and 800 to each person, based on a bell curve. So for example, if one guy has 5000 points, he might get an 800 on the new scale. The people with the middle of my point range will get a score near 500. In other words, 500 is the median?
A similar example might be the old scenario of "grading on the curve", where a the bulk of the students perhaps get a C or C+.
I'm not asking for the code, either a library, an algorithm book or a website to refer to.... I'll probably be writing this in Python (but C# is of some interest as well). There is NO need to graph the bell curve. My data will probably be in a database and I may have even a million people to which to assign this score, so scalability is an issue.
Thanks.
The important property of the bell curve is that it describes normal distribution, which is a simple model for many natural phenomena. I am not sure what kind of "normalization" you intend to do, but it seems to me that current score already complies with normal distribution, you just need to determine its properties (mean and variance) and scale each result accordingly.
References:
https://en.wikipedia.org/wiki/Grading_on_a_curve
https://en.wikipedia.org/wiki/Percentile
(see also: gaussian function)
I think the approach that I would try would be to compute the mean (average) and standard deviation (average distance from the average). I would then choose parameters to fit to my target range. Specifically, I would choose that the mean of the input values map to the value 500, and I would choose that 6 standard deviations consume 99.7% of my target range. Or, a single standard deviation will occupy about 16.6% of my target range.
Since your target range is 600 (from 200 to 800), a single standard deviation would cover 99.7 units. So a person who obtains an input credit score that is one standard deviation above the input mean would get a normalized credit score of 599.7.
So now:
# mean and standard deviation of the input values has been computed.
for score in input_scores:
distance_from_mean = score - mean
distance_from_mean_in_standard_deviations = distance_from_mean / stddev
target = 500 + distance_from_mean_in_standard_deviations * 99.7
if target < 200:
target = 200
if target > 800:
target = 800
This won't necessarily map the median of your input scores to 500. This approach assumes that your input is more-or-less normally distributed and simply translates the mean and stretches the input bell curve to fit in your range. For inputs that are significantly not bell curve shaped, this may distort the input curve rather badly.
A second approach is to simply map your input range to our output range:
for score in input_scores:
value = (score - 1.0) / (5000 - 1)
target = value * (800 - 200) + 200
This will preserve the shape of your input, but in your new range.
A third approach is to have your target range represent percentiles instead of trying to represent a normal distribution. 1% of people would score between 200 and 205; 1% would score between 794 and 800. Here you would rank your input scores and convert the ranks into a value in the range 200..600. This makes full use of your target range and gives it an easy to understand interpretation.

C# Date/Numerical Axis Scale

I'm developing a histogram container class and I'm trying to determine where the cut off points should be for the bins. I'd like the cutoff points to be nice looking numbers, in much that same way that graphs are scaled.
To distill my request into a basic question: Is there a basic method by which data axis labels can be determined from a list of numbers.
For example:
Array{1,6,8,5,12,15,22}
It would make sense to have 5 bins.
Bin Start Count
0 1
5 3
10 2
15 0
20 1
The bin start stuff is identical to selecting axis labels on a graph in this instance.
For the purpose of this question I don't really care about bins and the histogram, I'm more interested in the graph scale axis label portion of the question.
I will be using C# 4.0 for my app, so nifty solution using linq are welcome.
I've attempted stuff like this in the distant past using some log base 10 scaling stuff, but I never got it to work in great enough detail for this application. I don't want to do log scaling, I just used base 10 to round to nearest whole numbers. I'd like it to work for large numbers and very small numbers and possibly dates too; although dates can be converted to doubles and parsed that way.
Any resources on the subject would be greatly appreciated.
You could start with something simple:
NUM_BINS is a passed argument or constatn (e.g. NUM_BINS = 10)
x is your array of x-values (e.g. int[] x = new int[50])
int numBins = x.Length < NUM_BINS ? x.Length : NUM_BINS;
At this point you could calc a histogram of xPoints, and if the xPoints are heavily weighted to one side of distribution (maybe just count left of midpoint vs. right of midpoint), then use log/exp divisions over range of x[]. If the histogram is flat, use linear divisions.
double[] xAxis = new double[numBins];
double range = x[x.Length-1] - x[0];
CalcAxisValues(xAxis, range, TYPE); //Type is enum of LOG, EXP, or LINEAR
This function would then equally space points based on the TYPE.

Categories

Resources