Here's a somewhat simplified example of what I am trying to do.
Suppose I have a formula that computes credit points, but the formula has no constraints (for example, the score might be 1 to 5000). And a score is assigned to 100 people.
Now, I want to assign a "normalized" score between 200 and 800 to each person, based on a bell curve. So for example, if one guy has 5000 points, he might get an 800 on the new scale. The people with the middle of my point range will get a score near 500. In other words, 500 is the median?
A similar example might be the old scenario of "grading on the curve", where a the bulk of the students perhaps get a C or C+.
I'm not asking for the code, either a library, an algorithm book or a website to refer to.... I'll probably be writing this in Python (but C# is of some interest as well). There is NO need to graph the bell curve. My data will probably be in a database and I may have even a million people to which to assign this score, so scalability is an issue.
Thanks.
The important property of the bell curve is that it describes normal distribution, which is a simple model for many natural phenomena. I am not sure what kind of "normalization" you intend to do, but it seems to me that current score already complies with normal distribution, you just need to determine its properties (mean and variance) and scale each result accordingly.
References:
https://en.wikipedia.org/wiki/Grading_on_a_curve
https://en.wikipedia.org/wiki/Percentile
(see also: gaussian function)
I think the approach that I would try would be to compute the mean (average) and standard deviation (average distance from the average). I would then choose parameters to fit to my target range. Specifically, I would choose that the mean of the input values map to the value 500, and I would choose that 6 standard deviations consume 99.7% of my target range. Or, a single standard deviation will occupy about 16.6% of my target range.
Since your target range is 600 (from 200 to 800), a single standard deviation would cover 99.7 units. So a person who obtains an input credit score that is one standard deviation above the input mean would get a normalized credit score of 599.7.
So now:
# mean and standard deviation of the input values has been computed.
for score in input_scores:
distance_from_mean = score - mean
distance_from_mean_in_standard_deviations = distance_from_mean / stddev
target = 500 + distance_from_mean_in_standard_deviations * 99.7
if target < 200:
target = 200
if target > 800:
target = 800
This won't necessarily map the median of your input scores to 500. This approach assumes that your input is more-or-less normally distributed and simply translates the mean and stretches the input bell curve to fit in your range. For inputs that are significantly not bell curve shaped, this may distort the input curve rather badly.
A second approach is to simply map your input range to our output range:
for score in input_scores:
value = (score - 1.0) / (5000 - 1)
target = value * (800 - 200) + 200
This will preserve the shape of your input, but in your new range.
A third approach is to have your target range represent percentiles instead of trying to represent a normal distribution. 1% of people would score between 200 and 205; 1% would score between 794 and 800. Here you would rank your input scores and convert the ranks into a value in the range 200..600. This makes full use of your target range and gives it an easy to understand interpretation.
Related
I want to generate random numbers within a range (1 - 100000), but instead of purely random I want the results to be based on a kind of distribution. What I mean that in general I want the numbers "clustered" around the minimum value of the range (1).
I've read about Box–Muller transform and normal distributions but I'm not quite sure how to use them to achieve the number generator.
How can I achieve such an algorithm using C#?
There are a lot of ways doing this (using uniform distribution prng) here few I know of:
Combine more uniform random variables to obtain desired distribution.
I am not a math guy but there sure are equations for this. This kind of solution has usually the best properties from randomness and statistical point of view. For more info see the famous:
Understanding “randomness”.
but there are limited number of distributions we know the combinations for.
Apply non linear function on uniform random variable
This is the simplest to implement. You simply use floating randoms in <0..1> range apply your non linear function (that change the distribution towards your wanted shape) on them (while result is still in the <0..1> range) and rescale the result into your integer range for example (in C++):
floor( pow( random(),5 ) * 100000 )
The problem is that this is just blind fitting of the distribution so you usually need to tweak the constants a bit. It a good idea to render histogram and randomness graphs to see the quality of result directly like in here:
How to seed to generate random numbers?
You can also avoid too blind fitting with BEZIERS like in here:
Random but most likely 1 float
Distribution following pseudo random generator
there are two approaches I know of for this the simpler is:
create big enough array of size n
fill it with all values following the distribution
so simply loop through all values you want to output and compute how many of them will be in n size array (from your distribution) and add that count of the numbers into array. Beware the filled size of the array might be slightly less than n due to rounding. If n is too small you will be missing some less occurring numbers. so if you multiply probability of the least probable number and n it should be at least >=1. After the filling change the n into the real array size (number of really filled numbers in it).
shuffle the array
now use the array as linear list of random numbers
so instead of random() you just pick a number from array and move to the next one. Once you get into n-th value schuffle the array and start from first one again.
This solution has very good statistical properties (follows the distribution exactly) but the randomness properties are not good and requires array and occasional shuffling. For more info see:
How to efficiently generate a set of unique random numbers with a predefined distribution?
The other variation of this is to avoid use of array and shuffling. It goes like this:
get random value in range <0..1>
apply inverse cumulated distribution function to convert to target range
as you can see its like the #2 Apply non linear function... approach but instead of "some" non linear function you use directly the distribution. So if p(x) is probability of x in range <0..1> where 1 means 100% than we need a function that cumulates all the probabilities up to x (sorry do not know the exact math term in English). For integers:
f(x) = p(0)+p(1)+...+p(x)
Now we need inverse function g() to it so:
y = f(x)
x = g(y)
Now if my memory serves me well then the generation should look like this:
y = random(); // <0..1>
x = g(y); // probability -> value
Many distributions have known g() function but for those that do not (or we are too lazy to derive it) you can use binary search on p(x). Too lazy to code it so here slower linear search version:
for (x=0;x<max;x++) if (f(x)>=y) break;
So when put all together (and using only p(x)) I got this (C++):
y=random(); // uniform distribution pseudo random value in range <0..1>
for (f=0.0,x=0;x<max;x++) // loop x through all values
{
f+=p(x); // f(x) cumulative distribution function
if (f>=y) break;
}
// here x is your pseudo random value following p(x) distribution
This kind of solution has usually very good both statistical and randomness properties and does not require that the distribution is a continuous function (it can be even just an array of values instead).
Overview
I am currently working on a normalization PMML-Model executor in c#.
These PMML normalization models look like this:
<TransformationDictionary>
<DerivedField displayName="BU01" name="BU01*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 17 column(s)"/>
<NormContinuous field="BU01">
<LinearNorm orig="0.0" norm="-0.6148417019560395"/>
<LinearNorm orig="1.0" norm="-0.6140350877192982"/>
</NormContinuous>
</DerivedField>
(...)
I do know how min-max normalization in theory works using
z_i = (x_i - min(x)) / (max(x) - min(x))
to normalize a dataset into the range of 0-1 and obviously it's not hard to reverse this equation.
Problem
So to execute the normlization and denormalization I somehow have to translate this orig, norm values into min, max values. But I just can't figure out how these orig/norm values are being calculated and how they relate to min/max.
Question
So I'm asking if some does know an equation to transform orig/norm to min/max and back. Or is someone able to explain how to directly use orig/norm values to normalize/denormalize my fields?
Further Explanation
EDIT: It loks like as if I did not state clearly what the problem exactly is so here is another approach:
I try to get an attribut of a dataset normalized into the range from 0-1 using Min-Max normalization method (aka Feature Scaling). Using the Data Analysis tool Knime I can do this and export my "scaling" as a PMML Model. (Example of this is the XML provided above)
With these normalized attributes I train my MLP Model. Now if I export my MLP Model as PMML I have to put normalized values in and get normalized output out when caluclating a prediction. (Computing the MLP Network already works)
In a deployed scenario where Knime can't do this normalization for me I want to use my normalization Model. As already described I do know the theory behing Feature Scaling and can easily compute de-/normalization if I am provided with min and max of my attribute. The problem is that PMML has another let's say "notation" for saving this min-max information which is somehow inside the orig and norm value.
So what I am ultimately looking for is a way to convert orig/norm to min/max or how min/max information is "encoded" into orig/norm values.
Extra Info
[Why this "encoding" is done in the first place seems to be because computation speed reasons (which is not important in my scenario) and to easier encode min/max normlization info for ranges other than 0-1.]
Example #1
To give an example:
Let's say I want to normalize the array of [0, 1, 2, 4, 8] into the range of 0-1. Clearly the answer is [0, 0.125, 0.25, 0.5, 1] as computed by Feature Scaling with min = 0, max = 8. Easy. But now if I look at the PMML normalization Model:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="0.0"/>
<LinearNorm orig="1.0" norm="0.125"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Example #2
[1, 2, 4, 8] -> [0, 0.333, 0.667, 1]
With:
<TransformationDictionary>
<DerivedField displayName="column1" name="column1*" optype="continuous" dataType="double">
<Extension name="summary" extender="KNIME" value="Min/Max (0.0, 1) normalization on 1 column(s)"/>
<NormContinuous field="column1">
<LinearNorm orig="0.0" norm="-0.3333333333333333"/>
<LinearNorm orig="1.0" norm="0.0"/>
</NormContinuous>
</DerivedField>
</TransformationDictionary>
Question
So how am I supposed to scale with orig/norm or compute min/max from these values?
What I'm about to say depends on what you mean by (min, max).
I'm going to assume that min equals the value where 0.5% of the total lies below and max equals the value where 0.5% of the total lies above.
If we agree on that, a symmetric normal distribution would have a mean value of approximately mean ~ (max+min)/2. (You call the mean the origin.)
Six standard deviations encompasses 99% of a normal distribution, so the standard deviation is approximately sigma ~ (max-min)/6.
The definition of normalized z = (x - mean)/sigma.
With those values you can get yourself back to the denormalized distribution.
Found the answer. After carefully reading again through the Documentation (which is extremly confusing imo) i came across this sentence:
The sequence of LinearNorm elements defines a sequence of points for a stepwise linear interpolation function. The sequence must contain at least two elements. Within NormContinous the elements LinearNorm must be strictly sorted by ascending value of orig.
Which basically explains it all. Normalization in PMML is done by using a stepwise interpolation with only 2 points. So in fact just a simple conversion function.
In the case of normalization into a range of 0-1 it even get's easier as the two points will always be at x1=0 and x2=1 (orig values). And will therefore always have their y axis intercept at orig=0 norm-value. As far as the slope of the function is concerned it is also very easy to calculate by slope = (y2-y1)/(x2-x1) = (y2-y1)/(1-0) = y2-y1 which are just the 2 norm-values.
So to get our interpolation function which will always be a polynom 1st grade we just calculate:
f(x) = ax + b = (y2-y1)x + y1 = (norm(orig=1)-norm(orig=0) * x + norm(orig=0) This is used for normalization.
and now we can calculate the inverse:
x = (f(x) - norm(orig=0)) / (norm(orig=1)-norm(orig=0)) This is used for de-normalization
Hope this helps everyone who at someday will also go through the hassle of implementing their own PMML executor engine and gets stuck at this topic.
After record a sound in Unity, is it possible to get the peak power of the sound? Or is it have any way to calculate the peak power of a sound?
Peak isn't very interesting in sound. If you want something closer to perceived volume of the sound, one pretty good metric is RMS. To get this, you have to do just a bit of math:
Load the sample data using audio.GetOutputData
Sum squares of all the sampled values
Take the square root of sum / amountOfSamples - that's RMS (root-mean-square)
If you want to have a value in dB, you can get it as 20 * log10(rms / reference), where reference stands for the value you want to have at 0 dB. A good reference point is 0.1, for example. Note that the RMS value will always be from 0 to 1, while dB values are a bit wilder - they better approximate human hearing, though. If you want to be really serious, different frequencies are perceived at different volumes - have a look at dBA, for example.
We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:
fee = fixed + variable*transaction_price
A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use).
Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.
The data we get about each transaction is a data tuple: (card_id, transaction_price, fee).
transaction_price and fee are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.
card_id identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id's. Multiple card_id's will share a fee scheme.
The data we get looks like this, and what we want to figure out is the last two columns.
card_id transaction_price fee fixed variable
=======================================================================
12345 200 22 ? ?
67890 300 21 ? ?
56789 150 8 ? ?
34567 150 8 ? ?
34567 150 "rounding cent"-> 9 ? ?
34567 150 8 ? ?
The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:
fee_scheme_id fixed variable
======================================
1 22 0
2 21 0
3 ? ?
4 ? ?
...
The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.
The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.
We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.
Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.
Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.
Hough transform
Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.
You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.
This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.
Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so:
First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. This can easily be done in SQL. This assumes, there are not outliers - but you can catch those at after this stage if so required. The resulting number of rows is guaranteed to be no higher than the number of raw data points.
Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. Again any outliers can be caught if desired after this stage.
Third pass: Per group calculate the average value of a, now that b is known. This is basic SQL. Outliers as allways
If you decide to do the second step in a cursor you can stuff all that into a stored procedure.
Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping.
The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice).
Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision)).
There is a faster approach based on 2^n-trees, but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.)
If there is some additional order to the data you may be able to do better (i.e. do the line segments form a piecewise function?).
Naive Hough transform
Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array.
Foreach point in the input data
Foreach bin in theta
find the distance rho of the altitude from the origin to
a line through (a,y) and making angle theta with the horizontal
rho = x cos(theta) + y sin(theta)
and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}
Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder.
You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly.
You're looking at the rounding cent as a significant source of noise in your calculations, so I'd focus on minimizing the noise due to that issue. The easiest way to do this IMO is to increase the sample size.
Instead of viewing your data as thousands of y=mx + b (+Rounding) group your data into larger subsets:
If you combine X transactions with the same and look at this as (sum of X fees) = (variable rate)*(sum of X transactions) + X(base rates) (+Rounding) your rounding number the noise will likely fall to the wayside.
Get enough groups of size 'X' and you should be able to come up with a pretty close representation of the real numbers.
I am making a pitch detection program using fft. To get the pitch I need to find the lowest frequency that is significantly above the noise floor.
All the results are in an array. Each position is for a frequency. I don't have any idea how to find the peak.
I am programming in C#.
Here is a screenshot of the frequency analysis in audacity.
Instead of attempting to find the lowest peak, I would look for a fundamental frequency which maximizes the spectral energy captured by its first 5 integer multiples. Note that every peak is an integer multiple of the lowest peak. This is a hack of the cepstrum method. Don't judge :).
N.B. From your plots, I assume a 1024 sample window and 44.1kHZ sampling Rate. This yields a frequency granularity of only 44.1kHz/1024 = 43Hz. Given a 44.1kHz audio, I recommend using a longer analysis window of ~50 ms or 2048 samples. This would yield a finer frequency granularity of ~21 Hz.
Assuming a Matlab vector 'psd' of size 2048 with the PSD values.
% 50 Hz (Dude) -> 50Hz/44100Hz * 2048 -> ~2 Lower Lim
% 300 Hz (Baby) -> 300Hz/44100Hz * 2048 -> ~14 Upper Lim
lower_lim = 2;
upper_lim = 14
for fund_cand = lower_lim:1:upper_lim
i_first_five_multiples = [1:1:5]*fund_cand;
sum_energy = sum(psd(i_first_five_multiples));
end
I would find the frequency which maximizes the sum_energy value.
It would be easier if you had some notion of the absolute values to expect, but I would suggest:
find the lowest (weakest) value first. It is your noise level.
compute the average level, it is your signal strength
define some function to decide the noise threshold. This is the tricky part, it may require some experimentation.
In a bad situation, signal may be only 2 or 3 times the noise level. If the signal is better you can probably use a threshold of 2xnoise.
Edit, after looking at the picture:
You should probably just start at the left and find a local maximum. Looks like you could use 30 dB threshold and a 10-bin window or something.
Finding the lowest peak won't work reliably for estimating pitch, as this frequency is sometimes completely missing, or down in the noise floor. For better reliability, try another algorithm: autocorrelation (AMDF, ASDF lag), cepstrum (FFT log FFT), harmonic product spectrum, state space density, and variations thereof that use neural nets, genetic algorithms or decision matrices to decide between alternative pitch hypothesis (RAPT, YAAPT, et.al.).
Added:
That said, you could guess a frequency, compute the average and standard deviation of spectral magnitudes for, say, a 2-to-1 frequency range around your guess, and see if there exists a peak significantly above the average (2 sigma?). Rinse and repeat for some number of frequency guesses, and see which one, or the lowest of several, has a peak that stands out the most from the average. Use that peak.