I'm developing a histogram container class and I'm trying to determine where the cut off points should be for the bins. I'd like the cutoff points to be nice looking numbers, in much that same way that graphs are scaled.
To distill my request into a basic question: Is there a basic method by which data axis labels can be determined from a list of numbers.
For example:
Array{1,6,8,5,12,15,22}
It would make sense to have 5 bins.
Bin Start Count
0 1
5 3
10 2
15 0
20 1
The bin start stuff is identical to selecting axis labels on a graph in this instance.
For the purpose of this question I don't really care about bins and the histogram, I'm more interested in the graph scale axis label portion of the question.
I will be using C# 4.0 for my app, so nifty solution using linq are welcome.
I've attempted stuff like this in the distant past using some log base 10 scaling stuff, but I never got it to work in great enough detail for this application. I don't want to do log scaling, I just used base 10 to round to nearest whole numbers. I'd like it to work for large numbers and very small numbers and possibly dates too; although dates can be converted to doubles and parsed that way.
Any resources on the subject would be greatly appreciated.
You could start with something simple:
NUM_BINS is a passed argument or constatn (e.g. NUM_BINS = 10)
x is your array of x-values (e.g. int[] x = new int[50])
int numBins = x.Length < NUM_BINS ? x.Length : NUM_BINS;
At this point you could calc a histogram of xPoints, and if the xPoints are heavily weighted to one side of distribution (maybe just count left of midpoint vs. right of midpoint), then use log/exp divisions over range of x[]. If the histogram is flat, use linear divisions.
double[] xAxis = new double[numBins];
double range = x[x.Length-1] - x[0];
CalcAxisValues(xAxis, range, TYPE); //Type is enum of LOG, EXP, or LINEAR
This function would then equally space points based on the TYPE.
Related
I implemented a simplex noise algorithm (by KdotJPG: OpenSimplex2S) which works fine, but I'd like to add a "function" which can increase/decrease the contrast of the noise. The noise method returns a value between -1 and 1 but the overall result is quite homogeneous. It is not bad at all, but I need to get a different outcome now.
So basically I should "pull" the value of the noise toward the range edges.. this will result more contrasting noise (more distance between the smaller and bigger values). Of course this change must be consistent and proportionally scaled between -1 and 1 (or 0-1) to get natural result.
Actually this is pure mathematical issue, but I'm not good in math at all! I'd like to make it more understandable to give this picture of two graphs:
So, on these graph the Y axis is the noise value (-1 is bottom and +1 it the top) and X axis is the time passed. The left graph shows the original result of the noise generator, and the right is the stretched version what I need to get. As you can see on the right graph everything the same but their values stretched/pulled toward the edge (toward the min, max limit) but still in range.
Is there any math formula or c# built in function to stretch the return value of the noise proportionally respect to the min, max values (-1/1 or 0/1)? If you need the code of the noise you can see it here OpenSimplex2S too, but this is irrelevant in my case, as I just wish to modify its return value. Thanks!
I want to generate random numbers within a range (1 - 100000), but instead of purely random I want the results to be based on a kind of distribution. What I mean that in general I want the numbers "clustered" around the minimum value of the range (1).
I've read about Box–Muller transform and normal distributions but I'm not quite sure how to use them to achieve the number generator.
How can I achieve such an algorithm using C#?
There are a lot of ways doing this (using uniform distribution prng) here few I know of:
Combine more uniform random variables to obtain desired distribution.
I am not a math guy but there sure are equations for this. This kind of solution has usually the best properties from randomness and statistical point of view. For more info see the famous:
Understanding “randomness”.
but there are limited number of distributions we know the combinations for.
Apply non linear function on uniform random variable
This is the simplest to implement. You simply use floating randoms in <0..1> range apply your non linear function (that change the distribution towards your wanted shape) on them (while result is still in the <0..1> range) and rescale the result into your integer range for example (in C++):
floor( pow( random(),5 ) * 100000 )
The problem is that this is just blind fitting of the distribution so you usually need to tweak the constants a bit. It a good idea to render histogram and randomness graphs to see the quality of result directly like in here:
How to seed to generate random numbers?
You can also avoid too blind fitting with BEZIERS like in here:
Random but most likely 1 float
Distribution following pseudo random generator
there are two approaches I know of for this the simpler is:
create big enough array of size n
fill it with all values following the distribution
so simply loop through all values you want to output and compute how many of them will be in n size array (from your distribution) and add that count of the numbers into array. Beware the filled size of the array might be slightly less than n due to rounding. If n is too small you will be missing some less occurring numbers. so if you multiply probability of the least probable number and n it should be at least >=1. After the filling change the n into the real array size (number of really filled numbers in it).
shuffle the array
now use the array as linear list of random numbers
so instead of random() you just pick a number from array and move to the next one. Once you get into n-th value schuffle the array and start from first one again.
This solution has very good statistical properties (follows the distribution exactly) but the randomness properties are not good and requires array and occasional shuffling. For more info see:
How to efficiently generate a set of unique random numbers with a predefined distribution?
The other variation of this is to avoid use of array and shuffling. It goes like this:
get random value in range <0..1>
apply inverse cumulated distribution function to convert to target range
as you can see its like the #2 Apply non linear function... approach but instead of "some" non linear function you use directly the distribution. So if p(x) is probability of x in range <0..1> where 1 means 100% than we need a function that cumulates all the probabilities up to x (sorry do not know the exact math term in English). For integers:
f(x) = p(0)+p(1)+...+p(x)
Now we need inverse function g() to it so:
y = f(x)
x = g(y)
Now if my memory serves me well then the generation should look like this:
y = random(); // <0..1>
x = g(y); // probability -> value
Many distributions have known g() function but for those that do not (or we are too lazy to derive it) you can use binary search on p(x). Too lazy to code it so here slower linear search version:
for (x=0;x<max;x++) if (f(x)>=y) break;
So when put all together (and using only p(x)) I got this (C++):
y=random(); // uniform distribution pseudo random value in range <0..1>
for (f=0.0,x=0;x<max;x++) // loop x through all values
{
f+=p(x); // f(x) cumulative distribution function
if (f>=y) break;
}
// here x is your pseudo random value following p(x) distribution
This kind of solution has usually very good both statistical and randomness properties and does not require that the distribution is a continuous function (it can be even just an array of values instead).
I am developing a little application in Visual Studio 2010 in C# to draw a spectrogram (frequency "heat map").
I have already done the basic things:
Cut a rectangular windowed array out of the input signal array
Feed that array into FFT, which returns complex values
Store magnitude values in an array (spectrum for that window)
Step the window, and store the new values in other arrays, resulting in a jagged array that holds every step of windowing and their spectra
Draw these into a Graphics object, in color that uses the global min/max values of the heat map as relative cold and hot
The LEFT side of the screenshot shows my application, and on the RIGHT there is a spectrogram for the same input (512 samples long) and same rectangular window with size 32 from a program called "PAST - time series analysis" (https://folk.uio.no/ohammer/past/index.html). My 512 long sample array only consists of integer elements ranging from around 100 to 1400.
(Note: the light-blue bar on the very right of the PAST spectrogram is only because I accidentally left an unnecessary '0' element at the end of thats input array. Otherwise they are the same.)
Link to screenshot: https://drive.google.com/open?id=1UbJ4GyqmS6zaHoYZCLN9c0JhWONlrbe3
But I have encountered a few problems here:
The spectrogram seems very undetailed, related to another one that I made in "PAST time series analysis" for reference, and that one looks extremely detailed. Why is that?
I know that for an e.g. 32 long time window, the FFT returns 32 elements, the 0. elem is not needed here, the next 32/2 elements have the magnitude values I need. But this means that the frequency "resolution" on the output for a 32 long window is 16. That is exactly what my program uses. But "PAST" program shows a lot more detail. If you look at the narrow lines in the blue background, you can see that they show a nice pattern in the frequency axis, but in my spectrogram that information remains unseen. Why?
In the beginning (windowSize/2) wide window step-band and the ending (windowSize/2) step-band, there are less values for FFT input, thus there is less output, or just less precision. But in the "PAST" program those parts also seem relatively detailed, not just stretched bars like in mine. How can I improve that?
The 0. element of the FFT return array (the so called "DC" element) is a huge number, which is a lot bigger than the sample average, or even its sum. Why is that?
Why are my values (e.g. the maximum that you see near the color bar) so huge? That is just a magnitude value from the FFT output. Why are there different values in the PAST program? What correction should I use on the FFT output to get those values?
Please share your ideas, if you know more about this topic. I am very new to this. I only read first about Fourier transform a little more than a week ago.
Thanks in advance!
To get more smoothness in the vertical axis, zero pad your FFT so that there are more (interpolated) frequency bins in the output. For instance, zero pad your 32 data points so that you can use a 256 point or larger FFT.
To get more smoothness in the horizontal axis, overlap your FFT input windows (75% overlap, or more).
For both, use a smooth window function (Hamming or Von Hann, et.al.), and try wider windows, more than 32 (thus even more overlapped).
To get better coloring, try using a color mapping table, with the input being the log() of the (non zero) magnitudes.
You can also use multiple different FFTs per graph XY point, and decide which to color with based on local properties.
Hello LimeAndConconut,
Even though I do not know about PAST, I can provide you with some general information about FFT. Here is an answer for each of your points
1- You are right, a FFT performed on 32 elements returns 32 frequencies (null frequency, positive and negative components). It means that you already have all the information in your data, and PAST cannot get more information with the same 32 sized window. That's why I suspect the data to be interpolated for plotting, but this just visual. Once again PAST cannot create more information than the one you have in your data.
2- Once again I agree with you. On the borders, you have access to less frequency components. You can decide different strategies: not show data at the borders, or extend this data with zero-padding or circular padding
3- The zero element of the FFT should be the sum of your 32 windowed array. You need to check FFT normalization, have a look at the documentation of your FFT function.
4- Once again check the FFT normalization. Since PAST colorbar exhibit negative values, it seems to be plotted in logarithmic scale. This is common usage to use logarithm for plotting data with high dynamics in order to enhance details.
I'm making just a basic application that just writes pixels along a curve in C#.
I came across this website with a formula that looks promising. I believe this website is also talking about the same thing here.
What I don't really understand is how to implement it. I tried looking at the JavaScript code on the first link but I can't really tell what data I need to supply. The things involving the PVC, PVI, or PVT are the things I don't understand.
The example situation I'm going to set up is just both of the grades (vertical incline/decline) is just 5 and -5. Let's say point 1 is at 0, 0 and point 2 is 100, 100.
Can someone explain some of the obscure variables in the formula and how would I use the formula to draw the curve?
Generally, to draw a curve in 2D you vary one parameter, and then collect x,y point pairs, and plot the pairs. In your case it will work to just vary the horizontal distance (x), and then collect the corresponding y-values, and then you can plot these.
As for the formula, it is very unclear. Basically it's just a parabola with a bunch of (poorly defined) jargon around it. To graph this, you want to vary x from 0 to L (this isn't obvious, btw, I had to work out the math, i.e., how to vary x so that the slopes would be as they suggest in the figure, anyway, it's 0 to L, and they should have said so).
I don't have C# running now, but hopefully you can translate this Python code:
from matplotlib.pyplot import plot, show
from numpy import arange
G1 = .1 # an initial slope (grade) of 10% (note that one never uses percentages directly in calculations, it's always %/100)
G2 = -.02 # a final slope (grade) of 2%
c = 0 # elevation (value of curve when x=0, that is, y at PVC
L = 10. # the length of the curve in whatever unit you want to use (ft, m, mi, etc), but this basically sets your unit system
N = 1000 # I'm going to calculate and plot 100 points to illustrate this curve
x = arange(0, L, float(L)/N) # an array of N x values from 0 to (almost) L
# calculate the curve
a = (G2-G1)/(2*L)
b = G1
y = a*x*x + b*x + c # this is shorthand for a loop y[0]=a*x[0]*x[0] + b*...
plot(x, y)
show()
print (y[1]-y[0])/(x[1]-x[0]), (y[-1]-y[-2])/(x[-1]-x[-2])
The final line prints the initial and final slopes as a check (in Python neg indexing counts from the back of the array), and this match what I specified for G1 and G2. The plot looks like:
As for your requests: "The example situation I'm going to set up is just both of the grades (vertical incline/decline) is just 5 and -5. Let's say point 1 is at 0, 0 and point 2 is 100, 100.", in a parabola you basically get three free parameters (corresponding to a, b, and c), and here, I think, you over-specified it.
What are PVC, PVT, and PVI? PVC: the starting point, so Y_PVC is the height of the starting point. PVT: the ending point. PVI: if you draw a line from PVC at the initial slope G1 (ie the tangent to the curve on the left), and similarly from PVT, the point where they intersect is called PVI (though why someone would ever care about this point is beyond me).
Here's a somewhat simplified example of what I am trying to do.
Suppose I have a formula that computes credit points, but the formula has no constraints (for example, the score might be 1 to 5000). And a score is assigned to 100 people.
Now, I want to assign a "normalized" score between 200 and 800 to each person, based on a bell curve. So for example, if one guy has 5000 points, he might get an 800 on the new scale. The people with the middle of my point range will get a score near 500. In other words, 500 is the median?
A similar example might be the old scenario of "grading on the curve", where a the bulk of the students perhaps get a C or C+.
I'm not asking for the code, either a library, an algorithm book or a website to refer to.... I'll probably be writing this in Python (but C# is of some interest as well). There is NO need to graph the bell curve. My data will probably be in a database and I may have even a million people to which to assign this score, so scalability is an issue.
Thanks.
The important property of the bell curve is that it describes normal distribution, which is a simple model for many natural phenomena. I am not sure what kind of "normalization" you intend to do, but it seems to me that current score already complies with normal distribution, you just need to determine its properties (mean and variance) and scale each result accordingly.
References:
https://en.wikipedia.org/wiki/Grading_on_a_curve
https://en.wikipedia.org/wiki/Percentile
(see also: gaussian function)
I think the approach that I would try would be to compute the mean (average) and standard deviation (average distance from the average). I would then choose parameters to fit to my target range. Specifically, I would choose that the mean of the input values map to the value 500, and I would choose that 6 standard deviations consume 99.7% of my target range. Or, a single standard deviation will occupy about 16.6% of my target range.
Since your target range is 600 (from 200 to 800), a single standard deviation would cover 99.7 units. So a person who obtains an input credit score that is one standard deviation above the input mean would get a normalized credit score of 599.7.
So now:
# mean and standard deviation of the input values has been computed.
for score in input_scores:
distance_from_mean = score - mean
distance_from_mean_in_standard_deviations = distance_from_mean / stddev
target = 500 + distance_from_mean_in_standard_deviations * 99.7
if target < 200:
target = 200
if target > 800:
target = 800
This won't necessarily map the median of your input scores to 500. This approach assumes that your input is more-or-less normally distributed and simply translates the mean and stretches the input bell curve to fit in your range. For inputs that are significantly not bell curve shaped, this may distort the input curve rather badly.
A second approach is to simply map your input range to our output range:
for score in input_scores:
value = (score - 1.0) / (5000 - 1)
target = value * (800 - 200) + 200
This will preserve the shape of your input, but in your new range.
A third approach is to have your target range represent percentiles instead of trying to represent a normal distribution. 1% of people would score between 200 and 205; 1% would score between 794 and 800. Here you would rank your input scores and convert the ranks into a value in the range 200..600. This makes full use of your target range and gives it an easy to understand interpretation.