Matrix Multiplication Cache Issue

Matrix Multiplication Cache Issue - c#

I have a little neural network program. I have optimized it to train quicker, but I have noticed that the backpropagation part takes around 10 times longer than the forward pass. I of course saw that the only major difference is that I am accessing my weight matrix non sequentially (as opposed to sequentially, like in the forward pass). This leads to cache misses and messes up my performance. On the forward pass, I loop trough all neurons in the current layer, then trough all input neurons, and I access the weights of the current layer in the order they are laid out in the matrix. The matrix holds the weights for the current neuron on rows. As such, I am iterating for every input neuron, so I am iterating over the columns, and as such, in a cache friendly way. On the backward pass, however, I am looping trough the neurons of the current layer, then trough the neurons of the next (output or hidden) layer. I then access the weights of that next layer, and access individual elements from the rows, inside my loop. This makes my index jump over entire rows in the matrix, effectively destroying my cache performance. Is there a way to do this backwards step in a more cache friendly way?

I'm no expert in neural networks, but I have spent a little bit of time with matrix multiplication.
A very simple and effective optimization when multiplying matrices is to read a complete row (assuming column major matrix) and store this in a temporary array to be reused. Modifying this answer slightly:
var arr = new double[cA];
for (int j = 0; j<cB; j++)
{
for(int t = 0; t < cA; t++){
arr[t] = B[t, j]
}
for (int i = 0; i<rA; i++)
{
temp = 0;
for (int k = 0; k<cA; k++)
{
temp += A[i, k] * arr[k];
}
result[i, j] = temp;
}
}
The point is to copy in the slow direction in the outer loop, so these values can be reused each time in the inner loop. This is fairly simple, and helps a great deal with cache efficiency.
An even better approach would be to read an entire cache-lines and reuse these values as much as possible, but that requires like 6 nested loops, so you will need to find other resources for an exact description on how to do it, I would not be confident writing one freehand.
Some other tips:
Just use a library, there should be plenty of well optimized matrix multiplication implementation out there.
If you use multidimensional matrices, make sure to not call .GetLength() repeatedly, this is super slow.
Consider making your own matrix class that uses a 1D array as the backing storage. This may help reduce index calculations a bit, but the greatest benefit is ease of interoperability. External tools and libraries have spotty support for multidimensional matrices. 1D matrices are usually better supported.
Use SIMD intrinstics, this should give a ~4 times speedup
Use a parallel outer loop

Related

Normalize ConcurrentDictionary of arrays

I have a ConcurrentDictionary of arrays, where each array has the same fixed size. It looks like this: ConcurrentDictionary<int, double[]> ItemFeatures
I want to normalize the values in the list by dividing all the values by the maximum of the values in that column. For example, if my lists are of size 5, I want every element in the first position to be divided by the maximum of all the values in that position, and so on for position 2 onwards.
The naive way that I can think of doing this, is by first iterating over every list and every element in the list, and storing the max value per position. Then iterating over them again and dividing them by the previously found maximum values.
Is there a more elegant way to do this in Linq perhaps? These dictionaries would be large, so the more efficient/least time consuming, the better.

No, that will actually be the most efficient way. In the end, this is what you need to do anyway, you can't skip anything. You can probably write it in LINQ somehow, but the performance will be worse because it will have a lot of function calls and memory allocations. LINQ doesn't perform miracles, it's just a (sometimes) shorter way of writing things.
What can speed this up is if your algorithm has a good "cache locality" - in other words, if you access the computer memory in a sequential way. That's pretty hard to guarantee in an environment like .NET, but a loop like you described probably has the best chances of getting close to it.

LINQ is designed for querying data, not modifying data. You can use a little LINQ to compute the maximums, but that is about it:
var cols = ItemFeatures.First().Value.Length;
var maxv = new double[cols];
for (var j1 = 0; j1 < cols; ++j1)
maxv[j1] = ItemFeatures.Values.Select(vs => vs[j1]).Max();
foreach (var kvp in ItemFeatures)
for (var j1 = 0; j1 < cols; ++j1)
kvp.Value[j1] /= maxv[j1];

Making C# mandelbrot drawing more efficient

First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)

If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.

WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.

For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).

Genetic algorithm initial seed diversity using value encoding C#

I would like to know the following:
How to effectively make initial generation of chromosomes with high diversity using value encoding ?
One way is grid initialization, but it is too slow.
Till now I have been using Random class from .NET for choosing random values in value encoding but, although values are uniformly distributed, fitness function values calculated from such chromosomes are not. Here is a code for Chromosome initialization:
public Chromosome(Random rand)
{
Alele = new List<double>();
for (int i = 0; i < ChromosomeLength; i++)
{
Alele.Add(rand.NextDouble() * 2000 - 1000);
}
}
So, I developed a function that calculates fitness from new, randomly made chromosome (upper code) and if fitness is similar to any other already in the list of chromosomes, a new chromosome is made randomly and his fitness is calculated and this process is repeated until his fitness is not different enough from those already in the list.
Here is the code for this part:
private bool CheckSimilarFitnes(List<Chromosome> chromosome, Chromosome newCandidate)
{
Boolean flag=false;
double fitFromList, fitFromCandidate;
double fitBigger,fitSmaller;
foreach (var listElement in chromosome)
{
fitFromList = listElement.CalculateChromosomeFitness(listElement.Alele);
fitFromCandidate = newCandidate.CalculateChromosomeFitness(newCandidate.Alele);
fitBigger = fitFromList >= fitFromCandidate ? fitFromList : fitFromCandidate;
fitSmaller = fitFromList < fitFromCandidate ? fitFromList : fitFromCandidate;
if ((fitFromList / fitFromCandidate) < 1.5)
return false
}
else return true;
}
But, the more chromosomes I have in the list it takes longer to add a new one, with fitness that is enough different from others already in there.
So, is there a way to make this grid initialization more faster, it takes days to make 80 chromosomes like this?

here's some code that might help (which I just wrote): GA for ordering 10 values spaced by 1.0. It starts with a population of 100 completely random alleles, which is exactly how your code starts.
The goal I gave the GA to solve was to order the values in increasing order with a separation of 1.0. It does this in the fitness function Eval_OrderedDistance by by computing the standard deviation of each pair of samples from 1.0. As the fitness tends toward 0.0, the alleles should start to appear in sequential order.
Generation 0's fittest Chromosome was completely random, as were the rest of the Chromosomes. You can see the fitness value is very high (i.e., bad):
GEN: fitness (allele, ...)
0: 375.47460 (583.640, -4.215, -78.418, 164.228, -243.982, -250.237, 354.559, 374.306, 709.859, 115.323)
As the generations continue, the fitness (standard deviation from 1.0) decreases until it's nearly perfect in generation 100,000:
100: 68.11683 (-154.818, -173.378, -170.846, -193.750, -198.722, -396.502, -464.710, -450.014, -422.194, -407.162)
...
10000: 6.01724 (-269.681, -267.947, -273.282, -281.582, -287.407, -293.622, -302.050, -307.582, -308.198, -308.648)
...
99999: 0.67262 (-294.746, -293.906, -293.114, -292.632, -292.596, -292.911, -292.808, -292.039, -291.112, -290.928)
The interesting parts of the code are the fitness function:
// try to pack the aleles together spaced apart by 1.0
// returns the standard deviation of the samples from 1.0
static float Eval_OrderedDistance(Chromosome c) {
float sum = 0;
int n = c.alele.Length;
for(int i=1; i<n; i++) {
float diff = (c.alele[i] - c.alele[i-1]) - 1.0f;
sum += diff*diff; // variance from 1.0
}
return (float)Math.Sqrt(sum/n);
}
And the mutations. I used a simple crossover and a "completely mutate one allele":
Chromosome ChangeOne(Chromosome c) {
Chromosome d = c.Clone();
int i = rand.Next() % d.alele.Length;
d.alele[i] = (float)(rand.NextDouble()*2000-1000);
return d;
}
I used elitism to always keep one exact copy of the best Chromosome. Then generated 100 new Chromosomes using mutation and crossover.
It really sounds like you're calculating the variance of the fitness, which does of course tell you that the fitnesses in your population are all about the same. I've found that it's very important how you define your fitness function. The more granular the fitness function, the more you can discriminate between your Chromosomes. Obviously, your fitness function is returning similar values for completely different chromosomes, since your gen 0 returns a fitness variance of 68e-19.
Can you share your fitness calculation? Or what problem you're asking the GA to solve? I think that might help us help you.
[Edit: Adding Explicit Fitness Sharing / Niching]
I rethought this a bit and updated my code. If you're trying to maintain unique chromosomes, you have to compare their content (as others have mentioned). One way to do this would be to compute the standard deviation between them. If it's less than some threshold, you can consider them the same. From class Chromosome:
// compute the population standard deviation
public float StdDev(Chromosome other) {
float sum = 0.0f;
for(int i=0; i<alele.Length; i++) {
float diff = other.alele[i] - alele[i];
sum += diff*diff;
}
return (float)Math.Sqrt(sum);
}
I think Niching will give you what you'd like. It compares all the Chromosomes in the population to determine their similarity and assigns a "niche" value to each. The chromosomes are then "penalized" for belonging to a niche using a technique called Explicit Fitness Sharing. The fitness values are divided by the number of Chromosomes in each niche. So if you have three in niche group A (A,A,A) instead of that niche being 3 times as likely to be chosen, it's treated as a single entity.
I compared my sample with Explicit Fitness Sharing on and off. With a max STDDEV of 500 and Niching turned OFF, there were about 18-20 niches (so basically 5 duplicates of each item in a 100 population). With Niching turned ON, there were about 85 niches. Thats 85% unique Chromosomes in the population. In the output of my test, you can see the diversity after 17000 generations.
Here's the niching code:
// returns: total number of niches in this population
// max_stddev -- any two chromosomes with population stddev less than this max
// will be grouped together
int ComputeNiches(float max_stddev) {
List<int> niches = new List<int>();
// clear niches
foreach(var c in population) {
c.niche = -1;
}
// calculate niches
for(int i=0; i<population.Count; i++) {
var c = population[i];
if( c.niche != -1) continue; // niche already set
// compute the niche by finding the stddev between the two chromosomes
c.niche = niches.Count;
int count_in_niche = 1; // includes the curent Chromosome
for(int j=i+1; j<population.Count; j++) {
var d = population[j];
float stddev = c.StdDev(d);
if(stddev < max_stddev) {
d.niche = c.niche; // same niche
++count_in_niche;
}
}
niches.Add(count_in_niche);
}
// penalize Chromosomes by their niche size
foreach(var c in population) {
c.niche_scaled_fitness = c.scaled_fitness / niches[c.niche];
}
return niches.Count;
}
[Edit: post-analysis and update of Anton's code]
I know this probably isn't the right forum to address homework problems, but since I did the effort before knowing this, and I had a lot of fun doing it, I figure it can only be helpful to Anton.
Genotip.cs, Kromosom.cs, KromoMain.cs
This code maintains good diversity, and I was able in one run to get the "raw fitness" down to 47, which is in your case the average squared error. That was pretty close!
As noted in my comment, I'd like to try to help you in your programming, not just help you with your homework. Please read these analysis of your work.
As we expected, there was no need to make a "more diverse" population from the start. Just generate some completely random Kromosomes.
Your mutations and crossovers were highly destructive, and you only had a few of them. I added several new operators that seem to work better for this problem.
You were throwing away the best solution. When I got your code running with only Tournament Selection, there would be one Kromo that was 99% better than all the rest. With tournament selection, that best value was very likely to be forgotten. I added a bit of "elitism" which keeps a copy of that value for the next generation.
Consider object oriented techniques. Compare the re-write I sent you with my original code.
Don't duplicate code. You had the sampling parameters in two different classes.
Keep your code clean. There were several unused parts of code. Especially when submitting questions to SO, try to narrow it down, remove unused code, and do some cleaning up.
Comment your code! I've commented the re-work significantly. I know it's Serbian, but even a few comments will help someone else understand what you are doing and what you intended to do.
Overall, nice job implementing some of the more sophisticated things like Tournament Selection
Prefer double[] arrays instead of List. There's less overhead. Also, several of your List temp variables weren't even needed. Your structure
List temp = new List();
for(...) {
temp.add(value);
}
for(each value in temp) {
sum += value
}
average = sum / temp.Count
can easily be written as:
sum = 0
for(...) {
sum += value;
}
average = sum / count;
In several places you forgot to initialize a loop variable, which could have easily added to your problem. Something like this will cause serious problems, and it was in your fitness code along with one or two other places
double fit = 0;
for(each chromosome) {
// YOU SHOULD INITIALIZE fit HERE inside the LOOP
for(each allele) {
fit += ...;
}
fit /= count;
}
Good luck programming!

The basic problem here is that most randomly generated chromosomes have similar fitness, right? That's fine; the idea isn't for your initial chromosomes to have wildly different fitnesses; it's for the chromosomes themselves to be different, and presumably they are. In fact, you should expect the initial fitness of most of your first generation to be close to zero, since you haven't run the algorithm yet.
Here's why your code is so slow. Let's say the first candidate is terrible, basically zero fitness. If the second one has to be 1.5x different, that really just means it has to be 1.5x better, since it can't really get worse. Then the next one has to 1.5x better than that, and so on up to 80. So what you're really doing is searching for increasingly better chromosomes by generating completely random ones and comparing them to what you have. I bet if you logged the progress, you'd find it takes more and more time to find the subsequent candidates, because really good chromosomes are hard to find. But finding better chromosomes is what the GA is for! Basically what you've done is optimize some of the chromosomes by hand before, um, actually optimizing them.
If you want to ensure that your chromosomes are diverse, compare their content, don't compare their fitness. Comparing the fitness is the algo's job.

I'm going to take a quick swing at this, but Isaac's pretty much right. You need to let the GA do its job. You have a generation of individuals (chromosomes, whatever), and they're all over the scale on fitness (or maybe they're all identical).
You pick some good ones to mutate (by themselves) and crossover (with each other). You maybe use the top 10% to generate another full population and throw out the bottom 90%. Maybe you always keep the top guy around (Elitism).
You iterate at this for a while until your GA stops improving because the individuals are all very much alike. You've ended up with very little diversity in your population.
What might help you is to 1) make your mutations more effective, 2) find a better way to select individuals to mutate. In my comment I recommended AI Techniques for Game Programmers. It's a great book. Very easy to read.
To list a few headings from the book, the things you're looking for are:
Selection techniques like Roulette Selection (on stackoveflow) (on wikipedia) and Stochastic Universal Sampling, which control how you select your individuals. I've always liked Roulette Selection. You set the probabilities that an individual will be selected. It's not just simple white-noise random sampling.
I used this outside of GA for selecting 4 letters from the Roman alphabet randomly. I assigned a value from 0.0 to 1.0 to each letter. Every time the user (child) would pick the letter correctly, I would lower that value by, say 0.1. This would increase the likelihood that the other letters would be selected. If after 10 times, the user picked the correct letter, the value would be 0.0, and there would be (almost) no chance that letter would be presented again.
Fitness Scaling techniques like Rank Scaling, Sigma Scaling, and Boltzmann Scaling (pdf on ftp!!!) that let you modify your raw fitness values to come up with adjusted fitness values. Some of these are dynamic, like Boltzmann Scaling, which allows you to set a "pressure" or "temperature" that changes over time. Increased "pressure" means that fitter individuals are selected. Decreased pressure means that any individual in the population can be selected.
I think of it this way: you're searching through multi-dimensional space for a solution. You hit a "peak" and work your way up into it. The pressure to be fit is very high. You snug right into that local maxima. Now your fitness can't change. Your mutations aren't getting you out of the peak. So you start to reduce the pressure and just, oh, select items randomly. Your fitness levels start to drop, which is okay for a while. Then you start to increase the pressure again, and surprise! You've skipped out of the local maxima and found a lovely new local maxima to climb into. Increase the pressure again!
Niching (which I've never used, but appears to be a way to group similar individuals together). Say you have two pretty good individuals, but they're wildly different. They keep getting selected. They keep mutating slightly, and not getting much better. Now you have half your population as minor variants of A, and half your population minor variants of B. This seems like a way to say, hey, what's the average fitness of that entire group A? and what for B? And what for every other niche you have. Then do your selection based on the average fitness for each niche. Pick your niche, then select a random individual from that niche. Maybe I'll start using this after all. I like it!
Hope you find some of that helpful!

If you need true random numbers for your application, I recommend you check out Random.org. They have a free HTTP API, and clients for just about every language.
The randomness comes from atmospheric noise, which for many purposes is better than the pseudo-random number algorithms typically used in computer programs.
(I am unaffiliated with Random.org, although I did contribute the PHP client).

I think your problem is in how your fitness function and how you select candidates, not in how random values are. Your filtering feels too strict that may not even allow enough elements to be accepted.
Sample
values: random float 0-10000.
fitness function square root(n)
desired distribution of fitness - linear with distance at least 1.
With this fitness function you will quickly get most of the 1-wide "spots" taken (as you have at most 100 places), so every next one will take longer. At some point there will be several tiny ranges left and most of the results will simply rejected, even worse after you get about 50 numbers places there is a good chance that next one simply will not be able to fit.

What is the fastest way to calculate frequency distribution for array in C#?

I am just wondering what is the best approach for that calculation. Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array.
Is it good idea to use bucket search for that?
Actually I found that question Calculating frequency distribution of a collection with .Net/C#
But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation.
EDIT:
After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. So we need some sort of hash function with O(1) complexity? Any ideas how to do it?

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.
int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
for(int i = 0; i < buckets.length - 1; i++)
{
if(d >= buckets[i] && d < buckets[i+1])
{
freq[i]++;
break;
}
}
}
It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.

If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:
First of all sort your input array. If you work with real-world data
I would recommend to consider Timsort - Wiki for this. It
provides very good performance guarantees for a patterns that can be seen in
real-world data.
Traverse through sorted array and compare it with the first value in the array of boundaries:
If value in input array is less then boundary - increment frequency counter for this boundary
If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary.
In a code it can look like this:
Timsort(myArray);
int boundPos;
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()
for (int i = 0; i<myArray.Lenght; i++) {
if (myArray[i]<boundaries[boundPos]) {
boundaries[boubdPos]++;
}
else {
boundPos++;
boundaries[boubdPos]++;
}
}

How to find the top several values from an array?

I have an array of float values and want the value and more importantly the position of the maximum four values.
I built the system originally to walk through the array and find the max the usual way, by comparing the value at the current position to a recorded max-so-far, and updating a position variable when the max-so-far changes. This worked well, an O(n) algo that was very simple. I later learned that I need to keep not only the top value, but the top three or four. I extended the same procedure and complicated the max-so-far into an array of four max-so-fars and now the code is ugly.
It still works and is still sufficiently fast because only a trivial amount of computations have been added to the procedure. it still effectively walks across the array and checks each value once.
I do this in MATLAB with a sort function that returns two arrays, the sorted list and the accompanying original position list. By looking at the first few values I have exactly what I need. I am replicating this functionality into a C# .NET 2.0 program.
I know that I could do something similar with a List object, and that the List object has a built in sort routine, but I do not believe that it can tell me the original positions, and those are really what I am after.
It has been working well, but now I find myself wanting the fifth max value and see that rewriting the max-so-far checker that is currently an ugly mess of if statements would only compound the ugliness. It would work fine and be no slower to add a fifth level, but I want to ask the SO community if there is a better way.
Sorting the entire list takes many more computations than my current method, but I don't think it would be a problem, as the list is 'only' one or two thousand floats; so if there is a sort routine that can give back the original positions, that would be ideal.
As background, this array is the result of a Fourier Transform on a kilobyte of wave file, so the max values' positions correspond to the sample data's peak frequencies. I had been content with the top four, but see a need to really gather the top five or six for more accurate sample classification.

I can suggest an alternative algorithm which you'll have to code :)
Use a heap of size K where K denotes the count of top elements you want to save. Initialize this to the first K elements of your original array. For all N - K elements walk the array, inserting as and when required.
proc top_k (array<n>, heap<k>)
heap <- array<1..k-1>
for each (array<k..n-1>)
if array[i] > heap.min
heap.erase(heap.min)
heap.insert(array[i])
end if
end for

You could still use your list idea - the elements you put in the list could be a structure which stores both the index and the value; but sorts only on the value, for instance:
class IndexAndValue : IComparable<IndexAndValue>
{
public int index;
public double value;
public int CompareTo(IndexAndValue other)
{
return value.CompareTo(other.value);
}
}
Then you can stick them in the list, while retaining the information about the index. If you keep only the largest m items in the list, then your efficiency should be O(mn).

I don't know which algorithm you're currently using, but I'll suggest a simple one.
Admitting that you have an array of floats f and a maximum of capacity
numbers, you could do the following:
int capacity = 4; // number of floats you want to retrieve
float [] f; // your float list
float [] max_so_far = new float[capacity]; // max so far
// say that the first 'capacity' elements are the biggest, for now
for (int i = 0; i < capacity; i++)
max_so_far[i] = i;
// for each number not processed
for (int i = capacity; i < f.length; i++)
{
// find out the smallest 'max so far' number
int m = 0;
for (int j = 0; j < capacity; j++)
if (f[max_so_far[j]] < f[max_so_far[m]])
m = j;
// if our current number is bigger than the smallest stored, replace it
if (f[i] > f[max_so_far[m]])
max_so_far[m] = i;
}
By the end of the algorithm, you'll have the indices of the greatest elements stored
in max_so_far.
Do note that if the capacity value grows, it will become slightly slower than the
alternative, which is sorting the list while keeping track of the initial positions.
Remember that sorting takes O(nlog n) comparisons, while this algorithm takes O(ncapacity).

Another option is to use quick-select.
Quick-select returns the position of the k-th element in a list. After you have the position and the value of the k-th element, go over the list and take every element whose value is smaller/larger than the k-th element.
I found a c# implementation of quick-select here: link text
Pros:
O(n+k) average running time.
Cons:
The k elements found are not sorted. If you sort them the running time is O(n + logk)
I haven't checked this, but I think that for a very small k the best option is to do k runs over the array, each time finding the next smallest/largest element.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.