I would like to know the following:
How to effectively make initial generation of chromosomes with high diversity using value encoding ?
One way is grid initialization, but it is too slow.
Till now I have been using Random class from .NET for choosing random values in value encoding but, although values are uniformly distributed, fitness function values calculated from such chromosomes are not. Here is a code for Chromosome initialization:
public Chromosome(Random rand)
{
Alele = new List<double>();
for (int i = 0; i < ChromosomeLength; i++)
{
Alele.Add(rand.NextDouble() * 2000 - 1000);
}
}
So, I developed a function that calculates fitness from new, randomly made chromosome (upper code) and if fitness is similar to any other already in the list of chromosomes, a new chromosome is made randomly and his fitness is calculated and this process is repeated until his fitness is not different enough from those already in the list.
Here is the code for this part:
private bool CheckSimilarFitnes(List<Chromosome> chromosome, Chromosome newCandidate)
{
Boolean flag=false;
double fitFromList, fitFromCandidate;
double fitBigger,fitSmaller;
foreach (var listElement in chromosome)
{
fitFromList = listElement.CalculateChromosomeFitness(listElement.Alele);
fitFromCandidate = newCandidate.CalculateChromosomeFitness(newCandidate.Alele);
fitBigger = fitFromList >= fitFromCandidate ? fitFromList : fitFromCandidate;
fitSmaller = fitFromList < fitFromCandidate ? fitFromList : fitFromCandidate;
if ((fitFromList / fitFromCandidate) < 1.5)
return false
}
else return true;
}
But, the more chromosomes I have in the list it takes longer to add a new one, with fitness that is enough different from others already in there.
So, is there a way to make this grid initialization more faster, it takes days to make 80 chromosomes like this?
here's some code that might help (which I just wrote): GA for ordering 10 values spaced by 1.0. It starts with a population of 100 completely random alleles, which is exactly how your code starts.
The goal I gave the GA to solve was to order the values in increasing order with a separation of 1.0. It does this in the fitness function Eval_OrderedDistance by by computing the standard deviation of each pair of samples from 1.0. As the fitness tends toward 0.0, the alleles should start to appear in sequential order.
Generation 0's fittest Chromosome was completely random, as were the rest of the Chromosomes. You can see the fitness value is very high (i.e., bad):
GEN: fitness (allele, ...)
0: 375.47460 (583.640, -4.215, -78.418, 164.228, -243.982, -250.237, 354.559, 374.306, 709.859, 115.323)
As the generations continue, the fitness (standard deviation from 1.0) decreases until it's nearly perfect in generation 100,000:
100: 68.11683 (-154.818, -173.378, -170.846, -193.750, -198.722, -396.502, -464.710, -450.014, -422.194, -407.162)
...
10000: 6.01724 (-269.681, -267.947, -273.282, -281.582, -287.407, -293.622, -302.050, -307.582, -308.198, -308.648)
...
99999: 0.67262 (-294.746, -293.906, -293.114, -292.632, -292.596, -292.911, -292.808, -292.039, -291.112, -290.928)
The interesting parts of the code are the fitness function:
// try to pack the aleles together spaced apart by 1.0
// returns the standard deviation of the samples from 1.0
static float Eval_OrderedDistance(Chromosome c) {
float sum = 0;
int n = c.alele.Length;
for(int i=1; i<n; i++) {
float diff = (c.alele[i] - c.alele[i-1]) - 1.0f;
sum += diff*diff; // variance from 1.0
}
return (float)Math.Sqrt(sum/n);
}
And the mutations. I used a simple crossover and a "completely mutate one allele":
Chromosome ChangeOne(Chromosome c) {
Chromosome d = c.Clone();
int i = rand.Next() % d.alele.Length;
d.alele[i] = (float)(rand.NextDouble()*2000-1000);
return d;
}
I used elitism to always keep one exact copy of the best Chromosome. Then generated 100 new Chromosomes using mutation and crossover.
It really sounds like you're calculating the variance of the fitness, which does of course tell you that the fitnesses in your population are all about the same. I've found that it's very important how you define your fitness function. The more granular the fitness function, the more you can discriminate between your Chromosomes. Obviously, your fitness function is returning similar values for completely different chromosomes, since your gen 0 returns a fitness variance of 68e-19.
Can you share your fitness calculation? Or what problem you're asking the GA to solve? I think that might help us help you.
[Edit: Adding Explicit Fitness Sharing / Niching]
I rethought this a bit and updated my code. If you're trying to maintain unique chromosomes, you have to compare their content (as others have mentioned). One way to do this would be to compute the standard deviation between them. If it's less than some threshold, you can consider them the same. From class Chromosome:
// compute the population standard deviation
public float StdDev(Chromosome other) {
float sum = 0.0f;
for(int i=0; i<alele.Length; i++) {
float diff = other.alele[i] - alele[i];
sum += diff*diff;
}
return (float)Math.Sqrt(sum);
}
I think Niching will give you what you'd like. It compares all the Chromosomes in the population to determine their similarity and assigns a "niche" value to each. The chromosomes are then "penalized" for belonging to a niche using a technique called Explicit Fitness Sharing. The fitness values are divided by the number of Chromosomes in each niche. So if you have three in niche group A (A,A,A) instead of that niche being 3 times as likely to be chosen, it's treated as a single entity.
I compared my sample with Explicit Fitness Sharing on and off. With a max STDDEV of 500 and Niching turned OFF, there were about 18-20 niches (so basically 5 duplicates of each item in a 100 population). With Niching turned ON, there were about 85 niches. Thats 85% unique Chromosomes in the population. In the output of my test, you can see the diversity after 17000 generations.
Here's the niching code:
// returns: total number of niches in this population
// max_stddev -- any two chromosomes with population stddev less than this max
// will be grouped together
int ComputeNiches(float max_stddev) {
List<int> niches = new List<int>();
// clear niches
foreach(var c in population) {
c.niche = -1;
}
// calculate niches
for(int i=0; i<population.Count; i++) {
var c = population[i];
if( c.niche != -1) continue; // niche already set
// compute the niche by finding the stddev between the two chromosomes
c.niche = niches.Count;
int count_in_niche = 1; // includes the curent Chromosome
for(int j=i+1; j<population.Count; j++) {
var d = population[j];
float stddev = c.StdDev(d);
if(stddev < max_stddev) {
d.niche = c.niche; // same niche
++count_in_niche;
}
}
niches.Add(count_in_niche);
}
// penalize Chromosomes by their niche size
foreach(var c in population) {
c.niche_scaled_fitness = c.scaled_fitness / niches[c.niche];
}
return niches.Count;
}
[Edit: post-analysis and update of Anton's code]
I know this probably isn't the right forum to address homework problems, but since I did the effort before knowing this, and I had a lot of fun doing it, I figure it can only be helpful to Anton.
Genotip.cs, Kromosom.cs, KromoMain.cs
This code maintains good diversity, and I was able in one run to get the "raw fitness" down to 47, which is in your case the average squared error. That was pretty close!
As noted in my comment, I'd like to try to help you in your programming, not just help you with your homework. Please read these analysis of your work.
As we expected, there was no need to make a "more diverse" population from the start. Just generate some completely random Kromosomes.
Your mutations and crossovers were highly destructive, and you only had a few of them. I added several new operators that seem to work better for this problem.
You were throwing away the best solution. When I got your code running with only Tournament Selection, there would be one Kromo that was 99% better than all the rest. With tournament selection, that best value was very likely to be forgotten. I added a bit of "elitism" which keeps a copy of that value for the next generation.
Consider object oriented techniques. Compare the re-write I sent you with my original code.
Don't duplicate code. You had the sampling parameters in two different classes.
Keep your code clean. There were several unused parts of code. Especially when submitting questions to SO, try to narrow it down, remove unused code, and do some cleaning up.
Comment your code! I've commented the re-work significantly. I know it's Serbian, but even a few comments will help someone else understand what you are doing and what you intended to do.
Overall, nice job implementing some of the more sophisticated things like Tournament Selection
Prefer double[] arrays instead of List. There's less overhead. Also, several of your List temp variables weren't even needed. Your structure
List temp = new List();
for(...) {
temp.add(value);
}
for(each value in temp) {
sum += value
}
average = sum / temp.Count
can easily be written as:
sum = 0
for(...) {
sum += value;
}
average = sum / count;
In several places you forgot to initialize a loop variable, which could have easily added to your problem. Something like this will cause serious problems, and it was in your fitness code along with one or two other places
double fit = 0;
for(each chromosome) {
// YOU SHOULD INITIALIZE fit HERE inside the LOOP
for(each allele) {
fit += ...;
}
fit /= count;
}
Good luck programming!
The basic problem here is that most randomly generated chromosomes have similar fitness, right? That's fine; the idea isn't for your initial chromosomes to have wildly different fitnesses; it's for the chromosomes themselves to be different, and presumably they are. In fact, you should expect the initial fitness of most of your first generation to be close to zero, since you haven't run the algorithm yet.
Here's why your code is so slow. Let's say the first candidate is terrible, basically zero fitness. If the second one has to be 1.5x different, that really just means it has to be 1.5x better, since it can't really get worse. Then the next one has to 1.5x better than that, and so on up to 80. So what you're really doing is searching for increasingly better chromosomes by generating completely random ones and comparing them to what you have. I bet if you logged the progress, you'd find it takes more and more time to find the subsequent candidates, because really good chromosomes are hard to find. But finding better chromosomes is what the GA is for! Basically what you've done is optimize some of the chromosomes by hand before, um, actually optimizing them.
If you want to ensure that your chromosomes are diverse, compare their content, don't compare their fitness. Comparing the fitness is the algo's job.
I'm going to take a quick swing at this, but Isaac's pretty much right. You need to let the GA do its job. You have a generation of individuals (chromosomes, whatever), and they're all over the scale on fitness (or maybe they're all identical).
You pick some good ones to mutate (by themselves) and crossover (with each other). You maybe use the top 10% to generate another full population and throw out the bottom 90%. Maybe you always keep the top guy around (Elitism).
You iterate at this for a while until your GA stops improving because the individuals are all very much alike. You've ended up with very little diversity in your population.
What might help you is to 1) make your mutations more effective, 2) find a better way to select individuals to mutate. In my comment I recommended AI Techniques for Game Programmers. It's a great book. Very easy to read.
To list a few headings from the book, the things you're looking for are:
Selection techniques like Roulette Selection (on stackoveflow) (on wikipedia) and Stochastic Universal Sampling, which control how you select your individuals. I've always liked Roulette Selection. You set the probabilities that an individual will be selected. It's not just simple white-noise random sampling.
I used this outside of GA for selecting 4 letters from the Roman alphabet randomly. I assigned a value from 0.0 to 1.0 to each letter. Every time the user (child) would pick the letter correctly, I would lower that value by, say 0.1. This would increase the likelihood that the other letters would be selected. If after 10 times, the user picked the correct letter, the value would be 0.0, and there would be (almost) no chance that letter would be presented again.
Fitness Scaling techniques like Rank Scaling, Sigma Scaling, and Boltzmann Scaling (pdf on ftp!!!) that let you modify your raw fitness values to come up with adjusted fitness values. Some of these are dynamic, like Boltzmann Scaling, which allows you to set a "pressure" or "temperature" that changes over time. Increased "pressure" means that fitter individuals are selected. Decreased pressure means that any individual in the population can be selected.
I think of it this way: you're searching through multi-dimensional space for a solution. You hit a "peak" and work your way up into it. The pressure to be fit is very high. You snug right into that local maxima. Now your fitness can't change. Your mutations aren't getting you out of the peak. So you start to reduce the pressure and just, oh, select items randomly. Your fitness levels start to drop, which is okay for a while. Then you start to increase the pressure again, and surprise! You've skipped out of the local maxima and found a lovely new local maxima to climb into. Increase the pressure again!
Niching (which I've never used, but appears to be a way to group similar individuals together). Say you have two pretty good individuals, but they're wildly different. They keep getting selected. They keep mutating slightly, and not getting much better. Now you have half your population as minor variants of A, and half your population minor variants of B. This seems like a way to say, hey, what's the average fitness of that entire group A? and what for B? And what for every other niche you have. Then do your selection based on the average fitness for each niche. Pick your niche, then select a random individual from that niche. Maybe I'll start using this after all. I like it!
Hope you find some of that helpful!
If you need true random numbers for your application, I recommend you check out Random.org. They have a free HTTP API, and clients for just about every language.
The randomness comes from atmospheric noise, which for many purposes is better than the pseudo-random number algorithms typically used in computer programs.
(I am unaffiliated with Random.org, although I did contribute the PHP client).
I think your problem is in how your fitness function and how you select candidates, not in how random values are. Your filtering feels too strict that may not even allow enough elements to be accepted.
Sample
values: random float 0-10000.
fitness function square root(n)
desired distribution of fitness - linear with distance at least 1.
With this fitness function you will quickly get most of the 1-wide "spots" taken (as you have at most 100 places), so every next one will take longer. At some point there will be several tiny ranges left and most of the results will simply rejected, even worse after you get about 50 numbers places there is a good chance that next one simply will not be able to fit.
Related
.NET 4.5.1
I have a "bunch" of Int16 values that fit in a range from -4 to 32760. The numbers in the range are not consecutive, but they are ordered from -4 to 32760. In other words, the numbers from 16-302 are not in the "bunch", but numbers 303-400 are in there, number 2102 is not there, etc.
What is the all-out fastest way to determine if a particular value (eg 18400) is in the "bunch"? Right now it is in an Int16[] and the Linq Contains method is used to determine if a value is in the array, but if anyone can say why/how a different structure would deliver a single value faster I would appreciate it. Speed is the key for this lookup (the "bunch" is a static property on a static class).
Sample code that works
Int16[] someShorts = new[] { (short)4 ,(short) 5 , (short)6};
var isInIt = someShorts.Contains( (short)4 );
I am not sure if that is the most performant thing that can be done.
Thanks.
It sounds like you really want BitArray - just offset the value by 4 so you've got a range of [0, 32764] and you should be fine.
That will allocate an array which is effectively 4K in size (32764 / 8), with one bit per value in the array. It will handle finding the relevant element in the array, and applying bit masking. (I don't know whether it uses a byte[] internally or something else.)
This is a potentially less compact representation than storing ranges, but the only cost involved in getting/setting a bit will be computing an index (basically a shift), getting the relevant bit of memory to the CPU, and then bit masking. It takes 1/8th the size of a bool[], making your CPU cache usage more efficient.
Of course, if this is really a performance bottleneck for you, you should compare both this solution and a bool[] approach in your real application - microbenchmarks aren't nearly as important here as how your real app behaves.
Make one bool for each possible value:
var isPresentItems = new bool[32760-(-4)+1];
Set the corresponding element to true if the given item is present in the set. Lookup is easy:
var isPresent = isPresentItems[myIndex];
Can't be done any faster. The bools will fit into L1 or L2 cache.
I advise against using BitArray because it stores multiple values per byte. This means that each access is slower. Bit-arithmetic is required.
And if you want insane speed, don't make LINQ call a delegate once for each item. LINQ is not the first choice for performance-critical code. Many indirections that stall the CPU.
If you want to optimize for lookup time, pick a data structure with O(1) (constant-time) lookups. You have several choices since you only care about set membership, and not sorting or ordering.
A HashSet<Int16> will give this to you, as will a BitArray indexed on max - min + 1. The absolute fastest ad-hoc solution would probably be a simple array indexed on max - min + 1, as #usr suggests. Any of these should be plenty "fast enough". The HashSet<Int16> will probably use the most memory, as the size of the internal hash table is an implementation detail. BitArray would be the most space efficient out of these options.
If you only have a single lookup, then memory should not be a concern, and I suggest first going with a HashSet<Int16>. That solution is easy to reason about and deal with in a bug-free manner, as you don't have to worry about staying within array boundaries; you can simply check set.Contains(n). This is particularly useful if your value range might change in the future. You can fall back to one of the other solutions if you need to optimize further for speed or performance.
One option is to use the HashSet. To find if the value is in it, it is a O(1) operation
The code example:
HashSet<Int16> evenNumbers = new HashSet<Int16>();
for (Int16 i = 0; i < 20; i++)
{
evenNumbers.Add(i);
}
if (evenNumbers.Contains(0))
{
/////
}
Because the numbers are sorted, I would loop through the list one time and generate a list of Range objects that have a start and end number. That list would be much smaller than having a list or dictionary of thousands of numbers.
If your "bunch" of numbers can be identified as a series of intervals, I suggest you use Interval Trees. An interval tree allows dynamic insertion/deletions and also searching if a an interval intersects any interval in the tree is O(log(n)) where n is the number of intervals in the tree. In your case the number of intervals would be way less than the number of ints and the search is much faster.
Program Purpose: Integration. I am implementing an adaptive quadrature (aka numerical integration) algorithm for high dimensions (up to 100). The idea is to randomly break the volume up into smaller sections by evaluating points using a sampling density proportional to an estimate of the error at that point. Early on I "burn-in" a uniform sample, then randomly choose points according to a Gaussian distribution over the estimated error. In a manner similar to simulated annealing, I "lower the temperature" and reduce the standard deviation of my Gaussian as time goes on, so that low-error points initially have a fair chance of being chosen, but later on are chosen with steadily decreasing probability. This enables the program to stumble upon spikes that might be missed due to imperfections in the error function. (My algorithm is similar in spirit to Markov-Chain Monte-Carlo integration.)
Function Characteristics. The function to be integrated is estimated insurance policy loss for multiple buildings due to a natural disaster. Policy functions are not smooth: there are deductibles, maximums, layers (e.g. zero payout up to 1 million dollars loss, 100% payout from 1-2 million dollars, then zero payout above 2 million dollars) and other odd policy terms. This introduces non-linear behavior and functions that have no derivative in numerous planes. On top of the policy function is the damage function, which varies by building type and strength of hurricane and is definitely not bell-shaped.
Problem Context: Error Function. The difficulty is choosing a good error function. For each point I record measures that seem useful for this: the magnitude of the function, how much it changed as a result of a previous measuremnent (a proxy for the first derivative), the volume of the region the point occupies (larger volumes can hide error better), and a geometric factor related to the shape of the region. My error function will be a linear combination of these measures where each measure is assigned a different weight. (If I get poor results, I will contemplate non-linear functions.) To aid me in this effort, I decided to perform an optimization over a wide range of possible values for each weight, hence the Microsoft Solution Foundation.
What to Optimize: Error Rank. My measures are normalized, from zero to one. These error values are progressively revised as the integration proceeds to reflect recent averages for function values, changes, etc. As a result, I am not trying to make a function that yields actual error values, but instead yields a number that sorts the same as the true error, i.e. if all sampled points are sorted by this estimated error value, they should receive a rank similar to the rank they would receive if sorted by the true error.
Not all points are equal. I care very much if the point region with #1 true error is ranked #1000 (or vice versa), but care very little if the #500 point is ranked #1000. My measure of success is to MINIMIZE the sum of the following over many regions at a point partway into the algorithm's execution:
ABS(Log2(trueErrorRank) - Log2(estimatedErrorRank))
For Log2 I am using a function that returns the largest power of two less than or equal to the number. From this definition, come useful results. Swapping #1 and #2 costs a point, but swapping #2 and #3 costs nothing. This has the effect of stratifying points into power of two ranges. Points that are swapped within a range do not add to the function.
How I Evaluate. I have constructed a class called Rank that does this:
Ranks all regions by true error once.
For each separate set of parameterized weights, it computes the
trial (estimated) error for that region.
Sorts the regions by that trial error.
Computes the trial rank for each region.
Adds up the absolute difference of logs of the two ranks and calls
this the value of the parameterization, hence the value to be
minimized.
C# Code. Having done all that, I just need a way to set up Microsoft Solver Foundation to find me the best parameters. The syntax has me stumped. Here is my C# code that I have so far. In it you will see comments for three problems I have identified. Maybe you can spot even more! Any ideas how to make this work?
public void Optimize()
{
// Get the parameters from the GUI and figures out the low and high values for each weight.
ParseParameters();
// Computes the true rank for each region according to true error.
var myRanker = new Rank(ErrorData, false);
// Obtain Microsoft Solver Foundation's core solver object.
var solver = SolverContext.GetContext();
var model = solver.CreateModel();
// Create a delegate that can extract the current value of each solver parameter
// and stuff it in to a double array so we can later use it to call LinearTrial.
Func<Model, double[]> marshalWeights = (Model m) =>
{
var i = 0;
var weights = new double[myRanker.ParameterCount];
foreach (var d in m.Decisions)
{
weights[i] = d.ToDouble();
i++;
}
return weights;
};
// Make a solver decision for each GUI defined parameter.
// Parameters is a Dictionary whose Key is the parameter name, and whose
// value is a Tuple of two doubles, the low and high values for the range.
// All are Real numbers constrained to fall between a defined low and high value.
foreach (var pair in Parameters)
{
// PROBLEM 1! Should I be using Decisions or Parameters here?
var decision = new Decision(Domain.RealRange(ToRational(pair.Value.Item1), ToRational(pair.Value.Item2)), pair.Key);
model.AddDecision(decision);
}
// PROBLEM 2! This calls myRanker.LinearTrial immediately,
// before the Decisions have values. Also, it does not return a Term.
// I want to pass it in a lambda to be evaluated by the solver for each attempted set
// of decision values.
model.AddGoal("Goal", GoalKind.Minimize,
myRanker.LinearTrial(marshalWeights(model), false)
);
// PROBLEM 3! Should I use a directive, like SimplexDirective? What type of solver is best?
var solution = solver.Solve();
var report = solution.GetReport();
foreach (var d in model.Decisions)
{
Debug.WriteLine("Decision " + d.Name + ": " + d.ToDouble());
}
Debug.WriteLine(report);
// Enable/disable buttons.
UpdateButtons();
}
UPDATE: I decided to look for another library as a fallback, and found DotNumerics (http://dotnumerics.com/). Their Nelder-Mead Simplex solver was easy to call:
Simplex simplex = new Simplex()
{
MaxFunEvaluations = 20000,
Tolerance = 0.001
};
int numVariables = Parameters.Count();
OptBoundVariable[] variables = new OptBoundVariable[numVariables];
//Constrained Minimization on the intervals specified by the user, initial Guess = 1;
foreach (var x in Parameters.Select((parameter, index) => new { parameter, index }))
{
variables[x.index] = new OptBoundVariable(x.parameter.Key, 1, x.parameter.Value.Item1, x.parameter.Value.Item2);
}
double[] minimum = simplex.ComputeMin(ObjectiveFunction, variables);
Debug.WriteLine("Simplex Method. Constrained Minimization.");
for (int i = 0; i < minimum.Length; i++)
Debug.WriteLine(Parameters[i].Key + " = " + minimum[i].ToString());
All I needed was to implement ObjectiveFunction as a method taking a double array:
private double ObjectiveFunction(double[] weights)
{
return Ranker.LinearTrial(weights, false);
}
I have not tried it against real data, but I created a simulation in Excel to setup test data and score it. The results coming back from their algorithm were not perfect, but gave a very good solution.
Here's my TL;DR summary: He doesn't know how to minimize the return value of LinearTrial, which takes an array of doubles. Each value in this array has its own min/max value, and he's modeling that using Decisions.
If that's correct, it seems you could just do the following:
double[] minimums = Parameters.Select(p => p.Value.Item1).ToArray();
double[] maximums = Parameters.Select(p => p.Value.Item2).ToArray();
// Some initial values, here it's a quick and dirty average
double[] initials = Parameters.Select(p => (p.Item1 + p.Item2)/2.0).ToArray();
var solution = NelderMeadSolver.Solve(
x => myRanker.LinearTrial(x, false), initials, minimums, maximums);
// Make sure you check solution.Result to ensure that it found a solution.
// For this, I'll assume it did.
// Value 0 is the minimized value of LinearTrial
int i = 1;
foreach (var param in Parameters)
{
Console.WriteLine("{0}: {1}", param.Key, solution.GetValue(i));
i++;
}
The NelderMeadSolver is new in MSF 3.0. The Solve static method "finds the minimum value of the specified function" according to the documentation in the MSF assembly (despite the MSDN documentation being blank and showing the wrong function signature).
Disclaimer: I'm no MSF expert, but the above worked for me and my test goal function (sum the weights).
I am using C# and I have two list<AACoordinate> where each element in these lists represents a 3D point in space by x,y and z.
class AACoordinate
{
public int ResiNumber { get; set; }
public double x { get; set; }
public double y { get; set; }
public double z { get; set; }
}
Each list can contain 2000 or more points and my aim is to compare each point of list1 to all the points of list2 and if the distance is smaller than a specific number I keep a record of it. at the moment I am using foreach to compare each element of list1 to all of list2. This is quite slow because of the number of points. Do you have any suggestion to make it fast?
my loop is:
foreach (var resiSet in Program.atomList1)
{
foreach (var res in Program.atomList2)
{
var dis = EuclideanDistance(resiSet, res);
if (dis < 5)
temp1.Add(resiSet.ResiNumber);
}
}
Thanks in advance for your help.
Maybe is a little complicated to implement, but I don't have any other ideas than this:
To lower down the computational complexity probably you have to use some data structure like KD-Tree or QuadTree.
You can use a KD-Tree to do nearest neighbor search, and this is what you need.
1) You build your kd-tree for the first list in O(n log n). This must be done in a single thread.
2) For each item in your second list, you do a lookup in the kd-tree for the nearest neighbor (the nearest point to the point you are looking for), in O(m log n). If the distance from current point to the nearest found point is less than your delta, you have it. If you want you can do this step using multiple threads.
So at the end the complexity of the algorithm will be O(max(n, m) * log n) where n is the number of items in the first list, m is the number of items in the second list.
For KD-Trees, see:
See http://home.wlu.edu/~levys/software/kd/ this seems a good implementation, in java and C#.
See http://www.codeproject.com/KB/architecture/KDTree.aspx
For quad trees, see:
See http://csharpquadtree.codeplex.com/
See http://www.codeproject.com/KB/recipes/QuadTree.aspx
And of course, look on Wikipedia what is a quadtree and a kd-tree
Consider that (2000 * log base 2(2000)) is about 21931.5
Instead 2000*2000 is 4000000, a big difference!
Using a parallel algorithm, if you have 4 processors, the normal O(n*n) algorithm will require 1000000 per processor, and I guess, it will be still too much if you need something fast or almost real time.
You can use Parallel Libraries where you can find Parallel.ForEach.
Paralel Example
If you really want to compare each element of list1 with each of list2, you won't get rid of the nested for. But you could speed it up using Parallel.ForEach.
Your current method checks each ordered pair in L x R, a simple O(n^2) algorithm. A couple of ideas come to mind.
First, you can try splitting each of the two arrays into, say, cubes of side equal to your maximum distance; then you'd only have to compute distances between elements in L and R if they are no more than 1 cube away. This is still O(n^2) in the worst case, but if your points are much farther apart on average than your maximum distance, you can save on a lot of spurious comparisons here.
Second, you can micro-optimize how you're doing the distance function. You never need to use sqrt(), for instance; comparing the squared distance to the maximum distance squared is sufficient. Also, you can avoid doing integer multiplications to get the squared distance if you first check whether |dx|, |dy| or |dz| satisfy certain properties (i.e., are already bigger than the maximum distance).
Parallelization, as mentioned by the other posters, is always a good bet. In particular, a sophisticated parallelization + boxing strategy (outlined in the first suggestion) should make for a particularly scalable, efficient solution.
I know C# has the Random class and probably a few classes in LINQ to do this, but if I was to write my own code to randomly select an item from a collection without using any built in .NET objects, how would this be done?
I can't seem to nail the logic required for this - how would I tell the system when to stop an iteration and select the current value - at random?
EDIT: This is a hypothetical question. This is not related to a production coding matter. I am just curious.
Selecting a random element from a collection can be done as follows.
Random r = new Random();
int randomIndex = r.Next(0, myCollection.Size -1);
var randomCollectionItem = myCollection[randomIndex];
Unless you have a VERY good reason, writing your own random generator is not necessary.
My advice to you is DON'T DO IT. Whatever reason you think you may have for not wanting to use the built-in library, I am pretty sure you misunderstood something. Please go back to the drawing board.
All of the advice above is technically accurate, but is kind of like giving a chemistry textbook to someone who wants to refine his own oil to use in his car.
There are many pseudo-random number generators. They aren't truly random, but they come at different quality, distinguished by their statistical and sequential properties and what purpose they are applicable for.
It very much depends on "how random you need it". If it just needs to "look random to a human", simple generators look like that:
rnd = seed; // some starting value
rnd = (a * rnd + b) % c; // next value
...
For well chosen values of a, b, and cthese generators are ok for simple statistical tests. A detailed discussion and common values for these you find here.
One interesting approach is to collect as much "external" data as possible - like time between keypresses, mouse movements, duration of disk reads etc. -, and use an algorithm that accumulates randomness while discarding dependency. That is mathematically tricky though (IIRC not long ago a critical attack surfaced based on one of these not being as random as thought).
Only a very few special applications use a truly random external hardware source - anything between a open-imput amplifier and radioactive decay.
You need to use a seed, something semi random provided by the computer itself.
Maybe use very fine resolution time and use the last couple microseconds when the method is called. That should be random enough to generate anything from 00 to 99, you can then go from there.
It sounds like your problem isn't in calculating a random number, but in how to use that random number to select an item from a list. Assuming you can create a random number somehow, all you need to do is use it as the argument to the list's indexer.
int index = customRandomGenerator.Next();
var selection = items[index];
Assuming that your presupposition about having to iterate through the list is correct (or the collection doesn't have an indexer) then you could do:
int index = customRandomGenerator.Next();
Item selection = null;
for (int i = 0; i < items.Length; i++)
{
if (i == index)
{
selection = items[i];
break;
}
}
The only true "cryptographically strong" random number generator in the .Net Framework is in System.Cryptography.RandomNumberGenerator - run this through Reflector to see what is does? Looking at your problem you would need a to know the Count of the collection otherwise you may never retrieve an item - you would need to specify a start and end value to draw random numbers from - the Random class would work best - pop it through Reflector.
Well, I never thought about implementing that myself as it seems like reinventing the wheel but you may have a look on this wikipedia article, hope it helps you do what you want
Random Number Generator
I have built an application that is used to simulate the number of products that a company can produce in different "modes" per month. This simulation is used to aid in finding the optimal series of modes to run in for a month to best meet the projected sales forecast for the month. This application has been working well, until recently when the plant was modified to run in additional modes. It is now possible to run in 16 modes. For a month with 22 work days this yields 9,364,199,760 possible combinations. This is up from 8 modes in the past that would have yielded a mere 1,560,780 possible combinations. The PC that runs this application is on the old side and cannot handle the number of calculations before an out of memory exception is thrown. In fact the entire application cannot support more than 15 modes because it uses integers to track the number of modes and it exceeds the upper limit for an integer. Baring that issue, I need to do what I can to reduce the memory utilization of the application and optimize this to run as efficiently as possible even if it cannot achieve the stated goal of 16 modes. I was considering writing the data to disk rather than storing the list in memory, but before I take on that overhead, I would like to get people’s opinion on the method to see if there is any room for optimization there.
EDIT
Based on a suggestion by few to consider something more academic then merely calculating every possible answer, listed below is a brief explanation of how the optimal run (combination of modes) is chosen.
Currently the computer determines every possible way that the plant can run for the number of work days that month. For example 3 Modes for a max of 2 work days would result in the combinations (where the number represents the mode chosen) of (1,1), (1,2), (1,3), (2,2), (2,3), (3,3) For each mode a product produces at a different rate of production, for example in mode 1, product x may produce at 50 units per hour where product y produces at 30 units per hour and product z produces at 0 units per hour. Each combination is then multiplied by work hours and production rates. The run that produces numbers that most closely match the forecasted value for each product for the month is chosen. However, because some months the plant does not meet the forecasted value for a product, the algorithm increases the priority of a product for the next month to ensure that at the end of the year the product has met the forecasted value. Since warehouse space is tight, it is important that products not overproduce too much either.
Thank you
private List<List<int>> _modeIterations = new List<List<int>>();
private void CalculateCombinations(int modes, int workDays, string combinationValues)
{
List<int> _tempList = new List<int>();
if (modes == 1)
{
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
_modeIterations.Add(_tempList);
}
else
{
for (int i = workDays + 1; --i >= 0; )
{
CalculateCombinations(modes - 1, workDays - i, combinationValues + i + ",");
}
}
}
This kind of optimization problem is difficult but extremely well-studied. You should probably read up in the literature on it rather than trying to re-invent the wheel. The keywords you want to look for are "operations research" and "combinatorial optimization problem".
It is well-known in the study of optimization problems that finding the optimal solution to a problem is almost always computationally infeasible as the problem grows large, as you have discovered for yourself. However, it is frequently the case that finding a solution guaranteed to be within a certain percentage of the optimal solution is feasible. You should probably concentrate on finding approximate solutions. After all, your sales targets are already just educated guesses, therefore finding the optimal solution is already going to be impossible; you haven't got complete information.)
What I would do is start by reading the wikipedia page on the Knapsack Problem:
http://en.wikipedia.org/wiki/Knapsack_problem
This is the problem of "I've got a whole bunch of items of different values and different weights, I can carry 50 pounds in my knapsack, what is the largest possible value I can carry while meeting my weight goal?"
This isn't exactly your problem, but clearly it is related -- you've got a certain amount of "value" to maximize, and a limited number of slots to pack that value into. If you can start to understand how people find near-optimal solutions to the knapsack problem, you can apply that to your specific problem.
You could process the permutation as soon as you have generated it, instead of collecting them all in a list first:
public delegate void Processor(List<int> args);
private void CalculateCombinations(int modes, int workDays, string combinationValues, Processor processor)
{
if (modes == 1)
{
List<int> _tempList = new List<int>();
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
processor.Invoke(_tempList);
}
else
{
for (int i = workDays + 1; --i >= 0; )
{
CalculateCombinations(modes - 1, workDays - i, combinationValues + i + ",", processor);
}
}
}
I am assuming here, that your current pattern of work is something along the lines
CalculateCombinations(initial_value_1, initial_value_2, initial_value_3);
foreach( List<int> list in _modeIterations ) {
... process the list ...
}
With the direct-process-approach, this would be
private void ProcessPermutation(List<int> args)
{
... process ...
}
... somewhere else ...
CalculateCombinations(initial_value_1, initial_value_2, initial_value_3, ProcessPermutation);
I would also suggest, that you try to prune the search tree as early as possible; if you can already tell, that certain combinations of the arguments will never yield something, which can be processed, you should catch those already during generation, and avoid the recursion alltogether, if this is possible.
In new versions of C#, generation of the combinations using an iterator (?) function might be usable to retain the original structure of your code. I haven't really used this feature (yield) as of yet, so I cannot comment on it.
The problem lies more in the Brute Force approach that in the code itself. It's possible that brute force might be the only way to approach the problem but I doubt it. Chess, for example, is unresolvable by Brute Force but computers play at it quite well using heuristics to discard the less promising approaches and focusing on good ones. Maybe you should take a similar approach.
On the other hand we need to know how each "mode" is evaluated in order to suggest any heuristics. In your code you're only computing all possible combinations which, anyway, will not scale if the modes go up to 32... even if you store it on disk.
if (modes == 1)
{
List<int> _tempList = new List<int>();
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
processor.Invoke(_tempList);
}
Everything in this block of code is executed over and over again, so no line in that code should make use of memory without freeing it. The most obvious place to avoid memory craziness is to write out combinationValues to disk as it is processed (i.e. use a FileStream, not a string). I think that in general, doing string concatenation the way you are doing here is bad, since every concatenation results in memory sadness. At least use a stringbuilder (See back to basics , which discusses the same issue in terms of C). There may be other places with issues, though. The simplest way to figure out why you are getting an out of memory error may be to use a memory profiler (Download Link from download.microsoft.com).
By the way, my tendency with code like this is to have a global List object that is Clear()ed rather than having a temporary one that is created over and over again.
I would replace the List objects with my own class that uses preallocated arrays to hold the ints. I'm not really sure about this right now, but I believe that each integer in a List is boxed, which means much more memory is used than with a simple array of ints.
Edit: On the other hand it seems I am mistaken: Which one is more efficient : List<int> or int[]