Optimizing a Recursive Function for Very Large Lists .Net

Optimizing a Recursive Function for Very Large Lists .Net - c#

I have built an application that is used to simulate the number of products that a company can produce in different "modes" per month. This simulation is used to aid in finding the optimal series of modes to run in for a month to best meet the projected sales forecast for the month. This application has been working well, until recently when the plant was modified to run in additional modes. It is now possible to run in 16 modes. For a month with 22 work days this yields 9,364,199,760 possible combinations. This is up from 8 modes in the past that would have yielded a mere 1,560,780 possible combinations. The PC that runs this application is on the old side and cannot handle the number of calculations before an out of memory exception is thrown. In fact the entire application cannot support more than 15 modes because it uses integers to track the number of modes and it exceeds the upper limit for an integer. Baring that issue, I need to do what I can to reduce the memory utilization of the application and optimize this to run as efficiently as possible even if it cannot achieve the stated goal of 16 modes. I was considering writing the data to disk rather than storing the list in memory, but before I take on that overhead, I would like to get people’s opinion on the method to see if there is any room for optimization there.
EDIT
Based on a suggestion by few to consider something more academic then merely calculating every possible answer, listed below is a brief explanation of how the optimal run (combination of modes) is chosen.
Currently the computer determines every possible way that the plant can run for the number of work days that month. For example 3 Modes for a max of 2 work days would result in the combinations (where the number represents the mode chosen) of (1,1), (1,2), (1,3), (2,2), (2,3), (3,3) For each mode a product produces at a different rate of production, for example in mode 1, product x may produce at 50 units per hour where product y produces at 30 units per hour and product z produces at 0 units per hour. Each combination is then multiplied by work hours and production rates. The run that produces numbers that most closely match the forecasted value for each product for the month is chosen. However, because some months the plant does not meet the forecasted value for a product, the algorithm increases the priority of a product for the next month to ensure that at the end of the year the product has met the forecasted value. Since warehouse space is tight, it is important that products not overproduce too much either.
Thank you
private List<List<int>> _modeIterations = new List<List<int>>();
private void CalculateCombinations(int modes, int workDays, string combinationValues)
{
List<int> _tempList = new List<int>();
if (modes == 1)
{
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
_modeIterations.Add(_tempList);
}
else
{
for (int i = workDays + 1; --i >= 0; )
{
CalculateCombinations(modes - 1, workDays - i, combinationValues + i + ",");
}
}
}

This kind of optimization problem is difficult but extremely well-studied. You should probably read up in the literature on it rather than trying to re-invent the wheel. The keywords you want to look for are "operations research" and "combinatorial optimization problem".
It is well-known in the study of optimization problems that finding the optimal solution to a problem is almost always computationally infeasible as the problem grows large, as you have discovered for yourself. However, it is frequently the case that finding a solution guaranteed to be within a certain percentage of the optimal solution is feasible. You should probably concentrate on finding approximate solutions. After all, your sales targets are already just educated guesses, therefore finding the optimal solution is already going to be impossible; you haven't got complete information.)
What I would do is start by reading the wikipedia page on the Knapsack Problem:
http://en.wikipedia.org/wiki/Knapsack_problem
This is the problem of "I've got a whole bunch of items of different values and different weights, I can carry 50 pounds in my knapsack, what is the largest possible value I can carry while meeting my weight goal?"
This isn't exactly your problem, but clearly it is related -- you've got a certain amount of "value" to maximize, and a limited number of slots to pack that value into. If you can start to understand how people find near-optimal solutions to the knapsack problem, you can apply that to your specific problem.

You could process the permutation as soon as you have generated it, instead of collecting them all in a list first:
public delegate void Processor(List<int> args);
private void CalculateCombinations(int modes, int workDays, string combinationValues, Processor processor)
{
if (modes == 1)
{
List<int> _tempList = new List<int>();
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
processor.Invoke(_tempList);
}
else
{
for (int i = workDays + 1; --i >= 0; )
{
CalculateCombinations(modes - 1, workDays - i, combinationValues + i + ",", processor);
}
}
}
I am assuming here, that your current pattern of work is something along the lines
CalculateCombinations(initial_value_1, initial_value_2, initial_value_3);
foreach( List<int> list in _modeIterations ) {
... process the list ...
}
With the direct-process-approach, this would be
private void ProcessPermutation(List<int> args)
{
... process ...
}
... somewhere else ...
CalculateCombinations(initial_value_1, initial_value_2, initial_value_3, ProcessPermutation);
I would also suggest, that you try to prune the search tree as early as possible; if you can already tell, that certain combinations of the arguments will never yield something, which can be processed, you should catch those already during generation, and avoid the recursion alltogether, if this is possible.
In new versions of C#, generation of the combinations using an iterator (?) function might be usable to retain the original structure of your code. I haven't really used this feature (yield) as of yet, so I cannot comment on it.

The problem lies more in the Brute Force approach that in the code itself. It's possible that brute force might be the only way to approach the problem but I doubt it. Chess, for example, is unresolvable by Brute Force but computers play at it quite well using heuristics to discard the less promising approaches and focusing on good ones. Maybe you should take a similar approach.
On the other hand we need to know how each "mode" is evaluated in order to suggest any heuristics. In your code you're only computing all possible combinations which, anyway, will not scale if the modes go up to 32... even if you store it on disk.

if (modes == 1)
{
List<int> _tempList = new List<int>();
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
processor.Invoke(_tempList);
}
Everything in this block of code is executed over and over again, so no line in that code should make use of memory without freeing it. The most obvious place to avoid memory craziness is to write out combinationValues to disk as it is processed (i.e. use a FileStream, not a string). I think that in general, doing string concatenation the way you are doing here is bad, since every concatenation results in memory sadness. At least use a stringbuilder (See back to basics , which discusses the same issue in terms of C). There may be other places with issues, though. The simplest way to figure out why you are getting an out of memory error may be to use a memory profiler (Download Link from download.microsoft.com).
By the way, my tendency with code like this is to have a global List object that is Clear()ed rather than having a temporary one that is created over and over again.

I would replace the List objects with my own class that uses preallocated arrays to hold the ints. I'm not really sure about this right now, but I believe that each integer in a List is boxed, which means much more memory is used than with a simple array of ints.
Edit: On the other hand it seems I am mistaken: Which one is more efficient : List<int> or int[]

Related

Genetic algorithm initial seed diversity using value encoding C#

I would like to know the following:
How to effectively make initial generation of chromosomes with high diversity using value encoding ?
One way is grid initialization, but it is too slow.
Till now I have been using Random class from .NET for choosing random values in value encoding but, although values are uniformly distributed, fitness function values calculated from such chromosomes are not. Here is a code for Chromosome initialization:
public Chromosome(Random rand)
{
Alele = new List<double>();
for (int i = 0; i < ChromosomeLength; i++)
{
Alele.Add(rand.NextDouble() * 2000 - 1000);
}
}
So, I developed a function that calculates fitness from new, randomly made chromosome (upper code) and if fitness is similar to any other already in the list of chromosomes, a new chromosome is made randomly and his fitness is calculated and this process is repeated until his fitness is not different enough from those already in the list.
Here is the code for this part:
private bool CheckSimilarFitnes(List<Chromosome> chromosome, Chromosome newCandidate)
{
Boolean flag=false;
double fitFromList, fitFromCandidate;
double fitBigger,fitSmaller;
foreach (var listElement in chromosome)
{
fitFromList = listElement.CalculateChromosomeFitness(listElement.Alele);
fitFromCandidate = newCandidate.CalculateChromosomeFitness(newCandidate.Alele);
fitBigger = fitFromList >= fitFromCandidate ? fitFromList : fitFromCandidate;
fitSmaller = fitFromList < fitFromCandidate ? fitFromList : fitFromCandidate;
if ((fitFromList / fitFromCandidate) < 1.5)
return false
}
else return true;
}
But, the more chromosomes I have in the list it takes longer to add a new one, with fitness that is enough different from others already in there.
So, is there a way to make this grid initialization more faster, it takes days to make 80 chromosomes like this?

here's some code that might help (which I just wrote): GA for ordering 10 values spaced by 1.0. It starts with a population of 100 completely random alleles, which is exactly how your code starts.
The goal I gave the GA to solve was to order the values in increasing order with a separation of 1.0. It does this in the fitness function Eval_OrderedDistance by by computing the standard deviation of each pair of samples from 1.0. As the fitness tends toward 0.0, the alleles should start to appear in sequential order.
Generation 0's fittest Chromosome was completely random, as were the rest of the Chromosomes. You can see the fitness value is very high (i.e., bad):
GEN: fitness (allele, ...)
0: 375.47460 (583.640, -4.215, -78.418, 164.228, -243.982, -250.237, 354.559, 374.306, 709.859, 115.323)
As the generations continue, the fitness (standard deviation from 1.0) decreases until it's nearly perfect in generation 100,000:
100: 68.11683 (-154.818, -173.378, -170.846, -193.750, -198.722, -396.502, -464.710, -450.014, -422.194, -407.162)
...
10000: 6.01724 (-269.681, -267.947, -273.282, -281.582, -287.407, -293.622, -302.050, -307.582, -308.198, -308.648)
...
99999: 0.67262 (-294.746, -293.906, -293.114, -292.632, -292.596, -292.911, -292.808, -292.039, -291.112, -290.928)
The interesting parts of the code are the fitness function:
// try to pack the aleles together spaced apart by 1.0
// returns the standard deviation of the samples from 1.0
static float Eval_OrderedDistance(Chromosome c) {
float sum = 0;
int n = c.alele.Length;
for(int i=1; i<n; i++) {
float diff = (c.alele[i] - c.alele[i-1]) - 1.0f;
sum += diff*diff; // variance from 1.0
}
return (float)Math.Sqrt(sum/n);
}
And the mutations. I used a simple crossover and a "completely mutate one allele":
Chromosome ChangeOne(Chromosome c) {
Chromosome d = c.Clone();
int i = rand.Next() % d.alele.Length;
d.alele[i] = (float)(rand.NextDouble()*2000-1000);
return d;
}
I used elitism to always keep one exact copy of the best Chromosome. Then generated 100 new Chromosomes using mutation and crossover.
It really sounds like you're calculating the variance of the fitness, which does of course tell you that the fitnesses in your population are all about the same. I've found that it's very important how you define your fitness function. The more granular the fitness function, the more you can discriminate between your Chromosomes. Obviously, your fitness function is returning similar values for completely different chromosomes, since your gen 0 returns a fitness variance of 68e-19.
Can you share your fitness calculation? Or what problem you're asking the GA to solve? I think that might help us help you.
[Edit: Adding Explicit Fitness Sharing / Niching]
I rethought this a bit and updated my code. If you're trying to maintain unique chromosomes, you have to compare their content (as others have mentioned). One way to do this would be to compute the standard deviation between them. If it's less than some threshold, you can consider them the same. From class Chromosome:
// compute the population standard deviation
public float StdDev(Chromosome other) {
float sum = 0.0f;
for(int i=0; i<alele.Length; i++) {
float diff = other.alele[i] - alele[i];
sum += diff*diff;
}
return (float)Math.Sqrt(sum);
}
I think Niching will give you what you'd like. It compares all the Chromosomes in the population to determine their similarity and assigns a "niche" value to each. The chromosomes are then "penalized" for belonging to a niche using a technique called Explicit Fitness Sharing. The fitness values are divided by the number of Chromosomes in each niche. So if you have three in niche group A (A,A,A) instead of that niche being 3 times as likely to be chosen, it's treated as a single entity.
I compared my sample with Explicit Fitness Sharing on and off. With a max STDDEV of 500 and Niching turned OFF, there were about 18-20 niches (so basically 5 duplicates of each item in a 100 population). With Niching turned ON, there were about 85 niches. Thats 85% unique Chromosomes in the population. In the output of my test, you can see the diversity after 17000 generations.
Here's the niching code:
// returns: total number of niches in this population
// max_stddev -- any two chromosomes with population stddev less than this max
// will be grouped together
int ComputeNiches(float max_stddev) {
List<int> niches = new List<int>();
// clear niches
foreach(var c in population) {
c.niche = -1;
}
// calculate niches
for(int i=0; i<population.Count; i++) {
var c = population[i];
if( c.niche != -1) continue; // niche already set
// compute the niche by finding the stddev between the two chromosomes
c.niche = niches.Count;
int count_in_niche = 1; // includes the curent Chromosome
for(int j=i+1; j<population.Count; j++) {
var d = population[j];
float stddev = c.StdDev(d);
if(stddev < max_stddev) {
d.niche = c.niche; // same niche
++count_in_niche;
}
}
niches.Add(count_in_niche);
}
// penalize Chromosomes by their niche size
foreach(var c in population) {
c.niche_scaled_fitness = c.scaled_fitness / niches[c.niche];
}
return niches.Count;
}
[Edit: post-analysis and update of Anton's code]
I know this probably isn't the right forum to address homework problems, but since I did the effort before knowing this, and I had a lot of fun doing it, I figure it can only be helpful to Anton.
Genotip.cs, Kromosom.cs, KromoMain.cs
This code maintains good diversity, and I was able in one run to get the "raw fitness" down to 47, which is in your case the average squared error. That was pretty close!
As noted in my comment, I'd like to try to help you in your programming, not just help you with your homework. Please read these analysis of your work.
As we expected, there was no need to make a "more diverse" population from the start. Just generate some completely random Kromosomes.
Your mutations and crossovers were highly destructive, and you only had a few of them. I added several new operators that seem to work better for this problem.
You were throwing away the best solution. When I got your code running with only Tournament Selection, there would be one Kromo that was 99% better than all the rest. With tournament selection, that best value was very likely to be forgotten. I added a bit of "elitism" which keeps a copy of that value for the next generation.
Consider object oriented techniques. Compare the re-write I sent you with my original code.
Don't duplicate code. You had the sampling parameters in two different classes.
Keep your code clean. There were several unused parts of code. Especially when submitting questions to SO, try to narrow it down, remove unused code, and do some cleaning up.
Comment your code! I've commented the re-work significantly. I know it's Serbian, but even a few comments will help someone else understand what you are doing and what you intended to do.
Overall, nice job implementing some of the more sophisticated things like Tournament Selection
Prefer double[] arrays instead of List. There's less overhead. Also, several of your List temp variables weren't even needed. Your structure
List temp = new List();
for(...) {
temp.add(value);
}
for(each value in temp) {
sum += value
}
average = sum / temp.Count
can easily be written as:
sum = 0
for(...) {
sum += value;
}
average = sum / count;
In several places you forgot to initialize a loop variable, which could have easily added to your problem. Something like this will cause serious problems, and it was in your fitness code along with one or two other places
double fit = 0;
for(each chromosome) {
// YOU SHOULD INITIALIZE fit HERE inside the LOOP
for(each allele) {
fit += ...;
}
fit /= count;
}
Good luck programming!

The basic problem here is that most randomly generated chromosomes have similar fitness, right? That's fine; the idea isn't for your initial chromosomes to have wildly different fitnesses; it's for the chromosomes themselves to be different, and presumably they are. In fact, you should expect the initial fitness of most of your first generation to be close to zero, since you haven't run the algorithm yet.
Here's why your code is so slow. Let's say the first candidate is terrible, basically zero fitness. If the second one has to be 1.5x different, that really just means it has to be 1.5x better, since it can't really get worse. Then the next one has to 1.5x better than that, and so on up to 80. So what you're really doing is searching for increasingly better chromosomes by generating completely random ones and comparing them to what you have. I bet if you logged the progress, you'd find it takes more and more time to find the subsequent candidates, because really good chromosomes are hard to find. But finding better chromosomes is what the GA is for! Basically what you've done is optimize some of the chromosomes by hand before, um, actually optimizing them.
If you want to ensure that your chromosomes are diverse, compare their content, don't compare their fitness. Comparing the fitness is the algo's job.

I'm going to take a quick swing at this, but Isaac's pretty much right. You need to let the GA do its job. You have a generation of individuals (chromosomes, whatever), and they're all over the scale on fitness (or maybe they're all identical).
You pick some good ones to mutate (by themselves) and crossover (with each other). You maybe use the top 10% to generate another full population and throw out the bottom 90%. Maybe you always keep the top guy around (Elitism).
You iterate at this for a while until your GA stops improving because the individuals are all very much alike. You've ended up with very little diversity in your population.
What might help you is to 1) make your mutations more effective, 2) find a better way to select individuals to mutate. In my comment I recommended AI Techniques for Game Programmers. It's a great book. Very easy to read.
To list a few headings from the book, the things you're looking for are:
Selection techniques like Roulette Selection (on stackoveflow) (on wikipedia) and Stochastic Universal Sampling, which control how you select your individuals. I've always liked Roulette Selection. You set the probabilities that an individual will be selected. It's not just simple white-noise random sampling.
I used this outside of GA for selecting 4 letters from the Roman alphabet randomly. I assigned a value from 0.0 to 1.0 to each letter. Every time the user (child) would pick the letter correctly, I would lower that value by, say 0.1. This would increase the likelihood that the other letters would be selected. If after 10 times, the user picked the correct letter, the value would be 0.0, and there would be (almost) no chance that letter would be presented again.
Fitness Scaling techniques like Rank Scaling, Sigma Scaling, and Boltzmann Scaling (pdf on ftp!!!) that let you modify your raw fitness values to come up with adjusted fitness values. Some of these are dynamic, like Boltzmann Scaling, which allows you to set a "pressure" or "temperature" that changes over time. Increased "pressure" means that fitter individuals are selected. Decreased pressure means that any individual in the population can be selected.
I think of it this way: you're searching through multi-dimensional space for a solution. You hit a "peak" and work your way up into it. The pressure to be fit is very high. You snug right into that local maxima. Now your fitness can't change. Your mutations aren't getting you out of the peak. So you start to reduce the pressure and just, oh, select items randomly. Your fitness levels start to drop, which is okay for a while. Then you start to increase the pressure again, and surprise! You've skipped out of the local maxima and found a lovely new local maxima to climb into. Increase the pressure again!
Niching (which I've never used, but appears to be a way to group similar individuals together). Say you have two pretty good individuals, but they're wildly different. They keep getting selected. They keep mutating slightly, and not getting much better. Now you have half your population as minor variants of A, and half your population minor variants of B. This seems like a way to say, hey, what's the average fitness of that entire group A? and what for B? And what for every other niche you have. Then do your selection based on the average fitness for each niche. Pick your niche, then select a random individual from that niche. Maybe I'll start using this after all. I like it!
Hope you find some of that helpful!

If you need true random numbers for your application, I recommend you check out Random.org. They have a free HTTP API, and clients for just about every language.
The randomness comes from atmospheric noise, which for many purposes is better than the pseudo-random number algorithms typically used in computer programs.
(I am unaffiliated with Random.org, although I did contribute the PHP client).

I think your problem is in how your fitness function and how you select candidates, not in how random values are. Your filtering feels too strict that may not even allow enough elements to be accepted.
Sample
values: random float 0-10000.
fitness function square root(n)
desired distribution of fitness - linear with distance at least 1.
With this fitness function you will quickly get most of the 1-wide "spots" taken (as you have at most 100 places), so every next one will take longer. At some point there will be several tiny ranges left and most of the results will simply rejected, even worse after you get about 50 numbers places there is a good chance that next one simply will not be able to fit.

merge in-place without external storage

I want to merge two arrays with sorted values into one. Since both source arrays are stored as succeeding parts of a large array, I wonder, if you know a way to merge them into the large storage. Meaning inplace merge.
All methods I found, need some external storage. They often require sqrt(n) temp arrays. Is there an efficient way without it?
I m using C#. Other languages welcome also. Thanks in advance!

AFAIK, merging two (even sorted) arrays does not work inplace without considerably increasing the necessary number of comparisons and moves of elements. See: merge sort. However, blocked variants exist, which are able to sort a list of length n by utilizing a temporary arrays of lenght sqrt(n) - as you wrote - by still keeping the number of operations considerably low.. Its not bad - but its also not "nothing" and obviously the best you can get.
For practical situations and if you can afford it, you better use a temporary array to merge your lists.

If the values are stored as succeeding parts of a larger array, you just want to sort the array, then remove consecutive values which are equal.
void SortAndDedupe(Array<T> a)
{
// Do an efficient in-place sort
a.Sort();
// Now deduplicate
int lwm = 0; // low water mark
int hwm = 1; // High water mark
while(hwm < a.length)
{
// If the lwm and hwm elements are the same, it is a duplicate entry.
if(a[lwm] == a[hwm])
{
hwm++;
}else{
// Not a duplicate entry - move the lwm up
// and copy down the hwm element over the gap.
lwm++;
if(lwm < hwm){
a[lwm] = a[hwm];
}
hwm++;
}
}
// New length is lwm
// number of elements removed is (hwm-lwm-1)
}
Before you conclude that this will be too slow, implement it and profile it. That should take about ten minutes.
Edit: This can of course be improved by using a different sort rather than the built-in sort, e.g. Quicksort, Heapsort or Smoothsort, depending on which gives better performance in practice. Note that hardware architecture issues mean that the practical performance comparisons may very well be very different from the results of big O analysis.
Really you need to profile it with different sort algorithms on your actual hardware/OS platform.
Note: I am not attempting in this answer to give an academic answer, I am trying to give a practical one, on the assumption you are trying to solve a real problem.

Dont care about external storage. sqrt(n) or even larger should not harm your performance. You will just have to make sure, the storage is pooled. Especially for large data. Especially for merging them in loops. Otherwise, the GC will get stressed and eat up a considerable part of your CPU time / memory bandwidth.

performance problem

OK so I need to know if anyone can see a way to reduce the number of iterations of these loops because I can't. The first while loop is for going through a file, reading one line at a time. The first foreach loop is then comparing each of the compareSet with what was read in the first while loop. Then the next while loop is to do with bit counting.
As requested, an explaination of my algorithm:
There is a file that is too large to fit in memory. It contains a word followed by the pages in a very large document that this word is on. EG:
sky 1 7 9 32....... (it is not in this format, but you get the idea).
so parseLine reads in the line and converts it into a list of ints that are like a bit array where 1 means the word is on the page, and 0 means it isn't.
CompareSet is a bunch of other words. I can't fit my entire list of words into memory so I can only fit a subset of them. This is a bunch of words just like the "sky" example. I then compare each word in compareSet with Sky by seeing if they are on the same page.
So if sky and some other word both have 1 set at a certain index in the bit array (simulated as an int array for performance), they are on the same page. The algorithm therefore counts the occurances of any two words on a particular page. So in the end I will have a list like:
(for all words in list) is on the same page as (for all words in list) x number of times.
eg sky and land is on the same page x number of times.
while ((line = parseLine(s)) != null) {
getPageList(line.Item2, compareWord);
foreach (Tuple<int, uint[], List<Tuple<int, int>>> word in compareSet) {
unchecked {
for (int i = 0; i < 327395; i++) {
if (word.Item2[i] == 0 || compareWord[i] == 0)
continue;
uint combinedNumber = word.Item2[i] & compareWord[i];
while (combinedNumber != 0) {
actual++;
combinedNumber = combinedNumber & (combinedNumber - 1);
}
}
}

As my old professor Bud used to say: "When you see nested loops like this, your spidey senses should be goin' CRAZY!"
You have a while with a nested for with another while. This nesting of loops is an exponential increase on the order of operations. Your one for loop has 327395 iterations. Assuming they have the same or similar number of iterations, that means you have an order of operations of
327,395 * 327,395 * 327,395 = 35,092,646,987,154,875 (insane)
It's no wonder that things would be slowing down. You need to redefine your algorithm to remove these nested loops or combine work somewhere. Even if the numbers are smaller than my assumptions, the nesting of the loops is creating a LOT of operations that are probably unnecessary.

As Joal already mentioned nobody is able to optimize this looping algorithm. But what you can do is trying to better explain what you are trying to accomplish and what your hard requirements are. Maybe you can take a different approach by using some like HashSet<T>.IntersectWith() or BloomFilter or something like this.
So if you really want help from here you should not only post the code that doesn't work, but also what the overall task is you like to accomplish. Maybe someone has a completely other idea to solve your problem, making your whole algorithm obsolete.

Simple proof that GUID is not unique [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'd like to prove that a GUID is not unique in a simple test program.
I expected the following code to run for hours, but it's not working. How can I make it work?
BigInteger begin = new BigInteger((long)0);
BigInteger end = new BigInteger("340282366920938463463374607431768211456",10); //2^128
for(begin; begin<end; begin++)
Console.WriteLine(System.Guid.NewGuid().ToString());
I'm using C#.

Kai, I have provided a program that will do what you want using threads. It is licensed under the following terms: you must pay me $0.0001 per hour per CPU core you run it on. Fees are payable at the end of each calendar month. Please contact me for my paypal account details at your earliest convenience.
using System;
using System.Collections.Generic;
using System.Linq;
namespace GuidCollisionDetector
{
class Program
{
static void Main(string[] args)
{
//var reserveSomeRam = new byte[1024 * 1024 * 100]; // This indeed has no effect.
Console.WriteLine("{0:u} - Building a bigHeapOGuids.", DateTime.Now);
// Fill up memory with guids.
var bigHeapOGuids = new HashSet<Guid>();
try
{
do
{
bigHeapOGuids.Add(Guid.NewGuid());
} while (true);
}
catch (OutOfMemoryException)
{
// Release the ram we allocated up front.
// Actually, these are pointless too.
//GC.KeepAlive(reserveSomeRam);
//GC.Collect();
}
Console.WriteLine("{0:u} - Built bigHeapOGuids, contains {1} of them.", DateTime.Now, bigHeapOGuids.LongCount());
// Spool up some threads to keep checking if there's a match.
// Keep running until the heat death of the universe.
for (long k = 0; k < Int64.MaxValue; k++)
{
for (long j = 0; j < Int64.MaxValue; j++)
{
Console.WriteLine("{0:u} - Looking for collisions with {1} thread(s)....", DateTime.Now, Environment.ProcessorCount);
System.Threading.Tasks.Parallel.For(0, Int32.MaxValue, (i) =>
{
if (bigHeapOGuids.Contains(Guid.NewGuid()))
throw new ApplicationException("Guids collided! Oh my gosh!");
}
);
Console.WriteLine("{0:u} - That was another {1} attempts without a collision.", DateTime.Now, ((long)Int32.MaxValue) * Environment.ProcessorCount);
}
}
Console.WriteLine("Umm... why hasn't the universe ended yet?");
}
}
}
PS: I wanted to try out the Parallel extensions library. That was easy.
And using OutOfMemoryException as control flow just feels wrong.
EDIT
Well, it seems this still attracts votes. So I've fixed the GC.KeepAlive() issue. And changed it to run with C# 4.
And to clarify my support terms: support is only available on the 28/Feb/2010. Please use a time machine to make support requests on that day only.
EDIT 2
As always, the GC does a better job than I do at managing memory; any previous attempts at doing it myself were doomed to failure.

This will run for a lot more than hours. Assuming it loops at 1 GHz (which it won't - it will be a lot slower than that), it will run for 10790283070806014188970 years. Which is about 83 billion times longer than the age of the universe.
Assuming Moores law holds, it would be a lot quicker to not run this program, wait several hundred years and run it on a computer that is billions of times faster. In fact, any program that takes longer to run than it takes CPU speeds to double (about 18 months) will complete sooner if you wait until the CPU speeds have increased and buy a new CPU before running it (unless you write it so that it can be suspended and resumed on new hardware).

A GUID is theoretically non-unique. Here's your proof:
GUID is a 128 bit number
You cannot generate 2^128 + 1 or more GUIDs without re-using old GUIDs
However, if the entire power output of the sun was directed at performing this task, it would go cold long before it finished.
GUIDs can be generated using a number of different tactics, some of which take special measures to guarantee that a given machine will not generate the same GUID twice. Finding collisions in a particular algorithm would show that your particular method for generating GUIDs is bad, but would not prove anything about GUIDs in general.

Of course GUIDs can collide. Since GUIDs are 128-bits, just generate 2^128 + 1 of them and by the pigeonhole principle there must be a collision.
But when we say that a GUID is a unique, what we really mean is that the key space is so large that it is practically impossible to accidentally generate the same GUID twice (assuming that we are generating GUIDs randomly).
If you generate a sequence of n GUIDs randomly, then the probability of at least one collision is approximately p(n) = 1 - exp(-n^2 / 2 * 2^128) (this is the birthday problem with the number of possible birthdays being 2^128).
n p(n)
2^30 1.69e-21
2^40 1.77e-15
2^50 1.86e-10
2^60 1.95e-03
To make these numbers concrete, 2^60 = 1.15e+18. So, if you generate one billion GUIDs per second, it will take you 36 years to generate 2^60 random GUIDs and even then the probability that you have a collision is still 1.95e-03. You're more likely to be murdered at some point in your life (4.76e-03) than you are to find a collision over the next 36 years. Good luck.

If you're worried about uniqueness you can always purchase new GUIDs so you can throw away your old ones. I'll put some up on eBay if you'd like.

Personally, I think the "Big Bang" was caused when two GUIDs collided.

You can show that in O(1) time with a variant of the quantum bogosort algorithm.
Guid g1 = Guid.NewGuid();
Guid g2 = Guid.NewGuid();
if(g1 != g2) Universe.Current.Destroy();

Any two GUIDs are very likely unique (not equal).
See this SO entry, and from Wikipedia
While each generated GUID is not
guaranteed to be unique, the total
number of unique keys (2^128 or
3.4×10^38) is so large that the probability of the same number being
generated twice is very small. For
example, consider the observable
universe, which contains about 5×10^22
stars; every star could then have
6.8×10^15 universally unique GUIDs.
So probably you have to wait for many more billion of years, and hope that you hit one before the universe as we know it comes to an end.

[Update:] As the comments below point out, newer MS GUIDs are V4 and do not use the MAC address as part of the GUID generation (I haven't seen any indication of a V5 implementation from MS though, so if anyone has a link confirming that let me know). WIth V4 though, time is still a factor though, and the odds against duplication of GUIDs remains so small as to be irrelevant for any practical usage. You certainly would not be likely to ever generate a duplicate GUID from just a single system test such as the OP was trying to do.
Most of these answers are missing one vital point about Microsoft's GUID implementation. The first part of the GUID is based on a timestamp and another part is based on the MAC address of the network card (or a random number if no NIC is installed).
If I understand this correctly, it means that the only reliable way to duplicate a GUID would be to run simultainous GUID generations on multiple machines where the MAC addresses were the same AND where the clocks on both systems were at the same exact time when the generation occured (the timestamp is based on milliseconds if I understand it correctly).... even then there are a lot of other bits in the number that are random, so the odds are still vanishingly small.
For all practical purposes the GUIDs are universally unique.
There is a pretty good description of the MS GUID over at "The Old New Thing" blog

Here's a nifty little extension method that you can use if you want to check guid uniqueness in many places in your code.
internal static class GuidExt
{
public static bool IsUnique(this Guid guid)
{
while (guid != Guid.NewGuid())
{ }
return false;
}
}
To call it, simply call Guid.IsUnique whenever you generate a new guid...
Guid g = Guid.NewGuid();
if (!g.IsUnique())
{
throw new GuidIsNotUniqueException();
}
...heck, I'd even recommend calling it twice to make sure it got it right in the first round.

Counting to 2^128 - ambitious.
Lets imagine that we can count 2^32 IDs per second per machine - not that ambitious, since it's not even 4.3 billion per second. Lets dedicate 2^32 machines to that task. Furthermore, lets get 2^32 civilisations to each dedicate the same resources to the task.
So far, we can count 2^96 IDs per second, meaning we will be counting for 2^32 seconds (a little over 136 years).
Now, all we need is to get 4,294,967,296 civilisations to each dedicate 4,294,967,296 machines, each machine capable of counting 4,294,967,296 IDs per second, purely to this task for the next 136 years or so - I suggest we get started on this essential task right now ;-)

Well if the running time of 83 billion years does not scare you, think that you will also need to store the generated GUIDs somewhere to check if you have a duplicate; storing 2^128 16-byte numbers would only require you to allocate 4951760157141521099596496896 terabytes of RAM upfront, so imagining you have a computer which could fit all that and that you somehow find a place to buy terabyte DIMMs at 10 grams each, combined they will weigh more than 8 Earth masses, so you can seriously shift it off the current orbit, before you even press "Run". Think twice!

for(begin; begin<end; begin)
Console.WriteLine(System.Guid.NewGuid().ToString());
You aren't incrementing begin so the condition begin < end is always true.

If GUID collisions are a concern, I would recommend using the ScottGuID instead.

Presumably you have reason to be believe that the algorithm for producing Guids is not producing truly random numbers, but is in fact cycling with a period << 2^128.
e.g. RFC4122 method used to derive GUIDs which fixes the values of some bits.
Proof of cycling is going to depend upon the possible size of the period.
For small periods, hash table of hash(GUID) -> GUID with replacement on collision
if GUIDs do not match (terminate if they do) might be an approach. Consider also only doing the replacement a random fraction of the time.
Ultimately if the maximum period between collisions is large enough (and isn't known in advance) any method is only going to yield a probability that the collision would be found if it existed.
Note that if the method of generating Guids is clock based (see the RFC), then it may not be possible to determine if collisions exist because either (a) you won't be able to wait long enough for the clock to wrap round, or (b) you can't request enough Guids within a clock tick to force a collision.
Alternatively you might be able to show a statistical relationship between the bits in the Guid, or a correlation of bits between Guids. Such a relationship might make it highly probable that the algorithm is flawed without necessarily being able to find an actual collision.
Of course, if you just want to prove that Guids can collide, then a mathematical proof, not a program, is the answer.

I don't understand why no one has mentioned upgrading your graphics card... Surely if you got a high-end NVIDIA Quadro FX 4800 or something (192 CUDA cores) this would go faster...
Of course if you could afford a few NVIDIA Qadro Plex 2200 S4s (at 960 CUDA cores each), this calculation would really scream. Perhaps NVIDIA would be willing to loan you a few for a "Technology Demonstration" as a PR stunt?
Surely they'd want to be part of this historic calculation...

But do you have to be sure you have a duplicate, or do you only care if there can be a duplicate. To be sure that you have two people with the same birthday, you need 366 people (not counting leap year). For there to be a greater than 50% chance of having two people with the same birthday you only need 23 people. That's the birthday problem.
If you have 32 bits, you only need 77,163 values to have a greater than 50% chance of a duplicate. Try it out:
Random baseRandom = new Random(0);
int DuplicateIntegerTest(int interations)
{
Random r = new Random(baseRandom.Next());
int[] ints = new int[interations];
for (int i = 0; i < ints.Length; i++)
{
ints[i] = r.Next();
}
Array.Sort(ints);
for (int i = 1; i < ints.Length; i++)
{
if (ints[i] == ints[i - 1])
return 1;
}
return 0;
}
void DoTest()
{
baseRandom = new Random(0);
int count = 0;
int duplicates = 0;
for (int i = 0; i < 1000; i++)
{
count++;
duplicates += DuplicateIntegerTest(77163);
}
Console.WriteLine("{0} iterations had {1} with duplicates", count, duplicates);
}
1000 iterations had 737 with duplicates
Now 128 bits is a lot, so you are still talking a large number of items still giving you a low chance of collision. You would need the following number of records for the given odds using an approximation:
0.8 billion billion for a 1/1000 chance of a collision occurring
21.7 billion billion for 50% chance of a collision occurring
39.6 billion billion for 90% chance of a collision occurring
There are about 1E14 emails sent per year so it would be about 400,000 years at this level before you would have a 90% chance of having two with the same GUID, but that is a lot different than saying you need to run a computer 83 billion times the age of the universe or that the sun would go cold before finding a duplicate.

Aren't you all missing a major point?
I thought GUIDs were generated using two things which make the chances of them being Globally unique quite high. One is they are seeded with the MAC address of the machine that you are on and two they use the time that they were generated plus a random number.
So unless you run it on the actual machine and run all you guesses within the smallest amount of time that the machine uses to represent a time in the GUID you will never generate the same number no matter how many guesses you take using the system call.
I guess if you know the actual way a GUID is made would actually shorten the time to guess quite substantially.
Tony

You could hash the GUIDs. That way, you should get a result much faster.
Oh, of course, running multiple threads at the same time is also a good idea, that way you'll increase the chance of a race condition generating the same GUID twice on different threads.

GUIDs are 124 bits because 4 bits hold the version number.

Go to the cryogenics lab in the New York City.
Freeze yourself for (roughly) 1990 years.
Get a job at Planet Express.
Buy a brand-new CPU. Build a computer, run the program, and place it in the safe place with an pseudo-perpetual motion machine like the doomsday machine.
Wait until the time machine is invented.
Jump to the future using the time machine. If you bought 1YHz 128bit CPU, go to 3,938,453,320 days 20 hours 15 minutes 38 seconds 463 ms 463 μs 374 ns 607 ps after when you started to run the program.
...?
PROFIT!!!
... It takes at least 10,783,127 years even if you had 1YHz CPU which is 1,000,000,000,000,000 (or 1,125,899,906,842,624 if you prefer to use binary prefix) times faster than 1GHz CPU.
So rather than waiting for the compute finished, it would be better to feed pigeons which lost their home because other n pigeons took their home. :(
Or, you can wait until 128-bit quantum computer is invented. Then you may prove that GUID is not unique, by using your program in reasonable time(maybe).

Have you tried begin = begin + new BigInteger((long)1) in place of begin++?

If the number of UUID being generated follows Moore's law, the impression of never running out of GUID in the foreseeable future is false.
With 2 ^ 128 UUIDs, it will only take 18 months * Log2(2^128) ~= 192 years, before we run out of all UUIDs.
And I believe (with no statistical proof what-so-ever) in the past few years since mass adoption of UUID, the speed we are generating UUID is increasing way faster than Moore's law dictates. In other words, we probably have less than 192 years until we have to deal with UUID crisis, that's a lot sooner than end of universe.
But since we definitely won't be running them out by the end of 2012, we'll leave it to other species to worry about the problem.

The odds of a bug in the GUID generating code are much higher than the odds of the algorithm generating a collision. The odds of a bug in your code to test the GUIDs is even greater. Give up.

The program, albeit its errors, shows proof that a GUID is not unique. Those that try to prove the contrary are missing the point. This statement just proves the weak implementation of some of the GUID variations.
A GUID is not necessary unique by definition, it is highly unique by definition. You just refined the meaning of highly. Depending on the version, the implementator (MS or others), use of VM's, etc your definition of highly changes. (see link in earlier post)
You can shorten your 128 bit table to prove your point. The best solution is to use a hash formula to shorten your table with duplicates, and then use the full value once the hash collides and based on that re-generate a GUID. If running from different locations, you would be storing your hash/full key pairs in a central location.
Ps: If the goal is just to generate x number of different values, create a hash table of this width and just check on the hash value.

Not to p**s on the bonfire here, but it does actually happen, and yes, I understand the joking you have been giving this guy, but the GUID is unique only in principle, I bumped into this thread because there is a bug in the WP7 emulator which means every time it boots it gives out the SAME GUID the first time it is called! So, where in theory you cannot have a conflict, if there is a problem generating said GUI, then you can get duplicates
http://forums.create.msdn.com/forums/p/92086/597310.aspx#597310

Since part of Guid generation is based on the current machine's time, my theory to get a duplicate Guid is:
Perform a clean installation of Windows
Create a startup script that resets the time to 2010-01-01 12:00:00 just as Windows boots up.
Just after the startup script, it triggers your application to generate a Guid.
Clone this Windows installation, so that you rule out any subtle differences that may occur in subsequent boot-ups.
Re-image the hard drive with this image and boot-up the machine a few times.

For me.. the time it takes for a single core to generate a UUIDv1 guarantees it will be unique. Even in a multi core situation if the UUID generator only allows one UUID to be generated at a time for your specific resource (keep in mind that multiple resources can totally utilize the same UUIDs however unlikely since the resource inherently part of the address) then you will have more than enough UUIDs to last you until the timestamp burns out. At which point I really doubt you would care.

Here's a solution, too:
int main()
{
QUuid uuid;
while ( (uuid = QUuid::createUuid()) != QUuid::createUuid() ) { }
std::cout << "Aha! I've found one! " << qPrintable( uuid.toString() ) << std::endl;
}
Note: requires Qt, but I guarantee that if you let it run long enough, it might find one.
(Note note: actually, now that I'm looking at it, there may be something about the generation algorithm that prevents two subsequently generated uuids that collide--but I kinda doubt it).

The only solution to prove GUIDs are not unique would be by having a World GUID Pool. Each time a GUID is generated somewhere, it should be registered to the organization. Or heck, we might include a standardization that all GUID generators needs to register it automatically and for that it needs an active internet connection!

working with incredibly large numbers in .NET

I'm trying to work through the problems on projecteuler.net but I keep running into a couple of problems.
The first is a question of storing large quanities of elements in a List<t>. I keep getting OutOfMemoryException's when storing large quantities in the list.
Now I admit I might not be doing these things in the best way but, is there some way of defining how much memory the app can consume?
It usually crashes when I get abour 100,000,000 elements :S
Secondly, some of the questions require the addition of massive numbers. I use ulong data type where I think the number is going to get super big, but I still manage to wrap past the largest supported int and get into negative numbers.
Do you have any tips for working with incredibly large numbers?

Consider System.Numerics.BigInteger.

You need to use a large number class that uses some basic math principals to split these operations up. This implementation of a C# BigInteger library on CodePoject seems to be the most promising. The article has some good explanations of how operations with massive numbers work, as well.
Also see:
Big integers in C#

As far as Project Euler goes, you might be barking up the wrong tree if you are hitting OutOfMemory exceptions. From their website:
Each problem has been designed according to a "one-minute rule", which means that although it may take several hours to design a successful algorithm with more difficult problems, an efficient implementation will allow a solution to be obtained on a modestly powered computer in less than one minute.

As user Jakers said, if you're using Big Numbers, probably you're doing it wrong.
Of the ProjectEuler problems I've done, none have required big-number math so far.
Its more about finding the proper algorithm to avoid big-numbers.
Want hints? Post here, and we might have an interesting Euler-thread started.

I assume this is C#? F# has built in ways of handling both these problems (BigInt type and lazy sequences).
You can use both F# techniques from C#, if you like. The BigInt type is reasonably usable from other languages if you add a reference to the core F# assembly.
Lazy sequences are basically just syntax friendly enumerators. Putting 100,000,000 elements in a list isn't a great plan, so you should rethink your solutions to get around that. If you don't need to keep information around, throw it away! If it's cheaper to recompute it than store it, throw it away!

See the answers in this thread. You probably need to use one of the third-party big integer libraries/classes available or wait for C# 4.0 which will include a native BigInteger datatype.

As far as defining how much memory an app will use, you can check the available memory before performing an operation by using the MemoryFailPoint class.
This allows you to preallocate memory before doing the operation, so you can check if an operation will fail before running it.

string Add(string s1, string s2)
{
bool carry = false;
string result = string.Empty;
if (s1.Length < s2.Length)
s1 = s1.PadLeft(s2.Length, '0');
if(s2.Length < s1.Length)
s2 = s2.PadLeft(s1.Length, '0');
for(int i = s1.Length-1; i >= 0; i--)
{
var augend = Convert.ToInt64(s1.Substring(i,1));
var addend = Convert.ToInt64(s2.Substring(i,1));
var sum = augend + addend;
sum += (carry ? 1 : 0);
carry = false;
if(sum > 9)
{
carry = true;
sum -= 10;
}
result = sum.ToString() + result;
}
if(carry)
{
result = "1" + result;
}
return result;
}

I am not sure if it is a good way of handling it, but I use the following in my project.
I have a "double theRelevantNumber" variable and an "int PowerOfTen" for each item and in my relevant class I have a "int relevantDecimals" variable.
So... when large numbers is encountered they are handled like this:
First they are changed to x,yyy form. So if the number 123456,789 was inputed and the "powerOfTen" was 10, it would start like this:
theRelevantNumber = 123456,789
PowerOfTen = 10
The number was then: 123456,789*10^10
It is then changed to:
1,23456789*10^15
It is then rounded by the number of relevant decimals (for example 5) to 1,23456 and then saved along with "PowerOfTen = 15"
When adding or subracting numbers together, any number outside the relevant decimals are ignored. Meaning if you take:
1*10^15 + 1*10^10 it will change to 1,00001 if "relevantDecimals" is 5 but will not change at all if "relevantDecimals" are 4.
This method make you able to deal with numbers up doubleLimit*10^intLimit without any problem, and at least for OOP it is not that hard to keep track of.

You don't need to use BigInteger. You can do this even with string array of numbers.
class Solution
{
static void Main(String[] args)
{
int n = 5;
string[] unsorted = new string[6] { "3141592653589793238","1", "3", "5737362592653589793238", "3", "5" };
string[] result = SortStrings(n, unsorted);
foreach (string s in result)
Console.WriteLine(s);
Console.ReadLine();
}
static string[] SortStrings(int size, string[] arr)
{
Array.Sort(arr, (left, right) =>
{
if (left.Length != right.Length)
return left.Length - right.Length;
return left.CompareTo(right);
});
return arr;
}
}

If you want to work with incredibly large numbers look here...
MIKI Calculator
I am not a professional programmer i write for myself, sometimes, so sorry for unprofessional use of c# but the program works. I will be grateful for any advice and correction.
I use this calculator to generate 32-character passwords from numbers that are around 58 digits long.
Since the program adds numbers in the string format, you can perform calculations on numbers with the maximum length of the string variable. The program uses long lists for the calculation, so it is possible to calculate on larger numbers, possibly 18x the maximum capacity of the list.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.