Related
Is a GUID unique 100% of the time?
Will it stay unique over multiple threads?
While each generated GUID is not
guaranteed to be unique, the total
number of unique keys (2128 or
3.4×1038) is so large that the probability of the same number being
generated twice is very small. For
example, consider the observable
universe, which contains about 5×1022
stars; every star could then have
6.8×1015 universally unique GUIDs.
From Wikipedia.
These are some good articles on how a GUID is made (for .NET) and how you could get the same guid in the right situation.
https://ericlippert.com/2012/04/24/guid-guide-part-one/
https://ericlippert.com/2012/04/30/guid-guide-part-two/
https://ericlippert.com/2012/05/07/guid-guide-part-three/
If you are scared of the same GUID values then put two of them next to each other.
Guid.NewGuid().ToString() + Guid.NewGuid().ToString();
If you are too paranoid then put three.
The simple answer is yes.
Raymond Chen wrote a great article on GUIDs and why substrings of GUIDs are not guaranteed unique. The article goes in to some depth as to the way GUIDs are generated and the data they use to ensure uniqueness, which should go to some length in explaining why they are :-)
As a side note, I was playing around with Volume GUIDs in Windows XP. This is a very obscure partition layout with three disks and fourteen volumes.
\\?\Volume{23005604-eb1b-11de-85ba-806d6172696f}\ (F:)
\\?\Volume{23005605-eb1b-11de-85ba-806d6172696f}\ (G:)
\\?\Volume{23005606-eb1b-11de-85ba-806d6172696f}\ (H:)
\\?\Volume{23005607-eb1b-11de-85ba-806d6172696f}\ (J:)
\\?\Volume{23005608-eb1b-11de-85ba-806d6172696f}\ (D:)
\\?\Volume{23005609-eb1b-11de-85ba-806d6172696f}\ (P:)
\\?\Volume{2300560b-eb1b-11de-85ba-806d6172696f}\ (K:)
\\?\Volume{2300560c-eb1b-11de-85ba-806d6172696f}\ (L:)
\\?\Volume{2300560d-eb1b-11de-85ba-806d6172696f}\ (M:)
\\?\Volume{2300560e-eb1b-11de-85ba-806d6172696f}\ (N:)
\\?\Volume{2300560f-eb1b-11de-85ba-806d6172696f}\ (O:)
\\?\Volume{23005610-eb1b-11de-85ba-806d6172696f}\ (E:)
\\?\Volume{23005611-eb1b-11de-85ba-806d6172696f}\ (R:)
| | | | |
| | | | +-- 6f = o
| | | +---- 69 = i
| | +------ 72 = r
| +-------- 61 = a
+---------- 6d = m
It's not that the GUIDs are very similar but the fact that all GUIDs have the string "mario" in them. Is that a coincidence or is there an explanation behind this?
Now, when googling for part 4 in the GUID I found approx 125.000 hits with volume GUIDs.
Conclusion: When it comes to Volume GUIDs they aren't as unique as other GUIDs.
It should not happen. However, when .NET is under a heavy load, it is possible to get duplicate guids. I have two different web servers using two different sql servers. I went to merge the data and found I had 15 million guids and 7 duplicates.
Yes, a GUID should always be unique. It is based on both hardware and time, plus a few extra bits to make sure it's unique. I'm sure it's theoretically possible to end up with two identical ones, but extremely unlikely in a real-world scenario.
Here's a great article by Raymond Chen on Guids:
https://blogs.msdn.com/oldnewthing/archive/2008/06/27/8659071.aspx
Guids are statistically unique. The odds of two different clients generating the same Guid are infinitesimally small (assuming no bugs in the Guid generating code). You may as well worry about your processor glitching due to a cosmic ray and deciding that 2+2=5 today.
Multiple threads allocating new guids will get unique values, but you should get that the function you are calling is thread safe. Which environment is this in?
Eric Lippert has written a very interesting series of articles about GUIDs.
There are on the order 230 personal computers in the world (and of
course lots of hand-held devices or non-PC computing devices that have
more or less the same levels of computing power, but lets ignore
those). Let's assume that we put all those PCs in the world to the
task of generating GUIDs; if each one can generate, say, 220 GUIDs per
second then after only about 272 seconds -- one hundred and fifty
trillion years -- you'll have a very high chance of generating a
collision with your specific GUID. And the odds of collision get
pretty good after only thirty trillion years.
GUID Guide, part one
GUID Guide, part two
GUID Guide, part three
Theoretically, no, they are not unique. It's possible to generate an identical guid over and over. However, the chances of it happening are so low that you can assume they are unique.
I've read before that the chances are so low that you really should stress about something else--like your server spontaneously combusting or other bugs in your code. That is, assume it's unique and don't build in any code to "catch" duplicates--spend your time on something more likely to happen (i.e. anything else).
I made an attempt to describe the usefulness of GUIDs to my blog audience (non-technical family memebers). From there (via Wikipedia), the odds of generating a duplicate GUID:
1 in 2^128
1 in 340 undecillion (don’t worry, undecillion is not on the
quiz)
1 in 3.4 × 10^38
1 in 340,000,000,000,000,000,000,000,000,000,000,000,000
None seems to mention the actual math of the probability of it occurring.
First, let's assume we can use the entire 128 bit space (Guid v4 only uses 122 bits).
We know that the general probability of NOT getting a duplicate in n picks is:
(1-1/2128)(1-2/2128)...(1-(n-1)/2128)
Because 2128 is much much larger than n, we can approximate this to:
(1-1/2128)n(n-1)/2
And because we can assume n is much much larger than 0, we can approximate that to:
(1-1/2128)n^2/2
Now we can equate this to the "acceptable" probability, let's say 1%:
(1-1/2128)n^2/2 = 0.01
Which we solve for n and get:
n = sqrt(2* log 0.01 / log (1-1/2128))
Which Wolfram Alpha gets to be 5.598318 × 1019
To put that number into perspective, lets take 10000 machines, each having a 4 core CPU, doing 4Ghz and spending 10000 cycles to generate a Guid and doing nothing else. It would then take ~111 years before they generate a duplicate.
From http://www.guidgenerator.com/online-guid-generator.aspx
What is a GUID?
GUID (or UUID) is an acronym for 'Globally Unique Identifier' (or 'Universally Unique Identifier'). It is a 128-bit integer number used to identify resources. The term GUID is generally used by developers working with Microsoft technologies, while UUID is used everywhere else.
How unique is a GUID?
128-bits is big enough and the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%. Or if every human on Earth generated 600,000,000 GUIDs there would only be a 50% probability of a duplicate.
Is a GUID unique 100% of the time?
Not guaranteed, since there are several ways of generating one. However, you can try to calculate the chance of creating two GUIDs that are identical and you get the idea: a GUID has 128 bits, hence, there are 2128 distinct GUIDs – much more than there are stars in the known universe. Read the wikipedia article for more details.
MSDN:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.
If your system clock is set properly and hasn't wrapped around, and if your NIC has its own MAC (i.e. you haven't set a custom MAC) and your NIC vendor has not been recycling MACs (which they are not supposed to do but which has been known to occur), and if your system's GUID generation function is properly implemented, then your system will never generate duplicate GUIDs.
If everyone on earth who is generating GUIDs follows those rules then your GUIDs will be globally unique.
In practice, the number of people who break the rules is low, and their GUIDs are unlikely to "escape". Conflicts are statistically improbable.
I experienced a duplicate GUID.
I use the Neat Receipts desktop scanner and it comes with proprietary database software. The software has a sync to cloud feature, and I kept getting an error upon syncing. A gander at the logs revealed the awesome line:
"errors":[{"code":1,"message":"creator_guid: is already
taken","guid":"C83E5734-D77A-4B09-B8C1-9623CAC7B167"}]}
I was a bit in disbelief, but surely enough, when I found a way into my local neatworks database and deleted the record containing that GUID, the error stopped occurring.
So to answer your question with anecdotal evidence, no. A duplicate is possible. But it is likely that the reason it happened wasn't due to chance, but due to standard practice not being adhered to in some way. (I am just not that lucky) However, I cannot say for sure. It isn't my software.
Their customer support was EXTREMELY courteous and helpful, but they must have never encountered this issue before because after 3+ hours on the phone with them, they didn't find the solution. (FWIW, I am very impressed by Neat, and this glitch, however frustrating, didn't change my opinion of their product.)
For more better result the best way is to append the GUID with the timestamp (Just to make sure that it stays unique)
Guid.NewGuid().ToString() + DateTime.Now.ToString();
GUID algorithms are usually implemented according to the v4 GUID specification, which is essentially a pseudo-random string. Sadly, these fall into the category of "likely non-unique", from Wikipedia (I don't know why so many people ignore this bit): "... other GUID versions have different uniqueness properties and probabilities, ranging from guaranteed uniqueness to likely non-uniqueness."
The pseudo-random properties of V8's JavaScript Math.random() are TERRIBLE at uniqueness, with collisions often coming after only a few thousand iterations, but V8 isn't the only culprit. I've seen real-world GUID collisions using both PHP and Ruby implementations of v4 GUIDs.
Because it's becoming more and more common to scale ID generation across multiple clients, and clusters of servers, entropy takes a big hit -- the chances of the same random seed being used to generate an ID escalate (time is often used as a random seed in pseudo-random generators), and GUID collisions escalate from "likely non-unique" to "very likely to cause lots of trouble".
To solve this problem, I set out to create an ID algorithm that could scale safely, and make better guarantees against collision. It does so by using the timestamp, an in-memory client counter, client fingerprint, and random characters. The combination of factors creates an additive complexity that is particularly resistant to collision, even if you scale it across a number of hosts:
http://usecuid.org/
I have experienced the GUIDs not being unique during multi-threaded/multi-process unit-testing (too?). I guess that has to do with, all other tings being equal, the identical seeding (or lack of seeding) of pseudo random generators. I was using it for generating unique file names. I found the OS is much better at doing that :)
Trolling alert
You ask if GUIDs are 100% unique. That depends on the number of GUIDs it must be unique among. As the number of GUIDs approach infinity, the probability for duplicate GUIDs approach 100%.
In a more general sense, this is known as the "birthday problem" or "birthday paradox". Wikipedia has a pretty good overview at:
Wikipedia - Birthday Problem
In very rough terms, the square root of the size of the pool is a rough approximation of when you can expect a 50% chance of a duplicate. The article includes a probability table of pool size and various probabilities, including a row for 2^128. So for a 1% probability of collision you would expect to randomly pick 2.6*10^18 128-bit numbers. A 50% chance requires 2.2*10^19 picks, while SQRT(2^128) is 1.8*10^19.
Of course, that is just the ideal case of a truly random process. As others mentioned, a lot is riding on the that random aspect - just how good is the generator and seed? It would be nice if there was some hardware support to assist with this process which would be more bullet-proof except that anything can be spoofed or virtualized. I suspect that might be the reason why MAC addresses/time-stamps are no longer incorporated.
The Answer of "Is a GUID is 100% unique?" is simply "No" .
If You want 100% uniqueness of GUID then do following.
generate GUID
check if that GUID is Exist in your table column where you are looking for uniquensess
if exist then goto step 1 else step 4
use this GUID as unique.
The hardest part is not about generating a duplicated Guid.
The hardest part is designed a database to store all of the generated ones to check if it is actually duplicated.
From WIKI:
For example, the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
enter image description here
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes
GUID stands for Global Unique Identifier
In Brief:
(the clue is in the name)
In Detail:
GUIDs are designed to be unique; they are calculated using a random method based on the computers clock and computer itself, if you are creating many GUIDs at the same millisecond on the same machine it is possible they may match but for almost all normal operations they should be considered unique.
I think that when people bury their thoughts and fears in statistics, they tend to forget the obvious. If a system is truly random, then the result you are least likely to expect (all ones, say) is equally as likely as any other unexpected value (all zeros, say). Neither fact prevents these occurring in succession, nor within the first pair of samples (even though that would be statistically "truly shocking"). And that's the problem with measuring chance: it ignores criticality (and rotten luck) entirely.
IF it ever happened, what's the outcome? Does your software stop working? Does someone get injured? Does someone die? Does the world explode?
The more extreme the criticality, the worse the word "probability" sits in the mouth. In the end, chaining GUIDs (or XORing them, or whatever) is what you do when you regard (subjectively) your particular criticality (and your feeling of "luckiness") to be unacceptable. And if it could end the world, then please on behalf of all of us not involved in nuclear experiments in the Large Hadron Collider, don't use GUIDs or anything else indeterministic!
Enough GUIDs to assign one to each and every hypothetical grain of sand on every hypothetical planet around each and every star in the visible universe.
Enough so that if every computer in the world generates 1000 GUIDs a second for 200 years, there might (MIGHT) be a collision.
Given the number of current local uses for GUIDs (one sequence per table per database for instance) it is extraordinarily unlikely to ever be a problem for us limited creatures (and machines with lifetimes that are usually less than a decade if not a year or two for mobile phones).
... Can we close this thread now?
I'm looking to generate unique ids for identifying some data in my system. I'm using an elaborate system which concatenates some (non unique, relevant) meta-data with System.Guid.NewGuid()s. Are there any drawbacks to this approach, or am I in the clear?
I'm looking to generate unique ids for identifying some data in my system.
I'd recommend a GUID then, since they are by definition globally unique identifiers.
I'm using an elaborate system which concatenates some (non unique, relevant) meta-data with System.Guid.NewGuid(). Are there any drawbacks to this approach, or am I in the clear?
Well, since we do not know what you would consider a drawback, it is hard to say. A number of possible drawbacks come to mind:
GUIDs are big: 128 bits is a lot of bits.
GUIDs are not guaranteed to have any particular distribution; it is perfectly legal for GUIDs to be generated sequentially, and it is perfectly legal for the to be distributed uniformly over their 124 bit space (128 bits minus the four bits that are the version number of course.) This can have serious impacts on database performance if the GUID is being used as a primary key on a database that is indexed into sorted order by the GUID; insertions are much more efficient if the new row always goes at the end. A uniformly distributed GUID will almost never be at the end.
Version 4 GUIDs are not necessarily cryptographically random; if GUIDs are generated by a non-crypto-random generator, an attacker could in theory predict what your GUIDs are when given a representative sample of them. An attacker could in theory determine the probability that two GUIDs were generated in the same session. Version one GUIDs are of course barely random at all, and can tell the sophisticated reader when and where they were generated.
And so on.
I am planning a series of articles about these and other characteristics of GUIDs in the next couple of weeks; watch my blog for details.
UPDATE: https://ericlippert.com/2012/04/24/guid-guide-part-one/
When you use System.Guid.NewGuid(), you may still want to check that the guid doesn't already exist in your system.
While a guid is so complex as to be virtually unique, there is nothing to guarantee that it doesn't already exist except probability. It's just incredibly statistically unlikely, to the point that in almost any case it's the same as being unique.
Generating to identical guids is like winning the lottery twice - there's nothing to actually prevent it, it's just so unlikely it might as well be impossible.
Most of the time you could probably get away with not checking for existing matches, but in a very extreme case with lots of generation going on, or where the system absolutely must not fail, it could be worth checking.
EDIT
Let me clarify a little more. It is highly, highly unlikely that you would ever see a duplicate guid. That's the point. It's "globally unique", meaning there's such an infinitesimally chance of a duplicate that you can assume it will be unique. However, if we are talking about code that keeps an aircraft in the sky, monitors a nuclear reactor, or handles life support on the International Space Station, I, personally, would still check for a duplicate, just because it would really be terrible to hit that edge case. If you're just writing a blog engine, on the other hand, go ahead, use it without checking.
Feel free to use NewGuid(). There is no problem with its uniqueness.
There is too low probability that it will generate the same guid twice; a nice example can be found here: Simple proof that GUID is not unique
var bigHeapOGuids = new Dictionary<Guid, Guid>();
try
{
do
{
Guid guid = Guid.NewGuid();
bigHeapOGuids.Add(guid ,guid );
} while (true);
}
catch (OutOfMemoryException)
{
}
At some point it just crashed on OutOfMemory and not on duplicated key conflict.
I need to create random numbers between 0 and upto 2 so I am using:
//field level
private static Random _random = new Random();
//used in a method
_random.Next(0, 2)
My question is: will the sequence ever repeat / stop been random? Should I recreate (_random = new Random();) every day?
Your code is fine as it is.
You do not need to create a new Random object daily.
Note that Random is not truly random, but produces a pseudo-random result.
If you check the 'remarks' section in the documentation you will find that System.Random is a pseudo-random generator.
This means in theory the sequence does eventually repeat. In practice it very much depends on what you're doing as to how much this matters. For example, the length of the sequence is such that no human will ever notice them repeating. On the otherhand, it's pretty useless for cryptography.
Edit to add: restarting daily won't make much difference. Either the pseudo-randomness is sufficient for you or you need to look into cryptographically secure way of generating random numbers.
Math.Random returns a 32-bit signed integer, it will repeat on the 2^32'nd call if you dont reseed. If you DO reseed it will repeat itself sooner.
Random is not that random. If you need "more" random, you should have a look a cryptographic classes, such as RandomNumberGenerator (abstract), for example: RNGCryptoServiceProvider
If you read this, you'll see that Random is (currently) based on the subtractive lagged Fibonacci generator of (24, 55) which should guarantee the period to be at least 2^55 . BUT if you look at the end of the page, someone has pointed to a possible bug (the comment of ytosa on October 2010)
http://msdn.microsoft.com/en-us/library/system.random(v=vs.100).aspx
Lets say that the period is long enough unless you need to do "special" applications!
I will add that the Wiki http://en.wikipedia.org/wiki/Lagged_Fibonacci_generator tells us there is a paper that shows that the 24, 55 sequence of lagged Fibonacci isn't "random enough" (under Problems with LFG)
If the purpose is to generate cryptographically secure random numbers, you should probably use a better PRNG. If you just want the occasional random number and true randomness isn't critical (which I imagine given the output values range) then your code is most likely just fine as it is.
It worth mention (at list by my experience) that if you have a method when you declare :
Random myRand=new Random();
myRand.Next(10);
//...doing somthing with that
and then, if you call this method many times (in a loop) , you might ends up with a lot of duplicates results. (that what happend to me).
and the solution I found was to move Random myRand=new Random() to be a class member initiation statment ( instead of a local variable of a method), match like you did.
after that , there were no longer duplicated results
My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)
I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.
I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.
This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.
If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).
Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.
With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.
divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count
I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.
+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.
Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.
A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.
I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.
If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.
Why would anybody use the "standard" random number generator from System.Random at all instead of always using the cryptographically secure random number generator from System.Security.Cryptography.RandomNumberGenerator (or its subclasses because RandomNumberGenerator is abstract)?
Nate Lawson tells us in his Google Tech Talk presentation "Crypto Strikes Back" at minute 13:11 not to use the "standard" random number generators from Python, Java and C# and to instead use the cryptographically secure version.
I know the difference between the two versions of random number generators (see question 101337).
But what rationale is there to not always use the secure random number generator? Why use System.Random at all? Performance perhaps?
Speed and intent. If you're generating a random number and have no need for security, why use a slow crypto function? You don't need security, so why make someone else think that the number may be used for something secure when it won't be?
Apart from the speed and the more useful interface (NextDouble() etc) it is also possible to make a repeatable random sequence by using a fixed seed value. That is quite useful, amongst others during Testing.
Random gen1 = new Random(); // auto seeded by the clock
Random gen2 = new Random(0); // Next(10) always yields 7,8,7,5,2,....
First of all the presentation you linked only talks about random numbers for security purposes. So it doesn't claim Random is bad for non security purposes.
But I do claim it is. The .net 4 implementation of Random is flawed in several ways. I recommend only using it if you don't care about the quality of your random numbers. I recommend using better third party implementations.
Flaw 1: The seeding
The default constructor seeds with the current time. Thus all instances of Random created with the default constructor within a short time-frame (ca. 10ms) return the same sequence. This is documented and "by-design". This is particularly annoying if you want to multi-thread your code, since you can't simply create an instance of Random at the beginning of each thread's execution.
The workaround is to be extra careful when using the default constructor and manually seed when necessary.
Another problem here is that the seed space is rather small (31 bits). So if you generate 50k instances of Random with perfectly random seeds you will probably get one sequence of random numbers twice (due to the birthday paradox). So manual seeding isn't easy to get right either.
Flaw 2: The distribution of random numbers returned by Next(int maxValue) is biased
There are parameters for which Next(int maxValue) is clearly not uniform. For example if you calculate r.Next(1431655765) % 2 you will get 0 in about 2/3 of the samples. (Sample code at the end of the answer.)
Flaw 3: The NextBytes() method is inefficient.
The per byte cost of NextBytes() is about as big as the cost to generate a full integer sample with Next(). From this I suspect that they indeed create one sample per byte.
A better implementation using 3 bytes out of each sample would speed NextBytes() up by almost a factor 3.
Thanks to this flaw Random.NextBytes() is only about 25% faster than System.Security.Cryptography.RNGCryptoServiceProvider.GetBytes on my machine (Win7, Core i3 2600MHz).
I'm sure if somebody inspected the source/decompiled byte code they'd find even more flaws than I found with my black box analysis.
Code samples
r.Next(0x55555555) % 2 is strongly biased:
Random r = new Random();
const int mod = 2;
int[] hist = new int[mod];
for(int i = 0; i < 10000000; i++)
{
int num = r.Next(0x55555555);
int num2 = num % 2;
hist[num2]++;
}
for(int i=0;i<mod;i++)
Console.WriteLine(hist[i]);
Performance:
byte[] bytes=new byte[8*1024];
var cr=new System.Security.Cryptography.RNGCryptoServiceProvider();
Random r=new Random();
// Random.NextBytes
for(int i=0;i<100000;i++)
{
r.NextBytes(bytes);
}
//One sample per byte
for(int i=0;i<100000;i++)
{
for(int j=0;j<bytes.Length;j++)
bytes[j]=(byte)r.Next();
}
//One sample per 3 bytes
for(int i=0;i<100000;i++)
{
for(int j=0;j+2<bytes.Length;j+=3)
{
int num=r.Next();
bytes[j+2]=(byte)(num>>16);
bytes[j+1]=(byte)(num>>8);
bytes[j]=(byte)num;
}
//Yes I know I'm not handling the last few bytes, but that won't have a noticeable impact on performance
}
//Crypto
for(int i=0;i<100000;i++)
{
cr.GetBytes(bytes);
}
System.Random is much more performant since it does not generate cryptographically secure random numbers.
A simple test on my machine filling a buffer of 4 bytes with random data 1,000,000 times takes 49 ms for Random, but 2845 ms for RNGCryptoServiceProvider. Note that if you increase the size of the buffer you are filling, the difference narrows as the overhead for RNGCryptoServiceProvider is less relevant.
The most obvious reasons have already been mentioned, so here's a more obscure one: cryptographic PRNGs typically need to be continually be reseeded with "real" entropy. Thus, if you use a CPRNG too often, you could deplete the system's entropy pool, which (depending on the implementation of the CPRNG) will either weaken it (thus allowing an attacker to predict it) or it will block while trying to fill up its entropy pool (thus becoming an attack vector for a DoS attack).
Either way, your application has now become an attack vector for other, totally unrelated applications which – unlike yours – actually vitally depend on the cryptographic properties of the CPRNG.
This is an actual real world problem, BTW, that has been observed on headless servers (which naturally have rather small entropy pools because they lack entropy sources such as mouse and keyboard input) running Linux, where applications incorrectly use the /dev/random kernel CPRNG for all sorts of random numbers, whereas the correct behavior would be to read a small seed value from /dev/urandom and use that to seed their own PRNG.
If you're programming an online card game or lotter then you would want to make sure the sequence is next to impossible to guess. However, if you are showing users, say, a quote of the day the performance is more important than security.
This has been discussed at some length, but ultimately, the issue of performance is a secondary consideration when selecting a RNG. There are a vast array of RNGs out there, and the canned Lehmer LCG that most system RNGs consists of is not the best nor even necessarily the fastest. On old, slow systems it was an excellent compromise. That compromise is seldom ever really relevant these days. The thing persists into present day systems primarily because A) the thing is already built, and there is no real reason to 'reinvent the wheel' in this case, and B) for what the vast bulk of people will be using it for, it's 'good enough'.
Ultimately, the selection of an RNG comes down to Risk/Reward ratio. In some applications, for example a video game, there is no risk whatsoever. A Lehmer RNG is more than adequate, and is small, concise, fast, well-understood, and 'in the box'.
If the application is, for example, an on-line poker game or lottery where there are actual prizes involved and real money comes into play at some point in the equation, the 'in the box' Lehmer is no longer adequate. In a 32-bit version, it only has 2^32 possible valid states before it begins to cycle at best. These days, that's an open door to a brute force attack. In a case like this, the developer will want to go to something like a Very Long Period RNG of some species, and probably seed it from a cryptographically strong provider. This gives a good compromise between speed and security. In such a case, the person will be out looking for something like the Mersenne Twister, or a Multiple Recursive Generator of some kind.
If the application is something like communicating large quantities of financial information over a network, now there is a huge risk, and it heavily outweights any possible reward. There are still armored cars because sometimes heavily armed men is the only security that's adequate, and trust me, if a brigade of special ops people with tanks, fighters, and helicopters was financially feasible, it would be the method of choice. In a case like this, using a cryptographically strong RNG makes sense, because whatever level of security you can get, it's not as much as you want. So you'll take as much as you can find, and the cost is a very, very remote second-place issue, either in time or money. And if that means that every random sequence takes 3 seconds to generate on a very powerful computer, you're going to wait the 3 seconds, because in the scheme of things, that is a trivial cost.
Note that the System.Random class in C# is coded incorrectly, so should be avoided.
https://connect.microsoft.com/VisualStudio/feedback/details/634761/system-random-serious-bug#tabs
Not everyone needs cryptographically secure random numbers, and they might benefit more from a speedier plain prng. Perhaps more importantly is that you can control the sequence for System.Random numbers.
In a simulation utilizing random numbers you might want to recreate, you rerun the simulation with the same seed. It can be handy for tracking bugs when you want to regenerate a given faulty scenario as well - running your program with the exact same sequence of random numbers that crashed the program.
Random is not a random number generator, it is a deterministic pseudo-random sequence generator, which takes its name for historical reasons.
The reason to use System.Random is if you want these properties, namely a deterministic sequence, which is guaranteed to produce the same sequence of results when initialized with the same seed.
If you want to improve the "randomness" without sacrificing the interface, you can inherit from System.Random overriding several methods.
https://msdn.microsoft.com/en-us/library/system.random.sample(v=vs.110).aspx
Why would you want a deterministic sequence
One reason to have a deterministic sequence rather than true randomness is because it is repeatable.
For example, if you are running a numerical simulation, you can initialize the sequence with a (true) random number, and record what number was used.
Then, if you wish to repeat the exact same simulation, e.g. for debugging purposes, you can do so by instead initializing the sequence with the recorded value.
Why would you want this particular, not very good, sequence
The only reason I can think of would be for backwards compatibility with existing code which uses this class.
In short, if you want to improve the sequence without changing the rest of your code, go ahead.
If I don't need the security, i.e., I just want a relatively indeterminate value not one that's cryptographically strong, Random has a much easier interface to use.
Different needs call for different RNGs. For crypto, you want your random numbers to be as random as possible. For Monte Carlo simulations, you want them to fill the space evenly and to be able to start the RNG from a known state.
I wrote a game (Crystal Sliders on the iPhone: Here) that would put up a "random" series of gems (images) on the map and you would rotate the map how you wanted it and select them and they went away. - Similar to Bejeweled. I was using Random(), and it was seeded with the number of 100ns ticks since the phone booted, a pretty random seed.
I found it amazing that it would generate games that were nearly identical to each other -of the 90 or so gems, of 2 colors, I would get two EXACTLY the same except for 1 to 3 gems! If you flip 90 coins and get the same pattern except for 1-3 flips, that is VERY unlikely! I have several screen shots that show them the same. I was shocked at how bad System.Random() was! I assumed, that I MUST have written something horribly wrong in my code and was using it wrong. I was wrong though, it was the generator.
As an experiment - and a final solution, I went back to the random number generator that I've been using since 1985 or so - which is VASTLY better. It is faster, has a period of 1.3 * 10^154 (2^521) before it repeats. The original algorithm was seeded with a 16 bit number, but I changed that to a 32 bit number, and improved the initial seeding.
The original one is here:
ftp://ftp.grnet.gr/pub/lang/algorithms/c/jpl-c/random.c
Over the years, I've thrown every random number test I could think of at this, and it past all of them. I don't expect that it is of any value as a cryptographic one, but it returns a number as fast as "return *p++;" until it runs out of the 521 bits, and then it runs a quick process over the bits to create new random ones.
I created a C# wrapper - called it JPLRandom() implemented the same interface as Random() and changed all the places where I called it in the code.
The difference was VASTLY better - OMG I was amazed - there should be no way I could tell from just looking at the screens of 90 or so gems in a pattern, but I did an emergency release of my game following this.
And I would never use System.Random() for anything ever again. I'm SHOCKED that their version is blown away by something that is now 30 years old!
-Traderhut Games