How to generate "random" but also "unique" numbers? - c#

How are random numbers generated.? How do languages such as java etc generate random numbers, especially how it is done for GUIDs.? i found that algorithms like Pseudorandomnumber generator uses initial values.
But i need to create a random number program, in which a number once occurred should never repeats even if the system is restarted etc. I thought that i need to store the values anywhere so that i can check if the number repeats or not, but it will be too complex when the list goes beyond limits.?

First: If the number is guaranteed to never repeat, it's not very random.
Second: There are lots of PRNG algorithms.
UPDATE:
Third: There's an IETF RFC for UUIDs (what MS calls GUIDs), but you should recognize that (U|G)UIDs are not cryptographically secure, if that is a concern for you.
UPDATE 2:
If you want to actually use something like this in production code (not just for your own edification) please use a pre-existing library. This is the sort of code that is almost guaranteed to have subtle bugs in it if you've never done it before (or even if you have).
UPDATE 3:
Here's the docs for .NET's GUID

There are a lot of ways you could generate random numbers. It's usually done with a system/library call which uses a pseudo-number generator with a seed as you've already described.
But, there are other ways of getting random numbers which involve specialized hardware to get TRUE random numbers. I know of some poker sites that use this kind of hardware. It's very interesting to read how they do it.

Most random number generators have a way to "randomly" reïnitialize the seed value. (Sometimes called randomize).
If that's not possible, you can also use the system clock to initialize the seed.

You could use this code sample:
http://xkcd.com/221/
Or, you can use this book:
http://www.amazon.com/Million-Random-Digits-Normal-Deviates/dp/0833030477
But seriously, don't implement it yourself, use an existing library. You can't be the first person to do this.

Specifically regarding Java:
java.util.Random uses a linear congruential generator, which is not very good
java.util.UUID#randomUUID() uses java.security.SecureRandom, an interface for a variety of cryptographically secure RNGs - the default is based on SHA-1, I believe.
UUIDs/GUIDs are not necessarily random
It's easy to find implementations of RNGs on the net that are much better than java.util.Random, such as the Mersenne Twister or multiply-with-carry

I understand that you are seeking a way to generate random number using C#. If yes, RNGCryptoServiceProvider is what you are looking for.
[EDIT]
If you generate a fairly long number of bytes using RNGCryptoServiceProvider, it is likely to be unique but there is no gurantee. In theory, true random numbers doesnt mean to be unique. You roll a dice 2 times and you may get head both the times but they are still random. TRUE RANDOM!
I guess to apply the check of being unique, you just have to roll out your own mechanism of keeping history of previously generated numbers.

Related

Does UuidCreate use a CSPRNG?

Note that this is not my application, it is an application I am pentesting for a client. I usually ask questions like this on https://security.stackexchange.com/, however as this is more programming related I have asked on here.
Granted, RFC 4122 for UUIDs does not specify that type 4 UUIDs have to be generated by a Cryptographically Secure Pseudo Random Number Generator (CSPRNG). It simply says
Set all the other bits to randomly (or pseudo-randomly) chosen
values.
Although, some implementations of the algorithm, such as this one in Java, do use a CSPRNG.
I was trying to dig into whether Microsoft's implementation does or not. Mainly around how .NET or MSSQL Server generates them.
Checking the .NET source we can see this code:
Marshal.ThrowExceptionForHR(Win32Native.CoCreateGuid(out guid), new IntPtr(-1));
return guid;
Checking the CoCreateGuid docco, it states
The CoCreateGuid function calls the RPC function UuidCreate
All I can find out about this function is here. I seem to have reached the end of the rabbit hole.
Now, does anyone have any information on how UuidCreate generates its UUIDs?
I've seen many related posts:
How Random is System.Guid.NewGuid()? (Take two)
Is using a GUID a valid way to generate a random string of characters and numbers?
How securely unguessable are GUIDs?
how are GUIDs generated in SQL Server?
The first of which says:
A GUID doesn't make guarantees about randomness, it makes guarantees
around uniqueness. If you want randomness, use Random to generate a
string.
I agree with this except in my case for random, unpredictable numbers you'd of course use a CSPRNG instead of Random (e.g. RNGCryptoServiceProvider).
And the latter states (actually quoted from Wikipedia):
Cryptanalysis of the WinAPI GUID generator shows that, since the
sequence of V4 GUIDs is pseudo-random; given full knowledge of the
internal state, it is possible to predict previous and subsequent
values
Now, on the other side of the fence this post from Will Dean says
The last time I looked into this (a few years ago, probably XP SP2), I
stepped right down into the OS code to see what was actually
happening, and it was generating a random number with the secure
random number generator.
Of course, even if it was currently using a CSPRNG this would be implementation specific and subject to change at any point (e.g. any update to Windows). Unlikely, but theoretically possible.
My point is that there's no canonical reference for this, the above was to demonstrate that I've done my research and none of the above posts reference anything authoritative.
The reason is that I'm trying to decide whether a system that uses GUIDs for authentication tokens needs to be changed. From a pure design perspective, the answer is a definite yes, however from a practical point of view, if the Windows UuidCreate function does infact use a CSPRNG, then there is no immediate risk to the system. Can anyone shed any light on this?
I'm looking for any answers with a reputable source to back it up.
Although I'm still just some guy on the Internet, I have just repeated the exercise of stepping into UuidCreate, in a 32-bit app running on a 64-bit version of Windows 10.
Here's a bit of stack from part way through the process:
> 0018f670 7419b886 bcryptPrimitives!SymCryptAesExpandKeyInternal+0x7f
> 0018f884 7419b803 bcryptPrimitives!SymCryptRngAesGenerateSmall+0x68
> 0018f89c 7419ac08 bcryptPrimitives!SymCryptRngAesGenerate+0x3b
> 0018f8fc 7419aaae bcryptPrimitives!AesRNGState_generate+0x132
> 0018f92c 748346f1 bcryptPrimitives!ProcessPrng+0x4e
> 0018f93c 748346a1 RPCRT4!GenerateRandomNumber+0x11
> 0018f950 00dd127a RPCRT4!UuidCreate+0x11
It's pretty clear that it's using an AES-based RNG to generate the numbers. GUIDs generated by calling other people's GUID generation functions are still not suitable for use as unguessable auth tokens though, because that's not the purpose of the GUID generation function - you're merely exploiting a side effect.
Your "Unlikely, but theoretically possible." about changes in implementation between OS versions is rather given the lie by this statement in the docs for "UuidCreate":
If you do not need this level of security, your application can use the UuidCreateSequential function, which behaves exactly as the UuidCreate function does on all other versions of the operating system.
i.e. it used to be more predictable, now it's less predictable.

What is the quality of Random class implementation in .NET?

I have two questions regarding implementation of Random class in .NET Framework 4.6 (code available here):
What is the rationale for setting Seed argument to 1 at the end of the constructor? It seems to be copy-pasted from Numerical Recipes in C (2nd Ed.) where it made some sense, but it doesn't have any in C#.
It is directly stated in the book (Numerical Recipes in C (2nd Ed.)) that inextp field is set to value 31 because:
The constant 31 is special; see Knuth.
However, in the .NET implementation this field is set to value 21. Why? The rest of a code seems to closely follow the code from book except for this detail.
Regarding the intexp issue, this is a bug, one which Microsoft has acknowledged and refused to fix due to backwards compatibility concerns.
Indeed, you have discovered a genuine problem with the Random implementation.
We have discussed it within the team and with some of our partners and concluded that we unfortunately cannot fix the problem right now. The reason is that some applications rely on the fact that when initialised with the same seed, the generator produces the same pseudo random sequence. Even if the change is for the better, it will break the applications that made this assumption once they have migrated to the “fixed” version.
For some more context:
A while back I fully analysed this implementation. I found a few differences.
A the first one (perfectly fine) is a different large value (MBIG). Numerical Recipies claims that Knuth makes it clear that any large value should work, so that is not an issue, and Microsoft reasonably chose to use the largest value of a 32 bit integer.
The second one was that constant, you mentioned. That one is a big deal. In the minimum it will substantially decrease period. There have been reports that the effects are actually worse than that.
But then comes one other particularly nasty difference. It is literally guarenteed to bias the output (since it does so directly), and will also likely affect the period of the RNG.
So what is this second issue? When .NET first came out, Microsoft did not realize that the RNG they coded was inclusive at both ends, and they documented it as exclusive at the maximum end. To fix this, the security team added a rather evil line of code: if (retVal == MBIG) retVal--;. This is very unfortunately as the correct fix would literally be only 4 added characters (plus whitespace).
The correct fix would have been to change MBIG to int.MaxValue-1, but switch Sample() to use MBIG+1 (i.e. to keep using int.MaxValue). That would guarantee the that Sample has the range [0.0, 1.0) without introducing any bias, and only changes the value of MBIG which Numerical Recipies said Knuth said is perfectly fine.

Does every machine generate same result of random number by using the same seed?

I'm current stuck in the random generator. The requirement specification shows a sample like this:
Random rand = new Random(3412);
The rand result is not directly given out, but used for other performance.
I'd written the same code as above to generate a random number by a seed 3412.
however, the result of the rest performance is totally different with sample.
The generating result is 518435373, I used the same code tried on the online c# compiler, but getting different result of generation which is 11688046, the rest performance result was also different with the sample.
So I'm just wondering is that supposed to be different in different machines?
BTW, could anyone provide the result from your machine just see if it's same with me.
I would expect any one implementation to give the same sequence for the same seed, but there may well be different implementations involved. For example, an "online C# compiler" may well end up using Mono, which I'd expect to have a different implementation to the one in .NET.
I don't know whether the implementations have changed between versions of .NET, but again, that seems entirely possible.
The documentation for the Random(int) constructor states:
Providing an identical seed value to different Random objects causes each instance to produce identical sequences of random numbers.
... but it doesn't specify the implications of different versions etc. Heck, it doesn't even state whether the x86 and x64 versions will give the same results. I'd expect the same results within any one specific CLR instance (i.e. one process, and not two CLRs running side-by-side, either*.
If you need anything more stable, I'd start off with a specified algorithm - I bet there are implementations of the Mersenne Twister etc available.
It isn't specified as making such a promise, so you should assume that it does not.
A good rule with any specification, is not to make promises that aren't required for reasonable use, so you are freer to improve things later on.
Indeed, Random's documentation says:
The current implementation of the Random class is based on Donald E. Knuth's subtractive random number generator algorithm.
Note the phrase "current implementation", implying it may change in the future. This very strongly suggests that not only is there no promise to be consistent between versions, but there is no intention to either.
If a spec requires consistent pseudo-random numbers, then it must specify the algorithm as well as the seed value. Indeed, even if Random was specified as making such a promise, what if you need a non-.NET implementation of all or part of your specification - or something that interoperates with it - in the future?
This is probably due to different framework versions. Have a look at this
The online provider you tried might use the Mono implementation of the CLR, which is different of the one Microsoft provides. So probably their Random class implementation is a bit different.

Is there any drawback in using a combination of RandomNumberGenerator class and base 64 encoding for generating passwords?

In my web service I need to generate passwords that are strong and can be represented as a string. Currently I use System.Security.Cryptography.RandomNumberGenerator and generate a large enough (let's just assume it is really large enough) array of random bytes and then encode it using base 64 and return that to the user.
This way I have a random password which is generated using a suitable-for-cryptography PRNG (not class Random, see this question for details on why class Random is not okay here) and which can be represented as a string and sent in an email, shown in an interface or typed in or copy-pasted by the user.
Is anything inherently wrong with this scheme from the security standpoint?
With regard to whether there is anything inherently wrong with this scheme from a security standpoint, I would consider sending a password via e-mail to be a security risk in itself. Even if the e-mail is encrypted when going down the wire, it's still going to be stored on a medium that you have no control over.
Plus, the types of passwords that you're generating will not get memorised by users, making them more likely to get written down on a sticky note, or something similar, for all to see.
Both the Random and the RandomNumberGenerator classes are basically pseudo-random number generators and as they are based on algorithms, there is a limit as to how random their outputs can be.
But when compared to the Random, the RandomNumberGenerator class is considered as a Cryptographically secure pseudorandom number generator as it makes use of a quite a few other environmental parameters (http://en.wikipedia.org/wiki/CryptGenRandom) for ensuring randomness. Some of the parameters are:
The current process ID
The current thread ID
The tick count since boot time
The current time
Various high-precision performance counters
An MD4 hash of the user's environment block
High-precision internal CPU counters
Do go through the following link which is an interesting read regarding randomness: http://www.codinghorror.com/blog/2006/11/computers-are-lousy-random-number-generators.html
For the purpose normal usage scenarios such as generating passwords as in your case, the use of the RandomNumberGenerator class is more than enough (http://msdn.microsoft.com/en-us/library/system.random.aspx):
"To generate a cryptographically secure random number suitable for creating a random password, for example, use a class derived from System.Security.Cryptography.RandomNumberGenerator such as System.Security.Cryptography.RNGCryptoServiceProvider."

Dictionary with two hash functions in C#?

I've got a huge (>>10m) list of entries. Each entry offers two hash functions:
Cheap: quickly computes hash, but its distribution is terrible (may put 99% of items in 1% of hash space)
Expensive: takes a lot of time to compute, but the distribution is a lot better also
An ordinary Dictionary lets me use only one of these hash functions. I'd like a Dictionary that uses the cheap hash function first, and checks the expensive one on collisions.
It seems like a good idea to use a dictionary inside a dictionory for this. I currently basically use this monstrosity:
Dictionary<int, Dictionary<int, List<Foo>>>;
I improved this design so the expensive hash gets called only if there are actually two items of the same cheap hash.
It fits perfectly and does a flawless job for me, but it looks like something that should have died 65 million years ago.
To my knowledge, this functionality is not included in the basic framework. I am about to write a DoubleHashedDictionary class but I wanted to know of your opinion first.
As for my specific case:
First hash function = number of files in a file system directory (fast)
Second hash function = sum of size of files (slow)
Edits:
Changed title and added more informations.
Added quite important missing detail
In your case, you are technically using a modified function (A|B), not a double-hashed. However, depending on how huge your "huge" list of entries is and the characteristics of your data, consider the following:
A 20% full hash table with a not-so-good distribution can have more than 80% chance of collision. This means your expected function cost could be: (0.8 expensive + 0.2 cheap) + (cost of lookups). So if your table is more than 20% full it may not be worth using the (A|B) scheme.
Coming up with a perfect hash function is possible but O(n^3) which makes it impractical.
If performance is supremely important, you can make a specifically tuned hash table for your specific data by testing various hash functions on your key data.
First off, I think you're on the right path to implement your own hashtable, if what you are describing is truely desired.But as a critic, I'd like to ask a few questions:
Have you considered using something more unique for each entry?
I am assuming that each entry is a file system directory information, have you considered using its full path as key? prefixing with computer name/ip address?
On the other hand, if you're using number of files as hash key, are those directories never going to change? Because if the hash key/result changes, you will never be able to find it again.
While on this topic, if the directory content/size is never going to change, can you store that value somewhere to save the time to actually calculate that?
Just my few cents.
Have you taken a look at the Power Collections or C5 Collections libaries? The Power Collections library hasn't had much action recently, but the C5 stuff seems to be fairly up to date.
I'm not sure if either library has what you need, but they're pretty useful and they're open source so it may provide a decent base implementation for you to extend to your desired functionality.
You're basically talking about a hash table of hash table's, each using a different GetHashCode implementation... while it's possible I think you'd want to consider seriously whether you'll actually get a performance improvement over just doing one or the other...
Will there actually be a substantial number of objects that will be located via the quick-hash mechanism without having to resort to the more expensive to narrow it down further? Because if you can't locate a significant amount purely off the first calculation you really save nothing by doing it in two steps (Not knowing the data it's hard to predict whether this is the case).
If it is going to be a significant amount located in one step then you'll probably have to do a fair bit of tuning to work out how many records to store on each hash location of the outer before resorting to an inner "expensive" hashtable lookup rather than the more treatment of hashed data, but under certain circumstances I can see how you'd get a performance gain from this (The circumstances would be few and far between, but aren't inconceivable).
Edit
I just saw your ammendment to the question - you plan to do both lookups regardless... I doubt you'll get any performance benefits from this that you can't get just by configuring the main hash table a bit better. Have you tried using a single dictionary with an appropriate capacity passed in the constructor and perhaps an XOR of the two hash codes as your hash code?

Categories

Resources