What is the quality of Random class implementation in .NET?

What is the quality of Random class implementation in .NET? - c#

I have two questions regarding implementation of Random class in .NET Framework 4.6 (code available here):
What is the rationale for setting Seed argument to 1 at the end of the constructor? It seems to be copy-pasted from Numerical Recipes in C (2nd Ed.) where it made some sense, but it doesn't have any in C#.
It is directly stated in the book (Numerical Recipes in C (2nd Ed.)) that inextp field is set to value 31 because:
The constant 31 is special; see Knuth.
However, in the .NET implementation this field is set to value 21. Why? The rest of a code seems to closely follow the code from book except for this detail.

Regarding the intexp issue, this is a bug, one which Microsoft has acknowledged and refused to fix due to backwards compatibility concerns.
Indeed, you have discovered a genuine problem with the Random implementation.
We have discussed it within the team and with some of our partners and concluded that we unfortunately cannot fix the problem right now. The reason is that some applications rely on the fact that when initialised with the same seed, the generator produces the same pseudo random sequence. Even if the change is for the better, it will break the applications that made this assumption once they have migrated to the “fixed” version.

For some more context:
A while back I fully analysed this implementation. I found a few differences.
A the first one (perfectly fine) is a different large value (MBIG). Numerical Recipies claims that Knuth makes it clear that any large value should work, so that is not an issue, and Microsoft reasonably chose to use the largest value of a 32 bit integer.
The second one was that constant, you mentioned. That one is a big deal. In the minimum it will substantially decrease period. There have been reports that the effects are actually worse than that.
But then comes one other particularly nasty difference. It is literally guarenteed to bias the output (since it does so directly), and will also likely affect the period of the RNG.
So what is this second issue? When .NET first came out, Microsoft did not realize that the RNG they coded was inclusive at both ends, and they documented it as exclusive at the maximum end. To fix this, the security team added a rather evil line of code: if (retVal == MBIG) retVal--;. This is very unfortunately as the correct fix would literally be only 4 added characters (plus whitespace).
The correct fix would have been to change MBIG to int.MaxValue-1, but switch Sample() to use MBIG+1 (i.e. to keep using int.MaxValue). That would guarantee the that Sample has the range [0.0, 1.0) without introducing any bias, and only changes the value of MBIG which Numerical Recipies said Knuth said is perfectly fine.

Related

Does UuidCreate use a CSPRNG?

Note that this is not my application, it is an application I am pentesting for a client. I usually ask questions like this on https://security.stackexchange.com/, however as this is more programming related I have asked on here.
Granted, RFC 4122 for UUIDs does not specify that type 4 UUIDs have to be generated by a Cryptographically Secure Pseudo Random Number Generator (CSPRNG). It simply says
Set all the other bits to randomly (or pseudo-randomly) chosen
values.
Although, some implementations of the algorithm, such as this one in Java, do use a CSPRNG.
I was trying to dig into whether Microsoft's implementation does or not. Mainly around how .NET or MSSQL Server generates them.
Checking the .NET source we can see this code:
Marshal.ThrowExceptionForHR(Win32Native.CoCreateGuid(out guid), new IntPtr(-1));
return guid;
Checking the CoCreateGuid docco, it states
The CoCreateGuid function calls the RPC function UuidCreate
All I can find out about this function is here. I seem to have reached the end of the rabbit hole.
Now, does anyone have any information on how UuidCreate generates its UUIDs?
I've seen many related posts:
How Random is System.Guid.NewGuid()? (Take two)
Is using a GUID a valid way to generate a random string of characters and numbers?
How securely unguessable are GUIDs?
how are GUIDs generated in SQL Server?
The first of which says:
A GUID doesn't make guarantees about randomness, it makes guarantees
around uniqueness. If you want randomness, use Random to generate a
string.
I agree with this except in my case for random, unpredictable numbers you'd of course use a CSPRNG instead of Random (e.g. RNGCryptoServiceProvider).
And the latter states (actually quoted from Wikipedia):
Cryptanalysis of the WinAPI GUID generator shows that, since the
sequence of V4 GUIDs is pseudo-random; given full knowledge of the
internal state, it is possible to predict previous and subsequent
values
Now, on the other side of the fence this post from Will Dean says
The last time I looked into this (a few years ago, probably XP SP2), I
stepped right down into the OS code to see what was actually
happening, and it was generating a random number with the secure
random number generator.
Of course, even if it was currently using a CSPRNG this would be implementation specific and subject to change at any point (e.g. any update to Windows). Unlikely, but theoretically possible.
My point is that there's no canonical reference for this, the above was to demonstrate that I've done my research and none of the above posts reference anything authoritative.
The reason is that I'm trying to decide whether a system that uses GUIDs for authentication tokens needs to be changed. From a pure design perspective, the answer is a definite yes, however from a practical point of view, if the Windows UuidCreate function does infact use a CSPRNG, then there is no immediate risk to the system. Can anyone shed any light on this?
I'm looking for any answers with a reputable source to back it up.

Although I'm still just some guy on the Internet, I have just repeated the exercise of stepping into UuidCreate, in a 32-bit app running on a 64-bit version of Windows 10.
Here's a bit of stack from part way through the process:
> 0018f670 7419b886 bcryptPrimitives!SymCryptAesExpandKeyInternal+0x7f
> 0018f884 7419b803 bcryptPrimitives!SymCryptRngAesGenerateSmall+0x68
> 0018f89c 7419ac08 bcryptPrimitives!SymCryptRngAesGenerate+0x3b
> 0018f8fc 7419aaae bcryptPrimitives!AesRNGState_generate+0x132
> 0018f92c 748346f1 bcryptPrimitives!ProcessPrng+0x4e
> 0018f93c 748346a1 RPCRT4!GenerateRandomNumber+0x11
> 0018f950 00dd127a RPCRT4!UuidCreate+0x11
It's pretty clear that it's using an AES-based RNG to generate the numbers. GUIDs generated by calling other people's GUID generation functions are still not suitable for use as unguessable auth tokens though, because that's not the purpose of the GUID generation function - you're merely exploiting a side effect.
Your "Unlikely, but theoretically possible." about changes in implementation between OS versions is rather given the lie by this statement in the docs for "UuidCreate":
If you do not need this level of security, your application can use the UuidCreateSequential function, which behaves exactly as the UuidCreate function does on all other versions of the operating system.
i.e. it used to be more predictable, now it's less predictable.

Why is Marshal.AlignedSizeOfStruct<T> being used instead of Marshal.SizeOfStruct<T> in SafeBuffer.WriteArray<T> and SafeBuffer.ReadArray<T>?

According to http://referencesource.microsoft.com/#mscorlib/system/runtime/interopservices/safebuffer.cs
SafeBuffer uses the aligned size of the struct type rather than the actual size of the struct type. It appears this causes alignment issues when writing what needs to be a densely packed array of structures and when reading from a preexisting densely packed non-aligned array of structures in the buffer. In the first case, the use of the aligned rather than the actual size results in unwanted padding bytes. In the second, the data gets mangled. I have two questions (4 really, but 3 are related):
Is there a way around this other than manually aligning access using sequential calls to SafeBuffer.Write<T> / Read<T> (which is slower), or ditching the SafeBuffer class (and therefore the quite nice UnmanagedMemoryAccessor class) entirely?
What are the reasons behind this choice? Why is the CLR enforcing it's own alignment requirements on unmanaged memory? Why should this not be considered a bug?

Hmya, answers to these questions are invariably subjective, we don't have the .NET Framework designers contributing here to pass their design meeting notes to us. But you can safely assume that this is not a bug and this was pained about a great deal. Surely at least one of the reasons that it took so long for MMFs to be supported in .NET.
Everybody loves to ignore or wish away structure packing and alignment details. The CLR does a very terrific job of hiding them. But the buck stops here, no way to ignore them anymore. The cold hard fact is that it is entirely impossible to make everybody happy. The framework has no reasonable way to guess what the code on the other side of the MMF looks like. It is unknowable, MMFs are entirely too simplistic to support anything like metadata. With one clear failure mode of having a 32-bit process on one end and a 64-bit process on the other. They use different alignment choices, 4 vs 8. Many more, particularly if its is native code on the other end using its own #pragma pack.
Given that the framework can never get it 100% right, they chose to at least make it right and efficient when .NET code runs on either side. An entirely reasonable choice.
The only real flaw is that the documentation is lacking. You will have a headache when you need to interop with native code. Trial and error is, right now, the only good way. Or asking a question about the specific problem you have at SO of course :)

Does every machine generate same result of random number by using the same seed?

I'm current stuck in the random generator. The requirement specification shows a sample like this:
Random rand = new Random(3412);
The rand result is not directly given out, but used for other performance.
I'd written the same code as above to generate a random number by a seed 3412.
however, the result of the rest performance is totally different with sample.
The generating result is 518435373, I used the same code tried on the online c# compiler, but getting different result of generation which is 11688046, the rest performance result was also different with the sample.
So I'm just wondering is that supposed to be different in different machines?
BTW, could anyone provide the result from your machine just see if it's same with me.

I would expect any one implementation to give the same sequence for the same seed, but there may well be different implementations involved. For example, an "online C# compiler" may well end up using Mono, which I'd expect to have a different implementation to the one in .NET.
I don't know whether the implementations have changed between versions of .NET, but again, that seems entirely possible.
The documentation for the Random(int) constructor states:
Providing an identical seed value to different Random objects causes each instance to produce identical sequences of random numbers.
... but it doesn't specify the implications of different versions etc. Heck, it doesn't even state whether the x86 and x64 versions will give the same results. I'd expect the same results within any one specific CLR instance (i.e. one process, and not two CLRs running side-by-side, either*.
If you need anything more stable, I'd start off with a specified algorithm - I bet there are implementations of the Mersenne Twister etc available.

It isn't specified as making such a promise, so you should assume that it does not.
A good rule with any specification, is not to make promises that aren't required for reasonable use, so you are freer to improve things later on.
Indeed, Random's documentation says:
The current implementation of the Random class is based on Donald E. Knuth's subtractive random number generator algorithm.
Note the phrase "current implementation", implying it may change in the future. This very strongly suggests that not only is there no promise to be consistent between versions, but there is no intention to either.
If a spec requires consistent pseudo-random numbers, then it must specify the algorithm as well as the seed value. Indeed, even if Random was specified as making such a promise, what if you need a non-.NET implementation of all or part of your specification - or something that interoperates with it - in the future?

This is probably due to different framework versions. Have a look at this

The online provider you tried might use the Mono implementation of the CLR, which is different of the one Microsoft provides. So probably their Random class implementation is a bit different.

What's the "Hello World!" of genetic algorithms good for?

I found this very cool C++ sample , literally the "Hello World!" of genetic algorithms.
I so decided to re-code the whole thing in C# and this is the result.
Now I am asking myself: is there any practical application along the lines of generating a target string starting from a population of random strings?
EDIT: my buddy on twitter just tweeted that "is useful for transcription type things such as translation. Does not have to be Monkey's". I wish I had a clue.

Is there any practical application along the lines of generating a target string starting from a population of random strings?
Sure. Imagine any scenario in which you know how to evaluate the fitness of a particular string, and in which the choices are discrete and constrained in some way:
Picking pronounceable names ("Xhjkxc" has low fitness; "Artekzo" has high fitness)
Trying out a series of chess moves
Guessing the combination to a safe, assuming you can tell how close you are to unlocking each tumbler
Picking phone numbers that evaluate to words (e.g. "843-2378" has high fitness because it spells "THE-BEST")

No. Each time you run the GA, you are giving it the eventual answer. This is great for showing how a GA works and to show how powerful it can be, but it does not have any purpose beyond that.

You could write an EA that writes code in a dynamic language like IronPython with the goal of creating code that a) executes without crashing and b) analyzes the stock market and intelligently buys and sells stock.
That's a very simplistic take on what would be necessary, but it's possible. You would need a host that provides a lot of methods for the IronPython code (technical indicators, etc) and a database of ticks.
It would also be smart to not just generate any old random code, lest you format your own hard-drive. You need a sandbox, and you need to limit the namespaces that are accessable, and you would need to provide a time limit to avoid infinite loops. You could also provide symantic guidelines that allow it to choose appropriate approved keywords instead of just stringing random letters together -- this would greatly speed up evolution.
So, I was involved with a project that did everything but the EA. We had a satellite dish that got real-time stock ticks from the NASDAQ, a service for trading that had an API, and a primitive decision making "brain" that made decisions as the ticks came in.
Sadly, one of the partners flipped out, quit his job, forked the project (got his own dish, etc), and started trading with logic that wasn't ready. He lost a bunch of money. It turns out that for some people this type of project is only a step away from common gambling. But anyway, the project kind of fizzled out after that. Evolving the logic part is the missing link though. And I know there are people out there doing this type of thing.

I have used GA in 2 real life research problems.
One was a power optimization problem (maximize number of appliances turned on, meeting the available power constraint and service guarantee for each appliance)
Another was for radio network optimization, maximizing the coverage area given a fixed equipment budget

GA has one main disadvantage, it usually works with genetic speed so using it in some serious time-dependant projects is quite risky.

Building an assembler

I need to build an assembler for a CPU architecture that I've built. The architecture is similar to MIPS, but this is of no importance.
I started using C#, although C++ would be more appropriate. (C# means faster development time for me).
My only problem is that I can't come with a good design for this application. I am building a 2 pass assembler. I know what I need to do in each pass.\
I've implemented the first pass and I realised that if I have to lines assembly code on the same line ...no error is thrown.This means only one thing poor parsing techniques.
So almighty programmers, fathers of assembler enlighten me how should I proceed.
I just need to support symbols and data declaration. Instructions have fixed size.
Please let me know if you need more information.

I've written three or four simple assemblers. Without using a parser generator, what I did was model the S-C assembler that I knew best for 6502.
To do this, I used a simple syntax - a line was one of the following:
nothing
[label] [instruction] [comment]
[label] [directive] [comment]
A label was one letter followed by any number of letters or numbers.
An instruction was <whitespace><mnemonic> [operands]
A directive was <whitespace>.XX [operands]
A comment was a * up to end of line.
Operands depended on the instruction and the directive.
Directives included
.EQ equate for defining constants
.OR set origin address of code
.HS hex string of bytes
.AS ascii string of bytes - any delimiter except white space - whatever started it ended it
.TF target file for output
.BS n reserve block storage of n bytes
When I wrote it, I wrote simple parsers for each component. Whenever I encountered a label, I put it in a table with its target address. Whenever I encountered a label I didn't know, I marked the instruction as incomplete and put the unknown label with a reference to the instruction that needed fixing.
After all source lines had passed, I looked through the "to fix" table and tried to find an entry in the symbol table, if I did, I patched the instructions. If not, then it was an error.
I kept a table of instruction names and all the valid addressing modes for operands. When I got an instruction, I tried to parse each addressing mode in turn until something worked.
Given this structure, it should take a day maybe two to do the whole thing.

Look at this Assembler Development Kit from Randy Hyde's author of the famous "The Art of Assembly Language":
The Assembler Developer's Kit

The first pass of a two-pass assembler assembles the code and puts placeholders for the symbols (as you don't know how big everything is until you've run the assembler). The second pass fills in the addresses. If the assembled code subsequently needs to be linked to external references, this is the job of the eponymous linker.

If you are to write an assembler that just works, and spits out a hex file to be loaded on a microcontroller, it can be simple and easy. Part of my ciforth library is a full Pentium assembler to add inline definitions, of about 150 lines. There is an assembler for the 8080 of a couple dozen lines.
The principle is explained http://home.hccnet.nl/a.w.m.van.der.horst/postitfixup.html .
It amounts to applying the blackboard design pattern to the problem. You start with laying down the instruction, leaving holes for any and all operands. Then you fill in the holes, when you encounter the parameters.
There is a strict separation between the generic tool and the instruction set.
In case the assembler you need is just for yourself, and there are no requirements than usability (not a homework assignment), you can have an example implementation in http://home.hccnet.nl/a.w.m.van.der.horst/forthassembler.html. If you dislike Forth, there is also an example implementation in Perl. If the Pentium instruction set is too much too chew, then still you must be able to understand the principle and the generic part.
You're advised to have a look at the asi8080.frt file first. This is 389 WOC (Words Of Code, not Lines Of Code). An experienced Forther familiar with the instruction set can crank out an assembler like that in an evening. The Pentium is a bitch.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.