Why is Random.NextBytes() "surprisingly slow"?

Why is Random.NextBytes() "surprisingly slow"? - c#

From Fastest way to generate a random boolean, in the comment, CodesInChaos said:
MS messed up the implementation of NextBytes, so it's surprisingly slow.
[...] the performance is about as bad as calling Next for each byte, instead of taking advantage of all 31 bits. But since System.Random has bad design and implementation at pretty much every level, this is one of my smaller gripes.
Why did he said MS has made a design and implementation at pretty much every level?
How is the Random class wrongly implemented?

Of course I can't look inside his head for his reasons, but System.Random is pretty weird.
InternalSample() returns a non-negative int that cannot be int.MaxValue. That doesn't sound so bad on the outset, but that means it has almost (but not quite) 31 usable bits of randomness. That complicates things such as efficiently implementing NextBytes(byte[] buffer).. which it doesn't even try! It does this:
for (int index = 0; index < buffer.Length; ++index)
buffer[index] = (byte) (this.InternalSample() % 256);
Making approximately 4 times more calls to InternalSample than necessary. Also the % 256 is useless, casting to a byte truncates anyway. It's also biased, 255 is just slightly less probable than any other result, since the internal sample cannot be int.MaxValue.
But it gets worse. For example, NextDouble uses this.InternalSample() * 4.6566128752458E-10. It is probably not immediately obvious, but 4.6566128752458E-10 is 1.0 / int.MaxValue. What's annoying about that is that it's not a power of two, so it's a "messy" number that causes the gaps between adjacent possible results to be nonuniform.
Worse yet, the algorithms for Next(int) and Next(int, int) are inherently biased, since they simply scale a random double and reject nothing. It's also not especially fast, which would normally be a reason to avoid rejection sampling.
It's also fairly slow in general. It's a subtractive generator, a relatively unknown PRNG that apparently isn't too bad, but it has a big state (which is slow to seed and has an annoying cache footprint) and a bunch of annoying operations in the sampling algorithm. Certainly it has better quality than a basic LCG, but it's significantly marred by biased scaling methods and bad performance.
The interface design is also annoying. Since the upper bounds everywhere are exclusive, there is no easy way to generate a sample in [0 .. int.MaxValue] or [int.MinValue .. int.MaxValue], both of which are fairly commonly useful. Exclusive upper bounds are often nice to avoid weird -1's, offering no way to get a full-range sample is just annoying. Of course it can be done through NextDouble, but since the input of NextDouble already isn't a full-range sample, the result is necessarily biased.
There are probably some deficiencies that I've missed.

Related

How double hashing works in case of the .NET Dictionary?

The other day I was reading that article on CodeProject
And I got hard times understanding a few points about the implementation of the .NET Dictionary (considering the implementation here without all the optimizations in .NET Core):
Note: If will add more items than the maximum number in the table
(i.e 7199369), the resize method will manually search the next prime
number that is larger than twice the old size.
Note: The reason that the sizes are being doubled while resizing the
array is to make the inner-hash table operations to have asymptotic
complexity. The prime numbers are being used to support
double-hashing.
So I tried to remember my old CS classes back a decade ago with my good friend wikipedia:
Open Addressing
Separate Chaining
Double Hashing
But I still don't really see how first it relates to double hashing (which is a collision resolution technique for open-addressed hash tables) except the fact that the Resize() method double of the entries based on the minimum prime number (taken based on the current/old size), and tbh I don't really see the benefits of "doubling" the size, "asymptotic complexity" (I guess that article meant O(n) when the underlying array (entries) is full and subject to resize).
First, If you double the size with or without using a prime, is it not really the same?
Second, to me, the .NET hash table use a separate chaining technique when it comes to collision resolution.
I guess I must have missed a few things and I would like to have someone who can shed the light on those two points.

I got my answer on Reddit, so I am gonna try to summarize here:
Collision Resolution Technique
First off, it seems that the collision resolution is using Separate Chaining technique and not Open addressing technique and therefore there is no Double Hashing strategy:
The code goes as follows:
private struct Entry
{
public int hashCode; // Lower 31 bits of hash code, -1 if unused
public int next; // Index of next entry, -1 if last
public TKey key; // Key of entry
public TValue value; // Value of entry
}
It just that instead of having one dedicated storage for all the entries sharing the same hashcode / index like a list or whatnot for every bucket, everything is stored in the same entries array.
Prime Number
About the prime number the answer lies here: https://cs.stackexchange.com/a/64191/42745 it's all about multiple:
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this
be achieved? By choosing m to be a number that has very few factors: a
prime number.
Doubling the underlying entries array size
Help to avoid call too many resize operations (i.e. copies) by increasing the size of the array by enough amount of slots.
See that answer: https://stackoverflow.com/a/2369504/4636721
Hash-tables could not claim "amortized constant time insertion" if,
for instance, the resizing was by a constant increment. In that case
the cost of resizing (which grows with the size of the hash-table)
would make the cost of one insertion linear in the total number of
elements to insert. Because resizing becomes more and more expensive
with the size of the table, it has to happen "less and less often" to
keep the amortized cost of insertion constant.

c# Denormalized Floating Point: is "zero literal 0.0f" slow?

I just read about Denormalized floating point numbers, should i replace all zero literals with almost-zero literal to get better performance.
I am afraid that the evil zero constants in my could pollute my performance.
Example:
Program 1:
float a = 0.0f;
Console.WriteLine(a);
Program 2:
float b = 1.401298E-45f;
Console.WriteLine(b);
Shouldn't program 2 be 1.000.000 times faster than program 1 since b can be represented by ieee floating point representation in cannonized form ? whereas program 1 has to act with "zero" which is not directly representable.
If so the whole software development industry is flawed. A simple field declaration:
float c;
Would automatically initialize it to zero, Which would cause the dreaded performance hit.
Avoid the hustle mentioning "Premature Optimization is the..., blablabla".
Delayed Knowledge of Compilers Optimization Workings could result in the explosion of a nuclear factory. So i would like to know ahead what i am paying, so that i am safe to ignore optimizing it.
Ps. I don't care if float becomes denormalized by the result of a mathematical operation, i have no control in that, so i don't care.
Proof: x + 0.1f is 10 times faster than x + 0
Why does changing 0.1f to 0 slow down performance by 10x?
Question Synopsis: is 0.0f evil ? So all who used it as a constant are also evil?

There's nothing special about denormals that makes them inherently slower than normalized floating point numbers. In fact, a FP system which only supported denormals would be plenty fast, because it would essentially only be doing integer operations.
The slowness comes from the relative difficulty of certain operations when performed on a mix of normals and denormals. Adding a normal to a denormal is much trickier than adding a normal to a normal, or adding a denormal to a denormal. The machinery of computation is simply more involved, requires more steps. Because most of the time you're only operating on normals, it makes sense to optimize for that common case, and drop into the slower and more generalized normal/denormal implementation only when that doesn't work.
The exception to denormals being unusual, of course, is 0.0, which is a denormal with a zero mantissa. Because 0 is the sort of thing one often finds and does operations on, and because an operation involving a 0 is trivial, those are handled as part of the fast common case.
I think you've misunderstood what's going on in the answer to the question you linked. The 0 isn't by itself making things slow: despite being technically a denormal, operations on it are fast. The denormals in question are the ones stored in the y array after a sufficient number of loop iterations. The advantage of the 0.1 over the 0 is that, in that particular code snippet, it prevents numbers from becoming nonzero denormals, not that it's faster to add 0.1 than 0.0 (it isn't).

Why does everyone use 2^n numbers for allocation? -> new StringBuilder(256)

15 years ago, while programming with Pascal, I understood why to use power of two's for memory allocation. But this still seems to be state-of-the-art.
C# Examples:
new StringBuilder(256);
new byte[1024];
int bufferSize = 1 << 12;
I still see this thousands of times, I use this myself and I'm still questioning:
Do we need this in modern programming languages and modern hardware?
I guess its good practice, but what's the reason?
EDIT
For example a byte[] array, as stated by answers here, a power of 2 will make no sense: the array itself will use 16 bytes (?), so does it make sense to use 240 (=256-16) for the size to fit a total of 256 bytes?

Do we need this in modern programming languages and modern hardware? I guess its good practice, but what's the reason?
It depends. There are two things to consider here:
For sizes less than the memory page size, there's no appreciable difference between a power-of-two and an arbitrary number to allocate space;
You mostly use managed data structures with C#, so you won't even know how many bytes are really allocated underneath.
Assuming you're doing low-level allocation with malloc(), using multiples of the page size would be considered a good idea, i.e. 4096 or 8192; this is because it allows for more efficient memory management.
My advice would be to just allocate what you need and let C# handle the memory management and allocation for you.

Sadly, it's quite stupid if you want to keep a block of memory in a single memory page of 4k... And persons don't even know it :-) (I didn't until 10 minutes ago... I only had an hunch)... An example... It's unsafe code and implementation dependant (using .NET 4.5 at 32/64 bits)
byte[] arr = new byte[4096];
fixed (byte* p = arr)
{
int size = ((int*)p)[IntPtr.Size == 4 ? -1 : -2];
}
So the CLR has allocated at least 4096 + (1 or 2) sizeof(int)... So it has gone over one 4k memory page. This is logical... It has to keep the size of the array somewhere, and keeping it together with the array is the most intelligent thing (for those that know what Pascal Strings and BSTR are, yes, it's the same principle)
I'll add that all the objects in .NET have a syncblck number and a RuntimeType... They are at least int if not IntPtr, so a total of between 8 and 16 bytes/object (This is explained in various places... try looking for .net object header if you are interested)

It still makes sense in certain cases, but I would prefer to analyze case-by-case whether I need that kind of specification or not, rather than blindly use it as good practice.
For example, there might be cases where you want to use exactly 8 bits of information (1 byte) to address a table.
In that case, I would let the table have the size of 2^8.
Object table = new Object[256];
By this, you will be able to address any object of the table using only one byte.
Even if the table is actually smaller and doesn't use all 256 places, you still have the guarantee of bidirectional mapping from table to index and from index to table, which could prevent errors that would appear, for example, if you had:
Object table = new Object[100];
And then someone (probably someone else) accesses it with a byte value out of table's range.
Maybe this kind of bijective behavior could be good, maybe you could have other ways to guarantee your constraints.
Probably, given the increase in smartness of current compilers, it is not the only good practice anymore.

IMHO, anything ending in exact power of two's arithmeric operation is like a fast track. low level arithmeric operation for power of two takes less number of turns and bit manipulations than any other numbers need extra work for cpu.
And found this possible duplicate:Is it better to allocate memory in the power of two?

Yes, it's good practice, and it has at least one reason.
The modern processors have L1 cache-line size 64 bytes, and if you will use buffer size as 2^n (for example 1024, 4096,..), you will take fully cache-line, without wasted space.
In some cases, this will help prevent false sharing problem (http://en.wikipedia.org/wiki/False_sharing).

C#/XNA - Multiplication faster than Division?

I saw a tweet recently that confused me (this was posted by an XNA coder, in the context of writing an XNA game):
Microoptimization tip of the day: when possible, use multiplication instead of division in high frequency areas. It's a few cycles faster.
I was quite surprised, because I always thought compilers where pretty smart (for example, using bit-shifting), and recently read a post by Shawn Hargreaves saying much the same thing. I wondered how much truth there was in this, since there are lots of calculations in my game.
I inquired, hoping for a sample, however the original poster was unable to give one. He did, however, say this:
Not necessarily when it's something like "center = width / 2". And I've already determined "yes, it's worth it". :)
So, I'm curious...
Can anyone give an example of some code where you can change a division to a multiplication and get a performance gain, where the C# compiler wasn't able to do the same thing itself.

Most compilers can do a reasonable job of optimizing when you give them a chance. For example, if you're dividing by a constant, chances are pretty good that the compiler can/will optimize that so it's done about as quickly as anything you can reasonably substitute for it.
When, however, you have two values that aren't known ahead of time, and you need to divide one by the other to get the answer, if there was much way for the compiler to do much with it, it would -- and for that matter, if there was much room for the compiler to optimize it much, the CPU would do it so the compiler didn't have to.
Edit: Your best bet for something like that (that's reasonably realistic) would probably be something like:
double scale_factor = get_input();
for (i=0; i<values.size(); i++)
values[i] /= scale_factor;
This is relatively easy to convert to something like:
scale_factor = 1.0 / scale_factor;
for (i=0; i<values.size(); i++)
values[i] *= scale_factor;
I can't really guarantee much one way or the other about a particular compiler doing that. It's basically a combination of strength reduction and loop hoisting. There are certainly optimizers that know how to do both, but what I've seen of the C# compiler suggests that it may not (but I never tested anything exactly like this, and the testing I did was a few versions back...)

Although the compiler can optimize out divisions and multiplications by powers of 2, other numbers can be difficult or impossible to optimize. Try optimizing a division by 17 and you'll see why. This is of course assuming the compiler doesn't know that you are dividing by 17 ahead of time (it is a run-time variable, not a constant).

Bit late but never mind.
The answer to your question is yes.
Have a look at my article here, http://www.codeproject.com/KB/cs/UniqueStringList2.aspx, which uses information based on the article mentioned in the first comment to your question.
I have a QuickDivideInfo struct which stores the magic number and the shift for a given divisor thus allowing division and modulo to be calculated using faster multiplication. I pre-computed (and tested!) QuickDivideInfos for a list of Golden Prime Numbers. For x64 at least, the .Divide method on QuickDivideInfo is inlined and is 3x quicker than using the divide operator (on an i5); it works for all numerators except int.MinValue and cannot overflow since the multiplication is stored in 64 bits before shifting. (I've not tried on x86 but if it doesn't inline for some reasons then the neatness of the Divide method would be lost and you would have to manually inline it).
So the above will work in all scenarios (except int.MinValue) if you can precalculate. If you trust the code that generates the magic number/shift, then you can deal with any divisor at runtime.
Other well-known small divisors with a very limited range of numerators could be written inline and may well be faster if they don't need an intermediate long.
Division by multiple of two: I would expect the compiler to deal with this (as in your width / 2) example since it is constant. If it doesn't then changing it to width >> 1 should be fine

To give some numbers, on this pdf
http://cs.smith.edu/dftwiki/index.php/CSC231_Pentium_Instructions_and_Flags
of the Pentium we get some numbers, and they aren't good:
IMUL 10 or 11
FMUL 3+1
IDIV 46 (32 bits operand)
FDIV 39
We are speaking of BIG differences

while(start<=end)
{
int mid=(start+end)/2;
if(mid*mid==A)
return mid;
if(mid*mid<A)
{
start=mid+1;
ans=mid;
}
If i am doing this way the outcome is the TIME LIMIT EXCEEDED for square root of 2147483647
But if i am doing the following way then the thing is clear that for Division compiler responds faster than for multiplication.
while(start<=end)
{
int mid=(start+end)/2;
if(mid==A/mid)
return mid;
if(mid<A/mid)
{
start=mid+1;
ans=mid;
}
else
end=mid-1;
}

Is shifting bits faster than multiplying and dividing in Java? .NET? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Shifting bits left and right is apparently faster than multiplication and division operations on most, maybe even all, CPUs if you happen to be using a power of 2. However, it can reduce the clarity of code for some readers and some algorithms. Is bit-shifting really necessary for performance, or can I expect the compiler or VM to notice the case and optimize it (in particular, when the power-of-2 is a literal)? I am mainly interested in the Java and .NET behavior but welcome insights into other language implementations as well.

Almost any environment worth its salt will optimize this away for you. And if it doesn't, you've got bigger fish to fry. Seriously, do not waste one more second thinking about this. You will know when you have performance problems. And after you run a profiler, you will know what is causing it, and it should be fairly clear how to fix it.
You will never hear anyone say "my application was too slow, then I started randomly replacing x * 2 with x << 1 and everything was fixed!" Performance problems are generally solved by finding a way to do an order of magnitude less work, not by finding a way to do the same work 1% faster.

Most compilers today will do more than convert multiply or divide by a power-of-two to shift operations. When optimizing, many compilers can optimize a multiply or divide with a compile time constant even if it's not a power of 2. Often a multiply or divide can be decomposed to a series of shifts and adds, and if that series of operations will be faster than the multiply or divide, the compiler will use it.
For division by a constant, the compiler can often convert the operation to a multiply by a 'magic number' followed by a shift. This can be a major clock-cycle saver since multiplication is often much faster than a division operation.
Henry Warren's book, Hacker's Delight, has a wealth of information on this topic, which is also covered quite well on the companion website:
http://www.hackersdelight.org/
See also a discussion (with a link or two ) in:
Reading assembly code
Anyway, all this boils down to allowing the compiler to take care of the tedious details of micro-optimizations. It's been years since doing your own shifts outsmarted the compiler.

Humans are wrong in these cases.
99% when they try to second guess a modern (and all future) compilers.
99.9% when they try to second guess modern (and all future) JITs at the same time.
99.999% when they try to second guess modern (and all future) CPU optimizations.
Program in a way that accurately describes what you want to accomplish, not how to do it. Future versions of the JIT, VM, compiler, and CPU can all be independantly improved and optimized. If you specify something so tiny and specific, you lose the benefit of all future optimizations.

You can almost certainly depend on the literal-power-of-two multiplication optimisation to a shift operation. This is one of the first optimisations that students of compiler construction will learn. :)
However, I don't think there's any guarantee for this. Your source code should reflect your intent, rather than trying to tell the optimiser what to do. If you're making a quantity larger, use multiplication. If you're moving a bit field from one place to another (think RGB colour manipulation), use a shift operation. Either way, your source code will reflect what you are actually doing.

Note that shifting down and division will (in Java, certainly) give different results for negative, odd numbers.
int a = -7;
System.out.println("Shift: "+(a >> 1));
System.out.println("Div: "+(a / 2));
Prints:
Shift: -4
Div: -3
Since Java doesn't have any unsigned numbers it's not really possible for a Java compiler to optimise this.

On computers I tested, integer divisions are 4 to 10 times slower than other operations.
When compilers may replace divisions by multiples of 2 and make you see no difference, divisions by not multiples of 2 are significantly slower.
For example, I have a (graphics) program with many many many divisions by 255.
Actually my computation is :
r = (((top.R - bottom.R) * alpha + (bottom.R * 255)) * 0x8081) >> 23;
I can ensure that it is a lot faster than my previous computation :
r = ((top.R - bottom.R) * alpha + (bottom.R * 255)) / 255;
so no, compilers cannot do all the tricks of optimization.

I would ask "what are you doing that it would matter?". First design your code for readability and maintainability. The likelyhood that doing bit shifting verses standard multiplication will make a performance difference is EXTREMELY small.

It is hardware dependent. If we are talking micro-controller or i386, then shifting might be faster but, as several answers state, your compiler will usually do the optimization for you.
On modern (Pentium Pro and beyond) hardware the pipelining makes this totally irrelevant and straying from the beaten path usually means you loose a lot more optimizations than you can gain.
Micro optimizations are not only a waste of your time, they are also extremely difficult to get right.

If the compiler (compile-time constant) or JIT (runtime constant) knows that the divisor or multiplicand is a power of two and integer arithmetic is being performed, it will convert it to a shift for you.

According to the results of this microbenchmark, shifting is twice as fast as dividing (Oracle Java 1.7.0_72).

Most compilers will turn multiplication and division into bit shifts when appropriate. It is one of the easiest optimizations to do. So, you should do what is more easily readable and appropriate for the given task.

I am stunned as I just wrote this code and realized that shifting by one is actually slower than multiplying by 2!
(EDIT: changed the code to stop overflowing after Michael Myers' suggestion, but the results are the same! What is wrong here?)
import java.util.Date;
public class Test {
public static void main(String[] args) {
Date before = new Date();
for (int j = 1; j < 50000000; j++) {
int a = 1 ;
for (int i = 0; i< 10; i++){
a *=2;
}
}
Date after = new Date();
System.out.println("Multiplying " + (after.getTime()-before.getTime()) + " milliseconds");
before = new Date();
for (int j = 1; j < 50000000; j++) {
int a = 1 ;
for (int i = 0; i< 10; i++){
a = a << 1;
}
}
after = new Date();
System.out.println("Shifting " + (after.getTime()-before.getTime()) + " milliseconds");
}
}
The results are:
Multiplying 639 milliseconds
Shifting 718 milliseconds

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.