Slow execution under 64 bits. Possible RyuJIT bug?

Slow execution under 64 bits. Possible RyuJIT bug? - c#

I have the following C# code trying to benchmark under release mode:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication54
{
class Program
{
static void Main(string[] args)
{
int counter = 0;
var sw = new Stopwatch();
unchecked
{
int sum = 0;
while (true)
{
try
{
if (counter > 20)
throw new Exception("exception");
}
catch
{
}
sw.Restart();
for (int i = 0; i < int.MaxValue; i++)
{
sum += i;
}
counter++;
Console.WriteLine(sw.Elapsed);
}
}
}
}
}
I am on a 64-bit machine and VS 2015 installed. When I run the code under 32-bit, it runs each iteration around 0.6 seconds, printed to the console. When I run it under 64-bit then the duration for each iteration simply jumps to 4 seconds! I tried the sample code in my colleagues computer which only has VS 2013 installed. There both 32-bit and 64-bit versions run around 0.6 seconds.
In addition to that, if we just remove the try catch block, it also runs in 0.6 seconds with VS 2015 in 64-bit.
This looks like a serious RyuJIT regression when there is a try catch block. Am I correct ?

Bench-marking is a fine art. Make a small modification to your code:
Console.WriteLine("{0}", sw.Elapsed, sum);
And you'll now see the difference disappear. Or to put it another way, the x86 version is now just as slow as the x64 code. You can probably figure out what RyuJIT doesn't do what the legacy jitter did from this minor change, it doesn't eliminate the unnecessary
sum += i;
Something you can see when you look at the generated machine code with Debug > Windows > Disassembly. Which is indeed a quirk in RyuJIT. Its dead code elimination isn't as thorough as the legacy jitter. Otherwise not entirely without reason, Microsoft rewrote the x64 jitter because of bugs that it could not easily fix. One of them was a fairly nasty issue with the optimizer, it had no upper-bound on the amount of time it spent on optimizing a method. Causing rather poor behavior on methods with very large bodies, it could be out in the woods for dozens of milliseconds and cause noticeable execution pauses.
Calling it a bug, meh, not really. Write sane code and the jitter won't disappoint you. Optimization does forever start at the usual place, between the programmer's ears.

After a bit of testing I've got some interesting results. My testing revolved around the try catch block. As the OP pointed out, if you remove this block, the time to execute is the same. I've narrowed this down a bit further and have concluded that it's because of counter variable in if statement in the try block.
Lets remove the redundant throw:
try
{
if (counter== 0) { }
}
catch
{
}
You will get the same results with this code as you did with the original code.
Lets change counter to be an actual int value:
try
{
if (1 == 0) { }
}
catch
{
}
With this code, the 64 bit version has decreased in execution time from 4 seconds to about 1.7 seconds. Still double that of the 32 bit version. However I thought that was interesting. Unfortunately after my quick Google search I haven't come up with a reason, but I'll dig a bit more and update this answer if I find out why this is happening.
As for the remaining second that we would like to shave off the 64 bit version, I can see that this is down to incrementing the sum by i in your for loop.
Lets change this so that sum does not exceed its bounds:
for (int i = 0; i < int.MaxValue; i++)
{
sum ++;
}
This change (along with the change in the try block) will reduce the execution time of the 64 bit app to 0.7 seconds. My reasoning for the 1 second difference in time is due to the artificial way that the 64 bit version needs to handle an int which is naturally 32 bits.
In the 32 bit version, there are 32 bits allocated to the Int32 (sum). When sum goes above its bounds it is easy to determine this fact.
In the 64 bit version, there are 64 bits allocated to the Int32 (sum). When sum goes above its bounds there needs to be a mechanism to detect this, which could lead to the slow down. Perhaps even the operation of adding sum & i takes longer due to the increase in redundant bits allocated.
I am theorising here; so don't take this as gospel. I just thought I would post my findings. I'm sure someone else will be able to shed some light on the problem that I've found.
--
Update
#HansPassant 's answer pointed out that the sum += i; line may be eliminated as it is deemed unnecessary, which makes perfect sense, sum is not being used outside of the for loop. After he introduced the value of sum outside of the for loop, we noticed that the x86 version was just as slow as the x64 version. So I decided to do a bit of testing. Lets change the for loop and printing to the following:
int x = 0;
for (int i = 0; i < int.MaxValue; i++)
{
sum += i;
x = sum;
}
counter++;
Console.WriteLine(sw.Elapsed + " " + x);
You can see that I've introduced a new int x which is being assigned the value of sum in the for loop. That value of x is not being written out to the console. sum doesn't leave the for loop. This, believe it or not, actually reduces the execution time for x64 to 0.7 seconds. However, x86 version jumps up to 1.4 seconds.

Related

try catch for performance c#

My knowledge with try catch is limited. But i wonder if it can be used to gain performance.
Example, i am creating this voxel engine where a function is like this:
Block GetBlockInChunk(Vector Position){
if(InBound(Position)){
return Database.GetBlock();
}
return null;
}
Here it has to check bounds of the given position, with using try catch, then you can remove them?
Block GetBlockInChunk(Vector Position){
try{
return Database.GetBlock();
}
catch{
return null;
}
}
I feel like this is probably terrible practice, but i am curious.

The link I provided in the above comment shows a description of why you shouldn't ever use a try-catch when an if-statement would prevent the exception from being thrown, but in the interest of showing performance in terms of actual numbers, I wrote this quick little test program.
Stopwatch watch = new Stopwatch();
int[] testArray = new int[] { 1, 2, 3, 4, 5 };
int? test = null;
watch.Start();
for (int i = 0; i < 10000; i++)
{
try
{
testArray[(int)test] = 0;
}
catch { }
}
watch.Stop();
Console.WriteLine("try-catch result:");
Console.WriteLine(watch.Elapsed);
Console.WriteLine();
watch.Restart();
for (int i = 0; i < 10000; i++)
{
if (test != null)
testArray[(int)test] = 0;
}
watch.Stop();
Console.WriteLine("if-statement result:");
Console.WriteLine(watch.Elapsed);
The result of the program is this:
try-catch result:
00:00:32.6764911
if-statement result:
00:00:00.0001047
As you can see, the try-catch approach introduces significant overhead when an exception gets caught, taking over 30 seconds to complete 10,000 cycles on my machine. The if-statement, on the other hand, runs so fast that it is basically instantaneous. Compared to the try-catch, this is a performance improvement in the neighborhood of 3,000,000%.
(This isn't a rigorous benchmark, and there are ways to write it and run it differently to get more precise numbers, but this should give you a good idea of just how much more efficient it is to use an if-statement over a try-catch whenever possible.)

It is about 2 years later, but it is still relevant...
I respect the other replies but I believe there is a basic misunderstanding here for the purpose of try-catch. Try Catch is a very effective way within the C# and dotnet, to identify errors over the development and the life of the code. It is never intended to be a tool to use different, and if it is been fire, it means that you have a bug that needs to be fixed.
The problem it comes to solve is the standard error message that stops the code and then you need to dig in. With try and catch you can know in which method the problem occurs and narrow down the search.
As standard, I wrap ALL my methods with try-catch and added an additional functionality that writes the error message with the name of the method and additional essential information like time, and some helpful anchors of data to a debug file, which I can access even when the code is in production. This is priceless!
As far as performance, if the try-catch doesn't fire (which should be normal), there is no performance reduction, as it is merely a simple warper. If someone is really into a high level of performance and every fraction matter, it is possible to eliminate it using precompiler conditions (#if...).
Hope this is helpful.

Same multiplication returning different result after using Excel OLEDB on .NET [duplicate]

I am having a problem where it seems that the results of some calculations change after having used the Microsoft ACE driver to open an Excel spreadsheet.
The code below reproduces the problem.
The first two calls to DoCalculation yield the same results. Then I call the function OpenSpreadSheet which opens and closes an Excel 2003 spreadsheet using the ACE driver. You would not expect OpenSpreadSheet to have any effect on the last call to DoCalculation but it turns out that the result actually changes. This is the output that the program generates:
1,59142713593566
1,59142713593566
1,59142713593495
Note the differences on the last 3 decimals. This does not seem like a big difference but in our production code the calculations are complex and the resulting differences are quite large.
It makes no difference if I use the JET driver instead of the ACE driver. If I change the types from double to decimal the error goes away. But this is not an option in our production code.
I am running on a Windows 7 64 bit and the assemblies are compiled for .NET 4.5 x86. Using the 64 bit ACE driver is not an option as we are running 32 bit Office.
Does anybody know why this is happening and how I can fix it?
The following code reproduces my problem:
static void Main(string[] args)
{
DoCalculation();
DoCalculation();
OpenSpreadSheet();
DoCalculation();
}
static void DoCalculation()
{
// Multiply two randomly chosen number 10.000 times.
var d1 = 1.0003123132;
var d3 = 0.999734234;
double res = 1;
for (int i = 0; i < 10000; i++)
{
res *= d1 * d3;
}
Console.WriteLine(res);
}
public static void OpenSpreadSheet()
{
var cn = new OleDbConnection(#"Provider=Microsoft.ACE.OLEDB.12.0;data source=c:\temp\workbook1.xls;Extended Properties=Excel 8.0");
var cmd = new OleDbCommand("SELECT [Column1] FROM [Sheet1$]", cn);
cn.Open();
using (cn)
{
using (OleDbDataReader reader = cmd.ExecuteReader())
{
// Do nothing
}
}
}

This is technically possible, unmanaged code may be tinkering with the FPU control word and change the way it calculates. Well-known trouble makers are DLLs compiled with Borland tools, their runtime support code unmasks exceptions that can crash managed code. And DirectX, it is known for tinkering with the FPU control word to get calculations with double to be performed as float to speed up graphics math.
The specific kind of FPU control word change that appears to be made here is the rounding mode, used by the FPU when it needs to write an internal register value with 80-bit precision to a 64-bit memory location. It has 4 options to make that conversion: round up, round down, truncate and round-to-even (banker's rounding). Very small differences but you do make an effort to accumulate them rapidly. And if your numerical model is unstable then you certainly will see a difference in the end result. That doesn't make it more or less accurate, just different.
Managed code is pretty defenseless against code that does this, you cannot directly access the FPU control word. It requires writing assembly code. You've got one trick available, highly undocumented but pretty effective. The CLR will reset the FPU whenever it handles an exception. So you could do this:
public static void ResetMathProcessor()
{
if (IntPtr.Size != 4) return; // No need in 64-bit code, it uses SSE
try {
throw new Exception("Please ignore, resetting the FPU");
}
catch (Exception ex) {}
}
Do beware that this is expensive so use as infrequently as possible. And it is a major pita when you debug code so you might want to disable this in the Debug build.
I should mention an alternative, you can pinvoke the _fpreset() function in msvcrt.dll. It is however risky if you use it inside of a method that also performs floating point math, the jitter optimizer doesn't know that this function jerks the floor mat. You'll need to thoroughly test the Release build:
[System.Runtime.InteropServices.DllImport("msvcrt.dll")]
public static extern void _fpreset();
And do keep in mind that this does not make your calculation results more accurate in any way. Just different. Just like running the Release build of your code without a debugger will produce different results than the Debug build. The Release build code will perform this kind of rounding less frequently since the jitter optimizer makes an effort to keep intermediate results inside the FPU at 80-bit precision. Producing a different result from the Debug build but one that actually is more accurate. Give or take. This 80-bit intermediate format was Intel's billion dollar mistake, not repeated in the SSE2 instruction set.

What does cause different performance of Math.Max in C#?

I ran this on a laptop, 64-bit Windows 8.1, 2.2 Ghz Intel Core i3. The code was compiled in release mode and ran without a debugger attached.
static void Main(string[] args)
{
calcMax(new[] { 1, 2 });
calcMax2(new[] { 1, 2 });
var A = GetArray(200000000);
var stopwatch = new Stopwatch();
stopwatch.Start(); stopwatch.Stop();
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax(A);
stopwatch.Stop();
Console.WriteLine("caclMax - \t{0}", stopwatch.Elapsed);
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax2(A);
stopwatch.Stop();
Console.WriteLine("caclMax2 - \t{0}", stopwatch.Elapsed);
Console.ReadKey();
}
static int[] GetArray(int size)
{
var r = new Random(size);
var ret = new int[size];
for (int i = 0; i < size; i++)
{
ret[i] = r.Next();
}
return ret;
}
static int calcMax(int[] A)
{
int max = int.MinValue;
for (int i = 0; i < A.Length; i++)
{
max = Math.Max(max, A[i]);
}
return max;
}
static int calcMax2(int[] A)
{
int max1 = int.MinValue;
int max2 = int.MinValue;
for (int i = 0; i < A.Length; i += 2)
{
max1 = Math.Max(max1, A[i]);
max2 = Math.Max(max2, A[i + 1]);
}
return Math.Max(max1, max2);
}
Here are some statistics of program performance (time in miliseconds):
Framework 2.0
X86 platform:
2269 (calcMax)
2971 (calcMax2)
[winner calcMax]
X64 platform:
6163 (calcMax)
5916 (calcMax2)
[winner calcMax2]
Framework 4.5 (time in miliseconds)
X86 platform:
2109 (calcMax)
2579 (calcMax2)
[winner calcMax]
X64 platform:
2040 (calcMax)
2488 (calcMax2)
[winner calcMax]
As you can see the performance is different depend on framework and choosen compilied platform. I see generated IL code and it is the same for each cases.
The calcMax2 is under test because it should use "pipelining" of processor. But it is faster only with framework 2.0 on 64-bit platform. So, what is real reason of shown case in different performance?

Just some notes worth mentioning. My processor (Haswell i7) doesn't compare well with yours, I certainly can't get close to reproducing the outlier x64 result.
Benchmarking is a hazardous exercise and it is very easy to make simple mistakes that can have big consequences on execution time. You can only truly see them when you look at the generated machine code. Use Tools + Options, Debugging, General and untick the "Suppress JIT optimization" option. That way you can look at the code with Debug > Windows > Disassembly and not affect the optimizer.
Some things you'll see when you do this:
You made a mistake, you are not actually using the method return value. The jitter optimizer opportunities like this where possible, it completely omits the max variable assignment in calcMax(). But not in calcMax2(). This is a classic benchmarking oops, in a real program you'd of course use the return value. This makes calcMax() look too good.
The .NET 4 jitter is smarter about optimizing Math.Max(), in can generate the code inline. The .NET 2 jitter couldn't do that yet, it has to make a call to a CLR helper function. The 4.5 test should thus run a lot faster, that it didn't is a strong hint at what really throttles the code execution. It is not the processor's execution engine, it is the cost of accessing memory. Your array is too large to fit in the processor caches so your program is bogged down waiting for the slow RAM to supply the data. If the processor cannot overlap that with executing instructions then it just stalls.
Noteworthy about calcMax() is what happens to the array-bounds check that C# performs. The jitter knows how to completely eliminate it from the loop. It however isn't smart enough to do the same in calcMax2(), the A[i + 1] screws that up. That check doesn't come for free, it should make calcMax2() quite a bit slower. That it doesn't is again a strong hint that memory is the true bottleneck. That's pretty normal btw, array bound checking in C# can have low to no overhead because it is so much cheaper than the array element access.
As for your basic quest, trying to improve super-scalar execution opportunities, no, that's not how processors work. A loop is not a boundary for the processor, it just sees a different stream of compare and branch instructions, all of which can execute concurrently if they don't have inter-dependencies. What you did by hand is something the optimizer already does itself, an optimization called "loop unrolling". It selected not to do so in this particular case btw. An overview of jitter optimizer strategies is available in this post. Trying to outsmart the processor and the optimizer is a pretty tall order and getting a worse result by trying to help is certainly not unusual.

Many of the differences that you see are well within the range of tolerance, so they should be considered as no differences.
Essentially, what these numbers show is that Framework 2.0 was highly unoptimized for X64, (no surprise at all here,) and that overall, calcMax performs slightly better than calcMax2. (No surprise there either, because calcMax2 contains more instructions.)
So, what we learn is that someone came up with a theory that they could achieve better performance by writing high-level code that somehow takes advantage of some pipelining of the CPU, and that this theory was proved wrong.
The running time of your code is dominated by the failed branch predictions that are occurring within Math.max() due to the randomness of your data. Try less randomness (more consecutive values where the 2nd one will always be greater) and see if it gives you any better insights.

Every time you run the program, you'll get slightly different results.
Sometimes calcMax will win, and sometimes calcMax2 will win. This is because there is a problem comparing performance that way. What StopWhatch measures is the time elapsed since stopwatch.Start() is called, until stopwatch.Stop() is called. In between, things independent of your code can occur. For example, the operating system can take the processor from your process and give it for a while to another process running on your machine, due to the end of your process's time slice. after a while, your process gets the processor back for another time slice.
Such occurrences cannot be controlled or foreseen by your comparison code, and thus the entire experiment shouldn't be treated as reliable.
To minimize this kind of measurement errors, you should measure every function many times (for example, 1000 times), and calculate the average time of all measurements. This method of measurement tends to significantly improve the reliability of the result, as it is more resilient to statistical errors.

Does looping occurs at the same speed for all systems?

Does looping in C# occur at the same speed for all systems. If not, how can I control a looping speed to make the experience consistent on all platforms?

You can set a minimum time for the time taken to go around a loop, like this:
for(int i= 0; i < 10; i++)
{
System.Threading.Thread.Sleep(100);
... rest of your code...
}
The sleep call will take a minimum of 100ms (you cannot say what the maximum will be), so your loop wil take at least 1 second to run 10 iterations.
Bear in mind that it's counter to the normal way of Windows programming to sleep on your user-interface thread, but this might be useful to you for a quick hack.

You can never depend on the speed of a loop. Although all existing compilers strive to make loops as efficient as possible and so they probably produce very similar results (given enough development time), the compilers are not the only think influencing this.
And even leaving everything else aside, different machines have different performance. No two machines will yield the exact same speed for a loop. In fact, even starting the program twice on the same machine will yield slightly different performances. It depends on what other programs are running, how the CPU is feeling today and whether or not the moon is shining.

No, loops do not occur the same in all systems. There are so many factors to this question that it can not be appreciable answered without code.
This is a simple loop:
int j;
for(int i = 0; i < 100; i++) {
j = j + i;
}
this loop is too simple, it's merely a pair of load, add, store operations, with a jump and a compare. This will be only a few microops and will be really fast. However, the speed of those microops will be dependent on the processor. If the processor can do one microop in 1 billionth of a second (roughly one gigahertz) then the loop will take approximately 6 * 100 microops (this is all rough estimation, there are so many factors involved that I'm only going for approximation) or 6 * 100 billionths of a second, or slightly less than one millionth of a second. For the entire loop. You can barely measure this with most operating system functions.
I wanted to demonstrate the speed of the looping. I referenced above a processor of 1 billion microops per second. Now consider a processor that can do 4 billion microops per second. That processor would be four times faster (roughly) than the first processor. And we didn't change the code.
Does this answer the question?
For those who want to mention that the compiler might loop unroll this, ignore that for the sake of the learning.

One way of controlling this is by using the Stopwatch to control when you do your logic. See this example code:
int noofrunspersecond = 30;
long ticks1 = 0;
long ticks2 = 0;
double interval = (double)Stopwatch.Frequency / noofrunspersecond;
while (true) {
ticks2 = Stopwatch.GetTimestamp();
if (ticks2 >= ticks1 + interval) {
ticks1 = Stopwatch.GetTimestamp();
//perform your logic here
}
Thread.Sleep(1);
}
This will make sure that that the logic is performed at given intervals as long as the system can keep up, so if you try to execute 100 times per second, depending on the logic performed the system might not manage to perform that logic 100 times a second. In other cases this should work just fine.
This kind of logic is good for getting smooth animations that will not speed up or slow down on different systems for example.

Multithreading in C#

I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}

The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)

The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.

if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.