One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.
For starters here is the normal version of the function.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = a + b + 1;
}
}
return a;
}
Here is the version I made that breaks the functionality into small functions.
static double TinyFunctions()
{
double a = 0;
for (int i = 0; i < s_OuterLoopCount; i++)
{
a = Loop(a);
}
return a;
}
static double Loop(double a)
{
for (int i = 0; i < s_InnerLoopCount; i++)
{
double b = Double(i);
a = Add(a, Add(b, 1));
}
return a;
}
static double Double(double a)
{
return a * 2;
}
static double Add(double a, double b)
{
return a + b;
}
I use the stopwatch class to time the functions and when I ran it in debug I got the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;
These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;
These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?
We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.
I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.
In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.
I'd be grateful for any insight someone can provide.
Update 1
As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = b + 1 + a;
}
}
return a;
}
Here are the results with this configuration.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;
This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.
Furthermore, I then tried it with integer operations and we are back to not making any sense.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;
And this doesn't change regardless of the order of operations.
I can make performance match much better by changing one line of code:
a = a + b + 1;
Change it to:
a = b + 1 + a;
Or:
a += b + 1;
Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:
int Double( int a ) { return a * 2; }
I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).
The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.
But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:
Original Changed
01070620 push ebp 01390620 push ebp
01070621 mov ebp,esp 01390621 mov ebp,esp
01070623 push edi 01390623 push edi
01070624 push esi 01390624 push esi
01070625 push eax 01390625 push eax
01070626 fldz 01390626 fldz
01070628 xor esi,esi 01390628 xor esi,esi
0107062A mov edi,dword ptr ds:[0FE43ACh] 0139062A mov edi,dword ptr ds:[12243ACh]
01070630 test edi,edi 01390630 test edi,edi
01070632 jle 0107065A 01390632 jle 0139065A
01070634 xor edx,edx 01390634 xor edx,edx
01070636 mov ecx,dword ptr ds:[0FE43B0h] 01390636 mov ecx,dword ptr ds:[12243B0h]
0107063C test ecx,ecx 0139063C test ecx,ecx
0107063E jle 01070655 0139063E jle 01390655
01070640 mov eax,edx 01390640 mov eax,edx
01070642 add eax,eax 01390642 add eax,eax
01070644 mov dword ptr [ebp-0Ch],eax 01390644 mov dword ptr [ebp-0Ch],eax
01070647 fild dword ptr [ebp-0Ch] 01390647 fild dword ptr [ebp-0Ch]
0107064A faddp st(1),st 0139064A fld1
0107064C fld1 0139064C faddp st(1),st
0107064E faddp st(1),st 0139064E faddp st(1),st
01070650 inc edx 01390650 inc edx
01070651 cmp edx,ecx 01390651 cmp edx,ecx
01070653 jl 01070640 01390653 jl 01390640
01070655 inc esi 01390655 inc esi
01070656 cmp esi,edi 01390656 cmp esi,edi
01070658 jl 01070634 01390658 jl 01390634
0107065A pop ecx 0139065A pop ecx
0107065B pop esi 0139065B pop esi
0107065C pop edi 0139065C pop edi
0107065D pop ebp 0139065D pop ebp
0107065E ret 0139065E ret
Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.
Update:
With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:
int b = 2 * i;
a = a + b + 1;
Into something like:
mov esi, eax ; b = i
add esi, esi ; b += b
lea ecx, [ecx + esi + 1] ; a = a + b + 1
Where a is stored in the ecx register, i in eax, and b in esi.
Whereas the TinyFunctions version gets turned into something like:
mov eax, edx
add eax, eax
inc eax
add ecx, eax
Where i is in edx, b is in eax, and a is in ecx this time around.
I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:
int b = 2 * i + 1;
a += b;
This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.
Related
Question
I wanted to write a small profiler class that allows me to measure the run time of hot paths throughout the application. In doing so, I discovered and interesting performance difference between two possible implementations that I cannot explain, but would like to understand.
Setup
The idea is as follows:
// somewhere accessible
public static profiler HotPathProfiler = new HotPathProfiler("some name", enabled: true);
// within programm
long ticket = profiler.Enter();
... // hot path
var result = profiler.Exit(ticket: ticket);
As there aren't much of these hot paths running in parallel, the idea is to implement this via an array that holds the timestamp (0 when slot is free) and returning the index (called ticket) when calling Enter(). So the class looks like the following:
public class HotPathProfiler
{
private readonly string _name;
private readonly bool _enabled;
private readonly long[] _ticketList;
public HotPathProfiler(string name, bool enabled)
{
_name = name;
_enabled = enabled;
_ticketList = new long[128];
}
}
If code Enter()s and none of the 128 tickets is available, -1 will be returned which the Exit(ticket) function can handle by returning early.
When thinking about how to implement the Enter() call I saw the Interlocked.Read method that can atomically read values on 32bit systems, while, according to the documentation, it is unnecessary on 64bit systems.
So I went on an implemented various types of Enter() methods, including one with Interlocked.Read and one with Interlocked.CompareExchange, and compared them with BenchmarkDotNet. That's where I discovered an enormous performance difference:
| Method | Mean | Error | StdDev | Code Size |
|------------- |----------:|---------:|---------:|----------:|
| SafeArray | 28.64 ns | 0.573 ns | 0.536 ns | 295 B |
| SafeArrayCAS | 744.75 ns | 8.741 ns | 7.749 ns | 248 B |
The benchmark for both look pretty much the same:
[Benchmark]
public void SafeArray()
{
// doesn't matter if 'i < 1' or 'i < 10'
// performance differs by the same factor (approx. 20x)
for (int i = 0; i < 1; i++)
{
_ticketArr[i] = _hpp_sa.EnterSafe();
// SafeArrayCAS:
// _ticketArr[i] = _hpp_sa_cas.EnterSafe();
}
}
Implementations
Again, free slots hold value 0, occupied slots some other value (timestamp). Enter() is supposed to return the index/ticket of the slot.
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long ts = Stopwatch.GetTimestamp();
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
last++;
} while (val != 0 && last < 128);
return val == 0 ? last : -1;
}
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
last++;
} while (val != 0 && last < 128);
if (val != 0)
{
return -1;
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
if (prev != 0)
{
return -1;
}
return last;
}
Enter rabbit hole
Now, one would say that it's no surprise to see a difference, since the slow method always tries to CAS an entry, while the other one only lazily reads each entry and then only tries a CAS once.
But, beside the fact that the benchmark only does 1 Enter(), i.e. only one while{} run that shouldn't make that (20x) much difference, it is even harder to explain once you realize the atomic read is implemented as CAS:
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D048FCE xor edi,edi
long ts = Stopwatch.GetTimestamp();
00007FF82D048FD0 lea rcx,[rsp+28h]
00007FF82D048FD5 call CLRStub[JumpStub]#7ff82d076d70 (07FF82D076D70h)
00007FF82D048FDA mov rsi,qword ptr [rsp+28h]
00007FF82D048FDF mov rax,7FF88CF3E07Ch
00007FF82D048FE9 cmp dword ptr [rax],0
00007FF82D048FEC jne HotPathProfilerSafeArrayCAS.EnterSafe()+0A6h (07FF82D049046h)
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
00007FF82D048FEE mov rbx,qword ptr [rsp+50h]
00007FF82D048FF3 mov rax,qword ptr [rbx+10h]
00007FF82D048FF7 mov edx,dword ptr [rax+8]
00007FF82D048FFA movsxd rdx,edx
00007FF82D048FFD cmp rdi,rdx
00007FF82D049000 jae HotPathProfilerSafeArrayCAS.EnterSafe()+0ADh (07FF82D04904Dh)
00007FF82D049002 lea rdx,[rax+rdi*8+10h]
00007FF82D049007 xor eax,eax
00007FF82D049009 lock cmpxchg qword ptr [rdx],rsi
last++;
00007FF82D04900E inc rdi
} while (val != 0 && last < 128);
00007FF82D049011 test rax,rax
00007FF82D049014 je HotPathProfilerSafeArrayCAS.EnterSafe()+084h (07FF82D049024h)
00007FF82D049016 cmp rdi,80h
00007FF82D04901D mov qword ptr [rsp+50h],rbx
00007FF82D049022 jl HotPathProfilerSafeArrayCAS.EnterSafe()+04Eh (07FF82D048FEEh)
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D046C74 xor esi,esi
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
00007FF82D046C76 mov rax,qword ptr [rcx+10h]
00007FF82D046C7A mov edx,dword ptr [rax+8]
00007FF82D046C7D movsxd rdx,edx
00007FF82D046C80 cmp rsi,rdx
00007FF82D046C83 jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82D046D2Ch)
00007FF82D046C89 lea rdx,[rax+rsi*8+10h]
00007FF82D046C8E xor r8d,r8d
00007FF82D046C91 xor eax,eax
00007FF82D046C93 lock cmpxchg qword ptr [rdx],r8
last++;
00007FF82D046C98 inc rsi
} while (val != 0 && last < 128);
00007FF82D046C9B test rax,rax
00007FF82D046C9E je HotPathProfilerSafeArray.EnterSafe()+059h (07FF82D046CA9h)
00007FF82D046CA0 cmp rsi,80h
00007FF82D046CA7 jl HotPathProfilerSafeArray.EnterSafe()+026h (07FF82D046C76h)
if (val != 0)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
00007FF82FBA6ADF mov rcx,qword ptr [rcx+10h]
00007FF82FBA6AE3 mov eax,dword ptr [rcx+8]
00007FF82FBA6AE6 movsxd rax,eax
00007FF82FBA6AE9 cmp rsi,rax
00007FF82FBA6AEC jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82FBA6B4Ch)
00007FF82FBA6AEE lea rdi,[rcx+rsi*8+10h]
00007FF82FBA6AF3 mov qword ptr [rsp+28h],rdi
00007FF82FBA6AF8 lea rcx,[rsp+30h]
00007FF82FBA6AFD call CLRStub[JumpStub]#7ff82d076d70 (07FF82D076D70h)
00007FF82FBA6B02 mov rdx,qword ptr [rsp+30h]
00007FF82FBA6B07 xor eax,eax
00007FF82FBA6B09 mov rdi,qword ptr [rsp+28h]
00007FF82FBA6B0E lock cmpxchg qword ptr [rdi],rdx
00007FF82FBA6B13 mov rdi,rax
00007FF82FBA6B16 mov rax,7FF88CF3E07Ch
00007FF82FBA6B20 cmp dword ptr [rax],0
00007FF82FBA6B23 jne HotPathProfilerSafeArray.EnterSafe()+0D5h (07FF82FBA6B45h)
if (prev != 0)
[...] // ommited for brevity
Summary
I run all on Win10 x64 Release build, on a Xeon E-2176G (6-core Coffee Lake) CPU. Assembler output is from Visual Studio, but equals the DisassemblyDiagnoser of BenchmarkDotNet.
Beside the hows and whys on why I'm doing this at all, I simply cannot explain the performance difference between these two methods. It shouldn't be this much I would guess. Can it be BenchmarkDotNet itself? Am I'm missing something else?
Feels like I have a black spot in my understanding about this lowlevel stuff that I'd like to shine some light on...thanks!
PS:
What I've tried so far:
Rearraging order of Benchmark runs
Defer GetTimestamp() call in the slow method
Doing some initialization/test calls before the benchmark run (though I guess that's covered anyways by BenchmarkDotNet)
I have simple program written in C#:
static void Main(string[] args)
{
int a = 0;
for (int i = 0; i < 100; ++i)
a = a + 1;
Console.WriteLine(a);
}
I am newbie in such field of programming and my purpose is just to understand assembly code created by JIT. It is piece of asm code:
7: int a = 0;
0000003c xor edx,edx
0000003e mov dword ptr [ebp-40h],edx
8: for (int i = 0; i < 100; ++i)
00000041 xor edx,edx
00000043 mov dword ptr [ebp-44h],edx
I cannot understand code :0000003c xor edx,edx. Where is result of operation stored? I found only such quote from "Intel® 64 and IA-32 Architectures Software Developer’s Manual":
The logical instructions AND, OR, XOR (exclusive or), and NOT perform the standard Boolean operations for which
they are named. The AND, OR, and XOR instructions require two operands; the NOT instruction operates on a
single operand
EDIT: As I understand this result should be stored at edx (see next code line). But it seems weird for me. I thought that result will be pushed onto stack
Logical operation instructions store results in the first argument - in your case, it's edx.
Note that XOR-ing a value with itself produces 0. Hence, XOR a, a is a common assembly idiom to clear a register.
xor edx,edx is the idiomatic way of clearing the edx register.
(Note that a XOR a is zero for any value of a.)
In C#, I have an array of structs and I need to assign values to each. What is the most efficient way to do this? I could assign each field, indexing the array for each field:
array[i].x = 1;
array[i].y = 1;
I could construct a new struct on the stack and copy it to the array:
array[i] = new Vector2(1, 2);
Is there another way? I could call a method and pass the struct by ref, but I'd guess the method call overhead would not be worth it.
In case the struct size matters, the structs in question have 2-4 fields of type float or byte.
In some cases I need to assign the same values to multiple array entries, eg:
Vector2 value = new Vector2(1, 2);
array[i] = value;
array[i + 1] = value;
array[i + 2] = value;
array[i + 3] = value;
Does this change which approach is more efficient?
I understand this is quite low level, but I'm doing it millions of times and I'm curious.
Edit: I slapped together a benchmark:
this.array = new Vector2[100];
Vector2[] array = this.array;
for (int i = 0; i < 1000; i++){
long startTime, endTime;
startTime = DateTime.Now.Ticks;
for (int x = 0; x < 100000000; x++) {
array[0] = new Vector2(1,2);
array[1] = new Vector2(3,4);
array[2] = new Vector2(5,6);
array[3] = new Vector2(7,8);
array[4] = new Vector2(9,0);
array[5] = new Vector2(1,2);
array[6] = new Vector2(3,4);
array[7] = new Vector2(5,6);
array[8] = new Vector2(7,8);
array[9] = new Vector2(9,0);
}
endTime = DateTime.Now.Ticks;
double ns = ((double)(endTime - startTime)) / ((double)loopCount);
Debug.Log(ns.ToString("F"));
}
This reported ~0.77ns and another version which indexed and assigned the struct fields gave ~0.24ns, FWIW. It appears the array index is cheap compared to the struct stack allocation and copy. Might be interesting to see the performance on a mobile device.
Edit2: Dan Bryant's answer below is why I didn't write a benchmark to begin with, too easy to get wrong.
I was curious about the first case (field assignment vs. constructor call), so I made a release build and attached post-JIT to see the disassembly. The (x64) code looks like this:
var array = new Vector2[10];
00000000 mov ecx,191372h
00000005 mov edx,0Ah
0000000a call FFF421C4
0000000f mov edx,eax
array[i].x = 1;
00000011 cmp dword ptr [edx+4],0
00000015 jbe 0000003E
00000017 lea eax,[edx+8]
0000001a fld1
0000001c fstp qword ptr [eax]
array[i].y = 1;
0000001e fld1
00000020 fstp qword ptr [edx+10h]
array[i] = new Vector2(1, 1);
00000023 add edx,8
00000026 mov eax,edx
00000028 fld1
0000002a fld1
0000002c fxch st(1)
0000002e fstp qword ptr [eax]
00000030 fstp qword ptr [eax+8]
One thing worth noting is that the 'constructor call' is inlined when using a release build outside the debugger, so, in principle, there should be no difference between setting fields or calling the constructor. That said, the jitter did some interesting things here.
For the 'constructor' version, it used two floating point stack slots and stores them at the same time to the structure memory (fld1, fld1, fstp, fstp.) It also has an fxch (exchange), which is a bit silly since both slots contain constant value 1, but not exactly a high priority optimization target for most applications, I'd assume.
For the 'individual fields' version, it only used one slot on the FPU stack, by splitting up the writes (fld1, fstp, fld1, fstp). I'm not an x64 guru, so I don't know which ordering is more efficient in terms of execution time. Any difference is probably quite miniscule, though, since the primary potential overhead (constructor method call) is inlined out.
Delphi:
procedure TForm1.Button1Click(Sender: TObject);
var I,Tick:Integer;
begin
Tick := GetTickCount();
for I := 0 to 1000000000 do
begin
end;
Button1.Caption := IntToStr(GetTickCount()-Tick)+' ms';
end;
C#:
private void button1_Click(object sender, EventArgs e)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
button1.Text = tick.ToString()+" ms";
}
Delphi gives around 515 ms
C# gives around 3775 ms
Delphi is compiled to native code, whereas C# is compiled to CLR code which is then translated at runtime. That said C# does use JIT compilation, so you might expect the timing to be more similar, but it is not a given.
It would be useful if you could describe the hardware you ran this on (CPU, clock rate).
I do not have access to Delphi to repeat your experiment, but using native C++ vs C# and the following code:
VC++ 2008
#include <iostream>
#include <windows.h>
int main(void)
{
int tick = GetTickCount() ;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = GetTickCount() - tick;
std::cout << tick << " ms" << std::endl ;
}
C#
using System;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
Console.Write( tick.ToString() + " ms" ) ;
}
}
}
I initially got:
C++ 2792ms
C# 2980ms
However I then performed a Rebuild on the C# version and ran the executable in <project>\bin\release and <project>\bin\debug respectively directly from the command line. This yielded:
C# (release): 720ms
C# (debug): 3105ms
So I reckon that is where the difference truly lies, you were running the debug version of the C# code from the IDE.
In case you are thinking that C++ is then particularly slow, I ran that as an optimised release build and got:
C++ (Optimised): 0ms
This is not surprising because the loop is empty, and the control variable is not used outside the loop so the optimiser removes it altogether. To avoid that I declared i as a volatile with the following result:
C++ (volatile i): 2932ms
My guess is that the C# implementation also removed the loop and that the 720ms is from something else; this may explain most of the difference between the timings in the first test.
What Delphi is doing I cannot tell, you might look at the generated assembly code to see.
All the above tests on AMD Athlon Dual Core 5000B 2.60GHz, on Windows 7 32bit.
If this is intended as a benchmark, it's an exceptional bad one as in both cases the loop can be optimized away, so you have to look at the generated machine code to see what's going on. If you use release mode for C#, the following code
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; ++i){ }
sw.Stop();
Console.WriteLine(sw.Elapsed);
is transformed by the JITter to this:
push ebp
mov ebp,esp
push edi
push esi
call 67CDBBB0
mov edi,eax
xor eax,eax ; i = 0
inc eax ; ++i
cmp eax,3B9ACA00h ; i == 1000000000?
jl 0000000E ; false: jmp
mov ecx,edi
cmp dword ptr [ecx],ecx
call 67CDBC10
mov ecx,66DDAEDCh
call FFE8FBE0
mov esi,eax
mov ecx,edi
call 67CD75A8
mov ecx,eax
lea eax,[esi+4]
mov dword ptr [eax],ecx
mov dword ptr [eax+4],edx
call 66A94C90
mov ecx,eax
mov edx,esi
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+3Ch]
call dword ptr [eax+14h]
pop esi
pop edi
pop ebp
ret
TickCount is not a reliable timer; you should use .Net's Stopwatch class. (I don't know what the Delphi equivalent is).
Also, are you running a Release build?
Do you have a debugger attached?
The Delphi compiler uses the for loop counter downwards (if possible); the above code sample is compiled to:
Unit1.pas. 42: Tick := GetTickCount();
00489367 E8B802F8FF call GetTickCount
0048936C 8BF0 mov esi,eax
Unit1.pas.43: for I := 0 to 1000000000 do
0048936E B801CA9A3B mov eax,$3b9aca01
00489373 48 dec eax
00489374 75FD jnz $00489373
You are comparing native code against VM JITted code, and that is not fair. Native code will be ALWAYS faster since the JITter can not optimize the code like a native compiler can.
That said, comparing Delphi against C# is not fair at all, a Delphi binary will win always (faster, smaller, without any kind of dependencies, etc).
Btw, I'm sadly amazed how many posters here don't know this differences... or may be you just hurted some .NET zealots that try to defend C# against anything that shows there are better options out there.
this is the c# disassembly:
DEBUG:
// int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
0000004e 33 D2 xor edx,edx
00000050 89 55 B8 mov dword ptr [ebp-48h],edx
00000053 90 nop
00000054 EB 00 jmp 00000056
00000056 FF 45 B8 inc dword ptr [ebp-48h]
00000059 81 7D B8 00 CA 9A 3B cmp dword ptr [ebp-48h],3B9ACA00h
00000060 0F 95 C0 setne al
00000063 0F B6 C0 movzx eax,al
00000066 89 45 B4 mov dword ptr [ebp-4Ch],eax
00000069 83 7D B4 00 cmp dword ptr [ebp-4Ch],0
0000006d 75 E7 jne 00000056
as you see it is a waste of cpu.
EDIT:
RELEASE:
//unchecked
//{
//int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
00000032 33 D2 xor edx,edx
00000034 89 55 F4 mov dword ptr [ebp-0Ch],edx
00000037 FF 45 F4 inc dword ptr [ebp-0Ch]
0000003a 81 7D F4 00 CA 9A 3B cmp dword ptr [ebp-0Ch],3B9ACA00h
00000041 75 F4 jne 00000037
//}
EDIT:
and this is the c++ version:running about 9x faster in my machine.
__asm
{
PUSH ECX
PUSH EBX
XOR ECX, ECX
MOV EBX, 1000000000
NEXT: INC ECX
CMP ECX, EBX
JS NEXT
POP EBX
POP ECX
}
You should attach a debugger and take a look at the machine code generated by each.
Delphi would almost definitely optimise that loop to execute in reverse order (ie DOWNTO zero rather than FROM zero) - Delphi does this whenever it determines it is "safe" to do, presumably because either subtraction or checking against zero is faster than addition or checking against a non-zero number.
What happens if you try both cases specifying the loops to execute in reverse order?
In Delphi the break condition is calculated only once before the loop procedure begins whereas in C# the break condition is calculated in each loop pass again.
That’s why the looping in Delphi is faster than in C#.
"// int i = 0; while (++i != 1000000000) ;"
That's interesting.
while (++i != x) is not the same as for (; i != x; i++)
The difference is that the while loop doesn't execute the loop for i = 0.
(try it out: run something like this:
int i;
for (i = 0; i < 5; i++)
Console.WriteLine(i);
i = 0;
while (++i != 5)
Console.WriteLine(i);
The following code gives different output when running the release inside Visual Studio, and running the release outside Visual Studio. I'm using Visual Studio 2008 and targeting .NET 3.5. I've also tried .NET 3.5 SP1.
When running outside Visual Studio, the JIT should kick in. Either (a) there's something subtle going on with C# that I'm missing or (b) the JIT is actually in error. I'm doubtful that the JIT can go wrong, but I'm running out of other possiblities...
Output when running inside Visual Studio:
0 0,
0 1,
1 0,
1 1,
Output when running release outside of Visual Studio:
0 2,
0 2,
1 2,
1 2,
What is the reason?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Test
{
struct IntVec
{
public int x;
public int y;
}
interface IDoSomething
{
void Do(IntVec o);
}
class DoSomething : IDoSomething
{
public void Do(IntVec o)
{
Console.WriteLine(o.x.ToString() + " " + o.y.ToString()+",");
}
}
class Program
{
static void Test(IDoSomething oDoesSomething)
{
IntVec oVec = new IntVec();
for (oVec.x = 0; oVec.x < 2; oVec.x++)
{
for (oVec.y = 0; oVec.y < 2; oVec.y++)
{
oDoesSomething.Do(oVec);
}
}
}
static void Main(string[] args)
{
Test(new DoSomething());
Console.ReadLine();
}
}
}
It is a JIT optimizer bug. It is unrolling the inner loop but not updating the oVec.y value properly:
for (oVec.x = 0; oVec.x < 2; oVec.x++) {
0000000a xor esi,esi ; oVec.x = 0
for (oVec.y = 0; oVec.y < 2; oVec.y++) {
0000000c mov edi,2 ; oVec.y = 2, WRONG!
oDoesSomething.Do(oVec);
00000011 push edi
00000012 push esi
00000013 mov ecx,ebx
00000015 call dword ptr ds:[00170210h] ; first unrolled call
0000001b push edi ; WRONG! does not increment oVec.y
0000001c push esi
0000001d mov ecx,ebx
0000001f call dword ptr ds:[00170210h] ; second unrolled call
for (oVec.x = 0; oVec.x < 2; oVec.x++) {
00000025 inc esi
00000026 cmp esi,2
00000029 jl 0000000C
The bug disappears when you let oVec.y increment to 4, that's too many calls to unroll.
One workaround is this:
for (int x = 0; x < 2; x++) {
for (int y = 0; y < 2; y++) {
oDoesSomething.Do(new IntVec(x, y));
}
}
UPDATE: re-checked in August 2012, this bug was fixed in the version 4.0.30319 jitter. But is still present in the v2.0.50727 jitter. It seems unlikely they'll fix this in the old version after this long.
I believe this is in a genuine JIT compilation bug. I would report it to Microsoft and see what they say. Interestingly, I found that the x64 JIT does not have the same problem.
Here is my reading of the x86 JIT.
// save context
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
// put oDoesSomething pointer in ebx
00000006 mov ebx,ecx
// zero out edi, this will store oVec.y
00000008 xor edi,edi
// zero out esi, this will store oVec.x
0000000a xor esi,esi
// NOTE: the inner loop is unrolled here.
// set oVec.y to 2
0000000c mov edi,2
// call oDoesSomething.Do(oVec) -- y is always 2!?!
00000011 push edi
00000012 push esi
00000013 mov ecx,ebx
00000015 call dword ptr ds:[002F0010h]
// call oDoesSomething.Do(oVec) -- y is always 2?!?!
0000001b push edi
0000001c push esi
0000001d mov ecx,ebx
0000001f call dword ptr ds:[002F0010h]
// increment oVec.x
00000025 inc esi
// loop back to 0000000C if oVec.x < 2
00000026 cmp esi,2
00000029 jl 0000000C
// restore context and return
0000002b pop ebx
0000002c pop esi
0000002d pop edi
0000002e pop ebp
0000002f ret
This looks like an optimization gone bad to me...
I copied your code into a new Console App.
Debug Build
Correct output with both debugger and no debugger
Switched to Release Build
Again, correct output both times
Created a new x86 configuration (I'm on running X64 Windows 2008 and was using 'Any CPU')
Debug Build
Got the correct output both F5 and CTRL+F5
Release Build
Correct output with Debugger attached
No debugger - Got the incorrect output
So it is the x86 JIT incorrectly generating the code. Have deleted my original text about reordering of loops etc. A few other answers on here have confirmed that the JIT is unwinding the loop incorrectly when on x86.
To fix the problem you can change the declaration of IntVec to a class and it works in all flavours.
Think this needs to go on MS Connect....
-1 to Microsoft!