Run vs. Debug = different results in Assembler - c#

I'm learning assembler. I practise with this code:
ASM:
;-------------------------------------------------------------------------
.586
.MODEL flat, stdcall
public srednia_harm
OPTION CASEMAP:NONE
INCLUDE include\windows.inc
INCLUDE include\user32.inc
INCLUDE include\kernel32.inc
.CODE
jeden dd 1.0
DllEntry PROC hInstDLL:HINSTANCE, reason:DWORD, reserved1:DWORD
mov eax, TRUE
ret
DllEntry ENDP
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
srednia_harm PROC
push ebp
mov esp,ebp
push esi
mov esi, [ebp+8] ; address of array
mov ecx, [ebp+12] ; the number of elements
finit
fldz ; the current value of the sum - st(0)=0
mianownik:
fld dword PTR jeden ;ST(0)=1, ST(1)=sum
fld dword PTR [esi] ;loading of array elements - ST(0)=tab[i], ST(1)=1 ST(2)=suma
fdivp st(1), st(0) ; st(1)=st(1)/(st0) -> ST(0)=1/tab[i], ST(1)=suma
faddp st(1),st(0) ; st(1)=st(0)+st(1) -> st(0)=suma+1/tab[i]
add esi,4
loop mianownik
pop esi
pop ebp
ret
srednia_harm ENDP
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
END DllEntry
DEF:
LIBRARY "biblioteka"
EXPORTS
srednia_harm
C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Runtime.InteropServices;
namespace GUI
{
unsafe class FunkcjeAsemblera //imports of assembler's function
{
[DllImport("bibliotekaASM.dll", CallingConvention = CallingConvention.StdCall)]
private static extern float srednia_harm(float[] table, int n);
public float wywolajTest(float[] table, int n)
{
float wynik = srednia_harm(table, n);
return wynik;
}
}
}
C#:
private void button6_Click(object sender, EventArgs e)
{
FunkcjeAsemblera funkcje = new FunkcjeAsemblera();
int n = 4;
float[] table = new float[n];
for (int i = 0; i < n; i++)
table[i] = 1;
float wynik = funkcje.wywolajTest(table, n);
textBox6.Text = wynik.ToString();
}
When i run this code everything is fine. The result is 4 as I expected. But I tried to understand that code, so I set a lot of breakpoints in ASM function. Then the problems started. Arrat was exactly where it should be in memory but the seond parameter is lost. Address pointed to an empty field in the memory. I tried a lots of combinations, I changed types ant it still was the same.
I made some researched but i didn't find any clues. How it possible that when I run program everything works fine and in DEBUG not?

Ok, I tested this in Debug and Release mode. I enabled Properties->Debug->EnableNativeCodedebugging. It works in both cases with Step Into(F11). The 'n'-variable is accessed properly.
One problem I noticed is an improper PROC setup. The code as above accesses the two variables relative to EBP but does not clean up the stack(stdcall [in which the callee is responsible for cleaning up the stack]#Wikipedia).
push ebp
mov esp,ebp
push esi
mov esi,dword ptr [ebp+8]
mov ecx,dword ptr [ebp+0Ch]
wait
...
add esi,4
loop 6CC7101F
pop esi
pop ebp
ret <-- two params not cleaned up
The following is the code assembled by the PROC heading below:
push ebp
mov ebp,esp
push esi
mov esi,dword ptr [ebp+8]
mov ecx,dword ptr [ebp+0Ch]
wait
...
add esi,4
loop 6CC7101F
pop esi
leave <-- restores EBP
ret 8 <-- two params cleaned up
I suggest changing the PROC to
srednia_harm PROC uses esi lpArr: DWORD, num: DWORD
mov esi, lpArr
mov ecx, num
...
ret
srednia_harm ENDP
Maybe that has been the cause of some troubles.

Related

Is there restrict equivalent in C#

I have the following function (which I cleaned up a bit to make it easier to understand) which takes the destination array gets the element at index n adds to it the src1[i] and then multiplies it with src2[i] (nothing too fancy):
static void F(long[] dst, long[] src1, long[] src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
no this generates following ASM:
<Program>$.<<Main>$>g__F|0_0(Int64[], Int64[], Int64[], UInt64)
L0000: sub rsp, 0x28
L0004: test r9, r9
L0007: jl short L0051
L0009: mov rax, r9
L000c: mov r9d, [rcx+8]
L0010: movsxd r9, r9d
L0013: cmp rax, r9
L0016: jae short L0057
L0018: lea rcx, [rcx+rax*8+0x10]
L001d: mov r9, rcx
L0020: mov r10, [r9]
L0023: mov r11d, [rdx+8]
L0027: movsxd r11, r11d
L002a: cmp rax, r11
L002d: jae short L0057
L002f: add r10, [rdx+rax*8+0x10]
L0034: mov [r9], r10
L0037: mov edx, [r8+8]
L003b: movsxd rdx, edx
L003e: cmp rax, rdx
L0041: jae short L0057
L0043: imul r10, [r8+rax*8+0x10]
L0049: mov [rcx], r10
L004c: add rsp, 0x28
L0050: ret
L0051: call 0x00007ffc9dadb710
L0056: int3
L0057: call 0x00007ffc9dadbc70
L005c: int3
as you can it adds bunch of stuff and because I can guarantee that the n will be in between the legal range: I can use pointers.
static unsafe void G(long* dst, long* src1, long* src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
Now this generates much simpler ASM:
<Program>$.<<Main>$>g__G|0_1(Int64*, Int64*, Int64*, UInt64)
L0000: lea rax, [rcx+r9*8]
L0004: mov rcx, rax
L0007: mov rdx, [rdx+r9*8]
L000b: add [rcx], rdx
L000e: mov rdx, [rax] ; loads the value again?
L0011: imul rdx, [r8+r9*8]
L0016: mov [rax], rdx
L0019: ret
As you may have noticed, there is an extra MOV there (I think, at least I can't reason why is it there).
Question
How can I remove that line? In C I could use the keyword restrict if I'm not wrong. Is there such keyword in C#? I couldn't find anything on internet sadly.
Note
Here is SharpLab link.
Here is the C example:
void
f(int64_t *dst,
int64_t *src1,
int64_t *src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
void
g(int64_t *restrict dst,
int64_t *restrict src1,
int64_t *restrict src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
this generates:
f:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
mov QWORD PTR [rdx], rax ; this is strange. It loads the value back to [RDX]?
; shouldn't that be other way around? I don't know.
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
g:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
and here is the Godbolt link.
This:
dst[n] = (dst[n] + src1[n]) * src2[n];
removes that extra mov.
In C# there is no equivalent of restrict qualifier from C language.
In the C# ECMA-334:2017 language specification, in chapter 23. Unsafe Code, there is no syntax to specify that a part of the memory must be accessed only by specific pointer. And there is no syntax to specify that memory regions pointed by pointers are not overlapped. Thus there is no such equivalent. This is probably because C# is a managed language, unsafe syntax which allows for working with pointers/unmanaged memory is an edge case in C#. And restrict on pointers would be an edge case of the edge case.

Huge performance difference in byte-array access between x64 and x86

I'm currenty doing micro-benchmarks for a better understanding of clr performance and version issues. The micro-benchmark in question is xoring two byte arrays of each 64 bytes together.
I'm always making a reference implementation with safe .net before I try to beat the .net framework implementation with unsafe and so on.
My reference implementation in question is:
for (int p = 0; p < 64; p++)
a[p] ^= b[p];
where a and b are byte[] a = new byte[64] and filled with data from .NET rng.
This code runs on x64 as double as fast as on x86. First I thought this is ok, because the jit will make something like *long^=*long out of it and *int^=*int on x86.
But my optimized unsafe-version:
fixed (byte* pA = a)
fixed (byte* pB = b)
{
long* ppA = (long*)pA;
long* ppB = (long*)pB;
for (int p = 0; p < 8; p++)
{
*ppA ^= *ppB;
ppA++;
ppB++;
}
}
runs about factor 4 times faster than the x64 reference-implementation. So my thoughts about *long^=*long and *int^=*int optimization of the compiler are not right.
Where does this huge performance difference in the reference implementation come from? Now that I posted the ASM code: Why can't the C# compiler also optimize the x86 version this way?
IL code for x86 and x64 reference implementation (they are identical):
IL_0059: ldloc.3
IL_005a: ldloc.s p
IL_005c: ldelema [mscorlib]System.Byte
IL_0061: dup
IL_0062: ldobj [mscorlib]System.Byte
IL_0067: ldloc.s b
IL_0069: ldloc.s p
IL_006b: ldelem.u1
IL_006c: xor
IL_006d: conv.u1
IL_006e: stobj [mscorlib]System.Byte
IL_0073: ldloc.s p
IL_0075: ldc.i4.1
IL_0076: add
IL_0077: stloc.s p
IL_0079: ldloc.s p
IL_007b: ldc.i4.s 64
IL_007d: blt.s IL_0059
I think that ldloc.3 is a.
Resulting ASM code for x86:
for (int p = 0; p < 64; p++)
010900DF xor edx,edx
010900E1 mov edi,dword ptr [ebx+4]
a[p] ^= b[p];
010900E4 cmp edx,edi
010900E6 jae 0109010C
010900E8 lea esi,[ebx+edx+8]
010900EC mov eax,dword ptr [ebp-14h]
010900EF cmp edx,dword ptr [eax+4]
010900F2 jae 0109010C
010900F4 movzx eax,byte ptr [eax+edx+8]
010900F9 xor byte ptr [esi],al
for (int p = 0; p < 64; p++)
010900FB inc edx
010900FC cmp edx,40h
010900FF jl 010900E4
Resulting ASM code for x64:
a[p] ^= b[p];
00007FFF4A8B01C6 mov eax,3Eh
00007FFF4A8B01CB cmp rax,rcx
00007FFF4A8B01CE jae 00007FFF4A8B0245
00007FFF4A8B01D0 mov rax,qword ptr [rbx+8]
00007FFF4A8B01D4 mov r9d,3Eh
00007FFF4A8B01DA cmp r9,rax
00007FFF4A8B01DD jae 00007FFF4A8B0245
00007FFF4A8B01DF mov r9d,3Fh
00007FFF4A8B01E5 cmp r9,rcx
00007FFF4A8B01E8 jae 00007FFF4A8B0245
00007FFF4A8B01EA mov ecx,3Fh
00007FFF4A8B01EF cmp rcx,rax
00007FFF4A8B01F2 jae 00007FFF4A8B0245
00007FFF4A8B01F4 nop word ptr [rax+rax]
00007FFF4A8B0200 movzx ecx,byte ptr [rdi+rdx+10h]
00007FFF4A8B0205 movzx eax,byte ptr [rbx+rdx+10h]
00007FFF4A8B020A xor ecx,eax
00007FFF4A8B020C mov byte ptr [rdi+rdx+10h],cl
00007FFF4A8B0210 movzx ecx,byte ptr [rdi+rdx+11h]
00007FFF4A8B0215 movzx eax,byte ptr [rbx+rdx+11h]
00007FFF4A8B021A xor ecx,eax
00007FFF4A8B021C mov byte ptr [rdi+rdx+11h],cl
00007FFF4A8B0220 add rdx,2
for (int p = 0; p < 64; p++)
00007FFF4A8B0224 cmp rdx,40h
00007FFF4A8B0228 jl 00007FFF4A8B0200
You've made a classic mistake, attempting performance analysis on non-optimized code. Here is a complete minimal compilable example:
using System;
namespace SO30558357
{
class Program
{
static void XorArray(byte[] a, byte[] b)
{
for (int p = 0; p< 64; p++)
a[p] ^= b[p];
}
static void Main(string[] args)
{
byte[] a = new byte[64];
byte[] b = new byte[64];
Random r = new Random();
r.NextBytes(a);
r.NextBytes(b);
XorArray(a, b);
Console.ReadLine(); // when the program stops here
// use Debug -> Attach to process
}
}
}
I compiled that using Visual Studio 2013 Update 3, default "Release Build" settings for a C# console application except for the architecture, and ran it with CLR v4.0.30319. Oh I think I have Roslyn installed, but that shouldn't replace the JIT, only the translation to MSIL which is identical on both architectures.
The actual x86 assembly for XorArray:
006F00D8 push ebp
006F00D9 mov ebp,esp
006F00DB push edi
006F00DC push esi
006F00DD push ebx
006F00DE push eax
006F00DF mov dword ptr [ebp-10h],edx
006F00E2 xor edi,edi
006F00E4 mov ebx,dword ptr [ecx+4]
006F00E7 cmp edi,ebx
006F00E9 jae 006F010F
006F00EB lea esi,[ecx+edi+8]
006F00EF movzx eax,byte ptr [esi]
006F00F2 mov edx,dword ptr [ebp-10h]
006F00F5 cmp edi,dword ptr [edx+4]
006F00F8 jae 006F010F
006F00FA movzx edx,byte ptr [edx+edi+8]
006F00FF xor eax,edx
006F0101 mov byte ptr [esi],al
006F0103 inc edi
006F0104 cmp edi,40h
006F0107 jl 006F00E7
006F0109 pop ecx
006F010A pop ebx
006F010B pop esi
006F010C pop edi
006F010D pop ebp
006F010E ret
And for x64:
00007FFD4A3000FB mov rax,qword ptr [rsi+8]
00007FFD4A3000FF mov rax,qword ptr [rbp+8]
00007FFD4A300103 nop word ptr [rax+rax]
00007FFD4A300110 movzx ecx,byte ptr [rsi+rdx+10h]
00007FFD4A300115 movzx eax,byte ptr [rdx+rbp+10h]
00007FFD4A30011A xor ecx,eax
00007FFD4A30011C mov byte ptr [rsi+rdx+10h],cl
00007FFD4A300120 movzx ecx,byte ptr [rsi+rdx+11h]
00007FFD4A300125 movzx eax,byte ptr [rdx+rbp+11h]
00007FFD4A30012A xor ecx,eax
00007FFD4A30012C mov byte ptr [rsi+rdx+11h],cl
00007FFD4A300130 movzx ecx,byte ptr [rsi+rdx+12h]
00007FFD4A300135 movzx eax,byte ptr [rdx+rbp+12h]
00007FFD4A30013A xor ecx,eax
00007FFD4A30013C mov byte ptr [rsi+rdx+12h],cl
00007FFD4A300140 movzx ecx,byte ptr [rsi+rdx+13h]
00007FFD4A300145 movzx eax,byte ptr [rdx+rbp+13h]
00007FFD4A30014A xor ecx,eax
00007FFD4A30014C mov byte ptr [rsi+rdx+13h],cl
00007FFD4A300150 add rdx,4
00007FFD4A300154 cmp rdx,40h
00007FFD4A300158 jl 00007FFD4A300110
Bottom line: The x64 optimizer worked a lot better. While it still is using byte-sized transfers, it unrolled the loop by a factor of 4, and inlined the function call.
Since in the x86 version, loop control logic corresponds to roughly half the code, the unrolling can be expected to yield almost twice the performance.
Inlining allowed the compiler to perform context-sensitive optimization, knowing the size of the arrays and eliminating the runtime bounds check.
If we inline by hand, the x86 compiler now yields:
00A000B1 xor edi,edi
00A000B3 mov eax,dword ptr [ebp-10h]
00A000B6 mov ebx,dword ptr [eax+4]
a[p] ^= b[p];
00A000B9 mov eax,dword ptr [ebp-10h]
00A000BC cmp edi,ebx
00A000BE jae 00A000F5
00A000C0 lea esi,[eax+edi+8]
00A000C4 movzx eax,byte ptr [esi]
00A000C7 mov edx,dword ptr [ebp-14h]
00A000CA cmp edi,dword ptr [edx+4]
00A000CD jae 00A000F5
00A000CF movzx edx,byte ptr [edx+edi+8]
00A000D4 xor eax,edx
00A000D6 mov byte ptr [esi],al
for (int p = 0; p< 64; p++)
00A000D8 inc edi
00A000D9 cmp edi,40h
00A000DC jl 00A000B9
Didn't help that much, the loop still does not unroll and the runtime bounds checking is still there.
Notably, the x86 compiler found a register (EBX) to cache the length of one array, but ran out of registers and was forced to access the other array length from memory on every iteration. This should be a "cheap" L1 cache access, but that's still slower than register access, and much slower than no bounds check at all.

When to use short?

Say I'm looping through like 20/30 objects or in any other case where I'm dealing with smaller numbers, is it a good practice to use short instead of int?
I mean why isn't this common:
for(short i=0; i<x; i++)
Method(array[i]);
Is it because the performance gain is too low?
Thanks
"is it a good practice to use short instead of int?"
First of all, this is a micro optimization that will not achieve the expected results: increase speed or efficiency.
Second: No, not really, the CLR internally still uses 32 bit integers (Int32) to perform the iteration. Basically it converts short to Int32 for computation purposes during JIT compilation.
Third: Array indexes are Int32, and the iterating short variable is automatically converted to int32 when used as an array indexer.
If we take the next code:
var array = new object[32];
var x = array.Length;
for (short i = 0; i < x; i++)
Method(array[i]);
And disassemble it, you can clearly see at 00000089 inc eax that at machine level an 32 bit register was used for the iterating variable (eax), which is next truncated to 16 bit 0000008a movsx eax,ax so there are no benefits from using a short oppossed to using an int32, actually there might be a slight performance loss due to extra instructions that need to be executed.
00000042 nop
var array = new object[32];
00000043 mov ecx,64B41812h
00000048 mov edx,20h
0000004d call FFBC01A4
00000052 mov dword ptr [ebp-50h],eax
00000055 mov eax,dword ptr [ebp-50h]
00000058 mov dword ptr [ebp-40h],eax
var x = array.Length;
0000005b mov eax,dword ptr [ebp-40h]
0000005e mov eax,dword ptr [eax+4]
00000061 mov dword ptr [ebp-44h],eax
for (short i = 0; i < x; i++)
00000064 xor edx,edx
00000066 mov dword ptr [ebp-48h],edx
00000069 nop
0000006a jmp 00000090
Method(array[i]);
0000006c mov eax,dword ptr [ebp-48h]
0000006f mov edx,dword ptr [ebp-40h]
00000072 cmp eax,dword ptr [edx+4]
00000075 jb 0000007C
00000077 call 657A28F6
0000007c mov ecx,dword ptr [edx+eax*4+0Ch]
00000080 call FFD9A708
00000085 nop
for (short i = 0; i < x; i++)
00000086 mov eax,dword ptr [ebp-48h]
00000089 inc eax
0000008a movsx eax,ax
0000008d mov dword ptr [ebp-48h],eax
00000090 mov eax,dword ptr [ebp-48h]
00000093 cmp eax,dword ptr [ebp-44h]
00000096 setl al
00000099 movzx eax,al
0000009c mov dword ptr [ebp-4Ch],eax
0000009f cmp dword ptr [ebp-4Ch],0
000000a3 jne 0000006C
Yes, the performance difference is negligible. However, a short uses 16 bits instead of 32 for an int, so it's conceivable that you may want to use a short if you're processing enough small numbers.
In general, using numbers that matches the word size of the processor is relatively faster than unmatching numbers. On the other hand, short uses less memory space than int.
If you have a limited memory space, using short may be the alternative; but personally I have never encountered such a thing when writing c# applications.
An int uses 32 bits of memory, a short uses 16 bits and a byte uses 8 bits. If you're only looping through 20/30 objects and you're concerned about memory usage, use byte instead.
Catering for memory usage to this level is rarely required with today's machines, though you could argue that using int everywhere is just lazy. Personally, I try to always use the relevant type that uses the least memory.
http://msdn.microsoft.com/en-us/library/5bdb6693(v=vs.100).aspx

Why looping in Delphi faster than C#?

Delphi:
procedure TForm1.Button1Click(Sender: TObject);
var I,Tick:Integer;
begin
Tick := GetTickCount();
for I := 0 to 1000000000 do
begin
end;
Button1.Caption := IntToStr(GetTickCount()-Tick)+' ms';
end;
C#:
private void button1_Click(object sender, EventArgs e)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
button1.Text = tick.ToString()+" ms";
}
Delphi gives around 515 ms
C# gives around 3775 ms
Delphi is compiled to native code, whereas C# is compiled to CLR code which is then translated at runtime. That said C# does use JIT compilation, so you might expect the timing to be more similar, but it is not a given.
It would be useful if you could describe the hardware you ran this on (CPU, clock rate).
I do not have access to Delphi to repeat your experiment, but using native C++ vs C# and the following code:
VC++ 2008
#include <iostream>
#include <windows.h>
int main(void)
{
int tick = GetTickCount() ;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = GetTickCount() - tick;
std::cout << tick << " ms" << std::endl ;
}
C#
using System;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
Console.Write( tick.ToString() + " ms" ) ;
}
}
}
I initially got:
C++ 2792ms
C# 2980ms
However I then performed a Rebuild on the C# version and ran the executable in <project>\bin\release and <project>\bin\debug respectively directly from the command line. This yielded:
C# (release): 720ms
C# (debug): 3105ms
So I reckon that is where the difference truly lies, you were running the debug version of the C# code from the IDE.
In case you are thinking that C++ is then particularly slow, I ran that as an optimised release build and got:
C++ (Optimised): 0ms
This is not surprising because the loop is empty, and the control variable is not used outside the loop so the optimiser removes it altogether. To avoid that I declared i as a volatile with the following result:
C++ (volatile i): 2932ms
My guess is that the C# implementation also removed the loop and that the 720ms is from something else; this may explain most of the difference between the timings in the first test.
What Delphi is doing I cannot tell, you might look at the generated assembly code to see.
All the above tests on AMD Athlon Dual Core 5000B 2.60GHz, on Windows 7 32bit.
If this is intended as a benchmark, it's an exceptional bad one as in both cases the loop can be optimized away, so you have to look at the generated machine code to see what's going on. If you use release mode for C#, the following code
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; ++i){ }
sw.Stop();
Console.WriteLine(sw.Elapsed);
is transformed by the JITter to this:
push ebp
mov ebp,esp
push edi
push esi
call 67CDBBB0
mov edi,eax
xor eax,eax ; i = 0
inc eax ; ++i
cmp eax,3B9ACA00h ; i == 1000000000?
jl 0000000E ; false: jmp
mov ecx,edi
cmp dword ptr [ecx],ecx
call 67CDBC10
mov ecx,66DDAEDCh
call FFE8FBE0
mov esi,eax
mov ecx,edi
call 67CD75A8
mov ecx,eax
lea eax,[esi+4]
mov dword ptr [eax],ecx
mov dword ptr [eax+4],edx
call 66A94C90
mov ecx,eax
mov edx,esi
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+3Ch]
call dword ptr [eax+14h]
pop esi
pop edi
pop ebp
ret
TickCount is not a reliable timer; you should use .Net's Stopwatch class. (I don't know what the Delphi equivalent is).
Also, are you running a Release build?
Do you have a debugger attached?
The Delphi compiler uses the for loop counter downwards (if possible); the above code sample is compiled to:
Unit1.pas. 42: Tick := GetTickCount();
00489367 E8B802F8FF call GetTickCount
0048936C 8BF0 mov esi,eax
Unit1.pas.43: for I := 0 to 1000000000 do
0048936E B801CA9A3B mov eax,$3b9aca01
00489373 48 dec eax
00489374 75FD jnz $00489373
You are comparing native code against VM JITted code, and that is not fair. Native code will be ALWAYS faster since the JITter can not optimize the code like a native compiler can.
That said, comparing Delphi against C# is not fair at all, a Delphi binary will win always (faster, smaller, without any kind of dependencies, etc).
Btw, I'm sadly amazed how many posters here don't know this differences... or may be you just hurted some .NET zealots that try to defend C# against anything that shows there are better options out there.
this is the c# disassembly:
DEBUG:
// int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
0000004e 33 D2 xor edx,edx
00000050 89 55 B8 mov dword ptr [ebp-48h],edx
00000053 90 nop
00000054 EB 00 jmp 00000056
00000056 FF 45 B8 inc dword ptr [ebp-48h]
00000059 81 7D B8 00 CA 9A 3B cmp dword ptr [ebp-48h],3B9ACA00h
00000060 0F 95 C0 setne al
00000063 0F B6 C0 movzx eax,al
00000066 89 45 B4 mov dword ptr [ebp-4Ch],eax
00000069 83 7D B4 00 cmp dword ptr [ebp-4Ch],0
0000006d 75 E7 jne 00000056
as you see it is a waste of cpu.
EDIT:
RELEASE:
//unchecked
//{
//int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
00000032 33 D2 xor edx,edx
00000034 89 55 F4 mov dword ptr [ebp-0Ch],edx
00000037 FF 45 F4 inc dword ptr [ebp-0Ch]
0000003a 81 7D F4 00 CA 9A 3B cmp dword ptr [ebp-0Ch],3B9ACA00h
00000041 75 F4 jne 00000037
//}
EDIT:
and this is the c++ version:running about 9x faster in my machine.
__asm
{
PUSH ECX
PUSH EBX
XOR ECX, ECX
MOV EBX, 1000000000
NEXT: INC ECX
CMP ECX, EBX
JS NEXT
POP EBX
POP ECX
}
You should attach a debugger and take a look at the machine code generated by each.
Delphi would almost definitely optimise that loop to execute in reverse order (ie DOWNTO zero rather than FROM zero) - Delphi does this whenever it determines it is "safe" to do, presumably because either subtraction or checking against zero is faster than addition or checking against a non-zero number.
What happens if you try both cases specifying the loops to execute in reverse order?
In Delphi the break condition is calculated only once before the loop procedure begins whereas in C# the break condition is calculated in each loop pass again.
That’s why the looping in Delphi is faster than in C#.
"// int i = 0; while (++i != 1000000000) ;"
That's interesting.
while (++i != x) is not the same as for (; i != x; i++)
The difference is that the while loop doesn't execute the loop for i = 0.
(try it out: run something like this:
int i;
for (i = 0; i < 5; i++)
Console.WriteLine(i);
i = 0;
while (++i != 5)
Console.WriteLine(i);

.NET 3.5 JIT not working when running the application

The following code gives different output when running the release inside Visual Studio, and running the release outside Visual Studio. I'm using Visual Studio 2008 and targeting .NET 3.5. I've also tried .NET 3.5 SP1.
When running outside Visual Studio, the JIT should kick in. Either (a) there's something subtle going on with C# that I'm missing or (b) the JIT is actually in error. I'm doubtful that the JIT can go wrong, but I'm running out of other possiblities...
Output when running inside Visual Studio:
0 0,
0 1,
1 0,
1 1,
Output when running release outside of Visual Studio:
0 2,
0 2,
1 2,
1 2,
What is the reason?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Test
{
struct IntVec
{
public int x;
public int y;
}
interface IDoSomething
{
void Do(IntVec o);
}
class DoSomething : IDoSomething
{
public void Do(IntVec o)
{
Console.WriteLine(o.x.ToString() + " " + o.y.ToString()+",");
}
}
class Program
{
static void Test(IDoSomething oDoesSomething)
{
IntVec oVec = new IntVec();
for (oVec.x = 0; oVec.x < 2; oVec.x++)
{
for (oVec.y = 0; oVec.y < 2; oVec.y++)
{
oDoesSomething.Do(oVec);
}
}
}
static void Main(string[] args)
{
Test(new DoSomething());
Console.ReadLine();
}
}
}
It is a JIT optimizer bug. It is unrolling the inner loop but not updating the oVec.y value properly:
for (oVec.x = 0; oVec.x < 2; oVec.x++) {
0000000a xor esi,esi ; oVec.x = 0
for (oVec.y = 0; oVec.y < 2; oVec.y++) {
0000000c mov edi,2 ; oVec.y = 2, WRONG!
oDoesSomething.Do(oVec);
00000011 push edi
00000012 push esi
00000013 mov ecx,ebx
00000015 call dword ptr ds:[00170210h] ; first unrolled call
0000001b push edi ; WRONG! does not increment oVec.y
0000001c push esi
0000001d mov ecx,ebx
0000001f call dword ptr ds:[00170210h] ; second unrolled call
for (oVec.x = 0; oVec.x < 2; oVec.x++) {
00000025 inc esi
00000026 cmp esi,2
00000029 jl 0000000C
The bug disappears when you let oVec.y increment to 4, that's too many calls to unroll.
One workaround is this:
for (int x = 0; x < 2; x++) {
for (int y = 0; y < 2; y++) {
oDoesSomething.Do(new IntVec(x, y));
}
}
UPDATE: re-checked in August 2012, this bug was fixed in the version 4.0.30319 jitter. But is still present in the v2.0.50727 jitter. It seems unlikely they'll fix this in the old version after this long.
I believe this is in a genuine JIT compilation bug. I would report it to Microsoft and see what they say. Interestingly, I found that the x64 JIT does not have the same problem.
Here is my reading of the x86 JIT.
// save context
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
// put oDoesSomething pointer in ebx
00000006 mov ebx,ecx
// zero out edi, this will store oVec.y
00000008 xor edi,edi
// zero out esi, this will store oVec.x
0000000a xor esi,esi
// NOTE: the inner loop is unrolled here.
// set oVec.y to 2
0000000c mov edi,2
// call oDoesSomething.Do(oVec) -- y is always 2!?!
00000011 push edi
00000012 push esi
00000013 mov ecx,ebx
00000015 call dword ptr ds:[002F0010h]
// call oDoesSomething.Do(oVec) -- y is always 2?!?!
0000001b push edi
0000001c push esi
0000001d mov ecx,ebx
0000001f call dword ptr ds:[002F0010h]
// increment oVec.x
00000025 inc esi
// loop back to 0000000C if oVec.x < 2
00000026 cmp esi,2
00000029 jl 0000000C
// restore context and return
0000002b pop ebx
0000002c pop esi
0000002d pop edi
0000002e pop ebp
0000002f ret
This looks like an optimization gone bad to me...
I copied your code into a new Console App.
Debug Build
Correct output with both debugger and no debugger
Switched to Release Build
Again, correct output both times
Created a new x86 configuration (I'm on running X64 Windows 2008 and was using 'Any CPU')
Debug Build
Got the correct output both F5 and CTRL+F5
Release Build
Correct output with Debugger attached
No debugger - Got the incorrect output
So it is the x86 JIT incorrectly generating the code. Have deleted my original text about reordering of loops etc. A few other answers on here have confirmed that the JIT is unwinding the loop incorrectly when on x86.
To fix the problem you can change the declaration of IntVec to a class and it works in all flavours.
Think this needs to go on MS Connect....
-1 to Microsoft!

Categories

Resources