Related
I have the following function (which I cleaned up a bit to make it easier to understand) which takes the destination array gets the element at index n adds to it the src1[i] and then multiplies it with src2[i] (nothing too fancy):
static void F(long[] dst, long[] src1, long[] src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
no this generates following ASM:
<Program>$.<<Main>$>g__F|0_0(Int64[], Int64[], Int64[], UInt64)
L0000: sub rsp, 0x28
L0004: test r9, r9
L0007: jl short L0051
L0009: mov rax, r9
L000c: mov r9d, [rcx+8]
L0010: movsxd r9, r9d
L0013: cmp rax, r9
L0016: jae short L0057
L0018: lea rcx, [rcx+rax*8+0x10]
L001d: mov r9, rcx
L0020: mov r10, [r9]
L0023: mov r11d, [rdx+8]
L0027: movsxd r11, r11d
L002a: cmp rax, r11
L002d: jae short L0057
L002f: add r10, [rdx+rax*8+0x10]
L0034: mov [r9], r10
L0037: mov edx, [r8+8]
L003b: movsxd rdx, edx
L003e: cmp rax, rdx
L0041: jae short L0057
L0043: imul r10, [r8+rax*8+0x10]
L0049: mov [rcx], r10
L004c: add rsp, 0x28
L0050: ret
L0051: call 0x00007ffc9dadb710
L0056: int3
L0057: call 0x00007ffc9dadbc70
L005c: int3
as you can it adds bunch of stuff and because I can guarantee that the n will be in between the legal range: I can use pointers.
static unsafe void G(long* dst, long* src1, long* src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
Now this generates much simpler ASM:
<Program>$.<<Main>$>g__G|0_1(Int64*, Int64*, Int64*, UInt64)
L0000: lea rax, [rcx+r9*8]
L0004: mov rcx, rax
L0007: mov rdx, [rdx+r9*8]
L000b: add [rcx], rdx
L000e: mov rdx, [rax] ; loads the value again?
L0011: imul rdx, [r8+r9*8]
L0016: mov [rax], rdx
L0019: ret
As you may have noticed, there is an extra MOV there (I think, at least I can't reason why is it there).
Question
How can I remove that line? In C I could use the keyword restrict if I'm not wrong. Is there such keyword in C#? I couldn't find anything on internet sadly.
Note
Here is SharpLab link.
Here is the C example:
void
f(int64_t *dst,
int64_t *src1,
int64_t *src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
void
g(int64_t *restrict dst,
int64_t *restrict src1,
int64_t *restrict src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
this generates:
f:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
mov QWORD PTR [rdx], rax ; this is strange. It loads the value back to [RDX]?
; shouldn't that be other way around? I don't know.
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
g:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
and here is the Godbolt link.
This:
dst[n] = (dst[n] + src1[n]) * src2[n];
removes that extra mov.
In C# there is no equivalent of restrict qualifier from C language.
In the C# ECMA-334:2017 language specification, in chapter 23. Unsafe Code, there is no syntax to specify that a part of the memory must be accessed only by specific pointer. And there is no syntax to specify that memory regions pointed by pointers are not overlapped. Thus there is no such equivalent. This is probably because C# is a managed language, unsafe syntax which allows for working with pointers/unmanaged memory is an edge case in C#. And restrict on pointers would be an edge case of the edge case.
I'm currenty doing micro-benchmarks for a better understanding of clr performance and version issues. The micro-benchmark in question is xoring two byte arrays of each 64 bytes together.
I'm always making a reference implementation with safe .net before I try to beat the .net framework implementation with unsafe and so on.
My reference implementation in question is:
for (int p = 0; p < 64; p++)
a[p] ^= b[p];
where a and b are byte[] a = new byte[64] and filled with data from .NET rng.
This code runs on x64 as double as fast as on x86. First I thought this is ok, because the jit will make something like *long^=*long out of it and *int^=*int on x86.
But my optimized unsafe-version:
fixed (byte* pA = a)
fixed (byte* pB = b)
{
long* ppA = (long*)pA;
long* ppB = (long*)pB;
for (int p = 0; p < 8; p++)
{
*ppA ^= *ppB;
ppA++;
ppB++;
}
}
runs about factor 4 times faster than the x64 reference-implementation. So my thoughts about *long^=*long and *int^=*int optimization of the compiler are not right.
Where does this huge performance difference in the reference implementation come from? Now that I posted the ASM code: Why can't the C# compiler also optimize the x86 version this way?
IL code for x86 and x64 reference implementation (they are identical):
IL_0059: ldloc.3
IL_005a: ldloc.s p
IL_005c: ldelema [mscorlib]System.Byte
IL_0061: dup
IL_0062: ldobj [mscorlib]System.Byte
IL_0067: ldloc.s b
IL_0069: ldloc.s p
IL_006b: ldelem.u1
IL_006c: xor
IL_006d: conv.u1
IL_006e: stobj [mscorlib]System.Byte
IL_0073: ldloc.s p
IL_0075: ldc.i4.1
IL_0076: add
IL_0077: stloc.s p
IL_0079: ldloc.s p
IL_007b: ldc.i4.s 64
IL_007d: blt.s IL_0059
I think that ldloc.3 is a.
Resulting ASM code for x86:
for (int p = 0; p < 64; p++)
010900DF xor edx,edx
010900E1 mov edi,dword ptr [ebx+4]
a[p] ^= b[p];
010900E4 cmp edx,edi
010900E6 jae 0109010C
010900E8 lea esi,[ebx+edx+8]
010900EC mov eax,dword ptr [ebp-14h]
010900EF cmp edx,dword ptr [eax+4]
010900F2 jae 0109010C
010900F4 movzx eax,byte ptr [eax+edx+8]
010900F9 xor byte ptr [esi],al
for (int p = 0; p < 64; p++)
010900FB inc edx
010900FC cmp edx,40h
010900FF jl 010900E4
Resulting ASM code for x64:
a[p] ^= b[p];
00007FFF4A8B01C6 mov eax,3Eh
00007FFF4A8B01CB cmp rax,rcx
00007FFF4A8B01CE jae 00007FFF4A8B0245
00007FFF4A8B01D0 mov rax,qword ptr [rbx+8]
00007FFF4A8B01D4 mov r9d,3Eh
00007FFF4A8B01DA cmp r9,rax
00007FFF4A8B01DD jae 00007FFF4A8B0245
00007FFF4A8B01DF mov r9d,3Fh
00007FFF4A8B01E5 cmp r9,rcx
00007FFF4A8B01E8 jae 00007FFF4A8B0245
00007FFF4A8B01EA mov ecx,3Fh
00007FFF4A8B01EF cmp rcx,rax
00007FFF4A8B01F2 jae 00007FFF4A8B0245
00007FFF4A8B01F4 nop word ptr [rax+rax]
00007FFF4A8B0200 movzx ecx,byte ptr [rdi+rdx+10h]
00007FFF4A8B0205 movzx eax,byte ptr [rbx+rdx+10h]
00007FFF4A8B020A xor ecx,eax
00007FFF4A8B020C mov byte ptr [rdi+rdx+10h],cl
00007FFF4A8B0210 movzx ecx,byte ptr [rdi+rdx+11h]
00007FFF4A8B0215 movzx eax,byte ptr [rbx+rdx+11h]
00007FFF4A8B021A xor ecx,eax
00007FFF4A8B021C mov byte ptr [rdi+rdx+11h],cl
00007FFF4A8B0220 add rdx,2
for (int p = 0; p < 64; p++)
00007FFF4A8B0224 cmp rdx,40h
00007FFF4A8B0228 jl 00007FFF4A8B0200
You've made a classic mistake, attempting performance analysis on non-optimized code. Here is a complete minimal compilable example:
using System;
namespace SO30558357
{
class Program
{
static void XorArray(byte[] a, byte[] b)
{
for (int p = 0; p< 64; p++)
a[p] ^= b[p];
}
static void Main(string[] args)
{
byte[] a = new byte[64];
byte[] b = new byte[64];
Random r = new Random();
r.NextBytes(a);
r.NextBytes(b);
XorArray(a, b);
Console.ReadLine(); // when the program stops here
// use Debug -> Attach to process
}
}
}
I compiled that using Visual Studio 2013 Update 3, default "Release Build" settings for a C# console application except for the architecture, and ran it with CLR v4.0.30319. Oh I think I have Roslyn installed, but that shouldn't replace the JIT, only the translation to MSIL which is identical on both architectures.
The actual x86 assembly for XorArray:
006F00D8 push ebp
006F00D9 mov ebp,esp
006F00DB push edi
006F00DC push esi
006F00DD push ebx
006F00DE push eax
006F00DF mov dword ptr [ebp-10h],edx
006F00E2 xor edi,edi
006F00E4 mov ebx,dword ptr [ecx+4]
006F00E7 cmp edi,ebx
006F00E9 jae 006F010F
006F00EB lea esi,[ecx+edi+8]
006F00EF movzx eax,byte ptr [esi]
006F00F2 mov edx,dword ptr [ebp-10h]
006F00F5 cmp edi,dword ptr [edx+4]
006F00F8 jae 006F010F
006F00FA movzx edx,byte ptr [edx+edi+8]
006F00FF xor eax,edx
006F0101 mov byte ptr [esi],al
006F0103 inc edi
006F0104 cmp edi,40h
006F0107 jl 006F00E7
006F0109 pop ecx
006F010A pop ebx
006F010B pop esi
006F010C pop edi
006F010D pop ebp
006F010E ret
And for x64:
00007FFD4A3000FB mov rax,qword ptr [rsi+8]
00007FFD4A3000FF mov rax,qword ptr [rbp+8]
00007FFD4A300103 nop word ptr [rax+rax]
00007FFD4A300110 movzx ecx,byte ptr [rsi+rdx+10h]
00007FFD4A300115 movzx eax,byte ptr [rdx+rbp+10h]
00007FFD4A30011A xor ecx,eax
00007FFD4A30011C mov byte ptr [rsi+rdx+10h],cl
00007FFD4A300120 movzx ecx,byte ptr [rsi+rdx+11h]
00007FFD4A300125 movzx eax,byte ptr [rdx+rbp+11h]
00007FFD4A30012A xor ecx,eax
00007FFD4A30012C mov byte ptr [rsi+rdx+11h],cl
00007FFD4A300130 movzx ecx,byte ptr [rsi+rdx+12h]
00007FFD4A300135 movzx eax,byte ptr [rdx+rbp+12h]
00007FFD4A30013A xor ecx,eax
00007FFD4A30013C mov byte ptr [rsi+rdx+12h],cl
00007FFD4A300140 movzx ecx,byte ptr [rsi+rdx+13h]
00007FFD4A300145 movzx eax,byte ptr [rdx+rbp+13h]
00007FFD4A30014A xor ecx,eax
00007FFD4A30014C mov byte ptr [rsi+rdx+13h],cl
00007FFD4A300150 add rdx,4
00007FFD4A300154 cmp rdx,40h
00007FFD4A300158 jl 00007FFD4A300110
Bottom line: The x64 optimizer worked a lot better. While it still is using byte-sized transfers, it unrolled the loop by a factor of 4, and inlined the function call.
Since in the x86 version, loop control logic corresponds to roughly half the code, the unrolling can be expected to yield almost twice the performance.
Inlining allowed the compiler to perform context-sensitive optimization, knowing the size of the arrays and eliminating the runtime bounds check.
If we inline by hand, the x86 compiler now yields:
00A000B1 xor edi,edi
00A000B3 mov eax,dword ptr [ebp-10h]
00A000B6 mov ebx,dword ptr [eax+4]
a[p] ^= b[p];
00A000B9 mov eax,dword ptr [ebp-10h]
00A000BC cmp edi,ebx
00A000BE jae 00A000F5
00A000C0 lea esi,[eax+edi+8]
00A000C4 movzx eax,byte ptr [esi]
00A000C7 mov edx,dword ptr [ebp-14h]
00A000CA cmp edi,dword ptr [edx+4]
00A000CD jae 00A000F5
00A000CF movzx edx,byte ptr [edx+edi+8]
00A000D4 xor eax,edx
00A000D6 mov byte ptr [esi],al
for (int p = 0; p< 64; p++)
00A000D8 inc edi
00A000D9 cmp edi,40h
00A000DC jl 00A000B9
Didn't help that much, the loop still does not unroll and the runtime bounds checking is still there.
Notably, the x86 compiler found a register (EBX) to cache the length of one array, but ran out of registers and was forced to access the other array length from memory on every iteration. This should be a "cheap" L1 cache access, but that's still slower than register access, and much slower than no bounds check at all.
I've recently been testing the performance of the for loop vs the foreach loop in C#, and I've noticed that for summing an array of ints into a long, the foreach loop may come out actually faster. Here is the full test program, I've used Visual Studio 2012, x86, release mode, optimizations on.
Here is the assembly code for both loops. The foreach:
long sum = 0;
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 xor ebx,ebx
00000008 xor edi,edi
foreach (var i in collection) {
0000000a xor esi,esi
0000000c cmp dword ptr [ecx+4],0
00000010 jle 00000025
00000012 mov eax,dword ptr [ecx+esi*4+8]
sum += i;
00000016 mov edx,eax
00000018 sar edx,1Fh
0000001b add ebx,eax
0000001d adc edi,edx
0000001f inc esi
foreach (var i in collection) {
00000020 cmp dword ptr [ecx+4],esi
00000023 jg 00000012
}
return sum;
00000025 mov eax,ebx
00000027 mov edx,edi
00000029 pop ebx
0000002a pop esi
0000002b pop edi
0000002c pop ebp
0000002d ret
And the for:
long sum = 0;
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 push eax
00000007 xor ebx,ebx
00000009 xor edi,edi
for (int i = 0; i < collection.Length; ++i) {
0000000b xor esi,esi
0000000d mov eax,dword ptr [ecx+4]
00000010 mov dword ptr [ebp-10h],eax
00000013 test eax,eax
00000015 jle 0000002A
sum += collection[i];
00000017 mov eax,dword ptr [ecx+esi*4+8]
0000001b cdq
0000001c add eax,ebx
0000001e adc edx,edi
00000020 mov ebx,eax
00000022 mov edi,edx
for (int i = 0; i < collection.Length; ++i) {
00000024 inc esi
00000025 cmp dword ptr [ebp-10h],esi
00000028 jg 00000017
}
return sum;
0000002a mov eax,ebx
0000002c mov edx,edi
0000002e pop ecx
0000002f pop ebx
00000030 pop esi
00000031 pop edi
00000032 pop ebp
00000033 ret
As you can see, the main loop is 7 instructions for "foreach" and 9 instructions for "for". This translates into approximately a 10% performance difference in my benchmarks.
I'm not very good at reading assembly code however and I don't understand why the for loop wouldn't be at least as efficient as the foreach. What is going on here?
As the array is so big the only relevand part is clearly the one inside the loop, this one:
// for loop
00000017 mov eax,dword ptr [ecx+esi*4+8]
0000001b cdq
0000001c add eax,ebx
0000001e adc edx,edi
00000020 mov ebx,eax
00000022 mov edi,edx
// foreach loop
00000012 mov eax,dword ptr [ecx+esi*4+8]
00000016 mov edx,eax
00000018 sar edx,1Fh
0000001b add ebx,eax
0000001d adc edi,edx
Since the sum is a long int it is stored in two differenc registers, namely ebx contains its least significant four bytes and edi the most significant four ones. They differ in how collection[i] is (implicitly) casted from int to long:
// for loop
0000001b cdq
// foreach loop
00000016 mov edx,eax
00000018 sar edx,1Fh
Another important thing to notice is that the for-loop version does the sum in "reversed" order:
long temp = (long) collection[i]; // implicit cast, stored in edx:eax
temp += sum; // instead of "simply" sum += temp
sum = temp; // sum is stored back into ebx:edi
I can't tell ou why the compiler preferred this way instead of sum += temp (#EricLippert could maybe tell us :) ) but I suspect that it is related to some instruction dependency issues that might arise.
OK, so here's an annotated version of the assembly code, as you will see the instruction in the loop are very close.
foreach (var i in collection) {
0000000a xor esi,esi clear index
0000000c cmp dword ptr [ecx+4],0 get size of collection
00000010 jle 00000025 exit if empty
00000012 mov eax,dword ptr [ecx+esi*4+8] get item from collection
sum += i;
00000016 mov edx,eax move to edx:eax
00000018 sar edx,1Fh shift 31 bits to keep sign only
0000001b add ebx,eax add to sum
0000001d adc edi,edx add with carry from previous add
0000001f inc esi increment index
foreach (var i in collection) {
00000020 cmp dword ptr [ecx+4],esi compare size to index
00000023 jg 00000012 loop if more
}
return sum;
00000025 mov eax,ebx result was in ebx
=================================================
for (int i = 0; i < collection.Length; ++i) {
0000000b xor esi,esi clear index
0000000d mov eax,dword ptr [ecx+4] get limit on for
00000010 mov dword ptr [ebp-10h],eax save limit
00000013 test eax,eax test if limit is empty
00000015 jle 0000002A exit loop if empty
sum += collection[i];
00000017 mov eax,dword ptr [ecx+esi*4+8] get item form collection
0000001b cdq convert eax to edx:eax
0000001c add eax,ebx add to sum
0000001e adc edx,edi add with carry from previous add
00000020 mov ebx,eax put result in edi:ebx
00000022 mov edi,edx
for (int i = 0; i < collection.Length; ++i) {
00000024 inc esi increment index
00000025 cmp dword ptr [ebp-10h],esi compare to limit
00000028 jg 00000017 loop if more
}
return sum;
0000002a mov eax,ebx result was in ebx
According to the C# Language Specification 4.0, a foreach loop gets broken down to the following by the compiler:
foreach-statement:
foreach ( local-variable-type identifier in expression ) embedded-statement
{
E e = ((C)(x)).GetEnumerator();
try {
V v;
while (e.MoveNext()) {
v = (V)(T)e.Current;
embedded-statement
}
}
finally {
… // Dispose e
}
}
This is after the following processing (again from the specs):
•If the type X of expression is an array type then there is an implicit reference conversion from X to the System.Collections.IEnumerable interface (since System.Array implements this interface). The collection type is the System.Collections.IEnumerable interface, the enumerator type is the System.Collections.IEnumerator interface and the element type is the element type of the array type X.
Likely a good reason why you aren't seeing the same assembly code from the compiler.
EDIT
I tested release in 32 bit, and the code was compact. Therefore the below is a 64 bit issue.
I'm using VS 2012 RC. Debug is 32 bit, and Release is 64 bit. Below is the debug then release disassembly of a line of code:
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
0000006f mov eax,dword ptr [ebp-40h]
00000072 shr eax,8
00000075 mov edx,dword ptr [ebp-3Ch]
00000078 mov ecx,0FF00h
0000007d and edx,ecx
0000007f shr edx,8
00000082 mov ecx,dword ptr [ebp-40h]
00000085 mov ebx,0FFh
0000008a and ecx,ebx
0000008c xor edx,ecx
0000008e mov ecx,dword ptr ds:[03387F38h]
00000094 cmp edx,dword ptr [ecx+4]
00000097 jb 0000009E
00000099 call 6F54F5EC
0000009e xor eax,dword ptr [ecx+edx*4+8]
000000a2 mov dword ptr [ebp-40h],eax
-----------------------------------------------------------------------------
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
000000a5 mov eax,dword ptr [rsp+20h]
000000a9 shr eax,8
000000ac mov dword ptr [rsp+38h],eax
000000b0 mov rdx,124DEE68h
000000ba mov rdx,qword ptr [rdx]
000000bd mov eax,dword ptr [rsp+00000090h]
000000c4 and eax,0FF00h
000000c9 shr eax,8
000000cc mov ecx,dword ptr [rsp+20h]
000000d0 and ecx,0FFh
000000d6 xor eax,ecx
000000d8 mov ecx,eax
000000da mov qword ptr [rsp+40h],rdx
000000df mov rax,qword ptr [rsp+40h]
000000e4 mov rax,qword ptr [rax+8]
000000e8 mov qword ptr [rsp+48h],rcx
000000ed cmp qword ptr [rsp+48h],rax
000000f2 jae 0000000000000100
000000f4 mov rax,qword ptr [rsp+48h]
000000f9 mov qword ptr [rsp+48h],rax
000000fe jmp 0000000000000105
00000100 call 000000005FA5D364
00000105 mov rax,qword ptr [rsp+40h]
0000010a mov rcx,qword ptr [rsp+48h]
0000010f mov ecx,dword ptr [rax+rcx*4+10h]
00000113 mov eax,dword ptr [rsp+38h]
00000117 xor eax,ecx
00000119 mov dword ptr [rsp+20h],eax
What is all the extra code in the 64 bit version doing? It is testing for what? I haven't benchmarked this, but the 32 bit code should execute much faster.
EDIT
The whole function:
public static uint CRC32(uint val)
{
uint crc = 0xffffffff;
crc = (crc >> 8) ^ crcTable[(val & 0x000000ff) ^ crc & 0xff];
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
crc = (crc >> 8) ^ crcTable[((val & 0x00ff0000) >> 16) ^ crc & 0xff];
crc = (crc >> 8) ^ crcTable[(val >> 24) ^ crc & 0xff];
// flip bits
return (crc ^ 0xffffffff);
}
I suspect you are using "Go to disassembly" while debugging the release build to get the assembly code.
After going to Tools -> Options, Debugging, General, and disabling "Suppress JIT optimization on module load" I got an x64 assembly listing without error checking.
It seems like by default even in release mode the code is not optimized if the debugger attached. Keep that in mind when trying to benchmark your code.
PS: Benchmarking shows x64 slightly faster than x86, 4.3 vs 4.8 seconds for 1 billion function calls.
Edit: Break points still worked for me, otherwise I wouldn't have been able to see the disassembly after unchecking. Your example line from above looks like this (VS 2012 RC):
crc = (crc >> 8) ^ crcTable[((val & 0x0000ff00) >> 8) ^ crc & 0xff];
00000030 mov r11d,eax
00000033 shr r11d,8
00000037 mov ecx,edx
00000039 and ecx,0FF00h
0000003f shr ecx,8
00000042 movzx eax,al
00000045 xor ecx,eax
00000047 mov eax,ecx
00000049 cmp rax,r9
0000004c jae 00000000000000A4
0000004e mov eax,dword ptr [r8+rax*4+10h]
00000053 xor r11d,eax
Looking at the code this is related to the error checking for accessing crcTable. It's doing your bounds before it starts digging into the array.
In the the 32-bit code you see this
0000008e mov ecx,dword ptr ds:[03387F38h]
....
0000009e xor eax,dword ptr [ecx+edx*4+8]
In this case it's loading the base address of the array from 03387F38h and then using standard pointer arithmetic to access the correct entry.
In the 64-bit code this seems to be more complicated.
000000b0 mov rdx,124DEE68h
000000ba mov rdx,qword ptr [rdx]
This loads an address into the rdx register
000000da mov qword ptr [rsp+40h],rdx
...
00000105 mov rax,qword ptr [rsp+40h]
0000010a mov rcx,qword ptr [rsp+48h]
0000010f mov ecx,dword ptr [rax+rcx*4+10h]
This moves the address onto the stack, then later on it moves it into the rax register and does the same pointer work to access the array.
Pretty much everything between 000000da and 00000100/00000105 seems to be validation code. The rest of the code maps pretty well between the 64-bit and the 32-bit code, with some less aggressive register utilization in the 64-bit code.
exp ^ crc & 0xff is compiled as exp ^ (cr & 0xff):
00000082 mov ecx,dword ptr [ebp-40h]
00000085 mov ebx,0FFh
0000008a and ecx,ebx
0000008c xor edx,ecx
Should you write the expression as ?
(exp ^ crc) & 0xff
The 64-bit version is definitely less optimized than the 32-bit version. CLR has two seperate JIT compiler implementation.
Also, if perf is criticial, use unsafe code to remove bounds check.
Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:
public static void SingleLineTest()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
count += i % 16 == 0 ? 1 : 0;
}
stopwatch.Stop();
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
public static void MultiLineTest()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
var isMultipleOf16 = i % 16 == 0;
count += isMultipleOf16 ? 1 : 0;
}
stopwatch.Stop();
Console.WriteLine("Multi-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.
Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this...
static void Main()
{
SingleLineTest();
MultiLineTest();
}
...and SingleLineTest is faster. (Compile using the x86 Release configuration, ensuring that "Optimize code" setting is enabled, and run the test from outside VS2010.) But reverse the order...
static void Main()
{
MultiLineTest();
SingleLineTest();
}
...and both methods take the same time (almost, but not quite, as long as MultiLineTest before). (When running this test, it's useful to add some additional calls to SingleLineTest and MultiLineTest to get additional samples. How many and what order doesn't matter, except for which method is called first.)
Finally, to demonstrate that JIT order is important, leave MultiLineTest first, but force SingleLineTest to be JITed first...
static void Main()
{
RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);
MultiLineTest();
SingleLineTest();
}
Now, SingleLineTest is faster again.
If you turn off "Suppress JIT optimization on module load" in VS2010, you can put a breakpoint in SingleLineTest and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.
A sample project demonstrating this behavior is on github.
It's not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.
Update: See also the Microsoft Connect entry for this issue.
Please note that I do not trust the "Suppress JIT optimization on module load" option, I spawn the process without debugging and attach my debugger after the JIT has run.
In the version where single-line runs faster, this is Main:
SingleLineTest();
00000000 push ebp
00000001 mov ebp,esp
00000003 call dword ptr ds:[0019380Ch]
MultiLineTest();
00000009 call dword ptr ds:[00193818h]
SingleLineTest();
0000000f call dword ptr ds:[0019380Ch]
MultiLineTest();
00000015 call dword ptr ds:[00193818h]
SingleLineTest();
0000001b call dword ptr ds:[0019380Ch]
MultiLineTest();
00000021 call dword ptr ds:[00193818h]
00000027 pop ebp
}
00000028 ret
Note that MultiLineTest has been placed on an 8 byte boundary, and SingleLineTest on a 4 byte boundary.
Here's Main for the version where both run at the same speed:
MultiLineTest();
00000000 push ebp
00000001 mov ebp,esp
00000003 call dword ptr ds:[00153818h]
SingleLineTest();
00000009 call dword ptr ds:[0015380Ch]
MultiLineTest();
0000000f call dword ptr ds:[00153818h]
SingleLineTest();
00000015 call dword ptr ds:[0015380Ch]
MultiLineTest();
0000001b call dword ptr ds:[00153818h]
SingleLineTest();
00000021 call dword ptr ds:[0015380Ch]
MultiLineTest();
00000027 call dword ptr ds:[00153818h]
0000002d pop ebp
}
0000002e ret
Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.
More digging is necessary. I think it was mentioned that the code before the loop wasn't exactly the same in both versions? Going to investigate.
Here's the "slow" version of SingleLineTest (and I checked, the last digits of the function address haven't changed).
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,7A5A2C68h
0000000b call FFF91EA0
00000010 mov esi,eax
00000012 mov dword ptr [esi+4],0
00000019 mov dword ptr [esi+8],0
00000020 mov byte ptr [esi+14h],0
00000024 mov dword ptr [esi+0Ch],0
0000002b mov dword ptr [esi+10h],0
stopwatch.Start();
00000032 cmp byte ptr [esi+14h],0
00000036 jne 00000047
00000038 call 7A22B314
0000003d mov dword ptr [esi+0Ch],eax
00000040 mov dword ptr [esi+10h],edx
00000043 mov byte ptr [esi+14h],1
int count = 0;
00000047 xor edi,edi
for (uint i = 0; i < 1000000000; ++i) {
00000049 xor edx,edx
count += i % 16 == 0 ? 1 : 0;
0000004b mov eax,edx
0000004d and eax,0Fh
00000050 test eax,eax
00000052 je 00000058
00000054 xor eax,eax
00000056 jmp 0000005D
00000058 mov eax,1
0000005d add edi,eax
for (uint i = 0; i < 1000000000; ++i) {
0000005f inc edx
00000060 cmp edx,3B9ACA00h
00000066 jb 0000004B
}
stopwatch.Stop();
00000068 mov ecx,esi
0000006a call 7A23F2C0
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000006f mov ecx,797C29B4h
00000074 call FFF91EA0
00000079 mov ecx,eax
0000007b mov dword ptr [ecx+4],edi
0000007e mov ebx,ecx
00000080 mov ecx,797BA240h
00000085 call FFF91EA0
0000008a mov edi,eax
0000008c mov ecx,esi
0000008e call 7A23ABE8
00000093 push edx
00000094 push eax
00000095 push 0
00000097 push 2710h
0000009c call 783247EC
000000a1 mov dword ptr [edi+4],eax
000000a4 mov dword ptr [edi+8],edx
000000a7 mov esi,edi
000000a9 call 793C6F40
000000ae push ebx
000000af push esi
000000b0 mov ecx,eax
000000b2 mov edx,dword ptr ds:[03392034h]
000000b8 mov eax,dword ptr [ecx]
000000ba mov eax,dword ptr [eax+3Ch]
000000bd call dword ptr [eax+1Ch]
000000c0 pop ebx
}
000000c1 pop esi
000000c2 pop edi
000000c3 pop ebp
000000c4 ret
And the "fast" version:
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,7A5A2C68h
0000000b call FFE11F70
00000010 mov esi,eax
00000012 mov ecx,esi
00000014 call 7A1068BC
stopwatch.Start();
00000019 cmp byte ptr [esi+14h],0
0000001d jne 0000002E
0000001f call 7A12B3E4
00000024 mov dword ptr [esi+0Ch],eax
00000027 mov dword ptr [esi+10h],edx
0000002a mov byte ptr [esi+14h],1
int count = 0;
0000002e xor edi,edi
for (uint i = 0; i < 1000000000; ++i) {
00000030 xor edx,edx
count += i % 16 == 0 ? 1 : 0;
00000032 mov eax,edx
00000034 and eax,0Fh
00000037 test eax,eax
00000039 je 0000003F
0000003b xor eax,eax
0000003d jmp 00000044
0000003f mov eax,1
00000044 add edi,eax
for (uint i = 0; i < 1000000000; ++i) {
00000046 inc edx
00000047 cmp edx,3B9ACA00h
0000004d jb 00000032
}
stopwatch.Stop();
0000004f mov ecx,esi
00000051 call 7A13F390
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000056 mov ecx,797C29B4h
0000005b call FFE11F70
00000060 mov ecx,eax
00000062 mov dword ptr [ecx+4],edi
00000065 mov ebx,ecx
00000067 mov ecx,797BA240h
0000006c call FFE11F70
00000071 mov edi,eax
00000073 mov ecx,esi
00000075 call 7A13ACB8
0000007a push edx
0000007b push eax
0000007c push 0
0000007e push 2710h
00000083 call 782248BC
00000088 mov dword ptr [edi+4],eax
0000008b mov dword ptr [edi+8],edx
0000008e mov esi,edi
00000090 call 792C7010
00000095 push ebx
00000096 push esi
00000097 mov ecx,eax
00000099 mov edx,dword ptr ds:[03562030h]
0000009f mov eax,dword ptr [ecx]
000000a1 mov eax,dword ptr [eax+3Ch]
000000a4 call dword ptr [eax+1Ch]
000000a7 pop ebx
}
000000a8 pop esi
000000a9 pop edi
000000aa pop ebp
000000ab ret
Just the loops, fast on the left, slow on the right:
00000030 xor edx,edx 00000049 xor edx,edx
00000032 mov eax,edx 0000004b mov eax,edx
00000034 and eax,0Fh 0000004d and eax,0Fh
00000037 test eax,eax 00000050 test eax,eax
00000039 je 0000003F 00000052 je 00000058
0000003b xor eax,eax 00000054 xor eax,eax
0000003d jmp 00000044 00000056 jmp 0000005D
0000003f mov eax,1 00000058 mov eax,1
00000044 add edi,eax 0000005d add edi,eax
00000046 inc edx 0000005f inc edx
00000047 cmp edx,3B9ACA00h 00000060 cmp edx,3B9ACA00h
0000004d jb 00000032 00000066 jb 0000004B
The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the je loading a constant 1 is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps ( jmp after loading a constant zero, and jb repeating the entire loop) are taken millions more times, and are aligned in the "fast" version.
I think this is the smoking gun.
So for a definitive answer... I suspect we would need to dig into the dis-assembly.
However, I have a guess. The compiler for the SingleLineTest() stores each result of the equation on the stack and pops each value as needed. However, the MultiLineTest() may be storing values and having to access them from there. This could cause a few clock cycles to be missed. Where as grabbing the values off the stack will keep it in a register.
Interestingly, changing the order of the function compilation may be adjusting the garbage collector's actions. Because isMultipleOf16 is defined within the loop, it may be be handled funny. You may want to move the definition outside of the loop and see what that changes...
My time is 2400 and 2600 on i5-2410M 2,3Ghz 4GB ram 64bit Win 7.
Here is my output: Single first
After starting the process and then attaching the debugger
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
--------------------------------
SingleLineTest()
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,685D2C68h
0000000b call FFD91F70
00000010 mov esi,eax
00000012 mov ecx,esi
00000014 call 681D68BC
stopwatch.Start();
00000019 cmp byte ptr [esi+14h],0
0000001d jne 0000002E
0000001f call 681FB3E4
00000024 mov dword ptr [esi+0Ch],eax
00000027 mov dword ptr [esi+10h],edx
0000002a mov byte ptr [esi+14h],1
int count = 0;
0000002e xor edi,edi
for (int i = 0; i < 1000000000; ++i)
00000030 xor edx,edx
{
count += i % 16 == 0 ? 1 : 0;
00000032 mov eax,edx
00000034 and eax,8000000Fh
00000039 jns 00000040
0000003b dec eax
0000003c or eax,0FFFFFFF0h
0000003f inc eax
00000040 test eax,eax
00000042 je 00000048
00000044 xor eax,eax
00000046 jmp 0000004D
00000048 mov eax,1
0000004d add edi,eax
for (int i = 0; i < 1000000000; ++i)
0000004f inc edx
00000050 cmp edx,3B9ACA00h
00000056 jl 00000032
}
stopwatch.Stop();
00000058 mov ecx,esi
0000005a call 6820F390
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000005f mov ecx,6A8B29B4h
00000064 call FFD91F70
00000069 mov ecx,eax
0000006b mov dword ptr [ecx+4],edi
0000006e mov ebx,ecx
00000070 mov ecx,6A8AA240h
00000075 call FFD91F70
0000007a mov edi,eax
0000007c mov ecx,esi
0000007e call 6820ACB8
00000083 push edx
00000084 push eax
00000085 push 0
00000087 push 2710h
0000008c call 6AFF48BC
00000091 mov dword ptr [edi+4],eax
00000094 mov dword ptr [edi+8],edx
00000097 mov esi,edi
00000099 call 6A457010
0000009e push ebx
0000009f push esi
000000a0 mov ecx,eax
000000a2 mov edx,dword ptr ds:[039F2030h]
000000a8 mov eax,dword ptr [ecx]
000000aa mov eax,dword ptr [eax+3Ch]
000000ad call dword ptr [eax+1Ch]
000000b0 pop ebx
}
000000b1 pop esi
000000b2 pop edi
000000b3 pop ebp
000000b4 ret
Multi first:
MultiLineTest();
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
--------------------------------
SingleLineTest()
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,685D2C68h
0000000b call FFF31EA0
00000010 mov esi,eax
00000012 mov dword ptr [esi+4],0
00000019 mov dword ptr [esi+8],0
00000020 mov byte ptr [esi+14h],0
00000024 mov dword ptr [esi+0Ch],0
0000002b mov dword ptr [esi+10h],0
stopwatch.Start();
00000032 cmp byte ptr [esi+14h],0
00000036 jne 00000047
00000038 call 682AB314
0000003d mov dword ptr [esi+0Ch],eax
00000040 mov dword ptr [esi+10h],edx
00000043 mov byte ptr [esi+14h],1
int count = 0;
00000047 xor edi,edi
for (int i = 0; i < 1000000000; ++i)
00000049 xor edx,edx
{
count += i % 16 == 0 ? 1 : 0;
0000004b mov eax,edx
0000004d and eax,8000000Fh
00000052 jns 00000059
00000054 dec eax
00000055 or eax,0FFFFFFF0h
00000058 inc eax
00000059 test eax,eax
0000005b je 00000061
0000005d xor eax,eax
0000005f jmp 00000066
00000061 mov eax,1
00000066 add edi,eax
for (int i = 0; i < 1000000000; ++i)
00000068 inc edx
00000069 cmp edx,3B9ACA00h
0000006f jl 0000004B
}
stopwatch.Stop();
00000071 mov ecx,esi
00000073 call 682BF2C0
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000078 mov ecx,6A8B29B4h
0000007d call FFF31EA0
00000082 mov ecx,eax
00000084 mov dword ptr [ecx+4],edi
00000087 mov ebx,ecx
00000089 mov ecx,6A8AA240h
0000008e call FFF31EA0
00000093 mov edi,eax
00000095 mov ecx,esi
00000097 call 682BABE8
0000009c push edx
0000009d push eax
0000009e push 0
000000a0 push 2710h
000000a5 call 6B0A47EC
000000aa mov dword ptr [edi+4],eax
000000ad mov dword ptr [edi+8],edx
000000b0 mov esi,edi
000000b2 call 6A506F40
000000b7 push ebx
000000b8 push esi
000000b9 mov ecx,eax
000000bb mov edx,dword ptr ds:[038E2034h]
000000c1 mov eax,dword ptr [ecx]
000000c3 mov eax,dword ptr [eax+3Ch]
000000c6 call dword ptr [eax+1Ch]
000000c9 pop ebx
}
000000ca pop esi
000000cb pop edi
000000cc pop ebp
000000cd ret