Why does JIT order affect performance? - c#

Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:
public static void SingleLineTest()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
count += i % 16 == 0 ? 1 : 0;
}
stopwatch.Stop();
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
public static void MultiLineTest()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
var isMultipleOf16 = i % 16 == 0;
count += isMultipleOf16 ? 1 : 0;
}
stopwatch.Stop();
Console.WriteLine("Multi-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.
Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this...
static void Main()
{
SingleLineTest();
MultiLineTest();
}
...and SingleLineTest is faster. (Compile using the x86 Release configuration, ensuring that "Optimize code" setting is enabled, and run the test from outside VS2010.) But reverse the order...
static void Main()
{
MultiLineTest();
SingleLineTest();
}
...and both methods take the same time (almost, but not quite, as long as MultiLineTest before). (When running this test, it's useful to add some additional calls to SingleLineTest and MultiLineTest to get additional samples. How many and what order doesn't matter, except for which method is called first.)
Finally, to demonstrate that JIT order is important, leave MultiLineTest first, but force SingleLineTest to be JITed first...
static void Main()
{
RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);
MultiLineTest();
SingleLineTest();
}
Now, SingleLineTest is faster again.
If you turn off "Suppress JIT optimization on module load" in VS2010, you can put a breakpoint in SingleLineTest and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.
A sample project demonstrating this behavior is on github.
It's not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.
Update: See also the Microsoft Connect entry for this issue.

Please note that I do not trust the "Suppress JIT optimization on module load" option, I spawn the process without debugging and attach my debugger after the JIT has run.
In the version where single-line runs faster, this is Main:
SingleLineTest();
00000000 push ebp
00000001 mov ebp,esp
00000003 call dword ptr ds:[0019380Ch]
MultiLineTest();
00000009 call dword ptr ds:[00193818h]
SingleLineTest();
0000000f call dword ptr ds:[0019380Ch]
MultiLineTest();
00000015 call dword ptr ds:[00193818h]
SingleLineTest();
0000001b call dword ptr ds:[0019380Ch]
MultiLineTest();
00000021 call dword ptr ds:[00193818h]
00000027 pop ebp
}
00000028 ret
Note that MultiLineTest has been placed on an 8 byte boundary, and SingleLineTest on a 4 byte boundary.
Here's Main for the version where both run at the same speed:
MultiLineTest();
00000000 push ebp
00000001 mov ebp,esp
00000003 call dword ptr ds:[00153818h]
SingleLineTest();
00000009 call dword ptr ds:[0015380Ch]
MultiLineTest();
0000000f call dword ptr ds:[00153818h]
SingleLineTest();
00000015 call dword ptr ds:[0015380Ch]
MultiLineTest();
0000001b call dword ptr ds:[00153818h]
SingleLineTest();
00000021 call dword ptr ds:[0015380Ch]
MultiLineTest();
00000027 call dword ptr ds:[00153818h]
0000002d pop ebp
}
0000002e ret
Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.
More digging is necessary. I think it was mentioned that the code before the loop wasn't exactly the same in both versions? Going to investigate.
Here's the "slow" version of SingleLineTest (and I checked, the last digits of the function address haven't changed).
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,7A5A2C68h
0000000b call FFF91EA0
00000010 mov esi,eax
00000012 mov dword ptr [esi+4],0
00000019 mov dword ptr [esi+8],0
00000020 mov byte ptr [esi+14h],0
00000024 mov dword ptr [esi+0Ch],0
0000002b mov dword ptr [esi+10h],0
stopwatch.Start();
00000032 cmp byte ptr [esi+14h],0
00000036 jne 00000047
00000038 call 7A22B314
0000003d mov dword ptr [esi+0Ch],eax
00000040 mov dword ptr [esi+10h],edx
00000043 mov byte ptr [esi+14h],1
int count = 0;
00000047 xor edi,edi
for (uint i = 0; i < 1000000000; ++i) {
00000049 xor edx,edx
count += i % 16 == 0 ? 1 : 0;
0000004b mov eax,edx
0000004d and eax,0Fh
00000050 test eax,eax
00000052 je 00000058
00000054 xor eax,eax
00000056 jmp 0000005D
00000058 mov eax,1
0000005d add edi,eax
for (uint i = 0; i < 1000000000; ++i) {
0000005f inc edx
00000060 cmp edx,3B9ACA00h
00000066 jb 0000004B
}
stopwatch.Stop();
00000068 mov ecx,esi
0000006a call 7A23F2C0
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000006f mov ecx,797C29B4h
00000074 call FFF91EA0
00000079 mov ecx,eax
0000007b mov dword ptr [ecx+4],edi
0000007e mov ebx,ecx
00000080 mov ecx,797BA240h
00000085 call FFF91EA0
0000008a mov edi,eax
0000008c mov ecx,esi
0000008e call 7A23ABE8
00000093 push edx
00000094 push eax
00000095 push 0
00000097 push 2710h
0000009c call 783247EC
000000a1 mov dword ptr [edi+4],eax
000000a4 mov dword ptr [edi+8],edx
000000a7 mov esi,edi
000000a9 call 793C6F40
000000ae push ebx
000000af push esi
000000b0 mov ecx,eax
000000b2 mov edx,dword ptr ds:[03392034h]
000000b8 mov eax,dword ptr [ecx]
000000ba mov eax,dword ptr [eax+3Ch]
000000bd call dword ptr [eax+1Ch]
000000c0 pop ebx
}
000000c1 pop esi
000000c2 pop edi
000000c3 pop ebp
000000c4 ret
And the "fast" version:
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,7A5A2C68h
0000000b call FFE11F70
00000010 mov esi,eax
00000012 mov ecx,esi
00000014 call 7A1068BC
stopwatch.Start();
00000019 cmp byte ptr [esi+14h],0
0000001d jne 0000002E
0000001f call 7A12B3E4
00000024 mov dword ptr [esi+0Ch],eax
00000027 mov dword ptr [esi+10h],edx
0000002a mov byte ptr [esi+14h],1
int count = 0;
0000002e xor edi,edi
for (uint i = 0; i < 1000000000; ++i) {
00000030 xor edx,edx
count += i % 16 == 0 ? 1 : 0;
00000032 mov eax,edx
00000034 and eax,0Fh
00000037 test eax,eax
00000039 je 0000003F
0000003b xor eax,eax
0000003d jmp 00000044
0000003f mov eax,1
00000044 add edi,eax
for (uint i = 0; i < 1000000000; ++i) {
00000046 inc edx
00000047 cmp edx,3B9ACA00h
0000004d jb 00000032
}
stopwatch.Stop();
0000004f mov ecx,esi
00000051 call 7A13F390
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000056 mov ecx,797C29B4h
0000005b call FFE11F70
00000060 mov ecx,eax
00000062 mov dword ptr [ecx+4],edi
00000065 mov ebx,ecx
00000067 mov ecx,797BA240h
0000006c call FFE11F70
00000071 mov edi,eax
00000073 mov ecx,esi
00000075 call 7A13ACB8
0000007a push edx
0000007b push eax
0000007c push 0
0000007e push 2710h
00000083 call 782248BC
00000088 mov dword ptr [edi+4],eax
0000008b mov dword ptr [edi+8],edx
0000008e mov esi,edi
00000090 call 792C7010
00000095 push ebx
00000096 push esi
00000097 mov ecx,eax
00000099 mov edx,dword ptr ds:[03562030h]
0000009f mov eax,dword ptr [ecx]
000000a1 mov eax,dword ptr [eax+3Ch]
000000a4 call dword ptr [eax+1Ch]
000000a7 pop ebx
}
000000a8 pop esi
000000a9 pop edi
000000aa pop ebp
000000ab ret
Just the loops, fast on the left, slow on the right:
00000030 xor edx,edx 00000049 xor edx,edx
00000032 mov eax,edx 0000004b mov eax,edx
00000034 and eax,0Fh 0000004d and eax,0Fh
00000037 test eax,eax 00000050 test eax,eax
00000039 je 0000003F 00000052 je 00000058
0000003b xor eax,eax 00000054 xor eax,eax
0000003d jmp 00000044 00000056 jmp 0000005D
0000003f mov eax,1 00000058 mov eax,1
00000044 add edi,eax 0000005d add edi,eax
00000046 inc edx 0000005f inc edx
00000047 cmp edx,3B9ACA00h 00000060 cmp edx,3B9ACA00h
0000004d jb 00000032 00000066 jb 0000004B
The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the je loading a constant 1 is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps ( jmp after loading a constant zero, and jb repeating the entire loop) are taken millions more times, and are aligned in the "fast" version.
I think this is the smoking gun.

So for a definitive answer... I suspect we would need to dig into the dis-assembly.
However, I have a guess. The compiler for the SingleLineTest() stores each result of the equation on the stack and pops each value as needed. However, the MultiLineTest() may be storing values and having to access them from there. This could cause a few clock cycles to be missed. Where as grabbing the values off the stack will keep it in a register.
Interestingly, changing the order of the function compilation may be adjusting the garbage collector's actions. Because isMultipleOf16 is defined within the loop, it may be be handled funny. You may want to move the definition outside of the loop and see what that changes...

My time is 2400 and 2600 on i5-2410M 2,3Ghz 4GB ram 64bit Win 7.
Here is my output: Single first
After starting the process and then attaching the debugger
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
--------------------------------
SingleLineTest()
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,685D2C68h
0000000b call FFD91F70
00000010 mov esi,eax
00000012 mov ecx,esi
00000014 call 681D68BC
stopwatch.Start();
00000019 cmp byte ptr [esi+14h],0
0000001d jne 0000002E
0000001f call 681FB3E4
00000024 mov dword ptr [esi+0Ch],eax
00000027 mov dword ptr [esi+10h],edx
0000002a mov byte ptr [esi+14h],1
int count = 0;
0000002e xor edi,edi
for (int i = 0; i < 1000000000; ++i)
00000030 xor edx,edx
{
count += i % 16 == 0 ? 1 : 0;
00000032 mov eax,edx
00000034 and eax,8000000Fh
00000039 jns 00000040
0000003b dec eax
0000003c or eax,0FFFFFFF0h
0000003f inc eax
00000040 test eax,eax
00000042 je 00000048
00000044 xor eax,eax
00000046 jmp 0000004D
00000048 mov eax,1
0000004d add edi,eax
for (int i = 0; i < 1000000000; ++i)
0000004f inc edx
00000050 cmp edx,3B9ACA00h
00000056 jl 00000032
}
stopwatch.Stop();
00000058 mov ecx,esi
0000005a call 6820F390
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000005f mov ecx,6A8B29B4h
00000064 call FFD91F70
00000069 mov ecx,eax
0000006b mov dword ptr [ecx+4],edi
0000006e mov ebx,ecx
00000070 mov ecx,6A8AA240h
00000075 call FFD91F70
0000007a mov edi,eax
0000007c mov ecx,esi
0000007e call 6820ACB8
00000083 push edx
00000084 push eax
00000085 push 0
00000087 push 2710h
0000008c call 6AFF48BC
00000091 mov dword ptr [edi+4],eax
00000094 mov dword ptr [edi+8],edx
00000097 mov esi,edi
00000099 call 6A457010
0000009e push ebx
0000009f push esi
000000a0 mov ecx,eax
000000a2 mov edx,dword ptr ds:[039F2030h]
000000a8 mov eax,dword ptr [ecx]
000000aa mov eax,dword ptr [eax+3Ch]
000000ad call dword ptr [eax+1Ch]
000000b0 pop ebx
}
000000b1 pop esi
000000b2 pop edi
000000b3 pop ebp
000000b4 ret
Multi first:
MultiLineTest();
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
SingleLineTest();
MultiLineTest();
--------------------------------
SingleLineTest()
Stopwatch stopwatch = new Stopwatch();
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 mov ecx,685D2C68h
0000000b call FFF31EA0
00000010 mov esi,eax
00000012 mov dword ptr [esi+4],0
00000019 mov dword ptr [esi+8],0
00000020 mov byte ptr [esi+14h],0
00000024 mov dword ptr [esi+0Ch],0
0000002b mov dword ptr [esi+10h],0
stopwatch.Start();
00000032 cmp byte ptr [esi+14h],0
00000036 jne 00000047
00000038 call 682AB314
0000003d mov dword ptr [esi+0Ch],eax
00000040 mov dword ptr [esi+10h],edx
00000043 mov byte ptr [esi+14h],1
int count = 0;
00000047 xor edi,edi
for (int i = 0; i < 1000000000; ++i)
00000049 xor edx,edx
{
count += i % 16 == 0 ? 1 : 0;
0000004b mov eax,edx
0000004d and eax,8000000Fh
00000052 jns 00000059
00000054 dec eax
00000055 or eax,0FFFFFFF0h
00000058 inc eax
00000059 test eax,eax
0000005b je 00000061
0000005d xor eax,eax
0000005f jmp 00000066
00000061 mov eax,1
00000066 add edi,eax
for (int i = 0; i < 1000000000; ++i)
00000068 inc edx
00000069 cmp edx,3B9ACA00h
0000006f jl 0000004B
}
stopwatch.Stop();
00000071 mov ecx,esi
00000073 call 682BF2C0
Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000078 mov ecx,6A8B29B4h
0000007d call FFF31EA0
00000082 mov ecx,eax
00000084 mov dword ptr [ecx+4],edi
00000087 mov ebx,ecx
00000089 mov ecx,6A8AA240h
0000008e call FFF31EA0
00000093 mov edi,eax
00000095 mov ecx,esi
00000097 call 682BABE8
0000009c push edx
0000009d push eax
0000009e push 0
000000a0 push 2710h
000000a5 call 6B0A47EC
000000aa mov dword ptr [edi+4],eax
000000ad mov dword ptr [edi+8],edx
000000b0 mov esi,edi
000000b2 call 6A506F40
000000b7 push ebx
000000b8 push esi
000000b9 mov ecx,eax
000000bb mov edx,dword ptr ds:[038E2034h]
000000c1 mov eax,dword ptr [ecx]
000000c3 mov eax,dword ptr [eax+3Ch]
000000c6 call dword ptr [eax+1Ch]
000000c9 pop ebx
}
000000ca pop esi
000000cb pop edi
000000cc pop ebp
000000cd ret

Related

Is there restrict equivalent in C#

I have the following function (which I cleaned up a bit to make it easier to understand) which takes the destination array gets the element at index n adds to it the src1[i] and then multiplies it with src2[i] (nothing too fancy):
static void F(long[] dst, long[] src1, long[] src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
no this generates following ASM:
<Program>$.<<Main>$>g__F|0_0(Int64[], Int64[], Int64[], UInt64)
L0000: sub rsp, 0x28
L0004: test r9, r9
L0007: jl short L0051
L0009: mov rax, r9
L000c: mov r9d, [rcx+8]
L0010: movsxd r9, r9d
L0013: cmp rax, r9
L0016: jae short L0057
L0018: lea rcx, [rcx+rax*8+0x10]
L001d: mov r9, rcx
L0020: mov r10, [r9]
L0023: mov r11d, [rdx+8]
L0027: movsxd r11, r11d
L002a: cmp rax, r11
L002d: jae short L0057
L002f: add r10, [rdx+rax*8+0x10]
L0034: mov [r9], r10
L0037: mov edx, [r8+8]
L003b: movsxd rdx, edx
L003e: cmp rax, rdx
L0041: jae short L0057
L0043: imul r10, [r8+rax*8+0x10]
L0049: mov [rcx], r10
L004c: add rsp, 0x28
L0050: ret
L0051: call 0x00007ffc9dadb710
L0056: int3
L0057: call 0x00007ffc9dadbc70
L005c: int3
as you can it adds bunch of stuff and because I can guarantee that the n will be in between the legal range: I can use pointers.
static unsafe void G(long* dst, long* src1, long* src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
Now this generates much simpler ASM:
<Program>$.<<Main>$>g__G|0_1(Int64*, Int64*, Int64*, UInt64)
L0000: lea rax, [rcx+r9*8]
L0004: mov rcx, rax
L0007: mov rdx, [rdx+r9*8]
L000b: add [rcx], rdx
L000e: mov rdx, [rax] ; loads the value again?
L0011: imul rdx, [r8+r9*8]
L0016: mov [rax], rdx
L0019: ret
As you may have noticed, there is an extra MOV there (I think, at least I can't reason why is it there).
Question
How can I remove that line? In C I could use the keyword restrict if I'm not wrong. Is there such keyword in C#? I couldn't find anything on internet sadly.
Note
Here is SharpLab link.
Here is the C example:
void
f(int64_t *dst,
int64_t *src1,
int64_t *src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
void
g(int64_t *restrict dst,
int64_t *restrict src1,
int64_t *restrict src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
this generates:
f:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
mov QWORD PTR [rdx], rax ; this is strange. It loads the value back to [RDX]?
; shouldn't that be other way around? I don't know.
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
g:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
and here is the Godbolt link.
This:
dst[n] = (dst[n] + src1[n]) * src2[n];
removes that extra mov.
In C# there is no equivalent of restrict qualifier from C language.
In the C# ECMA-334:2017 language specification, in chapter 23. Unsafe Code, there is no syntax to specify that a part of the memory must be accessed only by specific pointer. And there is no syntax to specify that memory regions pointed by pointers are not overlapped. Thus there is no such equivalent. This is probably because C# is a managed language, unsafe syntax which allows for working with pointers/unmanaged memory is an edge case in C#. And restrict on pointers would be an edge case of the edge case.

Surprisingly different performance of simple C# program

Below is a simple program that with a small change, makes a significant performance impact and I don't understand why.
What the program does is not really relevant, but it calculates PI in a very convoluted way by counting collisions between two object of different mass and a wall. What I noticed as I was changing the code around was a quite large variance in performance.
The rows in question are the commented ones which are mathematically equivalent. Using the slow version makes the entire program take roughly twice as long as using the fast version.
int iterations = 0;
for (int i = 4; i < 9; i++)
{
Stopwatch s = Stopwatch.StartNew();
double ms = 1.0;
double mL = Math.Pow(100.0, i);
double uL = 1.0;
double us = 0.0;
double msmLInv = 1d / (ms + mL);
long collisions = 0;
while (!(uL < 0 && us <= 0 && uL <= us))
{
Debug.Assert(++iterations > 0);
++collisions;
double vs = (2 * mL * uL + us * (ms - mL)) * msmLInv;
//double vL = (2 * ms * us - uL * (ms - mL)) * msmLInv; //fast
double vL = uL + (us - vs) / mL; //slow
Debug.Assert(Math.Abs(((2 * ms * us - uL * (ms - mL)) * msmLInv) - (uL + (us - vs) / mL)) < 0.001d); //checks equality between fast and slow
if (vs > 0)
{
++collisions;
vs = -vs;
}
us = vs;
uL = vL;
}
s.Stop();
Debug.Assert(collisions.ToString() == "314159265359".Substring(0, i + 1)); //check the correctness
Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
}
Debug.Assert(iterations == 174531180); //check that we dont skip loops
Console.Write("Waiting...");
Console.ReadKey();
My intuition says that because the fast version has 7 operations compared to 4 operations of the slow one, the slow one should be faster, but it is not.
I disassembled the program using .NET Reflector which shows that they are mostly equal, as expected, except for the part shown below. The code before and after an identical
//slow
ldloc.s uL
ldloc.2
ldloc.s us
ldloc.s vs
sub
mul
ldloc.3
div
add
//fast
ldc.r8 2
ldloc.2
mul
ldloc.s us
mul
ldloc.s uL
ldloc.2
ldloc.3
sub
mul
sub
ldloc.2
ldloc.3
add
div
This also shows that more code is executing with the fast version which also would lead me to expect it to be slower.
The only guess I have right now is that the slow version causes more cache misses, but I don't know how to measure that (a guide would be welcome). Other than that I am at a loss.
EDIT 1.
As per the request of #EricLippert here is the disassembly from the JIT for the inner while loop where the difference is.
EDIT 2.
Solved how to break in the release program and updated the disassembly so now there seems to be some difference. I got these results by running the release version, stopping the program in the same function with a ReadKey, attaching the debugger, making the program continue execution, breaking on the next row, going into disassembly window (ctrl+alt+d)
EDIT 3.
Change the code to an updated example base on all the suggestions.
//slow
78:
79: vs = (2 * mL * uL + us * (ms - mL)) / (ms + mL);
00C10530 call CA9AD013
00C10535 fdiv st,st(3)
00C10537 faddp st(2),st
80:
81: //double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
82: double vL = uL + ms * (us - vs) / mL; //slow
00C10539 fldz
00C1053B fcomip st,st(1)
00C1053D jp 00C10549
00C1053F jae 00C10549
00C10541 add ebx,1
00C10544 adc edi,0
00C10547 fchs
00C10549 fld st(1)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
00C1054B fldz
00C1054D fcomip st,st(3)
00C1054F fstp st(2)
00C10551 jp 00C10508
00C10553 jbe 00C10508
00C10555 fldz
00C10557 fcomip st,st(1)
00C10559 jp 00C10508
00C1055B jb 00C10508
00C1055D fxch st(1)
00C1055F fcomi st,st(1)
00C10561 jnp 00C10567
00C10563 fxch st(1)
00C10565 jmp 00C10508
00C10567 jbe 00C1056D
00C10569 fxch st(1)
00C1056B jmp 00C10508
00C1056D fstp st(1)
00C1056F fstp st(0)
00C10571 fstp st(0)
92: }
93:
94: s.Stop();
00C10573 mov ecx,esi
00C10575 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
00C1057A mov ecx,725B0994h
00C1057F call 00B930C8
00C10584 mov edx,eax
00C10586 mov eax,dword ptr [ebp-14h]
00C10589 mov dword ptr [edx+4],eax
00C1058C mov dword ptr [ebp-34h],edx
00C1058F mov ecx,725F3778h
00C10594 call 00B930C8
00C10599 mov dword ptr [ebp-38h],eax
00C1059C mov ecx,725F2C10h
00C105A1 call 00B930C8
00C105A6 mov dword ptr [ebp-3Ch],eax
00C105A9 mov ecx,esi
00C105AB call 71835820
00C105B0 push edx
00C105B1 push eax
00C105B2 push 0
00C105B4 push 2710h
00C105B9 call 736071A0
00C105BE mov dword ptr [ebp-48h],eax
00C105C1 mov dword ptr [ebp-44h],edx
00C105C4 fild qword ptr [ebp-48h]
00C105C7 fstp dword ptr [ebp-40h]
00C105CA fld dword ptr [ebp-40h]
00C105CD fdiv dword ptr ds:[0C10678h]
00C105D3 mov eax,dword ptr [ebp-38h]
00C105D6 fstp dword ptr [eax+4]
00C105D9 mov edx,dword ptr [ebp-38h]
00C105DC mov eax,dword ptr [ebp-3Ch]
00C105DF mov dword ptr [eax+4],ebx
00C105E2 mov dword ptr [eax+8],edi
00C105E5 mov esi,dword ptr [ebp-3Ch]
00C105E8 lea edi,[ebp-30h]
00C105EB xorps xmm0,xmm0
00C105EE movq mmword ptr [edi],xmm0
00C105F2 movq mmword ptr [edi+8],xmm0
00C105F7 push edx
00C105F8 push esi
00C105F9 lea ecx,[ebp-30h]
00C105FC mov edx,dword ptr [ebp-34h]
00C105FF call 724A2ED4
00C10604 lea eax,[ebp-30h]
00C10607 push dword ptr [eax+0Ch]
00C1060A push dword ptr [eax+8]
00C1060D push dword ptr [eax+4]
00C10610 push dword ptr [eax]
00C10612 mov edx,dword ptr ds:[3832310h]
00C10618 xor ecx,ecx
00C1061A call 72497A00
00C1061F mov ecx,eax
00C10621 call 72571934
61: for (int i = 4; i < 9; i++)
00C10626 inc dword ptr [ebp-14h]
00C10629 cmp dword ptr [ebp-14h],9
00C1062D jl 00C10496
97: }
98:
99: Console.WriteLine(loops);
00C10633 mov ecx,dword ptr [ebp-10h]
00C10636 call 72C583FC
100: Console.Write("Waiting...");
00C1063B mov ecx,dword ptr ds:[3832314h]
00C10641 call 724C67F0
00C10646 lea ecx,[ebp-20h]
00C10649 xor edx,edx
00C1064B call 72C57984
00C10650 lea esp,[ebp-0Ch]
00C10653 pop ebx
00C10654 pop esi
00C10655 pop edi
00C10656 pop ebp
00C10657 ret
//fast
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0550 or al,83h
80:
81: double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
02FD0552 ret
02FD0553 add dword ptr [ebx-3626FF29h],eax
02FD0559 fchs
02FD055B fxch st(1)
02FD055D fld st(0)
73:
74: while (!(uL < 0 && us <= 0 && uL <= us))
02FD055F fldz
02FD0561 fcomip st,st(2)
02FD0563 fstp st(1)
02FD0565 jnp 02FD056B
02FD0567 fxch st(1)
02FD0569 jmp 02FD050B
02FD056B ja 02FD0571
02FD056D fxch st(1)
02FD056F jmp 02FD050B
02FD0571 fldz
02FD0573 fcomip st,st(2)
02FD0575 jnp 02FD057B
02FD0577 fxch st(1)
02FD0579 jmp 02FD050B
02FD057B jae 02FD0581
02FD057D fxch st(1)
02FD057F jmp 02FD050B
02FD0581 fcomi st,st(1)
02FD0583 jnp 02FD0589
02FD0585 fxch st(1)
02FD0587 jmp 02FD050B
02FD0589 jbe 02FD0592
02FD058B fxch st(1)
02FD058D jmp 02FD050B
02FD0592 fstp st(1)
02FD0594 fstp st(0)
92: }
93:
94: s.Stop();
02FD0596 mov ecx,esi
02FD0598 call 71880260
95:
96: Console.WriteLine($"i: {i}, T: {s.ElapsedMilliseconds / 1000f}, PI: {collisions}");
02FD059D mov ecx,725B0994h
02FD05A2 call 013830C8
02FD05A7 mov edx,eax
02FD05A9 mov eax,dword ptr [ebp-14h]
02FD05AC mov dword ptr [edx+4],eax
02FD05AF mov dword ptr [ebp-3Ch],edx
02FD05B2 mov ecx,725F3778h
02FD05B7 call 013830C8
02FD05BC mov dword ptr [ebp-40h],eax
02FD05BF mov ecx,725F2C10h
02FD05C4 call 013830C8
02FD05C9 mov dword ptr [ebp-44h],eax
02FD05CC mov ecx,esi
02FD05CE call 71835820
02FD05D3 push edx
02FD05D4 push eax
02FD05D5 push 0
02FD05D7 push 2710h
02FD05DC call 736071A0
02FD05E1 mov dword ptr [ebp-50h],eax
02FD05E4 mov dword ptr [ebp-4Ch],edx
02FD05E7 fild qword ptr [ebp-50h]
02FD05EA fstp dword ptr [ebp-48h]
02FD05ED fld dword ptr [ebp-48h]
02FD05F0 fdiv dword ptr ds:[2FD06A8h]
02FD05F6 mov eax,dword ptr [ebp-40h]
02FD05F9 fstp dword ptr [eax+4]
02FD05FC mov edx,dword ptr [ebp-40h]
02FD05FF mov eax,dword ptr [ebp-44h]
02FD0602 mov dword ptr [eax+4],ebx
02FD0605 mov dword ptr [eax+8],edi
02FD0608 mov esi,dword ptr [ebp-44h]
02FD060B lea edi,[ebp-38h]
02FD060E xorps xmm0,xmm0
02FD0611 movq mmword ptr [edi],xmm0
02FD0615 movq mmword ptr [edi+8],xmm0
02FD061A push edx
02FD061B push esi
02FD061C lea ecx,[ebp-38h]
02FD061F mov edx,dword ptr [ebp-3Ch]
02FD0622 call 724A2ED4
02FD0627 lea eax,[ebp-38h]
02FD062A push dword ptr [eax+0Ch]
02FD062D push dword ptr [eax+8]
02FD0630 push dword ptr [eax+4]
02FD0633 push dword ptr [eax]
02FD0635 mov edx,dword ptr ds:[4142310h]
02FD063B xor ecx,ecx
02FD063D call 72497A00
02FD0642 mov ecx,eax
02FD0644 call 72571934
61: for (int i = 4; i < 9; i++)
02FD0649 inc dword ptr [ebp-14h]
02FD064C cmp dword ptr [ebp-14h],9
02FD0650 jl 02FD0496
97: }
98:
99: Console.WriteLine(loops);
02FD0656 mov ecx,dword ptr [ebp-10h]
02FD0659 call 72C583FC
100: Console.Write("Waiting...");
02FD065E mov ecx,dword ptr ds:[4142314h]
02FD0664 call 724C67F0
02FD0669 lea ecx,[ebp-28h]
02FD066C xor edx,edx
02FD066E call 72C57984
02FD0673 lea esp,[ebp-0Ch]
02FD0676 pop ebx
02FD0677 pop esi
02FD0678 pop edi
02FD0679 pop ebp
02FD067A ret
I think the reason is CPU instruction pipelining. your slow equation depends on vs, that means vs must be calculated first, then vl is calculated.
but in your fast equation, more instructions can be pipelined as vs and vl can be calculated at same time because they don't depend on each other.
Please don't confuse this with multi threading. Instruction pipelining is some thing implemented at very low hardware level and tries to exploit as many CPU modules as possible at the same time to achieve maximum instruction throughput.
You calculations are not equal
double vL = (2 * ms * us - uL * (ms - mL)) / (ms + mL); //fast
double vL = uL + ms * (us - vs) / mL; //slow
Example: I miss vs in the fast version
I expect your while loop doing more iterations because of this?

Huge performance difference in byte-array access between x64 and x86

I'm currenty doing micro-benchmarks for a better understanding of clr performance and version issues. The micro-benchmark in question is xoring two byte arrays of each 64 bytes together.
I'm always making a reference implementation with safe .net before I try to beat the .net framework implementation with unsafe and so on.
My reference implementation in question is:
for (int p = 0; p < 64; p++)
a[p] ^= b[p];
where a and b are byte[] a = new byte[64] and filled with data from .NET rng.
This code runs on x64 as double as fast as on x86. First I thought this is ok, because the jit will make something like *long^=*long out of it and *int^=*int on x86.
But my optimized unsafe-version:
fixed (byte* pA = a)
fixed (byte* pB = b)
{
long* ppA = (long*)pA;
long* ppB = (long*)pB;
for (int p = 0; p < 8; p++)
{
*ppA ^= *ppB;
ppA++;
ppB++;
}
}
runs about factor 4 times faster than the x64 reference-implementation. So my thoughts about *long^=*long and *int^=*int optimization of the compiler are not right.
Where does this huge performance difference in the reference implementation come from? Now that I posted the ASM code: Why can't the C# compiler also optimize the x86 version this way?
IL code for x86 and x64 reference implementation (they are identical):
IL_0059: ldloc.3
IL_005a: ldloc.s p
IL_005c: ldelema [mscorlib]System.Byte
IL_0061: dup
IL_0062: ldobj [mscorlib]System.Byte
IL_0067: ldloc.s b
IL_0069: ldloc.s p
IL_006b: ldelem.u1
IL_006c: xor
IL_006d: conv.u1
IL_006e: stobj [mscorlib]System.Byte
IL_0073: ldloc.s p
IL_0075: ldc.i4.1
IL_0076: add
IL_0077: stloc.s p
IL_0079: ldloc.s p
IL_007b: ldc.i4.s 64
IL_007d: blt.s IL_0059
I think that ldloc.3 is a.
Resulting ASM code for x86:
for (int p = 0; p < 64; p++)
010900DF xor edx,edx
010900E1 mov edi,dword ptr [ebx+4]
a[p] ^= b[p];
010900E4 cmp edx,edi
010900E6 jae 0109010C
010900E8 lea esi,[ebx+edx+8]
010900EC mov eax,dword ptr [ebp-14h]
010900EF cmp edx,dword ptr [eax+4]
010900F2 jae 0109010C
010900F4 movzx eax,byte ptr [eax+edx+8]
010900F9 xor byte ptr [esi],al
for (int p = 0; p < 64; p++)
010900FB inc edx
010900FC cmp edx,40h
010900FF jl 010900E4
Resulting ASM code for x64:
a[p] ^= b[p];
00007FFF4A8B01C6 mov eax,3Eh
00007FFF4A8B01CB cmp rax,rcx
00007FFF4A8B01CE jae 00007FFF4A8B0245
00007FFF4A8B01D0 mov rax,qword ptr [rbx+8]
00007FFF4A8B01D4 mov r9d,3Eh
00007FFF4A8B01DA cmp r9,rax
00007FFF4A8B01DD jae 00007FFF4A8B0245
00007FFF4A8B01DF mov r9d,3Fh
00007FFF4A8B01E5 cmp r9,rcx
00007FFF4A8B01E8 jae 00007FFF4A8B0245
00007FFF4A8B01EA mov ecx,3Fh
00007FFF4A8B01EF cmp rcx,rax
00007FFF4A8B01F2 jae 00007FFF4A8B0245
00007FFF4A8B01F4 nop word ptr [rax+rax]
00007FFF4A8B0200 movzx ecx,byte ptr [rdi+rdx+10h]
00007FFF4A8B0205 movzx eax,byte ptr [rbx+rdx+10h]
00007FFF4A8B020A xor ecx,eax
00007FFF4A8B020C mov byte ptr [rdi+rdx+10h],cl
00007FFF4A8B0210 movzx ecx,byte ptr [rdi+rdx+11h]
00007FFF4A8B0215 movzx eax,byte ptr [rbx+rdx+11h]
00007FFF4A8B021A xor ecx,eax
00007FFF4A8B021C mov byte ptr [rdi+rdx+11h],cl
00007FFF4A8B0220 add rdx,2
for (int p = 0; p < 64; p++)
00007FFF4A8B0224 cmp rdx,40h
00007FFF4A8B0228 jl 00007FFF4A8B0200
You've made a classic mistake, attempting performance analysis on non-optimized code. Here is a complete minimal compilable example:
using System;
namespace SO30558357
{
class Program
{
static void XorArray(byte[] a, byte[] b)
{
for (int p = 0; p< 64; p++)
a[p] ^= b[p];
}
static void Main(string[] args)
{
byte[] a = new byte[64];
byte[] b = new byte[64];
Random r = new Random();
r.NextBytes(a);
r.NextBytes(b);
XorArray(a, b);
Console.ReadLine(); // when the program stops here
// use Debug -> Attach to process
}
}
}
I compiled that using Visual Studio 2013 Update 3, default "Release Build" settings for a C# console application except for the architecture, and ran it with CLR v4.0.30319. Oh I think I have Roslyn installed, but that shouldn't replace the JIT, only the translation to MSIL which is identical on both architectures.
The actual x86 assembly for XorArray:
006F00D8 push ebp
006F00D9 mov ebp,esp
006F00DB push edi
006F00DC push esi
006F00DD push ebx
006F00DE push eax
006F00DF mov dword ptr [ebp-10h],edx
006F00E2 xor edi,edi
006F00E4 mov ebx,dword ptr [ecx+4]
006F00E7 cmp edi,ebx
006F00E9 jae 006F010F
006F00EB lea esi,[ecx+edi+8]
006F00EF movzx eax,byte ptr [esi]
006F00F2 mov edx,dword ptr [ebp-10h]
006F00F5 cmp edi,dword ptr [edx+4]
006F00F8 jae 006F010F
006F00FA movzx edx,byte ptr [edx+edi+8]
006F00FF xor eax,edx
006F0101 mov byte ptr [esi],al
006F0103 inc edi
006F0104 cmp edi,40h
006F0107 jl 006F00E7
006F0109 pop ecx
006F010A pop ebx
006F010B pop esi
006F010C pop edi
006F010D pop ebp
006F010E ret
And for x64:
00007FFD4A3000FB mov rax,qword ptr [rsi+8]
00007FFD4A3000FF mov rax,qword ptr [rbp+8]
00007FFD4A300103 nop word ptr [rax+rax]
00007FFD4A300110 movzx ecx,byte ptr [rsi+rdx+10h]
00007FFD4A300115 movzx eax,byte ptr [rdx+rbp+10h]
00007FFD4A30011A xor ecx,eax
00007FFD4A30011C mov byte ptr [rsi+rdx+10h],cl
00007FFD4A300120 movzx ecx,byte ptr [rsi+rdx+11h]
00007FFD4A300125 movzx eax,byte ptr [rdx+rbp+11h]
00007FFD4A30012A xor ecx,eax
00007FFD4A30012C mov byte ptr [rsi+rdx+11h],cl
00007FFD4A300130 movzx ecx,byte ptr [rsi+rdx+12h]
00007FFD4A300135 movzx eax,byte ptr [rdx+rbp+12h]
00007FFD4A30013A xor ecx,eax
00007FFD4A30013C mov byte ptr [rsi+rdx+12h],cl
00007FFD4A300140 movzx ecx,byte ptr [rsi+rdx+13h]
00007FFD4A300145 movzx eax,byte ptr [rdx+rbp+13h]
00007FFD4A30014A xor ecx,eax
00007FFD4A30014C mov byte ptr [rsi+rdx+13h],cl
00007FFD4A300150 add rdx,4
00007FFD4A300154 cmp rdx,40h
00007FFD4A300158 jl 00007FFD4A300110
Bottom line: The x64 optimizer worked a lot better. While it still is using byte-sized transfers, it unrolled the loop by a factor of 4, and inlined the function call.
Since in the x86 version, loop control logic corresponds to roughly half the code, the unrolling can be expected to yield almost twice the performance.
Inlining allowed the compiler to perform context-sensitive optimization, knowing the size of the arrays and eliminating the runtime bounds check.
If we inline by hand, the x86 compiler now yields:
00A000B1 xor edi,edi
00A000B3 mov eax,dword ptr [ebp-10h]
00A000B6 mov ebx,dword ptr [eax+4]
a[p] ^= b[p];
00A000B9 mov eax,dword ptr [ebp-10h]
00A000BC cmp edi,ebx
00A000BE jae 00A000F5
00A000C0 lea esi,[eax+edi+8]
00A000C4 movzx eax,byte ptr [esi]
00A000C7 mov edx,dword ptr [ebp-14h]
00A000CA cmp edi,dword ptr [edx+4]
00A000CD jae 00A000F5
00A000CF movzx edx,byte ptr [edx+edi+8]
00A000D4 xor eax,edx
00A000D6 mov byte ptr [esi],al
for (int p = 0; p< 64; p++)
00A000D8 inc edi
00A000D9 cmp edi,40h
00A000DC jl 00A000B9
Didn't help that much, the loop still does not unroll and the runtime bounds checking is still there.
Notably, the x86 compiler found a register (EBX) to cache the length of one array, but ran out of registers and was forced to access the other array length from memory on every iteration. This should be a "cheap" L1 cache access, but that's still slower than register access, and much slower than no bounds check at all.

foreach vs for: please explain the assembly code difference

I've recently been testing the performance of the for loop vs the foreach loop in C#, and I've noticed that for summing an array of ints into a long, the foreach loop may come out actually faster. Here is the full test program, I've used Visual Studio 2012, x86, release mode, optimizations on.
Here is the assembly code for both loops. The foreach:
long sum = 0;
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 xor ebx,ebx
00000008 xor edi,edi
foreach (var i in collection) {
0000000a xor esi,esi
0000000c cmp dword ptr [ecx+4],0
00000010 jle 00000025
00000012 mov eax,dword ptr [ecx+esi*4+8]
sum += i;
00000016 mov edx,eax
00000018 sar edx,1Fh
0000001b add ebx,eax
0000001d adc edi,edx
0000001f inc esi
foreach (var i in collection) {
00000020 cmp dword ptr [ecx+4],esi
00000023 jg 00000012
}
return sum;
00000025 mov eax,ebx
00000027 mov edx,edi
00000029 pop ebx
0000002a pop esi
0000002b pop edi
0000002c pop ebp
0000002d ret
And the for:
long sum = 0;
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 push eax
00000007 xor ebx,ebx
00000009 xor edi,edi
for (int i = 0; i < collection.Length; ++i) {
0000000b xor esi,esi
0000000d mov eax,dword ptr [ecx+4]
00000010 mov dword ptr [ebp-10h],eax
00000013 test eax,eax
00000015 jle 0000002A
sum += collection[i];
00000017 mov eax,dword ptr [ecx+esi*4+8]
0000001b cdq
0000001c add eax,ebx
0000001e adc edx,edi
00000020 mov ebx,eax
00000022 mov edi,edx
for (int i = 0; i < collection.Length; ++i) {
00000024 inc esi
00000025 cmp dword ptr [ebp-10h],esi
00000028 jg 00000017
}
return sum;
0000002a mov eax,ebx
0000002c mov edx,edi
0000002e pop ecx
0000002f pop ebx
00000030 pop esi
00000031 pop edi
00000032 pop ebp
00000033 ret
As you can see, the main loop is 7 instructions for "foreach" and 9 instructions for "for". This translates into approximately a 10% performance difference in my benchmarks.
I'm not very good at reading assembly code however and I don't understand why the for loop wouldn't be at least as efficient as the foreach. What is going on here?
As the array is so big the only relevand part is clearly the one inside the loop, this one:
// for loop
00000017 mov eax,dword ptr [ecx+esi*4+8]
0000001b cdq
0000001c add eax,ebx
0000001e adc edx,edi
00000020 mov ebx,eax
00000022 mov edi,edx
// foreach loop
00000012 mov eax,dword ptr [ecx+esi*4+8]
00000016 mov edx,eax
00000018 sar edx,1Fh
0000001b add ebx,eax
0000001d adc edi,edx
Since the sum is a long int it is stored in two differenc registers, namely ebx contains its least significant four bytes and edi the most significant four ones. They differ in how collection[i] is (implicitly) casted from int to long:
// for loop
0000001b cdq
// foreach loop
00000016 mov edx,eax
00000018 sar edx,1Fh
Another important thing to notice is that the for-loop version does the sum in "reversed" order:
long temp = (long) collection[i]; // implicit cast, stored in edx:eax
temp += sum; // instead of "simply" sum += temp
sum = temp; // sum is stored back into ebx:edi
I can't tell ou why the compiler preferred this way instead of sum += temp (#EricLippert could maybe tell us :) ) but I suspect that it is related to some instruction dependency issues that might arise.
OK, so here's an annotated version of the assembly code, as you will see the instruction in the loop are very close.
foreach (var i in collection) {
0000000a xor esi,esi clear index
0000000c cmp dword ptr [ecx+4],0 get size of collection
00000010 jle 00000025 exit if empty
00000012 mov eax,dword ptr [ecx+esi*4+8] get item from collection
sum += i;
00000016 mov edx,eax move to edx:eax
00000018 sar edx,1Fh shift 31 bits to keep sign only
0000001b add ebx,eax add to sum
0000001d adc edi,edx add with carry from previous add
0000001f inc esi increment index
foreach (var i in collection) {
00000020 cmp dword ptr [ecx+4],esi compare size to index
00000023 jg 00000012 loop if more
}
return sum;
00000025 mov eax,ebx result was in ebx
=================================================
for (int i = 0; i < collection.Length; ++i) {
0000000b xor esi,esi clear index
0000000d mov eax,dword ptr [ecx+4] get limit on for
00000010 mov dword ptr [ebp-10h],eax save limit
00000013 test eax,eax test if limit is empty
00000015 jle 0000002A exit loop if empty
sum += collection[i];
00000017 mov eax,dword ptr [ecx+esi*4+8] get item form collection
0000001b cdq convert eax to edx:eax
0000001c add eax,ebx add to sum
0000001e adc edx,edi add with carry from previous add
00000020 mov ebx,eax put result in edi:ebx
00000022 mov edi,edx
for (int i = 0; i < collection.Length; ++i) {
00000024 inc esi increment index
00000025 cmp dword ptr [ebp-10h],esi compare to limit
00000028 jg 00000017 loop if more
}
return sum;
0000002a mov eax,ebx result was in ebx
According to the C# Language Specification 4.0, a foreach loop gets broken down to the following by the compiler:
foreach-statement:
foreach ( local-variable-type identifier in expression ) embedded-statement
{
E e = ((C)(x)).GetEnumerator();
try {
V v;
while (e.MoveNext()) {
v = (V)(T)e.Current;
embedded-statement
}
}
finally {
… // Dispose e
}
}
This is after the following processing (again from the specs):
•If the type X of expression is an array type then there is an implicit reference conversion from X to the System.Collections.IEnumerable interface (since System.Array implements this interface). The collection type is the System.Collections.IEnumerable interface, the enumerator type is the System.Collections.IEnumerator interface and the element type is the element type of the array type X.
Likely a good reason why you aren't seeing the same assembly code from the compiler.

When to use short?

Say I'm looping through like 20/30 objects or in any other case where I'm dealing with smaller numbers, is it a good practice to use short instead of int?
I mean why isn't this common:
for(short i=0; i<x; i++)
Method(array[i]);
Is it because the performance gain is too low?
Thanks
"is it a good practice to use short instead of int?"
First of all, this is a micro optimization that will not achieve the expected results: increase speed or efficiency.
Second: No, not really, the CLR internally still uses 32 bit integers (Int32) to perform the iteration. Basically it converts short to Int32 for computation purposes during JIT compilation.
Third: Array indexes are Int32, and the iterating short variable is automatically converted to int32 when used as an array indexer.
If we take the next code:
var array = new object[32];
var x = array.Length;
for (short i = 0; i < x; i++)
Method(array[i]);
And disassemble it, you can clearly see at 00000089 inc eax that at machine level an 32 bit register was used for the iterating variable (eax), which is next truncated to 16 bit 0000008a movsx eax,ax so there are no benefits from using a short oppossed to using an int32, actually there might be a slight performance loss due to extra instructions that need to be executed.
00000042 nop
var array = new object[32];
00000043 mov ecx,64B41812h
00000048 mov edx,20h
0000004d call FFBC01A4
00000052 mov dword ptr [ebp-50h],eax
00000055 mov eax,dword ptr [ebp-50h]
00000058 mov dword ptr [ebp-40h],eax
var x = array.Length;
0000005b mov eax,dword ptr [ebp-40h]
0000005e mov eax,dword ptr [eax+4]
00000061 mov dword ptr [ebp-44h],eax
for (short i = 0; i < x; i++)
00000064 xor edx,edx
00000066 mov dword ptr [ebp-48h],edx
00000069 nop
0000006a jmp 00000090
Method(array[i]);
0000006c mov eax,dword ptr [ebp-48h]
0000006f mov edx,dword ptr [ebp-40h]
00000072 cmp eax,dword ptr [edx+4]
00000075 jb 0000007C
00000077 call 657A28F6
0000007c mov ecx,dword ptr [edx+eax*4+0Ch]
00000080 call FFD9A708
00000085 nop
for (short i = 0; i < x; i++)
00000086 mov eax,dword ptr [ebp-48h]
00000089 inc eax
0000008a movsx eax,ax
0000008d mov dword ptr [ebp-48h],eax
00000090 mov eax,dword ptr [ebp-48h]
00000093 cmp eax,dword ptr [ebp-44h]
00000096 setl al
00000099 movzx eax,al
0000009c mov dword ptr [ebp-4Ch],eax
0000009f cmp dword ptr [ebp-4Ch],0
000000a3 jne 0000006C
Yes, the performance difference is negligible. However, a short uses 16 bits instead of 32 for an int, so it's conceivable that you may want to use a short if you're processing enough small numbers.
In general, using numbers that matches the word size of the processor is relatively faster than unmatching numbers. On the other hand, short uses less memory space than int.
If you have a limited memory space, using short may be the alternative; but personally I have never encountered such a thing when writing c# applications.
An int uses 32 bits of memory, a short uses 16 bits and a byte uses 8 bits. If you're only looping through 20/30 objects and you're concerned about memory usage, use byte instead.
Catering for memory usage to this level is rarely required with today's machines, though you could argue that using int everywhere is just lazy. Personally, I try to always use the relevant type that uses the least memory.
http://msdn.microsoft.com/en-us/library/5bdb6693(v=vs.100).aspx

Categories

Resources