I have following snippet which sums all the elements of the array (size is hardcoded and is 32):
static unsafe int F(int* a)
{
Vector256<int> ymm0 = Avx2.LoadVector256(a + 0);
Vector256<int> ymm1 = Avx2.LoadVector256(a + 8);
Vector256<int> ymm2 = Avx2.LoadVector256(a + 16);
Vector256<int> ymm3 = Avx2.LoadVector256(a + 24);
ymm0 = Avx2.Add(ymm0, ymm1);
ymm2 = Avx2.Add(ymm2, ymm3);
ymm0 = Avx2.Add(ymm0, ymm2);
const int s = 256 / 32;
int* t = stackalloc int[s];
Avx2.Store(t, ymm0);
int r = 0;
for (int i = 0; i < s; ++i)
r += t[i];
return r;
}
this generates following ASM:
Program.F(Int32*)
L0000: sub rsp, 0x28
L0004: vzeroupper ; Question #1
L0007: vxorps xmm4, xmm4, xmm4
L000b: vmovdqa [rsp], xmm4 ; Question #2
L0010: vmovdqa [rsp+0x10], xmm4 ; Question #2
L0016: xor eax, eax ; Question #3
L0018: mov [rsp+0x20], rax
L001d: mov rax, 0x7d847bd1f9ce ; Question #4
L0027: mov [rsp+0x20], rax
L002c: vmovdqu ymm0, [rcx]
L0030: vmovdqu ymm1, [rcx+0x20]
L0035: vmovdqu ymm2, [rcx+0x40]
L003a: vmovdqu ymm3, [rcx+0x60]
L003f: vpaddd ymm0, ymm0, ymm1
L0043: vpaddd ymm2, ymm2, ymm3
L0047: vpaddd ymm0, ymm0, ymm2
L004b: lea rax, [rsp] ; Question #5
L004f: vmovdqu [rax], ymm0
L0053: xor edx, edx ; Question #5
L0055: xor ecx, ecx ; Question #5
L0057: movsxd r8, ecx
L005a: add edx, [rax+r8*4]
L005e: inc ecx
L0060: cmp ecx, 8
L0063: jl short L0057
L0065: mov eax, edx
L0067: mov rcx, 0x7d847bd1f9ce ; Question #4
L0071: cmp [rsp+0x20], rcx
L0076: je short L007d
L0078: call 0x00007ffc9de2d430 ; Question #6
L007d: nop
L007e: vzeroupper
L0081: add rsp, 0x28
L0085: ret
Questions
Why do we need VZEROUPPER at the beginning. Wouldn't it be perfectly fine without it?
What do the VMOVDQAs do in the beginning. Or rather why are they there?
Zeroing out the EAX register? Why? Probably related to next line MOV [RSP+0x20], RAX, but still can't understand.
What does this mysterious value (0x7d847bd1f9ce) do?
There are also lines in between which I can not understand why are they needed (see "Question #5" comments in the code).
I'm assuming this line (L0078: call 0x00007ffc9de2d430) throws an exception. Is there a function or something in my code that can throw an exception?
I know there are lot of question, but I can't separate them because they are related to each other I think. TO BE CRYSTAL CLEAR: I'm just trying to understand the generated ASM here. I'm not a professional in this area.
Note
In case you're wondering what GCC (O2) generates, here is the result:
int32_t
f(int32_t *a) {
__m256i ymm0;
__m256i ymm1;
__m256i ymm2;
__m256i ymm3;
ymm0 = _mm256_load_si256((__m256i*)(a + 0));
ymm1 = _mm256_load_si256((__m256i*)(a + 8));
ymm2 = _mm256_load_si256((__m256i*)(a + 16));
ymm3 = _mm256_load_si256((__m256i*)(a + 24));
ymm0 = _mm256_add_epi32(ymm0, ymm1);
ymm2 = _mm256_add_epi32(ymm2, ymm3);
ymm0 = _mm256_add_epi32(ymm0, ymm2);
int32_t t[8];
_mm256_store_si256((__m256i*)t, ymm0);
int32_t r;
r = 0;
for (int i = 0; i < 8; ++i)
r += t[i];
return r;
}
And the generated ASM:
f:
push rbp
xor r8d, r8d
mov rbp, rsp
and rsp, -32
lea rax, [rsp-32]
mov rdx, rsp
vmovdqa ymm1, YMMWORD PTR [rdi+96]
vpaddd ymm0, ymm1, YMMWORD PTR [rdi+64]
vpaddd ymm0, ymm0, YMMWORD PTR [rdi+32]
vpaddd ymm0, ymm0, YMMWORD PTR [rdi]
vmovdqa YMMWORD PTR [rsp-32], ymm0
.L2:
add r8d, DWORD PTR [rax]
add rax, 4
cmp rax, rdx
jne .L2
mov eax, r8d
vzeroupper
leave
ret
I think It optimized (maybe heavily) my code here, but whatever.
Why do we need VZEROUPPER at the beginning. Wouldn't it be perfectly fine without it?
Inserting vzeroupper in the beginning may be a workaround for a library/some other third party code that is known to forget to clean it's uppers (to protect SSE code). But you're not using SSE code, you only have AVX code, so yes, it's not needed in the beginning.
Your code is using VEX-encoded instructions (v prefix), which means it would not encounter a "false dependency" (transition penalties) problem (Why is this SSE code 6 times slower without VZEROUPPER on Skylake?). And on top of that you're using ymm vectors immediately (entering Dirty Upper State), which means any reasoning for power management/frequency scaling is also not applying here (Dynamically determining where a rogue AVX-512 instruction is executing - mentions forgotten vzeroupper causing reduced frequency for entire app).
What do the VMOVDQAs do in the beginning. Or rather why are they there?
L0007: vxorps xmm4, xmm4, xmm4
L000b: vmovdqa [rsp], xmm4 ; Question #2
L0010: vmovdqa [rsp+0x10], xmm4 ; Question #2
Why is it zeroing out the memory that you're going to fully overwrite? My guess is that the compiler does not fully compute write coverage of the loop, so it does not know you will fully overwrite it. So it zeros it just in case.
Zeroing out the EAX register? Why? Probably related to next line MOV [RSP+0x20], RAX, but still can't understand.
L0016: xor eax, eax ; Question #3
L0018: mov [rsp+0x20], rax
L001d: mov rax, 0x7d847bd1f9ce ; Question #4
L0027: mov [rsp+0x20], rax
So it writes 64-bit zero at address rsp+0x20 and then overwrites the same memory region with a stack canary. Why does it need to write a zero there first? I don't know, looks like a missed optimization.
What does this mysterious value (0x7d847bd1f9ce) do?
I'm assuming this line (L0078: call 0x00007ffc9de2d430) throws an exception. Is there a function or something in my code that can throw an exception?
As already mentioned it's the stack canary to detect buffer overrun.
"The use of stackalloc automatically enables buffer overrun detection features in the common language runtime (CLR). If a buffer overrun is detected, the process is terminated as quickly as possible to minimize the chance that malicious code is executed" - quote from https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/stackalloc
It writes a value that it knows at the end of the stack buffer. Then executes the loop that you have. Then it checks if the value changed (if it did, means your loop wrote out of bounds). Note, that this is a huge stack canary. Not sure why they have to use 64-bit. Unless there is a good reason for it to be 64-bit I would consider this a missed optimization. It's large in code-size and for uop-cache and it's causing the compiler to emit more instructions (have to always use mov, can't use 64-bit constant as immediate operand of any other instruction, such as cmp or store mov).
Also, a note on canary-checking code
L0071: cmp [rsp+0x20], rcx
L0076: je short L007d
L0078: call 0x00007ffc9de2d430 ; Question #6
L007d: nop
Fall-through path should be the most-likely taken path. In this case, the fall-through path is the "throw exception", which shouldn't be normal. It may be another missed optimization. The way it could affect performance is - if this code is not in branch history, then it'll suffer a branch miss. If it's predicted correctly then it'll be fine. And indirect affect - taken branches occupy space in branch predictor history. If this branch was never taken - would be cheaper.
There are also lines in between which I can not understand why are they needed (see "Question #5" comments in the code).
L004b: lea rax, [rsp] ; Question #5
L004f: vmovdqu [rax], ymm0
L0053: xor edx, edx ; Question #5
L0055: xor ecx, ecx ; Question #5
LEA is not needed here. My guess is that it's related to how compiler does register allocation/stack management, so it's just a quirk of the compiler (rsp can't be allocated like a normal register, it's always used as stack pointer, so it has to be treated specially).
Zeroing edx - it's used as an accumulator for the final result. Zeroing ecx - used as counter in the loop that follows.
About horizontal sum at the end.
In general, when storing and reading from the same location, but different offset/size - need to check against store-forwarding rules for your target CPU to not suffer a penalty (you can find those at https://www.agner.org/optimize/#manuals, Intel and AMD have the rules listed in their guides as well). If you're targeting modern CPUs (Skylake/Zen), you shouldn't suffer a store-forwarding stall in your case, but there are still faster ways to sum up a vector horizontally. (And it has a bonus of avoiding missed optimizations related to the stack buffer).
Check out this nice writeup on good ways to sum a vector horizontally: https://stackoverflow.com/a/35270026/899255
You could also check out how a compiler does it: https://godbolt.org/z/q74abrqzh (GCC at -O3).
#stepan explained the RyuJIT-generated code quite well, but I thought I would address the question of why the GCC code is so different and why RyuJIT missed so many potential optimizations.
The short answer is that being Just In Time, RyuJIT has a very limited time budget in which to optimize, so it optimizes for frequently-used patterns. In your case the JIT may be taking your code a bit too literally, while GCC is able to capture your intent a bit better.
The stack canary code can be eliminated simply by removing the stackalloc and using a Vector256<T> local instead. Additionally, the loop over the stack values is missing a few optimizations, like your i variable being sign-extended on each iteration. This version of your method resolves both of those issues by helping the JIT out with things it knows how to optimize.
static unsafe int F(int* a)
{
Vector256<int> ymm0 = Avx.LoadVector256(a + 0);
Vector256<int> ymm1 = Avx.LoadVector256(a + 8);
Vector256<int> ymm2 = Avx.LoadVector256(a + 16);
Vector256<int> ymm3 = Avx.LoadVector256(a + 24);
ymm0 = Avx2.Add(ymm0, ymm1);
ymm2 = Avx2.Add(ymm2, ymm3);
ymm0 = Avx2.Add(ymm0, ymm2);
// This address-taken local will be forced to the stack
Vector256<int> ymm4 = ymm0;
int* t = (int*)&ymm4;
// RyuJIT unrolls loops of Vector<T>.Count,
// Vector128<T>.Count, and Vector256<T>.Count
int r = 0;
for (int i = 0; i < Vector256<int>.Count; ++i)
r += *(t + i);
return r;
}
compiles to:
Program.F(Int32*)
L0000: sub rsp, 0x38
L0004: vzeroupper
L0007: vmovdqu ymm0, [rcx]
L000b: vmovdqu ymm1, [rcx+0x20]
L0010: vmovdqu ymm2, [rcx+0x40]
L0015: vmovdqu ymm3, [rcx+0x60]
L001a: vpaddd ymm2, ymm2, ymm3
L001e: vpaddd ymm0, ymm0, ymm1
L0022: vpaddd ymm0, ymm0, ymm2
L0026: vmovupd [rsp], ymm0 ; write to the stack with no zeroing/canary
L002b: lea rax, [rsp]
L002f: mov edx, [rax] ; auto-unrolled loop
L0031: add edx, [rax+4]
L0034: add edx, [rax+8]
L0037: add edx, [rax+0xc]
L003a: add edx, [rax+0x10]
L003d: add edx, [rax+0x14]
L0040: add edx, [rax+0x18]
L0043: add edx, [rax+0x1c]
L0046: mov eax, edx
L0048: vzeroupper
L004b: add rsp, 0x38
L004f: ret
Note that the stack zeroing, the stack canary write, check, and possible throw are all gone. And the loop is auto-unrolled, with more optimal scalar load/add code.
Beyond that, as other comments/answers have suggested, the spill to the stack and scalar adds are unnecessary, because you can use SIMD instructions to add horizontally. RyuJIT will not do this for you like GCC can, but if you are explicit, you can get optimal SIMD ASM.
static unsafe int F(int* a)
{
Vector256<int> ymm0 = Avx.LoadVector256(a + 0);
Vector256<int> ymm1 = Avx.LoadVector256(a + 8);
// The load can be contained in the add if you use the load
// as an operand rather than declaring explicit locals
ymm0 = Avx2.Add(ymm0, Avx.LoadVector256(a + 16));
ymm1 = Avx2.Add(ymm1, Avx.LoadVector256(a + 24));
ymm0 = Avx2.Add(ymm0, ymm1);
// Add the upper 128-bit lane to the lower lane
Vector128<int> xmm0 = Sse2.Add(ymm0.GetLower(), ymm0.GetUpper());
// Add odd elements to even
xmm0 = Sse2.Add(xmm0, Sse2.Shuffle(xmm0, 0b_11_11_01_01));
// Add high half to low half
xmm0 = Sse2.Add(xmm0, Sse2.UnpackHigh(xmm0.AsInt64(), xmm0.AsInt64()).AsInt32());
// Extract low element
return xmm0.ToScalar();
}
compiles to:
Program.F(Int32*)
L0000: vzeroupper
L0003: vmovdqu ymm0, [rcx]
L0007: vmovdqu ymm1, [rcx+0x20]
L000c: vpaddd ymm0, ymm0, [rcx+0x40]
L0011: vpaddd ymm1, ymm1, [rcx+0x60]
L0016: vpaddd ymm0, ymm0, ymm1
L001a: vextracti128 xmm1, ymm0, 1
L0020: vpaddd xmm0, xmm0, xmm1
L0024: vpshufd xmm1, xmm0, 0xf5
L0029: vpaddd xmm0, xmm0, xmm1
L002d: vpunpckhqdq xmm1, xmm0, xmm0
L0031: vpaddd xmm0, xmm0, xmm1
L0035: vmovd eax, xmm0
L0039: vzeroupper
L003c: ret
which, aside from the overly-conservative vzerouppers, is the same as you'd get from an optimizing C/C++ compiler.
vzeroupper can help performance.
The L0007 thru L0018 lines are zeroing out the storage space used by the local variables.
The 0x7d847bd1f9ce value appears to be related to detecting stack overruns. It sets in a check value, and when the function is done it looks to see if that value has changed. If it has it calls a diagnostic function.
The function body starts at L002c. First it initializes your local ymm variables, then does the additions.
The lea at L004b is the allocation of t. The next instruction (L004f) is the Avx2.Store(t, ymm0); statement.
L0053 thru L0063 is the for loop. rax already has the value of t, ecx holds i, and edx holds r.
From L0065 to the end we have the return statement and function epilog. The epilog checks to see if the stack has been clobbered, does some cleanup, and returns to the caller.
Related
I have the following function (which I cleaned up a bit to make it easier to understand) which takes the destination array gets the element at index n adds to it the src1[i] and then multiplies it with src2[i] (nothing too fancy):
static void F(long[] dst, long[] src1, long[] src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
no this generates following ASM:
<Program>$.<<Main>$>g__F|0_0(Int64[], Int64[], Int64[], UInt64)
L0000: sub rsp, 0x28
L0004: test r9, r9
L0007: jl short L0051
L0009: mov rax, r9
L000c: mov r9d, [rcx+8]
L0010: movsxd r9, r9d
L0013: cmp rax, r9
L0016: jae short L0057
L0018: lea rcx, [rcx+rax*8+0x10]
L001d: mov r9, rcx
L0020: mov r10, [r9]
L0023: mov r11d, [rdx+8]
L0027: movsxd r11, r11d
L002a: cmp rax, r11
L002d: jae short L0057
L002f: add r10, [rdx+rax*8+0x10]
L0034: mov [r9], r10
L0037: mov edx, [r8+8]
L003b: movsxd rdx, edx
L003e: cmp rax, rdx
L0041: jae short L0057
L0043: imul r10, [r8+rax*8+0x10]
L0049: mov [rcx], r10
L004c: add rsp, 0x28
L0050: ret
L0051: call 0x00007ffc9dadb710
L0056: int3
L0057: call 0x00007ffc9dadbc70
L005c: int3
as you can it adds bunch of stuff and because I can guarantee that the n will be in between the legal range: I can use pointers.
static unsafe void G(long* dst, long* src1, long* src2, ulong n)
{
dst[n] += src1[n];
dst[n] *= src2[n];
}
Now this generates much simpler ASM:
<Program>$.<<Main>$>g__G|0_1(Int64*, Int64*, Int64*, UInt64)
L0000: lea rax, [rcx+r9*8]
L0004: mov rcx, rax
L0007: mov rdx, [rdx+r9*8]
L000b: add [rcx], rdx
L000e: mov rdx, [rax] ; loads the value again?
L0011: imul rdx, [r8+r9*8]
L0016: mov [rax], rdx
L0019: ret
As you may have noticed, there is an extra MOV there (I think, at least I can't reason why is it there).
Question
How can I remove that line? In C I could use the keyword restrict if I'm not wrong. Is there such keyword in C#? I couldn't find anything on internet sadly.
Note
Here is SharpLab link.
Here is the C example:
void
f(int64_t *dst,
int64_t *src1,
int64_t *src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
void
g(int64_t *restrict dst,
int64_t *restrict src1,
int64_t *restrict src2,
uint64_t n) {
dst[n] += src1[n];
dst[n] *= src2[n];
}
this generates:
f:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
mov QWORD PTR [rdx], rax ; this is strange. It loads the value back to [RDX]?
; shouldn't that be other way around? I don't know.
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
g:
mov r10, rdx
lea rdx, [rcx+r9*8]
mov rax, QWORD PTR [rdx]
add rax, QWORD PTR [r10+r9*8]
imul rax, QWORD PTR [r8+r9*8]
mov QWORD PTR [rdx], rax
ret
and here is the Godbolt link.
This:
dst[n] = (dst[n] + src1[n]) * src2[n];
removes that extra mov.
In C# there is no equivalent of restrict qualifier from C language.
In the C# ECMA-334:2017 language specification, in chapter 23. Unsafe Code, there is no syntax to specify that a part of the memory must be accessed only by specific pointer. And there is no syntax to specify that memory regions pointed by pointers are not overlapped. Thus there is no such equivalent. This is probably because C# is a managed language, unsafe syntax which allows for working with pointers/unmanaged memory is an edge case in C#. And restrict on pointers would be an edge case of the edge case.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I noticed something odd when comparing the generated JIT of 2 methods which should perform the same.
To my surprise, the generated JIT had major differences and it's length was almost doubled for the supposedly simpler method M1.
The methods I compared were M1 and M2.
The number of assignments is the same, so the only difference should be how the bound checks are handled for each method.
using System;
public class C {
static void M1(int[] left, int[] right)
{
for (int i = 0; i < 5; i++)
{
left[i] = 1;
right[i] = 1;
}
}
static void M2(int[] left, int[] right)
{
for (int i = 0; i < 10; i+=2)
{
left[i] = 1;
right[i] = 1;
}
}
}
Generated JIT for each method:
C.M1(Int32[], Int32[])
L0000: sub rsp, 0x28
L0004: xor eax, eax
L0006: test rcx, rcx
L0009: setne r8b
L000d: movzx r8d, r8b
L0011: test rdx, rdx
L0014: setne r9b
L0018: movzx r9d, r9b
L001c: test r9d, r8d
L001f: je short L005c
L0021: cmp dword ptr [rcx+8], 5
L0025: setge r8b
L0029: movzx r8d, r8b
L002d: cmp dword ptr [rdx+8], 5
L0031: setge r9b
L0035: movzx r9d, r9b
L0039: test r9d, r8d
L003c: je short L005c
L003e: movsxd r8, eax
L0041: mov dword ptr [rcx+r8*4+0x10], 1
L004a: mov dword ptr [rdx+r8*4+0x10], 1
L0053: inc eax
L0055: cmp eax, 5
L0058: jl short L003e
L005a: jmp short L0082
L005c: cmp eax, [rcx+8]
L005f: jae short L0087
L0061: movsxd r8, eax
L0064: mov dword ptr [rcx+r8*4+0x10], 1
L006d: cmp eax, [rdx+8]
L0070: jae short L0087
L0072: mov dword ptr [rdx+r8*4+0x10], 1
L007b: inc eax
L007d: cmp eax, 5
L0080: jl short L005c
L0082: add rsp, 0x28
L0086: ret
L0087: call 0x00007ffc50fafc00
L008c: int3
C.M2(Int32[], Int32[])
L0000: sub rsp, 0x28
L0004: xor eax, eax
L0006: mov r8d, [rcx+8]
L000a: cmp eax, r8d
L000d: jae short L0036
L000f: movsxd r9, eax
L0012: mov dword ptr [rcx+r9*4+0x10], 1
L001b: cmp eax, [rdx+8]
L001e: jae short L0036
L0020: mov dword ptr [rdx+r9*4+0x10], 1
L0029: add eax, 2
L002c: cmp eax, 0xa
L002f: jl short L000a
L0031: add rsp, 0x28
L0035: ret
L0036: call 0x00007ffc50fafc00
L003b: int3
M1's length is double of M2's!
What could explain this and is it some kind of bug?
EDIT
Figured out that M1 creates a version for the loop without bound checks, and that's why M1 is longer. Still the question remains, Why M1 performs worse, even though it doesn't perform bound checking at all?
I also ran BenchmarkDotNet and verified that M2 performs about 20% - 30% faster than M1 for arrays of length 10
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.14393.3930 (1607/AnniversaryUpdate/Redstone1)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
Frequency=3515622 Hz, Resolution=284.4447 ns, Timer=TSC
.NET Core SDK=3.1.401
[Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
DefaultJob : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio |
|-------- |---------:|----------:|----------:|------:|
| M1Bench | 4.372 ns | 0.0215 ns | 0.0201 ns | 1.00 |
| M2Bench | 3.350 ns | 0.0340 ns | 0.0301 ns | 0.77 |
But, there's a lot of overhead up front for M1() to know it can use
the "fast" path...if your arrays aren't large enough, the overhead
would dominate and produce counter-intuitive results.
Peter Duniho
The overhead of choosing the path (in the JIT) for optimized bound check with loops of type:
for(int i = 0; i < array.Length; i++)
won't be beneficial for smaller loops.
As loops grow larger, eliminating bound checks becomes more beneficial, and surpasses the performance of a non-optimized path.
examples for non optimized loops:
for(int i = 0; i < array.Length; i+=2)
for(int i = 0; i <= array.Length; i++)
for(int i = 0; i < array.Length / 2; i++)
One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.
For starters here is the normal version of the function.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = a + b + 1;
}
}
return a;
}
Here is the version I made that breaks the functionality into small functions.
static double TinyFunctions()
{
double a = 0;
for (int i = 0; i < s_OuterLoopCount; i++)
{
a = Loop(a);
}
return a;
}
static double Loop(double a)
{
for (int i = 0; i < s_InnerLoopCount; i++)
{
double b = Double(i);
a = Add(a, Add(b, 1));
}
return a;
}
static double Double(double a)
{
return a * 2;
}
static double Add(double a, double b)
{
return a + b;
}
I use the stopwatch class to time the functions and when I ran it in debug I got the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;
These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;
These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?
We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.
I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.
In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.
I'd be grateful for any insight someone can provide.
Update 1
As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = b + 1 + a;
}
}
return a;
}
Here are the results with this configuration.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;
This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.
Furthermore, I then tried it with integer operations and we are back to not making any sense.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;
And this doesn't change regardless of the order of operations.
I can make performance match much better by changing one line of code:
a = a + b + 1;
Change it to:
a = b + 1 + a;
Or:
a += b + 1;
Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:
int Double( int a ) { return a * 2; }
I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).
The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.
But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:
Original Changed
01070620 push ebp 01390620 push ebp
01070621 mov ebp,esp 01390621 mov ebp,esp
01070623 push edi 01390623 push edi
01070624 push esi 01390624 push esi
01070625 push eax 01390625 push eax
01070626 fldz 01390626 fldz
01070628 xor esi,esi 01390628 xor esi,esi
0107062A mov edi,dword ptr ds:[0FE43ACh] 0139062A mov edi,dword ptr ds:[12243ACh]
01070630 test edi,edi 01390630 test edi,edi
01070632 jle 0107065A 01390632 jle 0139065A
01070634 xor edx,edx 01390634 xor edx,edx
01070636 mov ecx,dword ptr ds:[0FE43B0h] 01390636 mov ecx,dword ptr ds:[12243B0h]
0107063C test ecx,ecx 0139063C test ecx,ecx
0107063E jle 01070655 0139063E jle 01390655
01070640 mov eax,edx 01390640 mov eax,edx
01070642 add eax,eax 01390642 add eax,eax
01070644 mov dword ptr [ebp-0Ch],eax 01390644 mov dword ptr [ebp-0Ch],eax
01070647 fild dword ptr [ebp-0Ch] 01390647 fild dword ptr [ebp-0Ch]
0107064A faddp st(1),st 0139064A fld1
0107064C fld1 0139064C faddp st(1),st
0107064E faddp st(1),st 0139064E faddp st(1),st
01070650 inc edx 01390650 inc edx
01070651 cmp edx,ecx 01390651 cmp edx,ecx
01070653 jl 01070640 01390653 jl 01390640
01070655 inc esi 01390655 inc esi
01070656 cmp esi,edi 01390656 cmp esi,edi
01070658 jl 01070634 01390658 jl 01390634
0107065A pop ecx 0139065A pop ecx
0107065B pop esi 0139065B pop esi
0107065C pop edi 0139065C pop edi
0107065D pop ebp 0139065D pop ebp
0107065E ret 0139065E ret
Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.
Update:
With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:
int b = 2 * i;
a = a + b + 1;
Into something like:
mov esi, eax ; b = i
add esi, esi ; b += b
lea ecx, [ecx + esi + 1] ; a = a + b + 1
Where a is stored in the ecx register, i in eax, and b in esi.
Whereas the TinyFunctions version gets turned into something like:
mov eax, edx
add eax, eax
inc eax
add ecx, eax
Where i is in edx, b is in eax, and a is in ecx this time around.
I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:
int b = 2 * i + 1;
a += b;
This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.
Say I'm looping through like 20/30 objects or in any other case where I'm dealing with smaller numbers, is it a good practice to use short instead of int?
I mean why isn't this common:
for(short i=0; i<x; i++)
Method(array[i]);
Is it because the performance gain is too low?
Thanks
"is it a good practice to use short instead of int?"
First of all, this is a micro optimization that will not achieve the expected results: increase speed or efficiency.
Second: No, not really, the CLR internally still uses 32 bit integers (Int32) to perform the iteration. Basically it converts short to Int32 for computation purposes during JIT compilation.
Third: Array indexes are Int32, and the iterating short variable is automatically converted to int32 when used as an array indexer.
If we take the next code:
var array = new object[32];
var x = array.Length;
for (short i = 0; i < x; i++)
Method(array[i]);
And disassemble it, you can clearly see at 00000089 inc eax that at machine level an 32 bit register was used for the iterating variable (eax), which is next truncated to 16 bit 0000008a movsx eax,ax so there are no benefits from using a short oppossed to using an int32, actually there might be a slight performance loss due to extra instructions that need to be executed.
00000042 nop
var array = new object[32];
00000043 mov ecx,64B41812h
00000048 mov edx,20h
0000004d call FFBC01A4
00000052 mov dword ptr [ebp-50h],eax
00000055 mov eax,dword ptr [ebp-50h]
00000058 mov dword ptr [ebp-40h],eax
var x = array.Length;
0000005b mov eax,dword ptr [ebp-40h]
0000005e mov eax,dword ptr [eax+4]
00000061 mov dword ptr [ebp-44h],eax
for (short i = 0; i < x; i++)
00000064 xor edx,edx
00000066 mov dword ptr [ebp-48h],edx
00000069 nop
0000006a jmp 00000090
Method(array[i]);
0000006c mov eax,dword ptr [ebp-48h]
0000006f mov edx,dword ptr [ebp-40h]
00000072 cmp eax,dword ptr [edx+4]
00000075 jb 0000007C
00000077 call 657A28F6
0000007c mov ecx,dword ptr [edx+eax*4+0Ch]
00000080 call FFD9A708
00000085 nop
for (short i = 0; i < x; i++)
00000086 mov eax,dword ptr [ebp-48h]
00000089 inc eax
0000008a movsx eax,ax
0000008d mov dword ptr [ebp-48h],eax
00000090 mov eax,dword ptr [ebp-48h]
00000093 cmp eax,dword ptr [ebp-44h]
00000096 setl al
00000099 movzx eax,al
0000009c mov dword ptr [ebp-4Ch],eax
0000009f cmp dword ptr [ebp-4Ch],0
000000a3 jne 0000006C
Yes, the performance difference is negligible. However, a short uses 16 bits instead of 32 for an int, so it's conceivable that you may want to use a short if you're processing enough small numbers.
In general, using numbers that matches the word size of the processor is relatively faster than unmatching numbers. On the other hand, short uses less memory space than int.
If you have a limited memory space, using short may be the alternative; but personally I have never encountered such a thing when writing c# applications.
An int uses 32 bits of memory, a short uses 16 bits and a byte uses 8 bits. If you're only looping through 20/30 objects and you're concerned about memory usage, use byte instead.
Catering for memory usage to this level is rarely required with today's machines, though you could argue that using int everywhere is just lazy. Personally, I try to always use the relevant type that uses the least memory.
http://msdn.microsoft.com/en-us/library/5bdb6693(v=vs.100).aspx
The following code gives different output when running the release inside Visual Studio, and running the release outside Visual Studio. I'm using Visual Studio 2008 and targeting .NET 3.5. I've also tried .NET 3.5 SP1.
When running outside Visual Studio, the JIT should kick in. Either (a) there's something subtle going on with C# that I'm missing or (b) the JIT is actually in error. I'm doubtful that the JIT can go wrong, but I'm running out of other possiblities...
Output when running inside Visual Studio:
0 0,
0 1,
1 0,
1 1,
Output when running release outside of Visual Studio:
0 2,
0 2,
1 2,
1 2,
What is the reason?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Test
{
struct IntVec
{
public int x;
public int y;
}
interface IDoSomething
{
void Do(IntVec o);
}
class DoSomething : IDoSomething
{
public void Do(IntVec o)
{
Console.WriteLine(o.x.ToString() + " " + o.y.ToString()+",");
}
}
class Program
{
static void Test(IDoSomething oDoesSomething)
{
IntVec oVec = new IntVec();
for (oVec.x = 0; oVec.x < 2; oVec.x++)
{
for (oVec.y = 0; oVec.y < 2; oVec.y++)
{
oDoesSomething.Do(oVec);
}
}
}
static void Main(string[] args)
{
Test(new DoSomething());
Console.ReadLine();
}
}
}
It is a JIT optimizer bug. It is unrolling the inner loop but not updating the oVec.y value properly:
for (oVec.x = 0; oVec.x < 2; oVec.x++) {
0000000a xor esi,esi ; oVec.x = 0
for (oVec.y = 0; oVec.y < 2; oVec.y++) {
0000000c mov edi,2 ; oVec.y = 2, WRONG!
oDoesSomething.Do(oVec);
00000011 push edi
00000012 push esi
00000013 mov ecx,ebx
00000015 call dword ptr ds:[00170210h] ; first unrolled call
0000001b push edi ; WRONG! does not increment oVec.y
0000001c push esi
0000001d mov ecx,ebx
0000001f call dword ptr ds:[00170210h] ; second unrolled call
for (oVec.x = 0; oVec.x < 2; oVec.x++) {
00000025 inc esi
00000026 cmp esi,2
00000029 jl 0000000C
The bug disappears when you let oVec.y increment to 4, that's too many calls to unroll.
One workaround is this:
for (int x = 0; x < 2; x++) {
for (int y = 0; y < 2; y++) {
oDoesSomething.Do(new IntVec(x, y));
}
}
UPDATE: re-checked in August 2012, this bug was fixed in the version 4.0.30319 jitter. But is still present in the v2.0.50727 jitter. It seems unlikely they'll fix this in the old version after this long.
I believe this is in a genuine JIT compilation bug. I would report it to Microsoft and see what they say. Interestingly, I found that the x64 JIT does not have the same problem.
Here is my reading of the x86 JIT.
// save context
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
// put oDoesSomething pointer in ebx
00000006 mov ebx,ecx
// zero out edi, this will store oVec.y
00000008 xor edi,edi
// zero out esi, this will store oVec.x
0000000a xor esi,esi
// NOTE: the inner loop is unrolled here.
// set oVec.y to 2
0000000c mov edi,2
// call oDoesSomething.Do(oVec) -- y is always 2!?!
00000011 push edi
00000012 push esi
00000013 mov ecx,ebx
00000015 call dword ptr ds:[002F0010h]
// call oDoesSomething.Do(oVec) -- y is always 2?!?!
0000001b push edi
0000001c push esi
0000001d mov ecx,ebx
0000001f call dword ptr ds:[002F0010h]
// increment oVec.x
00000025 inc esi
// loop back to 0000000C if oVec.x < 2
00000026 cmp esi,2
00000029 jl 0000000C
// restore context and return
0000002b pop ebx
0000002c pop esi
0000002d pop edi
0000002e pop ebp
0000002f ret
This looks like an optimization gone bad to me...
I copied your code into a new Console App.
Debug Build
Correct output with both debugger and no debugger
Switched to Release Build
Again, correct output both times
Created a new x86 configuration (I'm on running X64 Windows 2008 and was using 'Any CPU')
Debug Build
Got the correct output both F5 and CTRL+F5
Release Build
Correct output with Debugger attached
No debugger - Got the incorrect output
So it is the x86 JIT incorrectly generating the code. Have deleted my original text about reordering of loops etc. A few other answers on here have confirmed that the JIT is unwinding the loop incorrectly when on x86.
To fix the problem you can change the declaration of IntVec to a class and it works in all flavours.
Think this needs to go on MS Connect....
-1 to Microsoft!