C# Unexpected loop performance. Possible JIT bound check bug? [closed] - c#

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I noticed something odd when comparing the generated JIT of 2 methods which should perform the same.
To my surprise, the generated JIT had major differences and it's length was almost doubled for the supposedly simpler method M1.
The methods I compared were M1 and M2.
The number of assignments is the same, so the only difference should be how the bound checks are handled for each method.
using System;
public class C {
static void M1(int[] left, int[] right)
{
for (int i = 0; i < 5; i++)
{
left[i] = 1;
right[i] = 1;
}
}
static void M2(int[] left, int[] right)
{
for (int i = 0; i < 10; i+=2)
{
left[i] = 1;
right[i] = 1;
}
}
}
Generated JIT for each method:
C.M1(Int32[], Int32[])
L0000: sub rsp, 0x28
L0004: xor eax, eax
L0006: test rcx, rcx
L0009: setne r8b
L000d: movzx r8d, r8b
L0011: test rdx, rdx
L0014: setne r9b
L0018: movzx r9d, r9b
L001c: test r9d, r8d
L001f: je short L005c
L0021: cmp dword ptr [rcx+8], 5
L0025: setge r8b
L0029: movzx r8d, r8b
L002d: cmp dword ptr [rdx+8], 5
L0031: setge r9b
L0035: movzx r9d, r9b
L0039: test r9d, r8d
L003c: je short L005c
L003e: movsxd r8, eax
L0041: mov dword ptr [rcx+r8*4+0x10], 1
L004a: mov dword ptr [rdx+r8*4+0x10], 1
L0053: inc eax
L0055: cmp eax, 5
L0058: jl short L003e
L005a: jmp short L0082
L005c: cmp eax, [rcx+8]
L005f: jae short L0087
L0061: movsxd r8, eax
L0064: mov dword ptr [rcx+r8*4+0x10], 1
L006d: cmp eax, [rdx+8]
L0070: jae short L0087
L0072: mov dword ptr [rdx+r8*4+0x10], 1
L007b: inc eax
L007d: cmp eax, 5
L0080: jl short L005c
L0082: add rsp, 0x28
L0086: ret
L0087: call 0x00007ffc50fafc00
L008c: int3
C.M2(Int32[], Int32[])
L0000: sub rsp, 0x28
L0004: xor eax, eax
L0006: mov r8d, [rcx+8]
L000a: cmp eax, r8d
L000d: jae short L0036
L000f: movsxd r9, eax
L0012: mov dword ptr [rcx+r9*4+0x10], 1
L001b: cmp eax, [rdx+8]
L001e: jae short L0036
L0020: mov dword ptr [rdx+r9*4+0x10], 1
L0029: add eax, 2
L002c: cmp eax, 0xa
L002f: jl short L000a
L0031: add rsp, 0x28
L0035: ret
L0036: call 0x00007ffc50fafc00
L003b: int3
M1's length is double of M2's!
What could explain this and is it some kind of bug?
EDIT
Figured out that M1 creates a version for the loop without bound checks, and that's why M1 is longer. Still the question remains, Why M1 performs worse, even though it doesn't perform bound checking at all?
I also ran BenchmarkDotNet and verified that M2 performs about 20% - 30% faster than M1 for arrays of length 10
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.14393.3930 (1607/AnniversaryUpdate/Redstone1)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
Frequency=3515622 Hz, Resolution=284.4447 ns, Timer=TSC
.NET Core SDK=3.1.401
[Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
DefaultJob : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio |
|-------- |---------:|----------:|----------:|------:|
| M1Bench | 4.372 ns | 0.0215 ns | 0.0201 ns | 1.00 |
| M2Bench | 3.350 ns | 0.0340 ns | 0.0301 ns | 0.77 |

But, there's a lot of overhead up front for M1() to know it can use
the "fast" path...if your arrays aren't large enough, the overhead
would dominate and produce counter-intuitive results.
Peter Duniho
The overhead of choosing the path (in the JIT) for optimized bound check with loops of type:
for(int i = 0; i < array.Length; i++)
won't be beneficial for smaller loops.
As loops grow larger, eliminating bound checks becomes more beneficial, and surpasses the performance of a non-optimized path.
examples for non optimized loops:
for(int i = 0; i < array.Length; i+=2)
for(int i = 0; i <= array.Length; i++)
for(int i = 0; i < array.Length / 2; i++)

Related

20x performance difference Interlocked.Read vs Interlocked.CompareExchange though both implemented with lock cmpxchg

Question
I wanted to write a small profiler class that allows me to measure the run time of hot paths throughout the application. In doing so, I discovered and interesting performance difference between two possible implementations that I cannot explain, but would like to understand.
Setup
The idea is as follows:
// somewhere accessible
public static profiler HotPathProfiler = new HotPathProfiler("some name", enabled: true);
// within programm
long ticket = profiler.Enter();
... // hot path
var result = profiler.Exit(ticket: ticket);
As there aren't much of these hot paths running in parallel, the idea is to implement this via an array that holds the timestamp (0 when slot is free) and returning the index (called ticket) when calling Enter(). So the class looks like the following:
public class HotPathProfiler
{
private readonly string _name;
private readonly bool _enabled;
private readonly long[] _ticketList;
public HotPathProfiler(string name, bool enabled)
{
_name = name;
_enabled = enabled;
_ticketList = new long[128];
}
}
If code Enter()s and none of the 128 tickets is available, -1 will be returned which the Exit(ticket) function can handle by returning early.
When thinking about how to implement the Enter() call I saw the Interlocked.Read method that can atomically read values on 32bit systems, while, according to the documentation, it is unnecessary on 64bit systems.
So I went on an implemented various types of Enter() methods, including one with Interlocked.Read and one with Interlocked.CompareExchange, and compared them with BenchmarkDotNet. That's where I discovered an enormous performance difference:
| Method | Mean | Error | StdDev | Code Size |
|------------- |----------:|---------:|---------:|----------:|
| SafeArray | 28.64 ns | 0.573 ns | 0.536 ns | 295 B |
| SafeArrayCAS | 744.75 ns | 8.741 ns | 7.749 ns | 248 B |
The benchmark for both look pretty much the same:
[Benchmark]
public void SafeArray()
{
// doesn't matter if 'i < 1' or 'i < 10'
// performance differs by the same factor (approx. 20x)
for (int i = 0; i < 1; i++)
{
_ticketArr[i] = _hpp_sa.EnterSafe();
// SafeArrayCAS:
// _ticketArr[i] = _hpp_sa_cas.EnterSafe();
}
}
Implementations
Again, free slots hold value 0, occupied slots some other value (timestamp). Enter() is supposed to return the index/ticket of the slot.
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long ts = Stopwatch.GetTimestamp();
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
last++;
} while (val != 0 && last < 128);
return val == 0 ? last : -1;
}
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
last++;
} while (val != 0 && last < 128);
if (val != 0)
{
return -1;
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
if (prev != 0)
{
return -1;
}
return last;
}
Enter rabbit hole
Now, one would say that it's no surprise to see a difference, since the slow method always tries to CAS an entry, while the other one only lazily reads each entry and then only tries a CAS once.
But, beside the fact that the benchmark only does 1 Enter(), i.e. only one while{} run that shouldn't make that (20x) much difference, it is even harder to explain once you realize the atomic read is implemented as CAS:
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D048FCE xor edi,edi
long ts = Stopwatch.GetTimestamp();
00007FF82D048FD0 lea rcx,[rsp+28h]
00007FF82D048FD5 call CLRStub[JumpStub]#7ff82d076d70 (07FF82D076D70h)
00007FF82D048FDA mov rsi,qword ptr [rsp+28h]
00007FF82D048FDF mov rax,7FF88CF3E07Ch
00007FF82D048FE9 cmp dword ptr [rax],0
00007FF82D048FEC jne HotPathProfilerSafeArrayCAS.EnterSafe()+0A6h (07FF82D049046h)
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
00007FF82D048FEE mov rbx,qword ptr [rsp+50h]
00007FF82D048FF3 mov rax,qword ptr [rbx+10h]
00007FF82D048FF7 mov edx,dword ptr [rax+8]
00007FF82D048FFA movsxd rdx,edx
00007FF82D048FFD cmp rdi,rdx
00007FF82D049000 jae HotPathProfilerSafeArrayCAS.EnterSafe()+0ADh (07FF82D04904Dh)
00007FF82D049002 lea rdx,[rax+rdi*8+10h]
00007FF82D049007 xor eax,eax
00007FF82D049009 lock cmpxchg qword ptr [rdx],rsi
last++;
00007FF82D04900E inc rdi
} while (val != 0 && last < 128);
00007FF82D049011 test rax,rax
00007FF82D049014 je HotPathProfilerSafeArrayCAS.EnterSafe()+084h (07FF82D049024h)
00007FF82D049016 cmp rdi,80h
00007FF82D04901D mov qword ptr [rsp+50h],rbx
00007FF82D049022 jl HotPathProfilerSafeArrayCAS.EnterSafe()+04Eh (07FF82D048FEEh)
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D046C74 xor esi,esi
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
00007FF82D046C76 mov rax,qword ptr [rcx+10h]
00007FF82D046C7A mov edx,dword ptr [rax+8]
00007FF82D046C7D movsxd rdx,edx
00007FF82D046C80 cmp rsi,rdx
00007FF82D046C83 jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82D046D2Ch)
00007FF82D046C89 lea rdx,[rax+rsi*8+10h]
00007FF82D046C8E xor r8d,r8d
00007FF82D046C91 xor eax,eax
00007FF82D046C93 lock cmpxchg qword ptr [rdx],r8
last++;
00007FF82D046C98 inc rsi
} while (val != 0 && last < 128);
00007FF82D046C9B test rax,rax
00007FF82D046C9E je HotPathProfilerSafeArray.EnterSafe()+059h (07FF82D046CA9h)
00007FF82D046CA0 cmp rsi,80h
00007FF82D046CA7 jl HotPathProfilerSafeArray.EnterSafe()+026h (07FF82D046C76h)
if (val != 0)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
00007FF82FBA6ADF mov rcx,qword ptr [rcx+10h]
00007FF82FBA6AE3 mov eax,dword ptr [rcx+8]
00007FF82FBA6AE6 movsxd rax,eax
00007FF82FBA6AE9 cmp rsi,rax
00007FF82FBA6AEC jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82FBA6B4Ch)
00007FF82FBA6AEE lea rdi,[rcx+rsi*8+10h]
00007FF82FBA6AF3 mov qword ptr [rsp+28h],rdi
00007FF82FBA6AF8 lea rcx,[rsp+30h]
00007FF82FBA6AFD call CLRStub[JumpStub]#7ff82d076d70 (07FF82D076D70h)
00007FF82FBA6B02 mov rdx,qword ptr [rsp+30h]
00007FF82FBA6B07 xor eax,eax
00007FF82FBA6B09 mov rdi,qword ptr [rsp+28h]
00007FF82FBA6B0E lock cmpxchg qword ptr [rdi],rdx
00007FF82FBA6B13 mov rdi,rax
00007FF82FBA6B16 mov rax,7FF88CF3E07Ch
00007FF82FBA6B20 cmp dword ptr [rax],0
00007FF82FBA6B23 jne HotPathProfilerSafeArray.EnterSafe()+0D5h (07FF82FBA6B45h)
if (prev != 0)
[...] // ommited for brevity
Summary
I run all on Win10 x64 Release build, on a Xeon E-2176G (6-core Coffee Lake) CPU. Assembler output is from Visual Studio, but equals the DisassemblyDiagnoser of BenchmarkDotNet.
Beside the hows and whys on why I'm doing this at all, I simply cannot explain the performance difference between these two methods. It shouldn't be this much I would guess. Can it be BenchmarkDotNet itself? Am I'm missing something else?
Feels like I have a black spot in my understanding about this lowlevel stuff that I'd like to shine some light on...thanks!
PS:
What I've tried so far:
Rearraging order of Benchmark runs
Defer GetTimestamp() call in the slow method
Doing some initialization/test calls before the benchmark run (though I guess that's covered anyways by BenchmarkDotNet)

Understanding C# SIMD output

I have following snippet which sums all the elements of the array (size is hardcoded and is 32):
static unsafe int F(int* a)
{
Vector256<int> ymm0 = Avx2.LoadVector256(a + 0);
Vector256<int> ymm1 = Avx2.LoadVector256(a + 8);
Vector256<int> ymm2 = Avx2.LoadVector256(a + 16);
Vector256<int> ymm3 = Avx2.LoadVector256(a + 24);
ymm0 = Avx2.Add(ymm0, ymm1);
ymm2 = Avx2.Add(ymm2, ymm3);
ymm0 = Avx2.Add(ymm0, ymm2);
const int s = 256 / 32;
int* t = stackalloc int[s];
Avx2.Store(t, ymm0);
int r = 0;
for (int i = 0; i < s; ++i)
r += t[i];
return r;
}
this generates following ASM:
Program.F(Int32*)
L0000: sub rsp, 0x28
L0004: vzeroupper ; Question #1
L0007: vxorps xmm4, xmm4, xmm4
L000b: vmovdqa [rsp], xmm4 ; Question #2
L0010: vmovdqa [rsp+0x10], xmm4 ; Question #2
L0016: xor eax, eax ; Question #3
L0018: mov [rsp+0x20], rax
L001d: mov rax, 0x7d847bd1f9ce ; Question #4
L0027: mov [rsp+0x20], rax
L002c: vmovdqu ymm0, [rcx]
L0030: vmovdqu ymm1, [rcx+0x20]
L0035: vmovdqu ymm2, [rcx+0x40]
L003a: vmovdqu ymm3, [rcx+0x60]
L003f: vpaddd ymm0, ymm0, ymm1
L0043: vpaddd ymm2, ymm2, ymm3
L0047: vpaddd ymm0, ymm0, ymm2
L004b: lea rax, [rsp] ; Question #5
L004f: vmovdqu [rax], ymm0
L0053: xor edx, edx ; Question #5
L0055: xor ecx, ecx ; Question #5
L0057: movsxd r8, ecx
L005a: add edx, [rax+r8*4]
L005e: inc ecx
L0060: cmp ecx, 8
L0063: jl short L0057
L0065: mov eax, edx
L0067: mov rcx, 0x7d847bd1f9ce ; Question #4
L0071: cmp [rsp+0x20], rcx
L0076: je short L007d
L0078: call 0x00007ffc9de2d430 ; Question #6
L007d: nop
L007e: vzeroupper
L0081: add rsp, 0x28
L0085: ret
Questions
Why do we need VZEROUPPER at the beginning. Wouldn't it be perfectly fine without it?
What do the VMOVDQAs do in the beginning. Or rather why are they there?
Zeroing out the EAX register? Why? Probably related to next line MOV [RSP+0x20], RAX, but still can't understand.
What does this mysterious value (0x7d847bd1f9ce) do?
There are also lines in between which I can not understand why are they needed (see "Question #5" comments in the code).
I'm assuming this line (L0078: call 0x00007ffc9de2d430) throws an exception. Is there a function or something in my code that can throw an exception?
I know there are lot of question, but I can't separate them because they are related to each other I think. TO BE CRYSTAL CLEAR: I'm just trying to understand the generated ASM here. I'm not a professional in this area.
Note
In case you're wondering what GCC (O2) generates, here is the result:
int32_t
f(int32_t *a) {
__m256i ymm0;
__m256i ymm1;
__m256i ymm2;
__m256i ymm3;
ymm0 = _mm256_load_si256((__m256i*)(a + 0));
ymm1 = _mm256_load_si256((__m256i*)(a + 8));
ymm2 = _mm256_load_si256((__m256i*)(a + 16));
ymm3 = _mm256_load_si256((__m256i*)(a + 24));
ymm0 = _mm256_add_epi32(ymm0, ymm1);
ymm2 = _mm256_add_epi32(ymm2, ymm3);
ymm0 = _mm256_add_epi32(ymm0, ymm2);
int32_t t[8];
_mm256_store_si256((__m256i*)t, ymm0);
int32_t r;
r = 0;
for (int i = 0; i < 8; ++i)
r += t[i];
return r;
}
And the generated ASM:
f:
push rbp
xor r8d, r8d
mov rbp, rsp
and rsp, -32
lea rax, [rsp-32]
mov rdx, rsp
vmovdqa ymm1, YMMWORD PTR [rdi+96]
vpaddd ymm0, ymm1, YMMWORD PTR [rdi+64]
vpaddd ymm0, ymm0, YMMWORD PTR [rdi+32]
vpaddd ymm0, ymm0, YMMWORD PTR [rdi]
vmovdqa YMMWORD PTR [rsp-32], ymm0
.L2:
add r8d, DWORD PTR [rax]
add rax, 4
cmp rax, rdx
jne .L2
mov eax, r8d
vzeroupper
leave
ret
I think It optimized (maybe heavily) my code here, but whatever.
Why do we need VZEROUPPER at the beginning. Wouldn't it be perfectly fine without it?
Inserting vzeroupper in the beginning may be a workaround for a library/some other third party code that is known to forget to clean it's uppers (to protect SSE code). But you're not using SSE code, you only have AVX code, so yes, it's not needed in the beginning.
Your code is using VEX-encoded instructions (v prefix), which means it would not encounter a "false dependency" (transition penalties) problem (Why is this SSE code 6 times slower without VZEROUPPER on Skylake?). And on top of that you're using ymm vectors immediately (entering Dirty Upper State), which means any reasoning for power management/frequency scaling is also not applying here (Dynamically determining where a rogue AVX-512 instruction is executing - mentions forgotten vzeroupper causing reduced frequency for entire app).
What do the VMOVDQAs do in the beginning. Or rather why are they there?
L0007: vxorps xmm4, xmm4, xmm4
L000b: vmovdqa [rsp], xmm4 ; Question #2
L0010: vmovdqa [rsp+0x10], xmm4 ; Question #2
Why is it zeroing out the memory that you're going to fully overwrite? My guess is that the compiler does not fully compute write coverage of the loop, so it does not know you will fully overwrite it. So it zeros it just in case.
Zeroing out the EAX register? Why? Probably related to next line MOV [RSP+0x20], RAX, but still can't understand.
L0016: xor eax, eax ; Question #3
L0018: mov [rsp+0x20], rax
L001d: mov rax, 0x7d847bd1f9ce ; Question #4
L0027: mov [rsp+0x20], rax
So it writes 64-bit zero at address rsp+0x20 and then overwrites the same memory region with a stack canary. Why does it need to write a zero there first? I don't know, looks like a missed optimization.
What does this mysterious value (0x7d847bd1f9ce) do?
I'm assuming this line (L0078: call 0x00007ffc9de2d430) throws an exception. Is there a function or something in my code that can throw an exception?
As already mentioned it's the stack canary to detect buffer overrun.
"The use of stackalloc automatically enables buffer overrun detection features in the common language runtime (CLR). If a buffer overrun is detected, the process is terminated as quickly as possible to minimize the chance that malicious code is executed" - quote from https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/stackalloc
It writes a value that it knows at the end of the stack buffer. Then executes the loop that you have. Then it checks if the value changed (if it did, means your loop wrote out of bounds). Note, that this is a huge stack canary. Not sure why they have to use 64-bit. Unless there is a good reason for it to be 64-bit I would consider this a missed optimization. It's large in code-size and for uop-cache and it's causing the compiler to emit more instructions (have to always use mov, can't use 64-bit constant as immediate operand of any other instruction, such as cmp or store mov).
Also, a note on canary-checking code
L0071: cmp [rsp+0x20], rcx
L0076: je short L007d
L0078: call 0x00007ffc9de2d430 ; Question #6
L007d: nop
Fall-through path should be the most-likely taken path. In this case, the fall-through path is the "throw exception", which shouldn't be normal. It may be another missed optimization. The way it could affect performance is - if this code is not in branch history, then it'll suffer a branch miss. If it's predicted correctly then it'll be fine. And indirect affect - taken branches occupy space in branch predictor history. If this branch was never taken - would be cheaper.
There are also lines in between which I can not understand why are they needed (see "Question #5" comments in the code).
L004b: lea rax, [rsp] ; Question #5
L004f: vmovdqu [rax], ymm0
L0053: xor edx, edx ; Question #5
L0055: xor ecx, ecx ; Question #5
LEA is not needed here. My guess is that it's related to how compiler does register allocation/stack management, so it's just a quirk of the compiler (rsp can't be allocated like a normal register, it's always used as stack pointer, so it has to be treated specially).
Zeroing edx - it's used as an accumulator for the final result. Zeroing ecx - used as counter in the loop that follows.
About horizontal sum at the end.
In general, when storing and reading from the same location, but different offset/size - need to check against store-forwarding rules for your target CPU to not suffer a penalty (you can find those at https://www.agner.org/optimize/#manuals, Intel and AMD have the rules listed in their guides as well). If you're targeting modern CPUs (Skylake/Zen), you shouldn't suffer a store-forwarding stall in your case, but there are still faster ways to sum up a vector horizontally. (And it has a bonus of avoiding missed optimizations related to the stack buffer).
Check out this nice writeup on good ways to sum a vector horizontally: https://stackoverflow.com/a/35270026/899255
You could also check out how a compiler does it: https://godbolt.org/z/q74abrqzh (GCC at -O3).
#stepan explained the RyuJIT-generated code quite well, but I thought I would address the question of why the GCC code is so different and why RyuJIT missed so many potential optimizations.
The short answer is that being Just In Time, RyuJIT has a very limited time budget in which to optimize, so it optimizes for frequently-used patterns. In your case the JIT may be taking your code a bit too literally, while GCC is able to capture your intent a bit better.
The stack canary code can be eliminated simply by removing the stackalloc and using a Vector256<T> local instead. Additionally, the loop over the stack values is missing a few optimizations, like your i variable being sign-extended on each iteration. This version of your method resolves both of those issues by helping the JIT out with things it knows how to optimize.
static unsafe int F(int* a)
{
Vector256<int> ymm0 = Avx.LoadVector256(a + 0);
Vector256<int> ymm1 = Avx.LoadVector256(a + 8);
Vector256<int> ymm2 = Avx.LoadVector256(a + 16);
Vector256<int> ymm3 = Avx.LoadVector256(a + 24);
ymm0 = Avx2.Add(ymm0, ymm1);
ymm2 = Avx2.Add(ymm2, ymm3);
ymm0 = Avx2.Add(ymm0, ymm2);
// This address-taken local will be forced to the stack
Vector256<int> ymm4 = ymm0;
int* t = (int*)&ymm4;
// RyuJIT unrolls loops of Vector<T>.Count,
// Vector128<T>.Count, and Vector256<T>.Count
int r = 0;
for (int i = 0; i < Vector256<int>.Count; ++i)
r += *(t + i);
return r;
}
compiles to:
Program.F(Int32*)
L0000: sub rsp, 0x38
L0004: vzeroupper
L0007: vmovdqu ymm0, [rcx]
L000b: vmovdqu ymm1, [rcx+0x20]
L0010: vmovdqu ymm2, [rcx+0x40]
L0015: vmovdqu ymm3, [rcx+0x60]
L001a: vpaddd ymm2, ymm2, ymm3
L001e: vpaddd ymm0, ymm0, ymm1
L0022: vpaddd ymm0, ymm0, ymm2
L0026: vmovupd [rsp], ymm0 ; write to the stack with no zeroing/canary
L002b: lea rax, [rsp]
L002f: mov edx, [rax] ; auto-unrolled loop
L0031: add edx, [rax+4]
L0034: add edx, [rax+8]
L0037: add edx, [rax+0xc]
L003a: add edx, [rax+0x10]
L003d: add edx, [rax+0x14]
L0040: add edx, [rax+0x18]
L0043: add edx, [rax+0x1c]
L0046: mov eax, edx
L0048: vzeroupper
L004b: add rsp, 0x38
L004f: ret
Note that the stack zeroing, the stack canary write, check, and possible throw are all gone. And the loop is auto-unrolled, with more optimal scalar load/add code.
Beyond that, as other comments/answers have suggested, the spill to the stack and scalar adds are unnecessary, because you can use SIMD instructions to add horizontally. RyuJIT will not do this for you like GCC can, but if you are explicit, you can get optimal SIMD ASM.
static unsafe int F(int* a)
{
Vector256<int> ymm0 = Avx.LoadVector256(a + 0);
Vector256<int> ymm1 = Avx.LoadVector256(a + 8);
// The load can be contained in the add if you use the load
// as an operand rather than declaring explicit locals
ymm0 = Avx2.Add(ymm0, Avx.LoadVector256(a + 16));
ymm1 = Avx2.Add(ymm1, Avx.LoadVector256(a + 24));
ymm0 = Avx2.Add(ymm0, ymm1);
// Add the upper 128-bit lane to the lower lane
Vector128<int> xmm0 = Sse2.Add(ymm0.GetLower(), ymm0.GetUpper());
// Add odd elements to even
xmm0 = Sse2.Add(xmm0, Sse2.Shuffle(xmm0, 0b_11_11_01_01));
// Add high half to low half
xmm0 = Sse2.Add(xmm0, Sse2.UnpackHigh(xmm0.AsInt64(), xmm0.AsInt64()).AsInt32());
// Extract low element
return xmm0.ToScalar();
}
compiles to:
Program.F(Int32*)
L0000: vzeroupper
L0003: vmovdqu ymm0, [rcx]
L0007: vmovdqu ymm1, [rcx+0x20]
L000c: vpaddd ymm0, ymm0, [rcx+0x40]
L0011: vpaddd ymm1, ymm1, [rcx+0x60]
L0016: vpaddd ymm0, ymm0, ymm1
L001a: vextracti128 xmm1, ymm0, 1
L0020: vpaddd xmm0, xmm0, xmm1
L0024: vpshufd xmm1, xmm0, 0xf5
L0029: vpaddd xmm0, xmm0, xmm1
L002d: vpunpckhqdq xmm1, xmm0, xmm0
L0031: vpaddd xmm0, xmm0, xmm1
L0035: vmovd eax, xmm0
L0039: vzeroupper
L003c: ret
which, aside from the overly-conservative vzerouppers, is the same as you'd get from an optimizing C/C++ compiler.
vzeroupper can help performance.
The L0007 thru L0018 lines are zeroing out the storage space used by the local variables.
The 0x7d847bd1f9ce value appears to be related to detecting stack overruns. It sets in a check value, and when the function is done it looks to see if that value has changed. If it has it calls a diagnostic function.
The function body starts at L002c. First it initializes your local ymm variables, then does the additions.
The lea at L004b is the allocation of t. The next instruction (L004f) is the Avx2.Store(t, ymm0); statement.
L0053 thru L0063 is the for loop. rax already has the value of t, ecx holds i, and edx holds r.
From L0065 to the end we have the return statement and function epilog. The epilog checks to see if the stack has been clobbered, does some cleanup, and returns to the caller.

C# Performance on Small Functions

One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.
For starters here is the normal version of the function.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = a + b + 1;
}
}
return a;
}
Here is the version I made that breaks the functionality into small functions.
static double TinyFunctions()
{
double a = 0;
for (int i = 0; i < s_OuterLoopCount; i++)
{
a = Loop(a);
}
return a;
}
static double Loop(double a)
{
for (int i = 0; i < s_InnerLoopCount; i++)
{
double b = Double(i);
a = Add(a, Add(b, 1));
}
return a;
}
static double Double(double a)
{
return a * 2;
}
static double Add(double a, double b)
{
return a + b;
}
I use the stopwatch class to time the functions and when I ran it in debug I got the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;
These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;
These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?
We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.
I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.
In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.
I'd be grateful for any insight someone can provide.
Update 1
As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = b + 1 + a;
}
}
return a;
}
Here are the results with this configuration.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;
This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.
Furthermore, I then tried it with integer operations and we are back to not making any sense.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;
And this doesn't change regardless of the order of operations.
I can make performance match much better by changing one line of code:
a = a + b + 1;
Change it to:
a = b + 1 + a;
Or:
a += b + 1;
Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:
int Double( int a ) { return a * 2; }
I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).
The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.
But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:
Original Changed
01070620 push ebp 01390620 push ebp
01070621 mov ebp,esp 01390621 mov ebp,esp
01070623 push edi 01390623 push edi
01070624 push esi 01390624 push esi
01070625 push eax 01390625 push eax
01070626 fldz 01390626 fldz
01070628 xor esi,esi 01390628 xor esi,esi
0107062A mov edi,dword ptr ds:[0FE43ACh] 0139062A mov edi,dword ptr ds:[12243ACh]
01070630 test edi,edi 01390630 test edi,edi
01070632 jle 0107065A 01390632 jle 0139065A
01070634 xor edx,edx 01390634 xor edx,edx
01070636 mov ecx,dword ptr ds:[0FE43B0h] 01390636 mov ecx,dword ptr ds:[12243B0h]
0107063C test ecx,ecx 0139063C test ecx,ecx
0107063E jle 01070655 0139063E jle 01390655
01070640 mov eax,edx 01390640 mov eax,edx
01070642 add eax,eax 01390642 add eax,eax
01070644 mov dword ptr [ebp-0Ch],eax 01390644 mov dword ptr [ebp-0Ch],eax
01070647 fild dword ptr [ebp-0Ch] 01390647 fild dword ptr [ebp-0Ch]
0107064A faddp st(1),st 0139064A fld1
0107064C fld1 0139064C faddp st(1),st
0107064E faddp st(1),st 0139064E faddp st(1),st
01070650 inc edx 01390650 inc edx
01070651 cmp edx,ecx 01390651 cmp edx,ecx
01070653 jl 01070640 01390653 jl 01390640
01070655 inc esi 01390655 inc esi
01070656 cmp esi,edi 01390656 cmp esi,edi
01070658 jl 01070634 01390658 jl 01390634
0107065A pop ecx 0139065A pop ecx
0107065B pop esi 0139065B pop esi
0107065C pop edi 0139065C pop edi
0107065D pop ebp 0139065D pop ebp
0107065E ret 0139065E ret
Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.
Update:
With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:
int b = 2 * i;
a = a + b + 1;
Into something like:
mov esi, eax ; b = i
add esi, esi ; b += b
lea ecx, [ecx + esi + 1] ; a = a + b + 1
Where a is stored in the ecx register, i in eax, and b in esi.
Whereas the TinyFunctions version gets turned into something like:
mov eax, edx
add eax, eax
inc eax
add ecx, eax
Where i is in edx, b is in eax, and a is in ecx this time around.
I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:
int b = 2 * i + 1;
a += b;
This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.

Huge performance difference in byte-array access between x64 and x86

I'm currenty doing micro-benchmarks for a better understanding of clr performance and version issues. The micro-benchmark in question is xoring two byte arrays of each 64 bytes together.
I'm always making a reference implementation with safe .net before I try to beat the .net framework implementation with unsafe and so on.
My reference implementation in question is:
for (int p = 0; p < 64; p++)
a[p] ^= b[p];
where a and b are byte[] a = new byte[64] and filled with data from .NET rng.
This code runs on x64 as double as fast as on x86. First I thought this is ok, because the jit will make something like *long^=*long out of it and *int^=*int on x86.
But my optimized unsafe-version:
fixed (byte* pA = a)
fixed (byte* pB = b)
{
long* ppA = (long*)pA;
long* ppB = (long*)pB;
for (int p = 0; p < 8; p++)
{
*ppA ^= *ppB;
ppA++;
ppB++;
}
}
runs about factor 4 times faster than the x64 reference-implementation. So my thoughts about *long^=*long and *int^=*int optimization of the compiler are not right.
Where does this huge performance difference in the reference implementation come from? Now that I posted the ASM code: Why can't the C# compiler also optimize the x86 version this way?
IL code for x86 and x64 reference implementation (they are identical):
IL_0059: ldloc.3
IL_005a: ldloc.s p
IL_005c: ldelema [mscorlib]System.Byte
IL_0061: dup
IL_0062: ldobj [mscorlib]System.Byte
IL_0067: ldloc.s b
IL_0069: ldloc.s p
IL_006b: ldelem.u1
IL_006c: xor
IL_006d: conv.u1
IL_006e: stobj [mscorlib]System.Byte
IL_0073: ldloc.s p
IL_0075: ldc.i4.1
IL_0076: add
IL_0077: stloc.s p
IL_0079: ldloc.s p
IL_007b: ldc.i4.s 64
IL_007d: blt.s IL_0059
I think that ldloc.3 is a.
Resulting ASM code for x86:
for (int p = 0; p < 64; p++)
010900DF xor edx,edx
010900E1 mov edi,dword ptr [ebx+4]
a[p] ^= b[p];
010900E4 cmp edx,edi
010900E6 jae 0109010C
010900E8 lea esi,[ebx+edx+8]
010900EC mov eax,dword ptr [ebp-14h]
010900EF cmp edx,dword ptr [eax+4]
010900F2 jae 0109010C
010900F4 movzx eax,byte ptr [eax+edx+8]
010900F9 xor byte ptr [esi],al
for (int p = 0; p < 64; p++)
010900FB inc edx
010900FC cmp edx,40h
010900FF jl 010900E4
Resulting ASM code for x64:
a[p] ^= b[p];
00007FFF4A8B01C6 mov eax,3Eh
00007FFF4A8B01CB cmp rax,rcx
00007FFF4A8B01CE jae 00007FFF4A8B0245
00007FFF4A8B01D0 mov rax,qword ptr [rbx+8]
00007FFF4A8B01D4 mov r9d,3Eh
00007FFF4A8B01DA cmp r9,rax
00007FFF4A8B01DD jae 00007FFF4A8B0245
00007FFF4A8B01DF mov r9d,3Fh
00007FFF4A8B01E5 cmp r9,rcx
00007FFF4A8B01E8 jae 00007FFF4A8B0245
00007FFF4A8B01EA mov ecx,3Fh
00007FFF4A8B01EF cmp rcx,rax
00007FFF4A8B01F2 jae 00007FFF4A8B0245
00007FFF4A8B01F4 nop word ptr [rax+rax]
00007FFF4A8B0200 movzx ecx,byte ptr [rdi+rdx+10h]
00007FFF4A8B0205 movzx eax,byte ptr [rbx+rdx+10h]
00007FFF4A8B020A xor ecx,eax
00007FFF4A8B020C mov byte ptr [rdi+rdx+10h],cl
00007FFF4A8B0210 movzx ecx,byte ptr [rdi+rdx+11h]
00007FFF4A8B0215 movzx eax,byte ptr [rbx+rdx+11h]
00007FFF4A8B021A xor ecx,eax
00007FFF4A8B021C mov byte ptr [rdi+rdx+11h],cl
00007FFF4A8B0220 add rdx,2
for (int p = 0; p < 64; p++)
00007FFF4A8B0224 cmp rdx,40h
00007FFF4A8B0228 jl 00007FFF4A8B0200
You've made a classic mistake, attempting performance analysis on non-optimized code. Here is a complete minimal compilable example:
using System;
namespace SO30558357
{
class Program
{
static void XorArray(byte[] a, byte[] b)
{
for (int p = 0; p< 64; p++)
a[p] ^= b[p];
}
static void Main(string[] args)
{
byte[] a = new byte[64];
byte[] b = new byte[64];
Random r = new Random();
r.NextBytes(a);
r.NextBytes(b);
XorArray(a, b);
Console.ReadLine(); // when the program stops here
// use Debug -> Attach to process
}
}
}
I compiled that using Visual Studio 2013 Update 3, default "Release Build" settings for a C# console application except for the architecture, and ran it with CLR v4.0.30319. Oh I think I have Roslyn installed, but that shouldn't replace the JIT, only the translation to MSIL which is identical on both architectures.
The actual x86 assembly for XorArray:
006F00D8 push ebp
006F00D9 mov ebp,esp
006F00DB push edi
006F00DC push esi
006F00DD push ebx
006F00DE push eax
006F00DF mov dword ptr [ebp-10h],edx
006F00E2 xor edi,edi
006F00E4 mov ebx,dword ptr [ecx+4]
006F00E7 cmp edi,ebx
006F00E9 jae 006F010F
006F00EB lea esi,[ecx+edi+8]
006F00EF movzx eax,byte ptr [esi]
006F00F2 mov edx,dword ptr [ebp-10h]
006F00F5 cmp edi,dword ptr [edx+4]
006F00F8 jae 006F010F
006F00FA movzx edx,byte ptr [edx+edi+8]
006F00FF xor eax,edx
006F0101 mov byte ptr [esi],al
006F0103 inc edi
006F0104 cmp edi,40h
006F0107 jl 006F00E7
006F0109 pop ecx
006F010A pop ebx
006F010B pop esi
006F010C pop edi
006F010D pop ebp
006F010E ret
And for x64:
00007FFD4A3000FB mov rax,qword ptr [rsi+8]
00007FFD4A3000FF mov rax,qword ptr [rbp+8]
00007FFD4A300103 nop word ptr [rax+rax]
00007FFD4A300110 movzx ecx,byte ptr [rsi+rdx+10h]
00007FFD4A300115 movzx eax,byte ptr [rdx+rbp+10h]
00007FFD4A30011A xor ecx,eax
00007FFD4A30011C mov byte ptr [rsi+rdx+10h],cl
00007FFD4A300120 movzx ecx,byte ptr [rsi+rdx+11h]
00007FFD4A300125 movzx eax,byte ptr [rdx+rbp+11h]
00007FFD4A30012A xor ecx,eax
00007FFD4A30012C mov byte ptr [rsi+rdx+11h],cl
00007FFD4A300130 movzx ecx,byte ptr [rsi+rdx+12h]
00007FFD4A300135 movzx eax,byte ptr [rdx+rbp+12h]
00007FFD4A30013A xor ecx,eax
00007FFD4A30013C mov byte ptr [rsi+rdx+12h],cl
00007FFD4A300140 movzx ecx,byte ptr [rsi+rdx+13h]
00007FFD4A300145 movzx eax,byte ptr [rdx+rbp+13h]
00007FFD4A30014A xor ecx,eax
00007FFD4A30014C mov byte ptr [rsi+rdx+13h],cl
00007FFD4A300150 add rdx,4
00007FFD4A300154 cmp rdx,40h
00007FFD4A300158 jl 00007FFD4A300110
Bottom line: The x64 optimizer worked a lot better. While it still is using byte-sized transfers, it unrolled the loop by a factor of 4, and inlined the function call.
Since in the x86 version, loop control logic corresponds to roughly half the code, the unrolling can be expected to yield almost twice the performance.
Inlining allowed the compiler to perform context-sensitive optimization, knowing the size of the arrays and eliminating the runtime bounds check.
If we inline by hand, the x86 compiler now yields:
00A000B1 xor edi,edi
00A000B3 mov eax,dword ptr [ebp-10h]
00A000B6 mov ebx,dword ptr [eax+4]
a[p] ^= b[p];
00A000B9 mov eax,dword ptr [ebp-10h]
00A000BC cmp edi,ebx
00A000BE jae 00A000F5
00A000C0 lea esi,[eax+edi+8]
00A000C4 movzx eax,byte ptr [esi]
00A000C7 mov edx,dword ptr [ebp-14h]
00A000CA cmp edi,dword ptr [edx+4]
00A000CD jae 00A000F5
00A000CF movzx edx,byte ptr [edx+edi+8]
00A000D4 xor eax,edx
00A000D6 mov byte ptr [esi],al
for (int p = 0; p< 64; p++)
00A000D8 inc edi
00A000D9 cmp edi,40h
00A000DC jl 00A000B9
Didn't help that much, the loop still does not unroll and the runtime bounds checking is still there.
Notably, the x86 compiler found a register (EBX) to cache the length of one array, but ran out of registers and was forced to access the other array length from memory on every iteration. This should be a "cheap" L1 cache access, but that's still slower than register access, and much slower than no bounds check at all.

When to use short?

Say I'm looping through like 20/30 objects or in any other case where I'm dealing with smaller numbers, is it a good practice to use short instead of int?
I mean why isn't this common:
for(short i=0; i<x; i++)
Method(array[i]);
Is it because the performance gain is too low?
Thanks
"is it a good practice to use short instead of int?"
First of all, this is a micro optimization that will not achieve the expected results: increase speed or efficiency.
Second: No, not really, the CLR internally still uses 32 bit integers (Int32) to perform the iteration. Basically it converts short to Int32 for computation purposes during JIT compilation.
Third: Array indexes are Int32, and the iterating short variable is automatically converted to int32 when used as an array indexer.
If we take the next code:
var array = new object[32];
var x = array.Length;
for (short i = 0; i < x; i++)
Method(array[i]);
And disassemble it, you can clearly see at 00000089 inc eax that at machine level an 32 bit register was used for the iterating variable (eax), which is next truncated to 16 bit 0000008a movsx eax,ax so there are no benefits from using a short oppossed to using an int32, actually there might be a slight performance loss due to extra instructions that need to be executed.
00000042 nop
var array = new object[32];
00000043 mov ecx,64B41812h
00000048 mov edx,20h
0000004d call FFBC01A4
00000052 mov dword ptr [ebp-50h],eax
00000055 mov eax,dword ptr [ebp-50h]
00000058 mov dword ptr [ebp-40h],eax
var x = array.Length;
0000005b mov eax,dword ptr [ebp-40h]
0000005e mov eax,dword ptr [eax+4]
00000061 mov dword ptr [ebp-44h],eax
for (short i = 0; i < x; i++)
00000064 xor edx,edx
00000066 mov dword ptr [ebp-48h],edx
00000069 nop
0000006a jmp 00000090
Method(array[i]);
0000006c mov eax,dword ptr [ebp-48h]
0000006f mov edx,dword ptr [ebp-40h]
00000072 cmp eax,dword ptr [edx+4]
00000075 jb 0000007C
00000077 call 657A28F6
0000007c mov ecx,dword ptr [edx+eax*4+0Ch]
00000080 call FFD9A708
00000085 nop
for (short i = 0; i < x; i++)
00000086 mov eax,dword ptr [ebp-48h]
00000089 inc eax
0000008a movsx eax,ax
0000008d mov dword ptr [ebp-48h],eax
00000090 mov eax,dword ptr [ebp-48h]
00000093 cmp eax,dword ptr [ebp-44h]
00000096 setl al
00000099 movzx eax,al
0000009c mov dword ptr [ebp-4Ch],eax
0000009f cmp dword ptr [ebp-4Ch],0
000000a3 jne 0000006C
Yes, the performance difference is negligible. However, a short uses 16 bits instead of 32 for an int, so it's conceivable that you may want to use a short if you're processing enough small numbers.
In general, using numbers that matches the word size of the processor is relatively faster than unmatching numbers. On the other hand, short uses less memory space than int.
If you have a limited memory space, using short may be the alternative; but personally I have never encountered such a thing when writing c# applications.
An int uses 32 bits of memory, a short uses 16 bits and a byte uses 8 bits. If you're only looping through 20/30 objects and you're concerned about memory usage, use byte instead.
Catering for memory usage to this level is rarely required with today's machines, though you could argue that using int everywhere is just lazy. Personally, I try to always use the relevant type that uses the least memory.
http://msdn.microsoft.com/en-us/library/5bdb6693(v=vs.100).aspx

Categories

Resources