Sometimes I need a hardcoded lookup table for a single method.
I can create such an array either
locally in the method itself
static inside the class
Example for the first case:
public int Convert(int i)
{
int[] lookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/ };
return lookup[i];
}
As far as I understand it, a new lookup array will be created by the .net engine each time this method is executed. Is this correct, or is the JITer smart enough to cache and reuse the array between calls?
I presume that the answer is no, so if I want to make sure that the array is cached between calls, one way would be to make it static:
Example for the second case:
private static readonly int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
public int Convert(int i)
{
return lookup[i];
}
Is there a way to do this without polluting the namespace of my class? Can I somehow declare a static array that is only visible inside the current scope?
Local array
The Roslyn compiler puts local arrays in the metadata. Let's take the first version of your Convert method:
public int Convert(int i)
{
int[] lookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/ };
return lookup[i];
}
Here is the corresponded IL code (Release build, Roslyn 1.3.1.60616):
// Token: 0x06000002 RID: 2 RVA: 0x0000206C File Offset: 0x0000026C
.method public hidebysig
instance int32 Convert (
int32 i
) cil managed noinlining
{
// Header Size: 1 byte
// Code Size: 20 (0x14) bytes
.maxstack 8
/* 0x0000026D 1D */ IL_0000: ldc.i4.7
/* 0x0000026E 8D13000001 */ IL_0001: newarr [mscorlib]System.Int32
/* 0x00000273 25 */ IL_0006: dup
/* 0x00000274 D001000004 */ IL_0007: ldtoken field valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=28' '<PrivateImplementationDetails>'::'502D7419C3650DEE94B5938147BC9B4724D37F99'
/* 0x00000279 281000000A */ IL_000C: call void [mscorlib]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [mscorlib]System.Array, valuetype [mscorlib]System.RuntimeFieldHandle)
/* 0x0000027E 03 */ IL_0011: ldarg.1
/* 0x0000027F 94 */ IL_0012: ldelem.i4
/* 0x00000280 2A */ IL_0013: ret
} // end of method Program::Convert
And here is the PrivateImplementationDetails:
// Token: 0x02000003 RID: 3
.class private auto ansi sealed '<PrivateImplementationDetails>'
extends [mscorlib]System.Object
{
.custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = (
01 00 00 00
)
// Nested Types
// Token: 0x02000004 RID: 4
.class nested private explicit ansi sealed '__StaticArrayInitTypeSize=28'
extends [mscorlib]System.ValueType
{
.pack 1
.size 28
} // end of class __StaticArrayInitTypeSize=28
// Fields
// Token: 0x04000001 RID: 1 RVA: 0x00002944 File Offset: 0x00000B44
.field assembly static initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=28' '502D7419C3650DEE94B5938147BC9B4724D37F99' at I_00002944 // 28 (0x001c) bytes
} // end of class <PrivateImplementationDetails>
As you can see, your lookup array is in the assembly metadata. When you start your application, JIT only has to get the array content from the metadata. An asm example (Windows 10, .NET Framework 4.6.1 (4.0.30319.42000), RyuJIT: clrjit-v4.6.1080.0, Release build):
int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
00007FFEDF0A44E2 sub esp,20h
00007FFEDF0A44E5 mov esi,edx
00007FFEDF0A44E7 mov rcx,7FFF3D1C4C62h
00007FFEDF0A44F1 mov edx,7
00007FFEDF0A44F6 call 00007FFF3E6B2600
00007FFEDF0A44FB mov rdx,134CF7F2944h
00007FFEDF0A4505 mov ecx,dword ptr [rax+8]
00007FFEDF0A4508 lea r8,[rax+10h]
00007FFEDF0A450C vmovdqu xmm0,xmmword ptr [rdx]
00007FFEDF0A4511 vmovdqu xmmword ptr [r8],xmm0
00007FFEDF0A4516 mov r9,qword ptr [rdx+10h]
00007FFEDF0A451A mov qword ptr [r8+10h],r9
00007FFEDF0A451E mov r9d,dword ptr [rdx+18h]
00007FFEDF0A4522 mov dword ptr [r8+18h],r9d
return lookup[i];
00007FFEDF0A4526 cmp esi,ecx
return lookup[i];
00007FFEDF0A4528 jae 00007FFEDF0A4537
00007FFEDF0A452A movsxd rdx,esi
00007FFEDF0A452D mov eax,dword ptr [rax+rdx*4+10h]
00007FFEDF0A4531 add rsp,20h
00007FFEDF0A4535 pop rsi
00007FFEDF0A4536 ret
00007FFEDF0A4537 call 00007FFF3EB57BE0
00007FFEDF0A453C int 3
A LegacyJIT-x64 version:
int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
00007FFEDF0E41E0 push rbx
00007FFEDF0E41E1 push rdi
00007FFEDF0E41E2 sub rsp,28h
00007FFEDF0E41E6 mov ebx,edx
00007FFEDF0E41E8 mov edx,7
00007FFEDF0E41ED lea rcx,[7FFF3D1C4C62h]
00007FFEDF0E41F4 call 00007FFF3E6B2600
00007FFEDF0E41F9 mov rdi,rax
00007FFEDF0E41FC lea rcx,[7FFEDF124760h]
00007FFEDF0E4203 call 00007FFF3E73CA90
00007FFEDF0E4208 mov rdx,rax
00007FFEDF0E420B mov rcx,rdi
00007FFEDF0E420E call 00007FFF3E73C8B0
return lookup[i];
00007FFEDF0E4213 movsxd r11,ebx
00007FFEDF0E4216 mov rax,qword ptr [rdi+8]
00007FFEDF0E421A cmp r11,7
00007FFEDF0E421E jae 00007FFEDF0E4230
00007FFEDF0E4220 mov eax,dword ptr [rdi+r11*4+10h]
00007FFEDF0E4225 add rsp,28h
00007FFEDF0E4229 pop rdi
00007FFEDF0E422A pop rbx
00007FFEDF0E422B ret
00007FFEDF0E422C nop dword ptr [rax]
00007FFEDF0E4230 call 00007FFF3EB57BE0
00007FFEDF0E4235 nop
A LegacyJIT-x86 version:
int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
009A2DC4 push esi
009A2DC5 push ebx
009A2DC6 mov ebx,edx
009A2DC8 mov ecx,6A2C402Eh
009A2DCD mov edx,7
009A2DD2 call 0094322C
009A2DD7 lea edi,[eax+8]
009A2DDA mov esi,5082944h
009A2DDF mov ecx,7
009A2DE4 rep movs dword ptr es:[edi],dword ptr [esi]
return lookup[i];
009A2DE6 cmp ebx,dword ptr [eax+4]
009A2DE9 jae 009A2DF4
009A2DEB mov eax,dword ptr [eax+ebx*4+8]
009A2DEF pop ebx
009A2DF0 pop esi
009A2DF1 pop edi
009A2DF2 pop ebp
009A2DF3 ret
009A2DF4 call 6B9D52F0
009A2DF9 int 3
Static array
Now, let's compare it with the second version:
private static readonly int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
public int Convert(int i)
{
return lookup[i];
}
IL:
// Token: 0x04000001 RID: 1
.field private static initonly int32[] lookup
// Token: 0x06000002 RID: 2 RVA: 0x00002056 File Offset: 0x00000256
.method public hidebysig
instance int32 Convert (
int32 i
) cil managed noinlining
{
// Header Size: 1 byte
// Code Size: 8 (0x8) bytes
.maxstack 8
/* 0x00000257 7E01000004 */ IL_0000: ldsfld int32[] ConsoleApplication5.Program::lookup
/* 0x0000025C 03 */ IL_0005: ldarg.1
/* 0x0000025D 94 */ IL_0006: ldelem.i4
/* 0x0000025E 2A */ IL_0007: ret
} // end of method Program::Convert
// Token: 0x02000003 RID: 3
.class private auto ansi sealed '<PrivateImplementationDetails>'
extends [mscorlib]System.Object
{
.custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = (
01 00 00 00
)
// Nested Types
// Token: 0x02000004 RID: 4
.class nested private explicit ansi sealed '__StaticArrayInitTypeSize=28'
extends [mscorlib]System.ValueType
{
.pack 1
.size 28
} // end of class __StaticArrayInitTypeSize=28
// Fields
// Token: 0x04000002 RID: 2 RVA: 0x000028FC File Offset: 0x00000AFC
.field assembly static initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=28' '502D7419C3650DEE94B5938147BC9B4724D37F99' at I_000028fc // 28 (0x001c) bytes
} // end of class <PrivateImplementationDetails>
ASM (RyuJIT-x64):
return lookup[i];
00007FFEDF0B4490 sub rsp,28h
00007FFEDF0B4494 mov rax,212E52E0080h
00007FFEDF0B449E mov rax,qword ptr [rax]
00007FFEDF0B44A1 mov ecx,dword ptr [rax+8]
00007FFEDF0B44A4 cmp edx,ecx
00007FFEDF0B44A6 jae 00007FFEDF0B44B4
00007FFEDF0B44A8 movsxd rdx,edx
00007FFEDF0B44AB mov eax,dword ptr [rax+rdx*4+10h]
00007FFEDF0B44AF add rsp,28h
00007FFEDF0B44B3 ret
00007FFEDF0B44B4 call 00007FFF3EB57BE0
00007FFEDF0B44B9 int 3
ASM (LegacyJIT-x64):
return lookup[i];
00007FFEDF0A4611 sub esp,28h
00007FFEDF0A4614 mov rcx,226CC5203F0h
00007FFEDF0A461E mov rcx,qword ptr [rcx]
00007FFEDF0A4621 movsxd r8,edx
00007FFEDF0A4624 mov rax,qword ptr [rcx+8]
00007FFEDF0A4628 cmp r8,rax
00007FFEDF0A462B jae 00007FFEDF0A4637
00007FFEDF0A462D mov eax,dword ptr [rcx+r8*4+10h]
00007FFEDF0A4632 add rsp,28h
00007FFEDF0A4636 ret
00007FFEDF0A4637 call 00007FFF3EB57BE0
00007FFEDF0A463C nop
ASM (LegacyJIT-x86):
return lookup[i];
00AA2E18 push ebp
00AA2E19 mov ebp,esp
00AA2E1B mov eax,dword ptr ds:[03628854h]
00AA2E20 cmp edx,dword ptr [eax+4]
00AA2E23 jae 00AA2E2B
00AA2E25 mov eax,dword ptr [eax+edx*4+8]
00AA2E29 pop ebp
00AA2E2A ret
00AA2E2B call 6B9D52F0
00AA2E30 int 3
Benchmarks
Let's write a benchmark with help of BenchmarkDotNet
[Config(typeof(Config)), LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, RPlotExporter]
public class ArrayBenchmarks
{
private static readonly int[] lookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/};
[MethodImpl(MethodImplOptions.NoInlining)]
public int ConvertStatic(int i)
{
return lookup[i];
}
[MethodImpl(MethodImplOptions.NoInlining)]
public int ConvertLocal(int i)
{
int[] localLookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/};
return localLookup[i];
}
[Benchmark]
public int Static()
{
int sum = 0;
for (int i = 0; i < 10001; i++)
sum += ConvertStatic(0);
return sum;
}
[Benchmark]
public int Local()
{
int sum = 0;
for (int i = 0; i < 10001; i++)
sum += ConvertLocal(0);
return sum;
}
private class Config : ManualConfig
{
public Config()
{
Add(new MemoryDiagnoser());
Add(MarkdownExporter.StackOverflow);
}
}
}
Note that it's a synthetic toy benchmark which uses NoInlining for the Convert methods. We use it to show the difference between two methods. The real performance will depend on how you are using the Convert method in your code. My results:
Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143474 ticks, Resolution=466.5324 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Type=ArrayBenchmarks Mode=Throughput
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
------- |--------- |---------- |-------------- |----------- |--------- |------ |------ |------------------- |
Static | X64 | LegacyJit | 24.0243 us | 0.1590 us | - | - | - | 1.07 |
Local | X64 | LegacyJit | 2,068.1034 us | 33.7142 us | 1,089.00 | - | - | 436,603.02 |
Static | X64 | RyuJit | 20.7906 us | 0.2018 us | - | - | - | 1.06 |
Local | X64 | RyuJit | 83.4041 us | 0.9993 us | 613.55 | - | - | 244,936.53 |
Static | X86 | LegacyJit | 20.9957 us | 0.2267 us | - | - | - | 1.01 |
Local | X86 | LegacyJit | 167.6257 us | 1.3543 us | 431.43 | - | - | 172,121.77 |
Conclusion
Does .NET cache hardcoded local arrays? Kind of: the Roslyn compiler put it in the metadata.
Do we have any overhead in this case? Unfortunately, yes: JIT will copy the array content from the metadata for each invocation; it will work longer than the case with a static array. Runtime also allocates objects and produce memory traffic.
Should we care about it? It depends. If it's a hot method and you want to achieve a good level of performance, you should use a static array. If it's a cold method which doesn't affect the application performance, you probably should write “good” source code and put the array in the method scope.
Related
Question
I wanted to write a small profiler class that allows me to measure the run time of hot paths throughout the application. In doing so, I discovered and interesting performance difference between two possible implementations that I cannot explain, but would like to understand.
Setup
The idea is as follows:
// somewhere accessible
public static profiler HotPathProfiler = new HotPathProfiler("some name", enabled: true);
// within programm
long ticket = profiler.Enter();
... // hot path
var result = profiler.Exit(ticket: ticket);
As there aren't much of these hot paths running in parallel, the idea is to implement this via an array that holds the timestamp (0 when slot is free) and returning the index (called ticket) when calling Enter(). So the class looks like the following:
public class HotPathProfiler
{
private readonly string _name;
private readonly bool _enabled;
private readonly long[] _ticketList;
public HotPathProfiler(string name, bool enabled)
{
_name = name;
_enabled = enabled;
_ticketList = new long[128];
}
}
If code Enter()s and none of the 128 tickets is available, -1 will be returned which the Exit(ticket) function can handle by returning early.
When thinking about how to implement the Enter() call I saw the Interlocked.Read method that can atomically read values on 32bit systems, while, according to the documentation, it is unnecessary on 64bit systems.
So I went on an implemented various types of Enter() methods, including one with Interlocked.Read and one with Interlocked.CompareExchange, and compared them with BenchmarkDotNet. That's where I discovered an enormous performance difference:
| Method | Mean | Error | StdDev | Code Size |
|------------- |----------:|---------:|---------:|----------:|
| SafeArray | 28.64 ns | 0.573 ns | 0.536 ns | 295 B |
| SafeArrayCAS | 744.75 ns | 8.741 ns | 7.749 ns | 248 B |
The benchmark for both look pretty much the same:
[Benchmark]
public void SafeArray()
{
// doesn't matter if 'i < 1' or 'i < 10'
// performance differs by the same factor (approx. 20x)
for (int i = 0; i < 1; i++)
{
_ticketArr[i] = _hpp_sa.EnterSafe();
// SafeArrayCAS:
// _ticketArr[i] = _hpp_sa_cas.EnterSafe();
}
}
Implementations
Again, free slots hold value 0, occupied slots some other value (timestamp). Enter() is supposed to return the index/ticket of the slot.
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long ts = Stopwatch.GetTimestamp();
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
last++;
} while (val != 0 && last < 128);
return val == 0 ? last : -1;
}
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
last++;
} while (val != 0 && last < 128);
if (val != 0)
{
return -1;
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
if (prev != 0)
{
return -1;
}
return last;
}
Enter rabbit hole
Now, one would say that it's no surprise to see a difference, since the slow method always tries to CAS an entry, while the other one only lazily reads each entry and then only tries a CAS once.
But, beside the fact that the benchmark only does 1 Enter(), i.e. only one while{} run that shouldn't make that (20x) much difference, it is even harder to explain once you realize the atomic read is implemented as CAS:
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D048FCE xor edi,edi
long ts = Stopwatch.GetTimestamp();
00007FF82D048FD0 lea rcx,[rsp+28h]
00007FF82D048FD5 call CLRStub[JumpStub]#7ff82d076d70 (07FF82D076D70h)
00007FF82D048FDA mov rsi,qword ptr [rsp+28h]
00007FF82D048FDF mov rax,7FF88CF3E07Ch
00007FF82D048FE9 cmp dword ptr [rax],0
00007FF82D048FEC jne HotPathProfilerSafeArrayCAS.EnterSafe()+0A6h (07FF82D049046h)
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
00007FF82D048FEE mov rbx,qword ptr [rsp+50h]
00007FF82D048FF3 mov rax,qword ptr [rbx+10h]
00007FF82D048FF7 mov edx,dword ptr [rax+8]
00007FF82D048FFA movsxd rdx,edx
00007FF82D048FFD cmp rdi,rdx
00007FF82D049000 jae HotPathProfilerSafeArrayCAS.EnterSafe()+0ADh (07FF82D04904Dh)
00007FF82D049002 lea rdx,[rax+rdi*8+10h]
00007FF82D049007 xor eax,eax
00007FF82D049009 lock cmpxchg qword ptr [rdx],rsi
last++;
00007FF82D04900E inc rdi
} while (val != 0 && last < 128);
00007FF82D049011 test rax,rax
00007FF82D049014 je HotPathProfilerSafeArrayCAS.EnterSafe()+084h (07FF82D049024h)
00007FF82D049016 cmp rdi,80h
00007FF82D04901D mov qword ptr [rsp+50h],rbx
00007FF82D049022 jl HotPathProfilerSafeArrayCAS.EnterSafe()+04Eh (07FF82D048FEEh)
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D046C74 xor esi,esi
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
00007FF82D046C76 mov rax,qword ptr [rcx+10h]
00007FF82D046C7A mov edx,dword ptr [rax+8]
00007FF82D046C7D movsxd rdx,edx
00007FF82D046C80 cmp rsi,rdx
00007FF82D046C83 jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82D046D2Ch)
00007FF82D046C89 lea rdx,[rax+rsi*8+10h]
00007FF82D046C8E xor r8d,r8d
00007FF82D046C91 xor eax,eax
00007FF82D046C93 lock cmpxchg qword ptr [rdx],r8
last++;
00007FF82D046C98 inc rsi
} while (val != 0 && last < 128);
00007FF82D046C9B test rax,rax
00007FF82D046C9E je HotPathProfilerSafeArray.EnterSafe()+059h (07FF82D046CA9h)
00007FF82D046CA0 cmp rsi,80h
00007FF82D046CA7 jl HotPathProfilerSafeArray.EnterSafe()+026h (07FF82D046C76h)
if (val != 0)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
00007FF82FBA6ADF mov rcx,qword ptr [rcx+10h]
00007FF82FBA6AE3 mov eax,dword ptr [rcx+8]
00007FF82FBA6AE6 movsxd rax,eax
00007FF82FBA6AE9 cmp rsi,rax
00007FF82FBA6AEC jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82FBA6B4Ch)
00007FF82FBA6AEE lea rdi,[rcx+rsi*8+10h]
00007FF82FBA6AF3 mov qword ptr [rsp+28h],rdi
00007FF82FBA6AF8 lea rcx,[rsp+30h]
00007FF82FBA6AFD call CLRStub[JumpStub]#7ff82d076d70 (07FF82D076D70h)
00007FF82FBA6B02 mov rdx,qword ptr [rsp+30h]
00007FF82FBA6B07 xor eax,eax
00007FF82FBA6B09 mov rdi,qword ptr [rsp+28h]
00007FF82FBA6B0E lock cmpxchg qword ptr [rdi],rdx
00007FF82FBA6B13 mov rdi,rax
00007FF82FBA6B16 mov rax,7FF88CF3E07Ch
00007FF82FBA6B20 cmp dword ptr [rax],0
00007FF82FBA6B23 jne HotPathProfilerSafeArray.EnterSafe()+0D5h (07FF82FBA6B45h)
if (prev != 0)
[...] // ommited for brevity
Summary
I run all on Win10 x64 Release build, on a Xeon E-2176G (6-core Coffee Lake) CPU. Assembler output is from Visual Studio, but equals the DisassemblyDiagnoser of BenchmarkDotNet.
Beside the hows and whys on why I'm doing this at all, I simply cannot explain the performance difference between these two methods. It shouldn't be this much I would guess. Can it be BenchmarkDotNet itself? Am I'm missing something else?
Feels like I have a black spot in my understanding about this lowlevel stuff that I'd like to shine some light on...thanks!
PS:
What I've tried so far:
Rearraging order of Benchmark runs
Defer GetTimestamp() call in the slow method
Doing some initialization/test calls before the benchmark run (though I guess that's covered anyways by BenchmarkDotNet)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I noticed something odd when comparing the generated JIT of 2 methods which should perform the same.
To my surprise, the generated JIT had major differences and it's length was almost doubled for the supposedly simpler method M1.
The methods I compared were M1 and M2.
The number of assignments is the same, so the only difference should be how the bound checks are handled for each method.
using System;
public class C {
static void M1(int[] left, int[] right)
{
for (int i = 0; i < 5; i++)
{
left[i] = 1;
right[i] = 1;
}
}
static void M2(int[] left, int[] right)
{
for (int i = 0; i < 10; i+=2)
{
left[i] = 1;
right[i] = 1;
}
}
}
Generated JIT for each method:
C.M1(Int32[], Int32[])
L0000: sub rsp, 0x28
L0004: xor eax, eax
L0006: test rcx, rcx
L0009: setne r8b
L000d: movzx r8d, r8b
L0011: test rdx, rdx
L0014: setne r9b
L0018: movzx r9d, r9b
L001c: test r9d, r8d
L001f: je short L005c
L0021: cmp dword ptr [rcx+8], 5
L0025: setge r8b
L0029: movzx r8d, r8b
L002d: cmp dword ptr [rdx+8], 5
L0031: setge r9b
L0035: movzx r9d, r9b
L0039: test r9d, r8d
L003c: je short L005c
L003e: movsxd r8, eax
L0041: mov dword ptr [rcx+r8*4+0x10], 1
L004a: mov dword ptr [rdx+r8*4+0x10], 1
L0053: inc eax
L0055: cmp eax, 5
L0058: jl short L003e
L005a: jmp short L0082
L005c: cmp eax, [rcx+8]
L005f: jae short L0087
L0061: movsxd r8, eax
L0064: mov dword ptr [rcx+r8*4+0x10], 1
L006d: cmp eax, [rdx+8]
L0070: jae short L0087
L0072: mov dword ptr [rdx+r8*4+0x10], 1
L007b: inc eax
L007d: cmp eax, 5
L0080: jl short L005c
L0082: add rsp, 0x28
L0086: ret
L0087: call 0x00007ffc50fafc00
L008c: int3
C.M2(Int32[], Int32[])
L0000: sub rsp, 0x28
L0004: xor eax, eax
L0006: mov r8d, [rcx+8]
L000a: cmp eax, r8d
L000d: jae short L0036
L000f: movsxd r9, eax
L0012: mov dword ptr [rcx+r9*4+0x10], 1
L001b: cmp eax, [rdx+8]
L001e: jae short L0036
L0020: mov dword ptr [rdx+r9*4+0x10], 1
L0029: add eax, 2
L002c: cmp eax, 0xa
L002f: jl short L000a
L0031: add rsp, 0x28
L0035: ret
L0036: call 0x00007ffc50fafc00
L003b: int3
M1's length is double of M2's!
What could explain this and is it some kind of bug?
EDIT
Figured out that M1 creates a version for the loop without bound checks, and that's why M1 is longer. Still the question remains, Why M1 performs worse, even though it doesn't perform bound checking at all?
I also ran BenchmarkDotNet and verified that M2 performs about 20% - 30% faster than M1 for arrays of length 10
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.14393.3930 (1607/AnniversaryUpdate/Redstone1)
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
Frequency=3515622 Hz, Resolution=284.4447 ns, Timer=TSC
.NET Core SDK=3.1.401
[Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
DefaultJob : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio |
|-------- |---------:|----------:|----------:|------:|
| M1Bench | 4.372 ns | 0.0215 ns | 0.0201 ns | 1.00 |
| M2Bench | 3.350 ns | 0.0340 ns | 0.0301 ns | 0.77 |
But, there's a lot of overhead up front for M1() to know it can use
the "fast" path...if your arrays aren't large enough, the overhead
would dominate and produce counter-intuitive results.
Peter Duniho
The overhead of choosing the path (in the JIT) for optimized bound check with loops of type:
for(int i = 0; i < array.Length; i++)
won't be beneficial for smaller loops.
As loops grow larger, eliminating bound checks becomes more beneficial, and surpasses the performance of a non-optimized path.
examples for non optimized loops:
for(int i = 0; i < array.Length; i+=2)
for(int i = 0; i <= array.Length; i++)
for(int i = 0; i < array.Length / 2; i++)
I make a conversion "bytes[4] -> float number -> bytes[4]" without any arithmetics.
In bytes I have a single precision number in IEEE-754 format (4 bytes per number, little endian order as in a machine).
I encounter a issue, when bytes represents a NaN value converted not verbatim.
For example:
{ 0x1B, 0xC4, 0xAB, 0x7F } -> NaN -> { 0x1B, 0xC4, 0xEB, 0x7F }
Code for reproduction:
using System;
using System.Linq;
namespace StrangeFloat
{
class Program
{
private static void PrintBytes(byte[] array)
{
foreach (byte b in array)
{
Console.Write("{0:X2}", b);
}
Console.WriteLine();
}
static void Main(string[] args)
{
byte[] strangeFloat = { 0x1B, 0xC4, 0xAB, 0x7F };
float[] array = new float[1];
Buffer.BlockCopy(strangeFloat, 0, array, 0, 4);
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
PrintBytes(strangeFloat);
PrintBytes(bitConverterResult);
bool isEqual = strangeFloat.SequenceEqual(bitConverterResult);
Console.WriteLine("IsEqual: {0}", isEqual);
}
}
}
Result ( https://ideone.com/p5fsrE ):
1BC4AB7F
1BC4EB7F
IsEqual: False
This behaviour depends from platform and configuration: this code convert a number without errors on x64 in all configurations or in x86/Debug. On x86/Release an error exists.
Also, if I change
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
to
float f = array[0];
byte[] bitConverterResult = BitConverter.GetBytes(f);
then it erroneus also on x86/Debug.
I do research the problem and found that compiler generate x86 code that use a FPU registers (!) to a hold a float value (FLD/FST instructions). But FPU set a high bit of mantissa to 1 instead of 0, so it modify value although logic was is just pass a value without change.
On x64 platform a xmm0 register used (SSE) and it works fine.
[Question]
What is this: it is a somewhere documented undefined behavior for a NaN values or a JIT/optimization bug?
Why compiler use a FPU and SSE when no arithmetic operations was made?
Update 1
Debug configuration - pass value via stack without side effects - correct result:
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
02232E45 mov eax,dword ptr [ebp-44h]
02232E48 cmp dword ptr [eax+4],0
02232E4C ja 02232E53
02232E4E call 71EAC65A
02232E53 push dword ptr [eax+8] // eax+8 points to "1b c4 ab 7f" CORRECT!
02232E56 call 7136D8E4
02232E5B mov dword ptr [ebp-5Ch],eax // eax points to managed
// array data "fc 35 d7 70 04 00 00 00 __1b c4 ab 7f__" and this is correct
02232E5E mov eax,dword ptr [ebp-5Ch]
02232E61 mov dword ptr [ebp-48h],eax
Release configuration - optimizer or a JIT does a strange pass via FPU registers and breaks a data - incorrect
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
00B12DE8 cmp dword ptr [edi+4],0
00B12DEC jbe 00B12E3B
00B12DEE fld dword ptr [edi+8] // edi+8 points to "1b c4 ab 7f"
00B12DF1 fstp dword ptr [ebp-10h] // ebp-10h points to "1b c4 eb 7f" (FAIL)
00B12DF4 mov ecx,dword ptr [ebp-10h]
00B12DF7 call 70C75810
00B12DFC mov edi,eax
00B12DFE mov ecx,esi
00B12E00 call dword ptr ds:[4A70860h]
I just translate #HansPassant comment as an answer.
"The x86 jitter uses the FPU to handle floating point values. This is
not a bug. Your assumption that those byte values are a proper
argument to a method that takes a float argument is just wrong."
In other words, this is just a GIGO case (Garbage In, Garbage Out).
I'm learning assembler. I practise with this code:
ASM:
;-------------------------------------------------------------------------
.586
.MODEL flat, stdcall
public srednia_harm
OPTION CASEMAP:NONE
INCLUDE include\windows.inc
INCLUDE include\user32.inc
INCLUDE include\kernel32.inc
.CODE
jeden dd 1.0
DllEntry PROC hInstDLL:HINSTANCE, reason:DWORD, reserved1:DWORD
mov eax, TRUE
ret
DllEntry ENDP
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
srednia_harm PROC
push ebp
mov esp,ebp
push esi
mov esi, [ebp+8] ; address of array
mov ecx, [ebp+12] ; the number of elements
finit
fldz ; the current value of the sum - st(0)=0
mianownik:
fld dword PTR jeden ;ST(0)=1, ST(1)=sum
fld dword PTR [esi] ;loading of array elements - ST(0)=tab[i], ST(1)=1 ST(2)=suma
fdivp st(1), st(0) ; st(1)=st(1)/(st0) -> ST(0)=1/tab[i], ST(1)=suma
faddp st(1),st(0) ; st(1)=st(0)+st(1) -> st(0)=suma+1/tab[i]
add esi,4
loop mianownik
pop esi
pop ebp
ret
srednia_harm ENDP
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
;-------------------------------------------------------------------------
END DllEntry
DEF:
LIBRARY "biblioteka"
EXPORTS
srednia_harm
C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Runtime.InteropServices;
namespace GUI
{
unsafe class FunkcjeAsemblera //imports of assembler's function
{
[DllImport("bibliotekaASM.dll", CallingConvention = CallingConvention.StdCall)]
private static extern float srednia_harm(float[] table, int n);
public float wywolajTest(float[] table, int n)
{
float wynik = srednia_harm(table, n);
return wynik;
}
}
}
C#:
private void button6_Click(object sender, EventArgs e)
{
FunkcjeAsemblera funkcje = new FunkcjeAsemblera();
int n = 4;
float[] table = new float[n];
for (int i = 0; i < n; i++)
table[i] = 1;
float wynik = funkcje.wywolajTest(table, n);
textBox6.Text = wynik.ToString();
}
When i run this code everything is fine. The result is 4 as I expected. But I tried to understand that code, so I set a lot of breakpoints in ASM function. Then the problems started. Arrat was exactly where it should be in memory but the seond parameter is lost. Address pointed to an empty field in the memory. I tried a lots of combinations, I changed types ant it still was the same.
I made some researched but i didn't find any clues. How it possible that when I run program everything works fine and in DEBUG not?
Ok, I tested this in Debug and Release mode. I enabled Properties->Debug->EnableNativeCodedebugging. It works in both cases with Step Into(F11). The 'n'-variable is accessed properly.
One problem I noticed is an improper PROC setup. The code as above accesses the two variables relative to EBP but does not clean up the stack(stdcall [in which the callee is responsible for cleaning up the stack]#Wikipedia).
push ebp
mov esp,ebp
push esi
mov esi,dword ptr [ebp+8]
mov ecx,dword ptr [ebp+0Ch]
wait
...
add esi,4
loop 6CC7101F
pop esi
pop ebp
ret <-- two params not cleaned up
The following is the code assembled by the PROC heading below:
push ebp
mov ebp,esp
push esi
mov esi,dword ptr [ebp+8]
mov ecx,dword ptr [ebp+0Ch]
wait
...
add esi,4
loop 6CC7101F
pop esi
leave <-- restores EBP
ret 8 <-- two params cleaned up
I suggest changing the PROC to
srednia_harm PROC uses esi lpArr: DWORD, num: DWORD
mov esi, lpArr
mov ecx, num
...
ret
srednia_harm ENDP
Maybe that has been the cause of some troubles.
Delphi:
procedure TForm1.Button1Click(Sender: TObject);
var I,Tick:Integer;
begin
Tick := GetTickCount();
for I := 0 to 1000000000 do
begin
end;
Button1.Caption := IntToStr(GetTickCount()-Tick)+' ms';
end;
C#:
private void button1_Click(object sender, EventArgs e)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
button1.Text = tick.ToString()+" ms";
}
Delphi gives around 515 ms
C# gives around 3775 ms
Delphi is compiled to native code, whereas C# is compiled to CLR code which is then translated at runtime. That said C# does use JIT compilation, so you might expect the timing to be more similar, but it is not a given.
It would be useful if you could describe the hardware you ran this on (CPU, clock rate).
I do not have access to Delphi to repeat your experiment, but using native C++ vs C# and the following code:
VC++ 2008
#include <iostream>
#include <windows.h>
int main(void)
{
int tick = GetTickCount() ;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = GetTickCount() - tick;
std::cout << tick << " ms" << std::endl ;
}
C#
using System;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
Console.Write( tick.ToString() + " ms" ) ;
}
}
}
I initially got:
C++ 2792ms
C# 2980ms
However I then performed a Rebuild on the C# version and ran the executable in <project>\bin\release and <project>\bin\debug respectively directly from the command line. This yielded:
C# (release): 720ms
C# (debug): 3105ms
So I reckon that is where the difference truly lies, you were running the debug version of the C# code from the IDE.
In case you are thinking that C++ is then particularly slow, I ran that as an optimised release build and got:
C++ (Optimised): 0ms
This is not surprising because the loop is empty, and the control variable is not used outside the loop so the optimiser removes it altogether. To avoid that I declared i as a volatile with the following result:
C++ (volatile i): 2932ms
My guess is that the C# implementation also removed the loop and that the 720ms is from something else; this may explain most of the difference between the timings in the first test.
What Delphi is doing I cannot tell, you might look at the generated assembly code to see.
All the above tests on AMD Athlon Dual Core 5000B 2.60GHz, on Windows 7 32bit.
If this is intended as a benchmark, it's an exceptional bad one as in both cases the loop can be optimized away, so you have to look at the generated machine code to see what's going on. If you use release mode for C#, the following code
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; ++i){ }
sw.Stop();
Console.WriteLine(sw.Elapsed);
is transformed by the JITter to this:
push ebp
mov ebp,esp
push edi
push esi
call 67CDBBB0
mov edi,eax
xor eax,eax ; i = 0
inc eax ; ++i
cmp eax,3B9ACA00h ; i == 1000000000?
jl 0000000E ; false: jmp
mov ecx,edi
cmp dword ptr [ecx],ecx
call 67CDBC10
mov ecx,66DDAEDCh
call FFE8FBE0
mov esi,eax
mov ecx,edi
call 67CD75A8
mov ecx,eax
lea eax,[esi+4]
mov dword ptr [eax],ecx
mov dword ptr [eax+4],edx
call 66A94C90
mov ecx,eax
mov edx,esi
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+3Ch]
call dword ptr [eax+14h]
pop esi
pop edi
pop ebp
ret
TickCount is not a reliable timer; you should use .Net's Stopwatch class. (I don't know what the Delphi equivalent is).
Also, are you running a Release build?
Do you have a debugger attached?
The Delphi compiler uses the for loop counter downwards (if possible); the above code sample is compiled to:
Unit1.pas. 42: Tick := GetTickCount();
00489367 E8B802F8FF call GetTickCount
0048936C 8BF0 mov esi,eax
Unit1.pas.43: for I := 0 to 1000000000 do
0048936E B801CA9A3B mov eax,$3b9aca01
00489373 48 dec eax
00489374 75FD jnz $00489373
You are comparing native code against VM JITted code, and that is not fair. Native code will be ALWAYS faster since the JITter can not optimize the code like a native compiler can.
That said, comparing Delphi against C# is not fair at all, a Delphi binary will win always (faster, smaller, without any kind of dependencies, etc).
Btw, I'm sadly amazed how many posters here don't know this differences... or may be you just hurted some .NET zealots that try to defend C# against anything that shows there are better options out there.
this is the c# disassembly:
DEBUG:
// int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
0000004e 33 D2 xor edx,edx
00000050 89 55 B8 mov dword ptr [ebp-48h],edx
00000053 90 nop
00000054 EB 00 jmp 00000056
00000056 FF 45 B8 inc dword ptr [ebp-48h]
00000059 81 7D B8 00 CA 9A 3B cmp dword ptr [ebp-48h],3B9ACA00h
00000060 0F 95 C0 setne al
00000063 0F B6 C0 movzx eax,al
00000066 89 45 B4 mov dword ptr [ebp-4Ch],eax
00000069 83 7D B4 00 cmp dword ptr [ebp-4Ch],0
0000006d 75 E7 jne 00000056
as you see it is a waste of cpu.
EDIT:
RELEASE:
//unchecked
//{
//int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
00000032 33 D2 xor edx,edx
00000034 89 55 F4 mov dword ptr [ebp-0Ch],edx
00000037 FF 45 F4 inc dword ptr [ebp-0Ch]
0000003a 81 7D F4 00 CA 9A 3B cmp dword ptr [ebp-0Ch],3B9ACA00h
00000041 75 F4 jne 00000037
//}
EDIT:
and this is the c++ version:running about 9x faster in my machine.
__asm
{
PUSH ECX
PUSH EBX
XOR ECX, ECX
MOV EBX, 1000000000
NEXT: INC ECX
CMP ECX, EBX
JS NEXT
POP EBX
POP ECX
}
You should attach a debugger and take a look at the machine code generated by each.
Delphi would almost definitely optimise that loop to execute in reverse order (ie DOWNTO zero rather than FROM zero) - Delphi does this whenever it determines it is "safe" to do, presumably because either subtraction or checking against zero is faster than addition or checking against a non-zero number.
What happens if you try both cases specifying the loops to execute in reverse order?
In Delphi the break condition is calculated only once before the loop procedure begins whereas in C# the break condition is calculated in each loop pass again.
That’s why the looping in Delphi is faster than in C#.
"// int i = 0; while (++i != 1000000000) ;"
That's interesting.
while (++i != x) is not the same as for (; i != x; i++)
The difference is that the while loop doesn't execute the loop for i = 0.
(try it out: run something like this:
int i;
for (i = 0; i < 5; i++)
Console.WriteLine(i);
i = 0;
while (++i != 5)
Console.WriteLine(i);