Integer vs double arithmetic performance?

Integer vs double arithmetic performance? - c#

i'm writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don't obtain a real performance gain.
This is the X filter code (it is valid both for int and double cases):
foreach (pixel)
{
int value = 0;
for (int k = 0; k < filterOffsetsX.Length; k++)
{
value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k]; //index is relative to current pixel position
}
tempImage[index] = value;
}
In the integer case "value", "InputImage" and "tempImage" are of "int", "Image<byte>" and "Image<int>" types.
In the double case "value", "InputImage" and "tempImage" are of "double", "Image<double>" and "Image<double>" types.
(filterValues is int[] in each case)
(The class Image<T> is part of an extern dll. It should be similar to .NET Drawing Image class..).
My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)
The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)
The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?
EDIT: simplified the code deleting unimportant vars.
EDIT2: i don't think i have a cache miss related problema because "index"iterate through adjacent memory cells (row after row fashion). Moreover "filterOffstetsX" contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.

Using Visual C++, because that way I can be sure that I'm timing arithmetic operations and not much else.
Results (each operation is performed 600 million times):
i16 add: 834575
i32 add: 840381
i64 add: 1691091
f32 add: 987181
f64 add: 979725
i16 mult: 850516
i32 mult: 858988
i64 mult: 6526342
f32 mult: 1085199
f64 mult: 1072950
i16 divide: 3505916
i32 divide: 3123804
i64 divide: 10714697
f32 divide: 8309924
f64 divide: 8266111
freq = 1562587
CPU is an Intel Core i7, Turbo Boosted to 2.53 GHz.
Benchmark code:
#include <stdio.h>
#include <windows.h>
template<void (*unit)(void)>
void profile( const char* label )
{
static __int64 cumtime;
LARGE_INTEGER before, after;
::QueryPerformanceCounter(&before);
(*unit)();
::QueryPerformanceCounter(&after);
after.QuadPart -= before.QuadPart;
printf("%s: %I64i\n", label, cumtime += after.QuadPart);
}
const unsigned repcount = 10000000;
template<typename T>
void add(volatile T& var, T val) { var += val; }
template<typename T>
void mult(volatile T& var, T val) { var *= val; }
template<typename T>
void divide(volatile T& var, T val) { var /= val; }
template<typename T, void (*fn)(volatile T& var, T val)>
void integer_op( void )
{
unsigned reps = repcount;
do {
volatile T var = 2000;
fn(var,5);
fn(var,6);
fn(var,7);
fn(var,8);
fn(var,9);
fn(var,10);
} while (--reps);
}
template<typename T, void (*fn)(volatile T& var, T val)>
void fp_op( void )
{
unsigned reps = repcount;
do {
volatile T var = (T)2.0;
fn(var,(T)1.01);
fn(var,(T)1.02);
fn(var,(T)1.03);
fn(var,(T)2.01);
fn(var,(T)2.02);
fn(var,(T)2.03);
} while (--reps);
}
int main( void )
{
LARGE_INTEGER freq;
unsigned reps = 10;
do {
profile<&integer_op<__int16,add<__int16>>>("i16 add");
profile<&integer_op<__int32,add<__int32>>>("i32 add");
profile<&integer_op<__int64,add<__int64>>>("i64 add");
profile<&fp_op<float,add<float>>>("f32 add");
profile<&fp_op<double,add<double>>>("f64 add");
profile<&integer_op<__int16,mult<__int16>>>("i16 mult");
profile<&integer_op<__int32,mult<__int32>>>("i32 mult");
profile<&integer_op<__int64,mult<__int64>>>("i64 mult");
profile<&fp_op<float,mult<float>>>("f32 mult");
profile<&fp_op<double,mult<double>>>("f64 mult");
profile<&integer_op<__int16,divide<__int16>>>("i16 divide");
profile<&integer_op<__int32,divide<__int32>>>("i32 divide");
profile<&integer_op<__int64,divide<__int64>>>("i64 divide");
profile<&fp_op<float,divide<float>>>("f32 divide");
profile<&fp_op<double,divide<double>>>("f64 divide");
::QueryPerformanceFrequency(&freq);
putchar('\n');
} while (--reps);
printf("freq = %I64i\n", freq);
}
I did a default optimized build using Visual C++ 2010 32-bit.
Every call to profile, add, mult, and divide (inside the loops) got inlined. Function calls were still generated to profile, but since 60 million operations get done for each call, I think the function call overhead is unimportant.
Even with volatile thrown in, the Visual C++ optimizing compiler is SMART. I originally used small integers as the right-hand operand, and the compiler happily used lea and add instructions to do integer multiply. You may get more of a boost from calling out to highly optimized C++ code than the common wisdom suggests, simply because the C++ optimizer does a much better job than any JIT.
Originally I had the initialization of var outside the loop, and that made the floating-point multiply code run miserably slow because of the constant overflows. FPU handling NaNs is slow, something else to keep in mind when writing high-performance number-crunching routines.
The dependencies are also set up in such a way as to prevent pipelining. If you want to see the effects of pipelining, say so in a comment, and I'll revise the testbench to operate on multiple variables instead of just one.
Disassembly of i32 multiply:
; COMDAT ??$integer_op#H$1??$mult#H##YAXACHH#Z##YAXXZ
_TEXT SEGMENT
_var$66971 = -4 ; size = 4
??$integer_op#H$1??$mult#H##YAXACHH#Z##YAXXZ PROC ; integer_op<int,&mult<int> >, COMDAT
; 29 : {
00000 55 push ebp
00001 8b ec mov ebp, esp
00003 51 push ecx
; 30 : unsigned reps = repcount;
00004 b8 80 96 98 00 mov eax, 10000000 ; 00989680H
00009 b9 d0 07 00 00 mov ecx, 2000 ; 000007d0H
0000e 8b ff npad 2
$LL3#integer_op#5:
; 31 : do {
; 32 : volatile T var = 2000;
00010 89 4d fc mov DWORD PTR _var$66971[ebp], ecx
; 33 : fn(var,751);
00013 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
00016 69 d2 ef 02 00
00 imul edx, 751 ; 000002efH
0001c 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 34 : fn(var,6923);
0001f 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
00022 69 d2 0b 1b 00
00 imul edx, 6923 ; 00001b0bH
00028 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 35 : fn(var,7124);
0002b 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
0002e 69 d2 d4 1b 00
00 imul edx, 7124 ; 00001bd4H
00034 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 36 : fn(var,81);
00037 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
0003a 6b d2 51 imul edx, 81 ; 00000051H
0003d 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 37 : fn(var,9143);
00040 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
00043 69 d2 b7 23 00
00 imul edx, 9143 ; 000023b7H
00049 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 38 : fn(var,101244215);
0004c 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
0004f 69 d2 37 dd 08
06 imul edx, 101244215 ; 0608dd37H
; 39 : } while (--reps);
00055 48 dec eax
00056 89 55 fc mov DWORD PTR _var$66971[ebp], edx
00059 75 b5 jne SHORT $LL3#integer_op#5
; 40 : }
0005b 8b e5 mov esp, ebp
0005d 5d pop ebp
0005e c3 ret 0
??$integer_op#H$1??$mult#H##YAXACHH#Z##YAXXZ ENDP ; integer_op<int,&mult<int> >
; Function compile flags: /Ogtp
_TEXT ENDS
And of f64 multiply:
; COMDAT ??$fp_op#N$1??$mult#N##YAXACNN#Z##YAXXZ
_TEXT SEGMENT
_var$67014 = -8 ; size = 8
??$fp_op#N$1??$mult#N##YAXACNN#Z##YAXXZ PROC ; fp_op<double,&mult<double> >, COMDAT
; 44 : {
00000 55 push ebp
00001 8b ec mov ebp, esp
00003 83 e4 f8 and esp, -8 ; fffffff8H
; 45 : unsigned reps = repcount;
00006 dd 05 00 00 00
00 fld QWORD PTR __real#4000000000000000
0000c 83 ec 08 sub esp, 8
0000f dd 05 00 00 00
00 fld QWORD PTR __real#3ff028f5c28f5c29
00015 b8 80 96 98 00 mov eax, 10000000 ; 00989680H
0001a dd 05 00 00 00
00 fld QWORD PTR __real#3ff051eb851eb852
00020 dd 05 00 00 00
00 fld QWORD PTR __real#3ff07ae147ae147b
00026 dd 05 00 00 00
00 fld QWORD PTR __real#4000147ae147ae14
0002c dd 05 00 00 00
00 fld QWORD PTR __real#400028f5c28f5c29
00032 dd 05 00 00 00
00 fld QWORD PTR __real#40003d70a3d70a3d
00038 eb 02 jmp SHORT $LN3#fp_op#3
$LN22#fp_op#3:
; 46 : do {
; 47 : volatile T var = (T)2.0;
; 48 : fn(var,(T)1.01);
; 49 : fn(var,(T)1.02);
; 50 : fn(var,(T)1.03);
; 51 : fn(var,(T)2.01);
; 52 : fn(var,(T)2.02);
; 53 : fn(var,(T)2.03);
; 54 : } while (--reps);
0003a d9 ce fxch ST(6)
$LN3#fp_op#3:
0003c 48 dec eax
0003d d9 ce fxch ST(6)
0003f dd 14 24 fst QWORD PTR _var$67014[esp+8]
00042 dd 04 24 fld QWORD PTR _var$67014[esp+8]
00045 d8 ce fmul ST(0), ST(6)
00047 dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
0004a dd 04 24 fld QWORD PTR _var$67014[esp+8]
0004d d8 cd fmul ST(0), ST(5)
0004f dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
00052 dd 04 24 fld QWORD PTR _var$67014[esp+8]
00055 d8 cc fmul ST(0), ST(4)
00057 dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
0005a dd 04 24 fld QWORD PTR _var$67014[esp+8]
0005d d8 cb fmul ST(0), ST(3)
0005f dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
00062 dd 04 24 fld QWORD PTR _var$67014[esp+8]
00065 d8 ca fmul ST(0), ST(2)
00067 dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
0006a dd 04 24 fld QWORD PTR _var$67014[esp+8]
0006d d8 cf fmul ST(0), ST(7)
0006f dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
00072 75 c6 jne SHORT $LN22#fp_op#3
00074 dd d8 fstp ST(0)
00076 dd dc fstp ST(4)
00078 dd da fstp ST(2)
0007a dd d8 fstp ST(0)
0007c dd d8 fstp ST(0)
0007e dd d8 fstp ST(0)
00080 dd d8 fstp ST(0)
; 55 : }
00082 8b e5 mov esp, ebp
00084 5d pop ebp
00085 c3 ret 0
??$fp_op#N$1??$mult#N##YAXACNN#Z##YAXXZ ENDP ; fp_op<double,&mult<double> >
; Function compile flags: /Ogtp
_TEXT ENDS

It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That's assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.
Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn't sweat that much CPU wastage here, even if we were missing a deadline somewhere.
digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.
If you really are doing enough floating-point ops to make the entire shebang take a very long time, I'd suggest looking into doing one or more of the following:
Parallelize your algorithm and run it on every CPU available from your processor.
Don't run it on the CLR (use native C++, or Ada or Fortran or something).
Rewrite it to run on the GPU. GPUs are essentially array processors and are designed to do massively parallel math on arrays of floating-point values.

Your algorithm seems to access large regions of memory in a very non-sequential pattern. It's probably generating tons of cache misses. The bottleneck is probably memory access, not arithmetic. Using ints should make this slightly faster because ints are 32 bits, while doubles are 64 bits, meaning cache will be used slightly more efficiently. If almost every loop iteration involves a cache miss, though, you're basically out of luck unless you can make some algorithmic or data structure layout changes to improve the locality of reference.
BTW, have you considered using an FFT for convolution? That would put you in a completely different big-O class.

at least it is not fair to compare int (DWORD, 4 bytes) and double (QWORD, 8 bytes) on 32-bit system. Compare int to float or long to double to get fair results. double has increased precision, you must pay for it.
PS: for me it smells like micro(+premature) optimization, and that smell is not good.
Edit: Ok, good point. It is not correct to compare long to double, but still comparing int and double on 32 CPU is not correct even if they have both intrinsic instructions. This is not magic, x86 is fat CISC, still double is not processed as single step internally.

On my machine, I find that floating-point multiplication is about the same speed as integer multiplication.
I'm using this timing function:
static void Time<T>(int count, string desc, Func<T> action){
action();
Stopwatch sw = Stopwatch.StartNew();
for(int i = 0; i < count; i++)
action();
double seconds = sw.Elapsed.TotalSeconds;
Console.WriteLine("{0} took {1} seconds", desc, seconds);
}
Let's say you're processing a 200 x 200 array with a 25-length filter 200 times, then your inner loop is executing 200 * 200 * 25 * 200 = 200,000,000 times. Each time, you're doing one multiply, one add, and 3 array indices. So I use this profiling code
const int count = 200000000;
int[] a = {1};
double d = 5;
int i = 5;
Time(count, "array index", ()=>a[0]);
Time(count, "double mult", ()=>d * 6);
Time(count, "double add ", ()=>d + 6);
Time(count, "int mult", ()=>i * 6);
Time(count, "int add ", ()=>i + 6);
On my machine (slower than yours, I think), I get the following results:
array index took 1.4076632 seconds
double mult took 1.2203911 seconds
double add took 1.2342998 seconds
int mult took 1.2170384 seconds
int add took 1.0945793 seconds
As you see, integer multiplication, floating-point multiplication, and floating-point addition all took about the same time. Array indexing took a little longer (and you're doing it three times), and integer addition was a little faster.
So I think the performance advantage to integer math in your scenario is just too slight to make a significant difference, especially when outweighed by the relatively huge penalty you're paying for array indexing. If you really need to speed this up, then you should use unsafe pointers to your arrays to avoid the offset calculation and bounds checking.
By the way, the performance difference for division is much more striking. Following the pattern above, I get:
double div took 3.8597251 seconds
int div took 1.7824505 seconds
One more note:
Just to be clear, all profiling should be done with an optimized release build. Debug builds will be slower overall, and some operations may not have accurate timing with respect to others.

If the times you measuerd are accurate, then the runtime of your filtering algorithm seems to grow with the cube of the filter size. What kind of filter is that? Maybe you can reduce the number of multiplications needed. (e.g. if you're using a separable filter kernel?)
Otherwise, if you need raw performance, you might consider using a library like the Intel Performance Primitives - it contains highly optimized functions for things like this that use CPU SIMD instructions. They're usually a lot faster than hand-written code in C# or C++.

Did you try looking at the disassembled code? In high-level languages i'm pretty much trusting the compiler to optimize my code.
For example for(i=0;i<imageSize;i++) might be faster than foreach.
Also, arithmetic operrations might get optimized by the compiler anyway.... when you need to optimize something you either optimize the whole "black-box" and maybe reinvent the algorithm used in that loop, or you first take a look at the dissasembled code and see whats wrong with it

Related

Combine multiple realm query results .net

As a Contains implementation, I am using a bit tweaked method written by Andy Dent to query my realm database:
private IQueryable<Entry> FilterEntriesByIds(IQueryable<Entry> allEntries, int[] idsToMatch)
{
// Fancy way to invent Contains<>() for LINQ
ParameterExpression pe = Expression.Parameter(typeof(Entry), "Entry");
Expression chainedByOr = null;
Expression left = Expression.Property(pe, typeof(Entry).GetProperty("Id"));
for (int i = 0; i < idsToMatch.Count(); i++) {
Expression right = Expression.Constant(idsToMatch[i]);
Expression anotherEqual = Expression.Equal(left, right);
if (chainedByOr == null)
chainedByOr = anotherEqual;
else
chainedByOr = Expression.OrElse(chainedByOr, anotherEqual);
}
MethodCallExpression whereCallExpression = Expression.Call(
typeof(Queryable),
"Where",
new Type[] { allEntries.ElementType },
allEntries.Expression,
Expression.Lambda<Func<Entry, bool>>(chainedByOr, new ParameterExpression[] { pe }));
return allEntries.Provider.CreateQuery<Entry>(whereCallExpression);
}
It all works just fine as long as I pass less than 2-3K of ids, when I go with larger amounts the app just crashes with what seems to be a stackoverflow exception.
My first thought to solve this was to break the query into chunks and later combine the results, but the Concat and Union methods do not work on these realm IQueryables, so, how else can I merge such chunked results? Or is there any other workaround?
I can't just convert the results to a List or something and then merge, I have to return realm objects as IQueryable<>
The call stack:
=================================================================
Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
No native Android stacktrace (see debuggerd output).
=================================================================
Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7c326075c8):0x7c326075b8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x7c326075c8 fd 7b bb a9 fd 03 00 91 a0 0b 00 f9 10 0e 87 d2 .{..............
0x7c326075d8 10 d0 a7 f2 90 0f c0 f2 b0 0f 00 f9 10 00 9e d2 ................
0x7c326075e8 f0 d5 a9 f2 90 0f c0 f2 b0 13 00 f9 a0 a3 00 91 ................
=================================================================
Managed Stacktrace:
============================
=====================================
at Realms.QueryHandle:GroupBegin <0x00000>
at Realms.RealmResultsVisitor:VisitCombination <0x0007f>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.ExpressionVisitor:Visit <0x00087>
at Realms.RealmResultsVisitor:VisitCombination <0x000d7>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.ExpressionVisitor:Visit <0x00087>
at Realms.RealmResultsVisitor:VisitCombination <0x000d7>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.ExpressionVisitor:Visit <0x00087>
at Realms.RealmResultsVisitor:VisitCombination <0x000d7>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.Ex
pressionVisitor:Visit <0x00087>
UPD
I found the exact number of elements after which this error gets thrown: 3939, if I pass anything larger than that, it crashes.

I'm not a database expert, just have worked at varying levels (inc Realm) usually on top of someone's lower-level engine. (The c-tree Plus ISAM engine or the C++ core by the real wizards at Realm).
My first impression is that you have mostly-static data and so this is a fairly classical problem where you want a better index generated up front.
I think you can build such an index using Realm but with a bit more application logic.
It sounds a bit like an Inverted Index problem as in this SO question.
For all the single and probably at least double-word combinations which map to your words, you want another table that links to all the matching words. You can create those with a fairly easy loop that creates them from the existing words.
eg: your OneLetter table would have an entry for a that used a one-to-many relationship to all the matching words in the main Words table.
That will delivery a very fast Ilist you can iterate of the matching words.
Then you can flip to your Contains approach at 3 letters or above.

The SQL provider for SQL Server can handle a contains operation with more items than is the limit for maximum number of SQL parameters in one query, which is 2100.
This sample is working:
var ids = new List<int>();
for (int i = 0; i < 10000; i++)
{
ids.Add(i);
}
var testQuery = dbContext.Entity.Where(x => ids.Contains(x.Id)).ToList();
It means, you could try to rewrite your method to use ids.Contains(x.Id).
An example how to do it is in this post.
UPDATE: sorry did not mention this is of Realm, but perhaps it is still worth trying.

C# Performance on Small Functions

One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results.
For starters here is the normal version of the function.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = a + b + 1;
}
}
return a;
}
Here is the version I made that breaks the functionality into small functions.
static double TinyFunctions()
{
double a = 0;
for (int i = 0; i < s_OuterLoopCount; i++)
{
a = Loop(a);
}
return a;
}
static double Loop(double a)
{
for (int i = 0; i < s_InnerLoopCount; i++)
{
double b = Double(i);
a = Add(a, Add(b, 1));
}
return a;
}
static double Double(double a)
{
return a * 2;
}
static double Add(double a, double b)
{
return a + b;
}
I use the stopwatch class to time the functions and when I ran it in debug I got the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 377 ms;
TinyFunctions Time = 1322 ms;
These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 173 ms;
TinyFunctions Time = 98 ms;
These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster?
We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time.
I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction.
In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference.
I'd be grateful for any insight someone can provide.
Update 1
As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction.
static double NormalFunction()
{
double a = 0;
for (int j = 0; j < s_OuterLoopCount; ++j)
{
for (int i = 0; i < s_InnerLoopCount; ++i)
{
double b = i * 2;
a = b + 1 + a;
}
}
return a;
}
Here are the results with this configuration.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 91 ms;
TinyFunctions Time = 102 ms;
This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit.
Furthermore, I then tried it with integer operations and we are back to not making any sense.
s_OuterLoopCount = 10000;
s_InnerLoopCount = 10000;
NormalFunction Time = 87 ms;
TinyFunctions Time = 52 ms;
And this doesn't change regardless of the order of operations.

I can make performance match much better by changing one line of code:
a = a + b + 1;
Change it to:
a = b + 1 + a;
Or:
a += b + 1;
Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to:
int Double( int a ) { return a * 2; }
I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected).
The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower.
But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this:
Original Changed
01070620 push ebp 01390620 push ebp
01070621 mov ebp,esp 01390621 mov ebp,esp
01070623 push edi 01390623 push edi
01070624 push esi 01390624 push esi
01070625 push eax 01390625 push eax
01070626 fldz 01390626 fldz
01070628 xor esi,esi 01390628 xor esi,esi
0107062A mov edi,dword ptr ds:[0FE43ACh] 0139062A mov edi,dword ptr ds:[12243ACh]
01070630 test edi,edi 01390630 test edi,edi
01070632 jle 0107065A 01390632 jle 0139065A
01070634 xor edx,edx 01390634 xor edx,edx
01070636 mov ecx,dword ptr ds:[0FE43B0h] 01390636 mov ecx,dword ptr ds:[12243B0h]
0107063C test ecx,ecx 0139063C test ecx,ecx
0107063E jle 01070655 0139063E jle 01390655
01070640 mov eax,edx 01390640 mov eax,edx
01070642 add eax,eax 01390642 add eax,eax
01070644 mov dword ptr [ebp-0Ch],eax 01390644 mov dword ptr [ebp-0Ch],eax
01070647 fild dword ptr [ebp-0Ch] 01390647 fild dword ptr [ebp-0Ch]
0107064A faddp st(1),st 0139064A fld1
0107064C fld1 0139064C faddp st(1),st
0107064E faddp st(1),st 0139064E faddp st(1),st
01070650 inc edx 01390650 inc edx
01070651 cmp edx,ecx 01390651 cmp edx,ecx
01070653 jl 01070640 01390653 jl 01390640
01070655 inc esi 01390655 inc esi
01070656 cmp esi,edi 01390656 cmp esi,edi
01070658 jl 01070634 01390658 jl 01390634
0107065A pop ecx 0139065A pop ecx
0107065B pop esi 0139065B pop esi
0107065C pop edi 0139065C pop edi
0107065D pop ebp 0139065D pop ebp
0107065E ret 0139065E ret
Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly.
Update:
With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this:
int b = 2 * i;
a = a + b + 1;
Into something like:
mov esi, eax ; b = i
add esi, esi ; b += b
lea ecx, [ecx + esi + 1] ; a = a + b + 1
Where a is stored in the ecx register, i in eax, and b in esi.
Whereas the TinyFunctions version gets turned into something like:
mov eax, edx
add eax, eax
inc eax
add ecx, eax
Where i is in edx, b is in eax, and a is in ecx this time around.
I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up:
int b = 2 * i + 1;
a += b;
This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.

creating a constant but local array

Sometimes I need a hardcoded lookup table for a single method.
I can create such an array either
locally in the method itself
static inside the class
Example for the first case:
public int Convert(int i)
{
int[] lookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/ };
return lookup[i];
}
As far as I understand it, a new lookup array will be created by the .net engine each time this method is executed. Is this correct, or is the JITer smart enough to cache and reuse the array between calls?
I presume that the answer is no, so if I want to make sure that the array is cached between calls, one way would be to make it static:
Example for the second case:
private static readonly int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
public int Convert(int i)
{
return lookup[i];
}
Is there a way to do this without polluting the namespace of my class? Can I somehow declare a static array that is only visible inside the current scope?

Local array
The Roslyn compiler puts local arrays in the metadata. Let's take the first version of your Convert method:
public int Convert(int i)
{
int[] lookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/ };
return lookup[i];
}
Here is the corresponded IL code (Release build, Roslyn 1.3.1.60616):
// Token: 0x06000002 RID: 2 RVA: 0x0000206C File Offset: 0x0000026C
.method public hidebysig
instance int32 Convert (
int32 i
) cil managed noinlining
{
// Header Size: 1 byte
// Code Size: 20 (0x14) bytes
.maxstack 8
/* 0x0000026D 1D */ IL_0000: ldc.i4.7
/* 0x0000026E 8D13000001 */ IL_0001: newarr [mscorlib]System.Int32
/* 0x00000273 25 */ IL_0006: dup
/* 0x00000274 D001000004 */ IL_0007: ldtoken field valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=28' '<PrivateImplementationDetails>'::'502D7419C3650DEE94B5938147BC9B4724D37F99'
/* 0x00000279 281000000A */ IL_000C: call void [mscorlib]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [mscorlib]System.Array, valuetype [mscorlib]System.RuntimeFieldHandle)
/* 0x0000027E 03 */ IL_0011: ldarg.1
/* 0x0000027F 94 */ IL_0012: ldelem.i4
/* 0x00000280 2A */ IL_0013: ret
} // end of method Program::Convert
And here is the PrivateImplementationDetails:
// Token: 0x02000003 RID: 3
.class private auto ansi sealed '<PrivateImplementationDetails>'
extends [mscorlib]System.Object
{
.custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = (
01 00 00 00
)
// Nested Types
// Token: 0x02000004 RID: 4
.class nested private explicit ansi sealed '__StaticArrayInitTypeSize=28'
extends [mscorlib]System.ValueType
{
.pack 1
.size 28
} // end of class __StaticArrayInitTypeSize=28
// Fields
// Token: 0x04000001 RID: 1 RVA: 0x00002944 File Offset: 0x00000B44
.field assembly static initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=28' '502D7419C3650DEE94B5938147BC9B4724D37F99' at I_00002944 // 28 (0x001c) bytes
} // end of class <PrivateImplementationDetails>
As you can see, your lookup array is in the assembly metadata. When you start your application, JIT only has to get the array content from the metadata. An asm example (Windows 10, .NET Framework 4.6.1 (4.0.30319.42000), RyuJIT: clrjit-v4.6.1080.0, Release build):
int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
00007FFEDF0A44E2 sub esp,20h
00007FFEDF0A44E5 mov esi,edx
00007FFEDF0A44E7 mov rcx,7FFF3D1C4C62h
00007FFEDF0A44F1 mov edx,7
00007FFEDF0A44F6 call 00007FFF3E6B2600
00007FFEDF0A44FB mov rdx,134CF7F2944h
00007FFEDF0A4505 mov ecx,dword ptr [rax+8]
00007FFEDF0A4508 lea r8,[rax+10h]
00007FFEDF0A450C vmovdqu xmm0,xmmword ptr [rdx]
00007FFEDF0A4511 vmovdqu xmmword ptr [r8],xmm0
00007FFEDF0A4516 mov r9,qword ptr [rdx+10h]
00007FFEDF0A451A mov qword ptr [r8+10h],r9
00007FFEDF0A451E mov r9d,dword ptr [rdx+18h]
00007FFEDF0A4522 mov dword ptr [r8+18h],r9d
return lookup[i];
00007FFEDF0A4526 cmp esi,ecx
return lookup[i];
00007FFEDF0A4528 jae 00007FFEDF0A4537
00007FFEDF0A452A movsxd rdx,esi
00007FFEDF0A452D mov eax,dword ptr [rax+rdx*4+10h]
00007FFEDF0A4531 add rsp,20h
00007FFEDF0A4535 pop rsi
00007FFEDF0A4536 ret
00007FFEDF0A4537 call 00007FFF3EB57BE0
00007FFEDF0A453C int 3
A LegacyJIT-x64 version:
int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
00007FFEDF0E41E0 push rbx
00007FFEDF0E41E1 push rdi
00007FFEDF0E41E2 sub rsp,28h
00007FFEDF0E41E6 mov ebx,edx
00007FFEDF0E41E8 mov edx,7
00007FFEDF0E41ED lea rcx,[7FFF3D1C4C62h]
00007FFEDF0E41F4 call 00007FFF3E6B2600
00007FFEDF0E41F9 mov rdi,rax
00007FFEDF0E41FC lea rcx,[7FFEDF124760h]
00007FFEDF0E4203 call 00007FFF3E73CA90
00007FFEDF0E4208 mov rdx,rax
00007FFEDF0E420B mov rcx,rdi
00007FFEDF0E420E call 00007FFF3E73C8B0
return lookup[i];
00007FFEDF0E4213 movsxd r11,ebx
00007FFEDF0E4216 mov rax,qword ptr [rdi+8]
00007FFEDF0E421A cmp r11,7
00007FFEDF0E421E jae 00007FFEDF0E4230
00007FFEDF0E4220 mov eax,dword ptr [rdi+r11*4+10h]
00007FFEDF0E4225 add rsp,28h
00007FFEDF0E4229 pop rdi
00007FFEDF0E422A pop rbx
00007FFEDF0E422B ret
00007FFEDF0E422C nop dword ptr [rax]
00007FFEDF0E4230 call 00007FFF3EB57BE0
00007FFEDF0E4235 nop
A LegacyJIT-x86 version:
int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
009A2DC4 push esi
009A2DC5 push ebx
009A2DC6 mov ebx,edx
009A2DC8 mov ecx,6A2C402Eh
009A2DCD mov edx,7
009A2DD2 call 0094322C
009A2DD7 lea edi,[eax+8]
009A2DDA mov esi,5082944h
009A2DDF mov ecx,7
009A2DE4 rep movs dword ptr es:[edi],dword ptr [esi]
return lookup[i];
009A2DE6 cmp ebx,dword ptr [eax+4]
009A2DE9 jae 009A2DF4
009A2DEB mov eax,dword ptr [eax+ebx*4+8]
009A2DEF pop ebx
009A2DF0 pop esi
009A2DF1 pop edi
009A2DF2 pop ebp
009A2DF3 ret
009A2DF4 call 6B9D52F0
009A2DF9 int 3
Static array
Now, let's compare it with the second version:
private static readonly int[] lookup = new[] { 1, 2, 4, 8, 16, 32, 666, /*...*/ };
public int Convert(int i)
{
return lookup[i];
}
IL:
// Token: 0x04000001 RID: 1
.field private static initonly int32[] lookup
// Token: 0x06000002 RID: 2 RVA: 0x00002056 File Offset: 0x00000256
.method public hidebysig
instance int32 Convert (
int32 i
) cil managed noinlining
{
// Header Size: 1 byte
// Code Size: 8 (0x8) bytes
.maxstack 8
/* 0x00000257 7E01000004 */ IL_0000: ldsfld int32[] ConsoleApplication5.Program::lookup
/* 0x0000025C 03 */ IL_0005: ldarg.1
/* 0x0000025D 94 */ IL_0006: ldelem.i4
/* 0x0000025E 2A */ IL_0007: ret
} // end of method Program::Convert
// Token: 0x02000003 RID: 3
.class private auto ansi sealed '<PrivateImplementationDetails>'
extends [mscorlib]System.Object
{
.custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = (
01 00 00 00
)
// Nested Types
// Token: 0x02000004 RID: 4
.class nested private explicit ansi sealed '__StaticArrayInitTypeSize=28'
extends [mscorlib]System.ValueType
{
.pack 1
.size 28
} // end of class __StaticArrayInitTypeSize=28
// Fields
// Token: 0x04000002 RID: 2 RVA: 0x000028FC File Offset: 0x00000AFC
.field assembly static initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=28' '502D7419C3650DEE94B5938147BC9B4724D37F99' at I_000028fc // 28 (0x001c) bytes
} // end of class <PrivateImplementationDetails>
ASM (RyuJIT-x64):
return lookup[i];
00007FFEDF0B4490 sub rsp,28h
00007FFEDF0B4494 mov rax,212E52E0080h
00007FFEDF0B449E mov rax,qword ptr [rax]
00007FFEDF0B44A1 mov ecx,dword ptr [rax+8]
00007FFEDF0B44A4 cmp edx,ecx
00007FFEDF0B44A6 jae 00007FFEDF0B44B4
00007FFEDF0B44A8 movsxd rdx,edx
00007FFEDF0B44AB mov eax,dword ptr [rax+rdx*4+10h]
00007FFEDF0B44AF add rsp,28h
00007FFEDF0B44B3 ret
00007FFEDF0B44B4 call 00007FFF3EB57BE0
00007FFEDF0B44B9 int 3
ASM (LegacyJIT-x64):
return lookup[i];
00007FFEDF0A4611 sub esp,28h
00007FFEDF0A4614 mov rcx,226CC5203F0h
00007FFEDF0A461E mov rcx,qword ptr [rcx]
00007FFEDF0A4621 movsxd r8,edx
00007FFEDF0A4624 mov rax,qword ptr [rcx+8]
00007FFEDF0A4628 cmp r8,rax
00007FFEDF0A462B jae 00007FFEDF0A4637
00007FFEDF0A462D mov eax,dword ptr [rcx+r8*4+10h]
00007FFEDF0A4632 add rsp,28h
00007FFEDF0A4636 ret
00007FFEDF0A4637 call 00007FFF3EB57BE0
00007FFEDF0A463C nop
ASM (LegacyJIT-x86):
return lookup[i];
00AA2E18 push ebp
00AA2E19 mov ebp,esp
00AA2E1B mov eax,dword ptr ds:[03628854h]
00AA2E20 cmp edx,dword ptr [eax+4]
00AA2E23 jae 00AA2E2B
00AA2E25 mov eax,dword ptr [eax+edx*4+8]
00AA2E29 pop ebp
00AA2E2A ret
00AA2E2B call 6B9D52F0
00AA2E30 int 3
Benchmarks
Let's write a benchmark with help of BenchmarkDotNet
[Config(typeof(Config)), LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, RPlotExporter]
public class ArrayBenchmarks
{
private static readonly int[] lookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/};
[MethodImpl(MethodImplOptions.NoInlining)]
public int ConvertStatic(int i)
{
return lookup[i];
}
[MethodImpl(MethodImplOptions.NoInlining)]
public int ConvertLocal(int i)
{
int[] localLookup = new[] {1, 2, 4, 8, 16, 32, 666, /*...*/};
return localLookup[i];
}
[Benchmark]
public int Static()
{
int sum = 0;
for (int i = 0; i < 10001; i++)
sum += ConvertStatic(0);
return sum;
}
[Benchmark]
public int Local()
{
int sum = 0;
for (int i = 0; i < 10001; i++)
sum += ConvertLocal(0);
return sum;
}
private class Config : ManualConfig
{
public Config()
{
Add(new MemoryDiagnoser());
Add(MarkdownExporter.StackOverflow);
}
}
}
Note that it's a synthetic toy benchmark which uses NoInlining for the Convert methods. We use it to show the difference between two methods. The real performance will depend on how you are using the Convert method in your code. My results:
Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143474 ticks, Resolution=466.5324 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Type=ArrayBenchmarks Mode=Throughput
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
------- |--------- |---------- |-------------- |----------- |--------- |------ |------ |------------------- |
Static | X64 | LegacyJit | 24.0243 us | 0.1590 us | - | - | - | 1.07 |
Local | X64 | LegacyJit | 2,068.1034 us | 33.7142 us | 1,089.00 | - | - | 436,603.02 |
Static | X64 | RyuJit | 20.7906 us | 0.2018 us | - | - | - | 1.06 |
Local | X64 | RyuJit | 83.4041 us | 0.9993 us | 613.55 | - | - | 244,936.53 |
Static | X86 | LegacyJit | 20.9957 us | 0.2267 us | - | - | - | 1.01 |
Local | X86 | LegacyJit | 167.6257 us | 1.3543 us | 431.43 | - | - | 172,121.77 |
Conclusion
Does .NET cache hardcoded local arrays? Kind of: the Roslyn compiler put it in the metadata.
Do we have any overhead in this case? Unfortunately, yes: JIT will copy the array content from the metadata for each invocation; it will work longer than the case with a static array. Runtime also allocates objects and produce memory traffic.
Should we care about it? It depends. If it's a hot method and you want to achieve a good level of performance, you should use a static array. If it's a cold method which doesn't affect the application performance, you probably should write “good” source code and put the array in the method scope.

Converting float NaN values from binary form and vice-versa results a mismatch

I make a conversion "bytes[4] -> float number -> bytes[4]" without any arithmetics.
In bytes I have a single precision number in IEEE-754 format (4 bytes per number, little endian order as in a machine).
I encounter a issue, when bytes represents a NaN value converted not verbatim.
For example:
{ 0x1B, 0xC4, 0xAB, 0x7F } -> NaN -> { 0x1B, 0xC4, 0xEB, 0x7F }
Code for reproduction:
using System;
using System.Linq;
namespace StrangeFloat
{
class Program
{
private static void PrintBytes(byte[] array)
{
foreach (byte b in array)
{
Console.Write("{0:X2}", b);
}
Console.WriteLine();
}
static void Main(string[] args)
{
byte[] strangeFloat = { 0x1B, 0xC4, 0xAB, 0x7F };
float[] array = new float[1];
Buffer.BlockCopy(strangeFloat, 0, array, 0, 4);
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
PrintBytes(strangeFloat);
PrintBytes(bitConverterResult);
bool isEqual = strangeFloat.SequenceEqual(bitConverterResult);
Console.WriteLine("IsEqual: {0}", isEqual);
}
}
}
Result ( https://ideone.com/p5fsrE ):
1BC4AB7F
1BC4EB7F
IsEqual: False
This behaviour depends from platform and configuration: this code convert a number without errors on x64 in all configurations or in x86/Debug. On x86/Release an error exists.
Also, if I change
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
to
float f = array[0];
byte[] bitConverterResult = BitConverter.GetBytes(f);
then it erroneus also on x86/Debug.
I do research the problem and found that compiler generate x86 code that use a FPU registers (!) to a hold a float value (FLD/FST instructions). But FPU set a high bit of mantissa to 1 instead of 0, so it modify value although logic was is just pass a value without change.
On x64 platform a xmm0 register used (SSE) and it works fine.
[Question]
What is this: it is a somewhere documented undefined behavior for a NaN values or a JIT/optimization bug?
Why compiler use a FPU and SSE when no arithmetic operations was made?
Update 1
Debug configuration - pass value via stack without side effects - correct result:
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
02232E45 mov eax,dword ptr [ebp-44h]
02232E48 cmp dword ptr [eax+4],0
02232E4C ja 02232E53
02232E4E call 71EAC65A
02232E53 push dword ptr [eax+8] // eax+8 points to "1b c4 ab 7f" CORRECT!
02232E56 call 7136D8E4
02232E5B mov dword ptr [ebp-5Ch],eax // eax points to managed
// array data "fc 35 d7 70 04 00 00 00 __1b c4 ab 7f__" and this is correct
02232E5E mov eax,dword ptr [ebp-5Ch]
02232E61 mov dword ptr [ebp-48h],eax
Release configuration - optimizer or a JIT does a strange pass via FPU registers and breaks a data - incorrect
byte[] bitConverterResult = BitConverter.GetBytes(array[0]);
00B12DE8 cmp dword ptr [edi+4],0
00B12DEC jbe 00B12E3B
00B12DEE fld dword ptr [edi+8] // edi+8 points to "1b c4 ab 7f"
00B12DF1 fstp dword ptr [ebp-10h] // ebp-10h points to "1b c4 eb 7f" (FAIL)
00B12DF4 mov ecx,dword ptr [ebp-10h]
00B12DF7 call 70C75810
00B12DFC mov edi,eax
00B12DFE mov ecx,esi
00B12E00 call dword ptr ds:[4A70860h]

I just translate #HansPassant comment as an answer.
"The x86 jitter uses the FPU to handle floating point values. This is
not a bug. Your assumption that those byte values are a proper
argument to a method that takes a float argument is just wrong."
In other words, this is just a GIGO case (Garbage In, Garbage Out).

Why looping in Delphi faster than C#?

Delphi:
procedure TForm1.Button1Click(Sender: TObject);
var I,Tick:Integer;
begin
Tick := GetTickCount();
for I := 0 to 1000000000 do
begin
end;
Button1.Caption := IntToStr(GetTickCount()-Tick)+' ms';
end;
C#:
private void button1_Click(object sender, EventArgs e)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
button1.Text = tick.ToString()+" ms";
}
Delphi gives around 515 ms
C# gives around 3775 ms

Delphi is compiled to native code, whereas C# is compiled to CLR code which is then translated at runtime. That said C# does use JIT compilation, so you might expect the timing to be more similar, but it is not a given.
It would be useful if you could describe the hardware you ran this on (CPU, clock rate).
I do not have access to Delphi to repeat your experiment, but using native C++ vs C# and the following code:
VC++ 2008
#include <iostream>
#include <windows.h>
int main(void)
{
int tick = GetTickCount() ;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = GetTickCount() - tick;
std::cout << tick << " ms" << std::endl ;
}
C#
using System;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
int tick = System.Environment.TickCount;
for (int i = 0; i < 1000000000; ++i)
{
}
tick = System.Environment.TickCount - tick;
Console.Write( tick.ToString() + " ms" ) ;
}
}
}
I initially got:
C++ 2792ms
C# 2980ms
However I then performed a Rebuild on the C# version and ran the executable in <project>\bin\release and <project>\bin\debug respectively directly from the command line. This yielded:
C# (release): 720ms
C# (debug): 3105ms
So I reckon that is where the difference truly lies, you were running the debug version of the C# code from the IDE.
In case you are thinking that C++ is then particularly slow, I ran that as an optimised release build and got:
C++ (Optimised): 0ms
This is not surprising because the loop is empty, and the control variable is not used outside the loop so the optimiser removes it altogether. To avoid that I declared i as a volatile with the following result:
C++ (volatile i): 2932ms
My guess is that the C# implementation also removed the loop and that the 720ms is from something else; this may explain most of the difference between the timings in the first test.
What Delphi is doing I cannot tell, you might look at the generated assembly code to see.
All the above tests on AMD Athlon Dual Core 5000B 2.60GHz, on Windows 7 32bit.

If this is intended as a benchmark, it's an exceptional bad one as in both cases the loop can be optimized away, so you have to look at the generated machine code to see what's going on. If you use release mode for C#, the following code
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; ++i){ }
sw.Stop();
Console.WriteLine(sw.Elapsed);
is transformed by the JITter to this:
push ebp
mov ebp,esp
push edi
push esi
call 67CDBBB0
mov edi,eax
xor eax,eax ; i = 0
inc eax ; ++i
cmp eax,3B9ACA00h ; i == 1000000000?
jl 0000000E ; false: jmp
mov ecx,edi
cmp dword ptr [ecx],ecx
call 67CDBC10
mov ecx,66DDAEDCh
call FFE8FBE0
mov esi,eax
mov ecx,edi
call 67CD75A8
mov ecx,eax
lea eax,[esi+4]
mov dword ptr [eax],ecx
mov dword ptr [eax+4],edx
call 66A94C90
mov ecx,eax
mov edx,esi
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+3Ch]
call dword ptr [eax+14h]
pop esi
pop edi
pop ebp
ret

TickCount is not a reliable timer; you should use .Net's Stopwatch class. (I don't know what the Delphi equivalent is).
Also, are you running a Release build?
Do you have a debugger attached?

The Delphi compiler uses the for loop counter downwards (if possible); the above code sample is compiled to:
Unit1.pas. 42: Tick := GetTickCount();
00489367 E8B802F8FF call GetTickCount
0048936C 8BF0 mov esi,eax
Unit1.pas.43: for I := 0 to 1000000000 do
0048936E B801CA9A3B mov eax,$3b9aca01
00489373 48 dec eax
00489374 75FD jnz $00489373

You are comparing native code against VM JITted code, and that is not fair. Native code will be ALWAYS faster since the JITter can not optimize the code like a native compiler can.
That said, comparing Delphi against C# is not fair at all, a Delphi binary will win always (faster, smaller, without any kind of dependencies, etc).
Btw, I'm sadly amazed how many posters here don't know this differences... or may be you just hurted some .NET zealots that try to defend C# against anything that shows there are better options out there.

this is the c# disassembly:
DEBUG:
// int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
0000004e 33 D2 xor edx,edx
00000050 89 55 B8 mov dword ptr [ebp-48h],edx
00000053 90 nop
00000054 EB 00 jmp 00000056
00000056 FF 45 B8 inc dword ptr [ebp-48h]
00000059 81 7D B8 00 CA 9A 3B cmp dword ptr [ebp-48h],3B9ACA00h
00000060 0F 95 C0 setne al
00000063 0F B6 C0 movzx eax,al
00000066 89 45 B4 mov dword ptr [ebp-4Ch],eax
00000069 83 7D B4 00 cmp dword ptr [ebp-4Ch],0
0000006d 75 E7 jne 00000056
as you see it is a waste of cpu.
EDIT:
RELEASE:
//unchecked
//{
//int i = 0; while (++i != 1000000000) ;//==for(int i ...blah blah blah)
00000032 33 D2 xor edx,edx
00000034 89 55 F4 mov dword ptr [ebp-0Ch],edx
00000037 FF 45 F4 inc dword ptr [ebp-0Ch]
0000003a 81 7D F4 00 CA 9A 3B cmp dword ptr [ebp-0Ch],3B9ACA00h
00000041 75 F4 jne 00000037
//}
EDIT:
and this is the c++ version:running about 9x faster in my machine.
__asm
{
PUSH ECX
PUSH EBX
XOR ECX, ECX
MOV EBX, 1000000000
NEXT: INC ECX
CMP ECX, EBX
JS NEXT
POP EBX
POP ECX
}

You should attach a debugger and take a look at the machine code generated by each.

Delphi would almost definitely optimise that loop to execute in reverse order (ie DOWNTO zero rather than FROM zero) - Delphi does this whenever it determines it is "safe" to do, presumably because either subtraction or checking against zero is faster than addition or checking against a non-zero number.
What happens if you try both cases specifying the loops to execute in reverse order?

In Delphi the break condition is calculated only once before the loop procedure begins whereas in C# the break condition is calculated in each loop pass again.
That’s why the looping in Delphi is faster than in C#.

"// int i = 0; while (++i != 1000000000) ;"
That's interesting.
while (++i != x) is not the same as for (; i != x; i++)
The difference is that the while loop doesn't execute the loop for i = 0.
(try it out: run something like this:
int i;
for (i = 0; i < 5; i++)
Console.WriteLine(i);
i = 0;
while (++i != 5)
Console.WriteLine(i);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.