I tried to check a perfomance with boxing and without.
Here is a code:
public struct p1
{
int x, y;
public p1(int i)
{
x = y = i;
}
}
public class MainClass
{
public static void Main()
{
var al = new List<object>();
var l = new List<p1>();
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 1000; i++)
{
al.Add(new p1(i));
}
p1 b;
for (int i = 0; i < 1000; i++)
{
b = (p1)al[i];
}
Console.WriteLine(sw.ElapsedTicks);
var time = sw.ElapsedTicks;
for (int i = 0; i < 1000; i++)
{
l.Add(new p1(i));
}
p1 v;
for (int i = 0; i < 1000; i++)
{
v = l[i];
}
var t = sw.ElapsedTicks - time;
Console.WriteLine(t);
Console.ReadKey();
}
}
But List of object works faster then List of p1. Why?
1139
9256
1044
6909
I suspect this could be caused by quite a few things.
First, you should always put timings like this into a loop, and run them more than once in the same process. The JIT overhead will occur on the first run, and can dominate your timings. This will skew the results. (Normally, you'll want to completely ignore the first run...)
Also, make sure that you're running this in a release build, run outside of the Visual Studio test host. (Don't hit F5 - use Ctrl+F5 or run from outside VS.) Otherwise, the test host will disable most optimizations and dramatically slow down your results.
For example, try the following:
public static void Main()
{
for (int run = 0; run < 4; ++run)
{
if (run != 0)
{
// Ignore first run
Console.WriteLine("Run {0}", run);
}
var al = new List<object>();
var l = new List<p1>();
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 1000; i++)
{
al.Add(new p1(i));
}
p1 b;
for (int i = 0; i < 1000; i++)
{
b = (p1)al[i];
}
sw.Stop();
if (run != 0)
{
// Ignore first run
Console.WriteLine("With boxing: {0}", sw.ElapsedTicks);
}
sw.Reset();
sw.Start();
for (int i = 0; i < 1000; i++)
{
l.Add(new p1(i));
}
p1 v;
for (int i = 0; i < 1000; i++)
{
v = l[i];
}
sw.Stop();
if (run != 0)
{
// Ignore first run
Console.WriteLine("Without boxing: {0}", sw.ElapsedTicks);
}
}
Console.ReadKey();
}
On my system, by ignoring the first run (JIT issues), and running outside the VS test host in release, I get:
Run 1
With boxing: 99
Without boxing: 61
Run 2
With boxing: 92
Without boxing: 56
Run 3
With boxing: 97
Without boxing: 54
This is obviously dramatically better with the generic, non-boxed version.
Given the very large numbers in your results - I suspect this test was run in VS in Debug mode...
Structs are passed by value not reference. When you auto-box it I believe it will then be passed around by reference. So it is copied multiple times in the second loop.
Related
Writing a WASM app that needs some fast compute power (signal processing). Trying to decide if I should use c, JS or c# (I already have c# library that I wrote). So I decided to do CPU intense workload and compare. Will use Sieve of E
First to get a baseline I decided to just do on desktop, not WASM.
So here is c# code I used.
public static class App {
static char[] prime = new char[100000];
public static int SieveOfEratosthenes(int n) {
Array.Fill<char>(prime, (char)1);
for (int p = 2; p * p <= n; p++) {
// If prime[p] is not changed,
// then it is a prime
if (prime[p] == 1) {
// Update all multiples of p
for (int i = p * p; i <= n; i += p)
prime[i] = (char)0;
}
}
int count = 0;
// Print all prime numbers
for (int i = 2; i <= n; i++) {
if (prime[i] == 1)
count++;
}
return count;
}
public static void Run() {
for (int i = 2; i < 99999; i++) {
SieveOfEratosthenes(i);
}
// Driver Code
}
static void Main() {
var start = DateTime.Now;
Run();
var end = DateTime.Now;
var time = (end - start).TotalMilliseconds;
Console.WriteLine(time);
}
}
Run release build using regular compiled output = ~14 seconds
Run crossgen2 x64 release output = ~22 seconds (!)
run crossgen2 x86 release output = ~12 seconds
Crossgen2 done by doing 'publish' in vs2022 and choose specific platform
I am surprised to see that the x64 is so slow. Any thoughts?
Another baseline I did was the same calculation in c and JS (still on desktop)
run c equivalent code = ~4 seconds
JS code in node = ~26 seconds
JS code
prime = new Uint8Array(100000);
function sieveOfEratosthenes(n)
{
prime.fill(1);
for (p = 2; p * p <= n; p++)
{
// If prime[p] is not changed, then it is a
// prime
if (prime[p] == 1)
{
// Update all multiples of p
for (i = p * p; i <= n; i += p)
prime[i] = 0;
}
}
var count = 0;
// Print all prime numbers
for (i = 2; i <= n; i++)
{
if (prime[i] == 1)
count++;
}
return count;
}
globalThis.sieve = () => {
// Driver Code
var n = 30;
console.time("sieve");
for (var j = 3; j < 99999; j++) {
var count = sieveOfEratosthenes(j);
//console.log(count);
}
console.timeEnd("sieve");
}
globalThis.sieve();
C code
#include <memory.h>
#include <stdio.h>
#include <time.h>
int SieveOfEratosthenes(int n)
{
static char prime[100000];
memset(prime, 1, sizeof(prime));
for (int p = 2; p * p <= n; p++)
{
// If prime[p] is not changed,
// then it is a prime
if (prime[p] == 1)
{
for (int i = p * p; i <= n; i += p)
prime[i] = 0;
}
}
int count = 0;
// Print all prime numbers
for (int p = 2; p <= n; p++)
if (prime[p])
count++;
return count;
}
// Driver Code
int sieve()
{
int start = time(NULL);
for (int i = 3; i < 99999;i++)
{
int count = SieveOfEratosthenes(i);
// printf("%d ", i/count);
}
int end = time(NULL);
// printf("time=%d", end - start);
return 0;
}
int main(int argc, char * argv[])
{
sieve();
}
Second surprise is how much slower c# is that C. I didnt expect that much of a hit. Nor did I expect the JS version to give c# such a close race.
FYI, I WASMd the C code and the c# code and ran in Blazor App
c code ~4 seconds
JS code ~27 secs (as expected)
c# code 26 minutes !!
c# code AOT ~26 seconds
Lessons
C# blazor not AOT is horrifically slow
AOT c# ~= JS
C is best (a surprise that its so much faster)
Anybody think I messed anything up getting these numbers, as I said there are a few surprises in there.
I have a Dictinary<int, int> which populated with ~5Mio records.
While the performance is reasonably good considering the volume of data I'm looking to improve it. I don't care about data population my main concern is data retrieval.
First thing I'd done - I changed value type from decimal to int which got me twice better performance.
Then I tried trading 'genericness' for speed by passing non-generic IntComparer into Dictionary's ctor as follows:
public class IntegerComparer : IEqualityComparer<int>
{
public bool Equals(int x, int y)
{
return x == y;
}
public int GetHashCode(int obj)
{
return obj;
}
}
but to no avail, performance got degraded by 20%. SortedDictionary slowed things down by 10 times (didn't have much hope on it though). Wonder what can be done for improving the performance if any?
here's a synthetic test just for measuring performance:
var d = new Dictionary<int, int>();
for (var i = 0; i < 5000000; i++)
{
d.Add(i, i + 5);
}
var r = new Random();
var s = new Stopwatch();
s.Start();
for (var i = 0; i < 100000; i++)
{
var r0 = Enumerable.Range(1, 255).Select(t => r.Next(5000000));
var values = r0.Select(t => d[t]).ToList();
}
s.Stop();
MessageBox.Show(s.ElapsedMilliseconds.ToString());
As the comments point out your test is seriously flawed...
If the highest index you will see is 5,000,0000 then an array will be the most performant option. I've tried to quickly rewrite your test to try an eliminate some of the error. There will probably be mistakes, writing accurate benchmarks is hard.
static void Main(string[] args)
{
var loopLength = 100000000;
var d = new Dictionary<int, int>();
for (var i = 0; i < 5000000; i++)
{
d.Add(i, i + 5);
}
var ignore = d[7];
var a = new int[5000000];
for (var i = 0; i < 5000000; i++)
{
a[i] = i + 5;
}
ignore = a[7];
var s = new Stopwatch();
var x = 1;
s.Start();
for (var i = 0; i < loopLength; i++)
{
x = (x * 1664525 + 1013904223) & (4194303);
var y = d[x];
}
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
s.Reset();
x = 1;
s.Start();
for (var i = 0; i < loopLength; i++)
{
x = (x * 1664525 + 1013904223) & (4194303);
var y = a[x];
}
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
Console.ReadKey(true);
}
x coefficients borrowed from Wikipedia's Linear congruential generator article
My results:
24390
2076
That makes the array over 12x faster
I used to think if a method is inlined, then theoretically it is identical to the merge of the method and the calling method, but the benchmark showed there is slightly difference in performance
e.g. this takes 100ms
public long TestA()
{
long ret = 0;
for (int n = 0; n < testSize; n++)
{
for (int i = 0; i < a; i++)
for (int j = 0; j < b; j++)
{
ret += myArray[i][j];
}
}
return ret;
}
this takes 110ms (if I force MethodImplOptions.NoInlining on GetIt then it will be 400ms, so I assume it is auto inlined)
public long TestB()
{
long ret = 0;
for (int n = 0; n < testSize; n++)
{
for (int i = 0; i < a; i++)
for (int j = 0; j < b; j++)
{
ret += GetIt(i, j);
}
}
return ret;
}
int GetIt(int x, int y)
{
return myArray[x][y];
}
OK, I attach a snippet of benchmark function i used
public static void RunTests(Func<long> myTest)
{
const int numTrials = 100;
Stopwatch sw = new Stopwatch();
double[] sample = new double[numTrials];
Console.WriteLine("Checksum is {0:N0}", myTest());
sw.Start();
myTest();
sw.Stop();
Console.WriteLine("Estimated time per test is {0:N0} ticks\n", sw.ElapsedTicks);
if (sw.ElapsedTicks < 100)
{
Console.WriteLine("Ticks per test is less than 100 ticks. Suggest increase testSize.");
return;
}
if (sw.ElapsedTicks > 10000)
{
Console.WriteLine("Ticks per test is more than 10,000 ticks. Suggest decrease testSize.");
return;
}
for (int i = 0; i < numTrials / 3; i++)
{
myTest();
}
string testName = myTest.Method.Name;
Console.WriteLine("----> Starting benchmark {0}\n", myTest.Method.Name);
for (int i = 0; i < numTrials; i++)
{
GC.Collect();
GC.WaitForPendingFinalizers();
sw.Restart();
myTest();
sw.Stop();
sample[i] = sw.ElapsedTicks;
}
double testResult = DataSetAnalysis.Report(sample);
DataSetAnalysis.ConsistencyAnalyze(sample, 0.1);
Console.WriteLine();
for (int j = 0; j < numTrials; j = j + 5)
Console.WriteLine("{0,8:N0} {1,8:N0} {2,8:N0} {3,8:N0} {4,8:N0}", sample[j], sample[j + 1], sample[j + 2], sample[j + 3], sample[j + 4]);
Console.WriteLine("\n----> End of benchmark");
}
The resulting number of IL instructions differs slightly, the maxstack differs significantly:
TestA:
// Code size 78 (0x4e)
.maxstack 2
TestB:
// Code size 88 (0x58)
.maxstack 4
GetIt:
// Code size 7 (0x7)
.maxstack 1
C# does inline at JIT, so whether inline or not IL doesn't change.
MethodImplOptions.NoInlining is not the same as inline keyword in F#
I was trying out a code sample from this book that should demonstrate that the post decrement operator is not atomic. The code is as I have entered it into LinqPad.
void Main() {
var count = 0;
do {
_x = 10000;
for (int i = 0; i < 100; i++) {
new Thread(Go).Start();
}
Thread.Sleep(1000);
Console.WriteLine("Try "+ count);
count++;
} while (_x == 0);
Console.WriteLine(_x);
}
int _x = 10000;
void Go() { for (int i = 0; i < 100; i++) _x--; }
The idea is that decrementing _x in parallel on multiple threads without locking may lead to a value of _x other then 0 when all the threads have finished.
My problem is that no matter how long I seem to try I always get 0 as a result.
I have run the code on two different computers (both Windows 7) and two different versions of .NET and both give me the same result.
What am I missing here?
I have added 100000 interations in Go as Lasse V. Karlsen has suggested. The code now works as expected on the first try. I have also moved the Thread creation out of the loop and reduced the thread count as Henk Holterman has suggested.
void Main()
{
var count = 0;
do {
_x = 1000000;
var threads = Enumerable.Range(0,10).Select (_ => new Thread(Go)).ToList();
foreach (var t in threads)
{
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine("Try "+ count);
count++;
} while (_x == 0);
Console.WriteLine(_x);
}
int _x;
void Go() { for (int i = 0; i < 100000; i++) _x--; }
The code now works as expected.
If I have array of structs MyStruct[]:
struct MyStruct
{
float x;
float y;
}
And it's slower than if I do float[] -> x = > i; y => i + 1 (so this array is 2x bigger than with structs).
Time difference for 10,000 items compare each other (two fors inside) : struct 500ms, array with only floats - 78ms
I thought, that struct appears like eg. float, int etc (on heap).
Firstly structs don't necessarily appear on the heap - they can and often do appear on the stack.
Regarding your performance measurements, I think you must have tested it incorrectly. Using this benchmarking code I get almost the same performance results for both types:
TwoFloats[] a = new TwoFloats[10000];
float[] b = new float[20000];
void test1()
{
int count = 0;
for (int i = 0; i < 10000; i += 1)
{
if (a[i].x < 10) count++;
}
}
void test2()
{
int count = 0;
for (int i = 0; i < 20000; i += 2)
{
if (b[i] < 10) count++;
}
}
Results:
Method Iterations per second
test1 55200000
test2 54800000
You are doing something seriously wrong if you get times like that. Float comparisons are dramatically fast, I clock them at 2 nanoseconds with the loop overhead. Crafting a test like this is tricky because the JIT compiler will optimize stuff away if you don't use the result of the comparison.
The structure is slightly faster, 1.96 nanoseconds vs 2.20 nanoseconds for the float[] on my laptop. That's the way it should be, accessing the Y member of the struct doesn't cost an extra array index.
Test code:
using System;
using System.Diagnostics;
class Program {
static void Main(string[] args) {
var test1 = new float[100000000]; // 100 million
for (int ix = 0; ix < test1.Length; ++ix) test1[ix] = ix;
var test2 = new Test[test1.Length / 2];
for (int ix = 0; ix < test2.Length; ++ix) test2[ix].x = test2[ix].y = ix;
for (int cnt = 0; cnt < 20; ++cnt) {
var sw1 = Stopwatch.StartNew();
bool dummy = false;
for (int ix = 0; ix < test1.Length; ix += 2) {
dummy ^= test1[ix] >= test1[ix + 1];
}
sw1.Stop();
var sw2 = Stopwatch.StartNew();
for (int ix = 0; ix < test2.Length; ++ix) {
dummy ^= test2[ix].x >= test2[ix].y;
}
sw2.Stop();
Console.Write("", dummy);
Console.WriteLine("{0} {1}", sw1.ElapsedMilliseconds, sw2.ElapsedMilliseconds);
}
Console.ReadLine();
}
struct Test {
public float x;
public float y;
}
}
I get results that seem to agree with you (and disagree with Mark). I'm curious if I've made a mistake constructing this (albeit crude) benchmark or if there is another factor at play.
Compiled on MS C# targeting .NET 3.5 framework with VS2008. Release mode, no debugger attached.
Here's my test code:
class Program {
static void Main(string[] args) {
for (int i = 0; i < 10; i++) {
RunBench();
}
Console.ReadKey();
}
static void RunBench() {
Stopwatch sw = new Stopwatch();
const int numPoints = 10000;
const int numFloats = numPoints * 2;
int numEqs = 0;
float[] rawFloats = new float[numFloats];
Vec2[] vecs = new Vec2[numPoints];
Random rnd = new Random();
for (int i = 0; i < numPoints; i++) {
rawFloats[i * 2] = (float) rnd.NextDouble();
rawFloats[i * 2 + 1] = (float)rnd.NextDouble();
vecs[i] = new Vec2() { X = rawFloats[i * 2], Y = rawFloats[i * 2 + 1] };
}
sw.Start();
for (int i = 0; i < numFloats; i += 2) {
for (int j = 0; j < numFloats; j += 2) {
if (i != j &&
Math.Abs(rawFloats[i] - rawFloats[j]) < 0.0001 &&
Math.Abs(rawFloats[i + 1] - rawFloats[j + 1]) < 0.0001) {
numEqs++;
}
}
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds.ToString() + " : numEqs = " + numEqs);
numEqs = 0;
sw.Reset();
sw.Start();
for (int i = 0; i < numPoints; i++) {
for (int j = 0; j < numPoints; j++) {
if (i != j &&
Math.Abs(vecs[i].X - vecs[j].X) < 0.0001 &&
Math.Abs(vecs[i].Y - vecs[j].Y) < 0.0001) {
numEqs++;
}
}
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds.ToString() + " : numEqs = " + numEqs);
}
}
struct Vec2 {
public float X;
public float Y;
}
Edit: Ah! I wasn't iterating the proper amounts. With the updated code my timings look like I expected:
269 : numEqs = 8
269 : numEqs = 8
270 : numEqs = 2
269 : numEqs = 2
268 : numEqs = 4
270 : numEqs = 4
269 : numEqs = 2
268 : numEqs = 2
270 : numEqs = 6
270 : numEqs = 6
269 : numEqs = 8
268 : numEqs = 8
268 : numEqs = 4
270 : numEqs = 4
269 : numEqs = 6
269 : numEqs = 6
268 : numEqs = 2
270 : numEqs = 2
268 : numEqs = 4
270 : numEqs = 4
The most likely reason is that the C# runtime optimizer perform a better job when you work with floats that with full structs, probably because optimizer is mapping x and y to registers or likewise changes not done with full struct.
In your particular example there seems not to be any fundamental reason why it couldn't perform as good a job when you use structs (it's hard to be sure without seeing you actual benchmarking code), but it just doesn't. However it would be interesting to compare the performance of the resulting code when compiled with another C# implementations (I'm thinking of mono on Linux).
I tested Ron Warholic benchmark with mono, and results are consistant with Mark's, difference between the two types of access seems to be minimal (version with floats is 1% faster). However I still should do more testing as it is not unexpected that library calls like Math.Abs take a large amount of time and it could hide a real difference.
After removing calls to Math.Abs and just doing tests like rawFloats[i] < rawFloats[j] the structure version becomes marginally faster (about 5%) than the two arrays of floats.
The code below is based on different ways of iteration. On my machine, Test1b takes almost twice as long as Test1a. I wonder if this relates to your issue.
class Program
{
struct TwoFloats
{
public float x;
public float y;
}
static TwoFloats[] a = new TwoFloats[10000];
static int Test1a()
{
int count = 0;
for (int i = 0; i < 10000; i += 1)
{
if (a[i].x < a[i].y) count++;
}
return count;
}
static int Test1b()
{
int count = 0;
foreach (TwoFloats t in a)
{
if (t.x < t.y) count++;
}
return count;
}
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int j = 0; j < 5000; ++j)
{
Test1a();
}
sw.Stop();
Trace.WriteLine(sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for (int j = 0; j < 5000; ++j)
{
Test1b();
}
sw.Stop();
Trace.WriteLine(sw.ElapsedMilliseconds);
}
}