method inline still slower than manually refactor (merge) into same method?

method inline still slower than manually refactor (merge) into same method? - c#

I used to think if a method is inlined, then theoretically it is identical to the merge of the method and the calling method, but the benchmark showed there is slightly difference in performance
e.g. this takes 100ms
public long TestA()
{
long ret = 0;
for (int n = 0; n < testSize; n++)
{
for (int i = 0; i < a; i++)
for (int j = 0; j < b; j++)
{
ret += myArray[i][j];
}
}
return ret;
}
this takes 110ms (if I force MethodImplOptions.NoInlining on GetIt then it will be 400ms, so I assume it is auto inlined)
public long TestB()
{
long ret = 0;
for (int n = 0; n < testSize; n++)
{
for (int i = 0; i < a; i++)
for (int j = 0; j < b; j++)
{
ret += GetIt(i, j);
}
}
return ret;
}
int GetIt(int x, int y)
{
return myArray[x][y];
}
OK, I attach a snippet of benchmark function i used
public static void RunTests(Func<long> myTest)
{
const int numTrials = 100;
Stopwatch sw = new Stopwatch();
double[] sample = new double[numTrials];
Console.WriteLine("Checksum is {0:N0}", myTest());
sw.Start();
myTest();
sw.Stop();
Console.WriteLine("Estimated time per test is {0:N0} ticks\n", sw.ElapsedTicks);
if (sw.ElapsedTicks < 100)
{
Console.WriteLine("Ticks per test is less than 100 ticks. Suggest increase testSize.");
return;
}
if (sw.ElapsedTicks > 10000)
{
Console.WriteLine("Ticks per test is more than 10,000 ticks. Suggest decrease testSize.");
return;
}
for (int i = 0; i < numTrials / 3; i++)
{
myTest();
}
string testName = myTest.Method.Name;
Console.WriteLine("----> Starting benchmark {0}\n", myTest.Method.Name);
for (int i = 0; i < numTrials; i++)
{
GC.Collect();
GC.WaitForPendingFinalizers();
sw.Restart();
myTest();
sw.Stop();
sample[i] = sw.ElapsedTicks;
}
double testResult = DataSetAnalysis.Report(sample);
DataSetAnalysis.ConsistencyAnalyze(sample, 0.1);
Console.WriteLine();
for (int j = 0; j < numTrials; j = j + 5)
Console.WriteLine("{0,8:N0} {1,8:N0} {2,8:N0} {3,8:N0} {4,8:N0}", sample[j], sample[j + 1], sample[j + 2], sample[j + 3], sample[j + 4]);
Console.WriteLine("\n----> End of benchmark");
}

The resulting number of IL instructions differs slightly, the maxstack differs significantly:
TestA:
// Code size 78 (0x4e)
.maxstack 2
TestB:
// Code size 88 (0x58)
.maxstack 4
GetIt:
// Code size 7 (0x7)
.maxstack 1

C# does inline at JIT, so whether inline or not IL doesn't change.
MethodImplOptions.NoInlining is not the same as inline keyword in F#

Related

Code seems to move on prior to Parallel.For loops finishing

I'm struggling with using Parallel.For loops. I can tell they really speed up my long running code but I'm get null object errors as though the code moved on prior to completing the parallel loops. Below is my code with commented out Parallel.For statements that I've tried.
public bool Calculate()
{
// make list of all possible flops 22100
List<Flop> flops = new List<Flop>();
CardSet deck = new CardSet();
for (int i = 0; i < deck.Size() - 2; i++)
{
SbCard card1 = deck.GetCard(i);
for (int j = i + 1; j < deck.Size() - 1; j++)
{
SbCard card2 = deck.GetCard(j);
for (int k = j + 1; k < deck.Size(); k++)
{
SbCard card3 = deck.GetCard(k);
flops.Add(new Flop(card1, card2, card3));
}
}
}
int progress = 0;
var watch = System.Diagnostics.Stopwatch.StartNew();
// Loop over each flop
//Parallel.For(0, flops.Count, i =>
for (int i = 0; i < flops.Count; i++)
{
Dictionary<FlopEquityHoldemHandPair, FlopEquityHoldemHandPair> flopPairs =
new Dictionary<FlopEquityHoldemHandPair, FlopEquityHoldemHandPair>();
Flop flop = flops[i];
String filePath = Directory.GetCurrentDirectory() + "\\flops\\" +
flop.GetSorted() + ".txt";
if (!File.Exists(filePath))
{
// make list of all available starting hands
List<HoldemHand> hands = new List<HoldemHand>();
deck = new CardSet();
deck.RemoveAll(flop);
for (int j = 0; j < deck.Size() - 1; j++)
{
SbCard card1 = deck.GetCard(j);
for (int k = j + 1; k < deck.Size(); k++)
{
SbCard card2 = deck.GetCard(k);
hands.Add(new HoldemHand(card1, card2));
}
}
// loop over all hand vs hand combos
//Parallel.For(0, hands.Count - 1, j =>
for (int j = 0; j < hands.Count - 1; j++)
{
HoldemHand hand1 = hands[j];
//Parallel.For(j + 1, hands.Count, k =>
for (int k = j + 1; k < hands.Count; k++)
{
HoldemHand hand2 = hands[k];
if (!hand1.Contains(hand2))
{
FlopEquityHoldemHandPair holdemHandPair = new
FlopEquityHoldemHandPair(hand1, hand2);
if (!flopPairs.ContainsKey(holdemHandPair))
{
// next line triggers a loop of 1980 iterations
flopPairs.Add(holdemHandPair, new
FlopEquityHoldemHandPair(new
EquityHoldemHand(hand1), new
EquityHoldemHand(hand2), flop));
}
}
}//);
}//);
// WRITE FILE FOR CURRENT FLOP
StringBuilder sb = new StringBuilder();
foreach (FlopEquityHoldemHandPair pair in flopPairs.Values)
{
// Null value appears in flopPairs.Values and the list of values is around 200 short of the 600k values it should have
sb.AppendLine(pair.ToString());
}
File.WriteAllText(filePath, sb.ToString());
// reports calculation progress 1% at a time
int num = ((int)(i * 100 / 22100));
if (num > progress)
{
progress = num;
Console.WriteLine("Progress: " + progress + "%");
}
}
}//);
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine("Finished in " + elapsedMs / 60000 + "mins");
return true;
}
When I arrive at the foreach loop late in this code, some items in flopPairs. Values are null and the dictionary is not quite as big as it should be - as though some calculations did not finish before the code moved on. I apologize that this code is not runnable without more code but there is a lot to provide. I can attempt to provide a minimal simplified example that runs if the problem is not fairly obvious to somebody.

As John Wu stated in the comments, Dictionary is not thread safe and was causing my problems. Using ConcurrentDictionary was the correct answer

Stopwatch startup issues

I've got a c# program that tests a sort algorithm and it's performance by using a instance of the Stopwatch class.
So far everything is working correctly and I am getting the expected tick results except in the first run.Somehow the Stopwatch class needs about 900 ticks longer for the first calculation.
Do I have to initiate the Stopwatch class somehow different or is there any way to fix this?
static void Main() {
watch = new Stopwatch();
int amount = 10; // Amount of arrays to test
long[, ] stats= new long[3, amount]; // Array that stores ticks for every size (100,1000,10000) 'amount'-times
for (int size = 100, iteration = 0; size <= 10000; size *= 10, iteration++) {
for (int j = 0; j < amount; j++) {
stats[iteration, j] = TestSort(size); // Save ticks for random tested array in stats
}
}
PrintStats(stats);
}
public static long TestSort(int length) {
int[] testArray = GenerateRandomArray(length); // Generate a random array with size of length
watch.Reset();
watch.Start();
sort(testArray);
watch.Stop();
return watch.ElapsedTicks;
}
public static void PrintStats(long[, ] array) {
for (int i = 0; i < array.GetLength(0); i++) {
Console.Write("[");
for (int j = 0; j < array.GetLength(1); j++) {
Console.Write(array[i, j]);
if (j < array.GetLength(1) - 1) {
Console.Write(",");
}
}
Console.Write("]\n");
}
}
// Sample output
// Note that first entry is about 900 ticks longer then the other ones with size 100
[1150,256,268,262,261,262,263,261,263,262]
[19689,20550,20979,22953,19913,20578,19693,19945,19811,19970]
[1880705,1850265,3006533,1869953,1900301,1846915,1840681,1801887,1931206,2206952]

How can I maximize the performance of element-wise operation on an big array in C#

The operation is to multiply every i-th element of a array (call it A) and i-th element of a matrix of the same size(B), and update the same i-th element of A with the value earned.
In a arithmetic formula,
A'[i] = A[i]*B[i] (0 < i < n(A))
What's the best way to optimize this operation in a multi-core environment?
Here's my current code;
var learningRate = 0.001f;
var m = 20000;
var n = 40000;
var W = float[m*n];
var C = float[m*n];
//my current code ...[1]
Parallel.ForEach(Enumerable.Range(0, m), i =>
{
for (int j = 0; j <= n - 1; j++)
{
W[i*n+j] *= C[i*n+j];
}
});
//This is somehow far slower than [1], but I don't know why ... [2]
Parallel.ForEach(Enumerable.Range(0, n*m), i =>
{
w[i] *= C[i]
});
//This is faster than [2], but not as fast as [1] ... [3]
for(int i = 0; i < m*n; i++)
{
w[i] *= C[i]
}
Tested the following method. But the performance didn't get better at all.
http://msdn.microsoft.com/en-us/library/dd560853.aspx
public static void Test1()
{
Random rnd = new Random(1);
var sw1 = new Stopwatch();
var sw2 = new Stopwatch();
sw1.Reset();
sw2.Reset();
int m = 10000;
int n = 20000;
int loops = 20;
var W = DummyDataUtils.CreateRandomMat1D(m, n);
var C = DummyDataUtils.CreateRandomMat1D(m, n);
for (int l = 0; l < loops; l++)
{
var v = DummyDataUtils.CreateRandomVector(n);
var b = DummyDataUtils.CreateRandomVector(m);
sw1.Start();
Parallel.ForEach(Enumerable.Range(0, m), i =>
{
for (int j = 0; j <= n - 1; j++)
{
W[i*n+j] *= C[i*n+j];
}
});
sw1.Stop();
sw2.Start();
// Partition the entire source array.
var rangePartitioner = Partitioner.Create(0, n*m);
// Loop over the partitions in parallel.
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
// Loop over each range element without a delegate invocation.
for (int i = range.Item1; i < range.Item2; i++)
{
W[i] *= C[i];
}
});
sw2.Stop();
Console.Write("o");
}
var t1 = (double)sw1.ElapsedMilliseconds / loops;
var t2 = (double)sw2.ElapsedMilliseconds / loops;
Console.WriteLine("t1: " + t1);
Console.WriteLine("t2: " + t2);
}
Result:
t1: 119
t2: 120.4

The problem is that while invoking a delegate is relatively fast, it adds up when you invoke it many times and the code inside the delegate is very simple.
What you could try instead is to use a Partitioner to specify the range you want to iterate, which allows you to iterate over many items for each delegate invocation (similar to what you're doing in [1]):
Parallel.ForEach(Partitioner.Create(0, n * m), partition =>
{
for (int i = partition.Item1; i < partition.Item2; i++)
{
W[i] *= C[i];
}
});

C# heapSort ,System.Timers; to check algorithm time

I have to check HeapSort algorithm time in C# , my problem is that I Know I must use System.Timers , because I don't know how to measures the algorithm time.
I have to check the algorithm time for table contains 1000 ,10 000 , 100 000 and 1000 000 integers.
Help me good people please.
This is the code:
using System;
namespace Sort
{
class Program
{
public static void Adjust(int[] list, int i, int m)
{
int Temp = list[i];
int j = i * 2 + 1;
while (j <= m)
{
if (j < m)
if (list[j] < list[j + 1])
j = j + 1;
if (Temp < list[j])
{
list[i] = list[j];
i = j;
j = 2 * i + 1;
}
else
{
j = m + 1;
}
}
list[i] = Temp;
}
public static void HeapSort(int[] list)
{
int i;
//Boulding a heap
for (i = (list.Length - 1) / 2; i >= 0; i--)
Adjust(list, i, list.Length - 1);
for (i = list.Length - 1; i >= 1; i--)
{
int Temp = list[0];
list[0] = list[i];
list[i] = Temp;
Adjust(list, 0, i - 1);
}
}
static void Main(string[] args)
{
Console.Title = "HeapSort";
int i;
int[] a = { 12, 3, -12, 27, 34, 23, 1, 81, 45,
17, 9, 23, 11, 4, 121 };
Console.WriteLine("Data before sort ");
for (i = 0; i < a.Length; i++)
Console.Write(" {0} ", a[i]);
Console.WriteLine();
HeapSort(a);
Console.WriteLine("Data after sort");
for (i = 0; i < a.Length; i++)
Console.Write(" {0} ", a[i]);
Console.ReadLine();
}
}
}
I've write this with You help , does is good ?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
namespace Sort
{
class Program
{
public static void Adjust(int[] list, int i, int m)
{
int Temp = list[i];
int j = i * 2 + 1;
while (j <= m)
{
if (j < m)
if (list[j] < list[j + 1])
j = j + 1;
if (Temp < list[j])
{
list[i] = list[j];
i = j;
j = 2 * i + 1;
}
else
{
j = m + 1;
}
}
list[i] = Temp;
}
public static void HeapSort (int[] list)
{
int i;
//Boulding a heap
for (i = (list.Length - 1) / 2;i >=0;i--)
Adjust (list, i, list.Length - 1);
for ( i = list.Length - 1; i >= 1; i--)
{
int Temp = list [0];
list [0] = list [i];
list [i] = Temp;
Adjust (list, 0, i - 1);
}
}
static void Main(string[] args)
{
Console.Title = "HeapSort";
int i;
Random myRandom = new Random();//Creating instance of class Random
Stopwatch myTime = new Stopwatch(); //variable for time measurement
int[] a = new int[1000]; //table contents 1000 variables
for (i = 0; i < a.Length; i++)
a[i] = myRandom.Next(100);
Console.WriteLine("Data before sort ");
for (i = 0; i < a.Length; i++)
Console.Write(" {0} ", a[i]);
Console.WriteLine();
myTime.Start();
HeapSort(a);
myTime.Stop();
string TimeEl = myTime.Elapsed.ToString();
Console.WriteLine("Data after sort");
for (i = 0; i < a.Length; i++)
Console.Write(" {0} ", a[i]);
Console.WriteLine();
Console.WriteLine();
Console.WriteLine("time elapsed: {0} ", TimeEl);
Console.ReadLine();
}
}
}

If you're looking for time measurements, use the Stopwatch class.
This allows you to easily measure some time using the Start() and Stop() method. The Elapsed property will then tell you how long the operation took.

You could use the Stopwatch class to measure time:
var watch = Stopwatch.StartNew();
SomeFunctionThatCallsYourAlgorithm();
watch.Stop();
Console.WriteLine("algorithm execution time: {0}ms", watch.ElapsedMilliseconds);

There's some code from Vance Morrison's weblog that uses the Stopwatch class (as described above), but does multiple runs and performs some statistical analysis to give you the mean, median runtime, along with the standard derivation.
Check it out here:
Link

Why is struct slower than float?

If I have array of structs MyStruct[]:
struct MyStruct
{
float x;
float y;
}
And it's slower than if I do float[] -> x = > i; y => i + 1 (so this array is 2x bigger than with structs).
Time difference for 10,000 items compare each other (two fors inside) : struct 500ms, array with only floats - 78ms
I thought, that struct appears like eg. float, int etc (on heap).

Firstly structs don't necessarily appear on the heap - they can and often do appear on the stack.
Regarding your performance measurements, I think you must have tested it incorrectly. Using this benchmarking code I get almost the same performance results for both types:
TwoFloats[] a = new TwoFloats[10000];
float[] b = new float[20000];
void test1()
{
int count = 0;
for (int i = 0; i < 10000; i += 1)
{
if (a[i].x < 10) count++;
}
}
void test2()
{
int count = 0;
for (int i = 0; i < 20000; i += 2)
{
if (b[i] < 10) count++;
}
}
Results:
Method Iterations per second
test1 55200000
test2 54800000

You are doing something seriously wrong if you get times like that. Float comparisons are dramatically fast, I clock them at 2 nanoseconds with the loop overhead. Crafting a test like this is tricky because the JIT compiler will optimize stuff away if you don't use the result of the comparison.
The structure is slightly faster, 1.96 nanoseconds vs 2.20 nanoseconds for the float[] on my laptop. That's the way it should be, accessing the Y member of the struct doesn't cost an extra array index.
Test code:
using System;
using System.Diagnostics;
class Program {
static void Main(string[] args) {
var test1 = new float[100000000]; // 100 million
for (int ix = 0; ix < test1.Length; ++ix) test1[ix] = ix;
var test2 = new Test[test1.Length / 2];
for (int ix = 0; ix < test2.Length; ++ix) test2[ix].x = test2[ix].y = ix;
for (int cnt = 0; cnt < 20; ++cnt) {
var sw1 = Stopwatch.StartNew();
bool dummy = false;
for (int ix = 0; ix < test1.Length; ix += 2) {
dummy ^= test1[ix] >= test1[ix + 1];
}
sw1.Stop();
var sw2 = Stopwatch.StartNew();
for (int ix = 0; ix < test2.Length; ++ix) {
dummy ^= test2[ix].x >= test2[ix].y;
}
sw2.Stop();
Console.Write("", dummy);
Console.WriteLine("{0} {1}", sw1.ElapsedMilliseconds, sw2.ElapsedMilliseconds);
}
Console.ReadLine();
}
struct Test {
public float x;
public float y;
}
}

I get results that seem to agree with you (and disagree with Mark). I'm curious if I've made a mistake constructing this (albeit crude) benchmark or if there is another factor at play.
Compiled on MS C# targeting .NET 3.5 framework with VS2008. Release mode, no debugger attached.
Here's my test code:
class Program {
static void Main(string[] args) {
for (int i = 0; i < 10; i++) {
RunBench();
}
Console.ReadKey();
}
static void RunBench() {
Stopwatch sw = new Stopwatch();
const int numPoints = 10000;
const int numFloats = numPoints * 2;
int numEqs = 0;
float[] rawFloats = new float[numFloats];
Vec2[] vecs = new Vec2[numPoints];
Random rnd = new Random();
for (int i = 0; i < numPoints; i++) {
rawFloats[i * 2] = (float) rnd.NextDouble();
rawFloats[i * 2 + 1] = (float)rnd.NextDouble();
vecs[i] = new Vec2() { X = rawFloats[i * 2], Y = rawFloats[i * 2 + 1] };
}
sw.Start();
for (int i = 0; i < numFloats; i += 2) {
for (int j = 0; j < numFloats; j += 2) {
if (i != j &&
Math.Abs(rawFloats[i] - rawFloats[j]) < 0.0001 &&
Math.Abs(rawFloats[i + 1] - rawFloats[j + 1]) < 0.0001) {
numEqs++;
}
}
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds.ToString() + " : numEqs = " + numEqs);
numEqs = 0;
sw.Reset();
sw.Start();
for (int i = 0; i < numPoints; i++) {
for (int j = 0; j < numPoints; j++) {
if (i != j &&
Math.Abs(vecs[i].X - vecs[j].X) < 0.0001 &&
Math.Abs(vecs[i].Y - vecs[j].Y) < 0.0001) {
numEqs++;
}
}
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds.ToString() + " : numEqs = " + numEqs);
}
}
struct Vec2 {
public float X;
public float Y;
}
Edit: Ah! I wasn't iterating the proper amounts. With the updated code my timings look like I expected:
269 : numEqs = 8
269 : numEqs = 8
270 : numEqs = 2
269 : numEqs = 2
268 : numEqs = 4
270 : numEqs = 4
269 : numEqs = 2
268 : numEqs = 2
270 : numEqs = 6
270 : numEqs = 6
269 : numEqs = 8
268 : numEqs = 8
268 : numEqs = 4
270 : numEqs = 4
269 : numEqs = 6
269 : numEqs = 6
268 : numEqs = 2
270 : numEqs = 2
268 : numEqs = 4
270 : numEqs = 4

The most likely reason is that the C# runtime optimizer perform a better job when you work with floats that with full structs, probably because optimizer is mapping x and y to registers or likewise changes not done with full struct.
In your particular example there seems not to be any fundamental reason why it couldn't perform as good a job when you use structs (it's hard to be sure without seeing you actual benchmarking code), but it just doesn't. However it would be interesting to compare the performance of the resulting code when compiled with another C# implementations (I'm thinking of mono on Linux).
I tested Ron Warholic benchmark with mono, and results are consistant with Mark's, difference between the two types of access seems to be minimal (version with floats is 1% faster). However I still should do more testing as it is not unexpected that library calls like Math.Abs take a large amount of time and it could hide a real difference.
After removing calls to Math.Abs and just doing tests like rawFloats[i] < rawFloats[j] the structure version becomes marginally faster (about 5%) than the two arrays of floats.

The code below is based on different ways of iteration. On my machine, Test1b takes almost twice as long as Test1a. I wonder if this relates to your issue.
class Program
{
struct TwoFloats
{
public float x;
public float y;
}
static TwoFloats[] a = new TwoFloats[10000];
static int Test1a()
{
int count = 0;
for (int i = 0; i < 10000; i += 1)
{
if (a[i].x < a[i].y) count++;
}
return count;
}
static int Test1b()
{
int count = 0;
foreach (TwoFloats t in a)
{
if (t.x < t.y) count++;
}
return count;
}
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int j = 0; j < 5000; ++j)
{
Test1a();
}
sw.Stop();
Trace.WriteLine(sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for (int j = 0; j < 5000; ++j)
{
Test1b();
}
sw.Stop();
Trace.WriteLine(sw.ElapsedMilliseconds);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

method inline still slower than manually refactor (merge) into same method? - c#

The resulting number of IL instructions differs slightly, the maxstack differs significantly: TestA: // Code size 78 (0x4e) .maxstack 2 TestB: // Code size 88 (0x58) .maxstack 4 GetIt: // Code size 7 (0x7) .maxstack 1

C# does inline at JIT, so whether inline or not IL doesn't change. MethodImplOptions.NoInlining is not the same as inline keyword in F#

Related

Code seems to move on prior to Parallel.For loops finishing

Stopwatch startup issues

How can I maximize the performance of element-wise operation on an big array in C#

C# heapSort ,System.Timers; to check algorithm time

Why is struct slower than float?

Categories

Resources