Performance impact of array writes is much larger than expected - c#

I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
{
static void Main(string[] args)
{
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
}
private static string[] InitData()
{
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
{
data[i] = new string('-', maxString);
}
watch.Stop();
Console.WriteLine($"Data init, count: {data.Length:n0} x {maxString:n0}, {watch.Elapsed.TotalSeconds} secs");
return data;
}
private static void TestPerf1(string[] vals, bool testStore)
{
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
{
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
if (j < val.Length)
{
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
}
}
if (allEmpty)
break;
}
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
}
private static void TestPerf0(string[] vals, bool testStore)
{
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
{
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
}
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
}
}

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
Invert your i and j loop. This will remove the allEmpty variable completely.
Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.
Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
{
foreach (int ch in val)
{
var count = counters[ch];
tmp ^= count;
if (testStore)
{
counters[ch] = count + 1;
}
}
}
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to #mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

Related

What should I use to optimize my c# code?

I was doing a codewars kata and it's working but I'm timing out.
I searched online for solutions, for some kind of reference but they were all for java script.
Here is the kata: https://i.stack.imgur.com/yGLmw.png
Here is my code:
public static int DblLinear(int n)
{
if(n > 0)
{
var list = new List<int>();
int[] next_two = new int[2];
list.Add(1);
for (int i = 0; i < n; i++)
{
for (int m = 0; m < next_two.Length; m++)
{
next_two[m] = ((m + 2) * list[i]) + 1;
}
if(list.Contains(next_two[0]))
{
list.Add(next_two[1]);
}
else if(list.Contains(next_two[1]))
{
list.Add(next_two[0]);
}
else
list.AddRange(next_two);
list.Sort();
}
return list[n];
}
return 1;
}
It's really slow solution but that's what seems to be working for me.
The first rule of performance optimization is to measure. Ideally using a profiler that can tell you where most of the time is spent, but for simple cases using some stopwatches can be sufficient.
I would guess that most of the time would be spent in list.Contains, since this is linear lookup, and is in the innermost loop. So one approach would be to change the list to a HashSet<int> to provide better lookup performance, skip the .Sort-call, and return the maximum value in the hashSet. As far as I can tell that should give the same result.
You might also consider using some specialized data structure that fits the problem better than the general containers provided in .Net.

Parallel.ForEach search doesn't find the correct value

This is my first attempt at parallel programming.
I'm writing a test console app before using this in my real app and I can't seem to get it right. When I run this, the parallel search is always faster than the sequential one, but the parallel search never finds the correct value. What am I doing wrong?
I tried it without using a partitioner (just Parallel.For); it was slower than the sequential loop and gave the wrong number. I saw a Microsoft doc that said for simple computations, using Partitioner.Create can speed things up. So I tried that but still got the wrong values. Then I saw Interlocked, but I think I'm using it wrong.
Any help would be greatly appreciated
Random r = new Random();
Stopwatch timer = new Stopwatch();
do {
// Make and populate a list
List<short> test = new List<short>();
for (int x = 0; x <= 10000000; x++)
{
test.Add((short)(r.Next(short.MaxValue) * r.NextDouble()));
}
// Initialize result variables
short rMin = short.MaxValue;
short rMax = 0;
// Do min/max normal search
timer.Start();
foreach (var amp in test)
{
rMin = Math.Min(rMin, amp);
rMax = Math.Max(rMax, amp);
}
timer.Stop();
// Display results
Console.WriteLine($"rMin: {rMin} rMax: {rMax} Time: {timer.ElapsedMilliseconds}");
// Initialize parallel result variables
short pMin = short.MaxValue;
short pMax = 0;
// Create list partioner
var rangePortioner = Partitioner.Create(0, test.Count);
// Do min/max parallel search
timer.Restart();
Parallel.ForEach(rangePortioner, (range, loop) =>
{
short min = short.MaxValue;
short max = 0;
for (int i = range.Item1; i < range.Item2; i++)
{
min = Math.Min(min, test[i]);
max = Math.Max(max, test[i]);
}
_ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMin), Math.Min(pMin, min));
_ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMax), Math.Max(pMax, max));
});
timer.Stop();
// Display results
Console.WriteLine($"pMin: {pMin} pMax: {pMax} Time: {timer.ElapsedMilliseconds}");
Console.WriteLine("Press enter to run again; any other key to quit");
} while (Console.ReadKey().Key == ConsoleKey.Enter);
Sample output:
rMin: 0 rMax: 32746 Time: 106
pMin: 0 pMax: 32679 Time: 66
Press enter to run again; any other key to quit
The correct way to do a parallel search like this is to compute local values for each thread used, and then merge the values at the end. This ensures that synchronization is only needed at the final phase:
var items = Enumerable.Range(0, 10000).ToList();
int globalMin = int.MaxValue;
int globalMax = int.MinValue;
Parallel.ForEach<int, (int Min, int Max)>(
items,
() => (int.MaxValue, int.MinValue), // Create new min/max values for each thread used
(item, state, localMinMax) =>
{
var localMin = Math.Min(item, localMinMax.Min);
var localMax = Math.Max(item, localMinMax.Max);
return (localMin, localMax); // return the new min/max values for this thread
},
localMinMax => // called one last time for each thread used
{
lock(items) // Since this may run concurrently, synchronization is needed
{
globalMin = Math.Min(globalMin, localMinMax.Min);
globalMax = Math.Max(globalMax, localMinMax.Max);
}
});
As you can see this is quite a bit more complex than a regular loop, and this is not even doing anything fancy like partitioning. An optimized solution would work over larger blocks to reduce overhead, but this is omitted for simplicity, and it looks like the OP is aware such issues already.
Be aware that multi threaded programming is difficult. While it is a great idea to try out such techniques in a playground rather than a real program, I would still suggest that you should start by studying the potential dangers of thread safety, there is fairly easy to find good resources about this.
Not all problems will be as obviously wrong like this, and it is quite easy to cause issues that breaks once in a million, or only when the cpu load is high, or only on single CPU systems, or issues that are only detected long after the code is put into production. It is a good practice to be paranoid whenever multiple threads may read and write the same memory concurrently.
I would also recommend learning about immutable data types, and pure functions, since these are much safer and easier to reason about once multiple threads are involved.
Interlocked.Exchange is thread safe only for Exchange, every Math.Min and Math.Max can be with race condition. You should compute min/max for every batch separately and then join results.
Using low-lock techniques like the Interlocked class is tricky and advanced. Taking into consideration that your experience in multithreading is not excessive, I would say go with a simple and trusty lock:
object locker = new object();
//...
lock (locker)
{
pMin = Math.Min(pMin, min);
pMax = Math.Max(pMax, max);
}

How CPU caching works when you get access to different value in a 'for' loop?

I made some tests of code performance, and I would like to know how the CPU cache works in this kind of situation:
Here is a classic example for a loop:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
}
Here is the specific situation to get the same thing, but much more performant:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
short max = Math.Max(max1, max2);
}
So I am interested to know why the second is more efficient as the first one.
I understand it's a story of CPU cache, but I don't get really how it happens (like values are not read twice between loops).
EDIT:
.NET Core 4.6.27617.04
2.1.11
Intel Core i7-7850HQ 2.90GHz 64-bit
Calling 50 Million of times:
MyClass1:
=> 00:00:06.0702028
MyClass2:
=> 00:00:03.8563776 (-36 %)
The last metric are the one with the Loop unrolling.
The difference in performance in this case is not related to caching - you have just 100 values - they fit entirely in the L2 cache already at the time you generated them.
The difference is due to out-of-order execution.
A modern CPU has multiple execution units and can perform more than one operation at the same time even in a single-threaded application.
But your loop is problematic for a modern CPU because it has a dependency:
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
Here each subsequent iteration is dependent on the value max from the previous one, so the CPU is forced to compute them sequentially.
Your revised loop adds a degree of freedom for the CPU; since max1 and max2 are independent, they can be computed in parallel.
So essentially the revised loop can run equally fast per iteration as the first one:
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
But it has half the iterations, so in the end you get a significant speedup (not 2x because out-of-order execution is not perfect).
Caching
Caching in the cpu works such as it pre-loads the next few lines of code from memory and stores it in the CPU Cache, This may be data, pointers, variable values, etc. etc.
Code Blocks
between your two blocks of code, the difference may not appear in the syntax, try converting your Code to IL (intermediate runtime language for c# which is executed by JIT(just-in-time compiler)) see ref for tools and resources.
or just decompiler your built/compiled code and check how the compiler "optimized it" when making the dll/exe files using the decompiler below.
other performance optimization
Loop Unrolling
CPU Caching
Refs:
C# Decompiler
JIT

What do this methods do?

I've found two diferent methods to get a Max value from an array but I'm not really fond of parallel programing, so I really don't understand it.
I was wondering do this methods do the same or am I missing something?
I really don't have much information about them. Not even comments...
The first method:
int[] vec = ... (I guess the content doesn't matter)
static int naiveMax()
{
int max = vec[0];
object obj = new object();
Parallel.For(0, vec.Length, i =>
{
lock (obj) {
if (vec[i] > max) max = vec[i];
}
});
return max;
}
And the second one:
static int Max()
{
int max = vec[0];
object obj = new object();
Parallel.For(0, vec.Length, //could be Parallel.For<int>
() => vec[0],
(i, loopState, partial) =>
{
if(vec[i]>partial) partial = vec[i];
return partial;
},
partial => {
lock (obj) {
if( partial > max) max = partial;
}
});
return max;
}
Do these do the same or something diferent and what? Thanks ;)
Both find the maximum value in an array of integers. In an attempt to find the maximum value faster, they do it "in parallel" using the Parallel.For Method. Both methods fail at this, though.
To see this, we first need a sufficiently large array of integers. For small arrays, parallel processing doesn't give us a speed-up anyway.
int[] values = new int[100000000];
Random random = new Random();
for (int i = 0; i < values.Length; i++)
{
values[i] = random.Next();
}
Now we can run the two methods and see how long they take. Using an appropriate performance measurement setup (Stopwatch, array of 100,000,000 integers, 100 iterations, Release build, no debugger attached, JIT warm-up) I get the following results on my machine:
naiveMax 00:06:03.3737078
Max 00:00:15.2453303
So Max is much much better than naiveMax (6 minutes! cough).
But how does it compare to, say, PLINQ?
static int MaxPlinq(int[] values)
{
return values.AsParallel().Max();
}
MaxPlinq 00:00:11.2335842
Not bad, saved a few seconds. Now, what about a plain, old, sequential for loop for comparison?
static int Simple(int[] values)
{
int result = values[0];
for (int i = 0; i < values.Length; i++)
{
if (result < values[i]) result = values[i];
}
return result;
}
Simple 00:00:05.7837002
I think we have a winner.
Lesson learned: Parallel.For is not pixie dust that you can sprinkle over your code to
make it magically run faster. If performance matters, use the right tools and measure, measure, measure, ...
They appear to do the same thing, however they are very inefficient. The point of parallelization is to improve the speed of code that can be executed independently. Due to race conditions, discovering the maximum (as implemented here) requires an atomic semaphore/lock on the actual logic... Which means you're spinning up many threads and related resources simply to do the code sequentially anyway... Defeating the purpose of parallelization entirely.

Fastest algorithm to create an array of known size

I need to create an array of boolean values, which could be on the scale of 100,000s or even millions of entries. It also needs to be super-fast, so every millisecond per iteration counts.
At the time of beginning the loop, I will already know how many entries there are going to be in the array. The question is, will it be faster to create a bool array up front and fill in the values by index (which is random access - could be slow?), or should I create a List<bool>, keep adding entries to the list, and at the end return .ToArray()?
In other words:
Option 1
var array = new bool[size];
for (var n=0; n<size; n++)
array[n] = GetValue(n);
return array;
Option 2
var list = new List<bool>();
for (var n=0; n<size; n++)
list.Add(GetValue(n));
return list.ToArray();
Or maybe there's a 3rd way that's even faster?
Use a System.Collections.BitArray and don't worry about speed.
What you are suggesting above will only waste your memory. This is optimizes both for speed and size, and will pack your bool values nicely (8 per byte, as the gods intended :).
Reply to below comments: If you use a BitArray, everything will be zero at first. Set only those bits for which you have GetValue == true.
The following code seems to show (at least to me) that of the methods discussed on this page, the simple allocation to a bool[] using a loop is quickest.
The code also seems to show me that unless GetValue(n) is computationally trivial, the overhead of allocating the bytes is not the part of the process I would be hoping to optimise.
Hope this helps in some way.
edit: added the results from the run (on my machine)
-- 187ms BitArray
-- 171ms List<bool>().ToArray
-- 168ms bool[] set only if true
-- 130ms bool[] always set
--11460ms bool[] always set with 'complex' GetValue()
class Program
{
static void Main(string[] args)
{
BitArray bitArray = new BitArray(10000000);
bool[] boolArray = new bool[10000000];
Stopwatch sw1 = new Stopwatch();
sw1.Start();
for (int i = 0; i < 10000000; i++)
{
bitArray[i] = GetMod2(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
var list = new List<bool>();
for (int i = 0; i < 10000000; i++)
list.Add(GetMod2(i));
var boolArray2 = list.ToArray();
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
bool nextVal = GetMod2(i);
if (nextVal)
bitArray[i] = true;
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
boolArray[i] = GetMod2(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
sw1.Restart();
for (int i = 0; i < 10000000; i++)
{
boolArray[i] = GetRand(i);
}
Console.WriteLine(sw1.ElapsedMilliseconds);
Console.ReadLine();
}
static bool GetMod2(int i)
{
return (i % 2) == 1;
}
static bool GetRand(int i)
{
return new Random().Next(2) == 1;
}
}
Go with the first. The only reason it might be. "slow" is if it keeps paging data from outside the processor cache.
The list will have exactly the same problem, except it will also need to perform several memory allocations and copies.
Now here's a funny old thing. Inspired by #paul, I ran these benchmark tests myself, on 10,000,000 booleans. The results (in milliseconds) are very surprising, given the discussion in the comments to this question:
BitArray: 517
BitArray + CopyTo(array): 536
List + ToArray(): 455
bool array: 483
And what a turnout for the books! Despite the fact that the List<Bool> is inserting a new record every time, while the bool[] and BitArray are initialized to false on every record, and I only updated them where the value should be true, the List<bool> comes out tops, consistently, even including the .ToArray() call.
Yet another case where practical application is better than textbook knowledge, it seems... :)

Categories

Resources