ConcurrentDictionary<> performance at a single thread misunderstanding? - c#

Related brief info:
AFAIK , The concurrent stack, queue, and bag classes are implemented internally with linked lists.
And I know that there is much less contention because each thread is responsible for its own linked list.
Any way , my question is about the ConcurrentDictionary<,>
But I was testing this code :(single thread)
Stopwatch sw = new Stopwatch();
sw.Start();
var d = new ConcurrentDictionary < int, int > ();
for(int i = 0; i < 1000000; i++) d[i] = 123;
for(int i = 1000000; i < 2000000; i++) d[i] = 123;
for(int i = 2000000; i < 3000000; i++) d[i] = 123;
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Restart();
var d2 = new Dictionary < int, int > ();
for(int i = 0; i < 1000000; i++) lock (d2) d2[i] = 123;
for(int i = 1000000; i < 2000000; i++) lock (d2) d2[i] = 123;
for(int i = 2000000; i < 3000000; i++) lock (d2) d2[i] = 123;
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Stop();
Result : (tested many times, same values (+/-)).
baseline = 00:00:01.2604656
baseline = 00:00:00.3229741
Question :
What makes ConcurrentDictionary<,> much slower in a single threaded environment ?
My first instinct is that lock(){} will be always slower. but apparently it is not.

Well, ConcurrentDictionary is allowing for the possibility that it can be used by multiple threads. It seems entirely reasonable to me that that requires more internal housekeeping than something which assumes it can get away without worrying about access from multiple threads. I'd have been very surprised if it had worked out the other way round - if the safer version were always faster too, why would you ever use the less safe version?

The most likely reason that ConcurrentDictionary simply has more overhead than Dictionary for the same operation. This is demonstrably true if you dig into the sources
It uses a lock for the indexer
It uses volatile writes
It has to do atomic writes of values which are not guaranteed to be atomic in .Net
It has extra branches in the core add routine (whether to take a lock, do atomic write)
All of these costs are incurred irrespective of the number of threads that it's being used on. These costs may be individually small but aren't free and do add up over time

Update for .NET 5: I'll leave the previous answer up as it is still relevant for older runtimes but .NET 5 appears to have further improved ConcurrentDictionary to the point where reads via TryGetValue() are actually faster than even the normal Dictionary, as seen in the results below (COW is my CopyOnWriteDictionary, detailed below). Make what you will of this :)
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------- |------------:|----------:|----------:|---------:|---------:|---------:|----------:|
| ConcurrentWrite | 1,372.32 us | 12.752 us | 11.304 us | 226.5625 | 89.8438 | 44.9219 | 1398736 B |
| COWWrite | 1,077.39 us | 21.435 us | 31.419 us | 56.6406 | 19.5313 | 11.7188 | 868629 B |
| DictWrite | 347.19 us | 5.875 us | 5.208 us | 124.5117 | 124.5117 | 124.5117 | 673064 B |
| ConcurrentRead | 63.53 us | 0.486 us | 0.431 us | - | - | - | - |
| COWRead | 81.55 us | 0.908 us | 0.805 us | - | - | - | - |
| DictRead | 70.71 us | 0.471 us | 0.393 us | - | - | - | - |
Previous answer, still relevant for < .NET 5:
The latest versions of ConcurrentDictionary have improved significantly since I originally posted this answer. It no longer locks on read and thus offers almost the same performance profile as my CopyOnWriteDictionary implementation with more features so I recommend you use that instead in most cases. ConcurrentDictionary still has 20 - 30% more overhead than Dictionary or CopyOnWriteDictionary, so performance-sensitive applications may still benefit from its use.
You can read about my lock-free thread-safe copy-on-write dictionary implementation here:
http://www.singulink.com/CodeIndex/post/fastest-thread-safe-lock-free-dictionary
It's currently append-only (with the ability to replace values) as it is intended for use as a permanent cache. If you need removal then I suggest using ConcurrentDictionary since adding that into CopyOnWriteDictionary would eliminate all performance gains due to the added locking.
CopyOnWriteDictionary is very fast for quick bursts of writes and lookups usually run at almost standard Dictionary speed without locking. If you write occasionally and read often, this is the fastest option available.
My implementation provides maximum read performance by removing the need for any read locks under normal circumstances while updates aren't being made to the dictionary. The trade-off is that the dictionary needs to be copied and swapped after updates are applied (which is done on a background thread) but if you don't write often or you only write once during initialization then the trade-off is definitely worth it.

ConcurrentDictionary vs. Dictionary
In general, use a
System.Collections.Concurrent.ConcurrentDictionary in
any scenario where you are adding and updating keys or values
concurrently from multiple threads. In scenarios that involve frequent
updates and relatively few reads, the ConcurrentDictionary generally offers modest benefits. In scenarios that involve
many reads and many updates, the ConcurrentDictionary
generally is significantly faster on computers that have any number of
cores.
In scenarios that involve frequent updates, you can increase the
degree of concurrency in the ConcurrentDictionary and
then measure to see whether performance increases on computers that
have more cores. If you change the concurrency level, avoid global
operations as much as possible.
If you are only reading key or values, the Dictionary is
faster because no synchronization is required if the dictionary is not
being modified by any threads.
Link: https://msdn.microsoft.com/en-us/library/dd997373%28v=vs.110%29.aspx

The ConcurrentDictionary<> creates an internal set of locking objects at creation (this is determined by the concurrencyLevel, amongst other factors) - this set of locking objects is used to control access to the internal bucket structures in a series of fine-grained locks.
In a single threaded scenario, there would be no need for the locks, so the extra overhead of acquiring and releasing these locks is probably the source of the difference you're seeing.

There is no point in using ConcurrentDictionary in one thread or synchronizing access if all is done in a single thread. Of course dictionary will beat ConcrurrentDictionary.
Much depends on the usage pattern and number of threads. Here is a test, that shows that ConcurrentDictionary outperforms dictionary and lock with thread number increase.
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
namespace ConsoleApp
{
class Program
{
static void Main(string[] args)
{
Run(1, 100000, 10);
Run(10, 100000, 10);
Run(100, 100000, 10);
Run(1000, 100000, 10);
Console.ReadKey();
}
static void Run(int threads, int count, int cycles)
{
Console.WriteLine("");
Console.WriteLine($"Threads: {threads}, items: {count}, cycles:{cycles}");
var semaphore = new SemaphoreSlim(0, threads);
var concurrentDictionary = new ConcurrentDictionary<int, string>();
for (int i = 0; i < threads; i++)
{
Thread t = new Thread(() => Run(concurrentDictionary, count, cycles, semaphore));
t.Start();
}
Thread.Sleep(1000);
var w = Stopwatch.StartNew();
semaphore.Release(threads);
for (int i = 0; i < threads; i++)
semaphore.Wait();
Console.WriteLine($"ConcurrentDictionary: {w.Elapsed}");
var dictionary = new Dictionary<int, string>();
for (int i = 0; i < threads; i++)
{
Thread t = new Thread(() => Run(dictionary, count, cycles, semaphore));
t.Start();
}
Thread.Sleep(1000);
w.Restart();
semaphore.Release(threads);
for (int i = 0; i < threads; i++)
semaphore.Wait();
Console.WriteLine($"Dictionary: {w.Elapsed}");
}
static void Run(ConcurrentDictionary<int, string> dic, int elements, int cycles, SemaphoreSlim semaphore)
{
semaphore.Wait();
try
{
for (int i = 0; i < cycles; i++)
for (int j = 0; j < elements; j++)
{
var x = dic.GetOrAdd(i, x => x.ToString());
}
}
finally
{
semaphore.Release();
}
}
static void Run(Dictionary<int, string> dic, int elements, int cycles, SemaphoreSlim semaphore)
{
semaphore.Wait();
try
{
for (int i = 0; i < cycles; i++)
for (int j = 0; j < elements; j++)
lock (dic)
{
if (!dic.TryGetValue(i, out string value))
dic[i] = i.ToString();
}
}
finally
{
semaphore.Release();
}
}
}
}
Threads: 1, items: 100000, cycles:10
ConcurrentDictionary: 00:00:00.0000499
Dictionary: 00:00:00.0000137
Threads: 10, items: 100000, cycles:10
ConcurrentDictionary: 00:00:00.0497413
Dictionary: 00:00:00.2638265
Threads: 100, items: 100000, cycles:10
ConcurrentDictionary: 00:00:00.2408781
Dictionary: 00:00:02.2257736
Threads: 1000, items: 100000, cycles:10
ConcurrentDictionary: 00:00:01.8196668
Dictionary: 00:00:25.5717232

What makes ConcurrentDictionary<,> much slower in a single threaded environment?
The overhead of the machinery required to make it much faster in multi-threaded environments.
My first instinct is that lock(){} will be always slower. but apparently it is not.
A lock is very cheap when uncontested. You can lock a million times per second and your CPU won't even notice, provided that you are doing it from a single thread. What kills performance in multi-threaded programs is contention for locks. When multiple threads are competing fiercely for the same lock, almost all of them have to wait for the lucky one that holds the lock to release it. This is where the ConcurrentDictionary, with its granular locking implementation, shines. And the more concurrency you have (the more processors/cores), the more it shines.

In .Net 4, ConcurrentDictionary utilized very poor locking management and contention resolution that made it extremely slow. Dictionary with custom locking and/or even TestAndSet usage to COW the whole dictionary was faster.

Your test is wrong : you must stop the Stopwatch before !
Stopwatch sw = new Stopwatch();
sw.Start();
var d = new ConcurrentDictionary<int, int>();
for (int i = 0; i < 1000000; i++) d[i] = 123;
for (int i = 1000000; i < 2000000; i++) d[i] = 123;
for (int i = 2000000; i < 3000000; i++) d[i] = 123;
sw.Stop();
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Start();
var d2 = new Dictionary<int, int>();
for (int i = 0; i < 1000000; i++) lock (d2) d2[i] = 123;
for (int i = 1000000; i < 2000000; i++) lock (d2) d2[i] = 123;
for (int i = 2000000; i < 3000000; i++) lock (d2) d2[i] = 123;
sw.Stop();
Console.WriteLine("baseline = " + sw.Elapsed);
sw.Stop();
--Output :

Related

Is parallel code supposed to run slower than sequential code, after a certain dataset size?

I'm fairly new to C# and programming in general and I was trying out parallel programming.
I have written this example code that computes the sum of an array first, using multiple threads, and then, using one thread (the main thread).
I've timed both cases.
static long Sum(int[] numbers, int start, int end)
{
long sum = 0;
for (int i = start; i < end; i++)
{
sum += numbers[i];
}
return sum;
}
static async Task Main()
{
// Arrange data.
const int COUNT = 100_000_000;
int[] numbers = new int[COUNT];
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = random.Next(100);
}
// Split task into multiple parts.
int threadCount = Environment.ProcessorCount;
int taskCount = threadCount - 1;
int taskSize = numbers.Length / taskCount;
var start = DateTime.Now;
// Run individual parts in separate threads.
List<Task<long>> tasks = new();
for (int i = 0; i < taskCount; i++)
{
int begin = i * taskSize;
int end = (i == taskCount - 1) ? numbers.Length : (i + 1) * taskSize;
tasks.Add(Task.Run(() => Sum(numbers, begin, end)));
}
// Wait for all threads to finish, as we need the result.
var partialSums = await Task.WhenAll(tasks);
long sumAsync = partialSums.Sum();
var durationAsync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Async sum: {sumAsync}");
Console.WriteLine($"Async duration: {durationAsync} miliseconds");
// Sequential
start = DateTime.Now;
long sumSync = Sum(numbers, 0, numbers.Length);
var durationSync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Sync sum: {sumSync}");
Console.WriteLine($"Sync duration: {durationSync} miliseconds");
var factor = durationSync / durationAsync;
Console.WriteLine($"Factor: {factor:0.00}x");
}
When the array size is 100 million, the parallel sum is computed 2x faster. (on average).
But when the array size is 1 billion, it's significantly slower than the sequential sum.
Why is it running slower?
Hardware Information
Environment.ProcessorCount = 4
GC.GetGCMemoryInfo().TotalAvailableMemoryBytes = 8468377600
Timing:
When array size is 100,000,000
When array size is 1,000,000,000
New Test:
This time instead of separate threads (it was 3 in my case) working on different parts of a single array of 1,000,000,000 integers, I physically divided the dataset into 3 separate arrays of 333,333,333 (one-third in size). This time, although, I'm working on adding up a billion integers on the same machine, my parallel code runs faster (as expected)
private static void InitArray(int[] numbers)
{
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = (int)random.Next(100);
}
}
public static async Task Main()
{
Stopwatch stopwatch = new();
const int SIZE = 333_333_333; // one third of a billion
List<int[]> listOfArrays = new();
for (int i = 0; i < Environment.ProcessorCount - 1; i++)
{
int[] numbers = new int[SIZE];
InitArray(numbers);
listOfArrays.Add(numbers);
}
// Sequential.
stopwatch.Start();
long syncSum = 0;
foreach (var array in listOfArrays)
{
syncSum += Sum(array);
}
stopwatch.Stop();
var sequentialDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Sequential sum: {syncSum}");
Console.WriteLine($"Sequential duration: {sequentialDuration} ms");
// Parallel.
stopwatch.Restart();
List<Task<long>> tasks = new();
foreach (var array in listOfArrays)
{
tasks.Add(Task.Run(() => Sum(array)));
}
var partialSums = await Task.WhenAll(tasks);
long parallelSum = partialSums.Sum();
stopwatch.Stop();
var parallelDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Parallel sum: {parallelSum}");
Console.WriteLine($"Parallel duration: {parallelDuration} ms");
Console.WriteLine($"Factor: {sequentialDuration / parallelDuration:0.00}x");
}
Timing
I don't know if it helps figure out what went wrong in the first approach.
The asynchronous pattern is not the same as running code in parallel. The main reason for asynchronous code is better resource utilization while the computer is waiting for some kind of IO device. Your code would be better described as parallel computing or concurrent computing.
While your example should work fine, it may not be the easiest, nor optimal way to do it. The easiest option would probably be to use Parallel Linq: numbers.AsParallel().Sum();. There is also a Parallel.For method that should be better suited, including an overload that maintains a thread local state. Note that while the parallel.For will attempt to optimize its partitioning, you probably want to process chunks of data in each iteration to reduce overhead. I would try around 1-10k values or so.
We can only guess the reason your parallel method is slower. Summing numbers is a really fast operation, so it may be that the computation is limited by memory bandwith or Cache usage. And while you want your work partitions to be fairly large, using too large partitions may result in less overall parallelism if a thread gets suspended for any reason. You may also want partitions on certain sizes to work well with the caching system, see cache associativity. It is also possible you are including things you did not intend to measure, like compilation times or GCs, See benchmark .Net that takes care of many of the edge cases when measuring performance.
Also, never use DateTime for measuring performance, Stopwatch is both much easier to use and much more accurate.
My machine has 4GB RAM, so initializing an int[1_000_000_000] results in memory paging. Going from int[100_000_000] to int[1_000_000_000] results in non-linear performance degradation (100x instead of 10x). Essentially a CPU-bound operation becomes I/O-bound. Instead of adding numbers, the program spends most of its time reading segments of the array from the disk. In these conditions using multiple threads can be detrimental for the overall performance, because the pattern of accessing the storage device becomes more erratic and less streamlined.
Maybe something similar happens on your 8GB RAM machine too, but I can't say for sure.

Why isn't Parallel.For fast with heap-intensive operations?

For some operations Parallel scales well with the number of CPU's, but for other operations it does not.
Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?
void function1(int v) {
for (int i = 0; i < 100000000; i++) {
var q = Math.Sqrt(v);
}
}
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>();
for (int i = 0; i < 10000000; i++) {
dict.Add(i, v);
}
}
var sw = new System.Diagnostics.Stopwatch();
var iterations = 100;
sw.Restart();
for (int v = 0; v < iterations; v++) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
for (int v = 0; v < iterations; v++) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
The output on my machine:
function1 no parallel: 2 059,4 ms
function1 with parallel: 213,7 ms
function2 no parallel: 14 192,8 ms
function2 parallel: 4 491,1 ms
Environment:
Win 11, .Net 6.0, Release build
i9 12th gen, 16 cores, 24 proc, 32 GB DDR5
After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}
The result is:
function2 no parallell: 124,0 ms
function2 parallell: 402,4 ms
Is the conclusion that memory allocation does not scale well with multiple threads?...
tl;dr: Heap allocation contention.
Your first function is embarrassingly parallel. Each thread can do its computation with embarrassingly little interaction with other threads. So it scales up nicely to multiple threads. huseyin tugrul buyukisik correctly pointed out that your first computation makes use of the non-shared, per thread, processor registers.
Your second function, when it preallocates the dictionary, is somewhat less embarrassingly parallel. Each thread's computation is independent of the others' except for the fact that they each use your machine's RAM subsystem. So you see some thread-to-thread contention at the hardware level as thread-level cached data is written to and read from the machine-level RAM.
Your second function that does not preallocate memory is not embarrassingly parallel. Why not? Each .Add() operation must allocate some data in the shared heap. That can't be done in parallel, because all threads share the same heap. Rather they must be synchronized. The dotnet libraries do a good job of parallelizing heap operations as much as possible, but they do not avoid at least some blocking of thread B when thread A allocates heap data. So the threads slow each other down.
Separate processes rather than separate threads are a good way to scale up workloads like your non-preallocating second function. Each process has its own heap.
First func works in registers. More cores = more registers.
Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.
Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.
Parallel programming is not that simple. Using Parallel.For() or Parallel.ForEach() doesn't automatic make your program parallel.
Parallel programming is not about calling any higher level function (in any programming language) to make your code parallel. Is about prepare your code to be parallel.
Actually, you are not paralleling anything at all neither func1 or func2.
Backing to the foundation, the two basic types of parallelism are:
By task, which you split a complex task in smaller subtasks, each subtask to be processed at same time for different cores, CPUs or nodes (in a computer cluster)
By data, which you split a large data set into several smaller slices, each slice to be processed at same time for different cores, CPUs or nodes
Data parallelism is way more trickier to achieve and and not always provide a real performance gain.
Func1 is not really parallel, it's just a heavy piece of computation running concurrently. (Your CPU are just disputing who will finish the 100M for loop first)
Using Parallel.For() you are just spawning this heavy function 100 times among your threads.
A single for loop with Task.Run() inside would have nearly the same result
If your run this in only one thread/core obviously will take sometime. If you run in all your cores will be faster. No big mistery here, although being a concurrent code, not actually parallel. Besides, invoking these tasks 100 times, if you don't have these amount of CPU cores (or nodes in cluster) there's no big difference, parallel/concurrent code will be limit by the actual CPU cores in the machine (will see in a future example)
Now about the Func2 and the interaction with memory heap. Yes, every modern language with a built-in GC it's CPU expensive. One of the most expensive operation in an complex algorithm it's Garbage Collection, sometimes ad in non-optimized codes it can represents over 90% of CPU time.
Let's analyze your function2
Declare a new Dictionary into the function scope
Populate this Dictionary with 100M items
Outer the scope, you called function2 inside a Parallel.For with 100 interations
100 different scopes populate 100 different Dictionary with 100M data
There's no interaction between any of these scopes
As said before, this is not parallel programming, this is concurrent programming. You have separete 100 data chunks of 100M entries in each scope that doesn't intereact each other
But also there's a second factor too. Your function2 operation is a write operation (it means your adding-updading-deleting something to a collection). Well if it's just a bunch of random data and you can admit some loss and inconsistency okay. But if your're handling real data and cannot allow any kind of loss or inconsistency, bad news. There's no true parallel for writing a same memory address (object reference). You will need a synchronization contex and this will make things way slower, and these syncronized operations will always be concurrent, because if a thread is writing on memory reference, the other thread must wait until the other thread leaves. Actually, using several threads to write data might make your code slower instead faster, specially if the parallel operations are not CPU-bound.
For having real gains with data parallelism, you must have been using heavy computations uppon these partitioned data.
Let's check come code below, based on your methodology but with some changes:
var rand = new Random();
var operationSamples = 256;
var datasetSize = 100_000_000;
var computationDelay = 50;
var cpuCores = Environment.ProcessorCount;
Dictionary<int, int> datasetWithLoss = new(datasetSize);
Dictionary<int, int> dataset = new(datasetSize);
double result = 0;
Stopwatch sw = new();
ThreadPool.SetMinThreads(1, 1);
int HeavyComputation(int delay)
{
int iterations = 0;
var end = DateTime.Now + TimeSpan.FromMilliseconds(delay);
while (DateTime.Now < end)
iterations++;
return iterations;
}
double SequentialMeanHeavyComputation(int maxMilliseconds, int samples = 64)
{
double sum = 0;
for (int i = 0; i < samples; i++)
sum += HeavyComputation(maxMilliseconds);
return sum / samples;
}
double ParallelMeanHeavyComputation(int maxSecondsCount, int samples = 64, int threads = 4)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
var _lockKey = new object();
double sum = 0;
int offset = samples / threads;
List<Action> tasks = new();
for (int i = 0; i < samples; i++)
tasks.Add(new Action(() =>
{
var result = HeavyComputation(maxSecondsCount);
lock (_lockKey)
sum += result;
}));
Parallel.Invoke(new ParallelOptions { MaxDegreeOfParallelism = threads }, tasks.ToArray());
return sum / samples;
}
void SequentialDatasetPopulation(int size)
{
for (int i = 0; i < datasetSize; i++)
dataset.TryAdd(i, Guid.NewGuid().GetHashCode());
}
void ParalellDatasetPopulation(int size, int threads)
{
var _lock = new object();
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
var value = Guid.NewGuid().GetHashCode();
lock (_lock)
dataset.Add(i, value);
});
}
double SequentialReadOnlyDataset()
{
foreach (var x in dataset)
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
}
return 0;
}
double ParallelReadOnlyDataset()
{
Parallel.ForEach(dataset, x =>
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
});
return 0;
}
void ParalellDatasetWithLoss(int size, int threads)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
int value = Guid.NewGuid().GetHashCode();
datasetWithLoss.Add(i, value);
});
}
sw.Restart();
result = SequentialMeanHeavyComputation(computationDelay, operationSamples);
sw.Stop();
Console.WriteLine($"{nameof(SequentialMeanHeavyComputation)} sequential tasks: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (CPU threads match count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 100);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Higher thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 4);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Lower thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
SequentialDatasetPopulation(datasetSize);
sw.Stop();
Console.WriteLine($"{nameof(SequentialDatasetPopulation)} sequential data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
dataset.Clear();
sw.Restart();
ParalellDatasetPopulation(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetPopulation)} parallel data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParalellDatasetWithLoss(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetWithLoss)} parallel data with loss: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.WriteLine($"Lossless dataset count: {dataset.Count}");
Console.WriteLine($"Dataset with loss: {datasetWithLoss.Count}\n");
datasetWithLoss.Clear();
sw.Restart();
SequentialReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(SequentialReadOnlyDataset)} sequential reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParallelReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(ParallelReadOnlyDataset)} parallel reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.Read();
Output:
SequentialMeanHeavyComputation sequential tasks: 12 800,7ms
Available Threads: 15
ParallelMeanHeavyComputation parallel tasks (CPU threads match count): 860,3ms
Available Threads: 99
ParallelMeanHeavyComputation parallel tasks (Higher thread count): 805,0ms
Available Threads: 3
ParallelMeanHeavyComputation parallel tasks (Lower thread count): 3 200,4ms
SequentialDatasetPopulation sequential data population: 9 072,4ms
Available Threads: 15
ParalellDatasetPopulation parallel data population: 23 420,0ms
Available Threads: 15
ParalellDatasetWithLoss parallel data with loss: 6 788,3ms
Lossless dataset count: 100000000
Dataset with loss: 77057456
SequentialReadOnlyDataset sequential reading operations: 20 371,0ms
ParallelReadOnlyDataset parallel reading operations: 3 020,6ms
(Red: 25%, Orange: 56%, Green: 75%, Blue: 100%)
With task parallelism we achieved over 20x performance using 100% of CPU threads. (in this example, not always like that)
In read-only data paralelism with some computation we achieve near 6,5x faster of CPU usage 56% (with fewer computations the difference would be shorter)
But trying to implement a "real parallism" of data for writing our performance is more than twice slower and CPU can't use full potential using only 25% usage due sycronization contexts
Conclusions:
Using Parallel.For does not guarantee that your code will run really in parallel neither faster. It requires a previous code/data preparation and deep analysis, benchmarks and tunings
Check also this Microsoft Documentation talking about villains in Parallel Code
https://learn.microsoft.com/pt-br/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism

Convert List<double> to double[n,1]

I need to convert a large List of length n into a double[n,1] array. What is the fastest way to make the conversion?
For further background this is to pass into an set Excel object's Range.Value which requires a two dimensional array.
I'm writing this on the assumption that you really want the most efficient way to do this. Extreme performance almost always comes with a trade-off, usually code readability.
I can still substantially optimize one part of this as the comments note, but I didn't want to go overboard using dynamic methods on first pass.
const int TEST_SIZE = 100 * 1000;
//Test data setup
var list = new List<double>();
for (int i = 0; i < TEST_SIZE; i++)
list.Add(i);
//Grab the list's underlying array, which is not public
//This can be made MUCH faster with dynamic methods if you want me to optimize
var underlying = (double[])typeof(List<double>)
.GetField("_items", BindingFlags.NonPublic | BindingFlags.Instance)
.GetValue(list);
//We need the actual length of the list because there can be extra space in the array
//Do NOT use "underlying.Length"
int underlyingLength = list.Count;
//Benchmark it
var sw = Stopwatch.StartNew();
var twodarray = new double[underlyingLength, 1];
Buffer.BlockCopy(underlying, 0, twodarray, 0, underlyingLength * sizeof(double));
var elapsed = sw.Elapsed;
Console.WriteLine($"Elapsed: {elapsed}");
Output:
Elapsed: 00:00:00.0001998
Hardware used:
AMD Ryzen 7 3800X # 3.9 Ghz
32 GB DDR4 3200 RAM
I think this is what you want.
This operation will take no more than a few milliseconds even on a slow core. So why bother? How many times will you do this conversion? If millions of times, than try to find a better approach. But if you do this when the end-user presses a button...
Criticize the answer, but please providing metrics if about efficiency.
// Populate a List with 100.000 doubles
Random r = new Random();
List<double> dList = new List<double>();
int i = 0;
while (i++ < 100000) dList.Add(r.NextDouble());
// Convert to double[100000,1]
Stopwatch chrono = Stopwatch.StartNew();
// Conversion:
double[,] ddArray = new double[dList.Count, 1];
int dIndex = 0;
dList.ForEach((x) => ddArray[dIndex++, 0] = x);
Console.WriteLine("Completed in: {0}ms", chrono.Elapsed);
Outputs: (10 repetitions) - Maximum: 2.6 ms
Completed in: 00:00:00.0020677ms
Completed in: 00:00:00.0026287ms
Completed in: 00:00:00.0013854ms
Completed in: 00:00:00.0010382ms
Completed in: 00:00:00.0019168ms
Completed in: 00:00:00.0011480ms
Completed in: 00:00:00.0011172ms
Completed in: 00:00:00.0013586ms
Completed in: 00:00:00.0017165ms
Completed in: 00:00:00.0010508ms
Edit 1.
double[,] ddArray = new double[dList.Count, 1];
foreach (double x in dList) ddArray[dIndex++, 0] = x;
seems just a little bit faster, but needs more testing:
Completed in: 00:00:00.0020318ms
Completed in: 00:00:00.0019077ms
Completed in: 00:00:00.0023162ms
Completed in: 00:00:00.0015881ms
Completed in: 00:00:00.0013692ms
Completed in: 00:00:00.0022482ms
Completed in: 00:00:00.0015960ms
Completed in: 00:00:00.0012306ms
Completed in: 00:00:00.0015039ms
Completed in: 00:00:00.0016553ms

How CPU caching works when you get access to different value in a 'for' loop?

I made some tests of code performance, and I would like to know how the CPU cache works in this kind of situation:
Here is a classic example for a loop:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
}
Here is the specific situation to get the same thing, but much more performant:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
short max = Math.Max(max1, max2);
}
So I am interested to know why the second is more efficient as the first one.
I understand it's a story of CPU cache, but I don't get really how it happens (like values are not read twice between loops).
EDIT:
.NET Core 4.6.27617.04
2.1.11
Intel Core i7-7850HQ 2.90GHz 64-bit
Calling 50 Million of times:
MyClass1:
=> 00:00:06.0702028
MyClass2:
=> 00:00:03.8563776 (-36 %)
The last metric are the one with the Loop unrolling.
The difference in performance in this case is not related to caching - you have just 100 values - they fit entirely in the L2 cache already at the time you generated them.
The difference is due to out-of-order execution.
A modern CPU has multiple execution units and can perform more than one operation at the same time even in a single-threaded application.
But your loop is problematic for a modern CPU because it has a dependency:
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
Here each subsequent iteration is dependent on the value max from the previous one, so the CPU is forced to compute them sequentially.
Your revised loop adds a degree of freedom for the CPU; since max1 and max2 are independent, they can be computed in parallel.
So essentially the revised loop can run equally fast per iteration as the first one:
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
But it has half the iterations, so in the end you get a significant speedup (not 2x because out-of-order execution is not perfect).
Caching
Caching in the cpu works such as it pre-loads the next few lines of code from memory and stores it in the CPU Cache, This may be data, pointers, variable values, etc. etc.
Code Blocks
between your two blocks of code, the difference may not appear in the syntax, try converting your Code to IL (intermediate runtime language for c# which is executed by JIT(just-in-time compiler)) see ref for tools and resources.
or just decompiler your built/compiled code and check how the compiler "optimized it" when making the dll/exe files using the decompiler below.
other performance optimization
Loop Unrolling
CPU Caching
Refs:
C# Decompiler
JIT

Performance impact of array writes is much larger than expected

I've stumbled upon this effect when debugging an application - see the repro code below.
It gives me the following results:
Data init, count: 100,000 x 10,000, 4.6133365 secs
Perf test 0 (False): 5.8289565 secs
Perf test 0 (True): 5.8485172 secs
Perf test 1 (False): 32.3222312 secs
Perf test 1 (True): 217.0089923 secs
As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?
UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.
class Program
{
static void Main(string[] args)
{
var data = InitData();
TestPerf0(data, false);
TestPerf0(data, true);
TestPerf1(data, false);
TestPerf1(data, true);
if (Debugger.IsAttached)
Console.ReadKey();
}
private static string[] InitData()
{
var watch = Stopwatch.StartNew();
var data = new string[100_000];
var maxString = 10_000;
for (int i = 0; i < data.Length; i++)
{
data[i] = new string('-', maxString);
}
watch.Stop();
Console.WriteLine($"Data init, count: {data.Length:n0} x {maxString:n0}, {watch.Elapsed.TotalSeconds} secs");
return data;
}
private static void TestPerf1(string[] vals, bool testStore)
{
var watch = Stopwatch.StartNew();
var counters = new int[char.MaxValue];
int tmp = 0;
for (var j = 0; ; j++)
{
var allEmpty = true;
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
if (j < val.Length)
{
allEmpty = false;
var ch = val[j];
var count = counters[ch];
tmp ^= count;
if (testStore)
counters[ch] = count + 1;
}
}
if (allEmpty)
break;
}
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 1 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
}
private static void TestPerf0(string[] vals, bool testStore)
{
var watch = Stopwatch.StartNew();
var counters = new int[65536];
int tmp = 0;
for (var i = 0; i < 1_000_000_000; i++)
{
var j = i % counters.Length;
var count = counters[j];
tmp ^= count;
if (testStore)
counters[j] = count + 1;
}
// prevent the compiler from optimizing away our computations
tmp.GetHashCode();
watch.Stop();
Console.WriteLine($"Perf test 0 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
}
}
After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:
if (testStore)
counters[ch] = count + 1;
might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).
After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.
Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):
Invert your i and j loop. This will remove the allEmpty variable completely.
Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.
Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
As this is a "test setting" which the name suggests, is this important at all? Just remove it.
EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:
foreach (var val in vals)
{
foreach (int ch in val)
{
var count = counters[ch];
tmp ^= count;
if (testStore)
{
counters[ch] = count + 1;
}
}
}
I come from runtimes like this:
to runtimes like this:
Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.
EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:
for (var i = 0; i < vals.Length; i++)
{
var val = vals[i];
you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to #mjwills for pointing this out in his comment.
In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).
This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

Categories

Resources