C# parallel foreach does not give expected speedup

C# parallel foreach does not give expected speedup - c#

I am trying to find out why parallel foreach does not give the expected speedup on a machine with 32 physical cores and 64 logical cores with a simple test computation.
...
var parameters = new List<string>();
for (int i = 1; i <= 9; i++) {
parameters.Add(i.ToString());
if (Scenario.UsesParallelForEach)
{
Parallel.ForEach(parameters, parameter => {
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
lc.Compute();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
});
}
else
{
foreach (var parameter in parameters)
{
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
lc.Compute();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
}
}
}
...
class LongComputation
{
public void Compute()
{
var s = "";
for (int i = 0; i <= 40000; i++)
{
s = s + i.ToString() + "\n";
}
}
}
The Compute function takes about 5 seconds to complete. My assumption was, that with the parallel foreach loop each additional iteration creates a parallel thread running on one of the cores and taking as much as it would take to compute the Compute function only once. So, if I run the loop twice, then with the sequential foreach, it would take 10 seconds, with the parallel foreach only 5 seconds (assuming 2 cores are available). The speedup would be 2. If I run the loop three times, then with the sequential foreach, it would take 15 seconds, but again with the parallel foreach only 5 seconds. The speedup would be 3, then 4, 5, 6, 7, 8, and 9. However, what I observe is a constant speedup of 1.3.
Sequential vs parallel foreach. X-axis: number of sequential/parallel execution of the computation. Y-axis: time in seconds
Speedup, time of the sequential foreach divided by parallel foreach
The event fired in FireOnParameterComputed is intended to be used in a GUI progress bar to show the progress. In the progress bar it can be clearly see, that for each iteration, a new thread is created.
My question is, why don't I see the expected speedup or at least close to the expected speedup?

Tasks aren't threads.
Sometimes starting a task will cause a thread to be created, but not always. Creating and managing threads consumes time and system resources. When a task only takes a short amount of time, even though it's counter-intuitive, the single-threaded model is often faster.
The CLR knows this and tries to make its best judgment on how to execute the task based on a number of factors including any hints that you've passed to it.
For Parallel.ForEach, if you're certain that you want multiple threads to be spawned, try passing in ParallelOptions.
Parallel.ForEach(parameters, new ParallelOptions { MaxDegreeOfParallelism = 100 }, parameter => {});

Related

Is parallel code supposed to run slower than sequential code, after a certain dataset size?

I'm fairly new to C# and programming in general and I was trying out parallel programming.
I have written this example code that computes the sum of an array first, using multiple threads, and then, using one thread (the main thread).
I've timed both cases.
static long Sum(int[] numbers, int start, int end)
{
long sum = 0;
for (int i = start; i < end; i++)
{
sum += numbers[i];
}
return sum;
}
static async Task Main()
{
// Arrange data.
const int COUNT = 100_000_000;
int[] numbers = new int[COUNT];
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = random.Next(100);
}
// Split task into multiple parts.
int threadCount = Environment.ProcessorCount;
int taskCount = threadCount - 1;
int taskSize = numbers.Length / taskCount;
var start = DateTime.Now;
// Run individual parts in separate threads.
List<Task<long>> tasks = new();
for (int i = 0; i < taskCount; i++)
{
int begin = i * taskSize;
int end = (i == taskCount - 1) ? numbers.Length : (i + 1) * taskSize;
tasks.Add(Task.Run(() => Sum(numbers, begin, end)));
}
// Wait for all threads to finish, as we need the result.
var partialSums = await Task.WhenAll(tasks);
long sumAsync = partialSums.Sum();
var durationAsync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Async sum: {sumAsync}");
Console.WriteLine($"Async duration: {durationAsync} miliseconds");
// Sequential
start = DateTime.Now;
long sumSync = Sum(numbers, 0, numbers.Length);
var durationSync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Sync sum: {sumSync}");
Console.WriteLine($"Sync duration: {durationSync} miliseconds");
var factor = durationSync / durationAsync;
Console.WriteLine($"Factor: {factor:0.00}x");
}
When the array size is 100 million, the parallel sum is computed 2x faster. (on average).
But when the array size is 1 billion, it's significantly slower than the sequential sum.
Why is it running slower?
Hardware Information
Environment.ProcessorCount = 4
GC.GetGCMemoryInfo().TotalAvailableMemoryBytes = 8468377600
Timing:
When array size is 100,000,000
When array size is 1,000,000,000
New Test:
This time instead of separate threads (it was 3 in my case) working on different parts of a single array of 1,000,000,000 integers, I physically divided the dataset into 3 separate arrays of 333,333,333 (one-third in size). This time, although, I'm working on adding up a billion integers on the same machine, my parallel code runs faster (as expected)
private static void InitArray(int[] numbers)
{
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = (int)random.Next(100);
}
}
public static async Task Main()
{
Stopwatch stopwatch = new();
const int SIZE = 333_333_333; // one third of a billion
List<int[]> listOfArrays = new();
for (int i = 0; i < Environment.ProcessorCount - 1; i++)
{
int[] numbers = new int[SIZE];
InitArray(numbers);
listOfArrays.Add(numbers);
}
// Sequential.
stopwatch.Start();
long syncSum = 0;
foreach (var array in listOfArrays)
{
syncSum += Sum(array);
}
stopwatch.Stop();
var sequentialDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Sequential sum: {syncSum}");
Console.WriteLine($"Sequential duration: {sequentialDuration} ms");
// Parallel.
stopwatch.Restart();
List<Task<long>> tasks = new();
foreach (var array in listOfArrays)
{
tasks.Add(Task.Run(() => Sum(array)));
}
var partialSums = await Task.WhenAll(tasks);
long parallelSum = partialSums.Sum();
stopwatch.Stop();
var parallelDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Parallel sum: {parallelSum}");
Console.WriteLine($"Parallel duration: {parallelDuration} ms");
Console.WriteLine($"Factor: {sequentialDuration / parallelDuration:0.00}x");
}
Timing
I don't know if it helps figure out what went wrong in the first approach.

The asynchronous pattern is not the same as running code in parallel. The main reason for asynchronous code is better resource utilization while the computer is waiting for some kind of IO device. Your code would be better described as parallel computing or concurrent computing.
While your example should work fine, it may not be the easiest, nor optimal way to do it. The easiest option would probably be to use Parallel Linq: numbers.AsParallel().Sum();. There is also a Parallel.For method that should be better suited, including an overload that maintains a thread local state. Note that while the parallel.For will attempt to optimize its partitioning, you probably want to process chunks of data in each iteration to reduce overhead. I would try around 1-10k values or so.
We can only guess the reason your parallel method is slower. Summing numbers is a really fast operation, so it may be that the computation is limited by memory bandwith or Cache usage. And while you want your work partitions to be fairly large, using too large partitions may result in less overall parallelism if a thread gets suspended for any reason. You may also want partitions on certain sizes to work well with the caching system, see cache associativity. It is also possible you are including things you did not intend to measure, like compilation times or GCs, See benchmark .Net that takes care of many of the edge cases when measuring performance.
Also, never use DateTime for measuring performance, Stopwatch is both much easier to use and much more accurate.

My machine has 4GB RAM, so initializing an int[1_000_000_000] results in memory paging. Going from int[100_000_000] to int[1_000_000_000] results in non-linear performance degradation (100x instead of 10x). Essentially a CPU-bound operation becomes I/O-bound. Instead of adding numbers, the program spends most of its time reading segments of the array from the disk. In these conditions using multiple threads can be detrimental for the overall performance, because the pattern of accessing the storage device becomes more erratic and less streamlined.
Maybe something similar happens on your 8GB RAM machine too, but I can't say for sure.

Why isn't Parallel.For fast with heap-intensive operations?

For some operations Parallel scales well with the number of CPU's, but for other operations it does not.
Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?
void function1(int v) {
for (int i = 0; i < 100000000; i++) {
var q = Math.Sqrt(v);
}
}
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>();
for (int i = 0; i < 10000000; i++) {
dict.Add(i, v);
}
}
var sw = new System.Diagnostics.Stopwatch();
var iterations = 100;
sw.Restart();
for (int v = 0; v < iterations; v++) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
for (int v = 0; v < iterations; v++) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
The output on my machine:
function1 no parallel: 2 059,4 ms
function1 with parallel: 213,7 ms
function2 no parallel: 14 192,8 ms
function2 parallel: 4 491,1 ms
Environment:
Win 11, .Net 6.0, Release build
i9 12th gen, 16 cores, 24 proc, 32 GB DDR5
After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}
The result is:
function2 no parallell: 124,0 ms
function2 parallell: 402,4 ms
Is the conclusion that memory allocation does not scale well with multiple threads?...

tl;dr: Heap allocation contention.
Your first function is embarrassingly parallel. Each thread can do its computation with embarrassingly little interaction with other threads. So it scales up nicely to multiple threads. huseyin tugrul buyukisik correctly pointed out that your first computation makes use of the non-shared, per thread, processor registers.
Your second function, when it preallocates the dictionary, is somewhat less embarrassingly parallel. Each thread's computation is independent of the others' except for the fact that they each use your machine's RAM subsystem. So you see some thread-to-thread contention at the hardware level as thread-level cached data is written to and read from the machine-level RAM.
Your second function that does not preallocate memory is not embarrassingly parallel. Why not? Each .Add() operation must allocate some data in the shared heap. That can't be done in parallel, because all threads share the same heap. Rather they must be synchronized. The dotnet libraries do a good job of parallelizing heap operations as much as possible, but they do not avoid at least some blocking of thread B when thread A allocates heap data. So the threads slow each other down.
Separate processes rather than separate threads are a good way to scale up workloads like your non-preallocating second function. Each process has its own heap.

First func works in registers. More cores = more registers.
Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.
Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.

Parallel programming is not that simple. Using Parallel.For() or Parallel.ForEach() doesn't automatic make your program parallel.
Parallel programming is not about calling any higher level function (in any programming language) to make your code parallel. Is about prepare your code to be parallel.
Actually, you are not paralleling anything at all neither func1 or func2.
Backing to the foundation, the two basic types of parallelism are:
By task, which you split a complex task in smaller subtasks, each subtask to be processed at same time for different cores, CPUs or nodes (in a computer cluster)
By data, which you split a large data set into several smaller slices, each slice to be processed at same time for different cores, CPUs or nodes
Data parallelism is way more trickier to achieve and and not always provide a real performance gain.
Func1 is not really parallel, it's just a heavy piece of computation running concurrently. (Your CPU are just disputing who will finish the 100M for loop first)
Using Parallel.For() you are just spawning this heavy function 100 times among your threads.
A single for loop with Task.Run() inside would have nearly the same result
If your run this in only one thread/core obviously will take sometime. If you run in all your cores will be faster. No big mistery here, although being a concurrent code, not actually parallel. Besides, invoking these tasks 100 times, if you don't have these amount of CPU cores (or nodes in cluster) there's no big difference, parallel/concurrent code will be limit by the actual CPU cores in the machine (will see in a future example)
Now about the Func2 and the interaction with memory heap. Yes, every modern language with a built-in GC it's CPU expensive. One of the most expensive operation in an complex algorithm it's Garbage Collection, sometimes ad in non-optimized codes it can represents over 90% of CPU time.
Let's analyze your function2
Declare a new Dictionary into the function scope
Populate this Dictionary with 100M items
Outer the scope, you called function2 inside a Parallel.For with 100 interations
100 different scopes populate 100 different Dictionary with 100M data
There's no interaction between any of these scopes
As said before, this is not parallel programming, this is concurrent programming. You have separete 100 data chunks of 100M entries in each scope that doesn't intereact each other
But also there's a second factor too. Your function2 operation is a write operation (it means your adding-updading-deleting something to a collection). Well if it's just a bunch of random data and you can admit some loss and inconsistency okay. But if your're handling real data and cannot allow any kind of loss or inconsistency, bad news. There's no true parallel for writing a same memory address (object reference). You will need a synchronization contex and this will make things way slower, and these syncronized operations will always be concurrent, because if a thread is writing on memory reference, the other thread must wait until the other thread leaves. Actually, using several threads to write data might make your code slower instead faster, specially if the parallel operations are not CPU-bound.
For having real gains with data parallelism, you must have been using heavy computations uppon these partitioned data.
Let's check come code below, based on your methodology but with some changes:
var rand = new Random();
var operationSamples = 256;
var datasetSize = 100_000_000;
var computationDelay = 50;
var cpuCores = Environment.ProcessorCount;
Dictionary<int, int> datasetWithLoss = new(datasetSize);
Dictionary<int, int> dataset = new(datasetSize);
double result = 0;
Stopwatch sw = new();
ThreadPool.SetMinThreads(1, 1);
int HeavyComputation(int delay)
{
int iterations = 0;
var end = DateTime.Now + TimeSpan.FromMilliseconds(delay);
while (DateTime.Now < end)
iterations++;
return iterations;
}
double SequentialMeanHeavyComputation(int maxMilliseconds, int samples = 64)
{
double sum = 0;
for (int i = 0; i < samples; i++)
sum += HeavyComputation(maxMilliseconds);
return sum / samples;
}
double ParallelMeanHeavyComputation(int maxSecondsCount, int samples = 64, int threads = 4)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
var _lockKey = new object();
double sum = 0;
int offset = samples / threads;
List<Action> tasks = new();
for (int i = 0; i < samples; i++)
tasks.Add(new Action(() =>
{
var result = HeavyComputation(maxSecondsCount);
lock (_lockKey)
sum += result;
}));
Parallel.Invoke(new ParallelOptions { MaxDegreeOfParallelism = threads }, tasks.ToArray());
return sum / samples;
}
void SequentialDatasetPopulation(int size)
{
for (int i = 0; i < datasetSize; i++)
dataset.TryAdd(i, Guid.NewGuid().GetHashCode());
}
void ParalellDatasetPopulation(int size, int threads)
{
var _lock = new object();
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
var value = Guid.NewGuid().GetHashCode();
lock (_lock)
dataset.Add(i, value);
});
}
double SequentialReadOnlyDataset()
{
foreach (var x in dataset)
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
}
return 0;
}
double ParallelReadOnlyDataset()
{
Parallel.ForEach(dataset, x =>
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
});
return 0;
}
void ParalellDatasetWithLoss(int size, int threads)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
int value = Guid.NewGuid().GetHashCode();
datasetWithLoss.Add(i, value);
});
}
sw.Restart();
result = SequentialMeanHeavyComputation(computationDelay, operationSamples);
sw.Stop();
Console.WriteLine($"{nameof(SequentialMeanHeavyComputation)} sequential tasks: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (CPU threads match count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 100);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Higher thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 4);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Lower thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
SequentialDatasetPopulation(datasetSize);
sw.Stop();
Console.WriteLine($"{nameof(SequentialDatasetPopulation)} sequential data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
dataset.Clear();
sw.Restart();
ParalellDatasetPopulation(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetPopulation)} parallel data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParalellDatasetWithLoss(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetWithLoss)} parallel data with loss: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.WriteLine($"Lossless dataset count: {dataset.Count}");
Console.WriteLine($"Dataset with loss: {datasetWithLoss.Count}\n");
datasetWithLoss.Clear();
sw.Restart();
SequentialReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(SequentialReadOnlyDataset)} sequential reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParallelReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(ParallelReadOnlyDataset)} parallel reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.Read();
Output:
SequentialMeanHeavyComputation sequential tasks: 12 800,7ms
Available Threads: 15
ParallelMeanHeavyComputation parallel tasks (CPU threads match count): 860,3ms
Available Threads: 99
ParallelMeanHeavyComputation parallel tasks (Higher thread count): 805,0ms
Available Threads: 3
ParallelMeanHeavyComputation parallel tasks (Lower thread count): 3 200,4ms
SequentialDatasetPopulation sequential data population: 9 072,4ms
Available Threads: 15
ParalellDatasetPopulation parallel data population: 23 420,0ms
Available Threads: 15
ParalellDatasetWithLoss parallel data with loss: 6 788,3ms
Lossless dataset count: 100000000
Dataset with loss: 77057456
SequentialReadOnlyDataset sequential reading operations: 20 371,0ms
ParallelReadOnlyDataset parallel reading operations: 3 020,6ms
(Red: 25%, Orange: 56%, Green: 75%, Blue: 100%)
With task parallelism we achieved over 20x performance using 100% of CPU threads. (in this example, not always like that)
In read-only data paralelism with some computation we achieve near 6,5x faster of CPU usage 56% (with fewer computations the difference would be shorter)
But trying to implement a "real parallism" of data for writing our performance is more than twice slower and CPU can't use full potential using only 25% usage due sycronization contexts
Conclusions:
Using Parallel.For does not guarantee that your code will run really in parallel neither faster. It requires a previous code/data preparation and deep analysis, benchmarks and tunings
Check also this Microsoft Documentation talking about villains in Parallel Code
https://learn.microsoft.com/pt-br/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism

Why is a parallel-processing much slower for a first call in C#?

I am trying to process numbers as fast as possible with C# app. I use a Thread.Sleep() to simulate a processing and random numbers. I use 3 different techniques.
This is test code that I used:
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Test
{
internal class Program
{
private static void Main()
{
var data = new int[500000];
var random = new Random();
for (int i = 0; i < 500000; i++)
{
data[i] = random.Next();
}
var partialTimes = new Dictionary<int, double>();
var iterations = 5;
for (int i = 1; i < iterations + 1; i++)
{
Console.Write($"ProcessData3 {i}\t");
StartProcessing(data, partialTimes, ProcessData3);
GC.Collect();
}
Console.WriteLine();
Console.WriteLine("Press Enter to Exit");
Console.ReadLine();
}
private static void StartProcessing(int[] data, Dictionary<int, double> partialTimes, Action<int[], Dictionary<int, double>> processData)
{
var stopwatch = Stopwatch.StartNew();
try
{
processData?.Invoke(data, partialTimes);
stopwatch.Stop();
Console.WriteLine($"{stopwatch.Elapsed.ToString(#"mm\:ss\:fffffff")} total = {partialTimes.Sum(s => s.Value)} max = {partialTimes.Values.Max()}");
}
finally
{
partialTimes.Clear();
}
}
private static void ProcessData1(int[] data, Dictionary<int, double> partialTimes)
{
Parallel.ForEach(data, number =>
{
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
}
});
}
private static void ProcessData3(int[] data, Dictionary<int, double> partialTimes)
{
// Partition the entire source array.
var rangePartitioner = Partitioner.Create(0, data.Length);
// Loop over the partitions in parallel.
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
// Loop over each range element without a delegate invocation.
for (int i = range.Item1; i < range.Item2; i++)
{
var number = data[i];
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
}
}
});
}
private static void ProcessData2(int[] data, Dictionary<int, double> partialTimes)
{
var tasks = new Task[data.Count()];
for (int i = 0; i < data.Count(); i++)
{
var number = data[i];
tasks[i] = Task.Factory.StartNew(() =>
{
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
}
});
}
Task.WaitAll(tasks);
}
}
}
For each technique I restart the program. And I get these results,
with having a Thread.Sleep( 1 ):
ProcessData1 1 00:56:1796688 total = 801335,282599955 max = 16,8783
ProcessData1 2 00:23:5390014 total = 816167,642100022 max = 14,5913
ProcessData1 3 00:14:7090566 total = 827589,675899998 max = 13,2617
ProcessData1 4 00:10:8929177 total = 829296,528300007 max = 15,0175
ProcessData1 5 00:10:6333310 total = 839282,123200008 max = 29,2738
ProcessData2 1 00:37:8084153 total = 824507,174200022 max = 112,071
ProcessData2 2 00:16:3762096 total = 849272,47810001 max = 77,1514
ProcessData2 3 00:12:9177717 total = 854012,353100029 max = 67,5684
ProcessData2 4 00:10:4798701 total = 857396,642899983 max = 92,9408
ProcessData2 5 00:09:2206146 total = 870966,655499989 max = 51,8945
ProcessData3 1 01:13:6814541 total = 803581,718699918 max = 25,6815
ProcessData3 2 01:07:9809277 total = 814069,532899922 max = 26,0671
ProcessData3 3 01:07:9857984 total = 814148,329399928 max = 21,3116
ProcessData3 4 01:07:4812183 total = 808042,695499966 max = 16,8601
ProcessData3 5 01:07:2954614 total = 805895,325499903 max = 23,8517
Where
total is total a time spent inside each Parallel.ForEach() function together and
max is a maximum time of each function.
Why is the first loop so slow? How is it possible that other attempts are processed so quickly? How to achieve a faster parallel processing on the first attempt?
EDIT:
So I tried it also with having a Thread.Sleep( 10 )
Results are:
ProcessData1 1 02:50:2845698 total = 5109831,95429994 max = 12,0612
ProcessData1 2 00:56:3361645 total = 5125884,05919954 max = 12,7666
ProcessData1 3 00:53:4911541 total = 5131105,15209993 max = 12,7486
ProcessData1 4 00:49:5665628 total = 5144654,75829992 max = 13,2678
ProcessData1 5 00:46:0218194 total = 5152955,19509996 max = 13,702
ProcessData2 1 01:21:7207557 total = 5121889,31579983 max = 73,8152
ProcessData2 2 00:39:6660074 total = 5175557,68889969 max = 59,369
ProcessData2 3 00:31:9036416 total = 5193819,89889973 max = 56,2895
ProcessData2 4 00:27:4616803 total = 5207168,56969977 max = 65,5495
ProcessData2 5 00:24:4270755 total = 5222567,9044998 max = 65,368
ProcessData3 1 02:44:9985645 total = 5110117,19019997 max = 11,7172
ProcessData3 2 02:25:6533128 total = 5237779,27010012 max = 26,3171
ProcessData3 3 02:22:2771259 total = 5116123,45259975 max = 12,0581
ProcessData3 4 02:22:1678911 total = 5112574,93779995 max = 11,5334
ProcessData3 5 02:21:9418178 total = 5104980,07120004 max = 11,5583
So first loop still takes much more seconds than others..

The behavior you're seeing is entirely explained by the fact that the ThreadPool class delays creating new threads until some small amount of time has passed (on the order of 1 second…it's changed over the years).
It can be informative to add instrumentation to one's program. In your example, a very useful tool is to count the number of concurrent threads as managed by the thread pool, determine the "high water mark" (i.e. the maximum number of threads it eventually settles on), and then use that number to override the thread pool's behavior.
When I did that, I discovered that on the first run of the first method, you get up to about 25 threads. But since the default for the thread pool is to only create a number of threads equal to the number of cores on your computer (eight, in my case), creating the additional threads can take a fair amount of time. And of course, during that time, you get significantly less throughput than you would otherwise (so you incur a larger delay than just the 20 seconds or so getting up to that number of threads causes).
On the subsequent runs of that test, the max number of threads gradually rises (since each new run is starting with more threads in the thread pool already, from the previous run) gets as high as around 53.
If you know in advance how many threads the thread pool is going to require in order to perform your work efficiently, you can use the SetMinThreads() method to increase the number of threads it will create immediately on demand before switching to the throttled thread-creation algorithm. For example, having that 53 thread high water mark in hand, you can set the number of minimum threads to that number (or a nice round one, like 50).
When I do that, all five runs of your first test, which previously took between 25 seconds to 1 minute (with the longer runs being earlier, of course), take around 19 seconds to complete.
I'd like to emphasize that you should use SetMinThreads() very carefully. The thread pool is, in general, very good about managing work-loads. The scenario you present above is obviously just for the sake of example and not realistic, but it does have the problem that you're not really doing that much work in each Parallel.ForEach() iteration in the first place. It doesn't seem like a good fit for concurrency, since so much of the time spent will be on overhead. Using SetMinThreads() in any similar scenario just papers over a more insidious underlying issue.
You'll find that if you tailor your workloads to better match available resources, and to minimize transitions between tasks and threads, you can get good throughput without overriding the default thread pool numbers.
Some other notes on this particular test…
Note that if you change the program to run all three tests in the same session (five runs each), the "first run is longer" happens only for the first test. For future reference, you should always approach this sort of "first time is slower" question with an eye to testing different combinations and ordering, to verify whether it's a particular implementation that suffers from the effect, or if you see the effect for the first test, regardless of which implementation is run first. There are a number of implementation and platform details, including JIT, thread pool, disk cache that can affect the initial run of any algorithm, and you'll want to make sure that you quickly narrow down your search to knowing whether you're dealing with one of those or some genuine issue in your own algorithm.
By the way, not that it really matters for your question, but I find it odd your choice to use the random number in the data array as the key for your timings dictionary. This IMHO renders those timing values useless, due to collisions in the random numbers. You won't count every time (when there's a collision, only the last instance of that number will get stored) which means that the "total" time displayed is less than the true total time spent, and even the max values won't necessarily be correct (if the true max value gets overwritten by a later value using the same key, you'll miss it).
Here's my modified version of your first test, which shows both the diagnostic code I added, and (commented out) the statements to set the thread pool counts to produce faster, more consistent behavior:
private static int _threadCount1;
private static int _maxThreadCount1;
private static void ProcessData1(int[] data, Dictionary<int, double> partialTimes)
{
const int minOverride = 50;
int minMain, minIOCP, maxMain, maxIOCP;
ThreadPool.GetMinThreads(out minMain, out minIOCP);
ThreadPool.GetMaxThreads(out maxMain, out maxIOCP);
WriteLine($"cores: {Environment.ProcessorCount}");
WriteLine($"threads: {minMain} min, {maxMain} max");
// Uncomment two lines below to see uniform behavior across test runs:
//ThreadPool.SetMinThreads(minOverride, minIOCP);
//ThreadPool.SetMaxThreads(minOverride, maxIOCP);
_threadCount1 = _maxThreadCount1 = 0;
Parallel.ForEach(data, number =>
{
int threadCount = Interlocked.Increment(ref _threadCount1);
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
if (_maxThreadCount1 < threadCount)
{
_maxThreadCount1 = threadCount;
}
}
Interlocked.Decrement(ref _threadCount1);
});
ThreadPool.SetMinThreads(minMain, minIOCP);
ThreadPool.SetMaxThreads(maxMain, maxIOCP);
WriteLine($"max thread count: {_maxThreadCount1}");
}

Pause / Resume For loop?

I have a for loop running through 500.000ish list. For each of these it is queueing up a SmartThreadPool job.
lines.Length below contains 500.000ish items.
My problem is that i get memory issues when queueing them all at once.. So i though id write a logic to prevent this:
int activeThreads = _smartThreadPool2.ActiveThreads;
if (activeThreads < maxThreads)
{
int iia = 0;
for (int i = 0; i < lines.Length; i++)
{
if (doNotUseAdditive.Checked == true)
{
foreach (string engine in _checkedEngines) // Grab selected engines
{
query = lines[i];
_smartThreadPool2.QueueWorkItem(
new Amib.Threading.Func<string, string, int, int, int>(scrape),
query, engine, iia, useProxies);
iia++;
}
}
}
}
else
{
// Wait
wait.WaitOne();
}
The problem is that i cannot run that if statement inside my for loop, because when i come back to it, it will not remember where it was inside the loop.
I'm using a:
ManualResetEvent wait = new ManualResetEvent(false); //global variable
To "Pause/Resume"
I need to somehow pause the loop after X threads are used and then when threads are available return and continue the loop.
Any ideas?

I don't think that process every item in list in separate thread is a good idea. Even using custom thread pool can be really error-prone (and you examples proves my opinion).
First of all you should determine number of working threads correctly. It seems that you're dealing with computation intensive operations (so called CPU Bound operations) and you should use number of working threads equals to number of logical processors.
Than you can use Parallel LINQ to split all your working set for appropriate amount of chunks and process those chunks in parallel.
Joe Albahari has a great series of posts about this topic: Threading in C#. Part 5. Parallel Programming.
Here is a pseudocode of using PLINQ:
lines
.AsParallel()
.WithDegreeOfParallelism(YourNumberOfProcessors)
.Select(e => ProcessYourData(e));

Performance profiling in .NET

I wrote a class which uses Stopwatch to profile methods and for/foreach loops. With for and foreach loops it tests a standard loop against a Parallel.For or Parallel.ForEach implementation.
You would write performance tests like so:
Method:
PerformanceResult result = Profiler.Execute(() => { FooBar(); });
For loop:
SerialParallelPerformanceResult result = Profiler.For(0, 100, x => { FooBar(x); });
ForEach loop:
SerialParallelPerformanceResult result = Profiler.ForEach(list, item => { FooBar(item); });
Whenever I run the tests (one of .Execute, .For or .ForEach) I put them in a loop so I can see how the performance changes over time.
Example of performance might be:
Method execution 1 = 200ms
Method execution 2 = 12ms
Method execution 3 = 0ms
For execution 1 = 300ms (Serial), 100ms (Parallel)
For execution 2 = 20ms (Serial), 75ms (Parallel)
For execution 3 = 2ms (Serial), 50ms (Parallel)
ForEach execution 1 = 350ms (Serial), 300ms (Parallel)
ForEach execution 2 = 24ms (Serial), 89ms (Parallel)
ForEach execution 3 = 1ms (Serial), 21ms (Parallel)
My questions are:
Why does performance change over time, what is .NET doing in the background to facilitate this?
How/why is a serial operation faster than a parallel one? I have made sure that I make the operations complex to see the difference properly...in most cases serial operations seem faster!?
NOTE: For parallel processing I am testing on an 8 core machine.

After some more exploration into performance profiling, I have discovered that using a Stopwatch is not an accurate way to measure the performance of a particular task
(Thanks hatchet and Loren for your comments on this!)
Reasons a stopwatch are not accurate:
Measurements are calculated in elapsed time in milliseconds, not CPU time.
Measurements can be influenced by background "noise" and thread intensive processes.
Measurements do not take into account JIT compilation and overhead.
That being said, using a stopwatch is OK for casual exploration of performance. With that in mind, I have improved my profiling algorithm somewhat.
Where before it simply executed the expression that was passed to it, it now has the facility to iterate over the expression several times, building an average execution time. The first run can be omitted since this is where JIT kicks in, and some major overhead may occur. Understandably, this will never be as sophisticated as using a professional profiling tool like Redgate's ANTS profiler, but it's OK for simpler tasks!

As per my comment above: I did some simple tests on my own and found no difference over time. Can you share your code? I'll put mine in an answer as it doesn't fit here.
This is my sample code.
(I also tried with both static and instance methods with no difference)
class Program
{
static void Main(string[] args)
{
int to = 50000000;
OtherStuff os = new OtherStuff();
Console.WriteLine(Profile(() => os.CountTo(to)));
Console.WriteLine(Profile(() => os.CountTo(to)));
Console.WriteLine(Profile(() => os.CountTo(to)));
}
static long Profile(Action method)
{
Stopwatch st = Stopwatch.StartNew();
method();
st.Stop();
return st.ElapsedMilliseconds;
}
}
class OtherStuff
{
public void CountTo(int to)
{
for (int i = 0; i < to; i++)
{
// some work...
i++;
i--;
}
}
}
A sample output would be:
331
331
334
Consider executing this method instead:
class OtherStuff
{
public string CountTo(Guid id)
{
using(SHA256 sha = SHA256.Create())
{
int x = default(int);
for (int index = 0; index < 16; index++)
{
x = id.ToByteArray()[index] >> 32 << 16;
}
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
byte[] y = new byte[1024];
rng.GetBytes(y);
y = y.Concat(BitConverter.GetBytes(x)).ToArray();
return BitConverter.ToString(sha.ComputeHash(BitConverter.GetBytes(x).Where(o => o >> 2 < 0).ToArray()));
}
}
}
Sample output:
11
0
0

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.