I wrote a class which uses Stopwatch to profile methods and for/foreach loops. With for and foreach loops it tests a standard loop against a Parallel.For or Parallel.ForEach implementation.
You would write performance tests like so:
PerformanceResult result = Profiler.Execute(() => { FooBar(); });
For loop:
SerialParallelPerformanceResult result = Profiler.For(0, 100, x => { FooBar(x); });
ForEach loop:
SerialParallelPerformanceResult result = Profiler.ForEach(list, item => { FooBar(item); });
Whenever I run the tests (one of .Execute, .For or .ForEach) I put them in a loop so I can see how the performance changes over time.
Example of performance might be:
Method execution 1 = 200ms
Method execution 2 = 12ms
Method execution 3 = 0ms
For execution 1 = 300ms (Serial), 100ms (Parallel)
For execution 2 = 20ms (Serial), 75ms (Parallel)
For execution 3 = 2ms (Serial), 50ms (Parallel)
ForEach execution 1 = 350ms (Serial), 300ms (Parallel)
ForEach execution 2 = 24ms (Serial), 89ms (Parallel)
ForEach execution 3 = 1ms (Serial), 21ms (Parallel)
My questions are:
Why does performance change over time, what is .NET doing in the background to facilitate this?
How/why is a serial operation faster than a parallel one? I have made sure that I make the operations complex to see the difference properly...in most cases serial operations seem faster!?
NOTE: For parallel processing I am testing on an 8 core machine.
After some more exploration into performance profiling, I have discovered that using a Stopwatch is not an accurate way to measure the performance of a particular task
(Thanks hatchet and Loren for your comments on this!)
Reasons a stopwatch are not accurate:
Measurements are calculated in elapsed time in milliseconds, not CPU time.
Measurements can be influenced by background "noise" and thread intensive processes.
Measurements do not take into account JIT compilation and overhead.
That being said, using a stopwatch is OK for casual exploration of performance. With that in mind, I have improved my profiling algorithm somewhat.
Where before it simply executed the expression that was passed to it, it now has the facility to iterate over the expression several times, building an average execution time. The first run can be omitted since this is where JIT kicks in, and some major overhead may occur. Understandably, this will never be as sophisticated as using a professional profiling tool like Redgate's ANTS profiler, but it's OK for simpler tasks!
As per my comment above: I did some simple tests on my own and found no difference over time. Can you share your code? I'll put mine in an answer as it doesn't fit here.
This is my sample code.
(I also tried with both static and instance methods with no difference)
class Program
static void Main(string[] args)
int to = 50000000;
OtherStuff os = new OtherStuff();
Console.WriteLine(Profile(() => os.CountTo(to)));
Console.WriteLine(Profile(() => os.CountTo(to)));
Console.WriteLine(Profile(() => os.CountTo(to)));
static long Profile(Action method)
Stopwatch st = Stopwatch.StartNew();
return st.ElapsedMilliseconds;
class OtherStuff
public void CountTo(int to)
for (int i = 0; i < to; i++)
// some work...
A sample output would be:
Consider executing this method instead:
class OtherStuff
public string CountTo(Guid id)
using(SHA256 sha = SHA256.Create())
int x = default(int);
for (int index = 0; index < 16; index++)
x = id.ToByteArray()[index] >> 32 << 16;
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
byte[] y = new byte[1024];
y = y.Concat(BitConverter.GetBytes(x)).ToArray();
return BitConverter.ToString(sha.ComputeHash(BitConverter.GetBytes(x).Where(o => o >> 2 < 0).ToArray()));
Sample output:
I'm fairly new to C# and programming in general and I was trying out parallel programming.
I have written this example code that computes the sum of an array first, using multiple threads, and then, using one thread (the main thread).
I've timed both cases.
static long Sum(int[] numbers, int start, int end)
long sum = 0;
for (int i = start; i < end; i++)
sum += numbers[i];
return sum;
static async Task Main()
// Arrange data.
const int COUNT = 100_000_000;
int[] numbers = new int[COUNT];
Random random = new();
for (int i = 0; i < numbers.Length; i++)
numbers[i] = random.Next(100);
// Split task into multiple parts.
int threadCount = Environment.ProcessorCount;
int taskCount = threadCount - 1;
int taskSize = numbers.Length / taskCount;
var start = DateTime.Now;
// Run individual parts in separate threads.
List<Task<long>> tasks = new();
for (int i = 0; i < taskCount; i++)
int begin = i * taskSize;
int end = (i == taskCount - 1) ? numbers.Length : (i + 1) * taskSize;
tasks.Add(Task.Run(() => Sum(numbers, begin, end)));
// Wait for all threads to finish, as we need the result.
var partialSums = await Task.WhenAll(tasks);
long sumAsync = partialSums.Sum();
var durationAsync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Async sum: {sumAsync}");
Console.WriteLine($"Async duration: {durationAsync} miliseconds");
// Sequential
start = DateTime.Now;
long sumSync = Sum(numbers, 0, numbers.Length);
var durationSync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Sync sum: {sumSync}");
Console.WriteLine($"Sync duration: {durationSync} miliseconds");
var factor = durationSync / durationAsync;
Console.WriteLine($"Factor: {factor:0.00}x");
When the array size is 100 million, the parallel sum is computed 2x faster. (on average).
But when the array size is 1 billion, it's significantly slower than the sequential sum.
Why is it running slower?
Hardware Information
Environment.ProcessorCount = 4
GC.GetGCMemoryInfo().TotalAvailableMemoryBytes = 8468377600
When array size is 100,000,000
When array size is 1,000,000,000
New Test:
This time instead of separate threads (it was 3 in my case) working on different parts of a single array of 1,000,000,000 integers, I physically divided the dataset into 3 separate arrays of 333,333,333 (one-third in size). This time, although, I'm working on adding up a billion integers on the same machine, my parallel code runs faster (as expected)
private static void InitArray(int[] numbers)
Random random = new();
for (int i = 0; i < numbers.Length; i++)
numbers[i] = (int)random.Next(100);
public static async Task Main()
Stopwatch stopwatch = new();
const int SIZE = 333_333_333; // one third of a billion
List<int[]> listOfArrays = new();
for (int i = 0; i < Environment.ProcessorCount - 1; i++)
int[] numbers = new int[SIZE];
// Sequential.
long syncSum = 0;
foreach (var array in listOfArrays)
syncSum += Sum(array);
var sequentialDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Sequential sum: {syncSum}");
Console.WriteLine($"Sequential duration: {sequentialDuration} ms");
// Parallel.
List<Task<long>> tasks = new();
foreach (var array in listOfArrays)
tasks.Add(Task.Run(() => Sum(array)));
var partialSums = await Task.WhenAll(tasks);
long parallelSum = partialSums.Sum();
var parallelDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Parallel sum: {parallelSum}");
Console.WriteLine($"Parallel duration: {parallelDuration} ms");
Console.WriteLine($"Factor: {sequentialDuration / parallelDuration:0.00}x");
I don't know if it helps figure out what went wrong in the first approach.
The asynchronous pattern is not the same as running code in parallel. The main reason for asynchronous code is better resource utilization while the computer is waiting for some kind of IO device. Your code would be better described as parallel computing or concurrent computing.
While your example should work fine, it may not be the easiest, nor optimal way to do it. The easiest option would probably be to use Parallel Linq: numbers.AsParallel().Sum();. There is also a Parallel.For method that should be better suited, including an overload that maintains a thread local state. Note that while the parallel.For will attempt to optimize its partitioning, you probably want to process chunks of data in each iteration to reduce overhead. I would try around 1-10k values or so.
We can only guess the reason your parallel method is slower. Summing numbers is a really fast operation, so it may be that the computation is limited by memory bandwith or Cache usage. And while you want your work partitions to be fairly large, using too large partitions may result in less overall parallelism if a thread gets suspended for any reason. You may also want partitions on certain sizes to work well with the caching system, see cache associativity. It is also possible you are including things you did not intend to measure, like compilation times or GCs, See benchmark .Net that takes care of many of the edge cases when measuring performance.
Also, never use DateTime for measuring performance, Stopwatch is both much easier to use and much more accurate.
My machine has 4GB RAM, so initializing an int[1_000_000_000] results in memory paging. Going from int[100_000_000] to int[1_000_000_000] results in non-linear performance degradation (100x instead of 10x). Essentially a CPU-bound operation becomes I/O-bound. Instead of adding numbers, the program spends most of its time reading segments of the array from the disk. In these conditions using multiple threads can be detrimental for the overall performance, because the pattern of accessing the storage device becomes more erratic and less streamlined.
Maybe something similar happens on your 8GB RAM machine too, but I can't say for sure.
This is my first attempt at parallel programming.
I'm writing a test console app before using this in my real app and I can't seem to get it right. When I run this, the parallel search is always faster than the sequential one, but the parallel search never finds the correct value. What am I doing wrong?
I tried it without using a partitioner (just Parallel.For); it was slower than the sequential loop and gave the wrong number. I saw a Microsoft doc that said for simple computations, using Partitioner.Create can speed things up. So I tried that but still got the wrong values. Then I saw Interlocked, but I think I'm using it wrong.
Any help would be greatly appreciated
Random r = new Random();
Stopwatch timer = new Stopwatch();
do {
// Make and populate a list
List<short> test = new List<short>();
for (int x = 0; x <= 10000000; x++)
test.Add((short)(r.Next(short.MaxValue) * r.NextDouble()));
// Initialize result variables
short rMin = short.MaxValue;
short rMax = 0;
// Do min/max normal search
foreach (var amp in test)
rMin = Math.Min(rMin, amp);
rMax = Math.Max(rMax, amp);
// Display results
Console.WriteLine($"rMin: {rMin} rMax: {rMax} Time: {timer.ElapsedMilliseconds}");
// Initialize parallel result variables
short pMin = short.MaxValue;
short pMax = 0;
// Create list partioner
var rangePortioner = Partitioner.Create(0, test.Count);
// Do min/max parallel search
Parallel.ForEach(rangePortioner, (range, loop) =>
short min = short.MaxValue;
short max = 0;
for (int i = range.Item1; i < range.Item2; i++)
min = Math.Min(min, test[i]);
max = Math.Max(max, test[i]);
_ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMin), Math.Min(pMin, min));
_ = Interlocked.Exchange(ref Unsafe.As<short, int>(ref pMax), Math.Max(pMax, max));
// Display results
Console.WriteLine($"pMin: {pMin} pMax: {pMax} Time: {timer.ElapsedMilliseconds}");
Console.WriteLine("Press enter to run again; any other key to quit");
} while (Console.ReadKey().Key == ConsoleKey.Enter);
Sample output:
rMin: 0 rMax: 32746 Time: 106
pMin: 0 pMax: 32679 Time: 66
Press enter to run again; any other key to quit
The correct way to do a parallel search like this is to compute local values for each thread used, and then merge the values at the end. This ensures that synchronization is only needed at the final phase:
var items = Enumerable.Range(0, 10000).ToList();
int globalMin = int.MaxValue;
int globalMax = int.MinValue;
Parallel.ForEach<int, (int Min, int Max)>(
() => (int.MaxValue, int.MinValue), // Create new min/max values for each thread used
(item, state, localMinMax) =>
var localMin = Math.Min(item, localMinMax.Min);
var localMax = Math.Max(item, localMinMax.Max);
return (localMin, localMax); // return the new min/max values for this thread
localMinMax => // called one last time for each thread used
lock(items) // Since this may run concurrently, synchronization is needed
globalMin = Math.Min(globalMin, localMinMax.Min);
globalMax = Math.Max(globalMax, localMinMax.Max);
As you can see this is quite a bit more complex than a regular loop, and this is not even doing anything fancy like partitioning. An optimized solution would work over larger blocks to reduce overhead, but this is omitted for simplicity, and it looks like the OP is aware such issues already.
Be aware that multi threaded programming is difficult. While it is a great idea to try out such techniques in a playground rather than a real program, I would still suggest that you should start by studying the potential dangers of thread safety, there is fairly easy to find good resources about this.
Not all problems will be as obviously wrong like this, and it is quite easy to cause issues that breaks once in a million, or only when the cpu load is high, or only on single CPU systems, or issues that are only detected long after the code is put into production. It is a good practice to be paranoid whenever multiple threads may read and write the same memory concurrently.
I would also recommend learning about immutable data types, and pure functions, since these are much safer and easier to reason about once multiple threads are involved.
Interlocked.Exchange is thread safe only for Exchange, every Math.Min and Math.Max can be with race condition. You should compute min/max for every batch separately and then join results.
Using low-lock techniques like the Interlocked class is tricky and advanced. Taking into consideration that your experience in multithreading is not excessive, I would say go with a simple and trusty lock:
object locker = new object();
lock (locker)
pMin = Math.Min(pMin, min);
pMax = Math.Max(pMax, max);
I made some tests of code performance, and I would like to know how the CPU cache works in this kind of situation:
Here is a classic example for a loop:
private static readonly short[] _values;
static MyClass()
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
public static void Run()
short max = 0;
for (var index = 0; index < _values.Length; index++)
max = Math.Max(max, _values[index]);
Here is the specific situation to get the same thing, but much more performant:
private static readonly short[] _values;
static MyClass()
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
public static void Run()
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
short max = Math.Max(max1, max2);
So I am interested to know why the second is more efficient as the first one.
I understand it's a story of CPU cache, but I don't get really how it happens (like values are not read twice between loops).
.NET Core 4.6.27617.04
Intel Core i7-7850HQ 2.90GHz 64-bit
Calling 50 Million of times:
=> 00:00:06.0702028
=> 00:00:03.8563776 (-36 %)
The last metric are the one with the Loop unrolling.
The difference in performance in this case is not related to caching - you have just 100 values - they fit entirely in the L2 cache already at the time you generated them.
The difference is due to out-of-order execution.
A modern CPU has multiple execution units and can perform more than one operation at the same time even in a single-threaded application.
But your loop is problematic for a modern CPU because it has a dependency:
short max = 0;
for (var index = 0; index < _values.Length; index++)
max = Math.Max(max, _values[index]);
Here each subsequent iteration is dependent on the value max from the previous one, so the CPU is forced to compute them sequentially.
Your revised loop adds a degree of freedom for the CPU; since max1 and max2 are independent, they can be computed in parallel.
So essentially the revised loop can run equally fast per iteration as the first one:
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
But it has half the iterations, so in the end you get a significant speedup (not 2x because out-of-order execution is not perfect).
Caching in the cpu works such as it pre-loads the next few lines of code from memory and stores it in the CPU Cache, This may be data, pointers, variable values, etc. etc.
Code Blocks
between your two blocks of code, the difference may not appear in the syntax, try converting your Code to IL (intermediate runtime language for c# which is executed by JIT(just-in-time compiler)) see ref for tools and resources.
or just decompiler your built/compiled code and check how the compiler "optimized it" when making the dll/exe files using the decompiler below.
other performance optimization
Loop Unrolling
CPU Caching
C# Decompiler
I am trying to find out why parallel foreach does not give the expected speedup on a machine with 32 physical cores and 64 logical cores with a simple test computation.
var parameters = new List<string>();
for (int i = 1; i <= 9; i++) {
if (Scenario.UsesParallelForEach)
Parallel.ForEach(parameters, parameter => {
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
foreach (var parameter in parameters)
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
class LongComputation
public void Compute()
var s = "";
for (int i = 0; i <= 40000; i++)
s = s + i.ToString() + "\n";
The Compute function takes about 5 seconds to complete. My assumption was, that with the parallel foreach loop each additional iteration creates a parallel thread running on one of the cores and taking as much as it would take to compute the Compute function only once. So, if I run the loop twice, then with the sequential foreach, it would take 10 seconds, with the parallel foreach only 5 seconds (assuming 2 cores are available). The speedup would be 2. If I run the loop three times, then with the sequential foreach, it would take 15 seconds, but again with the parallel foreach only 5 seconds. The speedup would be 3, then 4, 5, 6, 7, 8, and 9. However, what I observe is a constant speedup of 1.3.
Sequential vs parallel foreach. X-axis: number of sequential/parallel execution of the computation. Y-axis: time in seconds
Speedup, time of the sequential foreach divided by parallel foreach
The event fired in FireOnParameterComputed is intended to be used in a GUI progress bar to show the progress. In the progress bar it can be clearly see, that for each iteration, a new thread is created.
My question is, why don't I see the expected speedup or at least close to the expected speedup?
Tasks aren't threads.
Sometimes starting a task will cause a thread to be created, but not always. Creating and managing threads consumes time and system resources. When a task only takes a short amount of time, even though it's counter-intuitive, the single-threaded model is often faster.
The CLR knows this and tries to make its best judgment on how to execute the task based on a number of factors including any hints that you've passed to it.
For Parallel.ForEach, if you're certain that you want multiple threads to be spawned, try passing in ParallelOptions.
Parallel.ForEach(parameters, new ParallelOptions { MaxDegreeOfParallelism = 100 }, parameter => {});
Now, I'm new to threading and async / sync programming and all that stuff. So, I've been practicing and saw this problem on youtube. The problem was to sum every content of a byte array. It was from the channel called Jamie King. He did this with threads. I've decided to do this with task. I made it asynchronous and it was slower than the synchronous one. The difference between the two was 360 milliseconds! I wonder if any of you could do it faster in an asynchronous way. If so, please post it!
Here's mine:
static Random Random = new Random(999);
static byte[] byteArr = new byte[100_000_000];
static byte TaskCount = (byte)Environment.ProcessorCount;
static int readingLength;
static void Main(string[] args)
for (int i = 0; i < byteArr.Length; i++)
byteArr[i] = (byte)Random.Next(11);
static async void SumAsync(byte[] bytes)
readingLength = bytes.Length / TaskCount;
int sum = 0;
Stopwatch watch = new Stopwatch();
for (int i = 0; i < TaskCount; i++)
Task<int> task = SumPortion(bytes.SubArray(i * readingLength, readingLength));
int result = await task;
sum += result;
Console.WriteLine("Done! Time took: {0}, Result: {1}", watch.ElapsedMilliseconds, sum);
static async Task<int> SumPortion(byte[] bytes)
Task<int> task = Task.Run(() =>
int sum = 0;
foreach (byte b in bytes)
sum += b;
return sum;
int result = await task;
return result;
Note that bytes.SubArray is an extension method. I have one question. Is asynchronous programming slower than synchronous programming?
Please point out my mistakes.
Thanks for your time!
You need to use WhenAll() and return all of the tasks at the end:
static async void SumAsync(byte[] bytes)
readingLength = bytes.Length / TaskCount;
int sum = 0;
Stopwatch watch = new Stopwatch();
var results = new Task[TaskCount];
for (int i = 0; i < TaskCount; i++)
Task<int> task = SumPortion(bytes.SubArray(i * readingLength, readingLength));
results[i] = task
int[] result = await Task.WhenAll(results);
Console.WriteLine("Done! Time took: {0}, Result: {1}", watch.ElapsedMilliseconds, result.Sum());
When you use the WhenAll() method, you combine all of the Task results, thus the tasks would run in parallel, saving you a lot of necessary time.
You can read more about it in learn.microsoft.com.
asynchronous is not explicitly slower - but runs in the background (Such as waits for connection to a website to be established) - so that the main thread is not stopped for the time it waits for something to happen.
The fastest way to do this is probably going to be to hand-roll a Parallel.ForEach() loop.
Plinq may not even give you a speedup in comparison to a single-threaded approach, and it certainly won't be as fast as Parallel.ForEach().
Here's some sample timing code. When you try this, make sure it's a RELEASE build and that you don't run it under the debugger (which will turn off the JIT optimiser, even if it's a RELEASE build):
using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
static class Program
static void Main()
// Create some random bytes (using a seed to ensure it's the same bytes each time).
var rng = new Random(12345);
byte[] byteArr = new byte[500_000_000];
// Time single-threaded Linq.
var sw = Stopwatch.StartNew();
long sum = byteArr.Sum(x => (long)x);
Console.WriteLine($"Single-threaded Linq took {sw.Elapsed} to calculate sum as {sum}");
// Time single-threaded loop;
sum = 0;
foreach (var n in byteArr)
sum += n;
Console.WriteLine($"Single-threaded took {sw.Elapsed} to calculate sum as {sum}");
// Time Plinq
sum = byteArr.AsParallel().Sum(x => (long)x);
Console.WriteLine($"Plinq took {sw.Elapsed} to calculate sum as {sum}");
// Time Parallel.ForEach() with partitioner.
sum = 0;
Partitioner.Create(0, byteArr.Length),
() => 0L,
(subRange, loopState, threadLocalState) =>
for (int i = subRange.Item1; i < subRange.Item2; i++)
threadLocalState += byteArr[i];
return threadLocalState;
finalThreadLocalState =>
Interlocked.Add(ref sum, finalThreadLocalState);
Console.WriteLine($"Parallel.ForEach with partioner took {sw.Elapsed} to calculate sum as {sum}");
The results I get with an x64 build on my octo-core PC are:
Single-threaded Linq took 00:00:03.1160235 to calculate sum as 63748717461
Single-threaded took 00:00:00.7596687 to calculate sum as 63748717461
Plinq took 00:00:01.0305913 to calculate sum as 63748717461
Parallel.ForEach with partioner took 00:00:00.0839141 to calculate sum as 63748717461
The results I get with an x86 build are:
Single-threaded Linq took 00:00:02.6964067 to calculate sum as 63748717461
Single-threaded took 00:00:00.8200462 to calculate sum as 63748717461
Plinq took 00:00:01.1251899 to calculate sum as 63748717461
Parallel.ForEach with partioner took 00:00:00.1084805 to calculate sum as 63748717461
As you can see, the Parallel.ForEach() with the x64 build is fastest (probably because it's calculating a long total, rather than because of the larger address space).
The Plinq is around three times faster than the Linq non-threaded solution.
The Parallel.ForEach() with a partitioner is more than 30 times faster.
But notably, the non-linq single-threaded code is faster than the Plinq code. In this case, using Plinq is pointless; it makes things slower!
This tells us that the speedup isn't just from multithreading - it's also related to the overhead of Linq and Plinq in comparison to hand-rolling the loop.
Generally speaking, you should only use Plinq when the processing of each element take a relatively long time (and adding a byte to a running total take a very short time).
The advantage of Plinq over Parallel.ForEach() with a partitioner is that it is much simpler to write - however, if it winds up being slower than a simple foreach loop then its utility is questionable. So timing things before choosing a solution is very important!