For some operations Parallel scales well with the number of CPU's, but for other operations it does not.
Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?
void function1(int v) {
for (int i = 0; i < 100000000; i++) {
var q = Math.Sqrt(v);
}
}
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>();
for (int i = 0; i < 10000000; i++) {
dict.Add(i, v);
}
}
var sw = new System.Diagnostics.Stopwatch();
var iterations = 100;
sw.Restart();
for (int v = 0; v < iterations; v++) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
for (int v = 0; v < iterations; v++) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));
The output on my machine:
function1 no parallel: 2 059,4 ms
function1 with parallel: 213,7 ms
function2 no parallel: 14 192,8 ms
function2 parallel: 4 491,1 ms
Environment:
Win 11, .Net 6.0, Release build
i9 12th gen, 16 cores, 24 proc, 32 GB DDR5
After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:
void function2(int v) {
Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}
The result is:
function2 no parallell: 124,0 ms
function2 parallell: 402,4 ms
Is the conclusion that memory allocation does not scale well with multiple threads?...
tl;dr: Heap allocation contention.
Your first function is embarrassingly parallel. Each thread can do its computation with embarrassingly little interaction with other threads. So it scales up nicely to multiple threads. huseyin tugrul buyukisik correctly pointed out that your first computation makes use of the non-shared, per thread, processor registers.
Your second function, when it preallocates the dictionary, is somewhat less embarrassingly parallel. Each thread's computation is independent of the others' except for the fact that they each use your machine's RAM subsystem. So you see some thread-to-thread contention at the hardware level as thread-level cached data is written to and read from the machine-level RAM.
Your second function that does not preallocate memory is not embarrassingly parallel. Why not? Each .Add() operation must allocate some data in the shared heap. That can't be done in parallel, because all threads share the same heap. Rather they must be synchronized. The dotnet libraries do a good job of parallelizing heap operations as much as possible, but they do not avoid at least some blocking of thread B when thread A allocates heap data. So the threads slow each other down.
Separate processes rather than separate threads are a good way to scale up workloads like your non-preallocating second function. Each process has its own heap.
First func works in registers. More cores = more registers.
Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.
Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.
Parallel programming is not that simple. Using Parallel.For() or Parallel.ForEach() doesn't automatic make your program parallel.
Parallel programming is not about calling any higher level function (in any programming language) to make your code parallel. Is about prepare your code to be parallel.
Actually, you are not paralleling anything at all neither func1 or func2.
Backing to the foundation, the two basic types of parallelism are:
By task, which you split a complex task in smaller subtasks, each subtask to be processed at same time for different cores, CPUs or nodes (in a computer cluster)
By data, which you split a large data set into several smaller slices, each slice to be processed at same time for different cores, CPUs or nodes
Data parallelism is way more trickier to achieve and and not always provide a real performance gain.
Func1 is not really parallel, it's just a heavy piece of computation running concurrently. (Your CPU are just disputing who will finish the 100M for loop first)
Using Parallel.For() you are just spawning this heavy function 100 times among your threads.
A single for loop with Task.Run() inside would have nearly the same result
If your run this in only one thread/core obviously will take sometime. If you run in all your cores will be faster. No big mistery here, although being a concurrent code, not actually parallel. Besides, invoking these tasks 100 times, if you don't have these amount of CPU cores (or nodes in cluster) there's no big difference, parallel/concurrent code will be limit by the actual CPU cores in the machine (will see in a future example)
Now about the Func2 and the interaction with memory heap. Yes, every modern language with a built-in GC it's CPU expensive. One of the most expensive operation in an complex algorithm it's Garbage Collection, sometimes ad in non-optimized codes it can represents over 90% of CPU time.
Let's analyze your function2
Declare a new Dictionary into the function scope
Populate this Dictionary with 100M items
Outer the scope, you called function2 inside a Parallel.For with 100 interations
100 different scopes populate 100 different Dictionary with 100M data
There's no interaction between any of these scopes
As said before, this is not parallel programming, this is concurrent programming. You have separete 100 data chunks of 100M entries in each scope that doesn't intereact each other
But also there's a second factor too. Your function2 operation is a write operation (it means your adding-updading-deleting something to a collection). Well if it's just a bunch of random data and you can admit some loss and inconsistency okay. But if your're handling real data and cannot allow any kind of loss or inconsistency, bad news. There's no true parallel for writing a same memory address (object reference). You will need a synchronization contex and this will make things way slower, and these syncronized operations will always be concurrent, because if a thread is writing on memory reference, the other thread must wait until the other thread leaves. Actually, using several threads to write data might make your code slower instead faster, specially if the parallel operations are not CPU-bound.
For having real gains with data parallelism, you must have been using heavy computations uppon these partitioned data.
Let's check come code below, based on your methodology but with some changes:
var rand = new Random();
var operationSamples = 256;
var datasetSize = 100_000_000;
var computationDelay = 50;
var cpuCores = Environment.ProcessorCount;
Dictionary<int, int> datasetWithLoss = new(datasetSize);
Dictionary<int, int> dataset = new(datasetSize);
double result = 0;
Stopwatch sw = new();
ThreadPool.SetMinThreads(1, 1);
int HeavyComputation(int delay)
{
int iterations = 0;
var end = DateTime.Now + TimeSpan.FromMilliseconds(delay);
while (DateTime.Now < end)
iterations++;
return iterations;
}
double SequentialMeanHeavyComputation(int maxMilliseconds, int samples = 64)
{
double sum = 0;
for (int i = 0; i < samples; i++)
sum += HeavyComputation(maxMilliseconds);
return sum / samples;
}
double ParallelMeanHeavyComputation(int maxSecondsCount, int samples = 64, int threads = 4)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
var _lockKey = new object();
double sum = 0;
int offset = samples / threads;
List<Action> tasks = new();
for (int i = 0; i < samples; i++)
tasks.Add(new Action(() =>
{
var result = HeavyComputation(maxSecondsCount);
lock (_lockKey)
sum += result;
}));
Parallel.Invoke(new ParallelOptions { MaxDegreeOfParallelism = threads }, tasks.ToArray());
return sum / samples;
}
void SequentialDatasetPopulation(int size)
{
for (int i = 0; i < datasetSize; i++)
dataset.TryAdd(i, Guid.NewGuid().GetHashCode());
}
void ParalellDatasetPopulation(int size, int threads)
{
var _lock = new object();
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
var value = Guid.NewGuid().GetHashCode();
lock (_lock)
dataset.Add(i, value);
});
}
double SequentialReadOnlyDataset()
{
foreach (var x in dataset)
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
}
return 0;
}
double ParallelReadOnlyDataset()
{
Parallel.ForEach(dataset, x =>
{
HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
});
return 0;
}
void ParalellDatasetWithLoss(int size, int threads)
{
ThreadPool.SetMaxThreads(threads, threads);
ThreadPool.GetAvailableThreads(out int workerThreads, out _);
Console.WriteLine($"Available Threads: {workerThreads}");
Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
{
int value = Guid.NewGuid().GetHashCode();
datasetWithLoss.Add(i, value);
});
}
sw.Restart();
result = SequentialMeanHeavyComputation(computationDelay, operationSamples);
sw.Stop();
Console.WriteLine($"{nameof(SequentialMeanHeavyComputation)} sequential tasks: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (CPU threads match count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 100);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Higher thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 4);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Lower thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
SequentialDatasetPopulation(datasetSize);
sw.Stop();
Console.WriteLine($"{nameof(SequentialDatasetPopulation)} sequential data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
dataset.Clear();
sw.Restart();
ParalellDatasetPopulation(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetPopulation)} parallel data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParalellDatasetWithLoss(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetWithLoss)} parallel data with loss: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.WriteLine($"Lossless dataset count: {dataset.Count}");
Console.WriteLine($"Dataset with loss: {datasetWithLoss.Count}\n");
datasetWithLoss.Clear();
sw.Restart();
SequentialReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(SequentialReadOnlyDataset)} sequential reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
sw.Restart();
ParallelReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(ParallelReadOnlyDataset)} parallel reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.Read();
Output:
SequentialMeanHeavyComputation sequential tasks: 12 800,7ms
Available Threads: 15
ParallelMeanHeavyComputation parallel tasks (CPU threads match count): 860,3ms
Available Threads: 99
ParallelMeanHeavyComputation parallel tasks (Higher thread count): 805,0ms
Available Threads: 3
ParallelMeanHeavyComputation parallel tasks (Lower thread count): 3 200,4ms
SequentialDatasetPopulation sequential data population: 9 072,4ms
Available Threads: 15
ParalellDatasetPopulation parallel data population: 23 420,0ms
Available Threads: 15
ParalellDatasetWithLoss parallel data with loss: 6 788,3ms
Lossless dataset count: 100000000
Dataset with loss: 77057456
SequentialReadOnlyDataset sequential reading operations: 20 371,0ms
ParallelReadOnlyDataset parallel reading operations: 3 020,6ms
(Red: 25%, Orange: 56%, Green: 75%, Blue: 100%)
With task parallelism we achieved over 20x performance using 100% of CPU threads. (in this example, not always like that)
In read-only data paralelism with some computation we achieve near 6,5x faster of CPU usage 56% (with fewer computations the difference would be shorter)
But trying to implement a "real parallism" of data for writing our performance is more than twice slower and CPU can't use full potential using only 25% usage due sycronization contexts
Conclusions:
Using Parallel.For does not guarantee that your code will run really in parallel neither faster. It requires a previous code/data preparation and deep analysis, benchmarks and tunings
Check also this Microsoft Documentation talking about villains in Parallel Code
https://learn.microsoft.com/pt-br/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism
Related
I'm fairly new to C# and programming in general and I was trying out parallel programming.
I have written this example code that computes the sum of an array first, using multiple threads, and then, using one thread (the main thread).
I've timed both cases.
static long Sum(int[] numbers, int start, int end)
{
long sum = 0;
for (int i = start; i < end; i++)
{
sum += numbers[i];
}
return sum;
}
static async Task Main()
{
// Arrange data.
const int COUNT = 100_000_000;
int[] numbers = new int[COUNT];
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = random.Next(100);
}
// Split task into multiple parts.
int threadCount = Environment.ProcessorCount;
int taskCount = threadCount - 1;
int taskSize = numbers.Length / taskCount;
var start = DateTime.Now;
// Run individual parts in separate threads.
List<Task<long>> tasks = new();
for (int i = 0; i < taskCount; i++)
{
int begin = i * taskSize;
int end = (i == taskCount - 1) ? numbers.Length : (i + 1) * taskSize;
tasks.Add(Task.Run(() => Sum(numbers, begin, end)));
}
// Wait for all threads to finish, as we need the result.
var partialSums = await Task.WhenAll(tasks);
long sumAsync = partialSums.Sum();
var durationAsync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Async sum: {sumAsync}");
Console.WriteLine($"Async duration: {durationAsync} miliseconds");
// Sequential
start = DateTime.Now;
long sumSync = Sum(numbers, 0, numbers.Length);
var durationSync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Sync sum: {sumSync}");
Console.WriteLine($"Sync duration: {durationSync} miliseconds");
var factor = durationSync / durationAsync;
Console.WriteLine($"Factor: {factor:0.00}x");
}
When the array size is 100 million, the parallel sum is computed 2x faster. (on average).
But when the array size is 1 billion, it's significantly slower than the sequential sum.
Why is it running slower?
Hardware Information
Environment.ProcessorCount = 4
GC.GetGCMemoryInfo().TotalAvailableMemoryBytes = 8468377600
Timing:
When array size is 100,000,000
When array size is 1,000,000,000
New Test:
This time instead of separate threads (it was 3 in my case) working on different parts of a single array of 1,000,000,000 integers, I physically divided the dataset into 3 separate arrays of 333,333,333 (one-third in size). This time, although, I'm working on adding up a billion integers on the same machine, my parallel code runs faster (as expected)
private static void InitArray(int[] numbers)
{
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = (int)random.Next(100);
}
}
public static async Task Main()
{
Stopwatch stopwatch = new();
const int SIZE = 333_333_333; // one third of a billion
List<int[]> listOfArrays = new();
for (int i = 0; i < Environment.ProcessorCount - 1; i++)
{
int[] numbers = new int[SIZE];
InitArray(numbers);
listOfArrays.Add(numbers);
}
// Sequential.
stopwatch.Start();
long syncSum = 0;
foreach (var array in listOfArrays)
{
syncSum += Sum(array);
}
stopwatch.Stop();
var sequentialDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Sequential sum: {syncSum}");
Console.WriteLine($"Sequential duration: {sequentialDuration} ms");
// Parallel.
stopwatch.Restart();
List<Task<long>> tasks = new();
foreach (var array in listOfArrays)
{
tasks.Add(Task.Run(() => Sum(array)));
}
var partialSums = await Task.WhenAll(tasks);
long parallelSum = partialSums.Sum();
stopwatch.Stop();
var parallelDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Parallel sum: {parallelSum}");
Console.WriteLine($"Parallel duration: {parallelDuration} ms");
Console.WriteLine($"Factor: {sequentialDuration / parallelDuration:0.00}x");
}
Timing
I don't know if it helps figure out what went wrong in the first approach.
The asynchronous pattern is not the same as running code in parallel. The main reason for asynchronous code is better resource utilization while the computer is waiting for some kind of IO device. Your code would be better described as parallel computing or concurrent computing.
While your example should work fine, it may not be the easiest, nor optimal way to do it. The easiest option would probably be to use Parallel Linq: numbers.AsParallel().Sum();. There is also a Parallel.For method that should be better suited, including an overload that maintains a thread local state. Note that while the parallel.For will attempt to optimize its partitioning, you probably want to process chunks of data in each iteration to reduce overhead. I would try around 1-10k values or so.
We can only guess the reason your parallel method is slower. Summing numbers is a really fast operation, so it may be that the computation is limited by memory bandwith or Cache usage. And while you want your work partitions to be fairly large, using too large partitions may result in less overall parallelism if a thread gets suspended for any reason. You may also want partitions on certain sizes to work well with the caching system, see cache associativity. It is also possible you are including things you did not intend to measure, like compilation times or GCs, See benchmark .Net that takes care of many of the edge cases when measuring performance.
Also, never use DateTime for measuring performance, Stopwatch is both much easier to use and much more accurate.
My machine has 4GB RAM, so initializing an int[1_000_000_000] results in memory paging. Going from int[100_000_000] to int[1_000_000_000] results in non-linear performance degradation (100x instead of 10x). Essentially a CPU-bound operation becomes I/O-bound. Instead of adding numbers, the program spends most of its time reading segments of the array from the disk. In these conditions using multiple threads can be detrimental for the overall performance, because the pattern of accessing the storage device becomes more erratic and less streamlined.
Maybe something similar happens on your 8GB RAM machine too, but I can't say for sure.
I am trying to find out why parallel foreach does not give the expected speedup on a machine with 32 physical cores and 64 logical cores with a simple test computation.
...
var parameters = new List<string>();
for (int i = 1; i <= 9; i++) {
parameters.Add(i.ToString());
if (Scenario.UsesParallelForEach)
{
Parallel.ForEach(parameters, parameter => {
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
lc.Compute();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
});
}
else
{
foreach (var parameter in parameters)
{
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
lc.Compute();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
}
}
}
...
class LongComputation
{
public void Compute()
{
var s = "";
for (int i = 0; i <= 40000; i++)
{
s = s + i.ToString() + "\n";
}
}
}
The Compute function takes about 5 seconds to complete. My assumption was, that with the parallel foreach loop each additional iteration creates a parallel thread running on one of the cores and taking as much as it would take to compute the Compute function only once. So, if I run the loop twice, then with the sequential foreach, it would take 10 seconds, with the parallel foreach only 5 seconds (assuming 2 cores are available). The speedup would be 2. If I run the loop three times, then with the sequential foreach, it would take 15 seconds, but again with the parallel foreach only 5 seconds. The speedup would be 3, then 4, 5, 6, 7, 8, and 9. However, what I observe is a constant speedup of 1.3.
Sequential vs parallel foreach. X-axis: number of sequential/parallel execution of the computation. Y-axis: time in seconds
Speedup, time of the sequential foreach divided by parallel foreach
The event fired in FireOnParameterComputed is intended to be used in a GUI progress bar to show the progress. In the progress bar it can be clearly see, that for each iteration, a new thread is created.
My question is, why don't I see the expected speedup or at least close to the expected speedup?
Tasks aren't threads.
Sometimes starting a task will cause a thread to be created, but not always. Creating and managing threads consumes time and system resources. When a task only takes a short amount of time, even though it's counter-intuitive, the single-threaded model is often faster.
The CLR knows this and tries to make its best judgment on how to execute the task based on a number of factors including any hints that you've passed to it.
For Parallel.ForEach, if you're certain that you want multiple threads to be spawned, try passing in ParallelOptions.
Parallel.ForEach(parameters, new ParallelOptions { MaxDegreeOfParallelism = 100 }, parameter => {});
Now, I'm new to threading and async / sync programming and all that stuff. So, I've been practicing and saw this problem on youtube. The problem was to sum every content of a byte array. It was from the channel called Jamie King. He did this with threads. I've decided to do this with task. I made it asynchronous and it was slower than the synchronous one. The difference between the two was 360 milliseconds! I wonder if any of you could do it faster in an asynchronous way. If so, please post it!
Here's mine:
static Random Random = new Random(999);
static byte[] byteArr = new byte[100_000_000];
static byte TaskCount = (byte)Environment.ProcessorCount;
static int readingLength;
static void Main(string[] args)
{
for (int i = 0; i < byteArr.Length; i++)
{
byteArr[i] = (byte)Random.Next(11);
}
SumAsync(byteArr);
}
static async void SumAsync(byte[] bytes)
{
readingLength = bytes.Length / TaskCount;
int sum = 0;
Console.WriteLine("Running...");
Stopwatch watch = new Stopwatch();
watch.Start();
for (int i = 0; i < TaskCount; i++)
{
Task<int> task = SumPortion(bytes.SubArray(i * readingLength, readingLength));
int result = await task;
sum += result;
}
watch.Stop();
Console.WriteLine("Done! Time took: {0}, Result: {1}", watch.ElapsedMilliseconds, sum);
}
static async Task<int> SumPortion(byte[] bytes)
{
Task<int> task = Task.Run(() =>
{
int sum = 0;
foreach (byte b in bytes)
{
sum += b;
}
return sum;
});
int result = await task;
return result;
}
Note that bytes.SubArray is an extension method. I have one question. Is asynchronous programming slower than synchronous programming?
Please point out my mistakes.
Thanks for your time!
You need to use WhenAll() and return all of the tasks at the end:
static async void SumAsync(byte[] bytes)
{
readingLength = bytes.Length / TaskCount;
int sum = 0;
Console.WriteLine("Running...");
Stopwatch watch = new Stopwatch();
watch.Start();
var results = new Task[TaskCount];
for (int i = 0; i < TaskCount; i++)
{
Task<int> task = SumPortion(bytes.SubArray(i * readingLength, readingLength));
results[i] = task
}
int[] result = await Task.WhenAll(results);
watch.Stop();
Console.WriteLine("Done! Time took: {0}, Result: {1}", watch.ElapsedMilliseconds, result.Sum());
}
When you use the WhenAll() method, you combine all of the Task results, thus the tasks would run in parallel, saving you a lot of necessary time.
You can read more about it in learn.microsoft.com.
asynchronous is not explicitly slower - but runs in the background (Such as waits for connection to a website to be established) - so that the main thread is not stopped for the time it waits for something to happen.
The fastest way to do this is probably going to be to hand-roll a Parallel.ForEach() loop.
Plinq may not even give you a speedup in comparison to a single-threaded approach, and it certainly won't be as fast as Parallel.ForEach().
Here's some sample timing code. When you try this, make sure it's a RELEASE build and that you don't run it under the debugger (which will turn off the JIT optimiser, even if it's a RELEASE build):
using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static void Main()
{
// Create some random bytes (using a seed to ensure it's the same bytes each time).
var rng = new Random(12345);
byte[] byteArr = new byte[500_000_000];
rng.NextBytes(byteArr);
// Time single-threaded Linq.
var sw = Stopwatch.StartNew();
long sum = byteArr.Sum(x => (long)x);
Console.WriteLine($"Single-threaded Linq took {sw.Elapsed} to calculate sum as {sum}");
// Time single-threaded loop;
sw.Restart();
sum = 0;
foreach (var n in byteArr)
sum += n;
Console.WriteLine($"Single-threaded took {sw.Elapsed} to calculate sum as {sum}");
// Time Plinq
sw.Restart();
sum = byteArr.AsParallel().Sum(x => (long)x);
Console.WriteLine($"Plinq took {sw.Elapsed} to calculate sum as {sum}");
// Time Parallel.ForEach() with partitioner.
sw.Restart();
sum = 0;
Parallel.ForEach
(
Partitioner.Create(0, byteArr.Length),
() => 0L,
(subRange, loopState, threadLocalState) =>
{
for (int i = subRange.Item1; i < subRange.Item2; i++)
threadLocalState += byteArr[i];
return threadLocalState;
},
finalThreadLocalState =>
{
Interlocked.Add(ref sum, finalThreadLocalState);
}
);
Console.WriteLine($"Parallel.ForEach with partioner took {sw.Elapsed} to calculate sum as {sum}");
}
}
}
The results I get with an x64 build on my octo-core PC are:
Single-threaded Linq took 00:00:03.1160235 to calculate sum as 63748717461
Single-threaded took 00:00:00.7596687 to calculate sum as 63748717461
Plinq took 00:00:01.0305913 to calculate sum as 63748717461
Parallel.ForEach with partioner took 00:00:00.0839141 to calculate sum as 63748717461
The results I get with an x86 build are:
Single-threaded Linq took 00:00:02.6964067 to calculate sum as 63748717461
Single-threaded took 00:00:00.8200462 to calculate sum as 63748717461
Plinq took 00:00:01.1251899 to calculate sum as 63748717461
Parallel.ForEach with partioner took 00:00:00.1084805 to calculate sum as 63748717461
As you can see, the Parallel.ForEach() with the x64 build is fastest (probably because it's calculating a long total, rather than because of the larger address space).
The Plinq is around three times faster than the Linq non-threaded solution.
The Parallel.ForEach() with a partitioner is more than 30 times faster.
But notably, the non-linq single-threaded code is faster than the Plinq code. In this case, using Plinq is pointless; it makes things slower!
This tells us that the speedup isn't just from multithreading - it's also related to the overhead of Linq and Plinq in comparison to hand-rolling the loop.
Generally speaking, you should only use Plinq when the processing of each element take a relatively long time (and adding a byte to a running total take a very short time).
The advantage of Plinq over Parallel.ForEach() with a partitioner is that it is much simpler to write - however, if it winds up being slower than a simple foreach loop then its utility is questionable. So timing things before choosing a solution is very important!
I wrote a class which uses Stopwatch to profile methods and for/foreach loops. With for and foreach loops it tests a standard loop against a Parallel.For or Parallel.ForEach implementation.
You would write performance tests like so:
Method:
PerformanceResult result = Profiler.Execute(() => { FooBar(); });
For loop:
SerialParallelPerformanceResult result = Profiler.For(0, 100, x => { FooBar(x); });
ForEach loop:
SerialParallelPerformanceResult result = Profiler.ForEach(list, item => { FooBar(item); });
Whenever I run the tests (one of .Execute, .For or .ForEach) I put them in a loop so I can see how the performance changes over time.
Example of performance might be:
Method execution 1 = 200ms
Method execution 2 = 12ms
Method execution 3 = 0ms
For execution 1 = 300ms (Serial), 100ms (Parallel)
For execution 2 = 20ms (Serial), 75ms (Parallel)
For execution 3 = 2ms (Serial), 50ms (Parallel)
ForEach execution 1 = 350ms (Serial), 300ms (Parallel)
ForEach execution 2 = 24ms (Serial), 89ms (Parallel)
ForEach execution 3 = 1ms (Serial), 21ms (Parallel)
My questions are:
Why does performance change over time, what is .NET doing in the background to facilitate this?
How/why is a serial operation faster than a parallel one? I have made sure that I make the operations complex to see the difference properly...in most cases serial operations seem faster!?
NOTE: For parallel processing I am testing on an 8 core machine.
After some more exploration into performance profiling, I have discovered that using a Stopwatch is not an accurate way to measure the performance of a particular task
(Thanks hatchet and Loren for your comments on this!)
Reasons a stopwatch are not accurate:
Measurements are calculated in elapsed time in milliseconds, not CPU time.
Measurements can be influenced by background "noise" and thread intensive processes.
Measurements do not take into account JIT compilation and overhead.
That being said, using a stopwatch is OK for casual exploration of performance. With that in mind, I have improved my profiling algorithm somewhat.
Where before it simply executed the expression that was passed to it, it now has the facility to iterate over the expression several times, building an average execution time. The first run can be omitted since this is where JIT kicks in, and some major overhead may occur. Understandably, this will never be as sophisticated as using a professional profiling tool like Redgate's ANTS profiler, but it's OK for simpler tasks!
As per my comment above: I did some simple tests on my own and found no difference over time. Can you share your code? I'll put mine in an answer as it doesn't fit here.
This is my sample code.
(I also tried with both static and instance methods with no difference)
class Program
{
static void Main(string[] args)
{
int to = 50000000;
OtherStuff os = new OtherStuff();
Console.WriteLine(Profile(() => os.CountTo(to)));
Console.WriteLine(Profile(() => os.CountTo(to)));
Console.WriteLine(Profile(() => os.CountTo(to)));
}
static long Profile(Action method)
{
Stopwatch st = Stopwatch.StartNew();
method();
st.Stop();
return st.ElapsedMilliseconds;
}
}
class OtherStuff
{
public void CountTo(int to)
{
for (int i = 0; i < to; i++)
{
// some work...
i++;
i--;
}
}
}
A sample output would be:
331
331
334
Consider executing this method instead:
class OtherStuff
{
public string CountTo(Guid id)
{
using(SHA256 sha = SHA256.Create())
{
int x = default(int);
for (int index = 0; index < 16; index++)
{
x = id.ToByteArray()[index] >> 32 << 16;
}
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
byte[] y = new byte[1024];
rng.GetBytes(y);
y = y.Concat(BitConverter.GetBytes(x)).ToArray();
return BitConverter.ToString(sha.ComputeHash(BitConverter.GetBytes(x).Where(o => o >> 2 < 0).ToArray()));
}
}
}
Sample output:
11
0
0
Given an entity List, of updated objects, is it safe to instantiate a new context per iteration in a Parallel.For or foreach loop, and call SubmitChanges() on every of (let's say) 10 000 iterations?
Is it safe performing bulk updates this way? What are the possible drawbacks?
This may be a scenerio where parallelism should be avoided.
Instantiating a new DataContext per an iteration would mean that within the iteration a connection would be acquired from the connection pool, opened and a single entity written to the database before returning the connection to pool. Do this every iteration is a comparitively expensive operation so the generating a overhead that outweighs the advantages of parallelism. Where as adding entities to the data context and writing them to the database as a single action is more efficent.
Using the following as a benchmark for the Parallel insertions:
private static TimeSpan RunInParallel(int inserts)
{
Stopwatch watch = new Stopwatch();
watch.Start();
Parallel.For(0, inserts, new ParallelOptions() { MaxDegreeOfParallelism = 100 },
(i) =>
{
using (var context = new DataClasses1DataContext())
{
context.Tables.InsertOnSubmit(new Table() { Number = i });
context.SubmitChanges();
}
}
);
watch.Stop();
return watch.Elapsed;
}
For serial insertions:
private static TimeSpan RunInSerial(int inserts)
{
Stopwatch watch = new Stopwatch();
watch.Start();
using (var ctx = new DataClasses1DataContext())
{
for (int i = 0; i < inserts; i++)
{
ctx.Tables.InsertOnSubmit(new Table() { Number = i });
}
ctx.SubmitChanges();
}
watch.Stop();
return watch.Elapsed;
}
Where the DataClasses1DataContext classes are an automatically generated DataContext for:
When run on a first generation Intel i7 (8 logical cores) the following results were obtained:
10 inserts:
Average time elapsed for a 100 runs in parallel: 00:00:00.0202820
Average time elapsed for a 100 runs in serial: 00:00:00.0108694
100 inserts:
Average time elapsed for a 100 runs in parallel: 00:00:00.2269799
Average time elapsed for a 100 runs in serial: 00:00:00.1434693
1000 inserts:
Average time elapsed for a 100 runs in parallel: 00:00:02.1647577
Average time elapsed for a 100 runs in serial: 00:00:00.8163786
10000 inserts:
Average time elapsed for a 10 runs in parallel: 00:00:22.7436584
Average time elapsed for a 10 runs in serial: 00:00:07.7273398
In general, when run in parallel the insertions take approximately twice as long to execute as when run without parallelism.
UPDATE:
If you can implement some batching scheme for the data, it might be beneficial to use parallel insertions.
When using batches, the size of the batch will affect the insertion performance so some optimal ratio between the number of entries per batch and number of batches inserted will have to be determined. To demonstrate this the following method was used to batch 10000 inserts into groups of 1 (10000 batches, same as the initial parallel approach), 10 (1000 batches), 100 (100 batches), 1000 (10 batches), 10000 (1 batch, same as the serial insertion approach) then insert each batch in parallel:
private static TimeSpan RunAsParallelBatches(int inserts, int batchSize)
{
Stopwatch watch = new Stopwatch();
watch.Start();
// batch the data to be inserted
List<List<int>> batches = new List<List<int>>();
for (int g = 0; g < inserts / batchSize; g++)
{
List<int> numbers = new List<int>();
int start = g * batchSize;
int end = start + batchSize;
for (int i = start; i < end; i++)
{
numbers.Add(i);
}
batches.Add(numbers);
}
// insert each batch in parallel
Parallel.ForEach(batches,
(batch) =>
{
using (DataClasses1DataContext ctx = new DataClasses1DataContext())
{
foreach (int number in batch)
{
ctx.Tables.InsertOnSubmit(new Table() { Number = number });
}
ctx.SubmitChanges();
}
}
);
watch.Stop();
return watch.Elapsed;
}
taking the average time for 10 runs of 10000 insertions generates the following results:
10000 inserts repeated 10 times
Average time for initial parallel insertion approach: 00:00:22.7436584
Average time in parallel using batches of 1 entity (10000 batches): 00:00:23.1088289
Average time in parallel using batches of 10 entities (1000 batches): 00:00:07.1443220
Average time in parallel using batches of 100 entities (100 batches): 00:00:04.3111268
Average time in parallel using batches of 1000 entities (10 batches): 00:00:04.0668334
Average time in parallel using batches of 10000 entities (1 batch): 00:00:08.2820498
Average time for serial insertion approach: 00:00:07.7273398
So by batching the insertions into groups, an performance increase can be gained so long as enough work is performed with in the iteration to outweigh the overhead of setting up the DataContext and performing the batch insertions. In this case by batching the insertions into groups of 1000, the parallel insertion managed to out perform the serial by ~2x on this system.
This can be done safely and will yield better performance. You need to make sure that:
you are not ever accessing the same datacontext concurrently
your are inserting batches of rows (maybe 100 to 10000 at a time). This will keep the overhead of instantiating the datacontext and opening connections low.