I am currently optimizing data processing logic for parallel execution.
I have noticed, that as the core count increases - data processing performance does not necessary increases the way I suppose it should.
Here is the test code:
Console.WriteLine($"{DateTime.Now}: Data processing start");
double lastElapsedMs = 0;
for (int i = 1; i <= Environment.ProcessorCount; i++)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
ProccessData(i); // main processing method
watch.Stop();
double elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"{DateTime.Now}: Core count: {i}, Elapsed: {elapsedMs}ms");
lastElapsedMs = elapsedMs;
}
Console.WriteLine($"{DateTime.Now}: Data processing end");
and
public static void ProccessData(int coreCount)
{
// First part is data preparation.
// splitting 1 collection into smaller chunks, depending on core count
////////////////
// combinations = collection of data
var length = combinations.Length;
int chuncSize = length / coreCount;
int[][][] chunked = new int[coreCount][][];
for (int i = 0; i < coreCount; i++)
{
int skip = i * chuncSize;
int take = chuncSize;
int diff = (length - skip) - take;
if (diff < chuncSize)
{
take = take + diff;
}
var sub = combinations.Skip(skip).Take(take).ToArray();
chunked[i] = sub.ToArray();
}
// Second part is itteration. 1 chunk of data processed per core.
////////////////
Parallel.For(0, coreCount, new ParallelOptions() { MaxDegreeOfParallelism = coreCount }, (chunkIndex, state) =>
{
var chunk = chunked[chunkIndex];
int chunkLength = chunk.Length;
// itterate data inside chunk
for (int idx = 0; idx < chunkLength; idx++)
{
// additional processing logic here for single data
}
});
}
The results are following:
As you can see from the result set - by using 2 cores instead of 1 - you can get almost ideal increase of performance (given the fact, that 1 core is running at 4700Mhz, but 2 cores run at 4600Mhz each).
After that, when the data was supposed to be processed in parallel on 3 cores, I was expecting to see the performance increase by 33% when compared to 2 core execution. The actual is 21.62% increase.
Next, as the core count increases - the degradation of "parallel" execution performance continues to increase.
In the end, when we have 12 cores results - the difference between actual and ideal results is more than twice as big (96442ms vs 39610ms)!
I have certainly not expected the difference to be as big. I have an Intel 8700k processor. 6 physical cores and 6 logical - total 12 Threads. 1 core running at 4700Mhz in turbo mode, 2C 4600, 3C 4500, 4C 4400, 5-6C 4400, 6C 4300.
If it matters - I have done additional observations in Core-temp:
when 1 core processing was running - 1 out of 6 cores was busy 50%
when 2 core processing was running - 2 out of 6 cores were busy 50%
when 3 core processing was running - 3 out of 6 cores were busy 50%
when 4 core processing was running - 4 out of 6 cores were busy 50%
when 5 core processing was running - 5 out of 6 cores were busy 50%
when 6 core processing was running - 6 out of 6 cores were busy 50%
when 7 core processing was running - 5 out of 6 cores were busy 50%, 1 core 100%
when 8 core processing was running - 4 out of 6 cores were busy 50%, 2 cores 100%
when 9 core processing was running - 3 out of 6 cores were busy 50%, 3 cores 100%
when 10 core processing was running - 2 out of 6 cores were busy 50%, 4 cores 100%
when 11 core processing was running - 1 out of 6 cores were busy 50%, 5 cores 100%
when 12 core processing was running - all 6 cores at 100%
I can certainly see that the end result should not be as performant as ideal result, because frequency per core decreases, but still.. Is there a good explanation why my code performs so badly at 12 cores? Is this generalized situation on every machine or perhaps a limitation of my PC?
.net core 2 used for tests
Edit: Sorry forgot to mention that data chunking can be optimized since I have done it as a draft solution. Nevertheless, splitting is done within 1 second time, so it's maximum 1000-2000ms added to the result execution time.
Edit2: I have just got rid of all the chunking logic and removed MaxDegreeOfParallelism property. The data is processed as is, in parallel. The execution time is now 94196ms, which is basically the same time as before, excluding chunking time. Seems like .net is smart enough to chunk data during runtime, so additional code is not necessary, unless I want to limit the number of cores used. The fact is, however, this did not notably increase the performance. I am leaning towards "Ahmdahl's law" explanation since nothing of what I have done increased the performance outside of bonds of an error margin.
Yeah Ahmdahl's law. Performance speedup is never linear with the number of cores thrown at the problem.
Also reciprocals...
Your chunking code is run once no matter how many processors you have, yet it still is dependent on the number of processors.
Especially the Skip/Take part followed by the double ToArray() call seems very much in need of optimization. See How to copy part of an array to another array in C#? on how to copy an array without traversing the whole thing multiple times.
That should do a lot for your performance coming closer to what you'd expect. That said, the work of branching out and combining the results will always degrade the performance of parallel execution. "Maximum parallelism" is not exactly something to strife for. There is always a sweet spot where the parallelization outweighs the hit from preparing for it. You need to find that. Or let .NET take care of that by leaving out the manual override for cores.
As nvoigt pointed out, the chunking code is run on a single core and is slow. Look at these two lines:
var sub = combinations.Skip(skip).Take(take).ToArray();
chunked[i] = sub.ToArray();
SkipTake inside a loop is a Schlemiel the Painter performance issue. Use a different method
sub is already a perfectly good array, why make another copy on the next line? Array allocation isn't 0 cost.
I think ArraySegment is a good fit for this problem instead of making array copies. At the least, you can ToArray an ArraySegment more efficiently than what you're currently doing.
Related
Multithreading with IEnumerables, which are evaluated several times parallely and are expensive to evaluate, does not use 100% CPU. Example is the Aggregate() function combined with Concat():
// Initialisation.
// Each IEnumerable<string> is made so that it takes time to evaluate it
// everytime when it is accessed.
IEnumerable<string>[] iEnumerablesArray = ...
// The line of the question (using less than 100% CPU):
Parallel.For(0, 1000000, _ => iEnumerablesArray.Aggregate(Enumerable.Concat).ToList());
Question: Why parallel code where IEnumerables are evaluated several times parallely does not use 100% CPU? The code does not use locks or waits so this behaviour is unexpected. A full code to simulate this is at the end of the post.
Notes and Edits:
Interesting fact: If the code
Enumerable.Range(0, 1).Select(__ => GenerateLongString())
of the full code at the end is changed to
Enumerable.Range(0, 1).Select(__ => GenerateLongString()).ToArray().AsEnumerable(),
then initialisation takes seconds and after that CPU is used to 100% (no problem occurs)
Interesting fact2: (from comment) When method GenerateLongString() is made less heavy on GC and more intensive on CPU, then CPU goes to 100%. So cause is connected to the implementation of this method. But, interestingly, if the current form of GenerateLongString() is called without IEnumerable, CPU goes to 100% also:
Parallel.For(0, int.MaxValue, _ => GenerateLongString());
So heaviness of GenerateLongString() is not the only problem here.
Fact3: (from comment) Suggested concurrency visualiser revealed that threads spend most of their time on line
clr.dll!WKS::gc_heap::wait_for_gc_done,
waiting for GC to finish. This is happening inside string.Concat() of GenerateLongString().
The same behaviour is observed when running manualy multiple Task.Factory.StartNew() or Thread.Start()
The same behaviour is observed on Win 10 and Windows Server 2012
The same behaviour is observed on real machine and virtual machine
Release vs. Debug does not matter.
.Net version tested: 4.7.2
The Full Code:
class Program
{
const int DATA_SIZE = 10000;
const int IENUMERABLE_COUNT = 10000;
static void Main(string[] args)
{
// initialisation - takes milliseconds
IEnumerable<string>[] iEnumerablesArray = GenerateArrayOfIEnumerables();
Console.WriteLine("Initialized");
List<string> result = null;
// =================
// THE PROBLEM LINE:
// =================
// CPU usage of next line:
// - 40 % on 4 virtual cores processor (2 physical)
// - 10 - 15 % on 12 virtual cores processor
Parallel.For(
0,
int.MaxValue,
(i) => result = iEnumerablesArray.Aggregate(Enumerable.Concat).ToList());
// just to be sure that Release mode would not omit some lines:
Console.WriteLine(result);
}
static IEnumerable<string>[] GenerateArrayOfIEnumerables()
{
return Enumerable
.Range(0, IENUMERABLE_COUNT)
.Select(_ => Enumerable.Range(0, 1).Select(__ => GenerateLongString()))
.ToArray();
}
static string GenerateLongString()
{
return string.Concat(Enumerable.Range(0, DATA_SIZE).Select(_ => "string_part"));
}
}
The fact that your threads are blocked on clr.dll!WKS::gc_heap::wait_for_gc_done shows that the garbage collector is the bottleneck of your application. As much as possible, you should try to limit the number of heap allocations in your program, to put less stress on the gc.
That said, there is another way to speed-up things. Per default, on desktop, the GC is configured to use limited resources on the computer (to avoid slowing down other applications). If you want to fully use the resources available, then you can activate server GC. This mode assumes that your application is the most important thing running on the computer. It will provide a significant performance boost, but use a lot more CPU and memory.
I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371
In my application
int numberOfTimes = 1; //Or 100, or 100000
//Incorrect, please see update.
var tasks = Enumerable.Repeat(
(new HttpClient()).GetStringAsync("http://www.someurl.com")
, numberOfTimes);
var resultArray = await Task.WhenAll(tasks);
With numberOfTimes == 1, it takes 5 seconds.
With numberOfTimes == 100000, it still takes 5 seconds.
Thats amazing.
But does that mean I can run unlimited calls in parallel? There has to be some limit when this starts to queues?
What is that limit? Where is that set? What does it depend on?
In other words, How many IO completion ports are there? Who all are competing for them? Does IIS get its own set of IO completion port.
--This is in an ASP.Net MVC action, .Net 4.5.2, IIS
Update: Thanks to #Enigmativity, following is more relevant to the question
var tasks = Enumerable.Range(1, numberOfTimes ).Select(i =>
(new HttpClient()).GetStringAsync("http://deletewhenever.com/api/default"));
var resultArray = await Task.WhenAll(tasks);
With numberOfTimes == 1, it takes 5 seconds.
With numberOfTimes == 100, it still takes 5 seconds.
I am seeing more believable numbers for higher counts now though. The question remains, what governs the number?
What is that limit? Where is that set?
There's no explicit limit. However, you will eventually run out of resources. Mark Russinovich has an interesting blog series on probing the limits of common resources.
Asynchronous operations generally increase memory usage in exchange for responsiveness. So, each naturally-async op uses at least memory for its Task, an OVERLAPPED struct, and an IRP for the driver (each of these represents an in-progress asynchronous operation at different levels). At the lower levels, there are lots and lots of different limitations that can come into play to affect system resources (for an example, I have an old blog post where I had to calculate the maximum size of an I/O buffer - something you would think is simple but is really not).
Socket operations require a client port, which are (in theory) limited to 64k connections to the same remote IP. Sockets also have their own more significant memory overhead, with both input and output buffers at the device level and in user space.
The IOCP doesn't come into play until the operations complete. On .NET, there's only one IOCP for your AppDomain. The default maximum number of I/O threads servicing this IOCP is 1000 on the modern (4.5) .NET framework. Note that this is a limit on how many operations may complete at a time, not how many may be in progress at a time.
Here's a test to see what's going on.
Start with this code:
var i = 0;
Func<int> generate = () =>
{
Thread.Sleep(1000);
return i++;
};
Now call this:
Enumerable.Repeat(generate(), 5)
After one second you get { 0, 0, 0, 0, 0 }.
But make this call:
Enumerable.Range(0, 5).Select(n => generate())
After five seconds you get { 0, 1, 2, 3, 4 }.
It's only calling the async function once in your code.
I am trying to measure DDR3 memory data transfer rate through a test. According to the CPU spec. maximum theoritical bandwidth is 51.2 GB/s. This should be the combined bandwidth of four channels, meaning 12.8 GB/channel. However, this is a theoretical limit and I am curious of how to further increase the practical limit in this post. In the below described test scenario I achieve a ~14 GB/s data transfer rate which I believe may be a close approximation when killing most of the throuhgput boost of the CPU L1, L2, and L3 caches.
Update 20/3 2014: This assumption of killing the L1-L3 caches is wrong. The harware prefetching of the memory controller will analyze the data accesses pattern and since it sequential, it will have an easy task of prefetching data into the CPU caches.
Specific questions follow at the bottom but mainly I am interested in a) a verifications of the assumptions leading up to this result, and b) if there is a better way measuring memory bandwith in .NET.
I have constructed a test in C# on .NET as a starter. Although .NET is not ideal from a memory allocation perspective, I think it is doable for this test (please let me know if you disagree and why). The test is to allocate an int64 array and fill it with integers. This array should have data aligned in memory. Then I simply loop this array using as many threads as I have cores on the machine and read the int64 value from the array and set it to a local public field in the test class. Since the result field is public, I should avoid compiler optimising away stuff in the loop. Futhermore, and this may be a weak assumption, I think the result stays in the register and not written to memory until it is over written again. Between each read of an element in the array I use an variable Step offset of 10, 100, and 1000 in the array in order to not be able to fetch many references in the same cache block (64 byte).
Reading the Int64 from the array should mean a lookup read of 8 bytes and then the read of the actual value another 8 byte. Since data is fetched from memory in 64 byte cache line, each read in the array should correspond to a 64 byte read from RAM each time in the loop given that the read data is not located in any CPU caches.
Here is how I initiallize the data array:
_longArray = new long[Config.NbrOfCores][];
for (int threadId = 0; threadId < Config.NbrOfCores; threadId++)
{
_longArray[threadId] = new long[Config.NmbrOfRequests];
for (int i = 0; i < Config.NmbrOfRequests; i++)
_longArray[threadId][i] = i;
}
And here is the actual test:
GC.Collect();
timer.Start();
Parallel.For(0, Config.NbrOfCores, threadId =>
{
var intArrayPerThread = _longArray[threadId];
for (int redo = 0; redo < Config.NbrOfRedos; redo++)
for (long i = 0; i < Config.NmbrOfRequests; i += Config.Step)
_result = intArrayPerThread[i];
});
timer.Stop();
Since the data summary is quite important for the result I give this info too (can be skipped if you trust me...)
var timetakenInSec = timer.ElapsedMilliseconds / (double)1000;
long totalNbrOfRequest = Config.NmbrOfRequests / Config.Step * Config.NbrOfCores*Config.NbrOfRedos;
var throughput_ReqPerSec = totalNbrOfRequest / timetakenInSec;
var throughput_BytesPerSec = throughput_ReqPerSec * byteSizePerRequest;
var timeTakenPerRequestInNanos = Math.Round(1e6 * timer.ElapsedMilliseconds / totalNbrOfRequest, 1);
var resultMReqPerSec = Math.Round(throughput_ReqPerSec/1e6, 1);
var resultGBPerSec = Math.Round(throughput_BytesPerSec/1073741824, 1);
var resultTimeTakenInSec = Math.Round(timetakenInSec, 1);
Neglecting to give you the actual output rendering code I get the following result:
Step 10: Throughput: 570,3 MReq/s and 34 GB/s (64B), Timetaken/request: 1,8 ns/req, Total TimeTaken: 12624 msec, Total Requests: 7 200 000 000
Step 100: Throughput: 462,0 MReq/s and 27,5 GB/s (64B), Timetaken/request: 2,2 ns/req, Total TimeTaken: 15586 msec, Total Requests: 7 200 000 000
Step 1000: Throughput: 236,6 MReq/s and 14,1 GB/s (64B), Timetaken/request: 4,2 ns/req, Total TimeTaken: 30430 msec, Total Requests: 7 200 000 000
Using 12 threads instead of 6 (since the CPU is hyper threaded) I get pretty much the same throughput (as expected I think): 32.9 / 30.2 / 15.5 GB/s .
As can be seen, throughput drops as the step increases which I think is normal. Partly I think it is due to that the 12 MB L3 cache forces mores cache misses and partly it may be the Memory Controllers prefetch mechanism that is not working as well when the reads are so far apart. I further believe that the step 1000 result is the closest one to the actual practical memory speed since it should kill most of the CPU caches and "hopefully" kill the prefetch mechanism. Futher more I am assuming that most of the overhead in this loop is the memory fetch operation and not something else.
hardware for this test is:
Intel Core I7-3930 (specs: CPU breif, more detailed, and really detailed spec ) using 32 GB total of DDR3-1600 memories.
Open questions
Am I correct in the assumptions made above?
Is there a way to increase the use of the memory bandwidth? For instance by doing it in C/C++ instead and spread out memory allocation more on heap enabling all four memory channels to be used.
Is there a better way to measure the memory data transfer?
Much obliged for input on this. I know it is a complex area under the hood...
All code here is available for download at https://github.com/Toby999/ThroughputTest. Feel free to contact me at an forwarding email tobytemporary[at]gmail.com.
The decrease in throughput as you increase step is likely caused by the memory prefetching not working well anymore if you don't stride linearly through memory.
Things you can do to improve the speed:
The test speed will be artificially bound by the loop itself taking up CPU cycles. As Roy shows, more speed can be achieved by unfolding the loop.
You should get rid of boundary checking (with "unchecked")
Instead of using Parallel.For, use Thread.Start and pin each thread you start on a separate core (using the code from here: Set thread processor affinity in Microsoft .Net)
Make sure all threads start at the same time, so you don't measure any stragglers (you can do this by spinning on a memory address that you Interlock.Exchange to a new value when all threads are running and spinning)
On a NUMA machine (for example a 2 Socket Modern Xeon), you may have to take extra steps to allocate memory on the NUMA node that a thread will live on. To do this, you need to PInvoke VirtualAllocExNuma
Speaking of memory allocations, using Large Pages should provide yet another boost
While .NET isn't the easiest framework to use for this type of testing, it IS possible to coax it into doing what you want.
Reported RAM results (128 MB) for my bus8thread64.exe benchmark on an i7 3820 with max memory bandwidth of 51.2 GB/s, vary from 15.6 with 1 thread, 28.1 with 2 threads to 38.7 at 8 threads. Code is:
void inc1word(IDEF data1[], IDEF ands[], int n)
{
int i, j;
for(j=0; j<passes1; j++)
{
for (i=0; i<wordsToTest; i=i+64)
{
ands[n] = ands[n] & data1[i ] & data1[i+1 ] & data1[i+2 ] & data1[i+3 ]
& data1[i+4 ] & data1[i+5 ] & data1[i+6 ] & data1[i+7 ]
& data1[i+8 ] & data1[i+9 ] & data1[i+10] & data1[i+11]
& data1[i+12] & data1[i+13] & data1[i+14] & data1[i+15]
& data1[i+16] & data1[i+17] & data1[i+18] & data1[i+19]
& data1[i+20] & data1[i+21] & data1[i+22] & data1[i+23]
& data1[i+24] & data1[i+25] & data1[i+26] & data1[i+27]
& data1[i+28] & data1[i+29] & data1[i+30] & data1[i+31]
& data1[i+32] & data1[i+33] & data1[i+34] & data1[i+35]
& data1[i+36] & data1[i+37] & data1[i+38] & data1[i+39]
& data1[i+40] & data1[i+41] & data1[i+42] & data1[i+43]
& data1[i+44] & data1[i+45] & data1[i+46] & data1[i+47]
& data1[i+48] & data1[i+49] & data1[i+50] & data1[i+51]
& data1[i+52] & data1[i+53] & data1[i+54] & data1[i+55]
& data1[i+56] & data1[i+57] & data1[i+58] & data1[i+59]
& data1[i+60] & data1[i+61] & data1[i+62] & data1[i+63];
}
}
}
This also measures burst reading speeds, where max DTR, based on this, is 46.9 GB/s. Benchmark and source code are in:
http://www.roylongbottom.org.uk/quadcore.zip
For results with interesting speeds using L3 caches are in:
http://www.roylongbottom.org.uk/busspd2k%20results.htm#anchor8Thread
C/C++ would give a more accurate metric of memory performance as .NET can sometimes do some weird things with memory handling and won't give you an accurate picture since it doesn't use compiler intrinsics or SIMD instructions.
There's no guarantee that the CLR is going to give you anything capable of truly benchmarking your RAM. I'm sure there's probably software already written to do this. Ah, yes, PassMark makes something: http://www.bandwidthtest.net/memory_bandwidth.htm
That's probably your best bet as making benchmarking software is pretty much all they do.
Also, nice processor btw, I have the same one in one of my machines ;)
UPDATE (2/20/2014):
I remember seeing some code in the XNA Framework that did some heavy duty optimizations in C# that may give you exactly what you want. Have you tried using "unsafe" code and pointers?
I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((#"c:\txt-set\agg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea?
Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions.
Is there a flaw in .NET's run-time?
Thanks in advance.
Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector.
Most well tuned applications are in the 0-10% range. This is typically
caused by an allocation pattern that allows objects to live just long
enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec.
This is quite high. It is not uncommon that this is simply a
performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
An attempt to explain the results:
a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are
var dic = new Dictionary<string, List<int>>();
...
dic[token].Add(1);
I replaced this with
var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;
and the result is closer to a 2x speedup.
But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.
Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.
Just for fun, here is a shorter PLINQ version:
File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count());