Multithreading with IEnumerables, which are evaluated several times parallely and are expensive to evaluate, does not use 100% CPU. Example is the Aggregate() function combined with Concat():
// Initialisation.
// Each IEnumerable<string> is made so that it takes time to evaluate it
// everytime when it is accessed.
IEnumerable<string>[] iEnumerablesArray = ...
// The line of the question (using less than 100% CPU):
Parallel.For(0, 1000000, _ => iEnumerablesArray.Aggregate(Enumerable.Concat).ToList());
Question: Why parallel code where IEnumerables are evaluated several times parallely does not use 100% CPU? The code does not use locks or waits so this behaviour is unexpected. A full code to simulate this is at the end of the post.
Notes and Edits:
Interesting fact: If the code
Enumerable.Range(0, 1).Select(__ => GenerateLongString())
of the full code at the end is changed to
Enumerable.Range(0, 1).Select(__ => GenerateLongString()).ToArray().AsEnumerable(),
then initialisation takes seconds and after that CPU is used to 100% (no problem occurs)
Interesting fact2: (from comment) When method GenerateLongString() is made less heavy on GC and more intensive on CPU, then CPU goes to 100%. So cause is connected to the implementation of this method. But, interestingly, if the current form of GenerateLongString() is called without IEnumerable, CPU goes to 100% also:
Parallel.For(0, int.MaxValue, _ => GenerateLongString());
So heaviness of GenerateLongString() is not the only problem here.
Fact3: (from comment) Suggested concurrency visualiser revealed that threads spend most of their time on line
clr.dll!WKS::gc_heap::wait_for_gc_done,
waiting for GC to finish. This is happening inside string.Concat() of GenerateLongString().
The same behaviour is observed when running manualy multiple Task.Factory.StartNew() or Thread.Start()
The same behaviour is observed on Win 10 and Windows Server 2012
The same behaviour is observed on real machine and virtual machine
Release vs. Debug does not matter.
.Net version tested: 4.7.2
The Full Code:
class Program
{
const int DATA_SIZE = 10000;
const int IENUMERABLE_COUNT = 10000;
static void Main(string[] args)
{
// initialisation - takes milliseconds
IEnumerable<string>[] iEnumerablesArray = GenerateArrayOfIEnumerables();
Console.WriteLine("Initialized");
List<string> result = null;
// =================
// THE PROBLEM LINE:
// =================
// CPU usage of next line:
// - 40 % on 4 virtual cores processor (2 physical)
// - 10 - 15 % on 12 virtual cores processor
Parallel.For(
0,
int.MaxValue,
(i) => result = iEnumerablesArray.Aggregate(Enumerable.Concat).ToList());
// just to be sure that Release mode would not omit some lines:
Console.WriteLine(result);
}
static IEnumerable<string>[] GenerateArrayOfIEnumerables()
{
return Enumerable
.Range(0, IENUMERABLE_COUNT)
.Select(_ => Enumerable.Range(0, 1).Select(__ => GenerateLongString()))
.ToArray();
}
static string GenerateLongString()
{
return string.Concat(Enumerable.Range(0, DATA_SIZE).Select(_ => "string_part"));
}
}
The fact that your threads are blocked on clr.dll!WKS::gc_heap::wait_for_gc_done shows that the garbage collector is the bottleneck of your application. As much as possible, you should try to limit the number of heap allocations in your program, to put less stress on the gc.
That said, there is another way to speed-up things. Per default, on desktop, the GC is configured to use limited resources on the computer (to avoid slowing down other applications). If you want to fully use the resources available, then you can activate server GC. This mode assumes that your application is the most important thing running on the computer. It will provide a significant performance boost, but use a lot more CPU and memory.
Related
There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.
You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.
In my application
int numberOfTimes = 1; //Or 100, or 100000
//Incorrect, please see update.
var tasks = Enumerable.Repeat(
(new HttpClient()).GetStringAsync("http://www.someurl.com")
, numberOfTimes);
var resultArray = await Task.WhenAll(tasks);
With numberOfTimes == 1, it takes 5 seconds.
With numberOfTimes == 100000, it still takes 5 seconds.
Thats amazing.
But does that mean I can run unlimited calls in parallel? There has to be some limit when this starts to queues?
What is that limit? Where is that set? What does it depend on?
In other words, How many IO completion ports are there? Who all are competing for them? Does IIS get its own set of IO completion port.
--This is in an ASP.Net MVC action, .Net 4.5.2, IIS
Update: Thanks to #Enigmativity, following is more relevant to the question
var tasks = Enumerable.Range(1, numberOfTimes ).Select(i =>
(new HttpClient()).GetStringAsync("http://deletewhenever.com/api/default"));
var resultArray = await Task.WhenAll(tasks);
With numberOfTimes == 1, it takes 5 seconds.
With numberOfTimes == 100, it still takes 5 seconds.
I am seeing more believable numbers for higher counts now though. The question remains, what governs the number?
What is that limit? Where is that set?
There's no explicit limit. However, you will eventually run out of resources. Mark Russinovich has an interesting blog series on probing the limits of common resources.
Asynchronous operations generally increase memory usage in exchange for responsiveness. So, each naturally-async op uses at least memory for its Task, an OVERLAPPED struct, and an IRP for the driver (each of these represents an in-progress asynchronous operation at different levels). At the lower levels, there are lots and lots of different limitations that can come into play to affect system resources (for an example, I have an old blog post where I had to calculate the maximum size of an I/O buffer - something you would think is simple but is really not).
Socket operations require a client port, which are (in theory) limited to 64k connections to the same remote IP. Sockets also have their own more significant memory overhead, with both input and output buffers at the device level and in user space.
The IOCP doesn't come into play until the operations complete. On .NET, there's only one IOCP for your AppDomain. The default maximum number of I/O threads servicing this IOCP is 1000 on the modern (4.5) .NET framework. Note that this is a limit on how many operations may complete at a time, not how many may be in progress at a time.
Here's a test to see what's going on.
Start with this code:
var i = 0;
Func<int> generate = () =>
{
Thread.Sleep(1000);
return i++;
};
Now call this:
Enumerable.Repeat(generate(), 5)
After one second you get { 0, 0, 0, 0, 0 }.
But make this call:
Enumerable.Range(0, 5).Select(n => generate())
After five seconds you get { 0, 1, 2, 3, 4 }.
It's only calling the async function once in your code.
I ran this on a laptop, 64-bit Windows 8.1, 2.2 Ghz Intel Core i3. The code was compiled in release mode and ran without a debugger attached.
static void Main(string[] args)
{
calcMax(new[] { 1, 2 });
calcMax2(new[] { 1, 2 });
var A = GetArray(200000000);
var stopwatch = new Stopwatch();
stopwatch.Start(); stopwatch.Stop();
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax(A);
stopwatch.Stop();
Console.WriteLine("caclMax - \t{0}", stopwatch.Elapsed);
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax2(A);
stopwatch.Stop();
Console.WriteLine("caclMax2 - \t{0}", stopwatch.Elapsed);
Console.ReadKey();
}
static int[] GetArray(int size)
{
var r = new Random(size);
var ret = new int[size];
for (int i = 0; i < size; i++)
{
ret[i] = r.Next();
}
return ret;
}
static int calcMax(int[] A)
{
int max = int.MinValue;
for (int i = 0; i < A.Length; i++)
{
max = Math.Max(max, A[i]);
}
return max;
}
static int calcMax2(int[] A)
{
int max1 = int.MinValue;
int max2 = int.MinValue;
for (int i = 0; i < A.Length; i += 2)
{
max1 = Math.Max(max1, A[i]);
max2 = Math.Max(max2, A[i + 1]);
}
return Math.Max(max1, max2);
}
Here are some statistics of program performance (time in miliseconds):
Framework 2.0
X86 platform:
2269 (calcMax)
2971 (calcMax2)
[winner calcMax]
X64 platform:
6163 (calcMax)
5916 (calcMax2)
[winner calcMax2]
Framework 4.5 (time in miliseconds)
X86 platform:
2109 (calcMax)
2579 (calcMax2)
[winner calcMax]
X64 platform:
2040 (calcMax)
2488 (calcMax2)
[winner calcMax]
As you can see the performance is different depend on framework and choosen compilied platform. I see generated IL code and it is the same for each cases.
The calcMax2 is under test because it should use "pipelining" of processor. But it is faster only with framework 2.0 on 64-bit platform. So, what is real reason of shown case in different performance?
Just some notes worth mentioning. My processor (Haswell i7) doesn't compare well with yours, I certainly can't get close to reproducing the outlier x64 result.
Benchmarking is a hazardous exercise and it is very easy to make simple mistakes that can have big consequences on execution time. You can only truly see them when you look at the generated machine code. Use Tools + Options, Debugging, General and untick the "Suppress JIT optimization" option. That way you can look at the code with Debug > Windows > Disassembly and not affect the optimizer.
Some things you'll see when you do this:
You made a mistake, you are not actually using the method return value. The jitter optimizer opportunities like this where possible, it completely omits the max variable assignment in calcMax(). But not in calcMax2(). This is a classic benchmarking oops, in a real program you'd of course use the return value. This makes calcMax() look too good.
The .NET 4 jitter is smarter about optimizing Math.Max(), in can generate the code inline. The .NET 2 jitter couldn't do that yet, it has to make a call to a CLR helper function. The 4.5 test should thus run a lot faster, that it didn't is a strong hint at what really throttles the code execution. It is not the processor's execution engine, it is the cost of accessing memory. Your array is too large to fit in the processor caches so your program is bogged down waiting for the slow RAM to supply the data. If the processor cannot overlap that with executing instructions then it just stalls.
Noteworthy about calcMax() is what happens to the array-bounds check that C# performs. The jitter knows how to completely eliminate it from the loop. It however isn't smart enough to do the same in calcMax2(), the A[i + 1] screws that up. That check doesn't come for free, it should make calcMax2() quite a bit slower. That it doesn't is again a strong hint that memory is the true bottleneck. That's pretty normal btw, array bound checking in C# can have low to no overhead because it is so much cheaper than the array element access.
As for your basic quest, trying to improve super-scalar execution opportunities, no, that's not how processors work. A loop is not a boundary for the processor, it just sees a different stream of compare and branch instructions, all of which can execute concurrently if they don't have inter-dependencies. What you did by hand is something the optimizer already does itself, an optimization called "loop unrolling". It selected not to do so in this particular case btw. An overview of jitter optimizer strategies is available in this post. Trying to outsmart the processor and the optimizer is a pretty tall order and getting a worse result by trying to help is certainly not unusual.
Many of the differences that you see are well within the range of tolerance, so they should be considered as no differences.
Essentially, what these numbers show is that Framework 2.0 was highly unoptimized for X64, (no surprise at all here,) and that overall, calcMax performs slightly better than calcMax2. (No surprise there either, because calcMax2 contains more instructions.)
So, what we learn is that someone came up with a theory that they could achieve better performance by writing high-level code that somehow takes advantage of some pipelining of the CPU, and that this theory was proved wrong.
The running time of your code is dominated by the failed branch predictions that are occurring within Math.max() due to the randomness of your data. Try less randomness (more consecutive values where the 2nd one will always be greater) and see if it gives you any better insights.
Every time you run the program, you'll get slightly different results.
Sometimes calcMax will win, and sometimes calcMax2 will win. This is because there is a problem comparing performance that way. What StopWhatch measures is the time elapsed since stopwatch.Start() is called, until stopwatch.Stop() is called. In between, things independent of your code can occur. For example, the operating system can take the processor from your process and give it for a while to another process running on your machine, due to the end of your process's time slice. after a while, your process gets the processor back for another time slice.
Such occurrences cannot be controlled or foreseen by your comparison code, and thus the entire experiment shouldn't be treated as reliable.
To minimize this kind of measurement errors, you should measure every function many times (for example, 1000 times), and calculate the average time of all measurements. This method of measurement tends to significantly improve the reliability of the result, as it is more resilient to statistical errors.
I wrote a naive Parallel.For() loop in C#, shown below. I also did the same work using a regular for() loop to compare single-thread vs. multi-thread. The single thread version took about five seconds every time I ran it. The parallel version took about three seconds at first, but if I ran it about four times, it would slow down dramatically. Most often it took about thirty seconds. One time it took eighty seconds. If I restarted the program, the parallel version would start out fast again, but slow down after three or four parallel runs. Sometimes the parallel runs would speed up again to the original three seconds then slow down.
I wrote another Parallel.For() loop for computing Mandelbrot set members (discarding the results) because I figured that the problem might be related to memory issues allocating and manipulating a large array. The Parallel.For() implementation of this second problem does indeed execute faster than the single-thread version every time, and the times are consistent too.
What data should I be looking at to understand to understand why my first naive program slows down after a number of runs? Is there something in Perfmon I should be looking at? I still suspect it is memory related, but I allocate the array outside the timer. I also tried a GC.Collect() at the end of each run, but that didn't seem help, not consistently anyway. Might it be an alignment issue with cache somewhere on the processor? How would I figure that out? Is there anything else that might be the cause?
JR
const int _meg = 1024 * 1024;
const int _len = 1024 * _meg;
private void ParallelArray() {
int[] stuff = new int[_meg];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
lblStart.Content = DateTime.Now.ToString();
s.Start();
Parallel.For(0,
_len,
i => {
stuff[i % _meg] = i;
}
);
s.Stop();
lblResult.Content = DateTime.Now.ToString();
lblDiff.Content = s.ElapsedMilliseconds.ToString();
}
I have profiled your code and it indeed looks strange. There should be no deviations. It is not an allocation issue (GC is fine and you are allocating only one array per run).
The problem can be reproduced on my Haswell CPU where the parallel version suddenly takes much longer to execute. I have CLR version 4.0.30319.34209 FX452RTMGDR.
On x64 it works fine and has no issues. Only x86 builds seem to suffer from it.
I have profiled it with the Windows Performance Toolkit and have found that it looks like a CLR issue where the TPL tries to find the next workitem. Sometimes it happens that the call
System.Threading.Tasks.RangeWorker.FindNewWork(Int64 ByRef, Int64 ByRef)
System.Threading.Tasks.Parallel+<>c__DisplayClassf`1[[System.__Canon, mscorlib]].<ForWorker>b__c()
System.Threading.Tasks.Task.InnerInvoke()
System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)
System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object)
System.Threading.Tasks.Task.InnerInvoke()
seems to "hang" in the clr itself.
clr!COMInterlocked::ExchangeAdd64+0x4d
When I compare the sampled stacks with a slow and fast run I find:
ntdll.dll!__RtlUserThreadStart -52%
kernel32.dll!BaseThreadInitThunk -52%
ntdll.dll!_RtlUserThreadStart -52%
clr.dll!Thread::intermediateThreadProc -48%
clr.dll!ThreadpoolMgr::ExecuteWorkRequest -48%
clr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem -48%
clr.dll!ManagedThreadBase_FullTransitionWithAD -48%
clr.dll!ManagedThreadBase_DispatchOuter -48%
clr.dll!ManagedThreadBase_DispatchMiddle -48%
clr.dll!ManagedThreadBase_DispatchInner -48%
clr.dll!QueueUserWorkItemManagedCallback -48%
clr.dll!MethodDescCallSite::CallTargetWorker -48%
clr.dll!CallDescrWorkerWithHandler -48%
mscorlib.ni.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteEntry(Boolean) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.TaskByRef) -48%
mscorlib.ni.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext System.Threading.ContextCallback System.Object Boolean) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecutionContextCallback(System.Object) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.Execute() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke() -48%
mscorlib.ni.dll!System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task) -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke() -48%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0 -24%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0<itself> -24%
...
clr.dll!COMInterlocked::ExchangeAdd64 +50%
In the dysfunctional case most of the time (50%) is spent in clr.dll!COMInterlocked::ExchangeAdd64. This method was compiled with FPO since the stacks were broken in the middle to get more performance. I have thought that such code is not allowed in the Windows Code base because it makes profiling harder. Looks like the optimizations have gone too far.
When I single step with the debugger to the actual exachange operation
eax=01c761bf ebx=01c761cf ecx=00000000 edx=00000000 esi=00000000 edi=0274047c
eip=747ca4bd esp=050bf6fc ebp=01c761bf iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
clr!COMInterlocked::ExchangeAdd64+0x49:
747ca4bd f00fc70f lock cmpxchg8b qword ptr [edi] ds:002b:0274047c=0000000001c761bf
cmpxchg8b compares EDX:EAX=1c761bf with the memory location and if the values equal copy the new value of ECX:EBX=1c761cf to the memory location. When you look at the registers you find that at index 0x1c761bf = 29.843.903 all values are not equal. Looks like there is an race condition (or excessive contention) when incrementing the global loop counter which surfaces only when your method body does so little work that it pops out.
Congrats you have found a real bug in the .NET Framework! You should report it at the connect website to make them aware of this issue.
To be absolutely sure that it is not another issue you can try the parallel loop with an empty delegate:
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Start();
Parallel.For(0,_len, i => {});
s.Stop();
System.Console.WriteLine(s.ElapsedMilliseconds.ToString());
This does also repro the issue. It is therefore definitely a CLR issue. Normally we at SO tell people to not try to write lock free code since it is very hard to get right. But even the smartest guys at MS seem to get it wrong sometimes ....
Update:
I have opened a bug report here: https://connect.microsoft.com/VisualStudio/feedbackdetail/view/969699/parallel-for-causes-random-slowdowns-in-x86-processes
Based on your program, I wrote a program to reproduce the problem. I think it is related to .NET large object heap and how Parallel.For implemented.
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 10; i++)
//ParallelArray();
SingleFor();
}
const int _meg = 1024 * 1024;
const int _len = 1024 * _meg;
static void ParallelArray()
{
int[] stuff = new int[_meg];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Start();
Parallel.For(0,
_len,
i =>
{
stuff[i % _meg] = i;
}
);
s.Stop();
System.Console.WriteLine( s.ElapsedMilliseconds.ToString());
}
static void SingleFor()
{
int[] stuff = new int[_meg];
System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
s.Start();
for (int i = 0; i < _len; i++){
stuff[i % _meg] = i;
}
s.Stop();
System.Console.WriteLine(s.ElapsedMilliseconds.ToString());
}
}
I compiled with VS2013, release version, and run it without debugger. If the function ParallelArray() is called in the main loop, the result I got is:
1631
1510
51302
1874
45243
2045
1587
1976
44257
1635
if the function SingleFor() is called, the result is:
898
901
897
897
897
898
897
897
899
898
I go through some documentation on MSDN about Parallel.For, this caught my attention: Writing to shared variables. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. As in the Parallel for loop, we're using a shared variable stuff.
This article Parallel Aggregation explain how .NET deal with this case:The Parallel Aggregation pattern uses unshared, local variables that are merged at the end of the computation to give the final result. Using unshared, local variables for partial, locally calculated results is how the steps of a loop can become independent of each other. Parallel aggregation demonstrates the principle that it's usually better to make changes to your algorithm than to add synchronization primitives to an existing algorithm. This means it creates local copies of data instead of using locks to guard the shared variable, and at the end, these 10 partitions needs to be combined together; this brings performance penalties.
When I run the test program with Parall.For, I used the the process explore to count the threads, it has 11 threads, so Parallel.For create 10 partitions for the loops, which mean it creates 10 local copies with size 100K, these object will be placed on Large Object Heap.
There are two different types of heaps in .NET. The Small Object Heap (SOH) and the Large Object Heap (LOH). If the object size is large than 85,000 bytes, it is in LOH. When doing GC, .NET treat the 2 heaps differently.
As it is explained in this blog: No More Memory Fragmentation on the .NET Large Object Heap: One of the key differences between the heaps is that the SOH compacts memory and hence reduces the chance of memory fragmentation dramatically while the LOH does not employ compaction. As a result, excessive usage of the LOH may result in memory fragmentation that can become severe enough to cause problems in applications.
As you're allocating big arrays with size > 85,000 continuously, when the LOH becomes in memory fragmentation, the performance goes down.
If you're using .NET 4.5.1, you can set GCSettings.LargeObjectHeapCompactionMode to CompactOnce to make LOH compact after GC.Collect().
Another good article to understand this problem is: Large Object Heap Uncovered
Further investigation is needed, but I don't have time now.
I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((#"c:\txt-set\agg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea?
Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions.
Is there a flaw in .NET's run-time?
Thanks in advance.
Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector.
Most well tuned applications are in the 0-10% range. This is typically
caused by an allocation pattern that allows objects to live just long
enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec.
This is quite high. It is not uncommon that this is simply a
performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
An attempt to explain the results:
a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are
var dic = new Dictionary<string, List<int>>();
...
dic[token].Add(1);
I replaced this with
var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;
and the result is closer to a 2x speedup.
But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.
Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.
Just for fun, here is a shorter PLINQ version:
File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count());