why does Paralle.for loose so badly [duplicate] - c#

Here is the code:
using (var context = new AventureWorksDataContext())
{
IEnumerable<Customer> _customerQuery = from c in context.Customers
where c.FirstName.StartsWith("A")
select c;
var watch = new Stopwatch();
watch.Start();
var result = Parallel.ForEach(_customerQuery, c => Console.WriteLine(c.FirstName));
watch.Stop();
Debug.WriteLine(watch.ElapsedMilliseconds);
watch = new Stopwatch();
watch.Start();
foreach (var customer in _customerQuery)
{
Console.WriteLine(customer.FirstName);
}
watch.Stop();
Debug.WriteLine(watch.ElapsedMilliseconds);
}
The problem is, Parallel.ForEach takes about 400ms vs a regular foreach, which takes about 40ms. What exactly am I doing wrong and why doesn't this work as I expect it to?

Suppose you have a task to perform. Let's say you're a math teacher and you have twenty papers to grade. It takes you two minutes to grade a paper, so it's going to take you about forty minutes.
Now let's suppose that you decide to hire some assistants to help you grade papers. It takes you an hour to locate four assistants. You each take four papers and you are all done in eight minutes. You've traded 40 minutes of work for 68 total minutes of work including the extra hour to find the assistants, so this isn't a savings. The overhead of finding the assistants is larger than the cost of doing the work yourself.
Now suppose you have twenty thousand papers to grade, so it is going to take you about 40000 minutes. Now if you spend an hour finding assistants, that's a win. You each take 4000 papers and are done in a total of 8060 minutes instead of 40000 minutes, a savings of almost a factor of 5. The overhead of finding the assistants is basically irrelevant.
Parallelization is not free. The cost of splitting up work amongst different threads needs to be tiny compared to the amount of work done per thread.
Further reading:
Amdahl's law
Gives the theoretical speedup in latency of the execution of a task at fixed workload, that can be expected of a system whose resources are improved.
Gustafson's law
Gives the theoretical speedup in latency of the execution of a task at fixed execution time, that can be expected of a system whose resources are improved.

The first thing you should realize is that not all parallelism is beneficial. There is an amount of overhead to parallelism, and this overhead may or may not be significant depending on the complexity what is being parallelized. Since the work in your parallel function is very small, the overhead of the management the parallelism has to do becomes significant, thus slowing down the overall work.

The additional overhead of creating all the threads for your enumerable VS just executing the numerable is more than likely the cause for the slowdown. Parallel.ForEach is not a blanket performance increasing move; it needs to be weighed whether or not the operation that is to be completed for each element is likely to block.
For example, if you were to make a web request or something instead of simply writing to the console, the parallel version might be faster. As it is, simply writing to the console is a very fast operation, so the overhead of creating the threads and starting them is going to be slower.

As previous writer has said there are some overhead associated with Parallel.ForEach, but that is not why you can't see your performance improvement. Console.WriteLine is a synchronous operation, so only one thread is working at a time. Try changing the body to something non-blocking and you will see the performance increase (as long as the amount of work in the body is big enough to outweight the overhead).

I like salomons answer and would like to add that you also have additional overhead of
Allocating delegates.
Calling through them.

Related

Parallel.ForEach slows down towards end of the iteration

I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371

CPU and/or RAM productivity when working with big integers

yesterday I was solving one exam problem, when found something very interesting (at least for me). The program is for factorials (very big ones) and the result is how much zeroes there are on the end of the number (in some cases 2500 zeros..). So I did what I could, but found that when enter number like 100 000 it takes exactly 1;30 - 1;33min to output the result. I thought its because of my CPU (it is not very fast). I've sent the .exe to some of my friends to try it because they have very good PCs when we are talking about performance - exactly the same result (1;33min).
My question is why is the time to solve the task the same. I know there are better ways to write my core so it wouldn't take so long, but this is very important for me to understand as a beginner programmer.
So here is my code:
static void Main()
{
int num = int.Parse(Console.ReadLine()),
zeroCounter = 0;
BigInteger fact = 1;
var startTime = DateTime.Now;
Console.WriteLine();
for (int i = 1; i <= num; i++)
{
fact *= i;
Console.Write("\r{0}", DateTime.Now - startTime);
}
BigInteger factTarget = fact;
while (factTarget % 10 == 0)
{
factTarget /= 10;
zeroCounter++;
Console.Write("\r{0}", DateTime.Now - startTime);
}
Console.WriteLine();
Console.WriteLine("Result is number with {0} zeros.", zeroCounter);
Console.WriteLine();
Console.WriteLine("Finished for: {0}", DateTime.Now - startTime);
Console.WriteLine();
Console.WriteLine("\nPres any key to exit...");
Console.ReadKey();
}
I am very sorry If this is the wrong place to ask, I did my best to find what I was looking for before I post this.
The thing that I notice immediately about your code is that you have included Console.WriteLine() statements in your computational loops.
The fact is, I/O is much slower for a computer to handle than computations, even under ideal conditions. And I wouldn't say that the Windows console window is a particularly efficient implementation of that particular kind of I/O. Furthermore, I/O tends to be less dependent on CPU and memory differences from machine to machine.
In other words, it seems very likely to me that you are primarily measuring I/O throughput and not computational throughput, and so it's not surprising to see consistent results between machines.
For what it's worth, when I run your example on my laptop, if I disable the output I can complete the computation in about a minute. I get something closer to your 1:30 time if I use the code as-is.
EDIT:
I recommend the answer from Hans Passant as well. Memory I/O is still I/O and is, as I describe above, much less variable from machine to machine than CPU speed. It's my hope that the above general-purpose description gives ideas for where the difference could be (without access to each of the machines in question, there's not really any way to know for sure what is the cause), but Hans's answer provides some very good detail about the memory I/O issue in particular and is very much worth reading.
now the time is 00:01:23.5856140
The speed of this program is determined by the bandwidth of the RAM in your machine. It is a design-constant and unrelated to the speed of the processor. RAM plays a role here because of the very large number of digits in the factorial, they don't fit the CPU caches anymore. And the memory access pattern for a BigInteger multiplication is very unfriendly, all digits are required to multiply a number.
Your program takes 57 seconds on my laptop, I know it has PC3-12800 RAM. Which has a peak transfer rate of 12800 MB/sec, give or take the CAS latency (I don't know mine). So we can calculate the RAM speed on your and your friend's machine:
1:23 = 83 sec, 57/83 x 12800 = 8790 MB/sec.
Which is a pretty close match for PC3-8500. A run-of-the-mill RAM speed very common in white-box machines, the kind you'd get from a vendor like Dell. Your friend's fast PC is a bit of a toaster, break it to him gently :)
Fwiw, why the highly upvoted post doesn't have much of an affect on the speed can use an explanation as well. The console window that your program uses is owned by another process. Conhost.exe, you can see it back in the Processes tab of Taskman.exe. It takes care of scrolling and painting the window, under the hood your program uses process-interop to tell it to update the window.
This happens while your program is running, on another thread, so your program is only bogged-down when it firehoses Conhost.exe, sending updates faster than it can handle. So, at the start of your program you are still fast and will get bogged down. But not when the number of digits starts to grow large and your multiplications start to get slow. Overall, the slowdown is not that great.
What happens is that the core or the processor the pc has a fixed size of internal buses, that store the data. The speed of RAM is 10-1000 times slower than Processor. There also something called Cache memory, but the size of cache memory is dam small. So Whats so large size of RAM you have in your pc, it will be still slow and take time. Coz when it reaches High Numbers, the numbers take time to Read and Write to and from Memory.
Plus writing each time to the screen eats up some time.

Why is the first iteration always faster then the next in a loop?

I would like to understand why the first iteration in the loop executes quicker than the rest.
Stopwatch sw = new Stopwatch ();
sw.Start ();
for(int i=0; i<10; i++)
{
System.Threading.Thread.Sleep ( 100 );
Console.WriteLine ( "Finished at : {0}", ((double) sw.ElapsedTicks / Stopwatch.Frequency ) * 1e3 );
}
When I execute the code I get the following:
Initially I thought it could be due to the accuracy factor of Stopwatch class, but then why is it applicable only to the first element? Correct me if I'm missing something.
This is a very flawed benchmark. For one, Thread.Sleep does not guarantee you that you'll sleep for exactly 100ms. Try much longer sleeps and you'll see more consistent results.
So it might be even just scheduling - the next iterations are always just doing sleep after sleep. Since Sleep works thanks to the system interrupt clock, the sleeps after the first should take similar amount of time, while the first has to "sync up" with the clock first.
If you add another sleep before the cycle (and before starting the stopwatch), you'll likely get closer times for each of the iterations.
Or even better, don't use sleeps. If you use some actual CPU work instead, you'll avoid thread switches (provided you've got enough CPU to do that) and many other costs not associated with the cycle itself. For example,
Stopwatch sw = new Stopwatch ();
sw.Start ();
for(int i=0; i<10; i++)
{
Thread.SpinWait(10000000);
Console.WriteLine ( "Finished at : {0}", ((double) sw.ElapsedTicks / Stopwatch.Frequency ) * 1e3 );
}
This will give you much more consistent results, because it doesn't depend on the clock at all.
There's many other things that can complicate a benchmark like this, which is why benchmarks simply aren't done this way. There will always be deviations, and they can get rather big, especially on a system with a lot of work.
In other words, if you're getting differences in CPU work execution time on the scale of milliseconds, someone is stealing your work. There's nothing in a modern CPU that would account for such a huge difference just based on e.g. i++ being there or not.
I could describe a lot more issues with your code, but it probably isn't worth it. Just google for some best practices on CPU work benchmarking in C#, and you'll get much more worth out of it.
Oh, and just to help hammer the point home more, on my computer, the first tends to go anywhere from 99 up to 100. This would be highly unusual, since the default is 15.6ms, rather than 1ms, but the culprit is easily found - Chrome sets it to 1ms. Ouch.
What you're outputting for times is the total time elapsed since the start. so, time increasing by about 100ms is exactly what you should be expecting
But, when you use Thread.Sleep you're giving up control of the thread and maybe for something close to the time you've specified. That time will be in multiples of the system quantum--so, what you specify cannot possibly be exact. If other threads of higher priority are doing work, it's less likely that your thread will be given processor time at a granularity close to the time you've suggested.

.NET's Multi-threading vs Multi-processing: Awful Parallel.ForEach Performance

I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((#"c:\txt-set\agg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea?
Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions.
Is there a flaw in .NET's run-time?
Thanks in advance.
Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector.
Most well tuned applications are in the 0-10% range. This is typically
caused by an allocation pattern that allows objects to live just long
enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec.
This is quite high. It is not uncommon that this is simply a
performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
An attempt to explain the results:
a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are
var dic = new Dictionary<string, List<int>>();
...
dic[token].Add(1);
I replaced this with
var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;
and the result is closer to a 2x speedup.
But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.
Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.
Just for fun, here is a shorter PLINQ version:
File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count());

Measure code speed in .net in milliseconds

I want to get the maximum count I have to execute a loop for it to take x milliseconds to finish.
For eg.
int GetIterationsForExecutionTime(int ms)
{
int count = 0;
/* pseudocode
do
some code here
count++;
until executionTime > ms
*/
return count;
}
How do I accomplish something like this?
I want to get the maximum count I have to execute a loop for it to take x milliseconds to finish.
First off, simply do not do that. If you need to wait a certain number of milliseconds do not busy-wait in a loop. Rather, start a timer and return. When the timer ticks, have it call a method that resumes where you left off. The Task.Delay method might be a good one to use; it takes care of the timer details for you.
If your question is actually about how to time the amount of time that some code takes then you need much more than simply a good timer. There is a lot of art and science to getting accurate timings.
First you should always use Stopwatch and never use DateTime.Now for these timings. Stopwatch is designed to be a high-precision timer for telling you how much time elapsed. DateTime.Now is a low-precision timer for telling you if it is time to watch Doctor Who yet. You wouldn't use a wall clock to time an Olympic race; you'd use the highest precision stopwatch you could get your hands on. So use the one provided for you.
Second, you need to remember that C# code is compiled Just In Time. The first time you go through a loop can therefore be hundreds or thousands of times more expensive than every subsequent time due to the cost of the jitter analyzing the code that the loop calls. If you are intending on measuring the "warm" cost of a loop then you need to run the loop once before you start timing it. If you are intending on measuring the average cost including the jit time then you need to decide how many times makes up a reasonable number of trials, so that the average works out correctly.
Third, you need to make sure that you are not wearing any lead weights when you are running. Never make performance measurements while debugging. It is astonishing the number of people who do this. If you are in the debugger then the runtime may be talking back and forth with the debugger to make sure that you are getting the debugging experience you want, and that chatter takes time. The jitter is generating worse code than it normally would, so that your debugging experience is more consistent. The garbage collector is collecting less aggressively. And so on. Always run your performance measurements outside the debugger, and with optimizations turned on.
Fourth, remember that virtual memory systems impose costs similar to those of jitters. If you are already running a managed program, or have recently run one, then the pages of the CLR that you need are likely "hot" -- already in RAM -- where they are fast. If not, then the pages might be cold, on disk, and need to be page faulted in. That can change timings enormously.
Fifth, remember that the jitter can make optimizations that you do not expect. If you try to time:
// Let's time addition!
for (int i = 0; i < 1000000; ++i) { int j = i + 1; }
the jitter is entirely within its rights to remove the entire loop. It can realize that the loop computes no value that is used anywhere else in the program and remove it entirely, giving it a time of zero. Does it do so? Maybe. Maybe not. That's up to the jitter. You should measure the performance of realistic code, where the values computed are actually used somehow; the jitter will then know that it cannot optimize them away.
Sixth, timings of tests which create lots of garbage can be thrown off by the garbage collector. Suppose you have two tests, one that makes a lot of garbage and one that makes a little bit. The cost of the collection of the garbage produced by the first test can be "charged" to the time taken to run the second test if by luck the first test manages to run without a collection but the second test triggers one. If your tests produce a lot of garbage then consider (1) is my test realistic to begin with? It doesn't make any sense to do a performance measurement of an unrealistic program because you cannot make good inferences to how your real program will behave. And (2) should I be charging the cost of garbage collection to the test that produced the garbage? If so, then make sure that you force a full collection before the timing of the test is done.
Seventh, you are running your code in a multithreaded, multiprocessor environment where threads can be switched at will, and where the thread quantum (the amount of time the operating system will give another thread until yours might get a chance to run again) is about 16 milliseconds. 16 milliseconds is about fifty million processor cycles. Coming up with accurate timings of sub-millisecond operations can be quite difficult if the thread switch happens within one of the several million processor cycles that you are trying to measure. Take that into consideration.
var sw = Stopwatch.StartNew();
...
long elapsedMilliseconds = sw.ElapsedMilliseconds;
You could also use the Stopwatch class:
int GetIterationsForExecutionTime(int ms)
{
int count = 0;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
do
{
// some code here
count++;
} while (stopwatch.ElapsedMilliseconds < ms);
stopwatch.Stop();
return count;
}
Good points from Eric Lippert.
I'd been benchmarking and unit testing for a while and I'd advise you should discard every first-pass on you code cause JIT compilation.
So in a benchmarking code which use loop and Stopwatch remember to put this at the end of the loop:
// JIT optimization.
if (i == 0)
{
// Discard every result you've collected.
// And restart the timer.
stopwatch.Restart();
}

Categories

Resources