Parallel.ForEach performance

Parallel.ForEach performance - c#

I am using Parallel.ForEach to extract a bunch of zipped files and copy them to a shared folder on a different machine, where then a BULK INSERT process is started. This all works well but i have noticed that, as soon as some big files come along, no new tasks are started. I assume this is because some files take longer than others, that the TPL starts scaling down, and stops creating new Tasks. I have set the MaxDegreeOfParallelism to a reasonable number (8). When i look at the CPU activity, i can see, that most of the time the SQL Server machine is below 30%, even less when it sits on a single BULK INSERT task. I think it could do more work. Can i somehow force the TPL to create more simultanously processed Tasks?

The reason is most likely the way Parallel.ForEach processes items by default. If you use it on array or something that implements IList (so that total length and indexer is available) - it will split whole workload in batches. Then separate thread will process each batch. That means if batches has different "size" (by size I mean time to processes them) - "small" batches will complete faster.
For example, let's look at this code:
var delays = Enumerable.Repeat(100, 24).Concat(Enumerable.Repeat(2000, 4)).ToArray();
Parallel.ForEach(delays, new ParallelOptions() {MaxDegreeOfParallelism = 4}, d =>
{
Thread.Sleep(d);
Console.WriteLine("Done with " + d);
});
If you run it, you will see all "100" (fast) items are processed fast and in parallel. However, all "2000" (slow) items are processed in the end one by one, without any parallelizm at all. That's because all "slow" items are in the same batch. Workload was splitted in 4 batches (MaxDegreeOfParallelism = 4), and first 3 contain only fast items. They are completed fast. Last batch has all slow items and so thread dedicated to this batch will process them one by one.
You can "fix" that for your situation either by ensuring that items are distributed evenly (so that "slow" items are not all together in source collection), or for example with custom partitioner:
var delays = Enumerable.Repeat(100, 24).Concat(Enumerable.Repeat(2000, 4)).ToArray();
var partitioner = Partitioner.Create(delays, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partitioner, new ParallelOptions {MaxDegreeOfParallelism = 4}, d =>
{
Thread.Sleep(d);
Console.WriteLine("Done with " + d);
});
NoBuffering ensures that items are taken one at a time, so avoids the problem.
Using another means to parallelize your work (such as SemaphoreSlim, or BlockingCollection) are also an option.

Related

How to dynamically control Parallelism Degree based on object sizes (or some system constraints)?

Let's simplify this scenario. There is a machine with 16 GB RAM, and 4 CPU cores. Given a list of objects with different sizes, e.g. [3,1,7,9,4,5,2], each of the elements surely needs the corresponding amount of RAM based on their size, e.g. "1" will need 1 GB RAM.
What is the best way to process this element lists in parallel, without causing OutOfMemory, in C#, with Parallelism library (built-in or 3rd party)?
One naive strategy could be:
First round: choose [3,1,7]. Still have one core left, but if using "9", the program would need 20 GB RAM. So let's for now use 3 cores.
Second round: if "3" is finished first, consider "9", but still surpass 16 GB RAM capacity (1+7+9 = 17). So, stop and wait.
Third round: if then "7" is finished, the program will move on with "1", "9" and "4".
I'm not an expert on algorithm as well as parallelism. So I can't frame this problem in more specific details... Any help, link, advice is highly appreciated. I believe this problem may have been solved somewhere else, and I don't need to reinvent the wheel.

You could consider using a specialized Semaphore that can have its CurrentCount decreased and increased atomically by more than 1, like the one found in this question. You could initialize this mechanism with an initialCount equal to the available memory in GBs (16), and Wait/Release it with the size of each object in GBs (between 1 and 16). This way an object could acquire the semaphore only after waiting for the CurrentCount to become equal or larger to its size.
To incorporate this mechanism in a Parallel.ForEach loop, you could create a deferred enumerable that would Wait for the semaphore as part of the enumeration, and then feed this throttled enumerable as the source of the parallel loop. One important detail you should take care of is to disable the chunk partitioning that the Parallel.ForEach employs by default, by using the EnumerablePartitionerOptions.NoBuffering configuration, otherwise the Parallel.ForEach may enumerate greedily more than one items at a time, interfering with the throttling intentions of this algorithm.
The semaphore should be released inside the body of the parallel loop, in a finally block, with the same releaseCount as the size of the processed object.
Putting everything together:
var items = new[] { 3, 1, 7, 9, 4, 5, 2 };
const int availableMemory = 16; // GB
using var throttler = new SemaphoreManyFifo(availableMemory, availableMemory);
var throttledItems = items
.Select(item => { throttler.Wait(item); return item; });
var partitioner = Partitioner.Create(throttledItems,
EnumerablePartitionerOptions.NoBuffering);
var parallelOptions = new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(partitioner, parallelOptions, item =>
{
try
{
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Processing #{item}");
Thread.Sleep(item * 1000); // Simulate a CPU-bound operation
}
finally
{
throttler.Release(item);
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Item #{item} completed");
}
});
Note: the size of each object should not exceed the initialCount of the semaphore, otherwise this algorithm will malfunction. Be aware that the aforementioned SemaphoreManyFifo implementation does not include proper argument validation.

Concurrently running multiple tasks C#

I have REST web API service in IIS which takes a collection of request objects. The user can enter more than 100 request objects.
I want to run this 100 request concurrently and then aggregate the result and send it back. This involves both I/O operation (calling to backend services for each request) and CPU bound operations (to compute few response elements)
Code snippet -
using System.Threading.Tasks;
....
var taskArray = new Task<FlightInformation>[multiFlightStatusRequest.FlightRequests.Count];
for (int i = 0; i < multiFlightStatusRequest.FlightRequests.Count; i++)
{
var z = i;
taskArray[z] = Tasks.Task.Run(() =>
PerformLogic(multiFlightStatusRequest.FlightRequests[z],lite, fetchRouteByAnyLeg)
);
}
Task.WaitAll(taskArray);
for (int i = 0; i < taskArray.Length; i++)
{
flightInformations.Add(taskArray[i].Result);
}
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
If i individually run the PerformLogic operation (for 1 object) it is taking 300 ms, now my requirement is when I run this PerformLogic() for 100 objects in a single request it should take around 2 secs.
PerformLogic() has the following steps - 1. Call a 3rd Party web service to get some details 2. Based on the details call another 3rd Party webservice 3. Collect the result from the webservice, apply few transformation
But with Task.run() it takes around 7 secs, I would like to know the best approach to handle concurrency and achieve the desired NFR of 2 secs.
I can see that at any point of time 7-8 threads are working concurrently
not sure if I can spawn 100 threads or tasks may be we can see some better performance. Please suggest an approach to handle this efficiently.

Judging by this
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
I'd wager that PerformLogic spends most its time waiting on the IO operations. If so, there's hope with async. You'll have to rewrite PerformLogicand maybe even the IO operations - async needs to be present in all levels, from the top to the bottom. But if you can do it, the result should be a lot faster.
Other than that - get faster hardware. If 8 cores take 7 seconds, then get 32 cores. It's pricey, but could still be cheaper than rewriting the code.

First, don't reinvent the wheel. PLINQ is perfectly capable of doing stuff in parallel, there is no need for manual task handling or result merging.
If you want 100 tasks each taking 300ms done in 2 seconds, you need at least 15 parallel workers, ignoring the cost of parallelization itself.
var results = multiFlightStatusRequest.FlightRequests
.AsParallel()
.WithDegreeOfParallelism(15)
.Select(flightRequest => PerformLogic(flightRequest, lite, fetchRouteByAnyLeg)
.ToList();
Now you have told PLinq to use 15 concurrent workers to work on your queue of tasks. Are you sure your machine is up to the task? You could put any number you want in there, that doesn't mean that your computer magically gets the power to do that.
Another option is to look at your PerformLogic method and optimize that. You call it 100 times, maybe it's worth optimizing.

Parallel.ForEach slows down towards end of the iteration

I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}

This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies

If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371

Batching Tasks and Running them in Parallel

I have a List of items, and I would like to go through each item, create a task and launch the task. But, I want it to do batches of 10 tasks at once.
For example, if I have 100 URL's in a list, I want it to group them into batches of 10, and loop through batches getting the web response from 10 URL's per batch iteration.
Is this possible?
I am using C# 5 and .NET 4.5.

You can use Parallel.For() or Parallel.ForEach(), they will execute the work on a number of Tasks.
When you need precise control over the batches you could use a custom Partitioner but given that the problem is about URLs it will probably make more sense to use the more common MaxDegreeOfParallelism option.

The Partitioner has a good algorithm for creating the batches depending also on the number of cores.
Parallel.ForEach(Partitioner.Create(from, to), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
// ... process i
}
});

Sequentially Splitting the load on a Parallel.Foreach Loop

I have a million elements in a List to process.
Dropping them crudely into a Parallel.ForEach would just saturate the CPU.
Instead I split the Elements Master Lists into pieces and drop the Sublists into a parallel loop.
List<Element> MasterList = new List<Element>();
Populate(MasterList); // puts a Million elements into list;
//Split Master List into 100 Lists of 10.0000 elements each
List<List<Element>> ListOfSubLists = Split(MasterList,100);
foreach (List<Element> EL in ListOfSubLists )
{
Parallel.ForEach(EL, E =>
{
// Do Stuff
}
//wait for all parallel iterations to end before continuing
}
What is the best way for waiting for all parallel iteration to end before continuing to next iteration of the upper loop ?
Edit :
as some answers stated, "saturate the CPU" is not an accurate expression.
Actually I just want to limit the CPU usage, and avoid excessive load on it coming from this processing.

Parallel.ForEach will not saturate the CPU; it uses some intelligence to decide how many parallel threads to run simultaneously, with a max of 63.
See: Does Parallel.ForEach limits the number of active threads?
You can also set the max degree of parallelism if you want, by supplying a ParallelOptions like new ParallelOptions { MaxDegreeOfParallelism = 5 } as the second argument to Parallel.ForEach.
As a last point, Parallel.ForEach blocks until all of the iterations have completed. So your code, as written, works. You do not need to wait for the iterations to complete.

What do you mean "Saturate the CPU"
You can still throttle the Parallel foreach loop by supplying it with ParallelOptions
One property of which is a MaxDegreesOfParallelism
That will allow you to go back to your single collection e.g.
Parallel.ForEach(
collection,
new ParallelOptions { MaxDegreeOfParallelism = 5},
E => { DoStuff(E) }
);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.