Sequentially Splitting the load on a Parallel.Foreach Loop

Sequentially Splitting the load on a Parallel.Foreach Loop - c#

I have a million elements in a List to process.
Dropping them crudely into a Parallel.ForEach would just saturate the CPU.
Instead I split the Elements Master Lists into pieces and drop the Sublists into a parallel loop.
List<Element> MasterList = new List<Element>();
Populate(MasterList); // puts a Million elements into list;
//Split Master List into 100 Lists of 10.0000 elements each
List<List<Element>> ListOfSubLists = Split(MasterList,100);
foreach (List<Element> EL in ListOfSubLists )
{
Parallel.ForEach(EL, E =>
{
// Do Stuff
}
//wait for all parallel iterations to end before continuing
}
What is the best way for waiting for all parallel iteration to end before continuing to next iteration of the upper loop ?
Edit :
as some answers stated, "saturate the CPU" is not an accurate expression.
Actually I just want to limit the CPU usage, and avoid excessive load on it coming from this processing.

Parallel.ForEach will not saturate the CPU; it uses some intelligence to decide how many parallel threads to run simultaneously, with a max of 63.
See: Does Parallel.ForEach limits the number of active threads?
You can also set the max degree of parallelism if you want, by supplying a ParallelOptions like new ParallelOptions { MaxDegreeOfParallelism = 5 } as the second argument to Parallel.ForEach.
As a last point, Parallel.ForEach blocks until all of the iterations have completed. So your code, as written, works. You do not need to wait for the iterations to complete.

What do you mean "Saturate the CPU"
You can still throttle the Parallel foreach loop by supplying it with ParallelOptions
One property of which is a MaxDegreesOfParallelism
That will allow you to go back to your single collection e.g.
Parallel.ForEach(
collection,
new ParallelOptions { MaxDegreeOfParallelism = 5},
E => { DoStuff(E) }
);

Related

How to dynamically control Parallelism Degree based on object sizes (or some system constraints)?

Let's simplify this scenario. There is a machine with 16 GB RAM, and 4 CPU cores. Given a list of objects with different sizes, e.g. [3,1,7,9,4,5,2], each of the elements surely needs the corresponding amount of RAM based on their size, e.g. "1" will need 1 GB RAM.
What is the best way to process this element lists in parallel, without causing OutOfMemory, in C#, with Parallelism library (built-in or 3rd party)?
One naive strategy could be:
First round: choose [3,1,7]. Still have one core left, but if using "9", the program would need 20 GB RAM. So let's for now use 3 cores.
Second round: if "3" is finished first, consider "9", but still surpass 16 GB RAM capacity (1+7+9 = 17). So, stop and wait.
Third round: if then "7" is finished, the program will move on with "1", "9" and "4".
I'm not an expert on algorithm as well as parallelism. So I can't frame this problem in more specific details... Any help, link, advice is highly appreciated. I believe this problem may have been solved somewhere else, and I don't need to reinvent the wheel.

You could consider using a specialized Semaphore that can have its CurrentCount decreased and increased atomically by more than 1, like the one found in this question. You could initialize this mechanism with an initialCount equal to the available memory in GBs (16), and Wait/Release it with the size of each object in GBs (between 1 and 16). This way an object could acquire the semaphore only after waiting for the CurrentCount to become equal or larger to its size.
To incorporate this mechanism in a Parallel.ForEach loop, you could create a deferred enumerable that would Wait for the semaphore as part of the enumeration, and then feed this throttled enumerable as the source of the parallel loop. One important detail you should take care of is to disable the chunk partitioning that the Parallel.ForEach employs by default, by using the EnumerablePartitionerOptions.NoBuffering configuration, otherwise the Parallel.ForEach may enumerate greedily more than one items at a time, interfering with the throttling intentions of this algorithm.
The semaphore should be released inside the body of the parallel loop, in a finally block, with the same releaseCount as the size of the processed object.
Putting everything together:
var items = new[] { 3, 1, 7, 9, 4, 5, 2 };
const int availableMemory = 16; // GB
using var throttler = new SemaphoreManyFifo(availableMemory, availableMemory);
var throttledItems = items
.Select(item => { throttler.Wait(item); return item; });
var partitioner = Partitioner.Create(throttledItems,
EnumerablePartitionerOptions.NoBuffering);
var parallelOptions = new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(partitioner, parallelOptions, item =>
{
try
{
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Processing #{item}");
Thread.Sleep(item * 1000); // Simulate a CPU-bound operation
}
finally
{
throttler.Release(item);
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Item #{item} completed");
}
});
Note: the size of each object should not exceed the initialCount of the semaphore, otherwise this algorithm will malfunction. Be aware that the aforementioned SemaphoreManyFifo implementation does not include proper argument validation.

Parallel.ForEach performance

I am using Parallel.ForEach to extract a bunch of zipped files and copy them to a shared folder on a different machine, where then a BULK INSERT process is started. This all works well but i have noticed that, as soon as some big files come along, no new tasks are started. I assume this is because some files take longer than others, that the TPL starts scaling down, and stops creating new Tasks. I have set the MaxDegreeOfParallelism to a reasonable number (8). When i look at the CPU activity, i can see, that most of the time the SQL Server machine is below 30%, even less when it sits on a single BULK INSERT task. I think it could do more work. Can i somehow force the TPL to create more simultanously processed Tasks?

The reason is most likely the way Parallel.ForEach processes items by default. If you use it on array or something that implements IList (so that total length and indexer is available) - it will split whole workload in batches. Then separate thread will process each batch. That means if batches has different "size" (by size I mean time to processes them) - "small" batches will complete faster.
For example, let's look at this code:
var delays = Enumerable.Repeat(100, 24).Concat(Enumerable.Repeat(2000, 4)).ToArray();
Parallel.ForEach(delays, new ParallelOptions() {MaxDegreeOfParallelism = 4}, d =>
{
Thread.Sleep(d);
Console.WriteLine("Done with " + d);
});
If you run it, you will see all "100" (fast) items are processed fast and in parallel. However, all "2000" (slow) items are processed in the end one by one, without any parallelizm at all. That's because all "slow" items are in the same batch. Workload was splitted in 4 batches (MaxDegreeOfParallelism = 4), and first 3 contain only fast items. They are completed fast. Last batch has all slow items and so thread dedicated to this batch will process them one by one.
You can "fix" that for your situation either by ensuring that items are distributed evenly (so that "slow" items are not all together in source collection), or for example with custom partitioner:
var delays = Enumerable.Repeat(100, 24).Concat(Enumerable.Repeat(2000, 4)).ToArray();
var partitioner = Partitioner.Create(delays, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partitioner, new ParallelOptions {MaxDegreeOfParallelism = 4}, d =>
{
Thread.Sleep(d);
Console.WriteLine("Done with " + d);
});
NoBuffering ensures that items are taken one at a time, so avoids the problem.
Using another means to parallelize your work (such as SemaphoreSlim, or BlockingCollection) are also an option.

Batching Tasks and Running them in Parallel

I have a List of items, and I would like to go through each item, create a task and launch the task. But, I want it to do batches of 10 tasks at once.
For example, if I have 100 URL's in a list, I want it to group them into batches of 10, and loop through batches getting the web response from 10 URL's per batch iteration.
Is this possible?
I am using C# 5 and .NET 4.5.

You can use Parallel.For() or Parallel.ForEach(), they will execute the work on a number of Tasks.
When you need precise control over the batches you could use a custom Partitioner but given that the problem is about URLs it will probably make more sense to use the more common MaxDegreeOfParallelism option.

The Partitioner has a good algorithm for creating the batches depending also on the number of cores.
Parallel.ForEach(Partitioner.Create(from, to), range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
// ... process i
}
});

How to optimize large size for loop

I have a for loop with more than 20k iterations,for each iteration it is taking around two or three seconds and total around 20minutes. how i can optimize this for loop. I am using .net3.5 so parallel foreach is not possible. so i splited the 200000 nos into small chunks and implemented some threading now i am able reduce the time by 50%. is there any other way to optimize these kind of for loops.
My sample code is given below
static double sum=0.0;
public double AsyncTest()
{
List<Item> ItemsList = GetItem();//around 20k items
int count = 0;
bool flag = true;
var newItemsList = ItemsList.Take(62).ToList();
while (flag)
{
int j=0;
WaitHandle[] waitHandles = new WaitHandle[62];
foreach (Item item in newItemsList)
{
var delegateInstance = new MyDelegate(MyMethod);
IAsyncResult asyncResult = delegateInstance.BeginInvoke(item.id, new AsyncCallback(MyAsyncResults), null);
waitHandles[j] = asyncResult.AsyncWaitHandle;
j++;
}
WaitHandle.WaitAll(waitHandles);
count = count + 62;
newItemsList = ItemsList.Skip(count).Take(62).ToList();
}
return sum;
}
public double MyMethod(int id)
{
//Calculations
return sum;
}
static public void MyAsyncResults(IAsyncResult iResult)
{
AsyncResult asyncResult = (AsyncResult) iResult;
MyDelegate del = (MyDelegate) asyncResult.AsyncDelegate;
double mySum = del.EndInvoke(iResult);
sum = sum + mySum;
}

It's possible to reduce number of loops by various techniques. However, this won't give you any noticeable improvement since the heavy computation is performed inside your loops. If you've already parallelized it to use all your CPU cores there is not much to be done. There is a certain amount of computation to be done and there is a certain computer power available. You can't squeeze from your machine more than it can provide.
You can try to:
Do a more efficient implementation of your algorithm if it's possible
Switch to faster environment/language, such as unmanaged C/C++.

Is there a rationale behind your batches size (62)?
Is "MyMethod" method IO bound or CPU bound?
What you do in each cycle is wait till all the batch completes and this wastes some cycles (you are actually waiting for all 62 calls to complete before taking the next batch).
Why won't you change the approach a bit so that you still keep N operations running simultaneosly, but you fire a new operation as soon as one of the executind operations completes?

According to this blog, for loops are more faster than foreach in case of collections. Try looping with for. It will help.

It sounds like you have a CPU intensive MyMethod. For CPU intensive tasks, you can gain a significant improvement by parallelization, but only to the point of better utilizing all CPU cores. Beyond that point, too much parallelization can start to hurt performance -- which I think is what you're doing. (This is unlike I/O intensive tasks where you pretty much parallelize as much as possible.)
What you need to do, in my opinion, is write another method that takes a "chunk" of items (not a single item) and returns their "sum":
double SumChunk(IEnumerable<Item> items)
{
return items.Sum(x => MyMethod(x));
}
Then divide the number of items by n (n being the degree of parallelism -- try n = number of CPU cores, and compare that to x2) and pass each chunk to an async task of SumChunk. And finally, sum up the sub-results.
Also, watch if any of the chunks is completed much before the other ones. If that's the case, then your task distributions is not homogen. You'd need to create smaller chunks (say chunks of 300 items) and pass those to SumChunk.

Correct me if I'm wrong, but it looks to me like your threading is at the individual item level - I wonder if this may be a little too granular.
You are already doing your work in blocks of 62 items. What if you were to take those items and process all of them within a single thread? I.e., you would have something like this:
void RunMyMethods(IEnumerable<Item> items)
{
foreach(Item item in items)
{
var result = MyMethod(item);
...
}
}
Keep in mind that WaitHandle objects can be slower than using Monitor objects: http://www.yoda.arachsys.com/csharp/threads/waithandles.shtml
Otherwise, the usual advice holds: profile the performance to find the true bottlenecks. In your question you state that it takes 2-3 seconds per iteration - with 20000 iterations, it would take a fair bit more than 20 minutes.
Edit:
If you are wanting to maximise your usage of CPU time, then it may be best to split your 20000 items into, say, four groups of 5000 and process each group in its own thread. I would imagine that this sort of "thick 'n chunky" concurrency would be more efficient than a very fine-grained approach.

To start with, the numbers just don't add:
20k iterations,for each iteration it is taking around two or three seconds and total around 20minutes
That's a x40 'parallelism factor' - you can never achieve that running on a normal machine.
Second, when 'optimizing' a CPU intensive computation, there's no sense in parallelizing beyond the number of cores. Try dropping that magical 62 to 16 and bench test - it will actually run faster.
I ran a deformed malversion of your code on my laptop, and got some 10-20% improvement using Parallel.ForEach
So maybe you can make it run 17 minutes instead of 20 - does it really matter ?

Start new threads in each loop and control amount of threads to be not more than x-amount at the same time

I have been writing "linear" winforms for couple of months and now I am trying to figure out threading.
this is my loop that has around 40,000 rows and it takes around 1 second to perform a task on this row:
foreach (String CASE in MAIN_CASES_LIST)
{
//bunch of code here
}
How do I
put each loop into separate thread
maintain no more than x-amount of threads at the same time

If you're on .NET 4 you can utilize Parallel.ForEach
Parallel.ForEach(MAIN_CASES_LIST, CASE =>
{
//bunch of code here
});

There's a great library called SmartThreadPool which may be useful here, it does a lot of useful stuff with threading and queueing, abstracting most of this away from you
Not sure if it will help you, but you can queue up a crapload of work items, limit the number of threads, etc etc
http://www.codeproject.com/Articles/7933/Smart-Thread-Pool
Of course if you want to get your hands dirty with multi-threading or use Parallel go for it, it's just a suggestion :)

To mix the answers above, and add a limit to the maximum number of threads created, you can use this overloaded call. Just be sure to add "using System.Threading.Tasks;" at the top.
LinkedList<String> theList = new LinkedList<string>();
ParallelOptions parOptions = new ParallelOptions();
parOptions.MaxDegreeOfParallelism = 5; //only up to 5 threads allowed.
Parallel.ForEach(theList.AsEnumerable(), parOptions , (string CASE) =>
{
//bunch of code here
});

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sequentially Splitting the load on a Parallel.Foreach Loop - c#

Related

How to dynamically control Parallelism Degree based on object sizes (or some system constraints)?

Parallel.ForEach performance

Batching Tasks and Running them in Parallel

How to optimize large size for loop

Start new threads in each loop and control amount of threads to be not more than x-amount at the same time

Categories

Resources