Ensure Parallel Invoke don't use to much CPU

Ensure Parallel Invoke don't use to much CPU - c#

I have a C# Program with WCF that use parallel Invoke a bit.
First, every client call is parallel on service side with my WCF service.
I have a class A that contains a List of class B
I can add a list of class B without adding A.
To insert my list of element B I do it in parallel because before adding I do a lot of verification. And same with A
Some Client adds in one time really big list of A elements.
So, I use a parallel invoke for adding each A elements.
I configure it with parallel options to use no more than half of the CPU.
To let other users that do other thing use the CPU.
But the task that Add Class A who is already parallel limited to half of CPU create another Parallel Invoke to Add Class B
For exemple on call of
InvokeAddClassAList
Create Two thread AddClassA.
And Each AddClassA
Create Two Thread AddClassB
So, I have now 4 Thread.
Is this 4 thread limited to half of CPU?
Or only The Two AddClassA are limited to half CPU and each children Thread can use as much CPU as they want?
var pCount = Environment.ProcessorCount / 2;
var options = new ParallelOptions();
options.MaxDegreeOfParallelism = pCount > 0 ? pCount : 1;
Parallel.Invoke(options, actions.ToArray());

Your CPU is a resource that should be used. Nobody will thank you if it idles. So limiting your process is pointless. You don't even know if there is somebody else using the computer at the time.
You could use your own TaskScheduler implementation to influence when and how many tasks will be started and when.
But again, there is no point. You should request as much as possible. And if you want to be nice to other users, lower your processes priority, so if you want to use 100%, they can still work with their higher prioritized processes.

Related

Thread Contention on a ConcurrentDictionary in C#

I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?

Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.

I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.

If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.

Parallel LINQ GroupBy taking long time on systems with high amount of cores

We detected a weird problem when running a parallel GroupBy on a system with high amount of cores.
We're running this on .Net Framework 4.7.2.
The (simplified) code:
public static void Main()
{
//int MAX_THREADS = Environment.ProcessorCount - 2;
//ThreadPool.SetMinThreads(1, 1);
//ThreadPool.SetMaxThreads(MAX_THREADS, MAX_THREADS);
var elements = new List<ElementInfo>();
for (int i = 0; i < 250000; i++)
elements.Add(new ElementInfo() { Name = "123", Description = "456" });
using (var cancellationTokenSrc = new CancellationTokenSource())
{
var cancellationToken = cancellationTokenSrc.Token;
var dummy = elements.AsParallel()
.WithCancellation(cancellationToken)
.Select(x => new { Name = x.Name })
.GroupBy(x => "abc")
.ToDictionary(g => g.Key, g => g.ToList());
}
}
public class ElementInfo
{
public string Name { get; set; }
public string Description { get; set; }
}
This code is running in an application that is already using about 100 threads. Running this on a "normal" pc (12 or 16 cores), it runs very fast (less than 1 second).
Running this on a PC with a high amount of cores (48), it runs very slow (20 seconds).
Taking a dump during the 20 second delay, I see the threads running this LINQ are all waiting in HashRepartitionEnumerator.MoveNext().
There's a m_barrier.Wait(), so I think it is waiting there. It seems to wait on m_barrier, which is set to the number of partitions.
My guess is the following:
The number of partitions is set to the number of cores (48 in this case).
A number of threads are started in the thread pool, but the thread pool is full, so new threads need to be started. This happens at 1 thread per second.
While the threadpool is spinning up threads, all threads already running this LINQ query, are waiting until enough threads are started.
Only when enough threads are started, the LINQ query can finish.
Uncommenting the first lines in the Main method supports this thesis: By limiting the number of threads, the desired amount of threads is never reached, so this LINQ query never finishes.
Does this seem like a bug in .Net Framework, or am I doing something wrong?
Note: the real LINQ query has a few CPU-intensive Where-clauses, which makes it ideal to run in parallel. I removed this code as it isn't needed to reproduce the issue.

Does this seem like a bug in .NET Framework, or am I doing something wrong?
Yes, it does look like a bug, but actually this behavior is by design. The Task Parallel Library depends heavily on the ThreadPool by default, and the ThreadPool is not an incredibly clever piece of software. Which is both good and bad. It's good because its behavior is predictable, and it's bad because it behaves non-optimally when stressed. The algorithm that controls its behavior¹ is basically this:
Satisfy instantly all demands for work until the number of the worker threads reaches the number specified by the ThreadPool.SetMinThreads method, which
by default is equal to Environment.ProcessorCount.
If the demand for work cannot be satisfied by the available workers, inject more threads in the pool with a frequency of one new thread per second.
This algorithm offers very few configuration options. For example you can't control the injection rate of new threads. So if the behavior of the built-in ThreadPool doesn't fit your needs, you are in a tough situation. You could consider implementing your own ThreadPool, in the form of a custom TaskScheduler, but unfortunately the PLINQ library doesn't even allow to configure the scheduler. There is no public WithTaskScheduler option available, analogous to the ParallelOptions.TaskScheduler property that can be used with the Parallel class (it's internal, due to fear of deadlocks).
Rewriting the PLINQ library from scratch on top of a custom ThreadPool is presumably not a realistic option. So the best that you can really do is to ensure that the ThreadPool has always enough threads to satisfy the demand (increase the ThreadPool.SetMinThreads), specify explicitly the MaxDegreeOfParalellism whenever you use paralellization, and be conservative regarding the degree of paralellism of each parallel operation. Definitely avoid nesting one parallel operation inside another, because this is the easiest way to saturate the ThreadPool and cause it to misbehave.
¹ As of .NET 6. The behavior of the ThreadPool could change in future .NET versions.

Parallel for each or any alternative for parallel loop?

I have this code
Lines.ToList().ForEach(y =>
{
globalQueue.AddRange(GetTasks(y.LineCode).ToList());
});
So for each line in my list of lines I get the tasks that I add to a global production queue. I can have 8 lines. Each get task request GetTasks(y.LineCode) take 1 minute. I would like to use parallelism to be sure I request my 8 calls together and not one by one.
What should I do?
Using another ForEach loop or using another extension method? Is there a ForEachAsync? Make the GetTasks request itself async?

Parallelism isn't concurrency. Concurrency isn't asynchrony. Running multiple slow queries in parallel won't make them run faster, quite the opposite. These are different problems and require very different solutions. Without a specific problem one can only give generic advice.
Parallelism - processing an 800K item array
Parallelism means processing a ton of data using multiple cores in parallel. To do that, you need to partition your data and feed each partition to a "worker" for processing. You need to minimize communication between workers and the need of synchronization to get the best performance, otherwise your workers will spend CPU time doing nothing. That means, no global queue updating.
If you have a lot of lines, or if line processing is CPU-bound, you can use PLINQ to process it :
var query = from y in lines.AsParallel()
from t in GetTasks(y.LineCode)
select t;
var theResults=query.ToList();
That's it. No need to synchronize access to a queue, either through locking or using a concurrent collection. This will use all available cores though. You can add WithDegreeOfParallelism() to reduce the number of cores used to avoid freezing
Concurrency - calling 2000 servers
Concurrency on the other hand means doing several different things at the same time. No partitioning is involved.
For example, if I had to query 8 or 2000 servers for monitoring data (true story) I wouldn't use Parallel or PLINQ. For one thing, Parallel and PLINQ use all available cores. In this case though they won't be doing anything, they'll just wait for responses. Parallelism classes can't handle async methods either because there's no point - they aren't meant to wait for responses.
A very quick & dirty solution would be to start multiple tasks and wait for them to return, eg :
var tasks=lines.Select(y=>Task.Run(()=>GetTasks(y.LineCode));
//Array of individual results
var resultsArray=await Task.WhenAll(tasks);
//flatten the results
var resultList=resultsArray.SelectMany(r=>r).ToList();
This will start all requests at once. Network Security didn't like the 2000 concurrent requests, since it looked like a hack attack and caused a bit of network flooding.
Concurrency with Dataflow
We can use the TPL Dataflow library and eg ActionBlock or TransformBlock to make the requests with a controlled degree of parallelism :
var options=new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4 ,
BoundedCapacity=10,
};
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasks(y.LineCode),
options);
var outputBlock=new BufferBlock<Result>();
spamBlock.LinkTo(outputBlock);
foreach(var line in lines)
{
await spamBlock.SendAsync(line);
}
spamBlock.Complete();
//Wait for all 4 workers to finish
await spamBlock.Completion;
Once the spamBlock completes, the results can be found in outputBlock. By setting a BoundedCapacity I ensure that the posting loop will wait if there are too many unprocessed messages in spamBlock's input queue.
An ActionBlock can handle asynchronous methods too. Assuming GetTasksAsync returns a Task<Result[]> we can use:
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasksAsync(y.LineCode),
options);

You can use Parallel Foreach:
Parallel.ForEach(Lines, (line) =>
{
globalQueue.AddRange(GetTasks(line.LineCode).ToList());
});
A Parallel.ForEach loop works like a Parallel.For loop. The loop
partitions the source collection and schedules the work on multiple
threads based on the system environment. The more processors on the
system, the faster the parallel method runs.

Performance measurement of individual threads in WaitAll construction

Say I'm writing a piece of software that simulates a user performaning certain actions on a system. I'm measuring the amount of time it takes for such an action to complete using a stopwatch.
Most of the times this is pretty straighforward: the click of a button is simulated, some service call is associated with this button. The time it takes for this service call to complete is measured.
Now comes the crux, some actions have more than one service call associated with them. Since they're all still part of the same logical action, I'm 'grouping' these using the signalling mechanism offered by C#, like so (pseudo):
var syncResultList = new List<WaitHandle>();
var syncResultOne = service.BeginGetStuff();
var syncResultTwo = service.BeginDoOtherStuff();
syncResultList.Add(syncResultOne.AsyncWaitHandle);
syncResultList.Add(syncResultTwo.AsyncWaitHandle);
WaitHandle.WaitAll(syncResultList.ToArray());
var retValOne = service.EndGetStuff(syncResultOne);
var retValTwo = service.EndDoOtherStuff(syncResultTwo);
So, GetStuff and DoOtherStuff constitute one logical piece of work for that particular action. And, ofcourse, I can easily measure the amount of time it takes for this conjuction of methods to complete, by just placing a stopwatch around them. But, I need a more fine-grained approach for my statistics. I'm really interested in the amount of time it takes for each of the methods to complete, without losing the 'grouped' semantics provided by WaitHandle.WaitAll.
What I've done to overcome this, was writing a wrapper class (or rather a code generation file), which implements some timing mechanism using a callback, since I'm not that interested in the actual result (save exceptions, which are part of the statistic), I'd just let that return some statistic. But this turned out to be a performance drain somehow.
So, basically, I'm looking for an alternative to this approach. Maybe it's much simpler than I'm thinking right now, but I can't seem to figure it out by myself at the moment.

This looks like a prime candidate for Tasks ( assuming you're using C# 4 )
You can create Tasks from your APM methods using MSDN: Task.Factory.FromAsync
You can then use all the rich TPL goodness like individual continuations.

If your needs are simple enough, a simple way would be to just record each service call individually, then calculate the logical action based off the individual service calls.
IE if logical action A is made of parallel service calls B and C where B took 2 seconds and C took 1 second, then A takes 2 seconds.
A = Max(B, C)

C# thread pool limiting threads

Alright...I've given the site a fair search and have read over many posts about this topic. I found this question: Code for a simple thread pool in C# especially helpful.
However, as it always seems, what I need varies slightly.
I have looked over the MSDN example and adapted it to my needs somewhat. The example I refer to is here: http://msdn.microsoft.com/en-us/library/3dasc8as(VS.80,printer).aspx
My issue is this. I have a fairly simple set of code that loads a web page via the HttpWebRequest and WebResponse classes and reads the results via a Stream. I fire off this method in a thread as it will need to executed many times. The method itself is pretty short, but the number of times it needs to be fired (with varied data for each time) varies. It can be anywhere from 1 to 200.
Everything I've read seems to indicate the ThreadPool class being the prime candidate. Here is what things get tricky. I might need to fire off this thing say 100 times, but I can only have 3 threads at most running (for this particular task).
I've tried setting the MaxThreads on the ThreadPool via:
ThreadPool.SetMaxThreads(3, 3);
I'm not entirely convinced this approach is working. Furthermore, I don't want to clobber other web sites or programs running on the system this will be running on. So, by limiting the # of threads on the ThreadPool, can I be certain that this pertains to my code and my threads only?
The MSDN example uses the event drive approach and calls WaitHandle.WaitAll(doneEvents); which is how I'm doing this.
So the heart of my question is, how does one ensure or specify a maximum number of threads that can be run for their code, but have the code keep running more threads as the previous ones finish up until some arbitrary point? Am I tackling this the right way?
Sincerely,
Jason
Okay, I've added a semaphore approach and completely removed the ThreadPool code. It seems simple enough. I got my info from: http://www.albahari.com/threading/part2.aspx
It's this example that showed me how:
[text below here is a copy/paste from the site]
A Semaphore with a capacity of one is similar to a Mutex or lock, except that the Semaphore has no "owner" – it's thread-agnostic. Any thread can call Release on a Semaphore, while with Mutex and lock, only the thread that obtained the resource can release it.
In this following example, ten threads execute a loop with a Sleep statement in the middle. A Semaphore ensures that not more than three threads can execute that Sleep statement at once:
class SemaphoreTest
{
static Semaphore s = new Semaphore(3, 3); // Available=3; Capacity=3
static void Main()
{
for (int i = 0; i < 10; i++)
new Thread(Go).Start();
}
static void Go()
{
while (true)
{
s.WaitOne();
Thread.Sleep(100); // Only 3 threads can get here at once
s.Release();
}
}
}

Note: if you are limiting this to "3" just so you don't overwhelm the machine running your app, I'd make sure this is a problem first. The threadpool is supposed to manage this for you. On the other hand, if you don't want to overwhelm some other resource, then read on!
You can't manage the size of the threadpool (or really much of anything about it).
In this case, I'd use a semaphore to manage access to your resource. In your case, your resource is running the web scrape, or calculating some report, etc.
To do this, in your static class, create a semaphore object:
System.Threading.Semaphore S = new System.Threading.Semaphore(3, 3);
Then, in each thread, you do this:
System.Threading.Semaphore S = new System.Threading.Semaphore(3, 3);
try
{
// wait your turn (decrement)
S.WaitOne();
// do your thing
}
finally {
// release so others can go (increment)
S.Release();
}
Each thread will block on the S.WaitOne() until it is given the signal to proceed. Once S has been decremented 3 times, all threads will block until one of them increments the counter.
This solution isn't perfect.
If you want something a little cleaner, and more efficient, I'd recommend going with a BlockingQueue approach wherein you enqueue the work you want performed into a global Blocking Queue object.
Meanwhile, you have three threads (which you created--not in the threadpool), popping work out of the queue to perform. This isn't that tricky to setup and is very fast and simple.
Examples:
Best threading queue example / best practice
Best method to get objects from a BlockingQueue in a concurrent program?

It's a static class like any other, which means that anything you do with it affects every other thread in the current process. It doesn't affect other processes.
I consider this one of the larger design flaws in .NET, however. Who came up with the brilliant idea of making the thread pool static? As your example shows, we often want a thread pool dedicated to our task, without having it interfere with unrelated tasks elsewhere in the system.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.