Threads (Tasks) Limit in WebAPI (or in general)

Threads (Tasks) Limit in WebAPI (or in general) - c#

We're developing WebAPI which has some logic of decryption of around 200 items (can be more). Each decryption takes around 20ms.
We've tried to parallel the tasks so we'll get it done as soon as possible, but it seems we're getting some kind of a limit as the threads are getting reused by waiting for the older threads to complete (and there are only few used) - overall action takes around 1-2 seconds to complete...
What we basically want to achieve is get x amount of threads start at the same time and finish after those ~20 ms.
We tried this:
Await multiple async Task while setting max running task at a time
But it seems this only describes setting a limit while we want to release it...
Here's a snippet:
var tasks = new List<Task>();
foreach (var element in Elements)
{
var task = new Task(() =>
{
element.Value = Cipher.Decrypt((string)element.Value);
}
});
task.Start();
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
What are we missing here?
Thanks,
Nir.

I cannot recommend parallelism on ASP.NET. It will certainly impact the scalability of your service, particularly if it is public-facing. I have thought "oh, I'm smart enough to do this" a couple of times and added parallelism in an ASP.NET app, only to have to tear it right back out a week later.
However, if you really want to...
it seems we're getting some kind of a limit
Is it the limit of physical cores on your machine?
We tried this: Await multiple async Task while setting max running task at a time
That solution is specifically for asynchronous concurrent code (e.g., I/O-bound). What you want is parallel (threaded) concurrent code (e.g., CPU-bound). Completely different use cases and solutions.
What are we missing here?
Your current code is throwing a ton of simultaneous tasks at the thread pool, which will attempt to handle them as best as it can. You can make this more efficient by using a higher-level abstraction, e.g., Parallel:
Parallel.ForEach(Elements, element =>
{
element.Value = Cipher.Decrypt((string)element.Value);
});
Parallel is more intelligent in terms of its partitioning and (re-)use of threads (i.e., not exceeding number of cores). So you should see some speedup.
However, I would expect it only to be a minor speedup. You are likely being limited by your number of physical cores.

Asuming no hyper threading:
If it takes 20ms for 1 item , then you can look at it as if it takes 1 core 20ms. If you want 200 items to complete in 20 ms, then you need 200 cores all for you. If you don't have that many, it just can't be done...
Under normal surcumstances, as many Task Will be scheduled parallel as optimal for you system

Related

Parallel LINQ GroupBy taking long time on systems with high amount of cores

We detected a weird problem when running a parallel GroupBy on a system with high amount of cores.
We're running this on .Net Framework 4.7.2.
The (simplified) code:
public static void Main()
{
//int MAX_THREADS = Environment.ProcessorCount - 2;
//ThreadPool.SetMinThreads(1, 1);
//ThreadPool.SetMaxThreads(MAX_THREADS, MAX_THREADS);
var elements = new List<ElementInfo>();
for (int i = 0; i < 250000; i++)
elements.Add(new ElementInfo() { Name = "123", Description = "456" });
using (var cancellationTokenSrc = new CancellationTokenSource())
{
var cancellationToken = cancellationTokenSrc.Token;
var dummy = elements.AsParallel()
.WithCancellation(cancellationToken)
.Select(x => new { Name = x.Name })
.GroupBy(x => "abc")
.ToDictionary(g => g.Key, g => g.ToList());
}
}
public class ElementInfo
{
public string Name { get; set; }
public string Description { get; set; }
}
This code is running in an application that is already using about 100 threads. Running this on a "normal" pc (12 or 16 cores), it runs very fast (less than 1 second).
Running this on a PC with a high amount of cores (48), it runs very slow (20 seconds).
Taking a dump during the 20 second delay, I see the threads running this LINQ are all waiting in HashRepartitionEnumerator.MoveNext().
There's a m_barrier.Wait(), so I think it is waiting there. It seems to wait on m_barrier, which is set to the number of partitions.
My guess is the following:
The number of partitions is set to the number of cores (48 in this case).
A number of threads are started in the thread pool, but the thread pool is full, so new threads need to be started. This happens at 1 thread per second.
While the threadpool is spinning up threads, all threads already running this LINQ query, are waiting until enough threads are started.
Only when enough threads are started, the LINQ query can finish.
Uncommenting the first lines in the Main method supports this thesis: By limiting the number of threads, the desired amount of threads is never reached, so this LINQ query never finishes.
Does this seem like a bug in .Net Framework, or am I doing something wrong?
Note: the real LINQ query has a few CPU-intensive Where-clauses, which makes it ideal to run in parallel. I removed this code as it isn't needed to reproduce the issue.

Does this seem like a bug in .NET Framework, or am I doing something wrong?
Yes, it does look like a bug, but actually this behavior is by design. The Task Parallel Library depends heavily on the ThreadPool by default, and the ThreadPool is not an incredibly clever piece of software. Which is both good and bad. It's good because its behavior is predictable, and it's bad because it behaves non-optimally when stressed. The algorithm that controls its behavior¹ is basically this:
Satisfy instantly all demands for work until the number of the worker threads reaches the number specified by the ThreadPool.SetMinThreads method, which
by default is equal to Environment.ProcessorCount.
If the demand for work cannot be satisfied by the available workers, inject more threads in the pool with a frequency of one new thread per second.
This algorithm offers very few configuration options. For example you can't control the injection rate of new threads. So if the behavior of the built-in ThreadPool doesn't fit your needs, you are in a tough situation. You could consider implementing your own ThreadPool, in the form of a custom TaskScheduler, but unfortunately the PLINQ library doesn't even allow to configure the scheduler. There is no public WithTaskScheduler option available, analogous to the ParallelOptions.TaskScheduler property that can be used with the Parallel class (it's internal, due to fear of deadlocks).
Rewriting the PLINQ library from scratch on top of a custom ThreadPool is presumably not a realistic option. So the best that you can really do is to ensure that the ThreadPool has always enough threads to satisfy the demand (increase the ThreadPool.SetMinThreads), specify explicitly the MaxDegreeOfParalellism whenever you use paralellization, and be conservative regarding the degree of paralellism of each parallel operation. Definitely avoid nesting one parallel operation inside another, because this is the easiest way to saturate the ThreadPool and cause it to misbehave.
¹ As of .NET 6. The behavior of the ThreadPool could change in future .NET versions.

Parallel for each or any alternative for parallel loop?

I have this code
Lines.ToList().ForEach(y =>
{
globalQueue.AddRange(GetTasks(y.LineCode).ToList());
});
So for each line in my list of lines I get the tasks that I add to a global production queue. I can have 8 lines. Each get task request GetTasks(y.LineCode) take 1 minute. I would like to use parallelism to be sure I request my 8 calls together and not one by one.
What should I do?
Using another ForEach loop or using another extension method? Is there a ForEachAsync? Make the GetTasks request itself async?

Parallelism isn't concurrency. Concurrency isn't asynchrony. Running multiple slow queries in parallel won't make them run faster, quite the opposite. These are different problems and require very different solutions. Without a specific problem one can only give generic advice.
Parallelism - processing an 800K item array
Parallelism means processing a ton of data using multiple cores in parallel. To do that, you need to partition your data and feed each partition to a "worker" for processing. You need to minimize communication between workers and the need of synchronization to get the best performance, otherwise your workers will spend CPU time doing nothing. That means, no global queue updating.
If you have a lot of lines, or if line processing is CPU-bound, you can use PLINQ to process it :
var query = from y in lines.AsParallel()
from t in GetTasks(y.LineCode)
select t;
var theResults=query.ToList();
That's it. No need to synchronize access to a queue, either through locking or using a concurrent collection. This will use all available cores though. You can add WithDegreeOfParallelism() to reduce the number of cores used to avoid freezing
Concurrency - calling 2000 servers
Concurrency on the other hand means doing several different things at the same time. No partitioning is involved.
For example, if I had to query 8 or 2000 servers for monitoring data (true story) I wouldn't use Parallel or PLINQ. For one thing, Parallel and PLINQ use all available cores. In this case though they won't be doing anything, they'll just wait for responses. Parallelism classes can't handle async methods either because there's no point - they aren't meant to wait for responses.
A very quick & dirty solution would be to start multiple tasks and wait for them to return, eg :
var tasks=lines.Select(y=>Task.Run(()=>GetTasks(y.LineCode));
//Array of individual results
var resultsArray=await Task.WhenAll(tasks);
//flatten the results
var resultList=resultsArray.SelectMany(r=>r).ToList();
This will start all requests at once. Network Security didn't like the 2000 concurrent requests, since it looked like a hack attack and caused a bit of network flooding.
Concurrency with Dataflow
We can use the TPL Dataflow library and eg ActionBlock or TransformBlock to make the requests with a controlled degree of parallelism :
var options=new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4 ,
BoundedCapacity=10,
};
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasks(y.LineCode),
options);
var outputBlock=new BufferBlock<Result>();
spamBlock.LinkTo(outputBlock);
foreach(var line in lines)
{
await spamBlock.SendAsync(line);
}
spamBlock.Complete();
//Wait for all 4 workers to finish
await spamBlock.Completion;
Once the spamBlock completes, the results can be found in outputBlock. By setting a BoundedCapacity I ensure that the posting loop will wait if there are too many unprocessed messages in spamBlock's input queue.
An ActionBlock can handle asynchronous methods too. Assuming GetTasksAsync returns a Task<Result[]> we can use:
var spamBlock=new TransformManyBlock<Line,Result>(
y=>GetTasksAsync(y.LineCode),
options);

You can use Parallel Foreach:
Parallel.ForEach(Lines, (line) =>
{
globalQueue.AddRange(GetTasks(line.LineCode).ToList());
});
A Parallel.ForEach loop works like a Parallel.For loop. The loop
partitions the source collection and schedules the work on multiple
threads based on the system environment. The more processors on the
system, the faster the parallel method runs.

Why these tasks are starting delayed?

I am trying this code (just spawn some tasks and simulate work):
var tasks = Enumerable.Range(1, 10).Select(d => Task.Factory.StartNew(() =>
{
Console.Out.WriteLine("Processing [{0}]", d);
Task.Delay(20000).Wait(); // Simulate work. Here will be some web service calls taking 7/8+ seconds.
Console.Out.WriteLine("Task Complete [{0}]", d);
return (2 * d).ToString();
})).ToList();
var results = Task.WhenAll(tasks).Result;
Console.Out.WriteLine("All processing were complete with results: {0}", string.Join("|", results));
I was expecting to see 10 Processing ... in the console at once; but when I run, initially I see this output
Processing [1]
Processing [2]
Processing [3]
Processing [4]
Then 1/2 seconds later Processing [5], Processing [6] and others are shown slowly one after another.
Can you explain this? Does this mean tasks are being started as delayed? Why?

As mentioned in another answer, using TaskCreationOptions.LongRunning will solve your problem.
But this is not how you should approach your problem. Your example simulates CPU bound work. You say your tasks will be making calls to a web service - meaning they will be IO bound.
As such, they should be running asynchronously. However, Task.Delay(20000).Wait(); waits synchronously, so it doesn't represent what will/should actually be going on.
Take this example instead:
var tasks = Enumerable.Range(1, 10).Select(async d =>
{
Console.Out.WriteLine("Processing [{0}]", d);
await Task.Delay(5000); // Simulate IO work. Here will be some web service calls taking 7/8+ seconds.
Console.Out.WriteLine("Task Complete [{0}]", d);
return (2*d).ToString();
}).ToList();
var results = Task.WhenAll(tasks).Result;
Console.Out.WriteLine("All processing were complete with results: {0}", string.Join("|", results));
All tasks start instantly as expected.

I expect you have 4 cpu cores.
Having two (cpu bond) threads fighting over a core leads to it taking longer to complete the work then have 1 thread doing the first task, then doing the second task run.
Until it know otherwise the task system assumes tasks are short running and CPU bound and that they will use “none blocking” IO.
Therefore I expect that the task systems defaults to the number of threads being close to the number of cores.
Using TaskCreationOptions.LongRunning
provides a hint to the TaskScheduler that oversubscription may be
warranted. Oversubscription lets you create more threads than the
available number of hardware threads.
And lastly tasks are not threads, they are design to hide a lot of the details of threads from you, including controlling the number of threads that are in use. It is reasonable to create 100s of tasks, if you created 100s of threads all trying to run at the same time, the cpu cache etc will have a very hard time.
However lets get back to what you are trying to do. Your example simulates CPU bound work. You say your tasks will be making calls to a web service - meaning they will be IO bound.
As such, they should be running asynchronously. However, Task.Delay(20000).Wait(); waits synchronously, so it doesn't represent what will/should actually be going on. See Gediminas Masaitis answer for a code sample using await to make the delay asynchronously. However as soon as you use yet more asynchronously code, you need to think more about locking etc.
Asynchronously IO is clearly better if you have 100s of requests going on at the same time. However if you just have a "hand full" and no other usage of await in your application then TaskCreationOptions.LongRunning may be good enough.

Task parallel library time slicing

I have a used TaskParallel library in couple of places in my WCF application.
At one place I am using it like:
Place 1
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 100 };
Parallel.ForEach(objList, options, recurringOrder =>
{
Task.Factory.StartNew(() => ProcessSingleRequestForDebitOrder(recurringOrder));
//var th = new Thread(() => ProcessSingleRequestForDebitOrder(recurringOrder)) { Priority = ThreadPriority.Normal };
//th.Start();
//ProcessSingleRequestForDebitOrder( recurringOrder);
});
And in of another method I have used it like:
Place 2
System.Threading.Tasks.Task.Factory.StartNew(() => ProcessTransaction(objInput.Clone()));
Problem is time slicing between the two places. That is if I have called the the method where parallel loop is processing hundreds of records at Place 2 my thread at Place 1 is waiting till all the records have processed. Could some how I can time slice the processing?
I am using task parallel library for .net 3.5 from;
https://www.nuget.org/packages/TaskParallelLibrary/

The problem is that you have spawned a lot of tasks in place 1 and place 2 is now queued. The Parallel loop in place 1 does nothing because the body only starts a task which is done very quickly.
Probably, you should remove the StartNew thing from place 1 so that the degree of parallelism is lower. I'm not sure this will completely remove any problems because the Parallel loop might still fully utilize all available pool threads.
Doing IO with Parallel is an anti pattern anyway because the system-chosen DOP almost always is a bad choice. The TPL has no idea how to efficiently schedule IO.
You can make place 2 a LongRunning task so that it does not depend on the thread pool and is guaranteed to run.
You also can investigate using async IO so that you do not depend on the thread pool anymore.

WithDegreeOfParallelism(N>CPU count)

System.Threading.ThreadPool.SetMaxThreads(50, 50);
File.ReadLines().AsParallel().WithDegreeOfParallelism(100).ForAll((s)->{
/*
some code which is waiting external API call
and do not utilize CPU
*/
});
I have never got threads count more than CPU count in my system.
Can I use PLINQ and get more than one thread per CPU?

If you're calling external web API, you might be hitting the limit of concurrent simultaneous connections, which is set to 2. In the begining of your application do the following:
System.Net.ServicePointManager.DefaultConnectionLimit = 4096;
System.Net.ServicePointManager.Expect100Continue = false;
Try if that helps. If not, there might be some other bottleneck within the routine you're trying to parallelize.
Also, just like other responders said, ThreadPool decides how many threads to spin up based on load. In my experience with TPL I've seen that thread cound increases by time: longer the app runs, and heavier load gets, more threads are spun up.

PLINQ uses a hill-climbing algorithm to determine the optimum size of the thread pool which is used by the TPL. I think that if you put a lot of I/O in your tasks, seeing more threads than the cpu count is likeable.
That said, I've never seen more threads than the cpu count :) . But maybe I never had the right situation.

I tested this with the following code:
var lines = Enumerable.Range(0, 200).ToArray();
int currentThreads = 0;
int maxThreads = 0;
object l = new object();
lines.AsParallel().WithDegreeOfParallelism(100).ForAll(
s =>
{
lock (l)
{
currentThreads++;
if (currentThreads > maxThreads)
{
maxThreads = currentThreads;
Console.WriteLine(maxThreads);
}
}
Thread.Sleep(3000);
lock (l)
{
currentThreads--;
}
});
Console.WriteLine();
Console.WriteLine(maxThreads);
Basically, it records the current number of concurrently executing iterations and then saves the maximum encountered value.
The results vary quite a bit, between 15 and 25, but it's always much more than the number of CPUs my computer has (4). Increasing the sleep time increases the maximum number of concurrent threads. So it looks like the limiting factor here is the ThreadPool: it will create new threads slowly, especially when jobs are being completed relatively quickly.
If you want to increase the number of threads used, you would need to use SetMinThreads() (not SetMaxThreads()). If I set the minimum to 50, the number of threads actually used is around 60.
But having dozens of threads that do nothing but wait is quite inefficient, especially when it comes to memory consumption. You should consider using asynchronous methods instead.

PLINQ does not fit in this case.
I have found next article useful for me.
http://msdn.microsoft.com/en-us/library/hh228609(v=vs.110).aspx

Short answer: nope.
The amount of threading is simply up to the .Net Framework runtime. There is no developer control for controlling the number of threads for TPL (Task Parallel Library) usage.
EDIT
Thanks to some other feedback: it is actually possible--but not recommended--to manually control the number of threads in the ThreadPool, which PLINQ and TPL use.
It's my opinion that any parallelization problem needs to be carefully thought out, and carefully constructed and tested. There's a lot of subtlety in this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.