In my application
int numberOfTimes = 1; //Or 100, or 100000
//Incorrect, please see update.
var tasks = Enumerable.Repeat(
(new HttpClient()).GetStringAsync("http://www.someurl.com")
, numberOfTimes);
var resultArray = await Task.WhenAll(tasks);
With numberOfTimes == 1, it takes 5 seconds.
With numberOfTimes == 100000, it still takes 5 seconds.
Thats amazing.
But does that mean I can run unlimited calls in parallel? There has to be some limit when this starts to queues?
What is that limit? Where is that set? What does it depend on?
In other words, How many IO completion ports are there? Who all are competing for them? Does IIS get its own set of IO completion port.
--This is in an ASP.Net MVC action, .Net 4.5.2, IIS
Update: Thanks to #Enigmativity, following is more relevant to the question
var tasks = Enumerable.Range(1, numberOfTimes ).Select(i =>
(new HttpClient()).GetStringAsync("http://deletewhenever.com/api/default"));
var resultArray = await Task.WhenAll(tasks);
With numberOfTimes == 1, it takes 5 seconds.
With numberOfTimes == 100, it still takes 5 seconds.
I am seeing more believable numbers for higher counts now though. The question remains, what governs the number?
What is that limit? Where is that set?
There's no explicit limit. However, you will eventually run out of resources. Mark Russinovich has an interesting blog series on probing the limits of common resources.
Asynchronous operations generally increase memory usage in exchange for responsiveness. So, each naturally-async op uses at least memory for its Task, an OVERLAPPED struct, and an IRP for the driver (each of these represents an in-progress asynchronous operation at different levels). At the lower levels, there are lots and lots of different limitations that can come into play to affect system resources (for an example, I have an old blog post where I had to calculate the maximum size of an I/O buffer - something you would think is simple but is really not).
Socket operations require a client port, which are (in theory) limited to 64k connections to the same remote IP. Sockets also have their own more significant memory overhead, with both input and output buffers at the device level and in user space.
The IOCP doesn't come into play until the operations complete. On .NET, there's only one IOCP for your AppDomain. The default maximum number of I/O threads servicing this IOCP is 1000 on the modern (4.5) .NET framework. Note that this is a limit on how many operations may complete at a time, not how many may be in progress at a time.
Here's a test to see what's going on.
Start with this code:
var i = 0;
Func<int> generate = () =>
{
Thread.Sleep(1000);
return i++;
};
Now call this:
Enumerable.Repeat(generate(), 5)
After one second you get { 0, 0, 0, 0, 0 }.
But make this call:
Enumerable.Range(0, 5).Select(n => generate())
After five seconds you get { 0, 1, 2, 3, 4 }.
It's only calling the async function once in your code.
Related
Let's simplify this scenario. There is a machine with 16 GB RAM, and 4 CPU cores. Given a list of objects with different sizes, e.g. [3,1,7,9,4,5,2], each of the elements surely needs the corresponding amount of RAM based on their size, e.g. "1" will need 1 GB RAM.
What is the best way to process this element lists in parallel, without causing OutOfMemory, in C#, with Parallelism library (built-in or 3rd party)?
One naive strategy could be:
First round: choose [3,1,7]. Still have one core left, but if using "9", the program would need 20 GB RAM. So let's for now use 3 cores.
Second round: if "3" is finished first, consider "9", but still surpass 16 GB RAM capacity (1+7+9 = 17). So, stop and wait.
Third round: if then "7" is finished, the program will move on with "1", "9" and "4".
I'm not an expert on algorithm as well as parallelism. So I can't frame this problem in more specific details... Any help, link, advice is highly appreciated. I believe this problem may have been solved somewhere else, and I don't need to reinvent the wheel.
You could consider using a specialized Semaphore that can have its CurrentCount decreased and increased atomically by more than 1, like the one found in this question. You could initialize this mechanism with an initialCount equal to the available memory in GBs (16), and Wait/Release it with the size of each object in GBs (between 1 and 16). This way an object could acquire the semaphore only after waiting for the CurrentCount to become equal or larger to its size.
To incorporate this mechanism in a Parallel.ForEach loop, you could create a deferred enumerable that would Wait for the semaphore as part of the enumeration, and then feed this throttled enumerable as the source of the parallel loop. One important detail you should take care of is to disable the chunk partitioning that the Parallel.ForEach employs by default, by using the EnumerablePartitionerOptions.NoBuffering configuration, otherwise the Parallel.ForEach may enumerate greedily more than one items at a time, interfering with the throttling intentions of this algorithm.
The semaphore should be released inside the body of the parallel loop, in a finally block, with the same releaseCount as the size of the processed object.
Putting everything together:
var items = new[] { 3, 1, 7, 9, 4, 5, 2 };
const int availableMemory = 16; // GB
using var throttler = new SemaphoreManyFifo(availableMemory, availableMemory);
var throttledItems = items
.Select(item => { throttler.Wait(item); return item; });
var partitioner = Partitioner.Create(throttledItems,
EnumerablePartitionerOptions.NoBuffering);
var parallelOptions = new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(partitioner, parallelOptions, item =>
{
try
{
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Processing #{item}");
Thread.Sleep(item * 1000); // Simulate a CPU-bound operation
}
finally
{
throttler.Release(item);
Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Item #{item} completed");
}
});
Note: the size of each object should not exceed the initialCount of the semaphore, otherwise this algorithm will malfunction. Be aware that the aforementioned SemaphoreManyFifo implementation does not include proper argument validation.
Multithreading with IEnumerables, which are evaluated several times parallely and are expensive to evaluate, does not use 100% CPU. Example is the Aggregate() function combined with Concat():
// Initialisation.
// Each IEnumerable<string> is made so that it takes time to evaluate it
// everytime when it is accessed.
IEnumerable<string>[] iEnumerablesArray = ...
// The line of the question (using less than 100% CPU):
Parallel.For(0, 1000000, _ => iEnumerablesArray.Aggregate(Enumerable.Concat).ToList());
Question: Why parallel code where IEnumerables are evaluated several times parallely does not use 100% CPU? The code does not use locks or waits so this behaviour is unexpected. A full code to simulate this is at the end of the post.
Notes and Edits:
Interesting fact: If the code
Enumerable.Range(0, 1).Select(__ => GenerateLongString())
of the full code at the end is changed to
Enumerable.Range(0, 1).Select(__ => GenerateLongString()).ToArray().AsEnumerable(),
then initialisation takes seconds and after that CPU is used to 100% (no problem occurs)
Interesting fact2: (from comment) When method GenerateLongString() is made less heavy on GC and more intensive on CPU, then CPU goes to 100%. So cause is connected to the implementation of this method. But, interestingly, if the current form of GenerateLongString() is called without IEnumerable, CPU goes to 100% also:
Parallel.For(0, int.MaxValue, _ => GenerateLongString());
So heaviness of GenerateLongString() is not the only problem here.
Fact3: (from comment) Suggested concurrency visualiser revealed that threads spend most of their time on line
clr.dll!WKS::gc_heap::wait_for_gc_done,
waiting for GC to finish. This is happening inside string.Concat() of GenerateLongString().
The same behaviour is observed when running manualy multiple Task.Factory.StartNew() or Thread.Start()
The same behaviour is observed on Win 10 and Windows Server 2012
The same behaviour is observed on real machine and virtual machine
Release vs. Debug does not matter.
.Net version tested: 4.7.2
The Full Code:
class Program
{
const int DATA_SIZE = 10000;
const int IENUMERABLE_COUNT = 10000;
static void Main(string[] args)
{
// initialisation - takes milliseconds
IEnumerable<string>[] iEnumerablesArray = GenerateArrayOfIEnumerables();
Console.WriteLine("Initialized");
List<string> result = null;
// =================
// THE PROBLEM LINE:
// =================
// CPU usage of next line:
// - 40 % on 4 virtual cores processor (2 physical)
// - 10 - 15 % on 12 virtual cores processor
Parallel.For(
0,
int.MaxValue,
(i) => result = iEnumerablesArray.Aggregate(Enumerable.Concat).ToList());
// just to be sure that Release mode would not omit some lines:
Console.WriteLine(result);
}
static IEnumerable<string>[] GenerateArrayOfIEnumerables()
{
return Enumerable
.Range(0, IENUMERABLE_COUNT)
.Select(_ => Enumerable.Range(0, 1).Select(__ => GenerateLongString()))
.ToArray();
}
static string GenerateLongString()
{
return string.Concat(Enumerable.Range(0, DATA_SIZE).Select(_ => "string_part"));
}
}
The fact that your threads are blocked on clr.dll!WKS::gc_heap::wait_for_gc_done shows that the garbage collector is the bottleneck of your application. As much as possible, you should try to limit the number of heap allocations in your program, to put less stress on the gc.
That said, there is another way to speed-up things. Per default, on desktop, the GC is configured to use limited resources on the computer (to avoid slowing down other applications). If you want to fully use the resources available, then you can activate server GC. This mode assumes that your application is the most important thing running on the computer. It will provide a significant performance boost, but use a lot more CPU and memory.
I have REST web API service in IIS which takes a collection of request objects. The user can enter more than 100 request objects.
I want to run this 100 request concurrently and then aggregate the result and send it back. This involves both I/O operation (calling to backend services for each request) and CPU bound operations (to compute few response elements)
Code snippet -
using System.Threading.Tasks;
....
var taskArray = new Task<FlightInformation>[multiFlightStatusRequest.FlightRequests.Count];
for (int i = 0; i < multiFlightStatusRequest.FlightRequests.Count; i++)
{
var z = i;
taskArray[z] = Tasks.Task.Run(() =>
PerformLogic(multiFlightStatusRequest.FlightRequests[z],lite, fetchRouteByAnyLeg)
);
}
Task.WaitAll(taskArray);
for (int i = 0; i < taskArray.Length; i++)
{
flightInformations.Add(taskArray[i].Result);
}
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
If i individually run the PerformLogic operation (for 1 object) it is taking 300 ms, now my requirement is when I run this PerformLogic() for 100 objects in a single request it should take around 2 secs.
PerformLogic() has the following steps - 1. Call a 3rd Party web service to get some details 2. Based on the details call another 3rd Party webservice 3. Collect the result from the webservice, apply few transformation
But with Task.run() it takes around 7 secs, I would like to know the best approach to handle concurrency and achieve the desired NFR of 2 secs.
I can see that at any point of time 7-8 threads are working concurrently
not sure if I can spawn 100 threads or tasks may be we can see some better performance. Please suggest an approach to handle this efficiently.
Judging by this
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
I'd wager that PerformLogic spends most its time waiting on the IO operations. If so, there's hope with async. You'll have to rewrite PerformLogicand maybe even the IO operations - async needs to be present in all levels, from the top to the bottom. But if you can do it, the result should be a lot faster.
Other than that - get faster hardware. If 8 cores take 7 seconds, then get 32 cores. It's pricey, but could still be cheaper than rewriting the code.
First, don't reinvent the wheel. PLINQ is perfectly capable of doing stuff in parallel, there is no need for manual task handling or result merging.
If you want 100 tasks each taking 300ms done in 2 seconds, you need at least 15 parallel workers, ignoring the cost of parallelization itself.
var results = multiFlightStatusRequest.FlightRequests
.AsParallel()
.WithDegreeOfParallelism(15)
.Select(flightRequest => PerformLogic(flightRequest, lite, fetchRouteByAnyLeg)
.ToList();
Now you have told PLinq to use 15 concurrent workers to work on your queue of tasks. Are you sure your machine is up to the task? You could put any number you want in there, that doesn't mean that your computer magically gets the power to do that.
Another option is to look at your PerformLogic method and optimize that. You call it 100 times, maybe it's worth optimizing.
There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.
You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.
I am trying to understand how Parallelism is implemented in .Net. Following code is taken as an example from Reed Copsey Blog.
This code loops over the customers collection and sends them emails after 14 days since their last contact. My Question here is if the customer table is very BIG and sending an email takes few seconds, will NOT this code takes CPU in Denial of Services mode to other important processes?
Is there a way to run following lines of code in parallel but only using few cores so other processes can share CPU? Or Am i approaching the problem in wrong way?
Parallel.ForEach(customers, (customer, parallelLoopState) =>
{
// database operation
DateTime lastContact = theStore.GetLastContact(customer);
TimeSpan timeSinceContact = DateTime.Now - lastContact;
// If it's been more than two weeks, send an email, and update...
if (timeSinceContact.Days > 14)
{
// Exit gracefully if we fail to email, since this
// entire process can be repeated later without issue.
if (theStore.EmailCustomer(customer) == false)
parallelLoopState.Break();
else
customer.LastEmailContact = DateTime.Now;
}
});
Accepted Answer:
Thought Process was RIGHT! as Cole Campbell pointed out, One can control and configure how many cores should be used by specifying ParallelOption object in this specific example. Here is how.
var parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism =
Math.Max(Environment.ProcessorCount / 2, 1);
And Parallel.ForEach will be used as follow.
Parallel.ForEach(customers, parallelOptions,
(customer, parallelLoopState) =>
{
//do all same stuff
}
Same concept can be applied for PLINQ using .WithDegreeOfParallelism(int numberOfThreads).
For more information on how to configure Parallel Options, read this.
The Task Parallelism Library is designed to take the system workload into account when scheduling tasks to run, so this shouldn't be an issue. However, you can use the MaxDegreeOfParallelism property on the ParallelOptions class, which can be passed into one of the overloads of ForEach(), to restrict the number of concurrent operations that it can perform, if you really need to.