Short version: how does async calls scale when async methods are called thousands and thousands of times in a loop, and these methods might call other async methods? Will my threadpool explode?
I've been reading and experimenting with the TPL and Async and after reading a lot of material I'm still confused about some aspects that I could not find much information about, like how async calls scale. I will try to go straight to the point.
Async calls
For IO, I read it is better to use async than a new thread/start a task, but from what I understand, performing an async operation without using a different thread is impossible, which means async must use other threads/start tasks at some point.
So my question is: how would code A be better than code B regarding system resources?
Code A
// an array with 5000 urls.
var urls = new string[5000];
// list of awaitable tasks.
var tasks = new List<Task<string>>(5000);
HttpClient httpClient;
foreach (string url in urls)
{
tasks.Add(httpClient.GetStringAsync(url));
}
await Task.WhenAll(tasks);
Code B
...same variables as code A...
foreach (string url in urls)
{
tasks.Add(
Task.Factory.StartNew(() =>
{
// This method represents a
// synchronous version of the GetStringAsync.
httpClient.GetString(url);
})
);
}
await Task.WhenAll(tasks);
Which leads me to the questions:
1 - should async calls be avoided in a loop?
2 - Is there a reasonable max of async calls that should be fired at a time, or is firing any number of async calls ok? How does this scale?
3 - Do async methods, under the hood, start a task for each call?
I tested this with 1000 urls and the number of used threadpool worker threads never even reached 30, and the number of IO completion threads is always about 5.
My Practical Experiment
I created a web application with a simple async controller.
The page is composed of a single form with a textarea where the user enters all urls he wishes to request/do some work with.
Upon submition, the urls are requested in loop using the HttpClient.GetUrlAsync method just like the code A above.
An interesting point is that if I submit 1000 urls, it takes about 3 minutes to finish all requests.
On the other hand, if I submit 3 forms from 3 different tabs (i.e. clients), each with 1000 urls, it takes much much longer for the result (about 10 minutes), which really got me confused, because as per msdn definition, it should not take much longer than 3 minutes, specially when even while processing all the requests at the same time the number of used threads from the threadpool is approx 25, which means resources are not being well explored at all!
The way it is working now, this type of application is far from scalable (say I had about 5000 clients requesting a bunch of urls all the time), and I fail to see how asyncis the way to fire multiple IO requests.
Further explanation about the application
Client side:
1. user enter the site
2. types 1000 urls in the text area
3. submits the urls
Server side:
1. receive urls as an array
2. perform the code
foreach (string url in urls)
{
tasks.Add(GetUrlAsync(url));
}
await Task.WhenAll(tasks);
//at this point the thread is
// returned to the pool to receive
// further requests.
notifies the client that work is done
Please, enlighten me!
Thank you.
from what I understand, performing an async operation without using a different thread is impossible, which means async must use other threads/start tasks at some point.
Nope. As I describe on my blog, pure async methods do not block threads.
So my question is: how would code A be better than code B regarding system resources?
A uses fewer threads than B.
(On a side note, do not use StartNew. It's horribly out-of-date and has very dangerous default parameter values. Use Task.Run instead. If you got this idea/code from a blog post or article, please pass the word along. StartNew is a cancer that seems to be taking over the Internet.)
should async calls be avoided in a loop?
Nope, that's fine.
Is there a reasonable max of async calls that should be fired at a time, or is firing any number of async calls ok?
Any number of them are fine, as long as your backend resource can handle it.
How does this scale?
Asynchronous I/O on .NET almost always uses IOCPs (I/O Completion Ports) underneath, which is generally considered the most scalable form of I/O available on Windows.
Do async methods, under the hood, start a task for each call?
Yes and no. The execution of every asynchronous method is represented by a Task instance, but these do not represent running tasks - they don't represent a thread.
I call async tasks Promise Tasks, as opposed to Delegate Tasks (tasks that actually do run on the thread pool).
really got me confused
One thing to be aware of when you're testing URL requests is that there's automatic throttling for URL requests built-in to .NET. Try setting ServicePointManager.DefaultConnectionLimit to int.MaxValue.
Related
I have a C# .NET program that uses an external API to process events for real-time stock market data. I use the API callback feature to populate a ConcurrentDictionary with the data it receives on a stock-by-stock basis.
I have a set of algorithms that each run in a constant loop until a terminal condition is met. They are called like this (but all from separate calling functions elsewhere in the code):
Task.Run(() => ExecutionLoop1());
Task.Run(() => ExecutionLoop2());
...
Task.Run(() => ExecutionLoopN());
Each one of those functions calls SnapTotals():
public void SnapTotals()
{
foreach (KeyValuePair<string, MarketData> kvpMarketData in
new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime))
{
...
The Handler.MessageEventHandler.Realtime object is the ConcurrentDictionary that is updated in real-time by the external API.
At a certain specific point in the day, there is an instant burst of data that comes in from the API. That is the precise time I want my ExecutionLoop() functions to do some work.
As I've grown the program and added more of those execution loop functions, and grown the number of elements in the ConcurrentDictionary, the performance of the program as a whole has seriously degraded. Specifically, those ExecutionLoop() functions all seem to freeze up and take much longer to meet their terminal condition than they should.
I added some logging to all of the functions above, and to the function that updates the ConcurrentDictionary. From what I can gather, the ExecutionLoop() functions appear to access the ConcurrentDictionary so often that they block the API from updating it with real-time data. The loops are dependent on that data to meet their terminal condition so they cannot complete.
I'm stuck trying to figure out a way to re-architect this. I would like for the thread that updates the ConcurrentDictionary to have a higher priority but the message events are handled from within the external API. I don't know if ConcurrentDictionary was the right type of data structure to use, or what the alternative could be, because obviously a regular Dictionary would not work here. Or is there a way to "pause" my execution loops for a few milliseconds to allow the market data feed to catch up? Or something else?
Your basic approach is sound except for one fatal flaw: they are all hitting the same dictionary at the same time via iterators, sets, and gets. So you must do one thing: in SnapTotals you must iterate over a copy of the concurrent dictionary.
When you iterate over Handler.MessageEventHandler.Realtime or even new ConcurrentDictionary<string, MarketData>(Handler.MessageEventHandler.Realtime) you are using the ConcurrentDictionary<>'s iterator, which even though is thread-safe, is going to be using the dictionary for the entire period of iteration (including however long it takes to do the processing for each and every entry in the dictionary). That is most likely where the contention occurs.
Making a copy of the dictionary is much faster, so should lower contention.
Change SnapTotals to
public void SnapTotals()
{
var copy = Handler.MessageEventHandler.Realtime.ToArray();
foreach (var kvpMarketData in copy)
{
...
Now, each ExecutionLoopX can execute in peace without write-side contention (your API updates) and without read-side contention from the other loops. The write-side can execute without read-side contention as well.
The only "contention" should be for the short duration needed to do each copy.
And by the way, the dictionary copy (an array) is not threadsafe; it's just a plain array, but that is ok because each task is executing in isolation on its own copy.
I think that your main problem is not related to the ConcurrentDictionary, but to the large number of ExecutionLoopX methods. Each of these methods saturates a CPU core, and since the methods are more than the cores of your machine, the whole CPU is saturated. My assumption is that if you find a way to limit the degree of parallelism of the ExecutionLoopX methods to a number smaller than the Environment.ProcessorCount, your program will behave and perform better. Below is my suggestion for implementing this limitation.
The main obstacle is that currently your ExecutionLoopX methods are monolithic: they can't be separated to pieces so that they can be parallelized. My suggestion is to change their return type from void to async Task, and place an await Task.Yield(); inside the outer loop. This way it will be possible to execute them in steps, with each step being the code from the one await to the next.
Then create a TaskScheduler with limited concurrency, and a TaskFactory that uses this scheduler:
int maxDegreeOfParallelism = Environment.ProcessorCount - 1;
TaskScheduler scheduler = new ConcurrentExclusiveSchedulerPair(
TaskScheduler.Default, maxDegreeOfParallelism).ConcurrentScheduler;
TaskFactory taskFactory = new TaskFactory(scheduler);
Now you can parallelize the execution of the methods, by starting the tasks with the taskFactory.StartNew method instead of the Task.Run:
List<Task> tasks = new();
tasks.Add(taskFactory.StartNew(() => ExecutionLoop1(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop2(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop3(data)).Unwrap());
tasks.Add(taskFactory.StartNew(() => ExecutionLoop4(data)).Unwrap());
//...
Task.WaitAll(tasks.ToArray());
The .Unwrap() is needed because the taskFactory.StartNew returns a nested task (Task<Task>). The Task.Run method is also doing this unwrapping internally, when the action is asynchronous.
An online demo of this idea can be found here.
The Environment.ProcessorCount - 1 configuration means that one CPU core will be available for other work, like the communication with the external API and the updating of the ConcurrentDictionary.
A more cumbersome implementation of the same idea, using iterators and the Parallel.ForEach method instead of async/await, can be found in the first revision of this answer.
If you're not squeamish about mixing operations in a task, you could redesign such that instead of task A doing A things, B doing B things, C doing C things, etc. you can reduce the number of tasks to the number of processors, and thus run fewer concurrently, greatly easing contention.
So, for example, say you have just two processors. Make a "general purpose/pluggable" task wrapper that accepts delegates. So, wrapper 1 would accept delegates to do A and B work. Wrapper 2 would accept delegates to do C and D work. Then ask each wrapper to spin up a task that calls the delegates in a loop over the dictionary.
This would of course need to be measured. What I am proposing is, say, 4 tasks each doing 4 different types of processing. This is 4 units of work per loop over 4 loops. This is not the same as 16 tasks each doing 1 unit of work. In that case you have 16 loops.
16 loops intuitively would cause more contention than 4.
Again, this is a potential solution that should be measured. There is one drawback for sure: you will have to ensure that a piece of work within a task doesn't affect any of the others.
We're developing WebAPI which has some logic of decryption of around 200 items (can be more). Each decryption takes around 20ms.
We've tried to parallel the tasks so we'll get it done as soon as possible, but it seems we're getting some kind of a limit as the threads are getting reused by waiting for the older threads to complete (and there are only few used) - overall action takes around 1-2 seconds to complete...
What we basically want to achieve is get x amount of threads start at the same time and finish after those ~20 ms.
We tried this:
Await multiple async Task while setting max running task at a time
But it seems this only describes setting a limit while we want to release it...
Here's a snippet:
var tasks = new List<Task>();
foreach (var element in Elements)
{
var task = new Task(() =>
{
element.Value = Cipher.Decrypt((string)element.Value);
}
});
task.Start();
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
What are we missing here?
Thanks,
Nir.
I cannot recommend parallelism on ASP.NET. It will certainly impact the scalability of your service, particularly if it is public-facing. I have thought "oh, I'm smart enough to do this" a couple of times and added parallelism in an ASP.NET app, only to have to tear it right back out a week later.
However, if you really want to...
it seems we're getting some kind of a limit
Is it the limit of physical cores on your machine?
We tried this: Await multiple async Task while setting max running task at a time
That solution is specifically for asynchronous concurrent code (e.g., I/O-bound). What you want is parallel (threaded) concurrent code (e.g., CPU-bound). Completely different use cases and solutions.
What are we missing here?
Your current code is throwing a ton of simultaneous tasks at the thread pool, which will attempt to handle them as best as it can. You can make this more efficient by using a higher-level abstraction, e.g., Parallel:
Parallel.ForEach(Elements, element =>
{
element.Value = Cipher.Decrypt((string)element.Value);
});
Parallel is more intelligent in terms of its partitioning and (re-)use of threads (i.e., not exceeding number of cores). So you should see some speedup.
However, I would expect it only to be a minor speedup. You are likely being limited by your number of physical cores.
Asuming no hyper threading:
If it takes 20ms for 1 item , then you can look at it as if it takes 1 core 20ms. If you want 200 items to complete in 20 ms, then you need 200 cores all for you. If you don't have that many, it just can't be done...
Under normal surcumstances, as many Task Will be scheduled parallel as optimal for you system
I have read TPL and Task library documents cover to cover. But, I still couldn't comprehend the following case very clearly and right now I need to implement it.
I will simplify my situation. I have an IEnumerable<Uri> of length 1000. I have to make a request for them using HttpClient.
I have two questions.
There is not much computation, just waiting for Http request. In this case can I still use Parallel.Foreach() ?
In case of using Task instead, what is the best practice for creating huge number of them? Let's say I use Task.Factory.StartNew() and add those tasks to a list and wait for all of them. Is there a feature (such as TPL partitioner) that controls number of maximum tasks and maximum HttpClient I can create?
There are couple of similar questions on SO, but no one mentions the maximums. The requirement is just using maximum tasks with maximum HttpClient.
Thank you in advance.
i3arnon's answer with TPL Dataflow is good; Dataflow is useful especially if you have a mix of CPU and I/O bound code. I'll echo his sentiment that Parallel is designed for CPU-bound code; it's not the best solution for I/O-based code, and especially not appropriate for asynchronous code.
If you want an alternative solution that works well with mostly-I/O code - and doesn't require an external library - the method you're looking for is Task.WhenAll:
var tasks = uris.Select(uri => SendRequestAsync(uri)).ToArray();
await Task.WhenAll(tasks);
This is the easiest solution, but it does have the drawback of starting all requests simultaneously. Particularly if all requests are going to the same service (or a small set of services), this can cause timeouts. To solve this, you need to use some kind of throttling...
Is there a feature (such as TPL partitioner) that controls number of maximum tasks and maximum HttpClient I can create?
TPL Dataflow has that nice MaxDegreeOfParallelism which only starts so many at a time. You can also throttle regular asynchronous code by using another builtin, SemaphoreSlim:
private readonly SemaphoreSlim _sem = new SemaphoreSlim(50);
private async Task SendRequestAsync(Uri uri)
{
await _sem.WaitAsync();
try
{
...
}
finally
{
_sem.Release();
}
}
In case of using Task instead, what is the best practice for creating huge number of them? Let's say I use Task.Factory.StartNew() and add those tasks to a list and wait for all of them.
You actually don't want to use StartNew. It only has one appropriate use case (dynamic task-based parallelism), which is extremely rare. Modern code should use Task.Run if you need to push work onto a background thread. But you don't even need that to begin with, so neither StartNew nor Task.Run is appropriate here.
There are couple of similar questions on SO, but no one mentions the maximums. The requirement is just using maximum tasks with maximum HttpClient.
Maximums are where asynchronous code really gets tricky. With CPU-bound (parallel) code, the solution is obvious: you use as many threads as you have cores. (Well, at least you can start there and adjust as necessary). With asynchronous code, there isn't as obvious of a solution. It depends on a lot of factors - how much memory you have, how the remote server responds (rate limiting, timeouts, etc), etc.
There's no easy solutions here. You just have to test out how your specific application deals with high levels of concurrency, and then throttle to some lower number.
I have some slides for a talk that attempts to explain when different technologies are appropriate (parallelism, asynchrony, TPL Dataflow, and Rx). If you prefer more of a written description with recipes, I think you may benefit from my book on concurrency.
.NET 6
Starting from .NET 6 you can use one of the Parallel.ForEachAsync methods which are async aware:
await Parallel.ForEachAsync(
uris,
async (uri, cancellationToken) => await SendRequestAsync(uri, cancellationToken));
This will use Environment.ProcessorCount as the degree of parallelism. To change it you can use the overload that accepts ParallelOptions:
await Parallel.ForEachAsync(
uris,
new ParallelOptions { MaxDegreeOfParallelism = 50 },
async (uri, cancellationToken) => await SendRequestAsync(uri, cancellationToken));
ParallelOptions also allows passing in a CancellationToken and a TaskScheduler
.NET 5 and older (including all .NET Framework versions)
In this case can I still use Parallel.Foreach ?
This isn't really appropriate. Parallel.Foreach is more for CPU intensive work. It also doesn't support async operations.
In case of using Task instead, what is the best practice for creating huge number of them?
Use a TPL Dataflow block instead. You don't create huge amounts of tasks that sit there waiting for a thread to become available. You can configure the max amount of tasks and reuse them for all the items that meanwhile sit in a buffer waiting for a task. For example:
var block = new ActionBlock<Uri>(
uri => SendRequestAsync(uri),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
foreach (var uri in uris)
{
block.Post(uri);
}
block.Complete();
await block.Completion;
This question already has answers here:
Brief explanation of Async/Await in .Net 4.5
(3 answers)
Closed 7 years ago.
C# offers multiple ways to perform asynchronous execution such as threads, futures, and async.
In what cases is async the best choice?
I have read many articles about the how and what of async, but so far I have not seen any article that discusses the why.
Initially I thought async was a built-in mechanism to create a future. Something like
async int foo(){ return ..complex operation..; }
var x = await foo();
do_something_else();
bar(x);
Where call to 'await foo' would return immediately, and the use of 'x' would wait on the the return value of 'foo'. async does not do this. If you want this behavior you can use the futures library: https://msdn.microsoft.com/en-us/library/Ff963556.aspx
The above example would instead be something like
int foo(){ return ..complex operation..; }
var x = Task.Factory.StartNew<int>(() => foo());
do_something_else();
bar(x.Result);
Which isn't as pretty as I would have hoped, but it works nonetheless.
So if you have a problem where you want to have multiple threads operate on the work then use futures or one of the parallel operations, such as Parallel.For.
async/await, then, is probably not meant for the use case of performing work in parallel to increase throughput.
async solves the problem of scaling an application for a large number of asynchronous events, such as I/O, when creating many threads is expensive.
Imagine a web server where requests are processed immediately as they come in. The processing happens on a single thread where every function call is synchronous. To fully process a thread might take a few seconds, which means that an entire thread is consumed until the processing is complete.
A naive approach to server programming is to spawn a new thread for each request. In this way it does not matter how long each thread takes to complete because no thread will block any other. The problem with this approach is that threads are not cheap. The underlying operating system can only create so many threads before running out of memory, or some other kind of resource. A web server that uses 1 thread per request will probably not be able to scale past a few hundred/thousand requests per second. The c10k challenge asks that modern servers be able to scale to 10,000 simultaneous users. http://www.kegel.com/c10k.html
A better approach is to use a thread pool where the number of threads in existence is more or less fixed (or at least, does not expand past some tolerable maximum). In that scenario only a fixed number of threads are available for processing the incoming requests. If there are more requests than there are threads available for processing then some requests must wait. If a thread is processing a request and has to wait on a long running I/O process then effectively the thread is not being utilized to its fullest extent, and the server throughput will be much less than it otherwise could be.
The question is now, how can we have a fixed number of threads but still use them efficiently? One answer is to 'cut up' the program logic so that when a thread would normally wait on an I/O process, instead it will start the I/O process but immediately become free for any other task that wants to execute. The part of the program that was going to execute after the I/O will be stored in a thing that knows how to keep executing later on.
For example, the original synchronous code might look like
void process(){
string name = get_user_name();
string address = look_up_address(name);
string tax_forms = find_tax_form(address);
render_tax_form(name, address, tax_forms);
}
Where look_up_address and find_tax_form have to talk to a database and/or make requests to other websites.
The asynchronous version might look like
void process(){
string name = get_user_name();
invoke_after(() => look_up_address(name), (address) => {
invoke_after(() => find_tax_form(address), (tax_forms) => {
render_tax_form(name, address, tax_forms);
}
}
}
This is continuation passing style, where next thing to do is passed as the second lambda to a function that will not block the current thread when the blocking operation (in the first lambda) is invoked. This works but it quickly becomes very ugly and hard to follow the program logic.
What the programmer has manually done in splitting up their program can be automatically done by async/await. Any time there is a call to an I/O function the program can mark that function call with await to inform the caller of the program that it can continue to do other things instead of just waiting.
async void process(){
string name = get_user_name();
string address = await look_up_address(name);
string tax_forms = await find_tax_form(address);
render_tax_form(name, address, tax_forms);
}
The thread that executes process will break out of the function when it gets to look_up_address and continue to do other work: such as processing other requests. When look_up_address has completed and process is ready to continue, some thread (or the same thread) will pick up where the last thread left off and execute the next line find_tax_forms(address).
Since my current belief of async is about managing threads, I don't believe that async makes a lot of sense for UI programming. Generally UI's will not have that many simultaneous events that need to be processed. The use case for async with UI's is preventing the UI thread from being blocked. Even though async can be used with a UI, I would find it dangerous because ommitting an await on some long running function, due to either an accident or forgetfulness, would cause the UI to block.
async void button_callback(){
await do_something_long();
....
}
This code won't block the UI because it uses an await for the long running function that it invokes. If later on another function call is added
async void button_callback(){
do_another_thing();
await do_something_long();
...
}
Where it wasn't clear to the programmer who added the call to do_another_thing just how long it would take to execute, the UI will now be blocked. It seems safer to just always execute all processing in a background thread.
void button_callback(){
new Thread(){
do_another_thing();
do_something_long();
....
}.start();
}
Now there is no possibility that the UI thread will be blocked, and the chances that too many threads will be created is very small.
I'm just beginning to learn C# threading and concurrent collections, and am not sure of the proper terminology to pose my question, so I'll describe briefly what I'm trying to do. My grasp of the subject is rudimentary at best at this point. Is my approach below even feasible as I've envisioned it?
I have 100,000 urls in a Concurrent collection that must be tested--is the link still good? I have another concurrent collection, initially empty, that will contain the subset of urls that an async request determines to have been moved (400, 404, etc errors).
I want to spawn as many of these async requests concurrently as my PC and our bandwidth will allow, and was going to start at 20 async-web-request-tasks per second and work my way up from there.
Would it work if a single async task handled both things: it would make the async request and then add the url to the BadUrls collection if it encountered a 4xx error? A new instance of that task would be spawned every 50ms:
class TestArgs args {
ConcurrentBag<UrlInfo> myCollection { get; set; }
System.Uri currentUrl { get; set; }
}
ConcurrentQueue<UrlInfo> Urls = new ConncurrentQueue<UrlInfo>();
// populate the Urls queue
<snip>
// initialize the bad urls collection
ConcurrentBag<UrlInfo> BadUrls = new ConcurrentBag<UrlInfo>();
// timer fires every 50ms, whereupon a new args object is created
// and the timer callback spawns a new task; an autoEvent would
// reset the timer and dispose of it when the queue was empty
void SpawnNewUrlTask(){
// if queue is empty then reset the timer
// otherwise:
TestArgs args = {
myCollection = BadUrls,
currentUrl = getNextUrl() // take an item from the queue
};
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
}
public async Task asyncWebRequestAndConcurrentCollectionUpdater(TestArgs args)
{
//make the async web request
// add the url to the bad collection if appropriate.
}
Feasible? Way off?
The approach seems fine, but there are some issues with the specific code you've shown.
But before I get to that, there have been suggestions in the comments that Task Parallelism is the way to go. I think that's misguided. There's a common misconception that if you want to have lots of work going on in parallel, you necessarily need lots of threads. That's only true if the work is compute-bound. But the work you're doing will be IO bound - this code is going to spend the vast majority of its time waiting for responses. It will do very little computation. So in practice, even if it only used a single thread, your initial target of 20 requests per second doesn't seem like a workload that would cause a single CPU core to break into a sweat.
In short, a single thread can handle very high levels of concurrent IO. You only need multiple threads if you need parallel execution of code, and that doesn't look likely to be the case here, because there's so little work for the CPU in this particular job.
(This misconception predates await and async by years. In fact, it predates the TPL - see http://www.interact-sw.co.uk/iangblog/2004/09/23/threadless for a .NET 1.1 era illustration of how you can handle thousands of concurrent requests with a tiny number of threads. The underlying principles still apply today because Windows networking IO still basically works the same way.)
Not that there's anything particularly wrong with using multiple threads here, I'm just pointing out that it's a bit of a distraction.
Anyway, back to your code. This line is problematic:
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
While you've not given us all your code, I can't see how that will be able to compile. The overloads of StartNew that accept two arguments require the first to be either an Action, an Action<object>, a Func<TResult>, or a Func<object,TResult>. In other words, it has to be a method that either takes no arguments, or accepts a single argument of type object (and which may or may not return a value). Your 'asyncWebRequestAndConcurrentCollectionUpdater' takes an argument of type TestArgs.
But the fact that it doesn't compile isn't the main problem. That's easily fixed. (E.g., change it to Task.Factory.StartNew(() => asyncWebRequestAndConcurrentCollectionUpdater(args));) The real issue is what you're doing is a bit weird: you're using Task.StartNew to invoke a method that already returns a Task.
Task.StartNew is a handy way to take a synchronous method (i.e., one that doesn't return a Task) and run it in a non-blocking way. (It'll run on the thread pool.) But if you've got a method that already returns a Task, then you didn't really need to use Task.StartNew. The weirdness becomes more apparent if we look at what Task.StartNew returns (once you've fixed the compilation error):
Task<Task> t = Task.Factory.StartNew(
() => asyncWebRequestAndConcurrentCollectionUpdater(args));
That Task<Task> reveals what's happening. You've decided to wrap a method that was already asynchronous with a mechanism that is normally used to make non-asynchronous methods asynchronous. And so you've now got a Task that produces a Task.
One of the slightly surprising upshots of this is that if you were to wait for the task returned by StartNew to complete, the underlying work would not necessarily be done:
t.Wait(); // doesn't wait for asyncWebRequestAndConcurrentCollectionUpdater to finish!
All that will actually do is wait for asyncWebRequestAndConcurrentCollectionUpdater to return a Task. And since asyncWebRequestAndConcurrentCollectionUpdater is already an async method, it will return a task more or less immediately. (Specifically, it'll return a task the moment it performs an await that does not complete immediately.)
If you want to wait for the work you've kicked off to finish, you'll need to do this:
t.Result.Wait();
or, potentially more efficiently, this:
t.Unwrap().Wait();
That says: get me the Task that my async method returned, and then wait for that. This may not be usefully different from this much simpler code:
Task t = asyncWebRequestAndConcurrentCollectionUpdater("foo");
... maybe queue up some other tasks ...
t.Wait();
You may not have gained anything useful by introducing `Task.Factory.StartNew'.
I say "may" because there's an important qualification: it depends on the context in which you start the work. C# generates code which, by default, attempts to ensure that when an async method continues after an await, it does so in the same context in which the await was initially performed. E.g., if you're in a WPF app and you await while on the UI thread, when the code continues it will arrange to do so on the UI thread. (You can disable this with ConfigureAwait.)
So if you're in a situation in which the context is essentially serialized (either because it's single-threaded, as will be the case in a GUI app, or because it uses something resembling a rental model, e.g. the context of an particular ASP.NET request), it may actually be useful to kick an async task off via Task.Factory.StartNew because it enables you to escape the original context. However, you just made your life harder - tracking your tasks to completion is somewhat more complex. And you might have been able to achieve the same effect simply by using ConfigureAwait inside your async method.
And it may not matter anyway - if you're only attempting to manage 20 requests a second, the minimal amount of CPU effort required to do that means that you can probably manage it entirely adequately on one thread. (Also, if this is a console app, the default context will come into play, which uses the thread pool, so your tasks will be able to run multithreaded in any case.)
But to get back to your question, it seems entirely reasonable to me to have a single async method that picks a url off the queue, makes the request, examines the response, and if necessary, adds an entry to the bad url collection. And kicking the things off from a timer also seems reasonable - that will throttle the rate at which connections are attempted without getting bogged down with slow responses (e.g., if a load of requests end up attempting to talk to servers that are offline). It might be necessary to introduce a cap for the maximum number of requests in flight if you hit some pathological case where you end up with tens of thousands of URLs in a row all pointing to a server that isn't responding. (On a related note, you'll need to make sure that you're not going to hit any per-client connection limits with whichever HTTP API you're using - that might end up throttling the effective throughput.)
You will need to add some sort of completion handling - just kicking off asynchronous operations and not doing anything to handle the results is bad practice, because you can end up with exceptions that have nowhere to go. (In .NET 4.0, these used to terminate your process, but as of .NET 4.5, by default an unhandled exception from an asynchronous operation will simply be ignored!) And if you end up deciding that it is worth launching via Task.Factory.StartNew remember that you've ended up with an extra layer of wrapping, so you'll need to do something like myTask.Unwrap().ContinueWith(...) to handle it correctly.
Of course you can. Concurrent collections are called 'concurrent' because they can be used... concurrently by multiple threads, with some warranties about their behaviour.
A ConcurrentQueue will ensure that each element inserted in it is extracted exactly once (concurrent threads will never extract the same item by mistake, and once the queue is empty, all the items have been extracted by a thread).
EDIT: the only thing that could go wrong is that 50ms is not enough to complete the request, and so more and more tasks cumulate in the task queue. If that happens, your memory could get filled, but the thing would work anyway. So yes, it is feasible.
Anyway, I would like to underline the fact that a task is not a thread. Even if you create 100 tasks, the framework will decide how many of them will be actually executed concurrently.
If you want to have more control on the level of parallelism, you should use asynchronous requests.
In your comments, you wrote "async web request", but I can't understand if you wrote async just because it's on a different thread or because you intend to use the async API.
If you were using the async API, I'd expect to see some handler attached to the completion event, but I couldn't see it, so I assumed you're using synchronous requests issued from an asynchronous task.
If you're using asynchronous requests, then it's pointless to use tasks, just use the timer to issue the async requests, since they are already asynchronous.
When I say "asynchronous request" I'm referring to methods like WebRequest.GetResponseAsync and WebRequest.BeginGetResponse.
EDIT2: if you want to use asynchronous requests, then you can just make requests from the timer handler. The BeginGetResponse method takes two arguments. The first one is a callback procedure, that will be called to report the status of the request. You can pass the same procedure for all the requests. The second one is an user-provided object, which will store status about the request, you can use this argument to differentiate among different requests. You can even do it without the timer. Something like:
private readonly int desiredConcurrency = 20;
struct RequestData
{
public UrlInfo url;
public HttpWebRequest request;
}
/// Handles the completion of an asynchronous request
/// When a request has been completed,
/// tries to issue a new request to another url.
private void AsyncRequestHandler(IAsyncResult ar)
{
if (ar.IsCompleted)
{
RequestData data = (RequestData)ar.AsyncState;
HttpWebResponse resp = data.request.EndGetResponse(ar);
if (resp.StatusCode != 200)
{
BadUrls.Add(data.url);
}
//A request has been completed, try to start a new one
TryIssueRequest();
}
}
/// If urls is not empty, dequeues a url from it
/// and issues a new request to the extracted url.
private bool TryIssueRequest()
{
RequestData rd;
if (urls.TryDequeue(out rd.url))
{
rd.request = CreateRequestTo(rd.url); //TODO implement
rd.request.BeginGetResponse(AsyncRequestHandler, rd);
return true;
}
else
{
return false;
}
}
//Called by a button handler, or something like that
void StartTheRequests()
{
for (int requestCount = 0; requestCount < desiredConcurrency; ++requestCount)
{
if (!TryIssueRequest()) break;
}
}