How to implement ForEachAsync? - c#

Questions on Lambda to delete partitions.
The existing query which uses parallelization is failing since it exceeds the number of parallel queries. We want to replace it with sequential queries and increased timeout for lambda.
Can we change the lambda to parallel with limited threads?
Database -> aws athena = Getting the List of clients from Athena. Looping throgh it.
Right now it works fine with sequential calls as well but since the number of clients is small now, it would pose a problem for future.
The only issue with limited parallel threads is that we need some code to handle the thread count as well.
Then someone suggested me use this: https://devblogs.microsoft.com/pfxteam/implementing-a-simple-foreachasync-part-2/
https://gist.github.com/0xced/94f6c50d620e582e19913742dbd76eb6
public class AthenaClient {
private readonly IAmazonAthena _client;
private readonly string _databaseName;
private readonly string _outputLocation;
private readonly string _tableName;
const int MaxQueryLength = 262144;
readonly int _maxclientsToBeProcessed;
public AthenaClient(string databaseName, string tableName, string outputLocation, int maxclientsToBeProcessed) {
_databaseName = databaseName;
_tableName = tableName;
_outputLocation = outputLocation;
_maxclientsToBeProcessed = maxclientsToBeProcessed == 0 ? 1 : maxclientsToBeProcessed;
_client = new AmazonAthenaClient();
}
public async Task < bool > DeletePartitions() {
var clients = await GetClients();
for (int i = 0; i < clients.Count; i = i + _maxclientsToBeProcessed) {
var clientItems = clients.Skip(i).Take(_maxclientsToBeProcessed);
var queryBuilder = new StringBuilder();
queryBuilder.AppendLine($ "ALTER TABLE { _databaseName }.{_tableName} DROP IF EXISTS");
foreach(var client in clientItems) {
queryBuilder.AppendLine($ " PARTITION (client_id = '{client}'), ");
}
var query = queryBuilder.ToString().Trim().TrimEnd(',') + ";";
LambdaLogger.Log(query);
if (query.Length >= MaxQueryLength) {
throw new Exception("Delete partition query length exceeded.");
}
var queryExecutionId = StartQueryExecution(query).Result;
await CheckQueryExecutionStatus(queryExecutionId);
}
return true;
}
}

It seems that the actual question should be :
How can I change the database partitions for lots of clients in AWS Athena without executing them sequentially?
The answer isn't ForEachAsync or the upcoming await foreach in C# 8. An asynchronous loop would still send calls to the service one at a time, it "just" wouldn't block while waiting for an answer.
Concurrent workers
This is a concurrent worker problem that can be handled using eg the TPL Dataflow library's ActionBlock class or the new System.Threading.Channel classes.
The Dataflow library is meant to create event/message processing pipelines similar to a shell script pipeline, by moving data between independent blocks. Each block runs on its own task/thread which means you can get concurrent execution simply by breaking up processing into blocks.
It's also possible to increase the number of processing tasks per block, by specifying the MaxDegreeOfParallelism option when creating the block. This allows us to quickly create "workers" that can work on lots of messages concurrently.
Example
In this case, the "message" is the Client whatever that is. A single ActionBlock could create the DDL statement and execute it. Each block has an input queue which means we can just post messages to a block and await for it to execute everything using the DOP we specified.
We can also specify a limit to the queue so it won't get flooded if the worker tasks can't run fast enough :
var options=new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = _maxclientsToBeProcessed,
BoundedCapacity = _maxclientsToBeProcessed*3, //Just a guess
});
var block=new ActionBlock<Client>(client=>CreateAndRunDDL(client));
//Post the client requests
foreach(var client in clients)
{
await block.SendAsync(client);
}
//Tell the block we're done
block.Complete();
//Await for all queued messages to finish processing
await block.Completion;
The CreateAndRunDDL(Client) method should do what the code inside the question's loop does. A good idea would be to refactor it though, and create separate functions to create and execute the query , eg :
async Task CreateAndRunDDL(Client client)
{
var query = QueryForClient(...);
LambdaLogger.Log(query);
if (query.Length >= MaxQueryLength) {
throw new Exception("Delete partition query length exceeded.");
}
var queryExecutionId = await StartQueryExecution(query);
await CheckQueryExecutionStatus(queryExecutionId);
}
Blocks can be linked too. If we wanted to batch multiple clients together for processing, we can use a BatchBlock and feed its results to our action block, eg :
var batchClients = new BatchBlock<Client>(20);
var linkOptions = new DataflowLinkOptions
{
PropagateCompletion = true
};
var block=new ActionBlock<Client>(clients=>CreateAndRunDDL(clients));
batchClients.LinkTo(block,linkOptions);
This time the CreateAndRunDDL method accepts a Client[] array with the number of clients/messages we specified in the batch size.
async Task CreateAndRunDDL(Client[] clients)
{
var query = QueryForClients(clients);
...
}
Messages should be posted to the batchClients block now. Once that completes, we need to wait for the last block in the pipeline to complete :
foreach(var client in clients)
{
await batchClients.SendAsync(client);
}
//Tell the *batch block* we're done
batchClient.Complete();
//Await for all queued messages to finish processing
await block.Completion;

Related

Running parallel async tasks and return result in .NET Core Web API

Hi Recently i was working in .net core web api project which is downloading files from external api.
In this .net core api recently found some issues while the no of files is more say more than 100. API is downloading max of 50 files and skipping others. WebAPI is deployed on AWS Lambda and timeout is 15mnts.
Actually the operation is timing out due to the long download process
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
bool DownloadFlag = false;
foreach (DownloadAttachment downloadAttachment in downloadAttachments)
{
DownloadFlag = await DownloadAttachment(downloadAttachment.id);
//update the download status in database
if(DownloadFlag)
{
bool UpdateFlag = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (UpdateFlag)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
return true;
}
catch (Exception ext)
{
log.Error(ext, "Error in Saving attachment {attachemntId}",downloadAttachment.id);
return false;
}
}
Document service code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
return await _documentRepository.UpdateAttachmentDownloadStatus(AttachmentID);
}
And DB update code
public async Task<bool> UpdateAttachmentDownloadStatus(string AttachmentID)
{
using (var db = new SqlConnection(_connectionString.Value))
{
var Result = 0; bool SuccessFlag = false;
var parameters = new DynamicParameters();
parameters.Add("#pm_AttachmentID", AttachmentID);
parameters.Add("#pm_Result", Result, System.Data.DbType.Int32, System.Data.ParameterDirection.Output);
var result = await db.ExecuteAsync("[Loan].[UpdateDownloadStatus]", parameters, commandType: CommandType.StoredProcedure);
Result = parameters.Get<int>("#pm_Result");
if (Result > 0) { SuccessFlag = true; }
return SuccessFlag;
}
}
How can i move this async task to run parallel ? and get the result? i tried following code
var task = Task.Run(() => DownloadAttachment( downloadAttachment.id));
bool result = task.Result;
Is this approach is fine? how can improve the performance? how to get the result from each parallel task and update to DB and delete based on success flag? Or this error is due to AWS timeout?
Please help
If you extracted the code that handles individual files to a separate method :
private async Task DownloadSingleAttachment(DownloadAttachment attachment)
{
try
{
var download = await DownloadAttachment(downloadAttachment.id);
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(downloadAttachment.id);
if (update)
{
await DeleteAttachment(downloadAttachment.id);
}
}
}
catch(....)
{
....
}
}
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
foreach (var attachment in downloadAttachments)
{
await DownloadSingleAttachment(attachment);
}
}
....
}
It would be easy to start all downloads at once, although not very efficient :
public async Task<bool> DownloadAttachmentsAsync(List<DownloadAttachment> downloadAttachment)
{
try
{
//Start all of them
var tasks=downloadAttachments.Select(att=>DownloadSingleAttachment(att));
await Task.WhenAll(tasks);
}
....
}
This isn't very efficient because external services hate lots of concurrent calls from a single source as you do, and almost certainly impose throttling. The database doesn't like lots of concurrent calls either, because in all database products concurrent calls lead to blocking one way or another. Even in databases that use multiversioning, this comes with an overhead.
Using Dataflow classes - Single block
One easy way to fix this is to use .NET's Dataflow classes to break the operation into a pipeline of steps, and execute each one with a different number of concurrent tasks.
We could put the entire operation into a single block, but that could cause problems if the update and delete operations aren't thread-safe :
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
};
var downloader=new ActionBlock<DownloadAttachment>(async att=>{
await DownloadSingleAttachment(att);
},dlOptions);
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await downloader.Completion;
Dataflow - Multiple steps
To avoid possible thread issues, the rest of the methods can go to their own blocks. They could both go into one ActionBlock that calls both Update and Delete, or they could go into separate blocks if the methods talk to different services with different concurrency requirements.
The downloader block will execute at most 10 concurrent downloads. By default, each block uses only a single task at a time.
The updater and deleter blocks have their default DOP=1, which means there's no risk of race conditions as long as they don't try to use eg the same connection at the same time.
var downloader=new TransformBlock<string,(string id,bool download)>(
async id=> {
var download=await DownloadAttachment(id);
return (id,download);
},dlOptions);
var updater=new TransformBlock<(string id,bool download),(string id,bool update)>(
async (id,download)=> {
if(download)
{
var update = await _DocumentService.UpdateDownloadStatus(id);
return (id,update);
}
return (id,false);
});
var deleter=new ActionBlock<(string id,bool update)>(
async (id,update)=> {
if(update)
{
await DeleteAttachment(id);
}
});
The blocks can be linked into a pipeline now and used. The setting PropagateCompletion = true means that as soon as a block is finished processing, it will tell all its connected blocks to finish as well :
var linkOptions=new DataflowLinkOptions { PropagateCompletion = true};
downloader.LinkTo(updater, linkOptions);
updater.LinkTo(deleter,linkOptions);
We can pump data into the head block as long as we need. When we're done, we call the head block's Complete() method. As each block finishes processing its data, it will propagate its completion to the next block in the pipeline. We need to await for the last (tail) block to complete to ensure all the attachments have been processed:
foreach (var attachment in downloadAttachments)
{
await downloader.SendAsync(attachement.id);
}
downloader.Complete();
await deleter.Completion;
Each block has an input and (when necessary) an output buffer, which means the "producer" and "consumers" of the messages don't have to be in sync, or even know of each other. All the "producer" needs to know is where to find the head block in a pipeline.
Throttling and backpressure
One way to throttle is to use a fixed number of tasks through MaxDegreeOfParallelism.
It's also possible to put a limit to the input buffer, thus blocking previous steps or producers if a block can't process messages fast enough. This can be done simply by setting the BoundedCapacity option for a block:
var dlOptions= new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10,
BoundedCapacity=20,
};
var updaterOptions= new ExecutionDataflowBlockOptions
{
BoundedCapacity=20,
};
...
var downloader=new TransformBlock<...>(...,dlOptions);
var updater=new TransformBlock<...>(...,updaterOptions);
No other changes are necessary
To run multiple asynchronous operations you could do something like this:
public async Task RunMultipleAsync<T>(IEnumerable<T> myList)
{
const int myNumberOfConcurrentOperations = 10;
var mySemaphore = new SemaphoreSlim(myNumberOfConcurrentOperations);
var tasks = new List<Task>();
foreach(var myItem in myList)
{
await mySemaphore.WaitAsync();
var task = RunOperation(myItem);
tasks.Add(task);
task.ContinueWith(t => mySemaphore.Release());
}
await Task.WhenAll(tasks);
}
private async Task RunOperation<T>(T myItem)
{
// Do stuff
}
Put your code from DownloadAttachmentsAsync at the 'Do stuff' comment
This will use a semaphore to limit the number of concurrent operations, since running to many concurrent operations is often a bad idea due to contention. You would need to experiment to find the optimal number of concurrent operations for your use case. Also note that error handling have been omitted to keep the example short.

Can you avoid Task Continuations when using async/await if you want execution to continue immediately

I am working on a protocol and trying to use as much async/await as I can to make it scale well. The protocol will have to support hundreds to thousands of simultaneous connections. Below is a little bit of pseudo code to illustrate my problem.
private static async void DoSomeWork()
{
var protocol = new FooProtocol();
await protocol.Connect("127.0.0.1", 1234);
var i = 0;
while(i != int.MaxValue)
{
i++;
var request = new FooRequest();
request.Payload = "Request Nr " + i;
var task = protocol.Send(request);
_ = task.ContinueWith(async tmp =>
{
var resp = await task;
Console.WriteLine($"Request {resp.SequenceNr} Successful: {(resp.Status == 0)}");
});
}
}
And below is a little pseudo code for the protocol.
public class FooProtocol
{
private int sequenceNr = 0;
private SemaphoreSlim ss = new SemaphoreSlim(20, 20);
public Task<FooResponse> Send(FooRequest fooRequest)
{
var tcs = new TaskCompletionSource<FooResponse>();
ss.Wait();
var tmp = Interlocked.Increment(ref sequenceNr);
fooRequest.SequenceNr = tmp;
// Faking some arbitrary delay. This work is done over sockets.
Task.Run(async () =>
{
await Task.Delay(1000);
tcs.SetResult(new FooResponse() {SequenceNr = tmp});
ss.Release();
});
return tcs.Task;
}
}
I have a protocol with request and response pairs. I have used asynchronous socket programming. The FooProtocol will take care of matching up request with responses (sequence numbers) and will also take care of the maximum number of pending requests. (Done in the pseudo and my code with a semaphore slim, So I am not worried about run away requests). The DoSomeWork method calls the Protocol.Send method, but I don't want to await the response, I want to spin around and send the next one until I am blocked by the maximum number of pending requests. When the task does complete I want to check the response and maybe do some work.
I would like to fix two things
I would like to avoid using Task.ContinueWith() because it seems to not fit in cleanly with the async/await patterns
Because I have awaited on the connection, I have had to use the async modifier. Now I get warnings from the IDE "Because this call is not waited, execution of the current method continues before this call is complete. Consider applying the 'await' operator to the result of the call." I don't want to do that, because as soon as I do it ruins the protocol's ability to have many requests in flight. The only way I can get rid of the warning is to use a discard. Which isn't the worst thing but I can't help but feel like I am missing a trick and fighting this too hard.
Side note: I hope your actual code is using SemaphoreSlim.WaitAsync rather than SemaphoreSlim.Wait.
In most socket code, you do end up with a list of connections, and along with each connection is a "processor" of some kind. In the async world, this is naturally represented as a Task.
So you will need to keep a list of Tasks; at the very least, your consuming application will need to know when it is safe to shut down (i.e., all responses have been received).
Don't preemptively worry about using Task.Run; as long as you aren't blocking (e.g., SemaphoreSlim.Wait), you probably will not starve the thread pool. Remember that during the awaits, no thread pool thread is used.
I am not sure that it's a good idea to enforce the maximum concurrency at the protocol level. It seems to me that this responsibility belongs to the caller of the protocol. So I would remove the SemaphoreSlim, and let it do the one thing that it knows to do well:
public class FooProtocol
{
private int sequenceNr = 0;
public async Task<FooResponse> Send(FooRequest fooRequest)
{
var tmp = Interlocked.Increment(ref sequenceNr);
fooRequest.SequenceNr = tmp;
await Task.Delay(1000); // Faking some arbitrary delay
return new FooResponse() { SequenceNr = tmp };
}
}
Then I would use an ActionBlock from the TPL Dataflow library in order to coordinate the process of sending a massive number of requests through the protocol, by handling the concurrency, the backpreasure (BoundedCapacity), the cancellation (if needed), the error-handling, and the status of the whole operation (running, completed, failed etc). Example:
private static async Task DoSomeWorkAsync()
{
var protocol = new FooProtocol();
var actionBlock = new ActionBlock<FooRequest>(async request =>
{
var resp = await protocol.Send(request);
Console.WriteLine($"Request {resp.SequenceNr} Status: {resp.Status}");
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 20,
BoundedCapacity = 100
});
await protocol.Connect("127.0.0.1", 1234);
foreach (var i in Enumerable.Range(0, Int32.MaxValue))
{
var request = new FooRequest();
request.Payload = "Request Nr " + i;
var accepted = await actionBlock.SendAsync(request);
if (!accepted) break; // The block has failed irrecoverably
}
actionBlock.Complete();
await actionBlock.Completion; // Propagate any exceptions
}
The BoundedCapacity = 100 configuration means that the ActionBlock will store in its internal buffer at most 100 requests. When this threshold is reached, anyone who wants to send more requests to it will have to wait. The awaiting will happen in the await actionBlock.SendAsync line.

Parallel.ForEach faster than Task.WaitAll for I/O bound tasks?

I have two versions of my program that submit ~3000 HTTP GET requests to a web server.
The first version is based off of what I read here. That solution makes sense to me because making web requests is I/O bound work, and the use of async/await along with Task.WhenAll or Task.WaitAll means that you can submit 100 requests all at once and then wait for them all to finish before submitting the next 100 requests so that you don't bog down the web server. I was surprised to see that this version completed all of the work in ~12 minutes - way slower than I expected.
The second version submits all 3000 HTTP GET requests inside a Parallel.ForEach loop. I use .Result to wait for each request to finish before the rest of the logic within that iteration of the loop can execute. I thought that this would be a far less efficient solution, since using threads to perform tasks in parallel is usually better suited for performing CPU bound work, but I was surprised to see that the this version completed all of the work within ~3 minutes!
My question is why is the Parallel.ForEach version faster? This came as an extra surprise because when I applied the same two techniques against a different API/web server, version 1 of my code was actually faster than version 2 by about 6 minutes - which is what I expected. Could performance of the two different versions have something to do with how the web server handles the traffic?
You can see a simplified version of my code below:
private async Task<ObjectDetails> TryDeserializeResponse(HttpResponseMessage response)
{
try
{
using (Stream stream = await response.Content.ReadAsStreamAsync())
using (StreamReader readStream = new StreamReader(stream, Encoding.UTF8))
using (JsonTextReader jsonTextReader = new JsonTextReader(readStream))
{
JsonSerializer serializer = new JsonSerializer();
ObjectDetails objectDetails = serializer.Deserialize<ObjectDetails>(
jsonTextReader);
return objectDetails;
}
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<HttpResponseMessage> TryGetResponse(string urlStr)
{
try
{
HttpResponseMessage response = await httpClient.GetAsync(urlStr)
.ConfigureAwait(false);
if (response.StatusCode != HttpStatusCode.OK)
{
throw new WebException("Response code is "
+ response.StatusCode.ToString() + "... not 200 OK.");
}
return response;
}
catch (Exception e)
{
// Log exception
return null;
}
}
private async Task<ListOfObjects> GetObjectDetailsAsync(string baseUrl, int id)
{
string urlStr = baseUrl + #"objects/id/" + id + "/details";
HttpResponseMessage response = await TryGetResponse(urlStr);
ObjectDetails objectDetails = await TryDeserializeResponse(response);
return objectDetails;
}
// With ~3000 objects to retrieve, this code will create 100 API calls
// in parallel, wait for all 100 to finish, and then repeat that process
// ~30 times. In other words, there will be ~30 batches of 100 parallel
// API calls.
private Dictionary<int, Task<ObjectDetails>> GetAllObjectDetailsInBatches(
string baseUrl, Dictionary<int, MyObject> incompleteObjects)
{
int batchSize = 100;
int numberOfBatches = (int)Math.Ceiling(
(double)incompleteObjects.Count / batchSize);
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= new Dictionary<int, Task<ObjectDetails>>(incompleteObjects.Count);
var orderedIncompleteObjects = incompleteObjects.OrderBy(pair => pair.Key);
for (int i = 0; i < 1; i++)
{
var batchOfObjects = orderedIncompleteObjects.Skip(i * batchSize)
.Take(batchSize);
var batchObjectsTaskList = batchOfObjects.Select(
pair => GetObjectDetailsAsync(baseUrl, pair.Key));
Task.WaitAll(batchObjectsTaskList.ToArray());
foreach (var objTask in batchObjectsTaskList)
objectTaskDict.Add(objTask.Result.id, objTask);
}
return objectTaskDict;
}
public void GetObjectsVersion1()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Dictionary<int, Task<ObjectDetails>> objectTaskDict
= GetAllObjectDetailsInBatches(baseUrl, incompleteObjects);
foreach (KeyValuePair<int, MyObject> pair in incompleteObjects)
{
ObjectDetails objectDetails = objectTaskDict[pair.Key].Result
.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
};
}
public void GetObjectsVersion2()
{
string baseUrl = #"https://mywebserver.com:/api";
// GetIncompleteObjects is not shown, but it is not relevant to
// the question
Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();
Parallel.ForEach(incompleteHosts, pair =>
{
ObjectDetails objectDetails = GetObjectDetailsAsync(
baseUrl, pair.Key).Result.objectDetails;
// Code here that copies fields from objectDetails to pair.Value
// (the incompleteObject)
AllObjects.Add(pair.Value);
});
}
A possible reason why Parallel.ForEach may run faster is because it creates the side-effect of throttling. Initially x threads are processing the first x elements (where x in the number of the available cores), and progressively more threads may be added depending on internal heuristics. Throttling IO operations is a good thing because it protects the network and the server that handles the requests from becoming overburdened. Your alternative improvised method of throttling, by making requests in batches of 100, is far from ideal for many reasons, one of them being that 100 concurrent requests are a lot of requests! Another one is that a single long running operation may delay the completion of the batch until long after the completion of the other 99 operations.
Note that Parallel.ForEach is also not ideal for parallelizing IO operations. It just happened to perform better than the alternative, wasting memory all along. For better approaches look here: How to limit the amount of concurrent async I/O operations?
https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netframework-4.8
Basically the parralel foreach allows iterations to run in parallel so you are not constraining the iteration to run in serial, on a host that is not thread constrained this will tend to lead to improved throughput
In short:
Parallel.Foreach() is most useful for CPU bound tasks.
Task.WaitAll() is more useful for IO bound tasks.
So in your case, you are getting information from webservers, which is IO. If the async methods are implemented correctly, it won't block any thread. (It will use IO Completion ports to wait on) This way the threads can do other stuff.
By running the async methods GetObjectDetailsAsync(baseUrl, pair.Key).Result synchroniced, it will block a thread. So the threadpool will be flood by waiting threads.
So I think the Task solution will have a better fit.

Is it safe to create thousands of web request tasks and await them with Task.WhenAll()

I have a Web API method that takes in a list of strings, performs a web request for each of those strings, and compiles all of the data into a list for returning.
The input list can be variable length, up into the thousands. Do I need to manually limit the number of concurrent tasks, by batching them into groups, or is it safe to create thousands of tasks and await them with Task.WhenAll()? Here is a snippet of what I am using now:
public async Task<List<Customer>> GetDashboard(List<string> customerIds)
{
HttpClient client = new HttpClient();
var tasks = new List<Task>();
var customers = new List<Customer>();
foreach (var customerId in customerIds)
{
string customerIdCopy = customerId;
tasks.Add(client.GetStringAsync("http://testurl.com/" + customerId)
.ContinueWith(t => {
customers.Add(new Customer { Id = customerIdCopy, Data = t.Result });
}));
}
await Task.WhenAll(tasks);
return customers;
}
HttpClient can perform concurrent requests efficiently, with the caveat that it limits the number of concurrent requests to a single server.
If your requests are all going to the same site, the excess requests will be put into a queue. When requests are in this queue, the request timeout is ticking down... before it ever tries to connect to the server. So, manage that carefully, and if appropriate maybe even turn the timeout off.
Beyond this, it is perfectly fine to launch thousands of requests at once.
If you think that'll affect you, you can use a SemaphoreSlim or maybe TPL Dataflow to limit the number of concurrent requests.
The first thing which comes to mind is to delegate all this "multithreading performance" part to TPL. Use async in your requests instead of manually created tasks and ContinueWith.
It will also make C# take care about thread performance.
private async Task<Customer> GetDashboardAsync(string customerId)
{
using (var httpClient = new HttpClient())
{
string data = await httpClient.GetStringAsync("http://testurl.com/" + id);
return new Customer { Id = id, Data = data });
}
}
public async Task<Customer[]> GetDashboardAsync(List<string> customerIds)
{
var tasks = customerIds
.Select(GetDashboardAsync)
.ToArray();
return await Task.WhenAll(tasks);
}
Do I need to manually limit the number of concurrent tasks, by batching them into groups, or is it safe to create thousands of tasks and await them with Task.WhenAll()?
If you just create tasks using a List, foreach and ContinueWith, then it can cause performance drop due to the excessive number of tasks.
However, if you use async/await in your code as above, then you don't need to bother about it. TPL will use asynchronous tasks and continuations to provide the best performance.
You can just try this out to make sure that it works :)

Best practice for task/await in a foreach loop

I have some time consuming code in a foreach that uses task/await.
it includes pulling data from the database, generating html, POSTing that to an API, and saving the replies to the DB.
A mock-up looks like this
List<label> labels = db.labels.ToList();
foreach (var x in list)
{
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
List<response> res = await api.sendMessage(object); //POST
//put all the responses in the db
foreach (var r in res)
{
db.responses.add(r);
}
db.SaveChanges();
}
Time wise, generating the Html and posting it to the API seem to be taking most of the time.
Ideally it would be great if I could generate the HTML for the next item, and wait for the post to finish, before posting the next item.
Other ideas are also welcome.
How would one go about this?
I first thought of adding a Task above the foreach and wait for that to finish before making the next POST, but then how do I process the last loop... it feels messy...
You can do it in parallel but you will need different context in each Task.
Entity framework is not thread safe, so if you can't use one context in parallel tasks.
var tasks = myLabels.Select( async label=>{
using(var db = new MyDbContext ()){
// do processing...
var response = await api.getresponse();
db.Responses.Add(response);
await db.SaveChangesAsync();
}
});
await Task.WhenAll(tasks);
In this case, all tasks will appear to run in parallel, and each task will have its own context.
If you don't create new Context per task, you will get error mentioned on this question Does Entity Framework support parallel async queries?
It's more an architecture problem than a code issue here, imo.
You could split your work into two separate parts:
Get data from database and generate HTML
Send API request and save response to database
You could run them both in parallel, and use a queue to coordinate that: whenever your HTML is ready it's added to a queue and another worker proceeds from there, taking that HTML and sending to the API.
Both parts can be done in multithreaded way too, e.g. you can process multiple items from the queue at the same time by having a set of workers looking for items to be processed in the queue.
This screams for the producer / consumer pattern: one producer produces data in a speed different than the consumer consumes it. Once the producer does not have anything to produce anymore it notifies the consumer that no data is expected anymore.
MSDN has a nice example of this pattern where several dataflowblocks are chained together: the output of one block is the input of another block.
Walkthrough: Creating a Dataflow Pipeline
The idea is as follows:
Create a class that will generate the HTML.
This class has an object of class System.Threading.Tasks.Dataflow.BufferBlock<T>
An async procedure creates all HTML output and await SendAsync the data to the bufferBlock
The buffer block implements interface ISourceBlock<T>. The class exposes this as a get property:
The code:
class MyProducer<T>
{
private System.Threading.Tasks.Dataflow.BufferBlock<T> bufferBlock = new BufferBlock<T>();
public ISourceBlock<T> Output {get {return this.bufferBlock;}
public async ProcessAsync()
{
while (somethingToProduce)
{
T producedData = ProduceOutput(...)
await this.bufferBlock.SendAsync(producedData);
}
// no date to send anymore. Mark the output complete:
this.bufferBlock.Complete()
}
}
A second class takes this ISourceBlock. It will wait at this source block until data arrives and processes it.
do this in an async function
stop when no more data is available
The code:
public class MyConsumer<T>
{
ISourceBlock<T> Source {get; set;}
public async Task ProcessAsync()
{
while (await this.Source.OutputAvailableAsync())
{ // there is input of type T, read it:
var input = await this.Source.ReceiveAsync();
// process input
}
// if here, no more input expected. finish.
}
}
Now put it together:
private async Task ProduceOutput<T>()
{
var producer = new MyProducer<T>();
var consumer = new MyConsumer<T>() {Source = producer.Output};
var producerTask = Task.Run( () => producer.ProcessAsync());
var consumerTask = Task.Run( () => consumer.ProcessAsync());
// while both tasks are working you can do other things.
// wait until both tasks are finished:
await Task.WhenAll(new Task[] {producerTask, consumerTask});
}
For simplicity I've left out exception handling and cancellation. StackOverFlow has artibles about exception handling and cancellation of Tasks:
Keep UI responsive using Tasks, Handle AggregateException
Cancel an Async Task or a List of Tasks
This is what I ended up using: (https://stackoverflow.com/a/25877042/275990)
List<ToSend> sendToAPI = new List<ToSend>();
List<label> labels = db.labels.ToList();
foreach (var x in list) {
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
sendToAPI.add(the object with HTML);
}
int maxParallelPOSTs=5;
await TaskHelper.ForEachAsync(sendToAPI, maxParallelPOSTs, async i => {
using (NasContext db2 = new NasContext()) {
List<response> res = await api.sendMessage(i.object); //POST
//put all the responses in the db
foreach (var r in res)
{
db2.responses.add(r);
}
db2.SaveChanges();
}
});
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body) {
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext()) {
await body(partition.Current).ContinueWith(t => {
if (t.Exception != null) {
string problem = t.Exception.ToString();
}
//observe exceptions
});
}
}));
}
basically lets me generate the HTML sync, which is fine, since it only takes a few seconds to generate 1000's but lets me post and save to DB async, with as many threads as I predefine. In this case I'm posting to the Mandrill API, parallel posts are no problem.

Categories

Resources