How to Limit request per Second in Async Task C# - c#

I'm writing an application which interact with Azure Cosmos DB. I need commit 30,000 Records to CosmosDB in a Session. Because I used .NET Core so I cannot use BulkInsert dll. So, I use Foreach loop to Insert to CosmosDB. But I see too many request per Second and it overload RU limit by CosmosDB.
foreach(item in listNeedInsert){
await RequestInsertToCosmosDB(item);
}
I want to Pause foreach loop when Number of request reach 100. After done 100 request. foreach will continue.

You can partition the list and await the results:
var tasks = new List<Task>();
foreach(item in listNeedInsert)
{
var task = RequestInsertToCosmosDB(item);
tasks.Add(task);
if(tasks.Count == 100)
{
await Task.WhenAll(tasks);
tasks.Clear();
}
}
// Wait for anything left to finish
await Task.WhenAll(tasks);
Every time you've got 100 tasks running the code will wait for them all to finish before executing the last batch.

You could set a delay on every hundred iteration
int i = 1;
foreach(item in listNeedInsert)
{
await RequestInsertToCosmosDB(item);
if (i % 100 == 0)
{
i = 0;
await Task.Delay(100); // Miliseconds
}
i++;
}

If you really want to maximize efficiency and can't do bulk updates, look into using SemaphorSlim in this post:
Throttling asynchronous tasks
Hammering a medium-sized database with 100 concurrent requests at a time isn't a great idea because it's not equipped to handle that kind of throughput. You could try playing with a different throttling number and seeing what's optimal, but I would guess it's in the single digit range.
If you want to do something quick and dirty, you could probably use Sean's solution. But I'd set the Task count to 5 starting out, not 100.

https://github.com/thomhurst/EnumerableAsyncProcessor
I've written a library to help with this sort of logic.
Usage would be:
await AsyncProcessorBuilder.WithItems(listNeedInsert) // Or Extension Method: listNeedInsert.ToAsyncProcessorBuilder()
.ForEachAsync(item => RequestInsertToCosmosDB(item), CancellationToken.None)
.ProcessInBatches(batchSize: 100);

Related

Multithreading/Concurrent strategy for a network based task

I am not pro in utilizing resources to the best hence am seeking the best way for a task that needs to be done in parallel and efficiently.
We have a scenario wherein we have to ping millions of system and receive a response. The response itself takes no time in computation but the task is network based.
My current implementation looks like this -
Parallel.ForEach(list, ip =>
{
try
{
// var record = client.QueryAsync(ip);
var record = client.Query(ip);
results.Add(record);
}
catch (Exception)
{
failed.Add(ip);
}
});
I tested this code for
100 items it takes about 4 secs
1k items it takes about 10 secs
10k items it takes about 80 secs
100k items it takes about 710 secs
I need to process close to 20M queries, what strategy should i use in order to speed this up further
Here is the problem
Parallel.ForEach uses the thread pool. Moreover, IO bound operations will block those threads waiting for a device to respond and tie up resources.
If you have CPU bound code, Parallelism is appropriate;
Though if you have IO bound code, Asynchrony is appropriate.
In this case, client.Query is clearly I/O, so the ideal consuming code would be asynchronous.
Since you said there was an async verison, you are best to use async/await pattern and/or some type of limit on concurrent tasks, another neat solution is to use ActionBlock Class in the TPL dataflow library.
Dataflow example
public static async Task DoWorkLoads(List<IPAddress> addresses)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 50
};
var block = new ActionBlock<IPAddress>(MyMethodAsync, options);
foreach (var ip in addresses)
block.Post(ip);
block.Complete();
await block.Completion;
}
...
public async Task MyMethodAsync(IpAddress ip)
{
try
{
var record = await client.Query(ip);
// note this is not thread safe best to lock it
results.Add(record);
}
catch (Exception)
{
// note this is not thread safe best to lock it
failed.Add(ip);
}
}
This approach gives you Asynchrony, it also gives you MaxDegreeOfParallelism, it doesn't waste resources, and lets IO be IO without chewing up unnecessary resources
*Disclaimer, DataFlow may not be where you want to be, however i just thought id give you some more information
Demo here
update
I just did some bench-marking with Parallel.Foreaceh and DataFlow
Run multiple times 10000 pings
Parallel.Foreach = 30 seconds
DataFlow = 10 seconds

Getting rows from SQL using c# Task

I have around 3 million rows in my table. I have a console application to get all the rows and process those rows. I want to use TPL to fetch 1000 rows at once and execute my processing logic. I can have the following logic, inside the ProcessRowsForPage method I will get the records based on the page no.
int totalRecordsCount = GetCount();
int pagecount = totalRecordsCount/1000;
for (int j= 0; j <= pagecount; j++)
{
var pageNo= j;
var t = Task.Factory.StartNew(() =>
{
ProcessRowsForPage(pageNo);
});
tasks.Add(t);
}
May be, its weird, but is there a way the tasks can be created without the total count. I want to use something like a do while loop and stop creating tasks when there are no more rows to be fetched
For this kind of situations you're better off with TPL Dataflow.
For that you'll need the following components:
a SqlDataReader or some other sort of thing that can stream data from the database
a BatchBlock with BatchSize = 1000
an ActionBlock that will call ProcessRows method
Now, to create the processing pipeline, link the blocks together:
batchBlock.LinkTo(actionBlock, new DataflowLinkOptions { PropagateCompletion = true });
After that, from your dataReader Post rows to the BatchBlock:
while(reader.Read())
{
var item = ConvertRow(reader);
batchBlock.Post(item);
}
// When you get here you've read all the data from the database
// tell the pipeline that no more data is coming
batchBlock.Complete();
And that will take care of processing. If you want to be notified when the pipeline has finished processing all items, use the Completion property of the ActionBlock to get notified.
actionBlock.Completion.ContinueWith(prev => {Console.WriteLine("Finished.");}).
You could do this if instead of spawning potentially millions of tasks, which is a bad idea, if you use a pool of some sort.
Create 3 (for example) tasks in an array, and start them all going.
When one task completes, if there are more rows, set it going again.
As soon as a task returns no more data, stop setting it going, wait for all tasks to complete, and then you're done.
Example:
TASK1 > GetNext100Rows(0)
TASK2 > GetNext100Rows(100)
TASK3 > GetNext100Rows(200)
If Task2 completes first, restart it:
TASK1 > GetNext100Rows(0) [Processing]
TASK2 > GetNext100Rows(300) [Processing]
TASK3 > GetNext100Rows(200) [Processing]
Keep restarting any tasks that complete, and increasing it by 100 each time.
Finally, when a task returns no more data, wait for all remaining threads to complete.
This requires your task to be able to return or indicate that it has no more data, for example by setting a flag variable or in a return object.

Dynamic Client Side Throttling inside a C# Service

I have a client application that will get a large number of jobs to run, on the order of 10k, and foreach will make an http request to a web api. Each job is semi long running and unpredictable 7-90s response times.
I am trying to minimize the total time for all jobs. I notice that if I make too many requests at once, response times drastically increase because the server is essentially being DoSed. This is bringing the total for all jobs way up. I have been using SemaphoreSlim to statically set the order of parallelism but need to find a way to dynamically adjust based on current response times to get the lowest response times overall. Here is the code I have been using.
List<Task<DataTable>> tasks = new List<Task<DataTable>>();
SemaphoreSlim throttler = new SemaphoreSlim(40, 300);
foreach (DataRow entry in newEntries.Rows)
{
await throttler.WaitAsync();
tasks.Add(Task<DataTable>.Run(async () =>
{
try
{
return RunRequest(entry); //Http requests and other logic is done here
}
finally
{
throttler.Release();
}
}
));
}
await Task.WhenAll(tasks.ToArray());
I know that throttler.Release(); can be passed different numbers to increase the total number of outstanding request at one time and calling Wait() without Release() will subtract from the count.
I believe that need to keep a some sort of rolling average of response times. Using the rolling average some how determine how much to increase/decrease the total number of outstanding requests being allowed. I am not sure if this is the right direction or not.
Question
Given the information above, how can I keep the total number of outstanding requests at a level to have the minimum time spent for all jobs.
List<DataTable> dataTables = new List<DataTable>();
Parallel.ForEach(newEntries.AsEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = 2 }, row => {
var request = RunRequest(row);
lock(dataTables)
{
dataTables.Add(request);
}
});
Now you can adjust the MaxDegreeOfParallelism //I wasn't understanding that you wanted to have this dynamically changed as the tasks were running.
I'll tell you from past experience when trying to allow users to change the amount of running threads when they are in process using a Semaphore, I wanted to jump in front of a moving truck. This was back before TPL.
Create a list of unstarted tasks. Start the number of tasks that you want, like 5 to start. Each task can return a duration from start to finish so you can use it to define your throttle logic. Now just loop the tasks with waitany as the block.
var runningTasks = 5;
//for loop to start 5 tasks.
while (taskList.count > 0)
{
var indexer = Task.WaitAny(taskList);
var myTask = taskList[indexer];
taskList.RemoveAt(indexer);
InterLocker.Decrement(ref runningTasks);
var durationResult = myTask.Result();
//do logic to determine if you need to start more.
//when you start another use InterLocker.Increment(ref runningTasks);
}

How to limit Parallel.foreach for asynchronous operations?

I'm writing a tool that sends queries to an azure table, the amount of queries depends on the user.
I want to send queries in parallel but only up to a given number ( i don't want to send all 100 queries at once).
Is there any built in mechanism i can use to sent say up to 20 queries in parallel each time ?
I know there is Parallel.Foreach which can be limited using ParallelOptions.MaxDegreeOfParallelism
but for asynchronous operation like mine this will just send all the queries really fast and my tool will handle all 100 callbacks at once.
You should use SemaphoreSlim. It's especially nice in the case of async operations because it has WaitAsync that returns a task you can await on. The first 20 will go right through, and the rest will asynchronously wait for an operation to end so they can start.
SemaphoreSlim _semaphore = new SemaphoreSlim(20);
async Task DoSomethingAsync()
{
await _semaphore.WaitAsync();
try
{
// possibly async operations limited to 20
}
finally
{
_semaphore.Release();
}
}
Usage:
for(int i=0; i < 100; i++)
{
await DoSomethingAsync();
}

Cancelling long-running tasks in PLINQ

I am trying to use the NET 4.0 parallel task library to handle multiple FTS queries. If the query takes too much time, I want to cancel it and continue forward with processing the rest.
This code doesn't stop when one query goes over the threshold. I think I'm calling it such that the cancel task and time limit is reached for the whole of the process rather than the single transaction. If I set the time period to be very small (300ms), then it gets called for all search strings.
I think I'm missing something obvious .. thanks in advance for any insight.
Additionally, this still doesn't seem to stop the very long query from executing. Is this even the correct way to cancel a long running query once it's been triggered?
Modified code:
CancellationTokenSource cts = new CancellationTokenSource();
CancellationToken token = cts.Token;
var query = searchString.Values.Select(c =>myLongQuery(c)).AsParallel().AsOrdered()
.Skip(counter * numToProcess).Take(numToProcess).WithCancellation(cts.Token);
new Thread(() =>
{
Thread.Sleep(5000);
cts.Cancel();
}).Start();
try
{
List<List<Threads>> results = query.ToList();
foreach (List<Threads> threads in results)
{
// does something with data
}
} catch (OperationCanceledException) {
Console.WriteLine("query took too long");
}
PLINQ will poll the cancellation token after every some number of elements. If the frequency of checks is insufficient for your application, make sure all expensive delegates in the PLINQ query regularly call cts.Token.ThrowIfCancellationRequested().
For more details, see this article: Link
This is just a guess: isn't the problem that the query is lazy (as in normal LINQ) and so it isn't executed until later?

Categories

Resources