I have around 3 million rows in my table. I have a console application to get all the rows and process those rows. I want to use TPL to fetch 1000 rows at once and execute my processing logic. I can have the following logic, inside the ProcessRowsForPage method I will get the records based on the page no.
int totalRecordsCount = GetCount();
int pagecount = totalRecordsCount/1000;
for (int j= 0; j <= pagecount; j++)
{
var pageNo= j;
var t = Task.Factory.StartNew(() =>
{
ProcessRowsForPage(pageNo);
});
tasks.Add(t);
}
May be, its weird, but is there a way the tasks can be created without the total count. I want to use something like a do while loop and stop creating tasks when there are no more rows to be fetched
For this kind of situations you're better off with TPL Dataflow.
For that you'll need the following components:
a SqlDataReader or some other sort of thing that can stream data from the database
a BatchBlock with BatchSize = 1000
an ActionBlock that will call ProcessRows method
Now, to create the processing pipeline, link the blocks together:
batchBlock.LinkTo(actionBlock, new DataflowLinkOptions { PropagateCompletion = true });
After that, from your dataReader Post rows to the BatchBlock:
while(reader.Read())
{
var item = ConvertRow(reader);
batchBlock.Post(item);
}
// When you get here you've read all the data from the database
// tell the pipeline that no more data is coming
batchBlock.Complete();
And that will take care of processing. If you want to be notified when the pipeline has finished processing all items, use the Completion property of the ActionBlock to get notified.
actionBlock.Completion.ContinueWith(prev => {Console.WriteLine("Finished.");}).
You could do this if instead of spawning potentially millions of tasks, which is a bad idea, if you use a pool of some sort.
Create 3 (for example) tasks in an array, and start them all going.
When one task completes, if there are more rows, set it going again.
As soon as a task returns no more data, stop setting it going, wait for all tasks to complete, and then you're done.
Example:
TASK1 > GetNext100Rows(0)
TASK2 > GetNext100Rows(100)
TASK3 > GetNext100Rows(200)
If Task2 completes first, restart it:
TASK1 > GetNext100Rows(0) [Processing]
TASK2 > GetNext100Rows(300) [Processing]
TASK3 > GetNext100Rows(200) [Processing]
Keep restarting any tasks that complete, and increasing it by 100 each time.
Finally, when a task returns no more data, wait for all remaining threads to complete.
This requires your task to be able to return or indicate that it has no more data, for example by setting a flag variable or in a return object.
Related
I'm writing an application which interact with Azure Cosmos DB. I need commit 30,000 Records to CosmosDB in a Session. Because I used .NET Core so I cannot use BulkInsert dll. So, I use Foreach loop to Insert to CosmosDB. But I see too many request per Second and it overload RU limit by CosmosDB.
foreach(item in listNeedInsert){
await RequestInsertToCosmosDB(item);
}
I want to Pause foreach loop when Number of request reach 100. After done 100 request. foreach will continue.
You can partition the list and await the results:
var tasks = new List<Task>();
foreach(item in listNeedInsert)
{
var task = RequestInsertToCosmosDB(item);
tasks.Add(task);
if(tasks.Count == 100)
{
await Task.WhenAll(tasks);
tasks.Clear();
}
}
// Wait for anything left to finish
await Task.WhenAll(tasks);
Every time you've got 100 tasks running the code will wait for them all to finish before executing the last batch.
You could set a delay on every hundred iteration
int i = 1;
foreach(item in listNeedInsert)
{
await RequestInsertToCosmosDB(item);
if (i % 100 == 0)
{
i = 0;
await Task.Delay(100); // Miliseconds
}
i++;
}
If you really want to maximize efficiency and can't do bulk updates, look into using SemaphorSlim in this post:
Throttling asynchronous tasks
Hammering a medium-sized database with 100 concurrent requests at a time isn't a great idea because it's not equipped to handle that kind of throughput. You could try playing with a different throttling number and seeing what's optimal, but I would guess it's in the single digit range.
If you want to do something quick and dirty, you could probably use Sean's solution. But I'd set the Task count to 5 starting out, not 100.
https://github.com/thomhurst/EnumerableAsyncProcessor
I've written a library to help with this sort of logic.
Usage would be:
await AsyncProcessorBuilder.WithItems(listNeedInsert) // Or Extension Method: listNeedInsert.ToAsyncProcessorBuilder()
.ForEachAsync(item => RequestInsertToCosmosDB(item), CancellationToken.None)
.ProcessInBatches(batchSize: 100);
I understand a Barrier can be used to have several tasks synchronise their completion before a second phase runs.
I would like to have several tasks synchronise multiple steps like so:
state is 1;
Task1 runs and pauses waiting for state to become 2;
Task2 runs and pauses waiting for state to become 2;
Task2 is final Task and causes the state to progress to state 2;
Task1 runs and pauses waiting for state to become 3;
Task2 runs and pauses waiting for state to become 3;
Task2 is final Task and causes the state to progress to state 3;
state 3 is final state and so all tasks exit.
I know I can spin up new tasks at the end of each state, but since each task does not take too long, I want to avoid creating new tasks for each step.
I can run the above synchronously using for loops, but final state can be 100,000, and so I would like to make use of more than one thread to run the process faster as the process is CPU bound.
I have tried using a counter to keep track of the number of completed Tasks that is incremented by each Task on completion. If the Task is the final Task to complete then it will change the state to the next state. All completed Tasks then wait using while (iterationState == state) await Task.Yield but the performance is terrible and it seems to me a very crude way of doing it.
What is the most efficient way to get the above done? There must be an optimised tool to get this done?
I'm using Parallel.For, creating 300 tasks, and each task needs to run through up to 100,000 states. Each task running through one state completes in less than a second, and creating 300 * 100,000 tasks is a huge overhead that makes running the whole thing synchronously much faster, even if using a single thread.
So I'd like to create 300 Tasks and have these Tasks synchronise moving through the 100,000 states. Hopefully the overhead of creating only 300 tasks instead of 300 * 100,000 tasks, with the overhead of optimised synchronisation between the tasks, will run faster than when doing it synchronously on a single thread.
Each state must complete fully before the next state can be run.
So - what's the optimal synchronisation technique for this scenario? Thanks!
while (iterationState == state) await Task.Yield is indeed a terrible solution to synchronize across your 300 tasks (and no, 300 isn't necessarily super-expensive: you'll only get a reasonable number of threads allocated).
The key problem here isn't the Parallel.For, it's synchronizing across 300 tasks to wait efficiently until each of them have completed a given phase.
The simplest and cleanest solution here would probably be to have a for loop over the stages and a parallel.for over the bit you want parallelized:
for (int stage = 0; stage < 10000; stage++)
{
// the next line blocks until all 300 have completed
// will use thread pool threads as necessary
Parallel.For( ... 300 items to process this stage ... );
}
No extra synchronization primitives needed, no spin-waiting consuming CPU, no needless thrashing between threads trying to see if they are ready to progress.
I think I am understanding what you are trying to do, so here is a suggested way to handle it. Note - I am using Action as the type for the blocking collection, but you can change it to whatever would work best in your scenario.
// Shared variables
CountdownEvent workItemsCompleted = new CountdownEvent(300);
BlockingCollection<Action> workItems = new BlockingCollection<Action>();
CancellationTokenSource cancelSource = new CancellationTokenSource();
// Work Item Queue Thread
for(int i=1; i < stages; ++i)
{
workItemsCompleted.Reset(300);
for(int j=0; j < workItemsForStage[i].Count; ++j)
{
workItems.Add(() => {}) // Add your work item here
}
workItemsCompleted.Wait(token) // token should be passed in from cancelSource.Token
}
// Worker threads that are making use of the queue
// token should be passed to the threads from cancelSource.Token
while(!token.IsCancelled)
{
var item = workItems.Take(token); // Blocks until available item or token is cancelled
item();
workItemsCompleted.Signal();
}
You can use cancelSource from your main thread to cancel the running operations if you need to. In your worker threads you would then need to handle the OperationCancelledException. With this setup you can launch as many worker threads as you need and easily benchmark where you are getting your optimal performance (maybe it is with only using 10 worker threads, etc). Just launch as many workers as you want and then queue up the work items in the Work item queue thread. It's basically a producer-consumer type model except that the producer queues up one phase of the work, then blocks until that phase is done and then queues up the next round of work.
I am inside a threat updating a graph and I go into a routine that makes a measurement for 4 seconds. The routine returns a double. What I am noticing is that my graph stops showing activity for 4 seconds until I am done collecting data. I need to start a new thread and put the GetTXPower() activity in the background. So in other words I want GetTXPower() and the graph charting to run in parallel. Any suggestions?
here is my code:
stopwatch.Start();
// Get Tx Power reading and save the Data
_specAn_y = GetTXPower();
_pa_Value = paData.Start;
DataPoint.Measurement = _specAn_y;
//Thread.Sleep(50);
double remaining = 0;
do
{
charting.stuff
}
uxChart.Update();
I suggest looking into Task Parallel Library.
Starting with the .NET Framework 4, the TPL is the preferred way to write multithreaded and parallel code.
Since you also need the result back from GetTXPower, I would use a Task<double> for it.
Task<double> task = Task.Factory.StartNew<double>(GetTXPower);
Depending on when you need the result, you can query if the task has completed by checking task.IsCompleted or alternatively block the thread and wait for the task to finish by calling task.Wait(). You can fetch the result through task.Result property.
An alternative would be to add a continuation to the initial task:
Task.Factory.StartNew<double>(GetTXPower).ContinueWith(task =>
{
// Do something with the task result.
});
Make a void method (I'll call it MeasureMentMethod) that collects the data. The create a Thread using the following code:
Thread MeasurementThread = new Thread(new ThreadStart(MeasurementMethod));
You can then run the thread with
MeasurementThread.Start();
And if your Thread has something like this:
while(true){
//Run your code here
Thread.Sleep(100);
}
Then you can just start it at the beginning, and it will just keep collecting data.
So, you would have your main thread that would update the chart, and you would start the thread that would get the data on the side.
I have a client application that will get a large number of jobs to run, on the order of 10k, and foreach will make an http request to a web api. Each job is semi long running and unpredictable 7-90s response times.
I am trying to minimize the total time for all jobs. I notice that if I make too many requests at once, response times drastically increase because the server is essentially being DoSed. This is bringing the total for all jobs way up. I have been using SemaphoreSlim to statically set the order of parallelism but need to find a way to dynamically adjust based on current response times to get the lowest response times overall. Here is the code I have been using.
List<Task<DataTable>> tasks = new List<Task<DataTable>>();
SemaphoreSlim throttler = new SemaphoreSlim(40, 300);
foreach (DataRow entry in newEntries.Rows)
{
await throttler.WaitAsync();
tasks.Add(Task<DataTable>.Run(async () =>
{
try
{
return RunRequest(entry); //Http requests and other logic is done here
}
finally
{
throttler.Release();
}
}
));
}
await Task.WhenAll(tasks.ToArray());
I know that throttler.Release(); can be passed different numbers to increase the total number of outstanding request at one time and calling Wait() without Release() will subtract from the count.
I believe that need to keep a some sort of rolling average of response times. Using the rolling average some how determine how much to increase/decrease the total number of outstanding requests being allowed. I am not sure if this is the right direction or not.
Question
Given the information above, how can I keep the total number of outstanding requests at a level to have the minimum time spent for all jobs.
List<DataTable> dataTables = new List<DataTable>();
Parallel.ForEach(newEntries.AsEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = 2 }, row => {
var request = RunRequest(row);
lock(dataTables)
{
dataTables.Add(request);
}
});
Now you can adjust the MaxDegreeOfParallelism //I wasn't understanding that you wanted to have this dynamically changed as the tasks were running.
I'll tell you from past experience when trying to allow users to change the amount of running threads when they are in process using a Semaphore, I wanted to jump in front of a moving truck. This was back before TPL.
Create a list of unstarted tasks. Start the number of tasks that you want, like 5 to start. Each task can return a duration from start to finish so you can use it to define your throttle logic. Now just loop the tasks with waitany as the block.
var runningTasks = 5;
//for loop to start 5 tasks.
while (taskList.count > 0)
{
var indexer = Task.WaitAny(taskList);
var myTask = taskList[indexer];
taskList.RemoveAt(indexer);
InterLocker.Decrement(ref runningTasks);
var durationResult = myTask.Result();
//do logic to determine if you need to start more.
//when you start another use InterLocker.Increment(ref runningTasks);
}
I'm writing a tool that sends queries to an azure table, the amount of queries depends on the user.
I want to send queries in parallel but only up to a given number ( i don't want to send all 100 queries at once).
Is there any built in mechanism i can use to sent say up to 20 queries in parallel each time ?
I know there is Parallel.Foreach which can be limited using ParallelOptions.MaxDegreeOfParallelism
but for asynchronous operation like mine this will just send all the queries really fast and my tool will handle all 100 callbacks at once.
You should use SemaphoreSlim. It's especially nice in the case of async operations because it has WaitAsync that returns a task you can await on. The first 20 will go right through, and the rest will asynchronously wait for an operation to end so they can start.
SemaphoreSlim _semaphore = new SemaphoreSlim(20);
async Task DoSomethingAsync()
{
await _semaphore.WaitAsync();
try
{
// possibly async operations limited to 20
}
finally
{
_semaphore.Release();
}
}
Usage:
for(int i=0; i < 100; i++)
{
await DoSomethingAsync();
}