Got some help here on Stackoverflow earlier this week which resulted in going forward with a producer/consumer pattern for loading processing and importing large datasets into RavenDb.
Parallelization of CPU bound task continuing with IO bound
I'm now looking to throttle the amount of work units that are prepared in advance by the producers in order to manage memory consumption. I've implemented the throttling using a basic semaphore but I'm having trouble with the implementation deadlocking at a certain point.
I cannot figure out what could be causing the deadlocks. Below is an excerpt of the code:
private static void LoadData<TParsedData, TData>(IDataLoader<TParsedData> dataLoader, int batchSize, Action<IndexedBatch<TData>> importProceedure, Func<IEnumerable<TParsedData>, List<TData>> processProceedure)
where TParsedData : class
where TData : class
{
Console.WriteLine(#"Loading {0}...", typeof(TData).ToString());
var batchCounter = 0;
var ist = Stopwatch.StartNew();
var throttler = new SemaphoreSlim(10);
var bc = new BlockingCollection<IndexedBatch<TData>>();
var importTask = Task.Run(() =>
{
bc.GetConsumingEnumerable()
.AsParallel()
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
//or
//.WithDegreeOfParallelism(1)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(data =>
{
var st = Stopwatch.StartNew();
importProceedure(data);
Console.WriteLine(#"Batch imported {0} in {1} ms", data.Index, st.ElapsedMilliseconds);
throttler.Release();
});
});
var processTask = Task.Run(() =>
{
dataLoader.GetParsedItems()
.Partition(batchSize)
.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount)
//or
//.WithDegreeOfParallelism(1)
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.ForAll(batch =>
{
throttler.Wait(); //.WaitAsync()
var batchno = ++batchCounter;
var st = Stopwatch.StartNew();
bc.Add(new IndexedBatch<TData>(batchno, processProceedure(batch)));
Console.WriteLine(#"Batch processed {0} in {1} ms", batchno, st.ElapsedMilliseconds);
});
});
processTask.Wait();
bc.CompleteAdding();
importTask.Wait();
Console.WriteLine(nl(1) + #"Loading {0} completed in {1} ms", typeof(TData).ToString(), ist.ElapsedMilliseconds);
}
public class IndexedBatch<TBatch>
where TBatch : class
{
public IndexedBatch(int index, List<TBatch> batch)
{
Index = index;
Batch = batch ?? new List<TBatch>();
}
public int Index { get; set; }
public List<TBatch> Batch { get; set; }
}
This is the call being made to LoadData:
LoadData<DataBase, Data>(
DataLoaderFactory.Create<DataBase>(datafilePath),
1024,
(data) =>
{
using (var session = Store.OpenSession())
{
foreach (var i in data.Batch)
{
session.Store(i);
d.TryAdd(i.LongId.GetHashCode(), int.Parse(i.Id.Substring(i.Id.LastIndexOf('/') + 1)));
}
session.SaveChanges();
}
},
(batch) =>
{
return batch.Select(i => new Data()
{
...
}).ToList();
}
);
Store is a RavenDb IDocumentStore. DataLoaderFactory constructs a custom parser for the give dataset.
Hard to debug a deadlock without big arrows that say "blocks here!". Avoiding debugging the code without a debugger: BlockingCollection already can throttle. Use the constructor that takes the int boundedCapacity argument and eliminate the semaphore. Very high odds that solves your deadlock.
Can you check the amount of threads you have? Probably you have exhausted the thread-pool due to blocking. The TPL injects more threads than ProcessorCount if it thinks your code would deadlock without them. But it can only do so up to a certain limit.
Anyway, blocking inside of TPL tasks is generally a bad idea as the built-in heuristics work best with non-blocking stuff.
Related
I have a collection of 1000 input message to process. I'm looping the input collection and starting the new task for each message to get processed.
//Assume this messages collection contains 1000 items
var messages = new List<string>();
foreach (var msg in messages)
{
Task.Factory.StartNew(() =>
{
Process(msg);
});
}
Can we guess how many maximum messages simultaneously get processed at the time (assuming normal Quad core processor), or can we limit the maximum number of messages to be processed at the time?
How to ensure this message get processed in the same sequence/order of the Collection?
You could use Parallel.Foreach and rely on MaxDegreeOfParallelism instead.
Parallel.ForEach(messages, new ParallelOptions {MaxDegreeOfParallelism = 10},
msg =>
{
// logic
Process(msg);
});
SemaphoreSlim is a very good solution in this case and I higly recommend OP to try this, but #Manoj's answer has flaw as mentioned in comments.semaphore should be waited before spawning the task like this.
Updated Answer: As #Vasyl pointed out Semaphore may be disposed before completion of tasks and will raise exception when Release() method is called so before exiting the using block must wait for the completion of all created Tasks.
int maxConcurrency=10;
var messages = new List<string>();
using(SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
List<Task> tasks = new List<Task>();
foreach(var msg in messages)
{
concurrencySemaphore.Wait();
var t = Task.Factory.StartNew(() =>
{
try
{
Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
tasks.Add(t);
}
Task.WaitAll(tasks.ToArray());
}
Answer to Comments
for those who want to see how semaphore can be disposed without Task.WaitAll
Run below code in console app and this exception will be raised.
System.ObjectDisposedException: 'The semaphore has been disposed.'
static void Main(string[] args)
{
int maxConcurrency = 5;
List<string> messages = Enumerable.Range(1, 15).Select(e => e.ToString()).ToList();
using (SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
List<Task> tasks = new List<Task>();
foreach (var msg in messages)
{
concurrencySemaphore.Wait();
var t = Task.Factory.StartNew(() =>
{
try
{
Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
tasks.Add(t);
}
// Task.WaitAll(tasks.ToArray());
}
Console.WriteLine("Exited using block");
Console.ReadKey();
}
private static void Process(string msg)
{
Thread.Sleep(2000);
Console.WriteLine(msg);
}
I think it would be better to use Parallel LINQ
Parallel.ForEach(messages ,
new ParallelOptions{MaxDegreeOfParallelism = 4},
x => Process(x);
);
where x is the MaxDegreeOfParallelism
With .NET 5.0 and Core 3.0 channels were introduced.
The main benefit of this producer/consumer concurrency pattern is that you can also limit the input data processing to reduce resource impact.
This is especially helpful when processing millions of data records.
Instead of reading the whole dataset at once into memory, you can now consecutively query only chunks of the data and wait for the workers to process it before querying more.
Code sample with a queue capacity of 50 messages and 5 consumer threads:
/// <exception cref="System.AggregateException">Thrown on Consumer Task exceptions.</exception>
public static async Task ProcessMessages(List<string> messages)
{
const int producerCapacity = 10, consumerTaskLimit = 3;
var channel = Channel.CreateBounded<string>(producerCapacity);
_ = Task.Run(async () =>
{
foreach (var msg in messages)
{
await channel.Writer.WriteAsync(msg);
// blocking when channel is full
// waiting for the consumer tasks to pop messages from the queue
}
channel.Writer.Complete();
// signaling the end of queue so that
// WaitToReadAsync will return false to stop the consumer tasks
});
var tokenSource = new CancellationTokenSource();
CancellationToken ct = tokenSource.Token;
var consumerTasks = Enumerable
.Range(1, consumerTaskLimit)
.Select(_ => Task.Run(async () =>
{
try
{
while (await channel.Reader.WaitToReadAsync(ct))
{
ct.ThrowIfCancellationRequested();
while (channel.Reader.TryRead(out var message))
{
await Task.Delay(500);
Console.WriteLine(message);
}
}
}
catch (OperationCanceledException) { }
catch
{
tokenSource.Cancel();
throw;
}
}))
.ToArray();
Task waitForConsumers = Task.WhenAll(consumerTasks);
try { await waitForConsumers; }
catch
{
foreach (var e in waitForConsumers.Exception.Flatten().InnerExceptions)
Console.WriteLine(e.ToString());
throw waitForConsumers.Exception.Flatten();
}
}
As pointed out by Theodor Zoulias:
On multiple consumer exceptions, the remaining tasks will continue to run and have to take the load of the killed tasks. To avoid this, I implemented a CancellationToken to stop all the remaining tasks and handle the exceptions combined in the AggregateException of waitForConsumers.Exception.
Side note:
The Task Parallel Library (TPL) might be good at automatically limiting the tasks based on your local resources. But when you are processing data remotely via RPC, it's necessary to manually limit your RPC calls to avoid filling the network/processing stack!
If your Process method is async you can't use Task.Factory.StartNew as it doesn't play well with an async delegate. Also there are some other nuances when using it (see this for example).
The proper way to do it in this case is to use Task.Run. Here's #ClearLogic answer modified for an async Process method.
static void Main(string[] args)
{
int maxConcurrency = 5;
List<string> messages = Enumerable.Range(1, 15).Select(e => e.ToString()).ToList();
using (SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
List<Task> tasks = new List<Task>();
foreach (var msg in messages)
{
concurrencySemaphore.Wait();
var t = Task.Run(async () =>
{
try
{
await Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
tasks.Add(t);
}
Task.WaitAll(tasks.ToArray());
}
Console.WriteLine("Exited using block");
Console.ReadKey();
}
private static async Task Process(string msg)
{
await Task.Delay(2000);
Console.WriteLine(msg);
}
You can create your own TaskScheduler and override QueueTask there.
protected virtual void QueueTask(Task task)
Then you can do anything you like.
One example here:
Limited concurrency level task scheduler (with task priority) handling wrapped tasks
You can simply set the max concurrency degree like this way:
int maxConcurrency=10;
var messages = new List<1000>();
using(SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
foreach(var msg in messages)
{
Task.Factory.StartNew(() =>
{
concurrencySemaphore.Wait();
try
{
Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
}
}
If you need in-order queuing (processing might finish in any order), there is no need for a semaphore. Old fashioned if statements work fine:
const int maxConcurrency = 5;
List<Task> tasks = new List<Task>();
foreach (var arg in args)
{
var t = Task.Run(() => { Process(arg); } );
tasks.Add(t);
if(tasks.Count >= maxConcurrency)
Task.WaitAny(tasks.ToArray());
}
Task.WaitAll(tasks.ToArray());
I ran into a similar problem where I wanted to produce 5000 results while calling apis, etc. So, I ran some speed tests.
Parallel.ForEach(products.Select(x => x.KeyValue).Distinct().Take(100), id =>
{
new ParallelOptions { MaxDegreeOfParallelism = 100 };
GetProductMetaData(productsMetaData, client, id).GetAwaiter().GetResult();
});
produced 100 results in 30 seconds.
Parallel.ForEach(products.Select(x => x.KeyValue).Distinct().Take(100), id =>
{
new ParallelOptions { MaxDegreeOfParallelism = 100 };
GetProductMetaData(productsMetaData, client, id);
});
Moving the GetAwaiter().GetResult() to the individual async api calls inside GetProductMetaData resulted in 14.09 seconds to produce 100 results.
foreach (var id in ids.Take(100))
{
GetProductMetaData(productsMetaData, client, id);
}
Complete non-async programming with the GetAwaiter().GetResult() in api calls resulted in 13.417 seconds.
var tasks = new List<Task>();
while (y < ids.Count())
{
foreach (var id in ids.Skip(y).Take(100))
{
tasks.Add(GetProductMetaData(productsMetaData, client, id));
}
y += 100;
Task.WhenAll(tasks).GetAwaiter().GetResult();
Console.WriteLine($"Finished {y}, {sw.Elapsed}");
}
Forming a task list and working through 100 at a time resulted in a speed of 7.36 seconds.
using (SemaphoreSlim cons = new SemaphoreSlim(10))
{
var tasks = new List<Task>();
foreach (var id in ids.Take(100))
{
cons.Wait();
var t = Task.Factory.StartNew(() =>
{
try
{
GetProductMetaData(productsMetaData, client, id);
}
finally
{
cons.Release();
}
});
tasks.Add(t);
}
Task.WaitAll(tasks.ToArray());
}
Using SemaphoreSlim resulted in 13.369 seconds, but also took a moment to boot to start using it.
var throttler = new SemaphoreSlim(initialCount: take);
foreach (var id in ids)
{
throttler.WaitAsync().GetAwaiter().GetResult();
tasks.Add(Task.Run(async () =>
{
try
{
skip += 1;
await GetProductMetaData(productsMetaData, client, id);
if (skip % 100 == 0)
{
Console.WriteLine($"started {skip}/{count}, {sw.Elapsed}");
}
}
finally
{
throttler.Release();
}
}));
}
Using Semaphore Slim with a throttler for my async task took 6.12 seconds.
The answer for me in this specific project was use a throttler with Semaphore Slim. Although the while foreach tasklist did sometimes beat the throttler, 4/6 times the throttler won for 1000 records.
I realize I'm not using the OPs code, but I think this is important and adds to this discussion because how is sometimes not the only question that should be asked, and the answer is sometimes "It depends on what you are trying to do."
Now to answer the specific questions:
How to limit the maximum number of parallel tasks in c#: I showed how to limit the number of tasks that are completed at a time.
Can we guess how many maximum messages simultaneously get processed at the time (assuming normal Quad core processor), or can we limit the maximum number of messages to be processed at the time? I cannot guess how many will be processed at a time unless I set an upper limit but I can set an upper limit. Obviously different computers function at different speeds due to CPU, RAM etc. and how many threads and cores the program itself has access to as well as other programs running in tandem on the same computer.
How to ensure this message get processed in the same sequence/order of the Collection? If you want to process everything in a specific order, it is synchronous programming. The point of being able to run things asynchronously is ensuring that they can do everything without an order. As you can see from my code, the time difference is minimal in 100 records unless you use async code. In the event that you need an order to what you are doing, use asynchronous programming up until that point, then await and do things synchronously from there. For example, task1a.start, task2a.start, then later task1a.await, task2a.await... then later task1b.start task1b.await and task2b.start task 2b.await.
public static void RunTasks(List<NamedTask> importTaskList)
{
List<NamedTask> runningTasks = new List<NamedTask>();
try
{
foreach (NamedTask currentTask in importTaskList)
{
currentTask.Start();
runningTasks.Add(currentTask);
if (runningTasks.Where(x => x.Status == TaskStatus.Running).Count() >= MaxCountImportThread)
{
Task.WaitAny(runningTasks.ToArray());
}
}
Task.WaitAll(runningTasks.ToArray());
}
catch (Exception ex)
{
Log.Fatal("ERROR!", ex);
}
}
you can use the BlockingCollection, If the consume collection limit has reached, the produce will stop producing until a consume process will finish. I find this pattern more easy to understand and implement than the SemaphoreSlim.
int TasksLimit = 10;
BlockingCollection<Task> tasks = new BlockingCollection<Task>(new ConcurrentBag<Task>(), TasksLimit);
void ProduceAndConsume()
{
var producer = Task.Factory.StartNew(RunProducer);
var consumer = Task.Factory.StartNew(RunConsumer);
try
{
Task.WaitAll(new[] { producer, consumer });
}
catch (AggregateException ae) { }
}
void RunConsumer()
{
foreach (var task in tasks.GetConsumingEnumerable())
{
task.Start();
}
}
void RunProducer()
{
for (int i = 0; i < 1000; i++)
{
tasks.Add(new Task(() => Thread.Sleep(1000), TaskCreationOptions.AttachedToParent));
}
}
Note that the RunProducer and RunConsumer has spawn two independent tasks.
I've set a bunch of Console.WriteLines and as far as I can tell none of them are being invoked when I run the following in .NET Fiddle.
using System;
using System.Net;
using System.Linq.Expressions;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using System.Timers;
using System.Collections.Generic;
public class Program
{
private static readonly object locker = new object();
private static readonly string pageFormat = "http://www.letsrun.com/forum/forum.php?board=1&page={0}";
public static void Main()
{
var client = new WebClient();
// Queue up the requests we are going to make
var tasks = new Queue<Task<string>>(
Enumerable
.Repeat(0,50)
.Select(i => new Task<string>(() => client.DownloadString(string.Format(pageFormat,i))))
);
// Create set of 5 tasks which will be the at most 5
// requests we wait on
var runningTasks = new HashSet<Task<string>>();
for(int i = 0; i < 5; ++i)
{
runningTasks.Add(tasks.Dequeue());
}
var timer = new System.Timers.Timer
{
AutoReset = true,
Interval = 2000
};
// On each tick, go through the tasks that are supposed
// to have started running and if they have completed
// without error then store their result and run the
// next queued task if there is one. When we run out of
// any more tasks to run or wait for, stop the ticks.
timer.Elapsed += delegate
{
lock(locker)
{
foreach(var task in runningTasks)
{
if(task.IsCompleted)
{
if(!task.IsFaulted)
{
Console.WriteLine("Got a document: {0}",
task.Result.Substring(Math.Min(30, task.Result.Length)));
runningTasks.Remove(task);
if(tasks.Any())
{
runningTasks.Add(tasks.Dequeue());
}
}
else
{
Console.WriteLine("Uh-oh, task faulted, apparently");
}
}
else if(!task.Status.Equals(TaskStatus.Running)) // task not started
{
Console.WriteLine("About to start a task.");
task.Start();
}
else
{
Console.WriteLine("Apparently a task is running.");
}
}
if(!runningTasks.Any())
{
timer.Stop();
}
}
};
}
}
I'd also appreciate advice on how I can simplify or fix any faulty logic in this. The pattern I'm trying to do is like
(1) Create a queueu of N tasks
(2) Create a set of M tasks, the first M dequeued items from (1)
(3) Start the M tasks running
(4) After X seconds, check for completed tasks.
(5) For any completed task, do something with the result, remove the task from the set and replace it with another task from the queue (if any are left in the queueu).
(6) Repeat (4)-(5) indefinitely.
(7) If the set has no tasks left, we're done.
but perhaps there's a better way to implement it, or perhaps there's some .NET function that easily encapsulates what I'm trying to do (web requests in parallel with a specified max degree of parallelism).
There are several issues in your code, but since you are looking for better way to implement it - you can use Parallel.For or Parallel.ForEach:
Parallel.For(0, 50, new ParallelOptions() { MaxDegreeOfParallelism = 5 }, (i) =>
{
// surround with try-catch
string result;
using (var client = new WebClient()) {
result = client.DownloadString(string.Format(pageFormat, i));
}
// do something with result
Console.WriteLine("Got a document: {0}", result.Substring(Math.Min(30, result.Length)));
});
It will execute the body in parallel (not more than 5 tasks at any given time). When one task is completed - next one is started, until they are all done, just like you want.
UPDATE. There are several waits to throttle tasks with this approach, but the most straightforward is just sleep:
Parallel.For(0, 50, new ParallelOptions() { MaxDegreeOfParallelism = 5 },
(i) =>
{
// surround with try-catch
var watch = Stopwatch.StartNew();
string result;
using (var client = new WebClient()) {
result = client.DownloadString(string.Format(pageFormat, i));
}
// do something with result
Console.WriteLine("Got a document: {0}", result.Substring(Math.Min(30, result.Length)));
watch.Stop();
var sleep = 2000 - watch.ElapsedMilliseconds;
if (sleep > 0)
Thread.Sleep((int)sleep);
});
This isn't a direct answer to your question. I just wanted to suggest an alternative approach.
I'd recommend that you look into using Microsoft's Reactive Framework (NuGet "System.Reactive") for doing this kind of thing.
You could then do something like this:
var query =
Observable
.Range(0, 50)
.Select(i => string.Format(pageFormat, i))
.Select(u => Observable.Using(
() => new WebClient(),
wc => Observable.Start(() => new { url = u, content = wc.DownloadString(u) })))
.Merge(5);
IDisposable subscription = query.Subscribe(x =>
{
Console.WriteLine(x.url);
Console.WriteLine(x.content);
});
It's all async and the process can be stopped at any time by calling subscription.Dispose();
I have an IEnumerable<customClass> object that has roughly 10-15 entries, so not a lot, but I'm running into a System.IO.FileNotFoundException when I try and do
Parallel.Foreach(..some linq query.., object => { ...stuff....});
with the enumerable. Here is the code I have that sometimes works, other times doesn't:
IEnumerable<UserIdentifier> userIds = script.Entries.Select(x => x.UserIdentifier).Distinct();
await Task.Factory.StartNew(() =>
{
Parallel.ForEach(userIds, async userId =>
{
Stopwatch watch = new Stopwatch();
watch.Start();
_Log.InfoFormat("user identifier: {0}", userId);
await Task.Factory.StartNew(() =>
{
foreach (ScriptEntry se in script.Entries.Where(x => x.UserIdentifier.Equals(userId)))
{
// // Run the script //
_Log.InfoFormat("waiting {0}", se.Delay);
Task.Delay(se.Delay);
_Log.InfoFormat("running SelectionInformation{0}", se.SelectionInformation);
ExecuteSingleEntry(se);
_Log.InfoFormat("[====== SelectionInformation {1} ELAPSED TIME: {0} ======]", watch.Elapsed,
se.SelectionInformation.Verb);
}
});
watch.Stop();
_Log.InfoFormat("[====== TOTAL ELAPSED TIME: {0} ======]", watch.Elapsed);
});
});
When the function ExecuteSingleEntry is ran, there is a function a few calls deep within that function that creates a temp directory and files. It seems to me, that when I run the parallel.foreach the function is getting slammed at once by numerous calls (I'm testing 5 at once currently but need to handle about 10) and isn't creating some of the files I need. But if I hit a break point in the file creation function and just F5 every time it gets hit I don't have any problems with a file not found exception being thrown.
So, my question is, how can I achieve running a subset of my scripts.Entries in parallel based on the user id within the script entries with a delay of 1 second between each different user id entries being started?
and a script entry is like:
UserIdentifier: 141, SelectionInformation: class of stuff, Ids: list of EntryIds, Names: list of Entry Names
And each user identifier can appear 1 or more times in the array. I want to start all the different user identifiers, more or less, at once. Then Task out the different SelectionInformation's tied to a script entry.
scripts.Entries is an array of ScriptEntry, which is as follows:
[DataMember]
public TimeSpan Delay { get; set; }
[DataMember]
public SelectionInformation Selection { get; set; }
[DataMember]
public long[] Ids { get; set; }
[DataMember]
public string Names { get; set; }
[DataMember]
public long UserIdentifier { get; set; }
I referenced: Parallel.ForEach vs Task.Factory.StartNew to obtain the
Task.Factory.StartNew(() => Parallel.Foreach({ }) ) so my UI doesn't lock up on me
There are a few principles to apply:
Prefer Task.Run over Task.Factory.StartNew. I describe on my blog why StartNew is dangerous; Run is a much safer, more modern alternative.
Don't pass an async lambda to Parallel.ForEach. It doesn't make sense, and it won't work right.
Task.Delay doesn't do anything by itself. You either have to await it or use the synchronous version (Thread.Sleep).
(In fact, in your case, the internal StartNew is meaningless; it's already parallel, and the code - running on a thread pool thread - is trying to start a new operation on a thread pool thread and immediately asynchronously await it???)
After applying these principles:
await Task.Run(() =>
{
Parallel.ForEach(userIds, userId =>
{
Stopwatch watch = new Stopwatch();
watch.Start();
_Log.InfoFormat("user identifier: {0}", userId);
foreach (ScriptEntry se in script.Entries.Where(x => x.UserIdentifier.Equals(userId)))
{
// // Run the script //
_Log.InfoFormat("waiting {0}", se.Delay);
Thread.Sleep(se.Delay);
_Log.InfoFormat("running SelectionInformation{0}", se.SelectionInformation);
ExecuteSingleEntry(se);
_Log.InfoFormat("[====== SelectionInformation {1} ELAPSED TIME: {0} ======]", watch.Elapsed,
se.SelectionInformation.Verb);
}
watch.Stop();
_Log.InfoFormat("[====== TOTAL ELAPSED TIME: {0} ======]", watch.Elapsed);
});
});
I am making a bunch or asynchronous calls to Azure Table Storage. For obvious reasons insertion of these records are not in the same order as they were invoked.
I am planning to introduce ConcurrentQueue to ensure sequence. Following sample code written as a POC seems to achieve desired result.
I am wondering is this the best way I can ensure asynchronous calls
will be completed in sequence?
public class ProductService
{
ConcurrentQueue<string> ordersQueue = new ConcurrentQueue<string>();
//Place make calls here
public void PlaceOrder()
{
Task.Run(() =>
{
Parallel.For(0, 100, (i) =>
{
string item = "Product " + i;
ordersQueue.Enqueue(item);
Console.WriteLine("Placed Order: " + item);
Task.Delay(2000).Wait();
});
});
}
//Process calls in sequence, I am hoping concurrentQueue will be consistent.
public void Deliver()
{
Task.Run(() =>
{
while(true)
{
string productId;
ordersQueue.TryDequeue(out productId);
if (!string.IsNullOrEmpty(productId))
{
Console.WriteLine("Delivered: " + productId);
}
}
});
}
}
If you want to process records asynchronously and sequentially this sounds like a perfect fit for TPL Dataflow's ActionBlock. Simply create a block with the action to execute and post records to it. It supports async actions and keeps order:
var block = new ActionBlock<Product>(async product =>
{
await product.ExecuteAsync();
});
block.Post(new Product());
It also supports processing in parallel and bounded capacity if you need.
Try using Microsoft's Reactive Framework.
This worked for me:
IObservable<Task<string>> query =
from i in Observable.Range(0, 100, Scheduler.Default)
let item = "Product " + i
select AzureAsyncCall(item);
query
.Subscribe(async x =>
{
var result = await x;
/* do something with result */
});
The AzureAsyncCall call signature I used was public Task<string> AzureAsyncCall(string x).
I dropped in a bunch of Console.WriteLine(Thread.CurrentThread.ManagedThreadId); calls to ensure I was getting the right async behaviour in my test code. It worked well.
All the calls were asynchronous and serialized one after the other.
I am using .Net to build a stock quote updater. Suppose there are X number of stock symbols to be updated during market hours. in order to keep the updating at a pace not exceeding data provider's limit (e.g. Yahoo finance), I will try to limit the number of requests/sec by using a mechanism similar to thread pool. Let's say I want to allow only 5 requests/sec, that corresponds to a pool of 5 threads.
I heard about TPL and would like to use it although I am inexperienced of it. How can I specify the number of threads in the implicitly used pool in Task? Here is a loop to schedule the requests where requestFunc(url) is the function to update quotes. I like to get some comments or suggestions from the experts to do it properly:
// X is a number much bigger than 5
List<Task> tasks = new List<Task>();
for (int i=0; i<X; i++)
{
Task t = Task.Factory.StartNew(() => { requestFunc(url); }, TaskCreationOptions.None);
t.Wait(100); //slow down 100 ms. I am not sure if this is the right thing to do
tasks.Add(t);
}
Task.WaitAll(tasks);
Ok, I added a outer loop to make it run continuously. When I make some changes of #steve16351 's code, it only loops once. Why????
static void Main(string[] args)
{
LimitedExecutionRateTaskScheduler scheduler = new LimitedExecutionRateTaskScheduler(5);
TaskFactory factory = new TaskFactory(scheduler);
List<string> symbolsToCheck = new List<string>() { "GOOG", "AAPL", "MSFT", "AGIO", "MNK", "SPY", "EBAY", "INTC" };
while (true)
{
List<Task> tasks = new List<Task>();
Console.WriteLine("Starting...");
foreach (string symbol in symbolsToCheck)
{
Task t = factory.StartNew(() => { write(symbol); },
CancellationToken.None, TaskCreationOptions.None, scheduler);
tasks.Add(t);
}
//Task.WhenAll(tasks);
Console.WriteLine("Ending...");
Console.Read();
}
//Console.Read();
}
public static void write (string symbol)
{
DateTime dateValue = DateTime.Now;
//Console.WriteLine("[{0:HH:mm:ss}] Doing {1}..", DateTime.Now, symbol);
Console.WriteLine("Date and Time with Milliseconds: {0} doing {1}..",
dateValue.ToString("MM/dd/yyyy hh:mm:ss.fff tt"), symbol);
}
If you want to have a flow of url requests while limiting to no more than 5 concurrent operations you should use TPL Dataflow's ActionBlock:
var block = new ActionBlock<string>(
url => requestFunc(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
foreach (var url in urls)
{
block.Post(url);
}
block.Complete();
await block.Completion;
You Post to it the urls and for each of them it would perform the request while making sure there are no more than MaxDegreeOfParallelism requests at a time.
When you are done, you can call Complete to signal the block for completion and await the Completion task to asynchronously wait until the block actually completes.
Do not worry about the amount of threads; just make sure that you are not exceeding the number of requests per sec. Use a single timer to signal a ManualResetEvent every 200 ms and have the tasks wait for that ManualResetEvent inside a loop.
To create a timer and make it signal the ManualResetEvent every 200 ms:
resetEvent = new ManualResetEvent(false);
timer = new Timer((state)=>resetEvent.Set(), 200, 0);
Make sure you clean up the timer (call Dispose) when you do not need it anymore.
Let the number of threads be determined by the run-time.
This would be a poor implementation if you create a single task per stock because you do not know when a stock will be updated.
So you could just put all the stocks in a list and have a single task update each stock one after another.
By giving another list of stocks to another task you could give that task a higher priority by setting its timer to every 250 ms and the low priority to every 1000 ms. That would add up to 5 times a second and the high priority list would be updated 4 times more often than the low priority.
You could use a custom task scheduler which limits the rate at which tasks can start.
In the below, tasks are queued up, and dequeued with a timer set to the frequency of your maximum allowed rate. So if 5 requests a second, the timer is set to 200ms. On the tick, a task is then dequeued and executed from those that are pending.
EDIT: In addition to the request rate, you can also extend to control the maximum number of executing threads as well.
static void Main(string[] args)
{
TaskFactory factory = new TaskFactory(new LimitedExecutionRateTaskScheduler(5, 5)); // 5 per second, 5 max executing
List<string> symbolsToCheck = new List<string>() { "GOOG", "AAPL", "MSFT" };
for (int i = 0; i < 5; i++)
symbolsToCheck.AddRange(symbolsToCheck);
foreach (string symbol in symbolsToCheck)
{
factory.StartNew(() =>
{
Console.WriteLine("[{0:HH:mm:ss}] [{1}] Doing {2}..", DateTime.Now, Thread.CurrentThread.ManagedThreadId, symbol);
Thread.Sleep(5000);
Console.WriteLine("[{0:HH:mm:ss}] [{1}] {2} is done", DateTime.Now, Thread.CurrentThread.ManagedThreadId, symbol);
});
}
Console.Read();
}
public class LimitedExecutionRateTaskScheduler : TaskScheduler
{
private ConcurrentQueue<Task> _pendingTasks = new ConcurrentQueue<Task>();
private readonly object _taskLocker = new object();
private List<Task> _executingTasks = new List<Task>();
private readonly int _maximumConcurrencyLevel = 5;
private Timer _doWork = null;
public LimitedExecutionRateTaskScheduler(double requestsPerSecond, int maximumDegreeOfParallelism)
{
_maximumConcurrencyLevel = maximumDegreeOfParallelism;
long frequency = (long)(1000.0 / requestsPerSecond);
_doWork = new Timer(ExecuteRequests, null, frequency, frequency);
}
public override int MaximumConcurrencyLevel
{
get
{
return _maximumConcurrencyLevel;
}
}
protected override bool TryDequeue(Task task)
{
return base.TryDequeue(task);
}
protected override void QueueTask(Task task)
{
_pendingTasks.Enqueue(task);
}
private void ExecuteRequests(object state)
{
Task queuedTask = null;
int currentlyExecutingTasks = 0;
lock (_taskLocker)
{
for (int i = 0; i < _executingTasks.Count; i++)
if (_executingTasks[i].IsCompleted)
_executingTasks.RemoveAt(i--);
currentlyExecutingTasks = _executingTasks.Count;
}
if (currentlyExecutingTasks == MaximumConcurrencyLevel)
return;
if (_pendingTasks.TryDequeue(out queuedTask) == false)
return; // no work to do
lock (_taskLocker)
_executingTasks.Add(queuedTask);
base.TryExecuteTask(queuedTask);
}
protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
{
return false; // not properly implemented just to complete the class
}
protected override IEnumerable<Task> GetScheduledTasks()
{
return new List<Task>(); // not properly implemented just to complete the class
}
}
You could use a while loop with a task delay to control when your requests are issued. Using an async void method to make your requests means you don't get blocked by a failing request.
Async void is fire and forget which some devs don't lkke but I think it would work as a possible solution in this case.
I also think erno de weerd makes a great suggestion around prioritising calls to more important stocks.
Thanks #steve16351! It works like this:
static void Main(string[] args)
{
LimitedExecutionRateTaskScheduler scheduler = new LimitedExecutionRateTaskScheduler(5);
TaskFactory factory = new TaskFactory(scheduler);
List<string> symbolsToCheck = new List<string>() { "GOOG", "AAPL", "MSFT", "AGIO", "MNK", "SPY", "EBAY", "INTC" };
while (true)
{
List<Task> tasks = new List<Task>();
foreach (string symbol in symbolsToCheck)
{
Task t = factory.StartNew(() =>
{
write(symbol);
}, CancellationToken.None,
TaskCreationOptions.None, scheduler);
tasks.Add(t);
}
}
}
public static void write (string symbol)
{
DateTime dateValue = DateTime.Now;
Console.WriteLine("Date and Time with Milliseconds: {0} doing {1}..",
dateValue.ToString("MM/dd/yyyy hh:mm:ss.fff tt"), symbol);
}