I'm currently working on a project, where we have the challenge to process items in parallel. So far not a big deal ;)
Now to the problem. We have a list of IDs, where we periodically (every 2 sec's) what to call a StoredProcedure for each ID.
The 2 sec's need to be checked for each item individually, as they are added and removing during runtime.
In addition we want to configure the maximum degree of parallelism, as the DB should not be flooded with 300 threads concurrently.
An item which is being processed should not be rescheduled for processing until it has finished with the previous execution. Reason is that we want to prevent queueing up a lot of items, in case of delays on the DB.
Right now we are using a self-developed component, that has a main thread, which periodically checks what items need to scheduled for processing. Once it has the list, it's dropping those on a custom IOCP-based thread pool, and then uses waithandles to wait for the items being processed. Then the next iteration starts. IOCP because of the work-stealing it provides.
I would like to replace this custom implementation with a TPL/.NET 4 version, and I would like to know how you would solve it (ideally simple and nicely readable/maintainable).
I know about this article: http://msdn.microsoft.com/en-us/library/ee789351.aspx, but it's just limiting the amount of threads being used. Leaves work stealing, periodically executing the items ....
Ideally it will become a generic component, that can be used for some all the tasks that need to be done periodically for a list of items.
any input welcome,
tia
Martin
I don't think you actually need to get down and dirty with direct TPL Tasks for this. For starters I would set up a BlockingCollection around a ConcurrentQueue (the default) with no BoundedCapacity set on the BlockingCollection to store the IDs that need to be processed.
// Setup the blocking collection somewhere when your process starts up (OnStart for a Windows service)
BlockingCollection<string> idsToProcess = new BlockingCollection<string>();
From there I would just use Parallel::ForEach on the enumeration returned from the BlockingCollection::GetConsumingEnumerable. In the ForEach call you will setup your ParallelOptions::MaxDegreeOfParallelism Inside the body of the ForEach you will execute your stored procedure.
Now, once the stored procedure execution completes, you're saying you don't want to re-schedule the execution for at least two seconds. No problem, schedule a System.Threading.Timer with a callback which will simply add the ID back to the BlockingCollection in the supplied callback.
Parallel.ForEach(
idsToProcess.GetConsumingEnumerable(),
new ParallelOptions
{
MaxDegreeOfParallelism = 4 // read this from config
},
(id) =>
{
// ... execute sproc ...
// Need to declare/assign this before the delegate so that we can dispose of it inside
Timer timer = null;
timer = new Timer(
_ =>
{
// Add the id back to the collection so it will be processed again
idsToProcess.Add(id);
// Cleanup the timer
timer.Dispose();
},
null, // no state, id wee need is "captured" in the anonymous delegate
2000, // probably should read this from config
Timeout.Infinite);
}
Finally, when the process is shutting down you would call BlockingCollection::CompleteAdding so that the enumerable being processed with stop blocking and complete and the Parallel::ForEach will exit. If this were a Windows service for example you would do this in OnStop.
// When ready to shutdown you just signal you're done adding
idsToProcess.CompleteAdding();
Update
You raised a valid concern in your comment that you might be processing a large amount of IDs at any given point and fear that there would be too much overhead in a timer per ID. I would absolutely agree with that. So in the case that you are dealing with a large list of IDs concurrently, I would change from using a timer-per-ID to using another queue to hold the "sleeping" IDs which is monitored by a single short interval timer instead. First you'll need a ConcurrentQueue onto which to place the IDs that are asleep:
ConcurrentQueue<Tuple<string, DateTime>> sleepingIds = new ConcurrentQueue<Tuple<string, DateTime>>();
Now, I'm using a two-part Tuple here for illustration purposes, but you may want to create a more strongly typed struct for it (or at least alias it with a using statement) for better readability. The tuple has the id and a DateTime which represents when it was put on the queue.
Now you'll also want to setup the timer that will monitor this queue:
Timer wakeSleepingIdsTimer = new Timer(
_ =>
{
DateTime utcNow = DateTime.UtcNow;
// Pull all items from the sleeping queue that have been there for at least 2 seconds
foreach(string id in sleepingIds.TakeWhile(entry => (utcNow - entry.Item2).TotalSeconds >= 2))
{
// Add this id back to the processing queue
idsToProcess.Enqueue(id);
}
},
null, // no state
Timeout.Infinite, // no due time
100 // wake up every 100ms, probably should read this from config
);
Then you would simply change the Parallel::ForEach to do the following instead of setting up a timer for each one:
(id) =>
{
// ... execute sproc ...
sleepingIds.Enqueue(Tuple.Create(id, DateTime.UtcNow));
}
This is pretty similar to the approach you said you already had in your question, but does so with TPL tasks. A task just adds itself back to a list of things to schedule when its done.
The use of locking on a plain list is fairly ugly in this example, would probably want a better collection to hold the list of things to schedule
// Fill the idsToSchedule
for (int id = 0; id < 5; id++)
{
idsToSchedule.Add(Tuple.Create(DateTime.MinValue, id));
}
// LongRunning will tell TPL to create a new thread to run this on
Task.Factory.StartNew(SchedulingLoop, TaskCreationOptions.LongRunning);
That starts up the SchedulingLoop, which actually performs the checking if its been two seconds since something ran
// Tuple of the last time an id was processed and the id of the thing to schedule
static List<Tuple<DateTime, int>> idsToSchedule = new List<Tuple<DateTime, int>>();
static int currentlyProcessing = 0;
const int ProcessingLimit = 3;
// An event loop that performs the scheduling
public static void SchedulingLoop()
{
while (true)
{
lock (idsToSchedule)
{
DateTime currentTime = DateTime.Now;
for (int index = idsToSchedule.Count - 1; index >= 0; index--)
{
var scheduleItem = idsToSchedule[index];
var timeSincePreviousRun = (currentTime - scheduleItem.Item1).TotalSeconds;
// start it executing in a background task
if (timeSincePreviousRun > 2 && currentlyProcessing < ProcessingLimit)
{
Interlocked.Increment(ref currentlyProcessing);
Console.WriteLine("Scheduling {0} after {1} seconds", scheduleItem.Item2, timeSincePreviousRun);
// Schedule this task to be processed
Task.Factory.StartNew(() =>
{
Console.WriteLine("Executing {0}", scheduleItem.Item2);
// simulate the time taken to call this procedure
Thread.Sleep(new Random((int)DateTime.Now.Ticks).Next(0, 5000) + 500);
lock (idsToSchedule)
{
idsToSchedule.Add(Tuple.Create(DateTime.Now, scheduleItem.Item2));
}
Console.WriteLine("Done Executing {0}", scheduleItem.Item2);
Interlocked.Decrement(ref currentlyProcessing);
});
// remove this from the list of things to schedule
idsToSchedule.RemoveAt(index);
}
}
}
Thread.Sleep(100);
}
}
Related
I've been using Parallel.ForEach to do some time-consuming processing on collections of items. The processing is actually handled by an external command line tool and I cannot change that. However, it seems that the Parallel.ForEach will get "stuck" on a long running item from the collection. I've distilled the problem down and can show that Parallel.ForEach is, in fact, waiting for this long one to finish and not allowing any others through. I've written a console app to demonstrate the problem:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace testParallel
{
class Program
{
static int inloop = 0;
static int completed = 0;
static void Main(string[] args)
{
// initialize an array integers to hold the wait duration (in milliseconds)
var items = Enumerable.Repeat(10, 1000).ToArray();
// set one of the items to 10 seconds
items[50] = 10000;
// Initialize our line for reporting status
Console.Write(0.ToString("000") + " Threads, " + 0.ToString("000") + " completed");
// Start the loop in a task (to avoid SO answers having to do with the Parallel.ForEach call, itself, not being parallel)
var t = Task.Factory.StartNew(() => Process(items));
// Wait for the operations to compelte
t.Wait();
// Report finished
Console.WriteLine("\nDone!");
}
static void Process(int[] items)
{
// SpinWait (not sleep or yield or anything) for the specified duration
Parallel.ForEach(items, (msToWait) =>
{
// increment the counter for how many threads are in the loop right now
System.Threading.Interlocked.Increment(ref inloop);
// determine at what time we shoule stop spinning
var e = DateTime.Now + new TimeSpan(0, 0, 0, 0, msToWait);
// spin until the target time
while (DateTime.Now < e) /* no body -- just a hard loop */;
// count another completed
System.Threading.Interlocked.Increment(ref completed);
// we're done with this iteration
System.Threading.Interlocked.Decrement(ref inloop);
// report status
Console.Write("\r" + inloop.ToString("000") + " Threads, " + completed.ToString("000") + " completed");
});
}
}
}
Basically, I make an array of int to store the number of milliseconds a given operation takes. I set them all to 10 except for one, which I set to 10000 (so, 10 seconds). I kick off the Parallel.ForEach in a task and process each integer in a hard spin wait (so it shouldn't be yielding or sleeping or anything).
On each iteration, I report how many iterations are in the body of the loop right now, and how many iterations we have completed. Mostly, it goes along fine. However, toward the end (time-wise), it reports "001 Threads, 987 Completed".
My question is why doesn't it use 7 of the other cores to work on the remaining 13 "jobs"? This one long-running iteration should not keep it from processing other elements in the collection, right?
This example happens to be a fixed collection, but it could easily be set to be an enumerable. We wouldn't want to stop fetching the next item in the enumerable just because one was taking a long time.
I found the answer (or at least, an answer). It has to do with the chunk partitioning. The SO answer here got it for me. So basically, at the top of my "Process" function, if I change from this:
static void Process(int[] items)
{
Parallel.ForEach(items, (msToWait) => { ... });
}
to this
static void Process(int[] items)
{
var partitioner = Partitioner.Create(items, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partitioner, (msToWait) => { ... });
}
it grabs the work one at a time. For the more typical case of a parallel for each, where the body doesn't take more than a second, I can certainly see chunking the sets of work. In my use case, however, each body part can take anywhere from half a second to 5 hours. I certainly would not want a bunch of the 10-second variety elements to be blocked by one 5 hour element. So, in this case, the overhead of "one-at-a-time" is well worth it.
My process Gets the data through HTTP request and it will get the data in Chunks(100 records at a time). in my case I had 100,000 records.
and then I need to process that data and load it into DB..
MY Current Process..
GrabAllRecords()
{
GRAB all 100,000 records(i.e 1000 requests).. its big amount of time.
Load into ArrayData
}
then..
Process Data(ArrayData)
{
}
But I need some thing like this...
START:
step1:
Grab 100 Records load into arraylist..
repeat step1 until it reach 100,000
step2:
process arrayList
This screams for the producer - consumer design pattern: one producer produces something in its own pace, while one or more consumers wait until something is produced, grab the produced information and process it, possibly leading to new produced output that other consumers might process.
Microsoft has good support for this via Microsoft TPL Dataflow nuget package.
Implement a Producer-Consumer Dataflow Pattern
Also helpful to start: Walkthrough: Creating a Dataflow Pipeline
The producer produces output in processable units, in your case: chunks. The output will be sent to an object of class BufferBlock< T > , where T is your chunk. Code will be similar to:
public class ChunkProducer
{
private BufferBlock<Chunk> outputBuffer = new BufferBlock<Chunk>;
// whenever the ChunkProducer produces a chunk it is put in this buffer
// consumers will need access to this outputbuffer as source of data:
public ISourceBlock<Chunk> OutputBuffer
{get {return this.outputBuffer as ISourceBlock<Chunk>;} }
public async Task ProduceAsync()
{
while(someThingsToProcess)
{
Chunk chunk = CreateChunk(...);
await this.outputBuffer.SendAsync(chunk);
}
// if here: nothing to process anymore.
// notify consumers that all output has been produced
this.outputBuffer.Complete();
}
The efficiency of this can be enhanced by creating the next chunk while the previous one is being sent and await before sending the next chunk. This is a bit out of scope here. More info about this is available on Stackoverflow.
You'll also need a ChunkConsumer. The ChunkConsumer will wait for chunks on the buffer block and process them:
public class ChunkConsumer
{
private readonly ISourceBlock<Chunk> chunkSource;
// the chunkConsumer will wait for input at this source
public ChunkConsumer(ISourceBlock<Chunk> chunkSource)
{
this.chunkSource = chunkSource
}
public async Task ConsumeAsync()
{
// wait until there is some data in the buffer
while (await this.chunkSource.OutputAvailableAsync())
{
// get the chunk and process it:
Chunk chunk = this.chunkSource.Receive()
ProcessChunk(chunk);
}
// if here: chunkSource has been completed. No more data to expect
}
Put it all together:
private async Task ProcessAsync()
{
ChunkProducer producer = new ChunkProducer();
ChunkConsumer consumer = new ChunkConsumer(producer.OutputBuffer);
// start a thread for the consumer to consume:
Task consumeTask = Task.Run( () => consumer.ConsumeAsync());
// let this thread start producing, and await until it is completed
await producer.ProduceAsync();
// if here, I know the producer finished producing
// wait until the consumer finished consuming:
await consumeTask;
// finished, all produced data is consumed.
}
Possible enhancements:
If producing is faster than consuming, consider using multiple consumers listening to the same ISourceBlock. Check TPL to see which of the BufferBlock types can handle multiple listeners
If producing is slower than consuming, consider using multiple producers producing to the same ITargetBlock. Check which type of buffer block can handle this.
Consider enabling cancellation using CancellationToken
If your chunk is not always the same number of records, consider using a batch block: The consumer gets notified if the batch has enough records to process.
You can use the DataFlow library to do something like this:
ActionBlock<Record[]> action_block = new ActionBlock<Record[]>(
x => ConsumeRecords(x),
new ExecutionDataflowBlockOptions
{
//Use one thread to process data.
//You can increase it if you want
//That would make sense if you produce the records faster than you consume them
MaxDegreeOfParallelism = 1
});
for (int i = 0; i < 1000; i++)
{
action_block.Post(ProduceNext100Records());
}
I am assuming that you have a method called ProduceNext100Records that produces records (e.g. via web service call) and another method called ConsumeRecords that consumes the records.
The easy answer I think is to use Microsoft Reactive Extensions (NuGet "Rx-Main").
Then you can do something like this:
var query =
from records in Get100Records().ToObservable()
from record in records.ToObservable()
from result in Observable.Start(() => ProcessRecord(record))
select new { record, result };
IDisposable subscription =
query
.Subscribe(
rr =>
{
/* Process each `rr.record`/`rr.result`
as they are produced */
},
() => { /* Run when all completed */ });
This will process in parallel and you'll start getting results as soon as the first ProcessRecord call is completed.
If you need to stop the processing early you just call subscription.Dispose().
I am trying to queue up items from an Azure Service Bus so I can process them in bulk. I am aware that the Azure Service Bus has a ReceiveBatch() but it seems problematic for the following reasons:
I can only get a max of 256 messages at a time and even this then can be random based on message size.
Even if I peek to see how many messages are waiting I don't know how many RequestBatch calls to make because I don't know how many messages each call will give me back. Since messages will keep coming in I can't just continue to make requests until it's empty since it will never be empty.
I decided to just use the message listener which is cheaper than doing wasted peeks and will give me more control.
Basically I am trying to let a set number of messages build up and
then process them at once. I use a timer to force a delay but I need
to be able to queue my items as they come in.
Based on my timer requirement it seemed like the blocking collection was not a good option so I am trying to use ConcurrentBag.
var batchingQueue = new ConcurrentBag<BrokeredMessage>();
myQueueClient.OnMessage((m) =>
{
Console.WriteLine("Queueing message");
batchingQueue.Add(m);
});
while (true)
{
var sw = WaitableStopwatch.StartNew();
BrokeredMessage msg;
while (batchingQueue.TryTake(out msg)) // <== Object is already disposed
{
...do this until I have a thousand ready to be written to DB in batch
Console.WriteLine("Completing message");
msg.Complete(); // <== ERRORS HERE
}
sw.Wait(MINIMUM_DELAY);
}
However as soon as I access the message outside of the OnMessage
pipeline it shows the BrokeredMessage as already being disposed.
I am thinking this must be some automatic behavior of OnMessage and I don't see any way to do anything with the message other than process it right away which I don't want to do.
This is incredibly easy to do with BlockingCollection.
var batchingQueue = new BlockingCollection<BrokeredMessage>();
myQueueClient.OnMessage((m) =>
{
Console.WriteLine("Queueing message");
batchingQueue.Add(m);
});
And your consumer thread:
foreach (var msg in batchingQueue.GetConsumingEnumerable())
{
Console.WriteLine("Completing message");
msg.Complete();
}
GetConsumingEnumerable returns an iterator that consumes items in the queue until the IsCompleted property is set and the queue is empty. If the queue is empty but IsCompleted is False, it does a non-busy wait for the next item.
To cancel the consumer thread (i.e. shut down the program), you stop adding things to the queue and have the main thread call batchingQueue.CompleteAdding. The consumer will empty the queue, see that the IsCompleted property is True, and exit.
Using BlockingCollection here is better than ConcurrentBag or ConcurrentQueue, because the BlockingCollection interface is easier to work with. In particular, the use of GetConsumingEnumerable relieves you from having to worry about checking the count or doing busy waits (polling loops). It just works.
Also note that ConcurrentBag has some rather strange removal behavior. In particular, the order in which items are removed differs depending on which thread removes the item. The thread that created the bag removes items in a different order than other threads. See Using the ConcurrentBag Collection for the details.
You haven't said why you want to batch items on input. Unless there's an overriding performance reason to do so, it doesn't seem like a particularly good idea to complicate your code with that batching logic.
If you want to do batch writes to the database, then I would suggest using a simple List<T> to buffer the items. If you have to process the items before they're written to the database, then use the technique I showed above to process them. Then, rather writing directly to the database, add the item to a list. When the list gets 1,000 items, or a given amount of time elapses, allocate a new list and start a task to write the old list to the database. Like this:
// at class scope
// Flush every 5 minutes.
private readonly TimeSpan FlushDelay = TimeSpan.FromMinutes(5);
private const int MaxBufferItems = 1000;
// Create a timer for the buffer flush.
System.Threading.Timer _flushTimer = new System.Threading.Timer(TimedFlush, FlushDelay.TotalMilliseconds, Timeout.Infinite);
// A lock for the list. Unless you're getting hundreds of thousands
// of items per second, this will not be a performance problem.
object _listLock = new Object();
List<BrokeredMessage> _recordBuffer = new List<BrokeredMessage>();
Then, in your consumer:
foreach (var msg in batchingQueue.GetConsumingEnumerable())
{
// process the message
Console.WriteLine("Completing message");
msg.Complete();
lock (_listLock)
{
_recordBuffer.Add(msg);
if (_recordBuffer.Count >= MaxBufferItems)
{
// Stop the timer
_flushTimer.Change(Timeout.Infinite, Timeout.Infinite);
// Save the old list and allocate a new one
var myList = _recordBuffer;
_recordBuffer = new List<BrokeredMessage>();
// Start a task to write to the database
Task.Factory.StartNew(() => FlushBuffer(myList));
// Restart the timer
_flushTimer.Change(FlushDelay.TotalMilliseconds, Timeout.Infinite);
}
}
}
private void TimedFlush()
{
bool lockTaken = false;
List<BrokeredMessage> myList = null;
try
{
if (Monitor.TryEnter(_listLock, 0, out lockTaken))
{
// Save the old list and allocate a new one
myList = _recordBuffer;
_recordBuffer = new List<BrokeredMessage>();
}
}
finally
{
if (lockTaken)
{
Monitor.Exit(_listLock);
}
}
if (myList != null)
{
FlushBuffer(myList);
}
// Restart the timer
_flushTimer.Change(FlushDelay.TotalMilliseconds, Timeout.Infinite);
}
The idea here is that you get the old list out of the way, allocate a new list so that processing can continue, and then write the old list's items to the database. The lock is there to prevent the timer and the record counter from stepping on each other. Without the lock, things would likely appear to work fine for a while, and then you'd get weird crashes at unpredictable times.
I like this design because it eliminates polling by the consumer. The only thing I don't like is that the consumer has to be aware of the timer (i.e. it has to stop and then restart the timer). With a little more thought, I could eliminate that requirement. But it works well the way it's written.
Switching to OnMessageAsync solved the problem for me
_queueClient.OnMessageAsync(async receivedMessage =>
I reached out to Microsoft about the BrokeredMessage being disposed issue on MSDN, this is the response:
Very basic rule and I am not sure if this is documented. The received message needs to be processed in the callback function's life time. In your case, messages will be disposed when async callback completes, this is why your complete attempts are failing with ObjectDisposedException in another thread.
I don't really see how queuing messages for further processing helps on the throughput. This will add more burden to client for sure. Try processing the message in the async callback, that should be performant enough.
In my case that means I can't use ServiceBus in the way I wanted to, and I have to re-think how I wanted things to work. Bugger.
I had the same issue when started to work with Azure Service Bus service.
I have found that method OnMessage always dispose BrokedMessage object. The approach proposed by Jim Mischel didn't help me (but it was very interesting to read - thanks!).
After some investigation I have found that the whole approach is wrong. Let me explain the right way to do what you want.
Use BrokedMessage.Complete() method only inside OnMessage method handler.
If you need to process message outside of this method that you should use method QueueClient.Complete(Guid lockToken). "LockToken" is property of BrokeredMessage object.
Example:
var messageOptions = new OnMessageOptions {
AutoComplete = false,
AutoRenewTimeout = TimeSpan.FromMinutes( 5 ),
MaxConcurrentCalls = 1
};
var buffer = new Dictionary<string, Guid>();
// get message from queue
myQueueClient.OnMessage(
m => buffer.Add(key: m.GetBody<string>(), value: m.LockToken),
messageOptions // this option says to ServiceBus to "froze" message in he queue until we process it
);
foreach(var item in buffer){
try {
Console.WriteLine($"Process item: {item.Key}");
myQueueClient.Complete(item.Value);// you can also use method CompleteBatch(...) to improve performance
}
catch{
// "unfroze" message in ServiceBus. Message would be delivered to other listener
myQueueClient.Defer(item.Value);
}
}
My solution was to get the message SequenceNumber then defer the message and add the SequenceNumber the BlockingCollection. Once the BlockingCollection picks up a new item it can receive the deferred message by the SequenceNumber and mark the message as complete. If for some reason the BlockingCollection doesn't process the SequenceNumber it will remain in the queue as deferred so it can be picked up later when the process is restarted. This protects against loosing messages if the process abnormally terminates while there's still items in the BlockingCollection.
BlockingCollection<long> queueSequenceNumbers = new BlockingCollection<long>();
//This finds any deferred/unfinished messages on startup.
BrokeredMessage existingMessage = client.Peek();
while (existingMessage != null)
{
if (existingMessage.State == MessageState.Deferred)
{
queueSequenceNumbers.Add(existingMessage.SequenceNumber);
}
existingMessage = client.Peek();
}
//setup the message handler
Action<BrokeredMessage> processMessage = new Action<BrokeredMessage>((message) =>
{
try
{
//skip deferred messages if they are already in the queueSequenceNumbers collection.
if (message.State != MessageState.Deferred || (message.State == MessageState.Deferred && !queueSequenceNumbers.Any(x => x == message.SequenceNumber)))
{
message.Defer();
queueSequenceNumbers.Add(message.SequenceNumber);
}
}
catch (Exception ex)
{
// Indicates a problem, unlock message in queue
message.Abandon();
}
});
// Callback to handle newly received messages
client.OnMessage(processMessage, new OnMessageOptions() { AutoComplete = false, MaxConcurrentCalls = 1 });
//start the blocking loop to process messages as they are added to the collection
foreach (var queueSequenceNumber in queueSequenceNumbers.GetConsumingEnumerable())
{
var message = client.Receive(queueSequenceNumber);
//mark the message as complete so it's removed from the queue
message.Complete();
//do something with the message
}
I have a situation in which I have a producer/consumer scenario. The producer never stops, which means that even if there is a time where there are no items in the BC, further items can be added later.
Moving from .NET Framework 3.5 to 4.0, I decided to use a BlockingCollection as a concurrent queue between the consumer and the producer. I even added some parallel extensions so I could use the BC with a Parallel.ForEach.
The problem is that, in the consumer thread, I need to have a kind of an hybrid model:
Im always checking the BC to process any item that arrived with a
Parallel.ForEach(bc.GetConsumingEnumerable(), item => etc
Inside this foreach, I execute all the tasks that dont depend between each other.
Here comes the problem. After paralelizing the previous tasks I need to manage their results in the same FIFO order in which they were in the BC. The processing of these results should be made in a sync thread.
A little example in pseudo code follows:
producer:
//This event is triggered each time a page is scanned. Any batch of new pages can be added at any time at the scanner
private void Current_OnPageScanned(object sender, ScannedPage scannedPage)
{
//The object to add has a property with the sequence number
_concurrentCollection.TryAdd(scannedPage);
}
consumer:
private void Init()
{
_cancelTasks = false;
_checkTask = Task.Factory.StartNew(() =>
{
while (!_cancelTasks)
{
//BlockingCollections with Parallel ForEach
var bc = _concurrentCollection;
Parallel.ForEach(bc.GetConsumingEnumerable(), item =>
{
ScannedPage currentPage = item;
// process a batch of images from the bc and check if an image has a valid barcode. T
});
//Here should go the code that takes the results from each tasks, process them in the same FIFO order in which they entered the BC and save each image to a file, all of this in this same thread.
}
});
}
Obviously, this cant work as it is because the .GetConsumingEnumerable() blocks until there is another item in the BC. I asume I could do it with tasks and just fire 4 or 5 task in a same batch, but:
How could I do this with tasks and still have a waiting point before the start of the tasks that blocks until there is an item to be consumed in the BC (I don't want to start processing if there is nothing. Once there is something in the BC i would just start the batch of 4 tasks, and use a TryTake inside each one so if there is nothing to take they don't block, because I don't know if I can always reach the number of items from the BC as the batch of tasks, for example, just one item left in the BC and a batch of 4 tasks) ?
How could I do this and take advantage of the efficiency that Parallel.For offers?
How could I save the results of the tasks in the same FIFO order in which the items were extracted from the BC?
Is there any other concurrency class more suited to this kind of hybrid processing of items in the consumer?
Also, this is my first question ever made in StackOverflow, so if you need any more data or you just think that my question is not correct just let me know.
I think I follow what you're asking, why not create a ConcurrentBag and add to it while processing like this:
while (!_cancelTasks)
{
//BlockingCollections with Paralell ForEach
var bc = _concurrentCollection;
var q = new ConcurrentBag<ScannedPage>();
Parallel.ForEach(bc.GetConsumingEnumerable(), item =>
{
ScannedPage currentPage = item;
q.Add(item);
// process a batch of images from the bc and check if an image has a valid barcode. T
});
//Here should go the code that takes the results from each tasks, process them in the same FIFO order in which they entered the BC and save each image to a file, all of this in this same thread.
//process items in your list here by sorting using some sequence key
var items = q.OrderBy( o=> o.SeqNbr).ToList();
foreach( var item in items){
...
}
}
This obviously doesn't enqueue them in the exact order they were added to the BC but you could add some sequence nbr to the ScannedPage object like Alex suggested and then sort the results after.
Here's how I'd handle the sequence:
Add this to the ScannedPage class:
public static int _counter; //public because this is just an example but it would work.
Get a sequence nbr and assign here:
private void Current_OnPageScanned(object sender, ScannedPage scannedPage)
{
lock( this){ //to single thread this process.. not necessary if it's already single threaded of course.
System.Threading.Interlocked.Increment( ref ScannedPage._counter);
scannedPage.SeqNbr = ScannedPage._counter;
...
}
}
Whenever you need the results of a parallel operation, using PLINQ is generally more convenient that using the Parallel class. Here is how you could refactor your code using PLINQ:
private void Init()
{
_cancelTasks = new CancellationTokenSource();
_checkTask = Task.Run(() =>
{
while (true)
{
_cancelTasks.Token.ThrowIfCancellationRequested();
var bc = _concurrentCollection;
var partitioner = Partitioner.Create(
bc.GetConsumingEnumerable(_cancelTasks.Token),
EnumerablePartitionerOptions.NoBuffering);
ScannedPage[] results = partitioner
.AsParallel()
.AsOrdered()
.Select(scannedPage =>
{
// Process the scannedPage
return scannedPage;
})
.ToArray();
// Process the results
}
});
}
The .AsOrdered() is what ensures that you'll get the results in the same order as the input.
Be aware that when you consume a BlockingCollection<T> with the Parallel class or PLINQ, it is important to use the Partitioner and the EnumerablePartitionerOptions.NoBuffering configuration, otherwise there is a risk of deadlocks. The default greedy behavior of the Parallel/PLINQ and the blocking behavior of the BlockingCollection<T>, do not interact well.
This seems like it has a very simple solution, but I've been looking on and off for months without finding a definitive answer.
I have an object being created by the UI Thread. I'm actually creating several of the same type of object. Sometimes I'll create 1 or 2 every minute, sometimes I'll create 30 in a second. Each time an object it created, I need to perform some calculations on the object. The combination of potentially 30 objects created at in a short time, and the expensive calculations I'm performing on said objects can really lag the UI.
I've tried performing the calculations via Tasks and backgroundWorkers and all sorts of threading but they all perform the calculations out of order, and it's imperative that the objects are calculated in the order they are created and that one doesn't start it's calculations until the object ahead of it finishes it's own calculations.
I can find all sorts of information about how to perform these tasks in parallel, but can anyone explain to me how I can force them to happen sequentially, just not on the UI Thread? Any help would be greatly appreciated. I've been trying to figure this out for months :(
IMO, the easiest way to run tasks in sequential order here is to use Task.ContinueWith. Note TaskContinuationOptions.LazyCancellation, it's used to make sure that cancellation doesn't break the order of the sequence:
Task _currentTask = Task.FromResult(Type.Missing);
readonly object _lock = new Object();
void QueueTask(Action action)
{
lock (_lock)
{
_currentTask = _currentTask.ContinueWith(
lastTask =>
{
// re-throw the error of the last completed task (if any)
lastTask.GetAwaiter().GetResult();
// run the new task
action();
},
CancellationToken.None,
TaskContinuationOptions.LazyCancellation,
TaskScheduler.Default);
}
}
private void button1_Click(object sender, EventArgs e)
{
for(var i = 0; i < 10; i++)
{
var sleep = 1000 - i*100;
QueueTask(() =>
{
Thread.Sleep(sleep);
Debug.WriteLine("Slept for {0} ms", sleep);
});
}
}
I think a proper solution would be to use a Queue, as it is FIFO in a task that is running in the background.
Your code could look something like this:
Edit
Edited to use Queue.Synchronize as #rwong mentioned (thanks!)
var queue = new Queue<MyCustomObject>();
//add object to the queue..
var mySyncedQueue = Queue.Synchronize(queue)
Task.Factory.StartNew(() =>
{
while (true)
{
var myObj = mySyncedQueue.Dequeue();
if (myObj != null)
{ do work...}
}
}, TaskCreationOptions.LongRunning);
This is just to get you started, im sure your could could be more efficient when you know exactly what is needed :)
Edit 2
As you are accessing the queue from multiple threads, it is better to use a ConcurrentQueue.
the method wont look much different:
var concurrentQueue = new ConcurrentQueue<MyCustomObject>();
//add object to the queue..
Task.Factory.StartNew(() =>
{
while (true)
{
MyCustomObject myObj;
if (concurrentQueue.TryDequeue(myObj))
{ do work...}
}
}, TaskCreationOptions.LongRunning);
Create exactly one background task. Then let it process all items in a thread-safe fifo buffer. Add objects to the fifo from your gui thread.