I am using a BlockingCollection to process some files and then upload them to a server.
Right now I have a single Producer that recurses the file system and compresses certain files to a temporary location. Once it has finished with a file it adds my own object to the BlockingCollection that contain information, such as, File Name, File Path, Modified date, etc. The Consumer then grabs this object and uses it to upload the file. When the Producer has finished searching the file system and working on files it calls the BlockingCollection.CompleteAdding() method to signal the Consumer that it has finished.
What I would like to do is increase the number of Producers to 2 or more. The reason being that the compression process takes a while and on multi core processors I'm only taking advantage of 1 core. This causes the Producer to sometimes fall behind the Consumer on faster networks.
My question is, when I have multiple Producers and only one Consumer how can I signal the Consumer that all of the Producers have finished their work? If I call the BlockingCollection.CompleteAdding() method on one of the Producers I could still have one or more other producers still working.
You can use a semaphore in your Producer code before calling the BlockingCollection.CompleteAdding(). The semaphore is shared by all the Producer instances, when the last producer has finished it can call the method. The semaphore can be implemented as a simple counter, increment the counter when a Producer is created, decrement it when your producer ends its job. If the counter reaches zero then the BlockingCollection.CompleteAdding() can be called.
I use something like this to have multiple producers and consumers. It is just a very simple solution not optimized for production code.
public class ManageBatchProcessing
{
private BlockingCollection<Action> blockingCollection;
public void Process()
{
blockingCollection = new BlockingCollection<Action>();
int numberOfBatches = 10;
Process(HandleProducers, HandleConsumers, numberOfBatches);
}
private void Process(Action<int> produce, Action<int> consume, int numberOfBatches)
{
produce(numberOfBatches);
consume(numberOfBatches);
}
private void HandleConsumers(int numberOfBatches)
{
var consumers = new List<Task>();
for (var i = 1; i <= numberOfBatches; i++)
{
consumers.Add(Task.Factory.StartNew(() =>
{
foreach (var action in blockingCollection.GetConsumingEnumerable())
{
action();
}
}));
}
Task.WaitAll(consumers.ToArray());
}
private void HandleProducers(int numberOfBatches)
{
var producers = new List<Task>();
for (var i = 0; i <= numberOfBatches; i++)
{
producers.Add(Task.Factory.StartNew(() =>
{
blockingCollection.Add(() => YourProdcerMethod());
}));
}
Task.WaitAll(producers.ToArray());
blockingCollection.CompleteAdding();
}
}
Related
I have to write a program where I'm reading from a database the queues to process and all the queues are run in parallel and managed on the parent thread using a ConcurrentDictionary.
I have a class that represents the queue, which has a constructor that takes in the queue information and the parent instance handle. The queue class also has the method that processes the queue.
Here is the Queue Class:
Class MyQueue {
protected ServiceExecution _parent;
protect string _queueID;
public MyQueue(ServiceExecution parentThread, string queueID)
{
_parent = parentThread;
_queueID = queueID;
}
public void Process()
{
try
{
//Do work to process
}
catch()
{
//exception handling
}
finally{
_parent.ThreadFinish(_queueID);
}
The parent thread loops through the dataset of queues and instantiates a new queue class. It spawns a new thread to execute the Process method of the Queue object asynchronously. This thread is added to the ConcurrentDictionary and then started as follows:
private ConcurrentDictionary<string, MyQueue> _runningQueues = new ConcurrentDictionary<string, MyQueue>();
Foreach(datarow dr in QueueDataset.rows)
{
MyQueue queue = new MyQueue(this, dr["QueueID"].ToString());
Thread t = new Thread(()=>queue.Process());
if(_runningQueues.TryAdd(dr["QueueID"].ToString(), queue)
{
t.start();
}
}
//Method that gets called by the queue thread when it finishes
public void ThreadFinish(string queueID)
{
MyQueue queue;
_runningQueues.TryRemove(queueID, out queue);
}
I have a feeling this is not the right approach to manage the asynchronous queue processing and I'm wondering if perhaps I can run into deadlocks with this design? Furthermore, I would like to use Tasks to run the queues asynchronously instead of the new Threads. I need to keep track of the queues because I will not spawn a new thread or task for the same queue if the previous run is not complete yet. What is the best way to handle this type of parallelism?
Thanks in advance!
About your current approach
Indeed it is not the right approach. High number of queues read from database will spawn high number of threads which might be bad. You will create a new thread each time. Better to create some threads and then re-use them. And if you want tasks, better to create LongRunning tasks and re-use them.
Suggested Design
I'd suggest the following design:
Reserve only one task to read queues from the database and put those queues in a BlockingCollection;
Now start multiple LongRunning tasks to read a queue each from that BlockingCollection and process that queue;
When a task is done with processing the queue it took from the BlockingCollection, it will then take another queue from that BlockingCollection;
Optimize the number of these processing tasks so as to properly utilize the cores of your CPU. Usually since DB interactions are slow, you can create tasks 3 times more than the number of cores however YMMV.
Deadlock possibility
They will at least not happen at the application side. However, since the queues are of database transactions, the deadlock may happen at the database end. You may have to write some logic to make your task start a transaction again if the database rolled it back because of deadlock.
Sample Code
private static void TaskDesignedRun()
{
var expectedParallelQueues = 1024; //Optimize it. I've chosen it randomly
var parallelProcessingTaskCount = 4 * Environment.ProcessorCount; //Optimize this too.
var baseProcessorTaskArray = new Task[parallelProcessingTaskCount];
var taskFactory = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);
var itemsToProcess = new BlockingCollection<MyQueue>(expectedParallelQueues);
//Start a new task to populate the "itemsToProcess"
taskFactory.StartNew(() =>
{
// Add code to read queues and add them to itemsToProcess
Console.WriteLine("Done reading all the queues...");
// Finally signal that you are done by saying..
itemsToProcess.CompleteAdding();
});
//Initializing the base tasks
for (var index = 0; index < baseProcessorTaskArray.Length; index++)
{
baseProcessorTaskArray[index] = taskFactory.StartNew(() =>
{
while (!itemsToProcess.IsAddingCompleted && itemsToProcess.Count != 0) {
MyQueue q;
if (!itemsToProcess.TryTake(out q)) continue;
//Process your queue
}
});
}
//Now just wait till all queues in your database have been read and processed.
Task.WaitAll(baseProcessorTaskArray);
}
Scope:
I am currently implementing an application that uses Amazon SQS Service as a provider of data for this program to process.
Since I need a parallel processing over the messages dequeued from this queue, this is what I've did.
Parallel.ForEach (GetMessages (msgAttributes), new ParallelOptions { MaxDegreeOfParallelism = threadCount }, message =>
{
// Processing Logic
});
Here's the header of the "GetMessages" method:
private static IEnumerable<Message> GetMessages (List<String> messageAttributes = null)
{
// Dequeueing Logic... 10 At a Time
// Yielding the messages to the Parallel Loop
foreach (Message awsMessage in messages)
{
yield return awsMessage;
}
}
How will this work ?:
My initial thought about how this would work was that the GetMessagesmethod would be executed whenever the thread's had no work (or a good number of threads had no work, something like an internal heuristic to measure this). That being said, to me, the GetMessages method would than, distribute the messages to the Parallel.For working threads, which would process the messages and wait for the Parallel.For handler to give them more messages to work.
Problem? I was wrong...
The thing is that, I was wrong. Still, I have no idea on what's happening in this situation.
The number of messages being dequeued is way too high, and it grews by powers of 2 every time they get dequeued. The dequeueing count (messsages) goes as following:
Dequeue is Called: Returns 80 Messages
Dequeue is Called: Returns 160 Messages
Dequeue is Called: Returns 320 Messages (and so forth)
After a certain point, the number of messages being dequeued, or, in this case, waiting to be processed by my application is too high and I end up running out of memory.
More Information:
I am using thread-safe InterLocked operations to increment counters mentioned above.
The number of threads being used is 25 (for the Parallel.Foreach)
Each "GetMessages" will return up to 10 messages (as an IEnumerable, yielded).
Question: What exactly is happening on this scenario ?
I am having a hard-time trying to figure out what exactly is going on. Is my GetMessages method being invoked by each thread once it finishes the "Processing Loop", hence, leading to more and more messages being dequeued ?
Is the call to the "GetMessages", made by a single thread, or is it being called by multiple threads ?
I think there's an issue with Parallel.ForEach partitioning... Your question is a typical producer / consumer scenario. For such a case, you should have independent logics for dequeuing on one side, and processing on the other. It will respect separation of concerns, and will simplify debugging.
BlockingCollection<T> will let you to separate boths : on one side, you add items to be processed, and on the other, you consume them. Here's an example of how to implement it :
You will need the ParallelExtensionsExtras nuget package for BlockingCollection<T> workload partitioning (.GetConsumingEnumerable() in the process method).
public static class ProducerConsumer
{
public static ConcurrentQueue<String> SqsQueue = new ConcurrentQueue<String>();
public static BlockingCollection<String> Collection = new BlockingCollection<String>();
public static ConcurrentBag<String> Result = new ConcurrentBag<String>();
public static async Task TestMethod()
{
// Here we separate all the Tasks in distinct threads
Task sqs = Task.Run(async () =>
{
Console.WriteLine("Amazon on thread " + Thread.CurrentThread.ManagedThreadId.ToString());
while (true)
{
ProducerConsumer.BackgroundFakedAmazon(); // We produce 50 Strings each second
await Task.Delay(1000);
}
});
Task deq = Task.Run(async () =>
{
Console.WriteLine("Dequeue on thread " + Thread.CurrentThread.ManagedThreadId.ToString());
while (true)
{
ProducerConsumer.DequeueData(); // Dequeue 20 Strings each 100ms
await Task.Delay(100);
}
});
Task process = Task.Run(() =>
{
Console.WriteLine("Process on thread " + Thread.CurrentThread.ManagedThreadId.ToString());
ProducerConsumer.BackgroundParallelConsumer(); // Process all the Strings in the BlockingCollection
});
await Task.WhenAll(c, sqs, deq, process);
}
public static void DequeueData()
{
foreach (var i in Enumerable.Range(0, 20))
{
String dequeued = null;
if (SqsQueue.TryDequeue(out dequeued))
{
Collection.Add(dequeued);
Console.WriteLine("Dequeued : " + dequeued);
}
}
}
public static void BackgroundFakedAmazon()
{
Console.WriteLine(" ---------- Generate 50 items on amazon side ---------- ");
foreach (var data in Enumerable.Range(0, 50).Select(i => Path.GetRandomFileName().Split('.').FirstOrDefault()))
SqsQueue.Enqueue(data + " / ASQS");
}
public static void BackgroundParallelConsumer()
{
// Here we stay in Parallel.ForEach, waiting for data. Once processed, we are still waiting the next chunks
Parallel.ForEach(Collection.GetConsumingEnumerable(), (i) =>
{
// Processing Logic
String processedData = "Processed : " + i;
Result.Add(processedData);
Console.WriteLine(processedData);
});
}
}
You can try it from a console app like this :
static void Main(string[] args)
{
ProducerConsumer.TestMethod().Wait();
}
I've an application that works with a queue with strings (which corresponds to different tasks that application needs to perform). At random moments the queue can be filled with strings (like several times a minute sometimes but it also can take a few hours.
Till now I always had a timer that checked every few seconds the queue whether there were items in the queue and removed them.
I think there must be a nicer solution than this way. Is there any way to get an event or so when an item is added to the queue?
Yes. Take a look at TPL Dataflow, in particular, the BufferBlock<T>, which does more or less the same as BlockingCollection without the nasty side-effect of jamming up your threads by leveraging async/await.
So you can:
void Main()
{
var b = new BufferBlock<string>();
AddToBlockAsync(b);
ReadFromBlockAsync(b);
}
public async Task AddToBlockAsync(BufferBlock<string> b)
{
while (true)
{
b.Post("hello");
await Task.Delay(1000);
}
}
public async Task ReadFromBlockAsync(BufferBlock<string> b)
{
await Task.Delay(10000); //let some messages buffer up...
while(true)
{
var msg = await b.ReceiveAsync();
Console.WriteLine(msg);
}
}
I'd take a look at BlockingCollection.GetConsumingEnumerable. The collection will be backed with a queue by default, and it is a nice way to automatically take values from the queue as they are added using a simple foreach loop.
There is also an overload that allows you to supply a CancellationToken meaning you can cleanly break out.
Have you looked at BlockingCollection ? The GetConsumingEnumerable() method allows an indefinite loop to be run on the consumer, to which will new items will be yielded once an item becomes available, with no need for timers, or Thread.Sleep's:
// Common:
BlockingCollection<string> _blockingCollection =
new BlockingCollection<string>();
// Producer
for (var i = 0; i < 100; i++)
{
_blockingCollection.Add(i.ToString());
Thread.Sleep(500); // So you can track the consumer synchronization. Remove.
}
// Consumer:
foreach (var item in _blockingCollection.GetConsumingEnumerable())
{
Debug.WriteLine(item);
}
The code below continues to create threads, even when the queue is empty..until eventually an OutOfMemory exception occurs. If i replace the Parallel.ForEach with a regular foreach, this does not happen. anyone know of reasons why this may happen?
public delegate void DataChangedDelegate(DataItem obj);
public class Consumer
{
public DataChangedDelegate OnCustomerChanged;
public DataChangedDelegate OnOrdersChanged;
private CancellationTokenSource cts;
private CancellationToken ct;
private BlockingCollection<DataItem> queue;
public Consumer(BlockingCollection<DataItem> queue) {
this.queue = queue;
Start();
}
private void Start() {
cts = new CancellationTokenSource();
ct = cts.Token;
Task.Factory.StartNew(() => DoWork(), ct);
}
private void DoWork() {
Parallel.ForEach(queue.GetConsumingPartitioner(), item => {
if (item.DataType == DataTypes.Customer) {
OnCustomerChanged(item);
} else if(item.DataType == DataTypes.Order) {
OnOrdersChanged(item);
}
});
}
}
I think Parallel.ForEach() was made primarily for processing bounded collections. And it doesn't expect collections like the one returned by GetConsumingPartitioner(), where MoveNext() blocks for a long time.
The problem is that Parallel.ForEach() tries to find the best degree of parallelism, so it starts as many Tasks as the TaskScheduler lets it run. But the TaskScheduler sees there are many Tasks that take a very long time to finish, and that they're not doing anything (they block) so it keeps on starting new ones.
I think the best solution is to set the MaxDegreeOfParallelism.
As an alternative, you could use TPL Dataflow's ActionBlock. The main difference in this case is that ActionBlock doesn't block any threads when there are no items to process, so the number of threads wouldn't get anywhere near the limit.
The Producer/Consumer pattern is mainly used when there is just one Producer and one Consumer.
However, what you are trying to achieve (multiple consumers) more neatly fits in the Worklist pattern. The following code was taken from a slide for unit2 slide "2c - Shared Memory Patterns" from a parallel programming class taught at the University of Utah, which is available in the download at http://ppcp.codeplex.com/
BlockingCollection<Item> workList;
CancellationTokenSource cts;
int itemcount
public void Run()
{
int num_workers = 4;
//create worklist, filled with initial work
worklist = new BlockingCollection<Item>(
new ConcurrentQueue<Item>(GetInitialWork()));
cts = new CancellationTokenSource();
itemcount = worklist.Count();
for( int i = 0; i < num_workers; i++)
Task.Factory.StartNew( RunWorker );
}
IEnumberable<Item> GetInitialWork() { ... }
public void RunWorker() {
try {
do {
Item i = worklist.Take( cts.Token );
//blocks until item available or cancelled
Process(i);
//exit loop if no more items left
} while (Interlocked.Decrement( ref itemcount) > 0);
} finally {
if( ! cts.IsCancellationRequested )
cts.Cancel();
}
}
}
public void AddWork( Item item) {
Interlocked.Increment( ref itemcount );
worklist.Add(item);
}
public void Process( Item i )
{
//Do what you want to the work item here.
}
The preceding code allows you to add worklist items to the queue, and lets you set an arbitrary number of workers (in this case, four) to pull items out of the queue and process them.
Another great resource for the Parallelism on .Net 4.0 is the book "Parallel Programming with Microsoft .Net" which is freely available at: http://msdn.microsoft.com/en-us/library/ff963553
Internally in the Task Parallel Library, the Parallel.For and Parallel.Foreach follow a hill-climbing algorithm to determine how much parallelism should be utilized for the operation.
More or less, they start with running the body on one task, move to two, and so on, until a break-point is reached and they need to reduce the number of tasks.
This works quite well for method bodies that complete quickly, but if the body takes a long time to run, it may take a long time before the it realizes it needs to decrease the amount of parallelism. Until that point, it continues adding tasks, and possibly crashes the computer.
I learned the above during a lecture given by one of the developers of the Task Parallel Library.
Specifying the MaxDegreeOfParallelism is probably the easiest way to go.
I would like to implement a Multiple file downloading with pattern of single producer and multiple consumer.
What I have:
- Code which finds new links to be downloaded in a loop
- When a new link is found - it calls download function
- Download function accepts source file path and destination file path and downloads the file.
What I want to do
- I want to download X number of files simultaneously (I dont know total number of files)
- At any times I should be able to download X files simultaneously - as soon as 1 of the X file finish downloading - the calling function should be able to add new download right away - which in turn downloading right away
So I have a producer function which keeps adding new download to queue (at any time maximum X downloads)
Multiple X thread which consumes the downloads and start downloading individually. Once it finishes download - the producer should be able to add new download - which will spawn new thread.
EXAMPLE would be really appreciated
For this P/C problem all you need is a BlockingCollection<T>.
//shared and thread-safe
static BlockingCollection<string> queue = new BlockingCollection<string>(100);
// Producer
queue.Add(fileName); // will block when full
// Consumer
if (queue.TryTake(out fileName, timeOut)) // waits when empty
...
You'll want to fine-tune it a little with timeouts and CancellationTokens.
ReaderWriterLockSlim class is designed to do that.
Also, check this brilliant website about threading:
http://www.albahari.com/threading/part4.aspx#_Reader_Writer_Locks
The example comes from the website above.
class SlimDemo
{
static ReaderWriterLockSlim _rw = new ReaderWriterLockSlim();
static List<int> _items = new List<int>();
static Random _rand = new Random();
static void Main()
{
new Thread (Read).Start();
new Thread (Read).Start();
new Thread (Read).Start();
new Thread (Write).Start ("A");
new Thread (Write).Start ("B");
}
static void Read()
{
while (true)
{
_rw.EnterReadLock();
foreach (int i in _items) Thread.Sleep (10);
_rw.ExitReadLock();
}
}
static void Write (object threadID)
{
while (true)
{
int newNumber = GetRandNum (100);
_rw.EnterWriteLock();
_items.Add (newNumber);
_rw.ExitWriteLock();
Console.WriteLine ("Thread " + threadID + " added " + newNumber);
Thread.Sleep (100);
}
}
static int GetRandNum (int max) { lock (_rand) return _rand.Next(max); }
}
Use a Concurrent collection for the communication between the boss and its work crew.
Either ConcurrentQueue (if you care about the order) or ConcurrentBag.
The boss adds to ConcurrentQueue (Add method) and the crew takes from the queue (Take method). Let me know if you need code.
I would suggest looking into the Task Parallel Library. This wraps up the method calls very cleanly, and manages your multiple threads for you.