Producer/2consumers implementation with multithreading

Producer/2consumers implementation with multithreading - c#

I want to implement "producer/two consumers" functionality.
Producer: scans directories recursively and adds directory information to some storage (I guess Queue<>)
Consumer 1: retrieves data about directory and writes it to XML-file.
Consumer 2: retrieves data about directory and add it to TreeNode.
So both (1 and 2) consumers have to work with a same data. Because if one of consumers call Dequeue(), the other one will miss this data.
The only idea I have - is to make 2 different Queue<> and Producer will fill them both with a same data. Then each consumer will work with different Queue object.
I hope you'll advise something more attractive

LMAX Disruptor is one solution to this problem.
Article: http://martinfowler.com/articles/lmax.html
Illustration of the single-producer, multithreaded consumer ring buffer: http://martinfowler.com/articles/images/lmax/disruptor.png
It is assumed that you will need good - nearly expert level - knowledge of how atomic instructions and lock-free algorithms work on your target platform.
The description below is different from LMAX - I adapted it to the OP's scenario.
The underlying structure could be either a ring buffer (fixed-capacity), or a lock-free linked list (unlimited capacity, but only available on platforms that supports certain kinds of multi-word atomic instructions).
The producer will just push stuff to the front.
Each consumer keeps an iterator to the item that they are processing. Each consumer advances its own iterator, at each's own pace.
Besides the consumers, there is also a trailing garbage collector which will also try to advance, but it will not advance past any of the consumer's iterators. Thus, it will eventually clean up items that both consumers have finished processing, and only those items.

You could use ZeroMQ, which has this functionality built into it (and a lot more) -
http://learning-0mq-with-pyzmq.readthedocs.org/en/latest/pyzmq/patterns/pushpull.html
The above example is with Python code, but there are .NET bindings -
http://zeromq.org/bindings:clr

Related

How to Achieve Parallel Fan-out processing in Reactive Extensions?

We already have parallel fan-out working in our code (using ParallelEnumerable) which is currently running on a 12-core, 64G RAM server. But we would like to convert the code to use Rx so that we can have better flexibility over our downstream pipeline.
Current Workflow:
We read millions of records from a database (in a streaming fashion).
On the client side, we then use a custom OrderablePartitioner<T> class to group the database records into groups. Let’s call an instance of this class: partioner.
We then use partioner.AsParallel().WithDegreeOfParallelism(5).ForAll(group => ProcessGroupOfRecordsAsync(group));Note: this could be read as “Process all the groups, 5 at a time in parallel.” (I.e. parallel fan-out).
ProcessGroupOfRecordsAsync() – loops through all the records in the group and turns them into hundreds or even thousands of POCO objects for further processing (i.e. serial fan-out or better yet, expand).
Depending on the client’s needs:
This new serial stream of POCO objects are evaluated, sorted, ranked, transformed, filtered, filtered by manual process, and possibly more parallel and/or serial fanned-out throughout the rest of the pipeline.
The end of the pipeline may end up storing new records into the database, displaying the POCO objects in a form or displayed in various graphs.
The process currently works just fine, except that point #5 and #6 aren’t as flexible as we would like. We need the ability to swap in and out various downstream workflows. So, our first attempt was to use a Func<Tin, Tout> like so:
partioner.AsParallel
.WithDegreeOfParallelism(5)
.ForAll(group =>ProcessGroupOfRecordsAsync(group, singleRecord =>
NextTaskInWorkFlow(singleRecord));
And that works okay, but the more we flushed out our needs the more we realized we are just re-implementing Rx.
Therefore, we would like to do something like the following in Rx:
IObservable<recordGroup> rg = dbContext.QueryRecords(inputArgs)
.AsParallel().WithDegreeOfParallelism(5)
.ProcessGroupOfRecordsInParallel();
If (client1)
rg.AnalizeRecordsForClient1().ShowResults();
if (client2)
rg.AnalizeRecordsForClient2()
.AsParallel()
.WithDegreeOfParallelism(3)
.MoreProcessingInParallel()
.DisplayGraph()
.GetUserFeedBack()
.Where(data => data.SaveToDatabase)
.Select(data => data.NewRecords)
.SaveToDatabase(Table2);
...
using(rg.Subscribe(groupId =>LogToScreen(“Group {0} finished.”, groupId);

It sounds like you might want to investigate Dataflows in the Task Parallel Library - This might be a better fit than Rx for dealing with part 5, and could be extended to handle the whole problem.
In general, I don't like the idea of trying to use Rx for parallelization of CPU bound tasks; its usually not a good fit. If you are not too careful, you can introduce inefficiencies inadvertently. Dataflows can give you nice way to parallelize only where it makes most sense.
From MSDN:
The Task Parallel Library (TPL) provides dataflow components to help increase the robustness of concurrency-enabled applications. These dataflow components are collectively referred to as the TPL Dataflow Library. This dataflow model promotes actor-based programming by providing in-process message passing for coarse-grained dataflow and pipelining tasks. The dataflow components build on the types and scheduling infrastructure of the TPL and integrate with the C#, Visual Basic, and F# language support for asynchronous programming. These dataflow components are useful when you have multiple operations that must communicate with one another asynchronously or when you want to process data as it becomes available. For example, consider an application that processes image data from a web camera. By using the dataflow model, the application can process image frames as they become available. If the application enhances image frames, for example, by performing light correction or red-eye reduction, you can create a pipeline of dataflow components. Each stage of the pipeline might use more coarse-grained parallelism functionality, such as the functionality that is provided by the TPL, to transform the image.

Kaboo!
As no one has provided anything definite, I'll point out that the source code can be browsed at GitHub at Rx. Taking a quick tour around, it looks like at least some of the processing (all of it?) is done on the thread-pool already. So, maybe it's not possibly to explicitly control the parallelization degree besides implementing your own scheduler (e.g. Rx TestScheduler), but it happens nevertheless. See also the links below, judging from the answers (especially the one provided by James in the first link), the observable tasks are queued and processed serially by design -- but one can provide multiple streams for Rx to process.
See also the other questions that are related and visible on the left side (by default). In particular it looks like this one, Reactive Extensions: Concurrency within the subscriber, could provide some answers to your question. Or maybe Run methods in Parallel using Reactive.
<edit: Just a note that if storing objects to database becomes a problem, the Rx stream could push the save operations to, say, a ConcurrentQueue, which would then be processed separately. Other option would be to let Rx to queue items with a proper combination of some time and number of items and push them to the database by bulk insert.

Best buffer architecture to handle massive incoming byte array stream

I am looking for advice how to best architecture a buffer structure that can handle a massive amount of incoming data that are processed at a slower speed than the incoming data.
I programmed a customized binary reader that can stream up to 12 million byte arrays per second on a single thread and look to process the byte array stream in a separate structure on the same machine and different thread. The problem is that the consuming structure cannot keep up with the amount of incoming data of the producer and thus I believe I need some sort of buffer to handle this properly. I am most interested in advice regarding the overall architecture rather than code examples. I target .Net 4.0. Here is more information of my current setup and requirements.
Producer: Runs on a dedicated thread and reads byte arrays from files on physical storage medium (SSD, OCZ Vertex 3 Max IOPS). Approximate throughput is 12 million byte arrays per second. Each array is only of 16 byte size. Fully implemented
Consumer: Supposed to run on a separate thread than the producer.Consumes byte arrays but must parse to several primitive data types before processing the data, thus the processing speed is significantly slower than the producer publishing speed. Consumer structure is fully implemented.
In between: Looking to set up a buffered structure that provides the producer to publish to and the consumer to, well, consume from. Not implemented.
I would be happy if some of you could comment from your own experience or expertise what best to consider in order to handle such structure. Should the buffer implement a throttling algorithm that only requests new data from the producer when the buffer/queue is half empty or so? How is locking and blocking handled? I am sorry I have very limited experience in this space and have so far handled it through the implementation of a messaging bus but any messaging bus technology I looked at is definitely unable to handle the throughput I am looking for. Any comments very welcome!!!
Edit: Forgot to mention, the data is only consumed by one single consumer. Also the order in which the arrays are published does matter; the order needs to be preserved such that the consumer consumes in the same order.

16 bytes, (call it 16B), is too small for efficient inter-thread comms. Queueing up such small buffers will result in more CPU spent on inter-thread comms than on actual useful processing of the data.
So, chunk them up.
Declare some buffer class, (C16B, say), that contains a nice, big array of these 16B's - at least 4K's worth, and a 'count' int to show how many are loaded, (the last buffer loaed from a file will probably not be full). It will help if you place a cache-line-sized empty byte array just in front of this 16B array - helps to avoid false-sharing, You can maybe put the code that processes the 16B's in as a method, 'Process16B', sya, and perhaps the code that loads the array too - takes a file descriptor as a parameter. This class can now be efficiently loaded up an queued to other threads.
You need a producer-consumer queue class - C# already has one in the BlockingCollection classes.
You need flow-control in this app. I would do it by creating a pool of C16B's - create a blocking queue and create/add a big pile of C16B's in a loop. 1024 is a nice, round number. Now you have a 'pool queue' that provides flow-control, avoids the need to new() any C16B's and you don't need them to be continually garbage-collected.
Once you have this, the rest is easy. In your loader thread, continually dequeue C16B's from the pool queue, load them up with data from the files and add() them off to the processing threads/s on a '16Bprocess' blocking queue. In the processing threads, take() from the 16Bprocess queue and process each C16B instance by calling its Process16B method. When the 16B's are processed, add() the C16B back to the pool queue for re-use.
The recycling of the C16B's via the pool queue provides end-to-end flow-control. If the producer is the fastest link, the pool will eventually empty and the producer will block there until the consumer/s returns some C16B's.
If the processing takes so much time, you could always add another processing thread if you have spare cores available. The snag is with such schemes is that the data will get processed out-of-order. This may, or may not, matter. If it does, the data flow might need 'straightening out' later, eg. using sequence numbers and a buffer-list.
I would advise dumping the pool queue count, (and maybe the 16Bprocess queue count as well), to a status component or command-line with a timer. This provides a useful snapshot of where the C16B instances are and you can see the bottlenecks and any C16B leaks without 3rd-party tools, (the ones that that slow the whole app down to a crawl and issue spurious leak reports on shutdown).

You can use a BlockingCollection it will block the producer from adding items to the collection as long as the consumer hasn't consumed enough items.
There are other concurrent collection classes as well, eg. ConcurrentQueue

IMO a blocking Queue of some kind may solve your problem. Essentially the Producer thread will block if the queue has no more capacity. Look at this Creating a blocking Queue<T> in .NET?

Why bother with a buffer at all? Use the disk files as a buffer. When the consumer starts processing a byte array, have the reader read the next one and that's it.
EDIT: After asking for decoupling of the consumer and producer.
You can have a coordinator that tells the producer to produce X byte arrays, and supplies X byte arrays to the consumer. The three parts can act like this:
Coordinator tells producer to produce X byte arrays.
Producer produces X byte arrays
And now do this in a loop:
Coordinator tells consumer to consumer X byte arrays
Coordinator tells producer to produce X byte arrays
Consumer tells coordinator it's done consuming
Loop until there are no more byte arrays
The producer and coordinator can run in the same thread. The consumer should have its own thread.
You will have almost no locking (I think you can do this with no locking at all, just a single wait handle the consumer uses to notify the coordinator it's done), and your coordinator is extremely simple.
REEDIT: Another really decoupled option
Use ZeroMQ for handling the communications. The producer reads byte arrays and posts each array to a ZeroMQ socket. The consumer reads arrays from a ZeroMQ socket.
ZeroMQ is very efficient and fast, and handles all the technicalities (thread synchronization, buffering, etc...) internally. When used on the same computer, you won't suffer any data loss, too (which might happen when using UDP on two different machines).

.net 4.0 concurrent queue dictionary

I would like to use the new concurrent collections in .NET 4.0 to solve the following problem.
The basic data structure I want to have is a producer consumer queue, there will be a single consumer and multiple producers.
There are items of type A,B,C,D,E that will be added to this queue. Items of type A,B,C are added to the queue in the normal manner and processed in order.
However items of type D or E can only exist in the queue zero or once. If one of these is to be added and there already exists another of the same type that has not yet been processed then this should update that other one in-place in the queue. The queue position would not change (i.e. would not go to the back of the queue) after the update.
Which .NET 4.0 classes would be best for this?

I think there is no such (priority) queue in .net 4 that would support atomic AddOrUpdate operation. There only is ConcurrentDictionary that supports this, but it's not suitable if you need the order preserved.
So your option is maybe to use some combination of the two.
However, please be aware that you will lose the safety of the concurrent structures as soon as you do combined operations on them; you must implement the locking mechanism on your own (look here for an example of such situation: A .Net4 Gem: The ConcurrentDictionary - Tips & Tricks).
Second option would be to google for some 3rd party implementations.

Is a lock (wait) free doubly linked list possible?

Asking this question with C# tag, but if it is possible, it should be possible in any language.
Is it possible to implement a doubly linked list using Interlocked operations to provide no-wait locking? I would want to insert, add and remove, and clear without waiting.

Yes it's possible, here's my implementation of an STL-like Lock-Free Doubly-Linked List in C++.
Sample code that spawns threads to randomly perform ops on a list
It requires a 64-bit compare-and-swap to operate without ABA issues. This list is only possible because of a lock-free memory manager.
Check out the benchmarks on page 12. Performance of the list scales linearly with the number of threads as contention increases. The algorithm supports parallelism for disjoint accesses, so as the list size increases contention can decrease.

A simple google search will reveal many lock-free doubly linked list papers.
However, they are based on atomic CAS (compare and swap).
I don't know how atomic the operations in C# are, but according to this website
http://www.albahari.com/threading/part4.aspx
C# operations are only guaranteed to be atomic for reading and writing a 32bit field. No mention of CAS.

Here is a paper which discribes a lock free doublly linked list.
We present an efficient and practical
lock-free implementation of a
concurrent deque that is
disjoint-parallel accessible and uses
atomic primitives which are available
in modern computer systems. Previously
known lock-free algorithms of deques
are either based on non-available
atomic synchronization primitives,
only implement a subset of the
functionality, or are not designed for
disjoint accesses. Our algorithm is
based on a doubly linked list, and
only requires single-word
compare-and-swap...
Ross Bencina has some really good links I just found with numerious papers and source code excamples for "Some notes on lock-free and wait-free algorithms".

I don't believe this is possible, since you're having to set multiple references in one shot, and the interlocked operations are limited in their power.
For example, take the add operation - if you're inserting node B between A and C, you need to set B->next, B->prev, A->next, and C->prev in one atomic operation. Interlocked can't handle that. Presetting B's elements doesn't even help, because another thread could decide to do an insert while you're preparing "B".
I'd focus more on getting the locking as fine-grained as possible in this case, not trying to eliminate it.

Read the footnote - they plan to pull ConcurrentLinkedList from 4.0 prior to the final release of VS2010

Well you haven't actually asked how to do it. But, provided you can do an atomic CAS in c# it's entirely possible.
In fact I'm just working through an implementation of a doubly linked wait free list in C++ right now.
Here is paper describing it.
http://www.cse.chalmers.se/~tsigas/papers/Haakan-Thesis.pdf
And a presentation that may also provide you some clues.
http://www.ida.liu.se/~chrke/courses/MULTI/slides/Lock-Free_DoublyLinkedList.pdf

It is possible to write lock free algorithms for all copyable data structures on most architectures [1]. But it is hard to write efficient ones.
I wrote an implementation of the lock-free doubly linked list by Håkan Sundell and Philippas Tsigas for .Net. Note, that it does not support atomic PopLeft due to the concept.
[1]: Maurice Herlihy: Impossibility and universality results for wait-freesynchronization (1988)

FWIW, .NET 4.0 is adding a ConcurrentLinkedList, a threadsafe doubly linked list in the System.Collections.Concurrent namespace. You can read the documentation or the blog post describing it.

I would say that the answer is a very deeply qualified "yes, it is possible, but hard". To implement what you're asking for, you'd basically need something that would compile the operations together to ensure no collisions; as such, it would be very hard to create a general implementation for that purpose, and it would still have some significant limitations. It would probably be simpler to create a specific implementation tailored to the precise needs, and even then, it wouldn't be "simple" by any means.

Buffer pool management using C#

We need to develop some kind of buffer management for an application we are developing using C#.
Essentially, the application receives messages from devices as and when they come in (there could be many in a short space of time). We need to queue them up in some kind of buffer pool so that we can process them in a managed fashion.
We were thinking of allocating a block of memory in 256 byte chunks (all messages are less than that) and then using buffer pool management to have a pool of available buffers that can be used for incoming messages and a pool of buffers ready to be processed.
So the flow would be "Get a buffer" (process it) "Release buffer" or "Leave it in the pool". We would also need to know when the buffer was filling up.
Potentially, we would also need a way to "peek" into the buffers to see what the highest priority buffer in the pool is rather than always getting the next buffer.
Is there already support for this in .NET or is there some open source code that we could use?

C# sharps memory management is actually quite good, so instead of having a pool of buffers, you could just allocate exactly what you need and stick it into a queue. Once you are done with buffer just let the garbage collector handle it.
One other option (knowing only very little about your application), is to process the messages minimally as you get them, and turn them into full fledged objects (with priorities and all), then your queue could prioritize them just by investigating the correct set of attributes or methods.
If your messages come in too fast even for minimal processing you could have a two queue system. One is just a queue of unprocessed buffers, and the next queue is the queue of message objects built from the buffers.
I hope this helps.

#grieve: Networking is native, meaning that when buffers are used the receive/send data on the network, they are pinned in memory. see my comments below for elaboration.

Why wouldn't you just receive the messages, create a DeviceMessage (for lack of a better name) object, and put that object into a Queue ? If the prioritization is important, implement a PriorityQueue class that handles that automatically (by placing the DeviceMessage objects in priority order as they're inserted into the queue). Seems like a more OO approach, and would simplify maintenance over time with regards to the prioritization.

I know this is an old post, but I think you should take a look at the memory pool implemented in the ILNumerics project. I think they did exactly what you need and it is a very nice piece of code.
Download the code at http://ilnumerics.net/ and take a look at the file ILMemoryPool.cs

I'm doing something similar. I have messages coming in on MTA threads that need to be serviced on STA threads.
I used a BlockingCollection (part of the parallel fx extensions) that is monitored by several STA threads (configurable, but defaults to xr * the number of cores). Each thread tries to pop a message off the queue. They either time out and try again or successfully pop a message off and service it.
I've got it wired with perfmon counters to keep track of idle time, job lengths, incoming messages, etc, which can be used to tweak the queue's settings.
You'd have to implement a custom collection, or perhaps extend BC, to implement queue item priorities.
One of the reasons why I implemented it this way is that, as I understand it, queueing theory generally favors single-line, multiple-servers (why do I feel like I'm going to catch crap about that?).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.