Performance of Blocking Collection in C# 4.0

Performance of Blocking Collection in C# 4.0 - c#

Blocking Collections are getting more pile up than Normal Queue. In Following Scenario,
I have a dedicated Thread as a Consumer.
Three or more dedicated Threads as Producer.
I have checked with Normal Queue (Monitor.Enter...) as well as Blocking Collection.
Results:
Both Queues are getting pile up (Obviously , Consumers < Producers)
Normal Queues are automatically cleared at some point & not keep on increasing after 20000 or 30000.
But Blocking Collection are keep on increasing more than hundreds of thousands and Obviously we have no clear option, at the same time i dont want to restrict the producer
Can any one Shed some light ..

This is a suggestion I keep making - try ZeroMQ out. The producer/consumers pattern is well supported (use PUSH and PULL sockets), and it will be blindingly fast. Since you're using the same process, you have no message loss to worry about.

Related

Multithreaded approach to process SQS item Queue

In this scenerio, I have to Poll AWS SQS messages from a queue, each async request can fetch upto 10 sqs items/messages. Once I Poll the items, Then I have to process those items on a kubernetes pod. Item processing includes getting response from few API calls, it may take some time & then saving the item to DB & S3.
I did some R&D & reach on following conclusion
To use consumer producer model, 1 thread will poll items & another thread will process the item or to use multi-threading for item processing
Maintain a data structure that will containes sqs polled items ready for processing, DS could be Blocking collection or Concurrent queue
Using Task Parellel Library for threadpooling & in item processing.
Channels can be used
My Queries
What would be best approach to achieve best performance or increase TPS.
Can/Should I use data flow TPL
Multi threaded or single threaded with asyn tasks

This is very dependant on the specifics of your use-case and how much effort would you want to put in.
I will, however, explain the thought process I would use when making such a decision.
The naive solution to handle SQS messages would be to do it one at a time sequentially (i.e. without concurrency). It doesn't mean that you're limited to a single message at a time since you can add more pods to the cluster.
So even in that naive solution you have one concurrency point you can utilize but it has a lot of overhead. The way to reduce overhead is usually to utilize the same overhead but process more messages with it. That's why, for example, SQS allows you to get 1-10 messages in a single call and not just one. It spreads the call overhead over 10 messages. In the naive solution the overhead is the cost of starting a whole process. Using the process for more messages means concurrent processing.
I've found that for stable and flexible concurrency you want many points of concurrency, but have each of them capped at some configurable degree of parallelism (whether hardcoded or actual configuration). That way you can tweak each of them to achieve optimal output (increase when you have free CPU and memory and decrease otherwise).
So, where can the additional concurrency be introduced? This is a progression where each step utilizes resources better but requires more effort.
Fetch 10 messages instead of one for every SQS API call and process them concurrently. That way you have 2 points of concurrency you can control: Number of pods, number of messages (up to 10) concurrently.
Have a few tasks each fetching 1-10 tasks and processing them concurrently. That's 3 concurrency points: Pods, tasks and messages per task. Both these solutions suffer from messages with varying processing time, meaning that a single long running message will "hold up" all the other 1-9 "slots" of work effectively reducing the concurrency to lower than configured.
Set up a TPL Dataflow block to process the messages concurrently and a task (or few) continuously fetching messages and pumping into the block. Keep in mind that SQS messages need to be explicitly deleted so the block needs to receive the message handle too so the message can be deleted after processing.
TPL Dataflow "pipe" consisting of a few blocks where each has it's own concurrency degree. That's useful when you have different steps of processing of the message where each step has different limitations (e.g. different APIs with different throttling configurations).
I personally am very fond of, and comfortable with, the Dataflow library so I would go straight to it. But simpler solutions are also valid when performance is less of an issue.

I'm not familiar with Kubernetes but there are many things to consider when maximising throughput.
All the things which you have mentioned is IO bound not CPU bound. So, using TPL is overcomplicating the design for marginal benefit. See: https://learn.microsoft.com/en-us/dotnet/csharp/async#recognize-cpu-bound-and-io-bound-work
Your Kubernetes pods are likely to have network limitations. For example, with Azure Function Apps on Consumption Plans is limited to 1,200 outbound connections. Other services will have some defined limits, too. https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections?tabs=csharp#connection-limit. Due to the nature of your work, it is likely that you will reach these limits before you need to process IO work on multiple threads.
You may also need to consider limits of the services which you are dependent on and ensure they are able to handle the throughput.
You may want to consider using Semaphores to limit the number of active connections to satisfy both your infrastructure and external dependency limits https://learn.microsoft.com/en-us/dotnet/api/system.threading.semaphoreslim?view=net-5.0
That being said, 500 messages per second is a realistic amount. To improve it further, you can look at having multiple processes with independent resource limitations processing the queue.

Not familiar with your use case, or specifically with the tech you are using, but this sounds like a very common message handling scenario.
Few guidelines:
First, these are guidelines, your usecase might be very different then what the ones commenting here are used to.
Whenever you want to increase your throughput you need to identify
your bottlenecks, and thrive towards CPU bottleneck, making sure you
fully utilize it. CPU load is usually the most expensive, and
generally makes for a more reliable metric for autoscaling. Obviously, depending on your remote api calls and your DB you might reach other bottlenecks - SQS queue size also makes for a good autoscaling metric, but keep in mind that autoscalling isn't guaranteed to increase you throughput if your bottleneck is DB or API related.
I would not go for a fancy solution with complex data structures, again, not familiar with your usecase, so I might be wrong - but keep it simple. There should be one thread that is responsible for polling the queue, and when it finds new messages it should create a Task that processes a batch. There should generally be one Task per processing batch - let the ThreadPool handle the number of threads.
Not familiar with .net SQS library. However, I am familiar with other libraries for very similar solutions. Most Libraries for queues out there already do it all for you, and you don't really have to worry about it. You should probably just have a callback function that is called when the highly optimized library already finds new messages. Those libraries probably already create a new task for each of those batches - you just need to register to their callback, and make sure you await any I/O bound code.
Edit: The solution I am proposing does have a limitation in that a single message can block an entire batch, this is not necessarily a bad thing - if your solution requires different processing for different messages, and you don't want to create this inner batch dependency, a TPL DataFlow could definitely be a good solution for your usecase.

Yeah, this sounds very much like the task for TPL Dataflow, it is very versatile yet powerful instrument. Your first chain link would acquire messages from the queue (not neccessarily one-threaded-ly, you just pass some delegates in). You will also be in control of how many items are "queued" locally this way.
Then you "subscribe" your workers in any way you desire – you can even customize it so that "faulted" processings would be put back into your queue — and it woudn't even matter if your processing is IO bound or not. If it is — well, nice, TPL dataflow is asyncronous, if not — well, not a problem, TPL dataflow can also be syncronous. Or you can fire up some thread pool threads, no biggie.

How can I set up a high-traffic queue

I am trying to set up a concurrent queue that will enqueue data objects coming in from one thread while another thread dequeues the data objects and processes them. I have used a BlockingCollection<T> and used the GetConsumingEnumerable() method to create a solution that works pretty well in simple usage. My problem lies in the facts that:
the data is coming in quickly, data items being enqueued approximately every 50ms
processing each item will likely take significantly longer than 50ms
I must maintain the order of the data items while processing as some of the data items represent events that must be fired in the proper order.
On my development machine, which is a pretty powerful setup, it seems the cutoff is about 60ms of processing time for getting things to work right. Beyond that, I have problems either with having the queue grow continuously (not dequeuing fast enough) or having the data items processed in the wrong order depending on how I set up the processing with regard to whether/how much/where I parallelize. Does anyone have any tips/tricks/solutions or can point me to such that will help me here?
Edit: As pointed out below, my issue is most likely not with the queuing structure itself so much as it is with trying to dequeue and process the items faster. Are there trick/tips/etc. for portioning out the processing work so that I can keep dequeuing quickly while still maintaining the order of the incoming data items.
Edit (again): Thanks for all your replies! It's obvious I need to put some more work into this. This is all great input, though and I think it will help point me in the right direction! I will reply again either with a solution that I came up with or a more detailed question and code sample! Thanks again.
Update: In the end, we went with a BlockingCollection backed by a ConcurrentQueue. The queue worked perfectly for what we wanted. In the end, as many mentioned, the key was making the processing side as fast and efficient as possible. There is really no way around that. We used parallelization where we found it helped (in some cases it actually hurt performance), cached data in certain areas, and tried to avoid locking scenarios. We did manage to get something working that performs well enough that the processing side can keep up with the data updates. Thanks again to everyone who kicked in a response!

If you are using TPL on .NET 4.0, you can investigate the TPL Dataflow library simple usage, as this library (it's not a third party, it's a library from Microsoft being distributed via NuGet) provide the logic which saves the order of data being processed in your system.
As I understand, you got some data which will come in order, which you have to mantain after some work at each of data item. You can use for this TransformBlock class or BufferBlock linked with ActionBlock: simply put the data on it's input, set up the action you need to be run on each item, and link this block with classes you need (you even can make it IObservable to create a responding UI.
As I said, TPL Dataflow blocks are incapsulating FIFO queue logic, and they are saving the order for results on their action. And the code you can write with them is multithreading-oriented (see more about maximum degree of parallelizm in TPL Dataflow).

I think that you are okay with the blocking queue. I enqueue thousands of messages per second in a BlockingCollection and the overhead is very small.I think you should do the following:
Add a synchronized sequence number when enqueuing the messages
Use multiple consumers to try to overload the queue
In general focus on the processing time. The default collection type for BlockingCollection is ConcurrentQueue, so the default is that the it is a FIFO (First in, first out) queue, so something else seems to be wrong.

some of the data items represent events that must be fired in the
proper order.
Then you may differentiate dependent items and process them in order while processing other items in parallel. Maybe you can build 2 separate queues, one for items to be processed in order, dequeued an processed with a single thread and another dequeued by multiple threads.
We need to know more about input and expected processing.

Could MSMQ resolve performance bottleneck of out multithreaded services?

We wrote service that using ~200 threads .
200 Threads must do:
1- Download from internet
2- Parse the raw data (html,xml,json...)
3- Store the newly created data to db
For ~10 threads elapsed time for second operation(Parsing) is 50ms (per thread)
For ~50 threads elapsed time for second operation(Parsing) is 80-18000 ms (per thread)
So we have an idea !
We can download documents as multithreaded but using MSMQ we can send rawdata to another process (consumer). And another process implement second part (Parsing) as single threaded.
You can say why dont you use c# Queue class in same process.. We could not prevent our "precious parsing thread" from Thread Context switch. If there are 200 threads in same process the precious will be context switch victim.
Using MSMQ for this requirement is normal?

Yes, this is an excellent example of where MSMQ makes a lot of sense. You can offload your difficult work to a different process to handle without affecting the performance of your current process which clearly doesn't care about the results. Not only that, but if your new worker process goes down, the queue will preserve state and messages (other than maybe the one being worked on when it went down) will not be lost.
Depending on your needs and goals I'd consider offloading the download to the other process as well - passing URLs to work on to the queue for example. Then, scaling up your system is as easy as dialing up the queue receivers, since queue messages are received in a thread safe manner when implemented correctly.

Yes, it is normal. And there are frameworks/libraries that help you building these kind of solutions providing you more than only transports.
NServiceBus or MassTransit are examples (both can sit on top of MSMQ)

Load balancing with shared priority queues

I am trying to implement a load balancer at the moment and have hit a bit of a speed bump. The situation is as follows (simplified),
I have a queue of requests queue_a which are processed by worker_a
There is a second queue of requests queue_b which are processed by worker_b
And I have a third queue of requests queue_c that can go to either of the workers
The reason for this kind of setup is that each worker has unique requests that only it can process, but there are also general requests that anyone can process.
I was going to implement this basically using 3 instances of the C5 IntervalHeap. Each worker would have access to its local queue + the shared queues that it is a part of (e.g., worker_a could see queue_a & queue_c).
The problem with this idea is that if there is a request in the local queue and a request in the shared queue(s) with the same priority, it's impossible to know which one should be processed first (the IntervalHeap is normally first-come-first-serve when this happens).
EDIT: I have discovered IntervalHeap appears to not be first-come-first-server with same priority requests!
I would like to minimise locking across the queues as it will be relatively high throughput and time sensitive, but the only way I can think of at the moment would involve a lot more complexity where the third queue is removed and shared requests are placed into both queue_a and queue_b. When the request is sucked up it would know it is a shared request and have to remove it from the other queues.
Hope that explains it clearly enough!

It seems that you'll simply end up pushing the bubble around - no matter how you arrange it, in the worst case you'll have three things of equal priority to execute by only two workers. What sort of tie breaking criteria could you apply beyond priority in order to choose which queue to pull the next task from?
Here are two ideas:
Pick the queue at random. All priorities are equal so it shouldn't matter which one is chosen. On average in the worst case, all queues will be serviced at roughly the same rate.
Minimize queue length by taking from the queue that has the largest number of elements. This might cause some starvation of other queues if one queue's fill rate is consistently higher than others.
HTH

Synchronizing your workers can share the same pool of resources as well as their private queue. Of there is 1 item available in the queue for worker 1 and 1 item available in the shared queue, it would be a shame if worker 1 picks up the item of the shared queue first since this will limit parallel runs. Rather you want worker 1 to pick up the private item first, this however leads to new caveats, one being where worker 1 and worker 2 are both busy handling private items and therefore older shared items will not be picked up.
Finding a solution that addresses these problems will be very difficult when also trying to keep the complexity down. A simple implementation is only to handle shared items when the private queue is empty. This does not tackle the part where priorities are not handled correctly on high load scenario's. (e.g. where the shared queue wont be handled since the private queues are always full). To balance this, you might want to handle the private queue first, only if the other workers private queue is empty. This is still not a perfect solution since this will still prefer private queue items over shared items. Addressing this problem again can be achieved by setting up multiple strategies but here comes even more complexity.
It all depends on your requirements.

Are Socket.*Async methods threaded?

I'm currently trying to figure what is the best way to minimize the amount of threads I use in a TCP master server, in order to maximize performance.
As I've been reading a lot recently with the new async features of C# 5.0, asynchronous does not necessarily mean multithreaded. It could mean separated in smaller chunks of finite state objects, then processed alongside other operations, by alternating. However, I don't see how this could be done in networking, since I'm basically "waiting" for input (from the client).
Therefore, I wouldn't use ReceiveAsync() for all my sockets, it would just be creating and ending threads continuously (assuming it does create threads).
Consequently, my question is more or less: what architecture can a master server take without having one "thread" per connection?
Side question for bonus coolness points: Why is having multiple threads bad, considering that having an amount of threads that is over your amount of processing cores simply makes the machine "fake" multithreading, just like any other asynchronous method would?

No, you would not necessarily be creating threads. There are two possible ways you can do async without setting up and tearing down threads all the time:
You can have a "small" number of long-lived threads, and have them sleep when there's no work to do (this means that the OS will never schedule them for execution, so the resource drain is minimal). Then, when work arrives (i.e. Async method called), wake one of them up and tell it what needs to be done. Pleased to meet you, managed thread pool.
In Windows, the most efficient mechanism for async is I/O completion ports which synchronizes access to I/O operations and allows a small number of threads to manage massive workloads.
Regarding multiple threads:
Having multiple threads is not bad for performance, if
the number of threads is not excessive
the threads do not oversaturate the CPU
If the number of threads is excessive then obviously we are taxing the OS with having to keep track of and schedule all these threads, which uses up global resources and slows it down.
If the threads are CPU-bound, then the OS will need to perform much more frequent context switches in order to maintain fairness, and context switches kill performance. In fact, with user-mode threads (which all highly scalable systems use -- think RDBMS) we make our lives harder just so we can avoid context switches.
Update:
I just found this question, which lends support to the position that you can't say how many threads are too much beforehand -- there are just too many unknown variables.

Seems like the *Async methods use IOCP (by looking at the code with Reflector).

Jon's answer is great. As for the 'side question'... See http://en.wikipedia.org/wiki/Amdahl%27s_law. Amdel's law says that serial code quickly diminishes the gains to be had from parallel code. We also know that thread coordination (scheduling, context switching, etc) is serial - so at some point more threads means there are so many serial steps that parallelization benefits are lost and you have a net negative performance. This is tricky stuff. That's why there is so much effort going into letting .NET manage threads while we define 'tasks' for the framework to decide what thread to run on. The framework can switch between tasks much more efficiently than the OS can switch between threads because the OS has a lot of extra things it needs to worry about when doing so.

Asynchronous work can be done without one-thread-per-connection or a thread pool with OS support for select or poll (and Windows supports this and it is exposed via Socket.Select). I am not sure of the performance on windows, but this is a very common idiom elsewhere.
One thread is the "pump" that manages the IO connections and monitors changes to the streams and then dispatches messages to/from other threads (conceivably 0 ... n depending upon model). Approaches with 0 or 1 additional threads may fall into the "Event Machine" category like twisted (Python) or POE (Perl). With >1 threads the callers form an "implicit thread pool" (themselves) and basically just offload the blocking IO.
There are also approaches like Actors, Continuations or Fibres exposed in the underlying models of some languages which alter how the basic problem is approached -- don't wait, react.
Happy coding.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.