I'm implementing a service for tasks processing and I would like to manage the quality of the service giving greater priority to certain types of tasks. There are four types of tasks, and for this reason I would use four queues, one for each type of task.
Could be it convenient to create four processing threads (one for each queue) and assign to them different priorities?
Or should I should make the processing thread take care mainly with the higher priority queue?
Are there other approaches?
I would suggest having a single thread that is responsible for grabbing tasks.
There are many, many possible strategies. One is simply to have 4 queues, and try to cycle between them. Another is to stick tasks into a priority queue (typically implemented with a heap data structure), but if you do that then be aware that all of the higher priority tasks will be taken before any lower priority tasks. A third is to use a priority queue based on age so you can take the oldest request first - then make high priority requests artificially old. (I could suggest the age of the oldest thing in the queue plus a constant term.)
One general point to keep in mind. If you assign sufficient capacity, your queues will likely remain reasonably short. If your capacity is insufficient, then queues will grow without bound and in the long run you can think of your queueing problem as one of triage rather than prioritization. But if possible, it often works well to try to increase capacity instead of being clever in prioritization.
Related
In this scenerio, I have to Poll AWS SQS messages from a queue, each async request can fetch upto 10 sqs items/messages. Once I Poll the items, Then I have to process those items on a kubernetes pod. Item processing includes getting response from few API calls, it may take some time & then saving the item to DB & S3.
I did some R&D & reach on following conclusion
To use consumer producer model, 1 thread will poll items & another thread will process the item or to use multi-threading for item processing
Maintain a data structure that will containes sqs polled items ready for processing, DS could be Blocking collection or Concurrent queue
Using Task Parellel Library for threadpooling & in item processing.
Channels can be used
My Queries
What would be best approach to achieve best performance or increase TPS.
Can/Should I use data flow TPL
Multi threaded or single threaded with asyn tasks
This is very dependant on the specifics of your use-case and how much effort would you want to put in.
I will, however, explain the thought process I would use when making such a decision.
The naive solution to handle SQS messages would be to do it one at a time sequentially (i.e. without concurrency). It doesn't mean that you're limited to a single message at a time since you can add more pods to the cluster.
So even in that naive solution you have one concurrency point you can utilize but it has a lot of overhead. The way to reduce overhead is usually to utilize the same overhead but process more messages with it. That's why, for example, SQS allows you to get 1-10 messages in a single call and not just one. It spreads the call overhead over 10 messages. In the naive solution the overhead is the cost of starting a whole process. Using the process for more messages means concurrent processing.
I've found that for stable and flexible concurrency you want many points of concurrency, but have each of them capped at some configurable degree of parallelism (whether hardcoded or actual configuration). That way you can tweak each of them to achieve optimal output (increase when you have free CPU and memory and decrease otherwise).
So, where can the additional concurrency be introduced? This is a progression where each step utilizes resources better but requires more effort.
Fetch 10 messages instead of one for every SQS API call and process them concurrently. That way you have 2 points of concurrency you can control: Number of pods, number of messages (up to 10) concurrently.
Have a few tasks each fetching 1-10 tasks and processing them concurrently. That's 3 concurrency points: Pods, tasks and messages per task. Both these solutions suffer from messages with varying processing time, meaning that a single long running message will "hold up" all the other 1-9 "slots" of work effectively reducing the concurrency to lower than configured.
Set up a TPL Dataflow block to process the messages concurrently and a task (or few) continuously fetching messages and pumping into the block. Keep in mind that SQS messages need to be explicitly deleted so the block needs to receive the message handle too so the message can be deleted after processing.
TPL Dataflow "pipe" consisting of a few blocks where each has it's own concurrency degree. That's useful when you have different steps of processing of the message where each step has different limitations (e.g. different APIs with different throttling configurations).
I personally am very fond of, and comfortable with, the Dataflow library so I would go straight to it. But simpler solutions are also valid when performance is less of an issue.
I'm not familiar with Kubernetes but there are many things to consider when maximising throughput.
All the things which you have mentioned is IO bound not CPU bound. So, using TPL is overcomplicating the design for marginal benefit. See: https://learn.microsoft.com/en-us/dotnet/csharp/async#recognize-cpu-bound-and-io-bound-work
Your Kubernetes pods are likely to have network limitations. For example, with Azure Function Apps on Consumption Plans is limited to 1,200 outbound connections. Other services will have some defined limits, too. https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections?tabs=csharp#connection-limit. Due to the nature of your work, it is likely that you will reach these limits before you need to process IO work on multiple threads.
You may also need to consider limits of the services which you are dependent on and ensure they are able to handle the throughput.
You may want to consider using Semaphores to limit the number of active connections to satisfy both your infrastructure and external dependency limits https://learn.microsoft.com/en-us/dotnet/api/system.threading.semaphoreslim?view=net-5.0
That being said, 500 messages per second is a realistic amount. To improve it further, you can look at having multiple processes with independent resource limitations processing the queue.
Not familiar with your use case, or specifically with the tech you are using, but this sounds like a very common message handling scenario.
Few guidelines:
First, these are guidelines, your usecase might be very different then what the ones commenting here are used to.
Whenever you want to increase your throughput you need to identify
your bottlenecks, and thrive towards CPU bottleneck, making sure you
fully utilize it. CPU load is usually the most expensive, and
generally makes for a more reliable metric for autoscaling. Obviously, depending on your remote api calls and your DB you might reach other bottlenecks - SQS queue size also makes for a good autoscaling metric, but keep in mind that autoscalling isn't guaranteed to increase you throughput if your bottleneck is DB or API related.
I would not go for a fancy solution with complex data structures, again, not familiar with your usecase, so I might be wrong - but keep it simple. There should be one thread that is responsible for polling the queue, and when it finds new messages it should create a Task that processes a batch. There should generally be one Task per processing batch - let the ThreadPool handle the number of threads.
Not familiar with .net SQS library. However, I am familiar with other libraries for very similar solutions. Most Libraries for queues out there already do it all for you, and you don't really have to worry about it. You should probably just have a callback function that is called when the highly optimized library already finds new messages. Those libraries probably already create a new task for each of those batches - you just need to register to their callback, and make sure you await any I/O bound code.
Edit: The solution I am proposing does have a limitation in that a single message can block an entire batch, this is not necessarily a bad thing - if your solution requires different processing for different messages, and you don't want to create this inner batch dependency, a TPL DataFlow could definitely be a good solution for your usecase.
Yeah, this sounds very much like the task for TPL Dataflow, it is very versatile yet powerful instrument. Your first chain link would acquire messages from the queue (not neccessarily one-threaded-ly, you just pass some delegates in). You will also be in control of how many items are "queued" locally this way.
Then you "subscribe" your workers in any way you desire – you can even customize it so that "faulted" processings would be put back into your queue — and it woudn't even matter if your processing is IO bound or not. If it is — well, nice, TPL dataflow is asyncronous, if not — well, not a problem, TPL dataflow can also be syncronous. Or you can fire up some thread pool threads, no biggie.
An image speaks more than words, so here is basically what I want to achieve :
(I have also used a fruit analogy for the sake of genericity an simplicity)
I've done this kind of stuff many time in the past using different king of .Net classes (BackGroundWOrkers, ThreadPool, Self Made Stuff...)
I am asking here for the sake of advice and to get fresh ideas on how to do this efficiently.
This is a high computing program so I am receiving Millions of (similar in structure but not in content) data, that have to be queued in order to be processed according to its content type. Hence, I want to avoid creating a parallel task for each single data to be processed (this overloads the CPU and is poor design IMHO). That's why I got the idea of having only ONE thread running for EACH data TYPE, dedicated to processing it (knowing that the "Press Juice" method is generic and independent of the fruit to be pressed)
Any Ideas and implementation suggestions are welcome.
I am free to give any further details.
TPL DataFlow seems like a very strong candidate for this.
Take a read of the intro here.
If all you really want is one thread (or a constant number of threads) for each type of fruit, then the simplest solution might be to use a BlockingCollection for each type of fruit. Your data bus will deliver the fruit to those collections, and your processing threads will take from them. But this means if there are no apples for now, the thread will be blocked, doing nothing.
A more flexible and efficient approach would be to use TPL Dataflow. With that, you don't work with threads or tasks, you work with blocks. For example your Thread C could be represented as a TransformBlock<Apple, AppleJuice>.
By default, each block uses at most one thread, but they can be easily configured to use more of them (by setting MaxDegreeOfParallelism). Also, dataflow blocks work well with the new C# 5.0 async-await, which could be a big advantage.
There are also things you should be careful about. For example, TDF is by default optimized for throughput, not latency. So, if your thread pool is busy and you have lots of oranges incoming and only one apple, it's possible that the apple will be processed only after all of the oranges are. But this can be also fixed by configuring the blocks properly (by setting MaxMessagesPerTask).
I would caution against a "worker thread per data type" approach. This makes the assumption that actual input load will conform to the equivalence classes that are handy for developers. Do you know if bananas are 5x slower to juice than oranges? What happens if every Tuesday is "apple celebration day" and everyone juices more fruit than usual, and all of it is apples?
Running things in parallel is about performance, not about the domain. Don't model it after your domain, model it to provide the lowest average cycle time.
In an attempt to speed up processing of physics objects in C# I decided to change a linear update algorithm into a parallel algorithm. I believed the best approach was to use the ThreadPool as it is built for completing a queue of jobs.
When I first implemented the parallel algorithm, I queued up a job for every physics object. Keep in mind, a single job completes fairly quickly (updates forces, velocity, position, checks for collision with the old state of any surrounding objects to make it thread safe, etc). I would then wait on all jobs to be finished using a single wait handle, with an interlocked integer that I decremented each time a physics object completed (upon hitting zero, I then set the wait handle). The wait was required as the next task I needed to do involved having the objects all be updated.
The first thing I noticed was that performance was crazy. When averaged, the thread pooling seemed to be going a bit faster, but had massive spikes in performance (on the order of 10 ms per update, with random jumps to 40-60ms). I attempted to profile this using ANTS, however I could not gain any insight into why the spikes were occurring.
My next approach was to still use the ThreadPool, however instead I split all the objects into groups. I initially started with only 8 groups, as that was how any cores my computer had. The performance was great. It far outperformed the single threaded approach, and had no spikes (about 6ms per update).
The only thing I thought about was that, if one job completed before the others, there would be an idle core. Therefore, I increased the number of jobs to about 20, and even up to 500. As I expected, it dropped to 5ms.
So my questions are as follows:
Why would spikes occur when I made the job sizes quick / many?
Is there any insight into how the ThreadPool is implemented that would help me to understand how best to use it?
Using threads has a price - you need context switching, you need locking (the job queue is most probably locked when a thread tries to fetch a new job) - it all comes at a price. This price is usually small compared to the actual work your thread is doing, but if the work ends quickly, the price becomes meaningful.
Your solution seems correct. A reasonable rule of thumb is to have twice as many threads as there are cores.
As you probably expect yourself, the spikes are likely caused by the code that manages the thread pools and distributes tasks to them.
For parallel programming, there are more sophisticated approaches than "manually" distributing work across different threads (even if using the threadpool).
See Parallel Programming in the .NET Framework for instance for an overview and different options. In your case, the "solution" may be as simple as this:
Parallel.ForEach(physicObjects, physicObject => Process(physicObject));
Here's my take on your two questions:
I'd like to start with question 2 (how the thread pool works) because it actually holds the key to answering question 1. The thread pool is implemented (without going into details) as a (thread-safe) work queue and a group of worker threads (which may shrink or enlarge as needed). As the user calls QueueUserWorkItem the task is put into the work queue. The workers keep polling the queue and taking work if they are idle. Once they manage to take a task, they execute it and then return to the queue for more work (this is very important!). So the work is done by the workers on-demand: as the workers become idle they take more pieces of work to do.
Having said the above, it's simple to see what is the answer to question 1 (why did you see a performance difference with more fine-grained tasks): it's because with fine-grain you get more load-balancing (a very desirable property), i.e. your workers do more or less the same amount of work and all cores are exploited uniformly. As you said, with a coarse-grain task distribution, there may be longer and shorter tasks, so one or more cores may be lagging behind, slowing down the overall computation, while other do nothing. With small tasks the problem goes away. Each worker thread takes one small task at a time and then goes back for more. If one thread picks up a shorter task it will go to the queue more often, If it takes a longer task it will go to the queue less often, so things are balanced.
Finally, when the jobs are too fine-grained, and considering that the pool may enlarge to over 1K threads, there is very high contention on the queue when all threads go back to take more work (which happens very often), which may account for the spikes you are seeing. If the underlying implementation uses a blocking lock to access the queue, then context switches are very frequent which hurts performance a lot and makes it seem rather random.
answer of question 1:
this is because of Thread switching , thread switching (or context switching in OS concepts) is CPU clocks that takes to switch between each thread , most of times multi-threading increases the speed of programs and process but when it's process is so small and quick size then context switching will take more time than thread's self process so the whole program throughput decreases, you can find more information about this in O.S concepts books .
answer of question 2:
actually i have a overall insight of ThreadPool , and i cant explain what is it's structure exactly.
to learn more about ThreadPool start here ThreadPool Class
each version of .NET Framework adds more and more capabilities utilizing ThreadPool indirectly. such as Parallel.ForEach Method mentioned before added in .NET 4 along with System.Threading.Tasks which makes code more readable and neat. You can learn more on this here Task Schedulers as well.
At very basic level what it does is: it creates let's say 20 threads and puts them into a lits. Each time it receives a delegate to execute async it takes idle thread from the list and executes delegate. if no available threads found it puts it into a queue. every time deletegate execution completes it will check if queue has any item and if so peeks one and executes in the same thread.
I am trying to implement a load balancer at the moment and have hit a bit of a speed bump. The situation is as follows (simplified),
I have a queue of requests queue_a which are processed by worker_a
There is a second queue of requests queue_b which are processed by worker_b
And I have a third queue of requests queue_c that can go to either of the workers
The reason for this kind of setup is that each worker has unique requests that only it can process, but there are also general requests that anyone can process.
I was going to implement this basically using 3 instances of the C5 IntervalHeap. Each worker would have access to its local queue + the shared queues that it is a part of (e.g., worker_a could see queue_a & queue_c).
The problem with this idea is that if there is a request in the local queue and a request in the shared queue(s) with the same priority, it's impossible to know which one should be processed first (the IntervalHeap is normally first-come-first-serve when this happens).
EDIT: I have discovered IntervalHeap appears to not be first-come-first-server with same priority requests!
I would like to minimise locking across the queues as it will be relatively high throughput and time sensitive, but the only way I can think of at the moment would involve a lot more complexity where the third queue is removed and shared requests are placed into both queue_a and queue_b. When the request is sucked up it would know it is a shared request and have to remove it from the other queues.
Hope that explains it clearly enough!
It seems that you'll simply end up pushing the bubble around - no matter how you arrange it, in the worst case you'll have three things of equal priority to execute by only two workers. What sort of tie breaking criteria could you apply beyond priority in order to choose which queue to pull the next task from?
Here are two ideas:
Pick the queue at random. All priorities are equal so it shouldn't matter which one is chosen. On average in the worst case, all queues will be serviced at roughly the same rate.
Minimize queue length by taking from the queue that has the largest number of elements. This might cause some starvation of other queues if one queue's fill rate is consistently higher than others.
HTH
Synchronizing your workers can share the same pool of resources as well as their private queue. Of there is 1 item available in the queue for worker 1 and 1 item available in the shared queue, it would be a shame if worker 1 picks up the item of the shared queue first since this will limit parallel runs. Rather you want worker 1 to pick up the private item first, this however leads to new caveats, one being where worker 1 and worker 2 are both busy handling private items and therefore older shared items will not be picked up.
Finding a solution that addresses these problems will be very difficult when also trying to keep the complexity down. A simple implementation is only to handle shared items when the private queue is empty. This does not tackle the part where priorities are not handled correctly on high load scenario's. (e.g. where the shared queue wont be handled since the private queues are always full). To balance this, you might want to handle the private queue first, only if the other workers private queue is empty. This is still not a perfect solution since this will still prefer private queue items over shared items. Addressing this problem again can be achieved by setting up multiple strategies but here comes even more complexity.
It all depends on your requirements.
I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks -- say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?
Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren't CPU bound then you'll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you're happy.
I've not had the opportunity to actually use PLINQ, however I do know that PLINQ (like vanilla LINQ) is based on IEnumerable. As such, I think this might be a case where it would make sense to implement the task producer via C# iterator blocks (i.e. the yield keyword).
Assuming you are not doing any operations where the entire set of tasks must be known in advance (e.g. ordering), I would expect that PLINQ would only consume as many tasks as it could process at once. Also, this article references some strategies for controlling just how PLINQ goes about consuming input (the section titled "Processing Query Output").
EDIT: Comparing PLINQ to a ThreadPool.
According to this MSDN article, efficiently allocating work to a thread pool is not at all trivial, and even when you do it "right", using the TPL generally exhibits better performance.
Use the ThreadPool.
Then you can queue up everything and items will be run as threads become available to the pool without overwhelming the system. The only trick is determining the optimum number of threads to run at a time.
Sounds like a job for Microsoft HPC Server 2008. Given that it's the number of tasks that's overwhelming, you need some kind of parallel process manager. That's what HPC server is all about.
http://www.microsoft.com/hpc/en/us/default.aspx
In order to give a good answer we need a few questions answered.
Is each individual task parallelizable? Or each task is the product of a parallelizable main task?
Also, is it the number of tasks that would cause the system to run out of memory, or is it the quantity of data each task holds and processes that would cause the system to run out of memory?
Sounds like Windows Workflow Foundation (WF) might be a good thing to use to do this. It might also give you some extra benefits such as pause/resume on your tasks.