I am trying to implement a load balancer at the moment and have hit a bit of a speed bump. The situation is as follows (simplified),
I have a queue of requests queue_a which are processed by worker_a
There is a second queue of requests queue_b which are processed by worker_b
And I have a third queue of requests queue_c that can go to either of the workers
The reason for this kind of setup is that each worker has unique requests that only it can process, but there are also general requests that anyone can process.
I was going to implement this basically using 3 instances of the C5 IntervalHeap. Each worker would have access to its local queue + the shared queues that it is a part of (e.g., worker_a could see queue_a & queue_c).
The problem with this idea is that if there is a request in the local queue and a request in the shared queue(s) with the same priority, it's impossible to know which one should be processed first (the IntervalHeap is normally first-come-first-serve when this happens).
EDIT: I have discovered IntervalHeap appears to not be first-come-first-server with same priority requests!
I would like to minimise locking across the queues as it will be relatively high throughput and time sensitive, but the only way I can think of at the moment would involve a lot more complexity where the third queue is removed and shared requests are placed into both queue_a and queue_b. When the request is sucked up it would know it is a shared request and have to remove it from the other queues.
Hope that explains it clearly enough!
It seems that you'll simply end up pushing the bubble around - no matter how you arrange it, in the worst case you'll have three things of equal priority to execute by only two workers. What sort of tie breaking criteria could you apply beyond priority in order to choose which queue to pull the next task from?
Here are two ideas:
Pick the queue at random. All priorities are equal so it shouldn't matter which one is chosen. On average in the worst case, all queues will be serviced at roughly the same rate.
Minimize queue length by taking from the queue that has the largest number of elements. This might cause some starvation of other queues if one queue's fill rate is consistently higher than others.
HTH
Synchronizing your workers can share the same pool of resources as well as their private queue. Of there is 1 item available in the queue for worker 1 and 1 item available in the shared queue, it would be a shame if worker 1 picks up the item of the shared queue first since this will limit parallel runs. Rather you want worker 1 to pick up the private item first, this however leads to new caveats, one being where worker 1 and worker 2 are both busy handling private items and therefore older shared items will not be picked up.
Finding a solution that addresses these problems will be very difficult when also trying to keep the complexity down. A simple implementation is only to handle shared items when the private queue is empty. This does not tackle the part where priorities are not handled correctly on high load scenario's. (e.g. where the shared queue wont be handled since the private queues are always full). To balance this, you might want to handle the private queue first, only if the other workers private queue is empty. This is still not a perfect solution since this will still prefer private queue items over shared items. Addressing this problem again can be achieved by setting up multiple strategies but here comes even more complexity.
It all depends on your requirements.
Related
In this scenerio, I have to Poll AWS SQS messages from a queue, each async request can fetch upto 10 sqs items/messages. Once I Poll the items, Then I have to process those items on a kubernetes pod. Item processing includes getting response from few API calls, it may take some time & then saving the item to DB & S3.
I did some R&D & reach on following conclusion
To use consumer producer model, 1 thread will poll items & another thread will process the item or to use multi-threading for item processing
Maintain a data structure that will containes sqs polled items ready for processing, DS could be Blocking collection or Concurrent queue
Using Task Parellel Library for threadpooling & in item processing.
Channels can be used
My Queries
What would be best approach to achieve best performance or increase TPS.
Can/Should I use data flow TPL
Multi threaded or single threaded with asyn tasks
This is very dependant on the specifics of your use-case and how much effort would you want to put in.
I will, however, explain the thought process I would use when making such a decision.
The naive solution to handle SQS messages would be to do it one at a time sequentially (i.e. without concurrency). It doesn't mean that you're limited to a single message at a time since you can add more pods to the cluster.
So even in that naive solution you have one concurrency point you can utilize but it has a lot of overhead. The way to reduce overhead is usually to utilize the same overhead but process more messages with it. That's why, for example, SQS allows you to get 1-10 messages in a single call and not just one. It spreads the call overhead over 10 messages. In the naive solution the overhead is the cost of starting a whole process. Using the process for more messages means concurrent processing.
I've found that for stable and flexible concurrency you want many points of concurrency, but have each of them capped at some configurable degree of parallelism (whether hardcoded or actual configuration). That way you can tweak each of them to achieve optimal output (increase when you have free CPU and memory and decrease otherwise).
So, where can the additional concurrency be introduced? This is a progression where each step utilizes resources better but requires more effort.
Fetch 10 messages instead of one for every SQS API call and process them concurrently. That way you have 2 points of concurrency you can control: Number of pods, number of messages (up to 10) concurrently.
Have a few tasks each fetching 1-10 tasks and processing them concurrently. That's 3 concurrency points: Pods, tasks and messages per task. Both these solutions suffer from messages with varying processing time, meaning that a single long running message will "hold up" all the other 1-9 "slots" of work effectively reducing the concurrency to lower than configured.
Set up a TPL Dataflow block to process the messages concurrently and a task (or few) continuously fetching messages and pumping into the block. Keep in mind that SQS messages need to be explicitly deleted so the block needs to receive the message handle too so the message can be deleted after processing.
TPL Dataflow "pipe" consisting of a few blocks where each has it's own concurrency degree. That's useful when you have different steps of processing of the message where each step has different limitations (e.g. different APIs with different throttling configurations).
I personally am very fond of, and comfortable with, the Dataflow library so I would go straight to it. But simpler solutions are also valid when performance is less of an issue.
I'm not familiar with Kubernetes but there are many things to consider when maximising throughput.
All the things which you have mentioned is IO bound not CPU bound. So, using TPL is overcomplicating the design for marginal benefit. See: https://learn.microsoft.com/en-us/dotnet/csharp/async#recognize-cpu-bound-and-io-bound-work
Your Kubernetes pods are likely to have network limitations. For example, with Azure Function Apps on Consumption Plans is limited to 1,200 outbound connections. Other services will have some defined limits, too. https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections?tabs=csharp#connection-limit. Due to the nature of your work, it is likely that you will reach these limits before you need to process IO work on multiple threads.
You may also need to consider limits of the services which you are dependent on and ensure they are able to handle the throughput.
You may want to consider using Semaphores to limit the number of active connections to satisfy both your infrastructure and external dependency limits https://learn.microsoft.com/en-us/dotnet/api/system.threading.semaphoreslim?view=net-5.0
That being said, 500 messages per second is a realistic amount. To improve it further, you can look at having multiple processes with independent resource limitations processing the queue.
Not familiar with your use case, or specifically with the tech you are using, but this sounds like a very common message handling scenario.
Few guidelines:
First, these are guidelines, your usecase might be very different then what the ones commenting here are used to.
Whenever you want to increase your throughput you need to identify
your bottlenecks, and thrive towards CPU bottleneck, making sure you
fully utilize it. CPU load is usually the most expensive, and
generally makes for a more reliable metric for autoscaling. Obviously, depending on your remote api calls and your DB you might reach other bottlenecks - SQS queue size also makes for a good autoscaling metric, but keep in mind that autoscalling isn't guaranteed to increase you throughput if your bottleneck is DB or API related.
I would not go for a fancy solution with complex data structures, again, not familiar with your usecase, so I might be wrong - but keep it simple. There should be one thread that is responsible for polling the queue, and when it finds new messages it should create a Task that processes a batch. There should generally be one Task per processing batch - let the ThreadPool handle the number of threads.
Not familiar with .net SQS library. However, I am familiar with other libraries for very similar solutions. Most Libraries for queues out there already do it all for you, and you don't really have to worry about it. You should probably just have a callback function that is called when the highly optimized library already finds new messages. Those libraries probably already create a new task for each of those batches - you just need to register to their callback, and make sure you await any I/O bound code.
Edit: The solution I am proposing does have a limitation in that a single message can block an entire batch, this is not necessarily a bad thing - if your solution requires different processing for different messages, and you don't want to create this inner batch dependency, a TPL DataFlow could definitely be a good solution for your usecase.
Yeah, this sounds very much like the task for TPL Dataflow, it is very versatile yet powerful instrument. Your first chain link would acquire messages from the queue (not neccessarily one-threaded-ly, you just pass some delegates in). You will also be in control of how many items are "queued" locally this way.
Then you "subscribe" your workers in any way you desire – you can even customize it so that "faulted" processings would be put back into your queue — and it woudn't even matter if your processing is IO bound or not. If it is — well, nice, TPL dataflow is asyncronous, if not — well, not a problem, TPL dataflow can also be syncronous. Or you can fire up some thread pool threads, no biggie.
I am trying to set up a concurrent queue that will enqueue data objects coming in from one thread while another thread dequeues the data objects and processes them. I have used a BlockingCollection<T> and used the GetConsumingEnumerable() method to create a solution that works pretty well in simple usage. My problem lies in the facts that:
the data is coming in quickly, data items being enqueued approximately every 50ms
processing each item will likely take significantly longer than 50ms
I must maintain the order of the data items while processing as some of the data items represent events that must be fired in the proper order.
On my development machine, which is a pretty powerful setup, it seems the cutoff is about 60ms of processing time for getting things to work right. Beyond that, I have problems either with having the queue grow continuously (not dequeuing fast enough) or having the data items processed in the wrong order depending on how I set up the processing with regard to whether/how much/where I parallelize. Does anyone have any tips/tricks/solutions or can point me to such that will help me here?
Edit: As pointed out below, my issue is most likely not with the queuing structure itself so much as it is with trying to dequeue and process the items faster. Are there trick/tips/etc. for portioning out the processing work so that I can keep dequeuing quickly while still maintaining the order of the incoming data items.
Edit (again): Thanks for all your replies! It's obvious I need to put some more work into this. This is all great input, though and I think it will help point me in the right direction! I will reply again either with a solution that I came up with or a more detailed question and code sample! Thanks again.
Update: In the end, we went with a BlockingCollection backed by a ConcurrentQueue. The queue worked perfectly for what we wanted. In the end, as many mentioned, the key was making the processing side as fast and efficient as possible. There is really no way around that. We used parallelization where we found it helped (in some cases it actually hurt performance), cached data in certain areas, and tried to avoid locking scenarios. We did manage to get something working that performs well enough that the processing side can keep up with the data updates. Thanks again to everyone who kicked in a response!
If you are using TPL on .NET 4.0, you can investigate the TPL Dataflow library simple usage, as this library (it's not a third party, it's a library from Microsoft being distributed via NuGet) provide the logic which saves the order of data being processed in your system.
As I understand, you got some data which will come in order, which you have to mantain after some work at each of data item. You can use for this TransformBlock class or BufferBlock linked with ActionBlock: simply put the data on it's input, set up the action you need to be run on each item, and link this block with classes you need (you even can make it IObservable to create a responding UI.
As I said, TPL Dataflow blocks are incapsulating FIFO queue logic, and they are saving the order for results on their action. And the code you can write with them is multithreading-oriented (see more about maximum degree of parallelizm in TPL Dataflow).
I think that you are okay with the blocking queue. I enqueue thousands of messages per second in a BlockingCollection and the overhead is very small.I think you should do the following:
Add a synchronized sequence number when enqueuing the messages
Use multiple consumers to try to overload the queue
In general focus on the processing time. The default collection type for BlockingCollection is ConcurrentQueue, so the default is that the it is a FIFO (First in, first out) queue, so something else seems to be wrong.
some of the data items represent events that must be fired in the
proper order.
Then you may differentiate dependent items and process them in order while processing other items in parallel. Maybe you can build 2 separate queues, one for items to be processed in order, dequeued an processed with a single thread and another dequeued by multiple threads.
We need to know more about input and expected processing.
i have a program which process price data coming from the broker. the pseudo code are as follow:
Process[] process = new Process[50];
void tickEvent(object sender, EventArgs e)
{
int contractNumber = e.contractNumber;
doPriceProcess(process[contractNumber], e);
}
now i would like to use mutlithreading to speed up my program, if the data are of different contract number, i would like to fire off different threads to speed up the process. However if the data are from the same contract, i would like the program to wait until the current process finishes before i continue with the next data. How do i do it?
can you provide some code please?
thanks in advance~
You have many high level architectural decissions to make here:
How many ticks do you expect to come from that broker?
After all, you should have some kind dispatcher here.
Here is some simple description of what basically is to do:
Encapsulate the incoming ticks in packages, best
single commands that have all the data needed
Have a queue where you can easily (and thread safe) can store those commands
Have a Dispatcher, that takes an item of the queue and assigns some worker
to do the command (or let the command execute itself)
Having a worker, you can have multiple threads, processes or whatsoever
to work multiple commands seemlessly
Maybe you want to do some dispatching already for the input queue, depending
on how many requests you want to be able to complete per time unit.
Here is some more information that can be helpful:
Command pattern in C#
Reactor pattern (with sample code)
Rather than holding onto an array of Processes, I would hold onto an array of BlockingCollections. Each blocking collection can correspond to a particular contract. Then you can have producer threads that add work to do onto the end of a corresponding contract's queue, and you can have producer queues that consume the results from those collections. You can ensure than each thread (I would use threads for this, not processes) is handling 1-n different queues, but that each queue is handled by no more than one thread. That way you can ensure that no bits of work from the same contract are worked on in parallel.
The threading aspect of this can be handled effectiving using C#'s Task class. For your consumers you can create a new task for each BlockingCollection. That task's body will pretty much just be:
foreach(SomeType item in blockingCollections[contractNumber].GetConsumingEnumerable())
processItem(item);
However, by using Tasks you will let the computer schedule them as it sees fit. If it notices most of them sitting around waiting on empty queues it will just have a few (or just one) actual thread rotating between the tasks that it's using. If they are trying to do enough, and your computer can clearly support the load of additional threads, it will add more (possibly adding/removing dynamically as it goes). By letting much smarter people than you or I handle that scheduling it's much more likely to be efficient without under or over parallelizing.
I'm implementing a service for tasks processing and I would like to manage the quality of the service giving greater priority to certain types of tasks. There are four types of tasks, and for this reason I would use four queues, one for each type of task.
Could be it convenient to create four processing threads (one for each queue) and assign to them different priorities?
Or should I should make the processing thread take care mainly with the higher priority queue?
Are there other approaches?
I would suggest having a single thread that is responsible for grabbing tasks.
There are many, many possible strategies. One is simply to have 4 queues, and try to cycle between them. Another is to stick tasks into a priority queue (typically implemented with a heap data structure), but if you do that then be aware that all of the higher priority tasks will be taken before any lower priority tasks. A third is to use a priority queue based on age so you can take the oldest request first - then make high priority requests artificially old. (I could suggest the age of the oldest thing in the queue plus a constant term.)
One general point to keep in mind. If you assign sufficient capacity, your queues will likely remain reasonably short. If your capacity is insufficient, then queues will grow without bound and in the long run you can think of your queueing problem as one of triage rather than prioritization. But if possible, it often works well to try to increase capacity instead of being clever in prioritization.
I have a database table that contains some records to be processed. The table has a flag column that represents the following status values. 1 - ready to be processed, 2- successfully processed, 3- processing failed.
The .net code (repeating process - console/service) will grab a list of records that are ready to be processed, and loop through them and attempt to process them (Not very lengthy), update status based on success or failure.
To have better performance, I want to enable multithreading for this process. I'm thinking to spawn say 6 threads, each threads grabbing a subset.
Obviously I want to avoid having different threads process the same records. I dont want to have a "Being processed" flag in the database to handle the case where the thread crashes leaving the record hanging.
The only way I see doing this is to grab the complete list of available records and assigning a group (maybe ids) to each thread. If an individual thread fails, its unprocessed records will be picked up next time the process runs.
Is there any other alternatives to dividing the groups prior to assigning them to threads?
The most straightforward way to implement this requirement is to use the Task Parallel Library's
Parallel.ForEach (or Parallel.For).
Allow it to manage individual worker threads.
From experience, I would recommend the following:
Have an additional status "Processing"
Have a column in the database that indicates when a record was picked up for processing and a cleanup task / process that runs periodically looking for records that have been "Processing" for far too long (reset the status to "ready for processing).
Even though you don't want it, "being processed" will be essential to crash recovery scenarios (unless you can tolerate the same record being processed twice).
Alternatively
Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). They are optimized for this very problem.
That would be my clear choice, having done both at massive scale.
Optimizing
If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection. That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue.
A New Alternative
Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative.
I remember doing something like what you described...A thread checks from time to time if there is something new in database that needs to be processed. It will load only the new ids, so if at time x last id read is 1000, at x+1 will read from id 1001.
Everything it reads goes into a thread safe Queue. When items are added to this queue, you notify the working threads (maybe use autoreset events, or spawn threads here). each thread will read from this thread safe queue one item at a time, until the queue is emptied.
You should not assign before the work foreach thread (unless you know that foreach file the process takes the same amount of time). if a thread finishes the work, then it should take the load from the other ones left. using this thread safe queue, you make sure of this.
Here is one approach that does not rely/use an additional database column (but see #4) or mandate an in-process queue. The premise this approach is to "shard" records across workers based on some consistent value, much like a distributed cache.
Here are my assumptions:
Re-processing does not cause unwanted side-effects; at most some work "is wasted".
The number of threads is fixed upon start-up. This is not a requirement, but it does simplify the implementation and allows me to skip transitory details in the simple description below.
There is only one "worker process" (but see #1) controlling the "worker threads". This simplifies dealing with how the records are split between workers.
There is some [immutable] "ID" column which is "well distributed". This is required so search worker gets about the same amount of work.
Work can be done "out of order" as long as it is "eventually done". Also, workers might not always run "at 100%" due to each one effectively working on a different queue.
Assign each thread a unique bucket value from [0, thread_count). If a thread dies/is restarted it will take the same bucket as that which it vacated.
Then, each time a thread needs a new record is needed it will fetch from the database:
SELECT *
FROM record
WHERE state = 'unprocessed'
AND (id % $thread_count) = $bucket
ORDER BY date
There could of course be other assumptions made about reading the "this threads tasks" in batch and storing them locally. A local queue, however, would be per thread (and thus re-loaded upon a new thread startup) and thus it would only deal with records associated for the given bucket.
When the thread is finished processing a record should mark the record as processed using the appropriate isolation level and/or optimistic concurrency and proceed to the next record.