I'm working on project with following workflow :
Background service consumme messages from Rabbitmq's queue
Background service use background task queue like this and here to process task paralleling
Each task execute queries to retrieve some datas and cache them in collection
If collection size is over 1000 objects, I would like to read collection and then clear it. Like each tasks are processing as parallel, I don't want that another thread add datas in collection until it was cleared.
There are blockingcollection or concurrentdictionary (thread-safe collection), but I don't know which mechanic to use ?
What's the best way to achieve this?
The collection that seems more suitable for your case is the Channel<T>. This is an asynchronous version of the BlockingCollection<T>, and internally it's based on the same storage (the ConcurrentQueue<T> collection). The similarities are:
They both can be configured to be bounded or unbounded.
A consumer can take a message, even if none is currently available. In this case the Take/TakeAsync call will block either synchronously or asynchronously until a message can be consumed, or the collection completes, whatever comes first.
A producer can push a message, even if the collection is currently full. In this case the Add/WriteAsync call will block either synchronously or asynchronously until there is space available for the message, or the collection completes, whatever comes first.
A consumer can enumerate the collection in a consuming fashion, with a foreach/await foreach loop. Each message received in the loop is consumed by this loop, and will never be available to other consuming loops that might be active by other consumers in parallel.
Some features of the Channel<T> that the BlockingCollection<T> lacks:
A Channel<T> exposes two facades, a Writer and a Reader, that allow a better separation between the roles of the producer and the consumer. In practice this can be more of an annoyance than a useful feature IMHO, but nonetheless it's part of the experience of working with a channel.
A ChannelWriter<T> can be optionally completed with an error. This error is propagated to the consumers of the channel.
A ChannelReader<T> has a Completion property of type Task.
A bounded Channel<T> can be configured to be lossy, so that it drops old buffered messages automatically in order to make space for new incoming messages.
Some features of the BlockingCollection<T> that the Channel<T> lacks:
There is no direct support for timeout when writing/reading messages. This can be achieved indirectly (but precariously, see below) with timer-based CancellationTokenSources.
The contents of a channel cannot be enumerated in a non-consuming fashion.
Some auxiliary features like the BlockingCollection<T>.TakeFromAny method are not available.
A channel cannot be backed by other internal collections, other than the ConcurrentQueue<T>. So it can't have, for example, the behavior of a stack instead of a queue.
Caveat:
There is a nasty memory leak issue that is triggered when a channel is idle (empty with an idle producer, or full with an idle consumer), and the consumer or the producer attempts continuously to read/write messages with timer-based CancellationTokenSources. Each such canceled operation leaks about 800 bytes. The leak is resolved automatically when the first read/write operation completes successfully. This issue is known for more than two years, and Microsoft has not decided yet what to do with it.
Check out concurrentQueue. It appears to be suitable for the tasks you have mentioned in your questions. Documentation here - https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentqueue-1?view=net-6.0
There are other concurrent collection types as well - https://learn.microsoft.com/en-us/dotnet/standard/collections/thread-safe/
Related
In this scenerio, I have to Poll AWS SQS messages from a queue, each async request can fetch upto 10 sqs items/messages. Once I Poll the items, Then I have to process those items on a kubernetes pod. Item processing includes getting response from few API calls, it may take some time & then saving the item to DB & S3.
I did some R&D & reach on following conclusion
To use consumer producer model, 1 thread will poll items & another thread will process the item or to use multi-threading for item processing
Maintain a data structure that will containes sqs polled items ready for processing, DS could be Blocking collection or Concurrent queue
Using Task Parellel Library for threadpooling & in item processing.
Channels can be used
My Queries
What would be best approach to achieve best performance or increase TPS.
Can/Should I use data flow TPL
Multi threaded or single threaded with asyn tasks
This is very dependant on the specifics of your use-case and how much effort would you want to put in.
I will, however, explain the thought process I would use when making such a decision.
The naive solution to handle SQS messages would be to do it one at a time sequentially (i.e. without concurrency). It doesn't mean that you're limited to a single message at a time since you can add more pods to the cluster.
So even in that naive solution you have one concurrency point you can utilize but it has a lot of overhead. The way to reduce overhead is usually to utilize the same overhead but process more messages with it. That's why, for example, SQS allows you to get 1-10 messages in a single call and not just one. It spreads the call overhead over 10 messages. In the naive solution the overhead is the cost of starting a whole process. Using the process for more messages means concurrent processing.
I've found that for stable and flexible concurrency you want many points of concurrency, but have each of them capped at some configurable degree of parallelism (whether hardcoded or actual configuration). That way you can tweak each of them to achieve optimal output (increase when you have free CPU and memory and decrease otherwise).
So, where can the additional concurrency be introduced? This is a progression where each step utilizes resources better but requires more effort.
Fetch 10 messages instead of one for every SQS API call and process them concurrently. That way you have 2 points of concurrency you can control: Number of pods, number of messages (up to 10) concurrently.
Have a few tasks each fetching 1-10 tasks and processing them concurrently. That's 3 concurrency points: Pods, tasks and messages per task. Both these solutions suffer from messages with varying processing time, meaning that a single long running message will "hold up" all the other 1-9 "slots" of work effectively reducing the concurrency to lower than configured.
Set up a TPL Dataflow block to process the messages concurrently and a task (or few) continuously fetching messages and pumping into the block. Keep in mind that SQS messages need to be explicitly deleted so the block needs to receive the message handle too so the message can be deleted after processing.
TPL Dataflow "pipe" consisting of a few blocks where each has it's own concurrency degree. That's useful when you have different steps of processing of the message where each step has different limitations (e.g. different APIs with different throttling configurations).
I personally am very fond of, and comfortable with, the Dataflow library so I would go straight to it. But simpler solutions are also valid when performance is less of an issue.
I'm not familiar with Kubernetes but there are many things to consider when maximising throughput.
All the things which you have mentioned is IO bound not CPU bound. So, using TPL is overcomplicating the design for marginal benefit. See: https://learn.microsoft.com/en-us/dotnet/csharp/async#recognize-cpu-bound-and-io-bound-work
Your Kubernetes pods are likely to have network limitations. For example, with Azure Function Apps on Consumption Plans is limited to 1,200 outbound connections. Other services will have some defined limits, too. https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections?tabs=csharp#connection-limit. Due to the nature of your work, it is likely that you will reach these limits before you need to process IO work on multiple threads.
You may also need to consider limits of the services which you are dependent on and ensure they are able to handle the throughput.
You may want to consider using Semaphores to limit the number of active connections to satisfy both your infrastructure and external dependency limits https://learn.microsoft.com/en-us/dotnet/api/system.threading.semaphoreslim?view=net-5.0
That being said, 500 messages per second is a realistic amount. To improve it further, you can look at having multiple processes with independent resource limitations processing the queue.
Not familiar with your use case, or specifically with the tech you are using, but this sounds like a very common message handling scenario.
Few guidelines:
First, these are guidelines, your usecase might be very different then what the ones commenting here are used to.
Whenever you want to increase your throughput you need to identify
your bottlenecks, and thrive towards CPU bottleneck, making sure you
fully utilize it. CPU load is usually the most expensive, and
generally makes for a more reliable metric for autoscaling. Obviously, depending on your remote api calls and your DB you might reach other bottlenecks - SQS queue size also makes for a good autoscaling metric, but keep in mind that autoscalling isn't guaranteed to increase you throughput if your bottleneck is DB or API related.
I would not go for a fancy solution with complex data structures, again, not familiar with your usecase, so I might be wrong - but keep it simple. There should be one thread that is responsible for polling the queue, and when it finds new messages it should create a Task that processes a batch. There should generally be one Task per processing batch - let the ThreadPool handle the number of threads.
Not familiar with .net SQS library. However, I am familiar with other libraries for very similar solutions. Most Libraries for queues out there already do it all for you, and you don't really have to worry about it. You should probably just have a callback function that is called when the highly optimized library already finds new messages. Those libraries probably already create a new task for each of those batches - you just need to register to their callback, and make sure you await any I/O bound code.
Edit: The solution I am proposing does have a limitation in that a single message can block an entire batch, this is not necessarily a bad thing - if your solution requires different processing for different messages, and you don't want to create this inner batch dependency, a TPL DataFlow could definitely be a good solution for your usecase.
Yeah, this sounds very much like the task for TPL Dataflow, it is very versatile yet powerful instrument. Your first chain link would acquire messages from the queue (not neccessarily one-threaded-ly, you just pass some delegates in). You will also be in control of how many items are "queued" locally this way.
Then you "subscribe" your workers in any way you desire – you can even customize it so that "faulted" processings would be put back into your queue — and it woudn't even matter if your processing is IO bound or not. If it is — well, nice, TPL dataflow is asyncronous, if not — well, not a problem, TPL dataflow can also be syncronous. Or you can fire up some thread pool threads, no biggie.
I'm building a trade engine for cryptocurrency and since the trade engine is going to get consumed by a lot of commands in multiple markets, I was wondering instead of using a default pattern like issuing a threat for each command there can be a DataBase that saves commands for short periods of time and executes them in a batch.
I was wondering if it's a good idea and is there a pattern or delivery mechanism I can use for my database?
The question here seems to be focused around CPU blocking when issuing redis commands, with a mention of fire and forget.
In the general case, then yes: with any client, a blocking synchronous call such as:
redis.call("incr", "foo");
is going to block the current thread for an instance of latency, however: different clients have different capabilities; let's assume that we're using StackExchange.Redis as the library, and compare to:
db.StringIncrement("foo"); // blocks current thread
We have a few options here, though:
// doesn't block thread, completes asynchronously, reactivated via task model with result
await.StringIncrememtAsync("foo");
// doesn't block thread; result is meaningless (default(T))
db.StringIncrement("foo", flags: CommandFlags.FireAndForget);
// doesn't block thread, completes synchronously via a completed task with
// meaningless result (default(T))
await db.StringIncrementAsync("foo", flags: CommandFlags.FireAndForget);
So there are multiple ways to avoid CPU blocking.
As for batching: in some cases batching can help reduce packet fragmentation, but since we're using "pipelines" as a buffer between the layers, we'll already be doing a fair job of creating dense packets if there is sufficient data, and if there isn't sufficient data to fill packets: there's not really any harm sending them earlier, and avoiding latency/buffering; unless you're sending multiple related commands in close proximity, the batching API is not usually immediately useful (and when you are sending multiple related commands - it may also be useful to compare/contrast the batching vs transaction APIs).
Stepen Toub mentions in this Channel 9 Video that a *Block creates a task if an item was pushed to its incoming queue. If all items in queue are computed the task gets destroyed.
If I use a lot of blocks to build up a mesh to number of actually running tasks is not clear (and if the TaskScheduler is the default one the number of active ThreadPool threads is also not clear).
Does TPL Dataflow offers a way where I can say: "Ok I want this kind of block with a permanent running task (thread)?
TL;DR: there is no way to dedicate a thread to a block, as it's clearly conflicting with purpose of TPL Dataflow, except by implementing your own TaskScheduler. Do measure before trying to improve your application performance.
I just watched the video and can't find such phrase in there:
creates a task if an item was pushed to its incoming queue. If all items in queue are computed the task gets destroyed.
Maybe I'm missing something, but all that Stephen said is: [at the beginning] We have a common Producer-Consumer problem, which can be easily implemented with .Net 4.0 stack, but the problem is that if the data runs out, the consumer goes away from loop, and never return.
[After that] Stephen explains, how such problem can be solved with TPL Dataflow, and he said that the ActionBlock starts a Task if it wasn't started. Inside that task there is code which waits (in async fashion) for a new message, freeing up the thread, but not destroying the task.
Also Stephen mentioned task while explaining the sending messages across the linked blocks, and there he says that posting task will fade away if there is no data to send. It doesn't mean that a task corresponding to the block fades away, it's only about some child task being used to send data, and that's it.
In the TPL Dataflow the only way to say to the block that there wouldn't be any more data: by calling it's Complete method or completing any of linked blocks. After that consuming task will be stopped, and, after all buffered data being processed, the block will end it's task.
According the official github for TPL Dataflow, all tasks for message handling inside blocks are created as DenyChildAttach, and, sometimes, with PreferFairness flag. So, there is no reason for me to provide a mechanism to fit one thread directly to the block, as it will stuck and waste CPU resources if there is no data for the block. You may introduce some custom TaskScheduler for blocks, but right now it's not obvious why do you need that.
If you're worried that some block may get more CPU time than others, there is a way to leverage that effect. According official docs, you can try to set the MaxMessagesPerTask property, forcing the task restart after some amount of data being sent. Still, this should be done only after measuring actual execution time.
Now, back to your words:
number of actually running tasks is not clear
the number of active ThreadPool threads is also not clear
How did you profile your application? During debug you can easily find all active tasks and all active threads. If it's not enough, you can profile your application, either with native Microsoft tools or a specialized profiler, like dotTrace, for example. Such toolkit can easily provide you information about what's going on in your app.
The talk is about the internal machinery of the TPL Dataflow library. As a mechanism it is a quite efficient, and you shouldn't really worry about any overhead unless your intended throughput is in the order of 100,000 messages per second or more (in which case you should search for ways to chunkify your workload). Even with workloads with very small granularity, the difference between processing the messages using a single task for all messages, or a dedicated task for each one, should be hardly noticeable. A Task is an object that "weighs" normally a couple hundred bytes, and the .NET platform is able to create and recycle millions of objects of this size per second.
It would be a problem if each Task required its own dedicated 1MB thread in order to run, but this is not the case. Typically the tasks are executed using ThreadPool threads, and a single ThreadPool thread can potentially execute millions of short-lived tasks per second.
I should also mention that the TPL Dataflow supports asynchronous lambdas too (lambdas with Task return types), in which case the blocks essentially don't have to execute any code at all. They just await the generated promise-style tasks to complete, and for asynchronous waiting no thread is needed.
Inspired by my current problem, which is kind of identical to this:
Analogue of Queue.Peek() for BlockingCollection when listening to consuming IEnumerable<T> with the difference that I - currently - am using ConcurrentQueue<T> instead of BlockingCollection<T>, I wonder what any use case for ConcurrentQueue<T>.TryPeek() may be?
Of course I mean a use case without manual lock(myQueue) stuff to serialize queue accesses as TPL is meant to improve/substitute those lockings.
I had an application that used ConcurrentQueue<T>.TryPeek to good effect. One thread was set up to monitor the queue. Mostly it was looking at queue size, but we also wanted to get an idea of latency. Because the items in the queue had a time stamp field that said what time they were put into the queue, my monitoring thread could call TryPeek to get the item at the head of the queue, subtract the insertion time from the current time, and tell me how long the item had been in the queue. Over time and many samples, that gave me a very clear picture of how long it was taking for a received item to be processed.
It didn't matter that some other thread might dequeue the item while my monitoring code was still examining it.
I can think of a few other scenarios in which it would be useful to see what's at the head of the queue, even though it might be pulled off immediately.
I have a ConcurrentQueue where many threads may Enqueue, but I limit just one thread doing TryPeek and TryDequeue by lock:
lock (dequeueLock)
if (queue.TryPeek(out item))
if (item.Value <= threshold)
queue.TryDequeue(out item);
Note: other threads may continue to Enqueue while this code runs.
It would be nicer to have some atomic peek - check - dequeue operation, but the lock is fine for my scenario.
TryPeek is used to wait for the object to be at the first of the queue. TryDequeue will dequeue any object that is there. So, for instance, I wrote a webserver that is multithreaded, but during authorization, when authorization is enabled for certain request, they need at one point to be processed in the order they were received. I don't want to lock up the whole thread function or half of it, only so that for some clients I can process their requests in order.
So, I created a Dictionary<string, ConcurrentQueue<HttpListenerContext>>, then at the very beginning of the server thread, I lock temporarily and check to see if authorization will be required, if so I store the HttpListenerContext in a queue with the client IP as the dictionary key, so that different clients don't block each other's threads unnecessarily. Then, I process the headers and compute the hashes as normal, as page may make two or three request using ajax and websockets connections after the initial, it is better to multithread the hashing of the authorization information (which is digest authorization I implemented for HttpListener myself, so that I am not restricted to using Active Directory). Then when the authorization needs to be checked for the case that what is called the client nonce count is only one greater than the last request for that client's session, a security feature, I use the que I created and TryPeek with Thread.Yield() to wait until that threads HttpListenerContext is the first in the que to finish authorization and then dequeue it.
In short, it can be used to multithread where for most the thread you want things to run in parallel, to take advantage of different cores, but then for some threads for a piece of them you need everything to get back in order.
My understanding is that you use this method when you want to do a peek but you are not sure there is an item in the queue. Normally Peek on an empty queue will throw an exception. TryPeek will return false if the item is not there. This can be extremely useful in multithreaded scenarios where another thread may dequeue the item in between checks for empty queue and actually peeking for the value.
Try it
T item = bc.GetConsumingEnumerable().FirstOrDefault();
if (item != null)
{
//...
}
Having a look at an object to see if it is valid before taking it out is an option, just remember that when you do this that the Concurrent Queue will create a reference and not release the object from memory when you dequeue it. If you do, and you are memory profiling as I did with my ConcurrentQueue, you will see something like this.
Notice the ConcurrentQueueSegment with 11,060 instances while the queue only holds 8.
I have a thread which fills a queue. And I have another thread which process this queue. My problem is first thread fills the queue very fast so the other thread couldn't process this queue that much faster and my program keeps overuse ram. What is the optimum solution for this problem?
Sorry I forgot to add something. I can't limit my queue or producer thread. My producer thread couldn't wait because it's capturing network packets and I shouldn't miss any packet. I have to process these packets fast than producer thread.
Well, assuming that the order of processing of items in the queue is not important, you can run two (or more) threads processing the queue.
Unless there's some sort of contention between them, that should enable faster processing. This is known as a multi-consumer model.
Another possibility is to have your producer thread monitor the size of the queue and refuse to add entries until it drops below some threshold. Standard C# queues don't provide a way to stop expansion of the capacity (even using a 1.0 growth factor will not inhibit growth).
You could define a maximum queue size (let's say 2000) which when hit causes the queue to only accept more items when it's down to a lower size (let's say 1000).
I'd recommend using an EventWaitHandle or a ManualResetEvent in order not to busy-wait. http://msdn.microsoft.com/en-us/library/system.threading.manualresetevent.aspx
Unless you are already doing so, use BlockingCollection<T> as your queue and pass some reasonable limit to the boundedCapacity parameter of constructor (which is then reflected in BoundedCapacity property) - your producer will block on Add if this would make the queue too large and resume after consumer has removed some element from the queue.
According to MSDN documentation for BlockingCollection<T>.Add:
If a bounded capacity was specified when this instance of BlockingCollection<T> was initialized, a call to Add may block until space is available to store the provided item.
Another method is to new() X inter-thread comms instances at startup, put them on a queue and never create any more. Thread A pops objects off this pool queue, fills them with data and queues them to thread B. Thread B gets the objects, processes them and then returns them to the pool queue.
This provides flow control - if thread A tries to post too fast, the pool will dry up and A will have to wait on the pool queue until B returns objects. It has the potential to improve peformance since there are no mallocs and frees after the initial pool filling - the lock time on a queue push/pop will be less than that of a memory-manager call. There is no need for complex bounded queues - any old producer-consumer queue class will do. The pool can be used for inter-thread comms throughout a full app with many threads/threadPools, so flow-controlling them all. Shutdown problems can be mitigated - if the pool queue is created by the main thread at startup before any forms etc and never freed, it is often possible to avoid explicit background thread shutdowns on app close - a pain that would be nice to just forget about. Object leaks and/or double-releases are easily detected by monitoring the pool level, ('detected', not 'fixed':).
The inevitable downsides - all the inter -thread comms instance memory is permanently allocated even if the app is completely idle. An object popped off the pool will be full of 'garbage' from the previous use of it. If the 'slowest' thread gets an object before releasing one, it is possible for the app to deadlock with the pool empty and all objects queued to the slowest thread. A very heavy burst of loading may cause the app to throttle itself 'early' when a simpler 'new/queue/dispose' mechanism would just allocate more instances and so clope better with the burst of work.
Rgds,
Martin
The simplest possible solution would be that the producer thread check if the queue has reached a certain limit of pending items, if so then go to sleep before pushing more work.
Other solutions depend on what the actual problem you are trying to solve, is the processing more IO bound or CPU bound etc, that will even allow you to design the solution which doesn't even need a queue. For ex: The producer thread can generate, lets say 10 items, and call another consumer "method" which process them in parallel and so on.