I want to write an application that needs a Tasks queue. I should be able to add Tasks into this queue and these tasks can finish asynchronously (and should be removable from this queue, once they are complete)
The datastructure should also make it possible to get the information about any task within the Queue, provided a unique queue-position-identifier.
The data-structure should also provide the list of items in the queue anytime.
A LINQ interface to manage this queue will also be desirable.
Since this is a very common requirement for many applications (atleast in my personal observation), I want to know if there are any standard datastructures that are available as part of the c# library, instead of I writing something from the scratch.
Any pointers ?
Seems to me you are conflating the data structure and the asynch task that it is designed to track. Are you sure they need to be the same thing?
Does ThreadPool.QueueUserWorkItem not satisfy for running asynch tasks? You can maintain your own structure derived from List<TaskStatus> or HashSet<TaskStatus> to keep track of the results, and you can provide convenience methods to clear completed items, retrieve pending items, and so on.
Related
This question talks about how the traditional queue pattern is somewhat antiquated in modern C# due to the TPL: Best way in .NET to manage queue of tasks on a separate (single) thread
The accepted answer proposes what appears to be stateless solution. It is very elegant, but, and perhaps I'm a dinosaur or misunderstand the answer... What if I want to pause the queue or save its state? What if when enqueuing a task the behaviour should be dependent on the queue's state or if queued tasks can have different priorities?
How could one efficiently implement an ordered task queue - that actually has an explicit Queue object which you can inspect and even interact with, within the Task paradigm? Supporting single/parallel processing of enqueued tasks is a benefit but for my purposes, single-concurrency is acceptable if it raises problems. I am not dealing with millions of tasks a second, in fact my tasks are typically large/slow.
I am happy to accept there are solutions with different scalability depending on requirements, and that we can often trade-off between scalability and coding effort/complexity.
What you are describing sounds essentially like the Channel<T> API. This already exists:
nuget: https://www.nuget.org/packages/System.Threading.Channels/
msdn: https://devblogs.microsoft.com/dotnet/an-introduction-to-system-threading-channels/
additional: https://www.stevejgordon.co.uk/an-introduction-to-system-threading-channels
it isn't explicitly a Queue<T>, but it acts as a queue. There is support for bounded vs unbounded, and single vs multiple readers/writers.
I am creating a data cleansing application that loops through the tables in a database and cleanses NPI data from the different columns. I have created a class for each of the tables that has a method that performs the cleansing operation. What I want to do is loop over the table classes and using reflection call each class and its cleansing method. I would like to do 10 tables at a time and as one table completes, spawn a new thread/task on the next table in the list.
I have a treeview where the user can select one or more tables from the database to cleanse. I have been able to loop over the selected tables and invoke the cleanse method for each table on its own thread, but I end up with over 100 threads (if all tables are selected) executing at the same time. Not an ideal situation.
Any suggestions on how to do this? I am using C# and .NET 4.6, so Task code would be preferred.
One simple approach is to use Parallel.ForEach with the MaxDegreeOfParallelism option set to the maximum threads you would like.
A more advanced and rather elegant framework for this kind of problem is the Dataflow Task Parallel Library (TPL). Use an ActionBlock to perform the work and set it's degree of parallelism as desired.
Take a look at the Task Parallel Library, it has a type Task and Task that I think will suite you just fine.
You could create 10 Tasks, put them in a collection and do a Tasks.WhenAny(myTasks).Result. At that point you can figure out how many tasks are done and put more into the collection (IsCompleted property).
Instead of .WhenAny you could probably make it less complicated and use WhenAll(myTasks), and just do this all in batches of x. The parallel foreach answer is also an excellent option, there's a whole world in the TPL for you to explore.
Very basic example, not sure your full context here:
var myTasks = new List<Task>();
myTasks.Add(Task.Run(() => someLongProcess))
myTasks.Add(Task.Run(() => someLongProcess))
Task.WhenAny(myTasks).Result
//Check to see how many tasks are done, and then add more to your collection and repeat until you're done
You better using Task, as it implements inside Work sharing: which basically means that Task=Work, which maps on hardware thread of your OS, via special Task Scheduler.
Threads may not be available always
There may be more work to do than threads, so if you have a queue to process, the same thread will be used to work over a new data (spawning a thread has its own costs)
Possible successful False-Sharing management (on CPU cache lines). You may be concerned less, but still worth knowing what it is.
..More.
A lot of that possibly thought, scheduled and processed by thread scheduler to get optimal generic performance without much hustle. To be clear, you will not get the best possible multi-threaded performance, but you, most likely, do not need that either.
What you are asking about is Data Parallelism.
The simple example of how to use it you can find in: How to: Write a Simple Parallel.For Loop
I am trying to set up a concurrent queue that will enqueue data objects coming in from one thread while another thread dequeues the data objects and processes them. I have used a BlockingCollection<T> and used the GetConsumingEnumerable() method to create a solution that works pretty well in simple usage. My problem lies in the facts that:
the data is coming in quickly, data items being enqueued approximately every 50ms
processing each item will likely take significantly longer than 50ms
I must maintain the order of the data items while processing as some of the data items represent events that must be fired in the proper order.
On my development machine, which is a pretty powerful setup, it seems the cutoff is about 60ms of processing time for getting things to work right. Beyond that, I have problems either with having the queue grow continuously (not dequeuing fast enough) or having the data items processed in the wrong order depending on how I set up the processing with regard to whether/how much/where I parallelize. Does anyone have any tips/tricks/solutions or can point me to such that will help me here?
Edit: As pointed out below, my issue is most likely not with the queuing structure itself so much as it is with trying to dequeue and process the items faster. Are there trick/tips/etc. for portioning out the processing work so that I can keep dequeuing quickly while still maintaining the order of the incoming data items.
Edit (again): Thanks for all your replies! It's obvious I need to put some more work into this. This is all great input, though and I think it will help point me in the right direction! I will reply again either with a solution that I came up with or a more detailed question and code sample! Thanks again.
Update: In the end, we went with a BlockingCollection backed by a ConcurrentQueue. The queue worked perfectly for what we wanted. In the end, as many mentioned, the key was making the processing side as fast and efficient as possible. There is really no way around that. We used parallelization where we found it helped (in some cases it actually hurt performance), cached data in certain areas, and tried to avoid locking scenarios. We did manage to get something working that performs well enough that the processing side can keep up with the data updates. Thanks again to everyone who kicked in a response!
If you are using TPL on .NET 4.0, you can investigate the TPL Dataflow library simple usage, as this library (it's not a third party, it's a library from Microsoft being distributed via NuGet) provide the logic which saves the order of data being processed in your system.
As I understand, you got some data which will come in order, which you have to mantain after some work at each of data item. You can use for this TransformBlock class or BufferBlock linked with ActionBlock: simply put the data on it's input, set up the action you need to be run on each item, and link this block with classes you need (you even can make it IObservable to create a responding UI.
As I said, TPL Dataflow blocks are incapsulating FIFO queue logic, and they are saving the order for results on their action. And the code you can write with them is multithreading-oriented (see more about maximum degree of parallelizm in TPL Dataflow).
I think that you are okay with the blocking queue. I enqueue thousands of messages per second in a BlockingCollection and the overhead is very small.I think you should do the following:
Add a synchronized sequence number when enqueuing the messages
Use multiple consumers to try to overload the queue
In general focus on the processing time. The default collection type for BlockingCollection is ConcurrentQueue, so the default is that the it is a FIFO (First in, first out) queue, so something else seems to be wrong.
some of the data items represent events that must be fired in the
proper order.
Then you may differentiate dependent items and process them in order while processing other items in parallel. Maybe you can build 2 separate queues, one for items to be processed in order, dequeued an processed with a single thread and another dequeued by multiple threads.
We need to know more about input and expected processing.
I'm really loving the TPL. Simply calling Task.Factory.StartNew() and not worrying about anything, is quite amazing.
But, is it possible to have multiple Factories running on the same thread?
Basically, I have would like to have two different queues, executing different types of tasks.
One queue handles tasks of type A while the second queue handles tasks of type B.
If queue A has nothing to do, it should ignore tasks in queue B and vice versa.
Is this possible to do, without making my own queues, or running multiple threads for the factories?
To clarify what I want to do.
I read data from a network device. I want to do two things with this data, totally independent from each other.
I want to log to a database.
I want to send to another device over network.
Sometimes the database log will take a while, and I don't want the network send to be delayed because of this.
If you use .NET 4.0:
LimitedConcurrencyLevelTaskScheduler (with concurrency level of 1; see here)
If you use .NET 4.5:
ConcurrentExclusiveSchedulerPair (take only the exclusive scheduler out of the pair; see here)
Create two schedulers and pass them to the appropriate StartNew. Or create two TaskFactories with these schdulers and use them to create and start the tasks.
You can define yourself a thread pool using a queue of threads
I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks -- say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?
Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren't CPU bound then you'll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you're happy.
I've not had the opportunity to actually use PLINQ, however I do know that PLINQ (like vanilla LINQ) is based on IEnumerable. As such, I think this might be a case where it would make sense to implement the task producer via C# iterator blocks (i.e. the yield keyword).
Assuming you are not doing any operations where the entire set of tasks must be known in advance (e.g. ordering), I would expect that PLINQ would only consume as many tasks as it could process at once. Also, this article references some strategies for controlling just how PLINQ goes about consuming input (the section titled "Processing Query Output").
EDIT: Comparing PLINQ to a ThreadPool.
According to this MSDN article, efficiently allocating work to a thread pool is not at all trivial, and even when you do it "right", using the TPL generally exhibits better performance.
Use the ThreadPool.
Then you can queue up everything and items will be run as threads become available to the pool without overwhelming the system. The only trick is determining the optimum number of threads to run at a time.
Sounds like a job for Microsoft HPC Server 2008. Given that it's the number of tasks that's overwhelming, you need some kind of parallel process manager. That's what HPC server is all about.
http://www.microsoft.com/hpc/en/us/default.aspx
In order to give a good answer we need a few questions answered.
Is each individual task parallelizable? Or each task is the product of a parallelizable main task?
Also, is it the number of tasks that would cause the system to run out of memory, or is it the quantity of data each task holds and processes that would cause the system to run out of memory?
Sounds like Windows Workflow Foundation (WF) might be a good thing to use to do this. It might also give you some extra benefits such as pause/resume on your tasks.