I want to process a list of 5000 items. For each item, the process can be very quick (1sec) or take much time (>1min). But I want to process this list the fastest possible way.
I can't put this 5000 items in the .NET ThreadPool, plus I need to know when the items are all processed, so I was thinking to have a specific number of Threads and to do:
foreach(var item in items)
{
// wait for a Thread to be available
// give the item to process to the Thread
}
but what is the easiest way to do that in c#? Should I use Threads, or are there some higher level classes that I could use?
I would start with Parallel.ForEach and measure your performance. That is a simple, powerful approach and the scheduling does a pretty decent job for a generic scheduler.
Parallel.ForEach(items, i => { /* your code here */ });
I can't put this 5000 items in the .NET ThreadPool
Nor would you want to. It is relatively expensive to create a thread. Context switches take time. If you had say 8 cores processing 5000 threads, a meaningful fraction of your execution time would be context switches.
to do parallel processing this is the structure to use
Parallel.ForEach(items, (item) =>
{
....
}
and if you want not to overload the thread pool you can use ParallelOptions
var po = new ParallelOptions
{
MaxDegreeOfParallelism = 5
}
Parallel.ForEach(items, po,(item) =>
{
....
}
I agree with the answers recommending Parallel.ForEach. Without knowing all of the specifics (like what's going on in the loop) I can't say 100%. As long as the iterations in the loop aren't doing anything that conflict with each other (like concurrent operations with some other object that aren't thread safe) then it should be fine.
You mentioned in a comment that it's throwing an exception. That can be a problem because if one iteration throws an exception then the loop will terminate leaving your tasks only partially complete.
To avoid that, handle exceptions within each iteration of the loop. For example,
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(items, i =>
{
try
{
//Your code to do whatever
}
catch(Exception ex)
{
exceptions.Enqueue(ex);
}
});
By using a ConcurrentQueue any iteration can safely add its own exception. When it's done you have a list of exceptions. Now you can decide what to do with them. You could throw a new exception:
if (exceptions.Count > 0) throw new AggregateException(exceptions);
Or if there's something that uniquely identifies each item you could do (for example)
var exceptions = new ConcurrentDictionary<Guid, Exception>();
And then when an exception is thrown,
exceptions.TryAdd(item.Id, ex); //making up the Id property
Now you know specifically which items succeeded and which failed.
Related
I have 2 Threads. One of them repeatedly modifies a List. The other tries to access the List but it usually throws a System.InvalidOperationException.
I've tried locking the List object but it did not help.
public static void Main()
{
List<int> list = new List<int>();
// A thread that repeatedly modifies the list
Thread listAddThread = new Thread(() =>
{
while(true)
{
list.Add(0);
Thread.Sleep(1);
}
});
listAddThread.Start();
// Wait to fill up the list
Thread.Sleep(50);
// Lock the list from other threads till the operation completes
lock(list)
{
foreach(var entry in list)
{
DoSomeStuffThatTakesALotOfTime(entry);
}
}
}
public static void DoSomeStuffThatTakesALotOfTime(int i)
{
Thread.Sleep(10);
Console.WriteLine(i);
}
I expect the lock(list) will prevent other threads from accessing the list.
If two different blocks of code use lock(list) then each must acquire the lock. One won't acquire the lock and execute until the other has completed and released the lock. (That, of course, is why we use it - so that the two blocks of code won't execute simultaneously.)
But in this case you're only using the lock once, in the second part of the method. Since there is no other "competing" code that might ever attempt to acquire that lock, it doesn't do anything at all.
There's also no guarantee that listAddThread will complete before the second part of the method starts to iterate the list. That means you might start trying to iterate the list while that other thread you started is adding to it. Based on the exception, that's not a maybe - that's what is happening.
The while(true) loop will never stop executing (true is always true), so regardless of whether the wait is 50ms or 5000ms you'll never stop adding to the list. But even if that loop did terminate, there's no guarantee that the 50ms wait will either be enough time or will be too much time. It's better to just write our code so that everything that needs to execute in sequence always executes in sequence, not by trying to guess how long something will take.
I'm playing around with BlockingCollection to try to understand them better, but I'm struggling to understand why my code hangs when it finishes processing all my items when I use a Parallel.For
I'm just adding a number to it (producer?):
var blockingCollection = new BlockingCollection<long>();
Task.Factory.StartNew(() =>
{
while (count <= 10000)
{
blockingCollection.Add(count);
count++;
}
});
Then I'm trying to process (Consumer?):
Parallel.For(0, 5, x =>
{
foreach (long value in blockingCollection.GetConsumingEnumerable())
{
total[x] += 1;
Console.WriteLine("Worker {0}: {1}", x, value);
}
});
But when it completes processing all the numbers, it just hangs there? What am I doing wrong?
Also, when I set my Parallel.For to 5, does it mean it's processing the data on 5 separate thread?
As its name implies, operations on BlockingCollection<T> block when they can't do anything, and this includes GetConsumingEnumerable().
The reason for this is that the collection can't tell if your producer is already done, or just busy producing the next item.
What you need to do is to notify the collection that you're done adding items to it by calling CompleteAdding(). For example:
while (count <= 10000)
{
blockingCollection.Add(count);
count++;
}
blockingCollection.CompleteAdding();
It's a GetConsumingEnumerable method feature.
Enumerating the collection in this way blocks the consumer thread if no items are available or if the collection is empty.
You can read more about it here
Also using Parallel.For(0,5) doesn't guarantee that the data will be processed in 5 separate threads. It depends on Environment.ProcessorCount.
Also, when I set my Parallel.For to 5, does it mean it's processing the data on 5 separate thread?
No, quoting from a previous answer in SO(How many threads Parallel.For(Foreach) will create? Default MaxDegreeOfParallelism?):
The default scheduler for Task Parallel Library and PLINQ uses the
.NET Framework ThreadPool to queue and execute work. In the .NET
Framework 4, the ThreadPool uses the information that is provided by
the System.Threading.Tasks.Task type to efficiently support the
fine-grained parallelism (short-lived units of work) that parallel
tasks and queries often represent.
Put it simply, TPL creates Tasks, not threads. The framework decides how many threads should handle them.
In a Parallel.For, is it possible to synchronize each threads with a 'WaitAll' ?
Parallel.For(0, maxIter, i =>
{
// Do stuffs
// Synchronisation : wait for all threads => ???
// Do another stuffs
});
Parallel.For, in the background, batches the iterations of the loop into one or more Tasks, which can executed in parallel. Unless you take ownership of the partitioning, the number of tasks (and threads) is (and should!) be abstracted away. Control will only exit the Parallel.For loop once all the tasks have completed (i.e. no need for WaitAll).
The idea of course is that each loop iteration is independent and doesn't require synchronization.
If synchronization is required in the tight loop, then you haven't isolated the Tasks correctly, or it means that Amdahl's Law is in effect, and the problem can't be speeded up through parallelization.
However, for an aggregation type pattern, you may need to synchronize after completion of each Task - use the overload with the localInit / localFinally to do this, e.g.:
// allTheStrings is a shared resource which isn't thread safe
var allTheStrings = new List<string>();
Parallel.For( // for (
0, // var i = 0;
numberOfIterations, // i < numberOfIterations;
() => new List<string> (), // localInit - Setup each task. List<string> --> localStrings
(i, parallelLoopState, localStrings) =>
{
// The "tight" loop. If you need to synchronize here, there is no point
// using parallel at all
localStrings.Add(i.ToString());
return localStrings;
},
(localStrings) => // local Finally for each task.
{
// Synchronization needed here is needed - run once per task
lock(allTheStrings)
{
allTheStrings.AddRange(localStrings);
}
});
In the above example, you could also have just declared allTheStrings as
var allTheStrings = new ConcurrentBag<string>();
In which case, we wouldn't have required the lock in the localFinally.
You shouldn't (for reasons stated by other users), but if you want to, you can use Barrier. This can be used to cause all threads to wait (block) at a certain point before X number of participants hit a barrier, causing the barrier to proceed and threads to unblock. The downside of this approach, as others have said, deadlocks
I have a data table full of summary entries and my software needs to go through and reach out to a web service to get details, then record those details back to the database. Looping through the table synchronously while calling the web service and waiting for the response is too slow (there are thousands of entries) so I'd like to take the results (10 or so at a time) and thread it out so it performs 10 operations at the same time.
My experience with C# threads is limited to say the least, so what's the best approach? Does .NET have some sort of threadsafe queue system that I can use to make sure that the results get handled properly and in order?
Depending on which version of the .NET Framework you have two pretty good options.
You can use ThreadPool.QueueUserWorkItem in any version.
int pending = table.Rows.Count;
var finished = new ManualResetEvent(false);
foreach (DataRow row in table.Rows)
{
DataRow capture = row; // Required to close over the loop variable correctly.
ThreadPool.QueueUserWorkItem(
(state) =>
{
try
{
ProcessDataRow(capture);
}
finally
{
if (Interlocked.Decrement(ref pending) == 0)
{
finished.Set(); // Signal completion of all work items.
}
}
}, null);
}
finished.WaitOne(); // Wait for all work items to complete.
If you are using .NET Framework 4.0 you can use the Task Parallel Library.
var tasks = new List<Task>();
foreach (DataRow row in table.Rows)
{
DataRow capture = row; // Required to close over the loop variable correctly.
tasks.Add(
Task.Factory.StartNew(
() =>
{
ProcessDataRow(capture);
}));
}
Task.WaitAll(tasks.ToArray()); // Wait for all work items to complete.
There are many other reasonable ways to do this. I highlight the patterns above because they are easy and work well. In the absence of specific details I cannot say for certain that either will be a perfect match for your situation, but they should be a good starting point.
Update:
I had a short period of subpar cerebral activity. If you have the TPL available you could also use Parallel.ForEach as a simpler method than all of that Task hocus-pocus I mentioned above.
Parallel.ForEach(table.Rows,
(DataRow row) =>
{
ProcessDataRow(row);
});
Does .NET have some sort of threadsafe queue system that I can use to make sure that the results get handled properly and in order?
This was something added in .NET 4. The BlockingCollection<T> class, by defaults, acts as a thread safe queue for producer/consumer scenarios.
It makes it fairly easy to create a number of elements that "consume" from the collection and process, with one or more elements adding to the collection.
This isn't about the different methods I could or should be using to utilize the queues in the best manner, rather something I have seen happening that makes no sense to me.
void Runner() {
// member variable
queue = Queue.Synchronized(new Queue());
while (true) {
if (0 < queue.Count) {
queue.Dequeue();
}
}
}
This is run in a single thread:
var t = new Thread(Runner);
t.IsBackground = true;
t.Start();
Other events are "Enqueue"ing else where. What I've seen happen is over a period of time, the Dequeue will actually throw InvalidOperationException, queue empty. This should be impossible seeing as how the count guarantees there is something there, and I'm positive that nothing else is "Dequeue"ing.
The question(s):
Is it possible that the Enqueue actually increases the count before the item is fully on the queue (whatever that means...)?
Is it possible that the thread is somehow restarting (expiring, reseting...) at the Dequeue statement, but immediately after it already removed an item?
Edit (clarification):
These code pieces are part of a Wrapper class that implements the background helper thread. The Dequeue here is the only Dequeue, and all Enqueue/Dequeue are on the Synchronized member variable (queue).
Using Reflector, you can see that no, the count does not get increased until after the item is added.
As Ben points out, it does seem as you do have multiple people calling dequeue.
You say you are positive that nothing else is calling dequeue. Is that because you only have the one thread calling dequeue? Is dequeue called anywhere else at all?
EDIT:
I wrote a little sample code, but could not get the problem to reproduce. It just kept running and running without any exceptions.
How long was it running before you got errors? Maybe you can share a bit more of the code.
class Program
{
static Queue q = Queue.Synchronized(new Queue());
static bool running = true;
static void Main()
{
Thread producer1 = new Thread(() =>
{
while (running)
{
q.Enqueue(Guid.NewGuid());
Thread.Sleep(100);
}
});
Thread producer2 = new Thread(() =>
{
while (running)
{
q.Enqueue(Guid.NewGuid());
Thread.Sleep(25);
}
});
Thread consumer = new Thread(() =>
{
while (running)
{
if (q.Count > 0)
{
Guid g = (Guid)q.Dequeue();
Console.Write(g.ToString() + " ");
}
else
{
Console.Write(" . ");
}
Thread.Sleep(1);
}
});
consumer.IsBackground = true;
consumer.Start();
producer1.Start();
producer2.Start();
Console.ReadLine();
running = false;
}
}
Here is what I think the problematic sequence is:
(0 < queue.Count) evaluates to true, the queue is not empty.
This thread gets preempted and another thread runs.
The other thread removes an item from the queue, emptying it.
This thread resumes execution, but is now within the if block, and attempts to dequeue an empty list.
However, you say nothing else is dequeuing...
Try outputting the count inside the if block. If you see the count jump numbers downwards, someone else is dequeuing.
Here's a possible answer from the MSDN page on this topic:
Enumerating through a collection is
intrinsically not a thread-safe
procedure. Even when a collection is
synchronized, other threads can still
modify the collection, which causes
the enumerator to throw an exception.
To guarantee thread safety during
enumeration, you can either lock the
collection during the entire
enumeration or catch the exceptions
resulting from changes made by other
threads.
My guess is that you're correct - at some point, there's a race condition happening, and you end up dequeuing something that isn't there.
A Mutex or Monitor.Lock is probably appropriate here.
Good luck!
Are the other areas that are "Enqueuing" data also using the same synchronized queue object? In order for the Queue.Synchronized to be thread-safe, all Enqueue and Dequeue operations must use the same synchronized queue object.
From MSDN:
To guarantee the thread safety of the
Queue, all operations must be done
through this wrapper only.
Edited:
If you are looping over many items that involve heavy computation or if you are using a long-term thread loop (communications, etc.), you should consider having a wait function such as System.Threading.Thread.Sleep, System.Threading.WaitHandle.WaitOne, System.Threading.WaitHandle.WaitAll, or System.Threading.WaitHandle.WaitAny in the loop, otherwise it might kill system performance.
question 1: If you're using a synchronized queue, then: no, you're safe! But you'll need to use the synchronized instance on both sides, the supplier and the feeder.
question 2: Terminating your worker thread when there is no work to do, is a simple job. However, you either way need a monitoring thread or have the queue start a background worker thread whenever the queue has something to do. The last one sounds more like the ActiveObject Pattern, than a simple queue (which's Single-Responsibily-Pattern says that it should only do queueing).
In addition, I'd go for a blocking queue instead of your code above. The way your code works requires CPU processing power even if there is no work to do. A blocking queue lets your worker thread sleep whenever there is nothing to do. You can have multiple sleeping threads running without using CPU processing power.
C# doesn't come with a blocking queue implementation, but there a many out there. See this example and this one.
Another option for making thread-safe use of queues is the ConcurrentQueue<T> class that has been introduced since 2009 (the year of this question). This may help avoid having to write your own synchronization code or at least help making it much simpler.
From .NET Framework 4.6 onward, ConcurrentQueue<T> also implements the interface IReadOnlyCollection<T>.