A weird thing is happening here. I thought Parallel.Foreach would wait until all of its tasks are complete before moving on. But then, I have something like that:
List<string> foo(List<A> list){
Dictionary<string, bool> dictionary = new Dictionary<string, bool>();
Parallel.Foreach(list, element =>
{
dictionary[element.Id] = true;
if (element.SomeMethod()){
dictionary[element.Id] = false;
}
});
List<string> selectedIds = (from element in list where !dictionary[element.Id] select element.Id).ToList();
return selectedIds;
}
and then I'm getting System.Collections.Generic.KeyNotFoundException (sometimes, not always) in the select line. As you can see, I'm initializing the dictionary for every possible key (Ids of list's elements), and then getting this exception, which made me think that this line might be reached before the execution of the Parallel.Foreach completes... Is that right? If so, how can I wait until all branches of this Parallel.Foreach completes?
Parallel.Foreach doesn't need to be waited as it doesn't return a Task and isn't asynchronous. When the call to that method completes the iteration is already done.
However, Parallel.Foreach uses multiple threads concurrently and Dictionary isn't thread safe.
You probably have a race conditions on your hands and you should be using the thread safe ConcurrentDictionary instead.
This specific case can be solved in a simpler way by using PLinq's AsParallel:
list.AsParallel().Where(element => !element.SomeMethod());
Related
I want to process a list of 5000 items. For each item, the process can be very quick (1sec) or take much time (>1min). But I want to process this list the fastest possible way.
I can't put this 5000 items in the .NET ThreadPool, plus I need to know when the items are all processed, so I was thinking to have a specific number of Threads and to do:
foreach(var item in items)
{
// wait for a Thread to be available
// give the item to process to the Thread
}
but what is the easiest way to do that in c#? Should I use Threads, or are there some higher level classes that I could use?
I would start with Parallel.ForEach and measure your performance. That is a simple, powerful approach and the scheduling does a pretty decent job for a generic scheduler.
Parallel.ForEach(items, i => { /* your code here */ });
I can't put this 5000 items in the .NET ThreadPool
Nor would you want to. It is relatively expensive to create a thread. Context switches take time. If you had say 8 cores processing 5000 threads, a meaningful fraction of your execution time would be context switches.
to do parallel processing this is the structure to use
Parallel.ForEach(items, (item) =>
{
....
}
and if you want not to overload the thread pool you can use ParallelOptions
var po = new ParallelOptions
{
MaxDegreeOfParallelism = 5
}
Parallel.ForEach(items, po,(item) =>
{
....
}
I agree with the answers recommending Parallel.ForEach. Without knowing all of the specifics (like what's going on in the loop) I can't say 100%. As long as the iterations in the loop aren't doing anything that conflict with each other (like concurrent operations with some other object that aren't thread safe) then it should be fine.
You mentioned in a comment that it's throwing an exception. That can be a problem because if one iteration throws an exception then the loop will terminate leaving your tasks only partially complete.
To avoid that, handle exceptions within each iteration of the loop. For example,
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(items, i =>
{
try
{
//Your code to do whatever
}
catch(Exception ex)
{
exceptions.Enqueue(ex);
}
});
By using a ConcurrentQueue any iteration can safely add its own exception. When it's done you have a list of exceptions. Now you can decide what to do with them. You could throw a new exception:
if (exceptions.Count > 0) throw new AggregateException(exceptions);
Or if there's something that uniquely identifies each item you could do (for example)
var exceptions = new ConcurrentDictionary<Guid, Exception>();
And then when an exception is thrown,
exceptions.TryAdd(item.Id, ex); //making up the Id property
Now you know specifically which items succeeded and which failed.
If I have code similar to this:
foreach (Item child in item.Children)
{
// Do stuff
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 3;
Parallel.ForEach(items, i => DoStuff());
}
Is the Parallel.Foreach going to finish all of its items before moving on to the next foreach item?
Yes - Parallel.ForEach will block. It's a synchronous method, which internally does its work in parallel.
I've gone with a slightly bizarre way to demonstrate the desired property below, because I can't find any nice excerpts from the documentation for e.g. Parallel.ForEach that just come out and states that the loops are completed before the methods return:
Yes. Note the return type of Parallel.Foreach is a ParallelLoopResult which contains information that can only be available once all of the operations have completed, such as IsCompleted:
Gets whether the loop ran to completion, such that all iterations of the loop were executed and the loop didn't receive a request to end prematurely.
ParallelLoopResult is a struct - and so whatever value is returned from Parallel.ForEach cannot be altered after the return from that method.
My code does very simple stuff
list already has elements. I have approximately 25000 elements (and I'm expecting to have more) in the list and each element is small (DateTime).
List<DateTime> newList = new List<DateTime>();
Parallel.ForEach(list, l => newlist.Add(new DateTime(l.Ticks + 5000)));
i.e, based on each element, I'm creating new elements and adding them to a different list.
But, this doesn't seem to be a good programming approach. I hit this exceptions some times, but not everytime.
IndexOutOfRangeException : {"Index was outside the bounds of the array."}
Can we add elements to a list using Parallel.ForEach()? If yes, why do I hit the error? If no, why?
What you would really want in this situation is more like this:
newlist = list.AsParallel().Select(l => new DateTime(l.Ticks + 5000)).ToList();
Although you should measure the performance to see if this situation even benefits from parallelization.
Try a thread local variable with a final result that adds all thread local variables to the newList as such...
Parallel.ForEach(list, () => DateTime.MinValue, (l, state, date) =>
{
date = new DateTime(l.Ticks+5000);
return date;
},
finalresult =>
{
lock (newList)
{
newList.Add(finalresult);
}
});
The first param is your old list the second Param is the initial value of each thread (I just initialized to datetime min). the third param block is as follows- the l is the same as in your code; the state is a Paralleloption object of which you can exit out of parallel loop if you choose; the last is the stand in variable that represents the thread local variable. The finalresult param represents the end result of each thread local variable and is called for each thread - it is there you can place a lock of the newList and add to the newList shared variable. In theory this works. I have used similar coding in my own code. Hope this helps you or someone else.
This will effectively call List<T>.Add concurrently, yet according to MSDN documentation for List<T>:
"Any instance members are not guaranteed to be thread safe."
Even if it were (thread safe), this is far too cheap to benefit from parallel execution (as opposed to overhead of parallel execution). Did you actually measure your performance? 25000 elements is not that many.
As everyone has mentioned, there seems to be no case for doing this parallel. It will certainly be far, far slower. However, for completion, the reason this sometimes fails is there is no lock on the list object that's being written to by multiple threads. Add this:
object _locker = new object();
List<DateTime> newList = new List<DateTime>();
Parallel.ForEach(list, l => lock (_locker) newlist.Add(new DateTime(l.Ticks + 5000)));
There simply is not enough work to do for this to warrant using Parallel.ForEach and also List<T> is not thread safe, so you would have to lock if you wanted to add to the same list in parallel. Just use a regular for loop.
Do you really need these in a list? If all you need is to enumerate the list in a foreach, you should probably do this instead, as it will use far less memory:
IEnumerable<DateTime> newSequence = list.Select(d => new DateTime(d.Ticks + 5000));
If you really need these in a list, just add .ToList() at the end:
var newSequence = list.Select(d => new DateTime(d.Ticks + 5000)).ToList();
This will almost certainly be fast enough that you don't need to parallelize it. In fact, this is probably faster than doing it in parallel, as it will have better memory performance.
I have a method that returns XML elements, but that method takes some time to finish and return a value.
What I have now is
foreach (var t in s)
{
r.add(method(test));
}
but this only runs the next statement after previous one finishes. How can I make it run simultaneously?
You should be able to use tasks for this:
//first start a task for each element in s, and add the tasks to the tasks collection
var tasks = new List<Task>();
foreach( var t in s)
{
tasks.Add(Task.Factory.StartNew(method(t)));
}
//then wait for all tasks to complete asyncronously
Task.WaitAll(tasks);
//then add the result of all the tasks to r in a treadsafe fashion
foreach( var task in tasks)
{
r.Add(task.Result);
}
EDIT
There are some problems with the code above. See the code below for a working version. Here I have also rewritten the loops to use LINQ for readability issues (and in the case of the first loop, to avoid the closure on t inside the lambda expression causing problems).
var tasks = s.Select(t => Task<int>.Factory.StartNew(() => method(t))).ToArray();
//then wait for all tasks to complete asyncronously
Task.WaitAll(tasks);
//then add the result of all the tasks to r in a treadsafe fashion
r = tasks.Select(task => task.Result).ToList();
You can use Parallel.ForEach which will utilize multiple threads to do the execution in parallel. You have to make sure that all code called is thread safe and can be executed in parallel.
Parallel.ForEach(s, t => r.add(method(t));
From what I'm seeing you are updating a shared collection inside the loop. This means that if you execute the loop in parallel a data race will occur because multiple threads will try to update a non-synchronized collection (assuming r is a List or something like this) at the same time, causing an inconsistent state.
To execute correctly in parallel, you will need to wrap that section of code inside a lock statement:
object locker = new object();
Parallel.Foreach (s,
t =>
{
lock(locker) r.add(method(t));
});
However, this will make the execution actually serial, because each thread needs to acquire the lock and two threads cannot do so at the same time.
The better solution would be to have a local list for each thread, add the partial results to that list and then merge the results when all threads have finished. Probably #Øyvind Knobloch-Bråthen's second solution is the best one, assuming method(t) is the real CPU-hog in this case.
Modification to the correct answer for this question
change
tasks.Add(Task.Factory.StartNew(method(t);));
to
//solution will be the following code
tasks.Add(Task.Factory.StartNew(() => { method(t);}));
I am using the below code
var processed = new List<Guid>();
Parallel.ForEach(items, item =>
{
processed.Add(SomeProcessingFunc(item));
});
Is the above code thread safe? Is there a chance of processed list getting corrupted? Or should i use a lock before adding?
var processed = new List<Guid>();
Parallel.ForEach(items, item =>
{
lock(items.SyncRoot)
processed.Add(SomeProcessingFunc(item));
});
thanks.
No! It is not safe at all, because processed.Add is not. You can do following:
items.AsParallel().Select(item => SomeProcessingFunc(item)).ToList();
Keep in mind that Parallel.ForEach was created mostly for imperative operations for each element of sequence. What you do is map: project each value of sequence. That is what Select was created for. AsParallel scales it across threads in most efficient manner.
This code works correctly:
var processed = new List<Guid>();
Parallel.ForEach(items, item =>
{
lock(items.SyncRoot)
processed.Add(SomeProcessingFunc(item));
});
but makes no sense in terms of multithreading. locking at each iteration forces totally sequential execution, bunch of threads will be waiting for single thread.
Use:
var processed = new ConcurrentBag<Guid>();
See parallel foreach loop - odd behavior.
From Jon Skeet's Book C# in Depth:
As part of Parallel Extensions in .Net 4, there are several new collections in a new System.Collections.Concurrent namespace. These are designed to be safe in the face of concurrent operations from multiple threads, with relatively little locking.
These include:
IProducerConsumerCollection<T>
BlockingCollection<T>
ConcurrentBag<T>
ConcurrentQueue<T>
ConcurrentStack<T>
ConcurrentDictionary<TKey, TValue>
and others
As alternative to the answer of Andrey:
items.AsParallel().Select(item => SomeProcessingFunc(item)).ToList();
You could also write
items.AsParallel().ForAll(item => SomeProcessingFunc(item));
This makes the query that is behind it even more efficient because no merge is required, MSDN.
Make sure the SomeProcessingFunc function is thread-safe.
And I think, but didn't test it, that you still need a lock if the list can be modified in an other thread (adding or removing) elements.
Using ConcurrentBag of type Something
var bag = new ConcurrentBag<List<Something>>;
var items = GetAllItemsINeed();
Parallel.For(items,i =>
{
bag.Add(i.DoSomethingInEachI());
});
reading is thread safe, but adding is not. You need a reader/writer lock setup as adding may cause the internal array to resize which would mess up a concurrent read.
If you can guarantee the array won't resize on add, you may be safe to add while reading, but don't quote me on that.
But really, a list is just an interface to an array.