Before I start, I should mention that I feel like I've got the wrong end of the stick here. But here we go anyway:
Imagine we have the following class:
public class SomeObject {
public int SomeInt;
private SomeObject anotherObject;
public void DoStuff() {
if (SomeCondition()) anotherObject.SomeInt += 1;
}
}
Now, imagine that we have a collection of these SomeObjects:
IList<SomeObject> allObjects = new List<SomeObject>(1000);
// ... Pretend the list is populated with 1000 SomeObjects here
Let's say I call DoStuff() on each one, like so:
foreach (var #object in allObjects) #object.DoStuff();
All is good so far.
Now, let's assume that the order in which the objects have their DoStuff() called is not important. Assume that SomeCondition() is computationally expensive, perhaps. I could utilize all four cores on my machine (and potentially get a performance gain) with:
Parallel.For(0, 1000, i => allObjects[i].DoStuff());
Now, ignoring any issues with atomicity of variable access, I don't care whilst I am in the loop whether or not any given SomeObject sees an outdated version of anotherObject or SomeInt.* However, once the loop is done, I want to make sure that my main worker thread (i.e. the one that called Parallel.For) DOES see everything up-to-date.
Is there a guarantee of this (e.g. some sort of memory barrier?) with using Parallel.For? Or do I need to make some sort of guarantee myself? Or is there no way to make this guarantee?
Finally, if I call Parallel.For(...) again in the same way just after, will all worker threads be working with the new, up-to-date values for everything?
(*) The implementers of DoStuff() would be wrong to make assumptions about the order of processing anyway, right?
var locker = new object();
var total = 0.0;
Parallel.For(1, 10000000,
i => { lock (locker) total += (i + 1); });
Console.WriteLine("WithLocker" + total);
var total2 = 0.0;
Parallel.For(1, 10000000,
i => total2 += (i + 1));
Console.WriteLine("WithoutLocker" + total2);
Console.ReadKey();
// WithLocker 50000004999999
// WithoutLocker 28861729333278
I have made for you two example one with locker and one without look to the result!
There are two issues here.
However, once the loop is done, I want to make sure that my main worker thread (i.e. the one that called Parallel.For) DOES see everything up-to-date.
To answer your question. Yes, once your Parallel.For has completed all the calls to DoStuff will have completed and your array will not see any more updates.
Now, ignoring any issues with atomicity of variable access, I don't care whilst I am in the loop whether or not any given SomeObject sees an outdated version of anotherObject or SomeInt.*
I really doubt that you don't care about this if you want a correct answer. Bassam's answer addresses the potential data races in your code. If one thread is running DoSomething and this writes to another index in the array which is simultaneously being read by another thread then you will see nondeterministic results. Locking can solve this (as shown above) but at the expense of performance. Locking on every thread for every update effectively serializes your work. I suspect that Bassam's lock example actually runs no faster and possibly slower that the non-locking one, although it does produce the correct answer.
If SomeObject::anotherObject refers to anything other than this you have a potential race condition. Consider the case where anotherObject refers to the element in the array adjacent to the current object. What happens when these run concurrently? One thread's code will be trying to read an instance of SomeObject while another thread writes to it. The write not guaranteed to happen atomically, your read my return an object in a half written state.
This depends a bit on what is being updated in SomeObject and how it's being updated. For example if all you are doing is incrementing an single integer value you could use Interlocked Operations to increment the value in a thread safe way or use critical sections or locks to ensure that your SomeObject is actually thread safe. Adding synchronization operations usually impacts performance so if possible I would recommend looking for an approach that does not require adding synchronization.
You can fix this in one of two ways.
1) If each instance of anotherObject in the array is guaranteed to be only updated once by one call to allObjects[i].DoStuff() then you can modify your code to have an input and output array. This prevents any race conditions as reads and writes no longer conflict. It means you need two copies of your array and they both need to be initialized.
2) If you are updating array items multiple times, or having two arrays of SomeObject is not an option and SomeCondition() is the only computationally expensive part of your method then you could parallelize this and then update the array sequentially.
IList<bool> allConditions = new List<bool>(1000);
Parallel.For(0, 1000, i => SomeCondition(i)) // Write allConditions not allObjects
for (int i = 0; i < 1000; ++i) { #object.DoStuff(allConditions[i]); }
So your observation:
This is interesting. It means that Parallel.For is basically only useful for code that's already thread-safe... Damn
Is not entirely correct. The code within your Parallel.For must either be thread safe or not access data and resources in a non-thread safe way. In other words it doesn't have to lock if you can rearrange your code to guarantee that there are no race conditions (or deadlocks) because none of the threads write the same data or will read data that another thread may be writing to. Note that concurrent reads are OK.
Related
I tried to transform a simple sequential loop into a parallel computed loop with the System.Threading.Tasks library.
The code compiles, returns correct results, but It does not save any computational cost, otherwise, it takes longer.
EDIT: Sorry guys, I have probably oversimplified the question and made some errors doing that.
To append additional information, I am running the code on an i7-4700QM, and it is referenced in a Grasshopper script.
Here is the actual code. I also switched to a non thread-local variables
public static class LineNet
{
public static List<Ray> SolveCpu(List<Speaker> sources, List<Receiver> targets, List<Panel> surfaces)
{
ConcurrentBag<Ray> rays = new ConcurrentBag<Ray>();
for (int i = 0; i < sources.Count; i++)
{
Parallel.For(
0,
targets.Count,
j =>
{
Line path = new Line(sources[i].Position, targets[j].Position);
Ray ray = new Ray(path, i, j);
if (Utils.CheckObstacles(ray,surfaces))
{
rays.Add(ray);
}
}
);
}
}
}
The Grasshopper implementation just collects sources targets and surfaces, calls the method Solve and returns rays.
I understand that dispatching workload to threads is expensive, but is it so expensive?
Or is the ConcurrentBag just preventing parallel calculation?
Plus, my classes are immutable (?), but if I use a common List the kernel aborts the operation and throws an exception, is someone able to tell why?
Without a good Minimal, Complete, and Verifiable code example that reliably reproduces the problem, it is not possible to provide a definitive answer. The code you posted does not even appear to be an excerpt of real code, because the type declared as the return type of the method isn't the same as the value actually returned by the return statement.
However, certainly the code you posted does not seem like a good use of Parallel.For(). Your Line constructor would have be fairly expensive to justify parallelizing the task of creating the items. And to be clear, that's the only possible win here.
At the end, you still need to aggregate all of the Line instances that you created into a single list, so all those intermediate lists created for the Parallel.For() tasks are just pure overhead. And the aggregation is necessarily serialized (i.e. only one thread at a time can be adding an item to the result collection), and in the worst way (each thread only gets to add a single item before it gives up the lock and another thread has a chance to take it).
Frankly, you'd be better off storing each local List<T> in a collection, and then aggregating them all at once in the main thread after Parallel.For() returns. Not that that would be likely to make the code perform better than a straight-up non-parallelized implementation. But at least it would be less likely to be worse. :)
The bottom line is that you don't seem to have a workload that could benefit from parallelization. If you think otherwise, you'll need to explain the basis for that thought in a clearer, more detailed way.
if I use a common List the kernel aborts the operation and throws an exception, is someone able to tell why?
You're already using (it appears) List<T> as the local data for each task, and indeed that should be fine, as tasks don't share their local data.
But if you are asking why you get an exception if you try to use List<T> instead of ConcurrentBag<T> for the result variable, well that's entirely to be expected. The List<T> class is not thread safe, but Parallel.For() will allow each task it runs to execute the localFinally delegate concurrently with all the others. So you have multiple threads all trying to modify the same not-thread-safe collection concurrently. This is a recipe for disaster. You're fortunate you get the exception; the actual behavior is undefined, and it's just as likely you'll simply corrupt the data structure as cause a run-time exception.
I'm trying to implement the Parallel.ForEach pattern and track progress, but I'm missing something regarding locking. The following example counts to 1000 when the threadCount = 1, but not when the threadCount > 1. What is the correct way to do this?
class Program
{
static void Main()
{
var progress = new Progress();
var ids = Enumerable.Range(1, 10000);
var threadCount = 2;
Parallel.ForEach(ids, new ParallelOptions { MaxDegreeOfParallelism = threadCount }, id => { progress.CurrentCount++; });
Console.WriteLine("Threads: {0}, Count: {1}", threadCount, progress.CurrentCount);
Console.ReadKey();
}
}
internal class Progress
{
private Object _lock = new Object();
private int _currentCount;
public int CurrentCount
{
get
{
lock (_lock)
{
return _currentCount;
}
}
set
{
lock (_lock)
{
_currentCount = value;
}
}
}
}
The usual problem with calling something like count++ from multiple threads (which share the count variable) is that this sequence of events can happen:
Thread A reads the value of count.
Thread B reads the value of count.
Thread A increments its local copy.
Thread B increments its local copy.
Thread A writes the incremented value back to count.
Thread B writes the incremented value back to count.
This way, the value written by thread A is overwritten by thread B, so the value is actually incremented only once.
Your code adds locks around operations 1, 2 (get) and 5, 6 (set), but that does nothing to prevent the problematic sequence of events.
What you need to do is to lock the whole operation, so that while thread A is incrementing the value, thread B can't access it at all:
lock (progressLock)
{
progress.CurrentCount++;
}
If you know that you will only need incrementing, you could create a method on Progress that encapsulates this.
Old question, but I think there is a better answer.
You can report progress using Interlocked.Increment(ref progress) that way you do not have to worry about locking the write operation to progress.
The easiest solution would actually have been to replace the property with a field, and
lock { ++progress.CurrentCount; }
(I personally prefer the look of the preincrement over the postincrement, as the "++." thing clashes in my mind! But the postincrement would of course work the same.)
This would have the additional benefit of decreasing overhead and contention, since updating a field is faster than calling a method that updates a field.
Of course, encapsulating it as a property can have advantages too. IMO, since field and property syntax is identical, the ONLY advantage of using a property over a field, when the property is autoimplemented or equivalent, is when you have a scenario where you may want to deploy one assembly without having to build and deploy dependent assemblies anew. Otherwise, you may as well use faster fields! If the need arises to check a value or add a side effect, you simply convert the field to a property and build again. Therefore, in many practical cases, there is no penalty to using a field.
However, we are living in a time where many development teams operate dogmatically, and use tools like StyleCop to enforce their dogmatism. Such tools, unlike coders, are not smart enough to judge when using a field is acceptable, so invariably the "rule that is simple enough for even StyleCop to check" becomes "encapsulate fields as properties", "don't use public fields" et cetera...
Remove lock statements from properties and modify Main body:
object sync = new object();
Parallel.ForEach(ids, new ParallelOptions {MaxDegreeOfParallelism = threadCount},
id =>
{
lock(sync)
progress.CurrentCount++;
});
The issue here is that ++ is not atomic - one thread can read and increment the value between another thread reading the value and it storing the (now incorrect) incremented value. This is probably compounded by the fact there's a property wrapping your int.
e.g.
Thread 1 Thread 2
reads 5 .
. reads 5
. writes 6
writes 6! .
The locks around the setter and getter don't help this, as there's nothing to stop the lock blocks themseves being called out of order.
Ordinarily, I'd suggest using Interlocked.Increment, but you can't use this with a property.
Instead, you could expose _lock and have the lock block be around the progress.CurrentCount++; call.
It is better to store any database or file system operation in a local buffer variable instead of locking it. locking reduces performance.
There are a fair share of questions about Interlocked vs. volatile here on SO, I understand and know the concepts of volatile (no re-ordering, always reading from memory, etc.) and I am aware of how Interlocked works in that it performs an atomic operation.
But my question is this: Assume I have a field that is read from several threads, which is some reference type, say: public Object MyObject;. I know that if I do a compare exchange on it, like this: Interlocked.CompareExchange(ref MyObject, newValue, oldValue) that interlocked guarantees to only write newValue to the memory location that ref MyObject refers to, if ref MyObject and oldValue currently refer to the same object.
But what about reading? Does Interlocked guarantee that any threads reading MyObject after the CompareExchange operation succeeded will instantly get the new value, or do I have to mark MyObject as volatile to ensure this?
The reason I'm wondering is that I've implemented a lock-free linked list which updates the "head" node inside itself constantly when you prepend an element to it, like this:
[System.Diagnostics.DebuggerDisplay("Length={Length}")]
public class LinkedList<T>
{
LList<T>.Cell head;
// ....
public void Prepend(T item)
{
LList<T>.Cell oldHead;
LList<T>.Cell newHead;
do
{
oldHead = head;
newHead = LList<T>.Cons(item, oldHead);
} while (!Object.ReferenceEquals(Interlocked.CompareExchange(ref head, newHead, oldHead), oldHead));
}
// ....
}
Now after Prepend succeeds, are the threads reading head guaranteed to get the latest version, even though it's not marked as volatile?
I have been doing some empirical tests, and it seems to be working fine, and I have searched here on SO but not found a definitive answer (a bunch of different questions and comments/answer in them all say conflicting things).
Does Interlocked guarantee that any threads reading MyObject after the CompareExchange operation succeeded will instantly get the new value, or do I have to mark MyObject as volatile to ensure this?
Yes, subsequent reads on the same thread will get the new value.
Your loop unrolls to this:
oldHead = head;
newHead = ... ;
Interlocked.CompareExchange(ref head, newHead, oldHead) // full fence
oldHead = head; // this read cannot move before the fence
EDIT:
Normal caching can happen on other threads. Consider:
var copy = head;
while ( copy == head )
{
}
If you run that on another thread, the complier can cache the value of head and never see the update.
Your code should work fine. Though it is not clearly documented the Interlocked.CompareExchange method will produce a full-fence barrier. I suppose you could make one small change and omit the Object.ReferenceEquals call in favor of relying on the != operator which would perform reference equality by default.
For what it is worth the documentation for the InterlockedCompareExchange Win API call is much better.
This function generates a full memory barrier (or fence) to ensure
that memory operations are completed in order.
It is a shame the same level documenation does not exist on the .NET BCL counterpart Interlocked.CompareExchange because it is very likely they map to the exact same underlying mechanisms for the CAS.
Now after Prepend succeeds, are the threads reading head guaranteed to
get the latest version, even though it's not marked as volatile?
No, not necessarily. If those threads do not generate an acquire-fence barrier then there is no guarantee that they will read the latest value. Make sure that you perform a volatile read upon any use of head. You have already ensured that in Prepend with the Interlocked.CompareExchange call. Sure, that code may go through the loop once with an stale value of head, but the next iteration will be refreshed due to the Interlocked operation.
So if the context of your question was in regards to other threads who are also executing Prepend then nothing more needs to be done.
But if the context of your question was in regards to other threads executing another method on LinkedList then make sure you use Thread.VolatileRead or Interlocked.CompareExchange where appropriate.
Side note...there may be a micro-optimization that could be performed on the following code.
newHead = LList<T>.Cons(item, oldHead);
The only problem I see with this is that memory is allocated on every iteration of the loop. During periods of high contention the loop may spin around several times before it finally succeeds. You could probably lift this line outside of the loop as long as you reassign the linked reference to oldHead on every iteration (so that you get the fresh read). This way memory is only allocated once.
I've an application that makes use of parallelization for processing data.
The main program is in C#, while one of the routine for analyzing data is on an external C++ dll. This library scans data and calls a callback everytime a certain signal is found within the data. Data should be collected, sorted and then stored into HD.
Here is my first simple implementation of the method invoked by the callback and of the method for sorting and storing data:
// collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// method invoked by the callback
private void Collect(int type, long time)
{
lock(locker) { mySignalList.Add(new MySignal(type, time)); }
}
// store signals to disk
private void Store()
{
// sort the signals
mySignalList.Sort();
// file is a object that manages the writing of data to a FileStream
file.Write(mySignalList.ToArray());
}
Data is made up of a bidimensional array (short[][] data) of size 10000 x n, with n variable. I use parallelization in this way:
Parallel.For(0, 10000, (int i) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
}
Now for each of the 10000 arrays I estimate that 0 to 4 callbacks could be fired. I'm facing a bottleneck and given that my CPU resources are not over-utilized, I suppose that the lock (together with thousand of callbacks) is the problem (am I right or there could be something else?). I've tried the ConcurrentBag collection but performances are still worse (in line with other user findings).
I thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
[EDIT]
I've followed the Reed Copsey's suggestion and I used the following solution (according to the profiler of VS2010, before the burden for locking and adding to the list was taking 15% of the resources, while now only 1%):
// master collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// thread-local storage of data (each thread is working on its List<MySignal>)
ThreadLocal<List<MySignal>> threadLocal;
// analyze data
private void AnalizeData()
{
using(threadLocal = new ThreadLocal<List<MySignal>>(() =>
{ return new List<MySignal>(); }))
{
Parallel.For<int>(0, 10000,
() =>
{ return 0;},
(i, loopState, localState) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
return 0;
},
(localState) =>
{
lock(this)
{
// add thread-local lists to the master collection
mySignalList.AddRange(local.Value);
local.Value.Clear();
}
});
}
}
// method invoked by the callback
private void Collect(int type, long time)
{
local.Value.Add(new MySignal(type, time));
}
thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
You might want to look at using ThreadLocal<T> to hold your collections. This automatically allocates a separate collection per thread.
That being said, there are overloads of Parallel.For which work with local state, and have a collection pass at the end. This, potentially, would allow you to spawn your ProcessData wrapper, where each loop body was working on its own collection, and then recombine at the end. This would, potentially, eliminate the need for locking (since each thread is working on it's own data set) until the recombination phase, which happens once per thread (instead of once per task,ie: 10000 times). This could reduce the number of locks you're taking from ~25000 (0-4*10000) down to a few (system and algorithm dependent, but on a quad core system, probably around 10 in my experience).
For details, see my blog post on aggregating data with Parallel.For/ForEach. It demonstrates the overloads and explains how they work in more detail.
You don't say how much of a "bottleneck" you're encountering. But let's look at the locks.
On my machine (quad core, 2.4 GHz), a lock costs about 70 nanoseconds if it's not contended. I don't know how long it takes to add an item to a list, but I can't imagine that it takes more than a few microseconds. But let's it takes 100 microseconds (I would be very surprised to find that it's even 10 microseconds) to add an item to the list, taking into account lock contention. So if you're adding 40,000 items to the list, that's 4,000,000 microseconds, or 4 seconds. And I would expect one core to be pegged if this were the case.
I haven't used ConcurrentBag, but I've found the performance of BlockingCollection to be very good.
I suspect, though, that your bottleneck is somewhere else. Have you done any profiling?
The basic collections in C# aren't thread safe.
The problem you're having is due to the fact that you're locking the entire collection just to call an add() method.
You could create a thread-safe collection that only locks single elements inside the collection, instead of the whole collection.
Lets look at a linked list for example.
Implement an add(item (or list)) method that does the following:
Lock collection.
A = get last item.
set last item reference to the new item (or last item in new list).
lock last item (A).
unclock collection.
add new items/list to the end of A.
unlock locked item.
This will lock the whole collection for just 3 simple tasks when adding.
Then when iterating over the list, just do a trylock() on each object. if it's locked, wait for the lock to be free (that way you're sure that the add() finished).
In C# you can do an empty lock() block on the object as a trylock().
So now you can add safely and still iterate over the list at the same time.
Similar solutions can be implemented for the other commands if needed.
Any built-in solution for a collection is going to involve some locking. There may be ways to avoid it, perhaps by segregating the actual data constructs being read/written, but you're going to have to lock SOMEWHERE.
Also, understand that Parallel.For() will use the thread pool. While simple to implement, you lose fine-grained control over creation/destruction of threads, and the thread pool involves some serious overhead when starting up a big parallel task.
From a conceptual standpoint, I would try two things in tandem to speed up this algorithm:
Create threads yourself, using the Thread class. This frees you from the scheduling slowdowns of the thread pool; a thread starts processing (or waiting for CPU time) when you tell it to start, instead of the thread pool feeding requests for threads into its internal workings at its own pace. You should be aware of the number of threads you have going at once; the rule of thumb is that the benefits of multithreading are overcome by the overhead when you have more than twice the number of active threads as "execution units" available to execute threads. However, you should be able to architect a system that takes this into account relatively simply.
Segregate the collection of results, by creating a dictionary of collections of results. Each results collection is keyed to some token carried by the thread doing the processing and passed to the callback. The dictionary can have multiple elements READ at one time without locking, and as each thread is WRITING to a different collection within the Dictionary there shouldn't be a need to lock those lists (and even if you did lock them you wouldn't be blocking other threads). The result is that the only collection that has to be locked such that it would block threads is the main dictionary, when a new collection for a new thread is added to it. That shouldn't have to happen often if you're smart about recycling tokens.
When implementing a thread safe list or queue; does it require to lock on the List.Count property before returning the Count i.e:
//...
public int Count
{
lock (_syncObject)
{
return _list.Count;
}
}
//...
Is it necessary to do a lock because of the original _list.Count variable maybe not a volatile variable?
Yes, that is necessary, but mostly irrelevant. Without the lock you could be reading stale values or even, depending on the inner working and the type of _list.Count, introduce errors.
But note that using that Count property is problematic, any calling code cannot really rely on it:
if (myStore.Count > 0) // this Count's getter locks internally
{
var item = myStore.Dequeue(); // not safe, myStore could be empty
}
So you should aim for a design where checking the Count and acting on it are combined:
ItemType GetNullOrFirst()
{
lock (_syncObject)
{
if (_list.Count > 0)
{
....
}
}
}
Additional:
is it nessesary to do a lock because of the original _list.Count variable maybe not a volatile variable?
The _list.Count is not a variable but a property. It cannot be marked volatile. Whether it is thread-safe depends on the getter code of the property, but it will usually be safe. But unreliable.
It depends.
Now, theoretically, because the underlying value is not threadsafe, just about anything could go wrong because of something the implementation does. In practice though, it's a read from a variable of 32-bits and so guaranteed to be atomic. Hence it might be stale, but it will be a real stale value rather than garbage caused by reading half a value before a change and the other half after it.
So the question is, does staleness matter? Possibly it doesn't. The more you can put up with staleness the better for performance (because the more you can put up with it the less you need to do to ensure you don't have anything stale). E.g. if you were putting the count out to the UI and the collection was changing rapidly, then just the time it takes for the person to read the number and process it in their own brain is going to be enough to make it obsolete anyway, and hence stale.
If however, you needed it to ensure an operation was reasonable before it was attempted, then that staleness is going to cause problems.
However, that staleness is going to happen anyway, because in between your locked (and guaranteed to be fresh) read and the operation happening, there's the opportunity for the collection to change, so you've got a race condition even when locking.
Hence, if freshness is important, you are going to have to lock at a higher level. With this being the case, the lower-level lock is just a waste. Hence you can probably do without it at that point.
The important thing here is that even with a class that is threadsafe in every regard, the best that can guarantee is that each operation will be fresh (though even that may not be guaranteed; indeed when there are more than one core involved "fresh" begins to become meaningless as changes that are near to truly simultaneous can happen) and that each operation won't put the object into an invalid safe. It is still possible to write non-threadsafe code with threadsafe objects (indeed very many non-threadsafe objects are composed of ints, strings and bools or objects that are in turn composed of them, and each of those is threadsafe in and of itself).
One thing that can be useful with mutable classes intended for multithreaded use are synchronised "Try" methods. E.g. to use a List as a stack we need the following operations:
See if the list is empty, reporting failure if it is.
Obtain the "top" value.
Remove the top value.
Even if we synchronise each of those individually, the code doing so won't be threadsafe as something can happen between each step. However, we can provide a Pop() method that does each three in one synchronised method. Analogous approaches are also useful with lock-free classes (where different techniques are used to ensure the method either succeeds or fails completely, and without damage to the object).
No, you don't need to lock, but the caller should otherwise something like this might happen
count is n
thread asks for count
Count returns n
another thread dequeues
count is n - 1
thread asking for count sees count is n when count is actually n - 1