How to test a multi-threaded race scenario - c#

I'm just now starting to learn multi-threading and I came across this question:
public class Program1
{
int variable;
bool variableValueHasBeenSet = false;
public void Func1()
{
variable = 1;
variableValueHasBeenSet = true;
}
public void Func2()
{
if (variableValueHasBeenSet) Console.WriteLine(variable);
}
}
the questions is: Determine all possible outputs (in console) for the following code snippet if Func1() and Func2() are run in parallel on two separate threads. The answer given is nothing, 1 or 0. the first two options are obvious but the third one surprised me so I wanted to try and get it, this is what I tried:
for (int i = 0; i < 100; i++)
{
var prog1 = new Program1();
List<Task> tasks = new List<Task>();
tasks.Add(new Task(() => prog1.Func2(), TaskCreationOptions.LongRunning));
tasks.Add(new Task(() => prog1.Func1(), TaskCreationOptions.LongRunning));
Parallel.ForEach(tasks, t => t.Start());
}
I couldn't get 0, only nothing and 1, so I was wondering what I'm doing wrong and how can I test this specific problem?
this is the explanation they provided for 0:
0 - This might seem impossible but this is a probable output and an interesting one. .Net runtime, C# and the CPU take the liberty of reordering instructions for optimization. So it is possible that variableValueHasBeenSet is set to true but the value of the variable is still zero. Another reason for such an output is caching. Thread2 might cache the value for the variable as 0 and will not see the updated value when Thread1 updates it in Func1. For a single threaded program this is not an issue as the ordering is guaranteed, but not so in multithreaded code. If the code at both the places is surrounded by locks, this problem can be mitigated. Another advanced way is to use memory barriers.

.Net runtime, C# and the CPU take the liberty of reordering
instructions for optimization.
This bit of information is very important, because there is no guarantee the reordering will happen at all.
The optimizer will often reorder the instructions, but usually this is triggered by code complexity and will typically only occur on a release build (the optimizer will look for dependency-chains and may decide to reorder the code if no dependency is broken AND it will result in faster/more compact code). The code complexity of your test is very low and may not trigger the reordering optimization.
The same thing may happen at the CPU level, if no dependency chains are found between CPU instructions, they may be reordered or at least run in parallel by a superscalar CPU, but other, simpler architectures will run code in-order.
Another reason for such an output is caching. Thread2 might cache the
value for the variable as 0 and will not see the updated value when
Thread1 updates it in Func1
Again, this is is only a possibility. This type of optimization is typically triggered when repeatedly accessing a variable in a loop. The optimizer may decide that it is faster to place the variable on a CPU register instead of accessing it from memory every iteration.
In any case, the amount of control you have over how the C# compiler emits its code is very limited, same goes for how the IL code gets translated to machine code. For these reasons, it would be very difficult for you to produce a reproducible test on every architecture for the case you intend to prove.
What is really important is that you need to be aware that 1) the execution order of the instructions can never be taken for granted and 2) variables may be temporarily stored in registers as a potential optimization. Once aware you should write your code defensively around these possibilities

Related

Why my code does not speed up with a multithreaded Parallel.For loop?

I tried to transform a simple sequential loop into a parallel computed loop with the System.Threading.Tasks library.
The code compiles, returns correct results, but It does not save any computational cost, otherwise, it takes longer.
EDIT: Sorry guys, I have probably oversimplified the question and made some errors doing that.
To append additional information, I am running the code on an i7-4700QM, and it is referenced in a Grasshopper script.
Here is the actual code. I also switched to a non thread-local variables
public static class LineNet
{
public static List<Ray> SolveCpu(List<Speaker> sources, List<Receiver> targets, List<Panel> surfaces)
{
ConcurrentBag<Ray> rays = new ConcurrentBag<Ray>();
for (int i = 0; i < sources.Count; i++)
{
Parallel.For(
0,
targets.Count,
j =>
{
Line path = new Line(sources[i].Position, targets[j].Position);
Ray ray = new Ray(path, i, j);
if (Utils.CheckObstacles(ray,surfaces))
{
rays.Add(ray);
}
}
);
}
}
}
The Grasshopper implementation just collects sources targets and surfaces, calls the method Solve and returns rays.
I understand that dispatching workload to threads is expensive, but is it so expensive?
Or is the ConcurrentBag just preventing parallel calculation?
Plus, my classes are immutable (?), but if I use a common List the kernel aborts the operation and throws an exception, is someone able to tell why?
Without a good Minimal, Complete, and Verifiable code example that reliably reproduces the problem, it is not possible to provide a definitive answer. The code you posted does not even appear to be an excerpt of real code, because the type declared as the return type of the method isn't the same as the value actually returned by the return statement.
However, certainly the code you posted does not seem like a good use of Parallel.For(). Your Line constructor would have be fairly expensive to justify parallelizing the task of creating the items. And to be clear, that's the only possible win here.
At the end, you still need to aggregate all of the Line instances that you created into a single list, so all those intermediate lists created for the Parallel.For() tasks are just pure overhead. And the aggregation is necessarily serialized (i.e. only one thread at a time can be adding an item to the result collection), and in the worst way (each thread only gets to add a single item before it gives up the lock and another thread has a chance to take it).
Frankly, you'd be better off storing each local List<T> in a collection, and then aggregating them all at once in the main thread after Parallel.For() returns. Not that that would be likely to make the code perform better than a straight-up non-parallelized implementation. But at least it would be less likely to be worse. :)
The bottom line is that you don't seem to have a workload that could benefit from parallelization. If you think otherwise, you'll need to explain the basis for that thought in a clearer, more detailed way.
if I use a common List the kernel aborts the operation and throws an exception, is someone able to tell why?
You're already using (it appears) List<T> as the local data for each task, and indeed that should be fine, as tasks don't share their local data.
But if you are asking why you get an exception if you try to use List<T> instead of ConcurrentBag<T> for the result variable, well that's entirely to be expected. The List<T> class is not thread safe, but Parallel.For() will allow each task it runs to execute the localFinally delegate concurrently with all the others. So you have multiple threads all trying to modify the same not-thread-safe collection concurrently. This is a recipe for disaster. You're fortunate you get the exception; the actual behavior is undefined, and it's just as likely you'll simply corrupt the data structure as cause a run-time exception.

Locks vs Compare-and-swap

I've been reading about lock-free techniques, like Compare-and-swap and leveraging the Interlocked and SpinWait classes to achieve thread synchronization without locking.
I've ran a few tests of my own, where I simply have many threads trying to append a character to a string. I tried using regular locks and compare-and-swap. Surprisingly (at least to me), locks showed much better results than using CAS.
Here's the CAS version of my code (based on this). It follows a copy->modify->swap pattern:
private string _str = "";
public void Append(char value)
{
var spin = new SpinWait();
while (true)
{
var original = Interlocked.CompareExchange(ref _str, null, null);
var newString = original + value;
if (Interlocked.CompareExchange(ref _str, newString, original) == original)
break;
spin.SpinOnce();
}
}
And the simpler (and more efficient) lock version:
private object lk = new object();
public void AppendLock(char value)
{
lock (lk)
{
_str += value;
}
}
If i try adding 50.000 characters, the CAS version takes 1.2 seconds and the lock version 700ms (average). For 100k characters, they take 7 seconds and 3.8 seconds, respectively.
This was run on a quad-core (i5 2500k).
I suspected the reason why CAS was displaying these results was because it was failing the last "swap" step a lot. I was right. When I try adding 50k chars (50k successful swaps), i was able to count between 70k (best case scenario) and almost 200k (worst case scenario) failed attempts. Worst case scenario, 4 out of every 5 attempts failed.
So my questions are:
What am I missing? Shouldn't CAS give better results? Where's the benefit?
Why exactly and when is CAS a better option? (I know this has been asked, but I can't find any satisfying answer that also explains my specific scenario).
It is my understanding that solutions employing CAS, although hard to code, scale much better and perform better than locks as contention increases. In my example, the operations are very small and frequent, which means high contention and high frequency. So why do my tests show otherwise?
I assume that longer operations would make the case even worse -> the "swap" failing rate would increase even more.
PS: this is the code I used to run the tests:
Stopwatch watch = Stopwatch.StartNew();
var cl = new Class1();
Parallel.For(0, 50000, i => cl.Append('a'));
var time = watch.Elapsed;
Debug.WriteLine(time.TotalMilliseconds);
The problem is a combination of the failure rate on the loop and the fact that strings are immutable. I did a couple of tests on my own using the following parameters.
Ran 8 different threads (I have an 8 core machine).
Each thread called Append 10,000 times.
What I observed was that the final length of the string was 80,000 (8 x 10,000) so that was perfect. The number of append attempts averaged ~300,000 for me. So that is a failure rate of ~73%. Only 27% of the CPU time resulted in useful work. Now because strings are immutable that means a new instance of the string is created on the heap and the original contents plus the one extra character is copied into it. By the way, this copy operation is O(n) so it gets longer and longer as the string's length increases. Because of the copy operation my hypothesis was that the failure rate would increase as the length of the string increases. The reason being that as the copy operation takes more and more time there is a higher probability of a collision as the threads spend more time competing to finalize the ICX. My tests confirmed this. You should try this same test yourself.
The biggest issue here is that sequential string concatenations do not lend themselves to parallelism very well. Since the results of the operation Xn depend on Xn-1 it is going to be faster to take the full lock especially if it means you avoid all of the failures and retries. A pessimistic strategy wins the battle against an optimistic one in this case. The low techniques work better when you can partition the problem into independent chucks that really can run unimpeded in parallel.
As a side note the use of Interlocked.CompareExchange to do the initial read of _str is unnecessary. The reason is that a memory barrier is not required for the read in this case. This is because the Interlocked.CompareExchange call that actually performs work (the second one in your code) will create a full barrier. So the worst case scenario is that the first read is "stale", the ICX operation fails the test, and the loop spins back around to try again. This time, however, the previous ICX forced a "fresh" read.1
The following code is how I generalize a complex operation using low lock mechanisms. In fact, the code presented below allows you to pass a delegate representing the operation so it is very generalized. Would you want to use it in production? Probably not because invoking the delegate is slow, but you at least you get the idea. You could always hard code the operation.
public static class InterlockedEx
{
public static T Change<T>(ref T destination, Func<T, T> operation) where T : class
{
T original, value;
do
{
original = destination;
value = operation(original);
}
while (Interlocked.CompareExchange(ref destination, value, original) != original);
return original;
}
}
1I actually dislike the terms "stale" and "fresh" when discussing memory barriers because that is not what they are about really. It is more of a side effect as opposed to actual guarantee. But, in this case it illustrates my point better.

Thread Safety With Parallel Operations

Before I start, I should mention that I feel like I've got the wrong end of the stick here. But here we go anyway:
Imagine we have the following class:
public class SomeObject {
public int SomeInt;
private SomeObject anotherObject;
public void DoStuff() {
if (SomeCondition()) anotherObject.SomeInt += 1;
}
}
Now, imagine that we have a collection of these SomeObjects:
IList<SomeObject> allObjects = new List<SomeObject>(1000);
// ... Pretend the list is populated with 1000 SomeObjects here
Let's say I call DoStuff() on each one, like so:
foreach (var #object in allObjects) #object.DoStuff();
All is good so far.
Now, let's assume that the order in which the objects have their DoStuff() called is not important. Assume that SomeCondition() is computationally expensive, perhaps. I could utilize all four cores on my machine (and potentially get a performance gain) with:
Parallel.For(0, 1000, i => allObjects[i].DoStuff());
Now, ignoring any issues with atomicity of variable access, I don't care whilst I am in the loop whether or not any given SomeObject sees an outdated version of anotherObject or SomeInt.* However, once the loop is done, I want to make sure that my main worker thread (i.e. the one that called Parallel.For) DOES see everything up-to-date.
Is there a guarantee of this (e.g. some sort of memory barrier?) with using Parallel.For? Or do I need to make some sort of guarantee myself? Or is there no way to make this guarantee?
Finally, if I call Parallel.For(...) again in the same way just after, will all worker threads be working with the new, up-to-date values for everything?
(*) The implementers of DoStuff() would be wrong to make assumptions about the order of processing anyway, right?
var locker = new object();
var total = 0.0;
Parallel.For(1, 10000000,
i => { lock (locker) total += (i + 1); });
Console.WriteLine("WithLocker" + total);
var total2 = 0.0;
Parallel.For(1, 10000000,
i => total2 += (i + 1));
Console.WriteLine("WithoutLocker" + total2);
Console.ReadKey();
// WithLocker 50000004999999
// WithoutLocker 28861729333278
I have made for you two example one with locker and one without look to the result!
There are two issues here.
However, once the loop is done, I want to make sure that my main worker thread (i.e. the one that called Parallel.For) DOES see everything up-to-date.
To answer your question. Yes, once your Parallel.For has completed all the calls to DoStuff will have completed and your array will not see any more updates.
Now, ignoring any issues with atomicity of variable access, I don't care whilst I am in the loop whether or not any given SomeObject sees an outdated version of anotherObject or SomeInt.*
I really doubt that you don't care about this if you want a correct answer. Bassam's answer addresses the potential data races in your code. If one thread is running DoSomething and this writes to another index in the array which is simultaneously being read by another thread then you will see nondeterministic results. Locking can solve this (as shown above) but at the expense of performance. Locking on every thread for every update effectively serializes your work. I suspect that Bassam's lock example actually runs no faster and possibly slower that the non-locking one, although it does produce the correct answer.
If SomeObject::anotherObject refers to anything other than this you have a potential race condition. Consider the case where anotherObject refers to the element in the array adjacent to the current object. What happens when these run concurrently? One thread's code will be trying to read an instance of SomeObject while another thread writes to it. The write not guaranteed to happen atomically, your read my return an object in a half written state.
This depends a bit on what is being updated in SomeObject and how it's being updated. For example if all you are doing is incrementing an single integer value you could use Interlocked Operations to increment the value in a thread safe way or use critical sections or locks to ensure that your SomeObject is actually thread safe. Adding synchronization operations usually impacts performance so if possible I would recommend looking for an approach that does not require adding synchronization.
You can fix this in one of two ways.
1) If each instance of anotherObject in the array is guaranteed to be only updated once by one call to allObjects[i].DoStuff() then you can modify your code to have an input and output array. This prevents any race conditions as reads and writes no longer conflict. It means you need two copies of your array and they both need to be initialized.
2) If you are updating array items multiple times, or having two arrays of SomeObject is not an option and SomeCondition() is the only computationally expensive part of your method then you could parallelize this and then update the array sequentially.
IList<bool> allConditions = new List<bool>(1000);
Parallel.For(0, 1000, i => SomeCondition(i)) // Write allConditions not allObjects
for (int i = 0; i < 1000; ++i) { #object.DoStuff(allConditions[i]); }
So your observation:
This is interesting. It means that Parallel.For is basically only useful for code that's already thread-safe... Damn
Is not entirely correct. The code within your Parallel.For must either be thread safe or not access data and resources in a non-thread safe way. In other words it doesn't have to lock if you can rearrange your code to guarantee that there are no race conditions (or deadlocks) because none of the threads write the same data or will read data that another thread may be writing to. Note that concurrent reads are OK.

Do we really need VOLATILE keyword in C#?

Here is the code that I was trying on my workstation.
class Program
{
public static volatile bool status = true;
public static void Main()
{
Thread FirstStart = new Thread(threadrun);
FirstStart.Start();
Thread.Sleep(200);
Thread thirdstart = new Thread(threadrun2);
thirdstart.Start();
Console.ReadLine();
}
static void threadrun()
{
while (status)
{
Console.WriteLine("Waiting..");
}
}
static void threadrun2()
{
status = false;
Console.WriteLine("the bool value is now made FALSE");
}
}
As you can see I have fired three threads in Main. Then using breakpoints I tracked the threads. My initial conception was all the three threads will be fired simultaneously, but my breakpoint flow showed that the thread-execution-flow followed one after other (and so was the output format i.e. Top to bottom execution of threads). Guys why is that happening ?
Additionally I tried to run the same program without using the volatile keyword in declaration, and I found no change in program execution. I doubt the volatile keyword is of no practical live use. Am I going wrong somewhere?
Your method of thinking is flawed.
The very nature of threading related issues is that they're non-deterministic. This means that what you have observed is potentially no indicator of what may happen in the future.
This is the very nature of why multithreaded programming is "hard." It often defies ad hoc testing, or even most unit testing. The only way to do it effectively is to understand your entire software and hardware stack, and diagram every possible occurrence through use of state machines.
In summary, threaded programming is not about what you've seen happen, it's about what might possibly happen, no matter how improbable.
Ok I will try to explain a very long story as short as possible:
Number 1: Trying to inspect the behavior of threads with the debugger is as useful as repeatedly running a multithreaded program and concluding that it works fine because out of 100 tests none failed: WRONG! Threads behave in a completely nondeterministic (some would say random) way and you need different methods to make sure such a program will run correctly.
Number 2: The use of volatile will become clear once you remove it and then run your program in Debug mode and then switch to Release mode. I think you will have a surprise... What happens in Release mode is that the compiler will optimize code (this includes reordering instructions and caching of values). Now, if your two threads run on different processor cores, then the core executing the thread that is checking for the value of status will cache its value instead of repeatedly checking for it. The other thread will set it but the first one will never see the change: deadlock! volatile prevents this kind of situation from occurring.
In a sense, volatile is a guard in case the code does not actually (and most likely will not) run as you think it will in a multithreaded scenario.
The fact that your simple code doesn't behave dirrefently with volatile doesn't mean anything. Your code is too simple and has nothing to do with volatile. You need to write very computation-intensive code to create a clearly visible memory race condition.
Also, volatile keyword may be useful on other platforms than x86/x64 with other memory models. (I mean like for example Itanium.)
Joe Duffy wrote interesting information about volatile on his blog. I strongly recommend to read it.
Then using breakpoints I tracked the threads. My initial conception
was all the three threads will be fired simultaneously, but my
breakpoint flow showed that the thread-execution-flow followed one
after other (and so was the output format i.e. Top to bottom execution
of threads). Guys why is that happening?
The debugger is temporarily suspending the threads to make it easier to debug.
I doubt the volatile keyword is of no practical live use. Am I going
wrong somewhere?
The Console.WriteLine calls are very likely fixing masking the problem. They are most likely generating the necessary memory barrier for you implicitly. Here is a really simple snippet of code that demonstrates that there is, in fact, a problem when volatile is not used to declare the stop variable.
Compile the following code with the Release configuration and run it outside of the debugger.
class Program
{
static bool stop = false;
public static void Main(string[] args)
{
var t = new Thread(() =>
{
Console.WriteLine("thread begin");
bool toggle = false;
while (!stop)
{
toggle = !toggle;
}
Console.WriteLine("thread end");
});
t.Start();
Thread.Sleep(1000);
stop = true;
Console.WriteLine("stop = true");
Console.WriteLine("waiting...");
t.Join();
}
}

Why we need Thread.MemoryBarrier()?

In "C# 4 in a Nutshell", the author shows that this class can write 0 sometimes without MemoryBarrier, though I can't reproduce in my Core2Duo:
public class Foo
{
int _answer;
bool _complete;
public void A()
{
_answer = 123;
//Thread.MemoryBarrier(); // Barrier 1
_complete = true;
//Thread.MemoryBarrier(); // Barrier 2
}
public void B()
{
//Thread.MemoryBarrier(); // Barrier 3
if (_complete)
{
//Thread.MemoryBarrier(); // Barrier 4
Console.WriteLine(_answer);
}
}
}
private static void ThreadInverteOrdemComandos()
{
Foo obj = new Foo();
Task.Factory.StartNew(obj.A);
Task.Factory.StartNew(obj.B);
Thread.Sleep(10);
}
This need seems crazy to me. How can I recognize all possible cases that this can occur? I think that if processor changes order of operations, it needs to guarantee that the behavior doesn't change.
Do you bother to use Barriers?
You are going to have a very hard time reproducing this bug. In fact, I would go as far as saying you will never be able to reproduce it using the .NET Framework. The reason is because Microsoft's implementation uses a strong memory model for writes. That means writes are treated as if they were volatile. A volatile write has lock-release semantics which means that all prior writes must be committed before the current write.
However, the ECMA specification has a weaker memory model. So it is theoretically possible that Mono or even a future version of the .NET Framework might start exhibiting the buggy behavior.
So what I am saying is that it is very unlikely that removing barriers #1 and #2 will have any impact on the behavior of the program. That, of course, is not a guarantee, but an observation based on the current implementation of the CLR only.
Removing barriers #3 and #4 will definitely have an impact. This is actually pretty easy to reproduce. Well, not this example per se, but the following code is one of the more well known demonstrations. It has to be compiled using the Release build and ran outside of the debugger. The bug is that the program does not end. You can fix the bug by placing a call to Thread.MemoryBarrier inside the while loop or by marking stop as volatile.
class Program
{
static bool stop = false;
public static void Main(string[] args)
{
var t = new Thread(() =>
{
Console.WriteLine("thread begin");
bool toggle = false;
while (!stop)
{
toggle = !toggle;
}
Console.WriteLine("thread end");
});
t.Start();
Thread.Sleep(1000);
stop = true;
Console.WriteLine("stop = true");
Console.WriteLine("waiting...");
t.Join();
}
}
The reason why some threading bugs are hard to reproduce is because the same tactics you use to simulate thread interleaving can actually fix the bug. Thread.Sleep is the most notable example because it generates memory barriers. You can verify that by placing a call inside the while loop and observing that the bug goes away.
You can see my answer here for another analysis of the example from the book you cited.
Odds are very good that the first task is completed by the time the 2nd task even starts running. You can only observe this behavior if both threads run that code simultaneously and there's no intervening cache-synchronizing operations. There is one in your code, the StartNew() method will take a lock inside the thread pool manager somewhere.
Getting two threads to run this code simultaneously is very hard. This code completes in a couple of nanoseconds. You would have to try billions of times and introduce variable delays to have any odds. Not much point to this of course, the real problem is when this happens randomly when you don't expect it.
Stay away from this, use the lock statement to write sane multi-threaded code.
If you use volatile and lock, the memory barrier is built in. But, yes, you do need it otherwise. Having said that, I suspect that you need half as many as your example shows.
Its very difficult to reproduce multithreaded bugs - usually you have to run the test code many times (thousands) and have some automated check that will flag if the bug occurs. You might try to add a short Thread.Sleep(10) in between some of the lines, but again it not always guarantees that you will get the same issues as without it.
Memory Barriers were introduced for people who need to do really hardcore low-level performance optimisation of their multithreaded code. In most cases you will be better off when using other synchronisation primitives, i.e. volatile or lock.
I'll just quote one of the great articles on multi-threading:
Consider the following example:
class Foo
{
int _answer;
bool _complete;
void A()
{
_answer = 123;
_complete = true;
}
void B()
{
if (_complete) Console.WriteLine (_answer);
}
}
If methods A and B ran concurrently on different threads, might it be
possible for B to write “0”? The answer is yes — for the following
reasons:
The compiler, CLR, or CPU may reorder your program's instructions to
improve efficiency. The compiler, CLR, or CPU may introduce caching
optimizations such that assignments to variables won't be visible to
other threads right away. C# and the runtime are very careful to
ensure that such optimizations don’t break ordinary single-threaded
code — or multithreaded code that makes proper use of locks. Outside
of these scenarios, you must explicitly defeat these optimizations by
creating memory barriers (also called memory fences) to limit the
effects of instruction reordering and read/write caching.
Full fences
The simplest kind of memory barrier is a full memory
barrier (full fence) which prevents any kind of instruction reordering
or caching around that fence. Calling Thread.MemoryBarrier generates a
full fence; we can fix our example by applying four full fences as
follows:
class Foo
{
int _answer;
bool _complete;
void A()
{
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B()
{
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
{
Thread.MemoryBarrier(); // Barrier 4
Console.WriteLine (_answer);
}
}
}
All the theory behind Thread.MemoryBarrier and why we need to use it in non-blocking scenarios to make the code safe and robust is described nicely here: http://www.albahari.com/threading/part4.aspx
If you are ever touching data from two different threads, this can occur. This is one of the tricks that processors use to increase speed - you could build processors that didn't do this, but they would be much slower, so no one does that anymore. You should probably read something like Hennessey and Patterson to recognize all of the various types of race conditions.
I always use some sort of higher level tool like a monitor or a lock, but internally they are doing something similar or are implemented with barriers.

Categories

Resources