Implicit limits of compiler instruction reordering?

Implicit limits of compiler instruction reordering? - c#

When learning about locking mechanisms and concurrency recently, I came across compiler instruction reordering. Since then, I am a lot more suspicious about the correctness of the code I write even if there is no concurrent access to fields. I just encountered a piece of code like this:
var now = DateTime.Now;
var newValue = CalculateCachedValue();
cachedValue = newValue;
lastUpdate = now;
Is it possible that lastUpdate = now is executed before cachedValue is assigned the new value? This would mean that if the thread running this code was cancelled I would have logged an update that was not saved. From what I know now I have to assume this is the case and I need a memory barrier.
But is it even possible that the first statement is executed after the second? This would mean now is the time after the calculation and not before. I guess this is not the case because a method call is involved. However, there is no other clear dependency that prevents reordering. Is a method call/property access an implicit barrier? Are there other implicit constraints for instruction reordering that I should be aware of?

The .NET jitter can reorder instructions, yes. Invariant code motion and common sub-expression elimination are important optimizations and can make code a great deal faster.
But that does not just happen willy-nilly. The optimizer will only ever contemplate such an optimization if it knows that reordering will not have any undesirable side-effects. In order for it to know, it first has to inline a method or property getter call. And that will never happen for DateTime.Now, it requires an operating system call and those can never be inlined. So you have a hard guarantee that no statement ever moves before or after var now = DateTime.Now;
And it actually has to make sense to move code and result in a measurable benefit. There is no point in reordering the assignment statements, it does not make the code any faster. Invariant code motion is an optimization that's applied to statements that are inside a loop, moving such a statement ahead of the loop so it does not get executed repeatedly pays off. No risk of this at all in this snippet. Sub-expression elimination is also nowhere in sight here.
Being afraid of optimizer-induced bugs is a bit like being afraid to step outside because you might be struck by a bolt of lightning. That happens. Odds are just very, very low. A nice guarantee you get with the .NET jitter is that it gets tested millions of times every day.

Related

How to test a multi-threaded race scenario

I'm just now starting to learn multi-threading and I came across this question:
public class Program1
{
int variable;
bool variableValueHasBeenSet = false;
public void Func1()
{
variable = 1;
variableValueHasBeenSet = true;
}
public void Func2()
{
if (variableValueHasBeenSet) Console.WriteLine(variable);
}
}
the questions is: Determine all possible outputs (in console) for the following code snippet if Func1() and Func2() are run in parallel on two separate threads. The answer given is nothing, 1 or 0. the first two options are obvious but the third one surprised me so I wanted to try and get it, this is what I tried:
for (int i = 0; i < 100; i++)
{
var prog1 = new Program1();
List<Task> tasks = new List<Task>();
tasks.Add(new Task(() => prog1.Func2(), TaskCreationOptions.LongRunning));
tasks.Add(new Task(() => prog1.Func1(), TaskCreationOptions.LongRunning));
Parallel.ForEach(tasks, t => t.Start());
}
I couldn't get 0, only nothing and 1, so I was wondering what I'm doing wrong and how can I test this specific problem?
this is the explanation they provided for 0:
0 - This might seem impossible but this is a probable output and an interesting one. .Net runtime, C# and the CPU take the liberty of reordering instructions for optimization. So it is possible that variableValueHasBeenSet is set to true but the value of the variable is still zero. Another reason for such an output is caching. Thread2 might cache the value for the variable as 0 and will not see the updated value when Thread1 updates it in Func1. For a single threaded program this is not an issue as the ordering is guaranteed, but not so in multithreaded code. If the code at both the places is surrounded by locks, this problem can be mitigated. Another advanced way is to use memory barriers.

.Net runtime, C# and the CPU take the liberty of reordering
instructions for optimization.
This bit of information is very important, because there is no guarantee the reordering will happen at all.
The optimizer will often reorder the instructions, but usually this is triggered by code complexity and will typically only occur on a release build (the optimizer will look for dependency-chains and may decide to reorder the code if no dependency is broken AND it will result in faster/more compact code). The code complexity of your test is very low and may not trigger the reordering optimization.
The same thing may happen at the CPU level, if no dependency chains are found between CPU instructions, they may be reordered or at least run in parallel by a superscalar CPU, but other, simpler architectures will run code in-order.
Another reason for such an output is caching. Thread2 might cache the
value for the variable as 0 and will not see the updated value when
Thread1 updates it in Func1
Again, this is is only a possibility. This type of optimization is typically triggered when repeatedly accessing a variable in a loop. The optimizer may decide that it is faster to place the variable on a CPU register instead of accessing it from memory every iteration.
In any case, the amount of control you have over how the C# compiler emits its code is very limited, same goes for how the IL code gets translated to machine code. For these reasons, it would be very difficult for you to produce a reproducible test on every architecture for the case you intend to prove.
What is really important is that you need to be aware that 1) the execution order of the instructions can never be taken for granted and 2) variables may be temporarily stored in registers as a potential optimization. Once aware you should write your code defensively around these possibilities

Is volatile needed with indirect access?

Please note I am not asking about replacing volatile with other means (like lock), I am asking about volatile nature -- thus I use "needed" in sense, volatile or no volatile.
Consider such case that one thread only write to variable x (Int32) and the other thread only reads it. The access in both cases is direct.
The volatile is needed to avoid caching, correct?
But what if the access to x is indirect -- for example via property:
int x;
int access_x { get { return x; } set { x = value; } }
So both threads now uses only access_x, not x. Is x needed to be marked as volatile? If yes is there some limit of indirection when volatile is not needed anymore?
Update: consider such code of the reader (no writes):
if (x>10)
...
// second thread changes `x`
if (x>10)
...
In the second if compiler could evaluate use the old value because x could be in cache and without volatile there is no need to refetch. My question is about such change:
if (access_x>10)
...
// second thread changes `x`
if (access_x>10)
...
And let's say I skip volatile for x. What will happen / what can happen?

Is x needed to be marked as volatile?
Yes, and no (but mostly yes).
"Yes", because technically you have no guarantees here. The compilers (C# and JIT) are permitted to make any optimization they see fit, as long as the optimization would not change the behavior of the code when executed in a single-thread. One obvious optimization is to omit the call to the property setter and getter and just directly access the field (i.e. inlining). Of course, the compiler is allowed to do whatever analysis it wants and make further optimizations.
"No", because in practice this usually is not a problem. With the access wrapped in a method, the C# compiler won't optimize away the field, and the JIT compiler is unlikely to do so (even if the methods are inlined…again, no guarantees, but AFAIK such an optimization isn't performed, and I think it not likely a future version would). So all you're left with is memory coherency issues (the other reason volatile is used…i.e. essentially dealing with optimizations performed at the hardware level).
As long as your code is only ever going to run on Intel x86-compatible hardware, that hardware treats all reads and writes as volatile.
However, other platforms may be different. Itanium and ARM being a couple of common examples which have different memory models.
Personally, I prefer to write to the technicalities. Writing code that just happens to work, in spite of a lack of guarantee, is just asking to find some time in the future that the code stops working for some mysterious reason.
So IMHO you really should mark the field as volatile.
If yes is there some limit of indirection when volatile is not needed anymore?
No. If such a limit existed, it would have to be documented to be useful. In practice, the limit in terms of compiler optimizations is really just a "level of indirection" (as you put it). But no amount of levels of indirection avoid the hardware-level optimizations, and even the limit in terms of compiler optimizations is strictly "in practice". You have no guarantee that the compiler will never analyze the code more deeply and perform more aggressive optimizations, to any arbitrarily deep level of calls.
More generally, my rule of thumb is this: if I am trying to decide whether I should use some particular feature that I know is normally used to protect against bugs in concurrency-related scenarios, and I have to ask "do I really need this feature, or will the code work fine without it?", then I probably don't know enough about the way that feature works for me to safely avoid using it.
Consider it a variation on the old "if you have to ask how much it costs, you can't afford it."

Read Introduction in C# - how to protect against it?

An article in MSDN Magazine discusses the notion of Read Introduction and gives a code sample which can be broken by it.
public class ReadIntro {
private Object _obj = new Object();
void PrintObj() {
Object obj = _obj;
if (obj != null) {
Console.WriteLine(obj.ToString()); // May throw a NullReferenceException
}
}
void Uninitialize() {
_obj = null;
}
}
Notice this "May throw a NullReferenceException" comment - I never knew this was possible.
So my question is: how can I protect against read introduction?
I would also be really grateful for an explanation exactly when the compiler decides to introduce reads, because the article doesn't include it.

Let me try to clarify this complicated question by breaking it down.
What is "read introduction"?
"Read introduction" is an optimization whereby the code:
public static Foo foo; // I can be changed on another thread!
void DoBar() {
Foo fooLocal = foo;
if (fooLocal != null) fooLocal.Bar();
}
is optimized by eliminating the local variable. The compiler can reason that if there is only one thread then foo and fooLocal are the same thing. The compiler is explicitly permitted to make any optimization that would be invisible on a single thread, even if it becomes visible in a multithreaded scenario. The compiler is therefore permitted to rewrite this as:
void DoBar() {
if (foo != null) foo.Bar();
}
And now there is a race condition. If foo turns from non-null to null after the check then it is possible that foo is read a second time, and the second time it could be null, which would then crash. From the perspective of the person diagnosing the crash dump this would be completely mysterious.
Can this actually happen?
As the article you linked to called out:
Note that you won’t be able to reproduce the NullReferenceException using this code sample in the .NET Framework 4.5 on x86-x64. Read introduction is very difficult to reproduce in the .NET Framework 4.5, but it does nevertheless occur in certain special circumstances.
x86/x64 chips have a "strong" memory model and the jit compilers are not aggressive in this area; they will not do this optimization.
If you happen to be running your code on a weak memory model processor, like an ARM chip, then all bets are off.
When you say "the compiler" which compiler do you mean?
I mean the jit compiler. The C# compiler never introduces reads in this manner. (It is permitted to, but in practice it never does.)
Isn't it a bad practice to be sharing memory between threads without memory barriers?
Yes. Something should be done here to introduce a memory barrier because the value of foo could already be a stale cached value in the processor cache. My preference for introducing a memory barrier is to use a lock. You could also make the field volatile, or use VolatileRead, or use one of the Interlocked methods. All of those introduce a memory barrier. (volatile introduces only a "half fence" FYI.)
Just because there's a memory barrier does not necessarily mean that read introduction optimizations are not performed. However, the jitter is far less aggressive about pursuing optimizations that affect code that contains a memory barrier.
Are there other dangers to this pattern?
Sure! Let's suppose there are no read introductions. You still have a race condition. What if another thread sets foo to null after the check, and also modifies global state that Bar is going to consume? Now you have two threads, one of which believes that foo is not null and the global state is OK for a call to Bar, and another thread which believes the opposite, and you're running Bar. This is a recipe for disaster.
So what's the best practice here?
First, do not share memory across threads. This whole idea that there are two threads of control inside the main line of your program is just crazy to begin with. It never should have been a thing in the first place. Use threads as lightweight processes; give them an independent task to perform that does not interact with the memory of the main line of the program at all, and just use them to farm out computationally intensive work.
Second, if you are going to share memory across threads then use locks to serialize access to that memory. Locks are cheap if they are not contended, and if you have contention, then fix that problem. Low-lock and no-lock solutions are notoriously difficult to get right.
Third, if you are going to share memory across threads then every single method you call that involves that shared memory must either be robust in the face of race conditions, or the races must be eliminated. That is a heavy burden to bear, and that is why you shouldn't go there in the first place.
My point is: read introductions are scary but frankly they are the least of your worries if you are writing code that blithely shares memory across threads. There are a thousand and one other things to worry about first.

You cant really "protect" against read introduction as it's a compiler optimization (excepting using Debug builds with no optimization of course). It's pretty well documented that the optimizer will maintain the single-threaded semantics of the function, which as the article notes can cause issues in multi-threaded situations.
That said, I'm confused by his example. In Jeffrey Richter's book CLR via C# (v3 in this case), in the Events section he covers this pattern, and notes that in the example snippet you have above, in THEORY it wouldn't work. But, it was a recommended pattern by Microsoft early in .Net's existence, and therefore the JIT compiler people he spoke to said that they would have to make sure that sort of snippet never breaks. (It's always possible they may decide that it's worth breaking for some reason though - I imagine Eric Lippert could shed light on that).
Finally, unlike the article, Jeffrey offers the "proper" way to handle this in multi-threaded situations (I've modified his example with your sample code):
Object temp = Interlocked.CompareExchange(ref _obj, null, null);
if(temp != null)
{
Console.WriteLine(temp.ToString());
}

I only skimmed the article, but it seems that what the author is looking for is that you need to declare the _obj member as volatile.

When to use volatile to counteract compiler optimizations in C#

I have spent an extensive number of weeks doing multithreaded coding in C# 4.0. However, there is one question that remains unanswered for me.
I understand that the volatile keyword prevents the compiler from storing variables in registers, thus avoiding inadvertently reading stale values. Writes are always volatile in .Net, so any documentation stating that it also avoids stales writes is redundant.
I also know that the compiler optimization is somewhat "unpredictable". The following code will illustrate a stall due to a compiler optimization (when running the release compile outside of VS):
class Test
{
public struct Data
{
public int _loop;
}
public static Data data;
public static void Main()
{
data._loop = 1;
Test test1 = new Test();
new Thread(() =>
{
data._loop = 0;
}
).Start();
do
{
if (data._loop != 1)
{
break;
}
//Thread.Yield();
} while (true);
// will never terminate
}
}
The code behaves as expected. However, if I uncomment out the //Thread.Yield(); line, then the loop will exit.
Further, if I put a Sleep statement before the do loop, it will exit. I don't get it.
Naturally, decorating _loop with volatile will also cause the loop to exit (in its shown pattern).
My question is: What are the rules the complier follows in order to determine when to implicity perform a volatile read? And why can I still get the loop to exit with what I consider to be odd measures?
EDIT
IL for code as shown (stalls):
L_0038: ldsflda valuetype ConsoleApplication1.Test/Data ConsoleApplication1.Test::data
L_003d: ldfld int32 ConsoleApplication1.Test/Data::_loop
L_0042: ldc.i4.1
L_0043: beq.s L_0038
L_0045: ret
IL with Yield() (does not stall):
L_0038: ldsflda valuetype ConsoleApplication1.Test/Data ConsoleApplication1.Test::data
L_003d: ldfld int32 ConsoleApplication1.Test/Data::_loop
L_0042: ldc.i4.1
L_0043: beq.s L_0046
L_0045: ret
L_0046: call bool [mscorlib]System.Threading.Thread::Yield()
L_004b: pop
L_004c: br.s L_0038

What are the rules the complier follows in order to determine when to
implicity perform a volatile read?
First, it is not just the compiler that moves instructions around. The big 3 actors in play that cause instruction reordering are:
Compiler (like C# or VB.NET)
Runtime (like the CLR or Mono)
Hardware (like x86 or ARM)
The rules at the hardware level are a little more cut and dry in that they are usually documented pretty well. But, at the runtime and compiler levels there are memory model specifications that provide constraints on how instructions can get reordered, but it is left up to the implementers to decide how aggressively they want to optimize the code and how closely they want to toe the line with respect to the memory model constraints.
For example, the ECMA specification for the CLI provides fairly weak guarantees. But Microsoft decided to tighten those guarantees in the .NET Framework CLR. Other than a few blog posts I have not seen much formal documentation on the rules the CLR adheres to. Mono, of course, might use a different set of rules that may or may not bring it closer to the ECMA specification. And of course, there may be some liberty in changing the rules in future releases as long as the formal ECMA specification is still considered.
With all of that said I have a few observations:
Compiling with the Release configuration is more likely to cause instruction reordering.
Simpler methods are more likely to have their instructions reordered.
Hoisting a read from inside a loop to outside of the loop is a typical type of reordering optimization.
And why can I still get the loop to exit with what I consider to be
odd measures?
It is because those "odd measures" are doing one of two things:
generating an implicit memory barrier
circumventing the compiler's or runtime's ability to perform certain optimizations
For example, if the code inside a method gets too complex it may prevent the JIT compiler from performing certain optimizations that reorders instructions. You can think of it as sort of like how complex methods also do not get inlined.
Also, things like Thread.Yield and Thread.Sleep create implicit memory barriers. I have started a list of such mechanisms here. I bet if you put a Console.WriteLine call in your code it would also cause the loop to exit. I have also seen the "non terminating loop" example behave differently in different versions of the .NET Framework. For example, I bet if you ran that code in 1.0 it would terminate.
This is why using Thread.Sleep to simulate thread interleaving could actually mask a memory barrier problem.
Update:
After reading through some of your comments I think you may be confused as to what Thread.MemoryBarrier is actually doing. What it is does is it creates a full-fence barrier. What does that mean exactly? A full-fence barrier is the composition of two half-fences: an acquire-fence and a release-fence. I will define them now.
Acquire fence: A memory barrier in which other reads & writes are not allowed to move before the fence.
Release fence: A memory barrier in which other reads & writes are not allowed to move after the fence.
So when you see a call to Thread.MemoryBarrier it will prevent all reads & writes from being moved either above or below the barrier. It will also emit whatever CPU specific instructions are required.
If you look at the code for Thread.VolatileRead here is what you will see.
public static int VolatileRead(ref int address)
{
int num = address;
MemoryBarrier();
return num;
}
Now you may be wondering why the MemoryBarrier call is after the actual read. Your intuition may tell you that to get a "fresh" read of address you would need the call to MemoryBarrier to occur before that read. But, alas, your intuition is wrong! The specification says a volatile read should produce an acquire-fence barrier. And per the definition I gave you above that means the call to MemoryBarrier has to be after the read of address to prevent other reads and writes from being moved before it. You see volatile reads are not strictly about getting a "fresh" read. It is about preventing the movement of instructions. This is incredibly confusing; I know.

Your sample runs unterminated (most of the time I think) because _loop can be cached.
Any of the 'solutions' you mentioned (Sleep, Yield) will involve a memory barrier, forcing the compiler to refresh _loop.
The minimal solution (untested):
do
{
System.Threading.Thread.MemoryBarrier();
if (data._loop != 1)
{
break;
}
} while (true);

It is not only a matter of compiler, it can also be a matter of CPU, which also does it's own optimizations. Granted, generally a consumer CPU does not have so much liberty and usually the compiler is the one guilty for the above scenario.
A full fence is probably too heavy-weight for making a single volatile read.
Having said this, a good explanation of what optimization can occur is found here: http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/

There seems to be a lot of talk about memory barriers at the hardware level. Memory fences are irrelevant here. It's nice to tell the hardware not to do anything funny, but it wasn't planning to do so in the first place, because you are of course going to run this code on x86 or amd64. You don't need a fence here (and it is very rare that you do, though it can happen). All you need in this case is to reload the value from memory.
The problem here is that the JIT compiler is being funny, not the hardware.
In order to force the JIT to quit joking around, you need something that either (1) just plain happens to trick the JIT compiler into reloading that variable (but that's relying on implementation details) or that (2) generates a memory barrier or read-with-acquire of the kind that the JIT compiler understands (even if no fences end up in the instruction stream).
To address your actual question, there are only actual rules about what should happen in case 2.

C# optimizations and side effects

Can optimizations done by the C# compiler or the JITter have visible side effects?
One example I've though off.
var x = new Something();
A(x);
B(x);
When calling A(x) x is guaranteed to be kept alive to the end of A - because B uses the same parameter. But if B is defined as
public void B(Something x) { }
Then the B(x) can be eliminated by the optimizer and then a GC.KeepAlive(x) call might be necessary instead.
Can this optimization actually be done by the JITter?
Are there other optimizations that might have visible side effects, except stack trace changes?

If your function B does not use the parameter x, then eliminating it and collecting x early does not have any visible side effects.
To be "visible side effects", they have to be visible to the program, not to an external tool like a debugger or object viewer.

When calling A(x) x is guaranteed to be kept alive to the end of A - because B uses the same parameter.
This statement is false. Suppose method A always throws an exception. The jitter could know that B will never be reached, and therefore x can be released immediately. Suppose method A goes into an unconditional infinite loop after its last reference to x; again, the jitter could know that via static analysis, determine that x will never be referenced again, and schedule it to be cleaned up. I do not know if the jitter actually performs these optimization; they seem dodgy, but they are legal.
Can this optimization (namely, doing early cleanup of a reference that is not used anywhere) actually be done by the JITter?
Yes, and in practice, it is done. That is not an observable side effect.
This is justified by section 3.9 of the specification, which I quote for your convenience:
If the object, or any part of it, cannot be accessed by any possible continuation of execution, other than the running of destructors, the object is considered no longer in use, and it becomes eligible for destruction. The C# compiler and the garbage collector may choose to analyze code to determine which references to an object may be used in the future. For instance, if a local variable that is in scope is the only existing reference to an object, but that local variable is never referred to in any possible continuation of execution from the current execution point in the procedure, the garbage collector may (but is not required to) treat the object as no longer in use.
Can optimizations done by the C# compiler or the JITter have visible side effects?
Your question is answered in section 3.10 of the specification, which I quote here for your convenience:
Execution of a C# program proceeds
such that the side effects of each
executing thread are preserved at
critical execution points.
A side
effect is defined as a read or write
of a volatile field, a write to a
non-volatile variable, a write to an
external resource, and the throwing of
an exception.
The critical execution
points at which the order of these
side effects must be preserved are
references to volatile fields, lock statements,
and thread creation and termination.
The execution environment is free to
change the order of execution of a C#
program, subject to the following
constraints:
Data dependence is
preserved within a thread of
execution. That is, the value of each
variable is computed as if all
statements in the thread were executed
in original program order.
Initialization ordering rules are
preserved.
The
ordering of side effects is preserved
with respect to volatile reads and
writes.
Additionally, the
execution environment need not
evaluate part of an expression if it
can deduce that that expression’s
value is not used and that no needed
side effects are produced (including
any caused by calling a method or
accessing a volatile field).
When
program execution is interrupted by an
asynchronous event (such as an
exception thrown by another thread),
it is not guaranteed that the
observable side effects are visible in
the original program order.
The second-to-last paragraph is I believe the one you are most concerned about; that is, what optimizations is the runtime allowed to perform with respect to affecting observable side effects? The runtime is permitted to perform any optimization which does not affect an observable side effect.
Note that in particular data dependence is only preserved within a thread of execution. Data dependence is not guaranteed to be preserved when observed from another thread of execution.
If that doesn't answer your question, ask a more specific question. In particular, a careful and precise definition of "observable side effect" will be necessary to answer your question in more detail, if you do not consider the definition given above to match your definition of "observable side effect".

Including B in your question just confuses the matter. Given this code:
var x = new Something();
A(x);
Assuming that A(x) is managed code, then calling A(x) maintains a reference to x, so the garbage collector can't collect x until after A returns. Or at least until A no longer needs it. The optimizations done by the JITer (absent bugs) will not prematurely collect x.
You should define what you mean by "visible side effects." One would hope that JITer optimizations at least have the side effect of making your code smaller or faster. Are those "visible?" Or do you mean "undesireable?"

Eric Lippert has started a great series about refactoring which leads me to believe that the C# Compiler and JITter makes sure not to introduce side effects. Part 1 and Part 2 are currently online.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.