Is volatile needed with indirect access?

Is volatile needed with indirect access? - c#

Please note I am not asking about replacing volatile with other means (like lock), I am asking about volatile nature -- thus I use "needed" in sense, volatile or no volatile.
Consider such case that one thread only write to variable x (Int32) and the other thread only reads it. The access in both cases is direct.
The volatile is needed to avoid caching, correct?
But what if the access to x is indirect -- for example via property:
int x;
int access_x { get { return x; } set { x = value; } }
So both threads now uses only access_x, not x. Is x needed to be marked as volatile? If yes is there some limit of indirection when volatile is not needed anymore?
Update: consider such code of the reader (no writes):
if (x>10)
...
// second thread changes `x`
if (x>10)
...
In the second if compiler could evaluate use the old value because x could be in cache and without volatile there is no need to refetch. My question is about such change:
if (access_x>10)
...
// second thread changes `x`
if (access_x>10)
...
And let's say I skip volatile for x. What will happen / what can happen?

Is x needed to be marked as volatile?
Yes, and no (but mostly yes).
"Yes", because technically you have no guarantees here. The compilers (C# and JIT) are permitted to make any optimization they see fit, as long as the optimization would not change the behavior of the code when executed in a single-thread. One obvious optimization is to omit the call to the property setter and getter and just directly access the field (i.e. inlining). Of course, the compiler is allowed to do whatever analysis it wants and make further optimizations.
"No", because in practice this usually is not a problem. With the access wrapped in a method, the C# compiler won't optimize away the field, and the JIT compiler is unlikely to do so (even if the methods are inlined…again, no guarantees, but AFAIK such an optimization isn't performed, and I think it not likely a future version would). So all you're left with is memory coherency issues (the other reason volatile is used…i.e. essentially dealing with optimizations performed at the hardware level).
As long as your code is only ever going to run on Intel x86-compatible hardware, that hardware treats all reads and writes as volatile.
However, other platforms may be different. Itanium and ARM being a couple of common examples which have different memory models.
Personally, I prefer to write to the technicalities. Writing code that just happens to work, in spite of a lack of guarantee, is just asking to find some time in the future that the code stops working for some mysterious reason.
So IMHO you really should mark the field as volatile.
If yes is there some limit of indirection when volatile is not needed anymore?
No. If such a limit existed, it would have to be documented to be useful. In practice, the limit in terms of compiler optimizations is really just a "level of indirection" (as you put it). But no amount of levels of indirection avoid the hardware-level optimizations, and even the limit in terms of compiler optimizations is strictly "in practice". You have no guarantee that the compiler will never analyze the code more deeply and perform more aggressive optimizations, to any arbitrarily deep level of calls.
More generally, my rule of thumb is this: if I am trying to decide whether I should use some particular feature that I know is normally used to protect against bugs in concurrency-related scenarios, and I have to ask "do I really need this feature, or will the code work fine without it?", then I probably don't know enough about the way that feature works for me to safely avoid using it.
Consider it a variation on the old "if you have to ask how much it costs, you can't afford it."

Related

Read int value in thread safe way [duplicate]

(This is a repeat of: How to correctly read an Interlocked.Increment'ed int field? but, after reading the answers and comments, I'm still not sure of the right answer.)
There's some code that I don't own and can't change to use locks that increments an int counter (numberOfUpdates) in several different threads. All calls use:
Interlocked.Increment(ref numberOfUpdates);
I want to read numberOfUpdates in my code. Now since this is an int, I know that it can't tear. But what's the best way to ensure that I get the latest value possible? It seems like my options are:
int localNumberOfUpdates = Interlocked.CompareExchange(ref numberOfUpdates, 0, 0);
Or
int localNumberOfUpdates = Thread.VolatileRead(numberOfUpdates);
Will both work (in the sense of delivering the latest value possible regardless of optimizations, re-orderings, caching, etc.)? Is one preferred over the other? Is there a third option that's better?

I'm a firm believer in that if you're using interlocked to increment shared data, then you should use interlocked everywhere you access that shared data. Likewise, if you use insert you favorite synchronization primitive here to increment shared data, then you should use insert you favorite synchronization primitive here everywhere you access that shared data.
int localNumberOfUpdates = Interlocked.CompareExchange(ref numberOfUpdates, 0, 0);
Will give you exactly what your looking for. As others have said interlocked operations are atomic. So Interlocked.CompareExchange will always return the most recent value. I use this all the time for accessing simple shared data like counters.
I'm not as familiar with Thread.VolatileRead, but I suspect it will also return the most recent value. I'd stick with interlocked methods, if only for the sake of being consistent.
Additional info:
I'd recommend taking a look at Jon Skeet's answer for why you may want to shy away from Thread.VolatileRead(): Thread.VolatileRead Implementation
Eric Lippert discusses volatility and the guarantees made by the C# memory model in his blog at http://blogs.msdn.com/b/ericlippert/archive/2011/06/16/atomicity-volatility-and-immutability-are-different-part-three.aspx. Straight from the horses mouth: "I don't attempt to write any low-lock code except for the most trivial usages of Interlocked operations. I leave the usage of "volatile" to real experts."
And I agree with Hans's point that the value will always be stale at least by a few ns, but if you have a use case where that is unacceptable, its probably not well suited for a garbage collected language like C# or a non-real-time OS. Joe Duffy has a good article on the timeliness of interlocked methods here: http://joeduffyblog.com/2008/06/13/volatile-reads-and-writes-and-timeliness/

Thread.VolatileRead(numberOfUpdates) is what you want. numberOfUpdates is an Int32, so you already have atomicity by default, and Thread.VolatileRead will ensure volatility is dealt with.
If numberOfUpdates is defined as volatile int numberOfUpdates; you don't have to do
this, as all reads of it will already be volatile reads.
There seems to be confusion about whether Interlocked.CompareExchange is more appropriate. Consider the following two excerpts from the documentation.
From the Thread.VolatileRead documentation:
Reads the value of a field. The value is the latest written by any processor in a computer, regardless of the number of processors or the state of processor cache.
From the Interlocked.CompareExchange documentation:
Compares two 32-bit signed integers for equality and, if they are equal, replaces one of the values.
In terms of the stated behavior of these methods, Thread.VolatileRead is clearly more appropriate. You do not want to compare numberOfUpdates to another value, and you do not want to replace its value. You want to read its value.
Lasse makes a good point in his comment: you might be better off using simple locking. When the other code wants to update numberOfUpdates it does something like the following.
lock (state)
{
state.numberOfUpdates++;
}
When you want to read it, you do something like the following.
int value;
lock (state)
{
value = state.numberOfUpdates;
}
This will ensure your requirements of atomicity and volatility without delving into more-obscure, relatively low-level multithreading primitives.

Will both work (in the sense of delivering the latest value possible regardless of optimizations, re-orderings, caching, etc.)?
No, the value you get is always stale. How stale the value might be is entirely unpredictable. The vast majority of the time it will be stale by a few nanoseconds, give or take, depending how quickly you act on the value. But there is no reasonable upper-bound:
your thread can lose the processor when it context-switches another thread onto the core. Typical delays are around 45 msec with no firm upper-bound. This does not mean that another thread in your process also gets switched-out, it can keep motoring and continue to mutate the value.
just like any user-mode code, your code is subjected to page-faults as well. Incurred when the processor needs RAM for another process. On a heavily loaded machine that can and will page-out active code. As sometimes happens to the mouse driver code for example, leaving a frozen mouse cursor.
managed threads are subject to near-random garbage collection pauses. Tends to be the lesser problem since it is likely that another thread that's mutating the value will be paused as well.
Whatever you do with the value needs to take this into account. Needless to say perhaps, that's very, very difficult. Practical examples are hard to come by. The .NET Framework is a very large chunk of battle-scarred code. You can see the cross-reference to usage of VolatileRead from the Reference Source. Number of hits: 0.

Well, any value you read will always be somewhat stale as Hans Passant said. You can only control a guarantee that other shared values are consistent with the one you've just read using memory fences in the middle of code reading several shared values without locks (ie: are at the same degree of "staleness")
Fences also have the effect of defeating some compiler optimizations and reordering thus preventing unexpected behavior in release mode on different platforms.
Thread.VolatileRead will cause a full memory fence to be emitted so that no reads or writes can be reordered around your read of the int (in the method that's reading it). Obviously if you're only reading a single shared value (and you're not reading something else shared and the order and consistency of them both is important), then it may not seem necessary...
But I think that you will need it anyway to defeat some optimizations by the compiler or CPU so that you don't get the read more "stale" than necessary.
A dummy Interlocked.CompareExchange will do the same thing as Thread.VolatileRead (full fence and optimization defeating behavior).
There is a pattern followed in the framework used by CancellationTokenSource
http://referencesource.microsoft.com/#mscorlib/system/threading/CancellationTokenSource.cs#64
//m_state uses the pattern "volatile int32 reads, with cmpxch writes" which is safe for updates and cannot suffer torn reads.
private volatile int m_state;
public bool IsCancellationRequested
{
get { return m_state >= NOTIFYING; }
}
// ....
if (Interlocked.CompareExchange(ref m_state, NOTIFYING, NOT_CANCELED) == NOT_CANCELED) {
}
// ....
The volatile keyword has the effect of emitting a "half" fence. (ie: it blocks reads/writes from being moved before the read to it, and blocks reads/writes from being moved after the write to it).

It seems like my options are:
int localNumberOfUpdates = Interlocked.CompareExchange(ref numberOfUpdates, 0, 0);
Or
int localNumberOfUpdates = Thread.VolatileRead(numberOfUpdates);
Starting from the .NET Framework 4.5, there is a third option:
int localNumberOfUpdates = Volatile.Read(ref numberOfUpdates);
The Interlocked methods impose full fences, while the Volatile methods impose half fences¹. So using the static methods of the Volatile class is a potentially more economic way of reading atomically the latest value of an int variable or field.
Alternatively, if the numberOfUpdates is a field of a class, you could declare it as volatile. Reading a volatile field is equivalent with reading it with the Volatile.Read method.
I should mention one more option, which is to simply read the numberOfUpdates directly, without the help of either the Interlocked or the Volatile. We are not supposed to do this, but demonstrating an actual problem caused by doing so might be impossible. The reason is that the memory models of the most commonly used CPUs are stronger than the C# memory model. So if your machine has such a CPU (for example x86 or x64), you won't be able to write a program that fails as a result of reading directly the field. Nevertheless personally I never use this option, because I am not an expert in CPU architectures and memory protocols, nor I have the desire to become one. So I prefer to use either the Volatile class or the volatile keyword, whatever is more convenient in each case.
¹ With some exceptions, like reading/writing an Int64 or double on a 32-bit machine.

Not sure why nobody mentioned Interlocked.Add(ref localNumberOfUpdates, 0), but seems the simplest way to me...

Implicit limits of compiler instruction reordering?

When learning about locking mechanisms and concurrency recently, I came across compiler instruction reordering. Since then, I am a lot more suspicious about the correctness of the code I write even if there is no concurrent access to fields. I just encountered a piece of code like this:
var now = DateTime.Now;
var newValue = CalculateCachedValue();
cachedValue = newValue;
lastUpdate = now;
Is it possible that lastUpdate = now is executed before cachedValue is assigned the new value? This would mean that if the thread running this code was cancelled I would have logged an update that was not saved. From what I know now I have to assume this is the case and I need a memory barrier.
But is it even possible that the first statement is executed after the second? This would mean now is the time after the calculation and not before. I guess this is not the case because a method call is involved. However, there is no other clear dependency that prevents reordering. Is a method call/property access an implicit barrier? Are there other implicit constraints for instruction reordering that I should be aware of?

The .NET jitter can reorder instructions, yes. Invariant code motion and common sub-expression elimination are important optimizations and can make code a great deal faster.
But that does not just happen willy-nilly. The optimizer will only ever contemplate such an optimization if it knows that reordering will not have any undesirable side-effects. In order for it to know, it first has to inline a method or property getter call. And that will never happen for DateTime.Now, it requires an operating system call and those can never be inlined. So you have a hard guarantee that no statement ever moves before or after var now = DateTime.Now;
And it actually has to make sense to move code and result in a measurable benefit. There is no point in reordering the assignment statements, it does not make the code any faster. Invariant code motion is an optimization that's applied to statements that are inside a loop, moving such a statement ahead of the loop so it does not get executed repeatedly pays off. No risk of this at all in this snippet. Sub-expression elimination is also nowhere in sight here.
Being afraid of optimizer-induced bugs is a bit like being afraid to step outside because you might be struck by a bolt of lightning. That happens. Odds are just very, very low. A nice guarantee you get with the .NET jitter is that it gets tested millions of times every day.

Variable freshness guarantee in .NET (volatile vs. volatile read)

I have read many contradicting information (msdn, SO etc.) about volatile and VoletileRead (ReadAcquireFence).
I understand the memory access reordering restriction implication of those - what I'm still completely confused about is the freshness guarantee - which is very important for me.
msdn doc for volatile mentions:
(...) This ensures that the most up-to-date value is present in the field at all times.
msdn doc for volatile fields mentions:
A read of a volatile field is called a volatile read. A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it in the instruction sequence.
.NET code for VolatileRead is:
public static int VolatileRead(ref int address)
{
int ret = address;
MemoryBarrier(); // Call MemoryBarrier to ensure the proper semantic in a portable way.
return ret;
}
According to msdn MemoryBarrier doc Memory barrier prevents reordering. However this doesn't seem to have any implications on freshness - correct?
How then one can get freshness guarantee?
And is there difference between marking field volatile and accessing it with VolatileRead and VolatileWrite semantic? I'm currently doing the later in my performance critical code that needs to guarantee freshness, however readers sometimes get stale value. I'm wondering if marking the state volatile will make situation different.
EDIT1:
What I'm trying to achieve - get the guarantee that reader threads will get as recent value of shared variable (written by multiple writers) as possible - ideally no older than what is the cost of context switch or other operations that may postpone the immediate write of state.
If volatile or higher level construct (e.g. lock) have this guarantee (do they?) than how do they achieve this?
EDIT2:
The very condensed question should have been - how do I get guarantee of as fresh value during reads as possible? Ideally without locking (as exclusive access is not needed and there is potential for high contention).
From what I learned here I'm wondering if this might be the solution (solving(?) line is marked with comment):
private SharedState _sharedState;
private SpinLock _spinLock = new SpinLock(false);
public void Update(SharedState newValue)
{
bool lockTaken = false;
_spinLock.Enter(ref lockTaken);
_sharedState = newValue;
if (lockTaken)
{
_spinLock.Exit();
}
}
public SharedState GetFreshSharedState
{
get
{
Thread.MemoryBarrier(); // <---- This is added to give readers freshness guarantee
var value = _sharedState;
Thread.MemoryBarrier();
return value;
}
}
The MemoryBarrier call was added to make sure both - reads and writes - are wrapped by full fences (same as lock code - as indicated here http://www.albahari.com/threading/part4.aspx#_The_volatile_keyword 'Memory barriers and locking' section)
Does this look correct or is it flawed?
EDIT3:
Thanks to very interesting discussions here I learned quite a few things and I actually was able to distill to the simplified unambiguous question that I have about this topic. It's quite different from the original one so I rather posted a new one here: Memory barrier vs Interlocked impact on memory caches coherency timing

I think this is a good question. But, it is also difficult to answer. I am not sure I can give you a definitive answer to your questions. It is not your fault really. It is just that the subject matter is complex and really requires knowing details that might not be feasible to enumerate. Honestly, it really seems like you have educated yourself on the subject quite well already. I have spent a lot of time studying the subject myself and I still do not fully understand everything. Nevertheless, I will still attempt an some semblance of an answer here anyway.
So what does it mean for a thread to read a fresh value anyway? Does it mean the value returned by the read is guaranteed to be no older than 100ms, 50ms, or 1ms? Or does it mean the value is the absolute latest? Or does it mean that if two reads occur back-to-back then the second is guaranteed to get a newer value assuming the memory address changed after the first read? Or does it mean something else altogether?
I think you are going to have a hard time getting your readers to work correctly if you are thinking about things in terms of time intervals. Instead think of things in terms of what happens when you chain reads together. To illustrate my point consider how you would implement an interlocked-like operation using arbitrarily complex logic.
public static T InterlockedOperation<T>(ref T location, T operand)
{
T initial, computed;
do
{
initial = location;
computed = op(initial, operand); // where op is replaced with a specific implementation
}
while (Interlocked.CompareExchange(ref location, computed, initial) != initial);
return computed;
}
In the code above we can create any interlocked-like operation if we exploit the fact that the second read of location via Interlocked.CompareExchange will be guaranteed to return a newer value if the memory address received a write after the first read. This is because the Interlocked.CompareExchange method generates a memory barrier. If the value has changed between reads then the code spins around the loop repeatedly until location stops changing. This pattern does not require that the code use the latest or freshest value; just a newer value. The distinction is crucial.1
A lot of lock free code I have seen works on this principal. That is the operations are usually wrapped into loops such that the operation is continually retried until it succeeds. It does not assume that the first attempt is using the latest value. Nor does it assume every use of the value be the latest. It only assumes that the value is newer after each read.
Try to rethink how your readers should behave. Try to make them more agnostic about the age of the value. If that is simply not possible and all writes must be captured and processed then you may be forced into a more deterministic approach like placing all writes into a queue and having the readers dequeue them one-by-one. I am sure the ConcurrentQueue class would help in that situation.
If you can reduce the meaning of "fresh" to only "newer" then placing a call to Thread.MemoryBarrier after each read, using Volatile.Read, using the volatile keyword, etc. will absolutely guarantee that one read in a sequence will return a newer value than a previous read.
1The ABA problem opens up a new can of worms.

A memory barrier does provide this guarantee. We can derive the "freshness" property that you are looking for from the reording properties that a barrier guarantees.
By freshness you probably mean that a read returns the value of the most recent write.
Let's say we have these operations, each on a different thread:
x = 1
x = 2
print(x)
How could we possibly print a value other than 2? Without volatile the read can move one slot upwards and return 1. Volatile prevents reorderings, though. The write cannot move backwards in time.
In short, volatile guarantees you to see the most recent value.
Strictly speaking I'd need to differentiate between volatile and a memory barrier here. The latter one is a stronger guarantee. I have simplified this discussion because volatile is implemented using memory barriers, at least on x86/x64.

Read Introduction in C# - how to protect against it?

An article in MSDN Magazine discusses the notion of Read Introduction and gives a code sample which can be broken by it.
public class ReadIntro {
private Object _obj = new Object();
void PrintObj() {
Object obj = _obj;
if (obj != null) {
Console.WriteLine(obj.ToString()); // May throw a NullReferenceException
}
}
void Uninitialize() {
_obj = null;
}
}
Notice this "May throw a NullReferenceException" comment - I never knew this was possible.
So my question is: how can I protect against read introduction?
I would also be really grateful for an explanation exactly when the compiler decides to introduce reads, because the article doesn't include it.

Let me try to clarify this complicated question by breaking it down.
What is "read introduction"?
"Read introduction" is an optimization whereby the code:
public static Foo foo; // I can be changed on another thread!
void DoBar() {
Foo fooLocal = foo;
if (fooLocal != null) fooLocal.Bar();
}
is optimized by eliminating the local variable. The compiler can reason that if there is only one thread then foo and fooLocal are the same thing. The compiler is explicitly permitted to make any optimization that would be invisible on a single thread, even if it becomes visible in a multithreaded scenario. The compiler is therefore permitted to rewrite this as:
void DoBar() {
if (foo != null) foo.Bar();
}
And now there is a race condition. If foo turns from non-null to null after the check then it is possible that foo is read a second time, and the second time it could be null, which would then crash. From the perspective of the person diagnosing the crash dump this would be completely mysterious.
Can this actually happen?
As the article you linked to called out:
Note that you won’t be able to reproduce the NullReferenceException using this code sample in the .NET Framework 4.5 on x86-x64. Read introduction is very difficult to reproduce in the .NET Framework 4.5, but it does nevertheless occur in certain special circumstances.
x86/x64 chips have a "strong" memory model and the jit compilers are not aggressive in this area; they will not do this optimization.
If you happen to be running your code on a weak memory model processor, like an ARM chip, then all bets are off.
When you say "the compiler" which compiler do you mean?
I mean the jit compiler. The C# compiler never introduces reads in this manner. (It is permitted to, but in practice it never does.)
Isn't it a bad practice to be sharing memory between threads without memory barriers?
Yes. Something should be done here to introduce a memory barrier because the value of foo could already be a stale cached value in the processor cache. My preference for introducing a memory barrier is to use a lock. You could also make the field volatile, or use VolatileRead, or use one of the Interlocked methods. All of those introduce a memory barrier. (volatile introduces only a "half fence" FYI.)
Just because there's a memory barrier does not necessarily mean that read introduction optimizations are not performed. However, the jitter is far less aggressive about pursuing optimizations that affect code that contains a memory barrier.
Are there other dangers to this pattern?
Sure! Let's suppose there are no read introductions. You still have a race condition. What if another thread sets foo to null after the check, and also modifies global state that Bar is going to consume? Now you have two threads, one of which believes that foo is not null and the global state is OK for a call to Bar, and another thread which believes the opposite, and you're running Bar. This is a recipe for disaster.
So what's the best practice here?
First, do not share memory across threads. This whole idea that there are two threads of control inside the main line of your program is just crazy to begin with. It never should have been a thing in the first place. Use threads as lightweight processes; give them an independent task to perform that does not interact with the memory of the main line of the program at all, and just use them to farm out computationally intensive work.
Second, if you are going to share memory across threads then use locks to serialize access to that memory. Locks are cheap if they are not contended, and if you have contention, then fix that problem. Low-lock and no-lock solutions are notoriously difficult to get right.
Third, if you are going to share memory across threads then every single method you call that involves that shared memory must either be robust in the face of race conditions, or the races must be eliminated. That is a heavy burden to bear, and that is why you shouldn't go there in the first place.
My point is: read introductions are scary but frankly they are the least of your worries if you are writing code that blithely shares memory across threads. There are a thousand and one other things to worry about first.

You cant really "protect" against read introduction as it's a compiler optimization (excepting using Debug builds with no optimization of course). It's pretty well documented that the optimizer will maintain the single-threaded semantics of the function, which as the article notes can cause issues in multi-threaded situations.
That said, I'm confused by his example. In Jeffrey Richter's book CLR via C# (v3 in this case), in the Events section he covers this pattern, and notes that in the example snippet you have above, in THEORY it wouldn't work. But, it was a recommended pattern by Microsoft early in .Net's existence, and therefore the JIT compiler people he spoke to said that they would have to make sure that sort of snippet never breaks. (It's always possible they may decide that it's worth breaking for some reason though - I imagine Eric Lippert could shed light on that).
Finally, unlike the article, Jeffrey offers the "proper" way to handle this in multi-threaded situations (I've modified his example with your sample code):
Object temp = Interlocked.CompareExchange(ref _obj, null, null);
if(temp != null)
{
Console.WriteLine(temp.ToString());
}

I only skimmed the article, but it seems that what the author is looking for is that you need to declare the _obj member as volatile.

When to use volatile to counteract compiler optimizations in C#

I have spent an extensive number of weeks doing multithreaded coding in C# 4.0. However, there is one question that remains unanswered for me.
I understand that the volatile keyword prevents the compiler from storing variables in registers, thus avoiding inadvertently reading stale values. Writes are always volatile in .Net, so any documentation stating that it also avoids stales writes is redundant.
I also know that the compiler optimization is somewhat "unpredictable". The following code will illustrate a stall due to a compiler optimization (when running the release compile outside of VS):
class Test
{
public struct Data
{
public int _loop;
}
public static Data data;
public static void Main()
{
data._loop = 1;
Test test1 = new Test();
new Thread(() =>
{
data._loop = 0;
}
).Start();
do
{
if (data._loop != 1)
{
break;
}
//Thread.Yield();
} while (true);
// will never terminate
}
}
The code behaves as expected. However, if I uncomment out the //Thread.Yield(); line, then the loop will exit.
Further, if I put a Sleep statement before the do loop, it will exit. I don't get it.
Naturally, decorating _loop with volatile will also cause the loop to exit (in its shown pattern).
My question is: What are the rules the complier follows in order to determine when to implicity perform a volatile read? And why can I still get the loop to exit with what I consider to be odd measures?
EDIT
IL for code as shown (stalls):
L_0038: ldsflda valuetype ConsoleApplication1.Test/Data ConsoleApplication1.Test::data
L_003d: ldfld int32 ConsoleApplication1.Test/Data::_loop
L_0042: ldc.i4.1
L_0043: beq.s L_0038
L_0045: ret
IL with Yield() (does not stall):
L_0038: ldsflda valuetype ConsoleApplication1.Test/Data ConsoleApplication1.Test::data
L_003d: ldfld int32 ConsoleApplication1.Test/Data::_loop
L_0042: ldc.i4.1
L_0043: beq.s L_0046
L_0045: ret
L_0046: call bool [mscorlib]System.Threading.Thread::Yield()
L_004b: pop
L_004c: br.s L_0038

What are the rules the complier follows in order to determine when to
implicity perform a volatile read?
First, it is not just the compiler that moves instructions around. The big 3 actors in play that cause instruction reordering are:
Compiler (like C# or VB.NET)
Runtime (like the CLR or Mono)
Hardware (like x86 or ARM)
The rules at the hardware level are a little more cut and dry in that they are usually documented pretty well. But, at the runtime and compiler levels there are memory model specifications that provide constraints on how instructions can get reordered, but it is left up to the implementers to decide how aggressively they want to optimize the code and how closely they want to toe the line with respect to the memory model constraints.
For example, the ECMA specification for the CLI provides fairly weak guarantees. But Microsoft decided to tighten those guarantees in the .NET Framework CLR. Other than a few blog posts I have not seen much formal documentation on the rules the CLR adheres to. Mono, of course, might use a different set of rules that may or may not bring it closer to the ECMA specification. And of course, there may be some liberty in changing the rules in future releases as long as the formal ECMA specification is still considered.
With all of that said I have a few observations:
Compiling with the Release configuration is more likely to cause instruction reordering.
Simpler methods are more likely to have their instructions reordered.
Hoisting a read from inside a loop to outside of the loop is a typical type of reordering optimization.
And why can I still get the loop to exit with what I consider to be
odd measures?
It is because those "odd measures" are doing one of two things:
generating an implicit memory barrier
circumventing the compiler's or runtime's ability to perform certain optimizations
For example, if the code inside a method gets too complex it may prevent the JIT compiler from performing certain optimizations that reorders instructions. You can think of it as sort of like how complex methods also do not get inlined.
Also, things like Thread.Yield and Thread.Sleep create implicit memory barriers. I have started a list of such mechanisms here. I bet if you put a Console.WriteLine call in your code it would also cause the loop to exit. I have also seen the "non terminating loop" example behave differently in different versions of the .NET Framework. For example, I bet if you ran that code in 1.0 it would terminate.
This is why using Thread.Sleep to simulate thread interleaving could actually mask a memory barrier problem.
Update:
After reading through some of your comments I think you may be confused as to what Thread.MemoryBarrier is actually doing. What it is does is it creates a full-fence barrier. What does that mean exactly? A full-fence barrier is the composition of two half-fences: an acquire-fence and a release-fence. I will define them now.
Acquire fence: A memory barrier in which other reads & writes are not allowed to move before the fence.
Release fence: A memory barrier in which other reads & writes are not allowed to move after the fence.
So when you see a call to Thread.MemoryBarrier it will prevent all reads & writes from being moved either above or below the barrier. It will also emit whatever CPU specific instructions are required.
If you look at the code for Thread.VolatileRead here is what you will see.
public static int VolatileRead(ref int address)
{
int num = address;
MemoryBarrier();
return num;
}
Now you may be wondering why the MemoryBarrier call is after the actual read. Your intuition may tell you that to get a "fresh" read of address you would need the call to MemoryBarrier to occur before that read. But, alas, your intuition is wrong! The specification says a volatile read should produce an acquire-fence barrier. And per the definition I gave you above that means the call to MemoryBarrier has to be after the read of address to prevent other reads and writes from being moved before it. You see volatile reads are not strictly about getting a "fresh" read. It is about preventing the movement of instructions. This is incredibly confusing; I know.

Your sample runs unterminated (most of the time I think) because _loop can be cached.
Any of the 'solutions' you mentioned (Sleep, Yield) will involve a memory barrier, forcing the compiler to refresh _loop.
The minimal solution (untested):
do
{
System.Threading.Thread.MemoryBarrier();
if (data._loop != 1)
{
break;
}
} while (true);

It is not only a matter of compiler, it can also be a matter of CPU, which also does it's own optimizations. Granted, generally a consumer CPU does not have so much liberty and usually the compiler is the one guilty for the above scenario.
A full fence is probably too heavy-weight for making a single volatile read.
Having said this, a good explanation of what optimization can occur is found here: http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/

There seems to be a lot of talk about memory barriers at the hardware level. Memory fences are irrelevant here. It's nice to tell the hardware not to do anything funny, but it wasn't planning to do so in the first place, because you are of course going to run this code on x86 or amd64. You don't need a fence here (and it is very rare that you do, though it can happen). All you need in this case is to reload the value from memory.
The problem here is that the JIT compiler is being funny, not the hardware.
In order to force the JIT to quit joking around, you need something that either (1) just plain happens to trick the JIT compiler into reloading that variable (but that's relying on implementation details) or that (2) generates a memory barrier or read-with-acquire of the kind that the JIT compiler understands (even if no fences end up in the instruction stream).
To address your actual question, there are only actual rules about what should happen in case 2.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.