Using memory barriers

Using memory barriers - c#

In the following code sample, does the memory barrier in FuncA is required to ensure that the most up-to-date value is read?
class Foo
{
DateTime m_bar;
void FuncA() // invoked by thread X
{
Thread.MemoryBarrier(); // is required?
Console.WriteLine(m_bar);
}
void FuncB() // invoked by thread Y
{
m_bar = DateTime.Now;
}
}
EDIT: If not, how can I ensure that FuncA will read the most recent value? (I want to make sure that the recent value is actually store in the processor's cache) [wihout using locks]

Looks like a big "No" to me. Thread.MemoryBarrier() only syncs up memory access within the thread that implemented it.
From MSDN:
The processor executing the current thread cannot reorder instructions in such a way that memory accesses prior to the call to MemoryBarrier execute after memory accesses that follow the call to MemoryBarrier.

I suggest you store Datetime as number of ticks (it is of type "long", i.e. Int64), you can easily transform from ticks (new DateTime (ticks)) and to ticks (myDateTime.Ticks). Then you can use Interlocked.Read to read value and Interlocked.Exchange to write value in fast non-locking operations.

Yes, the memory barrier is needed so that you can get the most up to date value.
If the memory barrier is not present then it is possible for thread X to read the value of m_bar from its own cache line while that value hasn't been written back to main memory (the change has been make local to Thread Y). You can achieve the same effect by declaring the variable as volatile:
The volatile modifier is usually used for a field that is accessed by multiple threads without using the lock statement to serialize access. Using the volatile modifier ensures that one thread retrieves the most up-to-date value written by another thread.
A good entry on that matter (probably the best) is this one by Joe Duffy: Volatile reads and writes, and timeliness

A memory barrier does infact the same thing as wat locking does, guarantee the field will get its
latest value from memory on entering the lock and be written to memory before exiting the lock.
Making sure a field's value is always read or written to memory and never optimized by reading or writing
it first to the cpu's cache can also be achieved by using the volatile keyword.
Unless primitive integral types and reference types DateTime cannot be cached in CPU registers ad so
need not (and cannot) be declared with the volatile keyword.

This actually doesn't matter since on 32bit architectures one can get a torn read in such situation

Related

Reading value before InterLocked.CompareExchange [duplicate]

The following example comes from the MSDN.
public class ThreadSafe
{
// Field totalValue contains a running total that can be updated
// by multiple threads. It must be protected from unsynchronized
// access.
private float totalValue = 0.0F;
// The Total property returns the running total.
public float Total { get { return totalValue; }}
// AddToTotal safely adds a value to the running total.
public float AddToTotal(float addend)
{
float initialValue, computedValue;
do
{
// Save the current running total in a local variable.
initialValue = totalValue;
// Add the new value to the running total.
computedValue = initialValue + addend;
// CompareExchange compares totalValue to initialValue. If
// they are not equal, then another thread has updated the
// running total since this loop started. CompareExchange
// does not update totalValue. CompareExchange returns the
// contents of totalValue, which do not equal initialValue,
// so the loop executes again.
}
while (initialValue != Interlocked.CompareExchange(ref totalValue,
computedValue, initialValue));
// If no other thread updated the running total, then
// totalValue and initialValue are equal when CompareExchange
// compares them, and computedValue is stored in totalValue.
// CompareExchange returns the value that was in totalValue
// before the update, which is equal to initialValue, so the
// loop ends.
// The function returns computedValue, not totalValue, because
// totalValue could be changed by another thread between
// the time the loop ends and the function returns.
return computedValue;
}
}
Should the totalValue not be declared as volatile to get the freshest value possible? I imagine that if you get a dirty value from a CPU cache then the call to Interlocked.CompareExchange should take care of getting the freshest value and cause the loop to try again. Would the volatile keyword potentially save one unnecessary loop?
I guess it isn't 100% necessary to have the volatile keyword because the method has overloads that takes datatype such as long that don't support the volatile keyword.

No, volatile wouldn't be helpful at all, and certainly not for this reason. It would just give that first read "acquire" semantics instead of effectively relaxed, but either way will compile to similar asm that runs a load instruction.
if you get a dirty value from a CPU cache
CPU caches are coherent, so anything you read from CPU cache is the current globally agreed-on value for this line. "Dirty" just means it doesn't match DRAM contents, and will have to get written-back if / when evicted. A load value can also be forwarded from the store buffer, for a value this thread stored recently that isn't yet globally visible, but that's fine, Interlocked methods are full barriers that result in waiting for the store buffer to drain as well.
If you mean stale, then no, that's impossible, cache coherency protocols like MESI prevent that. This is why Interlocked things like CAS aren't horribly slow if the cache line is already owned by this core (MESI Modified or Exclusive state). See Myths Programmers Believe about CPU Caches which talks some about Java volatiles, which I think are similar to C# volatile.
This C++11 answer also explains some about cache coherency and asm. (Note that C++11 volatile is significantly different from C#, and doesn't imply any thread-safety or ordering, but does still imply the asm must do a load or a store, not optimize into a register.)
On non-x86, running extra barrier instructions after the initial read (to give those acquire semantics) before you even try a CAS just slows things down. (On x86 including x86-64, a volatile read compiles to the same asm as a plain read, except it prevents compile-time reordering).
A volatile read can't be optimized into just using a value in a register if the current thread just wrote something via a non-interlocked = assignment. That's not helpful either; if we just stored something and remember in a register what we stored, a load that does store-forwarding from the store buffer is morally equivalent to just using the register value.
Most of the good use-cases for lock-free atomics are when contention is lowish, so usually things can succeed without hardware having to wait a long time for access / ownership of the cache line. So you want to make the uncontended case as fast as possible. Avoid volatile even if there was anything to gain from it in highly-contended cases, which I don't think there is anyway.
If you ever did any plain stores (assignments with =, not interlocked RMW), volatile would have an effect on those, too. That might mean waiting for the store buffer to drain before later memory ops in this thread can run, if C# volatile gives semantics like C++ memory_order_seq_cst. In that case, you'd be slowing down the code involving the stores a lot, if you didn't need ordering wrt. other loads/stores. If you did such a store before this CAS code, yeah you'd be waiting until the store (and all previous stores) were globally visible to try reloading it. This would mean a reload + CAS the CPU is waiting to do right after are very likely to not have to spin because the CPU will have ownership of that line, but I think you'd effectively get similar behaviour from the full barrier that's part of an Interlocked CAS.

You could get some insights by studying the source code of the ImmutableInterlocked.Update method:
/// <summary>
/// Mutates a value in-place with optimistic locking transaction semantics
/// via a specified transformation function.
/// The transformation is retried as many times as necessary to win the
/// optimistic locking race.
/// </summary>
public static bool Update<T>(ref T location, Func<T, T> transformer)
where T : class
{
Requires.NotNull(transformer, "transformer");
bool successful;
T oldValue = Volatile.Read(ref location);
do
{
T newValue = transformer(oldValue);
if (ReferenceEquals(oldValue, newValue))
{
// No change was actually required.
return false;
}
T interlockedResult = Interlocked.CompareExchange(ref location,
newValue, oldValue);
successful = ReferenceEquals(oldValue, interlockedResult);
oldValue = interlockedResult; // we already have a volatile read
// that we can reuse for the next loop
}
while (!successful);
return true;
}
You can see that the method starts by making a volatile read on the location argument. I think that there are two reasons for that:
This method has a little twist by avoiding the Interlocked.CompareExchange operation, in case the new value happens to be the same with the already stored value.
The transformer delegate has an unknown computational complexity, so invoking it on a potentially stale value could be much more costly than the cost of the initial Volatile.Read.

It does not matter since Interlocked.CompareExchange inserts memory barriers.
initialValue = totalValue;
At this point the totalValue could be anything. Stale value from cache, just replaced, who knowns. While volatile would prevent reading a cached value, the value might become stale just after it was read, so volatile does not solve anything.
Interlocked.CompareExchange(ref totalValue, computedValue, initialValue)
At this point we have memory barriers that ensures totalValue is up to date. If it is equal to initialValue, then we also know that the initialValue was not stale when we started the computation. If it is not equal we try again, and since we have issued a memory barrier we do not risk getting the same stale value the next iteration.
Edit:
I find it very unlikely that there would be any performance difference. If there is no contention there is little reason for the value to be stale. If there is high contention the time will be dominated by needing to loop.

Variable freshness guarantee in .NET (volatile vs. volatile read)

I have read many contradicting information (msdn, SO etc.) about volatile and VoletileRead (ReadAcquireFence).
I understand the memory access reordering restriction implication of those - what I'm still completely confused about is the freshness guarantee - which is very important for me.
msdn doc for volatile mentions:
(...) This ensures that the most up-to-date value is present in the field at all times.
msdn doc for volatile fields mentions:
A read of a volatile field is called a volatile read. A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it in the instruction sequence.
.NET code for VolatileRead is:
public static int VolatileRead(ref int address)
{
int ret = address;
MemoryBarrier(); // Call MemoryBarrier to ensure the proper semantic in a portable way.
return ret;
}
According to msdn MemoryBarrier doc Memory barrier prevents reordering. However this doesn't seem to have any implications on freshness - correct?
How then one can get freshness guarantee?
And is there difference between marking field volatile and accessing it with VolatileRead and VolatileWrite semantic? I'm currently doing the later in my performance critical code that needs to guarantee freshness, however readers sometimes get stale value. I'm wondering if marking the state volatile will make situation different.
EDIT1:
What I'm trying to achieve - get the guarantee that reader threads will get as recent value of shared variable (written by multiple writers) as possible - ideally no older than what is the cost of context switch or other operations that may postpone the immediate write of state.
If volatile or higher level construct (e.g. lock) have this guarantee (do they?) than how do they achieve this?
EDIT2:
The very condensed question should have been - how do I get guarantee of as fresh value during reads as possible? Ideally without locking (as exclusive access is not needed and there is potential for high contention).
From what I learned here I'm wondering if this might be the solution (solving(?) line is marked with comment):
private SharedState _sharedState;
private SpinLock _spinLock = new SpinLock(false);
public void Update(SharedState newValue)
{
bool lockTaken = false;
_spinLock.Enter(ref lockTaken);
_sharedState = newValue;
if (lockTaken)
{
_spinLock.Exit();
}
}
public SharedState GetFreshSharedState
{
get
{
Thread.MemoryBarrier(); // <---- This is added to give readers freshness guarantee
var value = _sharedState;
Thread.MemoryBarrier();
return value;
}
}
The MemoryBarrier call was added to make sure both - reads and writes - are wrapped by full fences (same as lock code - as indicated here http://www.albahari.com/threading/part4.aspx#_The_volatile_keyword 'Memory barriers and locking' section)
Does this look correct or is it flawed?
EDIT3:
Thanks to very interesting discussions here I learned quite a few things and I actually was able to distill to the simplified unambiguous question that I have about this topic. It's quite different from the original one so I rather posted a new one here: Memory barrier vs Interlocked impact on memory caches coherency timing

I think this is a good question. But, it is also difficult to answer. I am not sure I can give you a definitive answer to your questions. It is not your fault really. It is just that the subject matter is complex and really requires knowing details that might not be feasible to enumerate. Honestly, it really seems like you have educated yourself on the subject quite well already. I have spent a lot of time studying the subject myself and I still do not fully understand everything. Nevertheless, I will still attempt an some semblance of an answer here anyway.
So what does it mean for a thread to read a fresh value anyway? Does it mean the value returned by the read is guaranteed to be no older than 100ms, 50ms, or 1ms? Or does it mean the value is the absolute latest? Or does it mean that if two reads occur back-to-back then the second is guaranteed to get a newer value assuming the memory address changed after the first read? Or does it mean something else altogether?
I think you are going to have a hard time getting your readers to work correctly if you are thinking about things in terms of time intervals. Instead think of things in terms of what happens when you chain reads together. To illustrate my point consider how you would implement an interlocked-like operation using arbitrarily complex logic.
public static T InterlockedOperation<T>(ref T location, T operand)
{
T initial, computed;
do
{
initial = location;
computed = op(initial, operand); // where op is replaced with a specific implementation
}
while (Interlocked.CompareExchange(ref location, computed, initial) != initial);
return computed;
}
In the code above we can create any interlocked-like operation if we exploit the fact that the second read of location via Interlocked.CompareExchange will be guaranteed to return a newer value if the memory address received a write after the first read. This is because the Interlocked.CompareExchange method generates a memory barrier. If the value has changed between reads then the code spins around the loop repeatedly until location stops changing. This pattern does not require that the code use the latest or freshest value; just a newer value. The distinction is crucial.1
A lot of lock free code I have seen works on this principal. That is the operations are usually wrapped into loops such that the operation is continually retried until it succeeds. It does not assume that the first attempt is using the latest value. Nor does it assume every use of the value be the latest. It only assumes that the value is newer after each read.
Try to rethink how your readers should behave. Try to make them more agnostic about the age of the value. If that is simply not possible and all writes must be captured and processed then you may be forced into a more deterministic approach like placing all writes into a queue and having the readers dequeue them one-by-one. I am sure the ConcurrentQueue class would help in that situation.
If you can reduce the meaning of "fresh" to only "newer" then placing a call to Thread.MemoryBarrier after each read, using Volatile.Read, using the volatile keyword, etc. will absolutely guarantee that one read in a sequence will return a newer value than a previous read.
1The ABA problem opens up a new can of worms.

A memory barrier does provide this guarantee. We can derive the "freshness" property that you are looking for from the reording properties that a barrier guarantees.
By freshness you probably mean that a read returns the value of the most recent write.
Let's say we have these operations, each on a different thread:
x = 1
x = 2
print(x)
How could we possibly print a value other than 2? Without volatile the read can move one slot upwards and return 1. Volatile prevents reorderings, though. The write cannot move backwards in time.
In short, volatile guarantees you to see the most recent value.
Strictly speaking I'd need to differentiate between volatile and a memory barrier here. The latter one is a stronger guarantee. I have simplified this discussion because volatile is implemented using memory barriers, at least on x86/x64.

Is lock or volatile required when worker threads write non-competitively to local or class variables?

For the case below, when there is no competition for writes between the worker threads, are locks or volatile still required? Any difference in the answer if "Peek" access is not required at "G".
class A
{
Object _o; // need volatile (position A)?
Int _i; // need volatile (position B)?
Method()
{
Object o;
Int i;
Task [] task = new Task[2]
{
Task.Factory.StartNew(() => {
_o = f1(); // use lock() (position C)?
o = f2(); // use lock() (position D)?
}
Task.Factory.StartNew(() => {
_i = g1(); // use lock() (position E)?
i = g2(); // use lock() (position F)?
}
}
// "Peek" at _o, _i, o, i (position G)?
Task.WaitAll(tasks);
// Use _o, _i, o, i (position H)?
}

The safe thing to do is to not do this in the first place. Don't write a value on one thread and read the value on another thread in the first place. Make a Task<object> and a Task<int> that return the values to the thread that needs them, rather than making tasks that modify variables across threads.
If you are hell bent on writing to variables across threads then you need to guarantee two things. First, that the jitter does not choose optimizations that would cause reads and writes to be moved around in time, and second, that a memory barrier is introduced. The memory barrier limits the processor from moving reads and writes around in time in certain ways.
As Brian Gideon notes in his answer, you get a memory barrier from the WaitAll, but I do not recall offhand if that is a documented guarantee or just an implementation detail.
As I said, I would not do this in the first place. If I were forced to, I would at least make the variables I was writing to marked as volatile.

Writes to reference types (i.e. Object) and word-sized value types (i.e. int in a 32 bit system) are atomic. This means that when you peek at the values (position 6) you can be sure that you either get the old value or the new value, but not something else (if you had a type such as a large struct it could be spliced, and you could read the value when it was half way through being written). You don't need a lock or volatile, so long as you're willing to accept the potential risk of reading stale values.
Note that because there is no memory barrier introduced at this point (a lock or use of volatile both add one) it's possible that the variable has been updated in the other thread, but the current thread isn't observing that change; it can be reading a "stale" value for (potentially) quite some time after it has been changed in the other thread. The use of volatile will ensure that the current thread can observe changes to the variable sooner.
You can be sure that you'll have the appropriate value after the call to WaitAll, even without a lock or volatile.
Also note that while you can be sure the reference to the reference type is written atomically, your program makes no guarantee about the observed order of any changes to the actual object that the reference refers to. Even if, from the point of view of the background thread, the object is initialized before it is assigned to the instance field, it may not happen in that order. The other thread can therefore observe the write of the reference tot he object but then follow that reference and find an object in an initialize, or partially initialized, state. Introducing a memory barrier (i.e. through the use of a volatile variable can potentially allow you to prevent the runtime from making such re-orderings, thus ensuring that doesn't happen. This is why it's better to just not do this in the first place and to just have the two tasks return the results that they generate rather than manipulating a closed over variable.
WaitAll will introduce a memory barrier, in addition to ensuring that the two tasks are actually finished, which means that you know that the variables are up-to-date and will not have the old stale values.

At position G you may observe the values _o and _i may retain their initialized values null and 0 respectively or they may contain the values written by the tasks. It is unpredictable at this position.
However, at position H you force the issue in two different ways. First, you have guaranteed that both tasks finished and thus the writes are completed. Second, Task.WaitAll will generate a memory barrier which will guarantee that the main thread will observe the new values published by the tasks.
So, in this particular example an explicit lock or memory barrier generator (volatile) is not technically required.

thread-safety of primitive concurrent read and write

Simplified illustration below, how does .NET deal with such a situation?
and if it would cause problems, would i have to lock/gate access to each and every field/property that might at times be written to + accessed from different threads?
A field somewhere
public class CrossRoads(){
public int _timeouts;
}
A background thread writer
public void TimeIsUp(CrossRoads crossRoads){
crossRoads._timeouts++;
}
Possibly at the same time, trying to read elsewhere
public void HowManyTimeOuts(CrossRoads crossRoads){
int timeOuts = crossRoads._timeouts;
}

The simple answer is that the above code has the ability to cause problems if accessed simultaneously from multiple threads.
The .Net framework provides two solutions: interlocking and thread synchronization.
For simple data type manipulation (i.e. ints), interlocking using the Interlocked class will work correctly and is the recommended approach.
In fact, interlocked provides specific methods (Increment and Decrement) that make this process easy:
Add an IncrementCount method to your CrossRoads class:
public void IncrementCount() {
Interlocked.Increment(ref _timeouts);
}
Then call this from your background worker:
public void TimeIsUp(CrossRoads crossRoads){
crossRoads.IncrementCount();
}
The reading of the value, unless of a 64-bit value on a 32-bit OS, are atomic. See the Interlocked.Read method documentation for more detail.
For class objects or more complex operations, you will need to use thread synchronization locking (lock in C# or SyncLock in VB.Net).
This is accomplished by creating a static synchronization object at the level the lock is to be applied (for example, inside your class), obtaining a lock on that object, and performing (only) the necessary operations inside that lock:
private static object SynchronizationObject = new Object();
public void PerformSomeCriticalWork()
{
lock (SynchronizationObject)
{
// do some critical work
}
}

The good news is that reads and writes to ints are guaranteed to be atomic, so no torn values. However, it is not guaranteed to do a safe ++, and the read could potentially be cached in registers. There's also the issue of instruction re-ordering.
I would use:
Interlocked.Increment(ref crossroads._timeouts);
For the write, which will ensure no values are lost, and;
int timeouts = Interlocked.CompareExchange(ref crossroads._timeouts, 0, 0);
For the read, since this observes the same rules as the increment. Strictly speaking "volatile" is probably enough for the read, but it is so poorly understood that the Interlocked seems (IMO) safer. Either way, we're avoiding a lock.

Well, I'm not a C# developer, but this is how it typically works at this level:
how does .NET deal with such a situation?
Unlocked. Not likely to be guaranteed to be atomic.
Would i have to lock/gate access to each and every field/property that might at times be written to + accessed from different threads?
Yes. An alternative would be to make a lock for the object available to the clients, then tell the clients they must lock the object while using the instance. This will reduce the number of locks acquisitions, and guarantee a more consistent, predictable, state for your clients.

Forget dotnet. At the machine language level, crossRoads._timeouts++ will be implemented as an INC [memory] instruction. This is known as a Read-Modify-Write instruction. These instructions are atomic with respect to multi-threading on a single processor*, (essentially implemented with time-slicing,) but are not atomic with respect to multi-threading using multiple processors or multiple cores.
So:
If you can guarantee that only TimeIsUp() will ever modify crossRoads._timeouts, and if you can guarantee that only one thread will ever execute TimeIsUp(), then it will be safe to do this. The writing in TimeIsUp() will work fine, and the reading in HowManyTimeOuts() (and any place else) will work fine. But if you also modify crossRoads._timeouts elsewhere, or if you ever spawn one more background thread writer, you will be in trouble.
In either case, my advice would be to play it safe and lock it.
(*) They are atomic with respect to multi-threading on a single processor because context switches between threads happen on a periodic interrupt, and on the x86 architectures these instructions are atomic with respect to interrupts, meaning that if an interrupt occurs while the CPU is executing such an instruction, the interrupt will wait until the instruction completes. This does not hold true with more complex instructions, for example those with the REP prefix.

Although an int may be 'native' size to a CPU (dealing in 32 or 64 bits at a time), if you are reading and writing from different threads to the same variable, you are best off locking this variable and synchronizing access.
There is never a guarantee that reads/writes maybe atomic to an int.
You can also use Interlocked.Increment for your purposes here.

What is the "volatile" keyword used for?

I read some articles about the volatile keyword but I could not figure out its correct usage. Could you please tell me what it should be used for in C# and in Java?

Consider this example:
int i = 5;
System.out.println(i);
The compiler may optimize this to just print 5, like this:
System.out.println(5);
However, if there is another thread which can change i, this is the wrong behaviour. If another thread changes i to be 6, the optimized version will still print 5.
The volatile keyword prevents such optimization and caching, and thus is useful when a variable can be changed by another thread.

For both C# and Java, "volatile" tells the compiler that the value of a variable must never be cached as its value may change outside of the scope of the program itself. The compiler will then avoid any optimisations that may result in problems if the variable changes "outside of its control".

Reads of volatile fields have acquire semantics. This means that it is guaranteed that the memory read from the volatile variable will occur before any following memory reads. It blocks the compiler from doing the reordering, and if the hardware requires it (weakly ordered CPU), it will use a special instruction to make the hardware flush any reads that occur after the volatile read but were speculatively started early, or the CPU could prevent them from being issued early in the first place, by preventing any speculative load from occurring between the issue of the load acquire and its retirement.
Writes of volatile fields have release semantics. This means that it is guaranteed that any memory writes to the volatile variable are guaranteed to be delayed until all previous memory writes are visible to other processors.
Consider the following example:
something.foo = new Thing();
If foo is a member variable in a class, and other CPUs have access to the object instance referred to by something, they might see the value foo change before the memory writes in the Thing constructor are globally visible! This is what "weakly ordered memory" means. This could occur even if the compiler has all of the stores in the constructor before the store to foo. If foo is volatile then the store to foo will have release semantics, and the hardware guarantees that all of the writes before the write to foo are visible to other processors before allowing the write to foo to occur.
How is it possible for the writes to foo to be reordered so badly? If the cache line holding foo is in the cache, and the stores in the constructor missed the cache, then it is possible for the store to complete much sooner than the writes to the cache misses.
The (awful) Itanium architecture from Intel had weakly ordered memory. The processor used in the original XBox 360 had weakly ordered memory. Many ARM processors, including the very popular ARMv7-A have weakly ordered memory.
Developers often don't see these data races because things like locks will do a full memory barrier, essentially the same thing as acquire and release semantics at the same time. No loads inside the lock can be speculatively executed before the lock is acquired, they are delayed until the lock is acquired. No stores can be delayed across a lock release, the instruction that releases the lock is delayed until all of the writes done inside the lock are globally visible.
A more complete example is the "Double-checked locking" pattern. The purpose of this pattern is to avoid having to always acquire a lock in order to lazy initialize an object.
Snagged from Wikipedia:
public class MySingleton {
private static object myLock = new object();
private static volatile MySingleton mySingleton = null;
private MySingleton() {
}
public static MySingleton GetInstance() {
if (mySingleton == null) { // 1st check
lock (myLock) {
if (mySingleton == null) { // 2nd (double) check
mySingleton = new MySingleton();
// Write-release semantics are implicitly handled by marking
// mySingleton with 'volatile', which inserts the necessary memory
// barriers between the constructor call and the write to mySingleton.
// The barriers created by the lock are not sufficient because
// the object is made visible before the lock is released.
}
}
}
// The barriers created by the lock are not sufficient because not all threads
// will acquire the lock. A fence for read-acquire semantics is needed between
// the test of mySingleton (above) and the use of its contents. This fence
// is automatically inserted because mySingleton is marked as 'volatile'.
return mySingleton;
}
}
In this example, the stores in the MySingleton constructor might not be visible to other processors before the store to mySingleton. If that happens, the other threads that peek at mySingleton will not acquire a lock and they will not necessarily pick up the writes to the constructor.
volatile never prevents caching. What it does is guarantee the order in which other processors "see" writes. A store release will delay a store until all pending writes are complete and a bus cycle has been issued telling other processors to discard/writeback their cache line if they happen to have the relevant lines cached. A load acquire will flush any speculated reads, ensuring that they won't be stale values from the past.

To understand what volatile does to a variable, it's important to understand what happens when the variable is not volatile.
Variable is Non-volatile
When two threads A & B are accessing a non-volatile variable, each thread will maintain a local copy of the variable in it's local cache. Any changes done by thread A in it's local cache won't be visible to the thread B.
Variable is volatile
When variables are declared volatile it essentially means that threads should not cache such a variable or in other words threads should not trust the values of these variables unless they are directly read from the main memory.
So, when to make a variable volatile?
When you have a variable which can be accessed by many threads and you want every thread to get the latest updated value of that variable even if the value is updated by any other thread/process/outside of the program.

The volatile keyword has different meanings in both Java and C#.
Java
From the Java Language Spec :
A field may be declared volatile, in which case the Java memory model ensures that all threads see a consistent value for the variable.
C#
From the C# Reference (retrieved 2021-03-31):
The volatile keyword indicates that a field might be modified by multiple threads that are executing at the same time. The compiler, the runtime system, and even hardware may rearrange reads and writes to memory locations for performance reasons. Fields that are declared volatile are not subject to these optimizations. (...)

In Java, "volatile" is used to tell the JVM that the variable may be used by multiple threads at the same time, so certain common optimizations cannot be applied.
Notably the situation where the two threads accessing the same variable are running on separate CPU's in the same machine. It is very common for CPU's to cache aggressively the data it holds because memory access is very much slower than cache access. This means that if the data is updated in CPU1 it must immediately go through all caches and to main memory instead of when the cache decides to clear itself, so that CPU2 can see the updated value (again by disregarding all caches on the way).

When you are reading data that is non-volatile, the executing thread may or may not always get the updated value.
But if the object is volatile, the thread always gets the most up-to-date value.

Volatile is solving concurrency problem. To make that value in sync. This keyword is mostly use in a threading. When multiple thread updating same variable.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.