I was recently reading about the Compare And Swap atomic action (CMPXCHG, .NET's Interlocked.CompareExchange, whatever).
I understand how it works internally, and how it's used from a client.
What I can't quite figure out is when would someone use CAS?
Wikipedia says:
CAS is used for implementing synchronization primitives like
semaphores and mutexes, likewise more sophisticated lock-free and
wait-free algorithms.
So, can anyone give me a more generic real-world use case with code and description of CAS usage?
This question is meant to be language-agnostic, so any language will do (C-based or x86 assembly preferred).
Thanks!
This is easy to see by example. Say we want to atomically and concurrently set a bit on a shared variable:
int shared = 0;
void Set(int index) {
while (true) {
if (Interlocked.CompareExchange<int>(ref shared, shared | (1 << index), shared) == shared)
break; //success
}
}
We detect failure if we see that the "old value" (which is the return value) has changed in the meantime.
If this did not happen we did not have a concurrent modification so our own modification went through successfully.
You can realize pretty complex stuff using this technique. The more complex the more performance loss through spinning, though.
I want to emphasize that a key property of CAS is that it can fail and that failure can be detected reliably.
You use CAS to set a value (a bit or a word) atomically in one thread or process, while testing that another thread/process has not already done so. So it's used to acquire a flag or counter in a multi-threaded environment.
Addendum (Feb 2023)
For example, multiple threads could each use a CAS instruction to swap their process-ID into a shared word of memory (which starts out holding a value of zero). The first thread that gets its process-ID stored into the word can then take ownership of whatever resource that shared word is guarding.
When the process is done with the resource, it stores a zero into the word, releasing ownership of the resource and allowing other threads their turn to acquire the resource.
So, can anyone give me a more generic real-world use case with code and description of CAS usage?
This paper uses CAS to implement a thread safe queue without locks.
It has some pseudo code examples in it.
Related
I've been wondering recently how lock (or more specific: Monitor) works internally in .NET with regards to the objects that are locked. Specifically, I'm wondering what the overhead is, if there are 'global' (Process) locks used, if it's possible to create more of those global locks if that's the case (for groups of monitors) and what happens to the objects that are passed to lock (they don't seem to introduce an extra memory overhead).
To clarify what I'm not asking about: I'm not asking here about what a Monitor is (I made one myself at University some time ago). I'm also not asking how to use lock, Monitor, how they compile to a try/finally, etc; I'm pretty well aware of that (and there are other SO questions related to that). This is about the inner workings of Monitor.Enter and Monitor.Exit.
For example, consider this code executed by ten threads:
for (int i=0; i<1000; ++i)
{
lock (myArray[i])
{
// ...
}
}
Is it bad to lock a thousand objects instead of one? What is impact in terms of performance / memory pressure?
The underlying monitor creates a wait queue. Is it possible to have more than one wait queue and how would I create that?
Monitor.Enter is not a normal .NET method (can't be decompiled with ILSpy or similar). The method is implemented internally by the CLR, so strictly speaking, there is no one answer for .NET as different runtimes can have different implementations.
All objects in .NET have an object header containing a pointer to the type of the object, but also an SyncBlock index into a SyncTableEntry. Normally that index is zero/non used, but when you lock on the object it will contain an index into the SyncTableEntry which then contains the reference to the actual lock object.
So locking of thousands of objects will indeed create a lot of locks which is an overhead.
The information I found was in this MSDN article: http://msdn.microsoft.com/en-us/magazine/cc163791.aspx
Here's a good place to read about monitors, memory barriers etc.
EDIT
Screen shot from the page in case page become down in future:
The problem with the below class is when reading myThreadSafe.Value it may not return the most up-to-date value.
public class ThreadSafe
{
private int value;
public int Value { get { return value; } }
public void Update()
{
Interlocked.Add(ref value, 47); // UPDATE: use interlocked to not distract from the question being asked.
}
}
I realise I could lock when reading it and writing it:
public int Value { get { lock(locker) return value; } }
public void Update()
{
lock(locker)
{
value += 47;
}
}
And I have followed this pattern of using locks always. However I am trying to reduce the number of locks in my code (there are many and they are called frequently, I have profiled and Montior.Enter() is taking up more time then I would like - because it is called so many times).
UPDATE: I wonder now if that indeed the lock will make any difference in ensuring I am reading the most up to date value, it could still be from one of the machine's CPU caches couldn't it? (All the lock guarantees is mutual exclusive thread access).
I thought volatile would be the answer, MSDN does say: "This ensures that the most up-to-date value is present in the field at all times", however I read elsewhere write then read CPU instructions can still be swapped when using volatile in which case I could get a previous value for myThreadSafe.Value perhaps I could live with that - only being out by one update.
What is the most efficient way I will always get the most up-to-date value for myThreadSafe.Value?
UPDATE: This code will be compiled and run on CPU Architectures:
x86
AMD64 (though I can build as x86)
PowerPC
ARM (Little-endian only)
Using the runtimes:
CLR v4.0
Mono (I am not sure of the mono runtime versions but if they correspond to the Mono versions: 3.0 at least).
I am hoping to use the same code for all builds!
OK, I believe I found the answer and my concern is vindicated!
The code happens to be thread-safe on x86 and AMD64 because they invalidate a CPUs cache when the variable is written to causing subsequent reads to read the variable from memory. to quote Shafqay Ahmed quoting Jeffrey Richter:
Since two processors can have different caches, which are copies of the ram, they can have different values. In x86 and x64 processors (according to Jeffrey’s book) are designed to sync the caches of different processors so we may not see the problem.
Incidentally using lock and Interlocked flushes the variable from cache, so using lock when reading the property would have been safe. From http://blogs.msdn.com/b/ericlippert/archive/2011/06/16/atomicity-volatility-and-immutability-are-different-part-three.aspx:
Locks guarantee that memory read or modified inside the lock is observed to be consistent, locks guarantee that only one thread accesses a given hunk of memory at a time, and so on.
However there is no guarantee in the CLR specification given when reading a value updated by another thread (without using locking synchronization constructs) will be the most recent. Indeed on ARM I could well get an old value using ThreadSafe class as it is, from http://msdn.microsoft.com/en-us/magazine/jj553518.aspx:
If your code relies on lock-free algorithms that depend on the implementation of the x86 CLR (rather than the ECMA CLR specification), you’ll want to add the volatile keyword to relevant variables as appropriate. Once you’ve marked shared state as volatile, the CLR will take care of everything for you. If you’re like most developers, you’re ready to run on ARM because you’ve already used locks to protect your shared data, properly marked volatile variables and tested your app on ARM.
So it seems the answer is I can use a lock when reading or make my field volatile, though perhaps I should use lock and try reduce the number of calls, as a man who worked on the compiler says:
The number of situations in which a lock is too slow is very small, and the probability that you are going to get the code wrong because you don't understand the exact memory model is very large. I don't attempt to write any low-lock code except for the most trivial usages of Interlocked operations. I leave the usage of "volatile" to real experts.
I'm not certain what you mean by "most up to date value". You can use locks to ensure that you don't read Value at the same time it is being written to, which may yield some oddities, but if you read it then write to it, you won't have the most up to date value.
To handle the oddities I referred to, you can use locks as you have done. But you seem to desire a different solution. If you don't want to lock the read, but you want to ensure that the write is atomic such that the read won't return an odd number or some other messy thing when doing a read during a multithreaded write, then I would recommend using the Interlocked class.
Simply:
Interlocked.Add(ref value, 47);
More Interlocked functions can be found at http://msdn.microsoft.com/en-us/library/system.threading.interlocked(v=vs.110).aspx
These functions are great when working with primitives. With more complicated objects, other solutions like ReaderWriterLockSlim and others will be needed.
I'm using such configuration:
.NET framework 4.5
Windows Server 2008 R2
HP DL360p Gen8 (2 * Xeon E5-2640, x64)
I have such field somewhere in my program:
protected int HedgeVolume;
I access this field from several threads. I assume that as I have multi-processor system it's possible that this threads are executing on different processors.
What should I do to guarantee that any time I use this field the most recent value is "read"? And to make sure that when I "write" value it become available to all other threads immediately?
What should I do?
just leave field as is.
declare it volatile
use Interlocked class to access the field
use .NET 4.5 Volatile.Read, Volatile.Write methods to access the field
use lock
I only need simplest way to make my program work on this configuration I don't need my program to work on another computers or servers or operation systems. Also I want minimal latency so I'm looking for fastest solution that will always work on this standard configuration (multiprocessor intel x64, .net 4.5).
Your question is missing one key element... How important is the integrity of the data in that field?
volatile gives you performance, but if a thread is currently writing changes to the field, you won't get that data until it's done, so you might access out of date information, and potentially overwrite changes another thread is currently doing. If the data is sensitive, you might get bugs that would get very hard to track. However, if you are doing very quick update, overwrite the value without reading it and don't care that once in a while you get outdated (by a few ms) data, go for it.
lock guaranty that only one thread can access the field at a time. You can put it only on the methods that write the field and leave the reading method alone. The down side is, it is slow, and may block a thread while another is performing its task. However, you are sure your data stay valid.
Interlock exist to shield yourself from the scheduler context switch. My opinion? Don't use it unless you know exactly why you would be using it and exactly how to use it. It gives options, but with great options comes great problematic. It prevents a context switch while a variable is being update. It might not do what you think it does and won't prevent parallel threads from performing their tasks simultaneously.
You want to use Volatile.Read().
As you are running on x86, all writes in C# are the equivalent of Volatile.Write(), you only need to use this for Itanium.
Volatile.Read() will ensure that you get the latest copy regardless of which thread last wrote it.
There is a fantastic write up here, C# Memory Model Explained
Summary of it includes,
On some processors, not only must the compiler avoid certain
optimizations on volatile reads and writes, it also has to use special
instructions. On a multi-core machine, different cores have different
caches. The processors may not bother to keep those caches coherent by
default, and special instructions may be needed to flush and refresh
the caches.
Hopefully that much is obvious, other than the need for volatile to stop the compiler from optimising it, there is the processor as well.
However, in C# all writes are volatile (unlike say in Java),
regardless of whether you write to a volatile or a non-volatile field.
So, the above situation actually never happens in C#. A volatile write
updates the thread’s cache, and then flushes the entire cache to main
memory.
You do not need Volatile.Write(). More authoratitive source here, Joe Duffy CLR Memory Model. However, you may need it to stop the compiler reordering it.
Since all C# writes are volatile, you can think of all writes as going
straight to main memory. A regular, non-volatile read can read the
value from the thread’s cache, rather than from main
You need Volatile.Read()
When you start designing a concurrent program, you should consider these options in order of preference:
1) Isolation: each thread has it's own private data
2) Immutability: threads can see shared state, but it never changes
3) Mutable shared state: protect all access to shared state with locks
If you get to (3), then how fast do you actually need this to be?
Acquiring an uncontested lock takes in the order of 10ns ( 10-8 seconds ) - that's fast enough for most applications and is the easiest way to guarantee correctness.
Using any of the other options you mention takes you into the realm of low-lock programming, which is insanely difficult to get correct.
If you want to learn how to write concurrent software, you should read these:
Intro: Joe Albahari's free e-book - will take about a day to read
Bible: Joe Duffy's "Concurrent Programming on Windows" - will take about a month to read
Depends what you DO. For reading only, volatile is easiest, interlocked allows a little more control. Lock is unnecessary as it is more ganular than the problem you describe. Not sure about Volatile.Read/Write, never used them.
volatile - bad, there are some issues (see Joe Duffy's blog)
if all you do is read the value or unconditionally write a value - use Volatile.Read and Volatile.Write
if you need to read and subsequently write an updated value - use the lock syntax. You can however achieve the same effect without lock using the Interlocked classes functionality, but this is more complex (involves CompareExchange s to ensure that you are updating the read value i.e. has not been modified since the read operation + logic to retry if the value was modified since the read).
From this i can understand that you want to be able to read the last value that it was writtent in a field. Lets make an analogy with the sql concurency problem of the data. If you want to be able to read the last value of a field you must make atomic instructions. If someone is writing a field all of the threads must be locked for reading until that thread finished the writing transaction. After that every read on that thread will be safe. The problem is not with reading as it is with writing. A lock on that field whenever its writtent should be enough if you ask me ...
First have a look here: Volatile vs. Interlocked vs. lock
The volatile modifier shurely is a good option for a multikernel cpu.
But is this enough? It depends on how you calculate the new HedgeVolume value!
If your new HedgeVolume does not depend on current HedgeVolume then your done with volatile.
But if HedgeVolume[x] = f(HedgeVolume[x-1]) then you need some thread synchronisation to guarantee that HedgeVolume doesn't change while you calculate and assign the new value. Both, lock and Interlocked szenarios would be suitable in this case.
I had a similar question and found this article to be extremely helpful. It's a very long read, but I learned a LOT!
When I said atomic, I meant set of instructions will execute without any context switching to another thread on the same process (other kinds of switches have to be done of course). The only solution I came up with is to suspend all threads except currently executed before part and resume them after it. Any more elegant way?
The reason I want to do that is to collect a coherent state of objects running on multiple threads. However, their code cannot be changed (they're already compiled), so I cannot insert mutexes, semaphores, etc in it. The atomic operation is of course state collecting (i.e. copying some variables).
There are some atomic operations in the Interlocked class but it only provides a few very simple operations. It can't be used to create an entire atomic block of code.
I'd advise using locking carefully to make sure that your code will still work even if the context changes.
Well, you can use locks, but you can't prevent context switching exactly. But if your threads lock on the same object, then the threads waiting obviously won't be running, so there's no context switching involved since there's nothing to run.
You might want to look at this page too.
No. You can surround a block of code with a Monitor to make it thread-safe, but you cannot make general code snippets atomic.
object lck = new object();
lock(lck)
{
// thread safe code goes in here
}
No, that's against multi-tasking.
Unless very simple operations like incrementing ... which are not subject of your question.
It is possible to obtain a global state from a shared memory composed of a collection (array) of atomic one reader/multi writer registers. The solution is simple but not trivial. You can read the algorithm published in the paper "atomic snapshots of shared memory" or you can read the chapter 4 from the art of multiprocesor programming book, there you can get ideas on the implementation on the java language, of course, once you are familiarized with the idea you should be able to transport it to any other language. Sorry if my english is not well enough.
I've been reading Joe Duffy's book on Concurrent programming. I have kind of an academic question about lockless threading.
First: I know that lockless threading is fraught with peril (if you don't believe me, read the sections in the book about memory model)
Nevertheless, I have a question:
suppose I have an class with an int property on it.
The value referenced by this property will be read very frequently by multiple threads
It is extremely rare that the value will change, and when it does it will be a single thread that changes it.
If it does change while another operation that uses it is in flight, no one is going to lose a finger (the first thing anyone using it does is copy it to a local variable)
I could use locks (or a readerwriterlockslim to keep the reads concurrent).
I could mark the variable volatile (lots of examples where this is done)
However, even volatile can impose a performance hit.
What if I use VolatileWrite when it changes, and leave the access normal for reads. Something like this:
public class MyClass
{
private int _TheProperty;
internal int TheProperty
{
get { return _TheProperty; }
set { System.Threading.Thread.VolatileWrite(ref _TheProperty, value); }
}
}
I don't think that I would ever try this in real life, but I'm curious about the answer (more than anything, as a checkpoint of whether I understand the memory model stuff I've been reading).
Marking a variable as "volatile" has two effects.
1) Reads and writes have acquire and release semantics, so that reads and writes of other memory locations will not "move forwards and backwards in time" with respect to reads and writes of this memory location. (This is a simplification, but you take my point.)
2) The code generated by the jitter will not "cache" a value that seems to logically be unchanging.
Whether the former point is relevant in your scenario, I don't know; you've only described one memory location. Whether or not it is important that you have only volatile writes but not volatile reads is something that is up to you to decide.
But it seems to me that the latter point is quite relevant. If you have a spin lock on a non-volatile variable:
while(this.prop == 0) {}
the jitter is within its rights to generate this code as though you'd written
if (this.prop == 0) { while (true) {} }
Whether it actually does so or not, I don't know, but it has the right to. If what you want is for the code to actually re-check the property on each go round the loop, marking it as volatile is the right way to go.
The question is whether the reading thread will ever see the change. It's not just a matter of whether it sees it immediately.
Frankly I've given up on trying to understand volatility - I know it doesn't mean quite what I thought it used to... but I also know that with no kind of memory barrier on the reading thread, you could be reading the same old data forever.
The "performance hit" of volatile is because the compiler now generates code to actually check the value instead of optimizing that away - in other words, you'll have to take that performance hit regardless of what you do.
At the CPU level, yes every processor will eventually see the change to the memory address. Even without locks or memory barriers. Locks and barriers would just ensure that it all happened in a relative ordering (w.r.t other instructions) such that it appeared correct to your program.
The problem isn't cache-coherency (I hope Joe Duffy's book doesn't make that mistake). The caches stay conherent - it is just that this takes time, and the processors don't bother to wait for that to happen - unless you enforce it. So instead, the processor moves on to the next instruction, which may or may not end up happening before the previous one (because each memory read/write make take a different amount of time. Ironically because of the time for the processors to agree on coherency, etc. - this causes some cachelines to be conherent faster than others (ie depending on whether the line was Modified, Exclusive, Shared, or Invalid it takes more or less work to get into the necessary state).)
So a read may appear old or from an out of date cache, but really it just happened earlier than expected (typically because of look-ahead and branch prediction). When it really was read, the cache was coherent, it has just changed since then. So the value wasn't old when you read it, but it is now when you need it. You just read it too soon. :-(
Or equivalently, it was written later than the logic of your code thought it would be written.
Or both.
Anyhow, if this was C/C++, even without locks/barriers, you would eventually get the updated values. (within a few hundred cycles typically, as memory takes about that long). In C/C++ you could use volatile (the weak non-thread volatile) to ensure that the value wasn't read from a register. (Now there's a non-coherent cache! ie the registers)
In C# I don't know enough about CLR to know how long a value could stay in a register, nor how to ensure you get a real re-read from memory. You've lost the 'weak' volatile.
I would suspect as long as the variable access doesn't completely get compiled away, you will eventually run out of registers (x86 doesn't have many to start with) and get your re-read.
But no guarantees that I see. If you could limit your volatile-read to a particular point in your code that was often, but not too often (ie start of next task in a while(things_to_do) loop) then that might be the best you can do.
This is the pattern I use when the 'last writer wins' pattern is applicable to the situation. I had used the volatile keyword, but after seeing this pattern in a code example from Jeffery Richter, I started using it.
For normal things (like memory-mapped devices), the cache-coherency protocols going on within/between the CPU/CPUs is there to ensure that different threads sharing that memory get a consistent view of things (i.e., if I change the value of a memory location in one CPU, it will be seen by other CPUs that have the memory in their caches). In this regard volatile will help to ensure that the optimizer doesn't optimize away memory accesses (which are always going through cache anyway) by, say, reading the value cached in a register. The C# documentation seems pretty clear on this. Again, the application programmer doesn't generally have to deal with cache-coherency themselves.
I highly recommend reading the freely available paper "What Every Programmer Should Know About Memory". A lot of magic goes on under the hood that mostly prevents shooting oneself in the foot.
In C#, the int type is thread-safe.
Since you said that only one thread writes to it, you should never have contention as to what is the proper value, and as long as you are caching a local copy, you should never get dirty data.
You may, however, want to declare it volatile if an OS thread will be doing the update.
Also keep in mind that some operations are not atomic, and can cause problems if you have more than one writer. For example, even though the bool type wont corrupt if you have more than one writer, a statement like this:
a = !a;
is not atomic. If two threads read at the same time, you have a race condition.