The documentation for Volatile.Write says the following:
Writes the specified object reference to the specified field. On
systems that require it, inserts a memory barrier that prevents the
processor from reordering memory operations as follows: If a read or
write appears before this method in the code, the processor cannot
move it after this method.
and
value T
The object reference to write. The reference is written
immediately so that it is visible to all processors in the computer.
But it seems like quotes 1 and 2 are contradictory.
For the second quote to be true, I would think that the first quote would have to be changed as follows:
If a read or
write appears before after this method in the code, the processor cannot
move it after before this method.
Does Volatile.Write actually mean that other threads are guaranteed to pick up the write in a timely fashion, or is the second quote misleading?
It seems to me as though all these "Volatile"/"Memory Barriers" seem to be focused on is ensuring that if writes are exposed to other threads they are exposed in the correct order, but I can't seem to find what actually would be force them to be exposed.
I understand that it may be hard/impossible to expose writes to other threads immediately, but without volatile writes/reads there are cases when the writes are exposed never. So it seems there must be a way to ensure that writes are exposed "eventually", but I'm unsure what that is. Is it that writes are always exposed in .NET but reads can be cached? And if so does Volatile.Read stop this caching behaviour?
(Note I have read through Joseph Albahari's Threading in C# which tends to suggest I need explicit memory barriers before my reads and after my writes, although it's not clear why even that should be effective as the documentation for Thread.MemoryBarrier doesn't seem to explicitly say that the writes are shown to other threads).
You are misunderstanding the concept of barriers a little bit. As you wrote
The object reference to write. The reference is written immediately so that it is visible to all processors in the computer.
So the really important unit here is a processor, not thread.
So, there are processors, processor caches, store buffers and invalidation queues involved.
When a processor writes something into the memory, it looks like that or similar to that
The subject is at the store buffer level. As you can see, there are a lot of things is going on when you write something or read, and it does not happen instantly for all the processors in the system. At the beginning a read or write command is places into processor's store buffer, and those commands could be reordered, in other words, executed in different order by the processor.
While that happens, other processors don't know about changes, if the operation is write and the currently working processor doesn't know about changes other processors made.
When you place a barrier, that means that operations in the store buffer or invalidation queue should be completed before any read or write could be performed. That is necessary to actualize CPU caches across processors. So there is basically no mechanics to synchronize any data across threads, we are syncing data across processors.
When a thread A writes something on processor 1 and thread B reads something on the processor 1, they both starts by looking into the store buffer first, so they read actual data, whether any barriers placed or not.
It's just an overview of the mechanic involved, maybe I'm wrong in some details. You can find complete info if you read about MESI protocol, this PDF with explanation on invalidation queues and store buffers
I agree with you that the description in the MSDN documentation is bit confusing. I would say that "immediately" is strong word here as well as in regards to any subject related to parallel processes. The result won't be visible immediately but documentation doesn't say that - it says that the value will be written immediately, that is as soon as all prior load/store operation results become globally visible the store operation to write a value will be immediately initiated.
As for the memory barriers, they only can give a guarantee of prior operations exposure (global visibility) because in essence the memory barriers are instructions which are encountered by a CPU make the CPU "wait" for getting all pending load/store operations globally visible while the moment of global visibility of value written by Volatile.Write is neither barrier nor Volatile.Write concern.
Now about suggestion to use the barrier in lock-free programming. Of course it makes sense because it ensures the order of global visibility which is actual for multi-core systems. When you cannot be sure that an event B always happens after event A you just can't build reliable logic supposed to be executed in multi-core environemnts.
Related
I have a relatively simple case where:
My program will be receiving updates via Websockets, and will be using these updates to update it's local state. These updates will be very small (usually < 1-1000 bytes JSON so < 1ms to de-serialize) but will be very frequent (up to ~1000/s).
At the same time, the program will be reading/evaluating from this local state and outputs its results.
Both of these tasks should run in parallel and will run for the duration for the program, i.e. never stop.
Local state size is relatively small, so memory usage isn't a big concern.
The tricky part is that updates need to happen "atomically", so that it does not read from a local state that has for example, written only half of an update. The state is not constrained to using primitives and could contain arbitrary classes AFAICT atm, so I cannot solve it by something simple like using Interlocked atomic operations. I plan on running each task on its own thread, so a total of two threads in this case.
To achieve this goal I thought to use a double buffer technique, where:
It keeps two copies of the state so one can be read from while the other is being written to.
The threads could communicate which copy they are using by using a lock. i.e. Writer thread locks copy when writing to it; reader thread requests access to lock after it's done with current copy; writer thread sees that reader thread is using it so it switches to other copy.
Writing thread keeps track of state updates it's done on the current copy so when it switches to the other copy it can "catch up".
That's the general gist of the idea, but the actual implementation will be a bit different of course.
I've tried to lookup whether this is a common solution but couldn't really find much info, so it's got me wondering things like:
Is it viable, or am I missing something?
Is there a better approach?
Is it a common solution? If so what's it commonly referred to as?
(bonus) Is there a good resource I could read up on for topics related to this?
Pretty much I feel I've run into a dead-end where I cannot find (because I don't know what to search for) much more resources and info to see if this approach is "good". I plan on writing this in .NET C#, but I assume the techniques and solutions could translate to any language. All insights appreciated.
You actually need four buffers/objects. Two buffers/objects are owned by the reader, one by the writer, and one in the mailbox.
The reader -- each time he finishes a group of atomic operations on his newer object, he uses interlocked exchange to swap his older object handle (pointer or index doesn't matter) with the mailbox one. Then he looks at the newly obtained object and compares the sequence number to the object he just read (and is still holding) to find out which is newer.
The writer -- writes a complete copy of latest data into his object, then uses interlocked exchange to swap his newly written object with the mailbox one.
As you can see, the writer can steal the mailbox object at any time, but never the one that the reader is using, so read operations stay atomic. And the reader can steal the mailbox object at any time, but never the one the writer is using, so write operations stay atomic.
As long as the interlocked-exchange function produces the correct memory fence (release for the swap done in the writer thread, acquire for the reader thread), the objects can themselves be arbitrarily complex.
If I understand correctly, the writes themselves are synchronous. If so, then maybe it's not necessary to keep two copies or even to use locks.
Maybe something like this could work?
State state = populateInitialState();
...
// Reader thread
public State doRead() {
return makeCopyOfState(state);
}
...
// Writer thread
public void updateState() {
State newState = makeCopyOfState(state);
// make changes in newState
state = newState;
}
It looks like you are using the input-process-output pattern in a multithreaded pipeline. Sometimes the input and processing phases (or processing and output phases) are merged when the problem is simple.
You have added a C# tag so using something like a BlockingCollection might be a useful way to communicate between the input and output threads. Since the local state is relatively small (your words) then posting a data-object containing a copy of the local state from the input thread to the output thread could be a simple solution. This follows a share-nothing philosophy which satisfies the atomic requirement because a snapshot of the current state is queued. The "catch up" capability is satisfied because the queue contains the backlog of state changes.
Generally, Messaging Patterns and Conversation Patterns are useful resources when trying to work out what to communicate and how to communicate between 2 or more threads (or processes, services, servers, etc).
Simplified question:
Is there a difference in timing of memory caches coherency (or "flushing") caused by Interlocked operations compared to Memory barriers? Let's consider in C# - any Interlocked operations vs Thread.MemoryBarrier(). I believe there is a difference.
Background:
I read quite few information about memory barriers - all the impact on prevention of specific types of memory interaction instructions reordering, but I couldn't find consistent info on whether they should cause immediate flushing of read/write queues.
I actually found few sources mentioning that there is NO guarantee on immediacy of the operation (only the prevention of specific reordering is guaranteed).
E.g.
Wikipedia:
"However, to be clear, it does not mean any operations WILL have completed by the time the barrier completes; only the ORDERING of the completion of operations (when they do complete) is guaranteed"
Freebsd.org (barriers are HW specific, so I guess a specific OS doesn't matter): "memory barriers simply determine relative order of memory operations; they do not make any guarantee about timing of memory operations"
On the other hand Interlocked operations - from their definition - causes immediate flushing of all memory buffers to guarantee the most recent value of variable was updated causes memory subsystem to lock the entire cache line with the value, to prevent access (including reads) from any other CPU/core, until the operation is done.
Am I correct or am I mistaken?
Disclaimer:
This is an evolution of my original question here Variable freshness guarantee in .NET (volatile vs. volatile read)
EDIT1:
Fixed my statement about Interlocked operations - inline the text.
EDIT2:
Completely remove demonstration code + it's discussion (as some complained about too much information)
To understand C# interlocked operations, you need to understand Win32 interlocked operations.
The "pure" interlocked operations themselves only affect the freshness of the data directly referenced by the operation.
But in Win32, interlocked operations used to imply full memory barrier. I believe this is mostly to avoid breaking old programs on newer hardware. So InterlockedAdd does two things: interlocked add (very cheap, does not affect caches) and full memory barrier (rather heavy op).
Later, Microsoft realized this is expensive, and added versions of each operation that does no or partial memory barrier.
So there are now (in Win32 world) four versions of almost everything: e.g. InterlockedAdd (full fence), InterlockedAddAcquire (read fence), InterlockedAddRelease (write fence), pure InterlockedAddNoFence (no fence).
In C# world, there is only one version, and it matches the "classic" InterlockedAdd - that also does the full memory fence.
Short answer: CAS (Interlocked) operations have been (and most likely will) be the quickest caches flusher.
Background:
- CAS operations are supported in HW by single uninteruptable instruction. Compared to thread calling memory barrier which can be swapped right after placing the barrier but just before performing any reads/writes (so consistency guaranteed for the barrier is still met).
- CAS operations are foundations for majority (if not all) high level synchronization construct (mutexes, sempahores, locks - look on their implementation and you will find CAS operations). They wouldn't likely be used if they wouldn't guarantee immediate cross-thread state consistency or if there would be other, faster mechanism(s)
At least on Intel devices, a bunch of machinecode operations can be prefixed with a LOCK prefix, which ensures that the following operation is treated as atomic, even if the underlying datatype won't fit on the databus in one go, for example, LOCK REPNE SCASB will scan a string of bytes for a terminating zero, and won't be interrupted by other threads.
As far as I am aware, the Memory Barrier construct is basically a CAS based spinlock that causes a thread to wait for some Condition to be met, such as no other threads having any work to do. This is clearly a higher-level construct, but make no mistake there's a condition check in there, and it's likely to be atomic, and also likely to be CAS-protected, you're still going to pay the cache line price when you reach a memory barrier.
I'm using such configuration:
.NET framework 4.5
Windows Server 2008 R2
HP DL360p Gen8 (2 * Xeon E5-2640, x64)
I have such field somewhere in my program:
protected int HedgeVolume;
I access this field from several threads. I assume that as I have multi-processor system it's possible that this threads are executing on different processors.
What should I do to guarantee that any time I use this field the most recent value is "read"? And to make sure that when I "write" value it become available to all other threads immediately?
What should I do?
just leave field as is.
declare it volatile
use Interlocked class to access the field
use .NET 4.5 Volatile.Read, Volatile.Write methods to access the field
use lock
I only need simplest way to make my program work on this configuration I don't need my program to work on another computers or servers or operation systems. Also I want minimal latency so I'm looking for fastest solution that will always work on this standard configuration (multiprocessor intel x64, .net 4.5).
Your question is missing one key element... How important is the integrity of the data in that field?
volatile gives you performance, but if a thread is currently writing changes to the field, you won't get that data until it's done, so you might access out of date information, and potentially overwrite changes another thread is currently doing. If the data is sensitive, you might get bugs that would get very hard to track. However, if you are doing very quick update, overwrite the value without reading it and don't care that once in a while you get outdated (by a few ms) data, go for it.
lock guaranty that only one thread can access the field at a time. You can put it only on the methods that write the field and leave the reading method alone. The down side is, it is slow, and may block a thread while another is performing its task. However, you are sure your data stay valid.
Interlock exist to shield yourself from the scheduler context switch. My opinion? Don't use it unless you know exactly why you would be using it and exactly how to use it. It gives options, but with great options comes great problematic. It prevents a context switch while a variable is being update. It might not do what you think it does and won't prevent parallel threads from performing their tasks simultaneously.
You want to use Volatile.Read().
As you are running on x86, all writes in C# are the equivalent of Volatile.Write(), you only need to use this for Itanium.
Volatile.Read() will ensure that you get the latest copy regardless of which thread last wrote it.
There is a fantastic write up here, C# Memory Model Explained
Summary of it includes,
On some processors, not only must the compiler avoid certain
optimizations on volatile reads and writes, it also has to use special
instructions. On a multi-core machine, different cores have different
caches. The processors may not bother to keep those caches coherent by
default, and special instructions may be needed to flush and refresh
the caches.
Hopefully that much is obvious, other than the need for volatile to stop the compiler from optimising it, there is the processor as well.
However, in C# all writes are volatile (unlike say in Java),
regardless of whether you write to a volatile or a non-volatile field.
So, the above situation actually never happens in C#. A volatile write
updates the thread’s cache, and then flushes the entire cache to main
memory.
You do not need Volatile.Write(). More authoratitive source here, Joe Duffy CLR Memory Model. However, you may need it to stop the compiler reordering it.
Since all C# writes are volatile, you can think of all writes as going
straight to main memory. A regular, non-volatile read can read the
value from the thread’s cache, rather than from main
You need Volatile.Read()
When you start designing a concurrent program, you should consider these options in order of preference:
1) Isolation: each thread has it's own private data
2) Immutability: threads can see shared state, but it never changes
3) Mutable shared state: protect all access to shared state with locks
If you get to (3), then how fast do you actually need this to be?
Acquiring an uncontested lock takes in the order of 10ns ( 10-8 seconds ) - that's fast enough for most applications and is the easiest way to guarantee correctness.
Using any of the other options you mention takes you into the realm of low-lock programming, which is insanely difficult to get correct.
If you want to learn how to write concurrent software, you should read these:
Intro: Joe Albahari's free e-book - will take about a day to read
Bible: Joe Duffy's "Concurrent Programming on Windows" - will take about a month to read
Depends what you DO. For reading only, volatile is easiest, interlocked allows a little more control. Lock is unnecessary as it is more ganular than the problem you describe. Not sure about Volatile.Read/Write, never used them.
volatile - bad, there are some issues (see Joe Duffy's blog)
if all you do is read the value or unconditionally write a value - use Volatile.Read and Volatile.Write
if you need to read and subsequently write an updated value - use the lock syntax. You can however achieve the same effect without lock using the Interlocked classes functionality, but this is more complex (involves CompareExchange s to ensure that you are updating the read value i.e. has not been modified since the read operation + logic to retry if the value was modified since the read).
From this i can understand that you want to be able to read the last value that it was writtent in a field. Lets make an analogy with the sql concurency problem of the data. If you want to be able to read the last value of a field you must make atomic instructions. If someone is writing a field all of the threads must be locked for reading until that thread finished the writing transaction. After that every read on that thread will be safe. The problem is not with reading as it is with writing. A lock on that field whenever its writtent should be enough if you ask me ...
First have a look here: Volatile vs. Interlocked vs. lock
The volatile modifier shurely is a good option for a multikernel cpu.
But is this enough? It depends on how you calculate the new HedgeVolume value!
If your new HedgeVolume does not depend on current HedgeVolume then your done with volatile.
But if HedgeVolume[x] = f(HedgeVolume[x-1]) then you need some thread synchronisation to guarantee that HedgeVolume doesn't change while you calculate and assign the new value. Both, lock and Interlocked szenarios would be suitable in this case.
I had a similar question and found this article to be extremely helpful. It's a very long read, but I learned a LOT!
I've been reading Joe Duffy's book on Concurrent programming. I have kind of an academic question about lockless threading.
First: I know that lockless threading is fraught with peril (if you don't believe me, read the sections in the book about memory model)
Nevertheless, I have a question:
suppose I have an class with an int property on it.
The value referenced by this property will be read very frequently by multiple threads
It is extremely rare that the value will change, and when it does it will be a single thread that changes it.
If it does change while another operation that uses it is in flight, no one is going to lose a finger (the first thing anyone using it does is copy it to a local variable)
I could use locks (or a readerwriterlockslim to keep the reads concurrent).
I could mark the variable volatile (lots of examples where this is done)
However, even volatile can impose a performance hit.
What if I use VolatileWrite when it changes, and leave the access normal for reads. Something like this:
public class MyClass
{
private int _TheProperty;
internal int TheProperty
{
get { return _TheProperty; }
set { System.Threading.Thread.VolatileWrite(ref _TheProperty, value); }
}
}
I don't think that I would ever try this in real life, but I'm curious about the answer (more than anything, as a checkpoint of whether I understand the memory model stuff I've been reading).
Marking a variable as "volatile" has two effects.
1) Reads and writes have acquire and release semantics, so that reads and writes of other memory locations will not "move forwards and backwards in time" with respect to reads and writes of this memory location. (This is a simplification, but you take my point.)
2) The code generated by the jitter will not "cache" a value that seems to logically be unchanging.
Whether the former point is relevant in your scenario, I don't know; you've only described one memory location. Whether or not it is important that you have only volatile writes but not volatile reads is something that is up to you to decide.
But it seems to me that the latter point is quite relevant. If you have a spin lock on a non-volatile variable:
while(this.prop == 0) {}
the jitter is within its rights to generate this code as though you'd written
if (this.prop == 0) { while (true) {} }
Whether it actually does so or not, I don't know, but it has the right to. If what you want is for the code to actually re-check the property on each go round the loop, marking it as volatile is the right way to go.
The question is whether the reading thread will ever see the change. It's not just a matter of whether it sees it immediately.
Frankly I've given up on trying to understand volatility - I know it doesn't mean quite what I thought it used to... but I also know that with no kind of memory barrier on the reading thread, you could be reading the same old data forever.
The "performance hit" of volatile is because the compiler now generates code to actually check the value instead of optimizing that away - in other words, you'll have to take that performance hit regardless of what you do.
At the CPU level, yes every processor will eventually see the change to the memory address. Even without locks or memory barriers. Locks and barriers would just ensure that it all happened in a relative ordering (w.r.t other instructions) such that it appeared correct to your program.
The problem isn't cache-coherency (I hope Joe Duffy's book doesn't make that mistake). The caches stay conherent - it is just that this takes time, and the processors don't bother to wait for that to happen - unless you enforce it. So instead, the processor moves on to the next instruction, which may or may not end up happening before the previous one (because each memory read/write make take a different amount of time. Ironically because of the time for the processors to agree on coherency, etc. - this causes some cachelines to be conherent faster than others (ie depending on whether the line was Modified, Exclusive, Shared, or Invalid it takes more or less work to get into the necessary state).)
So a read may appear old or from an out of date cache, but really it just happened earlier than expected (typically because of look-ahead and branch prediction). When it really was read, the cache was coherent, it has just changed since then. So the value wasn't old when you read it, but it is now when you need it. You just read it too soon. :-(
Or equivalently, it was written later than the logic of your code thought it would be written.
Or both.
Anyhow, if this was C/C++, even without locks/barriers, you would eventually get the updated values. (within a few hundred cycles typically, as memory takes about that long). In C/C++ you could use volatile (the weak non-thread volatile) to ensure that the value wasn't read from a register. (Now there's a non-coherent cache! ie the registers)
In C# I don't know enough about CLR to know how long a value could stay in a register, nor how to ensure you get a real re-read from memory. You've lost the 'weak' volatile.
I would suspect as long as the variable access doesn't completely get compiled away, you will eventually run out of registers (x86 doesn't have many to start with) and get your re-read.
But no guarantees that I see. If you could limit your volatile-read to a particular point in your code that was often, but not too often (ie start of next task in a while(things_to_do) loop) then that might be the best you can do.
This is the pattern I use when the 'last writer wins' pattern is applicable to the situation. I had used the volatile keyword, but after seeing this pattern in a code example from Jeffery Richter, I started using it.
For normal things (like memory-mapped devices), the cache-coherency protocols going on within/between the CPU/CPUs is there to ensure that different threads sharing that memory get a consistent view of things (i.e., if I change the value of a memory location in one CPU, it will be seen by other CPUs that have the memory in their caches). In this regard volatile will help to ensure that the optimizer doesn't optimize away memory accesses (which are always going through cache anyway) by, say, reading the value cached in a register. The C# documentation seems pretty clear on this. Again, the application programmer doesn't generally have to deal with cache-coherency themselves.
I highly recommend reading the freely available paper "What Every Programmer Should Know About Memory". A lot of magic goes on under the hood that mostly prevents shooting oneself in the foot.
In C#, the int type is thread-safe.
Since you said that only one thread writes to it, you should never have contention as to what is the proper value, and as long as you are caching a local copy, you should never get dirty data.
You may, however, want to declare it volatile if an OS thread will be doing the update.
Also keep in mind that some operations are not atomic, and can cause problems if you have more than one writer. For example, even though the bool type wont corrupt if you have more than one writer, a statement like this:
a = !a;
is not atomic. If two threads read at the same time, you have a race condition.
I've just written a method that is called by multiple threads simultaneously and I need to keep track of when all the threads have completed. The code uses this pattern:
private void RunReport()
{
_reportsRunning++;
try
{
//code to run the report
}
finally
{
_reportsRunning--;
}
}
This is the only place within the code that _reportsRunning's value is changed, and the method takes about a second to run.
Occasionally when I have more than six or so threads running reports together the final result for _reportsRunning can get down to -1. If I wrap the calls to _runningReports++ and _runningReports-- in a lock then the behaviour appears to be correct and consistent.
So, to the question: When I was learning multithreading in C++ I was taught that you didn't need to synchronize calls to increment and decrement operations because they were always one assembly instruction and therefore it was impossible for the thread to be switched out mid-call. Was I taught correctly, and if so, how come that doesn't hold true for C#?
A ++ operator is not atomic in C# (and I doubt it is guaranteed to be atomic in C++) so yes, your counting is subject to race conditions.
Use Interlocked.Increment and .Decrement
System.Threading.Interlocked.Increment(ref _reportsRunning);
try
{
...
}
finally
{
System.Threading.Interlocked.Decrement(ref _reportsRunning);
}
So, to the question: When I was
learning multithreading in C++ I was
taught that you didn't need to
synchronize calls to increment and
decrement operations because they were
always one assembly instruction and
therefore it was impossible for the
thread to be switched out mid-call.
Was I taught correctly, and if so how
come that doesn't hold true for C#?
This is incredibly wrong.
On some architectures, like x86, there are single increment and decrement instructions. Many architectures do not have them and need to do separate loads and stores. Even on x86, there is no guarantee the compiler will generate the memory version of these instructions - it'll likely load into a register first, especially if it needs to do several operations with the result.
Even if the compiler could be guaranteed to always generate the memory version of increment and decrement on x86, that still does not guarantee atomicity - two CPU's could modify the variable simultaneously and get inconsistent results. The instruction would need the lock prefix to force it to be an atomic operation - compilers never emit the lock variant by default since it is less performant since it guarantees the action is atomic.
Consider the following x86 assembly instruction:
inc [i]
If I is initially 0 and the code is run on two threads on two cores, the value after both threads finish could legally be either 1 or 2, since there is no guarantee that one thread will complete its read before the other thread finishes its write, or that one thread's write will even be visible before the other threads read.
Changing this to:
lock inc [i]
Will result in getting a final value of 2.
Win32's InterlockedIncrement and InterlockedDecrement and .NET's Interlocked.Increment and Interlocked.Decrement result in doing the equivalent (possibly the exact same machine code) of lock inc.
You were taught wrong.
There does exist hardware with atomic integer increment, so it's possible that what you were taught was right for the hardware and compiler you were using at the time. But in general in C++ you can't even guarantee that incrementing a non-volatile variable writes memory consecutively with reading it, let alone atomically with reading.
Incrementing the int is one instruction but what about loading the value in the register?
That's what i++ effectively does:
load i into a register
increment the register
unload the register into i
As you can see there are 3 (this may be different on other platforms) instructions which in any stage the cpu can context switch into a different thread leaving your variable in an unknown state.
You should use Interlocked.Increment and Interlocked.Decrement to solve that.
No, you need to synchronize access. On Windows you can do this easily with InterlockedIncrement() and InterlockedDecrement(). I'm sure there are equivalents for other platforms.
EDIT: Just noticed the C# tag. Do what the other guy said. See also: I've heard i++ isn't thread safe, is ++i thread-safe?
Any kind of increment/decrement operation in a higher level language (and yes, even C is higher level compared to machine instructions) is not atomic by nature. However, each processor platform usually has primitives that support various atomic operations.
If your lecturer was referring to machine instructions, Increment and Decrement operations are likely to be atomic. Yet, that is not always correct on the ever increasing multi-core platforms of today, unless they guarantee coherency.
The higher level languages usually implement support for atomic transactions using low level atomic machine instructions. This is provided as the interlock mechanism by the higher level API.
x++ probably isn't atomic, but ++x might be (not sure offhand, but if you consider the difference between post- and pre-increment it should be clear why pre- is more amenable to atomicity).
A bigger point is, if these runs take a second to run each, the amount of time added by a lock is going to be noise compared to the runtime of the method itself. It's probably not worth monkeying with trying to remove the lock in this case - you've got a correct solution with locking, that will likely not have a visible difference in performance from the non-locking solution.
On a single-processor machine, if one isn't using virtual memory, x++ (rvalue ignored) is likely to translate into a single atomic INC instruction on x86 architectures (if x is long, the operation is only atomic when using a 32-bit compiler). Also, movsb/movsw/movsl are atomic ways of moving a byte/word/longword; a compiler isn't apt to use those as the normal way of assigning variables, but one could have an atomic-move utility function. It would be possible for a virtual memory manager to be written in such a way that those instructions would behave atomically if a page fault occurs on the write, but I don't think that's normally guaranteed.
On a multi-processor machine, all bets are off unless one uses explicit interlocked instructions (invokable via special library calls). The most versatile instruction which is commonly available is CompareExchange. That instruction will alter a memory location only if it contains an expected value; it will return the value it had when it decided whether or not to alter it. If one wishes to "xor" a variable with 1, one could do something like (in vb.net)
Dim OldValue as Integer
Do
OldValue = Variable
While Threading.Interlocked.CompareExchange(Variable, OldValue Xor 1, OldValue) OldValue
This approach allows one to perform any sort of atomic update to a variable whose new value should depend on the old value. For certain common operations like increment and decrement, there are faster alternatives, but the CompareExchange allows one to implement other useful patterns as well.
Important caveats: (1) Keep the loop as short as possible; the longer the loop, the more likely it is that another task will hit the variable during the loop, and the more time will be wasted each time that happens; (2) a specified number of updates, divided arbitrarily among threads, will always complete, since the only way a thread can forced to re-execute the loop is if some other thread has made useful progress; if some threads can perform updates without making forward progress toward completion, however, the code may become live-locked.