The C# spec states in section 5.5 that reads and writes on certain types (namely bool, char, byte, sbyte, short, ushort, uint, int, float, and reference types) are guaranteed to be atomic.
This has piqued my interest. How can you do that? I mean, my lowly personal experience only showed me to lock variables or to use barriers if I wanted reads and writes to look atomic; that would be a performance killer if it had to be done for every single read/write. And yet C# does something with a similar effect.
Perhaps other languages (like Java) do it. I seriously don't know. My question isn't really intended to be language-specific, it's just that I know C# does it.
I understand that it might have to deal with certain specific processor instructions, and may not be usable in C/C++. However, I'd still like to know how it works.
[EDIT] To tell the truth, I believed that reads and writes could be non-atomic in certain conditions, like a CPU could access a memory location while another CPU is writing there. Does this only happen when the CPU can't treat all the object at once, like because it's too big or because the memory is not aligned on the proper boundary?
The reason those types have guaranteed atomicity is because they are all 32 bits or smaller. Since .NET only runs on 32 and 64 bit operating systems, the processor architecture can read and write the entire value in a single operation. This is in contrast to say, an Int64 on a 32 bit platform which must be read and written using two 32 bit operations.
I'm not really a hardware guy so I apologize if my terminology makes me sound like a buffoon but it's the basic idea.
It is fairly cheap to implement the atomicity guarantee on x86 and x64 cores since the CLR only promises atomicity for variables that are 32-bit or smaller. All that's required is that the variable is properly aligned and doesn't straddle a cache line. The JIT compiler ensures this by allocating local variables on a 4-byte aligned stack offset. The GC heap manager does the same for heap allocations.
Notable is that the CLR guarantee is not a very good one. The alignment promise is not good enough to write code that's consistently performant for arrays of doubles. Very nicely demonstrated in this thread. Interop with machine code that uses SIMD instructions is also very difficult for this reason.
On x86 reads and writes are atomic anyway. It's supported at the hardware level. This however does not mean that operations like addition and multiplication are atomic; they require a load, compute, then store, which means they can interfere. That's where the lock prefix comes in.
You mentioned locking and memory barriers; they don't have anything to do with reads and writes being atomic. There is no way on x86 with or without using memory barriers that you're going to see a half-written 32-bit value.
Yes, C# and Java guarantee that loads and stores of some primitive types are atomic, like you say. This is cheap because the processors capable of running .NET or the JVM do guarantee that loads and stores of suitably aligned primitive types are atomic.
Now, what neither C# nor Java nor the processors they run on guarantee, and which is expensive, is issuing memory barriers so that those variables can be used for synchronization in a multi-threaded program. However, in Java and C# you can mark your variable with the "volatile" attribute, in which case the compiler takes care of issuing the appropriate memory barriers.
You can't. Even going all the way down to assembly language you have to use special LOCK opcodes in order to guarantee that another core or even process isn't going to come around and wipe out all your hard work.
Related
This is a continuation of the discussion on multithreading issues in C#.
In C++, unprotected access to the shared data from multiple threads is an undefined behavior* if there is a write operation involved. What is it in C#? As (the safe part of) C# doesn't contain undefined behaviors, are there any guarantees? C# seems to have a kind of as-if rule as well, but after reading the mentioned part of the standard I fail to see what are the consequences of an unprotected data access from the language point of view.
In particular, it's interesting to know which kind of optimizations including load fusing and invention are prohibited through the language. This prohibition would imply the validity (or the lack thereof) of several popular patterns in C# (including the one discussed in the original question).
[The details of the actual implementation in Microsoft CLR, despite being very interesting, are not the part of this question: only the guarantees given by the language itself (and therefore portable) are here under discussion.]
The normative references are very welcome but I suspect the C# standard has enough information on the topic. Maybe someone from the language team can shed some light on what are the actual guarantees which are going to be included into the standard later but can be relied upon right now.
I suspect that there are some implied guarantees like the absence of pointer reference tearing because this could easily lead to breaking the type safety. But I'm not an expert on the topic.
*Often shortened as UB. Undefined Behavior allows a C++ compiler to produce literally any code, including formatting the hard disk or whatever, or to crash at compile time.
the .net runtime guarantees that writes to some variable types are atomic
Reads and writes of the following data types shall be atomic: bool, char, byte, sbyte, short, ushort, uint, int, float, and reference types. In addition, reads and writes of enum types with an underlying type in the previous list shall also be atomic. Reads and writes of other types, including long, ulong, double, and decimal, as well as user-defined types, need not be atomic. Aside from the library functions designed for that purpose, there is no guarantee of atomic read-modify-write, such as in the case of increment or decrement.
Not mentioned is IntPtr, that I believe is also guaranteed to be atomic. Since references are atomic they are guaranteed not to tear. See also C# - The C# Memory Model in Theory and Practice for more information
There should also be a guarantee of memory safety, i.e. that any memory access will reference valid memory and that all memory is initialized before usage. With some exceptions for things like unmanaged resources, unsafe code and stackalloc.
The general rule with regards to optimization is that the compiler/jitter may perform any optimization as long as the result would be identical for a single threaded program. So tearing, fusing, reordering, etc would all be possible, absent any synchronization.
So always use appropriate synchronization whenever there is a possibility that multiple threads use the same memory concurrently for anything except reading. Note that ARM has weaker memory ordering guarantees than x86/x64, further emphasizing the need for synchronization.
As mentioned by #JonasH, the C# spec only guarantees atomic access to values sized 32 bits or smaller.
But, assuming you can rely on C# always being implemented on a runtime conforming to ECMA-335, then you can rely on that spec also. This should be safe, as all implementations of .Net, including Mono and WASM, conform to ECMA-335 (it is not a Microsoft-only spec).
ECMA-335 guarantees access to native-sized values, which includes IntPtr and object references, as well as 64-bit integers on a 64-bit architecture.
ECMA-335 says: (my bold)
12.6.6 Atomic reads and writes
A conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §12.6.2) when all the write accesses to a location are the same size. Atomic writes shall alter no bits other than those written. Unless explicit layout control (see Partition II (Controlling Instance Layout)) is used to alter the default behavior, data elements no larger than the natural word size (the size of a native int) shall be properly aligned. Object references shall be treated as though they are stored in the native word size.
[Note: There is no guarantee about atomic update (read-modify-write) of memory, except for methods provided for that purpose as part of the class library (see Partition IV). An atomic write of a "small data item" (an item no larger than the native word size) is required to do an atomic read/modify/write on hardware that does not support direct writes to small data items. end note]
You seem to be asking specifically about the atomicity of the code
if (SomeEvent != null) SomeEvent(this, args);
This code is not guaranteed to be thread-safe, either by the C# spec or by the .NET spec. While it is true that an optimizing JIT compiler might generate thread-safe code, it's unsafe to rely on it.
Instead use the better (and more concise) code, this is guaranteed thread-safe.
SomeEvent?.Invoke(this, args);
I've been reading a lot about floating-point determinism in .NET, i.e. ensuring that the same code with the same inputs will give the same results across different machines. Since .NET lacks options like Java's fpstrict and MSVC's fp:strict, the consensus seems to be that there is no way around this issue using pure managed code. The C# game AI Wars has settled on using Fixed-point math instead, but this is a cumbersome solution.
The main issue appears to be that the CLR allows intermediate results to live in FPU registers that have higher precision than the type's native precision, leading to impredictably higher precision results. An MSDN article by CLR engineer David Notario explains the following:
Note that with current spec, it’s still a language choice to give
‘predictability’. The language may insert conv.r4 or conv.r8
instructions after every FP operation to get a ‘predictable’ behavior.
Obviously, this is really expensive, and different languages have
different compromises. C#, for example, does nothing, if you want
narrowing, you will have to insert (float) and (double) casts by hand.
This suggests that one may achieve floating-point determinism simply by inserting explicit casts for every expression and sub-expression that evaluates to float. One might write a wrapper type around float to automate this task. This would be a simple and ideal solution!
Other comments however suggest that it isn't so simple. Eric Lippert recently stated (emphasis mine):
in some version of the runtime, casting to float explicitly gives a
different result than not doing so. When you explicitly cast to float,
the C# compiler gives a hint to the runtime to say "take this thing
out of extra high precision mode if you happen to be using this
optimization".
Just what is this "hint" to the runtime? Does the C# spec stipulate that an explicit cast to float causes the insertion of a conv.r4 in the IL? Does the CLR spec stipulate that a conv.r4 instruction causes a value to be narrowed down to its native size? Only if both of these are true can we rely on explicit casts to provide floating point "predictability" as explained by David Notario.
Finally, even if we can indeed coerce all intermediate results to the type's native size, is this enough to guarantee reproducibility across machines, or are there other factors like FPU/SSE run-time settings?
Just what is this "hint" to the runtime?
As you conjecture, the compiler tracks whether a conversion to double or float was actually present in the source code, and if it was, it always inserts the appropriate conv opcode.
Does the C# spec stipulate that an explicit cast to float causes the insertion of a conv.r4 in the IL?
No, but I assure you that there are unit tests in the compiler test cases that ensure that it does. Though the specification does not demand it, you can rely on this behaviour.
The specification's only comment is that any floating point operation may be done in a higher precision than required at the whim of the runtime, and that this can make your results unexpectedly more accurate. See section 4.1.6.
Does the CLR spec stipulate that a conv.r4 instruction causes a value to be narrowed down to its native size?
Yes, in Partition I, section 12.1.3, which I note you could have looked up yourself rather than asking the internet to do it for you. These specifications are free on the web.
A question you didn't ask but probably should have:
Is there any operation other than casting that truncates floats out of high precision mode?
Yes. Assigning to a static field, instance field or element of a double[] or float[] array truncates.
Is consistent truncation enough to guarantee reproducibility across machines?
No. I encourage you to read section 12.1.3, which has much interesting to say on the subject of denormals and NaNs.
And finally, another question you did not ask but probably should have:
How can I guarantee reproducible arithmetic?
Use integers.
The 8087 Floating Point Unit chip design was Intel's billion dollar mistake. The idea looks good on paper, give it an 8 register stack that stores values in extended precision, 80 bits. So that you can write calculations whose intermediate values are less likely to lose significant digits.
The beast is however impossible to optimize for. Storing a value from the FPU stack back to memory is expensive. So keeping them inside the FPU is a strong optimization goal. Inevitable, having only 8 registers is going to require a write-back if the calculation is deep enough. It is also implemented as a stack, not freely addressable registers so that requires gymnastics as well that may produce a write-back. Inevitably a write back will truncate the value from 80-bits back to 64-bits, losing precision.
So consequences are that non-optimized code does not produce the same result as optimized code. And small changes to the calculation can have big effects on the result when an intermediate value ends up needing to be written back. The /fp:strict option is a hack around that, it forces the code generator to emit a write-back to keep the values consistent, but with the inevitable and considerable loss of perf.
This is a complete rock and a hard place. For the x86 jitter they just didn't try to address the problem.
Intel didn't make the same mistake when they designed the SSE instruction set. The XMM registers are freely addressable and don't store extra bits. If you want consistent results then compiling with the AnyCPU target, and a 64-bit operating system, is the quick solution. The x64 jitter uses SSE instead of FPU instructions for floating point math. Albeit that this added a third way that a calculation can produce a different result. If the calculation is wrong because it loses too many significant digits then it will be consistently wrong. Which is a bit of a bromide, really, but typically only as far as a programmer looks.
I am trying to understand what performance differences exist when running a native C# / .Net 4.0 app in x64 vs x86. I understand the memory considerations (x64 addressing all memory, x86 limited to 2/4gb), as well as the fact that an x64 app will use more memory (all pointers are 8 bytes instead of 4 bytes). As far as I can tell, none of these should affect any of the clock for clock instructions, as the x64 pipeline is wide enough to handle the wider instructions.
Is there a performance hit in context switching, due to the larger stack size for each thread? What performance considerations am I missing in evaluating the two?
Joe White has given you some good reasons why your app might be slower. Larger pointers (and therefore by extension larger references in .NET) will take up more space in memory, meaning less of your code and data will fit into the cache.
There are, however, plenty of beneficial reasons you might want to use x64:
The AMD64 calling convention is used by default in x64 and can be quite a bit faster than the standard cdecl or stdcall, with many arguments being passed in registers and using the XMM registers for floating point.
The CLR will emit scalar SSE instructions for dealing with floating point operations in 64-bit. In x86 it falls back on using the standard x87 FP stack, which is quite a bit slower, especially for things like converting between ints and floats.
Having more registers means that there is much less chance that the JIT will have to spill them due to register pressure. Spilling registers can be quite costly for fast inner loops, especially if a function gets inlined and introduces additional register pressure there.
Any operations on 64-bit integers can benefit tremendously by being able to fit into a single register instead of being broken up into two separate halves.
This may be obvious, but the additional memory your process can access can be quite useful if your application is memory-intensive, even if it isn't hitting the theoretical limit. Fragmentation can cause you to hit "out of memory" conditions long before you reach that mark.
RIP-relative addressing in x64 can, in some cases, reduce the size of an executable image. Although that doesn't really apply directly to .NET apps, it can have an effect on the sharing of DLLs which may otherwise have to be relocated. I'd be interested in knowing if anyone has any specific information on this with regards to .NET and managed applications.
Aside from these, the x64 version of the .NET runtime seems to, at least in the current versions, perform more optimizations than the x86 equivalent. Things like inlining and memory alignment seem to happen much more often. In fact, there was a bug a while back that prevented inlining of any method that took or returned a value type; I remember seeing it fixed in x64 and not the x86 version.
Really, the only way you'll be able to tell which is better for your app will be to do profiling and testing on both architectures and comparing real results. However, I personally just use Any CPU wherever possible and avoid anything inherently architecture-dependent. This makes it easy to build and deploy, and is hopefully more future proof when the majority of users start switching to x64 exclusively.
Closely related to "x64 app will use more memory" is the fact that, with a 64-bit app, your locality of reference is smaller (because all your pointer sizes are doubled), so you get less mileage out of the CPU's on-board (ultra-fast) cache. You have to retrieve data from system RAM more often, which is much slower than the L2 and even the L1 on-chip cache.
I've been using C# for a while, and have recently started working on adding parallelism to a side project of mine. So, according to Microsoft, reads and writes to ints and even floats are atomic
I'm sure these atomicity requirements workout just fine on x86 architectures. However, on architectures such as ARM (which may not have hardware floating point support), it seems these guarantees will be hard.
The problem is only made more significant by the fact that an 'int' is always 32-bits. There are many embedded devices that can't atomically perform a 32-bit write.
It seems this is a fundamental mistake in C#. Guaranteeing the atomicity of these data types can't be done portably.
How are these atomicity guarantees intended to be implemented on architectures where there are no FPUs or 32-bit writes?
It's not too difficult to guarantee the atomicity with runtime checks. Sure, you won't be as performant as you might be if your platform supported atomic reads and writes, but that's a platform tradeoff.
Bottom line: C# (the core language, not counting some platform-specific APIs) is just as portable as Java.
The future happened yesterday, C# is in fact ported to a large number of embedded cores. The .NET Micro Framework is the typical deployment scenario. Model numbers I see listed as native targets are AT91, BF537, CortexM3, LPC22XX, LPC24XX, MC9328, PXA271 and SH2.
I don't know the exact implementation details of their core instruction set but I'm fairly sure that these are all 32-bit cores and several of them are ARM cores. Writing threaded code for them requires a minimum guarantee and atomic updates for properly aligned words is one of them. Given the supported list and that 4 byte atomic updates for aligned words is trivial to implement in 32-bit hardware, I trust they all do in fact support it.
There are two issues with regard to "portability":
Can an practical implementation of a language be produced for various platforms
Will a program written in a language be expected to run correctly on various platforms without modification
The stronger the guarantees made by a language, the harder it will be to port it to various platforms (some guarantees may make it impossible or impractical to implement the language on some platforms) but the more likely it is that programs written in the language will work without modification on any platform for which support exists.
For example, a lot of networking code relies upon the fact that (on most platforms) an unsigned char is eight bits, and a 32-bit integer is represented by four unsigned chars in ascending or descending sequence. I've used a platform where char was 16 bits, sizeof(int)==1, and sizeof(long)==2. The compiler author could have made the compiler simply use the bottom 8 bits of each address, or could have added a lot of extra code so that writing a 'char' pointer would shift the address right one bit (saving the lsb), read the address, update the high or low half based upon the saved address lsb, and writing it back. Either of those approaches would have allowed the networking code to run without modification, but would have greatly impeded the compiler's usefulness for other purposes.
Some of the guarantees in the CLR mean that it is impractical to implement it in any platform with an atomic operation size smaller than 32 bits. So what? If a microcontroller needs more than a few dozen Kbytes of code space and RAM, the cost differential between 8-bit and 32-bit is pretty small. Since nobody's going to be running any variation of the CLR on a part with 32K of code space and 4K of RAM, who cares whether such a chip could satisfy its guarantees.
BTW, I do think it would be useful to have different levels of features defined in a C spec; a lot of processors, for example, do have 8-bit chars which can be assembled into longer words using unions, and there is a lot of practical code which exploits this. It would be good to define standards for compilers which work with such things. I would also like to see more standards at the low end of the system, making some language enhancements available for 8-bit processors. For example, it would be useful to define overloads for a function which can take a run-time-computed 16-bit integer, an 8-bit variable, or an inline-expanded version with a constant. For often-used functions, there can be a big difference in efficiency among those.
That's what the CLI is for. I doubt they will certify an implementation if it isn't compliant. So basically, C# is portable to any platform that has one.
Excessively weakening guarantees for the sake of portability defeats the purpose of portability. The stronger the guarantees, the more valuable the portability. The goal is to find the right balance between what the likely target platforms can efficiently support with the guarantees that will be the most useful for development.
Can anyone explain to me what the benefits and and drawbacks of the two different approaches are?
When a double or long in Java is volatile, §17.7 of the Java Language Specification requires that they are read and written atomically. When they are not volatile, they can be written in multiple operations. This can result, for example, in the upper 32 bits of a long containing a new value, while the lower 32 bits still contain the old value.
Atomic reads and writes are easier for a programmer to reason about and write correct code with. However, support for atomic operations might impose a burden on VM implementers in some environments.
I don't know the reason why volatile cannot be applied to 64-bit ints in C#, but You can use Thread.VolatileWrite to do what you want in C#.
The volatile keyword is just syntactic sugar on this call.
excerpt:
Note:
In C#, using the volatile modifier on a field guarantees that all access to that field
uses Thread.VolatileRead or Thread.VolatileWrite.
The syntactic sugar (keyword) applies to 32-bit ints, but you can use the actual method calls on 64-bit ints.
I guess it comes down to what the memory model can guarantee. I don't know a vast amount about the CLI memory model (that C# has to use), but I know it'll guarantee 32 bits... but not 64 (although it'll guarantee a 64-bit reference on x64 - the full rules are in §17.4.3 of ECMA 334v4) . So it can't be volatile. You still have the Interlocked methods, though (such as long Interlocked.Exchange(ref long,long) and long Interlocked.Increment(ref long) etc).
I'm guessing that longs can't be volatile in C# because they are larger than 32-bits and can not be accessed in an atomic operation. Even though they would not be stored in a register or CPU cache, because it takes more than one operation to read or write the value it is possible for one thread to read the value while another is in the process of writing it.
I believe that there is a difference between how Java implements volatile fields and how DotNet does, but I'm not sure on the details. Java might use a lock on the field to prevent the problem that C# has.