This is a continuation of the discussion on multithreading issues in C#.
In C++, unprotected access to the shared data from multiple threads is an undefined behavior* if there is a write operation involved. What is it in C#? As (the safe part of) C# doesn't contain undefined behaviors, are there any guarantees? C# seems to have a kind of as-if rule as well, but after reading the mentioned part of the standard I fail to see what are the consequences of an unprotected data access from the language point of view.
In particular, it's interesting to know which kind of optimizations including load fusing and invention are prohibited through the language. This prohibition would imply the validity (or the lack thereof) of several popular patterns in C# (including the one discussed in the original question).
[The details of the actual implementation in Microsoft CLR, despite being very interesting, are not the part of this question: only the guarantees given by the language itself (and therefore portable) are here under discussion.]
The normative references are very welcome but I suspect the C# standard has enough information on the topic. Maybe someone from the language team can shed some light on what are the actual guarantees which are going to be included into the standard later but can be relied upon right now.
I suspect that there are some implied guarantees like the absence of pointer reference tearing because this could easily lead to breaking the type safety. But I'm not an expert on the topic.
*Often shortened as UB. Undefined Behavior allows a C++ compiler to produce literally any code, including formatting the hard disk or whatever, or to crash at compile time.
the .net runtime guarantees that writes to some variable types are atomic
Reads and writes of the following data types shall be atomic: bool, char, byte, sbyte, short, ushort, uint, int, float, and reference types. In addition, reads and writes of enum types with an underlying type in the previous list shall also be atomic. Reads and writes of other types, including long, ulong, double, and decimal, as well as user-defined types, need not be atomic. Aside from the library functions designed for that purpose, there is no guarantee of atomic read-modify-write, such as in the case of increment or decrement.
Not mentioned is IntPtr, that I believe is also guaranteed to be atomic. Since references are atomic they are guaranteed not to tear. See also C# - The C# Memory Model in Theory and Practice for more information
There should also be a guarantee of memory safety, i.e. that any memory access will reference valid memory and that all memory is initialized before usage. With some exceptions for things like unmanaged resources, unsafe code and stackalloc.
The general rule with regards to optimization is that the compiler/jitter may perform any optimization as long as the result would be identical for a single threaded program. So tearing, fusing, reordering, etc would all be possible, absent any synchronization.
So always use appropriate synchronization whenever there is a possibility that multiple threads use the same memory concurrently for anything except reading. Note that ARM has weaker memory ordering guarantees than x86/x64, further emphasizing the need for synchronization.
As mentioned by #JonasH, the C# spec only guarantees atomic access to values sized 32 bits or smaller.
But, assuming you can rely on C# always being implemented on a runtime conforming to ECMA-335, then you can rely on that spec also. This should be safe, as all implementations of .Net, including Mono and WASM, conform to ECMA-335 (it is not a Microsoft-only spec).
ECMA-335 guarantees access to native-sized values, which includes IntPtr and object references, as well as 64-bit integers on a 64-bit architecture.
ECMA-335 says: (my bold)
12.6.6 Atomic reads and writes
A conforming CLI shall guarantee that read and write access to properly aligned memory locations no larger than the native word size (the size of type native int) is atomic (see §12.6.2) when all the write accesses to a location are the same size. Atomic writes shall alter no bits other than those written. Unless explicit layout control (see Partition II (Controlling Instance Layout)) is used to alter the default behavior, data elements no larger than the natural word size (the size of a native int) shall be properly aligned. Object references shall be treated as though they are stored in the native word size.
[Note: There is no guarantee about atomic update (read-modify-write) of memory, except for methods provided for that purpose as part of the class library (see Partition IV). An atomic write of a "small data item" (an item no larger than the native word size) is required to do an atomic read/modify/write on hardware that does not support direct writes to small data items. end note]
You seem to be asking specifically about the atomicity of the code
if (SomeEvent != null) SomeEvent(this, args);
This code is not guaranteed to be thread-safe, either by the C# spec or by the .NET spec. While it is true that an optimizing JIT compiler might generate thread-safe code, it's unsafe to rely on it.
Instead use the better (and more concise) code, this is guaranteed thread-safe.
SomeEvent?.Invoke(this, args);
Related
I've been reading a lot about floating-point determinism in .NET, i.e. ensuring that the same code with the same inputs will give the same results across different machines. Since .NET lacks options like Java's fpstrict and MSVC's fp:strict, the consensus seems to be that there is no way around this issue using pure managed code. The C# game AI Wars has settled on using Fixed-point math instead, but this is a cumbersome solution.
The main issue appears to be that the CLR allows intermediate results to live in FPU registers that have higher precision than the type's native precision, leading to impredictably higher precision results. An MSDN article by CLR engineer David Notario explains the following:
Note that with current spec, it’s still a language choice to give
‘predictability’. The language may insert conv.r4 or conv.r8
instructions after every FP operation to get a ‘predictable’ behavior.
Obviously, this is really expensive, and different languages have
different compromises. C#, for example, does nothing, if you want
narrowing, you will have to insert (float) and (double) casts by hand.
This suggests that one may achieve floating-point determinism simply by inserting explicit casts for every expression and sub-expression that evaluates to float. One might write a wrapper type around float to automate this task. This would be a simple and ideal solution!
Other comments however suggest that it isn't so simple. Eric Lippert recently stated (emphasis mine):
in some version of the runtime, casting to float explicitly gives a
different result than not doing so. When you explicitly cast to float,
the C# compiler gives a hint to the runtime to say "take this thing
out of extra high precision mode if you happen to be using this
optimization".
Just what is this "hint" to the runtime? Does the C# spec stipulate that an explicit cast to float causes the insertion of a conv.r4 in the IL? Does the CLR spec stipulate that a conv.r4 instruction causes a value to be narrowed down to its native size? Only if both of these are true can we rely on explicit casts to provide floating point "predictability" as explained by David Notario.
Finally, even if we can indeed coerce all intermediate results to the type's native size, is this enough to guarantee reproducibility across machines, or are there other factors like FPU/SSE run-time settings?
Just what is this "hint" to the runtime?
As you conjecture, the compiler tracks whether a conversion to double or float was actually present in the source code, and if it was, it always inserts the appropriate conv opcode.
Does the C# spec stipulate that an explicit cast to float causes the insertion of a conv.r4 in the IL?
No, but I assure you that there are unit tests in the compiler test cases that ensure that it does. Though the specification does not demand it, you can rely on this behaviour.
The specification's only comment is that any floating point operation may be done in a higher precision than required at the whim of the runtime, and that this can make your results unexpectedly more accurate. See section 4.1.6.
Does the CLR spec stipulate that a conv.r4 instruction causes a value to be narrowed down to its native size?
Yes, in Partition I, section 12.1.3, which I note you could have looked up yourself rather than asking the internet to do it for you. These specifications are free on the web.
A question you didn't ask but probably should have:
Is there any operation other than casting that truncates floats out of high precision mode?
Yes. Assigning to a static field, instance field or element of a double[] or float[] array truncates.
Is consistent truncation enough to guarantee reproducibility across machines?
No. I encourage you to read section 12.1.3, which has much interesting to say on the subject of denormals and NaNs.
And finally, another question you did not ask but probably should have:
How can I guarantee reproducible arithmetic?
Use integers.
The 8087 Floating Point Unit chip design was Intel's billion dollar mistake. The idea looks good on paper, give it an 8 register stack that stores values in extended precision, 80 bits. So that you can write calculations whose intermediate values are less likely to lose significant digits.
The beast is however impossible to optimize for. Storing a value from the FPU stack back to memory is expensive. So keeping them inside the FPU is a strong optimization goal. Inevitable, having only 8 registers is going to require a write-back if the calculation is deep enough. It is also implemented as a stack, not freely addressable registers so that requires gymnastics as well that may produce a write-back. Inevitably a write back will truncate the value from 80-bits back to 64-bits, losing precision.
So consequences are that non-optimized code does not produce the same result as optimized code. And small changes to the calculation can have big effects on the result when an intermediate value ends up needing to be written back. The /fp:strict option is a hack around that, it forces the code generator to emit a write-back to keep the values consistent, but with the inevitable and considerable loss of perf.
This is a complete rock and a hard place. For the x86 jitter they just didn't try to address the problem.
Intel didn't make the same mistake when they designed the SSE instruction set. The XMM registers are freely addressable and don't store extra bits. If you want consistent results then compiling with the AnyCPU target, and a 64-bit operating system, is the quick solution. The x64 jitter uses SSE instead of FPU instructions for floating point math. Albeit that this added a third way that a calculation can produce a different result. If the calculation is wrong because it loses too many significant digits then it will be consistently wrong. Which is a bit of a bromide, really, but typically only as far as a programmer looks.
I've been reading a lot about floating-point determinism in .NET, i.e. ensuring that the same code with the same inputs will give the same results across different machines. Since .NET lacks options like Java's fpstrict and MSVC's fp:strict, the consensus seems to be that there is no way around this issue using pure managed code. The C# game AI Wars has settled on using Fixed-point math instead, but this is a cumbersome solution.
The main issue appears to be that the CLR allows intermediate results to live in FPU registers that have higher precision than the type's native precision, leading to impredictably higher precision results. An MSDN article by CLR engineer David Notario explains the following:
Note that with current spec, it’s still a language choice to give
‘predictability’. The language may insert conv.r4 or conv.r8
instructions after every FP operation to get a ‘predictable’ behavior.
Obviously, this is really expensive, and different languages have
different compromises. C#, for example, does nothing, if you want
narrowing, you will have to insert (float) and (double) casts by hand.
This suggests that one may achieve floating-point determinism simply by inserting explicit casts for every expression and sub-expression that evaluates to float. One might write a wrapper type around float to automate this task. This would be a simple and ideal solution!
Other comments however suggest that it isn't so simple. Eric Lippert recently stated (emphasis mine):
in some version of the runtime, casting to float explicitly gives a
different result than not doing so. When you explicitly cast to float,
the C# compiler gives a hint to the runtime to say "take this thing
out of extra high precision mode if you happen to be using this
optimization".
Just what is this "hint" to the runtime? Does the C# spec stipulate that an explicit cast to float causes the insertion of a conv.r4 in the IL? Does the CLR spec stipulate that a conv.r4 instruction causes a value to be narrowed down to its native size? Only if both of these are true can we rely on explicit casts to provide floating point "predictability" as explained by David Notario.
Finally, even if we can indeed coerce all intermediate results to the type's native size, is this enough to guarantee reproducibility across machines, or are there other factors like FPU/SSE run-time settings?
Just what is this "hint" to the runtime?
As you conjecture, the compiler tracks whether a conversion to double or float was actually present in the source code, and if it was, it always inserts the appropriate conv opcode.
Does the C# spec stipulate that an explicit cast to float causes the insertion of a conv.r4 in the IL?
No, but I assure you that there are unit tests in the compiler test cases that ensure that it does. Though the specification does not demand it, you can rely on this behaviour.
The specification's only comment is that any floating point operation may be done in a higher precision than required at the whim of the runtime, and that this can make your results unexpectedly more accurate. See section 4.1.6.
Does the CLR spec stipulate that a conv.r4 instruction causes a value to be narrowed down to its native size?
Yes, in Partition I, section 12.1.3, which I note you could have looked up yourself rather than asking the internet to do it for you. These specifications are free on the web.
A question you didn't ask but probably should have:
Is there any operation other than casting that truncates floats out of high precision mode?
Yes. Assigning to a static field, instance field or element of a double[] or float[] array truncates.
Is consistent truncation enough to guarantee reproducibility across machines?
No. I encourage you to read section 12.1.3, which has much interesting to say on the subject of denormals and NaNs.
And finally, another question you did not ask but probably should have:
How can I guarantee reproducible arithmetic?
Use integers.
The 8087 Floating Point Unit chip design was Intel's billion dollar mistake. The idea looks good on paper, give it an 8 register stack that stores values in extended precision, 80 bits. So that you can write calculations whose intermediate values are less likely to lose significant digits.
The beast is however impossible to optimize for. Storing a value from the FPU stack back to memory is expensive. So keeping them inside the FPU is a strong optimization goal. Inevitable, having only 8 registers is going to require a write-back if the calculation is deep enough. It is also implemented as a stack, not freely addressable registers so that requires gymnastics as well that may produce a write-back. Inevitably a write back will truncate the value from 80-bits back to 64-bits, losing precision.
So consequences are that non-optimized code does not produce the same result as optimized code. And small changes to the calculation can have big effects on the result when an intermediate value ends up needing to be written back. The /fp:strict option is a hack around that, it forces the code generator to emit a write-back to keep the values consistent, but with the inevitable and considerable loss of perf.
This is a complete rock and a hard place. For the x86 jitter they just didn't try to address the problem.
Intel didn't make the same mistake when they designed the SSE instruction set. The XMM registers are freely addressable and don't store extra bits. If you want consistent results then compiling with the AnyCPU target, and a 64-bit operating system, is the quick solution. The x64 jitter uses SSE instead of FPU instructions for floating point math. Albeit that this added a third way that a calculation can produce a different result. If the calculation is wrong because it loses too many significant digits then it will be consistently wrong. Which is a bit of a bromide, really, but typically only as far as a programmer looks.
It's a simple-looking question:
Given that native-sized integers are the best for arithmetic, why doesn't C# (or any other .NET language) support arithmetic with the native-sized IntPtr and UIntPtr?
Ideally, you'd be able to write code like:
for (IntPtr i = 1; i < arr.Length; i += 2) //arr.Length should also return IntPtr
{
arr[i - 1] += arr[i]; //something random like this
}
so that it would work on both 32-bit and 64-bit platforms. (Currently, you have to use long.)
Edit:
I'm not using these as pointers (the word "pointer" wasn't even mentioned)! They can be just treated as the C# counterpart of native int in MSIL and of intptr_t in C's stdint.h -- which are integers, not pointers.
In .NET 4, arithmetic between a left hand operand of type IntPtr and a right hand operand of integer types (int, long, etc) is supported.
[Edit]:
As other people have said, they are designed to represent pointers in native languages (as implied by the name IntPtr). It's fine to claim you're using them as native integers rather than pointers, but you can't overlook that one of the primary reasons the native size of an integer ever matters is for use as a pointer. If you're performing mathematical operations, or other general functions that are independent from the processor and memory architecture that your code is running on, it is arguably more useful and intuitive to use types such as int and long where you know their fixed size and upper and lower bounds in every situation regardless of hardware.
Just as the type IntPtr is designed to represent a native pointer, the arithmetic operations are designed to represent logical mathematical operations that you would perform on a pointer: adding some integer offset to a native pointer to reach a new native pointer (not that adding two IntPtrs is not supported, and nor is using IntPtr as the right hand operand).
Maybe native-sized integers make for the fastest arithmetic, but they certainly don't make for the most error-free programs.
Personally I hate programming with integer types whose sizes I do not know when I sit down to start typing (I 'm looking at you, C++), and I definitely prefer the peace of mind the CLR types give you over the very doubtful and certainly conditional performance benefit that using CPU instructions tailored to the platform might offer.
Consider also that the JIT compiler can optimize for the architecture the process is running on, in contrast to a "regular" compiler which has to generate machine code without having access to this information. The JIT compiler might therefore generate code just as fast because it knows more.
I imagine I 'm not alone in thinking this, so it might count for a reason.
I can actually think of one reason why an IntPtr (or UIntPtr) would be useful: accessing elements of an array requires native-sized integers. Though native integers are never exposed to the programmer, they are internally used in IL. Something like some_array[index] in C# will actually compile down to some_array[(int)checked((IntPtr)index)] in IL. I noticed this after disassembling my own code with ILSpy. (The index variable is 64-bit in my code.) To verify that the disassembler wasn't making a mistake, Microsoft's own ILDASM tool shows the existence of conv.u and conv.i instructions within my assembly. Those instructions convert integers to the system's native representation. I don't know what the performance implication is having all these conversion instructions in the IL code, but hopefully the JIT is smart enough to optimize the performance penalty away; if not, the next best thing is to allow manipulating native integers without conversions (which, in my opinion, might be the main motivation to use a native type).
Currently, the F# language allows the use of nativeint and and its unsigned counterpart for arithmetic. However, arrays can only be indexed by int in F# which means nativeint is not very useful for the purposes of indexing arrays.
If it really bothers you that much, write your own compiler that lifts restrictions on native integer use, create your own language, write your code in IL, or tweak the IL after compiling. Personally, I think it's a bad idea to squeeze out extra performance or save memory by using native int. If you wanted your code to fit the system like a glove, you'd best be using a lower level language with support for processor intrinsics.
.Net Framework tries not to introduce operations that can't be explained. I.e. there is no DateTime + DateTime because there is no such concept as sum of 2 dates. The same reasoning applies to pointer types - there is no concept of sum of 2 pointers. The fact that IntPtr is stored as platform depenedent int value does not really matter - there are a lot of other types that internally stored as basic values (again DateTime can be represented as long).
Because that's not a "safe" way of handling memory addressing. Pointer arithmetic can lead to all sorts of bugs and memory addressing problems that C# is designed explicitly to avoid.
The C# spec states in section 5.5 that reads and writes on certain types (namely bool, char, byte, sbyte, short, ushort, uint, int, float, and reference types) are guaranteed to be atomic.
This has piqued my interest. How can you do that? I mean, my lowly personal experience only showed me to lock variables or to use barriers if I wanted reads and writes to look atomic; that would be a performance killer if it had to be done for every single read/write. And yet C# does something with a similar effect.
Perhaps other languages (like Java) do it. I seriously don't know. My question isn't really intended to be language-specific, it's just that I know C# does it.
I understand that it might have to deal with certain specific processor instructions, and may not be usable in C/C++. However, I'd still like to know how it works.
[EDIT] To tell the truth, I believed that reads and writes could be non-atomic in certain conditions, like a CPU could access a memory location while another CPU is writing there. Does this only happen when the CPU can't treat all the object at once, like because it's too big or because the memory is not aligned on the proper boundary?
The reason those types have guaranteed atomicity is because they are all 32 bits or smaller. Since .NET only runs on 32 and 64 bit operating systems, the processor architecture can read and write the entire value in a single operation. This is in contrast to say, an Int64 on a 32 bit platform which must be read and written using two 32 bit operations.
I'm not really a hardware guy so I apologize if my terminology makes me sound like a buffoon but it's the basic idea.
It is fairly cheap to implement the atomicity guarantee on x86 and x64 cores since the CLR only promises atomicity for variables that are 32-bit or smaller. All that's required is that the variable is properly aligned and doesn't straddle a cache line. The JIT compiler ensures this by allocating local variables on a 4-byte aligned stack offset. The GC heap manager does the same for heap allocations.
Notable is that the CLR guarantee is not a very good one. The alignment promise is not good enough to write code that's consistently performant for arrays of doubles. Very nicely demonstrated in this thread. Interop with machine code that uses SIMD instructions is also very difficult for this reason.
On x86 reads and writes are atomic anyway. It's supported at the hardware level. This however does not mean that operations like addition and multiplication are atomic; they require a load, compute, then store, which means they can interfere. That's where the lock prefix comes in.
You mentioned locking and memory barriers; they don't have anything to do with reads and writes being atomic. There is no way on x86 with or without using memory barriers that you're going to see a half-written 32-bit value.
Yes, C# and Java guarantee that loads and stores of some primitive types are atomic, like you say. This is cheap because the processors capable of running .NET or the JVM do guarantee that loads and stores of suitably aligned primitive types are atomic.
Now, what neither C# nor Java nor the processors they run on guarantee, and which is expensive, is issuing memory barriers so that those variables can be used for synchronization in a multi-threaded program. However, in Java and C# you can mark your variable with the "volatile" attribute, in which case the compiler takes care of issuing the appropriate memory barriers.
You can't. Even going all the way down to assembly language you have to use special LOCK opcodes in order to guarantee that another core or even process isn't going to come around and wipe out all your hard work.
Can anyone explain to me what the benefits and and drawbacks of the two different approaches are?
When a double or long in Java is volatile, §17.7 of the Java Language Specification requires that they are read and written atomically. When they are not volatile, they can be written in multiple operations. This can result, for example, in the upper 32 bits of a long containing a new value, while the lower 32 bits still contain the old value.
Atomic reads and writes are easier for a programmer to reason about and write correct code with. However, support for atomic operations might impose a burden on VM implementers in some environments.
I don't know the reason why volatile cannot be applied to 64-bit ints in C#, but You can use Thread.VolatileWrite to do what you want in C#.
The volatile keyword is just syntactic sugar on this call.
excerpt:
Note:
In C#, using the volatile modifier on a field guarantees that all access to that field
uses Thread.VolatileRead or Thread.VolatileWrite.
The syntactic sugar (keyword) applies to 32-bit ints, but you can use the actual method calls on 64-bit ints.
I guess it comes down to what the memory model can guarantee. I don't know a vast amount about the CLI memory model (that C# has to use), but I know it'll guarantee 32 bits... but not 64 (although it'll guarantee a 64-bit reference on x64 - the full rules are in §17.4.3 of ECMA 334v4) . So it can't be volatile. You still have the Interlocked methods, though (such as long Interlocked.Exchange(ref long,long) and long Interlocked.Increment(ref long) etc).
I'm guessing that longs can't be volatile in C# because they are larger than 32-bits and can not be accessed in an atomic operation. Even though they would not be stored in a register or CPU cache, because it takes more than one operation to read or write the value it is possible for one thread to read the value while another is in the process of writing it.
I believe that there is a difference between how Java implements volatile fields and how DotNet does, but I'm not sure on the details. Java might use a lock on the field to prevent the problem that C# has.