I almost went crazy when trying to debug a random 40x performance drop when running in x86 on an algorithm which make heavy use of Interlock.CompareExchange with an Int64.
I finally isolated the issue and it occurred only when the said Int64 was not 8-bytes aligned.
No matter how I explicitly position the field in a StructLayout, its dependent on the base address of the outer object on the heap. On x86 the base address will be either 4-bytes aligned or 8-bytes aligned.
I thought of defining a 12 bytes struct and set the Int64 at offset 0 or offset 4 depending on the alignment, but that's kinda hacky.
Is there a good practice in c# for performing Interlocked operation on Int64 in x86 that guarantees the proper alignment?
EDIT
The code can be found here:
https://github.com/akkadotnet/akka.net/pull/1569#discussion-diff-47997213R520
Its a threadpool implementation based on the Clr ThreadPool. The issue is about storing the state of a custom semaphore in a 8 bytes struct and modifying it with InterlockedCompareExchange64.
Related
I am a tinkerer—no doubt about that. For this reason (and very little beyond that), I recently did a little experiment to confirm my suspicion that writing to a struct is not an atomic operation, which means that a so-called "immutable" value type which attempts to enforce certain constraints could hypothetically fail at its goal.
I wrote a blog post about this using the following type as an illustration:
struct SolidStruct
{
public SolidStruct(int value)
{
X = Y = Z = value;
}
public readonly int X;
public readonly int Y;
public readonly int Z;
}
While the above looks like a type for which it could never be true that X != Y or Y != Z, in fact this can happen if a value is "mid-assignment" at the same time it is copied to another location by a separate thread.
OK, big deal. A curiosity and little more. But then I had this hunch: my 64-bit CPU should actually be able to copy 64 bits atomically, right? So what if I got rid of Z and just stuck with X and Y? That's only 64 bits; it should be possible to overwrite those in one step.
Sure enough, it worked. (I realize some of you are probably furrowing your brows right now, thinking, Yeah, duh. How is this even interesting? Humor me.) Granted, I have no idea whether this is guaranteed or not given my system. I know next to nothing about registers, cache misses, etc. (I am literally just regurgitating terms I've heard without understanding their meaning); so this is all a black box to me at the moment.
The next thing I tried—again, just on a hunch—was a struct consisting of 32 bits using 2 short fields. This seemed to exhibit "atomic assignability" as well. But then I tried a 24-bit struct, using 3 byte fields: no go.
Suddenly the struct appeared to be susceptible to "mid-assignment" copies once again.
Down to 16 bits with 2 byte fields: atomic again!
Could someone explain to me why this is? I've heard of "bit packing", "cache line straddling", "alignment", etc.—but again, I don't really know what all that means, nor whether it's even relevant here. But I feel like I see a pattern, without being able to say exactly what it is; clarity would be greatly appreciated.
The pattern you're looking for is the native word size of the CPU.
Historically, the x86 family worked natively with 16-bit values (and before that, 8-bit values). For that reason, your CPU can handle these atomically: it's a single instruction to set these values.
As time progressed, the native element size increased to 32 bits, and later to 64 bits. In every case, an instruction was added to handle this specific amount of bits. However, for backwards compatibility, the old instructions were still kept around, so your 64-bit processor can work with all of the previous native sizes.
Since your struct elements are stored in contiguous memory (without padding, i.e. empty space), the runtime can exploit this knowledge to only execute that single instruction for elements of these sizes. Put simply, that creates the effect you're seeing, because the CPU can only execute one instruction at a time (although I'm not sure if true atomicity can be guaranteed on multi-core systems).
However, the native element size was never 24 bits. Consequently, there is no single instruction to write 24 bits, so multiple instructions are required for that, and you lose the atomicity.
The C# standard (ISO 23270:2006, ECMA-334) has this to say regarding atomicity:
12.5 Atomicity of variable references
Reads and writes of the following data types shall be atomic: bool, char, byte, sbyte, short, ushort,
uint, int, float, and reference types. In addition, reads and writes of enum types with an underlying type
in the previous list shall also be atomic. Reads and writes of other types, including long, ulong, double,
and decimal, as well as user-defined types, need not be atomic. (emphasis mine) Aside from the library functions designed
for that purpose, there is no guarantee of atomic read-modify-write, such as in the case of increment or
decrement.Your example X = Y = Z = value is short hand for 3 separate assignment operations, each of which is defined to be atomic by 12.5. The sequence of 3 operations (assign value to Z, assign Z to Y, assign Y to X) is not guaranteed to be atomic.
Since the language specification doesn't mandate atomicity, while X = Y = Z = value; might be an atomic operation, whether it is or not is dependent on a whole bunch of factors:
the whims of the compiler writers
what code generation optimizations options, if any, were selected at build time
the details of the JIT compiler responsible for turning the assembly's IL into machine language. Identical IL run under Mono, say, might exhibit different behaviour than when run under .Net 4.0 (and that might even differ from earlier versions of .Net).
the particular CPU on which the assembly is running.
One might also note that even a single machine instruction is not necessarily warranted to be an atomic operation—many are interruptable.
Further, visiting the CLI standard (ISO 23217:2006), we find section 12.6.6:
12.6.6 Atomic reads and writes
A conforming CLI shall guarantee that read and write access to properly
aligned memory locations no larger than the native word size (the size of type
native int) is atomic (see §12.6.2) when all the write accesses to a location are
the same size. Atomic writes shall alter no bits other than those written. Unless
explicit layout control (see Partition II (Controlling Instance Layout)) is used to
alter the default behavior, data elements no larger than the natural word size (the
size of a native int) shall be properly aligned. Object references shall be treated
as though they are stored in the native word size.
[Note: There is no guarantee
about atomic update (read-modify-write) of memory, except for methods provided for
that purpose as part of the class library (see Partition IV). (emphasis mine)
An atomic write of a “small data item” (an item no larger than the native word size)
is required to do an atomic read/modify/write on hardware that does not support direct
writes to small data items. end note]
[Note: There is no guaranteed atomic access to 8-byte data when the size of
a native int is 32 bits even though some implementations might perform atomic
operations when the data is aligned on an 8-byte boundary. end note]
x86 CPU operations take place in 8, 16, 32, or 64 bits; manipulating other sizes requires multiple operations.
The compiler and x86 CPU are going to be careful to move only exactly as many bytes as the structure defines. There are no x86 instructions that can move 24 bits in one operation, but there are single instruction moves for 8, 16, 32, and 64 bit data.
If you add another byte field to your 24 bit struct (making it a 32 bit struct), you should see your atomicity return.
Some compilers allow you to define padding on structs to make them behave like native register sized data. If you pad your 24 bit struct, the compiler will add another byte to "round up" the size to 32 bits so that the whole structure can be moved in one atomic instruction. The downside is your structure will always occupy 30% more space in memory.
Note that alignment of the structure in memory is also critical to atomicity. If a multibyte structure does not begin at an aligned address, it may span multiple cache lines in the CPU cache. Reading or writing this data will require multiple clock cycles and multiple read/writes even though the opcode is a single move instruction. So, even single instruction moves may not be atomic if the data is misaligned. x86 does guarantee atomicity for native sized read/writes on aligned boundaries, even in multicore systems.
It is possible to achieve memory atomicity with multi-step moves using the x86 LOCK prefix. However this should be avoided as it can be very expensive in multicore systems (LOCK not only blocks other cores from accessing memory, it also locks the system bus for the duration of the operation which can impact disk I/O and video operations. LOCK may also force the other cores to purge their local caches)
This question already has answers here:
Should I use byte or int?
(6 answers)
Closed 5 years ago.
This question is related to the physical memory of a C# Program. As we know that, byte variable consumes 1 byte of memory and on the other hand an int (32-bit) variable consumes 4-bytes of memory. So, when we need variables with possibly smaller values (such as a counter variable i to iterate a loop 100 times) which one should we use in the below for loop? byte or int ?
for(byte i=0; i<100; ++i)
Kindly please give your opinion with reason and share you precious knowledge. I shall be glad and thankful to you :-)
Note: I use byte instead of int in such cases. But I have seen that many experienced programmers use int even when the expected values are less than 255. Please let me know if I am wrong. :-)
In most cases, you won't get any benefit from using byte instead of int. The reason is:
If the loop variable is stored in a CPU register: Since modern CPUs have a register width of 32 bits, and since you can't use only one fourth of a register, the resulting code would be pretty much the same either way.
If the loop variable is not stored in a CPU register, then it will most likely be stored on the stack. Compilers try to align memory locations at addresses which are multiples of 4, this has to do with performance, thus the compiler would also assign 4 bytes to your byte variable on the stack.
Depending on details of your code, the compiler would even add extra code to make sure that the memory location (on the stack or in a register) never exceeds 255, which would add extra code and makes it slower.
It's a totally different story with 8 bit microcontrollers like those from Atmel and Microchip, there your approach would make sense.
I can't really understand why did object header got twice bigger in 64 bit applications.
The object header was 8 bytes and in 64 bit it is 16 what are these additional bytes used for ?
The object header is made up of two fields, the syncblk and the method table pointer (aka "type handle"). Second field is easy to understand, it is a pointer so it must grow from 4 to 8 bytes in 64-bit mode.
The syncblk is the much less obvious case, it is mix of flags and values (lock owner thread id, hash code, sync block index). No reason to make it bigger in 64-bit mode. What matters is what happens after the object is collected by the GC. If the free space was not eliminated by compacting the heap then the object space participates in the free block list. Works like a doubly-linked list. The 2nd field is the forward pointer to the next free block. The object data space is used to store the size of the free block, basic reason why an object is never less than 12 bytes. And the syncblk stores the back pointer to the previous free block. So now it must be big enough to store a pointer and therefore needs to grow to 8 bytes. So it is 8 + 8 = 16 bytes.
Fwiw, the minimum object size in 64-bit mode is 24 bytes, even though 8 + 8 + 4 = 20 bytes would do just fine, just to ensure that everything is aligned to 8. Alignment matters a great deal, you'd never want to have a pointer value straddle the L1 cache line. Makes accessing it about x3 times slower. The <gcAllowVeryLargeObjects> option is another reason, added later.
I'm pretty new to this, so if the question doesn't make sense, I apologize ahead of time.
int in c# is 4 bytes if I am correct. If I have the statement:
int x;
I would assume this is taking up 4 bytes of memory. If each memory address space is 1 byte then this would take up four address slots? If so, how does x map to the four address locations?
If I have the statement int x; I would assume this is taking up 4 bytes of memory. How does x map to the address of the four bytes?
First off, Mike is correct. C# has been designed specifically so that you do not need to worry about this stuff. Let the memory manager take care of it for you; it does a good job.
Assuming you do want to see how the sausage is made for your own edification: your assumption is not warranted. This statement does not need to cause any memory to be consumed. If it does cause memory to be consumed, the int consumes four bytes of memory.
There are two ways in which the local variable (*) can consume no memory. The first is that it is never used:
void M()
{
int x;
}
The compiler can be smart enough to know that x is never written to or read from, and it can be legally elided entirely. Obviously it then takes up no memory.
The second way that it can take up no memory is if the jitter chooses to enregister the local. It may assign a machine register specifically to that local variable. The variable then has no address associated with it because obviously registers do not have an address. (**)
Assuming that the local does take up memory, the jitter is responsible for keeping track of the location of that memory.
If the local is a perfectly normal local then the jitter will bump the stack pointer by four bytes, thereby reserving four bytes on the stack. It will then associate those four bytes with the local.
If the local is a closed-over outer local of an anonymous function, a local of an iterator block, or a local of an async method then the C# compiler will generate the local as a field of a class; the jitter asks the garbage collector to allocate the class instance and the jitter associates the local with a particular offset from the beginning of the memory buffer associated with that instance by the garbage collector.
All of this is implementation detail subject to change at any time; do not rely upon it.
(*) We know it is a local variable because you said it was a statement. A field declaration is not a statement.
(**) If unsafe code takes the address of a local, obviously it cannot be enregistered.
There's a lot (and I mean a LOT) that can be said about this. Various topics you're hitting on are things like the stack, the symbol table, memory management, the memory hierarchy, ... I could go on.
BUT, since you're new, I'll try to give an easier answer:
When you create a variable in a program (such as an int), you are telling the compiler to reserve a space in memory for that data. An int is 4 bytes, so 4 consecutive bytes are reserved. The memory location you were referring to only points to the beginning. It is known afterwards that the length is 4 bytes.
Now that memory location (in the case you provided) is not really saved in the same way that a variable would be. Every time there is a command that needs x, the command is instead replaced with a command that explicitly grabs that memory location. In other words, the address is saved in the "code" section of your program, not the "data" section.
This is just a really, REALLY high overview. Hopefully it helps.
You really should not need to worry about these things, since there is no way in C# that you could write code that would make use of this information.
But if you must know, at the machine-code level when we instruct the CPU to access the contents of x, it will be referred to using the address of the first one of those four bytes. The machine instruction that will do this will also contain information about how many bytes to be accessed, in this case four.
If the int x; is declared within a function, then the variable will be allocated on the stack, rather than the heap or global memory. The address of x in the compiler's symbol table will refer to the first byte of the four-byte integer. However since it is on the stack, the remembered address will be that of the offset on the stack, rather than a physical address. The variable will then be referenced via a instruction using that offset from the current stack pointer.
Assuming a 32-bit run-time, the offset on the stack will be aligned so the address is a multiple of 4 bytes, i.e. the offset will end in either 0, 4, 8 or 0x0c.
Furthermore because the 80x86 family is little-endian, the first byte of the integer will be the least significant, and the fourth byte will be the most significant, e.g. the decimal value 1,000,000 would be stored as the four bytes 0x40 0x42 0x0f 0x00.
This question already has answers here:
Why does .NET use int instead of uint in certain classes?
(7 answers)
Closed 9 years ago.
Why is Array.Length an int, and not a uint. This bothers me (just a bit) because a length value can never be negative.
This also forced me to use an int for a length-property on my own class, because when you
specify an int-value, this needs to be cast explicitly...
So the ultimate question is: is there any use for an unsigned int (uint)? Even Microsoft seems not to use them.
Unsigned int isn't CLS compliant and would therefore restrict usage of the property to those languages that do implement a UInt.
See here:
Framework 1.1
Introduction to the .NET Framework Class Library
Framework 2.0
.NET Framework Class Library Overview
Many reasons:
uint is not CLS compliant, thus making a built in type (array) dependent on it would have been problematic
The runtime as originally designed prohibits any object on the heap occupying more than 2GB of memory. Since the maximum sized array that would less than or equal to this limit would be new byte[int.MaxValue] it would be puzzling to people to be able to generate positive but illegal array lengths.
Note that this limitation has been somewhat removed in the 4.5 release, though the standard Length as int remains.
Historically C# inherits much of its syntax and convention from C and C++. In those arrays are simply pointer arithmetic so negative array indexing was possible (though normally illegal and dangerous). Since much existing code assumes that the array index is signed this would have been a factor
On a related note the use of signed integers for array indexes in C/C++ means that interop with these languages and unmanaged functions would require the use of ints in those circumstances anyway, which may confuse due to the inconsistency.
The BinarySearch implementation (a very useful component of many algorithms) relies on being able to use the negative range of the int to indicate that the value was not found and the location at which such a value should be inserted to maintain sorting.
When operating on an array it is likely that you would want to take a negative offset of an existing index. If you used an offset which would take you past the start of the array using unit then the wrap around behaviour would make your index possibly legal (in that it is positive). With an int the result would be illegal (but safe since the runtime would guard against reading invalid memory)
I think it also might have to do with simplifying things on a lower level, since Array.Length will of course be added to a negative number at some point, if Array.Length were unsigned, and added to a negative int (two's complement), there could be messy results.
Looks like nobody provided answer to "the ultimate question".
I believe primary use of unsigned ints is to provide easier interfacing with external systems (P/Invoke and the like) and to cover needs of various languages being ported to .NET.
Typically, integer values are signed, unless you explicitly need an unsigned value. It's just the way they are used. I may not agree with that choice, but that's just the way it is.
For the time being, with todays typical memory constraints, if your array or similar data structure needs an UInt32 length, you should consider other data structures.
With an array of bytes, Int32 will give you 2GB of values