Why does the compiler optimize ldc.i8 and not ldc.r8?

Why does the compiler optimize ldc.i8 and not ldc.r8? - c#

I'm wondering why this C# code
long b = 20;
is compiled to
ldc.i4.s 0x14
conv.i8
(Because it takes 3 bytes instead of the 9 required by ldc.i8 20. See this for more information.)
while this code
double a = 20;
is compiled to the 9-byte instruction
ldc.r8 20
instead of this 3-byte sequence
ldc.i4.s 0x14
conv.r8
(Using mono 4.8.)
Is this a missed opportunity or the cost of the conv.i8 outbalances the gain in code size ?

Because float is not a smaller double, and integer is not a float (or vice versa).
All int values have a 1:1 mapping on a long value. The same simply isn't true for float and double - floating point operations are tricky that way. Not to mention that int-float conversions aren't free - unlike pushing a 1 byte value on the stack / in a register; look at the x86-64 code produced by both approaches, not just the IL code. Size of the IL code is not the only factor to consider in optimisation.
This is in contrast to decimal, which is actually a base-10 decimal number, rather than a base-2 decimal floating point number. There 20M maps perfectly to 20 and vice versa, so the compiler is free to emit this:
IL_0000: ldc.i4.s 0A
IL_0002: newobj System.Decimal..ctor
The same approach simply isn't safe (or cheap!) for binary floating point numbers.
You might think that the two approaches are necessarily safe, because it doesn't really matter whether we do a conversion from an integer literal ("a string") to a double value in compile-time, or whether we do it in IL. But this simply isn't the case, as a bit of specification diving unveils:
ECMA CLR spec, III.1.1.1:
Storage locations for floating-point numbers (statics, array elements, and fields of classes) are of fixed size. The supported storage sizes are float32 and float64.
Everywhere else (on the evaluation stack, as arguments, as return types, and as local variables) floating-point numbers are represented using an internal floating-point type. In each such instance, the nominal type of the variable or expression is either float32 or float64, but its value might be represented internally with additional range and/or precision.
To keep things short, let's pretend float64 actually uses 4 binary digits, while the implementation defined floating type (F) uses 5 binary digits. We want to convert an integer literal that happens to have a binary representation that's more than four digits. Now compare how it's going to behave:
ldc.r8 0.1011E2 ; expanded to 0.10110E2
ldc.r8 0.1E2
mul ; 0.10110E2 * 0.10000E2 == 0.10110E3
conv.r8 converts to the F, not float64. So we actually get:
ldc.i4.s theSameLiteral
conv.r8 ; converted to 0.10111E2
mul ; 0.10111E2 * 0.10000E2 == 0.10111E3
Oops :)
Now, I'm pretty sure this isn't going to happen with an integer in the range of 0-255 on any reasonable platform. But since we're coding against the CLR specification, we can't make that assumption. The JIT compiler can, but that's too late. The language compiler may define the two to be equivalent, but the C# specification doesn't - a double local is considered a float64, not F. You can make your own language, if you so desire.
In any case, IL generators don't really optimise much. That's left to JIT compilation for the most part. If you want an optimised C#-IL compiler, write one - I doubt there's enough benefit to warrant the effort, especially if your only goal is to make the IL code smaller. Most IL binaries are already quite a bit smaller than the equivalent native code.
As for the actual code that runs, on my machine, both approaches result in exactly the same x86-64 assembly - load a double precision value from the data segment. The JIT can easily make this optimisation, since it knows what architecture the code is actually running on.

I doubt you will get more satisfactory answer than "because noone thought it necessary to implement it."
The fact is, they could've made it this way, but as Eric Lippert has many times stated, features are chosen to be implemented rather than chosen not to be implemented. In this particular case, this feature's gain didn't outweigh the costs, e.g. additional testing, non-trivial conversion between int and float, while in the case of ldc.i4.s, it's not that much of a trouble. Also it's better not to bloat the jitter with more optimization rules.
As shown by the Roslyn source code, the conversion is done only for long. All in all, it's entirely possible to add this feature also for float or double, but it won't be much useful except when producing shorter CIL code (useful when inlining is needed), and when you want to use a float constant, you usually actually use a floating point number (i.e. not an integer).

First, let's consider correctness. The ldc.i4.s can handle integers between -128 to 127, all of which can be exactly represented in float32. However, the CIL uses an internal floating-point type called F for some storage locations. The ECMA-335 standard says in III.1.1.1:
...the nominal type of the variable or expression is either float32 or
float64...The internal representation shall have the following
characteristics:
The internal representation shall have precision and range greater than or equal to the nominal type.
Conversions to and from the internal representation shall preserve value.
This all means that any float32 value is guaranteed to be safely represented in F no matter what F is.
We conclude that the alternative sequence of instructions that you have proposed is correct. Now the question is: is it better in terms of performance?
To answer this question, let's see what the JIT compiler does when it sees both code sequences. When using ldc.r8 20, the answer given in the link you referenced explains nicely the ramifications of using long instructions.
Let's consider the 3-byte sequence:
ldc.i4.s 0x14
conv.r8
We can make an assumption here that is reasonable for any optimizing JIT compiler. We'll assume that the JIT is capable of recognizing such sequence of instructions so that the two instructions can be compiled together. The compiler is given the value 0x14 represented in the two's complement format and have to convert it to the float32 format (which is always safe as discussed above). On relatively modern architectures, this can be done extremely efficiently. This tiny overhead is part of the JIT time and therefore is incurred only once. The quality of the generated native code is the same for both IL sequences.
So the 9-byte sequence has a size issue which could incur any amount of overhead from nothing to more (assuming that we use it everywhere) and the 3-byte sequence has the one-time tiny conversion overhead. Which one is better? Well, somebody has to do some scientifically-sound experimentation to measure the difference in performance to answer that question. I would like to stress that you should not care about this unless you are an engineer or researcher in compiler optimizations. Otherwise, you should be optimizing your code at a higher level (at the source code level).

Related

Immutable structs are thread safe they say [duplicate]

I am a tinkerer—no doubt about that. For this reason (and very little beyond that), I recently did a little experiment to confirm my suspicion that writing to a struct is not an atomic operation, which means that a so-called "immutable" value type which attempts to enforce certain constraints could hypothetically fail at its goal.
I wrote a blog post about this using the following type as an illustration:
struct SolidStruct
{
public SolidStruct(int value)
{
X = Y = Z = value;
}
public readonly int X;
public readonly int Y;
public readonly int Z;
}
While the above looks like a type for which it could never be true that X != Y or Y != Z, in fact this can happen if a value is "mid-assignment" at the same time it is copied to another location by a separate thread.
OK, big deal. A curiosity and little more. But then I had this hunch: my 64-bit CPU should actually be able to copy 64 bits atomically, right? So what if I got rid of Z and just stuck with X and Y? That's only 64 bits; it should be possible to overwrite those in one step.
Sure enough, it worked. (I realize some of you are probably furrowing your brows right now, thinking, Yeah, duh. How is this even interesting? Humor me.) Granted, I have no idea whether this is guaranteed or not given my system. I know next to nothing about registers, cache misses, etc. (I am literally just regurgitating terms I've heard without understanding their meaning); so this is all a black box to me at the moment.
The next thing I tried—again, just on a hunch—was a struct consisting of 32 bits using 2 short fields. This seemed to exhibit "atomic assignability" as well. But then I tried a 24-bit struct, using 3 byte fields: no go.
Suddenly the struct appeared to be susceptible to "mid-assignment" copies once again.
Down to 16 bits with 2 byte fields: atomic again!
Could someone explain to me why this is? I've heard of "bit packing", "cache line straddling", "alignment", etc.—but again, I don't really know what all that means, nor whether it's even relevant here. But I feel like I see a pattern, without being able to say exactly what it is; clarity would be greatly appreciated.

The pattern you're looking for is the native word size of the CPU.
Historically, the x86 family worked natively with 16-bit values (and before that, 8-bit values). For that reason, your CPU can handle these atomically: it's a single instruction to set these values.
As time progressed, the native element size increased to 32 bits, and later to 64 bits. In every case, an instruction was added to handle this specific amount of bits. However, for backwards compatibility, the old instructions were still kept around, so your 64-bit processor can work with all of the previous native sizes.
Since your struct elements are stored in contiguous memory (without padding, i.e. empty space), the runtime can exploit this knowledge to only execute that single instruction for elements of these sizes. Put simply, that creates the effect you're seeing, because the CPU can only execute one instruction at a time (although I'm not sure if true atomicity can be guaranteed on multi-core systems).
However, the native element size was never 24 bits. Consequently, there is no single instruction to write 24 bits, so multiple instructions are required for that, and you lose the atomicity.

The C# standard (ISO 23270:2006, ECMA-334) has this to say regarding atomicity:
12.5 Atomicity of variable references
Reads and writes of the following data types shall be atomic: bool, char, byte, sbyte, short, ushort,
uint, int, float, and reference types. In addition, reads and writes of enum types with an underlying type
in the previous list shall also be atomic. Reads and writes of other types, including long, ulong, double,
and decimal, as well as user-defined types, need not be atomic. (emphasis mine) Aside from the library functions designed
for that purpose, there is no guarantee of atomic read-modify-write, such as in the case of increment or
decrement.Your example X = Y = Z = value is short hand for 3 separate assignment operations, each of which is defined to be atomic by 12.5. The sequence of 3 operations (assign value to Z, assign Z to Y, assign Y to X) is not guaranteed to be atomic.
Since the language specification doesn't mandate atomicity, while X = Y = Z = value; might be an atomic operation, whether it is or not is dependent on a whole bunch of factors:
the whims of the compiler writers
what code generation optimizations options, if any, were selected at build time
the details of the JIT compiler responsible for turning the assembly's IL into machine language. Identical IL run under Mono, say, might exhibit different behaviour than when run under .Net 4.0 (and that might even differ from earlier versions of .Net).
the particular CPU on which the assembly is running.
One might also note that even a single machine instruction is not necessarily warranted to be an atomic operation—many are interruptable.
Further, visiting the CLI standard (ISO 23217:2006), we find section 12.6.6:
12.6.6 Atomic reads and writes
A conforming CLI shall guarantee that read and write access to properly
aligned memory locations no larger than the native word size (the size of type
native int) is atomic (see §12.6.2) when all the write accesses to a location are
the same size. Atomic writes shall alter no bits other than those written. Unless
explicit layout control (see Partition II (Controlling Instance Layout)) is used to
alter the default behavior, data elements no larger than the natural word size (the
size of a native int) shall be properly aligned. Object references shall be treated
as though they are stored in the native word size.
[Note: There is no guarantee
about atomic update (read-modify-write) of memory, except for methods provided for
that purpose as part of the class library (see Partition IV). (emphasis mine)
An atomic write of a “small data item” (an item no larger than the native word size)
is required to do an atomic read/modify/write on hardware that does not support direct
writes to small data items. end note]
[Note: There is no guaranteed atomic access to 8-byte data when the size of
a native int is 32 bits even though some implementations might perform atomic
operations when the data is aligned on an 8-byte boundary. end note]

x86 CPU operations take place in 8, 16, 32, or 64 bits; manipulating other sizes requires multiple operations.

The compiler and x86 CPU are going to be careful to move only exactly as many bytes as the structure defines. There are no x86 instructions that can move 24 bits in one operation, but there are single instruction moves for 8, 16, 32, and 64 bit data.
If you add another byte field to your 24 bit struct (making it a 32 bit struct), you should see your atomicity return.
Some compilers allow you to define padding on structs to make them behave like native register sized data. If you pad your 24 bit struct, the compiler will add another byte to "round up" the size to 32 bits so that the whole structure can be moved in one atomic instruction. The downside is your structure will always occupy 30% more space in memory.
Note that alignment of the structure in memory is also critical to atomicity. If a multibyte structure does not begin at an aligned address, it may span multiple cache lines in the CPU cache. Reading or writing this data will require multiple clock cycles and multiple read/writes even though the opcode is a single move instruction. So, even single instruction moves may not be atomic if the data is misaligned. x86 does guarantee atomicity for native sized read/writes on aligned boundaries, even in multicore systems.
It is possible to achieve memory atomicity with multi-step moves using the x86 LOCK prefix. However this should be avoided as it can be very expensive in multicore systems (LOCK not only blocks other cores from accessing memory, it also locks the system bus for the duration of the operation which can impact disk I/O and video operations. LOCK may also force the other cores to purge their local caches)

Most efficient way of multiplying and dividing fixed scale decimal numbers

Background
I work in the field of financial trading and am currently optimizing a real-time C# trading application.
Through extensive profiling I have identified that the performance of System.Decimal is now a bottleneck. As a result I am currently coding up a couple of more efficient fixed scale 64-bit 'decimal' structures (one signed, one unsigned) to perform base10 arithmatic. Using a fixed scale of 9 (i.e. 9 digits after the decimal point) means the underlying 64-bit integer can be used to represent the values:
-9,223,372,036.854775808 to 9,223,372,036.854775807
and
0 to 18,446,744,073.709551615
respectively.
This makes most operations trivial (i.e. comparisons, addition, subtraction). However, for multiplication and division I am currently falling back on the implementation provided by System.Decimal. I assume the external FCallMultiply method it invokes for multiplication uses either the Karatsuba or Toom–Cook algorithm under the covers. For division, I'm not sure which particular algorithm it would use.
Question
Does anyone know if, due to the fixed scale of my decimal values, there are any faster multiplication and division algorithms I can employ which are likely to out-perform System.Decimal.
I would appreciate your thoughts...

I have done something similar, by using the Schönhage Strassen algorithm.
I cannot find any sources now, but you can try to convert this code to the C# language.
P.S. i cannot say for sure about System.Decimal, but the "Karatsuba algorithm" is used by System.Numerics.BigInteger

My take of fixed point arithmetic (in general, not knowing about about C# or .NET in particular (VS Express acting up) (then, there's Fixed point math in c#? and Why no fixed point type in C#?):
The main point is a fixed scale - and that this is conceptual, first and foremost - the hardware couldn't care less about meaning/interpretation of numbers (or much anything) (unless it supports something, if for marketing reasons)
the easy: addition/subtraction - just ignore scaling
multiplication: compute the double-wide product, divide by scale
division: multiply (widened) dividend by scale and divide
the ugly - transcendental functions beyond exponentiation (exponentiate, multiply by scale to half that power)
in choosing a scale, don't forget conversion to and from digits, which may vastly outnumber multiplication&division (and give using a square a thought, see above …)
That said, "multiples of word size" and powers of two have been popular choices for scale due to speed in multiplying and dividing by such a scale. This still may make a difference with contemporary processors, if not for main ALUs of PCs - think SIMD extensions, GPUs, embedded …
Given what little I was able to discern of your application and requirements (consider disclosing more), three generic choices to consider are 10^9 (to the 9th power), 2^30 and 2^32. The latter representations may be called 34.30 and 32.32 for the bit lengths of their integral and fractional parts, respectively.
With a language that allows to create types (especially supporting operators in addition to invokable procedures), I deem designing and implementing that new type according the principle of least surprise important.

c# double value displayed as .9999998?

After further investigation, it all boils down to this:
(decimal)((object)my_4_decimal_place_double_value_20.9032)
after casting twice, it becomes 20.903199999999998
I have a double value, which is rounded to just 4 decimal points via Math.Round(...) the value is 20.9032
In my dev environment, it is displayed as is.
But in released environment, it is displayed as 20.903199999999998
There were no operation after Math.Round(...) but the value has been copied around and assigned.
How can this happen?
Updates:
Data is not loaded from a DB.
returned value from Math.Round() is assigned to the original double varible.
Release and dev are the same architecture, if this information helps.

According to the CLR ECMA specification:
Storage locations for floating-point numbers (statics, array elements,
and fields of classes) are of fixed size. The supported storage sizes
are float32 and float64. Everywhere else (on the evaluation stack, as
arguments, as return types, and as local variables) floating-point
numbers are represented using an internal floating-point type. In each
such instance, the nominal type of the variable or expression is
either R4 or R8, but its value can be represented internally with
additional range and/or precision. The size of the internal
floating-point representation is implementation-dependent, can vary,
and shall have precision at least as great as that of the variable or
expression being represented. An implicit widening conversion to the
internal representation from float32 or float64 is performed when
those types are loaded from storage. The internal representation is
typically the native size for the hardware, or as required for
efficient implementation of an operation.
To translate, the IL generated will be the same (except that debug mode inserts nops in places to ensure a breakpoint is possible, it may also deliberately maintain a temporary variable that release mode deems unnecessary.)... but the JITter is less aggressive when dealing with an assembly marked as debug. Release builds tend to move more floating values into 80-bit registers; debug builds tend to read direct from 64-bit memory storage.
If you want a "precise" float number printing, use string.Substring(...) instead of Math.Round

A IEEE754 double precision floating point number can not represent 20.9032.
The most accurate representation is 2.09031999999999982264853315428E1 and that is what you see in your output.
Do not format numbers with round instead use the string format of the double.ToString(string formatString) Method.
See msdn documentation of Double.ToString Method (String)
The difference between Release and Debug build may be some optimization that gets done for the release build, but this is way to detailed in my opinion.
In my opinion the core issue is that you try to format a text output with a mathematical Operation. I'm sorry but i don't know what in detail creates the different behavior.

Why is the division result between two integers truncated?

All experienced programmers in C# (I think this comes from C) are used to cast on of the integers in a division to get the decimal / double / float result instead of the int (the real result truncated).
I'd like to know why is this implemented like this? Is there ANY good reason to truncate the result if both numbers are integer?

C# traces its heritage to C, so the answer to "why is it like this in C#?" is a combination of "why is it like this in C?" and "was there no good reason to change?"
The approach of C is to have a fairly close correspondence between the high-level language and low-level operations. Processors generally implement integer division as returning a quotient and a remainder, both of which are of the same type as the operands.
(So my question would be, "why doesn't integer division in C-like languages return two integers", not "why doesn't it return a floating point value?")
The solution was to provide separate operations for division and remainder, each of which returns an integer. In the context of C, it's not surprising that the result of each of these operations is an integer. This is frequently more accurate than floating-point arithmetic. Consider the example from your comment of 7 / 3. This value cannot be represented by a finite binary number nor by a finite decimal number. In other words, on today's computers, we cannot accurately represent 7 / 3 unless we use integers! The most accurate representation of this fraction is "quotient 2, remainder 1".
So, was there no good reason to change? I can't think of any, and I can think of a few good reasons not to change. None of the other answers has mentioned Visual Basic which (at least through version 6) has two operators for dividing integers: / converts the integers to double, and returns a double, while \ performs normal integer arithmetic.
I learned about the \ operator after struggling to implement a binary search algorithm using floating-point division. It was really painful, and integer division came in like a breath of fresh air. Without it, there was lots of special handling to cover edge cases and off-by-one errors in the first draft of the procedure.
From that experience, I draw the conclusion that having different operators for dividing integers is confusing.
Another alternative would be to have only one integer operation, which always returns a double, and require programmers to truncate it. This means you have to perform two int->double conversions, a truncation and a double->int conversion every time you want integer division. And how many programmers would mistakenly round or floor the result instead of truncating it? It's a more complicated system, and at least as prone to programmer error, and slower.
Finally, in addition to binary search, there are many standard algorithms that employ integer arithmetic. One example is dividing collections of objects into sub-collections of similar size. Another is converting between indices in a 1-d array and coordinates in a 2-d matrix.
As far as I can see, no alternative to "int / int yields int" survives a cost-benefit analysis in terms of language usability, so there's no reason to change the behavior inherited from C.
In conclusion:
Integer division is frequently useful in many standard algorithms.
When the floating-point division of integers is needed, it may be invoked explicitly with a simple, short, and clear cast: (double)a / b rather than a / b
Other alternatives introduce more complication both the programmer and more clock cycles for the processor.

Is there ANY good reason to truncate the result if both numbers are integer?
Of course; I can think of a dozen such scenarios easily. For example: you have a large image, and a thumbnail version of the image which is 10 times smaller in both dimensions. When the user clicks on a point in the large image, you wish to identify the corresponding pixel in the scaled-down image. Clearly to do so, you divide both the x and y coordinates by 10. Why would you want to get a result in decimal? The corresponding coordinates are going to be integer coordinates in the thumbnail bitmap.
Doubles are great for physics calculations and decimals are great for financial calculations, but almost all the work I do with computers that does any math at all does it entirely in integers. I don't want to be constantly having to convert doubles or decimals back to integers just because I did some division. If you are solving physics or financial problems then why are you using integers in the first place? Use nothing but doubles or decimals. Use integers to solve finite mathematics problems.

Calculating on integers is faster (usually) than on floating point values. Besides, all other integer/integer operations (+, -, *) return an integer.
EDIT:
As per the request of the OP, here's some addition:
The OP's problem is that they think of / as division in the mathematical sense, and the / operator in the language performs some other operation (which is not the math. division). By this logic they should question the validity of all other operations (+, -, *) as well, since those have special overflow rules, which is not the same as would be expected from their math counterparts. If this is bothersome for someone, they should find another language where the operations perform as expected by the person.
As for the claim on perfomance difference in favor of integer values: When I wrote the answer I only had "folk" knowledge and "intuition" to back up the claim (hece my "usually" disclaimer). Indeed as Gabe pointed out, there are platforms where this does not hold. On the other hand I found this link (point 12) that shows mixed performances on an Intel platform (the language used is Java, though).
The takeaway should be that with performance many claims and intuition are unsubstantiated until measured and found true.

Yes, if the end result needs to be a whole number. It would depend on the requirements.
If these are indeed your requirements, then you would not want to store a decimal and then truncate it. You would be wasting memory and processing time to accomplish something that is already built-in functionality.

The operator is designed to return the same type as it's input.
Edit (comment response):
Why? I don't design languages, but I would assume most of the time you will be sticking with the data types you started with and in the remaining instance, what criteria would you use to automatically assume which type the user wants? Would you automatically expect a string when you need it? (sincerity intended)

If you add an int to an int, you expect to get an int. If you subtract an int from an int, you expect to get an int. If you multiple an int by an int, you expect to get an int. So why would you not expect an int result if you divide an int by an int? And if you expect an int, then you will have to truncate.
If you don't want that, then you need to cast your ints to something else first.
Edit: I'd also note that if you really want to understand why this is, then you should start looking into how binary math works and how it is implemented in an electronic circuit. It's certainly not necessary to understand it in detail, but having a quick overview of it would really help you understand how the low-level details of the hardware filter through to the details of high-level languages.

Why does .NET use int instead of uint in certain classes?

I always come across code that uses int for things like .Count, etc, even in the framework classes, instead of uint.
What's the reason for this?

UInt32 is not CLS compliant so it might not be available in all languages that target the Common Language Specification. Int32 is CLS compliant and therefore is guaranteed to exist in all languages.

int, in c, is specifically defined to be the default integer type of the processor, and is therefore held to be the fastest for general numeric operations.

Unsigned types only behave like whole numbers if the sum or product of a signed and unsigned value will be a signed type large enough to hold either operand, and if the difference between two unsigned values is a signed value large enough to hold any result. Thus, code which makes significant use of UInt32 will frequently need to compute values as Int64. Operations on signed integer types may fail to operate like whole numbers when the operands are overly large, but they'll behave sensibly when operands are small. Operations on unpromoted arguments of unsigned types pose problems even when operands are small. Given UInt32 x; for example, the inequality x-1 < x will fail for x==0 if the result type is UInt32, and the inequality x<=0 || x-1>=0 will fail for large x values if the result type is Int32. Only if the operation is performed on type Int64 can both inequalities be upheld.
While it is sometimes useful to define unsigned-type behavior in ways that differ from whole-number arithmetic, values which represent things like counts should generally use types that will behave like whole numbers--something unsigned types generally don't do unless they're smaller than the basic integer type.

UInt32 isn't CLS-Compliant. http://msdn.microsoft.com/en-us/library/system.uint32.aspx
I think that over the years people have come to the conclusions that using unsigned types doesn't really offer that much benefit. The better question is what would you gain by making Count a UInt32?

Some things use int so that they can return -1 as if it were "null" or something like that. Like a ComboBox will return -1 for it's SelectedIndex if it doesn't have any item selected.

If the number is truly unsigned by its intrinsic nature then I would declare it an unsigned int. However, if I just happen to be using a number (for the time being) in the positive range then I would call it an int.
The main reasons being that:
It avoids having to do a lot of type-casting as most methods/functions are written to take an int and not an unsigned int.
It eliminates possible truncation warnings.
You invariably end up wishing you could assign a negative value to the number that you had originally thought would always be positive.
Are just a few quick thoughts that came to mind.
I used to try and be very careful and choose the proper unsigned/signed and I finally realized that it doesn't really result in a positive benefit. It just creates extra work. So why make things hard by mixing and matching.

Some old libraries and even InStr use negative numbers to mean special cases. I believe either its laziness or there's negative special values.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.