What is the Standard for Handling Endianness With Interop? - c#
I created a very simple DLL for Speck in (granted, probably inefficient) ASM. I connected to it in C# using InteropServices.
When I tested this crypto with the test vectors provided in the paper describing the algorithm, I found that the only way to get them to come out right was to "flip" the key and the plain text, and then to "flip" the crypto at the end for a match. So an endianness issue I guess. I have seen the same, for example, between a reference implementation of Serpent and TrueCrypt's version -- they produce the same result only with the bytes in the reverse order.
I will post my assembly code and my C# code for reference, though it may not be critical to see the code in order to understand my question. In the C# code is a click event handler that checks the DLL for consistency with the test vectors. As you can also see there, the program has to do a lot of array flipping in that handler to get the match.
So the question I have been working towards is this. Should I "flip" those arrays inside the DLL to account for endianness? Or should I leave it to the caller (also me, but C# side)? Or am I making mountains out of molehills and I should just ignore endianness at this point? I am not planning to sell the silly thing, so there is no worry about compatibility issues, but I am a stickler for doing things right, so I am hoping you all can guide me on the best practice here if there is one.
ASM:
.code ; the beginning of the code
; section
WinMainCRTStartup proc h:DWORD, r:DWORD, u:DWORD ; the dll entry point
mov rax, 1 ; if eax is 0, the dll won't
; start
ret ; return
WinMainCRTStartup Endp ; end of the dll entry
_DllMainCRTStartup proc h:DWORD, r:DWORD, u:DWORD ; the dll entry point
mov rax, 1 ; if eax is 0, the dll won't
; start
ret ; return
_DllMainCRTStartup Endp
SpeckEncrypt proc plaintText:QWORD, cipherText:QWORD, Key:QWORD
; Pass in 3 addresses pointing to the base of the plainText, cipherText, and Key arrays
; These come in as RCX, RDX, and R8, respectively
; I will use These, RAX, and R9 through R15 for my working space. Will do 128 bit block, 128 bit key sizes, but they will fit nicely in 64 bit registers
; simple prologue, pushing ebp and ebx and the R# registers, and moving the value of esp into ebp for the duration of the proc
push rbp
mov rbp,rsp
push rbx
push R9
push R10
push R11
push R12
push R13
push R14
push R15
; Move data into the registers for processing
mov r9,[rcx] ; rcx holds the memory location of the first 64 bits of plainText. Move this into R9. This is plainText[0]
mov r10,[rcx+8] ; put next 64 bits into R10. This is plainText[1]
;NOTE that the address of the cipherText is in RDX but we will fill r11 and r12 with values pointed at by RCX. This is per the algorithm. We will use RDX to output the final bytes
mov r11,[rcx] ; cipherText[0] = plainText[0]
mov r12,[rcx+8] ; cipherText[1] = plainText[1]
mov r13, [r8] ;First 64 bits of key. This is Key[0]
mov r14, [r8+8] ; Next 64 bits of key. This is Key[1]
push rcx ; I could get away without this and loop in another register, but I want to count my loop in rcx so I free it up for that
mov rcx, 0 ; going to count up from here to 32. Would count down but the algorithm uses the counter value in one permutation, so going to count up
EncryptRoundFunction:
ror r12,8
add r12,r11
xor r12,r13
rol r11,3
xor r11,r12
ror r14,8
add r14,r13
xor r14,rcx
rol r13,3
xor r13,r14
inc rcx
cmp rcx, 32
jne EncryptRoundFunction
pop rcx
; Move cipherText into memory pointed at by RDX. We won't bother copying the Key or plainText back out
mov [rdx],r11
mov [rdx+8],r12
; Now the epilogue, returning values from the stack into non-volatile registers.
pop R15
pop R14
pop R13
pop R12
pop R11
pop R10
pop R9
pop rbx
pop rbp
ret ; return eax
SpeckEncrypt endp ; end of the function
SpeckDecrypt proc cipherText:QWORD, plainText:QWORD, Key:QWORD
; Pass in 3 addresses pointing to the base of the cipherText, plainText, and Key arrays
; These come in as RCX, RDX, and R8, respectively
; I will use These, RAX, and R9 through R15 for my working space. Will do 128 bit block, 128 bit key sizes, but they will fit nicely in 64 bit registers
; simple prologue, pushing ebp and ebx and the R# registers, and moving the value of esp into ebp for the duration of the proc
push rbp
mov rbp,rsp
push rbx
push R9
push R10
push R11
push R12
push R13
push R14
push R15
; Move data into the registers for processing
mov r9,[rcx] ; rcx holds the memory location of the first 64 bits of cipherText. Move this into R9. This is cipherText[0]
mov r10,[rcx+8] ; put next 64 bits into R10. This is cipherText[1]
;NOTE that the address of the plainText is in RDX but we will fill r11 and r12 with values pointed at by RCX. This is per the algorithm. We will use RDX to output the final bytes
mov r11,[rcx] ; plainText[0] = cipherText[0]
mov r12,[rcx+8] ; plainText[1] = cipherText[1]
mov r13, [r8] ;First 64 bits of key. This is Key[0]
mov r14, [r8+8] ; Next 64 bits of key. This is Key[1]
push rcx ; I could get away without this and loop in another register, but I want to count my loop in rcx so I free it up for that
mov rcx, 0 ; We will count up while making the round keys
DecryptMakeRoundKeys:
; On encrypt we could make each key just as we needed it. But here we need the keys in reverse order. To undo round 31 of encryption, for example, we need round key 31.
; So we will make them all and push them on the stack, pop them off again as we need them in the main DecryptRoundFunction
; I should pull this off and call it for encrypt and decrypt to save space, but for now will have it separate
; push r13 at the beginning of the process because we need a "raw" key by the time we reach decrypt round 0
; We will not push r14 because that half of the key is only used here in the round key generation function.
; We don't need it in the decrypt rounds
push r13
ror r14,8
add r14,r13
xor r14,rcx
rol r13,3
xor r13,r14
inc rcx
cmp rcx, 32
jne DecryptMakeRoundKeys
mov rcx, 32
DecryptRoundFunction:
dec rcx
pop r13
xor r11,r12
ror r11,3
xor r12,r13
sub r12,r11
rol r12,8
cmp rcx, 0
jne DecryptRoundFunction
pop rcx
; Move cipherText into memory pointed at by RDX. We won't bother copying the Key or plainText back out
mov [rdx],r11
mov [rdx+8],r12
; Now the epilogue, returning values from the stack into non-volatile registers.
pop R15
pop R14
pop R13
pop R12
pop R11
pop R10
pop R9
pop rbx
pop rbp
ret ; return eax
SpeckDecrypt endp ; end of the function
End ; end of the dll
And the C#:
using System;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Windows.Forms;
namespace SpeckDLLTest
{
public partial class Form1 : Form
{
byte[] key = { 0x0f, 0x0e, 0x0d, 0x0c, 0x0b, 0x0a, 0x09, 0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00 };
public Form1()
{
InitializeComponent();
Array.Reverse(key);
}
private void richTextBox1_TextChanged(object sender, EventArgs e)
{
textBox1.Text = richTextBox1.Text.Length.ToString();
if (richTextBox1.Text != "")
{
byte[] plainText = ASCIIEncoding.ASCII.GetBytes(richTextBox1.Text);
byte[] cipherText = new byte[plainText.Length];
Thread t = new Thread(() =>
{
cipherText = Encrypt(plainText);
BeginInvoke(new Action(() => richTextBox2.Text = Convert.ToBase64String(cipherText)));
});
t.Start();
t.Join();
t.Abort();
byte[] plainAgain = new byte[cipherText.Length];
t = new Thread(() =>
{
plainAgain = Decrypt(cipherText);
BeginInvoke(new Action(() => richTextBox3.Text = ASCIIEncoding.ASCII.GetString(plainAgain)));
});
t.Start();
t.Join();
t.Abort();
}
else
{
richTextBox2.Text = "";
richTextBox3.Text = "";
}
}
private byte[] Decrypt(byte[] cipherText)
{
int blockCount = cipherText.Length / 16;
if (cipherText.Length % 16 != 0) blockCount++;
Array.Resize(ref cipherText, blockCount * 16);
byte[] plainText = new byte[cipherText.Length];
unsafe
{
fixed (byte* plaintextPointer = plainText, ciphertextPointer = cipherText, keyPointer = key)
{
for (int i = 0; i < blockCount; i++)
{
for (int j = 0; j < 1; j++)
{
UnsafeMethods.SpeckDecrypt(ciphertextPointer + i * 16, plaintextPointer + i * 16, keyPointer);
}
}
}
}
return plainText;
}
private byte[] Encrypt(byte[] plainText)
{
int blockCount = plainText.Length / 16;
if (plainText.Length % 16 != 0) blockCount++;
Array.Resize(ref plainText, blockCount * 16);
byte[] cipherText = new byte[plainText.Length];
unsafe
{
fixed (byte* plaintextPointer = plainText, ciphertextPointer = cipherText, keyPointer = key)
{
for (int i = 0; i < blockCount; i++)
{
for (int j = 0; j < 1; j++)
{
UnsafeMethods.SpeckEncrypt(plaintextPointer + i * 16, ciphertextPointer + i * 16, keyPointer);
}
}
}
}
return cipherText;
}
private void button1_Click(object sender, EventArgs e)
{
byte[] plainText = { 0x6c, 0x61, 0x76, 0x69, 0x75, 0x71, 0x65, 0x20, 0x74, 0x69, 0x20, 0x65, 0x64, 0x61, 0x6d, 0x20 };
byte[] key = { 0x0f, 0x0e, 0x0d, 0x0c, 0x0b, 0x0a, 0x09, 0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00 };
byte[] testVector = { 0xa6, 0x5d, 0x98, 0x51, 0x79, 0x78, 0x32, 0x65, 0x78, 0x60, 0xfe, 0xdf, 0x5c, 0x57, 0x0d, 0x18 };
Array.Reverse(key);
Array.Reverse(plainText);
byte[] cipherText = new byte[16];
unsafe
{
fixed (byte* plaintextPointer = plainText, ciphertextPointer = cipherText, keyPointer = key)
{
UnsafeMethods.SpeckEncrypt(plaintextPointer, ciphertextPointer, keyPointer);
Array.Reverse(cipherText);
bool testBool = true;
for (int i = 0; i < cipherText.Length; i++)
{
if (testVector[i] != cipherText[i]) testBool = false;
}
if (testBool == false) MessageBox.Show("Failed!");
else MessageBox.Show("Passed!");
}
}
}
}
public static class UnsafeMethods
{
[DllImport("Speck.dll")]
unsafe public extern static void SpeckEncrypt(byte* plainText, byte* cipherText, byte* Key);
[DllImport("Speck.dll")]
unsafe public extern static void SpeckDecrypt(byte* cipherText, byte* plainText, byte* Key);
}
}
Whether someone might like it or not, the de facto standard for byte order when it comes to networking and cryptography is big-endian (most significant byte first — the “natural” order). This applies not only to serialization of data for inter-system exchange, but to intra-system API as well and for any other case where caller is not supposed to be aware of callee internals. This convention does not have anything to do with endianness of particular hardware and popularity of such hardware. It just sets the default format for exchanged data, so that both lower-level and higher-level programs may pass data around without regard to their degree of awareness of what this data contains and how it is processed.
However, if the caller is supposed to be tightly coupled with the callee, it may be more convenient and performance-wise to pass the data in a more preprocessed form, especially if some of that data remains constant across invocations. For example, if we are dealing with asymmetric cryptography, it may be easier and faster to call the core functions with all data already translated to big integers, and for those we may prefer little-endian digit order (a “digit” or a “limb” is usually a half of largest available register) even on a big-endian byte order hardware — simply because such an order of digits is more useful for arbitrary-precision math library. But those details should not be visible to the outside world — for anyone else, we are accepting and returning big-endian bytestream.
Regarding your specific task.
As #RossRidge already pointed out, you are probably very wrong if your are simply flipping entire arrays, — you should swap bytes (BSWAP) in particular pieces being processed rather than inverting the order of those pieces besides that.
Chances are high that you are very overestimating your ability to write efficient machine code: for example, you don't interleave instructions with unrelated registers for better out-of-order execution, your loop is not aligned, you use counter increase to N instead of decrease to zero. Of course, that code will still be 10x faster than .Net anyway, but I strongly recommend you to write an implementation in C and benchmark — to get amazed of how good a compiler (MSVC, GCC) may be at optimizing even a straight-though written program (believe me, I once committed the same mistake when trying to accomplish the same task). If performance is not a big issue, do not mess with unmanaged code at all, — because it is just an external non-portable dependency that increases required trust level for you .Net application.
Use .Net functions dealing with bytes with caution, because they are very inconsistent with regard to endianness: BitConverter uses host byte order, StreamReader always sticks to little-endian, and String is all about the encoding given (of all UTF encodings, only UTF-8 is endian-agnostic).
That are the issues I noticed at first glance. There may be more of them.
Related
Confirmation of Reverse Reciprocal CRC-8 Value?
I've spent quite a bit of time trying to confirm the type of CRC-8 algorithm used in ASCII data communications between two devices. I have confirmed that the CRC is calculated on the 0x02 Start of text byte + the next byte of data. An Interface Design Document that I have describing one device specifies the use of a 0xEA polynomial with an initial value of 0xFF. An example of one captured message is below: Input Bytes: 0x02 0x41 CRC Result: b10011011 or 0x9B Going into the problem, I had little to no knowledge of the inner working of a typical CRC algorithm. Initially, I tried hand calculation against the input bytes to confirm my understanding of the algo before attempting a code solution. This involved XORing the 1st input byte with my 0xFF initial value and then skipping to the second input byte to continue the XOR operations. Having tried multiple times to confirm the CRC through typical XOR operations while shifting the MSB left out of the register during each step, I could never get the results I wanted. Today, I realized that the 0xEA polynomial is also considered to be a reversed reciprocal of the 0xD5 poly with an implied 1+x^8 that is commonly used in CRC-8 algos. How does this fact change how I would go about manually calculating the CRC? I've read that in some instances a reverse leads to the algo right shifting bits instead of left shifting?
The polynomial is x^8+x^7+x^5+x^3+x^2+x+1 => 01AF bit reversed to x^8+x^7+x^6+x^5+x^3+x+1 => 0x1EB. Example code where the conditional XOR is done after the shift, so the XOR value is 0x1EB>>1 = 0xF5. A 256 byte table lookup could be used to replace the inner loop. using System; namespace crc8r { class Program { private static byte crc8r(byte[] bfr, int bfrlen) { byte crc = 0xff; for (int j = 0; j < bfrlen; j++) { crc ^= bfr[j]; for (int i = 0; i < 8; i++) // assumes twos complement math crc = (byte)((crc>>1)^((0-(crc&1))&0xf5)); } return crc; } static void Main(string[] args) { byte[] data = new byte[3] {0x02, 0x41, 0x00}; byte crc; crc = crc8r(data, 2); // crc == 0x9b Console.WriteLine("{0:X2}", crc); data[2] = crc; crc = crc8r(data, 3); // crc == 0x00 Console.WriteLine("{0:X2}", crc); return; } } } Regarding "EA", if the polynomial is XOR'ed before the shift, 0x1EB (or 0x1EA since bit 0 will be shifted off and doesn't matter) is used. XOR'ing before the shift requires 9 bits, or a post shift OR or XOR of 0x80, while XOR'ing after the shift only requires 8 bits. Example line of code using 0x1eb before the shift: crc = (byte)((crc^((0-(crc&1))&0x1eb))>>1);
C# Performance on Small Functions
One of my co-workers has been reading Clean Code by Robert C Martin and got to the section about using many small functions as opposed to fewer large functions. This led to a debate about the performance consequence of this methodology. So we wrote a quick program to test the performance and are confused by the results. For starters here is the normal version of the function. static double NormalFunction() { double a = 0; for (int j = 0; j < s_OuterLoopCount; ++j) { for (int i = 0; i < s_InnerLoopCount; ++i) { double b = i * 2; a = a + b + 1; } } return a; } Here is the version I made that breaks the functionality into small functions. static double TinyFunctions() { double a = 0; for (int i = 0; i < s_OuterLoopCount; i++) { a = Loop(a); } return a; } static double Loop(double a) { for (int i = 0; i < s_InnerLoopCount; i++) { double b = Double(i); a = Add(a, Add(b, 1)); } return a; } static double Double(double a) { return a * 2; } static double Add(double a, double b) { return a + b; } I use the stopwatch class to time the functions and when I ran it in debug I got the following results. s_OuterLoopCount = 10000; s_InnerLoopCount = 10000; NormalFunction Time = 377 ms; TinyFunctions Time = 1322 ms; These results make sense to me especially in debug as there is additional overhead in function calls. It is when I run it in release that I get the following results. s_OuterLoopCount = 10000; s_InnerLoopCount = 10000; NormalFunction Time = 173 ms; TinyFunctions Time = 98 ms; These results confuse me, even if the compiler was optimizing the TinyFunctions by in-lining all the function calls, how could that make it ~57% faster? We have tried moving variable declarations around in NormalFunctions and it basically no effect on the run time. I was hoping that someone would know what is going on and if the compiler can optimize TinyFunctions so well, why can't it apply similar optimizations to NormalFunction. In looking around we found where someone mentioned that having the functions broken out allows the JIT to better optimize what to put in the registers, but NormalFunctions only has 4 variables so I find it hard to believe that explains the massive performance difference. I'd be grateful for any insight someone can provide. Update 1 As pointed out below by Kyle changing the order of operations made a massive difference in the performance of NormalFunction. static double NormalFunction() { double a = 0; for (int j = 0; j < s_OuterLoopCount; ++j) { for (int i = 0; i < s_InnerLoopCount; ++i) { double b = i * 2; a = b + 1 + a; } } return a; } Here are the results with this configuration. s_OuterLoopCount = 10000; s_InnerLoopCount = 10000; NormalFunction Time = 91 ms; TinyFunctions Time = 102 ms; This is more what I expected but still leaves the question as to why order of operations can have a ~56% performance hit. Furthermore, I then tried it with integer operations and we are back to not making any sense. s_OuterLoopCount = 10000; s_InnerLoopCount = 10000; NormalFunction Time = 87 ms; TinyFunctions Time = 52 ms; And this doesn't change regardless of the order of operations.
I can make performance match much better by changing one line of code: a = a + b + 1; Change it to: a = b + 1 + a; Or: a += b + 1; Now you'll find that NormalFunction might actually be slightly faster and you can "fix" that by changing the signature of the Double method to: int Double( int a ) { return a * 2; } I thought of these changes because this is what was different between the two implementations. After this, their performance is very similar with TinyFunctions being a few percent slower (as expected). The second change is easy to explain: the NormalFunction implementation actually doubles an int and then converts it to a double (with an fild opcode at the machine code level). The original Double method loads a double first and then doubles it, which I would expect to be slightly slower. But that doesn't account for the bulk of the runtime discrepancy. That comes almost down entirely to that order change I made first. Why? I don't really have any idea. The difference in machine code looks like this: Original Changed 01070620 push ebp 01390620 push ebp 01070621 mov ebp,esp 01390621 mov ebp,esp 01070623 push edi 01390623 push edi 01070624 push esi 01390624 push esi 01070625 push eax 01390625 push eax 01070626 fldz 01390626 fldz 01070628 xor esi,esi 01390628 xor esi,esi 0107062A mov edi,dword ptr ds:[0FE43ACh] 0139062A mov edi,dword ptr ds:[12243ACh] 01070630 test edi,edi 01390630 test edi,edi 01070632 jle 0107065A 01390632 jle 0139065A 01070634 xor edx,edx 01390634 xor edx,edx 01070636 mov ecx,dword ptr ds:[0FE43B0h] 01390636 mov ecx,dword ptr ds:[12243B0h] 0107063C test ecx,ecx 0139063C test ecx,ecx 0107063E jle 01070655 0139063E jle 01390655 01070640 mov eax,edx 01390640 mov eax,edx 01070642 add eax,eax 01390642 add eax,eax 01070644 mov dword ptr [ebp-0Ch],eax 01390644 mov dword ptr [ebp-0Ch],eax 01070647 fild dword ptr [ebp-0Ch] 01390647 fild dword ptr [ebp-0Ch] 0107064A faddp st(1),st 0139064A fld1 0107064C fld1 0139064C faddp st(1),st 0107064E faddp st(1),st 0139064E faddp st(1),st 01070650 inc edx 01390650 inc edx 01070651 cmp edx,ecx 01390651 cmp edx,ecx 01070653 jl 01070640 01390653 jl 01390640 01070655 inc esi 01390655 inc esi 01070656 cmp esi,edi 01390656 cmp esi,edi 01070658 jl 01070634 01390658 jl 01390634 0107065A pop ecx 0139065A pop ecx 0107065B pop esi 0139065B pop esi 0107065C pop edi 0139065C pop edi 0107065D pop ebp 0139065D pop ebp 0107065E ret 0139065E ret Which is opcode-for-opcode identical except for the order of the floating point operations. That makes a huge performance difference but I don't know enough about x86 floating point operations to know why exactly. Update: With the new integer version we see something else curious. In this case it seems the JIT is trying to be clever and apply an optimization because it turns this: int b = 2 * i; a = a + b + 1; Into something like: mov esi, eax ; b = i add esi, esi ; b += b lea ecx, [ecx + esi + 1] ; a = a + b + 1 Where a is stored in the ecx register, i in eax, and b in esi. Whereas the TinyFunctions version gets turned into something like: mov eax, edx add eax, eax inc eax add ecx, eax Where i is in edx, b is in eax, and a is in ecx this time around. I suppose for our CPU architecture this LEA "trick" (explained here) ends up being slower than just using the ALU proper. It is still possible to change the code to get the performance between the two to line up: int b = 2 * i + 1; a += b; This ends up forcing the NormalFunction approach to end up getting turned into mov, add, inc, add as it appears in the TinyFunctions approach.
What is address of logical operation's result?
I have simple program written in C#: static void Main(string[] args) { int a = 0; for (int i = 0; i < 100; ++i) a = a + 1; Console.WriteLine(a); } I am newbie in such field of programming and my purpose is just to understand assembly code created by JIT. It is piece of asm code: 7: int a = 0; 0000003c xor edx,edx 0000003e mov dword ptr [ebp-40h],edx 8: for (int i = 0; i < 100; ++i) 00000041 xor edx,edx 00000043 mov dword ptr [ebp-44h],edx I cannot understand code :0000003c xor edx,edx. Where is result of operation stored? I found only such quote from "Intel® 64 and IA-32 Architectures Software Developer’s Manual": The logical instructions AND, OR, XOR (exclusive or), and NOT perform the standard Boolean operations for which they are named. The AND, OR, and XOR instructions require two operands; the NOT instruction operates on a single operand EDIT: As I understand this result should be stored at edx (see next code line). But it seems weird for me. I thought that result will be pushed onto stack
Logical operation instructions store results in the first argument - in your case, it's edx. Note that XOR-ing a value with itself produces 0. Hence, XOR a, a is a common assembly idiom to clear a register.
xor edx,edx is the idiomatic way of clearing the edx register. (Note that a XOR a is zero for any value of a.)
Converting float NaN values from binary form and vice-versa results a mismatch
I make a conversion "bytes[4] -> float number -> bytes[4]" without any arithmetics. In bytes I have a single precision number in IEEE-754 format (4 bytes per number, little endian order as in a machine). I encounter a issue, when bytes represents a NaN value converted not verbatim. For example: { 0x1B, 0xC4, 0xAB, 0x7F } -> NaN -> { 0x1B, 0xC4, 0xEB, 0x7F } Code for reproduction: using System; using System.Linq; namespace StrangeFloat { class Program { private static void PrintBytes(byte[] array) { foreach (byte b in array) { Console.Write("{0:X2}", b); } Console.WriteLine(); } static void Main(string[] args) { byte[] strangeFloat = { 0x1B, 0xC4, 0xAB, 0x7F }; float[] array = new float[1]; Buffer.BlockCopy(strangeFloat, 0, array, 0, 4); byte[] bitConverterResult = BitConverter.GetBytes(array[0]); PrintBytes(strangeFloat); PrintBytes(bitConverterResult); bool isEqual = strangeFloat.SequenceEqual(bitConverterResult); Console.WriteLine("IsEqual: {0}", isEqual); } } } Result ( https://ideone.com/p5fsrE ): 1BC4AB7F 1BC4EB7F IsEqual: False This behaviour depends from platform and configuration: this code convert a number without errors on x64 in all configurations or in x86/Debug. On x86/Release an error exists. Also, if I change byte[] bitConverterResult = BitConverter.GetBytes(array[0]); to float f = array[0]; byte[] bitConverterResult = BitConverter.GetBytes(f); then it erroneus also on x86/Debug. I do research the problem and found that compiler generate x86 code that use a FPU registers (!) to a hold a float value (FLD/FST instructions). But FPU set a high bit of mantissa to 1 instead of 0, so it modify value although logic was is just pass a value without change. On x64 platform a xmm0 register used (SSE) and it works fine. [Question] What is this: it is a somewhere documented undefined behavior for a NaN values or a JIT/optimization bug? Why compiler use a FPU and SSE when no arithmetic operations was made? Update 1 Debug configuration - pass value via stack without side effects - correct result: byte[] bitConverterResult = BitConverter.GetBytes(array[0]); 02232E45 mov eax,dword ptr [ebp-44h] 02232E48 cmp dword ptr [eax+4],0 02232E4C ja 02232E53 02232E4E call 71EAC65A 02232E53 push dword ptr [eax+8] // eax+8 points to "1b c4 ab 7f" CORRECT! 02232E56 call 7136D8E4 02232E5B mov dword ptr [ebp-5Ch],eax // eax points to managed // array data "fc 35 d7 70 04 00 00 00 __1b c4 ab 7f__" and this is correct 02232E5E mov eax,dword ptr [ebp-5Ch] 02232E61 mov dword ptr [ebp-48h],eax Release configuration - optimizer or a JIT does a strange pass via FPU registers and breaks a data - incorrect byte[] bitConverterResult = BitConverter.GetBytes(array[0]); 00B12DE8 cmp dword ptr [edi+4],0 00B12DEC jbe 00B12E3B 00B12DEE fld dword ptr [edi+8] // edi+8 points to "1b c4 ab 7f" 00B12DF1 fstp dword ptr [ebp-10h] // ebp-10h points to "1b c4 eb 7f" (FAIL) 00B12DF4 mov ecx,dword ptr [ebp-10h] 00B12DF7 call 70C75810 00B12DFC mov edi,eax 00B12DFE mov ecx,esi 00B12E00 call dword ptr ds:[4A70860h]
I just translate #HansPassant comment as an answer. "The x86 jitter uses the FPU to handle floating point values. This is not a bug. Your assumption that those byte values are a proper argument to a method that takes a float argument is just wrong." In other words, this is just a GIGO case (Garbage In, Garbage Out).
How to write hex values in byte array?
I have a hex value of 0x1047F71 and I want to put in byte array of 4 bytes. Is this the right way to do it: byte[] sync_welcome_sent = new byte[4] { 0x10, 0x47, 0xF7, 0x01 }; or byte[] sync_welcome_sent = new byte[4] { 0x01, 0x04, 0x7F, 0x71 }; I would appreciate any help.
If you want to be compatible with Intel little-endian, the answer is "None of the above", because the answer would be "71h, 7fh, 04h, 01h". For big-endian, the second answer above is correct: "01h, 04h, 7fh, 71h". You can get the bytes with the following code: uint test = 0x1047F71; var bytes = BitConverter.GetBytes(test); If you want big-endian, you can just reverse the bytes using Linq like so: var bytes = BitConverter.GetBytes(test).Reverse(); However, if you are running the code on a Big Endian system, reversing the bytes will not be necessary, since BitConverter.GetBytes()will return them as big endian on a big endian system. Therefore you should write the code as follows: uint test = 0x1047F71; var bytes = BitConverter.GetBytes(test); if (BitConverter.IsLittleEndian) bytes = bytes.Reverse().ToArray(); // now bytes[] are big-endian no matter what system the code is running on.