Does BitConverter handle little-endianness incorrectly?

Does BitConverter handle little-endianness incorrectly? - c#

I'm currently writing something in C#/.NET that involves sending unsigned 16-bit integers in a network packet. The ordering of the bytes needs to be big endian.
At the bit level, my understanding of 'big endian' is that the most significant bit goes at the end, and in reverse for little endian.
And at the byte level, my understanding is the same -- if I'm converting a 16 bit integer to the two 8 bit integers that comprise it, and the architecture is little endian, then I would expect the most significant byte to go at the beginning.
However, BitConverter appears to put the byte with the smallest value at the end of the array, as opposed to the byte with the least-significant value, e.g.
ushort number = 4;
var bytes = BitConverter.GetBytes(number);
Debug.Assert(bytes[BitConverter.IsLittleEndian ? 0 : 1] == 0);
For clarity, if my understanding is correct, then on a little endian machine I would expect the above to return 0x00, 0x04 and on a big endian machine 0x04, 0x00. However, on my little endian Windows x86 workstation running .NET 5, it returns 0x04, 0x00
It's even documented that they've considered the endianness. From: https://learn.microsoft.com/en-us/dotnet/api/system.bitconverter.getbytes?view=net-5.0
The order of bytes in the array returned by the GetBytes method depends on whether the computer architecture is little-endian or big-endian.
Am I being daft or does this seem like the wrong behaviour?

I am indeed being daft. As #mjwills pointed out, and Microsoft's documentation explains (https://learn.microsoft.com/en-us/dotnet/api/system.bitconverter.islittleendian?view=net-5.0#remarks):
"Big-endian" means the most significant byte is on the left end of a word. "Little-endian" means the most significant byte is on the right end of a word.
Wikipedia has a slightly better explanation:
A big-endian system stores the most significant byte of a word at the smallest memory address and the least significant byte at the largest. A little-endian system, in contrast, stores the least-significant byte at the smallest address.
So, if you imagine the memory addresses, converting a 16-bit integer with a value of 4 becomes:
Address
0x00
0x01
Little-endian
0x04
0x00
Big-endian
0x00
0x04
Hopefully this'll help anyone equally daft in future!

Related

What does "int &= 0xFF" in a checksum do?

I implemented this checksum algorithm I found, and it works fine but I can't figure out what this "&= 0xFF" line is actually doing.
I looked up the bitwise & operator, and wikipedia claims it's a logical AND of all the bits in A with B. I also read that 0xFF is equivalent to 255 -- which should mean that all of the bits are 1. If you take any number & 0xFF, wouldn't that be the identity of the number? So A & 0xFF produces A, right?
So then I thought, wait a minute, checksum in the code below is a 32 bit Int, but 0xFF is 8bit. Does that mean that the result of checksum &= 0xFF is that 24 bits end up as zeros and only the remaining 8 bits are kept? In which case, checksum is truncated to 8 bits. Is that what's going on here?
private int CalculateChecksum(byte[] dataToCalculate)
{
int checksum = 0;
for(int i = 0; i < dataToCalculate.Length; i++)
{
checksum += dataToCalculate[i];
}
//What does this line actually do?
checksum &= 0xff;
return checksum;
}
Also, if the result is getting truncated to 8 bits, is that because 32 bits is pointless in a checksum? Is it possible to have a situation where a 32 bit checksum catches corrupt data when 8 bit checksum doesn't?

It is masking off the higher bytes, leaving only the lower byte.
checksum &= 0xFF;
Is syntactically short for:
checksum = checksum & 0xFF;
Which, since it is doing integer operations, the 0xFF gets expanded into an int:
checksum = checksum & 0x000000FF;
Which masks off the upper 3 bytes and returns the lower byte as an integer (not a byte).
To answer your other question: Since a 32-bit checksum is much wider than an 8-bit checksum, it can catch errors that an 8-bit checksum would not, but both sides need to use the same checksum calculations for that to work.

Seems like you have a good understanding of the situation.
Does that mean that the result of checksum &= 0xFF is that 24 bits end up as zeros and only the remaining 8 bits are kept?
Yes.
Is it possible to have a situation where a 32 bit checksum catches corrupt data when 8 bit checksum doesn't?
Yes.

This is performing a simple checksum on the bytes (8 bit values) by adding them and ignoring any overflow out into higher order bits. The final &=0xFF, as you suspected, just truncates the value to the 8LSB of the 32 bit (If that is your compiler's definition of int) value resulting in an unsigned value between 0 and 255.
The truncation to 8 bits and throwing away the higher order bits is simply the algorithm defined for this checksum implementation. Historically this sort of check value was used to provide some confidence that a block of bytes had been transferred over a simple serial interface correctly.
To answer your last question then yes, a 32 bit check value will be able to detect an error that would not be detected with an 8 bit check value.

Yes, the checksum is truncated to 8 bits by the
&= 0xFF. The lowest 8 bits are kept and all higher bits are set to 0.
Narrowing the checksum to 8 bits does decrease the reliability. Just think of two 32bit checksums that are different but the lowest 8 bits are equal. In case of truncating to 8 bits both would be equal, in 32bit case they are not.

Reading numbers represented in both-endian format?

Conceptually, I'm having a hard time understanding how a 32-bit unsigned integer (which is 4 bytes) can be represented as 8 bytes, the first four of which are encoded using the little-endian format and the last four of which are encoded using the big-endian format.
I'm specifically referring to the ISO 9660 format which encodes some 16-bit and 32-bit integers in this fashion.
I tried the following but this obviously does not work because the BitConverter.ToUInt32() method only takes the first four bytes from the starting index.
byte[] leastSignificant = reader.ReadBytes(4, Endianness.Little);
byte[] mostSignificant = reader.ReadBytes(4, Endianness.Big);
byte[] buffer = new byte[8];
Array.Copy(leastSignificant, 0, buffer, 0, 4);
Array.Copy(mostSignificant, 0, buffer, 4, 4);
uint actualValue = BitConverter.ToUInt32(buffer, 0);
What is the proper way to read a 32-bit unsigned integer represented as 8 bytes encoded in both-endian format?

This is very typical for an ISO standard. The organization is not very good at creating decent standards, only good at creating compromises among its members. Two basic ways they do that, they either pick a sucky standard that makes everybody equally unhappy. Or pick more than one so that everybody can be happy. Encoding a number twice falls in the latter category.
There's some justification for doing it this way. Optical disks have lots of bits that are very cheap to duplicate. Their formats are often designed to keep the playback hardware as cheap as possible. Mastering a disk is often very convoluted because of that, the BlueRay standard is particularly painful.
Since your machine is little-endian, you only care about the little-endian value. Simply ignore the big-endian equivalent. Technically you could add a check that they are the same but that's just wasted effort.

How to read/write unsigned byte array between C# and Java on a file?

This question is a bit similar to my previous one, where I asked a "cross-language" way to write and read integers between a Java and a C# program. Problem was the endianess.
Anyway, I'm facing a secondary problem. My goal is now to store and retrieve an array of unsigned bytes (values from 0 to 255) in a way that it can be processed by both Java and C#.
In C# it's easy, since unsigned byte[] exists:
BinaryWriterBigEndian writer = new BinaryWriterBigEndian(fs);
// ...
writer.Write(byteData, 0, byteData.Length);
BinaryWriterBigEndian is... well... a big-endian binary writer ;)
This way, the file will contain a sequence composed by, for example, the following values (in a big-endian representation):
[0][120][50][250][221]...
Now it's time to do the same thing under Java. Since unsigned byte[] does not exist here, the array is stored in memory as a (signed) int[] in order to have the possibility to represent values higher than 127.
How to write it as a sequence of unsigned byte values like C# does?
I tried with this:
ByteBuffer byteBuffer = ByteBuffer.allocate(4 * dataLength);
IntBuffer intBuffer = byteBuffer.asIntBuffer();
intBuffer.put(intData);
outputStream.write(byteBuffer.array());
Writing goes well, but C# is not able to read it in the proper way.

Since unsigned byte[] does not exist [...]
You don't care. Signed or unsigned, a byte is ultimately 8 bits. Just use a regular ByteBuffer and write your individual bytes in it.
In C# as well in Java, 1000 0000 (for instance) is exactly the same binary representation of a byte; the fact that in C# it can be treated as an unsigned value, and not in Java, is irrelevant as long as you don't do any arithmetic on the value.
When you need a readable representation of it and you'd like it to be unsigned, you can use (int) (theByte & 0xff) (you need the mask, otherwise casting will "carry" the sign bit).
Or, if you use Guava, you can use UnsignedBytes.toString().

How to get complete bytes for unicode characters in C#?

For the symbol greater than > I need to get the complete bytes, which I understand to be \u003E. Now C# only gives me 3E. Is there anyway to get all the characters? i.e. \u003E.
I am using the following line of code.
Encoding.UTF8.GetBytes(">");
In a text file I have the following
\u003c
Which I need to search down at the byte level
Thanks!

In UTF-8 the (ASCII range) char > is encoded into 1 byte.
If you want the string "003E" you can use:
Encoding.Utf8.GetBytes(">")[0].ToString("X4");
and maybe add "\u" in front.
If you want an array of 2 bytes conataining { 0x00, 0x3E } then use
Encoding.Unicode.GetBytes(">");

What bytes make up > differs from encoding to encoding - in UTF8 it really is only 0x3e, in Unicode it is 0x00 0x3e, so you need
Encoding.XXXX.GetBytes(">");
with XXXX being the encoding of your choice, e.g. UTF8 or Unicode

The answer you get is correct - 3E is the hex representation of U+003E.
If you want the unicode bytes (ie 2 byte array) then simply use this encoding:
Encoding.Unicode.GetBytes(">");

I wrote a rather lengthy piece at http://www.hackcraft.net/xmlUnicode/#sect4 some years back, that says the following in more detail, but:
> is a character. It's a purely conceptual item that we understand as having one or more meanings, uses and ways of writing it depending on different lingual and textual contexts. It's an abstract concept rather than anything we can actually make use of in a computer.
U+003E (which in C# is represented as \u003E) is a code-point. It's a way of assigning a number to a character, but it's still a rather abstract thing. The number 0x3E (62) is still an abstract concept rather than something we can use in a computer.
00111110, 0000000000111110, 0011111000000000, 00000000000000000000000000111110 and 00111110000000000000000000000000 are all different ways commonly used to represent that code-point in actual 1s and 0s that computers can represent by pulses of electrical charge.
In between, as programmers we tend to think of those three as either 0x3E, 0x003E or 0x000000003E which are numbers mapped to datatypes we actually use. The difference between 0000000000111110 and 0011111000000000 for this is one of endian-ness and mostly we don't think of it at this point having already (if necessary) thought "must make sure the endianness is correct", because that "if necessary" tends to happen at a level where one doesn't think of characters at all.
Actually, as programmers we tend to think of it mostly as the > we started with. Abstractions are great.
Your code that uses UTF-8 is using one of the different ways of turning characters into bytes, the one that turns U+003E into 0x3E. There are others, though UTF-8 is the one most useful for most interchange. It is therefore one of the correct answers to "the complete bytes for '>'". The byte 0x00 followed by 0x3E and the byte 0x3E followed by 0x00 would be two other correct answers, both forms of UTF-16 with different endianness. The byte sequences 0x00, 0x00, 0x00, 0x3E and 0x3E, 0x00, 0x00, 0x00 would both be correct UTF-32.
If you have a reason for wanting a particular one of these, use the appropriate encoding. If in doubt, use UTF-8 as you were doing.

Converting byte[] of binary fixed point to floating point value

I'm reading some data over a socket. The integral data types are no trouble, the System.BitConverter methods are correctly handling the conversion. (So there are no Endian issues to worry about, I think?)
However, BitConverter.ToDouble isn't working for the floating point parts of the data...the source specification is a bit low level for me, but talks about a binary fixed point representation with a positive byte offset in the more significant direction and negative byte offset in the less significant direction.
Most of the research I've done has been aimed at C++ or a full fixed-point library handling sines and cosines, which sounds like overkill for this problem. Could someone please help me with a C# function to produce a float from 8 bytes of a byte array with, say, a -3 byte offset?
Further details of format as requested:
The signed numerical value of fixed point data shall be represented using binary, two's-complement notation.For fixed point data, the value of each data parameter shall be defined in relation to the reference byte. The reference byte defines an eight-bit field, with the unit of measure in the LSB position. The value of the LSB of the reference byte is ONE.
Byte offset shall be defined by a signed integer indicating the position of the least significant byte of a data element relative to the reference byte.
The MSB of the data element represents the sign bit. Bit positions between the MSB of the
parameter absolute value and the MSB of the most significant byte shall be equal in value to the sign bit.
Floating point data shall be represented as a binary floating point number in conformance with the IEEE ANSI/IEEE Std 754-2008. (This sentence is from a different section which may be a red herring).

Ok, after asking some questions from a local expert on the source material, it turns out CodeInChaos was on the right track...if the value is 8 bytes with a -3 byte offset, then I can use BitConverter.ToInt64 / 256^3, if it is 4 bytes with a -1 byte offset then BitConverter.ToInt32 / 256 will produce the correct answer. I guess that means BitConverter.ToXXX where XXX is signed is smart enough to handle the twos-complement calculations!
Thanks to those who tried to help out, I thought it couldn't be too complicated but getting that 256 offset from the reference document wording was very confusing:-)

System.BitConverter works very slow, so if performance is significant to you, i'd recommend to convert bytes to int by yourself (via logical shifts).
Also, please specify in what exact format floats are sent in your protocol.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.