C# Encoding.UTF8 messing up the bytes[]

C# Encoding.UTF8 messing up the bytes[] - c#

I am facing very strange problem in which I have byte[] and when I am passing this to Convert.UTF8.GetString(byte[] bytes) method, the system encoding is messing with my bytes and replacing only few special bytes (which I am using as Markers in my system) to some three char string representation.
[0] 70 byte
[1] 49 byte
[2] 45 byte
[3] 86 byte
[4] 49 byte
[5] 253 byte <-- Special byte
[6] 70 byte
[7] 49 byte
[8] 45 byte
[9] 86 byte
[10]50 byte
[11]253 byte <-- Special byte
[12]70 byte
[13]49 byte
[14]45 byte
[15]86 byte
[16]51 byte
When I am passing above byte[] into Encoding.UTF8.GetString(bytes) method I am getting following output;
private Encoding _encoding = System.Text.Encoding.GetEncoding("UTF-8", new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?"));
_encoding.GetString(bytes) "F1-V1�F1-V2�F1-V3" string
Actual value should not have '�' as this means it failed to encode and replaced those special bytes with '�'. Is there anyway I can get around this i.e. convert to string and keep the special bytes representation to a single char.
I have following special bytes which I am trying to use as markers;
byte AM = (byte) 254
byte VM = (byte) 253
byte SM = (byte) 252
Your help and comments will be appreciated.
Thanks,
--
Sheeraz

You cannot use those special values as markers inside a UTF-8 string, because the string ends up being invalid according to the UTF-8 encoding rules.
You could sneakily insert them and then take them back out before the data is fed to UTF-8-aware code like Encoding.GetString, but that's not a good idea exactly because it's sneaky (way confusing to anyone who does not already know what voodoo is happening in there, and thus very counter-productive).
A more sane option would be to simply insert "special" UTF-8 encoded characters inside your string. This would technically require (especially if you pick a character that encodes to 1 byte as those would be more likely to occur inside your actual payload as well) that you also come up with a scheme to escape these characters when they occur naturally inside your payload.

The data is only UTF-8 between the markers, so if it were me I would be extracting the delimited portions first, and then UTF-8 decode each portion separately, i.e. read through the byte[] looking for the markers in your binary data, giving you 3 binary chunks (70,49,45,86,49; 70,49,45,86,50; 70,59,45,86,51) which are then decoded into 3 strings. You can't UTF-8 decode the entire binary sequence because it is not valid UTF-8.
However, personally, I would say that using a delimiter is dangerous here; I would probably go for a length-prefix approach, so that
I know that I'm not accidentally conflating delimiters and real data
I can process it more efficiently than byte-by-byte
For example, if we used a "varint" length prefix, that would be:
05,70,49,45,86,49,05,70,49,45,86,50,05,70,59,45,86,51
where the 05 is the "varint" length which we interpret as 5 bytes; this means we can process nicely:
// pseude code
while(!EOF) {
int len = ReadVarint();
var blob = ReadBytes(len);
string s = Utf8Decode(blob);
// ...
}

Related

c# string to c++ wstring using Encoding.Unicode.Getbytes()

So the issue is that when using c# the char is 4 bytes so "abc" is (65 0 66 0 67 0).
When inputing that to a wstring in c++ thru sending it in a socket i get the following output a.
How i am able to convert such a string to a c++ string?

Sounds like you need ASCII or UTF-8 encoding instead of Unicode.
65 0 66 0 67 0 is only going to get you the A, since the next zero is interpreted as a null termination character in C++.
Strategies for converting Unicode to ASCII can be found here.

using c# the char is 4 bytes
No, in CSharp Strings are encoded in UTF16. Code units need at least two bytes in UTF16. For simple charachters a single code unit can represent a code point (e.g. 65 0).
On Windows wstring is usually UTF16 (2-4 Bytes) encoded, too. But on Unix/Linux wstring uses usually UTF32-Encoding (always 4 Bytes).
The Unicode code Point has the same numerical value compared to ASCII - therefore UTF-16 encoded ASCII text looks often like this: {num} 0 {num} 0 {num} 0...
See the details here: (https://en.wikipedia.org/wiki/UTF-16)
Could you show us some Code, how you constructed your wstring object?
The null byte is critical here, because it was the end marker for ASCII / ANSI Strings.

I have been able to solve the issue by using a std::u16string.
Here is some example code
std::vector<char> data = { 65, 0, 66, 0, 67, 0 };
std::u16string string(&data[0], data.size() / 2);
// now string should be encoded right

How does deserialising byte arrays to utf8 know when each character starts/ends?

I am bit confused how networking does this. I have a string in C# and I serialise it to utf-8. But according to utf-8 each character takes up "possibly" 1 to 4 bytes.
So if my server receives this byte array over the net and deserialises it knowing its a utf8 string of some size. How does it know how many bytes each character is to convert it properly?
Will i have to include the total bytes for each string in the protocol eg:
[message length][char byte length=1][2][char byte length=2][56][123][ ... etc...]
Or is this unnecessary ?

UTF-8 encodes the number of bytes required in the bits that make up the character. Read the description on Wikipedia; only single-byte code points start with a zero bit. Only two-byte code points start with bits 110, only bytes inside a multi-byte code point start with 10.

Does it matter for the security of AES how the byte array is encoded in C#?

I have a byte array that was encrypted using AES with the pass phrase encrypted using SHA-256.
Everything works perfect, but I'm wondering about the last part where I have to encode the byte array that I get as a result of the encryption.
Does it matter, for the robustness of the end result how the byte array is encrypted, Base64, conversion to hexadecimal values, something else?
Logically speaking, it doesn't matter since there really aren't that much encoding methods and most of the time the most obvious one, Base64, is used.
But since I'm not that well versed with cryptography I just want to make sure.
Take a byte array as an example (random array of bytes as an example):
[0] 182
[1] 238
[2] 54
[3] 24
[4] 69
[5] 224
[6] 105
[7] 13
[8] 5
[9] 52
[10]112
[11]71
[12]250
[13]163
[14]234
[15]234
This gives a possible result in Base64 (random result, does not match above):
ou+yUEkilfrGIF3HBH08vu8A==
Using BitConvertor to transform it to hexadecimal values gives (random result, does not match above):
A2EBCA1945E8BC920532F068D27BAEF1
It's simple to convert the above results back to the respective byte array and only then does the hard part start.

Does it matter, for the robustness of the end result how the byte array is encrypted, Base64, conversion to hexadecimal values, something else?
No, not at all. So long as you're encoding it in a lossless format (which both base64 and hex are) that's fine. Don't use something like Encoding.ASCII.GetString(...) - that would be lossy and inappropriate. (Don't use Encoding at all for this task.)
Just ask yourself whether you could reverse your encoding and get back to the original bytes - if so, you're fine. (And that's true for hex and base64, assuming it's properly implemented.)

Look at it this way. The cipher must produce ciphertext, from which it should not be possible (in practice) to obtain plain text without knowing the key:
C=E(K,M) // K is key, M message
Now, what you have is ciphertext. Whether you encode it in Base64 or something doesn't really matter, as the cipher already did it's job when it produced the ciphertext. Afterwards you can do anything with it - The cipher already did its job when it produced the cipher text, and it told you if you don't use the original key, it should not be possible to retrieve plain text.
So whatever you do afterwards, e.g. how you encode/decode cipher text, should not really matter.

BinaryReader ReadString specifying length?

I'm working on a parser to receive UDP information, parse it, and store it. To do so I'm using a BinaryReader since it will mostly be binary information. Some of it will be strings though. MSDN says for the ReadString() function:
Reads a string from the current stream. The string is prefixed with
the length, encoded as an integer seven bits at a time.
And I completely understand it up until "seven bits at a time" which I tried to simply ignore until I started testing. I'm creating my own byte array before putting it into a MemoryStream and attempting to read it with a BinaryReader. Here's what I first thought would work:
byte[] data = new byte[] { 3, 0, 0, 0, (byte)'C', (byte)'a', (byte)'t', }
BinaryReader reader = new BinaryReader(new MemoryStream(data));
String str = reader.ReadString();
Knowing an int is 4 bytes (and toying around long enough to find out that BinaryReader is Little Endian) I pass it the length of 3 and the corresponding letters. However str ends up holding \0\0\0. If I remove the 3 zeros and just have
byte[] data = new byte[] { 3, (byte)'C', (byte)'a', (byte)'t', }
Then it reads and stores Cat properly. To me this conflicts with the documentation saying that the length is supposed to be an integer. Now I'm beginning to think they simply mean a number with no decimal place and not the data type int. Does this mean that a BinaryReader can never read a string larger than 127 characters (since that would be 01111111 corresponding to the 7 bits part of the documentation)?
I'm writing up a protocol and need to completely understand what I'm getting into before I pass our documentation along to our clients.

I found the source code for BinaryReader. It uses a function called Read7BitEncodedInt() and after looking up that documentation and the documentation for Write7BitEncodedInt() I found this:
The integer of the value parameter is written out seven bits at a
time, starting with the seven least-significant bits. The high bit of
a byte indicates whether there are more bytes to be written after this
one. If value will fit in seven bits, it takes only one byte of space.
If value will not fit in seven bits, the high bit is set on the first
byte and written out. value is then shifted by seven bits and the next
byte is written. This process is repeated until the entire integer has
been written.
Also, Ralf found this link that better displays what's going on.

Unless they specifically say 'int' or 'Int32', they just mean an integer as in a whole number.
By '7 bits at time', they mean that it implements 7-bit length encoding, which seems a bit confusing at first but is actually rather straightforward. Here are some example values and how they are written out using 7-bit length encoding:
/*
decimal value binary value -> enc byte 1 enc byte 2 enc byte 3
85 00000000 00000000 01010101 -> 01010101 n/a n/a
1,365 00000000 00000101 01010101 -> 11010101 00001010 n/a
349,525 00000101 01010101 01010101 -> 11010101 10101010 00010101
*/
The table above uses big endian for no other reason than I simply had to pick one and it's what I'm most familiar with. The way 7-bit length encoding works, it is little endian by it's very nature.
Note that 85 writes out to 1 byte, 1,365 writes out to 2 bytes, and 349,525 writes out to 3 bytes.
Here's the same table using letters to show how each value's bits were used in the written output (dashes are zero-value bits, and the 0s and 1s are what's added by the encoding mechanism to indicate if a subsequent byte is to be written/read)...
/*
decimal value binary value -> enc byte 1 enc byte 2 enc byte 3
85 -------- -------- -AAAAAAA -> 0AAAAAAA n/a n/a
1,365 -------- -----BBB AAAAAAAA -> 1AAAAAAA 0---BBBA n/a
349,525 -----CCC BBBBBBBB AAAAAAAA -> 1AAAAAAA 1BBBBBBA 0--CCCBB
*/
So values in the range of 0 to 2^7-1 (127) will write out as 1 byte, values of 2^7 (128) to 2^14-1 (16,383) will use 2 bytes, 2^14 (16,384) to 2^21-1 (2,097,151) will take 3 bytes, and so on and so forth.

Can't read Integer in Stream after encoding it as UTF-8 instead of ASCII

I had problems with Umlauts in ASCII so I encode my Stream as UTF-8 now, which works, but it brings up a problem. I normally read 4 Bytes before ARTIST to determine the length of ARTIST=WHOEVER using
UTF8Encoding enc = new UTF8Encoding();
string response = enc.GetString(message, 0, bytesRead);
int posArtist = response.IndexOf("ARTIST");
BitConverter.ToInt32(message, posArtist - 4);
This works for ASCII perfectly.
The hex-editor examples are just to illustrate that reading the length doesn't work anymore like with ASCII
Here is an example-screenshot from a hex-editor:
"ARTIST=M.A.N.D.Y. vs. Booka Shade" Length = 21
However that doesn't work for the UTF8-encoded stream.
Here is a screenshot:
"ARTIST=Paulseq" Length = E but in the picture its 2E.
What am I doing wrong here?

your data is wrong - you actually have the character '\0' in the data where there should be binary zeroes
The problem lies in how you created this data, not in the reading of it

It is an utter mystery how you got 21 out of the ASCII data. The shaded byte is in hex, its real value is 33. There's no way you can get 21 out of BitConverter.ToInt32, that requires bytes values (in hex) 15 00 00 00.
This must have worked by accident but no idea what that accident might look like. Post more code, including the code that writes this.

My guess is that you are mixing tools. That is a binary stream. It should be read with a BinaryReader and written with a BinaryWriter. When writing text, use Encoder.GetBytes to get the raw bytes to write, and when reading use Encoder.GetString on the raw bytes read. BinaryWriter/Reader have methods for values (like lengths) directly.

Only the strings should be UTF-8 encoded/decoded. If you're passing other (non-string) values in binary, the encoders they will destroy them.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Encoding.UTF8 messing up the bytes[] - c#

Related

c# string to c++ wstring using Encoding.Unicode.Getbytes()

How does deserialising byte arrays to utf8 know when each character starts/ends?

Does it matter for the security of AES how the byte array is encoded in C#?

BinaryReader ReadString specifying length?

Can't read Integer in Stream after encoding it as UTF-8 instead of ASCII

Categories

Resources