BinaryReader ReadString specifying length? - c#

I'm working on a parser to receive UDP information, parse it, and store it. To do so I'm using a BinaryReader since it will mostly be binary information. Some of it will be strings though. MSDN says for the ReadString() function:
Reads a string from the current stream. The string is prefixed with
the length, encoded as an integer seven bits at a time.
And I completely understand it up until "seven bits at a time" which I tried to simply ignore until I started testing. I'm creating my own byte array before putting it into a MemoryStream and attempting to read it with a BinaryReader. Here's what I first thought would work:
byte[] data = new byte[] { 3, 0, 0, 0, (byte)'C', (byte)'a', (byte)'t', }
BinaryReader reader = new BinaryReader(new MemoryStream(data));
String str = reader.ReadString();
Knowing an int is 4 bytes (and toying around long enough to find out that BinaryReader is Little Endian) I pass it the length of 3 and the corresponding letters. However str ends up holding \0\0\0. If I remove the 3 zeros and just have
byte[] data = new byte[] { 3, (byte)'C', (byte)'a', (byte)'t', }
Then it reads and stores Cat properly. To me this conflicts with the documentation saying that the length is supposed to be an integer. Now I'm beginning to think they simply mean a number with no decimal place and not the data type int. Does this mean that a BinaryReader can never read a string larger than 127 characters (since that would be 01111111 corresponding to the 7 bits part of the documentation)?
I'm writing up a protocol and need to completely understand what I'm getting into before I pass our documentation along to our clients.

I found the source code for BinaryReader. It uses a function called Read7BitEncodedInt() and after looking up that documentation and the documentation for Write7BitEncodedInt() I found this:
The integer of the value parameter is written out seven bits at a
time, starting with the seven least-significant bits. The high bit of
a byte indicates whether there are more bytes to be written after this
one. If value will fit in seven bits, it takes only one byte of space.
If value will not fit in seven bits, the high bit is set on the first
byte and written out. value is then shifted by seven bits and the next
byte is written. This process is repeated until the entire integer has
been written.
Also, Ralf found this link that better displays what's going on.

Unless they specifically say 'int' or 'Int32', they just mean an integer as in a whole number.
By '7 bits at time', they mean that it implements 7-bit length encoding, which seems a bit confusing at first but is actually rather straightforward. Here are some example values and how they are written out using 7-bit length encoding:
/*
decimal value binary value -> enc byte 1 enc byte 2 enc byte 3
85 00000000 00000000 01010101 -> 01010101 n/a n/a
1,365 00000000 00000101 01010101 -> 11010101 00001010 n/a
349,525 00000101 01010101 01010101 -> 11010101 10101010 00010101
*/
The table above uses big endian for no other reason than I simply had to pick one and it's what I'm most familiar with. The way 7-bit length encoding works, it is little endian by it's very nature.
Note that 85 writes out to 1 byte, 1,365 writes out to 2 bytes, and 349,525 writes out to 3 bytes.
Here's the same table using letters to show how each value's bits were used in the written output (dashes are zero-value bits, and the 0s and 1s are what's added by the encoding mechanism to indicate if a subsequent byte is to be written/read)...
/*
decimal value binary value -> enc byte 1 enc byte 2 enc byte 3
85 -------- -------- -AAAAAAA -> 0AAAAAAA n/a n/a
1,365 -------- -----BBB AAAAAAAA -> 1AAAAAAA 0---BBBA n/a
349,525 -----CCC BBBBBBBB AAAAAAAA -> 1AAAAAAA 1BBBBBBA 0--CCCBB
*/
So values in the range of 0 to 2^7-1 (127) will write out as 1 byte, values of 2^7 (128) to 2^14-1 (16,383) will use 2 bytes, 2^14 (16,384) to 2^21-1 (2,097,151) will take 3 bytes, and so on and so forth.

Related

Sending Int32 equal to 4, received as equal to 67108864

What's going on, I do this on the server:
var msg = Server.Api.CreateMessage();
msg.Write(2);
msg.Write(FreshChunks.Count());
Server.Api.SendMessage(msg, peer.Connection, NetDeliveryMethod.ReliableUnordered);
then on the client it succesfuly reads the byte = 2 and the switch then routes to function which reads Int32 (FreshChunks.Count) which was equal 4 but when received it equals 67108864. I've tried with Int16-64 and UInt16-64, none of them work out the correct value.
Given that:
In your usage of msg.Write(2), the compiler reads the 2 as an int (Int32)
You mentioned that you "successfully read the byte = 2".
It seems that one of these options is happening:
msg.Write is writing only bytes that have at least one-bit set (=1) in them. (to save space)
msg.Write is always casting the given argument to a byte.
When asking for 4 bytes (Int32),
You got:
0x04 00 00 00. The first byte is exactly the 4 you passed.
It seems that when asking from msg.Read more bytes than it has (you requested 4bytes and it has only 1 due to msg.Write logic)
It does one of these:
Appends the remaining bytes with zeros
Keeps on reading, and in your case, there were 3 0's bytes in the message's metadata that was returned to you.
For solving your problem, you should read the documentation of the Write and Read methods and understand how they behave.

How does deserialising byte arrays to utf8 know when each character starts/ends?

I am bit confused how networking does this. I have a string in C# and I serialise it to utf-8. But according to utf-8 each character takes up "possibly" 1 to 4 bytes.
So if my server receives this byte array over the net and deserialises it knowing its a utf8 string of some size. How does it know how many bytes each character is to convert it properly?
Will i have to include the total bytes for each string in the protocol eg:
[message length][char byte length=1][2][char byte length=2][56][123][ ... etc...]
Or is this unnecessary ?
UTF-8 encodes the number of bytes required in the bits that make up the character. Read the description on Wikipedia; only single-byte code points start with a zero bit. Only two-byte code points start with bits 110, only bytes inside a multi-byte code point start with 10.

C# How to Encode a string to minimize bytes

Is it possible to Encode a string in a certain way to minimize the number of bytes? basically i need to get 29 characters down to 11 bytes of data.
var myString = "usmiaanzaklaacn40879005900133";
byte[] bytes = Encoding.UTF8.GetBytes(myString);
Console.WriteLine(bytes.Length); //Output = 29, 1 byte per character
Console.ReadKey();
This shows when encoding with UTF8 that 29 character string results in 29 Bytes... i need 29 character string resulting in 11 bytes or less.. is this possible? I was thinking i could possible have some sort of lookup or binary mapping algorithmn but i am a little unsure on how to go about this in C#.
EDIT:
So i have a Chip that has a custom data payload of 11 bytes. I want to be able to compress a 29 character string (that is unique) into bytes, assign it to the "custom data" and then receive the custom data bytes and decompress it back to the 29 character string... now i dont know if this is possible, but any help would be greatly appreciated.. thanks :)
the string itself [usmia]-[anzakl]-[aacn40879005900]-[133] = [origin]-[dest]-[random/unique]-[weight]
Ok the last 14 characters are integers.
I have access to all the Origins and Destination... would it be feesable to create a key value store have the key as the "Origin e.g. usmia" and the value is a particular byte.. i guess that would mean i could only have like 256 different Origin and Dests and then just make the the last 14 characters an integer??
15 lg(26) + 14 lg(10) ~= 117 bits ~= 14.6 bytes. (lg = log base 2)
So even I was optimistic and assumed that your strings were always 15 lower case letters followed by 14 digits, it would still take a minimum of 15 bytes to represent.
Unless there are more restrictions, like only the lower case letters a, c, i, k, l, m, n, s, u, and z are allowed, then no, you can't code that into 11 bytes. Whoops, wait, not even then. Even that would take a little over 12 bytes.

Is it possible to deal with individual bits in C#? Trying to implement a SHA256 generator

Just doing this for fun and I was reading the pseudo-code on Wikipedia and it says when pre-processing to append the bit '1' to the message and then append enough '0' bits to the resulting message length modulus 512 is 448. Then append the length of the message in bits as a 64-bit big-endian integer.
Okay. I'm not sure how to append just a '1' bit but I figure it could be possible to just append 128 (1000 0000) but that wouldn't work in the off chance the resulting message length modulus 512 was already 448 without all those extra 0's. In which case I'm not sure how to append just a 1 because I'd need to deal with at least bytes. Is it possible in C#?
Also, is there a built-in way to append a big-endian integer because I believe my system is little-endian by default.
It's defined in such a way that you only need to deal with bytes if the message is an even number of bytes. If the message length (mod 64) is 56, then append one byte of 0b10000000, folowed by 63 0 bytes, followed by the length. Otherwise, append one byte of 0b10000000, followed by 0 to 62 0 bytes, followed by the length.
You might check out the BitArray class in System.Collections. One of the ctor overloads takes an array of bytes, etc.

C# Encoding.UTF8 messing up the bytes[]

I am facing very strange problem in which I have byte[] and when I am passing this to Convert.UTF8.GetString(byte[] bytes) method, the system encoding is messing with my bytes and replacing only few special bytes (which I am using as Markers in my system) to some three char string representation.
[0] 70 byte
[1] 49 byte
[2] 45 byte
[3] 86 byte
[4] 49 byte
[5] 253 byte <-- Special byte
[6] 70 byte
[7] 49 byte
[8] 45 byte
[9] 86 byte
[10]50 byte
[11]253 byte <-- Special byte
[12]70 byte
[13]49 byte
[14]45 byte
[15]86 byte
[16]51 byte
When I am passing above byte[] into Encoding.UTF8.GetString(bytes) method I am getting following output;
private Encoding _encoding = System.Text.Encoding.GetEncoding("UTF-8", new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?"));
_encoding.GetString(bytes) "F1-V1�F1-V2�F1-V3" string
Actual value should not have '�' as this means it failed to encode and replaced those special bytes with '�'. Is there anyway I can get around this i.e. convert to string and keep the special bytes representation to a single char.
I have following special bytes which I am trying to use as markers;
byte AM = (byte) 254
byte VM = (byte) 253
byte SM = (byte) 252
Your help and comments will be appreciated.
Thanks,
--
Sheeraz
You cannot use those special values as markers inside a UTF-8 string, because the string ends up being invalid according to the UTF-8 encoding rules.
You could sneakily insert them and then take them back out before the data is fed to UTF-8-aware code like Encoding.GetString, but that's not a good idea exactly because it's sneaky (way confusing to anyone who does not already know what voodoo is happening in there, and thus very counter-productive).
A more sane option would be to simply insert "special" UTF-8 encoded characters inside your string. This would technically require (especially if you pick a character that encodes to 1 byte as those would be more likely to occur inside your actual payload as well) that you also come up with a scheme to escape these characters when they occur naturally inside your payload.
The data is only UTF-8 between the markers, so if it were me I would be extracting the delimited portions first, and then UTF-8 decode each portion separately, i.e. read through the byte[] looking for the markers in your binary data, giving you 3 binary chunks (70,49,45,86,49; 70,49,45,86,50; 70,59,45,86,51) which are then decoded into 3 strings. You can't UTF-8 decode the entire binary sequence because it is not valid UTF-8.
However, personally, I would say that using a delimiter is dangerous here; I would probably go for a length-prefix approach, so that
I know that I'm not accidentally conflating delimiters and real data
I can process it more efficiently than byte-by-byte
For example, if we used a "varint" length prefix, that would be:
05,70,49,45,86,49,05,70,49,45,86,50,05,70,59,45,86,51
where the 05 is the "varint" length which we interpret as 5 bytes; this means we can process nicely:
// pseude code
while(!EOF) {
int len = ReadVarint();
var blob = ReadBytes(len);
string s = Utf8Decode(blob);
// ...
}

Categories

Resources