c# string to c++ wstring using Encoding.Unicode.Getbytes() - c#

So the issue is that when using c# the char is 4 bytes so "abc" is (65 0 66 0 67 0).
When inputing that to a wstring in c++ thru sending it in a socket i get the following output a.
How i am able to convert such a string to a c++ string?

Sounds like you need ASCII or UTF-8 encoding instead of Unicode.
65 0 66 0 67 0 is only going to get you the A, since the next zero is interpreted as a null termination character in C++.
Strategies for converting Unicode to ASCII can be found here.

using c# the char is 4 bytes
No, in CSharp Strings are encoded in UTF16. Code units need at least two bytes in UTF16. For simple charachters a single code unit can represent a code point (e.g. 65 0).
On Windows wstring is usually UTF16 (2-4 Bytes) encoded, too. But on Unix/Linux wstring uses usually UTF32-Encoding (always 4 Bytes).
The Unicode code Point has the same numerical value compared to ASCII - therefore UTF-16 encoded ASCII text looks often like this: {num} 0 {num} 0 {num} 0...
See the details here: (https://en.wikipedia.org/wiki/UTF-16)
Could you show us some Code, how you constructed your wstring object?
The null byte is critical here, because it was the end marker for ASCII / ANSI Strings.

I have been able to solve the issue by using a std::u16string.
Here is some example code
std::vector<char> data = { 65, 0, 66, 0, 67, 0 };
std::u16string string(&data[0], data.size() / 2);
// now string should be encoded right

Related

Converting memory hex string to an int

I'm working on serial port communication and I have some info in a bat file which is encoded. I need to extract the file size which I translated to hex but it's flipped(something to do with memory) and i need to get the correct size.
Here is the hex I have in my bat file(converted to decimal it's : 1178534144)
So I'm having alot of problems converting it...
and here is the hex number I need to get(int decimal it's 81734)
**EDIT
Here's 64 bytes out of the bat file which I converted to hex cause in ASCII it's unreadable. Focus on the part marked with red(whole hex) and part in blue(it's the hex number I need to convert from 46 3f 01 00 to 0013f46
Use Convert.ToInt32-Methode: (String, Int32) with the base as parameter
The base of the number in value, which must be 2, 8, 10, or 16.
So the code would be (16 for base 16 aka hex)
int result = Convert.ToInt32("463F0100", 16); // 1178534144
The decimal number 1178534144 is 0x463F0100. To get decimal 81734 you need to rotate 4 bytes to get 0x00013F46.
Under Windows you can include winsock.h and use function ntohl.
https://learn.microsoft.com/en-us/windows/desktop/api/winsock/nf-winsock-ntohl

C# How to Encode a string to minimize bytes

Is it possible to Encode a string in a certain way to minimize the number of bytes? basically i need to get 29 characters down to 11 bytes of data.
var myString = "usmiaanzaklaacn40879005900133";
byte[] bytes = Encoding.UTF8.GetBytes(myString);
Console.WriteLine(bytes.Length); //Output = 29, 1 byte per character
Console.ReadKey();
This shows when encoding with UTF8 that 29 character string results in 29 Bytes... i need 29 character string resulting in 11 bytes or less.. is this possible? I was thinking i could possible have some sort of lookup or binary mapping algorithmn but i am a little unsure on how to go about this in C#.
EDIT:
So i have a Chip that has a custom data payload of 11 bytes. I want to be able to compress a 29 character string (that is unique) into bytes, assign it to the "custom data" and then receive the custom data bytes and decompress it back to the 29 character string... now i dont know if this is possible, but any help would be greatly appreciated.. thanks :)
the string itself [usmia]-[anzakl]-[aacn40879005900]-[133] = [origin]-[dest]-[random/unique]-[weight]
Ok the last 14 characters are integers.
I have access to all the Origins and Destination... would it be feesable to create a key value store have the key as the "Origin e.g. usmia" and the value is a particular byte.. i guess that would mean i could only have like 256 different Origin and Dests and then just make the the last 14 characters an integer??
15 lg(26) + 14 lg(10) ~= 117 bits ~= 14.6 bytes. (lg = log base 2)
So even I was optimistic and assumed that your strings were always 15 lower case letters followed by 14 digits, it would still take a minimum of 15 bytes to represent.
Unless there are more restrictions, like only the lower case letters a, c, i, k, l, m, n, s, u, and z are allowed, then no, you can't code that into 11 bytes. Whoops, wait, not even then. Even that would take a little over 12 bytes.

BinaryReader ReadString specifying length?

I'm working on a parser to receive UDP information, parse it, and store it. To do so I'm using a BinaryReader since it will mostly be binary information. Some of it will be strings though. MSDN says for the ReadString() function:
Reads a string from the current stream. The string is prefixed with
the length, encoded as an integer seven bits at a time.
And I completely understand it up until "seven bits at a time" which I tried to simply ignore until I started testing. I'm creating my own byte array before putting it into a MemoryStream and attempting to read it with a BinaryReader. Here's what I first thought would work:
byte[] data = new byte[] { 3, 0, 0, 0, (byte)'C', (byte)'a', (byte)'t', }
BinaryReader reader = new BinaryReader(new MemoryStream(data));
String str = reader.ReadString();
Knowing an int is 4 bytes (and toying around long enough to find out that BinaryReader is Little Endian) I pass it the length of 3 and the corresponding letters. However str ends up holding \0\0\0. If I remove the 3 zeros and just have
byte[] data = new byte[] { 3, (byte)'C', (byte)'a', (byte)'t', }
Then it reads and stores Cat properly. To me this conflicts with the documentation saying that the length is supposed to be an integer. Now I'm beginning to think they simply mean a number with no decimal place and not the data type int. Does this mean that a BinaryReader can never read a string larger than 127 characters (since that would be 01111111 corresponding to the 7 bits part of the documentation)?
I'm writing up a protocol and need to completely understand what I'm getting into before I pass our documentation along to our clients.
I found the source code for BinaryReader. It uses a function called Read7BitEncodedInt() and after looking up that documentation and the documentation for Write7BitEncodedInt() I found this:
The integer of the value parameter is written out seven bits at a
time, starting with the seven least-significant bits. The high bit of
a byte indicates whether there are more bytes to be written after this
one. If value will fit in seven bits, it takes only one byte of space.
If value will not fit in seven bits, the high bit is set on the first
byte and written out. value is then shifted by seven bits and the next
byte is written. This process is repeated until the entire integer has
been written.
Also, Ralf found this link that better displays what's going on.
Unless they specifically say 'int' or 'Int32', they just mean an integer as in a whole number.
By '7 bits at time', they mean that it implements 7-bit length encoding, which seems a bit confusing at first but is actually rather straightforward. Here are some example values and how they are written out using 7-bit length encoding:
/*
decimal value binary value -> enc byte 1 enc byte 2 enc byte 3
85 00000000 00000000 01010101 -> 01010101 n/a n/a
1,365 00000000 00000101 01010101 -> 11010101 00001010 n/a
349,525 00000101 01010101 01010101 -> 11010101 10101010 00010101
*/
The table above uses big endian for no other reason than I simply had to pick one and it's what I'm most familiar with. The way 7-bit length encoding works, it is little endian by it's very nature.
Note that 85 writes out to 1 byte, 1,365 writes out to 2 bytes, and 349,525 writes out to 3 bytes.
Here's the same table using letters to show how each value's bits were used in the written output (dashes are zero-value bits, and the 0s and 1s are what's added by the encoding mechanism to indicate if a subsequent byte is to be written/read)...
/*
decimal value binary value -> enc byte 1 enc byte 2 enc byte 3
85 -------- -------- -AAAAAAA -> 0AAAAAAA n/a n/a
1,365 -------- -----BBB AAAAAAAA -> 1AAAAAAA 0---BBBA n/a
349,525 -----CCC BBBBBBBB AAAAAAAA -> 1AAAAAAA 1BBBBBBA 0--CCCBB
*/
So values in the range of 0 to 2^7-1 (127) will write out as 1 byte, values of 2^7 (128) to 2^14-1 (16,383) will use 2 bytes, 2^14 (16,384) to 2^21-1 (2,097,151) will take 3 bytes, and so on and so forth.

Is it possible to deal with individual bits in C#? Trying to implement a SHA256 generator

Just doing this for fun and I was reading the pseudo-code on Wikipedia and it says when pre-processing to append the bit '1' to the message and then append enough '0' bits to the resulting message length modulus 512 is 448. Then append the length of the message in bits as a 64-bit big-endian integer.
Okay. I'm not sure how to append just a '1' bit but I figure it could be possible to just append 128 (1000 0000) but that wouldn't work in the off chance the resulting message length modulus 512 was already 448 without all those extra 0's. In which case I'm not sure how to append just a 1 because I'd need to deal with at least bytes. Is it possible in C#?
Also, is there a built-in way to append a big-endian integer because I believe my system is little-endian by default.
It's defined in such a way that you only need to deal with bytes if the message is an even number of bytes. If the message length (mod 64) is 56, then append one byte of 0b10000000, folowed by 63 0 bytes, followed by the length. Otherwise, append one byte of 0b10000000, followed by 0 to 62 0 bytes, followed by the length.
You might check out the BitArray class in System.Collections. One of the ctor overloads takes an array of bytes, etc.

C# Encoding.UTF8 messing up the bytes[]

I am facing very strange problem in which I have byte[] and when I am passing this to Convert.UTF8.GetString(byte[] bytes) method, the system encoding is messing with my bytes and replacing only few special bytes (which I am using as Markers in my system) to some three char string representation.
[0] 70 byte
[1] 49 byte
[2] 45 byte
[3] 86 byte
[4] 49 byte
[5] 253 byte <-- Special byte
[6] 70 byte
[7] 49 byte
[8] 45 byte
[9] 86 byte
[10]50 byte
[11]253 byte <-- Special byte
[12]70 byte
[13]49 byte
[14]45 byte
[15]86 byte
[16]51 byte
When I am passing above byte[] into Encoding.UTF8.GetString(bytes) method I am getting following output;
private Encoding _encoding = System.Text.Encoding.GetEncoding("UTF-8", new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?"));
_encoding.GetString(bytes) "F1-V1�F1-V2�F1-V3" string
Actual value should not have '�' as this means it failed to encode and replaced those special bytes with '�'. Is there anyway I can get around this i.e. convert to string and keep the special bytes representation to a single char.
I have following special bytes which I am trying to use as markers;
byte AM = (byte) 254
byte VM = (byte) 253
byte SM = (byte) 252
Your help and comments will be appreciated.
Thanks,
--
Sheeraz
You cannot use those special values as markers inside a UTF-8 string, because the string ends up being invalid according to the UTF-8 encoding rules.
You could sneakily insert them and then take them back out before the data is fed to UTF-8-aware code like Encoding.GetString, but that's not a good idea exactly because it's sneaky (way confusing to anyone who does not already know what voodoo is happening in there, and thus very counter-productive).
A more sane option would be to simply insert "special" UTF-8 encoded characters inside your string. This would technically require (especially if you pick a character that encodes to 1 byte as those would be more likely to occur inside your actual payload as well) that you also come up with a scheme to escape these characters when they occur naturally inside your payload.
The data is only UTF-8 between the markers, so if it were me I would be extracting the delimited portions first, and then UTF-8 decode each portion separately, i.e. read through the byte[] looking for the markers in your binary data, giving you 3 binary chunks (70,49,45,86,49; 70,49,45,86,50; 70,59,45,86,51) which are then decoded into 3 strings. You can't UTF-8 decode the entire binary sequence because it is not valid UTF-8.
However, personally, I would say that using a delimiter is dangerous here; I would probably go for a length-prefix approach, so that
I know that I'm not accidentally conflating delimiters and real data
I can process it more efficiently than byte-by-byte
For example, if we used a "varint" length prefix, that would be:
05,70,49,45,86,49,05,70,49,45,86,50,05,70,59,45,86,51
where the 05 is the "varint" length which we interpret as 5 bytes; this means we can process nicely:
// pseude code
while(!EOF) {
int len = ReadVarint();
var blob = ReadBytes(len);
string s = Utf8Decode(blob);
// ...
}

Categories

Resources