How to get complete bytes for unicode characters in C#? - c#

For the symbol greater than > I need to get the complete bytes, which I understand to be \u003E. Now C# only gives me 3E. Is there anyway to get all the characters? i.e. \u003E.
I am using the following line of code.
Encoding.UTF8.GetBytes(">");
In a text file I have the following
\u003c
Which I need to search down at the byte level
Thanks!

In UTF-8 the (ASCII range) char > is encoded into 1 byte.
If you want the string "003E" you can use:
Encoding.Utf8.GetBytes(">")[0].ToString("X4");
and maybe add "\u" in front.
If you want an array of 2 bytes conataining { 0x00, 0x3E } then use
Encoding.Unicode.GetBytes(">");

What bytes make up > differs from encoding to encoding - in UTF8 it really is only 0x3e, in Unicode it is 0x00 0x3e, so you need
Encoding.XXXX.GetBytes(">");
with XXXX being the encoding of your choice, e.g. UTF8 or Unicode

The answer you get is correct - 3E is the hex representation of U+003E.
If you want the unicode bytes (ie 2 byte array) then simply use this encoding:
Encoding.Unicode.GetBytes(">");

I wrote a rather lengthy piece at http://www.hackcraft.net/xmlUnicode/#sect4 some years back, that says the following in more detail, but:
> is a character. It's a purely conceptual item that we understand as having one or more meanings, uses and ways of writing it depending on different lingual and textual contexts. It's an abstract concept rather than anything we can actually make use of in a computer.
U+003E (which in C# is represented as \u003E) is a code-point. It's a way of assigning a number to a character, but it's still a rather abstract thing. The number 0x3E (62) is still an abstract concept rather than something we can use in a computer.
00111110, 0000000000111110, 0011111000000000, 00000000000000000000000000111110 and 00111110000000000000000000000000 are all different ways commonly used to represent that code-point in actual 1s and 0s that computers can represent by pulses of electrical charge.
In between, as programmers we tend to think of those three as either 0x3E, 0x003E or 0x000000003E which are numbers mapped to datatypes we actually use. The difference between 0000000000111110 and 0011111000000000 for this is one of endian-ness and mostly we don't think of it at this point having already (if necessary) thought "must make sure the endianness is correct", because that "if necessary" tends to happen at a level where one doesn't think of characters at all.
Actually, as programmers we tend to think of it mostly as the > we started with. Abstractions are great.
Your code that uses UTF-8 is using one of the different ways of turning characters into bytes, the one that turns U+003E into 0x3E. There are others, though UTF-8 is the one most useful for most interchange. It is therefore one of the correct answers to "the complete bytes for '>'". The byte 0x00 followed by 0x3E and the byte 0x3E followed by 0x00 would be two other correct answers, both forms of UTF-16 with different endianness. The byte sequences 0x00, 0x00, 0x00, 0x3E and 0x3E, 0x00, 0x00, 0x00 would both be correct UTF-32.
If you have a reason for wanting a particular one of these, use the appropriate encoding. If in doubt, use UTF-8 as you were doing.

Related

Decompress .Z files (LZW Compression) in C#

I am looking to implement the Rosetta Code LZSW Decompression method in C# and I need some help. The original code is available here: http://rosettacode.org/wiki/LZW_compression#C.23
I am only focusing on the Decompress method as I "simply" (if only) want to decompress .Z-files in my C# program in .NET 6.
I want my version take a byte[] as input and return a byte[] (as I am reading .ReadAllBytes() from file and want to create a new file with the decompressed result).
My problem comes from the fact that in C#, chars are 16bit (2 bytes) and not 8bit (1byte). This really messes with my head as that consequently (in my mind) means that each char should be represented by two bytes. In the code at Rosetta Code, the intial dictionary created only contains integer keys of 0 -> 255 meaning up to 1 byte, not two. I am thinking if this is an error in their implementation? What do you think? And how would you go about converting this algorithm to a method with the signature: byte[] Decompress(byte[]) ?
Thanks
No, there is no error. No, you don't need to convert the algorithm to work on 16-bit values. The usual lossless compression libraries operate on sequences of bytes. Your string of characters would first need to be converted to a sequence of bytes, e.g. to UTF-8, e.g. byte[] bs = Encoding.UTF8.GetBytes(str);. UTF-8 would be the right choice, since that encoding gives the compressor the best shot at compressing. In fact, just encoding UTF-16 to UTF-8 will almost always compress the strings, which makes it a good starting point. (In fact, using UTF-16 as the standard for character strings in .NET is a terrible choice for exactly this reason, but I digress.)
Any data you compress would first be serialized to a sequence of bytes for these compressors, if it isn't bytes already, in a way that permits reversing the transformation after decompression on the other end.
Since you are decompressing, someone encoded the characters into a sequence of bytes, so you need to first find out what they did. It may just be a sequence of ASCII characters, which are already one byte per character. Then you would use System.Text.Encoding.ASCII.GetString(bs); to make a character string out of it.
When compressing data we usually talk about sequences of symbols. A symbol in this context might be a byte, a character, or something completely different.
Your example obviously uses characters as it's symbols, but there should not be any real problem just replacing this with bytes instead. The more difficult part will be its use of strings to represent sequences of characters. You will need an equivalent representation of byte sequences that provides functionality like:
Concatenation/appending
Equality
GetHashCode (for performance)
Immutability (i.e. appending a byte should produce a new sequence, not modify the existing sequence)
Note that there LZW implementations have to agree on some specific things to be compatible, so implementing the posted example may or may not not allow you to decode .Z files encoded with another implementation. If your goal is to decode actual files you may have better luck asking on software recommendations for a preexisting decompression library.

Does BitConverter handle little-endianness incorrectly?

I'm currently writing something in C#/.NET that involves sending unsigned 16-bit integers in a network packet. The ordering of the bytes needs to be big endian.
At the bit level, my understanding of 'big endian' is that the most significant bit goes at the end, and in reverse for little endian.
And at the byte level, my understanding is the same -- if I'm converting a 16 bit integer to the two 8 bit integers that comprise it, and the architecture is little endian, then I would expect the most significant byte to go at the beginning.
However, BitConverter appears to put the byte with the smallest value at the end of the array, as opposed to the byte with the least-significant value, e.g.
ushort number = 4;
var bytes = BitConverter.GetBytes(number);
Debug.Assert(bytes[BitConverter.IsLittleEndian ? 0 : 1] == 0);
For clarity, if my understanding is correct, then on a little endian machine I would expect the above to return 0x00, 0x04 and on a big endian machine 0x04, 0x00. However, on my little endian Windows x86 workstation running .NET 5, it returns 0x04, 0x00
It's even documented that they've considered the endianness. From: https://learn.microsoft.com/en-us/dotnet/api/system.bitconverter.getbytes?view=net-5.0
The order of bytes in the array returned by the GetBytes method depends on whether the computer architecture is little-endian or big-endian.
Am I being daft or does this seem like the wrong behaviour?
I am indeed being daft. As #mjwills pointed out, and Microsoft's documentation explains (https://learn.microsoft.com/en-us/dotnet/api/system.bitconverter.islittleendian?view=net-5.0#remarks):
"Big-endian" means the most significant byte is on the left end of a word. "Little-endian" means the most significant byte is on the right end of a word.
Wikipedia has a slightly better explanation:
A big-endian system stores the most significant byte of a word at the smallest memory address and the least significant byte at the largest. A little-endian system, in contrast, stores the least-significant byte at the smallest address.
So, if you imagine the memory addresses, converting a 16-bit integer with a value of 4 becomes:
Address
0x00
0x01
Little-endian
0x04
0x00
Big-endian
0x00
0x04
Hopefully this'll help anyone equally daft in future!

How does c# decoder know the exact number of bytes it should use for one char?

For example, a stream has four bytes: D8 00 DC 05. How does the decoder(e.g. System.Text.Decoder) know it should treat them as one char \uD800\udc05 or two separate chars\uD800 and \udc05 please? Thanks.
Perhaps I didn't describe my question clearly. My original intention was about to understand how UTF8 decoder knows the exact number of bytes it should use for one char, as one UTF8 char can take one to four bytes and the way to handle this variable is the magic. UTF16 decoder has no this problem for surrogate pairs. The above example is not appropriate for my question.
Your question is really about UTF-16 and surrogate pairs.
The two code units U+D800 and U+DC05 always represent surrogate pairs. These two code units combine into one single code point, that is one character.
C# calls the code units char which can be a bit misleading since it sometimes takes two char values (a pair of surrogates) to create one "character", as you have noticed.
Any code unit (char) value between U+D800 and U+DBFF always represents the lower part of a surrogate pair, while any code unit between U+DC00 and U+DFFF is the corresponding upper part of the pair.
Code units outside this domain, i.e. either in U+0000 through U+D7FF or in U+E000 through U+FFFF stand for themselves, so in those ranges one UTF-16 code unit corresponds to one Unicode code point.
EDIT: The question was changed to ask about UTF-8 instead.
I will use the word octet for a word of exactly 8 bits (so an octet is what most people call a byte).
In UTF-8 you can see from the position of the first 0 bit within the octet where this octet belongs in a UTF-8 sequence.
0xxxxxxx: If the first bit is 0, this octet constitutes a 1-octet sequence (ASCII value)
10xxxxxx: If the octet starts on 10, this is a continuation octet, i.e. not initial in the sequence
110xxxxx: This is the initial octet in a 2-octet sequence
1110xxxx: This is the initial octet in a 3-octet sequence
11110xxx: This is the initial octet in a 4-octet sequence
Since modern UTF-8 does not allow 5-octet sequences, or longer, it is illegal for an octet to start with five ones, 11111xxx. But in early versions, the above scheme would be extended to allow 5-octet and 6-octet sequences (sometimes also longer).
When comparing UTF-16 and UTF-8, note that code points that require only a single 16-bit code unit in UTF-16, correspond exactly to code points that can be made with 1-, 2-, or 3-octet sequences in UTF-8. While code points that require a surrogate pair in UTF-16 (i.e. two UTF-16 code units) correspond exactly to those that require a 4-octet sequence in UTF-8.
There is .NET Framework source code , you can look.
Source code of System.Text.Decoder placed here. So you can find here everything you want to know about your question.

Implementing DbDataReader.GetChars() efficiently when underlying data is not UTF-16

I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.
The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.
I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".
Some 'requirements' that eliminate obvious implementations:
I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.
I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.
My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!
I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.
So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?
Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".
Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as 𩸽; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: 𩸽̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.
So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method
internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)
In StringInfo.cs.
A bit of nuisance, but doable.
For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.
Not really an answer but way too long for a comment.
Update
You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:
var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList(); // 95 found.
var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList(); // 95 names displayed.
Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.
This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:
Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.
I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.
Decoder is not cloneable, so I use serialization + deserialization to make the copy.

Can I mix UTF-16 conversion with UTF-8 conversion between bytes and string?

Short version
Is this an identity function?
f = (gₐ · hᵤ · gᵤ · hₐ)
where:
hₐ is the UTF-16 conversion from bytes to string,
gₐ is the UTF-16 conversion from string to bytes,
gᵤ is the Encoding.UTF8.GetBytes(),
hᵤ is the Encoding.UTF8.GetString(),
Long version
I'm using WebSocket4Net to send and receive messages through WebSockets between a C# application and a C# service.
Some messages being binary, I should convert them from and to strings when interacting with the library, since while its Send() method enables to send an array of bytes, its MessageReceived communicates the received message as a string only.
To convert bytes to string and string to bytes, I follow the answer by Mehrdad where the internal encoding of .NET Framework, i.e. UTF-16, is used.
On the other hand, according to the code source (see for example DraftHybi10Processor.cs, line 114), WebSocket4Net uses UTF-8 to convert string to bytes and bytes to string.
Would it cause issues? Is data loss possible?
If you need to send binary data as a string, well, that's what Base-64 and similar encodings are for. If you need to send a string as string... well, send it as a string. If you need to send a string as bytes, Unicode (UTF-16) or UTF-8 will do just fine. Strings aren't simple byte arrays (even if they can be represented that way if necessary). Unicode especially is quite a complicated encoding (see http://www.joelonsoftware.com/articles/Unicode.html; read it - it's a must). Did you know that you can get a unicode normalization that splits a single character into 5 bytes? The same character could also be interpreted as 2. Or a completely different number. I haven't observed it, but I'd expect that some byte arrays will be outright invalid in UTF-16 (which is the current default string encoding in .NET).
I'm not going to go through the proof that your "double-encoding" is flawed. I'm not sure, it might even work. However, the string you're going to get is going to be pretty silly and you'll have a lot of trouble encoding it to make sure that you're not sending commands or something.
The more important thing is - you're not showing intent. You're doing micro-optimalizations, and sacrificing readability. Worse, you're relying on implementation details, which aren't necessarily portable or stable with respect to later versions of .NET, not to mention other environments.
Unless you have a very, very good reason (based on actual performance analysis, not a "gut feeling"), go with the simple, readable solution. You can always improve if you have to.
EDIT: A sample code to show why using Unicode to encode non-unicode bytes is a bad idea:
Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(new byte[] { 200, 8 }))
The two bytes on input turned into four bytes, { 239, 191, 189, 8 }. Not quite what you wanted.

Categories

Resources