Conversion of a unicode character from byte

Conversion of a unicode character from byte - c#

In our API, we use byte[] to send over data across the network. Everything worked fine, until the day our "foreign" clients decided to pass/receive Unicode characters.
As far as I know, Unicode characters occupy 2 bytes, however, we only allocate 1 byte in the byte array for them.
Here is how we read the character from the byte[] array:
// buffer is a byte[6553] and index is a current location in the buffer
char c = System.BitConverter.ToChar(buffer, m_index);
index += SIZEOF_BYTE;
return c;
So the current issue is the API is receiving a strange Unicode character, when I look at the Unicode hexadecimal. I found that the last significant byte is correct but the most significant byte has a value when it’s supposed to be 0. A quick workaround, thus far, has been to 0x00FF & c to filter the msb.
Please suggest the correct approach to deal with Unicode characters coming from the socket?
Thanks.
Solution:
Kudos to Jon:
char c = (char) buffer[m_index];
And as he mentioned, the reason it works, is because the client api receives a character occupying only one byte, and BitConverter.ToChar uses two, hence the issue in converting it. I am still startled as to why it worked for some set of characters and not the others, as it should have failed in all cases.
Thanks Guys, great responses!

You should use Encoding.GetString, using the most appropriate encoding.
I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.
Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?
EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:
char c = (char) buffer[m_index];
I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.

You should look at the System.Text.ASCIIEncoder.ASCII.GetString function which takes a byte[] array and converts it to a string (for ascii).
And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in the UTF8 or UTF16 encodings.
There are also functions for converting Strings to Byte[] in the ASCIIEncoding, UTF8Encoding and UTF16Encoding classes: see the GetBytes(String) functions.

Unicode characters can take up to four bytes, but rarely are messages encoded on the wire using 4 bytes for each character. Rather, schemes like UTF8 or UTF16 are used that only bring in extra bytes when required.
Have a look at the Encoding class guidance.

Test streams should contain a byte-order marker that will allow you to determine how to treat the binary data.

It's unclear what exactly your goal is here. From what I can tell, there are 2 routes that you can take
Ignore all data sent in Unicode
Process both unicode and ASCII strings
IMHO, #1 is the way to go. But it sounds like your protocol is not necessarily setup to deal with a unicode string. You will have to do some detection logic to determine if the string coming in is a Unicode version. If it is you can use the Enconding.Unicode.GetString method to convert that particular byte array.

What encoding are your customers using? If some of your customers are still using ASCII, then you'll need your international customers to use something which maps the ASCII set (1-127) to itself, such as UTF8. After that, use the UTF8 encoding's GetString method.

My only solution is to fix the API. Either tell the users to use only ASCII string in the Byte[] or fix it to support ASCII and any other encoding you need to use.
Deciding what encoding is supplied by the foreign clients from just the byte[] can be a bit tricky.

Related

Cryptotext in Unicode from Byte Array Seems to Contain Invalid Characters

My encryption application (written in C# & GTK# and using Rijndeal) takes a string from a textview to encrypt, and returns the result in a Byte array. I then use Encoding.Unicode.GetString() to convert it to a string, but my output doesn't look right, it seems to contain invalid characters: `zźr[� ��ā�֖�Z�_����
W��h�.
I'm assuming that the encoding for the textview is not Unicode, but ASCII doesn't work either. How can I ensure that the output is not invalid? Or is my approach wrong to begin with?
I'm new to C# and not very experienced with programming in general (I have decent skill in PHP and know a little JavaScript, but that's about it) so if you could baby-down your answers it would be much appreciated.
Thank you in advance for taking the time to assist me.

While every string can be represented as a sequence of bytes using UTF-16, not every sequence of bytes represents a UTF-16 encoded string. Especially if the sequence of bytes is the result of an encryption process.
You can use the Convert.ToBase64String Method to convert the sequence of bytes to a Base64 string.

Is Base64 standardized?

In the company that I work, we have class to convert to and from Base64 string. When I first saw the code, I asked why don´t use the Convert.ToBase64String that comes with .NET?
Then I modified the method body to just call Convert.ToBase64String, but it don´t generate the same string.
I tried using ASCII, UTF8, Unicode and UTF32.
I dont remember exactly but I think that ASCII generates a string with the same length but some chars different, and others Enconding generates bigger strings.
Maybe our implementation was wrong, but I found a JavaScript implementations that matches ours.
Isn´t Base64 portable?
Edit: I found this at Wikipedia, but I don´t know if it was the cause http://en.wikipedia.org/wiki/Base64#Implementations_and_history
Edit2: I mentioned encodings because we are converting a string to another. Then I need to firstly convert my original string to a byte array using some encoding

RFC 3548. No, wait! There's a newer one: RFC 4648. (Thanks dtb!)
BTW, you seem to be mixing base64, which turns any binary stream into an ASCII stream and character encodings, which are totally different concepts. You may want to read this introduction article about encodings.

There are at least 10 different implementations. They mostly differ by the char for index 62 and 63.
http://en.wikipedia.org/wiki/Base64#Implementations_and_history

How do I encode a Binary blob as Unicode blob?

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.
What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.

There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.
My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:
var base64Encoded = Convert.ToBase64String(byteArray);
var original = Convert.FromBase64String(base64Encoded);
In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.

Normally, this would be the way to convert between bytes and Unicode text:
// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);
// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);
EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:
// string from bytes
Convert.ToBase64String(byteArray);
// bytes from string
Convert.FromBase64String(base64Encoded);
(Thanks to #Timwi who pointed this out!)

I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.

C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.

Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.

If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.

If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B

Base64 String throwing invalid character error

I keep getting a Base64 invalid character error even though I shouldn't.
The program takes an XML file and exports it to a document. If the user wants, it will compress the file as well. The compression works fine and returns a Base64 String which is encoded into UTF-8 and written to a file.
When its time to reload the document into the program I have to check whether its compressed or not, the code is simply:
byte[] gzBuffer = System.Convert.FromBase64String(text);
return "1F-8B-08" == BitConverter.ToString(new List<Byte>(gzBuffer).GetRange(4, 3).ToArray());
It checks the beginning of the string to see if it has GZips code in it.
Now the thing is, all my tests work. I take a string, compress it, decompress it, and compare it to the original. The problem is when I get the string returned from an ADO Recordset. The string is exactly what was written to the file (with the addition of a "\0" at the end, but I don't think that even does anything, even trimmed off it still throws). I even copy and pasted the entire string into a test method and compress/decompress that. Works fine.
The tests will pass but the code will fail using the exact same string? The only difference is instead of just declaring a regular string and passing it in I'm getting one returned from a recordset.
Any ideas on what am I doing wrong?

You say
The string is exactly what was written
to the file (with the addition of a
"\0" at the end, but I don't think
that even does anything).
In fact, it does do something (it causes your code to throw a FormatException:"Invalid character in a Base-64 string") because the Convert.FromBase64String does not consider "\0" to be a valid Base64 character.
byte[] data1 = Convert.FromBase64String("AAAA\0"); // Throws exception
byte[] data2 = Convert.FromBase64String("AAAA"); // Works
Solution: Get rid of the zero termination. (Maybe call .Trim("\0"))
Notes:
The MSDN docs for Convert.FromBase64String say it will throw a FormatException when
The length of s, ignoring white space
characters, is not zero or a multiple
of 4.
-or-
The format of s is invalid. s contains a non-base 64 character, more
than two padding characters, or a
non-white space character among the
padding characters.
and that
The base 64 digits in ascending order
from zero are the uppercase characters
'A' to 'Z', lowercase characters 'a'
to 'z', numerals '0' to '9', and the
symbols '+' and '/'.

Whether null char is allowed or not really depends on base64 codec in question.
Given vagueness of Base64 standard (there is no authoritative exact specification), many implementations would just ignore it as white space. And then others can flag it as a problem. And buggiest ones wouldn't notice and would happily try decoding it... :-/
But it sounds c# implementation does not like it (which is one valid approach) so if removing it helps, that should be done.
One minor additional comment: UTF-8 is not a requirement, ISO-8859-x aka Latin-x, and 7-bit Ascii would work as well. This because Base64 was specifically designed to only use 7-bit subset which works with all 7-bit ascii compatible encodings.

string stringToDecrypt = HttpContext.Current.Request.QueryString.ToString()
//change to
string stringToDecrypt = HttpUtility.UrlDecode(HttpContext.Current.Request.QueryString.ToString())

If removing \0 from the end of string is impossible, you can add your own character for each string you encode, and remove it on decode.

One gotcha to do with converting Base64 from a string is that some conversion functions use the preceding "data:image/jpg;base64," and others only accept the actual data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.