two byte character or one byte character - c#

How can I see if the input string is a two byte character or one byte character; and from which encoding system the character is coming from?
I am using C# and SilverLight; I assume I could find the encoding the computer is running and then the character? Any code snippet?
Thank you,
Rune
// Get a UTF-32 encoding by codepage.Encoding Encoding_12000_instance = Encoding.GetEncoding(12000);
// Get a UTF-32 encoding by name.Encoding Encoding_UTF32_instance = Encoding.GetEncoding("utf-32");

everything that is string in .net is in UTF-16. If you are getting input from other sources you need to get encoding name from it.

Related

Strange results when converting from byte array to string

I get strange results when converting byte array to string and then converting the string back to byte array.
Try this:
byte[] b = new byte[1];
b[0] = 172;
string s = Encoding.ASCII.GetString(b);
byte[] b2 = Encoding.ASCII.GetBytes(s);
MessageBox.Show(b2[0].ToString());
And the result for me is not 172 as I'd expect but... 63.
Why does it happen?
Why does it happen?
Because ASCII only contains values up to 127.
When faced with binary data which is invalid for the given encoding, Encoding.GetString can provide a replacement character, or throw an exception. Here, it's using a replacement character of ?.
It's not clear exactly what you're trying to achieve, but:
If you're converting arbitrary binary data to text, use Convert.ToBase64String instead; do not try to use an encoding, as you're not really representing text. You can use Convert.FromBase64String to then decode.
Encoding.ASCII is usually a bad choice, and certainly binary data including a byte of 172 is not ASCII text
You need to work out which encoding you're actually using. Personally I dislike using Encoding.Default unless you really know the data is in the default encoding for the platform you're working on. If you get the choice, using UTF-8 is a good one.
ASCII encoding is a 7-bit encoding. If you take a look into generated string it contains "?" - unrecognized character. You might choose Encoding.Default instead.
ASCII is a seven bit character encoding, so 172 falls out of that range, so when converting to a string, it converts to "?" which is used for characters that cannot be represented.

Reading special characters from Byte[]

I'm writing and readingfrom Mifare - RFID cards.
To WRITE into the card, i'm using a Byte[] like this:
byte[] buffer = Encoding.ASCII.GetBytes(txt_IDCard.Text);
Then, to READ from the card, I'm getting some error with special characters, when it's supposed to show me é, ã, õ, á, à... I get ? instead:
string result = System.Text.Encoding.UTF8.GetString(buffer);
string result2 = System.Text.Encoding.ASCII.GetString(buffer, 0, buffer.Length);
string result3 = Encoding.UTF7.GetString(buffer);
e.g: Instead I get Àgua, amanhã, você I receive/read ?gua, amanh?, voc?.
How may I solve it ?
ASCII by its very definition only supports 128 characters.
What you need is ANSI characters if you are reading legacy text.
You can use Encoding.Default instead of Encoding.ASCII to interpret characters in the current locale's default ANSI code page.
Ideally, you would know exactly which code page you are expecting the ANSI characters to use and specify the code page explicitly using this overload of Encoding.GetEncoding(int codePage), for example:
string result = System.Text.Encoding.GetEncoding(1252).GetString(buffer);
Here's a very good reference page on Unicode: http://www.joelonsoftware.com/articles/Unicode.html
And another here: http://msdn.microsoft.com/en-us/library/b05tb6tz%28v=vs.90%29.aspx
But maybe you can just use UTF8 when reading and writing
I don't know the details of the card reader. Is the data you read and write to the card just a load of bytes?
If so, you can just use UTF8 for both reading and writing and it will all just work. It's only necessary to use ANSI if you are working with a legacy device which is expecting (or providing) ANSI text. If the device just stores bytes blindly without implying any particular format, you can do what you like - in this case, just always use UTF8.
It seems like you're using characters that aren't mapped in the 7 bits ASCII, but in the "extensions" ISO-8859-1 or ISO-8859-15. You'll need to choose a specific encoding for mapping to your byte array and things should work fine;
byte[] buffer = Encoding.GetEncoding("ISO-8859-1").GetBytes(txt_IDCard.Text);
You have two problems there:
ASCII supports only a limited amount of characters.
You're currently using two different Encodings for reading and writing.
You should write with the same Encoding as you read.
Writing
byte[] buffer = Encoding.UTF8.GetBytes(txt_IDCard.Text);
Reading
string result = Encoding.UTF8.GetString(buffer);

Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

I am getting ÐиÑилл ÐаÑанник from a C++ component and I need to decode it. The string is always UTF-8 encoded. After much RnD, I figured following way to decode it.
String text = Encoding.UTF8
.GetString(Encoding.GetEncoding("iso-8859-1")
.GetBytes("ÐиÑилл ÐаÑанник"));
But isn't this hardcoding "iso-8859-1", as in what if characters other than cyrillic come up. So I want to have a generic method for decoding a UTF-8 string.
Thanks in advance.
When you type text, the computer sees only bytes. In this case, when you type Cyrillic characters into your C++ program, the computer converts each character to its corresponding UTF-8 encoded character.
string typedByUser = "Привет мир!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
Then your C++ program comes along, looks at the bytes and thinks it is ISO-8859-1 encoded.
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!
Not much you can do about that. Then you get the wrongly encoded string and have to assume it is incorrectly ISO-8859-1 encoded UTF-8. This assumption proves to be correct, but you cannot determine this in any way.
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// Привет мир!
Note that ISO-8859-1 is the ISO West-European encoding, and has nothing to do with the fact that the original input was Cyrillic. For example, if the input was Japanese UTF-8 encoded, your C++ program would still interpret it as ISO-8859-1:
string typedByUser = "こんにちは、世界!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ããã«ã¡ã¯ãä¸çï¼
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// こんにちは、世界!
The C++ program will always interpret the input as ISO-8859-1, regardless of whether it is Cyrillic, Japanese or plain English. So that assumption is always correct.
However, you have an additional assumption that the original input is UTF-8 encoded. I'm not sure whether that is always correct. It may depend on the program, the input mechanism it uses and the default encoding used by the Operating System. For example, the C++ program made the assumption that the original input is ISO-8859-1 encoded, which was wrong.
By the way, character encodings have always been problematic. A great example is a letter from a French student to his Russian friend where the Cyrillic address was incorrectly written as ISO-8859-1 on the envelope, and decoded by the postal employees.
A source of characters should only be transfered in one encoding, that means it's either iso-8859-1 or something else, but not both at the same time
(that means you might be wrong about the reverse engineered cyrillic in the first place)
Could you post the expected UTF-8 output of your input?

Which encoding does Alt+Numpad keys generate?

In short:
For this code:
Encoding.ASCII.GetBytes("‚")
I want the output to be 130, but this gives me 63.
I am typing the string using Alt+0130.
On my setup:
Encoding.ASCII.GetBytes("‚"); // 63
Encoding.Default.GetBytes("‚"); // 130
Of course 'default' could very well be environment-dependent...
When you try to encode the string using the ASCII encoding, it will be converted to a question mark as there is no such character in the ASCII character set. The character code for the question mark is 63.
You need to use an encoding that supports the character, to get it's actual character code.
One option is to use the Encoding.Default property to get the encoding for the system codepage, as David suggested. However as the system codepage can differ, it's not guaranteed to give the same result on all computers.
The unicode character code is 8218, which you can get by simply converting the character to an int:
int characterCode = (int)'‚';
As this is not depending on any system settings, you should consider if you can use that instead of the encoded byte value.

Conversion of a unicode character from byte

In our API, we use byte[] to send over data across the network. Everything worked fine, until the day our "foreign" clients decided to pass/receive Unicode characters.
As far as I know, Unicode characters occupy 2 bytes, however, we only allocate 1 byte in the byte array for them.
Here is how we read the character from the byte[] array:
// buffer is a byte[6553] and index is a current location in the buffer
char c = System.BitConverter.ToChar(buffer, m_index);
index += SIZEOF_BYTE;
return c;
So the current issue is the API is receiving a strange Unicode character, when I look at the Unicode hexadecimal. I found that the last significant byte is correct but the most significant byte has a value when it’s supposed to be 0. A quick workaround, thus far, has been to 0x00FF & c to filter the msb.
Please suggest the correct approach to deal with Unicode characters coming from the socket?
Thanks.
Solution:
Kudos to Jon:
char c = (char) buffer[m_index];
And as he mentioned, the reason it works, is because the client api receives a character occupying only one byte, and BitConverter.ToChar uses two, hence the issue in converting it. I am still startled as to why it worked for some set of characters and not the others, as it should have failed in all cases.
Thanks Guys, great responses!
You should use Encoding.GetString, using the most appropriate encoding.
I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.
Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?
EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:
char c = (char) buffer[m_index];
I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.
You should look at the System.Text.ASCIIEncoder.ASCII.GetString function which takes a byte[] array and converts it to a string (for ascii).
And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in the UTF8 or UTF16 encodings.
There are also functions for converting Strings to Byte[] in the ASCIIEncoding, UTF8Encoding and UTF16Encoding classes: see the GetBytes(String) functions.
Unicode characters can take up to four bytes, but rarely are messages encoded on the wire using 4 bytes for each character. Rather, schemes like UTF8 or UTF16 are used that only bring in extra bytes when required.
Have a look at the Encoding class guidance.
Test streams should contain a byte-order marker that will allow you to determine how to treat the binary data.
It's unclear what exactly your goal is here. From what I can tell, there are 2 routes that you can take
Ignore all data sent in Unicode
Process both unicode and ASCII strings
IMHO, #1 is the way to go. But it sounds like your protocol is not necessarily setup to deal with a unicode string. You will have to do some detection logic to determine if the string coming in is a Unicode version. If it is you can use the Enconding.Unicode.GetString method to convert that particular byte array.
What encoding are your customers using? If some of your customers are still using ASCII, then you'll need your international customers to use something which maps the ASCII set (1-127) to itself, such as UTF8. After that, use the UTF8 encoding's GetString method.
My only solution is to fix the API. Either tell the users to use only ASCII string in the Byte[] or fix it to support ASCII and any other encoding you need to use.
Deciding what encoding is supplied by the foreign clients from just the byte[] can be a bit tricky.

Categories

Resources