Reading special characters from Byte[] - c#

I'm writing and readingfrom Mifare - RFID cards.
To WRITE into the card, i'm using a Byte[] like this:
byte[] buffer = Encoding.ASCII.GetBytes(txt_IDCard.Text);
Then, to READ from the card, I'm getting some error with special characters, when it's supposed to show me é, ã, õ, á, à... I get ? instead:
string result = System.Text.Encoding.UTF8.GetString(buffer);
string result2 = System.Text.Encoding.ASCII.GetString(buffer, 0, buffer.Length);
string result3 = Encoding.UTF7.GetString(buffer);
e.g: Instead I get Àgua, amanhã, você I receive/read ?gua, amanh?, voc?.
How may I solve it ?

ASCII by its very definition only supports 128 characters.
What you need is ANSI characters if you are reading legacy text.
You can use Encoding.Default instead of Encoding.ASCII to interpret characters in the current locale's default ANSI code page.
Ideally, you would know exactly which code page you are expecting the ANSI characters to use and specify the code page explicitly using this overload of Encoding.GetEncoding(int codePage), for example:
string result = System.Text.Encoding.GetEncoding(1252).GetString(buffer);
Here's a very good reference page on Unicode: http://www.joelonsoftware.com/articles/Unicode.html
And another here: http://msdn.microsoft.com/en-us/library/b05tb6tz%28v=vs.90%29.aspx
But maybe you can just use UTF8 when reading and writing
I don't know the details of the card reader. Is the data you read and write to the card just a load of bytes?
If so, you can just use UTF8 for both reading and writing and it will all just work. It's only necessary to use ANSI if you are working with a legacy device which is expecting (or providing) ANSI text. If the device just stores bytes blindly without implying any particular format, you can do what you like - in this case, just always use UTF8.

It seems like you're using characters that aren't mapped in the 7 bits ASCII, but in the "extensions" ISO-8859-1 or ISO-8859-15. You'll need to choose a specific encoding for mapping to your byte array and things should work fine;
byte[] buffer = Encoding.GetEncoding("ISO-8859-1").GetBytes(txt_IDCard.Text);

You have two problems there:
ASCII supports only a limited amount of characters.
You're currently using two different Encodings for reading and writing.
You should write with the same Encoding as you read.
Writing
byte[] buffer = Encoding.UTF8.GetBytes(txt_IDCard.Text);
Reading
string result = Encoding.UTF8.GetString(buffer);

Related

How do I read hex sequences like xD0 into a C# string?

I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�
C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.

ansi to unicode conversion

While parsing certain documents, I get the character code 146, which is actually an ANSI number. While writing the char to text file, nothing is shown. If we write the char as Unicode number- 8217, the character is displayed fine.
Can anyone give me advice on how to convert the ANSI number 146 to Unicode 8217 in C#.
reference: http://www.alanwood.net/demos/ansi.html
Thanks
"ANSI" is really a misnomer - there are many encodings often known as "ANSI". However, if you're sure you need code page 1252, you can use:
Encoding encoding = Encoding.GetEncoding(1252);
using (TextReader reader = File.OpenText(filename, encoding))
{
// Read text and use it
}
or
Encoding encoding = Encoding.GetEncoding(1252);
string text = File.ReadAllText(filename, encoding);
That's for reading a file - writing a file is the same idea. Basically when you're converting from binary (e.g. file contents) to text, use an appropriate Encoding object.
My recommendation would be to read Joel's "Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets. There's quite a lot involved in your question and my experience has been that you'll just struggle against the simple answers if you don't understand these basics. It takes around 15 minutes to read.

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!

How do I encode a Binary blob as Unicode blob?

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.
What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.
There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.
My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:
var base64Encoded = Convert.ToBase64String(byteArray);
var original = Convert.FromBase64String(base64Encoded);
In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.
Normally, this would be the way to convert between bytes and Unicode text:
// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);
// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);
EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:
// string from bytes
Convert.ToBase64String(byteArray);
// bytes from string
Convert.FromBase64String(base64Encoded);
(Thanks to #Timwi who pointed this out!)

Conversion of a unicode character from byte

In our API, we use byte[] to send over data across the network. Everything worked fine, until the day our "foreign" clients decided to pass/receive Unicode characters.
As far as I know, Unicode characters occupy 2 bytes, however, we only allocate 1 byte in the byte array for them.
Here is how we read the character from the byte[] array:
// buffer is a byte[6553] and index is a current location in the buffer
char c = System.BitConverter.ToChar(buffer, m_index);
index += SIZEOF_BYTE;
return c;
So the current issue is the API is receiving a strange Unicode character, when I look at the Unicode hexadecimal. I found that the last significant byte is correct but the most significant byte has a value when it’s supposed to be 0. A quick workaround, thus far, has been to 0x00FF & c to filter the msb.
Please suggest the correct approach to deal with Unicode characters coming from the socket?
Thanks.
Solution:
Kudos to Jon:
char c = (char) buffer[m_index];
And as he mentioned, the reason it works, is because the client api receives a character occupying only one byte, and BitConverter.ToChar uses two, hence the issue in converting it. I am still startled as to why it worked for some set of characters and not the others, as it should have failed in all cases.
Thanks Guys, great responses!
You should use Encoding.GetString, using the most appropriate encoding.
I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.
Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?
EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:
char c = (char) buffer[m_index];
I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.
You should look at the System.Text.ASCIIEncoder.ASCII.GetString function which takes a byte[] array and converts it to a string (for ascii).
And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in the UTF8 or UTF16 encodings.
There are also functions for converting Strings to Byte[] in the ASCIIEncoding, UTF8Encoding and UTF16Encoding classes: see the GetBytes(String) functions.
Unicode characters can take up to four bytes, but rarely are messages encoded on the wire using 4 bytes for each character. Rather, schemes like UTF8 or UTF16 are used that only bring in extra bytes when required.
Have a look at the Encoding class guidance.
Test streams should contain a byte-order marker that will allow you to determine how to treat the binary data.
It's unclear what exactly your goal is here. From what I can tell, there are 2 routes that you can take
Ignore all data sent in Unicode
Process both unicode and ASCII strings
IMHO, #1 is the way to go. But it sounds like your protocol is not necessarily setup to deal with a unicode string. You will have to do some detection logic to determine if the string coming in is a Unicode version. If it is you can use the Enconding.Unicode.GetString method to convert that particular byte array.
What encoding are your customers using? If some of your customers are still using ASCII, then you'll need your international customers to use something which maps the ASCII set (1-127) to itself, such as UTF8. After that, use the UTF8 encoding's GetString method.
My only solution is to fix the API. Either tell the users to use only ASCII string in the Byte[] or fix it to support ASCII and any other encoding you need to use.
Deciding what encoding is supplied by the foreign clients from just the byte[] can be a bit tricky.

Categories

Resources