Decoding UTF16LE with Encoding.Unicode garbles text - c#

I have the following code:
var reader = new StreamReader(inputSubtitle, encoding);
string str;
var list = new List<String>();
try
{
str = reader.ReadLine();
while (str != null)
{
list.Add(str);
str = reader.ReadLine();
}
return list;
}
The encoding is based on the Byte Order Mark. The charset detector code (can provide it if necessary) simply looks at the hex value of the first couple of bytes in the file. The files are usually UTF-8, Windows ANSI (Codepage-1252) or UTF-16LE. The last one currently fails, and I have no clue why.
Previewing the text in Notepad says it's encoded as Unicode (with which it means UTF-16LE, afaik), opening in Firefox says it's UTF-16LE and the BOM starts with bytes FF FE.
Take this example text:
1
00:04:05,253 --> 00:04:11,886
<i>This is the first line</i>
- This is the second line.
I send this file as a filestream to the charset detector (I use filestream as an input in the backend), where I added the following debug line:
byte[] dataFromFileStream = new byte[(input.Length)];
input.Read(dataFromFileStream, 0, (int)input.Length);
Console.WriteLine(BitConverter.ToString(dataFromFileStream));
This produces the following hexcode:
"FF-FE-31-00-(...)"
FF-FE is the Byte Order Mark of UTF16-LE.
Opening this hexcode with the StreamReader and encoding set to Encoding.Unicode turns the data into a single string:
"\u0d00\u0a00  㨀 㐀㨀 㔀Ⰰ㈀㔀㌀ ⴀⴀ㸀   㨀 㐀㨀\u3100\u3100Ⰰ㠀㠀㘀\u0d00\u0a00㰀椀㸀吀栀椀猀 椀猀 琀栀攀 昀椀爀猀琀 氀椀渀攀㰀⼀椀㸀\u0d00\u0a00ⴀ 吀栀椀猀 椀猀 琀栀攀 猀攀挀漀渀搀 氀椀渀攀⸀"
Setting the encoder to Encoding.GetEncoding(1201), e.g. as a UTF-16BE, opens the file properly and decodes it to 4 lines in the list, as expected.
I noticed this bug since a couple of weeks, before then the code worked properly. Is it something that happened in an update? Encoding.Unicode is described as UTF-16LE in the documentation.
I changed my code to use UTF-16BE as decoder for the moment to make it work again, but that just feels wrong.

Turns out I made a stupid mistake; the Charset detector reads a couple of bytes to determine the encoding of the file. To make sure I have a UTF-16LE file (which starts with FF-FE) and not a UTF-32LE (which starts with FF-FE-00-00) I read the third byte as well. After reading those bytes, however, I did not reset the position of the FileStream back to 0. An earlier version of my code, with a different constructor, did reset the starting position. Adding the code for resetting the position fixed it.
Explanation:
The StreamReader class does not necessarily need a BOM in a UTF file, so it starts reading from the position left after the charset detector did its thing. Detecting UTF-8 or ANSI made no issues, since those are encoded in single bytes. UTF-16 uses two bytes for each character, so starting at an odd numbered byte caused everything to look like reversed byte order for the reader. Therefore decoding like UTF-16BE 'fixed' the issue.

Related

Reading special characters from Byte[]

I'm writing and readingfrom Mifare - RFID cards.
To WRITE into the card, i'm using a Byte[] like this:
byte[] buffer = Encoding.ASCII.GetBytes(txt_IDCard.Text);
Then, to READ from the card, I'm getting some error with special characters, when it's supposed to show me é, ã, õ, á, à... I get ? instead:
string result = System.Text.Encoding.UTF8.GetString(buffer);
string result2 = System.Text.Encoding.ASCII.GetString(buffer, 0, buffer.Length);
string result3 = Encoding.UTF7.GetString(buffer);
e.g: Instead I get Àgua, amanhã, você I receive/read ?gua, amanh?, voc?.
How may I solve it ?
ASCII by its very definition only supports 128 characters.
What you need is ANSI characters if you are reading legacy text.
You can use Encoding.Default instead of Encoding.ASCII to interpret characters in the current locale's default ANSI code page.
Ideally, you would know exactly which code page you are expecting the ANSI characters to use and specify the code page explicitly using this overload of Encoding.GetEncoding(int codePage), for example:
string result = System.Text.Encoding.GetEncoding(1252).GetString(buffer);
Here's a very good reference page on Unicode: http://www.joelonsoftware.com/articles/Unicode.html
And another here: http://msdn.microsoft.com/en-us/library/b05tb6tz%28v=vs.90%29.aspx
But maybe you can just use UTF8 when reading and writing
I don't know the details of the card reader. Is the data you read and write to the card just a load of bytes?
If so, you can just use UTF8 for both reading and writing and it will all just work. It's only necessary to use ANSI if you are working with a legacy device which is expecting (or providing) ANSI text. If the device just stores bytes blindly without implying any particular format, you can do what you like - in this case, just always use UTF8.
It seems like you're using characters that aren't mapped in the 7 bits ASCII, but in the "extensions" ISO-8859-1 or ISO-8859-15. You'll need to choose a specific encoding for mapping to your byte array and things should work fine;
byte[] buffer = Encoding.GetEncoding("ISO-8859-1").GetBytes(txt_IDCard.Text);
You have two problems there:
ASCII supports only a limited amount of characters.
You're currently using two different Encodings for reading and writing.
You should write with the same Encoding as you read.
Writing
byte[] buffer = Encoding.UTF8.GetBytes(txt_IDCard.Text);
Reading
string result = Encoding.UTF8.GetString(buffer);

Before reading a file, do I have to check ANSI encoding?

Im reding some csv files. The files are really easy, because there is always just ";" as seperator and there are no ", ', or something like that.
So its possible to read the file, line by line and seperate the strings. Thats working fine. Now people told me: maybe you should check the encoding of the file, it should be always ANSI, if its not maybe your output will be different and corrupted. So non-ansi files should be marked somehow.
I just said, okey! But if I think about it: do I really have to check the file for encoding in this case? I just changed the encoding of the file to something else and Im still able to read the file without any problems. My code is simple:
using (TextReader reader = new StreamReader(myFileStream))
{
while ((line = read.ReadLine()) != null)
{
//read the line, spererate by ; and other stuff...
}
}
So again: do I really need to check the files for ANSI encoding? Could somebody give me an example when could I get in trouble or when do I get a corrupted output after reading a non-ansi file? Thank you!
That particular constructor of StreamReader will assume that the data is UTF-8; that is compatible with ASCII, but can fail if data uses bytes in the 128-255 range for single-byte codepages (you'll get the wrong characters in strings, etc), or could fail completely (i.e. throw an exception) if the data is actually something very different like UTF-7, UTF-32, etc.
In some cases (the minority) you might be able to use the byte-order-mark to detect the encoding, but this is a circular problem: in most cases, if you don't already know the encoding, you can't really detect the encoding (robustly). So a better approach would be: to know the encoding in the first place. Then you can pass in the correct encoding to use via one of the other constructors.
Here's an example of it failing:
// we'll write UTF-32, big-endian, without a byte-order-mark
File.WriteAllText("my.txt", "Hello world", new UTF32Encoding(true, false));
using (var reader = new StreamReader("my.txt"))
{
string s = reader.ReadLine();
}
You can run under UTF-8 encoding , cause UTF-8 has a wonderful property support ASCII characters with 1 byte (as it would expected), but when it needed, shrink to support Unicode ones.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Handling non english characters in C#

I need to get my understanding of character sets and encoding right. Can someone point me to good write up on handling different character sets in C#?
Here's one of the problems I'm facing -
using (StreamReader reader = new StreamReader("input.txt"))
using (StreamWriter writer = new StreamWriter("output.txt")
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
This simple code snippet does not always preserve the encoding -
For example -
Aukéna in the input is turned into Auk�na in the output.
You just have an encoding problem. You have to remember that all you're really reading is a stream of bits. You have to tell your program how to properly interpret those bits.
To fix your problem, just use the constructors that take an encoding as well, and set it to whatever encoding your text uses.
http://msdn.microsoft.com/en-us/library/ms143456.aspx
http://msdn.microsoft.com/en-us/library/3aadshsx.aspx
I guess when reading a file, you should know which encoding the file has. Otherwise you can easily fail to read it correctly.
When you know the encoding of a file, you may do the following:
using (StreamReader reader = new StreamReader("input.txt", Encoding.GetEncoding(1251)))
using (StreamWriter writer = new StreamWriter("output.txt", false, Encoding.GetEncoding(1251)))
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
Another question comes up, if you want to change the original encoding of a file.
The following article may give you a good basis of what encodings are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
And this is a link msdn article, from which you could start:
Encoding Class
StreamReader.ReadLine() attemps to read the file using UTF encoding. If that's not the format your file uses, StreamReader will not read the characters correctly.
This article details the problem and suggests passing the constructor this encoding System.Text.Encoding.Default.
You could always create your own parser. What I use is:
`var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone();
ANSI.EncoderFallback = new EncoderReplacementFallback(string.Empty);`
The first line of this creates a clone of the Win-1252 encoding (as the database I deal with works with Win-1252, you'd probably want to use UTF-8 or ASCII). The second line - when parsing characters - returns an empty string if there is no equivalent to the original character.
After this you'd want to preferably filter out all command characters (excluding tabs, spaces, line feeds and carriage returns depending on what you need).
Below is my personal encoding-parser which I set up to correct data entering our database.
private string RetainOnlyPrintableCharacters(char c)
{
//even if the character comes from a different codepage altogether,
//if the character exists in 1252 it will be returned in 1252 format.
var ansiBytes = _ansiEncoding.GetBytes(new char[] {c});
if (ansiBytes.Any())
{
if (ansiBytes.First().In(_printableCharacters))
{
return _ansiEncoding.GetString(ansiBytes);
}
}
return string.Empty;
}
_ansiEncoding comes from the var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone(); with the fallback value set
if ansiBytes is not empty, it means that there is an encoding available for that particular character being passed in, so it is compared with a list of all the printable characters and if it exists - it is an acceptable character so is returned.

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!

File.Copy and character encoding

I noticed a strange behaviour of File.Copy() in .NET 3.5SP1. I don't know if that's a bug or a feature. But I know it's driving me crazy. We use File.Copy() in a custom build step, and it screws up the character encoding.
When I copy an ASCII encoding text file over a UTF-8 encoded text file, the destination file still is UTF-8 encoded, but has the content of the new file plus the 3 prefix characters for UTF-8. That's fine for ASCII characters, but incorrect for the remaining characters (128-255) of the ANSI code page.
Here's the code to reproduce. I first copy a UTF-8 file to the destination, then I copy an ANSI file to the same destination. Notice the output of the second console output: Content of copy.txt : this is ASCII encoded: / Encoding: utf-8
File.WriteAllText("ANSI.txt", "this is ANSI encoded: é", Encoding.GetEncoding(0));
File.WriteAllText("UTF8.txt", "this is UTF8 encoded: é", Encoding.UTF8);
File.Copy("UTF8.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
File.Copy("ANSI.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
Any ideas why this happens? Is there a mistake in my code? Any ideas how to fix this (my current idea is to delete the file before if it exists)
EDIT: correct ANSI/ASCII confusion
Where are you writing ASCII.txt? You're writing ANSI.txt in the first line, but that's certainly not ASCII as ASCII doesn't contain any accented characters. The ANSI file won't contain any preamble indicating that it's ANSI rather than ASCII or UTF-8.
You seem to have changed your mind between ASCII and ANSI half way through writing the example, basically.
I'd expect any ASCII file to be "detected" as UTF-8, but the encoding detection relies on the file having a byte order mark for it to be anything other than UTF-8. It's not like it reads the whole file and then guesses at what the encoding is.
From the docs for StreamReader:
This constructor initializes the
encoding to UTF8Encoding, the
BaseStream property using the stream
parameter, and the internal buffer to
the default size.
The detectEncodingFromByteOrderMarks
parameter detects the encoding by
looking at the first three bytes of
the stream. It automatically
recognizes UTF-8, little-endian
Unicode, and big-endian Unicode text
if the file starts with the
appropriate byte order marks. See the
Encoding.GetPreamble method for more
information.
Now File.Copy is just copying the raw bytes from place to place - it shouldn't change anything in terms of character encodings, because it doesn't try to interpret the file as a text file in the first place.
It's not quite clear to me where you see a problem (partly due to the ANSI/ASCII part). I suggest you separate out the issues of "does File.Copy change things?" and "what character encoding is detected by StreamReader?" in both your mind and your question. The answers should be:
File.Copy shouldn't change the contents of the file at all
StreamReader can only detect UTF-8 and UTF-16; if you need to read a file encoded with any other encoding, you should state that explicitly in the constructor. (I would personally recommend using Encoding.Default instead of Encoding.GetEncoding(0) by the way. I think it's clearer.)
I doubt this has anything to do with File.Copy. I think what you're seeing is that StreamReader uses UTF-8 by default to decode and since UTF-8 is backwards compatible, StreamReader never has any reason to stop using UTF-8 to read the ANSI-encoded file.
If you open ASCII.txt and copy.txt in a hex editor, are they identical?

Categories

Resources