Handling non english characters in C# - c#

I need to get my understanding of character sets and encoding right. Can someone point me to good write up on handling different character sets in C#?
Here's one of the problems I'm facing -
using (StreamReader reader = new StreamReader("input.txt"))
using (StreamWriter writer = new StreamWriter("output.txt")
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
This simple code snippet does not always preserve the encoding -
For example -
Aukéna in the input is turned into Auk�na in the output.

You just have an encoding problem. You have to remember that all you're really reading is a stream of bits. You have to tell your program how to properly interpret those bits.
To fix your problem, just use the constructors that take an encoding as well, and set it to whatever encoding your text uses.
http://msdn.microsoft.com/en-us/library/ms143456.aspx
http://msdn.microsoft.com/en-us/library/3aadshsx.aspx

I guess when reading a file, you should know which encoding the file has. Otherwise you can easily fail to read it correctly.
When you know the encoding of a file, you may do the following:
using (StreamReader reader = new StreamReader("input.txt", Encoding.GetEncoding(1251)))
using (StreamWriter writer = new StreamWriter("output.txt", false, Encoding.GetEncoding(1251)))
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
Another question comes up, if you want to change the original encoding of a file.
The following article may give you a good basis of what encodings are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
And this is a link msdn article, from which you could start:
Encoding Class

StreamReader.ReadLine() attemps to read the file using UTF encoding. If that's not the format your file uses, StreamReader will not read the characters correctly.
This article details the problem and suggests passing the constructor this encoding System.Text.Encoding.Default.

You could always create your own parser. What I use is:
`var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone();
ANSI.EncoderFallback = new EncoderReplacementFallback(string.Empty);`
The first line of this creates a clone of the Win-1252 encoding (as the database I deal with works with Win-1252, you'd probably want to use UTF-8 or ASCII). The second line - when parsing characters - returns an empty string if there is no equivalent to the original character.
After this you'd want to preferably filter out all command characters (excluding tabs, spaces, line feeds and carriage returns depending on what you need).
Below is my personal encoding-parser which I set up to correct data entering our database.
private string RetainOnlyPrintableCharacters(char c)
{
//even if the character comes from a different codepage altogether,
//if the character exists in 1252 it will be returned in 1252 format.
var ansiBytes = _ansiEncoding.GetBytes(new char[] {c});
if (ansiBytes.Any())
{
if (ansiBytes.First().In(_printableCharacters))
{
return _ansiEncoding.GetString(ansiBytes);
}
}
return string.Empty;
}
_ansiEncoding comes from the var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone(); with the fallback value set
if ansiBytes is not empty, it means that there is an encoding available for that particular character being passed in, so it is compared with a list of all the printable characters and if it exists - it is an acceptable character so is returned.

Related

Encoding "œ" with iso-8859-1 in C#

I am using below code to read a CSV file:
using (StreamReader readfile = new StreamReader(FilePath, Encoding.GetEncoding("iso-8859-1")))
{
// some code will go here
}
There is a character œin a column of the CSV file. Which is getting converted to ? in output. How can i get this œ encoded correctly so that in output i will get same œ character not a question mark.
This is an encoding problem. Many non-Unicode encodings are either incomplete and translate many characters to "?", or have subtly different behavior on different platforms. Consider using UTF-8 or UTF-16 as the default. At least, if you can.
"windows-1252" is a superset of "ISO-8859-1". Try with Encoding.GetEncoding(1252).
Demo:
public static void Main()
{
System.IO.File.AppendAllText("test","œ", System.Text.Encoding.GetEncoding(1252));
var content = System.IO.File.ReadAllText("test", System.Text.Encoding.GetEncoding(1252));
Console.WriteLine(content);
}
Try it online!
The iso-8859-15 character set contain those symbols, as does the Windows-1252 codepage. However, be aware that 8859-15 redefines six other RARELY USED (or ASCII duplicate) chars that are in 8859-1, but so does Windows 1252. A quick web search will reveal those differences.

Detect special symbols in c#

I'm working on a c# project in which some data contains characters which are not recognised by the encoding.
They are displayed like that:
"Some text � with special � symbols in it".
I have no control over the encoding process, also data come from files of various origins and various formats.
I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:
if(myString.Contains("�"))
{
//Do stuff
}
While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?
EDIT:
After checking back with the team responsible for reading the files, this is how they do it:
var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();
Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.
We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.
So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.
The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?
The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.
The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :
var sr = new StreamReader(filePath,Encoding.Default);
You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.
Assuming the files are either some type of Unicode or match your system locale, you can use:
var sr = new StreamReader(filePath,Encoding.Default, true);
From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer
EDIT
I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file.
The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. There is also a related SO question

Before reading a file, do I have to check ANSI encoding?

Im reding some csv files. The files are really easy, because there is always just ";" as seperator and there are no ", ', or something like that.
So its possible to read the file, line by line and seperate the strings. Thats working fine. Now people told me: maybe you should check the encoding of the file, it should be always ANSI, if its not maybe your output will be different and corrupted. So non-ansi files should be marked somehow.
I just said, okey! But if I think about it: do I really have to check the file for encoding in this case? I just changed the encoding of the file to something else and Im still able to read the file without any problems. My code is simple:
using (TextReader reader = new StreamReader(myFileStream))
{
while ((line = read.ReadLine()) != null)
{
//read the line, spererate by ; and other stuff...
}
}
So again: do I really need to check the files for ANSI encoding? Could somebody give me an example when could I get in trouble or when do I get a corrupted output after reading a non-ansi file? Thank you!
That particular constructor of StreamReader will assume that the data is UTF-8; that is compatible with ASCII, but can fail if data uses bytes in the 128-255 range for single-byte codepages (you'll get the wrong characters in strings, etc), or could fail completely (i.e. throw an exception) if the data is actually something very different like UTF-7, UTF-32, etc.
In some cases (the minority) you might be able to use the byte-order-mark to detect the encoding, but this is a circular problem: in most cases, if you don't already know the encoding, you can't really detect the encoding (robustly). So a better approach would be: to know the encoding in the first place. Then you can pass in the correct encoding to use via one of the other constructors.
Here's an example of it failing:
// we'll write UTF-32, big-endian, without a byte-order-mark
File.WriteAllText("my.txt", "Hello world", new UTF32Encoding(true, false));
using (var reader = new StreamReader("my.txt"))
{
string s = reader.ReadLine();
}
You can run under UTF-8 encoding , cause UTF-8 has a wonderful property support ASCII characters with 1 byte (as it would expected), but when it needed, shrink to support Unicode ones.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

.NET : StreamReader does not recognize ° characters

I am trying to run a RegEx to locate degree characters (\u00B0|\u00BA degrees in addition to locating the other form of ' --> \u00B4). I am reading latitude and longitude DMS coordinates like this one: 12º30'23.256547"S
The problem is with the way I am reading the file as I can manually inject a string like the one below (format is latitude, longitude, description):
const string myTestString = #"12º30'23.256547""S, 12º30'23.256547""W, Somewhere";
and my regex is matching as expected - I can also see the º values where, when I am using the streamreader, I see a � for all unrecognized characters (the º symbol being included as one of those unrecognized characters)
I've tried:
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.Unicode);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.BigEndianUnicode);
in addition to the default ASCII.
Either way I read the file, I end up with these special characters. Any advice would be greatly appreciated!!
You've tried various encodings... but presumably not the right one. You shouldn't just be guessing at encodings - find out what encoding it's really using, and use that. StreamReader itself is absolutely fine. It can deal with any encoding you give it, but it does have to match the encoding used when writing the file out.
Where does the file come from? What has written it out?
If it was written out with Notepad, it may well be using Encoding.Default, which is the system's default encoding (i.e. it will vary from machine to machine). If at all possible, change whatever is creating the file to use a single standard encoding - personally I'm a big fan of UTF-8.
You need to identify what encoding the file was saved in, and use that when you read it with your streamreader.
If it is created using a regular texteditor I'm guessing the default encoding is either Windows-1252 or ISO-8859-1.
The degree symbol is 0xBA in ISO-8859-1 and goes outside of the 7bit ASCII table. I don't know how the Encoding.ASCII interprets it.
Otherwise, it might be easier to just make sure to save the file as UTF-8 if you have that possibility.
The reason that it works when you define the string in code is because .NET will always work with strings with it's internal encoding (UCS-2?), so what StreamReader do is convert the bytes it is reading from the file into the internal encoding using the encoding that you specify when you create the StreamReader.
You can open your file being read in an editor like Notepad++ to see the Encoding type of the file and change it to UTF-8. Then reading as you are doing
'var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);'
will work. I could read degree symbol by doing this

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!

Categories

Resources