I'm opening a text file and removing the first line to prepare it for importing in a database using bulk insert. Here is my code:
string tempFile = Path.GetTempFileName();
using (var sr = new StreamReader("F:\\Upload\\File.txt", System.Text.Encoding.UTF8))
{
using (var sw = new StreamWriter(tempFile,true, System.Text.Encoding.UTF8))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Substring(0, 8) != "Nr. Crt.")
sw.WriteLine(line);
}
}
}
System.IO.File.Delete("F:\\Upload\\File.txt");
System.IO.File.Move(tempFile, "F:\\Upload\\File.txt");
After this if I open the resulting file, Unicode characters are replaced with other characters. For example strings containing non-breaking space (unicode U+00A0): Value (note the unicode char ) are transformed in Value� .
How can I avoid this?
Edit:
Notepad++ is set to 'Encode in UTF-8'
Here is a picture of how it looks :
are transformed in Value�
The byte values for those 3 odd characters are 0xef 0xbd 0xbf. Which is the utf8 encoding for codepoint \ufffd, the replacement character �. Which is used when reading utf encoded text and the text contains an invalid encoding byte sequence.
Pointing squarely at an issue with File.txt, it was probably not encoded in utf-8. If you have no idea what encoding was used for that file then the first guess is to pass Encoding.Default to the StreamReader constructor.
It looks to me like it is writing fine, but the tool you are reading with is not expecting UTF-8. In many cases, you need to explicitly tell the tool what encoding to expect. However, a common approach is to prepend a BOM ("byte order mark"). This is simple - just use new UTF8Encoding(true) as the encoding and it will happen automatically. In tools that don't expect a BOM this will display as a few mangled chars at the start - but most modern tools will know what it means, and will switch to UTF-8 automatically. The point is: the BOM for UTF-8, UTF-16 LE and UTF-16 BE etc are all slightly different, but recognisable. A more complete list is on wikipedia.
Related
I am using below code to read a CSV file:
using (StreamReader readfile = new StreamReader(FilePath, Encoding.GetEncoding("iso-8859-1")))
{
// some code will go here
}
There is a character œin a column of the CSV file. Which is getting converted to ? in output. How can i get this œ encoded correctly so that in output i will get same œ character not a question mark.
This is an encoding problem. Many non-Unicode encodings are either incomplete and translate many characters to "?", or have subtly different behavior on different platforms. Consider using UTF-8 or UTF-16 as the default. At least, if you can.
"windows-1252" is a superset of "ISO-8859-1". Try with Encoding.GetEncoding(1252).
Demo:
public static void Main()
{
System.IO.File.AppendAllText("test","œ", System.Text.Encoding.GetEncoding(1252));
var content = System.IO.File.ReadAllText("test", System.Text.Encoding.GetEncoding(1252));
Console.WriteLine(content);
}
Try it online!
The iso-8859-15 character set contain those symbols, as does the Windows-1252 codepage. However, be aware that 8859-15 redefines six other RARELY USED (or ASCII duplicate) chars that are in 8859-1, but so does Windows 1252. A quick web search will reveal those differences.
I have a string with few characters in Thai. This string is using unicode characters. But I don't see thai characters in IDE or even if I write the string in text file. If I want to see thai characters properly I have to write the following code
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
After applying above logic, I can see string correctly with thai characters. Here is output
// notice the thai character เดี่ยว in the string
M_M-150 150CC. เดี่ยว (2 For 18 Save 2)
I am not sure why I need to apply above logic to see the thai characters even the string is Unicode?
What exactly Encoding.Default is doing in this case?
From MSDN
Here is what Encoding.Default Property is:
Different computers can use different encodings as the default, and
the default encoding can even change on a single computer. Therefore,
data streamed from one computer to another or even retrieved at
different times on the same computer might be translated incorrectly.
In addition, the encoding returned by the Default property uses
best-fit fallback to map unsupported characters to characters
supported by the code page. For these two reasons, using the default
encoding is generally not recommended. To ensure that encoded bytes
are decoded properly, you should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to
use a higher-level protocol to ensure that the same format is used for
encoding and decoding.
The string is coming in by Encoding.Default, but then Decoded using UTF8
So the bottleneck is not the Encoding.Default. It's Encoding.UTF8
It's taking the bytes and convert it to string correctly.
Even if you tried to print it in the Console.
Take a look at both cases :
The second line, printed with utf8 configuration
You can config your console to support utf8 by adding this line :
Console.OutputEncoding = Encoding.UTF8;
Even with your code :
the result in a file will be looks like :
while converting the string to byte with Encoding.UTF8
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
the result will be :
If you take a look at Supported Scripts you'll see that UTF8 supports all Unicode characters
including Thai.
Note that the Encoding.Default will not be able to read chinese or japanese for an example,
take this example :
var text = "漢字";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
Here is the output from a text file :
Here if you try to write it to text, it'll not be converted successfully.
So you have to read and write it using UTF8
var text = "漢字";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
and you'll get this :
So as I said, the whole process depending on UTF8 not Default encoding.
I need to get my understanding of character sets and encoding right. Can someone point me to good write up on handling different character sets in C#?
Here's one of the problems I'm facing -
using (StreamReader reader = new StreamReader("input.txt"))
using (StreamWriter writer = new StreamWriter("output.txt")
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
This simple code snippet does not always preserve the encoding -
For example -
Aukéna in the input is turned into Auk�na in the output.
You just have an encoding problem. You have to remember that all you're really reading is a stream of bits. You have to tell your program how to properly interpret those bits.
To fix your problem, just use the constructors that take an encoding as well, and set it to whatever encoding your text uses.
http://msdn.microsoft.com/en-us/library/ms143456.aspx
http://msdn.microsoft.com/en-us/library/3aadshsx.aspx
I guess when reading a file, you should know which encoding the file has. Otherwise you can easily fail to read it correctly.
When you know the encoding of a file, you may do the following:
using (StreamReader reader = new StreamReader("input.txt", Encoding.GetEncoding(1251)))
using (StreamWriter writer = new StreamWriter("output.txt", false, Encoding.GetEncoding(1251)))
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
Another question comes up, if you want to change the original encoding of a file.
The following article may give you a good basis of what encodings are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
And this is a link msdn article, from which you could start:
Encoding Class
StreamReader.ReadLine() attemps to read the file using UTF encoding. If that's not the format your file uses, StreamReader will not read the characters correctly.
This article details the problem and suggests passing the constructor this encoding System.Text.Encoding.Default.
You could always create your own parser. What I use is:
`var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone();
ANSI.EncoderFallback = new EncoderReplacementFallback(string.Empty);`
The first line of this creates a clone of the Win-1252 encoding (as the database I deal with works with Win-1252, you'd probably want to use UTF-8 or ASCII). The second line - when parsing characters - returns an empty string if there is no equivalent to the original character.
After this you'd want to preferably filter out all command characters (excluding tabs, spaces, line feeds and carriage returns depending on what you need).
Below is my personal encoding-parser which I set up to correct data entering our database.
private string RetainOnlyPrintableCharacters(char c)
{
//even if the character comes from a different codepage altogether,
//if the character exists in 1252 it will be returned in 1252 format.
var ansiBytes = _ansiEncoding.GetBytes(new char[] {c});
if (ansiBytes.Any())
{
if (ansiBytes.First().In(_printableCharacters))
{
return _ansiEncoding.GetString(ansiBytes);
}
}
return string.Empty;
}
_ansiEncoding comes from the var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone(); with the fallback value set
if ansiBytes is not empty, it means that there is an encoding available for that particular character being passed in, so it is compared with a list of all the printable characters and if it exists - it is an acceptable character so is returned.
I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!
I have an ANSI-encoded file, and I want to convert the lines I read from the file to ASCII.
How do I go about doing this in C#?
EDIT : What if i used "BinaryReader"
BinaryReader reader = new BinaryReader(input, Encoding.Default);
but this reader takes (Stream, Encoding)
but "Stream" is an abstract! And where should I put the path of the file which he will read from ?
A direct conversion from ANSI to ASCII might not always be possible, since ANSI is a superset of ASCII.
You can try converting to UTF-8 using Encoding, though:
Encoding ANSI = Encoding.GetEncoding(1252);
byte[] ansiBytes = ANSI.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert(ANSI, Encoding.UTF8, ansiBytes);
String utf8String = Encoding.UTF8.GetString(utf8Bytes);
Of course you can replace UTF8 with ASCII, but that doesn't really make sense since:
if the original string doesn't contain any byte > 126, then it's already ASCII
if the original string does contain one or more bytes > 126, then those bytes will be lost
UPDATE:
In response to the updated question, you can use BinaryReader like this:
BinaryReader reader = new BinaryReader(File.Open("foo.txt", FileMode.Open),
Encoding.GetEncoding(1252));
Basically, you need to specify an Encoding when reading/writing the file. For example:
// read with the **local** system default ANSI page
string text = File.ReadAllText(path, Encoding.Default);
// ** I'm not sure you need to do this next bit - it sounds like
// you just want to read it? **
// write as ASCII (if you want to do this)
File.WriteAllText(path2, text, Encoding.ASCII);
Note that once you have read it, text is actually unicode when in memory.
You can choose different code-pages using Encoding.GetEncoding.