File.Copy and character encoding

File.Copy and character encoding - c#

I noticed a strange behaviour of File.Copy() in .NET 3.5SP1. I don't know if that's a bug or a feature. But I know it's driving me crazy. We use File.Copy() in a custom build step, and it screws up the character encoding.
When I copy an ASCII encoding text file over a UTF-8 encoded text file, the destination file still is UTF-8 encoded, but has the content of the new file plus the 3 prefix characters for UTF-8. That's fine for ASCII characters, but incorrect for the remaining characters (128-255) of the ANSI code page.
Here's the code to reproduce. I first copy a UTF-8 file to the destination, then I copy an ANSI file to the same destination. Notice the output of the second console output: Content of copy.txt : this is ASCII encoded: / Encoding: utf-8
File.WriteAllText("ANSI.txt", "this is ANSI encoded: é", Encoding.GetEncoding(0));
File.WriteAllText("UTF8.txt", "this is UTF8 encoded: é", Encoding.UTF8);
File.Copy("UTF8.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
File.Copy("ANSI.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
Any ideas why this happens? Is there a mistake in my code? Any ideas how to fix this (my current idea is to delete the file before if it exists)
EDIT: correct ANSI/ASCII confusion

Where are you writing ASCII.txt? You're writing ANSI.txt in the first line, but that's certainly not ASCII as ASCII doesn't contain any accented characters. The ANSI file won't contain any preamble indicating that it's ANSI rather than ASCII or UTF-8.
You seem to have changed your mind between ASCII and ANSI half way through writing the example, basically.
I'd expect any ASCII file to be "detected" as UTF-8, but the encoding detection relies on the file having a byte order mark for it to be anything other than UTF-8. It's not like it reads the whole file and then guesses at what the encoding is.
From the docs for StreamReader:
This constructor initializes the
encoding to UTF8Encoding, the
BaseStream property using the stream
parameter, and the internal buffer to
the default size.
The detectEncodingFromByteOrderMarks
parameter detects the encoding by
looking at the first three bytes of
the stream. It automatically
recognizes UTF-8, little-endian
Unicode, and big-endian Unicode text
if the file starts with the
appropriate byte order marks. See the
Encoding.GetPreamble method for more
information.
Now File.Copy is just copying the raw bytes from place to place - it shouldn't change anything in terms of character encodings, because it doesn't try to interpret the file as a text file in the first place.
It's not quite clear to me where you see a problem (partly due to the ANSI/ASCII part). I suggest you separate out the issues of "does File.Copy change things?" and "what character encoding is detected by StreamReader?" in both your mind and your question. The answers should be:
File.Copy shouldn't change the contents of the file at all
StreamReader can only detect UTF-8 and UTF-16; if you need to read a file encoded with any other encoding, you should state that explicitly in the constructor. (I would personally recommend using Encoding.Default instead of Encoding.GetEncoding(0) by the way. I think it's clearer.)

I doubt this has anything to do with File.Copy. I think what you're seeing is that StreamReader uses UTF-8 by default to decode and since UTF-8 is backwards compatible, StreamReader never has any reason to stop using UTF-8 to read the ANSI-encoded file.
If you open ASCII.txt and copy.txt in a hex editor, are they identical?

Related

Read multiple files with different encoding, preserving all characters

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt

Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));

C# WriteAllBytes ignores character encoding

I'm using the following code:
File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetBytes("THIS IS A TEST"))
Which should in theory write a UTF8 file, but I just get an ANSI file. I also tried this just to be especially verbose;
File.WriteAllBytes("c:\\test.xml", ASCIIEncoding.Convert(ASCIIEncoding.ASCII, UTF8Encoding.UTF8, Encoding.UTF8.GetBytes("THIS IS A TEST")))
Still the same issue though.
I am testing the outputted files by loading in TextPad which reads the format correctly (I tested with a sample file as I know these things can be a bit weird sometimes)

WriteAllBytes isn't ignoring the encoding - rather: you already did the encoding, when you called GetBytes. The entire point of WriteAllBytes is that it writes bytes. Bytes don't have an encoding; rather: encoding is the process of converting from text (string here) to bytes (byte[] here).
UTF-8 is identical to ASCII for all ASCII characters - i.e. 0-127. All of "THIS IS A TEST" is pure ASCII, so the UTF-8 and ASCII for that are identical.

Encode text in c# utf-8 without BOM

I tried but didn't function, I want to encode without BOM but with the option false still encoding in utf-8 with BOM.
Here is my code
System.Text.Encoding outputEnc = new System.Text.UTF8Encoding(false);
return File(outputEnc.GetBytes(" <?xml version=\"1.0\" encoding=\"utf-8\"?>" + xmlString), "application/xml", id);

This question is more than two years old, but I've found the answer. The reason you were seeing a BOM in the output is because there's a BOM in your input. What appears to be a space at the start of your XML declaration is actually a BOM followed by a space. To prove it, select the text " < from your XML encoding (the opening double-quote, the space following it, and the opening < character) and paste that into any tool that tells you Unicode codepoints. For example, pasting that text into http://www.babelstone.co.uk/Unicode/whatisit.html gave me the following result:
U+0022 : QUOTATION MARK
U+FEFF : ZERO WIDTH NO-BREAK SPACE [ZWNBSP] (alias BYTE ORDER MARK [BOM])
U+0020 : SPACE [SP]
U+003C : LESS-THAN SIGN
You can also copy and paste from the " < that I put in this answer: I copied those characters from your question, so they contain the invisible BOM immediately before the space character.
This is why I often refer to the BOM as a BOM(b) -- because it sits there silently, hidden, waiting to blow up on you when you least expect it. You were using System.Text.UTF8Encoding(false) correctly. It didn't add a BOM, but the source that you copied and pasted your XML from contained a BOM, so you got one in your output anyway because you had one in your input.
Personal rant: It's a very good idea to leave BOMs out of your UTF-8 encoded text. However, some broken tools (Microsoft, I'm looking at you since you're the ones who made most of them) will misinterpret text if it doesn't contain a BOM, so adding a BOM to UTF-8 encoded text is sometimes necessary. But it should really be avoided as much as possible. UTF-8 is now the de facto default encoding for the Internet, so any text file whose encoding is unknown should be parsed as UTF-8 first, falling back to "legacy" encodings like Windows-1252, Latin-1, etc. only if parsing the document as UTF-8 fails.

.NET : StreamReader does not recognize ° characters

I am trying to run a RegEx to locate degree characters (\u00B0|\u00BA degrees in addition to locating the other form of ' --> \u00B4). I am reading latitude and longitude DMS coordinates like this one: 12º30'23.256547"S
The problem is with the way I am reading the file as I can manually inject a string like the one below (format is latitude, longitude, description):
const string myTestString = #"12º30'23.256547""S, 12º30'23.256547""W, Somewhere";
and my regex is matching as expected - I can also see the º values where, when I am using the streamreader, I see a � for all unrecognized characters (the º symbol being included as one of those unrecognized characters)
I've tried:
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.Unicode);
var sr = new StreamReader(dlg.File.OpenRead(), Encoding.BigEndianUnicode);
in addition to the default ASCII.
Either way I read the file, I end up with these special characters. Any advice would be greatly appreciated!!

You've tried various encodings... but presumably not the right one. You shouldn't just be guessing at encodings - find out what encoding it's really using, and use that. StreamReader itself is absolutely fine. It can deal with any encoding you give it, but it does have to match the encoding used when writing the file out.
Where does the file come from? What has written it out?
If it was written out with Notepad, it may well be using Encoding.Default, which is the system's default encoding (i.e. it will vary from machine to machine). If at all possible, change whatever is creating the file to use a single standard encoding - personally I'm a big fan of UTF-8.

You need to identify what encoding the file was saved in, and use that when you read it with your streamreader.
If it is created using a regular texteditor I'm guessing the default encoding is either Windows-1252 or ISO-8859-1.
The degree symbol is 0xBA in ISO-8859-1 and goes outside of the 7bit ASCII table. I don't know how the Encoding.ASCII interprets it.
Otherwise, it might be easier to just make sure to save the file as UTF-8 if you have that possibility.
The reason that it works when you define the string in code is because .NET will always work with strings with it's internal encoding (UCS-2?), so what StreamReader do is convert the bytes it is reading from the file into the internal encoding using the encoding that you specify when you create the StreamReader.

You can open your file being read in an editor like Notepad++ to see the Encoding type of the file and change it to UTF-8. Then reading as you are doing
'var sr = new StreamReader(dlg.File.OpenRead(), Encoding.UTF8);'
will work. I could read degree symbol by doing this

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.

You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.

This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}

VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.

I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}

I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.

I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}

Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.