C# WriteAllBytes ignores character encoding

C# WriteAllBytes ignores character encoding - c#

I'm using the following code:
File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetBytes("THIS IS A TEST"))
Which should in theory write a UTF8 file, but I just get an ANSI file. I also tried this just to be especially verbose;
File.WriteAllBytes("c:\\test.xml", ASCIIEncoding.Convert(ASCIIEncoding.ASCII, UTF8Encoding.UTF8, Encoding.UTF8.GetBytes("THIS IS A TEST")))
Still the same issue though.
I am testing the outputted files by loading in TextPad which reads the format correctly (I tested with a sample file as I know these things can be a bit weird sometimes)

WriteAllBytes isn't ignoring the encoding - rather: you already did the encoding, when you called GetBytes. The entire point of WriteAllBytes is that it writes bytes. Bytes don't have an encoding; rather: encoding is the process of converting from text (string here) to bytes (byte[] here).
UTF-8 is identical to ASCII for all ASCII characters - i.e. 0-127. All of "THIS IS A TEST" is pure ASCII, so the UTF-8 and ASCII for that are identical.

Related

Read multiple files with different encoding, preserving all characters

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt

Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));

Change encoding to UTF8 for data kept in base64string

I need to process CSV files that are kept as bsae64strings. I never know in what format they were created (usually it'll be ANSI or UTF-8). I have been struggling to achieve anything useful, still, I receive messed up characters when I am testing my code on CSV file that was saved in ANSI. The code to read is just a two-liner:
byte[] dataToDecode = Convert.FromBase64String(base64Content);
string csvContentInUTF8 = Encoding.UTF8.GetString(dataToDecode2);
I do not have access to the code that saves files.
Sample line that's in the input CSV:
;;;superÆ/æ Ø/ø and even Å/å Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
what I get after decoding (second line of code)
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
following this question I tried changing the code to scandinavian encoding reading, so:
string csvContentInUTF8x = Encoding.GetEncoding("iso-8859-1").GetString(dataToDecode);
The output is:
;;;superï¿½/ï¿½ ï¿½/ï¿½ oraz ï¿½/ï¿½ Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
It looks exactly the same for the Encoding.Default

If what you wrote is correct, the text was corrupted before writing it in a csv file.
Now... Encoding.GetEncoding("iso-8859-1") is an identical encoding that doesn't do any remapping. Its 256 characters are mapped 1:1 to the first 256 (0-255) characters of unicode.
;;;superï¿½/ï¿½ ï¿½/ï¿½ oraz ï¿½/ï¿½ Topic;;John;Doe;;;;john#doe.com
You see the ï¿½ repeated six times? Normally each time it should be different, because you want six different characters (Æ/æ, Ø/ø, Å/å). But here they are always the same. And this is because in UTF-8 ï¿½ is the Unicode REPLACEMENT CHARACTER � that is used when a character can't be encoded. So the error is already present in your dataToDecode.

ASCII-Encoded values greater than 127 (0x7F) for TIFF files

I'm trying to encode a TIFF file manually. Whenever byte values greater than 0x7F are encoded they should be written to file as follows:
However, whenever I try to write the value of a character greater than 0x7F to file with ASCII encoding it's written as "?" (0x3F).
Does anyone know what the encoding for byte value in the shown image is?
For reference I'm using C#, writing single chars at a time to a .tif file using the StreamWriter class (StreamWriter::Write(wchar_t)).

Problem solved: using the BinaryWriter class instead of StreamWriter allows for encoding of values above 127 as shown in the problem statement.
Going to leave this up as I wasn't able to find this solution through Google using the keywords I mentioned.
I stated I needed ASCII encoding despite the necessity to encode values above 127 (which ASCII doesn't support) because all values below 127 were still given ASCII representation. I didn't know what encoding was needed, just that it was a superset of ASCII.

How do I read hex sequences like xD0 into a C# string?

I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�

C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.

File.Copy and character encoding

I noticed a strange behaviour of File.Copy() in .NET 3.5SP1. I don't know if that's a bug or a feature. But I know it's driving me crazy. We use File.Copy() in a custom build step, and it screws up the character encoding.
When I copy an ASCII encoding text file over a UTF-8 encoded text file, the destination file still is UTF-8 encoded, but has the content of the new file plus the 3 prefix characters for UTF-8. That's fine for ASCII characters, but incorrect for the remaining characters (128-255) of the ANSI code page.
Here's the code to reproduce. I first copy a UTF-8 file to the destination, then I copy an ANSI file to the same destination. Notice the output of the second console output: Content of copy.txt : this is ASCII encoded: / Encoding: utf-8
File.WriteAllText("ANSI.txt", "this is ANSI encoded: é", Encoding.GetEncoding(0));
File.WriteAllText("UTF8.txt", "this is UTF8 encoded: é", Encoding.UTF8);
File.Copy("UTF8.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
File.Copy("ANSI.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
Any ideas why this happens? Is there a mistake in my code? Any ideas how to fix this (my current idea is to delete the file before if it exists)
EDIT: correct ANSI/ASCII confusion

Where are you writing ASCII.txt? You're writing ANSI.txt in the first line, but that's certainly not ASCII as ASCII doesn't contain any accented characters. The ANSI file won't contain any preamble indicating that it's ANSI rather than ASCII or UTF-8.
You seem to have changed your mind between ASCII and ANSI half way through writing the example, basically.
I'd expect any ASCII file to be "detected" as UTF-8, but the encoding detection relies on the file having a byte order mark for it to be anything other than UTF-8. It's not like it reads the whole file and then guesses at what the encoding is.
From the docs for StreamReader:
This constructor initializes the
encoding to UTF8Encoding, the
BaseStream property using the stream
parameter, and the internal buffer to
the default size.
The detectEncodingFromByteOrderMarks
parameter detects the encoding by
looking at the first three bytes of
the stream. It automatically
recognizes UTF-8, little-endian
Unicode, and big-endian Unicode text
if the file starts with the
appropriate byte order marks. See the
Encoding.GetPreamble method for more
information.
Now File.Copy is just copying the raw bytes from place to place - it shouldn't change anything in terms of character encodings, because it doesn't try to interpret the file as a text file in the first place.
It's not quite clear to me where you see a problem (partly due to the ANSI/ASCII part). I suggest you separate out the issues of "does File.Copy change things?" and "what character encoding is detected by StreamReader?" in both your mind and your question. The answers should be:
File.Copy shouldn't change the contents of the file at all
StreamReader can only detect UTF-8 and UTF-16; if you need to read a file encoded with any other encoding, you should state that explicitly in the constructor. (I would personally recommend using Encoding.Default instead of Encoding.GetEncoding(0) by the way. I think it's clearer.)

I doubt this has anything to do with File.Copy. I think what you're seeing is that StreamReader uses UTF-8 by default to decode and since UTF-8 is backwards compatible, StreamReader never has any reason to stop using UTF-8 to read the ANSI-encoded file.
If you open ASCII.txt and copy.txt in a hex editor, are they identical?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.