I need to generate a file that will be read by another system. For this reason, it should be in binary, not text with some encoding.
Here's the code I'm using:
using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream))
{
writer.Write("Some text" + Environment.NewLine);
writer.Write("Some more text" + Environment.NewLine);
}
}
When I open the file and look at it, I can see some special character at the start of each line, similar to this (hard to paste it here, since it doesn't show the same):
~Some text
~Some more text
What am I doing wrong/forgetting?
Thanks for your help.
There's no such concept as text without an encoding. It's like wanting to save an abstract image to disk without specifying any image format. (Even "raw" is a kind of encoding for images - you need to agree on a way of communicating the width, height, byte order, colour depth somehow.)
I suggest you just fix on one encoding (e.g. Encoding.Unicode or Encoding.UTF8) and write the text that way.
As for why BinaryWriter.Write(text) is putting "special characters" at the start of each line, did you check the documentation for what it does?
Writes a length-prefixed string to this stream in the current encoding of the BinaryWriter, and advances the current position of the stream in accordance with the encoding used and the specific characters being written to the stream.
and
A length-prefixed string represents the string length by prefixing to the string a single byte or word that contains the length of that string. This method first writes the length of the string as a UTF-7 encoded unsigned integer, and then writes that many characters to the stream by using the BinaryWriter instance's current encoding.
So what you're seeing is the length-prefix... but then it will use whatever encoding you've set up for the BinaryWriter.
Related
I'm opening a text file and removing the first line to prepare it for importing in a database using bulk insert. Here is my code:
string tempFile = Path.GetTempFileName();
using (var sr = new StreamReader("F:\\Upload\\File.txt", System.Text.Encoding.UTF8))
{
using (var sw = new StreamWriter(tempFile,true, System.Text.Encoding.UTF8))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Substring(0, 8) != "Nr. Crt.")
sw.WriteLine(line);
}
}
}
System.IO.File.Delete("F:\\Upload\\File.txt");
System.IO.File.Move(tempFile, "F:\\Upload\\File.txt");
After this if I open the resulting file, Unicode characters are replaced with other characters. For example strings containing non-breaking space (unicode U+00A0): Value (note the unicode char ) are transformed in Value� .
How can I avoid this?
Edit:
Notepad++ is set to 'Encode in UTF-8'
Here is a picture of how it looks :
are transformed in Value�
The byte values for those 3 odd characters are 0xef 0xbd 0xbf. Which is the utf8 encoding for codepoint \ufffd, the replacement character �. Which is used when reading utf encoded text and the text contains an invalid encoding byte sequence.
Pointing squarely at an issue with File.txt, it was probably not encoded in utf-8. If you have no idea what encoding was used for that file then the first guess is to pass Encoding.Default to the StreamReader constructor.
It looks to me like it is writing fine, but the tool you are reading with is not expecting UTF-8. In many cases, you need to explicitly tell the tool what encoding to expect. However, a common approach is to prepend a BOM ("byte order mark"). This is simple - just use new UTF8Encoding(true) as the encoding and it will happen automatically. In tools that don't expect a BOM this will display as a few mangled chars at the start - but most modern tools will know what it means, and will switch to UTF-8 automatically. The point is: the BOM for UTF-8, UTF-16 LE and UTF-16 BE etc are all slightly different, but recognisable. A more complete list is on wikipedia.
I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!
I have some questions about editing files with c#.
I have managed to read a file into a byte[]. How can I get the ASCII code of each byte and show it in the text area of my form?
Also, how can I change the bytes and then write them back into a file?
For example:
I have a file and I know the first three bytes are letters. How can I change say, the second letter, to "A", then save the file?
Thanks!
If the file is ASCII, then each byte IS the ASCII code. To print the value of the byte to, say, a label, is as simple as this.
If you have read your file into byte[] file;
label1.Text = file[1].ToString();
To change the second letter to A:
file[1] = (byte)'A';
Or
file[1] = (byte)(int)'A';
I'm not sure, I don't have C# on my Mac to test.
But seriously, if it is a text file, you are better reading it in as text, not as a byte[]. And you would probably want to manipulate it using a StringBuilder
Firstly, to read it in as a string:
// Read the file as one string.
System.IO.StreamReader myFile =
new System.IO.StreamReader("c:\\test.txt");
string myString = myFile.ReadToEnd();
myFile.Close();
And this will work if the file is unicode as well.
Then, you can get the Unicode values (which for most latin characters is the same as the ASCII value) like so: int value = (int)myString[5]; or so.
You can then write back to a file like so:
System.IO.File.WriteAllText("c:\\test.txt", myString);
If you are going to do heavy modifications on the text, you should use a StringBuilder, otherwise, normal string operations would be fine.
I can only assume that you want to practice writing to/from files by the byte. You need to look into the class BitConverter, there is a lot of help out there for this class. To read in a value you would take in each byte into a byte[]. Once you have your byte[] it would look something like this.
string s = BitConverter.ToString(byteArray);
You can then make your adjustments to your string value, for writing back to the file you'll want to use the GetBytes method.
byte[] newByteArray = BitConveter.GetBytes(s);
Then you could write your bytes back to your file.
I noticed a strange behaviour of File.Copy() in .NET 3.5SP1. I don't know if that's a bug or a feature. But I know it's driving me crazy. We use File.Copy() in a custom build step, and it screws up the character encoding.
When I copy an ASCII encoding text file over a UTF-8 encoded text file, the destination file still is UTF-8 encoded, but has the content of the new file plus the 3 prefix characters for UTF-8. That's fine for ASCII characters, but incorrect for the remaining characters (128-255) of the ANSI code page.
Here's the code to reproduce. I first copy a UTF-8 file to the destination, then I copy an ANSI file to the same destination. Notice the output of the second console output: Content of copy.txt : this is ASCII encoded: / Encoding: utf-8
File.WriteAllText("ANSI.txt", "this is ANSI encoded: é", Encoding.GetEncoding(0));
File.WriteAllText("UTF8.txt", "this is UTF8 encoded: é", Encoding.UTF8);
File.Copy("UTF8.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
File.Copy("ANSI.txt", "copy.txt", true);
using (StreamReader reader = new StreamReader("copy.txt", true))
{
Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
reader.CurrentEncoding.BodyName);
}
Any ideas why this happens? Is there a mistake in my code? Any ideas how to fix this (my current idea is to delete the file before if it exists)
EDIT: correct ANSI/ASCII confusion
Where are you writing ASCII.txt? You're writing ANSI.txt in the first line, but that's certainly not ASCII as ASCII doesn't contain any accented characters. The ANSI file won't contain any preamble indicating that it's ANSI rather than ASCII or UTF-8.
You seem to have changed your mind between ASCII and ANSI half way through writing the example, basically.
I'd expect any ASCII file to be "detected" as UTF-8, but the encoding detection relies on the file having a byte order mark for it to be anything other than UTF-8. It's not like it reads the whole file and then guesses at what the encoding is.
From the docs for StreamReader:
This constructor initializes the
encoding to UTF8Encoding, the
BaseStream property using the stream
parameter, and the internal buffer to
the default size.
The detectEncodingFromByteOrderMarks
parameter detects the encoding by
looking at the first three bytes of
the stream. It automatically
recognizes UTF-8, little-endian
Unicode, and big-endian Unicode text
if the file starts with the
appropriate byte order marks. See the
Encoding.GetPreamble method for more
information.
Now File.Copy is just copying the raw bytes from place to place - it shouldn't change anything in terms of character encodings, because it doesn't try to interpret the file as a text file in the first place.
It's not quite clear to me where you see a problem (partly due to the ANSI/ASCII part). I suggest you separate out the issues of "does File.Copy change things?" and "what character encoding is detected by StreamReader?" in both your mind and your question. The answers should be:
File.Copy shouldn't change the contents of the file at all
StreamReader can only detect UTF-8 and UTF-16; if you need to read a file encoded with any other encoding, you should state that explicitly in the constructor. (I would personally recommend using Encoding.Default instead of Encoding.GetEncoding(0) by the way. I think it's clearer.)
I doubt this has anything to do with File.Copy. I think what you're seeing is that StreamReader uses UTF-8 by default to decode and since UTF-8 is backwards compatible, StreamReader never has any reason to stop using UTF-8 to read the ANSI-encoded file.
If you open ASCII.txt and copy.txt in a hex editor, are they identical?
I have an ANSI-encoded file, and I want to convert the lines I read from the file to ASCII.
How do I go about doing this in C#?
EDIT : What if i used "BinaryReader"
BinaryReader reader = new BinaryReader(input, Encoding.Default);
but this reader takes (Stream, Encoding)
but "Stream" is an abstract! And where should I put the path of the file which he will read from ?
A direct conversion from ANSI to ASCII might not always be possible, since ANSI is a superset of ASCII.
You can try converting to UTF-8 using Encoding, though:
Encoding ANSI = Encoding.GetEncoding(1252);
byte[] ansiBytes = ANSI.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert(ANSI, Encoding.UTF8, ansiBytes);
String utf8String = Encoding.UTF8.GetString(utf8Bytes);
Of course you can replace UTF8 with ASCII, but that doesn't really make sense since:
if the original string doesn't contain any byte > 126, then it's already ASCII
if the original string does contain one or more bytes > 126, then those bytes will be lost
UPDATE:
In response to the updated question, you can use BinaryReader like this:
BinaryReader reader = new BinaryReader(File.Open("foo.txt", FileMode.Open),
Encoding.GetEncoding(1252));
Basically, you need to specify an Encoding when reading/writing the file. For example:
// read with the **local** system default ANSI page
string text = File.ReadAllText(path, Encoding.Default);
// ** I'm not sure you need to do this next bit - it sounds like
// you just want to read it? **
// write as ASCII (if you want to do this)
File.WriteAllText(path2, text, Encoding.ASCII);
Note that once you have read it, text is actually unicode when in memory.
You can choose different code-pages using Encoding.GetEncoding.