C# StreamReader/StreamWriter encoding oddity - c#

I have a very simple c# console app that reads through a text file and outputs the same file but with a particular string replaced on each line that it appears - utilizing StreamReader and StreamWriter. I do not know the encoding of the source file. I have encountered a situation where there is a character in the file (ext ascii dec 166, broken pipe) that when running through this app gets "mangled" using the default encoding (In the output file it ends up as a "box" character). Since I do not know the source file encoding I have attempted multiple options to see what would provide an unaltered result and oddly the only way that works is having it read in UTF-7 and written in UTF-8.
UTF-7 to UTF-7 causes problems like & to change to +AC. UTF-8 to UTF-8 (which I believe is the default) converts the character in question to the "box". ASCII to ASCII turns it into ?. Unicode to Unicode results in gibberish. Shouldn't it be same encoding read and write for same results? Simplified code example below:
using (var fileStream = new FileStream(fileName, FileMode.Open))
using (var fileReader = new StreamReader(fileStream,Encoding.UTF7))
using (var fileStreamOut = new FileStream(tempFileName,FileMode.Create))
using (var fileWriter = new StreamWriter(fileStreamOut,Encoding.UTF8))
{
while (!fileReader.EndOfStream)
{
var inputLine = fileReader.ReadLine();
if (inputLine != null)
{
inputLine = inputLine.Substring(0, 3) + newRdfi + inputLine.Substring(12);
fileWriter.WriteLine(inputLine);
}
}
fileWriter.Flush();
}

After clarification on the file creation method received from the developer of the source system and knowledge of the server it is being produced on I came to the conclusion the encoding was Windows-1252. Changing my read and write streams to use Encoding.GetEncoding(1252) resulted in all characters reading and outputting as expected.

Related

C# Streamwriter - Problem with the encoding

I have some product data that I want to write into a csv file. First I have a function that writes the header into the csv file:
using(StreamWriter streamWriter = new StreamWriter(path))
{
string[] headerContent = {"banana","apple","orange"};
string header = string.Join(",", headerContent);
streamWriter.WriteLine(header);
}
Another function goes over the products and writes their data into the csv file:
using (StreamWriter streamWriter = new StreamWriter(new FileStream(path, FileMode.Open), Encoding.UTF8))
{
foreach (var product in products)
{
await streamWriter.WriteLineAsync(product.ToString());
}
}
When writing the products into the csv file and do it with FileMode.Open and Encoding.UTF8, the encoding is set correctly into the file meaning that special characters in german or french get shown correctly. But the problem here is that I overwrite my header when I do it like this.
The solution I tried was to not use FileMode.Open but to use FileMode.Append which works, but then for some reason the encoding just gets ignored.
What could I do to append the data while maintaing the encoding? And also why is this happening in the first place?
EDIT:
Example with FileMode.Open:
Fußpflegecreme
Example with FileMode.Append:
Fußpflegecreme
The important question here is: what does the file actually contain; for example, if I use the following:
using System.Text;
string path = "my.txt";
using (StreamWriter streamWriter = new StreamWriter(new FileStream(path, FileMode.Create), Encoding.UTF8))
{
streamWriter.WriteLine("Fußpflegecreme 1");
}
using (StreamWriter streamWriter = new StreamWriter(new FileStream(path, FileMode.Append), Encoding.UTF8))
{
streamWriter.WriteLine("Fußpflegecreme 2");
}
// this next line is lazy and inefficient; only good for quick tests
Console.WriteLine(BitConverter.ToString(File.ReadAllBytes(path)));
then the output is (re-formatted a little):
EF-BB-BF-
46-75-C3-9F-70-66-6C-65-67-65-63-72-65-6D-65-20-31-0D-0A-
46-75-C3-9F-70-66-6C-65-67-65-63-72-65-6D-65-20-32-0D-0A
The first line (note: there aren't any "lines" in the original hex) is the UTF-8 BOM; the second and third lines are the correctly UTF-8 encoded payloads. It would help if you could show the exact bytes that get written in your case. I wonder if the real problem here is that in your version, there is no BOM, but the rest of the data is correct. Some tools, in the absence of a BOM, will choose the wrong encoding. But also, some tools: in the presence of a BOM: will incorrectly show some garbage at the start of the file (and may also, because they're clearly not using the BOM: use the wrong encoding). The preferred option is: specify the encoding explicitly when reading the file, and use a tool that can handle the presence of absence of a BOM.
Whether or not to include a BOM (especially in the case of UTF-8) is a complex question, and there are pros/cons of each - and there are tools that will work better, or worse, with each. A lot of UTF-8 text files do not include a BOM, but: there is no universal answer. The actual content is still correctly UTF-8 encoded whether or not there is a BOM - but how that is interpreted (in either case) is up to the specific tool that you're using to read the data (and how that tool is configured).
I think this will be solved once you explicitly choose the utf8 encoding when writing the header. This will prefix the file with a BOM.

C# Stream Reader does not differentiate between UTF-16 and UTF-8 [duplicate]

This question already has answers here:
Is it possible to reliably auto-decode user files to Unicode? [C#]
(5 answers)
Closed 2 years ago.
I am building an app that downloads a csv file in plain text from an e-mail server and writes it to the local file system. I am developing this app in C# using .NET Core 3.1.
The problem is that I don't know what is the encoding of the files that I am receiving, so I decided to use the StreamReader class to convert the bytes that I downloaded from the e-mail to a string.
Here is the code
foreach (var data in loadedData)
{
if (IsValidData(data))
{
logger.Info($"Writing data from: {data.FileName}");
using var stream = new MemoryStream(data.FileContent);
using var reader = new StreamReader(stream, true);
var csvData = new CSVData
{
FileName = data.FileName,
FileContent = reader.ReadToEnd(),
};
dataWriter.WriteData(csvData);
logger.Info($"Writing data from: {data.FileName} was successfully written");
}
else
{
logger.Warn($"Invalid format: {data.FileName}");
}
}
And to write the data to the actual files I am using:
public void WriteData(CSVData data)
{
logger.Debug($"Writing received file: {data.FileName}");
var outputDir = config.GetReceivedFilesPath();
string fileName = this.GetOutputPath(data.FileName, outputDir);
Directory.CreateDirectory(outputDir);
using var writer = new StreamWriter(fileName, false, Encoding.UTF8);
writer.Write(data.FileContent);
logger.Debug($"The received data was successfully written to: {data.FileName}");
}
The problem is that some files that I am receiving are encoded in UTF-16 (I believe this is the encodigng that is being used, because there is a \0 after each char), but the StreamReader is interpreting this file as encoded in UTF-8, because the reader.CurrentEncoding property returns UTF-8.
The end result is that instead of having my files outputted as UTF-8, my app is outputting them as UTF-16, even though I explicity added UTF-8 as the output value.
What I am doing wrong?
You might be able to use this method
File.ReadAllText(string path, System.Text.Encoding encoding)
Based on the documentation it tries to figure this out automatically.
The below text is from the documentation
This method opens a file, reads all the text in the file, and returns
it as a string. It then closes the file.
This method attempts to automatically detect the encoding of a file
based on the presence of byte order marks. Encoding formats UTF-8 and
UTF-32 (both big-endian and little-endian) can be detected.
The file handle is guaranteed to be closed by this method, even if
exceptions are raised.
To use the encoding settings as configured for your operating system,
specify the Encoding.Default property for the encoding parameter.
The full document can be found here

Unable to convert special characters in UTF-8 file into ANSI

I have a file that needs to be read and a text has to be added at the end.
The program failed due to character "í" .
On opening the file in notepad++ (UTF-8) encoding, I could see
In my C# code I tried to convert it to Default encoding, but the application changes it to "?" instead of "í".
Sample code:
string processFilePath = #"D:\Test\File1.txt";
string outfile = #"D:\Test\File2.txt";
using (StreamReader reader = new StreamReader(processFilePath))
{
using (StreamWriter writer = new StreamWriter(outfile, false, Encoding.Default))
{
writer.WriteLine(reader.ReadToEnd());
}
}
I looked into similar questions on SO (above code snipped was the modified version from here):
UTF-8 to ANSI Conversion using C#
I tried different types of encoding available in the "System.Text.Encoding" - ASCII/ UTF*/ Default but the best I could get is a "?" instead of "í".
I had also gone through : http://kunststube.net/encoding/ , I did learn a lot, but was still unable to resolve the issue.
What I am getting:
What I need:
On Microsoft website:
What else am I missing (Should have been easy if System.Text.Encoding.ANSI existed )
MSDN:
StreamReader defaults to UTF-8 encoding unless specified otherwise,
instead of defaulting to the ANSI code page for the current system.
i.e. when opening StreamReader(processFilePath) it takes data as in UTF-8, which seems not the case, i.e. if the source text is ANSI, or most likely Windows-1252 for Spanish, use
using (StreamReader reader = new StreamReader(processFilePath, Encoding.GetEncoding(1252)))
{
using (StreamWriter writer = new StreamWriter(outfile, false, Encoding.UTF8))
{
writer.WriteLine(reader.ReadToEnd());
}
}
Note specified 1252 and UTF8.
P.S. Also note that false in StreamWriter will not append to the end, but overwrite.

How to set the BOM for a file being read

I have been having issues reading a file that contains a mix of Arabic and Western text. I read the file into a TextBox as follows:
tbx1.Text = File.ReadAllText(fileName.Text, Encoding.UTF8);
No matter what value I tried instead of "Encoding.UTF8" I got garbled characters displayed in place of the Arabic. The western text was displayed fine.
I thought it might have been an issue with the way the TextBox was defined, but on start up I write some mixed Western/Arabic text to the textbox and this displays fine:
tbx1.Text = "Start السلا عليكم" + Environment.NewLine + "Here";
Then I opened Notepad and copied the above text into it, then saved the file, at which point Notepad save dialogue asked for which encoding to use.
I then presented the saved file to my code and it displayed all the content correctly.
I examined the file and found 3 binary bytes at the beginning (not visible in Notepad):
The 3 bytes, I subsequently found through research represent the BOM, and this enables the C# "File.ReadAllText(fileName.Text, Encoding.UTF8);" to read/display the data as desired.
What puzzles me is specifying the " Encoding.UTF8" value should take care of this.
The only way I can think is to code up a step to add this data to a copy of teh file, then process that file. But this seems rather long-winded. Just wondering if there is a better way to do, or why the Encoding.UTF8 is not yielding the desired result.
Edit:
Still no luck despite trying the suggestion in the answer.
I cut the test data down to containing just Arabic as follows:
Code as follows:
FileStream fs = new FileStream(fileName.Text, FileMode.Open);
StreamReader sr = new StreamReader(fs, Encoding.UTF8, false);
tbx1.Text = sr.ReadToEnd();
sr.Close();
fs.Close();
Tried with both "true" and "false" on the 2nd line, but both give the same result.
If I open the file in Notepad++, and specify the Arabic ISO-8859-6 Character set it displays fine.
Here is what is looks like in Notepad++ (and what I would liek the textbox to display):
Not sure if the issue is in the reading from file, or the writing to the textbox.
I will try inspecting the data post read to see. But at the moment, I'm puzzled.
The StreamReader class has a constructor that will take care of testing for the BOM for you:
using (var stream = new FileStream(fileName.Text, FileAccess.Read))
{
using (var sr = new StreamReader(stream, Encoding.UTF8, true))
{
var text = sr.ReadToEnd();
}
}
The final true parameter is detectEncodingFromByteOrderMark:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes:
UTF-8
little-endian Unicode
and big-endian Unicode text
if the file
starts with the appropriate byte order marks. Otherwise, the
user-provided encoding is used. See the Encoding.GetPreamble method
for more information.

DOS-CSV import via C# and German Umlaute (Ä,Ü,Ö,ä,ü,ö)

I have the following problem: I have some Excel-sheets and must export them into DOS-CSV format (for some reasons). The consequence is, that the german umlaute (Ä,Ü,Ö,ä,ü,ö) are not exported correctly. In a next step these CSV-files must be imported into a Winform application. Is it now possible to get back the correkt characters Ä,Ü,Ö,ä,ü,ö during the import?
If you choose the DOS-CSV format, Excel is going to encode the document using the 437 codepage (found that here). You can convert it back to UTF-8 using a little bit of code:
Encoding dosEncoding = Encoding.GetEncoding(437);
string original = String.Empty;
using (StreamReader sr = new StreamReader(#"D:\Path\To\output.csv", dosEncoding))
{
original = sr.ReadToEnd();
sr.Close();
}
byte[] encBytes = dosEncoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(dosEncoding, Encoding.UTF8, encBytes);
string converted = Encoding.UTF8.GetString(utf8Bytes);
I tested this by putting Ä,Ü,Ö,ä,ü,ö into a cell and then saving it as a DOS formatted CSV file in Excel. Looking at the document, Excel turned it into Ž,š,™,„,,”.
Running it through the above code turned it back into Ä,Ü,Ö,ä,ü,ö.

Categories

Resources