DOS-CSV import via C# and German Umlaute (Ä,Ü,Ö,ä,ü,ö)

DOS-CSV import via C# and German Umlaute (Ä,Ü,Ö,ä,ü,ö) - c#

I have the following problem: I have some Excel-sheets and must export them into DOS-CSV format (for some reasons). The consequence is, that the german umlaute (Ä,Ü,Ö,ä,ü,ö) are not exported correctly. In a next step these CSV-files must be imported into a Winform application. Is it now possible to get back the correkt characters Ä,Ü,Ö,ä,ü,ö during the import?

If you choose the DOS-CSV format, Excel is going to encode the document using the 437 codepage (found that here). You can convert it back to UTF-8 using a little bit of code:
Encoding dosEncoding = Encoding.GetEncoding(437);
string original = String.Empty;
using (StreamReader sr = new StreamReader(#"D:\Path\To\output.csv", dosEncoding))
{
original = sr.ReadToEnd();
sr.Close();
}
byte[] encBytes = dosEncoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(dosEncoding, Encoding.UTF8, encBytes);
string converted = Encoding.UTF8.GetString(utf8Bytes);
I tested this by putting Ä,Ü,Ö,ä,ü,ö into a cell and then saving it as a DOS formatted CSV file in Excel. Looking at the document, Excel turned it into Ž,š,™,„,,”.
Running it through the above code turned it back into Ä,Ü,Ö,ä,ü,ö.

Related

How to encode a csv file to UTF-8 using C# and asp.net core?

I have a csv file which contains latin characters (Ascii value > 127). The file gets uploaded with any type of encoding and shows the right data after uploading. But it gets converted automatically to UTF8 after performing operations on the file.
But I am not able to see the same characters when it is converted to UTF8 after performing operations.
I believe if I will upload the files with UTF8 encoding only then I will see the same characters that were present while uploading the file. So I want to encode the file with UTF8 Encoding.
I am getting IForm File from the function. I tried these methods to change the encoding but it does not affect the file in any way.
First method
//'file' is the IForm file
string[] filecontent;
StreamReader sr = new StreamReader(file.FileName);
string data = sr.ReadLine();
filecontent = data.Split(",");
File.WriteAllLines(file.FileName, filecontent, Encoding.UTF8);
Second method
var fileStream2 = File.OpenWrite(file.FileName);
var sw = new StreamWriter(fileStream2, Encoding.UTF8, 1024, false);
sw.Write(fileStream2);
sw.Close();
Is there any other method to do this or is there any other library to encode the csv file with UTF 8 directly?

C# Stream Reader does not differentiate between UTF-16 and UTF-8 [duplicate]

This question already has answers here:
Is it possible to reliably auto-decode user files to Unicode? [C#]
(5 answers)
Closed 2 years ago.
I am building an app that downloads a csv file in plain text from an e-mail server and writes it to the local file system. I am developing this app in C# using .NET Core 3.1.
The problem is that I don't know what is the encoding of the files that I am receiving, so I decided to use the StreamReader class to convert the bytes that I downloaded from the e-mail to a string.
Here is the code
foreach (var data in loadedData)
{
if (IsValidData(data))
{
logger.Info($"Writing data from: {data.FileName}");
using var stream = new MemoryStream(data.FileContent);
using var reader = new StreamReader(stream, true);
var csvData = new CSVData
{
FileName = data.FileName,
FileContent = reader.ReadToEnd(),
};
dataWriter.WriteData(csvData);
logger.Info($"Writing data from: {data.FileName} was successfully written");
}
else
{
logger.Warn($"Invalid format: {data.FileName}");
}
}
And to write the data to the actual files I am using:
public void WriteData(CSVData data)
{
logger.Debug($"Writing received file: {data.FileName}");
var outputDir = config.GetReceivedFilesPath();
string fileName = this.GetOutputPath(data.FileName, outputDir);
Directory.CreateDirectory(outputDir);
using var writer = new StreamWriter(fileName, false, Encoding.UTF8);
writer.Write(data.FileContent);
logger.Debug($"The received data was successfully written to: {data.FileName}");
}
The problem is that some files that I am receiving are encoded in UTF-16 (I believe this is the encodigng that is being used, because there is a \0 after each char), but the StreamReader is interpreting this file as encoded in UTF-8, because the reader.CurrentEncoding property returns UTF-8.
The end result is that instead of having my files outputted as UTF-8, my app is outputting them as UTF-16, even though I explicity added UTF-8 as the output value.
What I am doing wrong?

You might be able to use this method
File.ReadAllText(string path, System.Text.Encoding encoding)
Based on the documentation it tries to figure this out automatically.
The below text is from the documentation
This method opens a file, reads all the text in the file, and returns
it as a string. It then closes the file.
This method attempts to automatically detect the encoding of a file
based on the presence of byte order marks. Encoding formats UTF-8 and
UTF-32 (both big-endian and little-endian) can be detected.
The file handle is guaranteed to be closed by this method, even if
exceptions are raised.
To use the encoding settings as configured for your operating system,
specify the Encoding.Default property for the encoding parameter.
The full document can be found here

Open .prn file that includes image with right Encoding using c#

I need to open a .prn file and replace some strings.
In the .prn file I included an image, that has a string like this:
When I open the .prn file, C# is not able to read the string as it is.
Probably, it misses some encoding, but not sure which one.
I tried different encodings, but without success.
Here is the code that opens the file in read mode:
string text = File.ReadAllText(root + #"testImage.prn");
c# reads that string in this way
and i'm not able to print the file with the image included.
Thanks in advance for your help.

Most PRN files contain ISO encoding. So, try using ISO encoding and read the file using System.IO.StreamReader with explicitly specifying the desired encoding.
The following example worked perfectly in my case:
System.Text.Encoding encoding = System.Text.Encoding.GetEncoding("ISO-8859-1");
string text;
using (System.IO.StreamReader sr = new System.IO.StreamReader(path, encoding))
{
text = sr.ReadToEnd();
}

In Java, it worked this way for me: Using Stream and charset ISO-8859-1.
Stream<String> stream = Files.lines(Paths.get(filePath), Charset.forName("ISO-8859-1"));

How to set the BOM for a file being read

I have been having issues reading a file that contains a mix of Arabic and Western text. I read the file into a TextBox as follows:
tbx1.Text = File.ReadAllText(fileName.Text, Encoding.UTF8);
No matter what value I tried instead of "Encoding.UTF8" I got garbled characters displayed in place of the Arabic. The western text was displayed fine.
I thought it might have been an issue with the way the TextBox was defined, but on start up I write some mixed Western/Arabic text to the textbox and this displays fine:
tbx1.Text = "Start السلا عليكم" + Environment.NewLine + "Here";
Then I opened Notepad and copied the above text into it, then saved the file, at which point Notepad save dialogue asked for which encoding to use.
I then presented the saved file to my code and it displayed all the content correctly.
I examined the file and found 3 binary bytes at the beginning (not visible in Notepad):
The 3 bytes, I subsequently found through research represent the BOM, and this enables the C# "File.ReadAllText(fileName.Text, Encoding.UTF8);" to read/display the data as desired.
What puzzles me is specifying the " Encoding.UTF8" value should take care of this.
The only way I can think is to code up a step to add this data to a copy of teh file, then process that file. But this seems rather long-winded. Just wondering if there is a better way to do, or why the Encoding.UTF8 is not yielding the desired result.
Edit:
Still no luck despite trying the suggestion in the answer.
I cut the test data down to containing just Arabic as follows:
Code as follows:
FileStream fs = new FileStream(fileName.Text, FileMode.Open);
StreamReader sr = new StreamReader(fs, Encoding.UTF8, false);
tbx1.Text = sr.ReadToEnd();
sr.Close();
fs.Close();
Tried with both "true" and "false" on the 2nd line, but both give the same result.
If I open the file in Notepad++, and specify the Arabic ISO-8859-6 Character set it displays fine.
Here is what is looks like in Notepad++ (and what I would liek the textbox to display):
Not sure if the issue is in the reading from file, or the writing to the textbox.
I will try inspecting the data post read to see. But at the moment, I'm puzzled.

The StreamReader class has a constructor that will take care of testing for the BOM for you:
using (var stream = new FileStream(fileName.Text, FileAccess.Read))
{
using (var sr = new StreamReader(stream, Encoding.UTF8, true))
{
var text = sr.ReadToEnd();
}
}
The final true parameter is detectEncodingFromByteOrderMark:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes:
UTF-8
little-endian Unicode
and big-endian Unicode text
if the file
starts with the appropriate byte order marks. Otherwise, the
user-provided encoding is used. See the Encoding.GetPreamble method
for more information.

C# StreamReader/StreamWriter encoding oddity

I have a very simple c# console app that reads through a text file and outputs the same file but with a particular string replaced on each line that it appears - utilizing StreamReader and StreamWriter. I do not know the encoding of the source file. I have encountered a situation where there is a character in the file (ext ascii dec 166, broken pipe) that when running through this app gets "mangled" using the default encoding (In the output file it ends up as a "box" character). Since I do not know the source file encoding I have attempted multiple options to see what would provide an unaltered result and oddly the only way that works is having it read in UTF-7 and written in UTF-8.
UTF-7 to UTF-7 causes problems like & to change to +AC. UTF-8 to UTF-8 (which I believe is the default) converts the character in question to the "box". ASCII to ASCII turns it into ?. Unicode to Unicode results in gibberish. Shouldn't it be same encoding read and write for same results? Simplified code example below:
using (var fileStream = new FileStream(fileName, FileMode.Open))
using (var fileReader = new StreamReader(fileStream,Encoding.UTF7))
using (var fileStreamOut = new FileStream(tempFileName,FileMode.Create))
using (var fileWriter = new StreamWriter(fileStreamOut,Encoding.UTF8))
{
while (!fileReader.EndOfStream)
{
var inputLine = fileReader.ReadLine();
if (inputLine != null)
{
inputLine = inputLine.Substring(0, 3) + newRdfi + inputLine.Substring(12);
fileWriter.WriteLine(inputLine);
}
}
fileWriter.Flush();
}

After clarification on the file creation method received from the developer of the source system and knowledge of the server it is being produced on I came to the conclusion the encoding was Windows-1252. Changing my read and write streams to use Encoding.GetEncoding(1252) resulted in all characters reading and outputting as expected.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.