Handling special characters in c# - c#

In a C# console app, I am using stringbuilder to write data to a local file. It seems to be mishandling special characters
Muñoz
outputs to the file as
Muñoz
at a bit of a loss how to manage that correctly.

Your C# code is correctly writing a UTF8 file, in which ñ is encoded as 3 bytes.
You're incorrectly reading the file as a different encoding which shows those bytes as three unwanted characters.
You need to read the file as UTF8.

Related

C# replaces special characters with question marks

i'm having a problem with encoding in c#
i'm downloading an xml file encoded in windows-1250 and then, when saved to a file, special characters like Š and Đ are replaced with ? even tho the file is saved correctly using the windows-1250 encoding.
this is an example of my code (simplified):
var res = Encoding.GetEncoding("Windows-1250").GetBytes(client.DownloadString("http://link/file.xml"));
var result = Encoding.GetEncoding("Windows-1250").GetString(res);
File.AppendAllText("file.xml", result);
the xml file is in fact encoded using windows-1250 and it reads just fine when i download it using the browser.
anyone knows what's going on here?
The problem could result from two different sources, one at the beginning and one at the end of your snippet.
And as has been pointed out, the Encoding and Decoding you are doing in your code is actually useless, because the origin (what DownloadString returns) and target (the variable result) are both C# Unicode strings.
Source 1: DownloadString
DownloadString could not properly decode the Windows-1250 encoded string, because either the server did not send the correct charset in the Content-Type header, or DownloadString doesn't even support this (unlikely, but I'm not familiar with DownloadString).
Source 2: File.AppendAllText
The string was downloaded correctly, then encoded in memory to Windows-1250, then decoded to a Unicode string again and everything worked well.
But then it was written by File.AppendAllText in another default encoding. AppendAllText has an optional, third parameter that you can use to specify the encoding. You should set this to Windows-1250 to actually write a file in Windows-1250 encoding.
Also, make sure that whatever editor you use to open the file uses the same encoding - this is often not very easy to guarantee, so I'd suggest you open it in a "developer-friendly" editor that lets you specify the encoding when opening a text file. (Vim, Emacs, Notepad++, Visual Studio, ...).

Reading accented characters (á) C# UTF-8 / Windows 1252

I am trying to read a file that has some letters that aren't showing up correctly when I am trying to convert the file to XMl. The letters come up as blocks when I open them in notepad++, but in the original document they are correct. An example letter is á.
I am using UTF-8 to encode the file so it should be covered in that but it isn't for some reason. If I change it to windows 1252 then it shows the character correctly.
Why is it not available in UTF-8 encoding but is in Windows 1252?
If you need anymore information then just ask, Thanks in advance.

Encoding text file to appear on IBM Mainframe

I have a C++ program that sends data via FTP via ASCII mode to an IBM Mainframe. I am now doing this via C#.
When it gets there and viewed the file looks like garbage.
I cannot see anything in the C++ code that does anything special to encode the file into something like EPCDIC. When the C++ files are sent they are viewed ok. The only thing I see different is \015 & \012 for line feeds whereas C# is using \r\n.
Would these characters have an effect and if so how can I get my C# app to use \015?
Do I have to do any special encoding to make it appear ok?
It sounds like you should indeed be using an EBCDIC encoding, and then probably transferring the text in binary. I have an EBCDIC encoding class you can use, should you wish.
Note that \015\012 is \r\n - they're characters 13 and 10 in decimal, just different ways of representing them. If you think the C++ code really is producing the same files as C#, compare two files which should be the same in a binary file editor.
Make sure you have the TYPE TEXT instead of TYPE BINARY command before you transfer the file.
If you are truly sending the files in ASCII mode, then the mainframe itself will convert that to EBCDIC (it's receiver-makes-good).
The fact that you're getting apparent garbage at the mainframe end, and character codes \015 and \012 (which are CR and LF respectively) means that you're not transferring in ASCII mode.
As an aside, the ISPF editor has been able to view ASCII data sets for quite a few versions now. Open up the file and enter the commands source ascii and lf.
The first renders converts the characters from ASCII to EBCDIC so you can see what they are, the second goes through and pads out "lines" so that linefeed markers are replaced with enough spaces to reach the record length.
Invaluable commands when dealing with mixed-encoding environments, which is where I do a lot of my work.

How to convert the encoding of an string to UTF-8 without know the original encoding in C#?

I'm reading a CSV file with Fast CSV Reader (on codeproject). When I print the content of the fields, the console show the character '?' in some words. How can fix it?
The short version is that you have to know the encoding of any text file you're going to read up front. You could use things like byte order marks and other heuristics if you really aren't going to know, but you should always allow for the value to be tweaked (in the same way that Excel does if you're importing CSV).
It's also worth double checking the values in the debugger, as it may be that it is the output that is wrong, as opposed to the reading -- bear in mind that all strings are Unicode internally, and conversion to '?' sounds like it is failing converting the unicode to the relevant code page for the console.

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.
If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Categories

Resources