C# : Characters do not display well when in Console, why?

C# : Characters do not display well when in Console, why? - c#

The picture below explains all:
alt text http://img133.imageshack.us/img133/4206/accentar9.png
The variable textInput comes from File.ReadAllText(path); and characters like : ' é è ... do not display. When I run my UnitTest, all is fine! I see them... Why?

The .NET classes (System.IO.StreamReader and the likes) take UTF-8 as the default encoding. If you want to read a different encoding you have to pass this explicitly to the appropriate constructor overload.
Also note that there's not one single encoding called “ANSI”. You're probably referring to the Windows codepage 1252 aka “Western European”. Notice that this is different from the Windows default encoding in other countries. This is relevant when you try to use System.Text.Encoding.Default because this actually differs from system to system.
/EDIT: It seems you misunderstood both my answer and my comment:
The problem in your code is that you need to tell .NET what encoding you're using.
The other remark, saying that “ANSI” may refer to different encodings, didn't have anything to do with your problem. It was just a “by the way” remark to prevent misunderstandings (well, that one backfired).
So, finally: The solution to your problem should be the following code:
string text = System.IO.File.ReadAllText("path", Encoding.GetEncoding(1252));
The important part here is the usage of an appropriate System.Text.Encoding instance.
However, this assumes that your encoding is indeed Windows-1252 (but I believe that's what Notepad++ means by “ANSI”). I have no idea why your text gets displayed correctly when read by NUnit. I suppose that NUnit either has some kind of autodiscovery for text encodings or that NUnit uses some weird defaults (i.e. not UTF-8).
Oh, and by the way: “ANSI” really refers to the “American National Standards Institute”. There are a lot of completely different standards that have “ANSI” as part of their names. For example, C++ is (among others) also an ANSI standard.
Only in some contexts it's (imprecisely) used to refer to the Windows encodings. But even there, as I've tried to explain, it usually doesn't refer to a specific encoding but rather to a class of encodings that Windows uses as defaults for different countries. One of these is Windows-1252.

Try setting your console sessin's output code page using the chcp command. The code pages supported by windows are here, here, and here. Remember, fundametnaly the console is pretty simple: it displays UNCICODE or DBCS characters by using a code page to dtermine the glyph that will be displayed.

I do not know why It works with NUnit, but I open the file with NotePad++ and I see ANSI in the format. Now I converted to UTF-8 and it works.
I am still wondering why it was working with NUnit and not in the console? but at least it works now.
Update
I do not get why I get down voted on the question and in this answer because the question is still good, why in a Console I can't read an ANSI file but in NUNit I can?

Related

How to show emoji in c# console output?

I have a problem in output emoji in console.
String starts with Unicode flag "\u" works well, like "\u263A".
However, if just simply copy and paste an emoji into string, like "🎁", it does not work.
code test below:
using System;
using System.Text;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Console.OutputEncoding = Encoding.UTF8;
string s1 = "🎁";
string s1_uni = "\ud83c\udf81"; // unicode code for s1
string s2 = "☺";
string s2_uni = "\u263A"; // unicode code for s2
Console.WriteLine(s1);
Console.WriteLine(s1_uni);
Console.WriteLine(s2);
Console.WriteLine(s2_uni);
Console.ReadLine();
}
}
}
s1 and s1_uni can successfully be outputted while s2 and s2_uni failed.
I want to know how to fix this problem.
By the way, the font applied is 'Consolas', which works perfectly in Visual Studio.
Update:
Please note that, I've done some searches in stackoverflow before I present this question. The most common way is to set the Console encoding to utf-8, which is done in the first line of Main.
This way (Console.OutputEncoding = Encoding.UTF8) can not totally fit the situation I presented.
Also, the reason why I make supplement to the console font in the question is to declare that Consolas font works perfectly in showing emoji in VS but failed in console. The first emoji failed to show.
Please do not close this question. Thanks.
Update2:
this emoji can be shown in the VS terminal.
Update3:
Thank Peter Duniho for help. And you are right.
While we are discussing, I look through the document MS Unicode Support for the Console.
Display of characters outside the Basic Multilingual Plane (that is, of surrogate pairs) is not supported, even if they are defined in a linked font file.
Code point of the emoji can't be shown in the console is just outside the BMP. And console does not support show code point outside BMP. Therefore, this emoji is not shown.
To change running context which may support this emoji. I did some experiments.
CMD:
Power Shell:
Windows Terminal:
You can see, windows terminal supports it.
Strictly speaking, the problem I met is not a duplicate question in stackoverflow. Because my code just did whatever can be done to meet the requirement. The problem is the running context, not code.
Thank Peter Duniho for help.

The current Windows command line console, cmd.exe, still uses GDI+ to render text. And the GDI+ API it uses does not correctly handle combining/surrogate pair characters like the emoji you want to display.
This is true even when using a font that includes the glyph for the character you want, and even when you have correctly set the output encoding for the Console class to a Unicode encoding (both of which you've done in your example).
Microsoft appears to be working on improvements to the command prompt code, to upgrade it to use the DirectWrite API instead of GDI+. If and when these improvements are released, the console window should be able to display your emoji correctly. See Github issue UTF-8 rendering woes #75
In the meantime, you can run your program in a context that is able to render these characters correctly, such as Windows Terminal or PowerShell.
Additional details regarding the limitations of the GDI+ font rendering can be found in Github issues Add emoji support to Windows Console #190 and emoji/unicode support mostly broken in windows #2693 (the latter isn't about a Windows component per se, but still relates to this problem).

What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?

In some Windows 10 builds (insiders starting April 2018 and also "normal" 1903) there is a new option called "Beta: Use Unicode UTF-8 for worldwide language support".
You can see this option by going to Settings and then:
All Settings -> Time & Language -> Language -> "Administrative Language Settings"
This is what it looks like:
When this checkbox is checked I observe some irregularities (below) and I would like to know what exactly this checkbox does and why the below happens.
Create a brand new Windows Forms application in your Visual Studio 2019. On the main form specify the Paint even handler as follows:
private void Form1_Paint(object sender, PaintEventArgs e)
{
Font buttonFont = new Font("Webdings", 9.25f);
TextRenderer.DrawText(e.Graphics, "0r", buttonFont, new Point(), Color.Black);
}
Run the program, here is what you will see if the checkbox is NOT checked:
However, if you check the checkbox (and reboot as asked) this changes to:
You can look up Webdings font on Wikipedia. According to character table given, the codes for these two characters are "\U0001F5D5\U0001F5D9". If I use them instead of "0r" it works with the checkbox checked but without the checkbox checked it now looks like this:
I would like to find a solution that always works that is regardless whether the box checked or unchecked.
Can this be done?

You can see it in ProcMon.
It seems to set the REG_SZ values ACP, MACCP, and OEMCP in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
to 65001.
I'm not entirely sure but it might be related to the variable gAnsiCodePage in KernelBase.dll, which GetACP reads. If you really want to, you might be able to change it dynamically for your program regardless of the system setting by dynamically disassembling GetACP to find the instruction sequence that reads gAnsiCodePage and obtaining a pointer to it, then updating the variable directly.
(Actually, I see references to an undocumented function named SetCPGlobal that would've done the job, but I can't find that function on my system. Not sure if it still exists.)

Most Windows C APIs come in two different variants:
"A" variant that uses 8-bit strings with whatever the systems configured encoding is. This varies depending on the configured country/language.
(Microsoft calls the configured encoding the "ANSI Code Page", but it's not really anything to do with ANSI).
"W" variant that uses 16-bit strings in a fixed almost-UTF-16 encoding. (The "almost" is because "unpaired surrogates" are allowed; if you don't know what those are then don't worry about them).
The official Microsoft advice is not to use the "A" versions, but to ensure your code always use uses the "W" variants. That way you're supposed to get consistent behaviour no matter what the user's country/language is configured as.
However, it looks like that checkbox is doing more than one thing. It's clear it's supposed to change the "ANSI Code Page" to 65001, which means UTF-8. It looks like it's also changing font rendering to be more Unicody.
I suggest you detect if GetACP() == 65001, then draw the Unicode version of your strings, otherwise draw the old "0r" version. I'm not sure how you do that from .NET...

Please look at this question to see what it solves when it is enabled: How to save to file non-ascii output of program in Powershell?
Also I found explanation written by Ghisler helpful (source):
If you check this option, Windows will use codepage 65001 (Unicode
UTF-8) instead of the local codepage like 1252 (Western Latin1) for
all plain text files. The advantage is that text files created in e.g.
Russian locale can also be read in other locale like Western or
Central Europe. The downside is that ANSI-Only programs (most older
programs) will show garbage instead of accented characters.
I leave here two ways to enable it, I think they will be helpful for many users:
Win+R -> intl.cpl
Administrative tab
Click the Change system locale button.
Enable Beta: Use Unicode UTF-8 for worldwide language support
Reboot
or alternatively via reg file:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"

On my windows, When I checked the Beta: Use Unicode UTF-8 for worldwide language support.
The following regedit values in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage changed.
ACP: 936 -> 65001
MACCP: 10008 -> 65001
OEMCP : 936 -> 65001
If I do not checked, then the visual studio compilation failed with Exception: Bad UTF-8 encoding (U+FFFD; REPLACEMENT CHARACTER) found while decoding string: ..., If I checked, then the compilation successed, but the os is full with unreadable code.

Visual C# Character Encoding Mix Up

I've been working on a song Metadata program for a short while now, and have run in to what seems like quite a simple but challenging issue...
As the program reads in some pieces of data from songs, it may come across an usual symbol, such as ℗. I noticed that some things happened that I didn't expect and so stepped through each line of my code individually. I noticed that said character was actually being read as "â„—". I tried this with another character, Ü, to see what the result was, and this time I got "Ãœ".
I did a quick Google search which led me to this page:
http://www.i18nqa.com/debug/utf8-debug.html
From this I would assume that my desired characters are actually being interpreted using the mentioned Windows-1252 encoding, and not UTF-8?
I should add that to further investigate this, I tried writing the Windows-1252 versions into the Metadata using my program, and these came out in Windows itself as the correct ℗ and Ü characters... Can anybody tell me how to ensure that both read and write functions are done using UTF-8, and also tell me why this really happens if I was not correct. Thanks.
I am using C#.

µ and é in namespace

We have developed a c# program. The program is distributed in Europe without problem on misc hardware configurations. Some of the namespaces in our program contains a 'µ' or a 'é' character. When deploying our program on 'non-european' ie China or some US systems a problem occurs somewhere in the process the 'µ' is changed into 'Âµ' causing lots of problems. What is causing this problem and how can we work around it (preferably without changing the name of the namespace)
edit 2015.08.07
Thanks all for your comments, but to clarify: the source files are not distributed as such. The program is compiled to an exe and then distributed using nsis. Source control is done using SVN. How can I verify the presence of the BOM in my source files ?

Either you or the recipient or both are using a character encoding other than UTF-8.
People shouldn't do that, but they do.
Some tools will default to a legacy encoding unless you include a BOM at the start of each file, so include a BOM at the start of each file.

You are hitting a difference in the character sets used by the different systems. Your software was probably running on systems assuming ISO-8859, most often used in European languages, while the Chinese and US systems you are encountering are probably using the Universal Character Set (ISO/IEC 10646). The mapping between the two is not a simple 1-to-1, so you run into the problems you are having. W3.org has a good article on this topic at http://www.w3.org/International/articles/definitions-characters/
Pay special attention to the sections on "Character sets, coded character sets, and encodings", and "The Document Character Set". If this is a web app, "The HTTP Header" might be particularly useful.

This isn't an answer. However I want to point out that this might be an encoding problem but this can happen on the same system and show up only when running code that (guess here) reads a byte at a time, as opposed to explicitly reading text of a particular encoding.
I have a C program (32 bit if that matters) which reads in a file using fgetc and saves characters to be used as "illegal" characters in names. It isn't fancy, just to prevent a few ascii characters from coming in accidentally, like an ' (apostrophe) in the name of a thing/object/label. Some one asked me to test µ (mu, appears as single character in this interface to stackoverflow). I generated this (without examining underlying encoding in MS Word) using Insert-Symbol in MS Word. I cut it from MS Word and inserted it into a text file using Notepad++. In Notepad++ and MS Word, it seems to be the same symbol. BUT fgetc (taking one int or char, however you like to think of it) at a time, sees in my debug output for a test case:
About to check for illegal characters in =>NameOfItemÂµ<=
Illegal character =>Â<= was found. Illegal characters are: '`Âµ
Illegal character =>µ<= was found. Illegal characters are: '`Âµ
I am compiling with Visual C++ Express 2013.
I'm happy it catches the illegal characters, and hope this isn't just noise to readers of this topic.

How to hex edit an exe file safely?

I am working on a small puzzle/wargame which involves coding Windows Forms in C#..
To get to a certain level I need a password which is stored in an exe. The same exe allows me send that password to a default person which is stored in a variable. The password sending is accomplished by updating the given user's data in a MySQL database.
The challenge was that a user should hex edit the exe and change the default recipient to the user desired username. But when I hex edited the file and put the desired user name and tried to run it, it showed an error "x.exe not a valid win32 application"..
Is there a way to safely hex edit a file without encountering this error. Or is there a way to modify the source so that, just one variable may be safely edited using a hex editor..

Editing a PE image in hex is going to be difficult since you will need to update various parts of the PE image if you change the length of a section or if the EXE is signed you would also invalidate it. The PE image spec can be found here if you want to understand all the fields you will need to update. If you want a nice UI, I would use something like CFF Explorer to edit to PE image correctly.
You could also use ildasm, only for .NET assemblies, to disassemble the EXE, edit the IL, and then use ilasm to reassemble and run it. This would eliminate the need to edit the PE image and be safer.

Assuming this is not an illegal alteration of an executable... (It sounds like a challenge in a contest, the way you have it worded.)
Most likely your change caused the program to no longer be able to verify the checksum. If you wish to successfully alter the exe, you need to recalculate the checksum. (This is just one possibile explanation for why the exe was corrupted.)
Altering a compiled executable and having it work is tricky to say the least. It's a pretty advanced topic and not likely something that can be answered fully here.

When I was doing something similar before I remember I had to replace variables with same-length strings for it to work properly. e.g. "someone#example.com" could be replaced with "another#example.net" or "myname#anexample.us". If you're using Gmail this would be easier because "mynameis#gmail.com" is the same as "my.name.is...+slim.shady#gmail.com".
Though, I think #David Stratton's idea is probably more relevant to exe's. I'm pretty sure the files I edited were just data files (it was a long time ago), but I know everything worked then for me as long as I didn't add or remove any bytes in the middle of the file.

When modifying strings inside EXE/DLL files it is important that the length of the string you are editing is kept the same, for example if I changed "Hello User" to "Welcome User", we overflowed the stack for 2 bytes.
This will obviously result in an error. In order to have a successful edit accomplished, it is important that the modified string you put does not overflow the string that you are inserting it on.
TLDR;
If the string you are editing is 12 characters long, you can only change 12 characters in total.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.