Vietnamese character in .NET Console Application (UTF-8)

Vietnamese character in .NET Console Application (UTF-8) - c#

I'm trying to write down a UTF-8 string (Vietnamese) into C# Console but no success. I'm running on Windows 7.
I tried to use the Encoding class that convert string to char[] to byte[] and then to String, but no help, the string is input directly from the database.
Here is some example
Tôi tên là Đức, cuộc sống thật vui vẻ
tuyệt vời
It does not show the special character like Đ or ứ... instead it show up ?, much worse than with the Encoding class.
Does anyone can try this out or know about this problem?
My code
static void Main(string[] args)
{
XDataContext _new = new XDataContext();
Console.OutputEncoding = Encoding.GetEncoding("UTF-8");
string srcString = _new.Posts.First().TITLE;
Console.WriteLine(srcString);
// Convert the UTF-16 encoded source string to UTF-8 and ASCII.
byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
byte[] asciiString = Encoding.ASCII.GetBytes(srcString);
// Write the UTF-8 and ASCII encoded byte arrays.
Console.WriteLine("UTF-8 Bytes: {0}", BitConverter.ToString(utf8String));
Console.WriteLine("ASCII Bytes: {0}", BitConverter.ToString(asciiString));
// Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded
// string and write.
Console.WriteLine("UTF-8 Text : {0}", Encoding.UTF8.GetString(utf8String));
Console.WriteLine("ASCII Text : {0}", Encoding.ASCII.GetString(asciiString));
Console.WriteLine(Encoding.UTF8.GetString(utf8String));
Console.WriteLine(Encoding.ASCII.GetString(asciiString));
}
and here is the outstanding output
NhÃ  bÃ¡o Ä‘i há»™i bÃ¡o XuÃ¢n
UTF-8 Bytes: 4E-68-C3-A0-20-62-C3-A1-6F-20-C4-91-69-20-68-E1-BB-99-69-20-62-C3-
A1-6F-20-58-75-C3-A2-6E
ASCII Bytes: 4E-68-3F-20-62-3F-6F-20-3F-69-20-68-3F-69-20-62-3F-6F-20-58-75-3F-
6E
UTF-8 Text : NhÃ  bÃ¡o Ä‘i há»™i bÃ¡o XuÃ¢n
ASCII Text : Nh? b?o ?i h?i b?o Xu?n
NhÃ  bÃ¡o Ä‘i há»™i bÃ¡o XuÃ¢n
Nh? b?o ?i h?i b?o Xu?n
Press any key to continue . . .

class Program
{
[DllImport("kernel32.dll")]
static extern bool SetConsoleOutputCP(uint wCodePageID);
static void Main(string[] args)
{
SetConsoleOutputCP(65001);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("tést, тест, τεστ, ←↑→↓∏∑√∞①②③④, Bài viết chọn lọc");
Console.ReadLine();
}
}
Screenshot of the output (use Consolas or another font that has all the above characters):

You will need to set Console.OutputEncoding to match UTF-8.
Probably something like:
Console.OutputEncoding = System.Text.Encoding.UTF8;

Does the font you use in the Console window support the characters you are trying to display?

it is the problem with cmd.exe console. It doesn't support unicode. [Nothing to do with C#/.NET]
Try changing it to a GUI app if you can or write to a file.

Related

Best way to transform string to valid encoding in C#

Sorry in advance if you have a duplicate or a simple question! I can't find the answer.
I'm working with a dll made in Delphi. Data can be sent to the device using a DLL. However, at the time the data is sent, some strings are not accepted or are written blank. The data sent to the device is stored in a txt file. It was generated using txt file third party program.
That is, I think the string is in an indefinite format. If I send in utf-8 format, it receives all the information. But some strings at the time ???? ???? remains.
Many of my texts are in the Cyrillic alphabet.
What I did:
// string that send to device
[MarshalAsAttribute(UnmanagedType.LPStr, SizeConst = 36)]
public string Name;
When I did this, the device received only 10 out of 100 data.
If i encoding with UTF-8:
byte[] bytes = Encoding.Default.GetBytes(getDvsName[1].ToString());
string res = Encoding.UTF8.GetString(bytes);
Got all the data this way but too many strings are became as ??? ????.
Also i tried like this:
static private string Win1251ToUTF8(string source)
{
Encoding utf8 = Encoding.GetEncoding(«utf-8»);
Encoding win1251 = Encoding.GetEncoding(«windows-1251»);
byte[] utf8Bytes = win1251.GetBytes(source);
byte[] win1251Bytes = Encoding.Convert(win1251, utf8, utf8Bytes);
source = win1251.GetString(win1251Bytes);
return source;
}
All of the above methods did not help. How can I receive incoming information in the correct format? Are there other ways?

hi there here is what went wrong you did encode the string to default instead of utf8.
string tom = "ටොම් හැන්ක්ස්";
byte[] bytes = Encoding.UTF8.GetBytes(tom);
string res = Encoding.UTF8.GetString(bytes);

How can I get bytes to display their character representations from different codepages?

I've been trying to create a consistent method to take the bytes of characters and display the bytes representation in alternative text codepages. For example, hex D1 in Windows 1251, KOI-8U, etc. The idea is to take text that appears scrambled, because it is being interpreted and displayed in the wrong character set and transform it to the correct display. Below is a shortened portion of the code I've used. I've gotten it to work on ideone, but can't get it to work as an add-type in powershell or compiling with csc. I just get question marks or incorrect characters.
The output of the below code from ideone, which is the correct transformation, is:
D1-00-C1-00
СБ
windows-1251
When compiled with PowerShell or csc it is (incorrect):
D1-00-C1-00
?A
windows-1251
Is there a way to make this work in the Windows environment?
using System;
using System.Text;
public class Test
{
public static void Main()
{
string str = "ÑÁ"
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
Encoding enc = Encoding.GetEncoding(1251);
char[] ca = enc.GetChars(bytes);
Console.WriteLine(BitConverter.ToString(bytes));
Console.WriteLine(ca);
Console.WriteLine(enc.HeaderName);
}
}

First of all, the best way to solve this problem is to avoid it—make sure that when you have bytes, you always know which character set was used to encode these bytes.
To answer the question: you can't. There is no consistent method to make this work anywhere. This will always involve guesswork.
What you see is a string which was encoded to bytes with some encoding, and then decoded using a different encoding. Here is how you fix these strings:
figure out (or guess) what encoding was originally used to encode the string to bytes.
figure out (or guess) what encoding was used when displaying the string,
reverse the operations: encode the mojibake using the encoding from step (2) and encode the bytes with the encoding from step (1)
If you already have bytes, you only do step (1) and use that decoding to decode the bytes to a string.
A program doing that would look like this:
using System;
using System.Text;
public class Test
{
public static void Main()
{
// our corrupted string
string str = "ÑÁ"
// encoding from step (2)
Encoding enc1 = Encoding.GetEncoding(1252);
byte[] bytes = enc1.GetBytes(str);
// encoding from step (1)
Encoding enc2 = Encoding.GetEncoding(1251);
string originalStr = enc.GetString(bytes);
Console.WriteLine(originalStr);
}
}

UPDATE/SOLUTION
As roeland notes, there's quite a bit of guess work involved with this. The problem as a C# solution is also two parts on Windows. It appears that the display default encoding for the console doesn't change with the encoding object automatically (it seems to with Mac and the Mono Framework). The console display has to be set manually with SetConsoleCP and SetConsoleOutputCP. I also had to create multiple encodings and use an inner loop to get the proper intersection of codepages. The link below pointed towards the display issue's resolution.
UTF-8 output from PowerShell
Example below is focused on scenario where Russian is the suspected language.
CODE
using System;
using System.Text;
using System.Runtime.InteropServices;
namespace Language
{
public class Test
{
//Imports dll to set console display
[DllImport("kernel32.dll"]
public static extern bool SetConsoleCP(int codepage);
[DllImport("kernel32.dll"]
public static extern bool SetConsoleOutputCP(int codepage);
public static void Main()
{
string s = "ÑÁÅ";
byte[] bytes = new byte[s.Length * sizeof(char)];
System.Buffer.BlockCopy(s.ToCharArray(), 0, bytes, 0, bytes.Length);
Console.WriteLine(BitConverter.ToString(bytes);
//produce possible combinations
foreach (Encoding encw in Russian.GetCps())
{
bool cp = SetConsoleOutputCP(encw.CodePage);
bool cp2 = SetConsoleCP(encw.CodePage);
foreach (Encoding enc in Russian.GetCps())
{
char[] ca = enc.GetChars(bytes);
Console.WriteLine(ca);
}
}
}
}
public class Russian
{
public static Encoding[] GetCps()
{
// get applicable Cyrillic pages
Encoding[] = russian = new Encoding[8];
russian[0] = Encoding.GetEncoding(855);
russian[1] = Encoding.GetEncoding(866);
russian[2] = Encoding.GetEncoding(1251);
russian[3] = Encoding.GetEncoding(10007);
russian[4] = Encoding.GetEncoding(20866);
russian[5] = Encoding.GetEncoding(21866);
russian[6] = Encoding.GetEncoding(20880);
russian[7] = Encoding.GetEncoding(28595);
return russian;
}
}
}
The output is lengthy, but gives a string with the correct output as one member of a list.
I made a shorter version in PowerShell, which appears to change the display codepage automatically and requires less iterations:
function Get-Language ([string]$source) {
$encodings = [System.Text.Encoding]::GetEncoding(855),[System.Text.Encoding]::GetEncoding(866),[System.Text.Encoding]::GetEncoding(1251),[System.Text.Encoding]::GetEncoding(10007),[System.Text.Encoding]::GetEncoding(20866),[System.Text.Encoding]::GetEncoding(21866),[System.Text.Encoding]::GetEncoding(20880),[System.Text.Encoding]::GetEncoding(28595)
$C = ""
$bytes = gc $source -encoding byte
for ($i=0; $i -le $encodings.Length - 1; $i++) {
$bytes | %{$C = $C + $encodings[$i].GetChars($_)}
Write-Host $C
$C = ""
}
}

Sending a string containing special characters through a TcpClient (byte[])

I'm trying to send a string containing special characters through a TcpClient (byte[]). Here's an example:
Client enters "amé" in a textbox
Client converts string to byte[] using a certain encoding (I've tried all the predefined ones plus some like "iso-8859-1")
Client sends byte[] through TCP
Server receives and outputs the string reconverted with the same encoding (to a listbox)
Edit :
I forgot to mention that the resulting string was "am?".
Edit-2 (as requested, here's some code):
#DJKRAZE here's a bit of code :
byte[] buffer = Encoding.ASCII.GetBytes("amé");
(TcpClient)server.Client.Send(buffer);
On the server side:
byte[] buffer = new byte[1024];
Client.Recieve(buffer);
string message = Encoding.ASCII.GetString(buffer);
ListBox1.Items.Add(message);
The string that appears in the listbox is "am?"
=== Solution ===
Encoding encoding = Encoding.GetEncoding("iso-8859-1");
byte[] message = encoding.GetBytes("babé");
Update:
Simply using Encoding.Utf8.GetBytes("ééé"); works like a charm.

Never too late to answer a question I think, hope someone will find answers here.
C# uses 16 bit chars, and ASCII truncates them to 8 bit, to fit in a byte. After some research, I found UTF-8 to be the best encoding for special characters.
//data to send via TCP or any stream/file
byte[] string_to_send = UTF8Encoding.UTF8.GetBytes("amé");
//when receiving, pass the array in this to get the string back
string received_string = UTF8Encoding.UTF8.GetString(message_to_send);

Your problem appears to be the Encoding.ASCII.GetBytes("amé"); and Encoding.ASCII.GetString(buffer); calls, as hinted at by '500 - Internal Server Error' in his comments.
The é character is a multi-byte character which is encoded in UTF-8 with the byte sequence C3 A9. When you use the Encoding.ASCII class to encode and decode, the é character is converted to a question mark since it does not have a direct ASCII encoding. This is true of any character that has no direct coding in ASCII.
Change your code to use Encoding.UTF8.GetBytes() and Encoding.UTF8.GetString() and it should work for you.

Your question and your error is not clear to me but using Base64String may solve the problem
Something like this
static public string EncodeTo64(string toEncode)
{
byte[] toEncodeAsBytes
= System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
string returnValue
= System.Convert.ToBase64String(toEncodeAsBytes);
return returnValue;
}
static public string DecodeFrom64(string encodedData)
{
byte[] encodedDataAsBytes
= System.Convert.FromBase64String(encodedData);
string returnValue =
System.Text.ASCIIEncoding.ASCII.GetString(encodedDataAsBytes);
return returnValue;
}

C# writes a ZERO WIDTH NO-BREAK SPACE at the beginning of a txt file

I have a text file that is written in C# using ascii encoding and when I attempt to read the file using a java project I get a ZERO WIDTH NO-BREAK SPACE character at the beginning of the file. Has anybody ever had this happen to them?
private static void SavePrivateKey(object key)
{
if (logger.IsInfoEnabled) logger.Info("SavePrivateKey - Begin");
string privatekey = (string)key;
string strDirName = Utility.RCTaskDirectory;
string strFileName = "PrivateKey.PPK";
string strKeyPathandName = Path.Combine(strDirName, strFileName);
//if (File.Exists(strKeyPathandName))
//{
// File.Create(strKeyPathandName);
//}
if (!string.IsNullOrEmpty(privatekey))
{//Save private key file
if (!Directory.Exists(strDirName))
Directory.CreateDirectory(strDirName);
FileStream fileStream = new FileStream(strKeyPathandName, FileMode.OpenOrCreate);
//TODO: Save File as ASCII
using (StreamWriter sw = new StreamWriter(fileStream, Encoding.ASCII))
{
if (logger.IsDebugEnabled) logger.DebugFormat("Saving the private key to {0}.", strKeyPathandName);
sw.Write(privatekey);
sw.Close();
if (logger.IsDebugEnabled) logger.DebugFormat("Saved private key to {0}.", strKeyPathandName);
}
}
if (logger.IsInfoEnabled) logger.Info("SavePrivateKey() - End");
}

It seems that the text is written with a BOM which is usually done when you write Unicode files... this specific character is the BOM for UTF16 files, so there must be something in your C# writing this file as UTF16...
see http://de.wikipedia.org/wiki/Byte_Order_Mark

As others have said, it is almost certainly a Unicode Byte Order Mark. If you have a look at the actual bytes in the file (not the characters) you can tell which encoding was used to write the file:
UTF-8 -> EF BB BF
UTF-16 BE -> FE FF
UTF-16 LE -> FF FE

Yes, it's quite normal, See Wikipedia. It's a optional character, which you simply should handle. So most likely you didn't write the file correctly as ASCII, since BOM only should appear if the file is encoded as unicode.

That's a Byte Order Mark indicating its a UTF-16 encoded text file.
Clearly its not writing the file in true ASCII, probably your code simply copying bytes, event though they are outside of the ASCII range. Can you post your code?

converting .txt files into unicode

is there a way i can convert a .txt file into unicode by using c#?

Only if you know the original encoding used to produce the .txt file (and that's not a restriction of C# or the .NET language either, it's a general problem).
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) to learn why "plain text" is meaningless if you don't know the encoding.

Provided you're only using ASCII characters in your text file, they're already Unicode, encoded as UTF-8.
In you want a different encoding of the characters (UTF16/UCS2, etc), any language that supports Unicode should be able to read in one encoding and write out another.
The System.Text.Encoding stuff will do it as per the following example - it outputs UTF16 as both UTF8 and ASCII and then back again (code gratuitously stolen from here).
using System;
using System.IO;
using System.Text;
class Test {
public static void Main() {
using (StreamWriter output = new StreamWriter("practice.txt")) {
string srcString = "Area = \u03A0r^2"; // PI.R.R
// Convert the UTF-16 encoded source string to UTF-8 and ASCII.
byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
byte[] asciiString = Encoding.ASCII.GetBytes(srcString);
// Write the UTF-8 and ASCII encoded byte arrays.
output.WriteLine("UTF-8 Bytes: {0}",
BitConverter.ToString(utf8String));
output.WriteLine("ASCII Bytes: {0}",
BitConverter.ToString(asciiString));
// Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded
// string and write.
output.WriteLine("UTF-8 Text : {0}",
Encoding.UTF8.GetString(utf8String));
output.WriteLine("ASCII Text : {0}",
Encoding.ASCII.GetString(asciiString));
Console.WriteLine(Encoding.UTF8.GetString(utf8String));
Console.WriteLine(Encoding.ASCII.GetString(asciiString));
}
}
}

Here is an example:
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
namespace utf16
{
class Program
{
static void Main(string[] args)
{
using (StreamReader sr = new StreamReader(args[0], Encoding.UTF8))
using (StreamWriter sw = new StreamWriter(args[1], false, Encoding.Unicode))
{
string line;
while ((line = sr.ReadLine()) != null)
{
sw.WriteLine(line);
}
}
}
}
}

There is a nice page on MSDN about this, including a whole example:
// Specify the code page to correctly interpret byte values
Encoding encoding = Encoding.GetEncoding(737); //(DOS) Greek code page
byte[] codePageValues = System.IO.File.ReadAllBytes(#"greek.txt");
// Same content is now encoded as UTF-16
string unicodeValues = encoding.GetString(codePageValues);

If you do really need to change the encoding (see Pax's answer about UTF-8 being valid Unicode), then yes, you can do that quite easily. Check out the System.Text.Encoding class.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Vietnamese character in .NET Console Application (UTF-8) - c#

You will need to set Console.OutputEncoding to match UTF-8. Probably something like: Console.OutputEncoding = System.Text.Encoding.UTF8;

Does the font you use in the Console window support the characters you are trying to display?

it is the problem with cmd.exe console. It doesn't support unicode. [Nothing to do with C#/.NET] Try changing it to a GUI app if you can or write to a file.

Related

Best way to transform string to valid encoding in C#

How can I get bytes to display their character representations from different codepages?

Sending a string containing special characters through a TcpClient (byte[])

C# writes a ZERO WIDTH NO-BREAK SPACE at the beginning of a txt file

converting .txt files into unicode

Categories

Resources