Thai character issues in unicode string - c#

I have a string with few characters in Thai. This string is using unicode characters. But I don't see thai characters in IDE or even if I write the string in text file. If I want to see thai characters properly I have to write the following code
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
After applying above logic, I can see string correctly with thai characters. Here is output
// notice the thai character เดี่ยว in the string
M_M-150 150CC. เดี่ยว (2 For 18 Save 2)
I am not sure why I need to apply above logic to see the thai characters even the string is Unicode?
What exactly Encoding.Default is doing in this case?

From MSDN
Here is what Encoding.Default Property is:
Different computers can use different encodings as the default, and
the default encoding can even change on a single computer. Therefore,
data streamed from one computer to another or even retrieved at
different times on the same computer might be translated incorrectly.
In addition, the encoding returned by the Default property uses
best-fit fallback to map unsupported characters to characters
supported by the code page. For these two reasons, using the default
encoding is generally not recommended. To ensure that encoded bytes
are decoded properly, you should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to
use a higher-level protocol to ensure that the same format is used for
encoding and decoding.
The string is coming in by Encoding.Default, but then Decoded using UTF8
So the bottleneck is not the Encoding.Default. It's Encoding.UTF8
It's taking the bytes and convert it to string correctly.
Even if you tried to print it in the Console.
Take a look at both cases :
The second line, printed with utf8 configuration
You can config your console to support utf8 by adding this line :
Console.OutputEncoding = Encoding.UTF8;
Even with your code :
the result in a file will be looks like :
while converting the string to byte with Encoding.UTF8
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
the result will be :
If you take a look at Supported Scripts you'll see that UTF8 supports all Unicode characters
including Thai.
Note that the Encoding.Default will not be able to read chinese or japanese for an example,
take this example :
var text = "漢字";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
Here is the output from a text file :
Here if you try to write it to text, it'll not be converted successfully.
So you have to read and write it using UTF8
var text = "漢字";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
and you'll get this :
So as I said, the whole process depending on UTF8 not Default encoding.

Related

How to unescape multibyte unicode in c# [duplicate]

This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);

Encoding "œ" with iso-8859-1 in C#

I am using below code to read a CSV file:
using (StreamReader readfile = new StreamReader(FilePath, Encoding.GetEncoding("iso-8859-1")))
{
// some code will go here
}
There is a character œin a column of the CSV file. Which is getting converted to ? in output. How can i get this œ encoded correctly so that in output i will get same œ character not a question mark.
This is an encoding problem. Many non-Unicode encodings are either incomplete and translate many characters to "?", or have subtly different behavior on different platforms. Consider using UTF-8 or UTF-16 as the default. At least, if you can.
"windows-1252" is a superset of "ISO-8859-1". Try with Encoding.GetEncoding(1252).
Demo:
public static void Main()
{
System.IO.File.AppendAllText("test","œ", System.Text.Encoding.GetEncoding(1252));
var content = System.IO.File.ReadAllText("test", System.Text.Encoding.GetEncoding(1252));
Console.WriteLine(content);
}
Try it online!
The iso-8859-15 character set contain those symbols, as does the Windows-1252 codepage. However, be aware that 8859-15 redefines six other RARELY USED (or ASCII duplicate) chars that are in 8859-1, but so does Windows 1252. A quick web search will reveal those differences.

How do I read hex sequences like xD0 into a C# string?

I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�
C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.

Reading special characters from Byte[]

I'm writing and readingfrom Mifare - RFID cards.
To WRITE into the card, i'm using a Byte[] like this:
byte[] buffer = Encoding.ASCII.GetBytes(txt_IDCard.Text);
Then, to READ from the card, I'm getting some error with special characters, when it's supposed to show me é, ã, õ, á, à... I get ? instead:
string result = System.Text.Encoding.UTF8.GetString(buffer);
string result2 = System.Text.Encoding.ASCII.GetString(buffer, 0, buffer.Length);
string result3 = Encoding.UTF7.GetString(buffer);
e.g: Instead I get Àgua, amanhã, você I receive/read ?gua, amanh?, voc?.
How may I solve it ?
ASCII by its very definition only supports 128 characters.
What you need is ANSI characters if you are reading legacy text.
You can use Encoding.Default instead of Encoding.ASCII to interpret characters in the current locale's default ANSI code page.
Ideally, you would know exactly which code page you are expecting the ANSI characters to use and specify the code page explicitly using this overload of Encoding.GetEncoding(int codePage), for example:
string result = System.Text.Encoding.GetEncoding(1252).GetString(buffer);
Here's a very good reference page on Unicode: http://www.joelonsoftware.com/articles/Unicode.html
And another here: http://msdn.microsoft.com/en-us/library/b05tb6tz%28v=vs.90%29.aspx
But maybe you can just use UTF8 when reading and writing
I don't know the details of the card reader. Is the data you read and write to the card just a load of bytes?
If so, you can just use UTF8 for both reading and writing and it will all just work. It's only necessary to use ANSI if you are working with a legacy device which is expecting (or providing) ANSI text. If the device just stores bytes blindly without implying any particular format, you can do what you like - in this case, just always use UTF8.
It seems like you're using characters that aren't mapped in the 7 bits ASCII, but in the "extensions" ISO-8859-1 or ISO-8859-15. You'll need to choose a specific encoding for mapping to your byte array and things should work fine;
byte[] buffer = Encoding.GetEncoding("ISO-8859-1").GetBytes(txt_IDCard.Text);
You have two problems there:
ASCII supports only a limited amount of characters.
You're currently using two different Encodings for reading and writing.
You should write with the same Encoding as you read.
Writing
byte[] buffer = Encoding.UTF8.GetBytes(txt_IDCard.Text);
Reading
string result = Encoding.UTF8.GetString(buffer);

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!

Categories

Resources