How to unescape multibyte unicode in c# [duplicate] - c#

This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome

This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working

try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}

That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);

Related

Base64EncodedString does not include NewLines

I´m using a .NET core 3.0 project on Windows 10. I´m trying to encode a string to base64 with below code:
var stringvalue = "Row1" + Environment.NewLine + "\n\n" + "Row2";
var encodedString = Convert.ToBase64String(Encoding.UTF8.GetBytes(stringvalue));
encodedString has then below result:
Um93MQ0KCgpSb3cy
stringvalue is:
Row1\r\n\n\nRow2
However, if I´m passing the same value to this site (https://www.base64encode.org/), i´m getting another result:
Um93MVxyXG5cblxuUm93Mg==
In visual studio, I tried to resave the file with Unix lineendings, but without any luck:
I want the string to be encoded as how it´s done in https://www.base64encode.org. Any ideas how to get this done?
From the screenshot, I can see that you have entered a different string from the string you used in your C# code. The string you used in https://www.base64encode.org is represented as a C# string literal like this:
"Row1\\r\\n\n\\nRow2"
// or
#"Row1\r\n\n\nRow2"
So to answer your question:
I want the string to be encoded as how it´s done in https://www.base64encode.org. Any ideas how to get this done?
You should do:
var encodedString = Convert.ToBase64String(Encoding.UTF8.GetBytes("Row1\\r\\n\n\\nRow2"));
But that's probably not what you actually want. Your first attempt at the C# code is more likely to be desired, because that is actually a carriage return character, followed by 3 new line characters. The string you entered in https://www.base64encode.org is simply the backslash character followed by the letter r (or n).
You can't really make the output on https://www.base64encode.org match the C# output, because you can only choose one kind of line separator on there. You can only either encode Row1\r\n\r\n\r\nRow2 or Row\n\n\nRow2. Nevertheless, you can check that the C# result is correct by decoding the output using https://www.base64decode.org.
The \r\n will be encoded on the website, this is not a newline, these are 4 characters. There is this newline-separator-checkbox, to say you want the windows style, to convert your real world input value:
Row1
Row2.
I guess your \r\n\n\n is just a mistake, the website is prepared to convert it to \r\n\r\n only.

Encoding "œ" with iso-8859-1 in C#

I am using below code to read a CSV file:
using (StreamReader readfile = new StreamReader(FilePath, Encoding.GetEncoding("iso-8859-1")))
{
// some code will go here
}
There is a character œin a column of the CSV file. Which is getting converted to ? in output. How can i get this œ encoded correctly so that in output i will get same œ character not a question mark.
This is an encoding problem. Many non-Unicode encodings are either incomplete and translate many characters to "?", or have subtly different behavior on different platforms. Consider using UTF-8 or UTF-16 as the default. At least, if you can.
"windows-1252" is a superset of "ISO-8859-1". Try with Encoding.GetEncoding(1252).
Demo:
public static void Main()
{
System.IO.File.AppendAllText("test","œ", System.Text.Encoding.GetEncoding(1252));
var content = System.IO.File.ReadAllText("test", System.Text.Encoding.GetEncoding(1252));
Console.WriteLine(content);
}
Try it online!
The iso-8859-15 character set contain those symbols, as does the Windows-1252 codepage. However, be aware that 8859-15 redefines six other RARELY USED (or ASCII duplicate) chars that are in 8859-1, but so does Windows 1252. A quick web search will reveal those differences.

Thai character issues in unicode string

I have a string with few characters in Thai. This string is using unicode characters. But I don't see thai characters in IDE or even if I write the string in text file. If I want to see thai characters properly I have to write the following code
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
After applying above logic, I can see string correctly with thai characters. Here is output
// notice the thai character เดี่ยว in the string
M_M-150 150CC. เดี่ยว (2 For 18 Save 2)
I am not sure why I need to apply above logic to see the thai characters even the string is Unicode?
What exactly Encoding.Default is doing in this case?
From MSDN
Here is what Encoding.Default Property is:
Different computers can use different encodings as the default, and
the default encoding can even change on a single computer. Therefore,
data streamed from one computer to another or even retrieved at
different times on the same computer might be translated incorrectly.
In addition, the encoding returned by the Default property uses
best-fit fallback to map unsupported characters to characters
supported by the code page. For these two reasons, using the default
encoding is generally not recommended. To ensure that encoded bytes
are decoded properly, you should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to
use a higher-level protocol to ensure that the same format is used for
encoding and decoding.
The string is coming in by Encoding.Default, but then Decoded using UTF8
So the bottleneck is not the Encoding.Default. It's Encoding.UTF8
It's taking the bytes and convert it to string correctly.
Even if you tried to print it in the Console.
Take a look at both cases :
The second line, printed with utf8 configuration
You can config your console to support utf8 by adding this line :
Console.OutputEncoding = Encoding.UTF8;
Even with your code :
the result in a file will be looks like :
while converting the string to byte with Encoding.UTF8
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
the result will be :
If you take a look at Supported Scripts you'll see that UTF8 supports all Unicode characters
including Thai.
Note that the Encoding.Default will not be able to read chinese or japanese for an example,
take this example :
var text = "漢字";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
Here is the output from a text file :
Here if you try to write it to text, it'll not be converted successfully.
So you have to read and write it using UTF8
var text = "漢字";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
and you'll get this :
So as I said, the whole process depending on UTF8 not Default encoding.

How to retrieve the unicode decimal representation of the chars in a string containing hindi text?

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" .
there are 4 characters in this string. i need all the four unicode characters.
Please help me.
When you write a code like string abc= "मेरा";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is े (DEVANAGARI VOWEL SIGN E).
If you want to see the numeric representations of those characters, just cast them to integers. For example
abc.Select(c => (int)c)
gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():
abc.Select(c => ((int)c).ToString("x4"))
returns the sequence of strings "092e", "0947", "0930", "093e".
Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.
If you wanted to handle characters in other planes too, you could use code like the following.
byte[] bytes = Encoding.UTF32.GetBytes(abc);
int codePointCount = bytes.Length / 4;
int[] codePoints = new int[codePointCount];
for (int i = 0; i < codePointCount; i++)
codePoints[i] = BitConverter.ToInt32(bytes, i * 4);
Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)
Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:
var abc = "मेरा";
foreach (var c in abc)
{
Console.WriteLine((int)c);
}
resulting in
2350
2375
2352
2366
use
System.Text.Encoding.UTF8.GetBytes(abc)
that will return your unicode values.
If you are trying to convert files from a legacy encoding into Unicode:
Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.
using (StreamReader reader = new StreamReader(#"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
using (StreamWriter writer = new StreamWriter(#"C:\MyConvertedFile.txt", false, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
If you are looking for a mapping of Devanagari characters to the Unicode code points:
You can find the chart at the Unicode Consortium website here.
Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.
If you have the string s = मेरा then you already have the answer.
This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.
If you want the underlying 8 bytes you can access them as so:
string str = #"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);

Arabic presentation forms B support in c#

I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.
here is the code I use
StreamReader sr = new StreamReader(#"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(#"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
But, I don't know how to convert the file correctly using this presentation forms in C#.
Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:
str = str.Normalize(NormalizationForm.FormKD);
st.Write(str);
To give a more general answer:
The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).
the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.
When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.
As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (?). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.
First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are \x0644 and \x0627, which are from the standard Arabic block. However, just to be sure I tried the character \xFEFB, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.
Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.
So I tried the following:
var input = "لا";
var encoding = Encoding.GetEncoding("windows-1256");
var result = encoding.GetBytes(input);
Console.WriteLine(string.Join(", ", result));
The output I get is 225, 199. So let’s try to turn it back:
var bytes = new byte[] { 225, 199 };
var result2 = encoding.GetString(bytes);
Console.WriteLine(result2);
Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.
Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.
My recommendation:
Write a short piece of code that exhibits the problem.
Post a new question with that piece of code.
In that question, describe exactly what result you get, and what result you expected instead.

Categories

Resources