This isn't a duplicate of JSON and escaping characters - That doesn't answer this, because those code samples are in JavaScript - I need C#.
What C# method/library converts a bullet point (โข) into \u2022? The same converter would convert a newline char into \n. Those are just 2 examples, but the overall solution I'm looking for is to pass in a string (containing a combination of ASCII and special chars), and it converts all that to the same ASCII, but with the special chars escaped. For example, I need the following string:
โข 3 Ply 330 3/16in x 1/16in(#77)
โข 25 ft Long X 22 in Wide
โข 2022 (2) Beltwall Blk Standard 4in (102mm)
...converted to this:
\u2022 3 Ply 330 3/16in x 1/16in(#77)\n\u2022 25 ft Long X 22 in Wide\n\u2022 (2) Beltwall Blk Standard 4in (102mm)
...so it can become a valid JSON string value.
I have been down a dozen rabbit holes trying to find the answer to this, though I have no doubt it's something ridiculously simple.
You need to set which characters are escaped. If you are using Newtonsoft (comments indicate you are) then by default it will only escape control characters (newlines, etc).
You can pass the option StringEscapeHandling.EscapeNonAscii to have it escape all possible characters.
public string EncodeNonAsciiCharacters(string value) {
return JsonConvert.SerializeObject(value, Newtonsoft.Json.Formatting.None,
new JsonSerializerSettings { StringEscapeHandling = StringEscapeHandling.EscapeNonAscii }
);
}
Related
I can't seem to find any sort of posts or videos online about this topic, so I'm starting to wonder if it's just not possible. Everything about "emojis" in Unity is just a simple implementation of a spritesheet and then manually indexing them with like <sprite=0>. I'm trying to pull tweets from Twitter and then display their text with emojis, so clearly this isn't feasible to do with the 1500+ emojis that unicode supports.
I believe I've correctly created a TMP font asset using the default Windows emoji font, Segoe UI Emoji, and it looks like using some unicode hex ranges I found on an online unicode database, I was able to detect 1505 emojis in the font.
I then set the emoji font as a fall-back font in the Project Settings:
But upon running the game, I still get the same error that The character with Unicode value \uD83D was not found in the [SEGOEUI SDF] font asset or any potential fallbacks. It was replaced by Unicode character \u25A1 in text object
In the console an output of the tweet text looks something like this: #cat #cats #CatsOfTwitter #CatsOnTwitter #pet \nLike & share , Thanks!\uD83D\uDE4F\uD83D\uDE4F\uD83D\uDE4F
From some looking around online and extremely basic knowledge of unicode, I theorize that the issue is that in the tweet body, the emojis are in UTF-16 surrogate pairs or whatever, where \uD83D\uDE4F is one emoji, but my emoji font is in UTF-32, so it's looking for u+0001f64f. So would I need to find a way to get it to read the full surrogate pair and then convert to UTF-32 to get the correct emoji to render?
Any help would be greatly appreciated, I've tried asking around the Unity Discord server, but nobody else knows how to solve this issue either.
Intro
TMPro is natively able to do this, but only with UTF-32 formatted unicode. For example, \U0001F600 is '๐๏ธ'. Your emojis are formatted in what I believe is UTF-8 (correct me if i'm wrong), being \u1F600, which is still '๐๏ธ'. The only difference between these two are the capital U and 3 zeros prepending it. This makes it very easy to convert. Typing the UTF-32 version into TMPro shows the emoji as normal. What you are looking for is converting UTF-16 surrogate pairs into UTF-32, which is included further down.
Luckily, this solution does not require any font modification, the default font is able to do this, and I didn't change any settings in the inspector.
UTF-8 Solution
This solution below is for non-surrogate pair UTF-8 code.
To convert UTF-8 to UTF-32, we just need to change the 'u' to be uppercase and add a few zeros prepending it. To do so, we can use System.RegularExpressions.Regex.Replace.
public string ToUTF32(string input)
{
string output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output, #"\U000" + output.Substring(output.IndexOf(#"\u", StringComparison.Ordinal) + 2, 5), 1);
}
return output;
}
input being the string that contains the emoji unicode. The function converts all of the unicode in the string, and keeps everything else as it was.
Explanation
This code is pretty long, so this is the explanation.
First, the code takes the input string, for example, blah blah \u1F600 blah \u1F603 blah, which contains 2 of the unicode emojis, and replaces the unicode with another long string of code, which is the next section.
Secondly, it takes the input and Substrings everything after "\u", 5 characters ahead. It replaces the text with "\U000" + the aforementioned string.
It repeats the above steps until all of the unicode is translated.
This outputs the correct string to do the job.
If anyone thinks the above information is incorrect, please let me know. My vocabulary on this subject is not the best, so I am willing to take corrections.
Surrogate Pairs Solution
I have tinkered for a little while and come up with the function below.
public string ToUTF32FromPair(string input)
{
var output = input;
Regex pattern = new Regex(#"\\u[a-zA-Z0-9]*\\u[a-zA-Z0-9]*");
while (output.Contains(#"\u"))
{
output = pattern.Replace(output,
m => {
var pair = m.Value;
var first = pair.Substring(0, 6);
var second = pair.Substring(6, 6);
var firstInt = Convert.ToInt32(first.Substring(2), 16);
var secondInt = Convert.ToInt32(second.Substring(2), 16);
var codePoint = (firstInt - 0xD800) * 0x400 + (secondInt - 0xDC00) + 0x10000;
return #"\U" + codePoint.ToString("X8");
},
1
);
}
return output;
}
This does basically the same thing as before except it takes in the input that has surrogate pairs in it and translates it.
This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
Itโs working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itรข\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//Itโs working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "ใใใซใกใฏไธ็!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);
I receive from a REST API a text with this kind of style, for example
๐๐ธ๐ ๐ฝ๐ธ ๐ป๐ฎ๐ถ๐ธ๐ฟ๐ฎ ๐ฝ๐ฑ๐ฒ๐ผ ๐ฏ๐ธ๐ท๐ฝ ๐ฏ๐ป๐ธ๐ถ ๐ช ๐ผ๐ฝ๐ป๐ฒ๐ท๐ฐ?
๐ป๐๐ ๐๐ ๐๐๐๐๐๐ ๐๐ฝ๐พ๐ ๐ป๐๐๐ ๐ป๐๐๐ ๐ถ ๐๐๐๐พ๐๐?
ะฝฯฯ ัฯ ััะผฯฮฝั ัะฝฮนั ฦฯฮทั ฦัฯะผ ฮฑ ัััฮนฮทg?
But this is not italic or bold or underlined since the type it's string.
This kind of text make it failed my Regex ^[a-zA-Z0-9._]*$
I would like to normalize this string received in a standard one in order to make my Regex still valid.
You can use Unicode Compatibility normalization forms, which use Unicode's own (lossy) character mappings to transform letter-like characters (among other things) to their simplified equivalents.
In python, for instance:
>>> from unicodedata import normalize
>>> normalize('NFKD','๐๐ธ๐ ๐ฝ๐ธ ๐ป๐ฎ๐ถ๐ธ๐ฟ๐ฎ ๐ฝ๐ฑ๐ฒ๐ผ ๐ฏ๐ธ๐ท๐ฝ ๐ฏ๐ป๐ธ๐ถ ๐ช ๐ผ๐ฝ๐ป๐ฒ๐ท๐ฐ')
'How to remove this font from a string'
# EDIT: This one wouldn't work
>>> normalize('NFKD','ะฝฯฯ ัฯ ััะผฯฮฝั ัะฝฮนั ฦฯฮทั ฦัฯะผ ฮฑ ัััฮนฮทg?')
'ะฝฯฯ ัฯ ััะผฯฮฝั ัะฝฮนั ฦฯฮทั ฦัฯะผ ฮฑ ัััฮนฮทg?'
Interactive example here.
EDIT: Note that this only applies to stylistic forms (superscripts, blackletter, fill-width, etc.), so your third example, which uses non-latin characters, can't be decomposed to ASCII.
EDIT2: I didn't realize your question was specific to C#, here's the documentation for String.Normalize, which does just that:
string s1 = "๐๐ธ๐ ๐ฝ๐ธ ๐ป๐ฎ๐ถ๐ธ๐ฟ๐ฎ ๐ฝ๐ฑ๐ฒ๐ผ ๐ฏ๐ธ๐ท๐ฝ ๐ฏ๐ป๐ธ๐ถ ๐ช ๐ผ๐ฝ๐ป๐ฒ๐ท๐ฐ"
string s2 = s1.Normalize(NormalizationForm.FormKD)
I'm trying to understand what is the best encode from C# that fulfill a requirement on a new SMS Provider.
The text I want to send is:
Bรคste Bjรถrn
The encoded text that the provider say it needs is:
B%E4ste+Bj%F6rn
so รค is %E4 and รถ is %F6
From this answer, I got that, for such conversion I need to use HttpUtility.HtmlAttributeEncode as the normal HttpUtility.UrlEncode will output:
B%c3%a4ste+Bj%c3%b6rn
and that outputs weird chars on the mobile phone :/
as several chars are not converted, I tried this:
private string specialEncoding(string text)
{
StringBuilder r = new StringBuilder();
foreach (char c in text.ToCharArray())
{
string e = System.Web.HttpUtility.UrlEncode(c.ToString());
if (e.StartsWith("%") && e.ToLower() != "%0a") // %0a == Linefeed
{
string attr = System.Web.HttpUtility.HtmlAttributeEncode(c.ToString());
r.Append(attr);
}
else
{
r.Append(e);
}
}
return r.ToString();
}
verbose so I could breakpoint and test each char, and found out that:
System.Web.HttpUtility.HtmlAttributeEncode("รค") is actually equal to รค... so there is no %E4 as output...
What am I missing? and is there a simply way to do the encoding without manipulating them char by char and have the required output?
that the provider say it needs
Ask the provider in which age they are living. According to Wikipedia: Percent-encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Granted, this RFC talks about "new URI schemes", which HTTP obviously is not, but adhering to this standard prevents headaches like this. See also What is the proper way to URL encode Unicode characters?.
They seem to want you to encode characters according to the Windows-1250 Code Page (or comparable, like ISO-8859-1 or -2, check alternatives here) instead, as using that code page E4 (132) maps to รค and F6 (148) maps to รถ. As #Simon points out in his comment, you should ask the provider which code page exactly they want you to use.
Assuming Windows-1250, you can implement it like this, according to URL encode ASCII/UTF16 characters:
var windows1250 = Encoding.GetEncoding(1250);
var percentEncoded = HttpUtility.UrlEncode("Bรคste Bjรถrn", windows1250);
The value of percentEncoded is:
B%e4ste+Bj%f6rn
If they insist on using uppercase, see .net UrlEncode - lowercase problem.
I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "เคฎเฅเคฐเคพ" .
there are 4 characters in this string. i need all the four unicode characters.
Please help me.
When you write a code like string abc= "เคฎเฅเคฐเคพ";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is เฅ (DEVANAGARI VOWEL SIGN E).
If you want to see the numeric representations of those characters, just cast them to integers. For example
abc.Select(c => (int)c)
gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():
abc.Select(c => ((int)c).ToString("x4"))
returns the sequence of strings "092e", "0947", "0930", "093e".
Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.
If you wanted to handle characters in other planes too, you could use code like the following.
byte[] bytes = Encoding.UTF32.GetBytes(abc);
int codePointCount = bytes.Length / 4;
int[] codePoints = new int[codePointCount];
for (int i = 0; i < codePointCount; i++)
codePoints[i] = BitConverter.ToInt32(bytes, i * 4);
Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)
Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:
var abc = "เคฎเฅเคฐเคพ";
foreach (var c in abc)
{
Console.WriteLine((int)c);
}
resulting in
2350
2375
2352
2366
use
System.Text.Encoding.UTF8.GetBytes(abc)
that will return your unicode values.
If you are trying to convert files from a legacy encoding into Unicode:
Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.
using (StreamReader reader = new StreamReader(#"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
using (StreamWriter writer = new StreamWriter(#"C:\MyConvertedFile.txt", false, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
If you are looking for a mapping of Devanagari characters to the Unicode code points:
You can find the chart at the Unicode Consortium website here.
Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.
If you have the string s = เคฎเฅเคฐเคพ then you already have the answer.
This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.
If you want the underlying 8 bytes you can access them as so:
string str = #"เคฎเฅเคฐเคพ";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);