How to convert emoticons to its UTF-32/escaped unicode? - c#

I am working on a chatting application in WPF and I want to use emoticons in it. I am working on WPF app. I want to read emoticons which are coming from Android/iOS devices and show respective images.
On WPF, I am getting a black Emoticon looking like . I somehow got a library of emoji icons which are saved with respective hex/escaped unicode values.
So, I want to convert these symbols of emoticons into UTF-32/escaped unicode so that I can directly replace related emoji icons with them.
I had tried to convert an emoticon to its unicode but end up getting a different string with couple of symbols, which are having different unicode.
string unicodeString = "\u1F642"; // represents 🙂
Encoding unicode = Encoding.Unicode;
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
char[] unicodeChars = new char[unicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length)];
unicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, unicodeChars, 0);
string asciiString = new string(unicodeChars);
Any help is appreciated!!

Your escaped Unicode String is invalid in C#.
string unicodeString = "\u1F642"; // represents 🙂
This piece of code doesnt represent the "slightly smiling face" since C# only respects the first 4 characters - representing an UTF-16 (with 2 Bytes).
So what you actually get is the letter representing 1F64 followed by a simple 2.
http://www.fileformat.info/info/unicode/char/1f64/index.htm
So this: ὤ2
If you want to type hex with 4 Bytes and get the corresponding string you have to use:
var unicodeString = char.ConvertFromUtf32(0x1F642);
https://msdn.microsoft.com/en-us/library/system.char.convertfromutf32(v=vs.110).aspx
or you could write it like this:
\uD83D\uDE42
This string can than be parsed like this, to get your desired result which is again is the hex value that we started with:
var x = char.ConvertFromUtf32(0x1F642);
var enc = new UTF32Encoding(true, false);
var bytes = enc.GetBytes(x);
var hex = new StringBuilder();
for (int i = 0; i < bytes.Length; i++)
{
hex.AppendFormat("{0:x2}", bytes[i]);
}
var o = hex.ToString();
//result is 0001F642
(The result has the leading Zeros, since an UTF-32 is always 4 Bytes)
Instead of the for Loop you can also use BitConverter.ToString(byte[]) https://msdn.microsoft.com/en-us/library/3a733s97(v=vs.110).aspx the result than will look like:
var x = char.ConvertFromUtf32(0x1F642);
var enc = new UTF32Encoding(true, false);
var bytes = enc.GetBytes(x);
var o = BitConverter.ToString(bytes);
//result is 00-01-F6-42

Please be aware that Encoding.Unicode is UTF-16 in C#. To read 32 bits Unicode, there is this Encoding.UTF32. Link on MSDN for Encoding.​UT​F32

Since C# source files can contain UTF-32 string literals, there is no need to use any encodings for this task.
Example 1.
var rgch = "\U0001F642".ToCharArray();
var str = $"\\u{(ushort)rgch[0]:X4}\\u{(ushort)rgch[1]:X4}";
Result: "\uD83D\uDE42"         Length of string str is 12 UTF-16 code points (24 bytes)
Example 2.
var rgch = "\U0001F642".ToCharArray();
var str = rgch[0] + "" + rgch[1];
Result: "🙂"             Length of string str is 2 UTF-16 code points (4 bytes)

Related

C# String to Byte Array (With preset string format)

I am working a problem in C# and I am having issues with converting my string of multiple hex values to a byte[].
string word = "\xCD\x01\xEF\xD7\x30";
(\x starts each new value, so I have: CD 01 EF D7 30)
This is my first time asking a question here, so please let me know if you need anything extra from me.
More information on the project:
I need to be able to change both
"apple" and "\xCD\x01\xEF\xD7\x30" to a byte array.
For the normal string "apple" I use
byte[] data = Encoding.ASCII.GetBytes(word);
this does not seem to be working with "\xCD\x01\xEF\xD7\x30" I am getting the values
63, 1, 63, 63, 48
Ok... You were trying to directly "downcast"/"upcast" char <-> byte (where char is the C# char that is 16 bits long, and byte is 8 bits long).
There are various ways to do it. The simplest (probably not the more performant) is to use the iso-8859-1 encoding that "maps" the byte values 0-255 to the unicode codes 0-255 (and return).
Encoding enc = Encoding.GetEncoding("iso-8859-1");
string str = "apple";
byte[] bytes = enc.GetBytes(str);
string str2 = enc.GetString(bytes);
You can even do a little LINQ:
string str = "apple";
// This is "bad" if the string contains codepoints > 255
byte[] bytes = str.Select(x => (byte)x).ToArray();
// This is always safe, because by definition any value of a byte
// is a legal unicode character
string str2 = string.Concat(bytes.Select(x => (char)x));

Base64 Encoding Javascript to C#

I am trying to port some Javascript to C# and I'm having a bit of trouble. The javascript I am porting calls this
var binary = out.map(function (c) {
return String.fromCharCode(c);
}).join("");
return btoa(binary);
out is an array of numbers. I understand that it is taking the numbers and using fromCharCode to add characters to a string. At first I wasn't sure if my C# equivalent of btoa was working correctly, but the only characters I'm having issues with are the first 6 or 8. My encoded string outputs the same except for the first few characters.
At first in C# I was doing this
String binary = "";
foreach(int val in output){
binary += ((char)val);
}
And then I tried
foreach(int val in output){
System.Text.ASCIIEncoding convertor = new System.Text.ASCIIEncoding();
char o = convertor.GetChars(new byte[] { (byte)val })[0];
binary += o;
}
Both work fine on the later characters of the String but not the start. I've researched but I don't know what I'm missing.
My array of numbers is as follows: { 10, 135, 3, 10, 182, ....}
I know the 10s are newline characters, the 3 is end of text, the 182 is ¶, but what's confusing me is that the 135 should be the double dagger ‡. The Javascript does not show it when I print the string.
So what ends up happening is when the String is converted to Base64 my string looks like Cj8DCj8CRFF.... while the Javascript String looks like CocDCrYCRFF.... The rest of the strings are the same and the int arrays used are identical.
Any ideas?
It's important to understand that binary data does not always represent valid text in a given encoding, and that some encodings have variable numbers of bytes to represent different characters. In short: binary data and text are not the same at all, and you can only convert between the two in some cases and by following clear, accurate rules. Treating them incorrectly will cause pain.
That said, if you have a list of ints, that are always within the range 0-255, that should become a base64 string, here is a way to do it:
var output = new[] { 0, 1, 2, 68, 69, 70, 254, 255 };
var binary = new List<byte>();
foreach(int val in output){
binary.Add((byte)val);
}
var result = Convert.ToBase64String(binary.ToArray());
If you have text that should be encoded as a base64 string...generally I'd recommend UTF8 encoding, unless you need it to match the JS's implementation.
var str = "Hello, world!";
var result = Convert.ToBase64String(Encoding.UTF8.GetBytes(str));
The encoding that JS uses appears to be the same as casting between byte and char (chars > 255 are invalid), which isn't one of the standard Encodings available.
Here's how you might combine raw numbers and strings, then convert that to base64.
checked // ensures that values outside of byte's range do not fail silently
{
var output = new int[] { 10, 135, 3, 10, 182 };
var binary = output.Select(x => (byte)x)
.Concat("Hello, world".Select(c => (byte)c)).ToArray();
var result = Convert.ToBase64String(binary);
}

C# ByteString to ASCII String

I am looking for a smart way to convert a string of hex-byte-values into a string of 'real text' (ASCII Characters).
For example I have the word "Hello" written in Hexadecimal ASCII: 48 45 4C 4C 4F. And using some method I want to receive the ASCII text of it (in this case "Hello").
// I have this string (example: "Hello") and want to convert it to "Hello".
string strHexa = "48454C4C4F";
// I want to convert the strHexa to an ASCII string.
string strResult = ConvertToASCII(strHexa);
I am sure there is a framework method. If this is not the case of course I could implement my own method.
Thanks!
var str = Encoding.UTF8.GetString(SoapHexBinary.Parse("48454C4C4F").Value); //HELLO
PS: SoapHexBinary is in System.Runtime.Remoting.Metadata.W3cXsd2001 namespace
I am sure there is a framework method.
A a single framework method: No.
However the second part of this: converting a byte array containing ASCII encoded text into a .NET string (which is UTF-16 encoded Unicode) does exist: System.Text.ASCIIEncoding and specifically the method GetString:
string result = ASCIIEncoding.GetString(byteArray);
The First part is easy enough to do yourself: take two hex digits at a time, parse as hex and cast to a byte to store in the array. Seomthing like:
byte[] HexStringToByteArray(string input) {
Debug.Assert(input.Length % 2 == 0, "Must have two digits per byte");
var res = new byte[input.Length/2];
for (var i = 0; i < input.Length/2; i++) {
var h = input.Substring(i*2, 2);
res[i] = Convert.ToByte(h, 16);
}
return res;
}
Edit: Note: L.B.'s answer identifies a method in .NET that will do the first part more easily: this is a better approach that writing it yourself (while in a, perhaps, obscure namespace it is implemented in mscorlib rather than needing an additional reference).
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hexStr.Length; i += 2)
{
string hs = hexStr.Substring(i, 2);
sb.Append(Convert.ToByte(hs, 16));
}

How to convert a string with character codes above 127 to a byte array properly?

I am retrieving ASCII strings encoded with code page 437 from another system which I need to transform to Unicode so they can be mixed with other Unicode strings.
This is what I am working with:
var asciiString = "\u0094"; // 94 corresponds represents 'ö' in code page 437.
var asciiEncoding = Encoding.GetEncoding(437);
var unicodeEncoding = Encoding.Unicode;
// This is what I attempted to do but it seems not to be able to support the eight bit. Characters using the eight bit are replaced with '?' (0x3F)
var asciiBytes = asciiEncoding.GetBytes(asciiString);
// This work-around does the job, but there must be built in functionality to do this?
//var asciiBytes = asciiString.Select(c => (byte)c).ToArray();
// This piece of code happliy converts the character correctly to unicode { 0x94 } => { 0xF6, 0x0 } .
var unicodeBytes = Encoding.Convert(asciiEncoding, unicodeEncoding, asciiBytes);
var unicodeString = unicodeEncoding.GetString(unicodeBytes); // I want this to be 'ö'.
What I am struggling with is that I cannot find a suitable method in the .NET framework to transform a string with character codes above 127 to a byte array. This seems strange since there are support there to transform a byte array with characters above 127 to Unicode strings.
So my question is, is there any built in method to do this conversion properly or is my work-around the proper way to do it?
var asciiString = "\u0094";
Whatever you name it, this will always be a Unicode string. .NET only has Unicode strings.
I am retrieving ASCII strings encoded with code page 437 from another system
Treat the incoming data as byte[], not as string.
var asciiBytes = new byte[] { 0x94 }; // 94 corresponds represents 'ö' in code page 437.
var asciiEncoding = Encoding.GetEncoding(437);
var unicodeString = asciiEncoding.GetString(asciiBytes);
\u0094 is Unicode code-point 0094, which is a control character; it is not ö. If you wanted ö, the correct string is
string s = "ö";
which is LATIN SMALL LETTER O WITH DIAERESIS, aka code-point 00F6.
So:
var s = "\u00F6"; // Identical to "ö"
Now we get our encoding:
var enc = Encoding.GetEncoding(437);
var bytes = enc.GetBytes(s);
And we find that it is a single-byte decimal 148, which is hex 94 - i.e. what you were after.
The significance here is that in C# when you use the "\uXXXX" syntax, the XXXX is always referring to Unicode code-points, not the encoded value in some particular encoding.
You have to look earlier in the code. Once you have the data as a string, it has already been decoded. Any characters lost in that decoding is impossible to get back.
You need the input as bytes, so that you can use your encoding object for code page 437 to decode it into a string.
byte[] asciiData = new byte[] { 0x94 }; // character ö in codepage 437
Encoding asciiEncoding = Encoding.GetEncoding(437);
string unicodeString = asciiEncoding.GetString(asciiData);
Console.WriteLine(unicodeString);
Output:
ö

C# unicode characters not getting displayed

i'm trying the msdn example,
using windows xp , .Net 4.0,
using System;
using System.Text;
class Example
{
static void Main()
{
string unicodeString = "This string contains the unicode character Pi (\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte array.
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
}
}
Expected:
// The example displays the following output:
// Original string: This string contains the unicode character Pi (Π)
// Ascii converted string: This string contains the unicode character Pi (?)
But I'm getting
// The example displays the following output:
// Original string: This string contains the unicode character Pi (?)
// Ascii converted string: This string contains the unicode character Pi (?)

Categories

Resources