c# windows not converting some emoji - c#

in the Net level emoji code is directly converted to the black and White emoji
example
when .Net Received
:punch: converted to converted to 👊
:+1:converted to converted to 👍
:-1:converted to converted to 👎
and these are on convert on system level we received directly black and White emoji
but some emoji not converted with the different name... and so how can I add this on the system level that these emoji also convert with the different name.. because some platform sends :+1: and some send :thumbsup: ?
these emoji not converted
:facepunch: Not converted to converted to 👊
:thumbsup: Not converted to converted to 👍
:thumbsdown:converted to converted to 👎
These are the same emoji with different name and all these emoji converted on another platform like android,ios

in the Net level emoji code is directly converted to the black and White emoji
there's no such thing as the "Net level". You deal with strings that contain unicode codepoints; how these are rendered is a totally different affair.
You're not mentioning what software does this, but I assure you that .Net doesn't take your string containing :emoji: and convert it to a single unicode codepoint. That's the job of whatever you use to enter these strings, or it is the job of whatever takes these strings and renders them.
At any rate, it's not a bug that a text console doesn't convert :squirtgun: to a picture of a squirt gun.
So, go wild. Build your own :emoji: conversion routines. It's a moving target.

Related

katex convert to UTF-8 in C#

I have a 3rd party api that returns me mathematical equations in katex form. Is there a way to convert katex into utf-8 in c#.
Thanks.
No. You can't. And there is a very good reason why.
Let's start about what with the fact that KaTeX is (I'm quoting the katex tag): KaTeX is a fast, easy-to-use JavaScript library for TeX math rendering on the web.
So it is a renderer (written in javascript, but it is irrelevant) that transforms some TeX commands (and some real text) into formatted text. Some (many) of those TeX commands are commands that print a character (\xi prints ξ for example). Others control the formatting (similarly to the formatting tags of html we could say). Now. The first ones could be extracted and converted. We look for \xi and we print ξ. The second ones we can't. In the same way that we can't put in a C# string the "bold" formatting. Formatting is part of the presentation.
For example:
\widetilde{\f\relax{x} = \int_{-\infty}^\infty \f\hat\xi\,e^{2 \pi i \xi x}\,d\xi}
is formatted by KaTeX to
If we convert all the TeX commands that simply print a character we will obtain:
f(x)=∫−∞∞f^(ξ) e2πiξx dξ
I don't think it would be very useful. We have lost the formatting and we have lost the undulated line that was above the equation (the equation was taken from the homepage of the KaTeX project, but I've added the \widetilde{} undulated line to show this point). The line isn't an unicode character. It is drawn with SVG.

C# / Python Encoding difference

Basically I am doing some conversions of PDF's into text, then analyzing and clipping parts of that text using a library in Python. The Python "clipping" doesn't actually cut the text into separate files it just has a start character and end character position for string extraction. For example:
the quick brown fox jumped over the lazy dog
My python code might cut out "quick" by specifying 4 , 9. Then I am using C# for a GUI program and try to take these values assigned by Python, and it works... for the most part. It appears the optical character recognition program that turned the pdf into a text file included some odd UTF characters which will change the counts on the C# side.
The PDF-txt conversion odd characters characters include a "fi" character, instead of an "f" and "i" character (possibly other characters too, they are large files.) Now this wouldn't be a problem, except C# says this is one character and Python (as well as Notepad++) consider this 3 character positions.
C#: "fi" length = 1 character.
Python/Notepad++: "fi" length = 3 characters.
What this ends up doing is giving me an offset clip due to the difference of character counts. Like I said when I run it in python (linux) and try outputting the clipping its perfect, and then I transferred the text file to Windows and Notepad++ confirms they are the correct positions. C# really just counts the "fi" as one character and Notepad++ as well as Python count it as 3 characters for some reason.
I need a way to bridge this discrepancy from the Python side OR the C# side.
You have to distinguish between characters and bytes. utf8 is a character encoding, where one character can have up to 4 bytes. So notepad++ displays probably byte positions, where Python can work with both byte and character strings. In C# probably have read the file as text file, which also produces character strings.
To read character strings in python use:
import codecs
with codecs.open(filename, encoding="utf-8") as inp:
text = inp.read()

How do i use 32 bit unicode characters in C#?

Maybe i dont need 32bit strings but i need to represent 32bit characters
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Now i grabbed the symbola font and can see the character when i paste it (in the url or any text areas) so i know i have the font support for it.
But how do i support it in my C#/.NET app?
-edit- i'll add something. When i pasted the said character in my .NET winform app i DO NOT see the character correctly. When pasting it into firefox i do see it correctly. How do i see the characters correctly in my winform apps?
I am not sure I understand your question:
Strings in .NET are UTF-16 encoded, and there is nothing you can do about this. If you want to get the UTF-32 version of a string, you will have to convert it into a byte array with the UTF32Encoding class.
Characters in .NET are thus 16 bits long, and there is nothing you can do about this either. A UTF-32 encoded character can only be represented by a byte array (with 4 items). You can use the UTF32Encoding class for this purpose.
Every UTF-32 character has an equivalent UTF-16 representation, and vice-versa. So in this context we could only speak of characters, and of their different representations (encodings), UTF-16 being the representation of choice on the .NET platform.
You didn't say what exactly do you mean by “support”. But there is nothing special you need to do to to work with characters that don't fit into one 16-bit char, unless you do string manipulation. They will just be represented as surrogate pairs, but you shouldn't need to know about that if you treat the string as a whole.
One exception is that some string manipulation methods won't work correctly. For example "\U0001F4A9".Substring(1) will return the second half of the surrogate pair, which is not a valid string.

Encoding conversion from RSS feed chars

I am trying to show a simple text RSS feed from a CodePlex project in a window.
My problem is that the feed text contains a lot of character sequences that looks like:
:
-
etc..
I know that they represent the punctuation and some special chars, with some kind of encoding, but I do not know how I can convert them back to simple ascii chars... I mean, without a switch/case covering each special char of course.
Thank you !
Sum-up: How can I convert "My name is : Aurelien" to "My name is : Aurelien" ?
As you can see by the question generated by your markup, those are HTML encoded characters.
All you have to do to decode them is use HttpUtility.HtmlDecode() to decode them.
If you're using .NET 4.0, you could also use System.Net.WebUtility.HtmlDecode() which would allow you to continue to target the Client Profile rather than the full framework.

Two encodings used in RTF string won't display correct in RichTextBox?

I am trying to parse some RTF, that i get back from the server. For most text i get back this works fine (and using a RichTextBox control will do the job), however some of the RTF seems to contain an additional "encoding" and some of the characters get corrupted.
The original string is as follows (and contains some of the characters used in Polish):
ąćęłńóśźż
The RTF string with hex encoded characters that is send back looks like this
{\lang1045\langfe1045\f16383 {\'b9\'e6\'ea\'b3{\f7 \'a8\'bd\'a8\'ae}\'9c\'9f\'bf}}
I am having problems decoding the ńó characters in the returned string, they seem to be represented by two hex values each, whereas the rest of the string is represented (as expected) by single hex values.
Using a RichTextBox control to "parse" the RTF results in corrupter text (the two characters in question are displayed as four different unwanted characters).
If i would encode the plain string myself to hex using the expected codepage (1250, Latin 2, the ANSI codepage for lcid 1045) i would get the following:
\'B9\'E6\'EA\'B3\'F1\'F3\'9C\'9F\'BF
I am lost as to how i can correctly decode the {\f7 \'a8\'bd\'a8\'ae} part of the returned string that should correspond to ńó.
Note that there is no font definition for \f7 in the RTF header and the string looks fine when viewed directly on the server meaning that the characters (if they are corrupted) are corrupted somewhere in the conversion before sending.
I am not sure if the problem is on the server side (as i have no control over that), but since the server is used for a lot of translation work i assume that the returned string is ok.
I have been going through the RTF specs but can not find any hint regarding this type of combination of encodings.
I don't know why it's happening, but the encoding appears to be GBK (or something sufficiently similar).
Perhaps the server tries to do some "clever" matching to find the characters, or the server's default character encoding is GBK or so, and those characters (and only those) also occur in GBK so it prefers that.
I found out by adding the offending hex codes (A8 BD A8 AE) as bytes into a simple HTML file, so I could go through my browser's encodings and see if anything matched:
<html><body>¨½¨®</body></html>
To my surprise, my browser came up with "ńó" straight away.

Categories

Resources