Convert UTF-8 hexadecimal to regular character

Convert UTF-8 hexadecimal to regular character - c#

Does anyone know how to convert a string in C# that has UTF-8 hexadecimal characters in it to a regular characters.?
For example
hell%c3%b3 to hello
Chart
UTF-8 ASCII TECKEN Flattened
%c3%b3 %f3 ò o
There are many UTF-8 hexadecimals I need to convert is there a way to do this with a built in method in .NET?

This is called URL encoding and can be undone with
using System.Web;
HttpUtility.UrlDecode("hell%c3%b3");
This gives helló, but probably that's what you wanted.
The second part, removing the diacritics, is not so simple, see How do I remove diacritics here on SO.

Related

Serialize a string in binary with C# and deserialize it with C++

I'm struggling to find an effective way to serialize a string that could contain both unicode and non-unicode characters into a binary array which I then serialize to a file that I have to deserialize using C++.
I have already implemented a serializer/deserializer in C++ which I use to do most of my serialization which can handle both unicode and non-unicode characters (basically I convert non-unicode characters into their unicode equivalent and serialize everything as a unicode string, not the most effective way since every string now has 2 bytes per character but works).
What I'm trying to achieve is to transform an arbitrary string into a 2 byte per character string that I can then deserialize from C++.
What would be the most effective effective way to achieve what I'm looking for?
Also, any suggestion regarding the way I'm serializing strings is well accepted of course.

Encoding.Unicode.GetBytes("my string") encodes the string as UTF-16, which has a size of 2 Bytes for each character. So if you are searching still an alternative consider the encoding.

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

I have some string that contains the following code/value:
"You won\u0092t find a ...."
It looks like that string contains the Right Apostrophe special character.
ref1: Unicode control 0092
ref2: ASCII chart (both 127 + extra extended ascii)
I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) 
Is my understanding correct?
Update 1:
It was suggested by #sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...
Note the ampersand got correctly encoded....

It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.
You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:
string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"

Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).
You're correct that "" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.
According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).
My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)
If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.
Hope this helps.
Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

Encoding.ASCII.GetBytes it convert "∞" to "?"

I want infinity Symbol i my string. i used following Code to get infinity symbol
char.ConvertFromUtf32(8734)
and it convert to json when json is encoded ie.
Encoding.ASCII.GetBytes(json)
then it convert "∞" to "?" symbol
so how i can resolve this problem. please Help me.
thanks.

The infinity sign ∞ is not part of the ASCII character set. So by using Encoding.ASCII.GetBytes() you explicitly exclude it from the string, effectivly replacing it with a placeholder, in this case ?
Since you use the resulting byte array for a JSON reply, you might want to consider using UTF8 inxtead of ASCII

Encoding conversion from RSS feed chars

I am trying to show a simple text RSS feed from a CodePlex project in a window.
My problem is that the feed text contains a lot of character sequences that looks like:
:
-
etc..
I know that they represent the punctuation and some special chars, with some kind of encoding, but I do not know how I can convert them back to simple ascii chars... I mean, without a switch/case covering each special char of course.
Thank you !
Sum-up: How can I convert "My name is : Aurelien" to "My name is : Aurelien" ?

As you can see by the question generated by your markup, those are HTML encoded characters.
All you have to do to decode them is use HttpUtility.HtmlDecode() to decode them.
If you're using .NET 4.0, you could also use System.Net.WebUtility.HtmlDecode() which would allow you to continue to target the Client Profile rather than the full framework.

Which encoding does Alt+Numpad keys generate?

In short:
For this code:
Encoding.ASCII.GetBytes("‚")
I want the output to be 130, but this gives me 63.
I am typing the string using Alt+0130.

On my setup:
Encoding.ASCII.GetBytes("‚"); // 63
Encoding.Default.GetBytes("‚"); // 130
Of course 'default' could very well be environment-dependent...

When you try to encode the string using the ASCII encoding, it will be converted to a question mark as there is no such character in the ASCII character set. The character code for the question mark is 63.
You need to use an encoding that supports the character, to get it's actual character code.
One option is to use the Encoding.Default property to get the encoding for the system codepage, as David suggested. However as the system codepage can differ, it's not guaranteed to give the same result on all computers.
The unicode character code is 8218, which you can get by simply converting the character to an int:
int characterCode = (int)'‚';
As this is not depending on any system settings, you should consider if you can use that instead of the encoded byte value.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Convert UTF-8 hexadecimal to regular character - c#

This is called URL encoding and can be undone with using System.Web; HttpUtility.UrlDecode("hell%c3%b3"); This gives helló, but probably that's what you wanted. The second part, removing the diacritics, is not so simple, see How do I remove diacritics here on SO.

Related

Serialize a string in binary with C# and deserialize it with C++

Is it possible to display (convert?) the unicode hex \u0092 to an unicode html entity in .NET?

Encoding.ASCII.GetBytes it convert "∞" to "?"

Encoding conversion from RSS feed chars

Which encoding does Alt+Numpad keys generate?

Categories

Resources