Convert Numeric Reference Character to actual character - c#

I have a text that needs to be converted into an actual character, we use swedish characters but the data that we received are encoded into Numerical Character like this: ö Ä å
which suppose to be converted into normal characters like ö, ä, å this respectively.
I tried to search here but no luck and there was one though however they use python library instead so I can't convert it to C#.
Do anyone has a sample code on how to convert it to a normal character?

WebUtility.HtmlDecode(string) method can do trick for you or you can also use HttpUtility.HtmlDecode(string) method Under Namespace System.Web.
var res1 = WebUtility.HtmlDecode("&#246");
OR
var res1 = HttpUtility.HtmlDecode("&#246");

Related

How to check if first character in a string is a bullet (Doesn't exist in ASCII)

I want to check if the first character in a string in c# is a • how would I do this, since it has no ASCII value?
There's a few different unicode characters that might look like bullets.
Use https://unicodelookup.com to look up the code you're trying to match.
Specify the unicode character in C# using the literal notation, such as: "\u2022" using the hex unicode for the character.
For example:
bool found = text.StartsWith("\u2022");
if (inputString.First() == '•') Console.WriteLine("True.");

Best way to parse ASCII(?) from a hex string in C#

the string I get in the application includes ASCII(?) characters like !,dp,\b,(,s#.
These are suppose to be equivalent.
value in database-
\x01\x01\x03!\xea\x01\x00\x00dP\x00\x00\x1f\x8b\b\x00\x00\x00\x00\x00\x04\x00\xe3\xe6\x10\x11\x98\xc3(\xc1\xa2\xc0\xa8\xc0\xa0 \x02\xc4\x0c\x1a\x8c\x1a\x0c\x1as#\x04\x18\xf2\b\x1de\xe6\xe6\xe2\xe2b604\x14`\x94\x98\xc3\ba\x9b\"\xb1M\x80\xec\xc9\x10\xb6\x81\x05\x90=\t\xca6Ab[\x02\xd9\x13\xa1\xea\x8d\x80\xec.\xa8\xb8)\x12\xdb\x0c\xc8n\x81\xaa1\x06\xb2\x1b\x19\xb98A\xe2 \xf5\xb5\x10\xa6\x01\x90Y\rf\x1a\x9a#\x98\x16\b&\xc8\x8cJ\x88Z\x90\x11\xa5\x10Q\x90\xb6\x12\x88(H[1\x84\t\xf2O\xb6\xc0&v\tF\x1e\xa1\a\x8c\xc3\xd9\x8f\x8f\x8d%\x18\x01\xa1\x98\x8d\x97\xea\x01\x00\x00
value I get in my app that includes chracters I don't want-
01010321ea010000645000001f8b0800000000000400e3e6101198c328c1a2c0a8c0a02002c40c1a8c1a0c1a73400418f2081d65e6e6e2e26236303414609498c308619b22b14d80ecc910b68105903d09ca3641625b02d913a1ea8d80ec2ea8b82912db0cc86e81aa3106b21b19b93841e220f5b510a60190590d661a9a2398160826c88c4a885a9011a5105190b6128828485b318409f24fb6c0267609461ea1078cc3d98f8f8d251801a1988d97ea0100000a\n\n"3a1ea8d80ec2ea8b82912db0cc86e81aa3106b21b19b93841e220f5b510a60190590d661a9a2398160826c88c4a885a9011a5105190b6128828485b318409f24fb6c0267609461ea1078cc3d98f8f8d251801a1988d97ea0100000a\n\n"3a1ea8d80ec2ea8b82912db0cc86e81aa3106b21b19b93841e220f5b510a60190590d661a9a2398160826c88c4a885a9011a5105190b6128828485b318409f24fb6c0267609461ea1078cc3d98f8f8d251801a1988d97ea0100000a\n\n
you can see that \x01 is 01 then \x03 is 03 then ! is 21. I want to take out all the non hex values in the second string.
What are chracters like ! and dP. Are they ASCII?
I can remove characters like new line like hexString = hexString.Replace("\n", ""); But I'm not sure if that's the best way to do for all.
3.Comparing the two strings, I see that (=28 and s#=7340 . Is there a table for conversion for this?
My guess is given the quotes around the ouput that the database is displaying non-ASCII (Unicode?) characters as hex (e.g. \x03) and that the actual string contains a single character for each hex formatted display, in which case there is no difference to pick out - the character d is also the hex value \x64, it is just the database chooses to output visible characters as their normal letter - same thing with \t which could be output as \x09 but they choose to use (C) standard control character abbreviations.
Found this:
When it is displayed on screen, redis-cli escapes non-printable characters using the \xHH encoding format, where HH is hexadecimal notation.
In other words,
The cli is just using 3 different methods to display the values in the database field:
The character is printable, output the character (e.g. d, P, !, ").
The character is not printable, but has a C language standard escape sequence, output the escape sequence (e.g. \b, \t, \n).
The character is not printable and has no escape sequence, output the hex for the value of the character (e.g. \x03, \x01, \x00).

(Char)174 returning the value of (Char)0174, why?

I am working with string delimiters and one of them is « or 174. However, when I step through my code it looks like this in the debugger, ®, which is 0174. See here for Codes.
This is how I'm doing it in code for reference:
string fvDelimiter = ((char)174).ToString();
It is all about character encoding. 174 (AE in hex) is ® in Unicode, which is used internally in string by default. But it is « in Extended ASCII code.
Please refer this difference in article you've provided:
Inserting ASCII characters
To insert an ASCII character, press and hold down ALT while typing the character code. For example, to insert the degree (º) symbol, press and hold down ALT while typing 0176 on the numeric keypad.
Inserting Unicode characters
To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X.
From what I have seen it is based on the font being used. On Windows I use "Character Map" to check them out. For example the "«" character in the font Calibri is 171.
For the simple reason that char in c# represents a unicode character code which match perfectly 0174 and not an ASCII code which is 174
here a code to let you understand more what I mean
var chrs = System.Text.Encoding.ASCII.GetChars( new byte[]{0174});
var chrsUtf = System.Text.Encoding.UTF8.GetChars(new byte[] { 0174 });
var chrsUnicode = System.Text.Encoding.Unicode.GetChars(new byte[] { 0174 });
Debug.WriteLine(chrs[0].ToString());
Debug.WriteLine(chrsUtf[0].ToString());
Debug.WriteLine(chrsUnicode[0].ToString());

Replace broken characters

I have a small programm that replace strings that contains umlauts, apostrophes etc.
But sometimes I haven broken strings that contains for example A¶ for ü, A¼ (or ü) for ö, and so on.
Is there a way to fix these strings?
I just tried to use another replace statement
str = str.Replace("A¶", "ü");
str = str.Replace("A¼", "ö");
str = str.Replace("ü", "ö");
But this do not work for me
It looks like because they are non-standard characters it is having trouble matching. You will probably have to use Regex.Replace and reference the Unicode value of the characters in your regex: How can you strip non-ASCII characters from a string? (in C#)
Unicode/UTF8 reference: http://www.utf8-chartable.de/
Complete Unicode character set: http://www.unicode.org/charts/

Convert &#char(w); to \uxxxx C#

I am working on Korean Document and the HTML Source Code contains special symbols starting with &#char(w) e.g 껰 Now I would like to convert this symbol to its Unicode represntation.
Is there a way to do so.
First, get the codepoint by converting it to int. Then, use String.Format to obtain the Unicode code string:
string result = string.Format("\\u{0:x4}", (int) chr);
or:
string result = "\\u" + ((int) chr).ToString("x4");
HTML uses the &# and &#x notation to encode Unicode characters. So your document already contains the charcters in one possible Unicode notation.
If the sequence starts with &#x the following characters are the hex code of the character. If the sequence starts with &# the following numbers are the decimal code of the character.
Convert these code to hex using ToString("x4") as in Konrad's answer.

Categories

Resources