Convert &#char(w); to \uxxxx C# - c#

I am working on Korean Document and the HTML Source Code contains special symbols starting with &#char(w) e.g 껰 Now I would like to convert this symbol to its Unicode represntation.
Is there a way to do so.

First, get the codepoint by converting it to int. Then, use String.Format to obtain the Unicode code string:
string result = string.Format("\\u{0:x4}", (int) chr);
or:
string result = "\\u" + ((int) chr).ToString("x4");

HTML uses the &# and &#x notation to encode Unicode characters. So your document already contains the charcters in one possible Unicode notation.
If the sequence starts with &#x the following characters are the hex code of the character. If the sequence starts with &# the following numbers are the decimal code of the character.
Convert these code to hex using ToString("x4") as in Konrad's answer.

Related

Convert Numeric Reference Character to actual character

I have a text that needs to be converted into an actual character, we use swedish characters but the data that we received are encoded into Numerical Character like this: ö Ä å
which suppose to be converted into normal characters like ö, ä, å this respectively.
I tried to search here but no luck and there was one though however they use python library instead so I can't convert it to C#.
Do anyone has a sample code on how to convert it to a normal character?
WebUtility.HtmlDecode(string) method can do trick for you or you can also use HttpUtility.HtmlDecode(string) method Under Namespace System.Web.
var res1 = WebUtility.HtmlDecode("&#246");
OR
var res1 = HttpUtility.HtmlDecode("&#246");

How to check if first character in a string is a bullet (Doesn't exist in ASCII)

I want to check if the first character in a string in c# is a • how would I do this, since it has no ASCII value?
There's a few different unicode characters that might look like bullets.
Use https://unicodelookup.com to look up the code you're trying to match.
Specify the unicode character in C# using the literal notation, such as: "\u2022" using the hex unicode for the character.
For example:
bool found = text.StartsWith("\u2022");
if (inputString.First() == '•') Console.WriteLine("True.");

Best way to parse ASCII(?) from a hex string in C#

the string I get in the application includes ASCII(?) characters like !,dp,\b,(,s#.
These are suppose to be equivalent.
value in database-
\x01\x01\x03!\xea\x01\x00\x00dP\x00\x00\x1f\x8b\b\x00\x00\x00\x00\x00\x04\x00\xe3\xe6\x10\x11\x98\xc3(\xc1\xa2\xc0\xa8\xc0\xa0 \x02\xc4\x0c\x1a\x8c\x1a\x0c\x1as#\x04\x18\xf2\b\x1de\xe6\xe6\xe2\xe2b604\x14`\x94\x98\xc3\ba\x9b\"\xb1M\x80\xec\xc9\x10\xb6\x81\x05\x90=\t\xca6Ab[\x02\xd9\x13\xa1\xea\x8d\x80\xec.\xa8\xb8)\x12\xdb\x0c\xc8n\x81\xaa1\x06\xb2\x1b\x19\xb98A\xe2 \xf5\xb5\x10\xa6\x01\x90Y\rf\x1a\x9a#\x98\x16\b&\xc8\x8cJ\x88Z\x90\x11\xa5\x10Q\x90\xb6\x12\x88(H[1\x84\t\xf2O\xb6\xc0&v\tF\x1e\xa1\a\x8c\xc3\xd9\x8f\x8f\x8d%\x18\x01\xa1\x98\x8d\x97\xea\x01\x00\x00
value I get in my app that includes chracters I don't want-
01010321ea010000645000001f8b0800000000000400e3e6101198c328c1a2c0a8c0a02002c40c1a8c1a0c1a73400418f2081d65e6e6e2e26236303414609498c308619b22b14d80ecc910b68105903d09ca3641625b02d913a1ea8d80ec2ea8b82912db0cc86e81aa3106b21b19b93841e220f5b510a60190590d661a9a2398160826c88c4a885a9011a5105190b6128828485b318409f24fb6c0267609461ea1078cc3d98f8f8d251801a1988d97ea0100000a\n\n"3a1ea8d80ec2ea8b82912db0cc86e81aa3106b21b19b93841e220f5b510a60190590d661a9a2398160826c88c4a885a9011a5105190b6128828485b318409f24fb6c0267609461ea1078cc3d98f8f8d251801a1988d97ea0100000a\n\n"3a1ea8d80ec2ea8b82912db0cc86e81aa3106b21b19b93841e220f5b510a60190590d661a9a2398160826c88c4a885a9011a5105190b6128828485b318409f24fb6c0267609461ea1078cc3d98f8f8d251801a1988d97ea0100000a\n\n
you can see that \x01 is 01 then \x03 is 03 then ! is 21. I want to take out all the non hex values in the second string.
What are chracters like ! and dP. Are they ASCII?
I can remove characters like new line like hexString = hexString.Replace("\n", ""); But I'm not sure if that's the best way to do for all.
3.Comparing the two strings, I see that (=28 and s#=7340 . Is there a table for conversion for this?
My guess is given the quotes around the ouput that the database is displaying non-ASCII (Unicode?) characters as hex (e.g. \x03) and that the actual string contains a single character for each hex formatted display, in which case there is no difference to pick out - the character d is also the hex value \x64, it is just the database chooses to output visible characters as their normal letter - same thing with \t which could be output as \x09 but they choose to use (C) standard control character abbreviations.
Found this:
When it is displayed on screen, redis-cli escapes non-printable characters using the \xHH encoding format, where HH is hexadecimal notation.
In other words,
The cli is just using 3 different methods to display the values in the database field:
The character is printable, output the character (e.g. d, P, !, ").
The character is not printable, but has a C language standard escape sequence, output the escape sequence (e.g. \b, \t, \n).
The character is not printable and has no escape sequence, output the hex for the value of the character (e.g. \x03, \x01, \x00).

Insufficient Hexadecimal Digits Regex Exception?

I am formulating a regex where it would match with all letters (including chinese) and some chosen punctuations (also including chinese).
Here's my regex
"^[\p{L}\x{FF01}-\x{FF1E}\x{3008}-\x{30A9}0-9\s##$^&*()+=,.?`~_:;|""-{}[]+$"
It throws an exception of insufficient hexadecimal digits. Can anybody please tell me what is wrong with it? I tried some regex testers online and it works there.
Im using the Regex class of c# to parse it
From the docs:
\x nn Uses hexadecimal representation to specify a character (nn consists of exactly two digits).
I think what you want is \u:
\u nnnn Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn).
Try this:
#"^[\p{L}\uFF01-\uFF1E\u3008-\u30A90-9\s##$^&*()+=,.?`~_:;|""-{}[]+$"

Java char literal to C# char literal

I am maintaining some Java code that I am currently converting to C#.
The Java code is doing this:
sendString(somedata + '\000');
And in C# I am trying to do the same:
sendString(somedata + '\000');
But on the '\000' VS2010 tells me that "Too many characters in character literal". How can I use '\000' in C#? I have tried to find out what the character is, but it seems to be " " or some kind of newline-character.
Do you know anything about the issue?
Thanks!
'\0' will be just fine in C#.
What's happening is that C# sees \0 and converts that to a nul-character with an ASCII value of 0; then it sees two more 0s, which is illegal inside a character (since you used single quotes, not double quotes). The nul-character is typically not printable, which is why it looked like an empty string when you tried to print it.
What you've typed in Java is a character literal supporting an octal number. C# does not support octal literals in characters or numbers, in an effort to reduce programming mistakes.*
C# does supports Unicode literals of the form '\u0000' where 0000 is a 1-4 digit hexadecimal number.
* In PHP, for example, if you type in a number with a leading zero that is a valid octal number, it gets translated. If it's not a legal octal number, it doesn't get translated correctly. <? echo 017; echo ", "; echo 018; ?> outputs 15, 1 on my machine.
That's a null character, also known as NUL. You can write it as '\0' in C#.
In C# the string "\000" represents three characters: the null character, followed by two zero digits. Since a character literal can only contain one character, this is why you get the error "Too many characters in character literal".

Categories

Resources