This question already has answers here:
Sending a string containing special characters through a TcpClient (byte[])
(3 answers)
Closed 5 years ago.
I want to encode and then decode a string that contains multilingual characters, in which the language, length and character positioning (like, chinese character on indexes 8-10) are unknown.
Is it even possible to have a "universal" encoder? Or some algorithm that knows how to decode this?
Searching the web came up with only solutions that involved knowing where the special characters are, and of what language, and I cant even know the language itself.
Any ideas?
EDIT:
Example: a string that consists of several languages, such as:
"Hello {CHINESE} my {LATIN} is rusted"
which consists of english, chinese, and latin.
But when I do
var test = ASCIIEncoding.ASCII.GetBytes(someStr);
and then
ASCIIEncoding.ASCII.GetString(test)
the "special characters" (IE, not english characters) are converted to question marks
Don't use ASCII encoding since it isn't supposed to handle multiple language characters in the same string.
Use Unicode instead:
var test = UnicodeEncoding.Unicode.GetBytes(someStr);
var test1 = UnicodeEncoding.Unicode.GetString(test);
Related
This question already has answers here:
How to convert from unicode to ASCII
(7 answers)
Closed 3 years ago.
I have about 1000 strings (that can change any time (more can be added, some can be removed)) with special characters from various languages. They all contain various special characters such as ñ. Is there a way to write a function that would change all special characters in a given string to their normal equivalents (not remove them) so like ñ would become n. A string ññoolpę would turn into nnoolpe.
I've found an answer elsewhere on SO. If anyone wants to see, here it is:
string Normalize(string input)
{
return string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}
It's from How to convert from unicode to ASCII.
This question already has answers here:
How can you strip non-ASCII characters from a string? (in C#)
(15 answers)
C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters
(4 answers)
Closed 4 years ago.
I'm reading data from a file, and sometimes the file contains funky stuff, like:
"䉌Āᜊ»ç‰ç•‡ï¼ƒè¸²æœ€ä²’Bíœë¨¿ä„€å•²ï²ä‹¾é¥˜BéŒé“‡ä„€â²ä‹¾â¢"
I need to strip/replace these characters as JSON has no idea what to do with them.
They aren't control characters (I think), so my current regex of
Regex.Replace(value, #"\p{C}+", string.Empty);
Isn't catching them.
A lot of these strings read in are going to be long, upwards of256 characters, so I'd rather not loop through each char checking it.
Is there a simple solution to this? I'm thinking regular expressions would solve it, but I'm not sure.
If all you want is ASCII then you could do:
Regex.Replace(value, #"[^\x00-\x7F]+", string.Empty);
and if all you want are the "normal" ASCII characters, you could do:
Regex.Replace(value, #"[^\x20-\x7E]+", string.Empty);
This question already has answers here:
Show UTF-8 characters in console
(5 answers)
How to write Unicode characters to the console?
(5 answers)
Closed 4 years ago.
I cant seem to find a general explanation for how to use extended ascii characters. I am specifically trying to use the different pipe variations for a minimap in my roguelike game.
static Encoding e = Encoding.GetEncoding("iso-8859-1");
string shown = e.GetString(new byte[] { 185 });
This code displays "1" even though all of the extended ascii tables show the pipe going in the top, left, bottom directions. Please help!
You shoud change your Console.OutputEncoding:
Encoding e = Encoding.GetEncoding("iso-8859-1");
Console.OutputEncoding = e;
then Console.WriteLine(shown); should display the desired result.
This question already has answers here:
How do I remove diacritics (accents) from a string in .NET?
(22 answers)
C# equivalence of python maketrans and translate
(1 answer)
Closed 5 years ago.
I want to write a F# or C# function that removes Spanish accents from a string like so:
in: "a stríng withóut áccents"
->
out: "a string without accents"
I know how to achieve this in Python 3:
trans_table = str.maketrans( "áéíóúñÁÉÍÓÚÑàèìòùäëïöü", "aeiounAEIOUNaeiouaeiuo" )
# trans_table now contains a dictionary: { "á" : "a", "é" : "e", ... }
"A stríng withóut áccents".translate( trans_table )
# result is: A string without accents
Mimicking this solution would be rather straight forward if the characters to be translated where regular ascii characters (in the 0-127 range). However, because of the way .NET encodes strings it doesn't seem straight forward to do it for actual unicode characters out of this range...
I would like a solution that does not imply doing a regex-replace for each one of the many accented characters but that hopefully loops over the string only once...
Any ideas?
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do I remove diacritics (accents) from a string in .NET?
Our project generates an string(Mērā nāma nitina hai) in web page and when we read it using Regex.match function then we get a string in which these special character are converted into some browser code like \&#\257(without backslash) in place of ā . So we want to convert it into 'a' or 'ā'. So that we can use it in further program.
Thanks
Im not sure that my method is absolutely right but it works for me:
[EDIT]
string first = #"Mērā nāma nitina hai";
first = System.Web.HttpUtility.HtmlDecode(first);
byte[] ansi = System.Text.Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), Encoding.Unicode.GetBytes(first));
string output = Encoding.Unicode.GetString(System.Text.Encoding.Convert(Encoding.GetEncoding(1252), Encoding.Unicode, ansi));
MessageBox.Show(output);
The main idea of this code - you are converting your string to ANSI and back to UNICODE. After this action all diacritics is gone away.
How about this:
var correctStr = HttpUtility.HtmlDecode(#"Mērā nāma nitina hai");
Explanation: ā is an html entity character representing the special accented char with unicode code 257.
You need to use the String.Normalize method.