Remove certain Unicode (garbage) JSON characters [duplicate] - c#

This question already has answers here:
How can you strip non-ASCII characters from a string? (in C#)
(15 answers)
C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters
(4 answers)
Closed 4 years ago.
I'm reading data from a file, and sometimes the file contains funky stuff, like:
"䉌Āᜊ»ç‰ç•‡ï¼ƒè¸²æœ€ä²’Bíœë¨¿ä„€å•²ï²ä‹¾é¥˜BéŒé“‡ä„€â²ä‹¾â¢"
I need to strip/replace these characters as JSON has no idea what to do with them.
They aren't control characters (I think), so my current regex of
Regex.Replace(value, #"\p{C}+", string.Empty);
Isn't catching them.
A lot of these strings read in are going to be long, upwards of256 characters, so I'd rather not loop through each char checking it.
Is there a simple solution to this? I'm thinking regular expressions would solve it, but I'm not sure.

If all you want is ASCII then you could do:
Regex.Replace(value, #"[^\x00-\x7F]+", string.Empty);
and if all you want are the "normal" ASCII characters, you could do:
Regex.Replace(value, #"[^\x20-\x7E]+", string.Empty);

Related

Replace special characters with non-special equivalents in a string in C# [duplicate]

This question already has answers here:
How to convert from unicode to ASCII
(7 answers)
Closed 3 years ago.
I have about 1000 strings (that can change any time (more can be added, some can be removed)) with special characters from various languages. They all contain various special characters such as ñ. Is there a way to write a function that would change all special characters in a given string to their normal equivalents (not remove them) so like ñ would become n. A string ññoolpę would turn into nnoolpe.
I've found an answer elsewhere on SO. If anyone wants to see, here it is:
string Normalize(string input)
{
return string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}
It's from How to convert from unicode to ASCII.

Regex - Minimum 6 chars and no whitespace. Everything else allowed [duplicate]

This question already has answers here:
Regular expression for not allowing spaces in the input field
(5 answers)
Closed 3 years ago.
I'm trying to use regex which checks only two things
Minimum 10 characters (No Max)
No whitespace allowed
I'm able to check minimum 10 chars with #"^[a-zA-Z0-9]{10,}$" and disallow white space with ^[^0-9 ]+$
Now the problem is, how to combine both of these and allow everything(alphanumeric including special characters) except white space
You could try to use a simpler regex pattern just to accept anything that is not a white-space: ^\S{10,}$
\S - matches any non-white-space character. More details here: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

How to remove "accents" from strings in C#? [duplicate]

This question already has answers here:
How do I remove diacritics (accents) from a string in .NET?
(22 answers)
C# equivalence of python maketrans and translate
(1 answer)
Closed 5 years ago.
I want to write a F# or C# function that removes Spanish accents from a string like so:
in: "a stríng withóut áccents"
->
out: "a string without accents"
I know how to achieve this in Python 3:
trans_table = str.maketrans( "áéíóúñÁÉÍÓÚÑàèìòùäëïöü", "aeiounAEIOUNaeiouaeiuo" )
# trans_table now contains a dictionary: { "á" : "a", "é" : "e", ... }
"A stríng withóut áccents".translate( trans_table )
# result is: A string without accents
Mimicking this solution would be rather straight forward if the characters to be translated where regular ascii characters (in the 0-127 range). However, because of the way .NET encodes strings it doesn't seem straight forward to do it for actual unicode characters out of this range...
I would like a solution that does not imply doing a regex-replace for each one of the many accented characters but that hopefully loops over the string only once...
Any ideas?

Encode and Decode multilingual string c# [duplicate]

This question already has answers here:
Sending a string containing special characters through a TcpClient (byte[])
(3 answers)
Closed 5 years ago.
I want to encode and then decode a string that contains multilingual characters, in which the language, length and character positioning (like, chinese character on indexes 8-10) are unknown.
Is it even possible to have a "universal" encoder? Or some algorithm that knows how to decode this?
Searching the web came up with only solutions that involved knowing where the special characters are, and of what language, and I cant even know the language itself.
Any ideas?
EDIT:
Example: a string that consists of several languages, such as:
"Hello {CHINESE} my {LATIN} is rusted"
which consists of english, chinese, and latin.
But when I do
var test = ASCIIEncoding.ASCII.GetBytes(someStr);
and then
ASCIIEncoding.ASCII.GetString(test)
the "special characters" (IE, not english characters) are converted to question marks
Don't use ASCII encoding since it isn't supposed to handle multiple language characters in the same string.
Use Unicode instead:
var test = UnicodeEncoding.Unicode.GetBytes(someStr);
var test1 = UnicodeEncoding.Unicode.GetString(test);

Regular expression for characters after '.' [duplicate]

This question already has answers here:
How do I match an entire string with a regex?
(8 answers)
Closed 6 years ago.
I need to detect following format when I enter serial number like
CK123456.789
I used Regex with pattern of
^(CV[0-9]{6}\.[0-9]{3}
to match but if I enter
CK123456.7890
it still able to proceed without flagging error. Is there a better regular expression to detect the trailing 3 digits after '.'?
Depending on how you use the regular expression matcher, you might need to enclose it in ^...$ which forces the pattern to be the whole string, i.e.
^CK[0-9]{6}\.[0-9]{3}$ (Note the CK prefix).
I've also removed your leading (mismatched) parenthesis.

Categories

Resources