.Net Regex that Matches Strings With Any non-ASCII char in it - c#

Looking for some black magic that will match any string with "weird" characters in it. Standard ASCII characters are fine. Everything else isn't.
This is for sanitizing various web forms.

This gets anything out of the ASCII range
[^\x00-\x7F]
There are still some "weird" characters like x00 (NULL), but they are valid ASCII.
For reference, see the ASCII table

[^\p{IsBasicLatin}] for what is asked for, [^\x00-\x7F] for concision over self-documentation, or \p{C} for clearing out formatters and controls without hurting other non-ASCIIs (and with greater concision yet).

Related

French/Portuguese extended ASCII symbols in regex

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçǵ]. The mask should accept both uppercase and lowercase symbols.
If found two suggestions:
[\p{L}]
and
[a-zA-Z0-9\u0080-\u009F]
What is the correct way to write such a regular expression?
Update:
My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me.
[\p{L}] seems to be a Unicode character class, I need an ASCII regexp.
Digits are allowed, but special characters such as !##$%^&*)_+}{|"?>< are disallowed (should be filtered).
I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]
https://regex101.com/r/EPF1rg/2
The question is why the range for [ùàçéèçǵ] is \u00B5-\u00FF and not \u0080-\u009F ?
As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.
https://www.ascii-codes.com/cp860.html
Can anyone explain it?
The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).
C# strings are unicode, and so are its regex features:
https://stackoverflow.com/a/20641460/1132334
If you really must specify a fixed range of characters, in C# you can just as well include them literally:
[a-zA-Z0-9µùàçéèçÇ]
Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:
\p{L}
A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

How to display non-printable Ascii characters?

I came across this simple code to output ascii to the console:
Console.Write((char)1); //Output ☺
The thing is, it only works when I change the fonts of the console to RasterFonts, and it's ugly. I mean, look at those old text-based games, how did they draw some ascii art like this?
The Amazing Adventures of ANSI Dude, Snipes
How can I draw nice Ascii on that console?
Unless for some reason you are restricted to use ASCII characters you should use proper Unicode characters. It will avoid potential conflicts with mapping control characters (0-31) to printable characters and let you use lines and borders directly with .Net String type without going through encodings (since line and borders are part of "extended ASCII" and not mapped directly to Unicode characters unlike regular 7 bit ASCII codes 1-127).
Unicode "\u263a" would produce face you are looking for. For the borders and lines drawing use characters from Unicode box drawing range, for more characters see overall table http://unicode.org/charts/.

Removing non alpha characters

What is the best way in order to remove all non-alpha characters in C#? I have looked up Regex but it doesn't seem to recognise Regex when I do:
string cleanString = "";
string dirtyString = "I don't_8 really know what ! 6 non alpha- is?";
cleanString = Regex.Replace(dirtyString, "[^A-Za-z0-9]", "");
Regex comes with a red wiggly line underneath. Is there a way I can remove simply non alpha letters and if so can some provide me with a sample? I'm not sure if loops and arrays are the way to go and also how can I get all non alpha characters? I'm assuming I have to do something like if doesn't equal A-Z or 0-9, then remove with ""?
You can do it using LINQ like so:
var cleanString = new string(dirtyString.Where(Char.IsLetter).ToArray());
You can check other Char checks on MSDN.
Regex comes with a red wiggly line underneath.
Then either:
The compilation prediction isn't working correctly (it does sometimes get things wrong).
You don't have a using System.Text.RegularExpressions in the code, so it can't work out you mean System.Text.RegularExpressions.Regex when you say Regex.
To return to your original question:
What is the best way in order to remove all non-alpha characters in C#?
The approach you take is good for small strings, though [^A-Za-z0-9] will remove non-alphanumerics and [^A-Za-z] non-alphabetical characters. This is assuming you are already restricted to (or want to add a restriction to) US-ASCII characters. To include letters like á, œ, ß or δ because you're dealing with real words rather than computer-code I'd use #"\P{L}" or #"[^\p{L}\p{N}]" to allow all letters and numbers.
If you are dealing with very large piece of text (many kilobytes) then you are better off reading it through a filtering stream that strips the characters you don't want as you go.

C# special characters

I need to verify that a string doesn't contain any special characters like #,%...™ etc. Basically it's a Name/surname (and some similar) strings, however, sticking to [a-zA-Z] wouldn't do as symbols like ščřž... are allowed.
At the moment I'd go with somewhat like
bool NonSpecial(string text){
return !Regex.Match(Regex.Escape("!##$%^&......")).Success;
}
but that just seems to be too complicated and clumsy.
Is there any simpler and/or more elegant way?
Update:
So after reading all the replies I decided to go with
private bool IsName( string text ) {
return Regex.Match( text, #"^[\p{L}\p{Nd}'\.\- ]+$" ).Success && !Regex.Match( text, #"['\-\.]{2}" ).Success && !Regex.Match( text, " " ).Success;
}
Basically the name can contain Letters, numbers, ', ., -, and spaces, any of the ",.-" must be separeted by at least 1 other allowed characters and there cannot be 2 spaces in a row.
Hope that's correct.
Have you tried text.All(Char.IsLetter)?
PS http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
You can use the Unicode category for letters:
Regex.Match(text, #"\p{L}+");
See Supported Unicode Categories.
This problem is worse than you imagine.
There are literally thousands of allowable characters that can legitimately be part of a name, spread over hundreds of ranges in the various unicode alphabets.
There are also literally tens of thousands of characters that will never be part of a name. Think of all the emoji and ascii art characters. These are also spread over hundreds of separate ranges of unicode characters.
Sifting the wheat from the chaff via manual code, even regular expressions, just isn't going to work well.
Thankfully, this work has been done for you. Look at the char.IsLetter() method.
You may also want to have an exception for the various allowed separator characters and accents that are not letters, but can be part of a name: hyphens, apostrophe's, and periods are legitimate, and all have more than one allowed unicode encoding. Unfortunately, I don't have a quick solution for you here. This may have to a best-effort approach, looking at just some of the more common.
try using Linq/Lambda as well pretty straight forward
will return true if it doesn't contain letters
bool result = text.Any(x => !char.IsLetter(x));

c# character encoding that treats accented characters like simple ones

Is there an encoding where an accented character like á or ä is treated as a single character?
and if not then what would be the most commonly used encoding today? I'm using UTF7 currently, how compatible is that with other types of encoding?
Thanks
You might think about what you're asking for. You're asking for an encoding that will recognize 'á' and turn it into 'a'. That's a converter, not an encoding. It would have to know what encoding the source is in so that it can convert to whatever encoding you're using.
Wait, maybe that's not what you're asking. There are encodings that treat those as single bytes. For example, the ISO-8859-1 encoding (also called Latin-1) treats many accented characters as a single byte.
(The following struck out because I was talking about ASCII, not UTF-7 ... long day.)
UTF-7 isn't particularly compatible with many other encodings. It has 128 possible values: just enough space for the 52 letters (upper and lower case, combined) used in the Latin alphabet, the 10 numerals, 32 control characters, and various punctuation marks. But it's not sufficient for Spanish, for example, which has upside-down questions marks and exclamation points as well as other things.
UTF-7 is "compatible" with other encodings in that it can represent the entire Unicode character set. But only some characters (known as the "direct characters") and a few control characters can be directly encoded as single ASCII bytes. Those characters will be the same as in UTF-8 and in many single-byte character sets. All other characters are represented by sequences, and will be different from any other encoding.
The most commonly used encoding today? On the Web, UTF-8 is used a lot. It's also the default encoding used when you create a StreamWriter. For the work I do (mostly English, and Western European character sets), it works better than anything else.
Now, it's possible that what you're looking for is something that will treat 'á' and 'a' as the same in comparisons. That's a different question. See Performing Culture-Insensitive String Comparisons for information on that.
This doesn't seem to have anything to do with encodings. In C# it doesn't matter what encoding you use for storage and transmission, the strings of characters are always internally in UTF-16 and ä is always 1 char long in composed form.
If "ä".Length is giving 2 to you, your string is in decomposed form and all you need to do is
string str = "ä"; //a + U+0308, .Length == 2
str = str.Normalize(NormalizationForm.FormC); //just ä now, with Length == 1
Sorry for the confusion over this issue, i finally found what i was looking for, which is that i needed my text to use Windows-1250 (Central European (Windows)) code page, because that is what a lot of other programs use, that correctly support characters like €đłŁ¤...etc
Thanks for all the help i got, it was a useful learning experience.

Categories

Resources