I need to verify that a string doesn't contain any special characters like #,%...™ etc. Basically it's a Name/surname (and some similar) strings, however, sticking to [a-zA-Z] wouldn't do as symbols like ščřž... are allowed.
At the moment I'd go with somewhat like
bool NonSpecial(string text){
return !Regex.Match(Regex.Escape("!##$%^&......")).Success;
}
but that just seems to be too complicated and clumsy.
Is there any simpler and/or more elegant way?
Update:
So after reading all the replies I decided to go with
private bool IsName( string text ) {
return Regex.Match( text, #"^[\p{L}\p{Nd}'\.\- ]+$" ).Success && !Regex.Match( text, #"['\-\.]{2}" ).Success && !Regex.Match( text, " " ).Success;
}
Basically the name can contain Letters, numbers, ', ., -, and spaces, any of the ",.-" must be separeted by at least 1 other allowed characters and there cannot be 2 spaces in a row.
Hope that's correct.
Have you tried text.All(Char.IsLetter)?
PS http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
You can use the Unicode category for letters:
Regex.Match(text, #"\p{L}+");
See Supported Unicode Categories.
This problem is worse than you imagine.
There are literally thousands of allowable characters that can legitimately be part of a name, spread over hundreds of ranges in the various unicode alphabets.
There are also literally tens of thousands of characters that will never be part of a name. Think of all the emoji and ascii art characters. These are also spread over hundreds of separate ranges of unicode characters.
Sifting the wheat from the chaff via manual code, even regular expressions, just isn't going to work well.
Thankfully, this work has been done for you. Look at the char.IsLetter() method.
You may also want to have an exception for the various allowed separator characters and accents that are not letters, but can be part of a name: hyphens, apostrophe's, and periods are legitimate, and all have more than one allowed unicode encoding. Unfortunately, I don't have a quick solution for you here. This may have to a best-effort approach, looking at just some of the more common.
try using Linq/Lambda as well pretty straight forward
will return true if it doesn't contain letters
bool result = text.Any(x => !char.IsLetter(x));
Related
I'm trying to modify a fairly basic regex pattern in C# that tests for phone numbers.
The patterns is -
[0-9]+(\.[0-9][0-9]?)?
I have two questions -
1) The existing expression does work (although it is fairly restrictive) but I can't quite understand how it works. Regexps for similar issues seem to look more like this one -
/^[0-9()]+$/
2) How could I extend this pattern to allow brackets, periods and a single space to separate numbers. I tried a few variations to include -
[0-9().+\s?](\.[0-9][0-9]?)?
Although i can't seem to create a valid pattern.
Any help would be much appreciated.
Thanks,
[0-9]+(\.[0-9][0-9]?)?
First of all, I recommend checking out either regexr.com or regex101.com, so you yourself get an understanding of how regex works. Both websites will give you a step-by-step explanation of what each symbol in the regex does.
Now, one of the main things you have to understand is that regex has special characters. This includes, among others, the following: []().-+*?\^$. So, if you want your regex to match a literal ., for example, you would have to escape it, since it's a special character. To do so, either use \. or [.]. Backslashes serve to escape other characters, while [] means "match any one of the characters in this set". Some special characters don't have a special meaning inside these brackets and don't require escaping.
Therefore, the regex above will match any combination of digits of length 1 or more, followed by an optional suffix (foobar)?, which has to be a dot, followed by one or two digits. In fact, this regex seems more like it's supposed to match decimal numbers with up to two digits behind the dot - not phone numbers.
/^[0-9()]+$/
What this does is pretty simple - match any combination of digits or round brackets that has the length 1 or greater.
[0-9().+\s?](\.[0-9][0-9]?)?
What you're matching here is:
one of: a digit, round bracket, dot, plus sign, whitespace or question mark; but exactly once only!
optionally followed by a dot and one or two digits
A suitable regex for your purpose could be:
(\+\d{2})?((\(0\)\d{2,3})|\d{2,3})?\d+
Enter this in one of the websites mentioned above to understand how it works. I modified it a little to also allow, for example +49 123 4567890.
Also, for simplicity, I didn't include spaces - so when using this regex, you have to remove all the spaces in your input first. In C#, that should be possible with yourString.Replace(" ", ""); (simply replacing all spaces with nothing = deleting spaces)
The + after the character set is a quantifier (meaning the preceeding character, character set or group is repeated) at least one, and unlimited number of times and it's greedy (matched the most possible).
Then [0-9().+\s]+ will match any character in set one or more times.
What is the best way in order to remove all non-alpha characters in C#? I have looked up Regex but it doesn't seem to recognise Regex when I do:
string cleanString = "";
string dirtyString = "I don't_8 really know what ! 6 non alpha- is?";
cleanString = Regex.Replace(dirtyString, "[^A-Za-z0-9]", "");
Regex comes with a red wiggly line underneath. Is there a way I can remove simply non alpha letters and if so can some provide me with a sample? I'm not sure if loops and arrays are the way to go and also how can I get all non alpha characters? I'm assuming I have to do something like if doesn't equal A-Z or 0-9, then remove with ""?
You can do it using LINQ like so:
var cleanString = new string(dirtyString.Where(Char.IsLetter).ToArray());
You can check other Char checks on MSDN.
Regex comes with a red wiggly line underneath.
Then either:
The compilation prediction isn't working correctly (it does sometimes get things wrong).
You don't have a using System.Text.RegularExpressions in the code, so it can't work out you mean System.Text.RegularExpressions.Regex when you say Regex.
To return to your original question:
What is the best way in order to remove all non-alpha characters in C#?
The approach you take is good for small strings, though [^A-Za-z0-9] will remove non-alphanumerics and [^A-Za-z] non-alphabetical characters. This is assuming you are already restricted to (or want to add a restriction to) US-ASCII characters. To include letters like á, œ, ß or δ because you're dealing with real words rather than computer-code I'd use #"\P{L}" or #"[^\p{L}\p{N}]" to allow all letters and numbers.
If you are dealing with very large piece of text (many kilobytes) then you are better off reading it through a filtering stream that strips the characters you don't want as you go.
I googled for an answer and I found some questions here on Stack Exchange asking similar question but they didn't help me. For example, I found C# regex - not matching my string but the answers given are way too complicated for me to understand. I don't know or understand regex. All I want to do is strip a double quote from a string.
To put my question simply, I have a string "\"123.456\"" and I need to remove the "\""
so I made my expression "[^\w\\"]" and after calling
string myString Regex.Replace("\"123.456\"", "[^\\w\\\"]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
myString is "\"123.456\"". I just need to know what my expression should be. I won't be able to understand any lengthy discussions or lectures on learning regex.
I got my example directly from Microsoft at http://msdn.microsoft.com/en-us/library/844skk0h(v=vs.110).aspx so basically all I did was replace the ".#-" with "\"".
UPDATE
Apparently trying to ask a simple question only attracts trolls. I didn't want to get too complicated because I didn't want all you hard working busy people to spend too much time answering the wrong question. I was trying to be nice.
We have a situation where we need to parse input files from several clients and going forwards, the number of clients will increase and there also the number of files from each client will increase.
We found that in several of our clients' transmitted files many fields will have various extra characters. We don't know how or why those characters are in there and our clients aren't telling. (if you want to know why they aren't telling, please move along, these aren't the questions you are looking for)
So, we have many files from many clients each with many rows with many fields of data and we need to strip out "bad" characters.
I took Microsofts method and changed it a bit to be more dynamic.
private string CleanInput(string strIn, string chars)
{
// Replace invalid characters with empty strings.
try
{
string regexString = string.Format(#"[^\w\{0}]", chars);
return Regex.Replace(strIn, regexString, "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException)
{
return string.Empty;
}
goal here is to be able to strip out any characters that don't belong dynamically But we can't just hard code those characters because not all fields will have any of these characters, and more importantly, some fields will have some bad characters along with other characters which are not to be considered bad for that field but may be considered bad for other fields.
With me so far?
So, in trying to get my work done by Friday (yes, tomorrow), I decided to start slowly with only a couple of known bad characters from 3 input files. So far, those characters are single quote, dash, double quote, dollar sign, comma. But not all the fields in my 3 files need these characters stripped, so I intend to call the CleanInput method only on those fields that need it, and only for the characters that we need stripped.
OK, so while I was testing, I discovered on one field, where we want to strip the comma, single quote, double quote and dollar sign, it was not removing the double quotes (an apparently the backslashes too). So I debugged this issue by first passing in only the comma -that worked. Then I tried passing in only the single quote - that worked. Then I passed in the dollar sign - that worked. Then I passed in the escaped double quote -and that didn't work - the double quotes are still in the string. So I simplified my test in a new console project and I hard coded the string and I called my method just to make sure nothing else could be interfering with it.
I hope and pray no one spends hours of their precious time trying to reconfigure my input files or attempting to teach me the end all be all of regex programming. I have to get this done by tomorrow. Please, I only want to know how to strip the double quote (and apparently the backslashes too) from the given string.
Rather than getting regex involved, perhaps you can just use Replace?
var myString = "\\\"123.456\\\"";
var myCleanString = myString.Replace(#"\""", "");
You are matching on a negated group (the [^] bit). This matches any character not in the square brackets and replaces it. You want to replace anything that is in the group which you can do by just placing the characters you wish to replace inside the square brackets and remove the negation (^):
private static string CleanInput(string strIn, string chars)
{
// Replace invalid characters with empty strings.
try
{
string regexString = string.Format(#"[{0}]", chars);
return Regex.Replace(strIn, regexString, "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException)
{
return string.Empty;
}
}
You would use the negative version if you knew what you wanted to include rather than exclude. For example if you knew you only wanted numbers and the period character you could do:
string myString = Regex.Replace("\"123.456\"", "[^\\d.]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.
a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.
You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.
Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.
Basically, the input field is just a string. People input their phone number in various formats. I need a regular expression to find and convert those numbers into links.
Input examples:
(201) 555-1212
(201)555-1212
201-555-1212
555-1212
Here's what I want:
(201) 555-1212 - Notice the space is gone
(201)555-1212
201-555-1212
555-1212
I know it should be more robust than just removing spaces, but it is for an internal web site that my employees will be accessing from their iPhone. So, I'm willing to "just get it working."
Here's what I have so far in C# (which should show you how little I know about regular expressions):
strchk = Regex.Replace(strchk, #"\b([\d{3}\-\d{4}|\d{3}\-\d{3}\-\d{4}|\(\d{3}\)\d{3}\-\d{4}])\b", "<a href='tel:$&'>$&</a>", RegexOptions.IgnoreCase);
Can anyone help me by fixing this or suggesting a better way to do this?
EDIT:
Thanks everyone. Here's what I've got so far:
strchk = Regex.Replace(strchk, #"\b(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})\b", "<a href='tel:$1'>$1</a>", RegexOptions.IgnoreCase);
It is picking up just about everything EXCEPT those with (nnn) area codes, with or without spaces between it and the 7 digit number. It does pick up the 7 digit number and link it that way. However, if the area code is specified it doesn't get matched. Any idea what I'm doing wrong?
Second Edit:
Got it working now. All I did was remove the \b from the start of the string.
Remove the [] and add \s* (zero or more whitespace characters) around each \-.
Also, you don't need to escape the -. (You can take out the \ from \-)
Explanation: [abcA-Z] is a character group, which matches a, b, c, or any character between A and Z.
It's not what you're trying to do.
Edits
In response to your updated regex:
Change [-\.\s] to [-\.\s]+ to match one or more of any of those characters (eg, a - with spaces around it)
The problem is that \b doesn't match the boundary between a space and a (.
Afaik, no phone enters the other characters, so why not replace [^0-9] with '' ?
Here's a regex I wrote for finding phone numbers:
(\+?\d[-\.\s]?)?(\(\d{3}\)\s?|\d{3}[-\.\s]?)\d{3}[-\.\s]?\d{4}
It's pretty flexible... allows a variety of formats.
Then, instead of killing yourself trying to replace it w/out spaces using a bunch of back references, instead pass the match to a function and just strip the spaces as you wanted.
C#/.net should have a method that allows a function as the replace argument...
Edit: They call it a `MatchEvaluator. That example uses a delegate, but I'm pretty sure you could use the slightly less verbose
(m) => m.Value.Replace(' ', '')
or something. working from memory here.