Check String Only Contain Characters - c#

I know this has been asked before, but my code isn't working.
The senario is I need to check if a string ONLY contains letters, numbers and spaces. I need to fail if it contains any thing else.
I've tried the RegEx method, but I don't understand regular expressions, so I need to use a LINQ method for my assessment.
Here is my code:
if (!CSVItemArray[count].All(Char.IsLetterOrDigit) && !CSVItemArray[count].Contains(" "))
{
return false;
}

Just combine the check for letter, digit, or whitespace in the All query:
if (!CSVItemArray[count].All(c => Char.IsLetterOrDigit(c) || Char.IsWhiteSpace(c)))
{
return false;
}

Your logic is a little confused. The following returns true if the string in CSVItemArray[count] only contains letters, digits and white spaces:
return CSVItemArray[count].All(c => Char.IsLetterOrDigit(c) || Char.IsWhiteSpace(c));

Doing something (e.g. using Regex) because not understanding it is a bad thing - at least for developers. In particular if what you want to do can easily be achieved by using a regex.
Having said this you may simply use this:
Regex r = new Regex("^[A-Za-z0-9\\s]*$");
var valid = r.IsMatch(myString);
This will look for any number of upper- or lowercase characters, digits and whitespace-characters. The sequence itself is embraced by [], the following * sets the number of times the sequence can occur in the string (in your case none uo to infinite times). The ^ and $ are just for marking the start and end of your string repsectivly. This avoids that %asdfgh12345 // will match for instance.
EDIT: If you need Umlauts also (ä, ö, ü, ß, ...) you may have a look at this post which handles special characters also.

Related

Why the result of the matche is different from the expression?

The Leetcode have a question:"Given a List of words, return the words that can be typed using letters of alphabet on only one row's of American keyboard.". To solve this, I try to using regular expression in C# like this:
public string[] FindWords(string[] words)
{
return words.Where(x => Regex.Match(
x, #"[qwertyuiop]*|[asdfghjkl]*|[zxcvbnm]*",
RegexOptions.IgnoreCase).Value == x).ToArray();
}
But still cannot get right.For example, when the input like:
["a", "b", "p", "hello"]
I can only get "p" returned.
Where am I doing wrong?
Your regex pattern is a bit off for what you are trying to achieve. Let's look at it and analyze it.
First, we need to indicate that we're actually trying to match a word, which has a start and an end. It means that we need to prepend the regex with an ^ and add $ at the end to indicate string start and end.
Then we need to make sure that we actually have a word, which means there's at least one character. To enforce "one or more character" rule we will need to use + quantifier instead of *.
Lastly, the Regex pattern you're trying to use does not ensure that we are using characters from only one row. It does ensure that for each capturing group (sections between the OR operator) but we end up having as many capturing groups as there are scenarios that should invalidate the string. Which basically means that the following word will still validate:
today
The Regex will match three capturing groups: "to", "da" and "y". Instead, we need to explicitly set the grouping.
I've ended up with the following pattern:
^([qwertyuiop]+|[asdfghjkl]+|[zxcvbnm]+)$

Regex to ensure that in a string such as "05123:12315", the first number is less than the second?

I must have strings in the format x:y where x and y have to be five digits (zero padded) and x <= y.
Example:
00515:02152
What Regex will match this format?
If possible, please explain the solution briefly to help me learn.
EDIT: Why do I need Regex? I've written a generic tool that takes input and validates it according to a configuration file. An unexpected requirement popped up that would require me to validate a string in the format I've shown (using the configuration file). I was hoping to solve this problem using the existing configuration framework I've coded up, as splitting and parsing would be out of the scope of this tool. For an outstanding requirement such as this, I don't mind having some unorthodox/messy regex, as long as it's not 10000 lines long. Any intelligent solutions using Regex are appreciated! Thanks.
Description
This expression will validate that the first 5 digit number is smaller then the second 5 digit number where zero padded 5 digit numbers are in a : delimited string and is formatted as 01234:23456.
^
(?:
(?=0....:[1-9]|1....:[2-9]|2....:[3-9]|3....:[4-9]|4....:[5-9]|5....:[6-9]|6....:[7-9]|7....:[8-9]|8....:[9])
|(?=(.)(?:0...:\1[1-9]|1...:\1[2-9]|2...:\1[3-9]|3...:\1[4-9]|4...:\1[5-9]|5...:\1[6-9]|6...:\1[7-9]|7...:\1[8-9]|8...:\1[9]))
|(?=(..)(?:0..:\2[1-9]|1..:\2[2-9]|2..:\2[3-9]|3..:\2[4-9]|4..:\2[5-9]|5..:\2[6-9]|6..:\2[7-9]|7..:\2[8-9]|8..:\2[9]))
|(?=(...)(?:0.:\3[1-9]|1.:\3[2-9]|2.:\3[3-9]|3.:\3[4-9]|4.:\3[5-9]|5.:\3[6-9]|6.:\3[7-9]|7.:\3[8-9]|8.:\3[9]))
|(?=(....)(?:0:\4[1-9]|1:\4[2-9]|2:\4[3-9]|3:\4[4-9]|4:\4[5-9]|5:\4[6-9]|6:\4[7-9]|7:\4[8-9]|8:\4[9]))
)
\d{5}:\d{5}$
Live demo: http://www.rubular.com/r/w1QLZhNoEa
Note that this is using the x option to ignore all white space and allow comments, if you use this without x then the expression will need to be all on one line
The language you want to recognize is finite, so the easiest thing to do is just list all the cases separated by "or". The regexp you want is:
(00000:[00000|00001| ... 99999])| ... |(99998:[99998|99999])|(99999:99999)
That regexp will be several billion characters long and take quite some time to execute, but it is what you asked for: a regular expression that matches the stated language.
Obviously that's impractical. Now is it clear why regular expressions are the wrong tool for this job? Use a regular expression to match 5 digits - colon - five digits, and then once you know you have that, split up the string and convert the two sets of digits to integers that you can compare.
x <= y.
Well, you are using wrong tool. Really, regex can't help you here. Or even if you get a solution, that will be too complex, and will be too difficult to expand.
Regex is a text-processing tool to match pattern in regular languages. It is very weak when it comes to semantics. It cannot identify meaning in the given string. Like in your given condition, to conform to x <= y condition, you need to have the knowledge of their numerical values.
For e.g., it can match digits in a sequence, or a mix of digits and characters, but what it cannot do is the stuff like -
match a number greater than 15 and less than 1245, or
match a pattern which is a date between given two dates.
So, where-ever matching a pattern, involves applying semantics to the matched string, Regex is not an option there.
The appropriate way here would be to split the string on colon, and then compare numbers. For leading zero, you can find some workaround.
You can't generally* do this with regex. You can use regex to match the pattern and extract the numbers, then compare the numbers in your code.
For example to match such format (without comparing the numbers) and get the numbers you could use:
^(\d{5}):(\d{5})\z
*) You probably could in this case (as the numbers are always 5 digits and zero padded, but it wouldn't be nice.
You should do something like this instead:
bool IsCorrect(string s)
{
string[] split = s.split(':');
int number1, number2;
if (split.Length == 2 && split[0].Length == 5 && split[1].Length == 5)
{
if (int.TryParse(split[0], out number1) && int.TryParse(split[1], out number2) && number1 <= number2)
{
return true;
}
}
return false;
}
With regex you can't make comparisons to see if a number is bigger than another number.
Let me show you a good example of why you shouldn't try to do this. This is a regex that (nearly) does the same job.
https://gist.github.com/anonymous/ad74e73f0350535d09c1
Raw file:
https://gist.github.com/anonymous/ad74e73f0350535d09c1/raw/03ea835b0e7bf7ac3c5fb6f9c7e934b83fb09d95/gistfile1.txt
Except it's just for 3 digits. For 4, the program that generates these fails with an OutOfMemoryException. With gcAllowVeryLargeObjects enabled. It went on until 5GB until it crashed. You don't want most of your app to be a Regex, right?
This is not a Regex's job.
This is a two step process because regex is a text parser and not analyzer. But with that said, Regex is perfect for validating that we have the 5:5 number pattern and this regex pattern will determine if we have that form factor \d\d\d\d\d:\d\d\d\d\d right. If that form factor is not found then a match fails and the whole validation fails. If it is valid, we can use regex/linq to parse out the numbers and then check for validity.
This code would be inside a method to do the check
var data = "00515:02151";
var pattern = #"
^ # starting from the beginning of the string...
(?=[\d:]{11}) # Is there is a string that is at least 11 characters long with only numbers and a ;, fail if not
(?=\d{5}:\d{5}) # Does it fall into our pattern? If not fail the match
((?<Values>[^:]+)(?::?)){2}
";
// IgnorePatternWhitespace only allows us to comment the pattern, it does not affect the regex parsing
var result = Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => mt.Groups["Values"].Captures
.OfType<Capture>()
.Select (cp => int.Parse(cp.Value)))
.FirstOrDefault();
// Two values at this point 515, 2151
bool valid = ((result != null) && (result.First () < result.Last ()));
Console.WriteLine (valid); // True
Using Javascript this can work.
var string = "00515:02152";
string.replace(/(\d{5})\:(\d{5})/, function($1,$2,$3){
return (parseInt($2)<=parseInt($3))?$1:null;
});
FIDDLE http://jsfiddle.net/VdzF7/

Check Formatting of a String

This has probably been answered somewhere before but since there are millions of unrelated posts about string formatting.
Take the following string:
24:Something(true;false;true)[0,1,0]
I want to be able to do two things in this case. I need to check whether or not all the following conditions are true:
There is only one : Achieved using Split() which I needed to use anyway to separate the two parts.
The integer before the : is a 1-3 digit int Simple int.parse logic
The () exists, and that the "Something", in this case any string less than 10 characters, is there
The [] exists and has at least 1 integer in it. Also, make sure the elements in the [] are integers separated by ,
How can I best do this?
EDIT: I have crossed out what I've achieved so far.
A regular expression is the quickest way. Depending on the complexity it may also be the most computationally expensive.
This seems to do what you need (I'm not that good so there might be better ways to do this):
^\d{1,3}:\w{1,9}\((true|false)(;true|;false)*\)\[\d(,[\d])*\]$
Explanation
\d{1,3}
1 to 3 digits
:
followed by a colon
\w{1,9}
followed by a 1-9 character alpha-numeric string,
\((true|false)(;true|;false)*\)
followed by parenthesis containing "true" or "false" followed by any number of ";true" or ";false",
\[\d(,[\d])*\]
followed by another set of parenthesis containing a digit, followed by any number of comma+digit.
The ^ and $ at the beginning and end of the string indicate the start and end of the string which is important since we're trying to verify the entire string matches the format.
Code Sample
var input = "24:Something(true;false;true)[0,1,0]";
var regex = new System.Text.RegularExpressions.Regex(#"^\d{1,3}:.{1,9}\(.*\)\[\d(,[\d])*\]$");
bool isFormattedCorrectly = regex.IsMatch(input);
Credit # Ian Nelson
This is one of those cases where your only sensible option is to use a Regular Expression.
My hasty attempt is something like:
var input = "24:Something(true;false;true)[0,1,0]";
var regex = new System.Text.RegularExpressions.Regex(#"^\d{1,3}:.{1,9}\(.*\)\[\d(,[\d])*\]$");
System.Diagnostics.Debug.Assert(regex.IsMatch(input));
This online RegEx tester should help refine the expression.
I think, the best way is to use regular expressions like this:
string s = "24:Something(true;false;true)[0,1,0]";
Regex pattern = new Regex(#"^\d{1,3}:[a-zA-z]{1,10}\((true|false)(;true|;false)*\)\[\d(,\d)*\]$");
if (pattern.IsMatch(s))
{
// s is valid
}
If you want anything inside (), you can use following regex:
#"^\d{1,3}:[a-zA-z]{1,10}\([^:\(]*\)\[\d(,\d)*\]$"

C# - Splitting on a pipe with an escaped pipe in the data?

I've got a pipe delimited file that I would like to split (I'm using C#). For example:
This|is|a|test
However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:
This|is|a|pip\|ed|test (this is a pip|ed test)
I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there's a better way. Thanks for any help.
Just use String.IndexOf() to find the next pipe. If the previous character is not a backslash, then use String.Substring() to extract the word. Alternatively, you could use String.IndexOfAny() to find the next occurrence of either the pipe or backslash.
I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.
EDIT
In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.
public List<string> ParseWords(string s)
{
List<string> words = new List<string>();
int pos = 0;
while (pos < s.Length)
{
// Get word start
int start = pos;
// Get word end
pos = s.IndexOf('|', pos);
while (pos > 0 && s[pos - 1] == '\\')
{
pos++;
pos = s.IndexOf('|', pos);
}
// Adjust for pipe not found
if (pos < 0)
pos = s.Length;
// Extract this word
words.Add(s.Substring(start, pos - start));
// Skip over pipe
if (pos < s.Length)
pos++;
}
return words;
}
This oughta do it:
string test = #"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, #"(?<!(?<!\\)*\\)\|");
The regular expression basically says: split on pipes that aren't preceded by an escape character. I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.
EDIT
In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.
With that said, if you don't instantiate or add the words to a List<string> and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split() method (0.01ms vs. 0.002ms) for purely splitting up the string. If you add back the overhead of managing and returning a List<string>, it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations. I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.
UPDATE
Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression. With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)
The results were the same using the extended regular expression from the post I linked to.
I came across a similar scenario, For me the count of number of pipes were fixed(not pipes with "\|") . This is how i have handled.
string sPipeSplit = "This|is|a|pip\\|ed|test (this is a pip|ed test)";
string sTempString = sPipeSplit.Replace("\\|", "¬"); //replace \| with non printable character
string[] sSplitString = sTempString.Split('|');
//string sFirstString = sSplitString[0].Replace("¬", "\\|"); //If you have fixed number of fields and you are copying to other field use replace while copying to other field.
/* Or you could use a loop to replace everything at once
foreach (string si in sSplitString)
{
si.Replace("¬", "\\|");
}
*/
Here is another solution.
One of the most beautiful thing about programming, is the several ways of giving a solution to the same problem:
string text = #"This|is|a|pip\|ed|test"; //The original text
string parsed = ""; //Where you will store the parsed string
bool flag = false;
foreach (var x in text.Split('|')) {
bool endsWithArroba = x.EndsWith(#"\");
parsed += flag ? "|" + x + " " : endsWithArroba ? x.Substring(0, x.Length-1) : x + " ";
flag = endsWithArroba;
}
Cory's solution is pretty good. But, i fyou prefer not to work with Regex, then you can simply do something searching for "\|" and replacing it with some other character, then doing your split, then replace it again with the "\|".
Another option is is to do the split, then examine all the strings and if the last character is a \, then join it with the next string.
Of course, all this ignores what happens if you need an escaped backslash before a pipe.. like "\\|".
Overall, I lean towards regex though.
Frankly, I prefer to use FileHelpers because, even though this isn't comma delimeted, it's basically the same thing. And they have a great story about why you shouldn't write this stuff yourself.
You can do this with a regex. Once you decide to use a backslash as your escape character, you have two escape cases to account for:
Escaping a pipe: \|
Escaping a backslash that you want interpreted literally.
Both of these can be done in the same regex. Escaped backslashes will always be two \ characters together. Consecutive, escaped backslashes will always be even numbers of \ characters. If you find an odd-numbered sequence of \ before a pipe, it means you have several escaped backslashes, followed by an escaped pipe. So you want to use something like this:
/^(?:((?:[^|\\]|(?:\\{2})|\\\|)+)(?:\||$))*/
Confusing, perhaps, but it should work. Explanation:
^ #The start of a line
(?:...
[^|\\] #A character other than | or \ OR
(?:\\{2})* #An even number of \ characters OR
\\\| #A literal \ followed by a literal |
...)+ #Repeat the preceding at least once
(?:$|\|) #Either a literal | or the end of a line

Regex which ensures no character is repeated

I need to ensure that a input string follows these rules:
It should contain upper case characters only.
NO character should be repeated in the string.
eg. ABCA is not valid because 'A' is being repeated.
For the upper case thing, [A-Z] should be fine.
But i am lost at how to ensure no repeating characters.
Can someone suggest some method using regular expressions ?
You can do this with .NET regular expressions although I would advise against it:
string s = "ABCD";
bool result = Regex.IsMatch(s, #"^(?:([A-Z])(?!.*\1))*$");
Instead I'd advise checking that the length of the string is the same as the number of distinct characters, and checking the A-Z requirement separately:
bool result = s.Cast<char>().Distinct().Count() == s.Length;
Alteranatively, if performance is a critical issue, iterate over the characters one by one and keep a record of which you have seen.
This cannot be done via regular expressions, because they are context-free. You need at least context-sensitive grammar language, so only way how to achieve this is by writing the function by hand.
See formal grammar for background theory.
Why not check for a character which is repeated or not in uppercase instead ? With something like ([A-Z])?.*?([^A-Z]|\1)
Use negative lookahead and backreference.
string pattern = #"^(?!.*(.).*\1)[A-Z]+$";
string s1 = "ABCDEF";
string s2 = "ABCDAEF";
string s3 = "ABCDEBF";
Console.WriteLine(Regex.IsMatch(s1, pattern));//True
Console.WriteLine(Regex.IsMatch(s2, pattern));//False
Console.WriteLine(Regex.IsMatch(s3, pattern));//False
\1 matches the first captured group. Thus the negative lookahead fails if any character is repeated.
This isn't regex, and would be slow, but You could create an array of the contents of the string, and then iterate through the array comparing n to n++
=Waldo
It can be done using what is call backreference.
I am a Java program so I will show you how it is done in Java (for C#, see here).
final Pattern aPattern = Pattern.compile("([A-Z]).*\\1");
final Matcher aMatcher1 = aPattern.matcher("ABCDA");
System.out.println(aMatcher1.find());
final Matcher aMatcher2 = aPattern.matcher("ABCDA");
System.out.println(aMatcher2.find());
The regular express is ([A-Z]).*\\1 which translate to anything between 'A' to 'Z' as group 1 ('([A-Z])') anything else (.*) and group 1.
Use $1 for C#.
Hope this helps.

Categories

Resources