Regex: Matching individual characters only (no characters containted in strings)

Regex: Matching individual characters only (no characters containted in strings) - c#

Aim: To split a string Regex.Split(...) based on a pattern of individual characters, leaving the character matched at the beginning of a split list.
Problem: One of the characters can appear in other parts of the string I don't wish to split on and I'm getting more list items than intended.
Example of string to split: T 2 TBS PO And > Qd PRN MIX X A 3 TB \ A 4 TB Xmon UG
Outcome desired:
T 2 TBS PO And
> Qd PRN MIX
X A 3 TB
\ A 4 TB Xmon UG
Pattern: (?=[#\+X\\>])
This works for everything except the X. Instead of the desired outcome, I'm getting it split in undesired places.
Current outcome:
T 2 TBS PO And
> Qd PRN MI
X
X A 3 TB
\ A 4 TB X
mon UG
Basically, I need it to not split on a string of characters only when it's on it's own.
Thanks in advance for your help
UPDATE: Oops! I seemed to have forgot to mention that that the centre of the pattern, the characters to split by, have been pulled from a table and technically, I don't know there's the X there beforehand (they may also change.)
For this reason, Jonny 5/Jerry's suggestion seems the most viable to me. I'll test when I get into work.

You could put some \s in there to make sure the characters you are matching are alone:
(?<=\s)(?=[#\+\\>X]\s)
(?<=\s) makes sure the character is preceded by a space, and the space that follows makes sure the character is followed by a space.
Note: where 'space' is mentioned above, it actually means whitespace, tab, newline, carriage return.

Just split your regex into two and piped them:
(?=[#\+\\>])|(?=\bX\b)
(?=[#\+\\>]) checking for your regular characters.
(?=\bX\b) is checking for alone X

Why not roll your own instead of using a regular expression:
public IEnumerable<string> CustomSplit( string source )
{
StringBuilder buf = new StringBuilder();
for ( int i = 0 ; i < source.Length ; ++i )
{
char curr = source[i] ;
char next = i+1 < source.Length ? source[i+1] : ' ' ;
bool isDelimiter = curr == '#'
| curr == '+'
| curr == '\\'
| curr == '>'
| ( curr == 'X' && char.IsWhiteSpace(next) )
;
if ( isDelimiter )
{
if ( buf.Length > 0 ) yield return buf.ToString() ;
buf.Length = 0 ;
}
buf.Append(curr) ;
}
// return the last element, if there is one.
if ( buf.Length > 0 ) yield return buf.ToString() ;
}

Related

How to write REGEX to get the particular string in C# ASP.NET?

Need to get three strings from the below mentioned string, need the possible solution in C# and ASP.NET:
"componentStatusId==2|3,screeningOwnerId>0"
I need to get '2','3' and '0' using a regular expression in C#

If all you want is the numbers from a string then you could use the regex in this code:
string re = "(?:\\b(\\d+)\\b[^\\d]*)+";
Regex regex = new Regex(re);
string input = "componentStatusId==2|3,screeningOwnerId>0";
MatchCollection matches = regex.Matches(input);
for (int ii = 0; ii < matches.Count; ii++)
{
Console.WriteLine("Match[{0}] // of 0..{1}:", ii, matches.Count - 1);
DisplayMatchResults(matches[ii]);
}
Function DisplayMatchResults is taken from this Stack Overflow answer.
The Console output from the above is:
Match[0] // of 0..0:
Match has 1 captures
Group 0 has 1 captures '2|3,screeningOwnerId>0'
Capture 0 '2|3,screeningOwnerId>0'
Group 1 has 3 captures '0'
Capture 0 '2'
Capture 1 '3'
Capture 2 '0'
match.Groups[0].Value == "2|3,screeningOwnerId>0"
match.Groups[1].Value == "0"
match.Groups[0].Captures[0].Value == "2|3,screeningOwnerId>0"
match.Groups[1].Captures[0].Value == "2"
match.Groups[1].Captures[1].Value == "3"
match.Groups[1].Captures[2].Value == "0"
Hence the numbers can be seen in match.Groups[1].Captures[...].
Another possibility is to use Regex.Split where the pattern is "non digits". The results from the code below will need post processing to remove empty strings. Note that Regex.Split does not have the StringSplitOptions.RemoveEmptyEntries of the string Split method.
string input = "componentStatusId==2|3,screeningOwnerId>0";
string[] numbers = Regex.Split(input, "[^\\d]+");
for (int ii = 0; ii < numbers.Length; ii++)
{
Console.WriteLine("{0}: '{1}'", ii, numbers[ii]);
}
The output from this is:
0: ''
1: '2'
2: '34'
3: '0'

Use following regex and capture your values from group 1, 2 and 3.
componentStatusId==(\d+)\|(\d+),screeningOwnerId>(\d+)
Demo
For generalizing componentStatusId and screeningOwnerId with any string, you can use \w+ in the regex and make it more general.
\w+==(\d+)\|(\d+),\w+>(\d+)
Updated Demo

Removal of colon and carriage returns and replace with colon

I'm working on a project where I have a HMTL fragment which needs to be cleaned up - the HTML has been removed and as a result of table being removed, there are some strange ends where they shouldnt be :-)
the characters as they appear are
a space at the beginning of a line
a colon, carriage return and linefeed at the end of the line - which needs to be replaced simply with the colon;
I am presently using regex as follows:
s = Regex.Replace(s, #"(:[\r\n])", ":", RegexOptions.Multiline | RegexOptions.IgnoreCase);
// gets rid of the leading space
s = Regex.Replace(s, #"(^[( )])", "", RegexOptions.Multiline | RegexOptions.IgnoreCase);
Example of what I am dealing with:
Tomas Adams
Solicitor
APLawyers
p:
1800 995 718
f:
07 3102 9135
a:
22 Fultam Street
PO Box 132, Booboobawah QLD 4113
which should look like:
Tomas Adams
Solicitor
APLawyers
p:1800 995 718
f:07 3102 9135
a:22 Fultam Street
PO Box 132, Booboobawah QLD 4313
as my attempt to clean the string, but the result is far from perfect ... Can someone assist me to correct the error and achive my goal ...
[EDIT]
the offending characters
f:\r\n07 3102 9135\r\na:\r\n22
the combination of :\r\n should be replaced by a single colon.
MTIA
Darrin

You may use
var result = Regex.Replace(s, #"(?m)^\s+|(?<=:)(?:\r?\n)+|(\r?\n){2,}", "$1")
See the .NET regex demo.
Details
(?m) - equal to RegexOptions.Multiline - makes ^ match the start of any line here
^ - start of a line
\s+ - 1+ whitespaces
| - or
(?<=:)(?:\r?\n)+ - a position that is immediately preceded with : (matched with (?<=:) positive lookbehind) followed with 1+ occurrences of an optional CR and LF (those are removed)
| - or
(\r?\n){2,} - two or more consecutive occurrences of an optional CR followed with an LF symbol. Only the last occurrence is saved in Group 1 memory buffer, thus the $1 replacement pattern inserts that last, single, occurrence.

A basic solution without Regex:
var lines = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries);
var output = new StringBuilder();
for (var i = 0; i < lines.Length; i++)
{
if (lines[i].EndsWith(":")) // feel free to also check for the size
{
lines[i + 1] = lines[i] + lines[i + 1];
continue;
}
output.AppendLine(lines[i].Trim()); // remove space before or after a line
}
Try it Online!

I tried to use your regular expression.I was able to replace "\n" and ":" with the following regular expression.This is removing ":" and "\n" at the end of the line.
#"([:\r\n])"

A Linq solution without Regex:
var tmp = string.Empty;
var output = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries).Aggregate(new StringBuilder(), (a,b) => {
if (b.EndsWith(":")) { // feel free to also check for the size
tmp = b;
}
else {
a.AppendLine((tmp + b).Trim()); // remove space before or after a line
tmp = string.Empty;
}
return a;
});
Try it Online!

How do I check the data type for each char in a string?

I'm new to C# so expect some mistakes ahead. Any help / guidance would be greatly appreciated.
I want to limit the accepted inputs for a string to just:
a-z
A-Z
hyphen
Period
If the character is a letter, a hyphen, or period, it's to be accepted. Anything else will return an error.
The code I have so far is
string foo = "Hello!";
foreach (char c in foo)
{
/* Is there a similar way
To do this in C# as
I am basing the following
Off of my Python 3 knowledge
*/
if (c.IsLetter == true) // *Q: Can I cut out the == true part ?*
{
// Do what I want with letters
}
else if (c.IsDigit == true)
{
// Do what I want with numbers
}
else if (c.Isletter == "-") // Hyphen | If there's an 'or', include period as well
{
// Do what I want with symbols
}
}
I know that's a pretty poor set of code.
I had a thought whilst writing this:
Is it possible to create a list of the allowed characters and check the variable against that?
Something like:
foreach (char c in foo)
{
if (c != list)
{
// Unaccepted message here
}
else if (c == list)
{
// Accepted
}
}
Thanks in advance!

Easily accomplished with a Regex:
using System.Text.RegularExpressions;
var isOk = Regex.IsMatch(foo, #"^[A-Za-z0-9\-\.]+$");
Rundown:
match from the start
| set of possible matches
| |
|+-------------+
|| |any number of matches is ok
|| ||match until the end of the string
|| |||
vv vvv
^[A-Za-z0-9\-\.]+$
^ ^ ^ ^ ^
| | | | |
| | | | match dot
| | | match hyphen
| | match 0 to 9
| match a-z (lowercase)
match A-Z (uppercase)

You can do this in a single line with regular expressions:
Regex.IsMatch(myInput, #"^[a-zA-Z0-9\.\-]*$")
^ -> match start of input
[a-zA-Z0-9\.\-] -> match any of a-z , A-Z , 0-9, . or -
* -> 0 or more times (you may prefer + which is 1 or more times)
$ -> match the end of input

You can use Regex.IsMatch function and specify your regular expression.
Or define manually chars what you need. Something like this:
string foo = "Hello!";
char[] availableSymbols = {'-', ',', '!'};
char[] availableLetters = {'A', 'a', 'H'}; //etc.
char[] availableNumbers = {'1', '2', '3'}; //etc
foreach (char c in foo)
{
if (availableLetters.Contains(c))
{
// Do what I want with letters
}
else if (availableNumbers.Contains(c))
{
// Do what I want with numbers
}
else if (availableSymbols.Contains(c))
{
// Do what I want with symbols
}
}

Possible solution
You can use the CharUnicodeInfo.GetUnicodeCategory(char) method. It returns the UnicodeCategory of a character. The following unicode categories might be what you're look for:
UnicodeCategory.DecimalDigitNumber
UnicodeCategory.LowercaseLetter and UnicodeCategory.UppercaseLetter
An example:
string foo = "Hello!";
foreach (char c in foo)
{
UnicodeCategory cat = CharUnicodeInfo.GetUnicodeCategory(c);
if (cat == UnicodeCategory.LowercaseLetter || cat == UnicodeCategory.UppercaseLetter)
{
// Do what I want with letters
}
else if (cat == UnicodeCategory.DecimalDigitNumber)
{
// Do what I want with numbers
}
else if (c == '-' || c == '.')
{
// Do what I want with symbols
}
}
Answers to your other questions
Can I cut out the == true part?:
Yes, you can cut the == true part, it is not required in C#
If there's an 'or', include period as well.:
To create or expressions use the 'barbar' (||) operator as i've done in the above example.

Whenever you have some kind of collection of similar things, an array, a list, a string of characters, whatever, you'll see at the definition of the collection that it implements IEnumerable
public class String : ..., IEnumerable,
here T is a char. It means that you can ask the class: "give me your first T", "give me your next T", "give me your next T" and so on until there are no more elements.
This is the basis for all Linq. Ling has about 40 functions that act upon sequences. And if you need to do something with a sequence of the same kind of items, consider using LINQ.
The functions in LINQ can be found in class Enumerable. One of the function is Contains. You can use it to find out if a sequence contains a character.
char[] allowedChars = "abcdefgh....XYZ.-".ToCharArray();
Now you have a sequence of allowed characters. Suppose you have a character x and want to know if x is allowed:
char x = ...;
bool xIsAllowed = allowedChars.Contains(x);
Now Suppose you don't have one character x, but a complete string and you want only the characters in this string that are allowed:
string str = ...
var allowedInStr = str
.Where(characterInString => allowedChars.Contains(characterInString));
If you are going to do a lot with sequences of things, consider spending some time to familiarize yourself with LINQ:
Linq explained

You can use Regex.IsMatch with "^[a-zA-Z_.]*$" to check for valid characters.
string foo = "Hello!";
if (!Regex.IsMatch(foo, "^[a-zA-Z_\.]*$"))
{
throw new ArgumentException("Exception description here")
}
Other than that you can create a list of chars and use string.Contains method to check if it is ok.
string validChars = "abcABC./";
foreach (char c in foo)
{
if (!validChars.Contains(c))
{
// Throw exception
}
}
Also, you don't need to check for == true/false in if line. Both expressions are equal below
if (boolvariable) { /* do something */ }
if (boolvariable == true) { /* do something */ }

C# string.IsNullOrWhiteSpace("\t") == true

I have a line of code
var delimiter = string.IsNullOrWhiteSpace(foundDelimiter) ? "," : foundDelimiter;
when foundDelimiter is "\t", string.IsNullOrWhiteSpace returns true.
Why? And what is the approriate way to work around this?

\t is the tab character, which is whitespace. In C# can do either of these to get a tab:
var tab1 = "\t";
var tab2 = " ";
var areEqual = tab1 == tab2; //returns true
Edit: As noted by Magus, SO is converting my tab character into spaces when the answer gets rendered. If you're in your IDE you'd just hit quote, tab, quote.
As far as a workaround goes, I'd suggest you just add a check for tabs in your conditional.
var delimiter = string.IsNullOrWhiteSpace(foundDelimiter) && foundDelimiter != "\t" ? "," : foundDelimiter;

Welcome to Unicode.
What did you expect would happen? HT (horizontal tab) has been a whitespace character for decades. The "classic" C-language definition of white-space characters consists of the US-ASCII characters:
SP: space (0x20,' ')
HT: horizontal tab (0x09,'\t')
LF: line feed (0x0A, '\n')
VT: vertical tab (0x0B, '\v')
FF: vertical tab (0x0C, '\f')
CR: carriage return (0x0C, '\r')
Unicode is a little more...ecumenical in its approach: its definition of white-space characters is this set:
Members of the Unicode category SpaceSeparator:
SPACE (U+0020)
OGHAM SPACE MARK (U+1680)
MONGOLIAN VOWEL SEPARATOR (U+180E)
EN QUAD (U+2000)
EM QUAD (U+2001)
EN SPACE (U+2002)
EM SPACE (U+2003)
THREE-PER-EM SPACE (U+2004)
FOUR-PER-EM SPACE (U+2005)
SIX-PER-EM SPACE (U+2006)
FIGURE SPACE (U+2007)
PUNCTUATION SPACE (U+2008)
THIN SPACE (U+2009)
HAIR SPACE (U+200A)
NARROW NO-BREAK SPACE (U+202F)
MEDIUM MATHEMATICAL SPACE (U+205F)
IDEOGRAPHIC SPACE (U+3000)
Members of the Unicode category LineSeparator, which consists solely of
LINE SEPARATOR (U+2028)
Member of the Unicode category ParagraphSeparator, which consists solely of
PARAGRAPH SEPARATOR (U+2029)
These Basic Latin/C0 Controls/US-ASCII characters:
CHARACTER TABULATION (U+0009)
LINE FEED (U+000A)
LINE TABULATION (U+000B)
FORM FEED (U+000C)
CARRIAGE RETURN (U+000D)
These C1 Controls and Latin-1 Supplement characters
NEXT LINE (U+0085)
NO-BREAK SPACE (U+00A0)
If you don't like the definition, roll your own along these lines (plug in your own character set):
public static bool IsNullOrCLanguageWhitespace( this string s )
{
bool value = ( s == null || rxWS.IsMatch(s) ) ;
return value ;
}
private static Regex rxWS = new Regex( #"^[ \t\n\v\f\r]*$") ;
You might want to add a char analog as well:
public static bool IsCLanguageWhitespace( this char c )
{
bool value ;
switch ( c )
{
case ' ' : value = true ; break ;
case '\t' : value = true ; break ;
case '\n' : value = true ; break ;
case '\v' : value = true ; break ;
case '\f' : value = true ; break ;
case '\r' : value = true ; break ;
default : value = false ; break ;
}
return value ;
}

regular expression for 2 string arguments having numeric values with range constraint

I need to validate console input arguments. User can pass only 2 arguments separated by Space.
First argument should be between 1 to 100
Second argument should be between 1 to 750.
I need a regular expression to validate the input. Please help.

Description
this regex will match 1-100 space 1-750
^\b([1-9][0-9]?|100)\b\s+\b([1-9][0-9]?|[1-6][0-9]{2}|7[0-4][0-9]|750)\b$
Expanded
^ match the start of the string
\b match the word boundary
( open capture group 1
[1-9] match any single digit not including zero followed by
[0-9]? match any single digit or no digit
| or
100 match the number one hundred
) close the capture group 1
\b\s+\b require a word break, space, and word break.
( start capture group 2
[1-9] match any single digit not including zero followed by
[0-9]? match any single digit or no digit
| or
[1-6] match any digits 1 thru 6 followed by
[0-9]{2} match two of any digits
| or
7 match a seven followed by
[0-4] match digits 0 thru 4 followed by
[0-9] match any single digit
| or
750 match the number seven hundred and fifty
) close the capture group
\b$ require a word break and end of string.

It sounds like you want a pattern like this:
^(1|[1-9]\d|100)\s+(1|[1-9]\d|[1-6]\d\d|7[0-5]\d)$
However, you are probably better off verifying the inputs via normal integer comparison:
int int1, int2;
if (int.TryParse(param1, out int1) && int.TryParse(param2, out int2))
{
if (int1 >= 1 && int1 <= 100 && int2 >= 1 && int2 <= 750)
{
...
}
}

As others have said, regex isn't the best option, but if you really want to use it, this seems to work...
^(?:100|[1-9]\d?) (?:[1-7](?:[0-4]\d|50)|[1-9]\d?)$

I rather recommend not using regex but something like this:
int a=0,b=0;
if(args.Length != 2){
// not 2 arguments
}else{
if(!int.TryParse(args[0], out a) || !int.TryParse(args[1], out b)){
// not numbers
}else{
if(a < 1 || a > 100 || b < 1 || b > 750){
// out of ranges
}else{
// everything fine
}
}
}
and you'll have your numbers right there.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex: Matching individual characters only (no characters containted in strings) - c#

Just split your regex into two and piped them: (?=[#\+\\>])|(?=\bX\b) (?=[#\+\\>]) checking for your regular characters. (?=\bX\b) is checking for alone X

Related

How to write REGEX to get the particular string in C# ASP.NET?

Removal of colon and carriage returns and replace with colon

How do I check the data type for each char in a string?

C# string.IsNullOrWhiteSpace("\t") == true

regular expression for 2 string arguments having numeric values with range constraint

Categories

Resources