Seperating numbers, punctuation and letters trought the whole case - c#

What I'm trying to achieve:
Split the string into separate parts of numbers, punctuation(except the . and , these should be found in the number part), and letters.
Example:
Case 1:
C_TANTALB
Result:
Alpha[]: C, TANTALB
Beta[]:
Zeta[]: _
Case 2:
BGA-100_T0.8
Result:
Alpha[]: BGA, T
Beta[]: 100, 0.8
Zeta[]: -, _
Case 3: C0201
Result:
Alpha[]: C
Beta[]: 0201
Zeta[]:
I've found this post but it doesn't the entire job for me as it fails on example 1 not returning even the alpha part. And it doesn't find the punctuation.
Any help would be appricated.

If iterating the string an test with IsDigit and IsLetter a bit to complexe,
You can use Regex for this : (?<Alfas>[a-zA-Z]+)|(?<Digits>\d+)|(?<Others>[^a-zA-Z\d])
1/. Named Capture Group Alfas (?<Alfas>[a-zA-Z]+)
Match a single character present in the list below [a-zA-Z]+
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2/. Named Capture Group Digits (?<Digits>[\d,.]+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
3/. Named Capture Group Others (?<Others>[^a-zA-Z\d]+)
Match a single character not present in the list below [^a-zA-Z\d]
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Then to get one goup values:
var matches = Regex.Matches(testInput, pattern).Cast<Match>();
var alfas = matches.Where(x => !string.IsNullOrEmpty(x.Groups["Alfas"].Value))
.Select(x=> x.Value)
.ToList();
LiveDemo

Probably the simplest way to do this is with 3 separate regular expressions; one for each class of characters.
[A-Za-z]+ for letter sequences
[\d.,]+ for numbers
[-_]+ for punctuation (incomplete for now; please feel free to extend the list)
Example:
using System;
using System.Linq;
using System.Text.RegularExpressions;
class MainClass
{
private static readonly Regex _regexAlpha = new Regex(#"[A-Za-z]+");
private static readonly Regex _regexBeta = new Regex(#"[\d.,]+");
private static readonly Regex _regexZeta = new Regex(#"[-_]+");
public static void Main (string[] args)
{
Console.Write("Input: ");
string input = Console.ReadLine();
var resultAlpha = _regexAlpha.Matches(input).Select(m => m.Value);
var resultBeta = _regexBeta.Matches(input).Select(m => m.Value);
var resultZeta = _regexZeta.Matches(input).Select(m => m.Value);
Console.WriteLine($"Alpha: {string.Join(", ", resultAlpha)}");
Console.WriteLine($"Beta: {string.Join(", ", resultBeta)}");
Console.WriteLine($"Zeta: {string.Join(", ", resultZeta)}");
}
}
Sample output:
Input: ABC_3.14m--nop
Alpha: ABC, m, nop
Beta: 3.14
Zeta: _, --
Live demo: https://repl.it/repls/LopsidedUsefulBucket

Related

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

How should be the regex for alphanumeric where alphabet only with combinations of"w" or "d" or "h"

I need the regex to validate the user entered string, the string may be of following formats.
ex: 1w 2d 1h or 2w 1h or 1w 2d. Likewise combinations of numbers and w,d and h.
I am looking for the regex to allow the combinations number and w or d or h.
Is it possible to have a regex like that way?
You could say you want
\d+ any number, 1 or more times
[wdh] one of w, d, h
(?: |$) space or end of string
Then put this in a group, loop it 1 or more times
(?: start of non-capture group
)+ end of non-capture group, 1 or more times
^ and $ start and end of string respectively
Result
^(?:\d+[wdh](?: |$))+$
Edit: I see you've added more requirements in the comments of your question, this regex will not fulfil your most recent requirements
We can try writing a rudimentary parser to check the input string.
string input = "1w 2d 1h";
string[] parts = Regex.Split(input, #"\s+");
bool success = true;
if (parts.Length > 3)
{
success = false;
}
else
{
Regex regex = new Regex(#"\d+(?:w|d|h)");
foreach (string part in parts)
{
Match match = regex.Match(part);
if (!match.Success)
{
success = false;
}
}
}
if (success)
{
Console.WriteLine("MATCH");
}
else
{
Console.WriteLine("NO MATCH");
}
This answer might carry its own weight if, in your C# application, you also needed to extract the numerical values of each component.
Try ^(?:\d+w ?)?(?:\d+d ?)?(?:\d+h)?$
Explaantion:
^ - match beginning of a string
(?:...) - non-capturing group
\d+ - match one or more digits
w, d, h - match literally w, d, h respectively
? - match preceeding pattern zero or one time
Demo
Finally, the following string seems to be working fine for me...
^(?:\d+w)?((?:\d+d)|(?: \d+d))?((?:\d+h)|(?: \d+h))?$
Thanks, everyone for helping.
This works for your requirement. Hope this helps! Ordering (w->d->h) guaranteed.
^(\d+w){0,1}\s*(\d+d){0,1}\s*(\d+h){0,1}$
Test cases:
10w 12d 11h
1w 2d
2w 1h
1d 1h
1d 1h
3w 2h

Replace floating numbers in math equation with letter variables

I want to replace all the floating numbers from a mathematical expression with letters using regular expressions. This is what I've tried:
Regex rx = new Regex("[-]?([0-9]*[.])?[0-9]+");
string expression = "((-30+5.2)*(2+7))-((-3.1*2.5)-9.12)";
char letter = 'a';
while (rx.IsMatch(expression))
{
expression = rx.Replace(expression , letter.ToString(), 1);
letter++;
}
The problem is that if I have for example (5-2)+3 it will replace it to: (ab)+c
So it gets the -2 as a number but I don't want that.
I am not experienced with Regex but I think I need something like this:
Check for '-', if there is a one, check if there is a number or right parenthesis before it. If there is NOT then save the '-'.
After that check for digits + dot + digits
My above Regex also works with values like: .2 .3 .4 but I don't need that, it should be explicit: 0.2 0.3 0.4
Following the suggested logic, you may consider
(?:(?<![)0-9])-)?[0-9]+(?:\.[0-9]+)?
See the regex demo.
Regex details
(?:(?<![)0-9])-)? - an optional non-capturing group matching 1 or 0 occurrences of
(?<![)0-9]) - a place in string that is not immediately preceded with a ) or digit
- - a minus
[0-9]+ - 1+ digits
(?:\.[0-9]+)? - an optional non-capturing group matching 1 or 0 occurrences of a . followed with 1+ digits.
In code, it is better to use a match evaluator (see the C# demo online):
Regex rx = new Regex(#"(?:(?<![)0-9])-)?[0-9]+(?:\.[0-9]+)?");
string expression = "((-30+5.2)*(2+7))-((-3.1*2.5)-9.12)";
char letter = (char)96; // char before a in ASCII table
string result = rx.Replace(expression, m =>
{
letter++; // char is incremented
return letter.ToString();
}
);
Console.WriteLine(result); // => ((a+b)*(c+d))-((e*f)-g)

Regex match 2 out of 4 groups

I want a single Regex expression to match 2 groups of lowercase, uppercase, numbers or special characters. Length needs to also be grater than 7.
I currently have this expression
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$
It, however, forces the string to have lowercase and uppercase and digit or special character.
I currently have this implemented using 4 different regex expressions that I interrogate with some C# code.
I plan to reuse the same expression in JavaScript.
This is sample console app that shows the difference between 2 approaches.
class Program
{
private static readonly Regex[] Regexs = new[] {
new Regex("[a-z]", RegexOptions.Compiled), //Lowercase Letter
new Regex("[A-Z]", RegexOptions.Compiled), // Uppercase Letter
new Regex(#"\d", RegexOptions.Compiled), // Numeric
new Regex(#"[^a-zA-Z\d\s:]", RegexOptions.Compiled) // Non AlphaNumeric
};
static void Main(string[] args)
{
Regex expression = new Regex(#"^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$", RegexOptions.ECMAScript & RegexOptions.Compiled);
string[] testCases = new[] { "P#ssword", "Password", "P2ssword", "xpo123", "xpo123!", "xpo123!123##", "Myxpo123!123##", "Something_Really_Complex123!#43#2*333" };
Console.WriteLine("{0}\t{1}\t", "Single", "C# Hack");
Console.WriteLine("");
foreach (var testCase in testCases)
{
Console.WriteLine("{0}\t{2}\t : {1}", expression.IsMatch(testCase), testCase,
(testCase.Length >= 8 && Regexs.Count(x => x.IsMatch(testCase)) >= 2));
}
Console.ReadKey();
}
}
Result Proper Test String
------- ------- ------------
True True : P#ssword
False True : Password
True True : P2ssword
False False : xpo123
False False : xpo123!
False True : xpo123!123##
True True : Myxpo123!123##
True True : Something_Really_Complex123!#43#2*333
For javascript you can use this pattern that looks for boundaries between different character classes:
^(?=.*(?:.\b.|(?i)(?:[a-z]\d|\d[a-z])|[a-z][A-Z]|[A-Z][a-z]))[^:\s]{8,}$
if a boundary is found, you are sure to have two different classes.
pattern details:
\b # is a zero width assertion, it's a boundary between a member of
# the \w class and an other character that is not from this class.
.\b. # represents the two characters with the word boundary.
boundary between a letter and a number:
(?i) # make the subpattern case insensitive
(?:
[a-z]\d # a letter and a digit
| # OR
\d[a-z] # a digit and a letter
)
boundary between an uppercase and a lowercase letter:
[a-z][A-Z] | [A-Z][a-z]
since all alternations contains at least two characters from two different character classes, you are sure to obtain the result you hope.
You could use possessive quantifiers (emulated using atomic groups), something like this:
((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}
Since using possessive matching will prevent backtracking, you won't run into the two groups being two consecutive groups of lowercase letters, for instance. So the full regex would be something like:
^(?=.*((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}).{8,}$
Though, were it me, I'd cut the lookahead, just use the expression ((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}, and check the length separately.

What RegEx string will find the last (rightmost) group of digits in a string?

Looking for a regex string that will let me find the rightmost (if any) group of digits embedded in a string. We only care about contiguous digits. We don't care about sign, commas, decimals, etc. Those, if found should simply be treated as non-digits just like a letter.
This is for replacement/incrementing purposes so we also need to grab everything before and after the detected number so we can reconstruct the string after incrementing the value so we need a tokenized regex.
Here's examples of what we are looking for:
"abc123def456ghi" should identify the'456'
"abc123def456ghi789jkl" should identify the'789'
"abc123def" should identify the'123'
"123ghi" should identify the'123'
"abc123,456ghi" should identify the'456'
"abc-654def" should identify the'654'
"abcdef" shouldn't return any match
As an example of what we want, it would be something like starting with the name 'Item 4-1a', extracting out the '1' with everything before being the prefix and everything after being the suffix. Then using that, we can generate the values 'Item 4-2a', 'Item 4-3a' and 'Item 4-4a' in a code loop.
Now If I were looking for the first set, this would be easy. I'd just find the first contiguous block of 0 or more non-digits for the prefix, then the block of 1 or more contiguous digits for the number, then everything else to the end would be the suffix.
The issue I'm having is how to define the prefix as including all (if any) numbers except the last set. Everything I try for the prefix keeps swallowing that last set, even when I've tried anchoring it to the end by basically reversing the above.
How about:
^(.*?)(\d+)(\D*)$
then increment the second group and concat all 3.
Explanation:
^ : Begining of string
( : start of 1st capture group
.*? : any number of any char not greedy
) : end group
( : start of 2nd capture group
\d+ : one or more digits
) : end group
( : start of 3rd capture group
\D* : any number of non digit char
) : end group
$ : end of string
The first capture group will match all characters until the first digit of last group of digits before the end of the string.
or if you can use named group
^(?<prefix>.*?)(?<number>\d+)(?<suffix>\D*)$
Try next regex:
(\d+)(?!.*\d)
Explanation:
(\d+) # One or more digits.
(?!.*\d) # (zero-width) Negative look-ahead: Don't find any characters followed with a digit.
EDIT (OFFTOPIC of the question):: This answer is incorrect but this question has already been answered in other posts so to avoid delete this one I will use this same regex other way, for example in Perl could be used like this to get same result as in C# (increment last digit):
s/(\d+)(?!.*\d)/$1 + 1/e;
You can also try little bit simpler version:
(\d+)[^\d]*$
This should do it:
Regex regexObj = new Regex(#"
# Grab last set of digits, prefix and suffix.
^ # Anchor to start of string.
(.*) # $1: Stuff before last set of digits.
(?<!\d) # Anchor start of last set of digits.
(\d+) # $2: Last set of one or more digits.
(\D*) # $3: Zero or more trailing non digits.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace);
What about not using Regex. Here's code snippet (for console)
string[] myStringArray = new string[] { "abc123def456ghi", "abc123def456ghi789jkl", "abc123def", "123ghi", "abcdef","abc-654def" };
char[] numberSet = new char[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
char[] filterSet = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z','-'};
foreach (string myString in myStringArray)
{
Console.WriteLine("your string - {0}",myString);
int index1 = myString.LastIndexOfAny(numberSet);
if (index1 == -1)
Console.WriteLine("no number");
else
{
string mySubString = myString.Substring(0,index1 + 1);
string prefix = myString.Substring(index1 + 1);
Console.WriteLine("prefix - {0}", prefix);
int index2 = mySubString.LastIndexOfAny(filterSet);
string suffix = myString.Substring(0, index2 + 1);
Console.WriteLine("suffix - {0}",suffix);
mySubString = mySubString.Substring(index2 + 1);
Console.WriteLine("number - {0}",mySubString);
Console.WriteLine("_________________");
}
}
Console.Read();

Categories

Resources