RegEx split string into words by space and containing chars - c#

How can one perform this split with the Regex.Split(input, pattern) method?
This is a [normal string ] made up of # different types # of characters
Array of strings output:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
Also it should keep the leading spaces, so I want to preserve everything. A string contains 20 chars, array of strings should total 20 chars across all elements.
What I have tried:
Regex.Split(text, #"(?<=[ ]|# #)")
Regex.Split(text, #"(?<=[ ])(?<=# #")

I suggest matching, i.e. extracting words, not splitting:
string source = #"This is a [normal string ] made up of # different types # of characters";
// Three possibilities:
// - plain word [A-Za-z]+
// - # ... # quotation
// - [ ... ] quotation
string pattern = #"[A-Za-z]+|(#.*?#)|(\[.*?\])";
var words = Regex
.Matches(source, pattern)
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, words
.Select((w, i) => $"{i + 1}. {w}")));
Outcome:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters

You may use
var res = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
See the regex demo
The (\[[^][]*]|#[^#]*#) part is a capturing group whose value is output to the resulting list along with the split items.
Pattern details
(\[[^][]*]|#[^#]*#) - Group 1: either of the two patterns:
\[[^][]*] - [, followed with 0+ chars other than [ and ] and then ]
#[^#]*# - a #, then 0+ chars other than # and then #
| - or
\s+ - 1+ whitespaces
C# demo:
var s = "This is a [normal string ] made up of # different types # of characters";
var results = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
Console.WriteLine(string.Join("\n", results));
Result:
This
is
a
[normal string ]
made
up
of
# different types #
of
characters

It would be easier using matching approach however it can be done using negative lookeaheads :
[ ](?![^\]\[]*\])(?![^#]*\#([^#]*\#{2})*[^#]*$)
matches a space not followed by
any character sequence except [ or ] followed by ]
# followed by an even number of #

Related

What would be the regex for this value? C#

I'm trying to get the regex for this value:
<5fond[3.4,550],[5.4,6.4,7.4, 8.4, 32.4],[ 9.4, 239.8662]
The numbers (minus the second one which appears to just be an integer) can be any decimal value.
I have tried the following but it doesn't seem to work.
private static readonly Regex RegexExp = new Regex(#"<5fond\[[0-9]*\.[0-9]+,[0-9]*\.[0-9]+],\[[0-9]*\.[0-9]+,[0-9]*\.[0-9]+,[0-9]*\.[0-9]+,[0-9]*\.[0-9]+\],\[[0-9]*\.[0-9]+,[0-9]*\.[0-9]+\]", RegexOptions.IgnorePatternWhitespace);
Any idea what I might be doing wrong?
You can use
<5fond\[\s*\d*\.?\d+(?:,\s*\d*\.?\d+)*\s*](?:,\s*\[\s*\d*\.?\d+(?:,\s*\d*\.?\d+)*\s*])*
See the regex demo. Details:
<5fond\[ - a <5fond[ string
\s* - zero or more whitespaces
\d*\.?\d+ - an int/float number pattern
(?:,\s*\d*\.?\d+)* - zero or more sequences of a comma, zero or more whitespaces, an int/float number
\s* - zero or more whitespaces
] - a ] char
(?:,\s*\[\s*\d*\.?\d+(?:,\s*\d*\.?\d+)*])* - zero or more occurrences of
, - a comma
\s*\[\s* - a [ char enclosed with zero or more whitespaces
\d*\.?\d+(?:,\s*\d*\.?\d+)* - an int/float and then zero or more occurrences of a comma, zero or more whitespaces, an int/float number
] - a ] char.

Extracting integer ranges separated with hyphen

I try to filter some strings I streamed for some useful information in C#.
I got two possible string structures:
string examplestring1 = "from - to (mm) no. 1\r\n\r\nna 570 - 590\r\n60 18.12.20\r\nna 5390 - 5410\r\n60 18.12.20\r\nna 11380 - 11390 60 18.12.20\r\nPage 1/1";
string examplestring2 = "e ne 570 - 590 ne 5390 - 5410 ne 11380 - 11390 e";
I'd like to get an array or a List of strings in the format of "xxx - xxx". Like:
string[] example = new string[]{"570 - 590","5390 - 5410","11380 - 11390"};
I tried to use Regex:
List<string> numbers = new List<string>();
numbers.AddRange(Regex.Split(examplestring2, #"\D+"));
At least I get a list only containg the numbers. But that doesn't work out for examplestring1 since there is date within.
Also I tried to play around with Regex pattern. But things like following does not work.
Regex.Split(examplestring1, #"\D+" + " - " + #"\D+");
I'd be grateful for a solution or at least some hint how to solve that matter.
You can use
var results = Regex.Matches(text, #"\d+\s*-\s*\d+").Cast<Match>().Select(x => x.Value);
See the regex demo. If there must be a single regular space on both ends of the -, you can use \d+ - \d+ regex.
If you want to match any -, you can use [\p{Pd}\xAD] instead of -.
Note that \d in .NET matches any Unicode digits, to only match ASCII digits, use RegexOptions.ECMAScript option: Regex.Matches(text, #"\d+\s*-\s*\d+", RegexOptions.ECMAScript).

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

How to capture groups

In C# and NET regex engine, I have an input line like this and it is terminated by \n
1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR
I have to obtain
First capture
1. num=1
2. surname=ROSS
3. name=SVETA
4. name=JAMIE
5. title=MRS
6. other=T02XT
Second capture
1. num=2
2. surname=WHITE
3. name=VIKA
4. title=MS
Third capture
1. num=3
2. surname=GREEN
3. name=ANDY
4. title=MR
The first group has two names and there is no space within ANDY and MR in the third group. I am unable to solve this problem. I started using
(^\d|\s\d)
to detect the groups and it works, but after I do not know how to capture till the end of each group and split into subgroups the inside data.
If the title values are set to MR, MRS or MS, you may use
\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)
See the regex demo
Details
\b - word boundary
(?<num>\d) - Group "num": a digit (replace with \d+ if there can be more than 1)
(?<surname>\p{L}+) - Group "surname": 1+ letters
(?:/(?<name>\p{L}+?))+ - one or more sequences of / followed with Group "surname": 1+ letters, as few as possible
(?:\s*(?<title>M(?:RS?|S)))? - an optional sequence of
\s* - 0+ whitespaces
(?<title>M(?:RS?|S)) - Group "title": M followed with R and optional S or followed with S
\b - word boundary
\s* - 0+ whitespaces
(?<other>.*?) - Group "other": 0 or more chars, as few as possible
(?=\b\d\p{L}+/\p{L}|$) - up to the first occurrence of the initial pattern (word boundary, digit, 1+ letters, / and a letter) or end of string.
C# demo:
var text = "1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR";
var pattern = #"\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)";
var result = Regex.Matches(text, pattern);
foreach (Match m in result) {
Console.WriteLine("Num: {0}", m.Groups["num"].Value);
Console.WriteLine("Surname: {0}", m.Groups["surname"].Value);
Console.WriteLine("Names: {0}", string.Join(", ", m.Groups["name"].Captures.Cast<Capture>().Select(x => x.Value)));
Console.WriteLine("Title: {0}", m.Groups["title"].Value);
Console.WriteLine("Other: {0}", m.Groups["other"].Value);
Console.WriteLine("===== NEXT MATCH ======");
}
Output:
Num: 1
Surname: ROSS
Names: SVETA, JAMIE
Title: MRS
Other: T02XT
===== NEXT MATCH ======
Num: 2
Surname: WHITE
Names: VIKA
Title: MS
Other:
===== NEXT MATCH ======
Num: 3
Surname: GREEN
Names: ANDY
Title: MR
Other:
===== NEXT MATCH ======

Regex with balancing groups

I need to write regex that capture generic arguments (that also can be generic) of type name in special notation like this:
System.Action[Int32,Dictionary[Int32,Int32],Int32]
lets assume type name is [\w.]+ and parameter is [\w.,\[\]]+
so I need to grab only Int32, Dictionary[Int32,Int32] and Int32
Basically I need to take something if balancing group stack is empty, but I don't really understand how.
UPD
The answer below helped me solve the problem fast (but without proper validation and with depth limitation = 1), but I've managed to do it with group balancing:
^[\w.]+ #Type name
\[(?<delim>) #Opening bracet and first delimiter
[\w.]+ #Minimal content
(
[\w.]+
((?(open)|(?<param-delim>)),(?(open)|(?<delim>)))* #Cutting param if balanced before comma and placing delimiter
((?<open>\[))* #Counting [
((?<-open>\]))* #Counting ]
)*
(?(open)|(?<param-delim>))\] #Cutting last param if balanced
(?(open)(?!) #Checking balance
)$
Demo
UPD2 (Last optimization)
^[\w.]+
\[(?<delim>)
[\w.]+
(?:
(?:(?(open)|(?<param-delim>)),(?(open)|(?<delim>))[\w.]+)?
(?:(?<open>\[)[\w.]+)?
(?:(?<-open>\]))*
)*
(?(open)|(?<param-delim>))\]
(?(open)(?!)
)$
I suggest capturing those values using
\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*
See the regex demo.
Details:
\w+(?:\.\w+)* - match 1+ word chars followed with . + 1+ word chars 1 or more times
\[ - a literal [
(?:,?(?<res>\w+(?:\[[^][]*])?))* - 0 or more sequences of:
,? - an optional comma
(?<res>\w+(?:\[[^][]*])?) - Group "res" capturing:
\w+ - one or more word chars (perhaps, you would like [\w.]+)
(?:\[[^][]*])? - 1 or 0 (change ? to * to match 1 or more) sequences of a [, 0+ chars other than [ and ], and a closing ].
A C# demo below:
var line = "System.Action[Int32,Dictionary[Int32,Int32],Int32]";
var pattern = #"\w+(?:\.\w+)*\[(?:,?(?<res>\w+(?:\[[^][]*])?))*";
var result = Regex.Matches(line, pattern)
.Cast<Match>()
.SelectMany(x => x.Groups["res"].Captures.Cast<Capture>()
.Select(t => t.Value))
.ToList();
foreach (var s in result) // DEMO
Console.WriteLine(s);
UPDATE: To account for unknown depth [...] substrings, use
\w+(?:\.\w+)*\[(?:\s*,?\s*(?<res>\w+(?:\[(?>[^][]+|(?<o>\[)|(?<-o>]))*(?(o)(?!))])?))*
See the regex demo

Categories

Resources