Removing whitespace between consecutive numbers - c#

I have a string, from which I want to remove the whitespaces between the numbers:
string test = "Some Words 1 2 3 4";
string result = Regex.Replace(test, #"(\d)\s(\d)", #"$1$2");
the expected/desired result would be:
"Some Words 1234"
but I retrieve the following:
"Some Words 12 34"
What am I doing wrong here?
Further examples:
Input: "Some Words That Should not be replaced 12 9 123 4 12"
Output: "Some Words That Should not be replaced 129123412"
Input: "test 9 8"
Output: "test 98"
Input: "t e s t 9 8"
Output: "t e s t 98"
Input: "Another 12 000"
Output: "Another 12000"

Regex.Replace continues to search after the previous match:
Some Words 1 2 3 4
^^^
first match, replace by "12"
Some Words 12 3 4
^
+-- continue searching here
Some Words 12 3 4
^^^
next match, replace by "34"
You can use a zero-width positive lookahead assertion to avoid that:
string result = Regex.Replace(test, #"(\d)\s(?=\d)", #"$1");
Now the final digit is not part of the match:
Some Words 1 2 3 4
^^?
first match, replace by "1"
Some Words 12 3 4
^
+-- continue searching here
Some Words 12 3 4
^^?
next match, replace by "2"
...

Your regex consumes the digit on the right. (\d)\s(\d) matches and captures 1 in Some Words 1 2 3 4 into Group 1, then matches 1 whitespace, and then matches and consumes (i.e. adds to the match value and advances the regex index) 2. Then, the regex engine tries to find another match from the current index, that is already after 1 2. So, the regex does not match 2 3, but finds 3 4.
Here is your regex demo and a diagram showing that:
Also, see the process of matching here:
Use lookarounds instead that are non-consuming:
(?<=\d)\s+(?=\d)
See the regex demo
Details
(?<=\d) - a positive lookbehind that matches a location in string immediately preceded with a digit
\s+ - 1+ whitespaces
(?=\d) - a positive lookahead that matches a location in string immediately followed with a digit.
C# demo:
string test = "Some Words 1 2 3 4";
string result = Regex.Replace(test, #"(?<=\d)\s+(?=\d)", "");
See the online demo:
var strs = new List<string> {"Some Words 1 2 3 4", "Some Words That Should not be replaced 12 9 123 4 12", "test 9 8", "t e s t 9 8", "Another 12 000" };
foreach (var test in strs)
{
Console.WriteLine(Regex.Replace(test, #"(?<=\d)\s+(?=\d)", ""));
}
Output:
Some Words 1234
Some Words That Should not be replaced 129123412
test 98
t e s t 98
Another 12000

Related

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

How to capture groups

In C# and NET regex engine, I have an input line like this and it is terminated by \n
1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR
I have to obtain
First capture
1. num=1
2. surname=ROSS
3. name=SVETA
4. name=JAMIE
5. title=MRS
6. other=T02XT
Second capture
1. num=2
2. surname=WHITE
3. name=VIKA
4. title=MS
Third capture
1. num=3
2. surname=GREEN
3. name=ANDY
4. title=MR
The first group has two names and there is no space within ANDY and MR in the third group. I am unable to solve this problem. I started using
(^\d|\s\d)
to detect the groups and it works, but after I do not know how to capture till the end of each group and split into subgroups the inside data.
If the title values are set to MR, MRS or MS, you may use
\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)
See the regex demo
Details
\b - word boundary
(?<num>\d) - Group "num": a digit (replace with \d+ if there can be more than 1)
(?<surname>\p{L}+) - Group "surname": 1+ letters
(?:/(?<name>\p{L}+?))+ - one or more sequences of / followed with Group "surname": 1+ letters, as few as possible
(?:\s*(?<title>M(?:RS?|S)))? - an optional sequence of
\s* - 0+ whitespaces
(?<title>M(?:RS?|S)) - Group "title": M followed with R and optional S or followed with S
\b - word boundary
\s* - 0+ whitespaces
(?<other>.*?) - Group "other": 0 or more chars, as few as possible
(?=\b\d\p{L}+/\p{L}|$) - up to the first occurrence of the initial pattern (word boundary, digit, 1+ letters, / and a letter) or end of string.
C# demo:
var text = "1ROSS/SVETA/JAMIE MRS T02XT 2WHITE/VIKA MS 3GREEN/ANDYMR";
var pattern = #"\b(?<num>\d)(?<surname>\p{L}+)(?:/(?<name>\p{L}+?))+(?:\s*(?<title>M(?:RS?|S)))?\b\s*(?<other>.*?)(?=\b\d\p{L}+/\p{L}|$)";
var result = Regex.Matches(text, pattern);
foreach (Match m in result) {
Console.WriteLine("Num: {0}", m.Groups["num"].Value);
Console.WriteLine("Surname: {0}", m.Groups["surname"].Value);
Console.WriteLine("Names: {0}", string.Join(", ", m.Groups["name"].Captures.Cast<Capture>().Select(x => x.Value)));
Console.WriteLine("Title: {0}", m.Groups["title"].Value);
Console.WriteLine("Other: {0}", m.Groups["other"].Value);
Console.WriteLine("===== NEXT MATCH ======");
}
Output:
Num: 1
Surname: ROSS
Names: SVETA, JAMIE
Title: MRS
Other: T02XT
===== NEXT MATCH ======
Num: 2
Surname: WHITE
Names: VIKA
Title: MS
Other:
===== NEXT MATCH ======
Num: 3
Surname: GREEN
Names: ANDY
Title: MR
Other:
===== NEXT MATCH ======

How to delete first text? (c#)

I want to code
var text = "14. hello my friends we meet 1 test, 2 baby 3 wiki 4 marvel";
string[] split = text.Split('14.', 1, 2, 3, 4);
var needText = split[0].Replace('14.', '');
"1" "2" "3" "4" is static text.
but, "14." is dynamic text.
ex)
var text2 = "1972. google youtube. 1 phone, 2 star 3 tv 4 mouse";
string[] split = text.Split('1972.', 1, 2, 3, 4);
var needText = split[0].Replace('1972.', '');
If you have dynamic separators like this, String.Split is not suitable. Use Regex.Split instead.
You can give a pattern to Regex.Split and it will treat every substring that matches the pattern as a separator.
In this case, you need a pattern like this:
\d+\. |1|2|3|4
| are or operators. \d matches any digit character. + means match between 1 to unlimited times. \. matches the dot literally because . has special meaning in regex.
Usage:
var split = Regex.Split(text, "\\d+\\. |1|2|3|4");
And I think the text you need is at index 1 of split.
Remember to add a using directive to System.Text.RegularExpressions!
If you use IndexOf() with Substring(), you can very easily grab the information you need. If it's any more complex than your examples then use Regex.
var text = "14. hello my friends we meet 1 test, 2 baby 3 wiki 4 marvel";
var strArr = text.Substring(text.IndexOf(' ')).Split('1', '2', '3', '4');

RegEx to find numbers sequence in string separated by space with predefined maximum length

Sorry for the confusing title, I'll try to explain this with example. Currently we have this expression to find number sequence in a string
\b((\d[ ]{0,1}){13,19})\b
Now I'd like to modify it so it fulfills these rule
- The length should be between 13 to 19 characters, excluding the whitespaces
- Each number cluster must have minimum 3 digits
The expression should mark these as matched:
1234567890123
1234 5678 9012 345
Not match:
123456789012 3
123 12 123 1 23134
Current expression that I have will mark all of them as match.
Example
This is possible using look-around.
The regex can be changed to the following:
\b(?<!\d )(?=(?:\d ?){13,19}(?! ?\d))(?:\d{3,} ?)+\b(?! ?\d)
This works by looking ahead to make sure the number is between 13 and 19 digits long. It then matches groups of 3 or more digits. It then uses negative look ahead after its found all groups of 3 to make sure there aren't any numbers left. If there are, we've found a group smaller than 3. This works on the examples you've provided.
\b Makes sure that its the start of a "word".
(?<!\d ) Make sure there are no numbers behind.
(?=(?:\d ?){13,19}(?! ?\d)) Looks ahead to make sure the number is between 13 and 19 digits long
(?:\d ?){13,19} From original. ?: added to make non-capturing
(?! ?\d) Negative look ahead: if there is still digits left after getting 19 digits, too big therefore discard current match
(?:\d{3,} ?)+ Match any number of clusters bigger than 3 (min 13, max 19 handled by first look ahead)
\b(?! ?\d) Looks for the end of a cluster. If there are still numbers left after the end of the cluster, there must be a cluster that is too small.
Test here
I suggest the following solution also based on lookarounds:
\b\d(?!\d?\b)(?: ?\d(?!(?<= \d)\d?\b)){12,18}\b
See the regex demo
The main point is that we only match the next digit if it is not a part of a 1- or 2-digit group.
Pattern explanation
\b - starting word boundary
\d(?!\d?\b) - a digit that is not followed with 1 or 0 digits and then a trailing word boundary (that is, if it is 12 or 1 like group, it is failed)
(?: ?\d(?!(?<= \d)\d?\b)){12,18} - 12 to 18 occurrences of:
? - 1 or 0 spaces
\d(?!(?<= \d)\d?\b) - any single digit that is not followed with 1 or 0 digits followed with a word boundary (thanks to the (?!\d?\b)), and if that 1 or 0 digits are preceded with space + 1 digit ((?<= \d) lookbehind does that)
\b - a trailing word boundary.
NOTE that in case you want to match these strings in a non-numeric context (that means, if you do not want to allow any digits on the left and on the right) you might also consider adding (?<!\d *) at the front and (?! *\d) at the end of the pattern.
Note that to match any whitespace, you may replace a literal space with \s in the pattern.
If you can use Linq, this will be way easier to maintain:
var myList = new List<string>
{
"1234567890123",
"1234 5678 9012 345",
"123456789012 3",
"123 12 123 1 23134"
};
foreach(var input in myList)
{
var splitted = Regex.Split(input, #"\s+"); // Split on whitespace
var length = splitted.Sum(x => x.Length); // Compute the total length
var smallestGroupSize = splitted.Min(x => x.Length); // Compute the length of the smallest chunck
Console.WriteLine($"Total lenght: {length}, smallest group size: {smallestGroupSize}");
if (length < 13 || length > 19 || smallestGroupSize < 3)
{
Console.WriteLine($"Input '{input}' is incorrect{Environment.NewLine}");
continue;
}
Console.WriteLine($"Input '{input}' is correct!{Environment.NewLine}");
}
which produces:
Total lenght: 13, smallest group size: 13
Input '1234567890123' is correct!
Total lenght: 15, smallest group size: 3
Input '1234 5678 9012 345' is correct!
Total lenght: 13, smallest group size: 1
Input '123456789012 3' is incorrect
Total lenght: 14, smallest group size: 1
Input '123 12 123 1 23134' is incorrect

Regex to get numbers after a period in a string

I'm trying to find the right regex to extract the numbers after the . in the string below. E.g, the first line should return and array of 1 1 1 1 1, the second should return 2 1 0 1 2. I can't seem to figure the correct regex expression to achieve this. Any help would be appreciated.
line = 0.1, 1.1, 2.1, 3.1, 4.1 // payline 0
line = 0.2, 1.1, 2.0, 3.1, 4.2 // payline 1
So far, I have the code below, but it just returns all the the numbers in the sting instead. eg, the first line returns 0 1 1 1 2 1 3 1 4 1 0 and the second returns 0 2 1 1 2 0 3 1 4 2 1
foreach (var line in Paylines)
{
int[] lines = (from Match m in Regex.Matches(line.ToString(), #"\d+")
select int.Parse(m.Value)).ToArray();
foreach (var x in lines)
{
Console.WriteLine(x.ToString());
}
}
You may use a lookbehind-based regex solution:
#"(?<=\.)\d+"
It matches 1+ digits after a dot without placing the dot into a match value.
See the regex demo.
In C#, you may use
var myVals = Regex.Matches(line, #"(?<=\.)\d+", RegexOptions.ECMAScript)
.Cast<Match>()
.Select(m => int.Parse(m.Value))
.ToList();
See the C# demo.
The RegexOptions.ECMAScript option is passed for the \d to only match ASCII digits in the [0-9] range and avoid matching other Unicode digits.

Categories

Resources