How to make balancing group capturing?

How to make balancing group capturing? - c#

Let's say I have this text input.
tes{}tR{R{abc}aD{mnoR{xyz}}}
I want to extract the ff output:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
Currently, I can only extract what's inside the {}groups using balanced group approach as found in msdn. Here's the pattern:
^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$
Does anyone know how to include the R{} and D{} in the output?

I think that a different approach is required here. Once you match the first larger group R{R{abc}aD{mnoR{xyz}}} (see my comment about the possible typo), you won't be able to get the subgroups inside as the regex doesn't allow you to capture the individual R{ ... } groups.
So, there had to be some way to capture and not consume and the obvious way to do that was to use a positive lookahead. From there, you can put the expression you used, albeit with some changes to adapt to the new change in focus, and I came up with:
(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))
[I also renamed the 'Open' to 'O' and removed the named capture for the close brace to make it shorter and avoid noises in the matches]
On regexhero.net (the only free .NET regex tester I know so far), I got the following capture groups:
1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}
Breakdown of regex:
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
(?: # First non-capture group opening
(?: # Second non-capture group opening
(?'O'{) # Get the named opening brace
[^{}]* # Any non-brace
)+ # Close of second non-capture group and repeat over as many times as necessary
(?: # Third non-capture group opening
(?'-O'}) # Removal of named opening brace when encountered
[^{}]*? # Any other non-brace characters in case there are more nested braces
)+ # Close of third non-capture group and repeat over as many times as necessary
)+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
(?(O)(?!)) # Condition to prevent unbalanced braces
) # Close capture group
) # Close positive lookahead
The following will not work in C#
I actually wanted to try out how it should be working out on the PCRE engine, since there was the option to have recursive regex and I think it was easier since I'm more familiar with it and which yielded a shorter regex :)
(?=([A-Z]{(?:[^{}]|(?1))+}))
regex101 demo
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
{ # Opening brace
(?: # Opening non-capture group
[^{}] # Matches non braces
| # OR
(?1) # Recurse first capture group
)+ # Close non-capture group and repeat as many times as necessary
} # Closing brace
) # Close of capture group
) # Close of positive lookahead

I'm not sure a single regex would be able to suit your needs: these nested substrings always mess it up.
One solution could be the following algorithm (written in Java, but I guess the translation to C# won't be that hard):
/**
* Finds all matches (i.e. including sub/nested matches) of the regex in the input string.
*
* #param input
* The input string.
* #param regex
* The regex pattern. It has to target the most nested substrings. For example, given the following input string
* <code>A{01B{23}45C{67}89}</code>, if you want to catch every <code>X{*}</code> substrings (where <code>X</code> is a capital letter),
* you have to use <code>[A-Z][{][^{]+?[}]</code> or <code>[A-Z][{][^{}]+[}]</code> instead of <code>[A-Z][{].+?[}]</code>.
* #param format
* The format must follow the <a href= "http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html#syntax" >format string
* syntax</a>. It will be given one single integer as argument, so it has to contain (and to contain only) a <code>%d</code> flag. The
* format must not be foundable anywhere in the input string. If <code>null</code>, <code>ééé%dèèè</code> will be used.
* #return The list of all the matches of the regex in the input string.
*/
public static List<String> findAllMatches(String input, String regex, String format) {
if (format == null) {
format = "ééé%dèèè";
}
int counter = 0;
Map<String, String> matches = new LinkedHashMap<String, String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
// if a substring has been found
while (matcher.find()) {
// create a unique replacement string using the counter
String replace = String.format(format, counter++);
// store the relation "replacement string --> initial substring" in a queue
matches.put(replace, matcher.group());
String end = input.substring(matcher.end(), input.length());
String start = input.substring(0, matcher.start());
// replace the found substring by the created unique replacement string
input = start + replace + end;
// reiterate on the new input string (faking the original matcher.find() implementation)
matcher = pattern.matcher(input);
}
List<Entry<String, String>> entries = new LinkedList<Entry<String, String>>(matches.entrySet());
// for each relation "replacement string --> initial substring" of the queue
for (int i = 0; i < entries.size(); i++) {
Entry<String, String> current = entries.get(i);
// for each relation that could have been found before the current one (i.e. more nested)
for (int j = 0; j < i; j++) {
Entry<String, String> previous = entries.get(j);
// if the current initial substring contains the previous replacement string
if (current.getValue().contains(previous.getKey())) {
// replace the previous replacement string by the previous initial substring in the current initial substring
current.setValue(current.getValue().replace(previous.getKey(), previous.getValue()));
}
}
}
return new LinkedList<String>(matches.values());
}
Thus, in your case:
String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}";
String regex = "[A-Z][{][^{}]+[}]";
findAllMatches(input, regex, null);
Returns:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}

Balancing groups in .Net regular expressions give you control over exactly what to capture, and the .Net regex engine keeps a full history of all captures of the group (unlike most other flavors that only capture the last occurrence of each group).
The MSDN example is a little too complicated. A simpler approach for matching nestes structures would be:
(?>
(?<O>)\p{Lu}\{ # Push to the O stack, and match an upper-case letter and {
| # OR
\}(?<-O>) # Match } and pop from the stack
| # OR
\p{Ll} # Match a lower-case letter
)+
(?(O)(?!)) # Make sure the stack is empty
or in a single line:
(?>(?<O>)\p{Lu}\{|\}(?<-O>)|\p{Ll})+(?(O)(?!))
Working example on Regex Storm
In your example it also matches the "tes" at the start of the string, but don't worry about that, we're not done.
With a small correction we can also capture the occurrences between the R{...} pairs:
(?>(?<O>)\p{Lu}\{|\}(?<Target-O>)|\p{Ll})+(?(O)(?!))
Each Match will have a Group called "Target", and each such Group will have a Capture for each occurrences - you only care about these captures.
Working example on Regex Storm - Click on Table tab and examine the 4 captures of ${Target}
See also:
What are regular expression Balancing Groups?

Related

I need help for building a regex

It is my first time working with regex and I am a little lost. To give you a little background, I am making a program that reads a text file line by line and it saves in a string called "line". If the line starts with either a tab o or a whitespace, followed by a number or number and dots (such as 1 or 1.2.1, for instance) followed by another tab or whitespace, it copies the line to another file.
So far I build this regex, but it does not work
string pattern = #"(\t| ) *[0-9.] (\t| )";
if (line.StartsWith(pattern))
{
//copy line
}
Also, is line.StartsWith correct? Or should I use something like rgx.Matches(pattern)?

Your pattern contains a character class without a quantifier, which will match either a single digit or dot.
To prevent matching for example only dots you could first match digits followed by an optional part which matches a dot and then again digits [0-9]+(?:\.[0-9]+)*
Note that in this part (\t| ) there are 2 characters expected to match as the space in that part has meaning.
You could simplify the pattern to use a character class to match either a tab or space instead of using an alternation and if you don't need the capturing group you could omit it.
Instead of using StartsWith you could usefor example IsMatch
^[ \t][0-9]+(?:\.[0-9]+)*[ \t]
^ Start of string
[ \t] Match a single tab or space
[0-9]+ Match 1+ digits 0-9
(?:\.[0-9]+)* Repeat 0+ times a dot and 1+ digits
[ \t] Match a single tab or space
Regex demo | C# demo
For example
string s = "\t1.2.1 ";
Regex regex = new Regex(#"^[ \t][0-9]+(?:\.[0-9]+)*[ \t]");
if (regex.IsMatch(s)) {
//copy line
}

Get each item within a capturing group

If you have a string like this:
[hello world] this is [the best .Home] is nice place.
How do you extract each word(separated by space) within brackets[] only.
Right now I have this working https://regex101.com/r/Tgokeq/2
Which returns:
hello world
the best .Home
But I want:
hello
world
the
best
.Home
PS: I know I could just do string split in a foreach but I don't want that I want it in the regex itself, just like this which gets every word, except I want words within the brackets [ ] only.
https://regex101.com/r/eweRWj/2

Use this Pattern ([^\[\] ]+)(?=[^\[\]]*\]) Demo
( # Capturing Group (1)
[^\[\] ] # Character not in [\[\] ] Character Class
+ # (one or more)(greedy)
) # End of Capturing Group (1)
(?= # Look-Ahead
[^\[\]] # Character not in [\[\]] Character Class
* # (zero or more)(greedy)
\] # "]"
) # End of Look-Ahead

This pattern may not seems as elegant since it does not match individual words separately. The full solution takes advantage of .Net regex library to get individual words. However, it avoids excessive backtracking of alpha bravo's solution. The importance of that will largely depend on how many lines you search and/or if you are matching large chunks of text or only individual lines at a time.
This approach also lets you identify exactly how many bracket pairs and which words were captured in each pair. A simple pattern-only solution will just get you the matched words without context.
The pattern:
\[\s*((?<word>[^[\]\s]+)\s*)+]
Then some brief code demonstrating how to get captured words via the .Net regex object model:
using System.Text.RegularExpressions;
...
Regex rx = new Regex(#"\[\s*((?<word>[^[\]\s]+)\s*)+]");
MatchCollection matches = rx.Matches(searchText);
foreach(Match m in matches) {
foreach(Capture c in m.Groups["word"].Captures) {
System.Console.WriteLine(c.Value);
}
}
Breakdown of pattern:
\[ # Opening bracket
\s* # Optional white space
( # Group for word delimited by space
(?<word> # Named capture group
[^[\]\s] # Negative character class: no brackets, no white space
+ # one or more greedy
) # End named capture group
\s* # Match white space after word
) # End of word+space grouping
+ # Match multiple occurrences of word+space
] # Literal closing bracket (no need to escape outside character class)
The above will match line feeds between the brackets. If you don't want that then use
\[\ *((?<word>[^[\]\s]+)\ *)+]

Regex - how to match multiple properly quoted substrings

I am trying to use a Regex to extract quote-wrapped strings from within a (C#) string which is a comma-separated list of such strings. I need to extract all properly quoted substrings, and ignore those that are missing a quote mark
eg given this string
"animal,dog,cat","ecoli, verification,"streptococcus"
I need to extract "animal,dog,cat" and "streptococcus".
I've tried various regex solutions in this forum but they all seem to find the first substring only, or incorrectly match "ecoli, verification," and ignore "streptococcus"
Is this solvable?
TIA

Try this:
string input = "\"animal,dog,cat\",\"ecoli, verification,\"streptococcus\"";
string pattern = "\"([^\"]+?[^,])\"";
var matches = Regex.Matches(input, pattern);
foreach (Match m in matches)
Console.WriteLine(m.Groups[1].Value);
P.S. But I agree with the commentators: fix the source.

I suggest this:
"(?>[^",]*(?>,[^",]+)*)"
Explanation:
" # Match a starting quote
(?> # Capture in an atomic group to avoid catastrophic backtracking:
[^",]* # - any number of characters except commas or quotes
(?> # - optionally followed by another (atomic) group:
, # - which starts with a comma
[^",]+ # - and contains at least one character besides comma or quotes.
)* # - (as said above, that group is optional but may occur many times)
) # End of the outer atomic group
" # Match a closing quote
Test it live on regex101.com.

How to extract numbers from a string using regular expressions?

This little challenge just screams regular expressions to me, but so far I am stumped.
I have an arbitrary string that contains two numbers embedded in it. I need to extract those two numbers, which will be n and m digits long (n,m are unknown in advance). The format of the string is always
FixedWord[n digits]anotherfixedword[m digits]alotmorestuffontheend
The first number is of the format 1.2.3.4 (the number of digits varying) eg 5.3.20 or 5.3.10.1 or 5.4.
and the second is a simpler 'm' digits (eg 25 or 2)
eg "AppName5.2.6dbVer44Oracle.Group"
It shouts 'pattern matching' and hence "extraction using regexes". Can anyone guide me further?
TIA

The following pattern:
(\d+(?>\.\d+)*)\w+?(\d+)
Will match this:
AppName5.2.6dbVer44Oracle.Group
\__________/ <-- match
\___/ \/ <-- captures
Demo
And will capture the two values you're interested in in capture groups.
Use it like this:
var match = Regex.Match(input, #"(\d+(?>\.\d+)*)\w+?(\d+)");
if (match.Success)
{
var first = match.Groups[1].Value;
var second = match.Groups[2].Value;
// ...
}
Pattern explanation:
( # Start of group 1
\d+ # a series of digits
(?> # start of atomic group
\.\d+ # dot followed by digits
)* # .. 0 to n times
)
\w+? # some word characters (as few as possible)
(\d+) # a series of digits captured in group 2

Try this:
\w*?([\d|\.]+)\w*?([\d{1,4}]+).*

You could start from the following:
^[a-zA-Z]+((?:\d+\.)+\d)[a-zA-Z]+(\d+).*$
I assumed that the fixed words are just made of letters and that you want to match the entire string. If you prefer, you could substitute the parts not in parentheses with the actual fixed words or change the character sets as desired. I recommend using a tool like https://regex101.com to fine-tune the expression.

Keep it basic by specifing a match ( ) by looking for a digit \d, then zero or more * digits or periods in a set [\d.] (the set is \d -or- a literal period):
var data = "AppName5.2.6dbVer44Oracle.Group";
var pattern = #"(\d[\d.]*)";
// Outputs:
// 5.2.6
// 44
Console.WriteLine (Regex.Matches(data, pattern)
.OfType<Match>()
.Select (mt => mt.Groups[1].Value));
Each match will be a number within the sentence. So if the total set of numbers change, the pattern will not fail and dutifully report 1 to N numbers.

Simply look for the numbers, since you only care for the numbers and don't want to check the syntax of the whole input string.
Matches matches = Regex.Matches(input, #"\d+(\.\d+)*");
if (matches.Count >= 2) {
string number1 = matches[0].Value;
string number2 = matches[1].Value;
} else {
// Less than two numbers found
}
The expression \d+(\.\d+)* means:
\d+ one or more digits.
( )* repeat zero, one or more times.
\.\d+ one decimal point (escaped with \) followed by one or more digits.
and
\d one digit.
( ) grouping.
+ repeat the expression to the left one or more times.
* repeat the expression to the left zero, one or more times.
\ escapes characters that have a special meaning in regex.
. any character (without escaping).
\. period character (".").

Matching repeating patterns

I'm currently trying to match and capture text in the following input:
field: one two three field: "moo cow" field: +this
I can match the field: with [a-z]*\: however I can't seem to match the rest of the content so far my attempts have only resulted in capturing everything which is not what I want to do.

If you know that it is always going to be literally field: there is absolutely no need for a regular expression:
var delimiters = new String[] {"field:"};
string[] values = input.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
However, from your regex I assume that the name field can vary, as long as it's in front of a colon. You could try to capture a word followed by : and then everything up to the next of those words (using a lookahead).
foreach(Match match in Regex.Matches(input, #"([a-z]+):((?:(?![a-z]+:).)*)"))
{
string fieldName = match.Groups[1].Value;
string value = match.Groups[2].Value;
}
An explanation of the regular expression:
( # opens a capturing group; the content can later be accessed with Groups[1]
[a-z] # lower-case letter
+ # one or more of them
) # end of capturing group
: # a literal colon
( # opens a capturing group; the content can later be accessed with Groups[2]
(?: # opens a non-capturing group; just a necessary subpattern which we do not
# need later any more
(?! # negative lookahead; this will NOT match if the pattern inside matches
[a-z]+:
# a word followed by a colon; just the same as we used at the beginning of
# the regex
) # end of negative lookahead (not that this does not consume any characters;
# it LOOKS ahead)
. # any character (except for line breaks)
) # end of non-capturing group
* # 0 or more of those
) # end of capturing group
So first we match anylowercaseword:. And then we match one more character at a time, for each one checking that this character is not the start of anotherlowercaseword:. With the capturing groups we can then later separately find the field's name and the field's value.

Don't forget that you can actually match literal strings in regexes. If your pattern is like this:
field\:
You will match "field:" literally, and nothing else.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to make balancing group capturing? - c#

Related

I need help for building a regex

Get each item within a capturing group

Regex - how to match multiple properly quoted substrings

How to extract numbers from a string using regular expressions?

Matching repeating patterns

Categories

Resources