Matching repeating patterns - c#

I'm currently trying to match and capture text in the following input:
field: one two three field: "moo cow" field: +this
I can match the field: with [a-z]*\: however I can't seem to match the rest of the content so far my attempts have only resulted in capturing everything which is not what I want to do.

If you know that it is always going to be literally field: there is absolutely no need for a regular expression:
var delimiters = new String[] {"field:"};
string[] values = input.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
However, from your regex I assume that the name field can vary, as long as it's in front of a colon. You could try to capture a word followed by : and then everything up to the next of those words (using a lookahead).
foreach(Match match in Regex.Matches(input, #"([a-z]+):((?:(?![a-z]+:).)*)"))
{
string fieldName = match.Groups[1].Value;
string value = match.Groups[2].Value;
}
An explanation of the regular expression:
( # opens a capturing group; the content can later be accessed with Groups[1]
[a-z] # lower-case letter
+ # one or more of them
) # end of capturing group
: # a literal colon
( # opens a capturing group; the content can later be accessed with Groups[2]
(?: # opens a non-capturing group; just a necessary subpattern which we do not
# need later any more
(?! # negative lookahead; this will NOT match if the pattern inside matches
[a-z]+:
# a word followed by a colon; just the same as we used at the beginning of
# the regex
) # end of negative lookahead (not that this does not consume any characters;
# it LOOKS ahead)
. # any character (except for line breaks)
) # end of non-capturing group
* # 0 or more of those
) # end of capturing group
So first we match anylowercaseword:. And then we match one more character at a time, for each one checking that this character is not the start of anotherlowercaseword:. With the capturing groups we can then later separately find the field's name and the field's value.

Don't forget that you can actually match literal strings in regexes. If your pattern is like this:
field\:
You will match "field:" literally, and nothing else.

Related

C# Regex Match NOT inside self defined tags

I use tags in the form of
[[MyTag]]Some Text[[/MyTag]]
To find these tags within the whole text I use the following expression (this is not related to this question here, but for info):
\[\[(?<key>.*\w)]\](?<keyvalue>.*?)\[\[/\1\]\]
Now I like to match and replace only text (MYSEARCHTEXT) which is NOT inside of these self definied tags.
Example:
[[Tag1]]Here I don't want to replace MYSEARCHTEXT[[/Tag1]]
But here MYSEARCHTEXT (1) should be replaced. And here MYSEARCHTEXT (2) needs to be replaced too.
[[AnotherTag]]Here I don't want to replace MYSEARCHTEXT[[/AnotherTag]]
And here I need to replace MYSEARCHTEXT (3) also.
MYSEARCHTEXT is a word or phrase and needs to be found 3 times in this example.
Maybe this can work? If I understood the problem correctly this will match MYSEARCHTEXT outside of your tags and your matches will be in the groups. This uses the positive lookahead
https://regex101.com/r/C8Kuiz/2
(?:\[\[Tag1.*?\/Tag1\]\])\n?(?:.*)(?=(MYSEARCHTEXT))
I have an idea that can simplify this. Use the following regular expression to match the tagged text:
\[.+?\][^\[\]]*?MYSEARCHTEXT[^\[\]]*?\[.+?\]\]
Then replace the MYSEARCHTEXT within the string preserving the captured groups.
You may use the following solution that uses your pattern version with an added alternative in a Regex.Replace method where a match evaluator is used as the replacement argument:
var pat = #"(?s)(\[\[(\w+)]].*?\[\[/\2]])|MYSEARCHTEXT";
var s = "[[Tag1]]Here I don't want to replace MYSEARCHTEXT[[/Tag1]]\nBut here MYSEARCHTEXT (1) should be replaced. And here MYSEARCHTEXT (2) needs to be replaced too.\n[[AnotherTag]]Here I don't want to replace MYSEARCHTEXT[[/AnotherTag]]\nAnd here I need to replace MYSEARCHTEXT (3) also.";
var res = Regex.Replace(s, pat, m =>
m.Groups[1].Success ? m.Groups[1].Value : "NEW_VALUE");
Console.WriteLine(res);
See the C# demo
Result:
[[Tag1]]Here I don't want to replace MYSEARCHTEXT[[/Tag1]]
But here NEW_VALUE (1) should be replaced. And here NEW_VALUE (2) needs to be replaced too.
[[AnotherTag]]Here I don't want to replace MYSEARCHTEXT[[/AnotherTag]]
And here I need to replace NEW_VALUE (3) also.
Pattern details
(?s) - a RegexOptions.Singleline inline modifier option (a . matches any char now)
(\[\[(\w+)]].*?\[\[/\2]]) - Group 1:
\[\[ - a [[ substring
(\w+) - Group 2: one or more word chars
]] - a ]] substring
.*? - any 0+ chars, as few as possible
\[\[/ - a [[/ substring
\2 - same text as captured into Group 2
]] - a literal ]] substring
| - or
MYSEARCHTEXT - some pattern to replace.
When Group 1 matches (m.Groups[1].Success ?) this value is put back, else the NEW_VALUE is inserted into the resulting string.
The best way is to match both seperately as a positive match.
Then decide which to replace and which to write back based on which
matched. (Someone posted this solution already, so I won't duplicate it)
The alternative is to forego that entirely and qualify the text
in the form of a lookahead after searchtext.
This shows how to do it that way.
var pat = #"(?s)MYSEARCHTEXT(?=(?:(?!\[\[/?\w+\]\]).)*?(?:\[\[\w+\]\]|$))";
var res = Regex.Replace(s, pat, "NEW_VALUE");
Demo: https://ideone.com/KOtNik
Formatted:
(?s) # Dot-all modifier
MYSEARCHTEXT
(?= # Qualify the text with an assertion
(?: # Get non-tag characters
(?! \[\[ /? \w+ \]\] )
.
)*?
(?: # Up to -
\[\[ \w+ \]\] # An open tag
| $ # or, end of string
)
)

Get each item within a capturing group

If you have a string like this:
[hello world] this is [the best .Home] is nice place.
How do you extract each word(separated by space) within brackets[] only.
Right now I have this working https://regex101.com/r/Tgokeq/2
Which returns:
hello world
the best .Home
But I want:
hello
world
the
best
.Home
PS: I know I could just do string split in a foreach but I don't want that I want it in the regex itself, just like this which gets every word, except I want words within the brackets [ ] only.
https://regex101.com/r/eweRWj/2
Use this Pattern ([^\[\] ]+)(?=[^\[\]]*\]) Demo
( # Capturing Group (1)
[^\[\] ] # Character not in [\[\] ] Character Class
+ # (one or more)(greedy)
) # End of Capturing Group (1)
(?= # Look-Ahead
[^\[\]] # Character not in [\[\]] Character Class
* # (zero or more)(greedy)
\] # "]"
) # End of Look-Ahead
This pattern may not seems as elegant since it does not match individual words separately. The full solution takes advantage of .Net regex library to get individual words. However, it avoids excessive backtracking of alpha bravo's solution. The importance of that will largely depend on how many lines you search and/or if you are matching large chunks of text or only individual lines at a time.
This approach also lets you identify exactly how many bracket pairs and which words were captured in each pair. A simple pattern-only solution will just get you the matched words without context.
The pattern:
\[\s*((?<word>[^[\]\s]+)\s*)+]
Then some brief code demonstrating how to get captured words via the .Net regex object model:
using System.Text.RegularExpressions;
...
Regex rx = new Regex(#"\[\s*((?<word>[^[\]\s]+)\s*)+]");
MatchCollection matches = rx.Matches(searchText);
foreach(Match m in matches) {
foreach(Capture c in m.Groups["word"].Captures) {
System.Console.WriteLine(c.Value);
}
}
Breakdown of pattern:
\[ # Opening bracket
\s* # Optional white space
( # Group for word delimited by space
(?<word> # Named capture group
[^[\]\s] # Negative character class: no brackets, no white space
+ # one or more greedy
) # End named capture group
\s* # Match white space after word
) # End of word+space grouping
+ # Match multiple occurrences of word+space
] # Literal closing bracket (no need to escape outside character class)
The above will match line feeds between the brackets. If you don't want that then use
\[\ *((?<word>[^[\]\s]+)\ *)+]

Regex - how to match multiple properly quoted substrings

I am trying to use a Regex to extract quote-wrapped strings from within a (C#) string which is a comma-separated list of such strings. I need to extract all properly quoted substrings, and ignore those that are missing a quote mark
eg given this string
"animal,dog,cat","ecoli, verification,"streptococcus"
I need to extract "animal,dog,cat" and "streptococcus".
I've tried various regex solutions in this forum but they all seem to find the first substring only, or incorrectly match "ecoli, verification," and ignore "streptococcus"
Is this solvable?
TIA
Try this:
string input = "\"animal,dog,cat\",\"ecoli, verification,\"streptococcus\"";
string pattern = "\"([^\"]+?[^,])\"";
var matches = Regex.Matches(input, pattern);
foreach (Match m in matches)
Console.WriteLine(m.Groups[1].Value);
P.S. But I agree with the commentators: fix the source.
I suggest this:
"(?>[^",]*(?>,[^",]+)*)"
Explanation:
" # Match a starting quote
(?> # Capture in an atomic group to avoid catastrophic backtracking:
[^",]* # - any number of characters except commas or quotes
(?> # - optionally followed by another (atomic) group:
, # - which starts with a comma
[^",]+ # - and contains at least one character besides comma or quotes.
)* # - (as said above, that group is optional but may occur many times)
) # End of the outer atomic group
" # Match a closing quote
Test it live on regex101.com.

What regex for matching words with keyword '('?

In my c# code I need to get a word if the words before match specific words:
var match= Regex.Match(someLine, #"^(FIRST WORDS) (\w+) (SECOND WORDS | PROBLEM KEYWORD \() (\w+)", RegexOptions.IgnoreCase);
var neededWord= match.Groups[4].Value;
If the string equals "FIRST WORDS SOME WORDS PROBLEM KEYWORD (SOMETHING AGAIN)", I would like to get 'SOMETHING' as my needed word. But this does not work. It returns an empty string.
What am I doing wrong?
RegEx Demo
^FIRST WORDS[^\(]+\(([^\)]+)\)
Debuggex Demo
Description
^ assert position at start of the string
FIRST WORDS matches the characters FIRST WORDS literally (case sensitive)
[^\(]+ match a single character not present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\( matches the character ( literally
\( matches the character ( literally
1st Capturing group ([^\)]+)
[^\)]+ match a single character not present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
\) matches the character ) literally
Note: if you need only the word SOMETHING I can edit the RegEx, also Group 1 will contain your requested results.

How to make balancing group capturing?

Let's say I have this text input.
tes{}tR{R{abc}aD{mnoR{xyz}}}
I want to extract the ff output:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
Currently, I can only extract what's inside the {}groups using balanced group approach as found in msdn. Here's the pattern:
^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$
Does anyone know how to include the R{} and D{} in the output?
I think that a different approach is required here. Once you match the first larger group R{R{abc}aD{mnoR{xyz}}} (see my comment about the possible typo), you won't be able to get the subgroups inside as the regex doesn't allow you to capture the individual R{ ... } groups.
So, there had to be some way to capture and not consume and the obvious way to do that was to use a positive lookahead. From there, you can put the expression you used, albeit with some changes to adapt to the new change in focus, and I came up with:
(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))
[I also renamed the 'Open' to 'O' and removed the named capture for the close brace to make it shorter and avoid noises in the matches]
On regexhero.net (the only free .NET regex tester I know so far), I got the following capture groups:
1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}
Breakdown of regex:
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
(?: # First non-capture group opening
(?: # Second non-capture group opening
(?'O'{) # Get the named opening brace
[^{}]* # Any non-brace
)+ # Close of second non-capture group and repeat over as many times as necessary
(?: # Third non-capture group opening
(?'-O'}) # Removal of named opening brace when encountered
[^{}]*? # Any other non-brace characters in case there are more nested braces
)+ # Close of third non-capture group and repeat over as many times as necessary
)+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
(?(O)(?!)) # Condition to prevent unbalanced braces
) # Close capture group
) # Close positive lookahead
The following will not work in C#
I actually wanted to try out how it should be working out on the PCRE engine, since there was the option to have recursive regex and I think it was easier since I'm more familiar with it and which yielded a shorter regex :)
(?=([A-Z]{(?:[^{}]|(?1))+}))
regex101 demo
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
{ # Opening brace
(?: # Opening non-capture group
[^{}] # Matches non braces
| # OR
(?1) # Recurse first capture group
)+ # Close non-capture group and repeat as many times as necessary
} # Closing brace
) # Close of capture group
) # Close of positive lookahead
I'm not sure a single regex would be able to suit your needs: these nested substrings always mess it up.
One solution could be the following algorithm (written in Java, but I guess the translation to C# won't be that hard):
/**
* Finds all matches (i.e. including sub/nested matches) of the regex in the input string.
*
* #param input
* The input string.
* #param regex
* The regex pattern. It has to target the most nested substrings. For example, given the following input string
* <code>A{01B{23}45C{67}89}</code>, if you want to catch every <code>X{*}</code> substrings (where <code>X</code> is a capital letter),
* you have to use <code>[A-Z][{][^{]+?[}]</code> or <code>[A-Z][{][^{}]+[}]</code> instead of <code>[A-Z][{].+?[}]</code>.
* #param format
* The format must follow the <a href= "http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html#syntax" >format string
* syntax</a>. It will be given one single integer as argument, so it has to contain (and to contain only) a <code>%d</code> flag. The
* format must not be foundable anywhere in the input string. If <code>null</code>, <code>ééé%dèèè</code> will be used.
* #return The list of all the matches of the regex in the input string.
*/
public static List<String> findAllMatches(String input, String regex, String format) {
if (format == null) {
format = "ééé%dèèè";
}
int counter = 0;
Map<String, String> matches = new LinkedHashMap<String, String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
// if a substring has been found
while (matcher.find()) {
// create a unique replacement string using the counter
String replace = String.format(format, counter++);
// store the relation "replacement string --> initial substring" in a queue
matches.put(replace, matcher.group());
String end = input.substring(matcher.end(), input.length());
String start = input.substring(0, matcher.start());
// replace the found substring by the created unique replacement string
input = start + replace + end;
// reiterate on the new input string (faking the original matcher.find() implementation)
matcher = pattern.matcher(input);
}
List<Entry<String, String>> entries = new LinkedList<Entry<String, String>>(matches.entrySet());
// for each relation "replacement string --> initial substring" of the queue
for (int i = 0; i < entries.size(); i++) {
Entry<String, String> current = entries.get(i);
// for each relation that could have been found before the current one (i.e. more nested)
for (int j = 0; j < i; j++) {
Entry<String, String> previous = entries.get(j);
// if the current initial substring contains the previous replacement string
if (current.getValue().contains(previous.getKey())) {
// replace the previous replacement string by the previous initial substring in the current initial substring
current.setValue(current.getValue().replace(previous.getKey(), previous.getValue()));
}
}
}
return new LinkedList<String>(matches.values());
}
Thus, in your case:
String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}";
String regex = "[A-Z][{][^{}]+[}]";
findAllMatches(input, regex, null);
Returns:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
Balancing groups in .Net regular expressions give you control over exactly what to capture, and the .Net regex engine keeps a full history of all captures of the group (unlike most other flavors that only capture the last occurrence of each group).
The MSDN example is a little too complicated. A simpler approach for matching nestes structures would be:
(?>
(?<O>)\p{Lu}\{ # Push to the O stack, and match an upper-case letter and {
| # OR
\}(?<-O>) # Match } and pop from the stack
| # OR
\p{Ll} # Match a lower-case letter
)+
(?(O)(?!)) # Make sure the stack is empty
or in a single line:
(?>(?<O>)\p{Lu}\{|\}(?<-O>)|\p{Ll})+(?(O)(?!))
Working example on Regex Storm
In your example it also matches the "tes" at the start of the string, but don't worry about that, we're not done.
With a small correction we can also capture the occurrences between the R{...} pairs:
(?>(?<O>)\p{Lu}\{|\}(?<Target-O>)|\p{Ll})+(?(O)(?!))
Each Match will have a Group called "Target", and each such Group will have a Capture for each occurrences - you only care about these captures.
Working example on Regex Storm - Click on Table tab and examine the 4 captures of ${Target}
See also:
What are regular expression Balancing Groups?

Categories

Resources