Capture all groups that fit regex - c#

I have a regex that does pretty much exactly what I want: \.?(\w+[\s|,]{1,}\w+[\s|,]{1,}\w+){1}\.?
Meaning it captures incidences of 3 words in a row that are not separated by anything except spaces and commas (so parts of sentences only). However I want this to match every instance of 3 words in a sentence.
So in this ultra simple example:
Hi this is Bob.
There should be 2 captures - "Hi this is" and "this is Bob". I can't seem to figure out how to get the regex engine to parse the entire statement this way. Any thoughts?

You cannot just get overlapping texts in capturing groups, but you can obtain overlapping matches with capturing groups holding the substrings you need.
Use
(?=\b(\w+(?:[\s,]+\w+){2})\b)
See the regex demo
The unanchored positive lookahead tests for an empty string match at every position of a string. It does not consume characters, but can still return submatches obtained with capturing groups.
Regex breakdown:
\b - a word boundary
(\w+(?:[\s,]+\w+){2}) - 3 "words" separated with , or a whitespace.
\w+ - 1 or more alphanumeric symbols followed with
(?:[\s,]+\w+){2} - 2 sequences of 1 or more whitespaces or commas followed by 1 or more alphanumeric symbols.
This pattern is just put into a capturing group (...) that is placed inside the lookahead (?=...).
Word boundaries are important in this expression because \b prevents matching inside a word (between two alphanumeric characters). As the lookahead is not anchored it tests all positions inside input string, and \b serves as a restriction on where a match can be returned.
In C#, you just need to collect all match.Groups[1].Values, e.g. like this:
var s = "Hi this is Bob.";
var results = Regex.Matches(s, #"(?=\b(\w+(?:[\s,]+\w+){2})\b)")
.Cast<Match>()
.Select(p => p.Groups[1].Value)
.ToList();
See the IDEONE demo

Related

Regex Match all characters until reach character, but also include last match

I'm trying to find all Color Hex codes using Regex.
I have this string value for example - #FF0000FF#0038FFFF#51FF00FF#F400FFFF and I use this:
#.+?(?=#)
pattern to match all characters until it reaches #, but it stops at the last character, which should be the last match.
I'm kind of new to this Regex stuff. How could I also get the last match?
Your regex does not match the last value because your regex (with the positive lookahead (?=#)) requires a # to appear after an already consumed value, and there is no # at the end of the string.
You may use
#[^#]+
See the regex demo
The [^#] negated character class matches any char but # (+ means 1 or more occurrences) and does not require a # to appear immediately to the right of the currently matched value.
In C#, you may collect all matches using
var result = Regex.Matches(s, #"#[^#]+")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
A more precise pattern you may use is #[A-Fa-f0-9]{8}, it matches a # and then any 8 hex chars, digits or letters from a to f and A to F.
Don't rely upon any characters after the #, match hex characters and it
will work every time.
(?i)#[a-f0-9]+

Regex to return the word before the match

I've been trying to extract the word before the match. For example, I have the following sentence:
"Allatoona was a town located in extreme southeastern Bartow County, Georgia."
I want to extract the word before "Bartow".
I've tried the following regex to extract that word:
\w\sCounty,
What I get returned is "w County" when what I wanted is just the word Bartow.
Any assistance would be greatly appreciated. Thanks!
You can use this regex with a lookahead to find word before County:
\w+(?=\s+County)
(?=\s+County) is a positive lookahead that asserts presence of 1 or more whitespaces followed by word County ahead of current match.
RegEx Demo
If you want to avoid lookahead then you can use a capture group:
(\w+)\s+County
and extract captured group #1 from match result.
Your \w\sCounty, regex returns w County because \w matches a single character that is either a letter, digit, or _. It does not match a whole word.
To match 1 or more symbols, you need to use a + quantifier and to capture the part you need to extract you can rely on capturing groups, (...).
So, you can fix your pattern by mere replacing \w with (\w+) and then, after getting a match, access the Match.Groups[1].Value.
However, if the county name contains a non-word symbol, like a hyphen, \w+ won't match it. A \S+ matching 1 or more non-whitespace symbols might turn out a better option in that case.
See a C# demo:
var m = Regex.Match(s, #"(\S+)\s+County");
if (m.Success)
{
Console.WriteLine(m.Groups[1].Value);
}
See a regex demo.
You can use this regex to find the word before Country
([\w]*.?\s+).?County
The [\w]* match any characters any times
the .? is if maybe there is a especial character in the sentences like (,.!)
and the \s+ for the banks spaces ( work if there is a double blank space in the sentence)
.? before Country if maybe a special character is placed there
If you want to find more than one word just add {n} after like this ([\w]*.?\s+){3}.?County

Regex pattern to separate string with semicolon and plus

Here I have used the below mentioned code.
MatchCollection matches = Regex.Matches(cellData, #"(^\[.*\]$|^\[.*\]_[0-9]*$)");
The only this pattern is not doing is it's not separating the semicolon and plus from the main string.
A sample string is
[dbServer];[ciDBNAME];[dbLogin];[dbPasswd] AND [SIM_ErrorFound#1]_+[#IterationCount]
I am trying to extract
[dbServer]
[ciDBNAME]
[dbLogin]
[dbPasswd]
[SIM_ErrorFound#1]
[#IterationCount]
from the string.
To extract the stuff in square brackets from [dbServer];[ciDBNAME];[dbLogin];[dbPasswd] AND [SIM_ErrorFound#1]_+[#IterationCount] (which is what I assume you're be trying to do),
The regular expression (I haven't quoted it) should be
\[([^\]]*)\]
You should not use ^ and $ as youre not interested in start and end of strings. The parentheses will capture every instance of zero or more characters inside square brackets.
If you want to be more specific about what you're capturing in the brackets, you'll need to change the [^\] to something else.
Your regex - (^\[.*\]$|^\[.*\]_[0-9]*$) - matches any full string that starts with [, then contains zero or more chars other than a newline, and ends with ] (\]$) or with _ followed with 0+ digits (_[0-9]*$). You could also write the pattern as ^\[.*](?:_[0-9]*)?$ and it would work the same.
However, you need to match multiple substrings inside a larger string. Thus, you should have removed the ^ and $ anchors and retried. Then, you would find out that .* is too greedy and matches from the first [ up to the last ]. To fix that, it is best to use a negated character class solution. E.g. you may use [^][]* that matches 0+ chars other than [ and ].
Edit: It seems you need to get only the text inside square brackets.
You need to use a capturing group, a pair of unescaped parentheses around the part of the pattern you need to get and then access the value by the group ID (unnamed groups are numbered starting with 1 from left to right):
var results = Regex.Matches(s, #"\[([^][]+)]")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
See the .NET regex demo

C# regex repeating group of the digit in pattern

i use regex pattern
pattern = "ID\\d+.*?ID\\d+";
input="ID1...sometxt1...ID1...sometxt2...ID3...sometxt3...ID50"
input=Regex.Replace(input, pattern, "");
Console.WriteLine(input);
Output will = "...sometxt2..."
but i need Output
...sometxt2...ID3...sometxt3...ID50,
i need that regex find groups with equal digit after ID. ID3 != ID50, this group must remain, ID1==ID1 - this group must be replaced
Thank!
If you need to replace the whole substrings from ID having the same digits after them, you need to use a capturing group with a backreference:
var pattern = #"\bID(\d+).*?\bID\1\b";
See the regex demo
Explanation:
\bID - a whole word "ID"
(\d+) - one or more digits captured into Group 1
.*? - any characters but a newline, as few as possible up to the closest
\bID - whole word "ID" followed with....
\1 - backreference to the matched digits in Group 1
\b - followed with a word boundary (so that we do not match 10 if we have 1 in Group 1).
Note that you will need RegexOptions.Singleline modifier if you have newline characters in your input strings.
Also, do not forget to assign the replacement result to a variable:
var res = Regex.Replace(input, pattern, string.Empty);

How To get text between 2 strings?

String is given below from which i want to extract the text.
String:
Hello Mr John and Hello Ms Rita
Regex
Hello(.*?)Rita
I am try to get text between 2 strings which "Hello" and "Rita" I am using the above given regex, but its is giving me
Mr John and Hello Ms
which is wrong. I need only "Ms" Can anyone help me out to write proper regex for this situation?
Use a tempered greedy token:
Hello((?:(?!Hello|Rita).)*)Rita
^^^^^^^^^^^^^^^^^^^
See regex demo here
The (?:(?!Hello|Rita).)* is the tempered greedy token that only matches text that is not Hello or Rita. You may add word boundaries \b if you need to check for whole words.
In order to get a Ms without spaces on both ends, use this regex variation:
Hello\s*((?:(?!Hello|Rita).)*?)\s*Rita
Adding the ? to * will form a lazy quantifier *? that matches as few characters as needed to find a match, and \s* will match zero or more whitespaces.
To get the closest match towards ending word, let a greedy dot in front of the initial word consume.
.*Hello(.*?)Rita
See demo at regex101
Or without whitespace in captured: .*Hello\s*(.*?)\s*Rita
Or with use of two capture groups: .*(Hello\s*(.*?)\s*Rita)
Your (.*?) is picking up too much text because .* matches any string of characters. So it grabs everything from the first "Hello" to "Rita" at the end.
One easy way you could get what you want is with this regular expression:
Hello (\S+) Rita
\S matches any non-whitespace character, so \S+ matches any consecutive string of non-whitespace characters, i.e. a single word.
This would be a bit more robust, allowing for multiple spaces or other whitespace between the words:
Hello\s+(\S+)\s+Rita
Demo
you can use lookahead and lookbehind (?<=Hello).*?(?=Rita)

Categories

Resources