I'm new to regex and was hoping for a pointer towards finding matches for words which are between { } brackets which are words and the first letter is uppercase and the second is lowercase. So I want to ignore any numbers also words which contain numbers
{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}
so I would only want to bring back matches for:
Test
Tesgd
Abc
I've looked at using \b and \w for words that are bound and [Az] for upper followed by lower but not sure how to only get the words which are between the brackets only as well.
Here is your solution:
Regex r = new Regex(#"(?<={[^}]*?({(?<depth>)[^}]*?}(?<-depth>))*?[^}]*?)(?<myword>[A-Z][a-z]+?)(?=,|}|\Z)", RegexOptions.ExplicitCapture);
string s = "{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}";
var m = r.Matches(s);
foreach (Match match in m)
Console.WriteLine(match.Groups["myword"].Value);
I assumed it is OK to match inside but not the deepest level paranthesis.
Let's dissect the regex a bit. AAA means an arbitrary expression. www means an arbitrary identifier (sequence of letters)
. is any character
[A-Z] is as you can guess any upper case letter.
[^}] is any character but }
,|}|\Z means , or } or end-of-string
*? means match what came before 0 or more times but lazily (Do a minimal match if possible and spit what you swallowed to make as many matches as possible)
(?<=AAA) means AAA should match on the left before you really try
to match something.
(?=AAA) means AAA should match on the right
after you really match something.
(?<www>AAA) means match AAA and give the string you matched the name www. Only used with ExplicitCapture option.
(?<depth>) matches everything but also pushes "depth" on the stack.
(?<-depth>) matches everything but also pops "depth" from the stack. Fails if the stack is empty.
We use the last two items to ensure that we are inside a paranthesis. It would be much simpler if there were no nested paranthesis or matches occured only in the deepest paranthesis.
The regular expression works on your example and probably has no bugs. However I tend to agree with others, you should not blindly copy what you cannot understand and maintain. Regular expressions are wonderful but only if you are willing to spend effort to learn them.
Edit: I corrected a careless mistake in the regex. (replaced .*? with [^}]*? in two places. Morale of the story: It's very easy to introduce bugs in Regex's.
In answer your original question, I would have offered this regex:
\b[A-Z][a-z]+\b(?=[^{}]*})
The last part is a positive lookahead; it notes the current match position, tries to match the enclosed subexpression, then returns the match position to where it started. In this case, it starts at the end of the word that was just matched and gobbles up as many characters it can as long as they're not { or }. If the next character after that is }, it means the word is inside a pair of braces, so the lookahead succeeds. If the next character is {, or if there's no next character because it's at the end of the string, the lookahead fails and the regex engine moves on to try the next word.
Unfortunately, that won't work because (as you mentioned in a comment) the braces may be nested. Matching any kind of nested or recursive structure is fundamentally incompatible with the way regexes work. Many regex flavors offer that capability anyway, but they tend to go about it in wildly different ways, and it's always ugly. Here's how I would do this in C#, using Balanced Groups:
Regex r = new Regex(#"
\b[A-Z][a-z]+\b
(?!
(?>
[^{}]+
|
{ (?<Open>)
|
} (?<-Open>)
)*
$
(?(Open)(?!))
)", RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
string s = "testa Testb { Test1 Testc testd 1Test } Teste { Testf {testg Testh} testi } Testj";
foreach (Match m in r.Matches(s))
{
Console.WriteLine(m.Value);
}
output:
Testc
Testf
Testh
I'm still using a lookahead, but this time I'm using the group named Open as a counter to keep track of the number of opening braces relative to the number of closing braces. If the word currently under consideration is not enclosed in braces, then by the time the lookahead reaches the end of the string ($), the value of Open will be zero. Otherwise, whether it's positive or negative, the conditional construct - (?(Open)(?!)) - will interpret it as "true" and try to try to match (?!). That's a negative lookahead for nothing, which is guaranteed to fail; it's always possible to match nothing.
Nested or not, there's no need to use a lookbehind; a lookahead is sufficient. Most flavors place such severe restrictions on lookbehinds that nobody would even think to try using them for a job like this. .NET has no such restrictions, so you could do this in a lookbehind, but it wouldn't make much sense. Why do all that work when the other conditions--uppercase first letter, no digits, etc--are so much cheaper to test?
Do the filtering in two steps. Use the regular expression
#"\{(.*)\}"
to pull out the pieces between the brackets, and the regular expression
#"\b([A-Z][a-z]+)\b"
to pull out each of the words that begins with a capital letter and is followed by lower case letters.
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm stumped on how to even go about this.
I am trying to match the string "ashi" but not if the word containing it is in a small list of known false positives like "flashing", "lashing", "smashing". The false positive words can appear in the string as well as long as the string "ashi" (not as part of one of the false positive words) is in the string it should return true.
I'm using C# and I was trying to go about it without using regular expressions, but I am having no luck.
These strings should return true
...somethingashisomething...
...something2!ashi*&something...
... something ashi something flashing...
These strings should return false
...somethingflashingsomething...
...smashingthesomething...
...the lashings are too tight...
Another option might be to use a negative lookbehind with a nested lookahead to match words that start with fl but not if they are followed by ashing to match ashi but not flashing.
(?<!\bfl(?=ashing\b))ashi
Explanation
(?<! Negative lookbehind, assert what is directly on the right is not
\bfl Word boundary, match fb
(?= Positive lookahead, assert what is directly on the right is
ashing\b Match ashing and word boundary
) Close positive lookahead
) Close positive lookbehind.
ashi Match literally
.NET Regex demo
Update
If you want to match and not match the updated values, you could use an alternation (?:sm|f?l) in the negative lookbehind to match sm or an optional f followed by l
(?<!(?:sm|f?l)(?=ashing))ashi
.NET regex demo | C# demo
You can make use of a capturing group:
(flashing)|ashi
If the first group is not empty, you matched flashing literally
The following will match ashi but not within flashing. I interpreted "word" loosely, so flashing is not required to be isolated as a separate word with space/punctuation delimiters.
(?<=(?<prefix>fl)|)ashi(?(prefix)(?!ng))
It is sufficient to return true/false over the entire pattern and won't require checking specific capture groups. In other words, it is usable with Regex.IsMatch().
Pattern details:
(?<= # Zero-width positive lookbehind: match but don't consume characters
(?<prefix>fl) # Named capture group to match "fl" at start of "flashing"
| # Alternate blank capture - will succeed if "fl" is not present
) # End lookbehind
ashi # match literal "ashi"
(?(prefix) # Conditional: Only match if named group prefix has successful capture (i.e. "fl" was matched)
(?!ng) # Zero-width negative loohahead: Fail match if "ng" follows
) # Close conditional (there is no false part, so match succeeds if "fl" was not present)
If flashing is only excluded as an isolated word, just add word boundary operators. This will match something like flashingwithnospace, whereas the first pattern would fail on that string:
(?<=(?<prefix>\bfl)|)ashi(?(prefix)(?!ng\b))
(FYI, the pattern will work in isolation, but if it is combined within another pattern, especially inside a repeating construction, it may not work due to the conditional on the named capture group. Once the named capture group has succeeded, the conditional will remain true while matching the larger pattern, even if it were to encounter another occurrence of ashi.)
The question gives the examples
...somethingashisomething...
...something2!ashi*&something...
... something ashi something...
The second and third examples can be found by including the word boundary \b in the search, i.e. search for \bashi\b. Finding the first example requires more knowledge of what the two enclosing somethings are. If they are alphanumeric then you need to specify the problem in much more detail.
I'm trying to capture everything inside curly bracers, but in some cases there may be multiple bracers and I want the external ones.
For example: I want to capture {{this}} part
I'll need {{this}} as the capture.
So I went with ({[^}]+}+) to capture the inner text, but of course this will yield multiple captures {{this} and {{this}}.
So I tried telling the regex to search for the phrase but only if the next character is not curly bracers: ({[^}]+}+)[^}]. This works, unless the capture is at the end of the input, in which case it doesn't work cause it expects a non } character at the end.
So I tried adding end of string option ({[^}]+}+)[$|^}], but for some reason, this will capture {{this} again. I have no idea why, it should only capture if the next char is end of input or not curly bracers...
Suggestions?
Edit:
Just to be clear, I'm not searching for valid nested parenthesis, only for text between { and the first matching } (no nesting!), however there may be cases where instead of one open/close brace there are two (so {something} and {{something}} both need to be caught).
The reason for this, is that the original text always has double braces {{ }}, but sometimes before the regex the text undergoes string.Format, in which case the double braces become single braces.
Generally, regex is not powerful enough to do this. However, .NET regex engine supports so-called Atomic Grouping, which let you process groups with balanced parentheses:
{(?>{(?<DEPTH>)|}(?<-DEPTH>)|[^}]+)*}(?(DEPTH)(?!))
If you want to match all text between braces, I think this should do the trick:
{+.*?}+
This matches everything between braces, taking all consecutive braces and as few internal characters as possible.
Further explanation: matches 1 or more { ({+), then any amount of any character (.*) but gives you the shortest string that does it (?), and finally matches 1+ } (}+). Without that ?, if you have {a} {b} it would match the whole thing instead of {a} and {b} separately.
If you won't want spaces between the braces, you can use this:
{+\S*?}+
If you only want letters, use \w instead of \S.
The only thing this is not validating is that the same amount of braces are used. Do you need that?
Result comparison (should be a comment).
Considering {{{{{{this}}}}}Blabla, I get this:
Regex author: c0d3rman
Matched string: {{{{{{this}}}}}B
Groups: 2 ({{{{{{this}}}}}B and {{{{{{this}}}}})
Captures: {{{{{{this}}}}}
Regex author: dasblinkenlight
Matched string: {{{{{this}}}}}
Groups: 2 ({{{{{this}}}}} and {})
Captures: {{{{{this}}}}}
Note: symmetric braces
Regex author: Andrew
Matched string: {{{{{{this}}}}}
Groups: {{{{{{this}}}}}
Captures: {{{{{{this}}}}}
You seem to have used a character class at the end instead of a non-capturing group. Try:
({[^}]+}+)(?:$|[^}])
This is a very small modification to your final attempt, that just uses correct syntax. In your final attempt you have [$|^}]. The issue with this is that you can't have an or | inside a character class []. Most special characters are escaped inside a character class, with a couple exceptions, one of which is ^ if it is the first character. So [$|^}] means any of the four literal characters $, |, ^, or }. What I did is change the syntax to what you intended by using a non-capturing group (?:stuff) this group does not save its contents and is purely for grouping. As such (?:$|[^}]) means an end-of-line or a non-}, as you wanted.
Note that this makes no effort to balance the curly braces (match the number of braces at the beginning and end).
If this
(°[0-5])
matches °4
and this
((°[0-5][0-9]))
matches °44
Why does this
((°[0-5])|(°[0-5][0-9]))
match °4 but not °44?
Because when you use logical OR in regex the regex engine returns the first match when it find a match with first part of regex (here °[0-5]), and in this case since °[0-5] match °4 in °44 it returns °4 and doesn't continue to match the other case (here °[0-5][0-9]):
((°[0-5])|(°[0-5][0-9]))
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
You are using shorter match first in regex alternation. Better use this regex to match both strings:
°[0-5][0-9]?
RegEx Demo
Because the alternation operator | tries the alternatives in the order specified and selects the first successful match. The other alternatives will never be tried unless something later in the regular expression causes backtracking. For instance, this regular expression
(a|ab|abc)
when fed this input:
abcdefghi
will only ever match a. However, if the regular expression is changed to
(a|ab|abc)d
It will match a. Then since the next characyer is not d it backtracks and tries then next alternative, matching ab. And since the next character is still not d it backtracks again and matches abc...and since the next character is d, the match succeeds.
Why would you not reduce your regular expression from
((°[0-5])|(°[0-5][0-9]))
to this?
°[0-5][0-9]?
It's simpler and easier to understand.
Since I am new of regular expression I well versed with Regex.
Can someone please help with the meaning of this Regex?
^(.)\1+$
^(.)\1+$ is a clever regular expression that matches to any full line that has two or more letters all of which are identical. E.g.:
aaa
BBBBBBB
See comments for explanation of what each part of the regular expression does.
The short answer: The first character in the subject is repeated one or more times and occupies the entire test subject. Put differently, the entire subject is occupied by two or more of the same character.
Since you're learning:
'^' is the begin of subject "anchor." It does not consume any data, just asserts a position. Similar to the metacharacter sequence \A if you encounter that, although the latter is not affected by line mode (research regex "mode modifiers").
'$' is the end of subject anchor. Again, a non consuming assertion. Similar to \Z metacharacter, although the latter is not affected by line mode. Close cousin of \z (although the latter has no regard for newlines and line modes). Anytime you see a regex framed with ^...$ it's asserting that the match condition is front to back.
"()" are capturing parenthesis. This means you can "refer back" to what was captured using \N where N is 1-9 corresponding to the order of capturing parenthesis, left to right. In your example, we only have one capture group, so it's referred to as \1. In addition to capturing for reference, groups are used for quantification--for repeating a pattern a specified number of times, and for alternation of strings or patterns, e.g.: (^|$), where "|" is a logical "or," this regex would test to see if the subject begins or ends with an underscore.
'.' (dot) is a metacharacter that represents any single character. (Roughly speaking, look up various match "modes" to see how you can control whether or not dot matches line break. Don't confuse "character" with octet or byte. Locales and character set encodings are too grand a subject to elaborate, but just be aware that the definition of a "character" is somewhat contextual.)
'+' is the one-or-more quantifier. \1+ means whatever character was captured by capture group 1 repeats one-or-more times (i.e., occurs two or more times). "aa" would match, where the first 'a' is matched by the dot captured in group \1, and the second 'a' matches because it satisfies the one-or-more quantification.
[SOME_WORDS:200:1000]
Trying to match just the last 1000 part. Both numbers are variable and can contain an unknown number of characters (although they are expected to contain digits, I cannot rule out that they may also contain other characters). The SOME_WORDS part is known and does not change.
So I begin by doing a positive lookbehind for [SOME_WORDS: followed by a positive lookahead for the trailing ]
That gives us the pattern (?<=\[SOME_WORDS:).*(?=])
And captures the part 200:1000
Now because I don't know how many characters are after SOME_WORDS:, but I know that it ends with another : I use .*: to indicate any character any amount of time followed by :
That gives us the pattern (?<=\[SOME_WORDS:.*:).*(?=])
However at this point the pattern no longer matches anything and this is where I become confused. What am I doing wrong here?
If I assume that the first number will always be 3 characters long I can replace .* with ... to get the pattern (?<=\[SOME_WORDS:...:).*(?=]) and this correctly captures just the 1000 part. However I don't understand why replacing ... with .* makes the pattern not capture anything.
EDIT:
It seems like the online tool I was using to test the regex pattern wasn't working correctly. The pattern (?<=\[SOME_WORDS:.*:).*(?=]) matches the 1000 with no issues when actually done in .net
You usually cannot use a + or a * in a lookbehind, only in a lookahead.
If c# does allow these than you could use a .*? instead of a .* as the .* will eat the second :
Try this:
(?<=\[SOME_WORDS:)(?=\d+:(\d+)])
The match wil be in the first capture group
Quote from http://www.regular-expressions.info/lookaround.html
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.
As Robert Smit mentions this is due to the * being a greedy operator. Greedy operators consume as many characters as they possibly can when they are matched first. They only give up characters if the match fails. If you make the greedy operator lazy(*?), then matching consumes as little number of characters as possible for the match to succeed, so the : is not consumed by *. You can also use [^:]* which is match any character other than :.