I'm trying to capture everything inside curly bracers, but in some cases there may be multiple bracers and I want the external ones.
For example: I want to capture {{this}} part
I'll need {{this}} as the capture.
So I went with ({[^}]+}+) to capture the inner text, but of course this will yield multiple captures {{this} and {{this}}.
So I tried telling the regex to search for the phrase but only if the next character is not curly bracers: ({[^}]+}+)[^}]. This works, unless the capture is at the end of the input, in which case it doesn't work cause it expects a non } character at the end.
So I tried adding end of string option ({[^}]+}+)[$|^}], but for some reason, this will capture {{this} again. I have no idea why, it should only capture if the next char is end of input or not curly bracers...
Suggestions?
Edit:
Just to be clear, I'm not searching for valid nested parenthesis, only for text between { and the first matching } (no nesting!), however there may be cases where instead of one open/close brace there are two (so {something} and {{something}} both need to be caught).
The reason for this, is that the original text always has double braces {{ }}, but sometimes before the regex the text undergoes string.Format, in which case the double braces become single braces.
Generally, regex is not powerful enough to do this. However, .NET regex engine supports so-called Atomic Grouping, which let you process groups with balanced parentheses:
{(?>{(?<DEPTH>)|}(?<-DEPTH>)|[^}]+)*}(?(DEPTH)(?!))
If you want to match all text between braces, I think this should do the trick:
{+.*?}+
This matches everything between braces, taking all consecutive braces and as few internal characters as possible.
Further explanation: matches 1 or more { ({+), then any amount of any character (.*) but gives you the shortest string that does it (?), and finally matches 1+ } (}+). Without that ?, if you have {a} {b} it would match the whole thing instead of {a} and {b} separately.
If you won't want spaces between the braces, you can use this:
{+\S*?}+
If you only want letters, use \w instead of \S.
The only thing this is not validating is that the same amount of braces are used. Do you need that?
Result comparison (should be a comment).
Considering {{{{{{this}}}}}Blabla, I get this:
Regex author: c0d3rman
Matched string: {{{{{{this}}}}}B
Groups: 2 ({{{{{{this}}}}}B and {{{{{{this}}}}})
Captures: {{{{{{this}}}}}
Regex author: dasblinkenlight
Matched string: {{{{{this}}}}}
Groups: 2 ({{{{{this}}}}} and {})
Captures: {{{{{this}}}}}
Note: symmetric braces
Regex author: Andrew
Matched string: {{{{{{this}}}}}
Groups: {{{{{{this}}}}}
Captures: {{{{{{this}}}}}
You seem to have used a character class at the end instead of a non-capturing group. Try:
({[^}]+}+)(?:$|[^}])
This is a very small modification to your final attempt, that just uses correct syntax. In your final attempt you have [$|^}]. The issue with this is that you can't have an or | inside a character class []. Most special characters are escaped inside a character class, with a couple exceptions, one of which is ^ if it is the first character. So [$|^}] means any of the four literal characters $, |, ^, or }. What I did is change the syntax to what you intended by using a non-capturing group (?:stuff) this group does not save its contents and is purely for grouping. As such (?:$|[^}]) means an end-of-line or a non-}, as you wanted.
Note that this makes no effort to balance the curly braces (match the number of braces at the beginning and end).
Related
I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)
You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1
The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo
Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.
Here I have used the below mentioned code.
MatchCollection matches = Regex.Matches(cellData, #"(^\[.*\]$|^\[.*\]_[0-9]*$)");
The only this pattern is not doing is it's not separating the semicolon and plus from the main string.
A sample string is
[dbServer];[ciDBNAME];[dbLogin];[dbPasswd] AND [SIM_ErrorFound#1]_+[#IterationCount]
I am trying to extract
[dbServer]
[ciDBNAME]
[dbLogin]
[dbPasswd]
[SIM_ErrorFound#1]
[#IterationCount]
from the string.
To extract the stuff in square brackets from [dbServer];[ciDBNAME];[dbLogin];[dbPasswd] AND [SIM_ErrorFound#1]_+[#IterationCount] (which is what I assume you're be trying to do),
The regular expression (I haven't quoted it) should be
\[([^\]]*)\]
You should not use ^ and $ as youre not interested in start and end of strings. The parentheses will capture every instance of zero or more characters inside square brackets.
If you want to be more specific about what you're capturing in the brackets, you'll need to change the [^\] to something else.
Your regex - (^\[.*\]$|^\[.*\]_[0-9]*$) - matches any full string that starts with [, then contains zero or more chars other than a newline, and ends with ] (\]$) or with _ followed with 0+ digits (_[0-9]*$). You could also write the pattern as ^\[.*](?:_[0-9]*)?$ and it would work the same.
However, you need to match multiple substrings inside a larger string. Thus, you should have removed the ^ and $ anchors and retried. Then, you would find out that .* is too greedy and matches from the first [ up to the last ]. To fix that, it is best to use a negated character class solution. E.g. you may use [^][]* that matches 0+ chars other than [ and ].
Edit: It seems you need to get only the text inside square brackets.
You need to use a capturing group, a pair of unescaped parentheses around the part of the pattern you need to get and then access the value by the group ID (unnamed groups are numbered starting with 1 from left to right):
var results = Regex.Matches(s, #"\[([^][]+)]")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
See the .NET regex demo
I know the regex for excluding words, roughly anyway, It would be (!?wordToIgnore|wordToIgnore2|wordToIgnore3)
But I have an existing, complicated regex that I need to add this to, and I am a bit confused about how to go about that. I'm still pretty new to regex, and it took me a very long time to make this particular one, but I'm not sure where to insert it or how ...
The regex I have is ...
^(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:/\p{L}'-]{1,64}$)$
This should only allow the person typing to insert between 1 and 64 letters that match that pattern, cannot start with a space, quote, double quote, special character, a dash, an escape character, etc, and only allows a-z both upper and lowercase, can include a space, ":", a dash, and a quote anywhere but the beginning.
But I want to forbid them from using certain words, so I have this list of words that I want to be forbidden, I just cannot figure out how to get that to fit into here.. I tried just pasting the whole .. "block" in, and that didn't work.
?!the|and|or|a|given|some|that|this|then|than
Has anyone encountered this before?
ciel, first off, congratulations for getting this far trying to build your regex rule. If you want to read something detailed about all kinds of exclusions, I suggest you have a look at Match (or replace) a pattern except in situations s1, s2, s3 etc
Next, in your particular situation, here is how we could approach your regex.
For consision, let's make all the negative lookarounds more compact, replacing them with a single (?!.*(?: |-|'){2})
In your character class, the \: just escapes the colon, needlessly so as : is enough. I assume you wanted to add a backslash character, and if so we need to use \\
\p{L} includes [a-zA-Z], so you can drop [a-zA-Z]. But are you sure you want to match all letters in any script? (Thai etc). If so, remember to set the u flag after the regex string.
For your "bad word exclusion" applying to the whole string, place it at the same position as the other lookarounds, i.e., at the head of the string, but using the .* as in your other exclusions: (?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3)) It does not matter which lookahead comes first because lookarounds do not change your position in the string. For more on this, see Mastering Lookahead and Lookbehind
This gives us this glorious regex (I added the case-insensitive flag):
^(?i)(?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3))(?!.*(?: |-|'){2})(?:[\\0-9 :/\p{L}'-]{1,64}$)$
Of course if you don't want unicode letters, replace \p{L} with a-z
Also, if you want to make sure that the wordToIgnore is a real word, as opposed to an embedded string (for instance you don't want cat but you are okay with catalog), add boundaries to the lookahead rule: (?!.*\b(?:wordToIgnore|wordToIgnore2|wordToIgnore3)\b)
use this:
^(?!.*(the|and|or|a|given|some|that|this|then|than))(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:\p{L}'-]{1,64}$)$
see demo
I'm new to regex and was hoping for a pointer towards finding matches for words which are between { } brackets which are words and the first letter is uppercase and the second is lowercase. So I want to ignore any numbers also words which contain numbers
{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}
so I would only want to bring back matches for:
Test
Tesgd
Abc
I've looked at using \b and \w for words that are bound and [Az] for upper followed by lower but not sure how to only get the words which are between the brackets only as well.
Here is your solution:
Regex r = new Regex(#"(?<={[^}]*?({(?<depth>)[^}]*?}(?<-depth>))*?[^}]*?)(?<myword>[A-Z][a-z]+?)(?=,|}|\Z)", RegexOptions.ExplicitCapture);
string s = "{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}";
var m = r.Matches(s);
foreach (Match match in m)
Console.WriteLine(match.Groups["myword"].Value);
I assumed it is OK to match inside but not the deepest level paranthesis.
Let's dissect the regex a bit. AAA means an arbitrary expression. www means an arbitrary identifier (sequence of letters)
. is any character
[A-Z] is as you can guess any upper case letter.
[^}] is any character but }
,|}|\Z means , or } or end-of-string
*? means match what came before 0 or more times but lazily (Do a minimal match if possible and spit what you swallowed to make as many matches as possible)
(?<=AAA) means AAA should match on the left before you really try
to match something.
(?=AAA) means AAA should match on the right
after you really match something.
(?<www>AAA) means match AAA and give the string you matched the name www. Only used with ExplicitCapture option.
(?<depth>) matches everything but also pushes "depth" on the stack.
(?<-depth>) matches everything but also pops "depth" from the stack. Fails if the stack is empty.
We use the last two items to ensure that we are inside a paranthesis. It would be much simpler if there were no nested paranthesis or matches occured only in the deepest paranthesis.
The regular expression works on your example and probably has no bugs. However I tend to agree with others, you should not blindly copy what you cannot understand and maintain. Regular expressions are wonderful but only if you are willing to spend effort to learn them.
Edit: I corrected a careless mistake in the regex. (replaced .*? with [^}]*? in two places. Morale of the story: It's very easy to introduce bugs in Regex's.
In answer your original question, I would have offered this regex:
\b[A-Z][a-z]+\b(?=[^{}]*})
The last part is a positive lookahead; it notes the current match position, tries to match the enclosed subexpression, then returns the match position to where it started. In this case, it starts at the end of the word that was just matched and gobbles up as many characters it can as long as they're not { or }. If the next character after that is }, it means the word is inside a pair of braces, so the lookahead succeeds. If the next character is {, or if there's no next character because it's at the end of the string, the lookahead fails and the regex engine moves on to try the next word.
Unfortunately, that won't work because (as you mentioned in a comment) the braces may be nested. Matching any kind of nested or recursive structure is fundamentally incompatible with the way regexes work. Many regex flavors offer that capability anyway, but they tend to go about it in wildly different ways, and it's always ugly. Here's how I would do this in C#, using Balanced Groups:
Regex r = new Regex(#"
\b[A-Z][a-z]+\b
(?!
(?>
[^{}]+
|
{ (?<Open>)
|
} (?<-Open>)
)*
$
(?(Open)(?!))
)", RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
string s = "testa Testb { Test1 Testc testd 1Test } Teste { Testf {testg Testh} testi } Testj";
foreach (Match m in r.Matches(s))
{
Console.WriteLine(m.Value);
}
output:
Testc
Testf
Testh
I'm still using a lookahead, but this time I'm using the group named Open as a counter to keep track of the number of opening braces relative to the number of closing braces. If the word currently under consideration is not enclosed in braces, then by the time the lookahead reaches the end of the string ($), the value of Open will be zero. Otherwise, whether it's positive or negative, the conditional construct - (?(Open)(?!)) - will interpret it as "true" and try to try to match (?!). That's a negative lookahead for nothing, which is guaranteed to fail; it's always possible to match nothing.
Nested or not, there's no need to use a lookbehind; a lookahead is sufficient. Most flavors place such severe restrictions on lookbehinds that nobody would even think to try using them for a job like this. .NET has no such restrictions, so you could do this in a lookbehind, but it wouldn't make much sense. Why do all that work when the other conditions--uppercase first letter, no digits, etc--are so much cheaper to test?
Do the filtering in two steps. Use the regular expression
#"\{(.*)\}"
to pull out the pieces between the brackets, and the regular expression
#"\b([A-Z][a-z]+)\b"
to pull out each of the words that begins with a capital letter and is followed by lower case letters.
I have an input like the following
[a href=http://twitter.com/suddentwilight][font][b][i]#suddentwilight[/font][/a] My POV: Rakhi Sawant hits below the belt & does anything for attention... [a href=http://twitter.com/mallikaLA][b]http://www.test.com[/b][/a] has maintained the grace/decency :)
Now I need to get the string #suddentwilight and http://www.test.com that comes inside the anchor tags. there might be some [b] or [i] like tags wrapping the actual text. I need to ignore that.
Basically I need to get a string matching that starts with [a] then need to get the string/url before closing of the a tag [/a].
Please Suggest
I don't know C#, but here's a regex:
/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/
This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a] and capture text.
To explain:
the /.../ are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
\[ and \] match a literal [ and ] character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
[^\]] is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^) denotes negation, and the escaped close square bracket is the character being negated.
* and + are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]* means match 0 or more of anything except a close square bracket.
\s is a shorthand for the character class of whitespace characters
(?:...) allows you to group the contents into an atomic pattern.
(...) groups like (?:...) does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
. matches any single character.
*? is a suffix for non-greedy matching. Normally, the * suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *? is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *? here instead of * is so that if we have multiple [/a]s on a line, we only go as far as the next one when matching link text.
This will only remove [tag]s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:
/\[[^\]]+\]/