Regex pattern to separate string with semicolon and plus - c#

Here I have used the below mentioned code.
MatchCollection matches = Regex.Matches(cellData, #"(^\[.*\]$|^\[.*\]_[0-9]*$)");
The only this pattern is not doing is it's not separating the semicolon and plus from the main string.
A sample string is
[dbServer];[ciDBNAME];[dbLogin];[dbPasswd] AND [SIM_ErrorFound#1]_+[#IterationCount]
I am trying to extract
[dbServer]
[ciDBNAME]
[dbLogin]
[dbPasswd]
[SIM_ErrorFound#1]
[#IterationCount]
from the string.

To extract the stuff in square brackets from [dbServer];[ciDBNAME];[dbLogin];[dbPasswd] AND [SIM_ErrorFound#1]_+[#IterationCount] (which is what I assume you're be trying to do),
The regular expression (I haven't quoted it) should be
\[([^\]]*)\]
You should not use ^ and $ as youre not interested in start and end of strings. The parentheses will capture every instance of zero or more characters inside square brackets.
If you want to be more specific about what you're capturing in the brackets, you'll need to change the [^\] to something else.

Your regex - (^\[.*\]$|^\[.*\]_[0-9]*$) - matches any full string that starts with [, then contains zero or more chars other than a newline, and ends with ] (\]$) or with _ followed with 0+ digits (_[0-9]*$). You could also write the pattern as ^\[.*](?:_[0-9]*)?$ and it would work the same.
However, you need to match multiple substrings inside a larger string. Thus, you should have removed the ^ and $ anchors and retried. Then, you would find out that .* is too greedy and matches from the first [ up to the last ]. To fix that, it is best to use a negated character class solution. E.g. you may use [^][]* that matches 0+ chars other than [ and ].
Edit: It seems you need to get only the text inside square brackets.
You need to use a capturing group, a pair of unescaped parentheses around the part of the pattern you need to get and then access the value by the group ID (unnamed groups are numbered starting with 1 from left to right):
var results = Regex.Matches(s, #"\[([^][]+)]")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
See the .NET regex demo

Related

Regex Match all characters until reach character, but also include last match

I'm trying to find all Color Hex codes using Regex.
I have this string value for example - #FF0000FF#0038FFFF#51FF00FF#F400FFFF and I use this:
#.+?(?=#)
pattern to match all characters until it reaches #, but it stops at the last character, which should be the last match.
I'm kind of new to this Regex stuff. How could I also get the last match?
Your regex does not match the last value because your regex (with the positive lookahead (?=#)) requires a # to appear after an already consumed value, and there is no # at the end of the string.
You may use
#[^#]+
See the regex demo
The [^#] negated character class matches any char but # (+ means 1 or more occurrences) and does not require a # to appear immediately to the right of the currently matched value.
In C#, you may collect all matches using
var result = Regex.Matches(s, #"#[^#]+")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
A more precise pattern you may use is #[A-Fa-f0-9]{8}, it matches a # and then any 8 hex chars, digits or letters from a to f and A to F.
Don't rely upon any characters after the #, match hex characters and it
will work every time.
(?i)#[a-f0-9]+

.Net Regex - last of repeating characters

I'm trying to capture everything inside curly bracers, but in some cases there may be multiple bracers and I want the external ones.
For example: I want to capture {{this}} part
I'll need {{this}} as the capture.
So I went with ({[^}]+}+) to capture the inner text, but of course this will yield multiple captures {{this} and {{this}}.
So I tried telling the regex to search for the phrase but only if the next character is not curly bracers: ({[^}]+}+)[^}]. This works, unless the capture is at the end of the input, in which case it doesn't work cause it expects a non } character at the end.
So I tried adding end of string option ({[^}]+}+)[$|^}], but for some reason, this will capture {{this} again. I have no idea why, it should only capture if the next char is end of input or not curly bracers...
Suggestions?
Edit:
Just to be clear, I'm not searching for valid nested parenthesis, only for text between { and the first matching } (no nesting!), however there may be cases where instead of one open/close brace there are two (so {something} and {{something}} both need to be caught).
The reason for this, is that the original text always has double braces {{ }}, but sometimes before the regex the text undergoes string.Format, in which case the double braces become single braces.
Generally, regex is not powerful enough to do this. However, .NET regex engine supports so-called Atomic Grouping, which let you process groups with balanced parentheses:
{(?>{(?<DEPTH>)|}(?<-DEPTH>)|[^}]+)*}(?(DEPTH)(?!))
If you want to match all text between braces, I think this should do the trick:
{+.*?}+
This matches everything between braces, taking all consecutive braces and as few internal characters as possible.
Further explanation: matches 1 or more { ({+), then any amount of any character (.*) but gives you the shortest string that does it (?), and finally matches 1+ } (}+). Without that ?, if you have {a} {b} it would match the whole thing instead of {a} and {b} separately.
If you won't want spaces between the braces, you can use this:
{+\S*?}+
If you only want letters, use \w instead of \S.
The only thing this is not validating is that the same amount of braces are used. Do you need that?
Result comparison (should be a comment).
Considering {{{{{{this}}}}}Blabla, I get this:
Regex author: c0d3rman
Matched string: {{{{{{this}}}}}B
Groups: 2 ({{{{{{this}}}}}B and {{{{{{this}}}}})
Captures: {{{{{{this}}}}}
Regex author: dasblinkenlight
Matched string: {{{{{this}}}}}
Groups: 2 ({{{{{this}}}}} and {})
Captures: {{{{{this}}}}}
Note: symmetric braces
Regex author: Andrew
Matched string: {{{{{{this}}}}}
Groups: {{{{{{this}}}}}
Captures: {{{{{{this}}}}}
You seem to have used a character class at the end instead of a non-capturing group. Try:
({[^}]+}+)(?:$|[^}])
This is a very small modification to your final attempt, that just uses correct syntax. In your final attempt you have [$|^}]. The issue with this is that you can't have an or | inside a character class []. Most special characters are escaped inside a character class, with a couple exceptions, one of which is ^ if it is the first character. So [$|^}] means any of the four literal characters $, |, ^, or }. What I did is change the syntax to what you intended by using a non-capturing group (?:stuff) this group does not save its contents and is purely for grouping. As such (?:$|[^}]) means an end-of-line or a non-}, as you wanted.
Note that this makes no effort to balance the curly braces (match the number of braces at the beginning and end).

Capture all groups that fit regex

I have a regex that does pretty much exactly what I want: \.?(\w+[\s|,]{1,}\w+[\s|,]{1,}\w+){1}\.?
Meaning it captures incidences of 3 words in a row that are not separated by anything except spaces and commas (so parts of sentences only). However I want this to match every instance of 3 words in a sentence.
So in this ultra simple example:
Hi this is Bob.
There should be 2 captures - "Hi this is" and "this is Bob". I can't seem to figure out how to get the regex engine to parse the entire statement this way. Any thoughts?
You cannot just get overlapping texts in capturing groups, but you can obtain overlapping matches with capturing groups holding the substrings you need.
Use
(?=\b(\w+(?:[\s,]+\w+){2})\b)
See the regex demo
The unanchored positive lookahead tests for an empty string match at every position of a string. It does not consume characters, but can still return submatches obtained with capturing groups.
Regex breakdown:
\b - a word boundary
(\w+(?:[\s,]+\w+){2}) - 3 "words" separated with , or a whitespace.
\w+ - 1 or more alphanumeric symbols followed with
(?:[\s,]+\w+){2} - 2 sequences of 1 or more whitespaces or commas followed by 1 or more alphanumeric symbols.
This pattern is just put into a capturing group (...) that is placed inside the lookahead (?=...).
Word boundaries are important in this expression because \b prevents matching inside a word (between two alphanumeric characters). As the lookahead is not anchored it tests all positions inside input string, and \b serves as a restriction on where a match can be returned.
In C#, you just need to collect all match.Groups[1].Values, e.g. like this:
var s = "Hi this is Bob.";
var results = Regex.Matches(s, #"(?=\b(\w+(?:[\s,]+\w+){2})\b)")
.Cast<Match>()
.Select(p => p.Groups[1].Value)
.ToList();
See the IDEONE demo

Regular expression match text between tag

I need a help with regular expression as I do not have good knowledge in it.
I have regular expression as:
Regex myregex = new Regex("testValue=\"(.+?)\"");
What does (.+?) indicate?
The string it matches is "testValue=123e4567" and returns 123e4567 as output.
Now I need help in regular expression to match a string "<helpMe>123e4567</helpMe>" where I need 123e4567 as output. How do I write a regular expression for it?
This means:
( Begin captured group
. Match any character
+ One or more times
? Non-greedy quantifier
) End captured group
In the case of your regex, the non-greedy quantifier ? means that your captured group will begin after the first double-quote, and then end immediately before the very next double-quote it encounters. If it were greedy (without the ?), the group would extend to the very last double-quote it encounters on that line (i.e., "greedily" consuming as much of the line as possible).
For your "helpMe" example, you'd want this regex:
<helpMe>(.+?)</helpMe>
Given this string:
<div>Something<helpMe>ABCDE</helpMe></div>
You'd get this match:
ABCDE
The value of the non-greedy quantifier is evident in this variation:
Regex: <helpMe>(.+)</helpMe>
String: <div>Something<helpMe>ABCDE</helpMe><helpMe>FGHIJ</helpMe></div>
The greedy capture would look like this:
ABCDE</helpMe><helpMe>FGHIJ
There are some useful interactive tools to play with these variations:
Regex Tester
Regex Pal
Ken Redler has a great answer regarding your first question. For the second question try:
<(helpMe)>(.*?)</\1>
Using the back reference \1 you can find values between the set of matching tags. The first group finds the tag name, the second group matches the content itself, and the \1 back reference re-uses the first group's match (in this case the tag name).
Also, in C# you can use named groups, like: <(helpMe)>(?<value>.*?)</\1> where now match.Groups["value"].Value contains your value.
What does (.+?) indicate?
It means match any character (.) one or more times (+?)
A simple regex to match your second string would be
<helpMe>([a-z0-9]+)<\/helpMe>
This will match any character of a-z and any digit inside <helpme> and </helpMe>.
The pharanteses are used to capture a group. This is useful if you need to reference the value inside this group later.

regex to fetch string between [a] and [/a] excluding any other tag like [b][/b] that comes in between

I have an input like the following
[a href=http://twitter.com/suddentwilight][font][b][i]#suddentwilight[/font][/a] My POV: Rakhi Sawant hits below the belt & does anything for attention... [a href=http://twitter.com/mallikaLA][b]http://www.test.com[/b][/a] has maintained the grace/decency :)
Now I need to get the string #suddentwilight and http://www.test.com that comes inside the anchor tags. there might be some [b] or [i] like tags wrapping the actual text. I need to ignore that.
Basically I need to get a string matching that starts with [a] then need to get the string/url before closing of the a tag [/a].
Please Suggest
I don't know C#, but here's a regex:
/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/
This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a] and capture text.
To explain:
the /.../ are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
\[ and \] match a literal [ and ] character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
[^\]] is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^) denotes negation, and the escaped close square bracket is the character being negated.
* and + are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]* means match 0 or more of anything except a close square bracket.
\s is a shorthand for the character class of whitespace characters
(?:...) allows you to group the contents into an atomic pattern.
(...) groups like (?:...) does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
. matches any single character.
*? is a suffix for non-greedy matching. Normally, the * suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *? is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *? here instead of * is so that if we have multiple [/a]s on a line, we only go as far as the next one when matching link text.
This will only remove [tag]s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:
/\[[^\]]+\]/

Categories

Resources