Regex - how to match multiple properly quoted substrings - c#

I am trying to use a Regex to extract quote-wrapped strings from within a (C#) string which is a comma-separated list of such strings. I need to extract all properly quoted substrings, and ignore those that are missing a quote mark
eg given this string
"animal,dog,cat","ecoli, verification,"streptococcus"
I need to extract "animal,dog,cat" and "streptococcus".
I've tried various regex solutions in this forum but they all seem to find the first substring only, or incorrectly match "ecoli, verification," and ignore "streptococcus"
Is this solvable?
TIA

Try this:
string input = "\"animal,dog,cat\",\"ecoli, verification,\"streptococcus\"";
string pattern = "\"([^\"]+?[^,])\"";
var matches = Regex.Matches(input, pattern);
foreach (Match m in matches)
Console.WriteLine(m.Groups[1].Value);
P.S. But I agree with the commentators: fix the source.

I suggest this:
"(?>[^",]*(?>,[^",]+)*)"
Explanation:
" # Match a starting quote
(?> # Capture in an atomic group to avoid catastrophic backtracking:
[^",]* # - any number of characters except commas or quotes
(?> # - optionally followed by another (atomic) group:
, # - which starts with a comma
[^",]+ # - and contains at least one character besides comma or quotes.
)* # - (as said above, that group is optional but may occur many times)
) # End of the outer atomic group
" # Match a closing quote
Test it live on regex101.com.

Related

Regexp: Match value if condition occurs

I have a string like
Value = ('1 OR 2') OR Value = ('THREE OR FOUR')
and I want to split it by OR (that one is not in quotes).
How can I do it with regexp? It has to match only if I have an even number of quotes before OR.
Is it possible?
I tried use [\w\W]*?'[\w\W]*(\sOR\s) but it works incorrect, it takes only last OR, even if it is inside quotes.
Using [\w\W] can match any character including '
You could make use of lookaround with an infinite quantifier in C# and match optional pairs of single quotes.
If you want all pairs of single quotes in the whole string, you can also assert them to the right.
If you don't want to cross matching newline, you can use [^'\r\n]* instead of [^']*
(?<=^(?:[^']*'[^']*')*[^']*)\bOR\b(?=(?:[^']*'[^']*')*[^']*$)
(?<= Positive lookbehind
^(?:[^']*'[^']*')*[^']* Match optional pairs or single quotes from the start of the string
) Close lookbehind
\bOR\b Match OR between word boundaries
(?= Positive lookahead
(?:[^']*'[^']*')*[^']*$ Match optional pairs of quotes till the end of the string
) Close lookahead
Regex demo
Using a positive lookbehind ensures that OR is only matched if it is preceded by an even number of single quotes (and surrounded by whitespace as in your regex).
(?<=^(?:[^']*'[^']*')*[^']*)\sOR\s
How about trying to match everything that is valid and use Regex.Matches to get all the sub-strings?
var splitRE = new Regex(#"([^'OR]+|O[^R]|'[^']*'|(?<!O)R|(?<=\w)OR|OR(?=\w))+", RegexOptions.Compiled);
var ans = splitRE.Matches(s);
Basically the pattern matches anything not a single-quote, O, or R OR matches O and following not an R OR matches a single-quoted string OR matches an R not preceded by an O OR matches an OR preceded by a word character OR matches an OR followed by a word character.

C# Regex Match NOT inside self defined tags

I use tags in the form of
[[MyTag]]Some Text[[/MyTag]]
To find these tags within the whole text I use the following expression (this is not related to this question here, but for info):
\[\[(?<key>.*\w)]\](?<keyvalue>.*?)\[\[/\1\]\]
Now I like to match and replace only text (MYSEARCHTEXT) which is NOT inside of these self definied tags.
Example:
[[Tag1]]Here I don't want to replace MYSEARCHTEXT[[/Tag1]]
But here MYSEARCHTEXT (1) should be replaced. And here MYSEARCHTEXT (2) needs to be replaced too.
[[AnotherTag]]Here I don't want to replace MYSEARCHTEXT[[/AnotherTag]]
And here I need to replace MYSEARCHTEXT (3) also.
MYSEARCHTEXT is a word or phrase and needs to be found 3 times in this example.
Maybe this can work? If I understood the problem correctly this will match MYSEARCHTEXT outside of your tags and your matches will be in the groups. This uses the positive lookahead
https://regex101.com/r/C8Kuiz/2
(?:\[\[Tag1.*?\/Tag1\]\])\n?(?:.*)(?=(MYSEARCHTEXT))
I have an idea that can simplify this. Use the following regular expression to match the tagged text:
\[.+?\][^\[\]]*?MYSEARCHTEXT[^\[\]]*?\[.+?\]\]
Then replace the MYSEARCHTEXT within the string preserving the captured groups.
You may use the following solution that uses your pattern version with an added alternative in a Regex.Replace method where a match evaluator is used as the replacement argument:
var pat = #"(?s)(\[\[(\w+)]].*?\[\[/\2]])|MYSEARCHTEXT";
var s = "[[Tag1]]Here I don't want to replace MYSEARCHTEXT[[/Tag1]]\nBut here MYSEARCHTEXT (1) should be replaced. And here MYSEARCHTEXT (2) needs to be replaced too.\n[[AnotherTag]]Here I don't want to replace MYSEARCHTEXT[[/AnotherTag]]\nAnd here I need to replace MYSEARCHTEXT (3) also.";
var res = Regex.Replace(s, pat, m =>
m.Groups[1].Success ? m.Groups[1].Value : "NEW_VALUE");
Console.WriteLine(res);
See the C# demo
Result:
[[Tag1]]Here I don't want to replace MYSEARCHTEXT[[/Tag1]]
But here NEW_VALUE (1) should be replaced. And here NEW_VALUE (2) needs to be replaced too.
[[AnotherTag]]Here I don't want to replace MYSEARCHTEXT[[/AnotherTag]]
And here I need to replace NEW_VALUE (3) also.
Pattern details
(?s) - a RegexOptions.Singleline inline modifier option (a . matches any char now)
(\[\[(\w+)]].*?\[\[/\2]]) - Group 1:
\[\[ - a [[ substring
(\w+) - Group 2: one or more word chars
]] - a ]] substring
.*? - any 0+ chars, as few as possible
\[\[/ - a [[/ substring
\2 - same text as captured into Group 2
]] - a literal ]] substring
| - or
MYSEARCHTEXT - some pattern to replace.
When Group 1 matches (m.Groups[1].Success ?) this value is put back, else the NEW_VALUE is inserted into the resulting string.
The best way is to match both seperately as a positive match.
Then decide which to replace and which to write back based on which
matched. (Someone posted this solution already, so I won't duplicate it)
The alternative is to forego that entirely and qualify the text
in the form of a lookahead after searchtext.
This shows how to do it that way.
var pat = #"(?s)MYSEARCHTEXT(?=(?:(?!\[\[/?\w+\]\]).)*?(?:\[\[\w+\]\]|$))";
var res = Regex.Replace(s, pat, "NEW_VALUE");
Demo: https://ideone.com/KOtNik
Formatted:
(?s) # Dot-all modifier
MYSEARCHTEXT
(?= # Qualify the text with an assertion
(?: # Get non-tag characters
(?! \[\[ /? \w+ \]\] )
.
)*?
(?: # Up to -
\[\[ \w+ \]\] # An open tag
| $ # or, end of string
)
)

Get each item within a capturing group

If you have a string like this:
[hello world] this is [the best .Home] is nice place.
How do you extract each word(separated by space) within brackets[] only.
Right now I have this working https://regex101.com/r/Tgokeq/2
Which returns:
hello world
the best .Home
But I want:
hello
world
the
best
.Home
PS: I know I could just do string split in a foreach but I don't want that I want it in the regex itself, just like this which gets every word, except I want words within the brackets [ ] only.
https://regex101.com/r/eweRWj/2
Use this Pattern ([^\[\] ]+)(?=[^\[\]]*\]) Demo
( # Capturing Group (1)
[^\[\] ] # Character not in [\[\] ] Character Class
+ # (one or more)(greedy)
) # End of Capturing Group (1)
(?= # Look-Ahead
[^\[\]] # Character not in [\[\]] Character Class
* # (zero or more)(greedy)
\] # "]"
) # End of Look-Ahead
This pattern may not seems as elegant since it does not match individual words separately. The full solution takes advantage of .Net regex library to get individual words. However, it avoids excessive backtracking of alpha bravo's solution. The importance of that will largely depend on how many lines you search and/or if you are matching large chunks of text or only individual lines at a time.
This approach also lets you identify exactly how many bracket pairs and which words were captured in each pair. A simple pattern-only solution will just get you the matched words without context.
The pattern:
\[\s*((?<word>[^[\]\s]+)\s*)+]
Then some brief code demonstrating how to get captured words via the .Net regex object model:
using System.Text.RegularExpressions;
...
Regex rx = new Regex(#"\[\s*((?<word>[^[\]\s]+)\s*)+]");
MatchCollection matches = rx.Matches(searchText);
foreach(Match m in matches) {
foreach(Capture c in m.Groups["word"].Captures) {
System.Console.WriteLine(c.Value);
}
}
Breakdown of pattern:
\[ # Opening bracket
\s* # Optional white space
( # Group for word delimited by space
(?<word> # Named capture group
[^[\]\s] # Negative character class: no brackets, no white space
+ # one or more greedy
) # End named capture group
\s* # Match white space after word
) # End of word+space grouping
+ # Match multiple occurrences of word+space
] # Literal closing bracket (no need to escape outside character class)
The above will match line feeds between the brackets. If you don't want that then use
\[\ *((?<word>[^[\]\s]+)\ *)+]

Regex to match spaces within quotes only

I need to match any space within double quotes ONLY - not outside. I've tried a few things, but none of them work.
[^"]\s[^"] - matches spaces outside of quotes
[^"] [^"] - see above
\s+ - see above
For example I want to match "hello world" but not "helloworld" and not hello world (without quotes). I will specifically be using this regex inside of Visual Studio via the Find feature.
With .net and pcre regex engines, you can use the \G feature that matches the position after a successful match or the start of the string to build a pattern that returns contiguous occurrences from the start of the string:
((?:\G(?!\A)|\A[^"]*")[^\s"]*)\s([^\s"]*"[^"]*")?
example for a replacement with #: demo
pattern details:
( # capture group 1
(?: # two possible beginning
\G(?!\A) # contiguous to a previous match
| # OR
\A[^"]*" # start of the string and reach the first quote
) # at this point you are sure to be inside quotes
[^\s"]* # all that isn't a white-space or a quote
)
\s # the white-space
([^\s"]*"[^"]*")? # optional capture group 2: useful for the last quoted white-space
# since it reaches an eventual next quoted part.
Notice: with the .net regex engine you can also use the lookbehind to test if the number of quotes before a space is even or odd, but this way isn't efficient. (same thing for a lookahead that checks remaining quotes until the end, but in addition this approach may be wrong if the quotes aren't balanced).

Matching repeating patterns

I'm currently trying to match and capture text in the following input:
field: one two three field: "moo cow" field: +this
I can match the field: with [a-z]*\: however I can't seem to match the rest of the content so far my attempts have only resulted in capturing everything which is not what I want to do.
If you know that it is always going to be literally field: there is absolutely no need for a regular expression:
var delimiters = new String[] {"field:"};
string[] values = input.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
However, from your regex I assume that the name field can vary, as long as it's in front of a colon. You could try to capture a word followed by : and then everything up to the next of those words (using a lookahead).
foreach(Match match in Regex.Matches(input, #"([a-z]+):((?:(?![a-z]+:).)*)"))
{
string fieldName = match.Groups[1].Value;
string value = match.Groups[2].Value;
}
An explanation of the regular expression:
( # opens a capturing group; the content can later be accessed with Groups[1]
[a-z] # lower-case letter
+ # one or more of them
) # end of capturing group
: # a literal colon
( # opens a capturing group; the content can later be accessed with Groups[2]
(?: # opens a non-capturing group; just a necessary subpattern which we do not
# need later any more
(?! # negative lookahead; this will NOT match if the pattern inside matches
[a-z]+:
# a word followed by a colon; just the same as we used at the beginning of
# the regex
) # end of negative lookahead (not that this does not consume any characters;
# it LOOKS ahead)
. # any character (except for line breaks)
) # end of non-capturing group
* # 0 or more of those
) # end of capturing group
So first we match anylowercaseword:. And then we match one more character at a time, for each one checking that this character is not the start of anotherlowercaseword:. With the capturing groups we can then later separately find the field's name and the field's value.
Don't forget that you can actually match literal strings in regexes. If your pattern is like this:
field\:
You will match "field:" literally, and nothing else.

Categories

Resources