Detect Two Consecutive Single Quotes Inside Single Quotes - c#

I'm struggling to get this regex pattern exactly right, and am open to other options outside of regex if someone has a better alternative.
The situation:
I'm basically looking to parse a T-SQL "in" clause against a text column in C#. So, I need to take a string value like this:
"'don''t', 'do', 'anything', 'stupid'"
And interpret that as a list of values (I'll take care of the double single quotes later):
"don''t"
"do"
"anything"
"stupid"
I have a regex that works for most cases, but I'm struggling to generalize it to the point where it will accept any character OR a doubled-up single quote inside my group: (?:')([a-z0-9\s(?:'(?='))]+)(?:')[,\w]*
I'm fairly experienced with regexes, but have rarely, if ever, found a need for look-arounds (so downgrade my assessment of my regex experience accordingly).
So, to put this another way, I'm wanting to take a string of comma-delimited values, each enclosed in single quotes but can contain doubled single quotes, and output each such value.
EDIT
Here's a non-working example with my current regex (my problem is I need to handle all characters in my grouping and stop when I encounter a single quote not followed by a second single quote):
"'don''t', 'do?', 'anything!', '#stupid$'"

If you still think about a regex-based solution, you can use the following regex:
'(?:''|[^'])*'
Or an "un-rolled" version suggested by #sln:
'[^']*(?:''[^']*)*'
See demo
It is fairly simple, it captures double single quotation marks OR anything that is not a single quotation mark. No need using any look-behinds or look-aheads. It does not take care of any escaped entities, but I do not see this requirement in your question.
Moreover, this regex will return matches that are easy to access and deal with:
var text = "'don''t', 'do', 'anything', 'stupid'";
var re = new Regex(#"'[^']*(?:''[^']*)*'"); // Updated thanks to #sln, previous (#"'(?:''|[^'])*'");
var match_values = re.Matches(text).Cast<Match>().Select(p => p.Value).ToList();
Output:

If you want to use the Capture Collection feature, you can grab them all in a
single pass.
# #"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+"""
"
\s*
(?:
'
( # (1 start)
[^']*
(?:
'' [^']*
)*
) # (1 end)
'
\s*
(?:
, \s*
| (?= " )
)
)+
"
C# code:
string strSrc = "\"'don''t', 'do', 'anything', 'stupid'\"";
Regex rx = new Regex(#"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+""");
Match srcMatch = rx.Match(strSrc);
if (srcMatch.Success)
{
CaptureCollection cc = srcMatch.Groups[1].Captures;
for (int i = 0; i < cc.Count; i++)
Console.WriteLine("{0} = '{1}'", i, cc[i].Value);
}
Output:
0 = 'don''t'
1 = 'do'
2 = 'anything'
3 = 'stupid'
Press any key to continue . . .

Why don't you split on ', ':
Regex regex = new Regex(#"'\s*,\s*'");
string[] substrings = regex.Split(str);
And then take care of the extra single quotes by Trimming

Looks to me like you're over-thinking the problem. A quoted string with an escaped quote looks just like two strings without escaped quotes, one right after the other (not even spaces between them).
(?:'[^']*')+
Of course, you'll have to remove the enclosing quotes, but you probably had to do some post-processing anyway, to unescape the escaped quotes.
Also note that I'm not trying to validate the input or work around possible errors; for example, I don't bother matching the commas between the strings. If the input is well formed, this regex should be all you need.

In the interest of maintainability, I decided against a regex and followed the advice of using a state machine. Here's the crux of my implementation:
string currentTerm = string.Empty;
State currentState = State.BetweenTerms;
foreach (char c in valueToParse)
{
switch (currentState)
{
// if between terms, only need to do something if we encounter a single quote, signalling to start a new term
// encloser is client-specified char to look for (e.g. ')
case State.BetweenTerms:
if (c == encloser)
{
currentState = State.InTerm;
}
break;
case State.InTerm:
if (c == encloser)
{
if (valueToParse.Length > index + 1 && valueToParse[index + 1] == encloser && valueToParse.Length > index + 2)
{
// if next character is also encloser then add it and move on
currentTerm += c;
}
else if (currentTerm.Length > 0 && currentTerm[currentTerm.Length - 1] != encloser)
{
// on an encloser and didn't just add encloser, so we are done
// converterFunc is a client-specified Func<string,T> to return terms in the specified type (to allow for converting to int, for example)
yield return converterFunc(currentTerm);
currentTerm = string.Empty;
currentState = State.BetweenTerms;
}
}
else
{
currentTerm += c;
}
break;
}
index++;
}
if (currentTerm.Length > 0)
{
yield return converterFunc(currentTerm);
}

Related

Regex and proper capture using .matches .Concat in C#

I have the following regex:
#"{thing:(?:((\w)\2*)([^}]*?))+}"
I'm using it to find matches within a string:
MatchCollection matches = regex.Matches(string);
IEnumerable formatTokens = matches[0].Groups[3].Captures
.OfType<Capture>()
.Where(i => i.Length > 0)
.Select(i => i.Value)
.Concat(matches[0].Groups[1].Captures.OfType<Capture>().Select(i => i.Value));
This used to yield the results I wanted; however, my goal has since changed. This is the desired behavior now:
Suppose the string entered is 'stuff/{thing:aa/bb/cccc}{thing:cccc}'
I want formatTokens to be:
formatTokens[0] == "aa/bb/cccc"
formatTokens[1] == "cccc"
Right now, this is what I get:
formatTokens[0] == "/"
formatTokens[1] == "/"
formatTokens[2] == "cccc"
formatTokens[3] == "bb"
formatTokens[4] == "aa"
Note especially that "cccc" does not appear twice even though it was entered twice.
I think the problems are 1) the recapture in the regex and 2) the concat configuration (which is from when I wanted everything separated), but so far I haven't been able to find a combination that yields what I want. Can someone shed some light on the proper regex/concat combination to yield the desired results above?
You may use
Regex.Matches(s, #"{thing:([^}]*)}")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.ToList()
See the regex demo
Details
{thing: - a literal {thing: substring
([^}]*) - Capturing group #1 (when a match is obtained, its value can be accessed via match.Groups[1].Value): 0+ chars other than }
} - a } char.
This way, you find multiple matches and only collect Group 1 values in the resulting list/array.
Mod update
I'm not sure why you settled for Stringnuts regex because it matches
anything inside braces {}.
The meek on SO will not get the satisfaction of deep knowledge,
so that may be your real problem.
Lets analyze your regex.
{thing:
(?:
( # (1 start)
( \w ) # (2)
\2*
) # (1 end)
( [^}]*? ) # (3)
)+
}
This reduces to this
{thing:
(?: \w [^}]*? )+
}
The only constraint is that right after {thing: there must be a word.
After which there can be anything else, because this clause [^}]*? accepts
anything.
Also, even though that clause is not greedy, the surrounding cluster will only run one iteration (?: )+
So, basically, it does almost nothing except for the single word requirement.
Your regex can be used as is to get convoluted matches,
and because you've captured all the parts in Capture Collections,
with each match you can piece that together using the code below.
I would try to understand regex a little better, before you go on to other stuff since it is likely much more important than the
language tricks used to extract data.
Here is how you would piece it all together using your unaltered regex.
Regex regex = new Regex(#"{thing:(?:((\w)\2*)([^}]*?))+}");
string str = "stuff/{thing:aa/bb/cccc}{thing:cccc}";
foreach (Match match in regex.Matches(str))
{
CaptureCollection cc1 = match.Groups[1].Captures;
CaptureCollection cc3 = match.Groups[3].Captures;
string token = "";
for (int i = 0; i < cc1.Count; i++)
token += cc1[i].Value + cc3[i].Value;
Console.WriteLine("{0}", token);
}
Output
aa/bb/cccc
cccc
Note that for example, your regex will match almost anything inside
of the braces as long as the first character is a word.
For example, it matches {thing:Z,,,*()(((asgassgasg,asgfasgafg\/\=99.239 }
You may want to think about the requirements of what actually is allowed
inside the braces.
Good Luck!

Replace Single WhiteSpace without Replacing Multiple WhiteSpace

I have a string in the format:
abc def ghi xyz
I would like to end with it in format:
abcdefghi xyz
What is the best way to do this? In this particular case, I could just strip off the last three characters, remove spaces, and then add them back at the end, but this won't work for cases in which the multiple spaces are in the middle of the string.
In Short, I want to remove all single whitespaces, and then replace all multiple whitespaces with a single. Each of those steps is easy enough by itself, but combining them seems a bit less straightforward.
I'm willing to use regular expressions, but I would prefer not to.
This approach uses regular expressions but hopefully in a way that's still fairly readable. First, split your input string on multiple spaces
var pattern = #" +"; // match two or more spaces
var groups = Regex.Split(input, pattern);
Next, remove the (individual) spaces from each token:
var tokens = groups.Select(group => group.Replace(" ", String.Empty));
Finally, join your tokens with single spaces
var result = String.Join(' ', tokens.ToArray());
This example uses a literal space character rather than 'whitespace' (which includes tabs, linefeeds, etc.) - substitute \s for ' ' if you need to split on multiple whitespace characters rather than actual spaces.
Well, Regular Expressions would probably be the fastest here, but you could implement some algorithm that uses a lookahead for single spaces and then replaces multiple spaces in a loop:
// Replace all single whitespaces
for (int i = 0; i < sourceString.Length; i++)
{
if (sourceString[i] = ' ')
{
if (i < sourceString.Length - 1 && sourceString[i+1] != ' ')
sourceString = sourceString.Delete(i);
}
}
// Replace multiple whitespaces
while (sourceString.Contains(" ")) // Two spaces here!
sourceString = sourceString.Replace(" ", " ");
But hey, that code is pretty ugly and slow compared to a proper regular expression...
For a Non-REGEX option you can use:
string str = "abc def ghi xyz";
var result = str.Split(); //This will remove single spaces from the result
StringBuilder sb = new StringBuilder();
bool ifMultipleSpacesFound = false;
for (int i = 0; i < result.Length;i++)
{
if (!String.IsNullOrWhiteSpace(result[i]))
{
sb.Append(result[i]);
ifMultipleSpacesFound = false;
}
else
{
if (!ifMultipleSpacesFound)
{
ifMultipleSpacesFound = true;
sb.Append(" ");
}
}
}
string output = sb.ToString();
The output would be:
output = "abcdefghi xyz"
Here's an approach which uses some fairly subtle logic:
public static string RemoveUnwantedSpaces(string text)
{
var sb = new StringBuilder();
char lhs = '\0';
char mid = '\0';
foreach (char rhs in text)
{
if (rhs != ' ' || (mid == ' ' && lhs != ' '))
sb.Append(rhs);
lhs = mid;
mid = rhs;
}
return sb.ToString().Trim();
}
How it works:
We will examine each possible three-character subsequence linearly across the string (in a kind of three-character sliding window). These three characters will be represented, in order, by the variables lhs, mid and rhs.
For each rhs character in the string:
If it's not a space we should output it.
If it is a space, and the previous character was also space but the one before that isn't, then this is the second in a sequence of at least two spaces, and therefore we should output one space.
Otherwise, don't output a space because this is either the first or the third (or later) space in a sequence of two or more spaces and in either case we don't want to output a space: If this happens to be the first in a sequence of two or more spaces, a space will be output when the second space comes along. If this is the third or later, we've already output a space for it.
The subtlety here is that I've avoided special casing the beginning of the sequence by initialising the lhs and mid variables with non-space characters. It doesn't matter what those values are, as long as they are not spaces, but I made them \0 to indicate that they are special values.
After second thought here is one line regex solution:
Regex.Replace("abc def ghi xyz", "( )( )*([^ ])", "$2$3")
the result of this is "abcdefghi xyz"
ORIGINAL ANSWER:
Two lines of code regex solution:
var tmp = Regex.Replace("abc def ghi xyz", "( )([^ ])", "$2")
tmp is "abcdefghi xyz"
then:
var result = Regex.Replace(tmp, "( )+", " ");
result is "abcdefghi xyz"
Explanation:
The first line of code removes single whitespaces and removes one whitespace for multiple whitespaces (so there are 3 spaces in tmp between letters i and x).
The second line just replace multiple whitespaces with one.
In-depth explanation of first line:
We match input string to regex that matches one space and non-space next to it. We also put this two characters in separate groups (we use ( ) for anonymous grouping).
So for "abc def ghi xyz" string we have this matches and groups:
match: " d" group1: " " group2: "d"
match: " g" group1: " " group2: "g"
match: " x" group1: " " group2: "x"
We are using substitution syntax for Regex.Replace method to replace match with the content of second group (which is non-whitespace character)

Replace one character but not two in a string

I want to replace single occurrences of a character but not two in a string using C#.
For example, I want to replace & by an empty string but not when the ocurrence is &&. Another example, a&b&&c would become ab&&c after the replacement.
If I use a regex like &[^&], it will also match the character after the & and I don't want to replace it.
Another solution I found is to iterate over the string characters.
Do you know a cleaner solution to do that?
To only match one & (not preceded or followed by &), use look-arounds (?<!&) and (?!&):
(?<!&)&(?!&)
See regex demo
You tried to use a negated character class that still matches a character, and you need to use a look-ahead/look-behind to just check for some character absence/presence, without consuming it.
See regular-expressions.info:
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u).
Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt.
You can match both & and && (or any number of repetition) and only replace the single one with an empty string:
str = Regex.Replace(str, "&+", m => m.Value.Length == 1 ? "" : m.Value);
You can use this regex: #"(?<!&)&(?!&)"
var str = Regex.Replace("a&b&&c", #"(?<!&)&(?!&)", "");
Console.WriteLine(str); // ab&&c
You can go with this:
public static string replacement(string oldString, char charToRemove)
{
string newString = "";
bool found = false;
foreach (char c in oldString)
{
if (c == charToRemove && !found)
{
found = true;
continue;
}
newString += c;
}
return newString;
}
Which is as generic as possible
I would use something like this, which IMO should be better than using Regex:
public static class StringExtensions
{
public static string ReplaceFirst(this string source, char oldChar, char newChar)
{
if (string.IsNullOrEmpty(source)) return source;
int index = source.IndexOf(oldChar);
if (index < 0) return source;
var chars = source.ToCharArray();
chars[index] = newChar;
return new string(chars);
}
}
I'll contribute to this statement from the comments:
in this case, only the substring with odd number of '&' will be replaced by all the "&" except the last "&" . "&&&" would be "&&" and "&&&&" would be "&&&&"
This is a pretty neat solution using balancing groups (though I wouldn't call it particularly clean nor easy to read).
Code:
string str = "11&222&&333&&&44444&&&&55&&&&&";
str = Regex.Replace(str, "&((?:(?<2>&)(?<-2>&)?)*)", "$1$2");
Output:
11222&&333&&44444&&&&55&&&&
ideone demo
It always matches the first & (not captured).
If it's followed by an even number of &, they're matched and stored in $1. The second group is captured by the first of the pair, but then it's substracted by the second.
However, if there's there's an odd number of of &, the optional group (?<-2>&)? does not match, and the group is not substracted. Then, $2 will capture an extra &
For example, matching the subject "&&&&", the first char is consumed and it isn't captured (1). The second and third chars are matched, but $2 is substracted (2). For the last char, $2 is captured (3). The last 3 chars were stored in $1, and there's an extra & in $2.
Then, the substitution "$1$2" == "&&&&".

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Regular expression to check if a string is within certain pattern that may contain nested parentheses in c#

I have been trying to write a code that will check if the given string contains certain strings with certain pattern.
To be precise, for example:
string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};
Now, I want to extract
"homo sapiens", "human" and "woman" but NOT "man"
from the above list as they follow the pattern, i.e string followed by~ or one of the strings inside parenthesis that starts with ~.
So far I have come up with:
string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};
var prunedList = new List<string>();
foreach(var term in checkList)
{
var pattern = #"~(\s)*(\(\s*)?(\(?\w\s*\)?)*" + term + #"(\s*\))?";
Match m = Regex.Match(mainString, pattern);
if(m.success)
{
prunedList.Add(term);
}
}
But this pattern is not working for all cases...
Can any one suggest me how this can be done?
I wrote a simple parser that works well for the example you gave.
I don't know what the expected behavior is for a string that ends in this pattern: ~(some words (ie, no closing parenthesis with valid opening)
I'm sure you could clean this up some...
private bool Contains(string source, string given)
{
return ExtractValidPhrases(source).Any(p => RegexMatch(p, given));
}
private bool RegexMatch(string phrase, string given)
{
return Regex.IsMatch(phrase, string.Format(#"\b{0}\b", given), RegexOptions.IgnoreCase);
}
private IEnumerable<string> ExtractValidPhrases(string source)
{
bool valid = false;
var parentheses = new Stack<char>();
var phrase = new StringBuilder();
for(int i = 0; i < source.Length; i++)
{
if (valid) phrase.Append(source[i]);
switch (source[i])
{
case '~':
valid = true;
break;
case ' ':
if (valid && parentheses.Count == 0)
{
yield return phrase.ToString();
phrase.Clear();
}
if (parentheses.Count == 0) valid = false;
break;
case '(':
if (valid)
{
parentheses.Push('(');
}
break;
case ')':
if (valid)
{
parentheses.Pop();
}
break;
}
}
//if (valid && !parentheses.Any()) yield return phrase.ToString();
if (valid) yield return phrase.ToString();
}
Here are the tests I used:
// NUnit tests
[Test]
[TestCase("Homo Sapiens", true)]
[TestCase("human", true)]
[TestCase("woman", true)]
[TestCase("man", false)]
public void X(string given, bool shouldBeFound)
{
const string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman";
Assert.AreEqual(shouldBeFound, Contains(mainString, given));
}
[Test]
public void Y()
{
const string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman";
var checkList = new List<string> {"homo sapiens", "human", "man", "woman"};
var expected = new List<string> { "homo sapiens", "human", "woman" };
var filtered = checkList.Where(s => Contains(mainString, s));
CollectionAssert.AreEquivalent(expected, filtered);
}
The language of balanced parenthesis is not regular and as a result you cannot accomplish what you want using RegEx. A better approach would be to use traditional string parsing with a couple of counters - one for open paren and one for close parens - or a stack to create a model similar to a Push Down Automaton.
To get a better idea of the concept check out PDA's on Wikipedia. http://en.wikipedia.org/wiki/Pushdown_automaton
Below is an example using a stack to get strings inside the out most parens (pseudo code).
Stack stack = new Stack();
char[] stringToParse = originalString.toCharArray();
for (int i = 0; i < stringToParse.Length; i++)
{
if (stringToParse[i] == '(')
stack.push(i);
if (stringToParse[i] == ')')
string StringBetweenParens = originalString.GetSubstring(stack.pop(), i);
}
Now of course this is a contrived example and would need some work to do more serious parsing, but it gives you the basic idea of how to do it. I've left out things like; the correct function names (don't feel like looking them up right now), how to get text in nested parens like getting "inner" out of the string "(outer (inner))" (that function would return "outer (inner)"), and how to store the strings you get back.
Simply for academic reasons, I would like to present the regex solution, too. Mostly, because you are probably using the only regex engine that is capable of solving this.
After clearing up some interesting issues about the combination of .NET's unique features, here is the code that gets you the desired results:
string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman";
List<string> checkList = new List<string> { "homo sapiens", "human", "man", "woman" };
// build subpattern "(?:homo sapiens|human|man|woman)"
string searchAlternation = "(?:" + String.Join("|", checkList.ToArray()) + ")";
MatchCollection matches = Regex.Matches(
mainString,
#"(?<=~|(?(Depth)(?!))~[(](?>[^()]+|(?<-Depth>)?[(]|(?<Depth>[)]))*)"+searchAlternation,
RegexOptions.IgnoreCase
);
Now how does this work? Firstly, .NET supports balancing groups, which allow for detection of correctly nested patterns. Every time we capture something with a named capturing group
(like (?<Depth>somepattern)) it does not overwrite the last capture, but instead is pushed onto a stack. We can pop one capture from that stack with (?<-Depth>). This will fail, if the stack is empty (just like something that does not match at the current position). And we can check whether the stack is empty or not with (?(Depth)patternIfNotEmpty|patternIfEmpty).
In addition to that, .NET has the only regex engine that supports variable-length lookbehinds. If we can use these two features together, we can look to the left of one of our desired strings and see whether there is a ~( somewhere outside the current nesting structure.
But here is the catch (see the link above). Lookbehinds are executed from right to left in .NET, which means that we need to push closing parens and pop on encountering opening parens, instead of the other way round.
So here is for some explanation of that murderous regex (it's easier to understand if you read the lookbehind from bottom to top, just like .NET would do):
(?<= # lookbehind
~ # if there is a literal ~ to the left of our string, we're good
| # OR
(?(Depth)(?!)) # if there is something left on the stack, we started outside
# of the parentheses that end end "~("
~[(] # match a literal ~(
(?> # subpattern to analyze parentheses. the > makes the group
# atomic, i.e. suppresses backtracking. Note: we can only do
# this, because the three alternatives are mutually exclusive
[^()]+ # consume any non-parens characters without caring about them
| # OR
(?<-Depth>)? # pop the top of stack IF possible. the last ? is necessary for
# like "human" where we start with a ( before there was a )
# which could be popped.
[(] # match a literal (
| # OR
(?<Depth>[)]) # match a literal ) and push it onto the stack
)* # repeat for as long as possible
) # end of lookbehind
(?:homo sapiens|human|man|woman)
# match one of the words in the check list
Paranthesis checking is a context-free language or grammar which requires a stack for checking. Regular expressions are suitable for regular languages. They do not have memory, therefore they cannot be used for such purposes.
To check this you need to scan the string and count the parentheses:
initialize count to 0
scan the string
if current character is ( then increment count
if current character is ) then decrement count
if count is negative, raise an error that parentheses are inconsistent; e.g., )(
In the end, if count is positive, then there are some unclosed parenthesis
If count is zero, then the test is passed
Or in C#:
public static bool CheckParentheses(string input)
{
int count = 0;
foreach (var ch in input)
{
if (ch == '(') count++;
if (ch == ')') count--;
// if a parenthesis is closed without being opened return false
if(count < 0)
return false;
}
// in the end the test is passed only if count is zero
return count == 0;
}
You see, since regular expressions are not capable of counting, then they cannot check such patterns.
This is not possible using regular expressions.
You should abandon idea of using them and use normal string operations like IndexOf.

Categories

Resources