C# Regex for a match outside a specific region

C# Regex for a match outside a specific region - c#

I have to find occurrences of a certain string (needle) within another string (haystack) that don't occur between specific "braces".
For example consider this haystack:
"BEGIN something END some other thing BEGIN something else END yet some more things."
And this needle:
"some"
With the braces "BEGIN" and "END"
I want to find all needles that are not between braces.
(there are two matches: the "some" followed by "other" and the "some" followed by "more")
I figured I could solve this with a Regex with negative lookahhead/lookbehind, but how?
I have tried
(?<!(BEGIN))some(?!(END))
which gives me 4 matches (obviously because no "some" is directly enclosed between "BEGIN" and "END")
I also tried
(?<!(BEGIN.*))some(?!(.*END))
but this gives me no matches at all (obviously because each needle is somehow preceeded by a "BEGIN")
No I'm stuck.
Here's the latest C# code I used:
string input = "BEGIN something END some other thing BEGIN something else END yet some more things.";
global::System.Text.RegularExpressions.Regex re = new Regex(#"(?<!(BEGIN.*))some(?!(.*END))");
global::System.Text.RegularExpressions.MatchCollection matches = re.Matches(input);
global::NUnit.Framework.Assert.AreEqual(2, matches.Count);

Would something like this work for you:
(?:^|END)((?!BEGIN).*?)(some)(.*?)(?:BEGIN|$)
This appears to match your text, as I tested using RegExDesigner.NET.

One simple option is to skip the parts you don't want to match, and capture only the needles you need:
MatchCollection matches = Regex.Matches(input, "BEGIN.*?END|(?<Needle>some)");
You'll get the two "some"s you're after by taking the successful "Needle" groups out of all matches:
IEnumerable<Group> needles = matches.Cast<Match>()
.Select(m => m.Groups["Needle"])
.Where(g => g.Success);

You might try splitting the string on occurrences of BEGIN or END so that you can insure that there is only one BEGIN and one END in the string that you apply your regex to. Also, if you are looking for occurrences of SOME that are outside your BEGIN/END braces then I think you'd want to look behind for END and lookahead for BEGIN (positive lookahead/behind), the opposite of what you have.
Hope this helps.

What if you just process the entire haystack and ignore the hay that is in between the braces (am I pushing the metaphor too far?)
For example, look through all the tokens (or characters, if you need to go to that level) and look for your braces. When the opening one is found, you loop through until you find the closing brace. At that point, you start looking for your needles until you find another opening brace. It's a bit more code than a Regex, but might be more readible and easier to troubleshoot.

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks

I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.

If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

Prevent Regex from devouring optional part of the match

I'v searched extensively but I can't find a simple answer to this and my Regex experience is limited. I'd appreciate a simple solution that is explained, please.
I have a very large string and I need to substitute certain words in it as follows:
Example: wherever you find the string "LINK-ABC" make it "LINK_ABC".
I wrote my Regex Match and Replace strings:
#"LINK-ABC", #"LINK_ABC" and it worked.
But there were a couple of things I had not recognized.
There COULD be words in the file like this:
LINK-ABC-DEF LINK-ABC-GHI-JKL ... and so on.
So I get "LINK_ABC-DEF" etc. (which is NOT what I want; this should have remained intact...)
Once I realized the problem it seemed that what I REALLY wanted was to recognize ONLY the word being matched and leave any cases where it was in combination with something else, unchanged. It seemed to me that if I checked for a space or period on the Match word, that should do it, so...
#"LINK-ABC[ |\\.]",#"LINK_ABC"
... and now I have stumbled.
Sample string:
link-xxx link-aaa-sss link-xxx-bbb link-xxx link-xxx.
Match/Replace string:
link-xxx[ |\\.],link_xxx
Result string:
link_xxxlink-aaa-sss link-xxx-bbb link_xxxlink_xxx
The replacements are correct, BUT the trailing comma or period has been "devoured" and so the result string is wrong.
Is there a way that I can match so that if it matches on space, the replacement will have a space and if it matches on a period, the replacement will have a period? I s'pose I could do 2 separate matches but I'd like to increase my understanding of Regex and do it more elegantly if it is possible.

You should be able to achieve the behavior you want with "capture groups"
var matchstring = #"link-xxx([ \.]|$)";
var fixstr = #"link_xxx$1";
The parenthesis around the last part of the matchstring will retain whatever matched inside it, and the $1 in the fixstr will substitute whatever was captured by that group.
I've also modified your punctuation section a little bit, presuming you want to replace a match if it happens to be the last word in the input (by adding the |$). A | inside a character class [] is a literal | character, so I removed it assuming you don't actually expect that in your input.

Regex not capturing date

I have a regex that works fine currently. But now I want to add on to it to capture dates.
Current regex:
(?<GeneralHelp>^/help\s*)?
(?:/client:)
(?<Client>\w*)
(?:(?:\s*/(?<ClientHelp>help))*)*
(?:(?:\s*/)(?<Modules>createHistory)(?:(?:\s*/(?<ModuleHelp>help))*)*)*
I added to the end:
(?:(?:\s*/)(?<StartDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
(?:(?:\s*/)(?<EndDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
Using the below example, it just won't get the dates, but it does match everything else.
/client:testClient/createHistory/11-11-2013/11.11.2013
This regex is used to break up the Main one string in the string array parameter from a console app. No one on my team in "fluent" in regex, nor do we have time to become fluent. We work with what we can and this addition is something I thought of today that may have with bigger problems what we have with our project and we are running low on time. So any help would be appreciated.

First, the ^ in your regex means "start of string", that is you only want to match a date at the start of the string (which is not true for you). So remove it. Same with "$" which means "end of string".
Secondly, [0|1] means "match characters 0, | or 1". You probably want [01] meaning "match characters 0 or 1".
Thirdly, you have an extra closing bracket with an unmatched opening bracket in both your regexes.
Fourthly as a general style point, [0] is the same as 0 so the square brackets are redundant here.
So your (not quite!) "fixed" regex is:
(?:(?:\s*/)(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:(?:\s*/)(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
However, this will not match your test string because of the extra "/testModule" in the string which is not in your working regex anywhere.
You could modify your original regex to allow extra slashes in between the two parts of regex?
<original regex>
(?:/[^/]+)* # <-- for the /testModule and any other similar tokens that appear in between
<date regex>
Also as a general point
you have a few occurences (?:(?:regex)*)*. I am not sure what the point is of doubling the outer * besides making the regex parser work much harder than it should for no good reason (the outer (?: )* is redundant here).
there is no point doing (?:/\s*) as you are not doing anything with the brackets, so just do /\s*
same with things like (?:/client:). Why have non-capturing brackets if you are not doing anything with them. /client: will do.
(?:regex)* means "match 0 to infinity occurences of regex". With things like (?:\s*/(?<ClientHelp>help))*, do you really expect this to occur infinitely many times in your string, or will it appear just once or not at all? Consider replacing * with ? which means "match 0 or 1 occurences" (if you know that that token will appear either once or not at all), or replace it with (say) {0, 100} if you know that that token will appear at most 100 times (and at least 0 times). This can improve performance.
So I recommend changing your regex like this:
(?<GeneralHelp>^/help\s*)?
/client:
(?<Client>\w*)
(?:\s*/(?<ClientHelp>help))*
(?:\s*/(?<Modules>createHistory)(?:\s*/(?<ModuleHelp>help))*)*
(?:/[^/]+)*
(?:\s*/(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:\s*/(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
You can fiddle around with your regex at regexr where I've created an example with your regex/test string. (Edit: the < and > in the regex seem to have been changed to < and > in regexr so the link won't work unless you copy/paste the regex I've written directly)

If you're sure these two last fields are dates, you could simply add something like
(?<StartDate>(?:\d+[. -]?){3})/(?<EndDate>.*)$
(or even (?<StartDate>[^/]+)/(?<EndDate>.+)$ if your cases are all in the same pattern and it fits your needs).
Also as already pointed out by mathematical.coffee, the first regex can be improved.

Replacing text between single quotes

I'm trying to replace some text in an Excel formula using FlexCel, C# and Regex, but single quotes appear to be tricky from the googling I've done.
The string is like this;
"=RIGHT(CELL(\"filename\",'C.Whittaker'!$A$1),LEN(CELL(\"filename\",'C.Whittaker'!$A$1))-FIND(\"]\",CELL(\"filename\",'C.Whittaker'!$A$1)))"
I want to replace C.Whittaker with another name. It's always going to be First Initial . Last Name and it's always going to be inside single quotes.
I've got this regex matching; (?:')[^\"]+(?:')
I thought the (?:') means that regex matches it, but then ignores it for a replace, yet this doesn't seem to be the case. Suspect the issue is to do with how strings are handled, but it's a bit beyond me.

(?:...) is a non-capturing group; it doesn't capture what it matches (that is, it doesn't make that part of the string available later by means of a backreference), but it does consume it. What you're thinking of is lookarounds, which don't consume what they match.
(?<=')[^']+(?=')
But do you really need them? You might find it simpler to consume the quotes and then add them to the replacement string. In other words, replace '[^']+' with 'X.Newname'.

I believe you need to escape your character you want Regex to take it literally...
ie: (?:')
becomes (?:\')

Do you really need to use regex? Regex is good for complex cases where speed is important, but it leads to difficult to maintain code.
Try this instead:
var formula2 =
String.Join("'", formula
.Split(new [] { '\'' })
.Select((x, n) => n % 2 == 1 ? "new name here" : x));

Finding strings that begin with -- and end with a linefeed

I am using the following regex to locate in a document any series of characters that begins with characters dash dash -- and ends with a line feed character /n.
return #"(^--).*?(?=\r|\n)";
Almost works but only when there is a space between the -- and the next character.
return #"(?:--\s).*?(?=\r|\n)
Almost works but only when there is no space between the -- and the next character.
How do I get my return whether a space is following the -- or not?
I know nothing of regex other than what it's capable of. I found both of these sample patterns online. Thanks for your assistance.

You need to use \s? to capture either 0 or 1 spaces.
One use of the question mark in regex is to indicate that 0 or one matches of the previous character (or group of characters) will be matched, but not more than one.
Also, if you ever have the desire to learn regex for yourself, visit http://www.regular-expressions.info to learn and http://www.regexpal.com to practice.

Assuming that you are searching for substrings in a larger string and want to capture the the substring between -- and \n you could use an expression like:
--(.*)\r?\n
Which can be quoted in C# like this:
#"--(.*)\r?\n"
If you just want to make sure that a string starts with -- and ends with \n you could use:
(?s)^--.*\n\z

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.