Prevent Regex from devouring optional part of the match

Prevent Regex from devouring optional part of the match - c#

I'v searched extensively but I can't find a simple answer to this and my Regex experience is limited. I'd appreciate a simple solution that is explained, please.
I have a very large string and I need to substitute certain words in it as follows:
Example: wherever you find the string "LINK-ABC" make it "LINK_ABC".
I wrote my Regex Match and Replace strings:
#"LINK-ABC", #"LINK_ABC" and it worked.
But there were a couple of things I had not recognized.
There COULD be words in the file like this:
LINK-ABC-DEF LINK-ABC-GHI-JKL ... and so on.
So I get "LINK_ABC-DEF" etc. (which is NOT what I want; this should have remained intact...)
Once I realized the problem it seemed that what I REALLY wanted was to recognize ONLY the word being matched and leave any cases where it was in combination with something else, unchanged. It seemed to me that if I checked for a space or period on the Match word, that should do it, so...
#"LINK-ABC[ |\\.]",#"LINK_ABC"
... and now I have stumbled.
Sample string:
link-xxx link-aaa-sss link-xxx-bbb link-xxx link-xxx.
Match/Replace string:
link-xxx[ |\\.],link_xxx
Result string:
link_xxxlink-aaa-sss link-xxx-bbb link_xxxlink_xxx
The replacements are correct, BUT the trailing comma or period has been "devoured" and so the result string is wrong.
Is there a way that I can match so that if it matches on space, the replacement will have a space and if it matches on a period, the replacement will have a period? I s'pose I could do 2 separate matches but I'd like to increase my understanding of Regex and do it more elegantly if it is possible.

You should be able to achieve the behavior you want with "capture groups"
var matchstring = #"link-xxx([ \.]|$)";
var fixstr = #"link_xxx$1";
The parenthesis around the last part of the matchstring will retain whatever matched inside it, and the $1 in the fixstr will substitute whatever was captured by that group.
I've also modified your punctuation section a little bit, presuming you want to replace a match if it happens to be the last word in the input (by adding the |$). A | inside a character class [] is a literal | character, so I removed it assuming you don't actually expect that in your input.

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks

I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.

If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

Capture Text Surrounding Regex Match .NET

I am building an application, and I have a requirement to capture characters before and after matches. This seems to work okay, except when there are multiple matches within the surrounding capture.
Regex:
.{0,10}(?=abc)
This should capture up to 10 characters before the string "abc" is found.
The issue comes up if there is a recurrence of the match in the preceding text:
"qqqqabcabcqqq"
With the above text, I would expect two captures:
qqqq (the 4 characters before the first abc occurrence)
qqqqabc (the 7 characters before the second abc occurrence)
I am not, however getting these matches. The only match I get is:
qqqqabc
I am certain that I am missing something, but I am not sure what. I believe that my regex is somehow being too greedy, and so it is overlooking the first match in favor of the larger, second one. Here is what I need:
I need a regex that:
1. Is for .NET
2. Looks within a string for X characters before an exact match on string S.
3. Includes any secondary match on S (call S') that is found within X characters before S
4. does not care in the slightest what these characters are.
I assure you, I tried looking for similar answers but I wasn't able to find anything that directly answers this question (which has been plaguing me for two days. Yes, I have to use regular expression). As for Regex flavor, I am working in .NET.
Thank you so much for any help.

Here it is:
(?<=(?<CharsBefore>.{0,10}))(?=abc)
Took me a while to remember that .NET allows positive lookbehinds with variability.
Regex test
Demo in C#
I changed the way your initial version worked a bit.
Hope it helps!
PS: I've named the group, but you are obviously free to keep it nameless and work with numbered groups if you want a less cluttered regex, like so:
(?<=(.{0,10}))(?=abc)

Regex not capturing date

I have a regex that works fine currently. But now I want to add on to it to capture dates.
Current regex:
(?<GeneralHelp>^/help\s*)?
(?:/client:)
(?<Client>\w*)
(?:(?:\s*/(?<ClientHelp>help))*)*
(?:(?:\s*/)(?<Modules>createHistory)(?:(?:\s*/(?<ModuleHelp>help))*)*)*
I added to the end:
(?:(?:\s*/)(?<StartDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
(?:(?:\s*/)(?<EndDate>^([0]?[1-9]|[1|2][0-9]|[3][0|1])[. -]([0]?[1-9]|[1][0-2])[. -]([0-9]{4}|[0-9]{2})$))*)*
Using the below example, it just won't get the dates, but it does match everything else.
/client:testClient/createHistory/11-11-2013/11.11.2013
This regex is used to break up the Main one string in the string array parameter from a console app. No one on my team in "fluent" in regex, nor do we have time to become fluent. We work with what we can and this addition is something I thought of today that may have with bigger problems what we have with our project and we are running low on time. So any help would be appreciated.

First, the ^ in your regex means "start of string", that is you only want to match a date at the start of the string (which is not true for you). So remove it. Same with "$" which means "end of string".
Secondly, [0|1] means "match characters 0, | or 1". You probably want [01] meaning "match characters 0 or 1".
Thirdly, you have an extra closing bracket with an unmatched opening bracket in both your regexes.
Fourthly as a general style point, [0] is the same as 0 so the square brackets are redundant here.
So your (not quite!) "fixed" regex is:
(?:(?:\s*/)(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:(?:\s*/)(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
However, this will not match your test string because of the extra "/testModule" in the string which is not in your working regex anywhere.
You could modify your original regex to allow extra slashes in between the two parts of regex?
<original regex>
(?:/[^/]+)* # <-- for the /testModule and any other similar tokens that appear in between
<date regex>
Also as a general point
you have a few occurences (?:(?:regex)*)*. I am not sure what the point is of doubling the outer * besides making the regex parser work much harder than it should for no good reason (the outer (?: )* is redundant here).
there is no point doing (?:/\s*) as you are not doing anything with the brackets, so just do /\s*
same with things like (?:/client:). Why have non-capturing brackets if you are not doing anything with them. /client: will do.
(?:regex)* means "match 0 to infinity occurences of regex". With things like (?:\s*/(?<ClientHelp>help))*, do you really expect this to occur infinitely many times in your string, or will it appear just once or not at all? Consider replacing * with ? which means "match 0 or 1 occurences" (if you know that that token will appear either once or not at all), or replace it with (say) {0, 100} if you know that that token will appear at most 100 times (and at least 0 times). This can improve performance.
So I recommend changing your regex like this:
(?<GeneralHelp>^/help\s*)?
/client:
(?<Client>\w*)
(?:\s*/(?<ClientHelp>help))*
(?:\s*/(?<Modules>createHistory)(?:\s*/(?<ModuleHelp>help))*)*
(?:/[^/]+)*
(?:\s*/(?<StartDate>(0?[1-9]|[12][0-9]|[3][01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
(?:\s*/(?<EndDate>(0?[1-9]|[12][0-9]|3[01])[. -](0?[1-9]|1[0-2])[. -]([0-9]{4}|[0-9]{2})))*
You can fiddle around with your regex at regexr where I've created an example with your regex/test string. (Edit: the < and > in the regex seem to have been changed to < and > in regexr so the link won't work unless you copy/paste the regex I've written directly)

If you're sure these two last fields are dates, you could simply add something like
(?<StartDate>(?:\d+[. -]?){3})/(?<EndDate>.*)$
(or even (?<StartDate>[^/]+)/(?<EndDate>.+)$ if your cases are all in the same pattern and it fits your needs).
Also as already pointed out by mathematical.coffee, the first regex can be improved.

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.

You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex

Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.

.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

I need a regular expression to convert US tel number to link

Basically, the input field is just a string. People input their phone number in various formats. I need a regular expression to find and convert those numbers into links.
Input examples:
(201) 555-1212
(201)555-1212
201-555-1212
555-1212
Here's what I want:
(201) 555-1212 - Notice the space is gone
(201)555-1212
201-555-1212
555-1212
I know it should be more robust than just removing spaces, but it is for an internal web site that my employees will be accessing from their iPhone. So, I'm willing to "just get it working."
Here's what I have so far in C# (which should show you how little I know about regular expressions):
strchk = Regex.Replace(strchk, #"\b([\d{3}\-\d{4}|\d{3}\-\d{3}\-\d{4}|\(\d{3}\)\d{3}\-\d{4}])\b", "<a href='tel:$&'>$&</a>", RegexOptions.IgnoreCase);
Can anyone help me by fixing this or suggesting a better way to do this?
EDIT:
Thanks everyone. Here's what I've got so far:
strchk = Regex.Replace(strchk, #"\b(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})\b", "<a href='tel:$1'>$1</a>", RegexOptions.IgnoreCase);
It is picking up just about everything EXCEPT those with (nnn) area codes, with or without spaces between it and the 7 digit number. It does pick up the 7 digit number and link it that way. However, if the area code is specified it doesn't get matched. Any idea what I'm doing wrong?
Second Edit:
Got it working now. All I did was remove the \b from the start of the string.

Remove the [] and add \s* (zero or more whitespace characters) around each \-.
Also, you don't need to escape the -. (You can take out the \ from \-)
Explanation: [abcA-Z] is a character group, which matches a, b, c, or any character between A and Z.
It's not what you're trying to do.
Edits
In response to your updated regex:
Change [-\.\s] to [-\.\s]+ to match one or more of any of those characters (eg, a - with spaces around it)
The problem is that \b doesn't match the boundary between a space and a (.

Afaik, no phone enters the other characters, so why not replace [^0-9] with '' ?

Here's a regex I wrote for finding phone numbers:
(\+?\d[-\.\s]?)?(\(\d{3}\)\s?|\d{3}[-\.\s]?)\d{3}[-\.\s]?\d{4}
It's pretty flexible... allows a variety of formats.
Then, instead of killing yourself trying to replace it w/out spaces using a bunch of back references, instead pass the match to a function and just strip the spaces as you wanted.
C#/.net should have a method that allows a function as the replace argument...
Edit: They call it a `MatchEvaluator. That example uses a delegate, but I'm pretty sure you could use the slightly less verbose
(m) => m.Value.Replace(' ', '')
or something. working from memory here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.