I have two problems, one of them is a regex

I have two problems, one of them is a regex - c#

I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5

The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string

?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.

What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer

You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Related

Variable-length lookbehind for backslashes

What seemed to be a simple task ended up to not work as expected...
I'm trying to match \$\w+\b, unless it's preceded by an uneven number of backslashes.
Examples (only $result should be in the match):
This $result should be matched
This \$result should not be matched
This \\$result should be matched
This \\\$result should not be matched
etc...
The following pattern works:
(?<!\\)(\\\\)*\$\w+\b
However, even repeats of backslashes are included in the match, which is unwanted, so I'm trying to achieve this purely with a variable-length lookbehind, but nothing I tried so far seems to work.
Any regex virtuoso here can lend a hand?

You may use the following pattern:
(?<!(?:^|[^\\])\\(?:\\\\)*)\$\w+\b
Demo.
Breakdown of the Lookbehind; i.e., not preceded by:
(?:^|[^\\]) - Beginning of string/line or any character other than backslash.
\\ - Then, one backslash character.
(?:\\\\)* Then, any even number of backslash characters (including zero).

Looks like asking the question helped me answer my own question.
The part I don't want to be matched has to be wrapped with a positive lookbehind.
(?<=(?<!\\)(\\\\)*)\$\w+\b
Also works if the $result is at the start of the line.
If anyone has more optimal solutions, shoot!

This regular expression gets the wanted text in the third capture group:
(^| )(\\\\)*(\$\w+\b)
Explanation:
(^| ) Either beginning of line or a space
(\\\\)* An even number of backslash characters, including none
( Start of capture group 3
\$\w+\b The wanted text
) End of capture group 3

The value of regex match groups remain empty [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Removing comments using regex

I am building a parser, and I would like to remove comments from various lines. For example,
variable = "some//thing" ////actual comment
Comment marker is //. In this case, variable would contain "some//thing" and everything else would be ignored. I plan to do it using regex replace. Currently I am using (".*"|[ \t])*(\/\/.*) as regex. However replacing it replaces "some//thing" ////actual comment entirely.
I can not figure out the regex which I should use instead. Thanks for any help.
Additional info - I am using C# with netcoreapp 1.1.0
Edit - some cases might be of a line with just comment like //line comment. Strings also might contain escaped quotes.

Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.
The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.
When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.
One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.
The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.
This is the one-liner pattern to use:
^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)
If you want to use the ignore pattern whitespace option, you can use this version:
(?x) # Turn on the ignore white space option
^( # Start the only capturing group
(?: # A non-capturing group to allow for repeating the logic
(?: # Capture either of the two options below
[^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
| # OR
/(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
)* # As many times as possible, even if none
(?(" # Start a conditional match for double-quoted strings
(?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
) # Then
(?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
| # Otherwise
(?(' # Start a conditional match for single-quoted strings
(?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
) # Then
(?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
| # Otherwise
(?([^/]) # If next character is not a slash
.) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
) # end the conditional match for single-quoted strings
) # End the conditional match for double-quoted strings
)* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group
This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.
As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.
Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!

"[^"\\]*(?:\\[\W\w][^"\\]*)*"|(\/\/.*)
Flags: global
Matches full strings or a comment.
Group 1: comment.
So if there's no comment, replace with the same matching text. Otherwise, do your thing on the comment itself.

Regex lookahead discard a match

I am trying to make a regex match which is discarding the lookahead completely.
\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*
This is the match and this is my regex101 test.
But when an email starts with - or _ or . it should not match it completely, not just remove the initial symbols. Any ideas are welcome, I've been searching for the past half an hour, but can't figure out how to drop the entire email when it starts with those symbols.

You can use the word boundary near # with a negative lookbehind to check if we are at the beginning of a string or right after a whitespace, then check if the 1st symbol is not inside the unwanted class [^\s\-_.]:
(?<=^|\s)[^\s\-_.]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
See demo
List of matches:
support#github.com
s.miller#mit.edu
j.hopking#york.ac.uk
steve.parker#soft.de
info#company-hotels.org
kiki#hotmail.co.uk
no-reply#github.com
s.peterson#mail.uu.net
info-bg#software-software.software.academy
Additional notes on usage and alternative notation
Note that it is best practice to use as few escaped chars as possible in the regex, so, the [^\s\-_.] can be written as [^\s_.-], with the hyphen at the end of the character class still denoting a literal hyphen, not a range. Also, if you plan to use the pattern in other regex engines, you might find difficulties with the alternation in the lookbehind, and then you can replace (?<=\s|^) with the equivalent (?<!\S). See this regex:
(?<!\S)[^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
And last but not least, if you need to use it in JavaScript or other languages not supporting lookarounds, replace the (?<!\S)/(?<=\s|^) with a (non)capturing group (\s|^), wrap the whole email pattern part with another set of capturing parentheses and use the language means to grab Group 1 contents:
(\s|^)([^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*)
See the regex demo.

I use this for multiple email addresses, separate with ‘;':
([A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4};)*
For a single mail:
[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}

C# Regex: only letters followed by an optional

I am looking for a way to get words out of a sentence. I am pretty far with the following expression:
\b([a-zA-Z]+?)\b
but there are some occurrences that it counts a word when I want it not to. E.g a word followed by more than one period like "text..". So, in my regex I want to have the period to be at the end of a word zero or one time. Inserting \.? did not do the trick, and variations on this have not yielded anything fruitful either.
Hope someone can help!

A single dot means any character. You must escape it as
\.?
Maybe you want an expression like this:
\w+\.?
or
\p{L}+\.?

You need to add \.? (and not .?) because the period has special meaning in regexes.

to avoid a match on your example "test.." you ask for you not only need to put the \.? for checking first character after the word to be a dot but also look one character further to check the second character after the word.
I did end up with something like this
\w{2,}\.?[^.]
You should also consider that a sentence not always ends with a . but also ! or ? and alike.
I usually use rubulator.com to quick test a regexp

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

I have two problems, one of them is a regex - c#

?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.

You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn. http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Related

Variable-length lookbehind for backslashes

The value of regex match groups remain empty [duplicate]

Removing comments using regex

Regex lookahead discard a match

C# Regex: only letters followed by an optional

Categories

Resources