RegEx to capture text between two delimiter characters including 'shared' - c#

If I have the following text...
The quick :brown:fox: jumped over the lazy :dog:.
I would like a regular expression to capture all the words that are between 2 : characters. In the above example it should return :brown:, :fox:, :dog:.
So far, I have this (\:{1}.\w*\s*\:{1}) which returns :brown: and :dog:. I can't quite figure out how to share the : between the 2 matching groups so that it will also return ':fox:'.

Here is a simple pattern which can be made to work:
(?<=:)(\w+)(?=:)
This uses lookarounds to make sure that one or more word characters are surrounded before and after by colons. Check the demo below to see it working.
The match would be available as the first capture group. Actually, it should also be available as the entire match itself, because lookarounds do not consume anything.
Demo
I like the above lookaround approach because it is clean and simple (at least in my mind). If, for some reason, you don't want any lookarounds, then just use the following pattern:
:(\w+):
But note that now you explicitly have to access the first capture group to obtain the matching word without colons on either side.

Related

Don't match string in specific context

For my internship I've been asked to create a tool that creates a regular expression from a few examples. Now I got it working, it generates multiple regular expressions and they are sorted depending on how greedy they are, but I want more.
The regex generator works by replacing parts of a string with regular expression character classes. For example GOM178 would turn into [A-Z]+178(letters replaced) or GOM\d+ (numbers replaced). The hard part is getting multiple character classes in one. For example at one point \p{P} is tried as well, and it replaces [],/\- and more. That causes the other character classes to mess up. It would turn [A-Z] in \p{P}A\p{P}Z\p{P}. Replacing \p{P} before [A-Z] wouldn't work as well, because that would replace the P in \p{P} causing this: \p{[A-Z]}.
I've already tried negative lookaheads, but that didn't workout too well. The only reason why it currently works is because I test it before saving the result. This is the regular expression I've used for that:
(?:(?!(?:\[a-z\]|\[A-Z\]|\[a-zA-Z\]|\\d\+\[\\\.,\]\?\\d\*|\\d|\\s|\\p\{P\}|\\w|\\n|\.)(?:\*|\?|\+|\+\?|\*\?)?)(<The character class to match goes here>))
Here is an example of the regex in action: Link to example.
As you can see it also matches the - and ] in the character class. It should ignore it because it's part of [a-z] which is noted in the negative lookahead.
Long story short, a string should not be replaced when it's in a specific context. Does anyone have an idea on how to fix this, or perhaps have a better idea on how to do this.

Regex to detect hyphen character

Live demo: http://regex101.com/r/wW6wC4
I'm trying to add a regex expression that allows email addresses like:
asdf.asdf#test-dom-a.com
([\w+\.]+#[\w]{1,})(\.)([0-9a-zA-Z\.\-]{1,})
^---- Thought this would allow hyphens...
what am I missing here?
Your pattern requires that the hyphen appears after a period. Try this instead:
([\w+.]+#[\w-]{1,})(\.)([0-9a-zA-Z.-]+)
Demonstration
Or more simply:
([\w+.]+#[\w.-]+)
Although the second pattern doesn't require that the second part of the address contains a period.
Demonstration
Your hyphen code appears in the segment that checks characters after the first period in the domain name. You need to add it to the match block before the domain name:
([\w+\.]+#[\w\-]{1,})(\.)([0-9a-zA-Z\.\-]{1,})
^^---- check here as well.
In reality, I would search for a more comprehensive email regex - the one you have doesn't seem robust enough IMHO.
Your regex:
([\w+\.]+#[\w]{1,})(\.)([0-9a-zA-Z\.\-]{1,})
This will allow hyphen as last character only.
To allow it anywhere use:
^([\w+.-]+#[\w-])(\.)([0-9a-zA-Z.-])$
OR to allow it only in between use (except first and last position):
^[\w+.-]*#\w[\w-]*\.[\w-]*[0-9a-zA-Z.]+$
Working Demo: http://regex101.com/r/lQ1nV7
You're not matching strings of the form "asd#fge.hj-kl", which as you can see not what you want.
([\w+\.]+)#([0-9a-zA-Z\.\-]{1,})\.com
([\w+\.]+)#([0-9a-zA-Z\.\-]{1,})\.([\w]{1,})

Finding optional groups with random order using regex

I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))
Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.

Regex.Matches returns one match per line, not per "word"

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName], I get one match - good.
If the markup contains [BName] [BAddress], I get one match - why?
If the markup contains [BName][BAddress], I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier # for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = #"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress.
You should write \[B[^\]]+\].
[^\]] matches every character except ], so it is forced to stop before the first ].

I have two problems, one of them is a regex

I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Categories

Resources