I have a regex where
%word% can occur multiple times, separated by a "<"
%word% is defined as ".*?"|[a-zA-Z]+
so i wrote
(".*"|[a-zA-Z]+)([<](".*"|[a-zA-Z]+))*
Is there any way i can shrink it using capturing groups?
(".*"|[a-zA-Z]+)([<]\1)*,
But i don't think \1 can be used as it'd mean repeat the first capture, as i would not know what was captured as it can be a quoted string or a word.
Any thing similar i can use to refer matching the previously written group. I'm working in C#.
using String.Format to avoid repetition and no there is no way to repeat the regex group literally
String.Format("{0}([<]{0})*", #"("".*""|[a-zA-Z]+)")
As the support is not there yet for the feature, i made a string replacer, where i wrote the specific words i need to replaced by regex using %% and then wrote the program to replace it by the regular expression defined for the text.
Related
If I have the following text...
The quick :brown:fox: jumped over the lazy :dog:.
I would like a regular expression to capture all the words that are between 2 : characters. In the above example it should return :brown:, :fox:, :dog:.
So far, I have this (\:{1}.\w*\s*\:{1}) which returns :brown: and :dog:. I can't quite figure out how to share the : between the 2 matching groups so that it will also return ':fox:'.
Here is a simple pattern which can be made to work:
(?<=:)(\w+)(?=:)
This uses lookarounds to make sure that one or more word characters are surrounded before and after by colons. Check the demo below to see it working.
The match would be available as the first capture group. Actually, it should also be available as the entire match itself, because lookarounds do not consume anything.
Demo
I like the above lookaround approach because it is clean and simple (at least in my mind). If, for some reason, you don't want any lookarounds, then just use the following pattern:
:(\w+):
But note that now you explicitly have to access the first capture group to obtain the matching word without colons on either side.
I'm trying to get the following using Regex.
This is sample input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emsubject="MYSUBJECT"
Other input:
-emto=USER#HOST.COM -emfrom=USER#HOST.COM -emcc=ME#HOST.COM -embcc=YOU#HOST.COM -emsubject="MYSUBJECT"
What I would like to achieve is get named groups using the text after -em.
So I'd like to have for example group EMAIL_TO, EMAIL_FROM, EMAIL_CC, ...
Note that I could concat groupname and capture using code, no problem.
Problem is that I don't know how to capture optional groups with "random" positions.
For example, CC and BCC do not always appear but sometimes they do and then I need to
capture them.
Can anybody help me out on this one?!
What I have so far: (?:-em(?<EMAIL_>to|cc|bcc|from|subject)=(.*))
Just do something like:
-em([^\s=]+)=([^\s]+)
If you need to support quoting of values, so that they can contain spaces:
-em([^\s=]+)=("[^"]*"|[^\s]+)
And iterate over all the matches in the command line arg string. For each match, look at the "key" (first capturing group) and see if it is one you recognize. If not, display an error message and exit. If it is, set the option accordingly (the second capturing group is the "value").
POSTSCRIPT: This reminds me of a situation which often comes up when writing a grammar for a computer language.
It is possible (perhaps even natural) to write a grammar which only works for syntactically perfect programs. But for good error reporting, it is much better to write a grammar which accepts a superset of syntactically correct programs. After you get the parse tree, you can run over it, look for errors, and report them using application-specific code.
In this case, you could write a regex which will only match the options which you actually accept. But then if someone mistypes an option, the regex will simply fail to match. Your program will not be able to provide any specific error messages, regardless of whether the command line args are -emsubjcet=something or if they are something completely off the wall like ###$*(#&U*REJDFFKDSJ**&#(*$&##.
POST-POSTSCRIPT: Note the very common regex pattern of matching "delimiter + any number of characters which are not a delimiter". In my above regexes, you can see this here: ([^\s=]+)= -- 1 or more chars which are not whitespace OR =, followed by =. This allows us to easily eat everything which is part of the key, but not go too far and match the delimiting =. You can see it again here: "[^"]*" -- a quote mark, followed by 0 or more chars which are not a quote mark, followed by a closing quote mark.
Hi I need to do like this.
Actually **ctu** is a good university but **ctu's** is not. There are many **,ctus,** present.
What I want to do is, I want to replace ctu in the string like this.
Actually **<s>ctu<e>** is a good university but **<s>ctu's<e>** is not. There are many **,<s>ctus<e>,** present.
But with the following pattern
**\\bctu*(?:['\\\\|""\\\\]*)\\w+\\b**
I'm getting the out put as:
A**<s>ctu<e>**ally **<s>ctu<e>** is a good university but **<s>ctu's<e>** is not. There are many **,ctus,** present.
I dont want to replace ctu inside words Actually. and also I need to replace " ,ctus, " with " ,<s>ctus<e>, "
How do I achieve this using regex. I need this in c#. csharp.
Thanks in advance.
The following regex matches all the cases listed in your example:
#"(\bctu(?:'\w+)?\w*\b)"
Then just replace the match with #"<s>\1<e>" where \1 is the backreference to the match above.
Are you looking for #"\bctu\b" ("ctu" with word boundaries on both sides, so it matches ctu but not Actually, ctu's, or ,ctus,) for the first search pattern and ",ctus," (exactly the string ,ctus,, regardless of where it might fall in a word) as the second search pattern? To search for both of these at once, you could use #"(\bctu\b|,ctus,)".
As a slight aside, in C# you can write regex literals much easier by using the #"" notation (verbatim strings) instead of "". E.g. to get regex to understand a word boundary, it must see \b, which can be represented as #"\b" or "\\b", and a literal \ is "\\\\" or #"\\". The first is easier to read, especially in more complex cases.
If this doesn't answer your question, please give a clear example of expected input/output.
I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/
I have a C# Regex class matching multiple subgroups such as
(?<g1>abc)|(?<g2>def)|(?<g3>ghi)
but with much more complicated sub-patterns. I basically want to match anything that doesn't belong to any of those groups, in addition to existing groups.
I tried
(?<g1>abc)|(?<g2>def)|(?<g3>ghi)|(.+?)
but it turned out too slow. I can't do negation because I don't want to copy those complex subpatterns redundantly. Using just (.+) overrides all other groups as expected.
Is there any other way? If that doesn't work I'll have to write an ad-hoc parser.
Additional details: All these groups are evaluated against a MatchEvaluator. So a Regex class behavior that sends "unmatched strings" to the MatchEvaluator will also work.
A sample text would be
.......abc........ghi.....def.....abc....def...ghi......abc.......
I want to catch parts inbetween.
Your regex generates separate match for every single character outside g1,g2,g3. So when you use it with MatchEvaluator it generates lots of evaluator calls. Thats why its slow.
If you try following regex:
(?<rest>.*?)((?<g1>abc)|(?<g2>def)|(?<g3>ghi)|$)
you will get single "rest" group match for entire fragment of text that doesnt contain "g" group.
Regex C# code:
Regex regex = new Regex(
#"(?<rest>.*?)((?<g1>abc)|(?<g2>def)|(?<g3>ghi)|$)",
RegexOptions.Singleline
| RegexOptions.Compiled
);
but it turned out too slow. I can't do
negation because I don't want to copy
those complex subpatterns redundantly.
Why not something like:
const string COMPLEX_REGEX_PATTERN = "\Gobbel[dy]go0\k"
Have you tried setting the regex option to be compiled? I find using a static compiled regex can speed things up considerably.
If your regex is four pages long, writing a state machine yourself would probably be a better idea...