Regex including what is supposed to be non-capturing group in result - c#

I have the following simple test where i'm trying to get the Regex pattern such that it yanks the executable name without the ".exe" suffix.
It appears my non-capturing group setting (?:\\.exe) isn't working or i'm misunderstanding how its intended to work.
Both regex101 and regexstorm.net show the same result and the former confirms that "(?:\.exe)" is a non-capturing match.
Any thoughts on what i'm doing wrong?
// test variable for what i would otherwise acquire from Environment.CommandLine
var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?"
var asmName = Regex.Match(testEcl, #"[^\\]+(?:\.exe)", RegexOptions.IgnoreCase).Value;
// expecting "MyApp" but I get "MyApp.exe"
I have been able to extract the value i wanted by using a matching pattern with group names defined, as shown in the following, but would like to understand why non-capturing group setting approach didn't work the way i expected it to.
// test variable for what i would otherwise acquire from Environment.CommandLine
var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?"
var asmName = Regex.Match(Environment.CommandLine, #"(?<fname>[^\\]+)(?<ext>\.exe)",
RegexOptions.IgnoreCase).Groups["fname"].Value;
// get the desired "MyApp" result
/eoq

A (?:...) is a non-capturing group that matches and still consumes the text. It means the part of text this group matches is still added to the overall match value.
In general, if you want to match something but not consume, you need to use lookarounds. So, if you need to match something that is followed with a specific string, use a positive lookahead, (?=...) construct:
some_pattern(?=specific string) // if specific string comes immmediately after pattern
some_pattern(?=.*specific string) // if specific string comes anywhere after pattern
If you need to match but "exclude from match" some specific text before, use a positive lookbehind:
(?<=specific string)some_pattern // if specific string comes immmediately before pattern
(?<=specific string.*?)some_pattern // if specific string comes anywhere before pattern
Note that .*? or .* - that is, patterns with *, +, ?, {2,} or even {1,3} quantifiers - in lookbehind patterns are not always supported by regex engines, however, C# .NET regex engine luckily supports them. They are also supported by Python PyPi regex module, Vim, JGSoft software and now by ECMAScript 2018 compliant JavaScript environments.
In this case, you may capture what you need to get and just match the context without capturing:
var testEcl = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?";
var asmName = string.Empty;
var m = Regex.Match(testEcl, #"([^\\]+)\.exe", RegexOptions.IgnoreCase);
if (m.Success)
{
asmName = m.Groups[1].Value;
}
Console.WriteLine(asmName);
See the C# demo
Details
([^\\]+) - Capturing group 1: one or more chars other than \
\. - a literal dot
exe - a literal exe substring.
Since we are only interested in capturing group #1 contents, we grab m.Groups[1].Value, and not the whole m.Value (that contains .exe).

You're using a non-capturing group. The emphasis is on the word group here; the group does not capture the .exe, but the regex in general still does.
You're probably wanting to use a positive lookahead, which just asserts that the string must meet a criteria for the match to be valid, though that criteria is not captured.
In other words, you want (?=, not (?:, at the start of your group.
The former is only if you are enumerating the Groups property of the Match object; in your case, you're just using the Value property, so there's no distinction between a normal group (\.exe) and a non-capturing group (?:\.exe).
To see the distinction, consider this test program:
static void Main(string[] args)
{
var positiveInput = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.exe\" /?";
Test(positiveInput, #"[^\\]+(\.exe)");
Test(positiveInput, #"[^\\]+(?:\.exe)");
Test(positiveInput, #"[^\\]+(?=\.exe)");
var negativeInput = "\"D:\\src\\repos\\myprj\\bin\\Debug\\MyApp.dll\" /?";
Test(negativeInput, #"[^\\]+(?=\.exe)");
}
static void Test(String input, String pattern)
{
Console.WriteLine($"Input: {input}");
Console.WriteLine($"Regex pattern: {pattern}");
var match = Regex.Match(input, pattern, RegexOptions.IgnoreCase);
if (match.Success)
{
Console.WriteLine("Matched: " + match.Value);
for (int i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine($"Groups[{i}]: {match.Groups[i]}");
}
}
else
{
Console.WriteLine("No match.");
}
Console.WriteLine("---");
}
The output of this is:
Input: "D:\src\repos\myprj\bin\Debug\MyApp.exe" /?
Regex pattern: [^\\]+(\.exe)
Matched: MyApp.exe
Groups[0]: MyApp.exe
Groups[1]: .exe
---
Input: "D:\src\repos\myprj\bin\Debug\MyApp.exe" /?
Regex pattern: [^\\]+(?:\.exe)
Matched: MyApp.exe
Groups[0]: MyApp.exe
---
Input: "D:\src\repos\myprj\bin\Debug\MyApp.exe" /?
Regex pattern: [^\\]+(?=\.exe)
Matched: MyApp
Groups[0]: MyApp
---
Input: "D:\src\repos\myprj\bin\Debug\MyApp.dll" /?
Regex pattern: [^\\]+(?=\.exe)
No match.
---
The first regex (#"[^\\]+(\.exe)") has \.exe as just a normal group.
When we enumerate the Groups property, we see that .exe is indeed a group captured in our input.
(Note that the entire regex is itself a group, hence Groups[0] is equal to Value).
The second regex (#"[^\\]+(?:\.exe)") is the one provided in your question.
The only difference compared to the previous scenario is that the Groups property doesn't contain .exe as one of its entries.
The third regex (#"[^\\]+(?=\.exe)") is the one I'm suggesting you use.
Now, the .exe part of the input isn't captured by the regex at all, but a regex won't match a string unless it ends in .exe, as the fourth scenario illustrates.

It would match the non capturing group but won't capture it, so if you want the non captured part you should access the capture group instead of the whole match
you can access groups in
var asmName = Regex.Match(testEcl, #"([^\\]+)(?:\.exe)", RegexOptions.IgnoreCase);
asmName.Groups[1].Value
the demo for the regex can be found here

Related

Regex working in Regexr but not C#, why?

From the below mentioned input string, I want to extract the values specified in {} for s:ds field. I have attached my regex pattern. Now the pattern I used for testing on http://www.regexr.com/ is:
s:ds=\\\"({[\d\w]{8}\-([\d\w]{4}\-){3}[\d\w]{12}})\\\"
and it works absolutely fine.
But the same in C# code does not work. I have also added \\ instead of \ for c# code and replaced \" with \"" . Let me know if Im doing something wrong. Below is the code snippet.
string inputString is "s:ds=\"{46C01EB7-6D43-4E2A-9267-608DE8AFA311}\" s:ds=\"{37BA4BA0-581C-40DC-A542-FFD9E99BC345}\" s:id=\"{C091E71D-4817-49BC-B120-56CE88BC52C2}\"";
string regex = #"s:ds=\\\""({[\d\w]{8}\-(?:[\d\w]{4}\-){3}[\d\w]{12}})\\\""";
MatchCollection matchCollection = Regex.Matches(layoutField, regex);
if (matchCollection.Count > 1)
{
Log.Info("Collection Found.", this);
}
If you only watch to match the values...
You should be able to just use ([\d\w]{8}\-([\d\w]{4}\-){3}[\d\w]{12}) for your expression if you only want to match the withing your gullwing braces :
string input = "s:ds=\"{46C01EB7-6D43-4E2A-9267-608DE8AFA311} ...";
// Use the following expression to just match your GUID values
string regex = #"([\d\w]{8}\-([\d\w]{4}\-){3}[\d\w]{12})";
// Store your matches
MatchCollection matchCollection = Regex.Matches(input, regex);
// Iterate through each one
foreach(var match in matchCollection)
{
// Output the match
Console.WriteLine("Collection Found : {0}", match);
}
You can see a working example of this in action here and example output demonstrated below :
If you want to only match those following s:ds...
If you only want to capture the values for s:ds sections, you could consider appending (?<=(s:ds=""{)) to the front of your expression, which would be a look-behind that would only match values that were preceded by "s:ds={" :
string regex = #"(?<=(s:ds=""{))([\d\w]{8}\-([\d\w]{4}\-){3}[\d\w]{12})";
You can see an example of this approach here and demonstrated below (notice it doesn't match the s:id element :
Another Consideration
Currently you are using \w to match "word" characters within your expression and while this might work for your uses, it will match all digits \d, letters a-zA-z and underscores _. It's unlikely that you would need some of these, so you may want to consider revising your character sets to use just what you would expect like [A-Z\d] to only match uppercase letters and numbers or [0-9A-Fa-f] if you are only expected GUID values (e.g. hex).
Looks like you might be over-escaping.
Give this a shot:
#"s:ds=\""({[\d\w]{8}\-([\d\w]{4}\-){3}[\d\w]{12}})\"""

Regex to capture groups of parentheses including inner and outer parentheses

I want to match all parentheses including the inner and outer parentheses.
Input: abc(test)def(rst(another test)uv)xy
Desired Output: (test)
(rst(another test)uv)
(another test)
My following c# code returns only (test) and (rst(another test)uv):
string input = "abc(test)def(rst(another test)uv)xy";
Regex regex = new Regex(#"\(([^()]+| (?<Level>\()| (?<-Level>\)))+(?(Level)(?!))\)", RegexOptions.IgnorePatternWhitespace);
foreach (Match c in regex.Matches(input))
{
Console.WriteLine(c.Value);
}
You are looking for overlapping matches. Thus, just place your regex into a capturing group and put it inside a non-anchored positive lookahead:
Regex regex = new Regex(#"(?=(\(([^()]+| (?<Level>\()| (?<-Level>\)))+(?(Level)(?!))\)))", RegexOptions.IgnorePatternWhitespace);
The value you need will be inside match.Groups[1].Value.
See the IDEONE demo:
using System;
using System.Text.RegularExpressions;
using System.IO;
using System.Linq;
public class Test
{
public static void Main()
{
var input = "abc(test)def(rst(another test)uv)xy";
var regex = new Regex(#"(?=(\(([^()]+| (?<Level>\()| (?<-Level>\)))+(?(Level)(?!))\)))", RegexOptions.IgnorePatternWhitespace);
var results = regex.Matches(input).Cast<Match>()
.Select(p => p.Groups[1].Value)
.ToList();
Console.WriteLine(String.Join(", ", results));
}
}
Results: (test), (rst(another test)uv), (another test).
Note that unanchored positive look-aheads can be used to find overlapping matches with capturing in place because they do not consume text (i.e. the regex engine index stays at its current position when trying to match with all the subpatterns inside the lookahead) and the regex engine automatically moves its index after match/failure making the matching process "global" (i.e. tests for a match at every position inside an input string).
Although lookahead subexpressions do not match, they still can capture into groups.
Thus, when the look-ahead comes to the (, it may match a zero-width string and place they value you need into Group 1. Then, it goes on and finds another ( inside the first (...), and can capture a substring inside it again.
You could use this one : \((?>[^()]+|\((?<P>)|(?<C-P>)\))*(?(P)(?!))\) but you'll have to dig through captures, groups and groups' captures to get what you want (see demo)
Edit: This answer is flat out wrong for .Net regular expressions - see nam's comment below.
Original answer:
Regular expressions match regular languages. Nested parentheses are not a regular language, they require a context-free grammar to match. So the short answer is there is no way to do what you're attempting.
https://stackoverflow.com/a/133684/361631

Lookbehind with equal sign

I want to match
===Something===
but not
====Something====
I've come up with the following regular expression
Regex.Match("====Something====", #"^\s*===\s*(?<!=====\s*)(?<Title>.*?)\s*===\s*$").Groups["Title"]
but it returns
=Something=
please help what's the issue with the lookbehind pattern.
Match for the full word! the angle brackets are all important. The below expression translated - if we are talking to the computer is like this: computer, search for a word starting with with three = signs then have any number of letters then end the word with three equals signs.
Hence if 4 equals signs are there at the start of the word - it won't match.
string regExpression = #"<={3}(\w+)={3}>";
static void Main(string[] args)
{
// searches for the first specified instance.
string textToSearchThrough = "===Something===";
string textToSearchThrough2 = "====Something====";
// add in \s+ to the below if you wish
string regexExpression = #"<={3}(\w+)={3}>";
Regex r = new Regex(regexExpression);
// change the text to search through to the second variable textToSearchThrough2 if you wish to check
Match m = r.Match(textToSearchThrough);
Console.WriteLine(m.Success.ToString());
Console.ReadLine();
}
One more possible solution:
(?<!=)===(?!=)(?<Title>.*?)(?<!=)===(?!=)
Your regex works wrong because you use .*? which can also match =. So it looks for === then accepts anything (other = also), and look for a match which will end with === again. So it will match also === in ========= string, and it is not what you are looking for. However if you change . (match any character) on \w (match word character). Also it would be better to use \w+ insted \w* to avoid maching only ====== without any word (if you don't want to) it should work nad match only ===Something=== even without lookbehind, like:
^\s*===\s*(?<Title>\w+?)\s*===\s*$
Try it HERE.

RegEx - Match using symbols but don't replace them

I would like to use a symbols in a RegEx pattern to find matches, but I don't want them replaced. This is for class and namespace manipulation in C#.
For example:
MyNamespaceLib.EntityDataModelTests.TestsMyClassTests+MyInnerClassTests
must be replaced as:
MyNamespaceLib.EntityDataModel.TestsMyClass+MyInnerClass
(Note, only "Tests" is replace when it appears at the end of the namespace part, and not when it's part of the class/namespace name)
I've managed to get the first part right in finding the matches, but I'm battling to keep the symbols in the replaced match.
So far I have:
var input = "MyNamespaceLib.EntityDataModelTests.TestsMyClassTests+MyInnerClassTests";
var output = Regex.Replace(input, "Tests[.+]|$", "");
I've tried using a non-capturing group, but I suspect it's not meant for the way I'm trying to use it.
Thanks!
So what you want to do is replace matches not followed by a . or a +? Use a lookahead:
#"Tests(?![.+])"
You can use the MatchEvaluator overload of the Regex.Replace method, where the string to replace the match with is generated on the fly. I get the special simbol in a capturing group (and the first capturing group is always Group1 of the match), and replace the match with the value, like this:
var output = Regex.Replace(input, #"Tests([.+]|$)", m => m.Groups[1].Value);
Also, per minitech's comment, you can also use $1 for the first capturing group in the (string, string) overload of Regex.Replace, like:
var output = Regex.Replace(input, #"Tests([.+]|$)", "$1");
That said, a regex is often write-only code, so you can always do a dumb and simple replace:
var output = input.Replace("Tests+","").Replace("Tests.","") ...;

need to create a C# Regex similar to this perl expression

I was wondering if it is possible to build equivalent C# regular expression for finding this pattern in a filename. For example, this is the expr in perl /^filer_(\d{10}).txt(.gz)?$/i Could we find or extract the \d{10} part so I can use it in processing?
To create a Regex object that will ignore character casing and match your filter try the following:
Regex fileFilter = new Regex(#"^filter_(\d{10})\.txt(\.gz)?$", RegexOptions.IgnoreCase),
To perform the match:
Match match = fileFilter.Match(filename);
And to get the value (number here):
if(match.Success)
string id = match.Groups[1].Value;
The matched groups work similar to Perl's matches, [0] references the whole match, [1] the first sub pattern/match, etc.
Note: In your initial perl code you didn't escape the . characters so they'd match any character, not just real periods!
Yes, you can. See the Groups property of the Match class that is returned by a call to Regex.Match.
In your case, it would be something along the lines of the following:
Regex yourRegex = new Regex("^filer_(\d{10}).txt(.gz)?$");
Match match = yourRegex.Match(input);
if(match.Success)
result = match.Groups[1].Value;
I don't know, what the /i means at the end of your regex, so I removed it in my sample code.
As daniel shows, you can access the content of the matched input via groups. But instead of using default indexed groups you can also use named groups. In the following i show how and also that you can use the static version of Match.
Match m = Regex.Match(input, #"^(?i)filer_(?<fileID>\d{10}).txt(?:.gz)?$");
if(m.Success)
string s = m.Groups["fileID"].Value;
The /i in perl means IgnoreCase as also shown by Mario. This can also be set inline in the regex statement using (?i) as shown above.
The last part (?:.gz) creates a non-capturing group, which means that it’s used in the match but no group is created.
I'm not sure if that's what you want, this is how you can do that.

Categories

Resources