Improve RegEx search

Improve RegEx search - c#

Using DirectoryServices.AccountManagement I'm getting users DistinguishedName which looks like so:
CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu
I need to get first OU value from this.
I found similar solution: C# Extracting a name from a string
And using some tweaks I created this code:
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
Match m = Regex.Match(input, #"OU=([a-zA-Z\\]+)\,.*$");
Console.WriteLine(m.Groups[1].Value);
This code returns STORE as expected, but if I change Groups[1] to Groups[0] I get almost same result as input string:
OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu
How can I change this regex so it will return only values of OU? SO that in this example I get array of 2 matches. If I would have more OU in my string then array would be longer.
EDIT:
I've converted my code (using #dasblinkenlight suggestions) into function:
private static List<string> GetOUs()
{
var input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
var mm = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
return (from Match m in mm select m.Groups[1].Value).ToList();
}
Is that correct?

Your regular expression is fine (almost), you are just using a wrong API.
Remove the parts of the regexp that match up to the ending anchor $, and change the call of Match for a call of Matches, and get the matches in a loop, like this:
var input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
var mm = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
foreach (Match m in mm)
Console.WriteLine(m.Groups[1].Value);
}

Your existing regex:
#"OU=([a-zA-Z\\]+)\,.*$"
Matches OU=, then some letters and backslashes ([a-zA-Z\\]+), then a comma, then any characters (.*) to the end of the line ($).
Thus a single match will always match the entire line after the first OU section.
Modify your regex by removing the ,.*$ at the end, at it will match each OU group:
#"OU=([a-zA-Z\\]+)"
Also note that the parentheses are a capturing group. They are useful if you also want to capture just the value part by itself, but if you are not using that, they are not necessary, and you can just have this:
#"OU=[a-zA-Z\\]+"

It's beacuse you are mixing up matches and groups
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
MatchCollection mc = Regex.Matches(input, #"OU=([a-zA-Z\\]+),");
foreach(Match m in mc)
{
Console.WriteLine(m.Result("$1"));
}

Group[0] returns the full match:
Group[1] returns the first Pattern in the match [i.e. everything in the first parenthesis '(' ')' ]
So if you wanted to get exactly those 2 occurances of OU.. you could do this:
Match m = Regex.Match(input, #"OU=([a-zA-Z\\]+)\,OU=([a-zA-Z\\]+)\,.*$");
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Group[0] returns the full match: (which you don't want)
Group[1] returns the first Pattern in the match [i.e everything in the first parenthesis '(' ')' ]
Group[2] returns the second Pattern in the match [i.e. everything in the second parenthesis '(' ')' ]
Giving:
STORE
COMPANY
But I'm assuming you don't want to be so explicit with your Regex for each Pattern you are interested in.
If you want to get multiple matches, then you need to do Regex's Matches call that returns a Matchcollection.
MatchCollection ms = Regex.Matches(...);
This still won't work with your current Regex though, because everything from STORE so the end of the line will be in the first match. If you only want to get the pattern "1-or-more-letters" after a "OU="
You only need:
#"OU=([a-zA-Z\\]+)"
So your code would be:
string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu";
MatchCollection ms = Regex.Matches(input, #"OU=([a-zA-Z\\]+)");
foreach (Match m in ms)
{
Console.WriteLine(m.Groups[1].Value);// get the string in the first "(" ")"
}

Related

Find String Between To Identical Control Separators?

I'm reading from a file, and need to find a string that is encapsulated by two identical non-ascii values/control seperators, in this case 'RS'
How would I go about doing this? Would I need some form of regex?

RS stands for Record Separator, and it has a value of 30 (or 0x1E in hexadecimal). You can use this regular expression:
\x1E([\w\s]*?)\x1E
That matches the RS, then matches any letter, number or space, and then again the RS. The ? is to make the regex match as less characters as possible, in case there are more RS characters afterwards.
If you prefer not to match numbers, you could use [a-zA-Z\s] instead of [\w\s].
Example:
string fileContents = "Something \u001Eyour string\u001E more things \u001Eanother text\u001E end.";
MatchCollection matches = Regex.Matches(fileContents, #"\x1E([\w\s]*?)\x1E");
if (matches.Count == 0)
return; // Not found, display an error message and exit.
foreach (Match match in matches)
{
if (match.Groups.Count > 1)
Console.WriteLine(match.Groups[1].Value);
}
As you can see, you get a collection of Match, and each match.Value will have the whole matched string including the separators. match.Groups will have all matched groups, being the first one again the whole matched string (that's by default) and then each of your groups (those between parenthesis). In this case, you only have one in your regex, so you just need the second one on that list.

Using regex you can do something like this:
string pattern = string.Format("{0}(.*){1}",firstString,secondString);
var matches = Regex.Matches(myString, pattern);
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
//Do stuff, with the current you should remove firstString and secondString from the capture.Value
}
}
After that use Regex.match to find the string that match with the pattern built before.
Remember to escape all the special char for regex.

You can use Regex.Matches, I'm using X as the separator in this example:
var fileContents = "Xsomething1X Xsomething2X Xsomething3X";
var results = Regex.Matches(fileContents, #"(X).*?(\1)");
The you can loop on results to do anything you want with the matches.
The \1 in the regex means "reference first group". I've put X between () so it is going to be group 1, the I use \1 to say that the match in this place should be exactly the same as the group 1.

You don't need a regular expression for that.
Read the contents of the file (File.ReadAllText).
Split on the separator character (String.Split).
If you know there's only one occurrence of your string, take the second array element (result[1]). Otherwise, take every other entry (result.Where((x, i) => i % 2 == 1)).

Regex Matchcollection groups

I already tried two days to solve the Problem, that I have a MatchCollection. In the patter is a Group and I want to have a list with the Solutions of the Group (there were two or more Solutions).
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "^<tr>$?<td>$?[D-M][i-r],[' '][0-3][1-9].[0-1][1-9].[0-9][0-9]$?</td>$?<td>$?([1-9][0-2]?)$?</td>$?";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
string s = groups[1].Value;
Datum2.Text = s;
}
But only the last match (2) appears in the TextBox "Datum2".
I know that I have to use e.g. a listbox, but the Groups[1].Value is a string...
Thanks for your help and time.
Dieter

First thing you need to correct in the code is Datum2.Text = s; would overwrite the text in Datum2 if it were more than one match.
Now, about your regex,
^ forces a match at the begging of the line, so there is really only 1 match. If you remove it, it'll match twice.
I can't seem to understand what was intended with $? all over the pattern (just take them out).
[' '] matches "either a quote, a space or a quote (no need to repeat characters in a character class.
All dots in [0-3][1-9].[0-1][1-9].[0-9][0-9] need to be escaped. A dot matches any character otherwise.
[0-1][1-9] matches all months except "10". The second character shoud be [0-9] (or \d).
Code:
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>([1-9][0-2]?)</td>";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
string s= "";
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
s = s + " " + groups[1].Value;
}
Datum2.Text = s;
Output:
1 2
DEMO
You should know that regex is not the tool to parse HTML. It'll work for simple cases, but for real cases do consider using HTML Agility Pack

regex pattern for tags needed

Howzit,
I need help with the following please.
I need to find tags in a string. These tags start with {{ and end with }}, there will be multiple tags in the string I receive.
So far I have this, but it doesn't find any matches, what am I missing here?
List<string> list = new List<string>();
string pattern = "{{*}}";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = r.Match(text);
while (m.Success)
{
list.Add(m.Groups[0].Value);
m = m.NextMatch();
}
return list;
even tried string pattern = "{{[A-Za-z0-9]}}";
thanx
PS. I know close to nothing about regex.

Not only do you want to use {{.+?}} as your regex, you also need to pass RegexOptions.SingleLine. That will treat your entire string as a single line and the . will match \n (which it normally will not do).

Try {{.+}}. The .+ means there has to be at least one character as part of the tag.
EDIT:
To capture the string containing your tags you can do {{(.+)}} and then tokenize your match with the Tokenize or Scanner class?

I would recommend trying something like the following:
List<string> list = new List<string>();
string pattern = "{{(.*?)}}";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = r.Match(text);
while (m.Success)
{
list.Add(m.Groups[1].Value);
m = m.NextMatch();
}
return list;
the regex specifies:
{{ # match {{ literally
( # begin capturing into group #1
.*? # match any characters, from zero to infinite, but be lazy*
) # end capturing group
}} # match }} literally
"lazy" means to attempt to continue matching the pattern afterwards "}}" before backtracking to the .*? and reluctantly adding a character to the capturing group only if the character does not match }} - hope that made sense.
I changed your code by modifying the regex and to extract the first matching group from the regex match object (m.Groups[1].value) instead of the entire match.

{{.*?}} or
{{.+?}}
. - means any symbol
? - means lazy(don't capute nextpattern)

A probably simple regex expression

I am a complete newb when it comes to regex, and would like help to make an expression to match in the following:
{ValidFunctionName}({parameter}:"{value}")
{ValidFunctionName}({parameter}:"{value}",
{parameter}:"{value}")
{ValidFunctionName}()
Where {x} is what I want to match, {parameter} can be anything $%"$ for example and {value} must be enclosed in quotation marks.
ThisIsValid_01(a:"40")
would be "ThisIsValid_01", "a", "40"
ThisIsValid_01(a:"40", b:"ZOO")
would be "ThisIsValid_01", "a", "40", "b", "ZOO"
01_ThisIsntValid(a:"40")
wouldn't return anything
ThisIsntValid_02(a:40)
wouldn't return anything, as 40 is not enclosed in quotation marks.
ThisIsValid_02()
would return "ThisIsValid_02"
For a valid function name I came across: "[A-Za-z_][A-Za-z_0-9]*"
But I can't for the life of me figure out how to match the rest.
I've been playing around on http://regexpal.com/ to try to get valid matches to all conditions, but to no avail :(
It would be nice if you kindly explained the regex too, so I can learn :)

EDIT: This will work, uses 2 regexs. The first get the function name and everything inside it, the second extracts each pair of params and values from what's inside the function's brackets. You cannot do this with a single regex. Add some [ \t\n\r]* for whitespace.
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
List<List<string>> matches = new List<List<string>>();
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<string>();
l.Add(match.Groups["function"].Value);
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
{
l.Add(m.Groups["param"].Value);
l.Add(m.Groups["value"].Value);
}
matches.Add(l);
}
(Old) Solution
(?<function>\w[\w\d]*?)\((?<param>.+?):"(?<value>[^"]*?)"\)
(Old) Explanation
Let's remove the group captures so it is easier to understand: \w[\w\d]*?\(.+?:"[^"]?"\)
\w is the word class, it is short for [a-zA-Z_]
\d is the digit class, it is short for [0-9]
\w[\w\d]*? Makes sure there is valid word character for the start of the function, and then matches zero or more further word or digit characters.
\(.+? Matches a left bracket then one or more of any characters (for the parameter)
:"[^"]*?"\) Matches a colon, then the opening quote, then zero or more of any character except quotes (for the value) then the close quote and right bracket.
Brackets (or parens, as some people call them) as escaped with the backslashes because otherwise they are capturing groups.
The (?<name> ) captures some text.
The ? after each the * and + operators makes them non-greedy, meaning that they will match the least, rather than the most, amount of text.
(Old) Use
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(aa%£$!:\"lolololol\") _test1(ghgasghe:\"asjkdgh\")";
List<string[]> matches = new List<string[]>();
if(r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
matches.Add(new[] { match.Groups["function"].Value, match.Groups["param"].Value, match.Groups["value"].Value });
}
EDIT: Now you've added an undefined number of multiple parameters, I would recommend making your own parser rather than using regexs. The above example only works with one parameter and strictly no whitespace. This will match multiple parameters with strict whitespace but will not return the parameters and values:
\w[\w\d]*?\(.+?:"[^"]*?"(,.+?:"[^"]*?")*\)
Just for fun, like above but with whitepace:
\w[\w\d]*?[ \t\r\n]*\([ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?"([ \t\r\n]*,[ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?")*[ \t\r\n]*\)
Capturing the text you want will be hard, because you don't know how many captures you are going to have and as such regexs are unsuited.

Someone else has already given an answer that gives you a flat list of strings, but in the interest of strong typing and proper class structure, I’m going to provide a solution that encapsulates the data properly.
First, declare two classes:
public class ParamValue // For a parameter and its value
{
public string Parameter;
public string Value;
}
public class FunctionInfo // For a whole function with all its parameters
{
public string FunctionName;
public List<ParamValue> Values;
}
Then do the matching and populate a list of FunctionInfos:
(By the way, I’ve made some slight fixes to the regexes... it will now match identifiers correctly, and it will not include the double-quotes as part of the “value” of each parameter.)
Regex r = new Regex(#"(?<function>[\p{L}_]\w*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
var matches = new List<FunctionInfo>();
if (r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<ParamValue>();
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
l.Add(new ParamValue
{
Parameter = m.Groups["param"].Value,
Value = m.Groups["value"].Value
});
matches.Add(new FunctionInfo
{
FunctionName = match.Groups["function"].Value,
Values = l
});
}
}
Then you can access the collection nicely with identifiers like FunctionName:
foreach (var match in matches)
{
Console.WriteLine("{0}({1})", match.FunctionName,
string.Join(", ", match.Values.Select(val =>
string.Format("{0}: \"{1}\"", val.Parameter, val.Value))));
}

Try this:
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*)\(((?<parameter>[^:]*):"(?<value>[^"]+)",?\s*)*\)
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*) matches the function name, ^ means start of the line, so that the first character in string must match. You can keep you remove the whitespace capture if you don't need it, I just added it to make the match a little more flexible.
The next set \(((?<parameter>[^:]*):"(?<value>[^"]+)",?)*\) means capture each parameter-value pair inside the parenthesis. You have to escape the parenthesis for the function since they are symbols within the regex syntax.
The ?<> inside parenthesis are named capture groups, which when supported by a library, as they are in .NET, make grabbing the groups in the matches a little easier.

Here:
\w[\w\d]*\s*\(\s*(?:(\w[\w\d]*):("[^"]*"|\d+))*\s*\)
Visualization of that regex here.

For Problems like that I always suggest people not to "find" a single regex but to write multiple regex sharing the work.
But here is my quick shot:
(?<funcName>[A-Za-z_][A-Za-z_0-9]*)
\(
(?<ParamGroup>
(?<paramName>[^(]+?)
:
"(?<paramValue>[^"]*)"
((,\s*)|(?=\)))
)*
\)
The whitespaces are there for better readability. Remove them or set the option to ignore pattern whitespaces.

This regex passes all your test cases:
^(?<function>[A-Za-z][\w]*?)\(((?<param>[^:]*?):"(?<value>[^"]*?)",{0,1}\s*)*\)$
This works on multiple parameters and no parameters. It also handles special characters in the param name and whitespace after the comma. There may need to be some adjustments as your test cases do not cover everything you indicate in your text.
Please note that \w usually includes digits and is not appropriate as the leading character of the function name. Reference: http://www.regular-expressions.info/charclass.html#shorthand

Regular expression capturing more than expected

New dad, so my eyes are tired and I'm trying to figure out why this code:
var regex = new Regex(#"(https:)?\/");
Console.WriteLine (regex.Replace("https://foo.com", ""));
Emits:
foo.com
I only have the one forward slash, so why are both being captured in the group for the replacement?

Regex.Replace:
In a specified input string, replaces all strings that match a regular expression pattern with a specified replacement string.
Every single / matches the regular expression pattern #"(https:)?\/". If you try e.g. "https://foo/./com/", all /s would be removed.

If you check what matches are generated, it becomes clear. Add this to your code:
var matches = regex.Matches("https://foo.com");
foreach (Match match in matches)
{
Console.WriteLine(match.Value);
}
And you'll see that https:/ is matched and replaced, / is matched and replaced (because https:is optional) and foo.com remains.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Improve RegEx search - c#

It's beacuse you are mixing up matches and groups string input = #"CN=Adam West,OU=STORE,OU=COMPANY,DC=mycompany,DC=group,DC=eu"; MatchCollection mc = Regex.Matches(input, #"OU=([a-zA-Z\\]+),"); foreach(Match m in mc) { Console.WriteLine(m.Result("$1")); }

Related

Find String Between To Identical Control Separators?

Regex Matchcollection groups

regex pattern for tags needed

A probably simple regex expression

Regular expression capturing more than expected

Categories

Resources