Multiline Regex matches first occurance but can't match second - c#

I have a string in the format below. (I added the markers to get the newlines to show up correctly)
-- START BELOW THIS LINE --
2013-08-28 00:00:00 - Tom Smith (Work notes)
Blah blah
b;lah blah
2013-08-27 00:00:00 - Tom Smith (Work notes)
ZXcZXCZXCZX
ZXcZXCZX
ZXCZXcZXc
ZXCZXC
-- END ABOVE THIS LINE --
I am trying to get a regular expression that will allow me to extract the information from the two separate parts of the string.
The following expression matches the first portion successfully:
^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)(?=\n\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)
I am trying to figure out a way that I can modify it to get the second part of the string. I have tried things like what is below, but it ends up extending the match all the way to the end of the string. It is like it is giving preference to the expression following the OR.
^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)(?:(?=\n\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)|\n\\Z)
Any help would be appreciated
-- EDIT --
Here is a copy of the test program I created to try and get this correct. I also added a 3rd message and my RegEx above breaks in that case.
using System;
using System.Text.RegularExpressions;
namespace RegExTest
{
class MainClass
{
public static void Main (string[] args)
{
string str = "2013-08-28 10:50:13 - Tom Smith (Work notes)\nWhat's up? \nHow you been?\n\n2013-08-19 10:21:03 - Tom Smith (Work notes)\nWork Notes\n\n2013-08-19 10:10:48 - Tom Smith (Work notes)\nGood day\n\n";
var regex = new Regex ("^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)\n\n(?=\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)",RegexOptions.Multiline);
foreach (Match match in regex.Matches(str))
{
if (match.Success)
{
for (var i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine('>'+match.Groups [i].Value);
}
}
}
Console.ReadKey();
}
}
}
-- EDIT --
Just to make it clear, the data I am trying to extract is the Date and Timestamp (as one item), the name, and the "body" from each "paragraph".

This is a pretty beefy piece of regex you've got here.
While you can do regex over multiple lines, it just complicates things. Additionally, because you have repetitive patterns, it would be cleaner to split your string on the newline character, and then just match each line.
Eventually, if you intend to ingest this from a file, it will be easy to match each line of the file, rather than reading in the whole file and then matching.
Here's what I would do:
var regex = new Regex ("(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*?) \\(Work notes\\)");
var lines = str.split(new char[] {'\n'});
foreach (var line in lines)
{
var match = regex.Match(line);
if (match.Success)
{
for (var i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine('>' + match.Groups[i].Value);
}
// will preface the body after each header
Console.WriteLine(">");
}
else
{
Console.WriteLine(line);
}
}
As far as your regex goes, I maintained the original groups you had, so we get the Date/timestamp in one group, and the name in the other. The body does not get matched to a group, but it would be trivial to construct a string that is the body.
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) Matching Group 1.
- Matched, but not grouped.
(.*?) Matching Group 2.
\(Work notes\) Matched, but not grouped.

Regex is not really the right solution for this, but if you must...
Your problem is a combination of regex greediness and starting the match with ^. If it starts with ^ it needs it to start the string and it won't match anywhere else.
The greediness of .* can be fixed by making it .*? instead.
Try this:
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - (.*?) \(Work notes\)\n([\w\W]*?)((?=\n\n\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} - .*? \(Work notes\)\n)|((\s{0,})$))

I was able to get an expression working but it looks a bit scary I guess:
#"([0-9\s:-]+)(?>\s-\s)(?>[^\n\r]+[\r\n]*)((?=[^0-9]+(\d{4}-\d{2}-\d{2}|$))[\s\S])+"
The # before the expression to make this a verbatim string so you won't have to double escape everything.
Note: This is by no means the right way to go about doing this, but I wanted to try out anyway.

Related

Extract Enums with brackets in comments?

The Enum I want to extract is like following:
...
other code
...
enum A
{
a,
b=2,
c=3,
d//{x}
}
...
More Enums like the above.
...
First, I have tried using the Option Singleline with Regex:
enum\s*\w+\s*{.*?\}
However, since the comments have brackets.The regex doesn't work. It will stop when it runs to the bracket in comments.
So I tried excluding the bracket after comments. Based on what I have searched so far,it seems I need Negative look ahead with grouping construct Multiline.
Then I tried parsing the brackets without comments ahead.
The substep is to find brackets after comments:
(?m:^.*?//.*?}.*?$).
However, it seems the . still match anychar including newline even in inline multiline mode.
Then I tried using multiline in the first place. Since the main problem is the brackets in comments.I tried:
(?!//.*)}
Negative look ahead doesn't work the way I expected.
Here is a csharp-regex-test-link for you to test.
To summarize, I need parse enum from a csharp source code file.
The main problem to me is the brackets in comments.
Edit:
To clarify
1.brackets in comments are in pairs. For example:
xxx=xxx; //{xx}
2.comments are only in the form of //
3.I can't rely on indentations.
You may use
#"\benum\s*\w+\s*{(?>[^{}]+|(?<o>){|(?<-o>)})*(?(o)(?!)|)}"
See the regex demo
Details
\benum - a whole word enum
\s* - 0+ whitespaces
\w+ - 1+ word chars
\s* - 0+ whitespaces
{ - a { char
(?>[^{}]+|(?<o>){|(?<-o>)})* - either 1+ chars other than { and }, or a { with an empty string pushed onto the Group o stack, or } with a value popped from Group o stack
(?(o)(?!)|) - a conditional yes-no construct that fails the match and makes the regex engine backtrack at the current location if Group o still has any items left on the stack
} - a } char.
I don't think it is possible to do your task with a single regex. What if you have a string that looks like
var notEnum = "enum A {a, b, c}";
Hovewer you can capture your enums with few passes. Take a look at this algorithm
Clear strings content
Drop singleline comments
Drop muliline comments
Use you original regex
Example:
var code = ...
var stringLiterals = new Regex("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"", RegexOptions.Compiled);
var multilineComments = new Regex("/\\*.*?\\*/", RegexOptions.Compiled | RegexOptions.Singleline);
var singlelineComments = new Regex("//.*$", RegexOptions.Compiled | RegexOptions.Multiline);
var #enum = new Regex("enum\\s*\\w+\\s*{.*?}", RegexOptions.Compiled | RegexOptions.Singleline);
code = stringLiterals.Replace(code, m => "\"\"");
code = multilineComments.Replace(code, m => "");
code = singlelineComments.Replace(code, m => "");
var enums = #enum.Matches(code).Cast<Match>().ToArray();
foreach (var match in enums)
Console.WriteLine(match.Value);

Regex instance can not find more than one match even though it exists

I was using Regex and I wrote this:
static void Main(string[] args)
{
string test = "this a string meant to test long space recognition n a";
Regex regex = new Regex(#"[a-z][\s]{4,}[a-z]$");
MatchCollection matches = regex.Matches(test);
if (matches.Count > 1)
Console.WriteLine("yes");
else
{
Console.WriteLine("no");
Console.WriteLine("the number of matches is "+matches.Count);
}
}
In my opinion the Matches method should find both "n n" and "n a". Nevertheless, it only manages to find "n n" and I just do not understand why is that..
The $ in your regular expression means, that the pattern must occur at the end of the line. If you want to find all the long spaces this simple expression suffices:
\s{4,}
If you really need to know whether the spaces are enclosed by a-z, you can search like this
(?<=[a-z])\s{4,}(?=[a-z])
This uses the pattern...
(?<=prefix)find(?=suffix)
...and finds positions enclosed between a prefix and a suffix. The prefix and suffix are not part of the match, i.e. match.Value contains only the contiguous spaces. Therefore you don't get the "n" is consumed problem mentioned by Jon Skeet.
You have two problems:
1) You're anchoring the match to the end of the string. So actually, the value that's matched is "n...a", not "n...n"
2) The middle "n" is consumed by the first match, so can't be part of the second match. If you change that "n" to "nx" (and remove the $) you'll see "n...n" and "x...a"
Short but complete example:
using System;
using System.Text.RegularExpressions;
public class Test
{
static void Main(string[] args)
{
string text = "ignored a bc d";
Regex regex = new Regex(#"[a-z][\s]{4,}[a-z]");
foreach (Match match in regex.Matches(text))
{
Console.WriteLine(match);
}
}
}
Result:
a b
c d
I just do not understand why is that..
I think the 'why' it is consumed by the first match is to prevent regexes like "\\w+s", designed to get every word that ends with an 's' from returning "ts", "ats" and "cats" when matched against "cats".
The Regex machinery does one match, if you want more, you have to restart it youself after the first match.

Regex to extract Variable Part

I have a string containing this: #[User::RootPath]+"Dim_MyPackage10.dtsx" and I need to extract the [User::RootPath] part using a regex. So far I have this regex: [a-zA-Z0-9]*\.dtsx but I don't know how to proceed further.
For the variable, why not consume what is needed by using the not set [^ ] to extract everything except in the set?
The ^ in the braces means find what is not matched, such as this where it seeks all that is not a ] or a quote (").
Then we can place the actual matches in named capture groups (?<{NameHere}> ) and extract accordingly
string pattern = #"(?:#\[)(?<Path>[^\]]+)(?:\]\+\"")(?<File>[^\""]+)(?:"")";
// Pattern is (?:#\[)(?<Path>[^\]]+)(?:\]\+\")(?<File>[^\"]+)(?:")
// w/o the "'s escapes for the C# parser
string text = #"#[User::RootPath]+""Dim_MyPackage10.dtsx""";
var result = Regex.Match(text, pattern);
Console.WriteLine ("Path: {0}{1}File: {2}",
result.Groups["Path"].Value,
Environment.NewLine,
result.Groups["File"].Value
);
/* Outputs
Path: User::RootPath
File: Dim_MyPackage10.dtsx
*/
(?: ) is match but don't capture, because we use those as defacto anchors for our pattern and to not place them into the match capture groups.
Use this regex pattern:
\[[^[\]]*\]
Check this demo.
Your regex will match any number of alphanumeric characters, followed by .dtsx. In your example, it would match MyPackage10.dtsx.
If you want to match Dim_MyPackage10.dtsx you need to add an underscore to your list of allowed characters in the regex: [a-zA-Z0-9]*.dtsx
If you want to match the [User::RootPath], you need a regex that will stop at the last / (or \, depends on which type of slashes you use in the paths): something like this: .*\/ (or .*\\)
From the answers and comments - and the fact that none has been 'accepted' so far - it appears to me that the question/problem is not completely clear. If you're looking for the pattern [User::SomeVariable] where only 'SomeVariable' is, well, variable, then you may try:
\[User::\w+]
to capture the full expression.
Furthermore, if you wish to detect that pattern, but then need only the "SomeVariable" part, you may try:
(?<=\[User::)\w+(?=])
which uses look-arounds.
Here it is bro
using System;
using System.Text.RegularExpressions;
namespace myapp
{
class Class1
{
static void Main(string[] args)
{
String sourcestring = "source string to match with pattern";
Regex re = new Regex(#"\[\S+\]");
MatchCollection mc = re.Matches(sourcestring);
int mIdx=0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
}
}
}

.NET Regex.Matches behaves not as expected when finding multiple matches to pattern containing * ()

My goal is to find all matches to some pattern in text.
Let's say my pattern is:
h.*o
This means I am searching for any text starting with 'h' ending with 'o' and having any number of chars in between (also zero).
My understanding was that method Matches() would deliver multiple matches according description (see MSDN).
const string input = "hello hllo helo";
Regex regex = new Regex("h.*o");
var result = regex.Matches(input);
foreach (Match match in result)
{
Console.WriteLine(match.Value);
}
My expectation was:
1. "hello"
2. "hllo"
3. "helo"
4. "hello hllo"
5. "hello hllo helo"
To my surprise returned matches contain only one string - the whole input string.
"hello hllo helo"
Questions:
Which one is wrong: my expectation, my regex or usage of class?
How to achieve the result as shown in my example?
Thanks in advance.
The * is greedy - it will try matching as many characters as it possibly could. You can make it reluctant by following it by question mark, but a better solution is to exclude o from the list if characters the . matches, like this:
h[^o]*o
Here is a link to very good explanation of greedy vs. reluctant.
Besides the fact that * is greedy, the Matches method only finds non-overlapping matches; that is, it looks for each subsequent match starting from the position where the last match left off. From MSDN Library:
Usually, the regular expression engine begins the search for the next match exactly where the previous match left off.
Thus, even if you used *? or h[^o]*o instead of *, it would still only find "hello", "hllo", and "helo".
I don't know if Regex has a built-in method to efficiently find all the possible substrings that match a specified pattern, but you could loop through all the possible substrings yourself and check if each one is a match:
const string input = "hello hllo helo";
Regex regex = new Regex("^h.*o$");
for (int startIndex = 0; startIndex < input.Length - 1; startIndex++)
{
for (int endIndex = startIndex + 1; endIndex <= input.Length; endIndex++)
{
string substring = input.Substring(startIndex, endIndex - startIndex);
if (regex.IsMatch(substring))
Console.WriteLine(substring);
}
}
Output:
hello
hello hllo
hello hllo helo
hllo
hllo helo
helo
Note that I added ^ and $ to the regex to ensure it matches the entire substring, not just a substring of the substring.

A probably simple regex expression

I am a complete newb when it comes to regex, and would like help to make an expression to match in the following:
{ValidFunctionName}({parameter}:"{value}")
{ValidFunctionName}({parameter}:"{value}",
{parameter}:"{value}")
{ValidFunctionName}()
Where {x} is what I want to match, {parameter} can be anything $%"$ for example and {value} must be enclosed in quotation marks.
ThisIsValid_01(a:"40")
would be "ThisIsValid_01", "a", "40"
ThisIsValid_01(a:"40", b:"ZOO")
would be "ThisIsValid_01", "a", "40", "b", "ZOO"
01_ThisIsntValid(a:"40")
wouldn't return anything
ThisIsntValid_02(a:40)
wouldn't return anything, as 40 is not enclosed in quotation marks.
ThisIsValid_02()
would return "ThisIsValid_02"
For a valid function name I came across: "[A-Za-z_][A-Za-z_0-9]*"
But I can't for the life of me figure out how to match the rest.
I've been playing around on http://regexpal.com/ to try to get valid matches to all conditions, but to no avail :(
It would be nice if you kindly explained the regex too, so I can learn :)
EDIT: This will work, uses 2 regexs. The first get the function name and everything inside it, the second extracts each pair of params and values from what's inside the function's brackets. You cannot do this with a single regex. Add some [ \t\n\r]* for whitespace.
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
List<List<string>> matches = new List<List<string>>();
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<string>();
l.Add(match.Groups["function"].Value);
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
{
l.Add(m.Groups["param"].Value);
l.Add(m.Groups["value"].Value);
}
matches.Add(l);
}
(Old) Solution
(?<function>\w[\w\d]*?)\((?<param>.+?):"(?<value>[^"]*?)"\)
(Old) Explanation
Let's remove the group captures so it is easier to understand: \w[\w\d]*?\(.+?:"[^"]?"\)
\w is the word class, it is short for [a-zA-Z_]
\d is the digit class, it is short for [0-9]
\w[\w\d]*? Makes sure there is valid word character for the start of the function, and then matches zero or more further word or digit characters.
\(.+? Matches a left bracket then one or more of any characters (for the parameter)
:"[^"]*?"\) Matches a colon, then the opening quote, then zero or more of any character except quotes (for the value) then the close quote and right bracket.
Brackets (or parens, as some people call them) as escaped with the backslashes because otherwise they are capturing groups.
The (?<name> ) captures some text.
The ? after each the * and + operators makes them non-greedy, meaning that they will match the least, rather than the most, amount of text.
(Old) Use
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(aa%£$!:\"lolololol\") _test1(ghgasghe:\"asjkdgh\")";
List<string[]> matches = new List<string[]>();
if(r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
matches.Add(new[] { match.Groups["function"].Value, match.Groups["param"].Value, match.Groups["value"].Value });
}
EDIT: Now you've added an undefined number of multiple parameters, I would recommend making your own parser rather than using regexs. The above example only works with one parameter and strictly no whitespace. This will match multiple parameters with strict whitespace but will not return the parameters and values:
\w[\w\d]*?\(.+?:"[^"]*?"(,.+?:"[^"]*?")*\)
Just for fun, like above but with whitepace:
\w[\w\d]*?[ \t\r\n]*\([ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?"([ \t\r\n]*,[ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?")*[ \t\r\n]*\)
Capturing the text you want will be hard, because you don't know how many captures you are going to have and as such regexs are unsuited.
Someone else has already given an answer that gives you a flat list of strings, but in the interest of strong typing and proper class structure, I’m going to provide a solution that encapsulates the data properly.
First, declare two classes:
public class ParamValue // For a parameter and its value
{
public string Parameter;
public string Value;
}
public class FunctionInfo // For a whole function with all its parameters
{
public string FunctionName;
public List<ParamValue> Values;
}
Then do the matching and populate a list of FunctionInfos:
(By the way, I’ve made some slight fixes to the regexes... it will now match identifiers correctly, and it will not include the double-quotes as part of the “value” of each parameter.)
Regex r = new Regex(#"(?<function>[\p{L}_]\w*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
var matches = new List<FunctionInfo>();
if (r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<ParamValue>();
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
l.Add(new ParamValue
{
Parameter = m.Groups["param"].Value,
Value = m.Groups["value"].Value
});
matches.Add(new FunctionInfo
{
FunctionName = match.Groups["function"].Value,
Values = l
});
}
}
Then you can access the collection nicely with identifiers like FunctionName:
foreach (var match in matches)
{
Console.WriteLine("{0}({1})", match.FunctionName,
string.Join(", ", match.Values.Select(val =>
string.Format("{0}: \"{1}\"", val.Parameter, val.Value))));
}
Try this:
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*)\(((?<parameter>[^:]*):"(?<value>[^"]+)",?\s*)*\)
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*) matches the function name, ^ means start of the line, so that the first character in string must match. You can keep you remove the whitespace capture if you don't need it, I just added it to make the match a little more flexible.
The next set \(((?<parameter>[^:]*):"(?<value>[^"]+)",?)*\) means capture each parameter-value pair inside the parenthesis. You have to escape the parenthesis for the function since they are symbols within the regex syntax.
The ?<> inside parenthesis are named capture groups, which when supported by a library, as they are in .NET, make grabbing the groups in the matches a little easier.
Here:
\w[\w\d]*\s*\(\s*(?:(\w[\w\d]*):("[^"]*"|\d+))*\s*\)
Visualization of that regex here.
For Problems like that I always suggest people not to "find" a single regex but to write multiple regex sharing the work.
But here is my quick shot:
(?<funcName>[A-Za-z_][A-Za-z_0-9]*)
\(
(?<ParamGroup>
(?<paramName>[^(]+?)
:
"(?<paramValue>[^"]*)"
((,\s*)|(?=\)))
)*
\)
The whitespaces are there for better readability. Remove them or set the option to ignore pattern whitespaces.
This regex passes all your test cases:
^(?<function>[A-Za-z][\w]*?)\(((?<param>[^:]*?):"(?<value>[^"]*?)",{0,1}\s*)*\)$
This works on multiple parameters and no parameters. It also handles special characters in the param name and whitespace after the comma. There may need to be some adjustments as your test cases do not cover everything you indicate in your text.
Please note that \w usually includes digits and is not appropriate as the leading character of the function name. Reference: http://www.regular-expressions.info/charclass.html#shorthand

Categories

Resources