I hope you guys can help me out.
I'm using C# .Net 4.0
I want validate file structure like
const string dataFileScr = #"
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple */
}
PZ 11
{
IA_return()
}
GDC 7
{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
";
So far I managed to build this regex pattern
const string patternFileScr = #"
^
((?:\[|\s)*
(?<Section>[^\]\r\n]*)
(?:\])*
(?:[\r\n]{0,}|\Z))
(
(?:\{) ### !! improve for .ini file, dont take {
(?:[\r\n]{0,}|\Z)
( # Begin capture groups (Key Value Pairs)
(?!\}|\[) # Stop capture groups if a } is found; new section
(?:\s)* # Line with space
(?<Key>[^=]*?) # Any text before the =, matched few as possible
(?:[\s]*=[\s]*) # Get the = now
(?<Value>[^\r\n]*) # Get everything that is not an Line Changes
(?:[\r\n]{0,})
)* # End Capture groups
(?:[\r\n]{0,})
(?:\})?
(?:[\r\n\s]{0,}|\Z)
)*
";
and c#
Dictionary <string, Dictionary<string, string>> DictDataFileScr
= (from Match m in Regex.Matches(dataFileScr,
patternFileScr,
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline)
select new
{
Section = m.Groups["Section"].Value,
kvps = (from cpKey in m.Groups["Key"].Captures.Cast().Select((a, i) => new { a.Value, i })
join cpValue in m.Groups["Value"].Captures.Cast().Select((b, i) => new { b.Value, i }) on cpKey.i equals cpValue.i
select new KeyValuePair(cpKey.Value, cpValue.Value)).OrderBy(_ => _.Key)
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value)
}).ToDictionary(itm => itm.Section, itm => itm.kvps);
It works for
const string dataFileScr = #"
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple */
}
GDC 7
{
Message = 6
RepeatCount = 2
ErrorMessage = 10
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
";
in other words
Section1
{
key1=value1
key2=value2
}
Section2
{
key1=value1
key2=value2
}
, but
1. not for multiple keyname, i want group by key and output
DictDataFileScr["GDC 7"]["Message"] = "6|7|8|8"
DictDataFileScr["GDC 7"]["ErrorMessage"] = "10|11"
2. not work for .ini file like
....
[Section1]
key1 = value1
key2 = value2
[Section2]
key1 = value1
key2 = value2
...
3. dont see next section after
....
PZ 11
{
IA_return()
}
.....
Here is a complete rework of the regex in C#.
Assumptions : (tell me if one of them is false or all are false)
An INI file section can only have key/value pair lines in its body
In an non INI file section, function calls can't have any parameters
Regex flags :
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.Singleline
Input test:
const string dataFileScr = #"
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple */
}
PZ 11
{
IA_return()
}
GDC 7
{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
[Section1]
key1 = value1
key2 = value2
[Section2]
key1 = value1
key2 = value2
";
Reworked regex:
const string patternFileScr = #"
(?<Section> (?# Start of a non ini file section)
(?<SectionName>[\w ]+)\s* (?# Capture section name)
{ (?# Match but don't capture beginning of section)
(?<SectionBody> (?# Capture section body. Section body can be empty)
(?<SectionLine>\s* (?# Capture zero or more line(s) in the section body)
(?: (?# A line can be either a key/value pair, a comment or a function call)
(?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]*)) (?# Capture key/value pair. Key and value are sub-captured separately)
|
(?<Comment>/\*.+?\*/) (?# Capture comment)
|
(?<FunctionCall>[\w]+\(\)) (?# Capture function call. A function can't have parameters though)
)\s* (?# Match but don't capture white characters)
)* (?# Zero or more line(s), previously mentionned in comments)
)
} (?# Match but don't capture beginning of section)
)
|
(?<Section> (?# Start of an ini file section)
\[(?<SectionName>[\w ]+)\] (?# Capture section name)
(?<SectionBody> (?# Capture section body. Section body can be empty)
(?<SectionLine> (?# Capture zero or more line(s) in the section body. Only key/value pair allowed.)
\s*(?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]+))\s* (?# Capture key/value pair. Key and value are sub-captured separately)
)* (?# Zero or more line(s), previously mentionned in comments)
)
)
";
Discussion
The regex is build to match either non INI file sections (1) or INI file section (2).
(1) Non-INI file sections These sections are composed by a section name followed by a body enclosed by { and }.
The section name con contain either letters, digits or spaces.
The section body is composed by zero or more lines. A line can be either a key/value pair (key = value), a comment (/* Here is a comment */) or a function call with no parameters (my_function()).
(2) INI file sections
These sections are composed by a section name enclosed by [ and ] followed by zero or more key/value pairs. Each pair is on one line.
Do yourself and your sanity a favor and learn how to use GPLex and GPPG. They are the closest thing that C# has to Lex and Yacc (or Flex and Bison, if you prefer) which are the proper tools for this job.
Regular expressions are great tools for performing robust string matching, but when you want to match structures of strings that's when you need a "grammar". This is what a parser is for. GPLex takes a bunch of regular expressions and generates a super-fast lexer. GPPG takes the grammar you write and generates a super-fast parser.
Trust me, learn how to use these tools ... or any other tools like them. You'll be glad you did.
# 2. not work for .ini file like
Won't work because as stated by your regular expression, an { is required after [Section].
Your regex will match if you have something like this :
[Section]
{
key = value
}
Here is a sample in Perl. Perl doesen't have named capture arrays. Probably because of backtracking.
Maybe you can pick something out of the regex though. This assumes there is no nesting of {} bracktes.
Edit Never content to leave well enough alone, a revised version is below.
use strict;
use warnings;
my $str = '
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple
*/
}
asdfasdf
PZ 11
{
IA_return()
}
[ section 5 ]
this = that
[ section 6 ]
this_ = _that{hello() hhh = bbb}
TOC{}
GDC 7
{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
';
use re 'eval';
my $rx = qr/
\s*
( \[ [^\S\n]* )? # Grp 1 optional ini section delimeter '['
(?<Section> \w+ (?:[^\S\n]+ \w+)* ) # Grp 2 'Section'
(?(1) [^\S\n]* \] |) # Condition, if we matched '[' then look for ']'
\s*
(?<Body> # Grp 3 'Body' (for display only)
(?(1)| \{ ) # Condition, if we're not a ini section then look for '{'
(?{ print "Section: '$+{Section}'\n" }) # SECTION debug print, remove in production
(?: # _grp_
\s* # whitespace
(?: # _grp_
\/\* .*? \*\/ # some comments
| # OR ..
# Grp 4 'Key' (tested with print, Perl doesen't have named capture arrays)
(?<Key> \w[\w\[\]]* (?:[^\S\n]+ [\w\[\]]+)* )
[^\S\n]* = [^\S\n]* # =
(?<Value> [^\n]* ) # Grp 5 'Value' (tested with print)
(?{ print " k\/v: '$+{Key}' = '$+{Value}'\n" }) # KEY,VALUE debug print, remove in production
| # OR ..
(?(1)| [^{}\n]* ) # any chars except newline and [{}] on the condition we're not a ini section
) # _grpend_
\s* # whitespace
)* # _grpend_ do 0 or more times
(?(1)| \} ) # Condition, if we're not a ini section then look for '}'
)
/x;
while ($str =~ /$rx/xsg)
{
print "\n";
print "Body:\n'$+{Body}'\n";
print "=========================================\n";
}
__END__
Output
Section: 'Start 0'
k/v: 'Next' = '1'
k/v: 'Author' = 'rk'
k/v: 'Date' = '2011-03-10'
Body:
'{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple
*/
}'
=========================================
Section: 'PZ 11'
Body:
'{
IA_return()
}'
=========================================
Section: 'section 5'
k/v: 'this' = 'that'
Body:
'this = that
'
=========================================
Section: 'section 6'
k/v: 'this_' = '_that{hello() hhh = bbb}'
Body:
'this_ = _that{hello() hhh = bbb}
'
=========================================
Section: 'TOC'
Body:
'{}'
=========================================
Section: 'GDC 7'
k/v: 'Message' = '6'
k/v: 'Message' = '7'
k/v: 'Message' = '8'
k/v: 'Message' = '8'
k/v: 'RepeatCount' = '2'
k/v: 'ErrorMessage' = '10'
k/v: 'ErrorMessage' = '11'
k/v: 'onKey[5]' = '6'
k/v: 'onKey[6]' = '4'
k/v: 'onKey[9]' = '11'
Body:
'{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}'
=========================================
Related
I'm looking for a .NET Regex pattern that matches the following:
string starts with the [ character
followed by an integer or decimal number
followed by .. (space character, dot, dot, space character)
followed by an integer or decimal number
followed by the last character of the string which is )
*- the decimal numbers have a decimal separator, the . character
*- the integer numbers or the integer value of the decimal numbers should have a maximum of 4 digits
*- the decimal numbers should have a maximum of 4 fractional digits
*- the numbers can be negative
*- if a number is positive then the + sign is missing
*- doesn't matter which one of the two numbers is smaller (first number can be bigger than the second one, "[56 .. 55)" for instance)
The pattern should match the following:
"[10 .. 15)"
"[100 .. 15.2)"
"[10.431 .. 15)"
"[-10.3 .. -5)"
"[-10.4 .. 5.12)"
"[10.4312 .. -5.1232)"
I'd also like to obtain the 2 numbers as strings from the string in case the pattern matches:
obtain "10" and "15" from "[10 .. 15)"
obtain "-10.4" and "5.12" from "[-10.4 .. 5.12)"
The following regex should be fine.
^\[-?\d+(?:\.\d+)? \.\. -?\d+(?:\.\d+)?\)$
var pattern = #"^\[-?\d+(?:\.\d+)? \.\. -?\d+(?:\.\d+)?\)$";
var inputs = new[]{"[10 .. 15)", "[100 .. 15.2)", "[10.431 .. 15)", "[-10.3 .. -5)", "[-10.4 .. 5.12)", "[10.4312 .. -5.1232)", };
foreach (var input in inputs)
{
Console.WriteLine(input + " = " + Regex.IsMatch(input, pattern));
}
// [10 .. 15) = True
// [100 .. 15.2) = True
// [10.431 .. 15) = True
// [-10.3 .. -5) = True
// [-10.4 .. 5.12) = True
// [10.4312 .. -5.1232) = True
https://dotnetfiddle.net/LpswtI
You can use
^\[(-?\d{1,4}(?:\.\d{1,4})?) \.\. (-?\d{1,4}(?:\.\d{1,4})?)\)$
See the regex demo. Details:
^ - start of string
\[ - a [ char
(-?\d{1,4}(?:\.\d{1,4})?) - Group 1: an optional -, one to four digits and then an optional sequence of a . and one to four digits
\.\. - a .. string
(-?\d{1,4}(?:\.\d{1,4})?) - Group 2: an optional -, one to four digits and then an optional sequence of a . and one to four digits
\) - a ) char
$ - end of string (use \z if you need to check for the very end of string).
See the C# demo:
var texts = new List<string> { "[10 .. 15)", "[100 .. 15.2)", "[10.431 .. 15)", "[-10.3 .. -5)", "[-10.4 .. 5.12)", "[10.4312 .. -5.1232)", "[12345.1234 .. 0)", "[1.23456 .. 0" };
var pattern = new Regex(#"^\[(-?\d{1,4}(?:\.\d{1,4})?) \.\. (-?\d{1,4}(?:\.\d{1,4})?)\)$");
foreach (var s in texts)
{
Console.WriteLine($"---- {s} ----");
var match = pattern.Match(s);
if (match.Success)
{
Console.WriteLine($"Group 1: {match.Groups[1].Value}, Group 2: {match.Groups[2].Value}");
}
else
{
Console.WriteLine($"No match found in '{s}'.");
}
}
Output:
---- [10 .. 15) ----
Group 1: 10, Group 2: 15
---- [100 .. 15.2) ----
Group 1: 100, Group 2: 15.2
---- [10.431 .. 15) ----
Group 1: 10.431, Group 2: 15
---- [-10.3 .. -5) ----
Group 1: -10.3, Group 2: -5
---- [-10.4 .. 5.12) ----
Group 1: -10.4, Group 2: 5.12
---- [10.4312 .. -5.1232) ----
Group 1: 10.4312, Group 2: -5.1232
---- [12345.1234 .. 0) ----
No match found in '[12345.1234 .. 0)'.
---- [1.23456 .. 0 ----
No match found in '[1.23456 .. 0'.
This works (see this .Net Fiddle:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Match m = rx.Match("[123 .. -9876.5432]");
if (!m.Success )
{
Console.WriteLine("No Match");
}
else
{
Console.WriteLine(#"left: {0}", m.Groups[ "left" ] );
Console.WriteLine(#"right: {0}", m.Groups[ "right" ] );
}
}
private static readonly Regex rx = new Regex(#"
^ # anchor match at start-of-text
[[] # a left square bracket followed by
(?<left> # a named capturing group, containing a number, consisting of
-?[0-9]{1,4} # - a mandatory integer portion followed by
([.][0-9]{1,4})? # - an optional fractional portion
) # the whole of which is followed by
[ ][.][.][ ] # a separator (' .. '), followed by
(?<right> # another named capturing group containing a number, consisting of
-?[0-9]{1,4} # - a mandatory integer portion followed by
([.][0-9]{1,4})? # - an optional fractional portion
) # the whole of which is followed by
\] # a right square bracket, followed by
$ # end-of-text
",
RegexOptions.IgnorePatternWhitespace|RegexOptions.ExplicitCapture
);
}
I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'
I'm attempting to parse key-value pairs from strings which look suspiciously like markup using .Net Core 2.1.
Considering the sample Program.cs file below...
My Questions Are:
1.
How can I write the pattern kvp to behave as "Key and Value if exists" instead of "Key or Value" as it currently behaves?
For example, in the test case 2 output, instead of:
=============================
input = <tag KEY1="vAl1">
--------------------
kvp[0] = KEY1
key = KEY1
value =
--------------------
kvp[1] = vAl1
key =
value = vAl1
=============================
I want to see:
=============================
input = <tag KEY1="vAl1">
--------------------
kvp[0] = KEY1="vAl1"
key = KEY1
value = vAl1
=============================
Without breaking test case 9:
=============================
input = <tag noValue1 noValue2>
--------------------
kvp[0] = noValue1
key = noValue1
value =
--------------------
kvp[1] = noValue2
key = noValue2
value =
=============================
2.
How can I write the pattern value to stop matching at the next character matched by the group named "quotes"? In other words, the very next balancing quote. I obviously misunderstand how backreferencing works, my understanding is that \k<quotes> will be replaced by the value matched at runtime (not the pattern defined at design time) by (?<quotes>[""'`]).
For example, in the test case 5 output, instead of:
--------------------
kvp[4] = key3='hello,
key =
value = key3='hello,
--------------------
kvp[5] = experts
key =
value = experts
=============================
I want to see (notwithstanding a solution to question 1):
--------------------
kvp[4] = key3
key = key3
value =
--------------------
kvp[5] = hello, "experts"
key =
value = hello, "experts"
=============================
3.
How can I write the pattern value to stop matching before />? In test case 7, the value for key2 should be thing-1. I can't remember all that I've attempted, but I haven't found a pattern that works without breaking test case 6, in which the / is part of the value.
Program.cs
using System;
using System.Reflection;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
RegExTest();
Console.ReadLine();
}
static void RegExTest()
{
// Test Cases
var case1 = #"<tag>";
var case2 = #"<tag KEY1=""vAl1"">";
var case3 = #"<tag kEy2='val2'>";
var case4 = #"<tag key3=`VAL3`>";
var case5 = #"<tag key1='val1'
key2=""http://www.w3.org"" key3='hello, ""experts""'>";
var case6 = #"<tag :key1 =some/thing>";
var case7 = #"<tag key2=thing-1/>";
var case8 = #"<tag key3 = thing-2>";
var case9 = #"<tag noValue1 noValue2>";
var case10 = #"<tag/>";
var case11 = #"<tag />";
// A key may begin with a letter, underscore or colon, follow by
// zero or more of those, or numbers, periods, or dashs.
string key = #"(?<key>(?<=\s+)[a-z_:][a-z0-9_:\.-]*?(?=[\s=>]+))";
// A value may contain any character, and must be wrapped in balanced quotes (double, single,
// or back) if the value contains any quote, whitespace, equal, or greater- or less- than
// character.
string value = #"(?<value>((?<=(?<quotes>[""'`])).*?(?=\k<quotes>)|(?<=[=][\s]*)[^""'`\s=<>]+))";
// A key-value pair must contain a key,
// a value is optional
string kvp = $"(?<kvp>{key}|{value})"; // Without the | (pipe), it doesn't match any test case...
// ...value needs to be optional (case9), tried:
//kvp = $"(?<kvp>{key}{value}?)";
//kvp = $"(?<kvp>{key}({value}?))";
//kvp = $"(?<kvp>{key}({value})?)";
// ...each only matches key, but also matches value in case8 as key
Regex getKvps = new Regex(kvp, RegexOptions.IgnoreCase);
FormatMatches(getKvps.Matches(case1)); // OK
FormatMatches(getKvps.Matches(case2)); // OK
FormatMatches(getKvps.Matches(case3)); // OK
FormatMatches(getKvps.Matches(case4)); // OK
FormatMatches(getKvps.Matches(case5)); // Backreference and/or lazy qualifier doesn't work.
FormatMatches(getKvps.Matches(case6)); // OK
FormatMatches(getKvps.Matches(case7)); // The / is not part of the value.
FormatMatches(getKvps.Matches(case8)); // OK
FormatMatches(getKvps.Matches(case9)); // OK
FormatMatches(getKvps.Matches(case10)); // OK
FormatMatches(getKvps.Matches(case11)); // OK
}
static void FormatMatches(MatchCollection matches)
{
Console.WriteLine(new string('=', 78));
var _input = matches.GetType().GetField("_input",
BindingFlags.NonPublic |
BindingFlags.Instance)
.GetValue(matches);
Console.WriteLine($"input = {_input}");
Console.WriteLine();
if (matches.Count < 1)
{
Console.WriteLine("[kvp not matched]");
return;
}
for (int i = 0; i < matches.Count; i++)
{
Console.WriteLine(new string('-', 20));
Console.WriteLine($"kvp[{i}] = {matches[i].Groups["kvp"]}");
Console.WriteLine($"\t key\t=\t{matches[i].Groups["key"]}");
Console.WriteLine($"\tvalue\t=\t{matches[i].Groups["value"]}");
}
}
}
}
You may use
\s(?<key>[a-z_:][a-z0-9_:.-]*)(?:\s*=\s*(?:(?<q>[`'"])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s`'"<>])+)))?
See the regex demo with groups highlighed and a .NET regex demo (proof).
C# usage:
var pattern = #"\s(?<key>[a-z_:][a-z0-9_:.-]*)(?:\s*=\s*(?:(?<q>[`'""])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s`'""<>])+)))?";
var matches = Regex.Matches(case, pattern, RegexOptions.IgnoreCase);
foreach (Match m in matches)
{
Console.WriteLine(m.Value); // The whole match
Console.WriteLine(m.Groups["key"].Value); // Group "key" value
Console.WriteLine(m.Groups["value"].Value); // Group "value" value
}
Details
\s - a whitespace
(?<key>[a-z_:][a-z0-9_:.-]*) - Group "key": a letter, _ or : and then 0+ letters, digits, _, :, . or -
(?:\s*=\s*(?:(?['"])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s'"<>])+)))? - one or zero occurrence of (the value is thus optional):
\s*=\s* - a = enclosed with 0+ whitespaces
(?: - start of a non-capturing group:
(?[`'"]) - a delimiter, `, ' or "
(?<value>.*?) - Group "value" matching any 0+ chars other than line break chars as few as possible
\k<q> - backreference to Group "q", same value must match
| - or
(?<value>(?:(?!/>)[^\s`'"<>])+) - Group "value": a char other than whitespace, `, ', ", < and >, 1 or more occurrences, that does not start a /> char sequence
) - end of the non-capturing group.
According to Regex documentation, using RegexOptions.ExplicitCapture makes the Regex only match named groups like (?<groupName>...); but in action it does something a little bit different.
Consider these lines of code:
static void Main(string[] args) {
Regex r = new Regex(
#"(?<code>^(?<l1>[\d]{2})/(?<l2>[\d]{3})/(?<l3>[\d]{2})$|^(?<l1>[\d]{2})/(?<l2>[\d]{3})$|(?<l1>^[\d]{2}$))"
, RegexOptions.ExplicitCapture
);
var x = r.Match("32/123/03");
r.GetGroupNames().ToList().ForEach(gn => {
Console.WriteLine("GroupName:{0,5} --> Value: {1}", gn, x.Groups[gn].Success ? x.Groups[gn].Value : "");
});
}
When you run this snippet you'll see the result contains a group named 0 while I don't have a group named 0 in my regex!
GroupName: 0 --> Value: 32/123/03
GroupName: code --> Value: 32/123/03
GroupName: l1 --> Value: 32
GroupName: l2 --> Value: 123
GroupName: l3 --> Value: 03
Press any key to continue . . .
Could somebody please explain this behavior to me?
You always have group 0: that's the entire match. Numbered groups are relative to 1 based on the ordinal position of the opening parenthesis that defines the group. Your regular expression (formatted for clarity):
(?<code>
^
(?<l1> [\d]{2} )
/
(?<l2> [\d]{3} )
/
(?<l3> [\d]{2} )
$
|
^
(?<l1>[\d]{2})
/
(?<l2>[\d]{3})
$
|
(?<l1> ^[\d]{2} $ )
)
Your expression will backtrack, so you might consider simplifying your regular expression. This is probably clearer and more efficient:
static Regex rxCode = new Regex(#"
^ # match start-of-line, followed by
(?<code> # a mandatory group ('code'), consisting of
(?<g1> \d\d ) # - 2 decimal digits ('g1'), followed by
( # - an optional group, consisting of
/ # - a literal '/', followed by
(?<g2> \d\d\d ) # - 3 decimal digits ('g2'), followed by
( # - an optional group, consisting of
/ # - a literal '/', followed by
(?<g3> \d\d ) # - 2 decimal digits ('g3')
)? # - END: optional group
)? # - END: optional group
) # - END: named group ('code'), followed by
$ # - end-of-line
" , RegexOptions.IgnorePatternWhitespace|RegexOptions.ExplicitCapture );
Once you have that, something like this:
string[] texts = { "12" , "12/345" , "12/345/67" , } ;
foreach ( string text in texts )
{
Match m = rxCode.Match( text ) ;
Console.WriteLine("{0}: match was {1}" , text , m.Success ? "successful" : "NOT successful" ) ;
if ( m.Success )
{
Console.WriteLine( " code: {0}" , m.Groups["code"].Value ) ;
Console.WriteLine( " g1: {0}" , m.Groups["g1"].Value ) ;
Console.WriteLine( " g2: {0}" , m.Groups["g2"].Value ) ;
Console.WriteLine( " g3: {0}" , m.Groups["g3"].Value ) ;
}
}
produces the expected
12: match was successful
code: 12
g1: 12
g2:
g3:
12/345: match was successful
code: 12/345
g1: 12
g2: 345
g3:
12/345/67: match was successful
code: 12/345/67
g1: 12
g2: 345
g3: 67
named group
^(?<l1>[\d]{2})/(?<l2>[\d]{3})/(?<l3>[\d]{2})$|^(?<l1>[\d]{2})/(?<l2>[\d]{3})$|(?<l1>^[\d]{2}$)
try this (i remove first group from your regex) - see demo
I'm trying to parse a file that has the following format:
BEGIN:VEVENT
CREATED:20120504T163940Z
DTEND;TZID=America/Chicago:20120504T130000
DTSTAMP:20120504T164000Z
DTSTART;TZID=America/Chicago:20120504T120000
LAST-MODIFIED:20120504T163940Z
SEQUENCE:0
SUMMARY:Test 1
TRANSP:OPAQUE
UID:21F61281-FB76-467F-A2CC-A666688BD9B5
X-RADICALE-NAME:21F61281-FB76-467F-A2CC-A666688BD9B5.ics
END:VEVENT
I need to take the values found after the colon or semi colon on each line and put them into props in an object. I'm attempting to do this with Regex, but I basically forget everything I know about Regex after I use it (which is maybe twice a year). Any help would be appreciated.
edit
This post got me thinking about the iCal format.
Before yesterday, I didn't know what the iCal format was. But, after reading the 1998 spec, its painfully obvious than none of the answers on this page is adequate to parse the content. And, its really too sophisticated even for my general regex below.
With that in mind, here is a solution that parses just the line content, as gleaned from the spec for general line content parsing. Its a step in the right direction, and hopefully someone can benefit. It doesen't do line continuation and does not validate.
C# code
Regex iCalMainRx = new Regex(
#" ^ (?<name> [^[:cntrl:]"";:,\n]+ )
(?<parameter>
;
(?<param_name> [^[:cntrl:]"";:,\n]+ )
=
(?<param_value>
(?: (?:[^\S\n]|[^[:cntrl:]"";:,])* | "" (?:[^\S\n]|[^[:cntrl:]""])* "" )
(?: , (?: (?:[^\S\n]|[^[:cntrl:]"";:,])* | "" (?:[^\S\n]|[^[:cntrl:]""])* "" ) )*
)
)*
:
(?<value> (?:[^\S\n]|[^[:cntrl:]])* )
$ ", RegexOptions.IgnorePatternWhitespace);
Regex iCalPvalRx = new Regex(
#" ^ (?<pvals> (?:[^\S\n]|[^[:cntrl:]"";:,])* | "" (?:[^\S\n]|[^[:cntrl:]""])* "" )
(?: ,+ (?<pvals> (?:[^\S\n]|[^[:cntrl:]"";:,])* | "" (?:[^\S\n]|[^[:cntrl:]""])* "" ) )*
$ ", RegexOptions.IgnorePatternWhitespace);
string[] lines = {
"BEGIN:VEVENT",
"CREATED:20120504T163940Z",
"DTEND;TZID=America/Chicago:20120504T130000",
"DTSTAMP:20120504T164000Z",
"DTSTART;TZID=,,,America/Chicago;Next=;last=\"this:;;;:=\";final=:20120504T120000",
"LAST-MODIFIED:20120504T163940Z",
"SEQUENCE:0",
"SUMMARY:Test 1",
"TRANSP:OPAQUE",
"UID:21F61281-FB76-467F-A2CC-A666688BD9B5",
"X-RADICALE-NAME:21F61281-FB76-467F-A2CC-A666688BD9B5.ics",
"END:VEVENT",
};
foreach (string str in lines)
{
Match m_content = iCalMainRx.Match( str );
if (m_content.Success)
{
Console.WriteLine("Key = " + m_content.Groups["name"].Value);
Console.WriteLine("Value = " + m_content.Groups["value"].Value);
CaptureCollection cc_pname = m_content.Groups["param_name"].Captures;
CaptureCollection cc_pvalue = m_content.Groups["param_value"].Captures;
if (cc_pname.Count > 0)
{
Console.WriteLine("Parameters: ");
for (int i = 0; i < cc_pname.Count; i++)
{
// Console.WriteLine("\t'" + cc_pname[i].Value + "' = '" + cc_pvalue[i].Value + "'");
Console.WriteLine("\t'" + cc_pname[i].Value + "' =");
Match m_vals = iCalPvalRx.Match( cc_pvalue[i].Value );
if (m_vals.Success)
{
CaptureCollection cc_vals = m_vals.Groups["pvals"].Captures;
for (int j = 0; j < cc_vals.Count; j++)
{
Console.WriteLine("\t\t'" + cc_vals[j].Value + "'");
}
}
}
}
Console.WriteLine("-------------------------");
}
}
Output
Key = BEGIN
Value = VEVENT
-------------------------
Key = CREATED
Value = 20120504T163940Z
-------------------------
Key = DTEND
Value = 20120504T130000
Parameters:
'TZID' =
'America/Chicago'
-------------------------
Key = DTSTAMP
Value = 20120504T164000Z
-------------------------
Key = DTSTART
Value = 20120504T120000
Parameters:
'TZID' =
''
'America/Chicago'
'Next' =
''
'last' =
'"this:;;;:="'
'final' =
''
-------------------------
Key = LAST-MODIFIED
Value = 20120504T163940Z
-------------------------
Key = SEQUENCE
Value = 0
-------------------------
Key = SUMMARY
Value = Test 1
-------------------------
Key = TRANSP
Value = OPAQUE
-------------------------
Key = UID
Value = 21F61281-FB76-467F-A2CC-A666688BD9B5
-------------------------
Key = X-RADICALE-NAME
Value = 21F61281-FB76-467F-A2CC-A666688BD9B5.ics
-------------------------
Key = END
Value = VEVENT
-------------------------
Spiting into lines and use IndexOf(":") may be enough for simple ICAL files instead of RegEx.
Check out if there is already existing ICAL parser and related questions ical+C#.
Try:
(?<key>[^:;]*)[:;](?<value>[^\s]*)
C# snippet:
Regex regex = new Regex(
#"(?<key>[^:;]*)[:;](?<value>[^\s]*)",
RegexOptions.None
);
It takes a string of any character but a colon or semicolon as the key, and then anything else but whitespace as the value.
If you want to test it or make changes, check out the regex checker I have on my blog: http://blog.stevekonves.com/2012/01/an-even-better-regex-tester/ (requires silverlight)
Run this with a few examples and see if it does what you want. I get the other comments about splitting or IndexOf but if you're expecting that the delimiter is either a colon or a semicolon then a regex is probably better.
string line = "LAST-MODIFIED:20120504T163940Z";
var p = Regex.Match(line, "(.*)?(:|;)(.*)$", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Singleline);
Console.WriteLine(p.Groups[0].Value);
Console.WriteLine(p.Groups[1].Value);
Console.WriteLine(p.Groups[2].Value);
Console.WriteLine(p.Groups[3].Value);
I'd personally use string.Split(':') for this for each line in the file. This has the benefit of being easy to read and understand too if you don't want to re-learn regular expressions again!