How can I match the given pattern using Regex in C#? - c#

I have the following input:
-key1:"val1" -key2: "val2" -key3:(val3) -key4: "(val4)" -key5: val5 -key6: "val-6" -key-7: val7 -key-eight: "val 8"
With only the following assumption about the pattern:
Keys always start with a - followed by a value delimited by :
How can I match and extract each key and it's corresponding value?
I have so far come up with the following regex:
-(?<key>\S*):\s?(?<val>\S*)
But it's currently not matching the complete value for the last argument as it contains a space but I cannot figure out how to match it.
The expected output should be:
key1 "val1"
key2 "val2"
key3 (val3)
key4 "(val4)"
key5 val5
key6 "val-6"
key-7 val7
key-eight val 8
Any help is much appreciated.

Guessing that you want to only allow whitespace characters that are not at the beginning or end, change your regex to:
-(?<key>\S*):\s?(?<val>\S+(\s*[^-\s])*)
This assumes that the character - preceeded by a whitespace unquestioningly means a new key is beginning, it cannot be a part of any value.
For this example:
-key: value -key2: value with whitespace -key3: value-with-hyphens -key4: v
The matches are:
-key: value, -key2: value with whitespace, -key3: value-with-hyphens, -key4: v.
It also works perfectly well on your provided example.

A low tech (non regex) solution, just for an alternative. Trim guff, ToDictionary if you need
var results = input.Split(new[] { " -" }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim('-').Split(':'));
Full Demo Here
Output
key1 -> "val1"
key2 -> "val2"
key3 -> (val3)
key4 -> "(val4)"
key5 -> val5
key6 -> "val-6"
key-7 -> val7
key8 -> "val 8"

Try this regex using Replace function:
(?:^|(?!\S)\s*)-|\s*:\s*
and replace with "\n". You should get key values in separate lines.

I presume you're wanting to keep the brackets and quotation marks as that's what you're doing in the example you gave? If so then the following should work:
-(?<key>\S+):+\s?(?<val>\S+\s?\d+\)?\"?)
This does presume that all val's end with a number though.
EDIT:
Given that the val doesn't always end with a number, but I'm guessing it always starts with val, this is what I have:
-(?<key>\S+):+\s?(?<val>\"?\(?(val)+\s?\S+)
Seems to be working properly...

This should do the trick
-(?<key>\S*):\s*(?<value>(?(?=")((")(?:(?=(\\?))\2.)*?\1))(\S*))
a sample link can be found here.
Basically it does and if/else/then to detect if the value contain " as (?(?=")(true regex)(false regex), the false regex is yours \S* while the true regex will try to match start/end quote (")(?:(?=(\\?))\2.)*?\1).

Related

Regex expression - match specific characters (multiple times) and ignore comments

I'm not an expert on regex and need some help to set up one.
I'm using Powershell and its [regex] type, which is a C# class, the final objective is to read a toml file (sample data at the bottom, or use this link to regex101), in which I need to:
match some values (values between "__")
ignore comments. (a comment starts with "#")
To match the values and put them in a capture group the following regex works:
match the template value (values between "__" ):
__(?<tokenName>[\w\.]+)__
I also want to ignore the commented lines, and I came up with this:
Ignore lines that start with a comment (even if "#" is preceded by spaces or tabs):
^(?!\s*\t*#).*
The problem starts when I put them together
^(?!\s*\t*#).*__(?<tokenName>[\w\.]+)__
this expression has the following problems:
up to one match per line, the last one (ie: in the line with "Prop5 = ..." I get one match instead of two)
Comments at the end of a line are not considered (ie: line with "Prop4 = ..." has two matches instead of one)
I've also tried to
add this at the end of the expression, it should stop the match on the first occurrence of the character
[^#]
add this at the beginning, which should check if the matched string has the given char before it and exclude it
(?<!^#)
This is a sample of my data
#templateFile
[Agent]
Prop1 = "__Data.Agent.Prop1__"
Prop2 = [__Data.Agent.Prop2__]
#I'm a comment
#Prop3 = "__NotUsed__"
Prop4 = [__Data.Agent.Prop4__] #sample usage comment __Data.Agent.xxx__
Prop5 = ["__Data.Agent.Prop5a__","__Data.Agent.Prop5b__"]
I think the easier solution will be to match the given string, only if there is no "#" before it on the same line.
Is it possible?
EDIT:
The first expression proposed by #the-fourth-bird works perfectly, it just needs the multiline modifier to be specified.
The final (runnable) result looks like this in PowerShell.
[regex]$reg = "(?m)(?<!^.*#.*)__(?<tokenName>[\w.]+)__"
$text = '
#templateFile
[Agent]
Prop1 = "__Data.Agent.Prop1__"
Prop2 = [__Data.Agent.Prop2__]
Prop5 = ["__Data.Agent.Prop5a__","__Data.Agent.Prop5b__"]
#a comment
#Prop3 = "__Data.Agent.Prop3__"
Prop4 = [__Data.Agent.Prop4__] #sample usage comment __Data.Agent.xxx__
'
$reg.Matches($text) | Format-Table
#This returns
Groups Success Name Captures Index Length Value
------ ------- ---- -------- ----- ------ -----
{0, tokenName} True 0 {0} 31 20 __Data.Agent.Prop1__
{0, tokenName} True 0 {0} 62 20 __Data.Agent.Prop2__
{0, tokenName} True 0 {0} 94 21 __Data.Agent.Prop5a__
{0, tokenName} True 0 {0} 118 21 __Data.Agent.Prop5b__
{0, tokenName} True 0 {0} 194 20 __Data.Agent.Prop4__
I think you could make use of infinite repetition to check if what precedes does not contain a # to also account for the comment in Prop4
(?<!^.*#.*)__(?<tokenName>[\w.]+)__
.Net regex demo
If Prop4 should have 2 matches, you might use:
(?<!^[ \t]*#.*)__(?<tokenName>[\w.]+)__
.NET regex demo
Both expressions needs the multiline modifier to work properly.
it can be specified inline by adding (?m) at the beginning. (or by specifying it in a constructor that supports it)
(?m)(?<!^.*#.*)__(?<tokenName>[\w.]+)__

Can we use StringComparer to sort all kind of strings including special characters?

I have used StringComparer.Ordinal to sort a list of strings. It sorts the strings including special characters except \\.
Is there any other options to sort \\ without writing user defined codes?
Tried the following code:
Var string={"#a","\\b","c","1"}
Array.Sort(string,StringComparer.Ordinal)
I expect output as
#a \\b 1 c
but the actual output is
#a 1 c \\b
The code-point of # is 35, 1 is 49, \ is 92, a/b/c is 97/98/99
The output from:
var arr = new[] { "#a", "\\b", "c", "1" };
Array.Sort(arr, StringComparer.Ordinal);
Console.WriteLine(string.Join(" ", arr));
is:
#a 1 \b c
So... it is working as expected, sorting them by their ordinal values.

RegEx split string into words by space and containing chars

How can one perform this split with the Regex.Split(input, pattern) method?
This is a [normal string ] made up of # different types # of characters
Array of strings output:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
Also it should keep the leading spaces, so I want to preserve everything. A string contains 20 chars, array of strings should total 20 chars across all elements.
What I have tried:
Regex.Split(text, #"(?<=[ ]|# #)")
Regex.Split(text, #"(?<=[ ])(?<=# #")
I suggest matching, i.e. extracting words, not splitting:
string source = #"This is a [normal string ] made up of # different types # of characters";
// Three possibilities:
// - plain word [A-Za-z]+
// - # ... # quotation
// - [ ... ] quotation
string pattern = #"[A-Za-z]+|(#.*?#)|(\[.*?\])";
var words = Regex
.Matches(source, pattern)
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, words
.Select((w, i) => $"{i + 1}. {w}")));
Outcome:
1. This
2. is
3. a
4. [normal string ]
5. made
6. up
7. of
8. # different types #
9. of
10. characters
You may use
var res = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
See the regex demo
The (\[[^][]*]|#[^#]*#) part is a capturing group whose value is output to the resulting list along with the split items.
Pattern details
(\[[^][]*]|#[^#]*#) - Group 1: either of the two patterns:
\[[^][]*] - [, followed with 0+ chars other than [ and ] and then ]
#[^#]*# - a #, then 0+ chars other than # and then #
| - or
\s+ - 1+ whitespaces
C# demo:
var s = "This is a [normal string ] made up of # different types # of characters";
var results = Regex.Split(s, #"(\[[^][]*]|#[^#]*#)|\s+")
.Where(x => !string.IsNullOrEmpty(x));
Console.WriteLine(string.Join("\n", results));
Result:
This
is
a
[normal string ]
made
up
of
# different types #
of
characters
It would be easier using matching approach however it can be done using negative lookeaheads :
[ ](?![^\]\[]*\])(?![^#]*\#([^#]*\#{2})*[^#]*$)
matches a space not followed by
any character sequence except [ or ] followed by ]
# followed by an even number of #

Regex for matching season and episode

I'm making small app for myself, and I want to find strings which match to a pattern but I could not find the right regular expression.
Stargate.SG-1.S01E08.iNT.DVDRip.XviD-LOCK.avi
That is expamle of string I have and I only want to know if it contains substring of S[NUMBER]E[NUMBER] with each number max 2 digits long.
Can you give me a clue?
Regex
Here is the regex using named groups:
S(?<season>\d{1,2})E(?<episode>\d{1,2})
Usage
Then, you can get named groups (season and episode) like this:
string sample = "Stargate.SG-1.S01E08.iNT.DVDRip.XviD-LOCK.avi";
Regex regex = new Regex(#"S(?<season>\d{1,2})E(?<episode>\d{1,2})");
Match match = regex.Match(sample);
if (match.Success)
{
string season = match.Groups["season"].Value;
string episode = match.Groups["episode"].Value;
Console.WriteLine("Season: " + season + ", Episode: " + episode);
}
else
{
Console.WriteLine("No match!");
}
Explanation of the regex
S // match 'S'
( // start of a capture group
?<season> // name of the capture group: season
\d{1,2} // match 1 to 2 digits
) // end of the capture group
E // match 'E'
( // start of a capture group
?<episode> // name of the capture group: episode
\d{1,2} // match 1 to 2 digits
) // end of the capture group
There's a great online test site here: http://gskinner.com/RegExr/
Using that, here's the regex you'd want:
S\d\dE\d\d
You can do lots of fancy tricks beyond that though!
Take a look at some of the media software like XBMC they all have pretty robust regex filters for tv shows
See here, here
The regex I would put for S[NUMBER1]E[NUMBER2] is
S(\d\d?)E(\d\d?) // (\d\d?) means one or two digit
You can get NUMBER1 by <matchresult>.group(1), NUMBER2 by <matchresult>.group(2).
I would like to propose a little more complex regex. I don't have ". : - _"
because i replace them with space
str_replace(
array('.', ':', '-', '_', '(', ')'), ' ',
This is the capture regex that splits title to title season and episode
(.*)\s(?:s?|se)(\d+)\s?(?:e|x|ep)\s?(\d+)
e.g. Da Vinci's Demons se02ep04 and variants
https://regex101.com/r/UKWzLr/3
The only case that i can't cover is to have interval between season and the number, because the letter s or se is becoming part if the title that does not work for me. Anyhow i haven't seen such a case, but still it is an issue.
Edit:
I managed to get around it with a second line
$title = $matches[1];
$title = preg_replace('/(\ss|\sse)$/i', '', $title);
This way i remove endings on ' s' and ' se' if name is part of series

parse text into key/value pair or json

I have text in the following format, I was wondering what the best approach might be to create a user object from it with the fields as its properties.
I dont know regular expressions that well and i was looking at the string methods in csharp particularly IndexOf and LastIndexOf, but i think that would be too messy as there are approximately 15 fields.
I am trying to do this in c sharp
Some characteristics:
The keys/fields are fixed and known beforehand, so i know that i have to look for things like title, company etc
The address part is single valued and following that there's some multi-valued fields
The multi-valued field may/maynot end with a comma (,)
There is one or two line brakes between the fields eg "country" is followed by 2 line brakes before we encounter "interest"
Title: Mr
Company: abc capital
Address1: 42 mystery lane
Zip: 112312
Country: Ireland
Interest: Biking, Swimming, Hiking,
Topic of Interest: Europe, Asia, Capital
This will split the the data up into key value pairs and store them in a dictionary. You may have to modify further for more requirements.
var dictionary = data
.Split(
new[] {"\r\n"},
StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Split(':'))
.ToDictionary(
k => k[0].Trim(),
v => v[1].Trim());
I'd probably go with something like this:
private Dictionary<string, IEnumerable<string>> ParseValues(string providedValues)
{
Dictionary<string, IEnumerable<string>> parsedValues = new Dictionary<string, IEnumerable<string>>();
string[] lines = providedValues.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries); //Your newline character here might differ, being '\r', '\n', '\r\n'...
foreach (string line in lines)
{
string[] lineSplit = line.Split(':');
string key = lineSplit[0].Trim();
IEnumerable<string> values = lineSplit[1].Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries).Select(x => x.Trim()); //Removing empty entries here will ensure you don't get an empty for the "Interest" line, where you have 'Hiking' followed by a comma, followed by nothing else
parsedValues.Add(key, values);
}
return parsedValues;
}
or if you subscribe to the notion that readability and maintainability are not as cool as a great big chain of calls:
private static Dictionary<string, IEnumerable<string>> ParseValues(string providedValues)
{
return providedValues.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Select(x => x.Split(':')).ToDictionary(key => key[0].Trim(), value => value[1].Split(new char[]{ ','}, StringSplitOptions.RemoveEmptyEntries).Select(x => x.Trim()));
}
I strongly recomend getting more familiar wit regexp for those cases. Parsing "half" structured text is very easy and logic with regular exp.
for ex. this (and other following are just variants there are many ways to do it depending on what you need)
title:\s*(.*)\s+comp.*?:\s*(.*)\s+addr.*?:\s*(.*)\s+zip:\s*(.*)\s+country:\s*(.*)\s+inter.*?:\s*(.*)\s+topic.*?:\s*(.*)
gives result
1. Mr
2. abc capital
3. 42 mystery lane
4. 112312
5. Ireland
6. Biking, Swimming, Hiking,
7. Europe, Asia, Capital
or - more open to anything:
\s(.*?):\s(.*)
parses your input into nice groups like this:
Match 1
1. Title
2. Mr
Match 2
1. Company
2. abc capital
Match 3
1. Address1
2. 42 mystery lane
Match 4
1. Zip
2. 112312
Match 5
1. Country
2. Ireland
Match 6
1. Interest
2. Biking, Swimming, Hiking,
Match 7
1. Topic of Interest
2. Europe, Asia, Capital
I am not familiar with c# (and its dialect of regexp), I just wanted do awake your interest ...

Categories

Resources