C# RegEx.Split delimiter followed by specific words

C# RegEx.Split delimiter followed by specific words - c#

I am trying to split using Regex.Split strings like this one:
string criteria = "NAME='Eduard O' Brian' COURSE='Math II' TEACHER = 'Chris Young' SCHEDULE='3' CAMPUS='C-1' ";
We have the following 'reserved words': NAME, COURSE, TEACHER, SCHEDULE, CAMPUS. It is required to split the original string into:
NAME='Eduard O' Brian'
COURSE='Math II'
TEACHER = 'Chris Young'
SCHEDULE='3'
CAMPUS='C-1'
The criteria for Split is: to have the simple quote, followed by one or more spaces, followed by a 'reserved word'.
The closest expression I achieved is:
var match = Regex.Split(criteria, #"'[\s+]([NAME]|[COURSE]|[TEACHER]|[SCHEDULE]|[CAMPUS])", RegexOptions.CultureInvariant);
This is the complete source code:
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication
{
class Program
{
static void Main(string[] args)
{
string criteria = "NAME='Eduard O' Brian' COURSE='Math II' TEACHER = 'Chris Young' SCHEDULE='3' CAMPUS='C-1' ";
var match = Regex.Split(criteria, #"'[\s+]([NAME]|[COURSE]|[TEACHER]|[SCHEDULE]|[CAMPUS])", RegexOptions.CultureInvariant);
foreach (var item in match)
Console.WriteLine(item.ToString());
Console.Read();
}
}
}
My code is doing this:
NAME='Eduard O' Brian' COURSE='Math II
T
EACHER = 'Chris Young
S
CHEDULE='3
C
AMPUS='C-1
It is deleting the last simple quote and is taking only the first letter of the reserved word. And COURSE in this sample has more than one space and is not working for it.
Thanks in advance!

You may simply split with 1+ whitespaces that are followed with your reserved words followed with =:
var results = Regex.Split(s, #"\s+(?=(?:NAME|COURSE|TEACHER|SCHEDULE|CAMPUS)\s*=)");
See the regex demo
Pattern details
\s+ - 1 or more whitespace chars
(?= - start of a positive lookahead that, immediately to the right of the current location, requires the following text:
(?:NAME|COURSE|TEACHER|SCHEDULE|CAMPUS) - any of the alternative literal texts
\s* - 0 or more whitespace chars (as there can be space(s) between reserved words and =)
= - an equal sign
) - end of the lookahead.
C# demo:
var criteria = "NAME='Eduard O' Brian' COURSE='Math II' TEACHER = 'Chris Young' SCHEDULE='3' CAMPUS='C-1' ";
var match = Regex.Split(criteria, #"\s+(?=(?:NAME|COURSE|TEACHER|SCHEDULE|CAMPUS)\s*=)");
Console.WriteLine(string.Join("\n", match));

Related

c# Regex of value after certain words

I have a question at regex I have a string that looks like:
Slot:0 Module:No module in slot
And what I need is a regex that well get values after slot and module, slot will allways be a number but i have a problem with module (this can be word with spaces), I tried:
var pattern = "(?<=:)[a-zA-Z0-9]+";
foreach (string config in backplaneConfig)
{
List<string> values = Regex.Matches(config, pattern).Cast<Match>().Select(x => x.Value).ToList();
modulesInfo.Add(new ModuleIdentyfication { ModuleSlot = Convert.ToInt32(values.First()), ModuleType = values.Last() });
}
So slot part works, but module works only if it is a word with no spaces, in my example it will give me only "No". Is there a way to do that

You may use a regex to capture the necessary details in the input string:
var pattern = #"Slot:(\d+)\s*Module:(.+)";
foreach (string config in backplaneConfig)
{
var values = Regex.Match(config, pattern);
if (values.Success)
{
modulesInfo.Add(new ModuleIdentyfication { ModuleSlot = Convert.ToInt32(values.Groups[1].Value), ModuleType = values.Groups[2].Value });
}
}
See the regex demo. Group 1 is the ModuleSlot and Group 2 is the ModuleType.
Details
Slot: - literal text
(\d+) - Capturing group 1: one or more digits
\s* - 0+ whitespaces
Module: - literal text
(.+) - Capturing group 2: the rest of the string to the end.

The most simple way would be to add 'space' to your pattern
var pattern = "(?<=:)[a-zA-Z0-9 ]+";
But the best solution would probably the answer from #Wiktor Stribiżew

Another option is to match either 1+ digits followed by a word boundary or match a repeating pattern using your character class but starting with [a-zA-Z]
(?<=:)(?:\d+\b|[a-zA-Z][a-zA-Z0-9]*(?: [a-zA-Z0-9]+)*)
(?<=:) Assert a : on the left
(?: Non capturing group
\d+\b Match 1+ digits followed by a word boundary
| Or
[a-zA-Z][a-zA-Z0-9]* Start a match with a-zA-Z
(?: [a-zA-Z0-9]+)* Optionally repeat a space and what is listed in the character class
) Close on capturing group
Regex demo

Plase replace this:
// regular exp.
(\d+)\s*(.+)

You don't need to use regex for such simple parsing. Try below:
var str = "Slot:0 Module:No module in slot";
str.Split(new string[] { "Slot:", "Module:"},StringSplitOptions.RemoveEmptyEntries)
.Select(s => s.Trim());

RegEx for capturing a word in between = and ;

I want to select word2 from the following :
word2;word3
word2 that is between ; and start of the line unless there is a = in between. In that case, I want start from the = instead of the start of the line
like word2 from
word1=word2;word3
I have tried using this regex
(?<=\=|^).*?(?=;)
which select the word2 from
word2;word3
but also the whole word1=word2 from
word1=word2;word3

You can use an optional group to check for a word followed by an equals sign and capture the value in the first capturing group:
^(?:\w+=)?(\w+);
Explanation
^ Start of string
(?:\w+=)? Optional non capturing group matching 1+ word chars followed by =
(\w+) Capture in the first capturing group 1+ word chars
; Match ;
See a regex demo
In .NET you might also use:
(?<=^(?:\w+=)?)\w+(?=;)
Regex demo | C# demo

There should be so many options, maybe regular expressions among the last ones.
But, if we wish to use an expression for this problem, let's start with a simple one and explore other options, maybe something similar to:
(.+=)?(.+?);
or
(.+=)?(.+?)(?:;.+)
where the second capturing group has our desired word2.
Demo 1
Demo 2
Example 1
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?);";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Example 2
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?)(?:;.+)";
string substitution = #"$2";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}

Instead of using regular expresions you can solve the problem with String class methods.
string[] words = str.Split(';');
string word2 = words[0].Substring(words[0].IndexOf('=') + 1);
First line splits the line from ';'. Assuming you just have a single ';' this statement splits your line into two strings. And second line returns a substring of first part (words[0]) starting from the first occurence of '=' (words[0].IndexOf('=')) character's next characher (+1) to the end. If your line doesn't have any '=' characters it just starts from the beginning because IndexOf returns -1.
Related documentation:
https://learn.microsoft.com/en-us/dotnet/api/system.string.split?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.substring?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.indexof?view=netframework-4.8

How to make regex only match with patterns that have exactly one letter before a =

I am trying to get the regex to match only when there is one letter from A-Z followed by a = like this A=, a=, B=, currently it is picking up any number of letters before the = like hem=, ac2=. Usually ^[a-zA-Z] works just fine but its not working for this case since I'm using named capture groups
String pattern = "FL2 (77) Flashing,77,a=1.875,A=90.0,b=3.625,B=95.0,c=1.375,C=175.0,d=2.5,hem=0.5,16GA-AL,";
var regex = new Regex("(?<label>[a-zA-Z]+)=(?<value>[^,]+)");
Other ways I've tried
var regex = new Regex("(?<label>^[a-zA-Z]+)=(?<value>[^,]+)");
var regex = new Regex("(?<label>[^a-zA-Z]+)=(?<value>[^,]+)");

If you want to match l= but not word=, you need a negative look-behind assertion.
new Regex("(?<![a-zA-Z])(?<label>[a-zA-Z])=(?<value>[^,]+)")

If the string pattern you have in your question is really the "haystack" in which you're looking for "needles", a really easy way to solve the problem would be to first split the string on ,, then use RegEx. Then you can use a simpler pattern ^(?<label>[a-zA-Z])=(?<value>.+)$ on each item in the list you get from splitting the string, and only keep the matches.

It's because you have a + after [a-zA-Z], which makes it match one or more characters in that character class. If you remove the +, it will only match one character before the =.
If you want it to only match in situations where there is exactly one alphabetical character before the equals sign, you will want to add to the beginning of the regex to make sure that the character before the letter you want to match is not a letter, like this:
(?<![a-zA-Z])(?<label>[a-zA-Z])=(?<value>[^,]+)
(notice though that this only matters in the case where you don't put a ^ before [a-zA-Z], in the case where you want matches that don't start at the beginning of a line)

Have you tried
var regex = new Regex("(?<label>^[a-zA-Z]?)=(?<value>[^,]+)");
I believe the "+" means 1 or more
"?" means 0 or 1
or exactly 1 should be {1} (at least in python, not sure about C#)
var regex = new Regex("(?<label>^[a-zA-Z]{1})=(?<value>[^,]+)");

Assuming that the label is separated by a comma (which seems to be the case based on your example and code) then you can use:
^|,(?<label>[A-Za-z])=(?<value>[^,]+)

I recommend Regex.Matches over capture groups here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace Rextester
{
public class Program
{
public static void Main(string[] args)
{
string content = "FL2 (77) Flashing,77,a=1.875,A=90.0,b=3.625,B=95.0,c=1.375,C=175.0,d=2.5,hem=0.5,16GA-AL,";
const string regexPattern = "(?<=[,| ])[a-zA-Z]=([0-9|.|-])+";
string singleMatch = new Regex(regexPattern).Match(content).ToString();
Console.WriteLine(singleMatch); // a=1.875
MatchCollection matchList = Regex.Matches(content, regexPattern);
var matches = matchList.Cast<Match>().Select(match => match.Value).ToList();
Console.WriteLine(string.Join(", ", matches)); // a=1.875, A=90.0, b=3.625, B=95.0, c=1.375, C=175.0, d=2.5
}
}
}

Simple regex-matching

I have a String
String test = #"Lists/Versions/2_.000";
I'm a bit confused on how to use regex to do this.
I'm using the pattern
String pattern = #"\D+";
The msdn page for regular expression says \D is "Matches any character other than a decimal digit"
So shouldn't it be returning 'Lists/Versions/' , '2'?
However its returning
'' , '2', '000'
I would like the string to only match the 2(Or any Integer). How would I do that?
String url = #"Lists/Versions/2_.000";
String pattern = #"\D+";
string[] substrings = Regex.Split(url, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}

The reason your receiving the issue, is because the /D is to capture non digits, so it detects two separate numeric values (2 and 000) because of the _. So that is how it is grabbing the data. So you have a couple of choices:
Break the string into manageable portions, then anchor to the array.
Build a better pattern to separate.
So the question will be, what are you trying to parse? 2.00 ? Or are you trying to separate numeric numbers in your string?
I'm assuming you have a typo also:
\d Matches a digit character. Equivalent to [0-9].
\D Matches a non-digit character. Equivalent to [^0-9].
\w Matches any word character including underscore. Equivalent to
"[A-Za-z0-9_]".
\W Matches any non-word character. Equivalent to "[^A-Za-z0-9_]".
You should be able to use:
You should simply do the following:
string url = #"Lists/Versions/2_.000";
var data = Regex.Split(url, #"\D+");
Console.WriteLine(#"Value: {0} and Secondary Value: {1}", data[0], data[1]);
That should find all integer values, so it should provide an output of:
2
000
Which should return as a normal string []. My syntax or expression may be off, but you can find a nice cheat sheet for Regular Expressions here. You'll also want to ensure you check the bounds of the array.

https://dotnetfiddle.net/BU6gp2
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String url = #"Lists/Versions/2_.000";
String pattern = #"\D+";
string[] substrings = Regex.Split(url, pattern);
Console.WriteLine("'{0}'", substrings[1]);
}
}

Please try the following:
// using System.Linq;
String url = #"Lists/Versions/2_.000";
String pattern = #"(?<=/)\d+";
string[] substrings = Regex.Matches(url, pattern)
.Cast<Match>()
.Select(_ => _.Value)
.ToArray();
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
Alternatively, if you don't need an array.
String url = #"Lists/Versions/2_.000";
String pattern = #"(?<=/)\d+";
Console.WriteLine("'{0}'", Regex.Match(url, pattern).Value);

Multiline Regex matches first occurance but can't match second

I have a string in the format below. (I added the markers to get the newlines to show up correctly)
-- START BELOW THIS LINE --
2013-08-28 00:00:00 - Tom Smith (Work notes)
Blah blah
b;lah blah
2013-08-27 00:00:00 - Tom Smith (Work notes)
ZXcZXCZXCZX
ZXcZXCZX
ZXCZXcZXc
ZXCZXC
-- END ABOVE THIS LINE --
I am trying to get a regular expression that will allow me to extract the information from the two separate parts of the string.
The following expression matches the first portion successfully:
^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)(?=\n\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)
I am trying to figure out a way that I can modify it to get the second part of the string. I have tried things like what is below, but it ends up extending the match all the way to the end of the string. It is like it is giving preference to the expression following the OR.
^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)(?:(?=\n\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)|\n\\Z)
Any help would be appreciated
-- EDIT --
Here is a copy of the test program I created to try and get this correct. I also added a 3rd message and my RegEx above breaks in that case.
using System;
using System.Text.RegularExpressions;
namespace RegExTest
{
class MainClass
{
public static void Main (string[] args)
{
string str = "2013-08-28 10:50:13 - Tom Smith (Work notes)\nWhat's up? \nHow you been?\n\n2013-08-19 10:21:03 - Tom Smith (Work notes)\nWork Notes\n\n2013-08-19 10:10:48 - Tom Smith (Work notes)\nGood day\n\n";
var regex = new Regex ("^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*) \\(Work notes\\)\n([\\w\\W]*)\n\n(?=\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - .* \\(Work notes\\)\n)",RegexOptions.Multiline);
foreach (Match match in regex.Matches(str))
{
if (match.Success)
{
for (var i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine('>'+match.Groups [i].Value);
}
}
}
Console.ReadKey();
}
}
}
-- EDIT --
Just to make it clear, the data I am trying to extract is the Date and Timestamp (as one item), the name, and the "body" from each "paragraph".

This is a pretty beefy piece of regex you've got here.
While you can do regex over multiple lines, it just complicates things. Additionally, because you have repetitive patterns, it would be cleaner to split your string on the newline character, and then just match each line.
Eventually, if you intend to ingest this from a file, it will be easy to match each line of the file, rather than reading in the whole file and then matching.
Here's what I would do:
var regex = new Regex ("(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) - (.*?) \\(Work notes\\)");
var lines = str.split(new char[] {'\n'});
foreach (var line in lines)
{
var match = regex.Match(line);
if (match.Success)
{
for (var i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine('>' + match.Groups[i].Value);
}
// will preface the body after each header
Console.WriteLine(">");
}
else
{
Console.WriteLine(line);
}
}
As far as your regex goes, I maintained the original groups you had, so we get the Date/timestamp in one group, and the name in the other. The body does not get matched to a group, but it would be trivial to construct a string that is the body.
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) Matching Group 1.
- Matched, but not grouped.
(.*?) Matching Group 2.
\(Work notes\) Matched, but not grouped.

Regex is not really the right solution for this, but if you must...
Your problem is a combination of regex greediness and starting the match with ^. If it starts with ^ it needs it to start the string and it won't match anywhere else.
The greediness of .* can be fixed by making it .*? instead.
Try this:
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - (.*?) \(Work notes\)\n([\w\W]*?)((?=\n\n\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} - .*? \(Work notes\)\n)|((\s{0,})$))

I was able to get an expression working but it looks a bit scary I guess:
#"([0-9\s:-]+)(?>\s-\s)(?>[^\n\r]+[\r\n]*)((?=[^0-9]+(\d{4}-\d{2}-\d{2}|$))[\s\S])+"
The # before the expression to make this a verbatim string so you won't have to double escape everything.
Note: This is by no means the right way to go about doing this, but I wanted to try out anyway.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# RegEx.Split delimiter followed by specific words - c#

Related

c# Regex of value after certain words

RegEx for capturing a word in between = and ;

How to make regex only match with patterns that have exactly one letter before a =

Simple regex-matching

Multiline Regex matches first occurance but can't match second

Categories

Resources