RegEx for capturing a word in between = and ;

RegEx for capturing a word in between = and ; - c#

I want to select word2 from the following :
word2;word3
word2 that is between ; and start of the line unless there is a = in between. In that case, I want start from the = instead of the start of the line
like word2 from
word1=word2;word3
I have tried using this regex
(?<=\=|^).*?(?=;)
which select the word2 from
word2;word3
but also the whole word1=word2 from
word1=word2;word3

You can use an optional group to check for a word followed by an equals sign and capture the value in the first capturing group:
^(?:\w+=)?(\w+);
Explanation
^ Start of string
(?:\w+=)? Optional non capturing group matching 1+ word chars followed by =
(\w+) Capture in the first capturing group 1+ word chars
; Match ;
See a regex demo
In .NET you might also use:
(?<=^(?:\w+=)?)\w+(?=;)
Regex demo | C# demo

There should be so many options, maybe regular expressions among the last ones.
But, if we wish to use an expression for this problem, let's start with a simple one and explore other options, maybe something similar to:
(.+=)?(.+?);
or
(.+=)?(.+?)(?:;.+)
where the second capturing group has our desired word2.
Demo 1
Demo 2
Example 1
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?);";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Example 2
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?)(?:;.+)";
string substitution = #"$2";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}

Instead of using regular expresions you can solve the problem with String class methods.
string[] words = str.Split(';');
string word2 = words[0].Substring(words[0].IndexOf('=') + 1);
First line splits the line from ';'. Assuming you just have a single ';' this statement splits your line into two strings. And second line returns a substring of first part (words[0]) starting from the first occurence of '=' (words[0].IndexOf('=')) character's next characher (+1) to the end. If your line doesn't have any '=' characters it just starts from the beginning because IndexOf returns -1.
Related documentation:
https://learn.microsoft.com/en-us/dotnet/api/system.string.split?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.substring?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.indexof?view=netframework-4.8

Related

Regex match pattern plus rest of the string until next dot, comma or space

Let's say I have a string WORK-232-3213-2323. Known possible case scenarios:
WORK-232-3213-2323, some text
WORK-232-3213-2323. some text
WORK-232-3213-2323.xlsx
WORK-232-3213-2323 some text
WORK-232-3213-2323/some text
Format WORK-232-3213-2323-some text may also occur, but there is no need to handle this case
My current regex is able to catch needed strings with WORK-232-3213-2323 pattern, but as an output I get -232-3213-2323. How to make it so that it would catch WORK- in string plus rest of the text until next whitespace, dot, slash or comma?
Current regex: WORK-(.*?)[\s]
C#:
Regex pattern = new Regex("WORK-(.*?)[\s]");
string result = pattern.Match(myString).Groups[1].Value

You might use a match without using a capture group and use a negated character class excluding a comma, dot or whitspace char.
\bWORK-[^.,\s]+
\bWORK- Match WORK preceded by a word boundary to prevent a partial match
[^.,\s]+ Negated character class to match 1+ times any char except . , or a whitspace char
Regex demo
string[] strings = {
"WORK-232-3213-2323, some text",
"WORK-232-3213-2323. some text",
"WORK-232-3213-2323.xlsx",
"WORK-232-3213-2323 some text",
"WORK-232-3213-2323/some text"
};
string pattern = #"\bWORK-[^.,\s]+";
foreach (String s in strings) {
Console.WriteLine(Regex.Match(s, pattern).Value);
}
Output
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323/some
If you don't want to match the last line, you could use the capture group and match a . , or whitespace char after it
\b(WORK-[^.,\s\/]+)[.,\s]
Regex demo
For example using the same example strings:
string pattern = #"\b(WORK-[^.,\s\/]+)[.,\s]";
foreach (String s in strings) {
Console.WriteLine(Regex.Match(s, pattern).Groups[1].Value);
}
Output
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323

Looks to me you can use the following pattern to handle all your cases, also the one that may occur:
\bWORK(?:-[0-9]+)+
See an online demo
I'm no hero in c# so I used some code I could find to test this:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var s = #"WORK-232-3213-2323, some text";
var pattern = #"\bWORK(?:-[0-9]+)+";
Regex r = new Regex(pattern);
Match m = r.Match(s);
if (m.Success)
{
Console.WriteLine(m.Value);
}
}
}
Alternatively you can use \bWORK(?:-\d+)+ and use Regex r = new Regex(pattern, RegexOptions.ECMAScript); with the ECMAScript option set.

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?

You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56

I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there

Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.

It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

C# RegEx.Split delimiter followed by specific words

I am trying to split using Regex.Split strings like this one:
string criteria = "NAME='Eduard O' Brian' COURSE='Math II' TEACHER = 'Chris Young' SCHEDULE='3' CAMPUS='C-1' ";
We have the following 'reserved words': NAME, COURSE, TEACHER, SCHEDULE, CAMPUS. It is required to split the original string into:
NAME='Eduard O' Brian'
COURSE='Math II'
TEACHER = 'Chris Young'
SCHEDULE='3'
CAMPUS='C-1'
The criteria for Split is: to have the simple quote, followed by one or more spaces, followed by a 'reserved word'.
The closest expression I achieved is:
var match = Regex.Split(criteria, #"'[\s+]([NAME]|[COURSE]|[TEACHER]|[SCHEDULE]|[CAMPUS])", RegexOptions.CultureInvariant);
This is the complete source code:
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication
{
class Program
{
static void Main(string[] args)
{
string criteria = "NAME='Eduard O' Brian' COURSE='Math II' TEACHER = 'Chris Young' SCHEDULE='3' CAMPUS='C-1' ";
var match = Regex.Split(criteria, #"'[\s+]([NAME]|[COURSE]|[TEACHER]|[SCHEDULE]|[CAMPUS])", RegexOptions.CultureInvariant);
foreach (var item in match)
Console.WriteLine(item.ToString());
Console.Read();
}
}
}
My code is doing this:
NAME='Eduard O' Brian' COURSE='Math II
T
EACHER = 'Chris Young
S
CHEDULE='3
C
AMPUS='C-1
It is deleting the last simple quote and is taking only the first letter of the reserved word. And COURSE in this sample has more than one space and is not working for it.
Thanks in advance!

You may simply split with 1+ whitespaces that are followed with your reserved words followed with =:
var results = Regex.Split(s, #"\s+(?=(?:NAME|COURSE|TEACHER|SCHEDULE|CAMPUS)\s*=)");
See the regex demo
Pattern details
\s+ - 1 or more whitespace chars
(?= - start of a positive lookahead that, immediately to the right of the current location, requires the following text:
(?:NAME|COURSE|TEACHER|SCHEDULE|CAMPUS) - any of the alternative literal texts
\s* - 0 or more whitespace chars (as there can be space(s) between reserved words and =)
= - an equal sign
) - end of the lookahead.
C# demo:
var criteria = "NAME='Eduard O' Brian' COURSE='Math II' TEACHER = 'Chris Young' SCHEDULE='3' CAMPUS='C-1' ";
var match = Regex.Split(criteria, #"\s+(?=(?:NAME|COURSE|TEACHER|SCHEDULE|CAMPUS)\s*=)");
Console.WriteLine(string.Join("\n", match));

How to make regex only match with patterns that have exactly one letter before a =

I am trying to get the regex to match only when there is one letter from A-Z followed by a = like this A=, a=, B=, currently it is picking up any number of letters before the = like hem=, ac2=. Usually ^[a-zA-Z] works just fine but its not working for this case since I'm using named capture groups
String pattern = "FL2 (77) Flashing,77,a=1.875,A=90.0,b=3.625,B=95.0,c=1.375,C=175.0,d=2.5,hem=0.5,16GA-AL,";
var regex = new Regex("(?<label>[a-zA-Z]+)=(?<value>[^,]+)");
Other ways I've tried
var regex = new Regex("(?<label>^[a-zA-Z]+)=(?<value>[^,]+)");
var regex = new Regex("(?<label>[^a-zA-Z]+)=(?<value>[^,]+)");

If you want to match l= but not word=, you need a negative look-behind assertion.
new Regex("(?<![a-zA-Z])(?<label>[a-zA-Z])=(?<value>[^,]+)")

If the string pattern you have in your question is really the "haystack" in which you're looking for "needles", a really easy way to solve the problem would be to first split the string on ,, then use RegEx. Then you can use a simpler pattern ^(?<label>[a-zA-Z])=(?<value>.+)$ on each item in the list you get from splitting the string, and only keep the matches.

It's because you have a + after [a-zA-Z], which makes it match one or more characters in that character class. If you remove the +, it will only match one character before the =.
If you want it to only match in situations where there is exactly one alphabetical character before the equals sign, you will want to add to the beginning of the regex to make sure that the character before the letter you want to match is not a letter, like this:
(?<![a-zA-Z])(?<label>[a-zA-Z])=(?<value>[^,]+)
(notice though that this only matters in the case where you don't put a ^ before [a-zA-Z], in the case where you want matches that don't start at the beginning of a line)

Have you tried
var regex = new Regex("(?<label>^[a-zA-Z]?)=(?<value>[^,]+)");
I believe the "+" means 1 or more
"?" means 0 or 1
or exactly 1 should be {1} (at least in python, not sure about C#)
var regex = new Regex("(?<label>^[a-zA-Z]{1})=(?<value>[^,]+)");

Assuming that the label is separated by a comma (which seems to be the case based on your example and code) then you can use:
^|,(?<label>[A-Za-z])=(?<value>[^,]+)

I recommend Regex.Matches over capture groups here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace Rextester
{
public class Program
{
public static void Main(string[] args)
{
string content = "FL2 (77) Flashing,77,a=1.875,A=90.0,b=3.625,B=95.0,c=1.375,C=175.0,d=2.5,hem=0.5,16GA-AL,";
const string regexPattern = "(?<=[,| ])[a-zA-Z]=([0-9|.|-])+";
string singleMatch = new Regex(regexPattern).Match(content).ToString();
Console.WriteLine(singleMatch); // a=1.875
MatchCollection matchList = Regex.Matches(content, regexPattern);
var matches = matchList.Cast<Match>().Select(match => match.Value).ToList();
Console.WriteLine(string.Join(", ", matches)); // a=1.875, A=90.0, b=3.625, B=95.0, c=1.375, C=175.0, d=2.5
}
}
}

How to use (?!...) regex pattern to skip the whole unmatched part?

I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".

You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'

The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

RegEx for capturing a word in between = and ; - c#

Related

Regex match pattern plus rest of the string until next dot, comma or space

Find pattern to solve regex in one step

C# RegEx.Split delimiter followed by specific words

How to make regex only match with patterns that have exactly one letter before a =

How to use (?!...) regex pattern to skip the whole unmatched part?

Categories

Resources