How to use (?!...) regex pattern to skip the whole unmatched part? - c#

I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".

You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'

The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).

Related

Regex match pattern plus rest of the string until next dot, comma or space

Let's say I have a string WORK-232-3213-2323. Known possible case scenarios:
WORK-232-3213-2323, some text
WORK-232-3213-2323. some text
WORK-232-3213-2323.xlsx
WORK-232-3213-2323 some text
WORK-232-3213-2323/some text
Format WORK-232-3213-2323-some text may also occur, but there is no need to handle this case
My current regex is able to catch needed strings with WORK-232-3213-2323 pattern, but as an output I get -232-3213-2323. How to make it so that it would catch WORK- in string plus rest of the text until next whitespace, dot, slash or comma?
Current regex: WORK-(.*?)[\s]
C#:
Regex pattern = new Regex("WORK-(.*?)[\s]");
string result = pattern.Match(myString).Groups[1].Value
You might use a match without using a capture group and use a negated character class excluding a comma, dot or whitspace char.
\bWORK-[^.,\s]+
\bWORK- Match WORK preceded by a word boundary to prevent a partial match
[^.,\s]+ Negated character class to match 1+ times any char except . , or a whitspace char
Regex demo
string[] strings = {
"WORK-232-3213-2323, some text",
"WORK-232-3213-2323. some text",
"WORK-232-3213-2323.xlsx",
"WORK-232-3213-2323 some text",
"WORK-232-3213-2323/some text"
};
string pattern = #"\bWORK-[^.,\s]+";
foreach (String s in strings) {
Console.WriteLine(Regex.Match(s, pattern).Value);
}
Output
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323/some
If you don't want to match the last line, you could use the capture group and match a . , or whitespace char after it
\b(WORK-[^.,\s\/]+)[.,\s]
Regex demo
For example using the same example strings:
string pattern = #"\b(WORK-[^.,\s\/]+)[.,\s]";
foreach (String s in strings) {
Console.WriteLine(Regex.Match(s, pattern).Groups[1].Value);
}
Output
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
WORK-232-3213-2323
Looks to me you can use the following pattern to handle all your cases, also the one that may occur:
\bWORK(?:-[0-9]+)+
See an online demo
I'm no hero in c# so I used some code I could find to test this:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var s = #"WORK-232-3213-2323, some text";
var pattern = #"\bWORK(?:-[0-9]+)+";
Regex r = new Regex(pattern);
Match m = r.Match(s);
if (m.Success)
{
Console.WriteLine(m.Value);
}
}
}
Alternatively you can use \bWORK(?:-\d+)+ and use Regex r = new Regex(pattern, RegexOptions.ECMAScript); with the ECMAScript option set.

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

RegEx for capturing a word in between = and ;

I want to select word2 from the following :
word2;word3
word2 that is between ; and start of the line unless there is a = in between. In that case, I want start from the = instead of the start of the line
like word2 from
word1=word2;word3
I have tried using this regex
(?<=\=|^).*?(?=;)
which select the word2 from
word2;word3
but also the whole word1=word2 from
word1=word2;word3
You can use an optional group to check for a word followed by an equals sign and capture the value in the first capturing group:
^(?:\w+=)?(\w+);
Explanation
^ Start of string
(?:\w+=)? Optional non capturing group matching 1+ word chars followed by =
(\w+) Capture in the first capturing group 1+ word chars
; Match ;
See a regex demo
In .NET you might also use:
(?<=^(?:\w+=)?)\w+(?=;)
Regex demo | C# demo
There should be so many options, maybe regular expressions among the last ones.
But, if we wish to use an expression for this problem, let's start with a simple one and explore other options, maybe something similar to:
(.+=)?(.+?);
or
(.+=)?(.+?)(?:;.+)
where the second capturing group has our desired word2.
Demo 1
Demo 2
Example 1
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?);";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Example 2
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(.+=)?(.+?)(?:;.+)";
string substitution = #"$2";
string input = #"word1=word2;word3
word2;word3";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
Instead of using regular expresions you can solve the problem with String class methods.
string[] words = str.Split(';');
string word2 = words[0].Substring(words[0].IndexOf('=') + 1);
First line splits the line from ';'. Assuming you just have a single ';' this statement splits your line into two strings. And second line returns a substring of first part (words[0]) starting from the first occurence of '=' (words[0].IndexOf('=')) character's next characher (+1) to the end. If your line doesn't have any '=' characters it just starts from the beginning because IndexOf returns -1.
Related documentation:
https://learn.microsoft.com/en-us/dotnet/api/system.string.split?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.substring?view=netframework-4.8
https://learn.microsoft.com/en-us/dotnet/api/system.string.indexof?view=netframework-4.8

Regex in C# to process a text

I am trying to remove some text and keep only small text from the string.
Actually I am very new to regex, I have read an article and did not get it very well.
Here is an example of my text (every line in separate string object)
2015-03-08 10:30:00 /user841/column-width
2015-03-08 10:30:01 /user849/connect
2015-03-08 10:30:01 /user262/open-level2-price/some other text
2015-03-08 10:30:01 /user839/open-detailed-quotes
I want to process them using regex in c# and have the following output:
column-width
connect
open-level2-price/some other text
open-detailed-quotes
I have used the following line to do that but it throws an exception:
Match match = Regex.Match(line, #"*./user\d+/*.");
The Exception:
System.ArgumentException: 'parsing "*./user\d+/*." - Quantifier {x,y} following nothing.'
could anyone help please!
The error you get is caused by the fact that you try to quantify the start of the pattern, which is considered an error in a .NET regex. Perhaps, you meant to use .* instead of the *. (to match any 0+ chars greedily, as many as possible), but it is certainly not what you need judging by the expected results.
You need
/user\d+/(.*)
See the regex demo
Details:
/user - a literal substring /user
\d+ - 1 or more digits (use RegexOptions.ECMAScript option to only match ASCII digits with \d in a .NET regex)
/ - a literal /
(.*) - A capturing group #1 that matches any 0+ chars other than a newline (replace * with + to match at least 1 char).
C#:
var results = Regex.Matches(s, #"/user\d+/(.*)")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Instead of using Regex, just split on the '/' character and use the last index of the array (using LINQ):
string inputString = "2015-03-08 10:30:01 /user262/open-level2-price";
inputString.Split('/').Last();
Split returns an array of strings, in your case with the sample input above the string array would look like:
array[0] = "2015-03-08 10:30:01 "
array[1] = "user262"
array[2] = "open-level2-price"
You indicate you always want the last part so just use LINQ to take the .Last() index of the array.
Fiddle here
Here's a simple example of how to use the Regex.Replace static method.
https://dotnetfiddle.net/JuUF9E
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string[] lines = new string[] {
"2015-03-08 10:30:00 /user841/column-width",
"2015-03-08 10:30:01 /user849/connect",
"2015-03-08 10:30:01 /user262/open-level2-price",
"2015-03-08 10:30:01 /user839/open-detailed-quotes"
};
string pattern = #"(.*/.*/)(.*)";
string replacement = "$2";
foreach(var line in lines)
{
Console.WriteLine(Regex.Replace(line, pattern, replacement));
}
}
}
I don't know why you're trying to do this simple thing with regex, you just have to read the lines and split by the '\', them select the last index and that's it. For example, if you have that data in a file you can use something like this:
string newString = "";
StreamReader sr = new StreamReader('log.txt');
while(!sr.ReadLine)
{
string[] splitted = sr.ReadLine().Split('/');
if(splitted.Length > 0)
newString += splitted[splitted.Length - 1];
}
sr.Close();
At the end, the newString variable will contains what you want. Otherwise you can add every line in a list if you will do some with the data.
How about using Look around
var line = "2015-03-08 10:30:01 /user839/open-detailed otes/dsada/dsa/das/dsadsa";
// dsadsa
var match = Regex.Match(line, #"(?!.*/).*").Value;

Simple regex-matching

I have a String
String test = #"Lists/Versions/2_.000";
I'm a bit confused on how to use regex to do this.
I'm using the pattern
String pattern = #"\D+";
The msdn page for regular expression says \D is "Matches any character other than a decimal digit"
So shouldn't it be returning 'Lists/Versions/' , '2'?
However its returning
'' , '2', '000'
I would like the string to only match the 2(Or any Integer). How would I do that?
String url = #"Lists/Versions/2_.000";
String pattern = #"\D+";
string[] substrings = Regex.Split(url, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
The reason your receiving the issue, is because the /D is to capture non digits, so it detects two separate numeric values (2 and 000) because of the _. So that is how it is grabbing the data. So you have a couple of choices:
Break the string into manageable portions, then anchor to the array.
Build a better pattern to separate.
So the question will be, what are you trying to parse? 2.00 ? Or are you trying to separate numeric numbers in your string?
I'm assuming you have a typo also:
\d Matches a digit character. Equivalent to [0-9].
\D Matches a non-digit character. Equivalent to [^0-9].
\w Matches any word character including underscore. Equivalent to
"[A-Za-z0-9_]".
\W Matches any non-word character. Equivalent to "[^A-Za-z0-9_]".
You should be able to use:
You should simply do the following:
string url = #"Lists/Versions/2_.000";
var data = Regex.Split(url, #"\D+");
Console.WriteLine(#"Value: {0} and Secondary Value: {1}", data[0], data[1]);
That should find all integer values, so it should provide an output of:
2
000
Which should return as a normal string []. My syntax or expression may be off, but you can find a nice cheat sheet for Regular Expressions here. You'll also want to ensure you check the bounds of the array.
https://dotnetfiddle.net/BU6gp2
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String url = #"Lists/Versions/2_.000";
String pattern = #"\D+";
string[] substrings = Regex.Split(url, pattern);
Console.WriteLine("'{0}'", substrings[1]);
}
}
Please try the following:
// using System.Linq;
String url = #"Lists/Versions/2_.000";
String pattern = #"(?<=/)\d+";
string[] substrings = Regex.Matches(url, pattern)
.Cast<Match>()
.Select(_ => _.Value)
.ToArray();
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
Alternatively, if you don't need an array.
String url = #"Lists/Versions/2_.000";
String pattern = #"(?<=/)\d+";
Console.WriteLine("'{0}'", Regex.Match(url, pattern).Value);

Categories

Resources