Splitting a string with a Regex results in wrong output [duplicate]

Splitting a string with a Regex results in wrong output [duplicate] - c#

I'm trying to extract values from a string which are between << and >>. But they could happen multiple times.
Can anyone help with the regular expression to match these;
this is a test for <<bob>> who like <<books>>
test 2 <<frank>> likes nothing
test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>.
I then want to foreach the GroupCollection to get all the values.
Any help greatly received.
Thanks.

Use a positive look ahead and look behind assertion to match the angle brackets, use .*? to match the shortest possible sequence of characters between those brackets. Find all values by iterating the MatchCollection returned by the Matches() method.
Regex regex = new Regex("(?<=<<).*?(?=>>)");
foreach (Match match in regex.Matches(
"this is a test for <<bob>> who like <<books>>"))
{
Console.WriteLine(match.Value);
}
LiveDemo in DotNetFiddle

While Peter's answer is a good example of using lookarounds for left and right hand context checking, I'd like to also add a LINQ (lambda) way to access matches/groups and show the use of simple numeric capturing groups that come handy when you want to extract only a part of the pattern:
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
// ...
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Same approach with Peter's compiled regex where the whole match value is accessed via Match.Value:
var results = regex.Matches(s).Cast<Match>().Select(x => x.Value);
Note:
<<(.*?)>> is a regex matching <<, then capturing any 0 or more chars as few as possible (due to the non-greedy *? quantifier) into Group 1 and then matching >>
RegexOptions.Singleline makes . match newline (LF) chars, too (it does not match them by default)
Cast<Match>() casts the match collection to a IEnumerable<Match> that you may further access using a lambda
Select(x => x.Groups[1].Value) only returns the Group 1 value from the current x match object
Note you may further create a list of array of obtained values by adding .ToList() or .ToArray() after Select.
In the demo C# code, string.Join(", ", results) generates a comma-separated string of the Group 1 values:
var strs = new List<string> { "this is a test for <<bob>> who like <<books>>",
"test 2 <<frank>> likes nothing",
"test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>." };
foreach (var s in strs)
{
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Console.WriteLine(string.Join(", ", results));
}
Output:
bob, books
frank
what, on, earth, this, is, too, much

You can try one of these:
(?<=<<)[^>]+(?=>>)
(?<=<<)\w+(?=>>)
However you will have to iterate the returned MatchCollection.

Something like this:
(<<(?<element>[^>]*)>>)*
This program might be useful:
http://sourceforge.net/projects/regulator/

Related

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?

Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]

I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();

There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.

To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();

Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

Get specific text from textbox using regexp

I'm looking for a regex in C#.net to extract printers from a list in a script.
This is an example:
#set nr=2
#if not exist "%userprofile%\Version%nr%.txt" goto reload
#goto koppla
:reload
#echo skrivare>"%userprofile%\Version%nr%.txt"
#del "%userprofile%\zxy-*.txt"
#call skrivare.cmd
#exit
:koppla
#%connect1% \\%Print2%\Lund-M1
#%connect2% \\%Print2%\MAR-M1
#%connect2% \\%Print2%\MAR-M2
I would like to get the names (Lund-M1, MAR-M1, MAR-M2) of the printers in a array to foreach.
I really appreciate any help on this, my mind doesn't work with Regex.
Thank you in advance!

You could do something quite simple, like searching for the Print2 prefix:
\\\\%Print2%\\(.*)
This gives the following output on http://www.regexer.com. You'd then need to access the first group of each Match object to grab the part of the string you are after.
Edit
If you want to encapsulate different print numbers use the following which allows the 2 to be exchanged with any other number.
\\\\%Print[0-9]%\\(.*)

(?m:(?<=^#\%connect\d\% \\\\(.*?\\)*)[^\\]+$)
will give three matches over your script, with values
Lund-M1
MAR-M1
MAR-M2
So
Regex.Matches(input, #"(?m:(?<=^#\%connect\d\% \\\\(.*?\\)*)[^\\]+$)")
.Cast<Match>()
.Select(m => m.Value)
.ToArray()
gives you what you need.
This checks for line starting #%connect then any digit followed by % then pulls the last segment of any path of the form \\something\something\something\AnyNonBackslashChars

foreach (Match match in Regex.Matches(text,
#"^#%connect\d+%\s+\\\\%Print2%\\(.*?)\s*$", RegexOptions.IgnoreCase | RegexOptions.Multiline))
{
if (match.Success)
{
var name = match.Groups[1];
}
}

Pattern Matching c#

Lets say I have a text file with the line below within it. I want to take both values within the quotations by matching between (" and "), so that would be I retreive ABC and DEF and put them in a string list or something, what's the best way of doing this? It's so annoying
If EXAMPLEA("ABC") AND EXAMPLEB("DEF")

Assuming a case where the value between the double quotes can not contain escaped double quotes might work like this:
var text = "If EXAMPLEA(\"ABC\") AND EXAMPLEB(\"DEF\")";
Regex pattern = new Regex("\"[^\"]*\"");
foreach (Match match in pattern.Matches(text))
{
Console.WriteLine(match.Value.Trim('"'));
}
But this is only one of the many ways you could do it and maybe not the smartest way out there. Try something yourself!

Best way...
List<string> matches=Regex.Matches(File.ReadAllText(yourPath),"(?<="")[^""]*(?="")")
.Cast<Match>()
.Select(x=>x.Value)
.ToList();

This pattern should do the trick:
\"([^"]*)\"
string str = "If EXAMPLEA(\"ABC\") AND EXAMPLEB(\"DEF\")";
MatchCollection matched = Regex.Matches(str, #"\""([^\""]*)\""");
foreach (Match match in matched)
{
Console.WriteLine(match.Groups[1].Value);
}
Note that the quotation marks are doubled in the actual code in order to escape them. And the code refers to group [1] to get just the part inside the parentheses.

IEnumerable<string> matches =
from Match match
in Regex.Matches(File.ReadAllText(filepath), #"\""([^\""]*)\""")
select match.Groups[1].Value;
Others already posted some answers, but my takes into account that you just want ABC and DEF in your example, without quotation marks and save it in a IEnumerable<string>.

Regex to remove all (non numeric OR period)

I need for text like "joe ($3,004.50)" to be filtered down to 3004.50 but am terrible at regex and can't find a suitable solution. So only numbers and periods should stay - everything else filtered. I use C# and VS.net 2008 framework 3.5

This should do it:
string s = "joe ($3,004.50)";
s = Regex.Replace(s, "[^0-9.]", "");

The regex is:
[^0-9.]
You can cache the regex:
Regex not_num_period = new Regex("[^0-9.]")
then use:
string result = not_num_period.Replace("joe ($3,004.50)", "");
However, you should keep in mind that some cultures have different conventions for writing monetary amounts, such as: 3.004,50.

You are dealing with a string - string is an IEumerable<char>, so you can use LINQ:
var input = "joe ($3,004.50)";
var result = String.Join("", input.Where(c => Char.IsDigit(c) || c == '.'));
Console.WriteLine(result); // 3004.50

For the accepted answer, MatthewGunn raises a valid point in that all digits, commas, and periods in the entire string will be condensed together. This will avoid that:
string s = "joe.smith ($3,004.50)";
Regex r = new Regex(#"(?:^|[^w.,])(\d[\d,.]+)(?=\W|$)/)");
Match m = r.match(s);
string v = null;
if (m.Success) {
v = m.Groups[1].Value;
v = Regex.Replace(v, ",", "");
}

The approach of removing offending characters is potentially problematic. What if there's another . in the string somewhere? It won't be removed, though it should!
Removing non-digits or periods, the string joe.smith ($3,004.50) would transform into the unparseable .3004.50.
Imho, it is better to match a specific pattern, and extract it using a group. Something simple would be to find all contiguous commas, digits, and periods with regexp:
[\d,\.]+
Sample test run:
Pattern understood as:
[\d,\.]+
Enter string to check if matches pattern
> a2.3 fjdfadfj34 34j3424 2,300 adsfa
Group 0 match: "2.3"
Group 0 match: "34"
Group 0 match: "34"
Group 0 match: "3424"
Group 0 match: "2,300"
Then for each match, remove all commas and send that to the parser. To handle case of something like 12.323.344, you could do another check to see that a matching substring has at most one ..

Determining which pattern matched using Regex.Matches

I'm writing a translator, not as any serious project, just for fun and to become a bit more familiar with regular expressions. From the code below I think you can work out where I'm going with this (cheezburger anyone?).
I'm using a dictionary which uses a list of regular expressions as the keys and the dictionary value is a List<string> which contains a further list of replacement values. If I'm going to do it this way, in order to work out what the substitute is, I obviously need to know what the key is, how can I work out which pattern triggered the match?
var dictionary = new Dictionary<string, List<string>>
{
{"(?!e)ight", new List<string>(){"ite"}},
{"(?!ues)tion", new List<string>(){"shun"}},
{"(?:god|allah|buddah?|diety)", new List<string>(){"ceiling cat"}},
..
}
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
substitute = GetRandomReplacement(dictionary[ ????? ]);
input = input.Replace(metamatch.Value, substitute);
}
Is what I'm attempting possible, or is there a better way to achieve this insanity?

You can name each capture group in a regular expression and then query the value of each named group in your match. This should allow you to do what you want.
For example, using the regular expression below,
(?<Group1>(?!e))ight
you can then extract the group matches from your match result:
match.Groups["Group1"].Captures

You've got another problem. Check this out:
string s = #"My weight is slight.";
Regex r = new Regex(#"(?<!e)ight\b");
foreach (Match m in r.Matches(s))
{
s = s.Replace(m.Value, "ite");
}
Console.WriteLine(s);
output:
My weite is slite.
String.Replace is a global operation, so even though weight doesn't match the regex, it gets changed anyway when slight is found. You need to do the match, lookup, and replace at the same time; Regex.Replace(String, MatchEvaluator) will let you do that.

Using named groups like Jeff says is the most robust way.
You can also access the groups by number, as they are expressed in your pattern.
(first)|(second)
can be accessed with
match.Groups[1] // match group 2 -> second
Of course if you have more parenthesis which you don't want to include, use the non-capture operator ?:
((?:f|F)irst)|((?:s|S)econd)
match.Groups[1].Value // also match group 2 -> second

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting a string with a Regex results in wrong output [duplicate] - c#

You can try one of these: (?<=<<)[^>]+(?=>>) (?<=<<)\w+(?=>>) However you will have to iterate the returned MatchCollection.

Something like this: (<<(?<element>[^>])>>) This program might be useful: http://sourceforge.net/projects/regulator/

Related

How to split a string every time the character changes?

Get specific text from textbox using regexp

Pattern Matching c#

Regex to remove all (non numeric OR period)

Determining which pattern matched using Regex.Matches

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting a string with a Regex results in wrong output [duplicate] - c#

You can try one of these: (?<=<<)[^>]+(?=>>) (?<=<<)\w+(?=>>) However you will have to iterate the returned MatchCollection.

Something like this: (<<(?<element>[^>]*)>>)* This program might be useful: http://sourceforge.net/projects/regulator/

Related

How to split a string every time the character changes?

Get specific text from textbox using regexp

Pattern Matching c#

Regex to remove all (non numeric OR period)

Determining which pattern matched using Regex.Matches

Categories

Resources

Something like this: (<<(?<element>[^>])>>) This program might be useful: http://sourceforge.net/projects/regulator/