C# Regular expression returns group multiple times

C# Regular expression returns group multiple times - c#

I have a very simple regex like this in C#:
(var \= 0\;)
But when I try to match this against a string that has only one occurrence of the pattern, I get multiple groups returned. The input string is:
foo bar
var = 0;
foo
I get 1 match returned by the Regex object, but inside I see two groups, each has 1 capture, which is the string I want.
I need the grouping parentheses in the regex because this is part of a bigger regex, and I need this to be captured as a group.
What am I doing wrong?
EDIT
This is the C# code I'm using:
private const string REGEX = "(var \\= [0]\\;)";
MatchCollection matches = REGEX.Matches(inputStr);
foreach (Match m in matches)
{
foreach (Group g in m.Groups)
{
Console.WriteLine("group[" + g.Captures.Count + "]: '" + g.ToString() + "'");
}
}
This is what I get:
group[1]: 'var = 0;'
group[1]: 'var = 0;'
My question is, why do I get two groups and not one?
EDIT #2:
A more complicated pattern shows the problem. The pattern:
# preceding comment
class
{
(param1 = "val1", param2 = "val2", param3 = val3)
}
[
# inside comment
setting1 = 0;
setting2 = 0;
]
The regex I'm using: (it's probably not the most obvious, but you can paste it in a regex viewer if you want to check it out)
(\#[^\n]*)?(?:[\s\r\n]*)domain(?:[\s\r\n]*)\{(?:[\s\r\n]*)\((?:[\s\r\n]*)(((?:[\s\r\n]*)(accountName(?:[\s\r\n]*)\=(?:[\s\r\n]*)\"[^"]+\"[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(tableName(?:[\s\r\n]*)\=(?:[\s\r\n]*)\"[^"]+\"[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(cap(?:[\s\r\n]*)\=(?:[\s\r\n]*)[\d]+[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(MinPartitionCount(?:[\s\r\n]*)\=(?:[\s\r\n]*)[\d]+[,]?)(?:[\s\r\n]*)))+\)(?:[\s\r\n]*)\}(?:[\s\r\n]*)\[(?:[\s\r\n]*)(\#[^\n]*)?(?:[\s\r\n]*)((?:[\s\r\n]*)(IsSplitEnabled(?:[\s\r\n]*)\=(?:[\s\r\n]*)[0|1](?:[\s\r\n]*)\;)(?:[\s\r\n]*)|(?:[\s\r\n]*)(IsMergeEnabled(?:[\s\r\n]*)\=(?:[\s\r\n]*)[0|1](?:[\s\r\n]*)\;)(?:[\s\r\n]*))*(?:[\s\r\n]*)\]
And I'm getting:
group:1: '# preceding comment
domain
{
(param1 = "val1", param2 = "val2", param3 = val3)
}
[
# inside comment
setting1 = 0;
setting2 = 0;
]'
'roup:1: '# preceding comment
group:3: 'cap = 1200'
group:1: 'param1 = "val1", '
group:1: 'param1 = "val1",'
group:1: 'param2 = "val2", '
group:1: 'param2 = "val2",'
group:1: 'param3 = val3'
group:1: 'param3 = val3'
'roup:1: '# inside comment
group:2: 'setting1 = 0;
'
group:1: 'setting1 = 0;'
group:1: 'setting2 = 0;'

According to the documentation, the first element of the GroupCollection is the entire match, not the first group created by ().
From near the bottom of the Remarks section here:
If the regular expression engine can find a match, the first element
of the GroupCollection object returned by the Groups property contains
a string that matches the entire regular expression pattern. Each subsequent element > represents a captured group, if the regular expression includes capturing groups.
Due to this, both items 0 and 1 are identical given the RegEx you are currently using. To only see the actual group matches, you could skip the first element of the GroupCollection, and only process the groups you have defined in the RegEx.
EDIT
After investigating the additional data, I think I may have found the cause of your duplicates.
I believe that you are seeing more than one Match, and so the outer foreach loop runs twice, not once. This is because there are 2 separate lines with "= 0;" in the example.
Here is LinqPad example code that shows 2 matches being found, and therefore multiple duplicate groups being output. (note, I used the simple regex you provided to test, since the long regex didn't provide any matches)
static string inputStr = "# preceding comment \r\n" +
"class\r\n" +
"{\r\n" +
" (param1 = \"val1\", param2 = \"val2\", param3 = val3)\r\n" +
"}\r\n" +
"[\r\n" +
" # inside comment\r\n" +
" setting1 = 0;\r\n" +
" setting2 = 0;\r\n" +
"]\r\n";
const string REGEX = "(\\= [0]\\;)";
void Main()
{
var regex = new System.Text.RegularExpressions.Regex(REGEX);
MatchCollection matches = regex.Matches(inputStr);
Console.WriteLine("Matches:{0}", matches.Count);
int matchCnt = 0;
foreach (Match m in matches)
{
int groupCnt = 0;
foreach (Group g in m.Groups)
{
Console.WriteLine("match[{0}] group[{1}]: Captures:{2} '{3}'", matchCnt, groupCnt, g.Captures.Count, g);
//g.Dump();
groupCnt++;
}
matchCnt++;
}
Console.WriteLine("Done!");
}
And here is the output generated by LinqPad when this code runs:
Matches:2
match[0] group[0]: Captures:1 '= 0;'
match[0] group[1]: Captures:1 '= 0;'
match[1] group[0]: Captures:1 '= 0;'
match[1] group[1]: Captures:1 '= 0;'
Done!

Related

C# Regex returning multiple lines of text

I have the following function:
public static string ReturnEmailAddresses(string input)
{
string regex1 = #"\[url=";
string regex2 = #"mailto:([^\?]*)";
string regex3 = #".*?";
string regex4 = #"\[\/url\]";
Regex r = new Regex(regex1 + regex2 + regex3 + regex4, RegexOptions.IgnoreCase | RegexOptions.Multiline);
MatchCollection m = r.Matches(input);
if (m.Count > 0)
{
StringBuilder sb = new StringBuilder();
int i = 0;
foreach (var match in m)
{
if (i > 0)
sb.Append(Environment.NewLine);
string shtml = match.ToString();
var innerString = shtml.Substring(shtml.IndexOf("]") + 1, shtml.IndexOf("[/url]") - shtml.IndexOf("]") - 1);
sb.Append(innerString); //just titles
i++;
}
return sb.ToString();
}
return string.Empty;
}
As you can see I define a url in the "markdown" format:
[url = http://sample.com]sample.com[/url]
In the same way, emails are written in that format too:
[url=mailto:service#paypal.com.au]service#paypal.com.au[/url]
However when i pass in a multiline string, with multiple email addresses, it only returns the first email only. I would like it to have multple matches, but I cannot seem to get that working?
For example
[url=mailto:service#paypal.com.au]service#paypal.com.au[/url] /r/n a whole bunch of text here /r/n more stuff here [url=mailto:anotheremail#paypal.com.au]anotheremail#paypal.com.au[/url]
This will only return the first email above?

The mailto:([^\?]*) part of your pattern is matching everything in your input string. You need to add the closing bracket ] to the inside of your excluded characters to restrict that portion from overflowing outside of the "mailto" section and into the text within the "url" tags:
\[url=mailto:([^\?\]]*).*?\[\/url\]
See this link for an example: https://regex101.com/r/zcgeW8/1

You can extract desired result with help of positive lookahead and positive lookbehind. See http://www.rexegg.com/regex-lookarounds.html
Try regex: (?<=\[url=mailto:).*?(?=\])
Above regex will capture two email addresses from sample string
[url=mailto:service#paypal.com.au]service#paypal.com.au[/url] /r/n a whole bunch of text here /r/n more stuff here [url=mailto:anotheremail#paypal.com.au]anotheremail#paypal.com.au[/url]
Result:
service#paypal.com.au
anotheremail#paypal.com.au

C# Enumerate Regex Matches

What is the best way to enumerate a regex-replacement in C#.
For example if I wanted every "<intent-filter" match to be replaced by "<intent-filter android:label=label#". The # sign is a incremental digit. What would be the best way to code it?

You can use an incremented counter in the anonymous method specified as the MatchEvaluator callback. The (?<=…) is positive lookbehind; it is matched by the regex evaluator, but not removed.
string input = "a <intent-filter data=a /> <intent-filter data=b />";
int count = 0;
string result = Regex.Replace(input, #"(?<=\<intent-filter)",
_ => " android:label=label" + count++);

Don't bother with Regexes for this one. Do something along the lines of:
var pieces = text.Split(new string[] { "xx" });
var sb = new StringBuilder();
var idx = 0;
foreach (var piece in pieces)
{
sb.Append(piece);
sb.Append(" android:label=label");
sb.Append(idx);
}
// oops, homework assignment: remove the last "<intent-filter android:label=label#"

Find all occurrences of substring in string

I have a text file that includes things such as the following:
_vehicle_12 = objNull;
if (true) then
{
_this = createVehicle ["Land_Mil_Guardhouse", [13741.654, 2926.7075, 3.8146973e-006], [], 0, "CAN_COLLIDE"];
_vehicle_12 = _this;
_this setDir -92.635818;
_this setPos [13741.654, 2926.7075, 3.8146973e-006];
};
I want to find all occurrences between { and }; and assign the following strings:
string direction = "_this setDir" value, in example _vehicle_12 it would mean that:
string direction = "-92.635818";
string position = "_this setPos" value, in example _vehicle_12 it would be:
string position = "[13741.654, 2926.7075, 3.8146973e-006]";
I have multiple occurrences of these types and would like to figure out the best way each time the { }; occurs to set direction and position and move onto the next occurrence.
The following code can read the string (that holds the file in a large string) and it finds the first occurence fine, however I would like to adapt it to finding every occurrence of the { and };
string alltext = File.ReadAllText(#file);
string re1 = ".*?"; // Non-greedy match on filler
string re2 = "(\\{.*?\\})"; // Curly Braces 1
Regex r = new Regex(re1 + re2, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(alltext);
if (m.Success)
{
String cbraces1 = m.Groups[1].ToString();
MessageBox.Show("Found vehicle: " + cbraces1.ToString() + "\n");
}

Think of a regex that might work
Test it.
If it does not work, modify and return to step 2.
You have a working regex :-)
To get you started:
\{\n([0-z\[\]" ,-\.=]+;\n)+\}
should return the individual lines inside the curly braces.

Why Group.Value always the last matched group string?

Recently, I found one C# Regex API really annoying.
I have regular expression (([0-9]+)|([a-z]+))+. I want to find all matched string. The code is like below.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456defFOO";
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
The output is:
Match1
Match group count = 4
Group0='abc123xyz456def'
Group1='def'
Group2='456'
Group3='def'
go to next match
It seems that all group.Value is the last matched string ("def" and "456"). I spent some time to figure out that I should count on group.Captures instead of group.Value.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456def";
//Console.WriteLine(str);
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
CaptureCollection cc = group.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine(" Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
This will output:
Match1
Match group count = 4
Group0='abc123xyz456def'
Capture0='abc123xyz456def', Position=0
Group1='def'
Capture0='abc', Position=0
Capture1='123', Position=3
Capture2='xyz', Position=6
Capture3='456', Position=9
Capture4='def', Position=12
Group2='456'
Capture0='123', Position=3
Capture1='456', Position=9
Group3='def'
Capture0='abc', Position=0
Capture1='xyz', Position=6
Capture2='def', Position=12
go to next match
Now, I am wondering why the API design is like this. Why Group.Value only returns the last matched string? This design doesn't look good.

The primary reason is historical: regexes have always worked that way, going back to Perl and beyond. But it's not really bad design. Usually, if you want every match like that, you just leave off the outermost quantifier (+ in ths case) and use the Matches() method instead of Match(). Every regex-enabled language provides a way to do that: in Perl or JavaScript you do the match in /g mode; in Ruby you use the scan method; in Java you call find() repeatedly until it returns false. Similarly, if you're doing a replace operation, you can plug the captured substrings back in as you go with placeholders ($1, $2 or \1, \2, depending on the language).
On the other hand, I know of no other Perl 5-derived regex flavor that provides the ability to retrieve intermediate capture-group matches like .NET does with its CaptureCollections. And I'm not surprised: it's actually very seldom that you really need to capture all the matches in one go like that. And think of all the storage and/or processing power it can take to keep track of all those intermediate matches. It is a nice feature though.

Using Regex to edit a string in C#

I'm just beginning to use Regex so bear with my terminology. I have a regex pattern that is working properly on a string. The string could be in the format "text [pattern] text". Therefore, I also have a regex pattern that negates the first pattern. If I print out the results from each of the matches everything is shown correctly.
The problem I'm having is I want to add text into the string and it changes the index of matches in a regex MatchCollection. For example, if I wanted to enclose the found match in "td" match "/td"" tags I have the following code:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
if (mc.Count > 0)
{
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index, mc[i].Length);
text = text.Insert(mc[i].Index, "<td>" + mc[i].Value + "</td>");
}
}
This works great for the first match. But as you'd expect the mc[i].Index is no longer valid because the string has changed. Therefore, I tried to search for just a single match in the for loop for the amount of matches I would expect (mc.Count), but then I keep finding the first match.
So hopefully without introducing more regex to make sure it's not the first match and with keeping everything in one string, does anybody have any input on how I could accomplish this? Thanks for your input.
Edit: Thank you all for your responses, I appreciate all of them.

It can be as simple as:-
string newString = Regex.Replace("abc", "b", "<td>${0}</td>");
Results in a<td>b</td>c.
In your case:-
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
text = r.Replace(text, "<td>${0}</td>");
Will replace all occurance of negRegexPattern with the content of that match surrounded by the td element.

Although I agree that the Regex.Replace answer above is the best choice, just to answer the question you asked, how about replacing from the last match to the first. This way your string grows beyond the "previous" match so the earlier matches indexes will still be valid.
for (int i = mc.Count - 1; i > 0; --i)

static string Tabulate(Match m)
{
return "<td>" + m.ToString() + "</td>";
}
static void Replace()
{
string text = "your text";
string result = Regex.Replace(text, "your_regexp", new MatchEvaluator(Tabulate));
}

You can try something like this:
Regex.Replace(input, pattern, match =>
{
return "<tr>" + match.Value + "</tr>";
});

Keep a counter before the loop starts, and add the amount of characters you inserted every time. IE:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
int counter = 0;
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index + counter, mc[i].Length);
text = text.Insert(mc[i].Index + counter, "<td>" + mc[i].Value + "</td>");
counter += ("<td>" + "</td>").Length;
}
I haven't tested this, but it SHOULD work.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regular expression returns group multiple times - c#

Related

C# Regex returning multiple lines of text

C# Enumerate Regex Matches

Find all occurrences of substring in string

Why Group.Value always the last matched group string?

Using Regex to edit a string in C#

Categories

Resources