Recently, I found one C# Regex API really annoying.
I have regular expression (([0-9]+)|([a-z]+))+. I want to find all matched string. The code is like below.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456defFOO";
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
The output is:
Match1
Match group count = 4
Group0='abc123xyz456def'
Group1='def'
Group2='456'
Group3='def'
go to next match
It seems that all group.Value is the last matched string ("def" and "456"). I spent some time to figure out that I should count on group.Captures instead of group.Value.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456def";
//Console.WriteLine(str);
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
CaptureCollection cc = group.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine(" Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
This will output:
Match1
Match group count = 4
Group0='abc123xyz456def'
Capture0='abc123xyz456def', Position=0
Group1='def'
Capture0='abc', Position=0
Capture1='123', Position=3
Capture2='xyz', Position=6
Capture3='456', Position=9
Capture4='def', Position=12
Group2='456'
Capture0='123', Position=3
Capture1='456', Position=9
Group3='def'
Capture0='abc', Position=0
Capture1='xyz', Position=6
Capture2='def', Position=12
go to next match
Now, I am wondering why the API design is like this. Why Group.Value only returns the last matched string? This design doesn't look good.
The primary reason is historical: regexes have always worked that way, going back to Perl and beyond. But it's not really bad design. Usually, if you want every match like that, you just leave off the outermost quantifier (+ in ths case) and use the Matches() method instead of Match(). Every regex-enabled language provides a way to do that: in Perl or JavaScript you do the match in /g mode; in Ruby you use the scan method; in Java you call find() repeatedly until it returns false. Similarly, if you're doing a replace operation, you can plug the captured substrings back in as you go with placeholders ($1, $2 or \1, \2, depending on the language).
On the other hand, I know of no other Perl 5-derived regex flavor that provides the ability to retrieve intermediate capture-group matches like .NET does with its CaptureCollections. And I'm not surprised: it's actually very seldom that you really need to capture all the matches in one go like that. And think of all the storage and/or processing power it can take to keep track of all those intermediate matches. It is a nice feature though.
Related
This might sound like a very basic question, but it's one that's given me quite a lot of trouble in C#.
Assume I have, for example, the following Strings known as my chosenTarget.titles:
2008/SD128934 - Wordz aaaaand more words (1233-26-21)
20998/AD1234 - Wordz and less words (1263-21-21)
208/ASD12345 - Wordz and more words (1833-21-21)
Now as you can see, all three Strings are different in some ways.
What I need is to extract a very specific part of these Strings, but getting the subtleties right is what confuses me, and I was wondering if some of you knew better than I.
What I know is that the Strings will always come in the following pattern:
yearNumber + "/" + aFewLetters + theDesiredNumber + " - " + descriptiveText + " (" + someDate + ")"
In the above example, what I would want to return to me would be:
128934
1234
12345
I need to extract theDesiredNumber.
Now, I'm not (that) lazy so I have made a few attempts myself:
var a = chosenTarget.title.Substring(chosenTarget.title.IndexOf("/") + 1, chosenTarget.title.Length - chosenTarget.title.IndexOf("/"));
What this has done is sliced out yearNumber and the /, leaving me with aFewLetter before theDesiredNumber.
I have a hard time properly removing the rest however, and I was wondering if any of you could aid me in the matter?
It sounds as if you only need to extract the number behind the first / which ends at -. You could use a combination of string methods and LINQ:
int startIndex = str.IndexOf("/");
string number = null;
if (startIndex >= 0 )
{
int endIndex = str.IndexOf(" - ", startIndex);
if (endIndex >= 0)
{
startIndex++;
string token = str.Substring(startIndex, endIndex - startIndex); // SD128934
number = String.Concat(token.Where(char.IsDigit)); // 128934
}
}
Another mainly LINQ approach using String.Split:
number = String.Concat(
str.Split(new[] { " - " }, StringSplitOptions.None)[0]
.Split('/')
.Last()
.Where(char.IsDigit));
Try this:
int indexSlash = chosenTarget.title.IndexOf("/");
int indexDash = chosenTarget.title.IndexOf("-");
string out = new string(chosenTarget.title.Substring(indexSlash,indexDash-indexSlash).Where(c => Char.IsDigit(c)).ToArray());
You can use a regex:
var pattern = "(?:[0-9]+/\w+)[0-9]";
var matcher = new Regex(pattern);
var result = matcher.Matches(yourEntireSetOfLinesInAString);
Or you can loop every line and use Match instead of Matches. In this case you don't need to build a "matcher" in every iteration but build it outside the loop
Regex is your friend:
(new [] {"2008/SD128934 - Wordz aaaaand more words (1233-26-21)",
"20998/AD1234 - Wordz and less words (1263-21-21)",
"208/ASD12345 - Wordz and more words (1833-21-21)"})
.Select(x => new Regex(#"\d+/[A-Z]+(\d+)").Match(x).Groups[1].Value)
The pattern you had recognized is very important, here is the solution:
const string pattern = #"\d+\/[a-zA-Z]+(\d+).*$";
string s1 = #"2008/SD128934 - Wordz aaaaand more words(1233-26-21)";
string s2 = #"20998/AD1234 - Wordz and less words(1263-21-21)";
string s3 = #"208/ASD12345 - Wordz and more words(1833-21-21)";
var strings = new List<string> { s1, s2, s3 };
var desiredNumber = string.Empty;
foreach (var s in strings)
{
var match = Regex.Match(s, pattern);
if (match.Success)
{
desiredNumber = match.Groups[1].Value;
}
}
I would use a RegEx for this, the string you're looking for is in Match.Groups[1]
string composite = "2008/SD128934 - Wordz aaaaand more words (1233-26-21)";
Match m= Regex.Match(composite,#"^\d{4}\/[a-zA-Z]+(\d+)");
if (m.Success) Console.WriteLine(m.Groups[1]);
The breakdown of the RegEx is as follows
"^\d{4}\/[a-zA-Z]+(\d+)"
^ - Indicates that it's the beginning of the string
\d{4} - Four digits
\/ - /
[a-zA-Z]+ - More than one letters
(\d+) - More than one digits (the parenthesis indicate that this part is captured as a group - in this case group 1)
I am seeking a way to search a string for an exact match or whole word match. RegEx.Match and RegEx.IsMatch don't seem to get me where I want to be. Consider the following scenario:
namespace test
{
class Program
{
static void Main(string[] args)
{
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
int indx = str.IndexOf("TOTAL");
string amount = str.Substring(indx + "TOTAL".Length, 10);
string strAmount = Regex.Replace(amount, "[^.0-9]", "");
Console.WriteLine(strAmount);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
The output of the above code is:
// 34.37
// Press any key to continue...
The problem is, I don't want SUBTOTAL, but IndexOf finds the first occurrence of the word TOTAL which is in SUBTOTAL which then yields the incorrect value of 34.37.
So the question is, is there a way to force IndexOf to find only an exact match or is there another way to force that exact whole word match so that I can find the index of that exact match and then perform some useful function with it. RegEx.IsMatch and RegEx.Match are, as far as I can tell, simply boolean searches. In this case, it isn't enough to just know the exact match exists. I need to know where it exists in the string.
Any advice would be appreciated.
You can use Regex
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var indx = Regex.Match(str, #"\WTOTAL\W").Index; // will be 18
My method is faster than the accepted answer because it does not use Regex.
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var indx = str.IndexOfWholeWord("TOTAL");
public static int IndexOfWholeWord(this string str, string word)
{
for (int j = 0; j < str.Length &&
(j = str.IndexOf(word, j, StringComparison.Ordinal)) >= 0; j++)
if ((j == 0 || !char.IsLetterOrDigit(str, j - 1)) &&
(j + word.Length == str.Length || !char.IsLetterOrDigit(str, j + word.Length)))
return j;
return -1;
}
You can use word boundaries, \b, and the Match.Index property:
var text = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var idx = Regex.Match(text, #"\bTOTAL\b").Index;
// => 19
See the C# demo.
The \bTOTAL\b matches TOTAL when it is not enclosed with any other letters, digits or underscores.
If you need to count a word as a whole word if it is enclosed with underscores, use
var idx = Regex.Match(text, #"(?<![^\W_])TOTAL(?![^\W_])").Index;
where (?<![^\W_]) is a negative lookbehind that fails the match if there is a character other than a non-word and underscore immediately to the left of the current location (so, there can be a start of string position, or a char that is a not a digit nor letter), and (?![^\W_]) is a similar negative lookahead that only matches if there is an end of string position or a char other than a letter or digit immediately to the right of the current location.
If the boundaries are whitespaces or start/end of string use
var idx = Regex.Match(text, #"(?<!\S)TOTAL(?!\S)").Index;
where (?<!\S) requires start of string or a whitespace immediately on the left, and (?!\S) requires the end of string or a whitespace on the right.
NOTE: \b, (?<!...) and (?!...) are non-consuming patterns, that is the regex index does not advance when matching these patterns, thus, you get the exact positions of the word you search for.
To make the accepted answer a little bit safer (since IndexOf returns -1 for unmatched):
string pattern = String.Format(#"\b{0}\b", findTxt);
Match mtc = Regex.Match(queryTxt, pattern);
if (mtc.Success)
{
return mtc.Index;
}
else
return -1;
While this may be a hack that just works for only your example, try
string amount = str.Substring(indx + " TOTAL".Length, 10);
giving an extra space before total. As this will not occur with SUBTOTAL, it should skip over the word you don't want and just look for an isolated TOTAL.
I'd recommend the Regex solution from L.B. too, but if you can't use Regex, then you could use String.LastIndexOf("TOTAL"). Assuming the TOTAL always comes after SUBTOTAL?
http://msdn.microsoft.com/en-us/library/system.string.lastindexof(v=vs.110).aspx
I have a very simple regex like this in C#:
(var \= 0\;)
But when I try to match this against a string that has only one occurrence of the pattern, I get multiple groups returned. The input string is:
foo bar
var = 0;
foo
I get 1 match returned by the Regex object, but inside I see two groups, each has 1 capture, which is the string I want.
I need the grouping parentheses in the regex because this is part of a bigger regex, and I need this to be captured as a group.
What am I doing wrong?
EDIT
This is the C# code I'm using:
private const string REGEX = "(var \\= [0]\\;)";
MatchCollection matches = REGEX.Matches(inputStr);
foreach (Match m in matches)
{
foreach (Group g in m.Groups)
{
Console.WriteLine("group[" + g.Captures.Count + "]: '" + g.ToString() + "'");
}
}
This is what I get:
group[1]: 'var = 0;'
group[1]: 'var = 0;'
My question is, why do I get two groups and not one?
EDIT #2:
A more complicated pattern shows the problem. The pattern:
# preceding comment
class
{
(param1 = "val1", param2 = "val2", param3 = val3)
}
[
# inside comment
setting1 = 0;
setting2 = 0;
]
The regex I'm using: (it's probably not the most obvious, but you can paste it in a regex viewer if you want to check it out)
(\#[^\n]*)?(?:[\s\r\n]*)domain(?:[\s\r\n]*)\{(?:[\s\r\n]*)\((?:[\s\r\n]*)(((?:[\s\r\n]*)(accountName(?:[\s\r\n]*)\=(?:[\s\r\n]*)\"[^"]+\"[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(tableName(?:[\s\r\n]*)\=(?:[\s\r\n]*)\"[^"]+\"[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(cap(?:[\s\r\n]*)\=(?:[\s\r\n]*)[\d]+[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(MinPartitionCount(?:[\s\r\n]*)\=(?:[\s\r\n]*)[\d]+[,]?)(?:[\s\r\n]*)))+\)(?:[\s\r\n]*)\}(?:[\s\r\n]*)\[(?:[\s\r\n]*)(\#[^\n]*)?(?:[\s\r\n]*)((?:[\s\r\n]*)(IsSplitEnabled(?:[\s\r\n]*)\=(?:[\s\r\n]*)[0|1](?:[\s\r\n]*)\;)(?:[\s\r\n]*)|(?:[\s\r\n]*)(IsMergeEnabled(?:[\s\r\n]*)\=(?:[\s\r\n]*)[0|1](?:[\s\r\n]*)\;)(?:[\s\r\n]*))*(?:[\s\r\n]*)\]
And I'm getting:
group:1: '# preceding comment
domain
{
(param1 = "val1", param2 = "val2", param3 = val3)
}
[
# inside comment
setting1 = 0;
setting2 = 0;
]'
'roup:1: '# preceding comment
group:3: 'cap = 1200'
group:1: 'param1 = "val1", '
group:1: 'param1 = "val1",'
group:1: 'param2 = "val2", '
group:1: 'param2 = "val2",'
group:1: 'param3 = val3'
group:1: 'param3 = val3'
'roup:1: '# inside comment
group:2: 'setting1 = 0;
'
group:1: 'setting1 = 0;'
group:1: 'setting2 = 0;'
According to the documentation, the first element of the GroupCollection is the entire match, not the first group created by ().
From near the bottom of the Remarks section here:
If the regular expression engine can find a match, the first element
of the GroupCollection object returned by the Groups property contains
a string that matches the entire regular expression pattern. Each subsequent element > represents a captured group, if the regular expression includes capturing groups.
Due to this, both items 0 and 1 are identical given the RegEx you are currently using. To only see the actual group matches, you could skip the first element of the GroupCollection, and only process the groups you have defined in the RegEx.
EDIT
After investigating the additional data, I think I may have found the cause of your duplicates.
I believe that you are seeing more than one Match, and so the outer foreach loop runs twice, not once. This is because there are 2 separate lines with "= 0;" in the example.
Here is LinqPad example code that shows 2 matches being found, and therefore multiple duplicate groups being output. (note, I used the simple regex you provided to test, since the long regex didn't provide any matches)
static string inputStr = "# preceding comment \r\n" +
"class\r\n" +
"{\r\n" +
" (param1 = \"val1\", param2 = \"val2\", param3 = val3)\r\n" +
"}\r\n" +
"[\r\n" +
" # inside comment\r\n" +
" setting1 = 0;\r\n" +
" setting2 = 0;\r\n" +
"]\r\n";
const string REGEX = "(\\= [0]\\;)";
void Main()
{
var regex = new System.Text.RegularExpressions.Regex(REGEX);
MatchCollection matches = regex.Matches(inputStr);
Console.WriteLine("Matches:{0}", matches.Count);
int matchCnt = 0;
foreach (Match m in matches)
{
int groupCnt = 0;
foreach (Group g in m.Groups)
{
Console.WriteLine("match[{0}] group[{1}]: Captures:{2} '{3}'", matchCnt, groupCnt, g.Captures.Count, g);
//g.Dump();
groupCnt++;
}
matchCnt++;
}
Console.WriteLine("Done!");
}
And here is the output generated by LinqPad when this code runs:
Matches:2
match[0] group[0]: Captures:1 '= 0;'
match[0] group[1]: Captures:1 '= 0;'
match[1] group[0]: Captures:1 '= 0;'
match[1] group[1]: Captures:1 '= 0;'
Done!
I want to match following pattern:
key="value" key="value" key="value" key="value" ...
where key and value are [a-z0-9]+, both should be grouped (2 groups, the " - chars can be matched or skipped)
input that should not be matched:
key="value"key="value" (no space between pairs)
For now I got this(not .NET syntax):
([a-z0-9]+)=(\"[a-z0-9]+\")(?=\s|$)
the problem with that, that it matches key4="value4" in input:
key3="value3"key4="value4"
The spec isn't very clear, but you can try:
(?<!\S)([a-z0-9]+)=("[a-z0-9]+")(?!\S)
Or, as a C# string literal:
"(?<!\\S)([a-z0-9]+)=(\"[a-z0-9]+\")(?!\\S)"
This uses a negative lookarounds to ensure that the the key-value pair is neither preceded nor followed by non-whitespace characters.
Here's an example snippet (as seen on ideone.com):
var input = "key1=\"value1\" key2=\"value2\"key3=\"value3\" key4=\"value4\"";
Console.WriteLine(input);
// key1="value1" key2="value2"key3="value3" key4="value4"
Regex r = new Regex("(?<!\\S)([a-z0-9]+)=(\"[a-z0-9]+\")(?!\\S)");
foreach (Match m in r.Matches(input)) {
Console.WriteLine(m);
}
// key1="value1"
// key4="value4"
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
On validating the entire input
You can use Regex.IsMatch to see if the input string matches against what should be the correct input pattern. You can also use the same pattern to extract the keys/values, thanks to the fact that .NET regex lets you access individual captures.
string[] inputs = {
"k1=\"v1\" k2=\"v2\" k3=\"v3\" k4=\"v4\"",
"k1=\"v1\" k2=\"v2\"k3=\"v3\" k4=\"v4\"",
" k1=\"v1\" k2=\"v2\" k3=\"v3\" k4=\"v4\" ",
" ",
" what is this? "
};
Regex r = new Regex("^\\s*(?:([a-z0-9]+)=\"([a-z0-9]+)\"(?:\\s+|$))+$");
foreach (string input in inputs) {
Console.Write(input);
if (r.IsMatch(input)) {
Console.WriteLine(": MATCH!");
Match m = r.Match(input);
CaptureCollection keys = m.Groups[1].Captures;
CaptureCollection values = m.Groups[2].Captures;
int N = keys.Count;
for (int i = 0; i < N; i++) {
Console.WriteLine(i + "[" + keys[i] + "]=>[" + values[i] + "]");
}
} else {
Console.WriteLine(": NO MATCH!");
}
}
The above prints (as seen on ideone.com):
k1="v1" k2="v2" k3="v3" k4="v4": MATCH!
0[k1]=>[v1]
1[k2]=>[v2]
2[k3]=>[v3]
3[k4]=>[v4]
k1="v1" k2="v2"k3="v3" k4="v4": NO MATCH!
k1="v1" k2="v2" k3="v3" k4="v4" : MATCH!
0[k1]=>[v1]
1[k2]=>[v2]
2[k3]=>[v3]
3[k4]=>[v4]
: NO MATCH!
what is this? : NO MATCH!
References
Is there a regex flavor that allows me to count the number of repetitions matched by the * and + operators?
Explanation of the pattern
The pattern to validate the entire input is essentially:
maybe leading
spaces ___ end of string anchor
| /
^\s*(entry)+$
| \
beginning \__ one or more entry
of string
anchor
Where each entry is:
key=value(\s+|$)
That is, a key/value pair followed by either spaces or the end of the string.
I think SilentGhost proposal is about using String.Split()
Like this :
String keyValues = "...";
foreach(String keyValuePair in keyValues.Split(' '))
Console.WriteLine(keyValuePair);
This is definitively faster and simpler.
Use a lookbehind like you used your lookahead:
(?<=\s|^)([a-z0-9]+)=(\"[a-z0-9]+\")(?=\s|$)
I second Jens' answer (but am still too puny to comment on others' answers).
Also, I've found this Regular Expressions Reference site to be quite awesome. There's a section on Lookaround about halfway down on the Advanced page, and some further notes about Lookbehind.
I'm just beginning to use Regex so bear with my terminology. I have a regex pattern that is working properly on a string. The string could be in the format "text [pattern] text". Therefore, I also have a regex pattern that negates the first pattern. If I print out the results from each of the matches everything is shown correctly.
The problem I'm having is I want to add text into the string and it changes the index of matches in a regex MatchCollection. For example, if I wanted to enclose the found match in "td" match "/td"" tags I have the following code:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
if (mc.Count > 0)
{
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index, mc[i].Length);
text = text.Insert(mc[i].Index, "<td>" + mc[i].Value + "</td>");
}
}
This works great for the first match. But as you'd expect the mc[i].Index is no longer valid because the string has changed. Therefore, I tried to search for just a single match in the for loop for the amount of matches I would expect (mc.Count), but then I keep finding the first match.
So hopefully without introducing more regex to make sure it's not the first match and with keeping everything in one string, does anybody have any input on how I could accomplish this? Thanks for your input.
Edit: Thank you all for your responses, I appreciate all of them.
It can be as simple as:-
string newString = Regex.Replace("abc", "b", "<td>${0}</td>");
Results in a<td>b</td>c.
In your case:-
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
text = r.Replace(text, "<td>${0}</td>");
Will replace all occurance of negRegexPattern with the content of that match surrounded by the td element.
Although I agree that the Regex.Replace answer above is the best choice, just to answer the question you asked, how about replacing from the last match to the first. This way your string grows beyond the "previous" match so the earlier matches indexes will still be valid.
for (int i = mc.Count - 1; i > 0; --i)
static string Tabulate(Match m)
{
return "<td>" + m.ToString() + "</td>";
}
static void Replace()
{
string text = "your text";
string result = Regex.Replace(text, "your_regexp", new MatchEvaluator(Tabulate));
}
You can try something like this:
Regex.Replace(input, pattern, match =>
{
return "<tr>" + match.Value + "</tr>";
});
Keep a counter before the loop starts, and add the amount of characters you inserted every time. IE:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
int counter = 0;
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index + counter, mc[i].Length);
text = text.Insert(mc[i].Index + counter, "<td>" + mc[i].Value + "</td>");
counter += ("<td>" + "</td>").Length;
}
I haven't tested this, but it SHOULD work.