C# Regex string parsing - c#

I have the expression already written, but whenever I run the code I get the entire string and a whole bunch of null values:
Regex regex = new Regex(#"y=\([0-9]\)\([0-9]\)(\s|)\+(\s+|)[0-9]");
Match match = regex.Match("y=(4)(5)+6");
for (int i = 0; i < match.Length; i++)
{
MessageBox.Show(i+"---"+match.Groups[i].Value);
}
Expected output: 4, 5, 6 (in different MessageBoxes
Actual output: y=(4)(5)+6
It finds if the entered string is correct, but once it does I can't get the specific values (the 4, 5, and 6). What can I do to possibly get that code? This is probably something very simple, but I've tried looking at the MSDN match.NextMatch article and that doesn't seem to help either.
Thank you!

As it currently is, you don't have any groups specified. (Except for around the spaces.)
You can specify groups using parenthesis. The parenthesis you are currently using have backslashes, so they are being used as part of the matching. Add an extra set of parenthesis inside of those.
Like so:
new Regex(#"y=\(([0-9]+)\)\(([0-9]+)\)\+([0-9]+)");
And with spaces:
new Regex(#"y\s*=\s*\(([0-9]+)\)\s*\(([0-9]+)\)\s*\+\s*([0-9]+)");
This will also allow for spaces between the parts to be optional, since * means 0 or more. This is better than (?:\s+|) that was given above, since you don't need a group for the spaces. It is also better since the pipe means 'or'. What you are saying with \s+| is "One or more spaces OR nothing". This is the same as \s*, which would be "Zero or more spaces".
Also, I used [0-9]+, because that means 1 or more digits. This allows numbers with multiple digits, like 10 or 100, to be matched. And another side note, using [0-9] is better than \d since \d refers to more than just the numbers we are used to.

You need to name your groups so that you can pull them out later. How do I access named capturing groups in a .NET Regex?
Regex regex = new Regex(#"y=\((?<left\>[0-9])\)\((?<right>[0-9])\)(\s|)\+(\s+|)(?<offset>[0-9])");
Then you can pull them out like this:
regex.Match("y=(4)(5)+6").Groups["left"];

Use (named) capturing groups. You will also need to use (?:) instead of () for the groups you don't want to capture. Otherwise, they will be in the result groups, too.
Regex regex = new Regex(#"y=(\([0-9]\))((\([0-9]\))(?:\s|)\+(?:\s+|)([0-9])");
Match match = regex.Match("y=(4)(5)+6");
Console.WriteLine("1: " + match.Groups[1] + ", 2: " + match.Groups[2] + ", 3: " + match.Groups[3]);
If the pattern found a match, the groups of that match are written into the property which can either be accessed via an index (index 0 contains the complete match).
You can also name those groups to have more readable code:
Regex regex = new Regex(#"y=(?<first>\([0-9]\))(?<second>(\([0-9]\))(?:\s|)\+(?:\s+|)(?<third>[0-9])");
Now, you can access the capturing groups by using match.Groups["first"] and so on.

C# is outside my area of expertise, but this may work:
#"y=\(([0-9])\)\(([0-9])\)(?:\s|)\+(?:\s+|)([0-9])"
It's basically your original regex, but with capturing groups around the numbers, and with the undesired capturing groups changed into non-capturing groups: (?: ... )

Group[0] will give always you the string that was matched, the null values are coming from (\s|).
This will work: y=\((\d)\)\((\d)\)\s*\+\s*(\d)
It's the groups starting from 1 that counts the brackets you use, but if you escape them they don't count (because you're telling it they're just text to match), so those digits need their own brackets. It's also not a good idea to use (x|) when something like ? or * would be more suitable, since you're not capturing that bit.
This will probably be even better y=\((\d+)\)\((\d+)\)\s*\+\s*(\d+) because it supports values larger than ten.

Related

Regex only returns value when anchor is provided

I'm using the following pattern to match numbers in a string. I know there is only one number in a given string I'm trying to match.
var str = "Store # 100";
var regex = new Regex(#"[0-9]*");
When I call regex.Match.Value, this returns an empty string. However, if I change it to:
var regex = new Regex(#"[0-9]*$");
It does return the value I wanted. What gives?
Ok I figured it out.
The problem with [0-9]* or let's make it simpler: \d* is that * makes it optional so it will also result in zero-length match for every character before the '100'.
To rectify this you could use \d\d*, this will cause at least one mandatory digit before the rest and clear out zero-length matches.
Edit: The dollar version, e.g. \d*$ will only work if your number is at the end of the input string.
More information here!
Aaaaand One more link for yet even more info (what a time to be alive).
According to MSDN,
The quantifiers *, +, and {n,m} and their lazy counterparts never
repeat after an empty match when the minimum number of captures has
been found. This rule prevents quantifiers from entering infinite
loops on empty subexpression matches when the maximum number of
possible group captures is infinite or near infinite.
So, as the minimum number of captures is zero, the [0-9]* pattern returns so many NULLs. And [0-9]+ will capture 100 without any problems.

Regex retrieve second capture group

I have following string (CrLf might be inserted outside {} and ())
{item1}, {item2} (2), {item3} (4), {item4}
(1), {item5},{item6}(5)
I am trying to separate each item to their components and create a JSON from it using regular expression.
the output should look like this
{"name":"item1", "count":""}, {"name":"item2", "count":""}, {"name":"item3", "count":""}, {"name":"item4", "count":""}, {"name":"item5", "count":""},{"name":"item6", "count":""}
So far I have following regex, but it does not capture second group.
\{(.[^,\n\]]*)\}\s*[\((.\d)\)]*
I am replacing the matches with
{\"name\":\"${1}\", \"count\":\"${2}\"}
Here is my test link
What I am doing wrong?
Second question
Is it possible to change items without count to zero such that my second capture group read as 0?
For example Instead of changing {item1} to {"name":"item1", "count":""}, it should change to {"name":"item1", "count":"0"}
Your second capture group is invalid for capturing numeric information i.e. [\((.\d)\)] which is why nothing is caught. Also, it's recommended when capturing numbers you use [0-9] because \d can also catch unwanted unicode-defined characters.
The following regex will capture the 2 groups only (unlike #revo's answer which captures an unnecessary group inbetween)
\{(.[^,\n\]]*)\}(?:\s*\(([0-9]+)\))?
As for the second requirement, regex is used for capturing information from existing data, as far as I am aware it's not possible to inject information that isn't already present. The simplest approach there would be to fix up the JSON after the regex has run.
Or alternatively, you could include a 0 at the start of your replace, that way any empty captures will always have a value of 0 and any captured ones will still be valid but just include a 0 at the beginning e.g. 04/035 etc.
{\"name\":\"$1\", \"count\":\"0$2\"}
1- You're using a malformed version of Regular Expressions. (using captured groups inside characters sequence [])
2- You're not including second captured group in your replacement pattern.
I updated your Regex to:
\{(.[^,\n\]]*)\}\s*(\((\d*)\))?
Live demo
I'm going to offer a better regex for this problem.
Update:
{(\w+)}\s*(\((\d+)[),])?
Live demo
A solution without regex . I tried to extract data from the string using substring method and it seems to work fine
int start, end;
String a = "{item1}, {item2} (2), {item3} (4), {item4}(1), {item5},{item6}(5)";
string[] b = a.Split(',');
foreach (String item in b)
{
Console.WriteLine(item);
start=item.IndexOf('{') +1 ;
end = item.IndexOf('}');
Console.WriteLine(" \t Name : " + item.Substring(start,end-start));
if (item.IndexOf('(')!=-1 )
{
start = item.IndexOf('(');
Console.WriteLine(" \t Count : " + item[start+1] );
}
}

Simple regex doesn't work

I want to match the strings "F1" to "F12". I only need the number. I'm out of training - my first try:
var r = new Regex(#"^(?:[F])[\d]{1,2}$");
matches - but returns "F1" - but i expect to get "1"?
What have I done wrong?
Maybe you want to use lookbehind:
var r = new Regex(#"^(?<=F)\d\d?$");
Even though you are using a non-capturing group for the "F", the overall match for your Regex will return the entire string it matched. Groups are used to outline sub-expressions within your regular expression that you want be able to extract the value of. Non-capturing groups are used if you want to specify a sub-expression without having it be stored in a group. They allow you to apply quantifiers to your sub-expression, but do not allow you to extract their resulting value after running the regex against a string. They are typically used for performance gains, since capturing groups add extra overhead.
If you want to get just the number, you need to put the number portion in a capturing group and look at the Groups property of the resulting Match (assuming you are calling the r.Match function).
The updated Regex would be:
var r = new Regex(#"^(?:[F])([\d]{1,2})$");
Since our number is inside of the first set of parenthesis associated with a capturing group, it will be group 1. You could also name your group to avoid confusion or possible errors if the regex gets updated at a later date.
Alternately, you can just use look-behind as M42 has suggested.

Why doesn't finite repetition in lookbehind work in some flavors?

I want to parse the 2 digits in the middle from a date in dd/mm/yy format but also allowing single digits for day and month.
This is what I came up with:
(?<=^[\d]{1,2}\/)[\d]{1,2}
I want a 1 or 2 digit number [\d]{1,2} with a 1 or 2 digit number and slash ^[\d]{1,2}\/ before it.
This doesn't work on many combinations, I have tested 10/10/10, 11/12/13, etc...
But to my surprise (?<=^\d\d\/)[\d]{1,2} worked.
But the [\d]{1,2} should also match if \d\d did, or am I wrong?
On lookbehind support
Major regex flavors have varying supports for lookbehind differently; some imposes certain restrictions, and some doesn't even support it at all.
Javascript: not supported
Python: fixed length only
Java: finite length only
.NET: no restriction
References
regular-expressions.info/Flavor comparison
Python
In Python, where only fixed length lookbehind is supported, your original pattern raises an error because \d{1,2} obviously does not have a fixed length. You can "fix" this by alternating on two different fixed-length lookbehinds, e.g. something like this:
(?<=^\d\/)\d{1,2}|(?<=^\d\d\/)\d{1,2}
Or perhaps you can put both lookbehinds as alternates of a non-capturing group:
(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}
(note that you can just use \d without the brackets).
That said, it's probably much simpler to use a capturing group instead:
^\d{1,2}\/(\d{1,2})
Note that findall returns what group 1 captures if you only have one group. Capturing group is more widely supported than lookbehind, and often leads to a more readable pattern (such as in this case).
This snippet illustrates all of the above points:
p = re.compile(r'(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}')
print(p.findall("12/34/56")) # "[34]"
print(p.findall("1/23/45")) # "[23]"
p = re.compile(r'^\d{1,2}\/(\d{1,2})')
print(p.findall("12/34/56")) # "[34]"
print(p.findall("1/23/45")) # "[23]"
p = re.compile(r'(?<=^\d{1,2}\/)\d{1,2}')
# raise error("look-behind requires fixed-width pattern")
References
regular-expressions.info/Lookarounds, Character classes, Alternation, Capturing groups
Java
Java supports only finite-length lookbehind, so you can use \d{1,2} like in the original pattern. This is demonstrated by the following snippet:
String text =
"12/34/56 date\n" +
"1/23/45 another date\n";
Pattern p = Pattern.compile("(?m)(?<=^\\d{1,2}/)\\d{1,2}");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
} // "34", "23"
Note that (?m) is the embedded Pattern.MULTILINE so that ^ matches the start of every line. Note also that since \ is an escape character for string literals, you must write "\\" to get one backslash in Java.
C-Sharp
C# supports full regex on lookbehind. The following snippet shows how you can use + repetition on a lookbehind:
var text = #"
1/23/45
12/34/56
123/45/67
1234/56/78
";
Regex r = new Regex(#"(?m)(?<=^\d+/)\d{1,2}");
foreach (Match m in r.Matches(text)) {
Console.WriteLine(m);
} // "23", "34", "45", "56"
Note that unlike Java, in C# you can use #-quoted string so that you don't have to escape \.
For completeness, here's how you'd use the capturing group option in C#:
Regex r = new Regex(#"(?m)^\d+/(\d{1,2})");
foreach (Match m in r.Matches(text)) {
Console.WriteLine("Matched [" + m + "]; month = " + m.Groups[1]);
}
Given the previous text, this prints:
Matched [1/23]; month = 23
Matched [12/34]; month = 34
Matched [123/45]; month = 45
Matched [1234/56]; month = 56
Related questions
How can I match on, but exclude a regex pattern?
Unless there's a specific reason for using the lookbehind which isn't noted in the question, how about simply matching the whole thing and only capturing the bit you're interested in instead?
JavaScript example:
>>> /^\d{1,2}\/(\d{1,2})\/\d{1,2}$/.exec("12/12/12")[1]
"12"
To quote regular-expressions.info:
The bad news is that most regex
flavors do not allow you to use just
any regex inside a lookbehind, because
they cannot apply a regular expression
backwards. Therefore, the regular
expression engine needs to be able to
figure out how many steps to step back
before checking the lookbehind.
Therefore, many regex flavors,
including those used by Perl and
Python, only allow fixed-length
strings. You can use any regex of
which the length of the match can be
predetermined. This means you can use
literal text and character classes.
You cannot use repetition or optional
items. You can use alternation, but
only if all options in the alternation
have the same length.
In other words your regex does not work because you're using a variable-width expression inside a lookbehind and your regex engine does not support that.
In addition to those listed by #polygenelubricants, there are two more exceptions to the "fixed length only" rule. In PCRE (the regex engine for PHP, Apache, et al) and Oniguruma (Ruby 1.9, Textmate), a lookbehind may consist of an alternation in which each alternative may match a different number of characters, as long as the length of each alternative is fixed. For example:
(?<=\b\d\d/|\b\d/)\d{1,2}(?=/\d{2}\b)
Note that the alternation has to be at the top level of the lookbehind subexpression. You might, like me, be tempted to factor out the common elements, like this:
(?<=\b(?:\d\d/|\d)/)\d{1,2}(?=/\d{2}\b)
...but it wouldn't work; at the top level, the subexpression now consists of a single alternative with a non-fixed length.
The second exception is much more useful: \K, supported by Perl and PCRE. It effectively means "pretend the match really started here." Whatever appears before it in the regex is treated as a positive lookbehind. As with .NET lookbehinds, there are no restrictions; whatever can appear in a normal regex can be used before the \K.
\b\d{1,2}/\K\d{1,2}(?=/\d{2}\b)
But most of the time, when someone has a problem with lookbehinds, it turns out they shouldn't even be using them. As #insin pointed out, this problem can be solved much more easily by using a capturing group.
EDIT: Almost forgot JGSoft, the regex flavor used by EditPad Pro and PowerGrep; like .NET, it has completely unrestricted lookbehinds, positive and negative.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources