Regex retrieve second capture group - c#

I have following string (CrLf might be inserted outside {} and ())
{item1}, {item2} (2), {item3} (4), {item4}
(1), {item5},{item6}(5)
I am trying to separate each item to their components and create a JSON from it using regular expression.
the output should look like this
{"name":"item1", "count":""}, {"name":"item2", "count":""}, {"name":"item3", "count":""}, {"name":"item4", "count":""}, {"name":"item5", "count":""},{"name":"item6", "count":""}
So far I have following regex, but it does not capture second group.
\{(.[^,\n\]]*)\}\s*[\((.\d)\)]*
I am replacing the matches with
{\"name\":\"${1}\", \"count\":\"${2}\"}
Here is my test link
What I am doing wrong?
Second question
Is it possible to change items without count to zero such that my second capture group read as 0?
For example Instead of changing {item1} to {"name":"item1", "count":""}, it should change to {"name":"item1", "count":"0"}

Your second capture group is invalid for capturing numeric information i.e. [\((.\d)\)] which is why nothing is caught. Also, it's recommended when capturing numbers you use [0-9] because \d can also catch unwanted unicode-defined characters.
The following regex will capture the 2 groups only (unlike #revo's answer which captures an unnecessary group inbetween)
\{(.[^,\n\]]*)\}(?:\s*\(([0-9]+)\))?
As for the second requirement, regex is used for capturing information from existing data, as far as I am aware it's not possible to inject information that isn't already present. The simplest approach there would be to fix up the JSON after the regex has run.
Or alternatively, you could include a 0 at the start of your replace, that way any empty captures will always have a value of 0 and any captured ones will still be valid but just include a 0 at the beginning e.g. 04/035 etc.
{\"name\":\"$1\", \"count\":\"0$2\"}

1- You're using a malformed version of Regular Expressions. (using captured groups inside characters sequence [])
2- You're not including second captured group in your replacement pattern.
I updated your Regex to:
\{(.[^,\n\]]*)\}\s*(\((\d*)\))?
Live demo
I'm going to offer a better regex for this problem.
Update:
{(\w+)}\s*(\((\d+)[),])?
Live demo

A solution without regex . I tried to extract data from the string using substring method and it seems to work fine
int start, end;
String a = "{item1}, {item2} (2), {item3} (4), {item4}(1), {item5},{item6}(5)";
string[] b = a.Split(',');
foreach (String item in b)
{
Console.WriteLine(item);
start=item.IndexOf('{') +1 ;
end = item.IndexOf('}');
Console.WriteLine(" \t Name : " + item.Substring(start,end-start));
if (item.IndexOf('(')!=-1 )
{
start = item.IndexOf('(');
Console.WriteLine(" \t Count : " + item[start+1] );
}
}

Related

Failure To Get Specific Text From Regex Group

My example is working fine with greedy when I use to capture the whole value of a string and a group(in group[1] ONLY) enclose with a pair of single quote.
But when I want to capture the whole value of a string and a group(in group[1] ONLY) enclose with multiple pair of single quote , it only capture the value of string enclose with last pair but not the string between first and last single quotes.
string val1 = "Content:abc'23'asad";
string val2 = "Content:'Scale['#13212']'ta";
Match match1 = Regex.Match(val1, #".*'(.*)'.*");
Match match2 = Regex.Match(val2, #".*'(.*)'.*");
if (match1.Success)
{
string value1 = match1.Value;
string GroupValue1 = match1.Groups[1].Value;
Console.WriteLine(value1);
Console.WriteLine(GroupValue1);
string value2 = match2.Value;
string GroupValue2 = match2.Groups[1].Value;
Console.WriteLine(value2);
Console.WriteLine(GroupValue2);
Console.ReadLine();
// using greedy For val1 i am getting perfect value for-
// value1--->Content:abc'23'asad
// GroupValue1--->23
// BUT using greedy For val2 i am getting the string elcosed by last single quote-
// value2--->Content:'Scale['#13212']'ta
// GroupValue2---> ]
// But i want GroupValue2--->Scale['#13212']
}
The problem with your existing regex is that you are using too many greedy modifiers. That first one is going to grab everything it can until it runs into the second to last apostrophe in the string. That's why your end result of the second example is just the stuff within the last pair of quotes.
There are a few ways to approach this. The simplest way is to use Slai's suggestion - just a pattern to grab anything and everything within the most "apart" apostrophes available:
'(.*)'
A more explicitly defined approach would be to slightly tweak the pattern you are currently using. Just change the first greedy modifier into a lazy one:
.*?'(.*)'.*
Alternatively, you could change the dot in that first and last section to instead match every character other than an apostrophe:
[^']*'(.*)'[^']*
Which one you end up using depends on what you're personally going after. One thing of note, though, is that according to Regex101, the first option involves the fewest steps, so it will be the most efficient method. However, it also dumps the rest of the string, but I don't know if that matters to you.
First off use named match capture groups such as (?<Data> ... ) then you can access that group by its name in C# such as match1.Groups["Data"].Value.
Secondly try not to use * which means zero to many. Is there really going to be no data? For a majority of the cases, that answer is no, there is data.
Use the +, one to many instead.
IMHO * screws up more patterns because it has to find zero data, when it does that, it skips ungodly amounts of data. When you know there is data use +.
It is better to match on what is known, than unknown and we will create a pattern to what is known. Also in that light use the negation set [^ ] to capture text such as [^']+ which says capture everything that is not a ', one to many times.
Pattern
Content:\x27?[^\x27?]+\x27(?<Data>[^\27]+?)\x27
The results on your two sets of data are 23 and #13212 and placed into match capture group[1] and group["Data"].
Note \x27 is the hex escape of the single quote '. \x22 is for the double quote ", which I bet is what you are really running into.
I use the hex escapes when dealing with quotes so not to have to mess with the C# compiler thinking they are quotes while parsing.

C# Regex Retrieve multiple values based on a single pattern

I have the following string:
"483 432,96 (HM: 369 694,86; ZP: 32 143,48; NP: 4 507,19; SP: 40 800,62; SDS: 4 389,84; IP: 9 497,14; PvN: 3 157,25; ÚP: 3 102,14; GP: 808,28; PRFS: 15 332,16)"
What I am trying to do, is to retrieve all values (if they exist) for the following letters (I highlighted necessary values in bold below):
483 432,96 (HM: 369 694,86; ZP: 32 143,48; NP: 4 507,19; SP: 40 800,62; SDS: 4 389,84; IP: 9 497,14; PvN: 3 157,25; ÚP: 3 102,14; GP: 808,28; PRFS: 15 332,16)
I tried to retrieve values one by one with the following regex:
string regex = "NP: ^[0-9]^[\\s\\d]([.,\\s\\d][0-9]{1,4})?$";
But with no luck either (I am a newbie in Regex patterns).
Is it possible to retrieve all values in a one string (and then simply loop through the results), or do I have to go one key at the time?
Here is my full code:
string sTest = "483 432,96 (HM: 369 694,86; ZP: 32 143,48; NP: 4 507,19; SP: 40 800,62; SDS: 4 389,84; IP: 9 497,14; PvN: 3 157,25; ÚP: 3 102,14; GP: 808,28; PRFS: 15 332,16)";
string regex = "NP: ^[0-9]^[\\s\\d]([.,\\s\\d][0-9]{1,4})?$";
System.Text.RegularExpressions.MatchCollection coll = System.Text.RegularExpressions.Regex.Matches(sTest, regex);
String result = coll[0].Groups[1].Value;
You can't get them all with one regex, unless you are absolutely sure that they will all appear next to each other. Also, what would be the point of getting them all and having to split the result afterwards anyway. Here is a regex which would find the values you wanted:
(ZP|NP|SP|SDS|IP|PvN|ÚP|GP|PRFS): ([^;)]+)
Now the first group will be the key and the second group will be the value.
The idea is:
(x|y|z) matches either x or y or z
[^;)]+ matches something, which is not ; (because this is how they are currently delimited) or ) (for the last position) one or more times
I tried to retrieve values one by one with the following regex:
Let's fix your one-by-one regex:
Caret ^ outside the [] character class means "start of line", so your expression with two carets in different places will not match anything.
Use \d instead of [0-9] and \D instead of [^0-9]
Here is one expression that matches NP: pattern (demo):
NP: \d+\D\d+([.,]\d{1,4})?
Now convert it to an expression that matches other tags like this:
(NP|ZP|SP|...): \d+\D\d+([.,]\d{1,4})?
Applying this pattern in a loop repeatedly will let you extract the tags one by one.

C# Regex string parsing

I have the expression already written, but whenever I run the code I get the entire string and a whole bunch of null values:
Regex regex = new Regex(#"y=\([0-9]\)\([0-9]\)(\s|)\+(\s+|)[0-9]");
Match match = regex.Match("y=(4)(5)+6");
for (int i = 0; i < match.Length; i++)
{
MessageBox.Show(i+"---"+match.Groups[i].Value);
}
Expected output: 4, 5, 6 (in different MessageBoxes
Actual output: y=(4)(5)+6
It finds if the entered string is correct, but once it does I can't get the specific values (the 4, 5, and 6). What can I do to possibly get that code? This is probably something very simple, but I've tried looking at the MSDN match.NextMatch article and that doesn't seem to help either.
Thank you!
As it currently is, you don't have any groups specified. (Except for around the spaces.)
You can specify groups using parenthesis. The parenthesis you are currently using have backslashes, so they are being used as part of the matching. Add an extra set of parenthesis inside of those.
Like so:
new Regex(#"y=\(([0-9]+)\)\(([0-9]+)\)\+([0-9]+)");
And with spaces:
new Regex(#"y\s*=\s*\(([0-9]+)\)\s*\(([0-9]+)\)\s*\+\s*([0-9]+)");
This will also allow for spaces between the parts to be optional, since * means 0 or more. This is better than (?:\s+|) that was given above, since you don't need a group for the spaces. It is also better since the pipe means 'or'. What you are saying with \s+| is "One or more spaces OR nothing". This is the same as \s*, which would be "Zero or more spaces".
Also, I used [0-9]+, because that means 1 or more digits. This allows numbers with multiple digits, like 10 or 100, to be matched. And another side note, using [0-9] is better than \d since \d refers to more than just the numbers we are used to.
You need to name your groups so that you can pull them out later. How do I access named capturing groups in a .NET Regex?
Regex regex = new Regex(#"y=\((?<left\>[0-9])\)\((?<right>[0-9])\)(\s|)\+(\s+|)(?<offset>[0-9])");
Then you can pull them out like this:
regex.Match("y=(4)(5)+6").Groups["left"];
Use (named) capturing groups. You will also need to use (?:) instead of () for the groups you don't want to capture. Otherwise, they will be in the result groups, too.
Regex regex = new Regex(#"y=(\([0-9]\))((\([0-9]\))(?:\s|)\+(?:\s+|)([0-9])");
Match match = regex.Match("y=(4)(5)+6");
Console.WriteLine("1: " + match.Groups[1] + ", 2: " + match.Groups[2] + ", 3: " + match.Groups[3]);
If the pattern found a match, the groups of that match are written into the property which can either be accessed via an index (index 0 contains the complete match).
You can also name those groups to have more readable code:
Regex regex = new Regex(#"y=(?<first>\([0-9]\))(?<second>(\([0-9]\))(?:\s|)\+(?:\s+|)(?<third>[0-9])");
Now, you can access the capturing groups by using match.Groups["first"] and so on.
C# is outside my area of expertise, but this may work:
#"y=\(([0-9])\)\(([0-9])\)(?:\s|)\+(?:\s+|)([0-9])"
It's basically your original regex, but with capturing groups around the numbers, and with the undesired capturing groups changed into non-capturing groups: (?: ... )
Group[0] will give always you the string that was matched, the null values are coming from (\s|).
This will work: y=\((\d)\)\((\d)\)\s*\+\s*(\d)
It's the groups starting from 1 that counts the brackets you use, but if you escape them they don't count (because you're telling it they're just text to match), so those digits need their own brackets. It's also not a good idea to use (x|) when something like ? or * would be more suitable, since you're not capturing that bit.
This will probably be even better y=\((\d+)\)\((\d+)\)\s*\+\s*(\d+) because it supports values larger than ten.

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources