How to add a new named group? - c#

I have the following working pattern:
String pattern = #"<(?<field>[^/>]+)>(?<data>.*)</\k<field>>";
String tcSQML = "<Pri_Key1>62</Pri_Key1><First1>SAM</First1><Last1>SPADE</Last1> <GstNo1></GstNo1><Pri_Key2>63</Pri_Key2><First2>TONY</First2><Last2>TUNE</Last2><GstNo2></GstNo2><Pri_Key3>64</Pri_Key3><First3>FRANK</First3><Last3>FAST</Last3><GstNo3></GstNo3><Pri_Key4>65</Pri_Key4><First4>BILLIE</First4><Last4>BLADES</Last4><GstNo4></GstNo4>";
MatchCollection matches = Regex.Matches(tcSQML, pattern, RegexOptions.Singleline);
foreach (Match m in matches)
{
Console.WriteLine(m.Groups["field"].ToString());
Console.WriteLine(m.Groups["number"].ToString());
}
What I want is to be able to also capture number after the field. E.g. in pri_key1 pri_key should be the field and 1 should be the number. I can not figure out how to introduce this new Number group into this pattern. I tried a few variations and nothing works. I am not good with RegEx so help and explanation is appreciated.

That's untested:
String pattern = #"<(?<field>[^0-9/>]+)(?<number>[^/>]*)>(?<data>.*)</\k<field>\k<number>>";
I just added a new group behind the field element. The new field element matches any string that does not contain a digit or the / or the > characters. The number element matches anything that is left between the field and the end of the tag.

Related

How to use Regex.Matches with a start index AND RegexOptions

There doesn't seem to be a way to specify both RegexOptions and a start index when using Regex.Matches.
According to the docs, there is a way to do both individually, but not together.
In the example below, I want matches to contain only the second hEllo in the string text
string pattern = #"\bhello\b";
string text = "hello world. hEllo";
Regex r = new Regex(pattern);
MatchCollection matches;
// matches nothing
matches = r.Matches(text, 5)
// matches the first occurence
matches = Regex.Matches(text, pattern, RegexOptions.IgnoreCase)
Is there a different way to accomplish this?
I don't believe you can. You should instead instantiate Regex using the desired options:
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
and then you can simply use your existing code from the first sample, which should now match since we're using the IgnoreCase option:
matches = r.Matches(text, 5);
Applicable constructor docs
Try it online

Regex Matchcollection groups

I already tried two days to solve the Problem, that I have a MatchCollection. In the patter is a Group and I want to have a list with the Solutions of the Group (there were two or more Solutions).
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "^<tr>$?<td>$?[D-M][i-r],[' '][0-3][1-9].[0-1][1-9].[0-9][0-9]$?</td>$?<td>$?([1-9][0-2]?)$?</td>$?";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
string s = groups[1].Value;
Datum2.Text = s;
}
But only the last match (2) appears in the TextBox "Datum2".
I know that I have to use e.g. a listbox, but the Groups[1].Value is a string...
Thanks for your help and time.
Dieter
First thing you need to correct in the code is Datum2.Text = s; would overwrite the text in Datum2 if it were more than one match.
Now, about your regex,
^ forces a match at the begging of the line, so there is really only 1 match. If you remove it, it'll match twice.
I can't seem to understand what was intended with $? all over the pattern (just take them out).
[' '] matches "either a quote, a space or a quote (no need to repeat characters in a character class.
All dots in [0-3][1-9].[0-1][1-9].[0-9][0-9] need to be escaped. A dot matches any character otherwise.
[0-1][1-9] matches all months except "10". The second character shoud be [0-9] (or \d).
Code:
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>([1-9][0-2]?)</td>";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
string s= "";
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
s = s + " " + groups[1].Value;
}
Datum2.Text = s;
Output:
1 2
DEMO
You should know that regex is not the tool to parse HTML. It'll work for simple cases, but for real cases do consider using HTML Agility Pack

Missing matches in match collection

I have few user inputs which I am trying to compare via MatchCollection with different patterns I have i.e
string user = "\b(?:I|would|like|to|see|id|of|bought|good)\b";
but When I am matching string
string pattern = "^(?=.*\bgoods?|items?|things?\b)(?=.*\bbought\b)(?=.*\bid\b)(?=.*\btest\b).*$";
MatchCollection mat = Regex.Matches(pattern , user );
foreach (var item in mat)
{
Console.WriteLine(item.ToString());
}
When matching the two above, the match collection only yields/collects match for (id, bought) but ignores "goods" from the first block of pattern. It seems to be ignoring ? lookahead mentioned goods?|items?|things?.
Any suggestions.
Thanks

A probably simple regex expression

I am a complete newb when it comes to regex, and would like help to make an expression to match in the following:
{ValidFunctionName}({parameter}:"{value}")
{ValidFunctionName}({parameter}:"{value}",
{parameter}:"{value}")
{ValidFunctionName}()
Where {x} is what I want to match, {parameter} can be anything $%"$ for example and {value} must be enclosed in quotation marks.
ThisIsValid_01(a:"40")
would be "ThisIsValid_01", "a", "40"
ThisIsValid_01(a:"40", b:"ZOO")
would be "ThisIsValid_01", "a", "40", "b", "ZOO"
01_ThisIsntValid(a:"40")
wouldn't return anything
ThisIsntValid_02(a:40)
wouldn't return anything, as 40 is not enclosed in quotation marks.
ThisIsValid_02()
would return "ThisIsValid_02"
For a valid function name I came across: "[A-Za-z_][A-Za-z_0-9]*"
But I can't for the life of me figure out how to match the rest.
I've been playing around on http://regexpal.com/ to try to get valid matches to all conditions, but to no avail :(
It would be nice if you kindly explained the regex too, so I can learn :)
EDIT: This will work, uses 2 regexs. The first get the function name and everything inside it, the second extracts each pair of params and values from what's inside the function's brackets. You cannot do this with a single regex. Add some [ \t\n\r]* for whitespace.
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
List<List<string>> matches = new List<List<string>>();
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<string>();
l.Add(match.Groups["function"].Value);
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
{
l.Add(m.Groups["param"].Value);
l.Add(m.Groups["value"].Value);
}
matches.Add(l);
}
(Old) Solution
(?<function>\w[\w\d]*?)\((?<param>.+?):"(?<value>[^"]*?)"\)
(Old) Explanation
Let's remove the group captures so it is easier to understand: \w[\w\d]*?\(.+?:"[^"]?"\)
\w is the word class, it is short for [a-zA-Z_]
\d is the digit class, it is short for [0-9]
\w[\w\d]*? Makes sure there is valid word character for the start of the function, and then matches zero or more further word or digit characters.
\(.+? Matches a left bracket then one or more of any characters (for the parameter)
:"[^"]*?"\) Matches a colon, then the opening quote, then zero or more of any character except quotes (for the value) then the close quote and right bracket.
Brackets (or parens, as some people call them) as escaped with the backslashes because otherwise they are capturing groups.
The (?<name> ) captures some text.
The ? after each the * and + operators makes them non-greedy, meaning that they will match the least, rather than the most, amount of text.
(Old) Use
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(aa%£$!:\"lolololol\") _test1(ghgasghe:\"asjkdgh\")";
List<string[]> matches = new List<string[]>();
if(r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
matches.Add(new[] { match.Groups["function"].Value, match.Groups["param"].Value, match.Groups["value"].Value });
}
EDIT: Now you've added an undefined number of multiple parameters, I would recommend making your own parser rather than using regexs. The above example only works with one parameter and strictly no whitespace. This will match multiple parameters with strict whitespace but will not return the parameters and values:
\w[\w\d]*?\(.+?:"[^"]*?"(,.+?:"[^"]*?")*\)
Just for fun, like above but with whitepace:
\w[\w\d]*?[ \t\r\n]*\([ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?"([ \t\r\n]*,[ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?")*[ \t\r\n]*\)
Capturing the text you want will be hard, because you don't know how many captures you are going to have and as such regexs are unsuited.
Someone else has already given an answer that gives you a flat list of strings, but in the interest of strong typing and proper class structure, I’m going to provide a solution that encapsulates the data properly.
First, declare two classes:
public class ParamValue // For a parameter and its value
{
public string Parameter;
public string Value;
}
public class FunctionInfo // For a whole function with all its parameters
{
public string FunctionName;
public List<ParamValue> Values;
}
Then do the matching and populate a list of FunctionInfos:
(By the way, I’ve made some slight fixes to the regexes... it will now match identifiers correctly, and it will not include the double-quotes as part of the “value” of each parameter.)
Regex r = new Regex(#"(?<function>[\p{L}_]\w*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
var matches = new List<FunctionInfo>();
if (r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<ParamValue>();
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
l.Add(new ParamValue
{
Parameter = m.Groups["param"].Value,
Value = m.Groups["value"].Value
});
matches.Add(new FunctionInfo
{
FunctionName = match.Groups["function"].Value,
Values = l
});
}
}
Then you can access the collection nicely with identifiers like FunctionName:
foreach (var match in matches)
{
Console.WriteLine("{0}({1})", match.FunctionName,
string.Join(", ", match.Values.Select(val =>
string.Format("{0}: \"{1}\"", val.Parameter, val.Value))));
}
Try this:
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*)\(((?<parameter>[^:]*):"(?<value>[^"]+)",?\s*)*\)
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*) matches the function name, ^ means start of the line, so that the first character in string must match. You can keep you remove the whitespace capture if you don't need it, I just added it to make the match a little more flexible.
The next set \(((?<parameter>[^:]*):"(?<value>[^"]+)",?)*\) means capture each parameter-value pair inside the parenthesis. You have to escape the parenthesis for the function since they are symbols within the regex syntax.
The ?<> inside parenthesis are named capture groups, which when supported by a library, as they are in .NET, make grabbing the groups in the matches a little easier.
Here:
\w[\w\d]*\s*\(\s*(?:(\w[\w\d]*):("[^"]*"|\d+))*\s*\)
Visualization of that regex here.
For Problems like that I always suggest people not to "find" a single regex but to write multiple regex sharing the work.
But here is my quick shot:
(?<funcName>[A-Za-z_][A-Za-z_0-9]*)
\(
(?<ParamGroup>
(?<paramName>[^(]+?)
:
"(?<paramValue>[^"]*)"
((,\s*)|(?=\)))
)*
\)
The whitespaces are there for better readability. Remove them or set the option to ignore pattern whitespaces.
This regex passes all your test cases:
^(?<function>[A-Za-z][\w]*?)\(((?<param>[^:]*?):"(?<value>[^"]*?)",{0,1}\s*)*\)$
This works on multiple parameters and no parameters. It also handles special characters in the param name and whitespace after the comma. There may need to be some adjustments as your test cases do not cover everything you indicate in your text.
Please note that \w usually includes digits and is not appropriate as the leading character of the function name. Reference: http://www.regular-expressions.info/charclass.html#shorthand

Need some quick C# regex help

I have this html:
This is the content.
I just need to get rid of the anchor tag html around the content text, so that all I end up with is "This is the content".
Can I do this using Regex.Replace?
Your regex: <a[^>]+?>(.*?)</a>
Check this Regex with the Regex-class and iterate through the result collection
and you should get your inner text.
String text = "test";
Regex rx = new Regex("<a[^>]+?>(.*?)</a>");
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
Console.WriteLine("{0} matches found. \n", matches.Count);
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine(match.Value);
Console.WriteLine("Groups:");
foreach (var g in match.Groups)
{
Console.WriteLine(g.ToString());
}
}
Console.ReadLine();
Output:
1 matches found.
test
Groups:
test
test
The match expression in () is stored in the second item of match's Groups collection (the first item is the whole match itself). Each expression in () gets into the Groups collection. See the MSDN for further information.
If you had to use Replace, this'd work for simple string content inside the tag:
Regex r = new Regex("<[^>]+>");
string result = r.Replace(#"This is the content.", "");
Console.WriteLine("Result = \"{0}\"", result);
Good luck
You could also use groups in Regex.
For example, the following would give you the content of any tag.
Regex r = new Regex(#"<a.*>(.*)</a>");
// Regex r = new Regex(#"<.*>(.*)</.*>"); or any kind of tag
var m = r.Match(#"This is the content.");
string content = m.Groups[1].Value;
you use groups in regexes by using the parenthesis, although group 0 is the whole match, not just the group.

Categories

Resources