Determining which pattern matched using Regex.Matches - c#

I'm writing a translator, not as any serious project, just for fun and to become a bit more familiar with regular expressions. From the code below I think you can work out where I'm going with this (cheezburger anyone?).
I'm using a dictionary which uses a list of regular expressions as the keys and the dictionary value is a List<string> which contains a further list of replacement values. If I'm going to do it this way, in order to work out what the substitute is, I obviously need to know what the key is, how can I work out which pattern triggered the match?
var dictionary = new Dictionary<string, List<string>>
{
{"(?!e)ight", new List<string>(){"ite"}},
{"(?!ues)tion", new List<string>(){"shun"}},
{"(?:god|allah|buddah?|diety)", new List<string>(){"ceiling cat"}},
..
}
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
substitute = GetRandomReplacement(dictionary[ ????? ]);
input = input.Replace(metamatch.Value, substitute);
}
Is what I'm attempting possible, or is there a better way to achieve this insanity?

You can name each capture group in a regular expression and then query the value of each named group in your match. This should allow you to do what you want.
For example, using the regular expression below,
(?<Group1>(?!e))ight
you can then extract the group matches from your match result:
match.Groups["Group1"].Captures

You've got another problem. Check this out:
string s = #"My weight is slight.";
Regex r = new Regex(#"(?<!e)ight\b");
foreach (Match m in r.Matches(s))
{
s = s.Replace(m.Value, "ite");
}
Console.WriteLine(s);
output:
My weite is slite.
String.Replace is a global operation, so even though weight doesn't match the regex, it gets changed anyway when slight is found. You need to do the match, lookup, and replace at the same time; Regex.Replace(String, MatchEvaluator) will let you do that.

Using named groups like Jeff says is the most robust way.
You can also access the groups by number, as they are expressed in your pattern.
(first)|(second)
can be accessed with
match.Groups[1] // match group 2 -> second
Of course if you have more parenthesis which you don't want to include, use the non-capture operator ?:
((?:f|F)irst)|((?:s|S)econd)
match.Groups[1].Value // also match group 2 -> second

Related

Select dynamically a list of char in a string form a DLL

I'm new here and my English will not be the best you'll read today.
I just imported from a DLL a list of "key"
(#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py),
we will name it this way it's a simple string.
I need to select the "key" that compose this string after each # but it has to be done dynamically and not like you choose in an ArrayList [0,1,2 ...].
The end result should look like 8yg54w and after u got this one it's a loop and u get the next one, which means 95jz#e. The first "#" is a separator for each key.
I wanna know how can I proceed to get each key after the first separator.
I'll try to answer your questions because I think that there will be some, this is probably poorly explained, I apologize in advance! Thanks
Your solution may be a simple function that returns an IEnumerable<string>. You can accomplish this by splitting the string and using the yield keyword to return an iterator. E.g.
// Define the splitting function
public IEnumerable<string> GetKeys(string source) {
var splitted = source.Split("-#");
foreach (var key in splitted)
yield return key;
}
// Use it in your code
var myKeys = GetKeys("#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py");
foreach(var k in myKeys) {
// This will print your keys in the console one per line.
Console.WriteLine(k);
}
You can use this approach but I suggest to better hide the logic to get the nex Key if you need it to be a Unique Gobal Key generator. Using a static class with only a GetNextKey() method that can be the combination of the code above...
This should return an array of keys.
string.Split("-#");
When you just need the string:
string x = "(#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py)";
Console.WriteLine(string.Join("-", x.Split("-#")));
You can use a Regular expression.
string input = "(#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py)";
MatchCollection matches = Regex.Matches(input, #"(?<=\#)[A-Za-z1-9#]+(?=-)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
Output:
8yg54w
95jz#e
#9ixop
7ps
ny#9qv
bp669x
bp6696
bp6696
bp6696
bp6696
bp6696
fbhstu
ehddtk
Explanation of the regular expression (?<=\#)[A-Za-z1-9#]+(?=-)
General form (?<=prefix)find(?=suffix) finds the pattern find between a prefix and a suffix.
(?<=\#) prefix # (escaped with \).
[A-Za-z1-9#] character set to match (upper and lower case letters + digits + #).
+ quantifier: At leat one character.
(?=-) suffix -.
I am not sure whether the ) is part of string. To get the last key ###9py if the string contains ) use (?<=\#)[A-Za-z1-9#]+(?=-|\)) where \) is the right brace escaped. If ) is in there, use (?<=\#)[A-Za-z1-9#]+(?=-|$) where $ is the end of the string. | means OR. I.e., the suffix is either '-' OR ) or it is - OR $ (end of line).

Splitting a string with a Regex results in wrong output [duplicate]

I'm trying to extract values from a string which are between << and >>. But they could happen multiple times.
Can anyone help with the regular expression to match these;
this is a test for <<bob>> who like <<books>>
test 2 <<frank>> likes nothing
test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>.
I then want to foreach the GroupCollection to get all the values.
Any help greatly received.
Thanks.
Use a positive look ahead and look behind assertion to match the angle brackets, use .*? to match the shortest possible sequence of characters between those brackets. Find all values by iterating the MatchCollection returned by the Matches() method.
Regex regex = new Regex("(?<=<<).*?(?=>>)");
foreach (Match match in regex.Matches(
"this is a test for <<bob>> who like <<books>>"))
{
Console.WriteLine(match.Value);
}
LiveDemo in DotNetFiddle
While Peter's answer is a good example of using lookarounds for left and right hand context checking, I'd like to also add a LINQ (lambda) way to access matches/groups and show the use of simple numeric capturing groups that come handy when you want to extract only a part of the pattern:
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
// ...
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Same approach with Peter's compiled regex where the whole match value is accessed via Match.Value:
var results = regex.Matches(s).Cast<Match>().Select(x => x.Value);
Note:
<<(.*?)>> is a regex matching <<, then capturing any 0 or more chars as few as possible (due to the non-greedy *? quantifier) into Group 1 and then matching >>
RegexOptions.Singleline makes . match newline (LF) chars, too (it does not match them by default)
Cast<Match>() casts the match collection to a IEnumerable<Match> that you may further access using a lambda
Select(x => x.Groups[1].Value) only returns the Group 1 value from the current x match object
Note you may further create a list of array of obtained values by adding .ToList() or .ToArray() after Select.
In the demo C# code, string.Join(", ", results) generates a comma-separated string of the Group 1 values:
var strs = new List<string> { "this is a test for <<bob>> who like <<books>>",
"test 2 <<frank>> likes nothing",
"test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>." };
foreach (var s in strs)
{
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Console.WriteLine(string.Join(", ", results));
}
Output:
bob, books
frank
what, on, earth, this, is, too, much
You can try one of these:
(?<=<<)[^>]+(?=>>)
(?<=<<)\w+(?=>>)
However you will have to iterate the returned MatchCollection.
Something like this:
(<<(?<element>[^>]*)>>)*
This program might be useful:
http://sourceforge.net/projects/regulator/

Receiving one set of numbers with regex

I have info like the following
"id":"456138988365628440_103920","user"657852231654
and I would like to return,
456138988365628440_103920
I know using
"id":"[0-9_]*","user"
will return
"id":"456138988365628440_103920","user"
but I just want the id itself.
You can use capture groups by placing the part you want between parentheses and calling it back using match.Groups[1].Value:
string msg = #"""id"":""456138988365628440_103920"",""user""657852231654""";
var reg = new Regex(#"""id"":""([0-9_]*)"",""user""", RegexOptions.IgnoreCase);
var results = reg.Matches(msg);
foreach (Match match in results)
{
Console.WriteLine(match.Groups[1].Value);
}
ideone demo.
Or you could just use String.Split (if regex are not mandatory):
var input = #"""id"":""456138988365628440_103920"",""user""657852231654""";
var idValue = input.Split(',')[0].Split(':')[1];
Console.WriteLine(idValue);
Output:
456138988365628440_103920
What you need is a kind of conditional statement in your Regular Expression, which is called Zero-Width Positive Look-behind Assertion
In other words, you need a statement that says only match numbers which are after the id property.
"id":"456138988365628440_103920","user"657852231654
(?<="id":")[\d_]*
This regular expression would only return the requested number for you.
You can test it here.

C# sort and put back Regex.matches

Is there any way of using RegEx.Matches to find, and write back matched values but in different (alphabetical) order?
For now I have something like:
var pattern = #"(KEY `[\w]+?` \(`.*`*\))";
var keys = Regex.Matches(line, pattern);
Console.WriteLine("\n\n");
foreach (Match match in keys)
{
Console.WriteLine(match.Index + " = " + match.Value.Replace("\n", "").Trim());
}
But what I really need is to take table.sql dump and sort existing INDEXES alphabetically, example code:
line = "...PRIMARY KEY (`communication_auto`),\n KEY `idx_current` (`current`),\n KEY `idx_communication` (`communication_id`,`current`),\n KEY `idx_volunteer` (`volunteer_id`,`current`),\n KEY `idx_template` (`template_id`,`current`)\n);"
Thanks
J
Update:
Thanks, m.buettner solution gave me basics that I could use to move on. I'm not so good at RegEx sadly, but I ended up with code that I believe can be still improved:
...
//sort INDEXES definitions alphabetically
if (line.Contains(" KEY `")) line = Regex.Replace(
line,
#"[ ]+(KEY `[\w]+` \([\w`,]+\),?\s*)+",
ReplaceCallbackLinq
);
static string ReplaceCallbackLinq(Match match)
{
var result = String.Join(",\n ",
from Capture item in match.Groups[1].Captures
orderby item.Value.Trim()
select item.Value.Trim().Replace("),", ")")
);
return " " + result + "\n";
}
Update:
There is also a case when index field is longer than 255 chars mysql trims index up to 255 and writes it like this:
KEY `idx3` (`app_property_definition_id`,`value`(255),`audit_current`),
so, in order to match this case too I had to change some code:
in ReplaceCallbackLinq:
select item.Value.Trim().Replace("`),", "`)")
and regex definition to:
#"[ ]+(KEY `[\w]+` \([\w`(\(255\)),]+\),?\s*)+",
This cannot be done with regex alone. But you could use a callback function and make use of .NET's unique capability of capturing multiple things with the same capturing group. This way you avoid using Matches and writing everything back by yourself. Instead you can use the built-in Replace function. My example below simply sorts the KEY phrases and puts them back as they were (so it does nothing but sort they phrases within the SQL statement). If you want a different output you can easily achieve that by capturing different parts of the pattern and adjusting the Join operation at the very end.
First we need a match evaluator to pass the callback:
MatchEvaluator evaluator = new MatchEvaluator(ReplaceCallback);
Then we write a regex that matches the whole set of indices at once, capturing the index-names in a capturing group. We put this in the overload of Replace that takes an evaluator:
output = Regex.Replace(
input,
#"(KEY `([\w]+)` \(`[^`]*`(?:,`[^`]*`)*\),?\s*)+",
evaluator
);
Now in most languages this would not be useful, because due to the repetition capturing group 1 would always contain only the first or last thing that was captured (same as capturing group 2). But luckily, you are using C#, and .NET's regex engine is just one powerful beast. So let's have a look at the callback function and how to use the multiple captures:
static string ReplaceCallback(Match match)
{
int captureCount = match.Groups[1].Captures.Count;
string[] indexNameArray = new string[captureCount];
string[] keyBlockArray = new string[captureCount];
for (int i = 0; i < captureCount; i++)
{
keyBlockArray[i] = match.Groups[1].Captures[i].Value;
indexNameArray[i] = match.Groups[2].Captures[i].Value;
}
Array.Sort(indexNameArray, keyBlockArray);
return String.Join("\n ", keyBlockArray);
}
match.Groups[i].Captures lets us access the multiple captures of a single group. Since these are Capture objects which do not seem really useful right now, we build two string arrays from their values. Then we use Array.Sort which sorts two arrays based on the values of one (which is considered the key). As the "key" we use the capturing of the table name. As the "value" we use the full capture of one complete KEY ..., block. This sorts the full blocks by their names. Then we can simply join together the blocks, add in the whitespace separator that was used before and return them.
Not sure if I fully understand the question, but does changing the foreach to:
foreach (Match match in keys.Cast<Match>().OrderBy(m => m.Value))
do what you want?

A probably simple regex expression

I am a complete newb when it comes to regex, and would like help to make an expression to match in the following:
{ValidFunctionName}({parameter}:"{value}")
{ValidFunctionName}({parameter}:"{value}",
{parameter}:"{value}")
{ValidFunctionName}()
Where {x} is what I want to match, {parameter} can be anything $%"$ for example and {value} must be enclosed in quotation marks.
ThisIsValid_01(a:"40")
would be "ThisIsValid_01", "a", "40"
ThisIsValid_01(a:"40", b:"ZOO")
would be "ThisIsValid_01", "a", "40", "b", "ZOO"
01_ThisIsntValid(a:"40")
wouldn't return anything
ThisIsntValid_02(a:40)
wouldn't return anything, as 40 is not enclosed in quotation marks.
ThisIsValid_02()
would return "ThisIsValid_02"
For a valid function name I came across: "[A-Za-z_][A-Za-z_0-9]*"
But I can't for the life of me figure out how to match the rest.
I've been playing around on http://regexpal.com/ to try to get valid matches to all conditions, but to no avail :(
It would be nice if you kindly explained the regex too, so I can learn :)
EDIT: This will work, uses 2 regexs. The first get the function name and everything inside it, the second extracts each pair of params and values from what's inside the function's brackets. You cannot do this with a single regex. Add some [ \t\n\r]* for whitespace.
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
List<List<string>> matches = new List<List<string>>();
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<string>();
l.Add(match.Groups["function"].Value);
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
{
l.Add(m.Groups["param"].Value);
l.Add(m.Groups["value"].Value);
}
matches.Add(l);
}
(Old) Solution
(?<function>\w[\w\d]*?)\((?<param>.+?):"(?<value>[^"]*?)"\)
(Old) Explanation
Let's remove the group captures so it is easier to understand: \w[\w\d]*?\(.+?:"[^"]?"\)
\w is the word class, it is short for [a-zA-Z_]
\d is the digit class, it is short for [0-9]
\w[\w\d]*? Makes sure there is valid word character for the start of the function, and then matches zero or more further word or digit characters.
\(.+? Matches a left bracket then one or more of any characters (for the parameter)
:"[^"]*?"\) Matches a colon, then the opening quote, then zero or more of any character except quotes (for the value) then the close quote and right bracket.
Brackets (or parens, as some people call them) as escaped with the backslashes because otherwise they are capturing groups.
The (?<name> ) captures some text.
The ? after each the * and + operators makes them non-greedy, meaning that they will match the least, rather than the most, amount of text.
(Old) Use
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(aa%£$!:\"lolololol\") _test1(ghgasghe:\"asjkdgh\")";
List<string[]> matches = new List<string[]>();
if(r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
matches.Add(new[] { match.Groups["function"].Value, match.Groups["param"].Value, match.Groups["value"].Value });
}
EDIT: Now you've added an undefined number of multiple parameters, I would recommend making your own parser rather than using regexs. The above example only works with one parameter and strictly no whitespace. This will match multiple parameters with strict whitespace but will not return the parameters and values:
\w[\w\d]*?\(.+?:"[^"]*?"(,.+?:"[^"]*?")*\)
Just for fun, like above but with whitepace:
\w[\w\d]*?[ \t\r\n]*\([ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?"([ \t\r\n]*,[ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?")*[ \t\r\n]*\)
Capturing the text you want will be hard, because you don't know how many captures you are going to have and as such regexs are unsuited.
Someone else has already given an answer that gives you a flat list of strings, but in the interest of strong typing and proper class structure, I’m going to provide a solution that encapsulates the data properly.
First, declare two classes:
public class ParamValue // For a parameter and its value
{
public string Parameter;
public string Value;
}
public class FunctionInfo // For a whole function with all its parameters
{
public string FunctionName;
public List<ParamValue> Values;
}
Then do the matching and populate a list of FunctionInfos:
(By the way, I’ve made some slight fixes to the regexes... it will now match identifiers correctly, and it will not include the double-quotes as part of the “value” of each parameter.)
Regex r = new Regex(#"(?<function>[\p{L}_]\w*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
var matches = new List<FunctionInfo>();
if (r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<ParamValue>();
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
l.Add(new ParamValue
{
Parameter = m.Groups["param"].Value,
Value = m.Groups["value"].Value
});
matches.Add(new FunctionInfo
{
FunctionName = match.Groups["function"].Value,
Values = l
});
}
}
Then you can access the collection nicely with identifiers like FunctionName:
foreach (var match in matches)
{
Console.WriteLine("{0}({1})", match.FunctionName,
string.Join(", ", match.Values.Select(val =>
string.Format("{0}: \"{1}\"", val.Parameter, val.Value))));
}
Try this:
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*)\(((?<parameter>[^:]*):"(?<value>[^"]+)",?\s*)*\)
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*) matches the function name, ^ means start of the line, so that the first character in string must match. You can keep you remove the whitespace capture if you don't need it, I just added it to make the match a little more flexible.
The next set \(((?<parameter>[^:]*):"(?<value>[^"]+)",?)*\) means capture each parameter-value pair inside the parenthesis. You have to escape the parenthesis for the function since they are symbols within the regex syntax.
The ?<> inside parenthesis are named capture groups, which when supported by a library, as they are in .NET, make grabbing the groups in the matches a little easier.
Here:
\w[\w\d]*\s*\(\s*(?:(\w[\w\d]*):("[^"]*"|\d+))*\s*\)
Visualization of that regex here.
For Problems like that I always suggest people not to "find" a single regex but to write multiple regex sharing the work.
But here is my quick shot:
(?<funcName>[A-Za-z_][A-Za-z_0-9]*)
\(
(?<ParamGroup>
(?<paramName>[^(]+?)
:
"(?<paramValue>[^"]*)"
((,\s*)|(?=\)))
)*
\)
The whitespaces are there for better readability. Remove them or set the option to ignore pattern whitespaces.
This regex passes all your test cases:
^(?<function>[A-Za-z][\w]*?)\(((?<param>[^:]*?):"(?<value>[^"]*?)",{0,1}\s*)*\)$
This works on multiple parameters and no parameters. It also handles special characters in the param name and whitespace after the comma. There may need to be some adjustments as your test cases do not cover everything you indicate in your text.
Please note that \w usually includes digits and is not appropriate as the leading character of the function name. Reference: http://www.regular-expressions.info/charclass.html#shorthand

Categories

Resources