named groups splitting regardless of position of match - c#

Having a hard time explaining what I mean, so here is what I want to do
I want any sentence to be parsed along the pattern of
text #something a few words [someothertext]
for this, the matching sentence would be
Jeremy is trying #20 times to [understand this]
And I would name 4 groups, as text, time, who, subtitle
However, I could also write
#20 Jeremy is trying [understand this] times to
and still get the tokens
#20
Jeremy is trying
times to
understand this
corresponding to the right groups
As long as the delimited tokens can separate the 2 text only tokens, I'm fine.
Is this even possible? I've tried a few regex's and failed miserably (am still experimenting but finding myself spending way too much time learning it)
Note: The order of the tokens can be random. If this isn't possible with regex then I guess I can live with a fixed order.
edit: fixed a typo. clarified further what I wanted.

You can alternate on the different types of text. Using named groups means that one group would have a Success value equal to true for each match.
This pattern should do what you need:
#"(?<Number>#\d+\b)|(?<Subtitle>\[.+?])|\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s*"
(?<Number>#\d+\b) - matches # followed by one or more digits, up to a word boundary
(?<Subtitle>\[.+?]) - non-greedy matching of text between square brackets
\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s* - trims spaces at either end of the string, and the named capture group uses an approach that matches a single character at a time provided that the negative look-ahead fails to match if it detects text that would match the other 2 text patterns of interest (numbers and subtitles).
Example usage:
var inputs = new[]
{
"Jeremy is trying #20 times to [understand this]",
"#20 Jeremy is trying [understand this] times to"
};
string pattern = #"(?<Number>#\d+\b)|(?<Subtitle>\[.+?])|\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s*";
foreach (var input in inputs)
{
Console.WriteLine("Input: " + input);
foreach (Match m in Regex.Matches(input, pattern))
{
// skip first group, which is the entire matched text
var group = m.Groups.Cast<Group>().Skip(1).First(g => g.Success);
Console.WriteLine(group.Value);
}
Console.WriteLine();
}
Alternately, this example demonstrates how to pair the named groups to the matches:
var re = new Regex(pattern);
foreach (var input in inputs)
{
Console.WriteLine("Input: " + input);
var query = from Match m in re.Matches(input)
from g in re.GetGroupNames().Skip(1)
where m.Groups[g].Success
select new
{
GroupName = g,
Value = m.Groups[g].Value
};
foreach (var item in query)
{
Console.WriteLine("{0}: {1}", item.GroupName, item.Value);
}
Console.WriteLine();
}

So if I understand this correctly, you're looking for four phrases:
1) 1+ words of normal text
2) 1 word of text prefixed by a #
3) 1+ words of normal text
4) 1+ words of text wrapped by [ ]
My (admittedly slow and regex-less) suggestion would be to find the indexes of the #, [, and ] characters, then use several calls to string.Substring().
This would be acceptable for relatively small strings and a relatively small number of iterations, although with much larger strings this would be extremely slow.

Related

Extract version number

I am trying to just get the version number from an HML link.
Take this for example
firefox-10.0.2.bundle
I have got it to take everything after the - with
string versionNum = name.Split('-').Last();
versionNum = Regex.Replace(versionNum, "[^0-9.]", "");
which gives you an output of
10.0.2
However, if the link is like this
firefox-10.0.2.source.tar.bz2
the output will look like
10.0.2...2
How can I make it so that it just chops everything off after the third .? Or can I make it so that when first letter is detected it cuts that and everything that follows?
You could solve this with a single regex match.
Here is an example:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Regex regex = new Regex(#"\d+.\d+.\d+");
Match match = regex.Match("firefox-10.0.2.source.tar.bz2");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
after you split "firefox-10.0.2.source.tar.bz2" to "10.0.2.source.tar.bz2"
string a = "10.0.2.source.tar.bz2";
string[] list = a.Split(new char [] { '.' });
string output = ""
foreach(var item in list)
{
if item is integer then // try to write this part
output += item + ".";
}
after that remove the last character from output.
Although late, I feel that this answer would be much more apt:
Regex r = new Regex(#"[\d\.]+(?![a-zA-Z\-])");
Match m = r.Match(name);
Console.WriteLine(m.Value);
Improvements -
Though #Samuel's answer works, what happens if the build is 10.2.2.3? His regex would give 10.2.2 - a partial answer, and therefore, wrong.
With the regex I have posted, the match would be complete.
Explanation -
[\d\.]+ matches all the combination of numbers and dots such as 10.2.2.34.56.78 and even just 10 if the build is 10.bundle
(?![a-zA-Z\-]) is a negative look-ahead which ensures that the match is not followed by any letter or dash.
Being robust is absolutely vital to any code, so my posted answer should work pretty well under any circumstances (because the link could be anything).
Here's a version which can handle 1-4 numbers (not just digits) in the input string, and returns a version number:
public static Version ExtractVersionNumber(string input)
{
Match match = Regex.Match(input, #"(\d+\.){0,3}\d+");
if (match.Success)
{
return new Version(match.Value);
}
return null;
}
void Main()
{
Console.WriteLine(ExtractVersionNumber("firefox-10.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.6.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.6source.tar.bz2"));
}
Explanation
There are essentially 2 parts:
(\d+\.){0,3} -match a number (uninterrupted sequence of 1 or more digits) immediately followed by a dot. Match this 0 to 3 times.
\d+ - match a number (sequence of 1 or more digits).
These work as follows:
When there's only 1 number (or even if there's only 1 number followed by a dot), the first part will match nothing, the second part will match the number
when there are 2 numbers separated by a dot, the first part matches the first number and the dot, the second part matches the second number.
for 3 numbers separated by dots, the first part gets the first 2 numbers & dots, the last the third number
for 4 or more numbers separated by dots, the first part gets the first 3 numbers and dots, the second gets the fourth number. Any subsequent numbers and dots are ignored.
ps. If you wanted to ensure that you only got the number after the hyphen (e.g. to avoid getting 4.0.1 given the string firefox4.0.1-10.0.2.source.tar.bz2") you could add a negative look behind to say "the character immediately before the version number must be a hyphen": (?<=-)(\d+\.){0,3}\d+.

C# sort and put back Regex.matches

Is there any way of using RegEx.Matches to find, and write back matched values but in different (alphabetical) order?
For now I have something like:
var pattern = #"(KEY `[\w]+?` \(`.*`*\))";
var keys = Regex.Matches(line, pattern);
Console.WriteLine("\n\n");
foreach (Match match in keys)
{
Console.WriteLine(match.Index + " = " + match.Value.Replace("\n", "").Trim());
}
But what I really need is to take table.sql dump and sort existing INDEXES alphabetically, example code:
line = "...PRIMARY KEY (`communication_auto`),\n KEY `idx_current` (`current`),\n KEY `idx_communication` (`communication_id`,`current`),\n KEY `idx_volunteer` (`volunteer_id`,`current`),\n KEY `idx_template` (`template_id`,`current`)\n);"
Thanks
J
Update:
Thanks, m.buettner solution gave me basics that I could use to move on. I'm not so good at RegEx sadly, but I ended up with code that I believe can be still improved:
...
//sort INDEXES definitions alphabetically
if (line.Contains(" KEY `")) line = Regex.Replace(
line,
#"[ ]+(KEY `[\w]+` \([\w`,]+\),?\s*)+",
ReplaceCallbackLinq
);
static string ReplaceCallbackLinq(Match match)
{
var result = String.Join(",\n ",
from Capture item in match.Groups[1].Captures
orderby item.Value.Trim()
select item.Value.Trim().Replace("),", ")")
);
return " " + result + "\n";
}
Update:
There is also a case when index field is longer than 255 chars mysql trims index up to 255 and writes it like this:
KEY `idx3` (`app_property_definition_id`,`value`(255),`audit_current`),
so, in order to match this case too I had to change some code:
in ReplaceCallbackLinq:
select item.Value.Trim().Replace("`),", "`)")
and regex definition to:
#"[ ]+(KEY `[\w]+` \([\w`(\(255\)),]+\),?\s*)+",
This cannot be done with regex alone. But you could use a callback function and make use of .NET's unique capability of capturing multiple things with the same capturing group. This way you avoid using Matches and writing everything back by yourself. Instead you can use the built-in Replace function. My example below simply sorts the KEY phrases and puts them back as they were (so it does nothing but sort they phrases within the SQL statement). If you want a different output you can easily achieve that by capturing different parts of the pattern and adjusting the Join operation at the very end.
First we need a match evaluator to pass the callback:
MatchEvaluator evaluator = new MatchEvaluator(ReplaceCallback);
Then we write a regex that matches the whole set of indices at once, capturing the index-names in a capturing group. We put this in the overload of Replace that takes an evaluator:
output = Regex.Replace(
input,
#"(KEY `([\w]+)` \(`[^`]*`(?:,`[^`]*`)*\),?\s*)+",
evaluator
);
Now in most languages this would not be useful, because due to the repetition capturing group 1 would always contain only the first or last thing that was captured (same as capturing group 2). But luckily, you are using C#, and .NET's regex engine is just one powerful beast. So let's have a look at the callback function and how to use the multiple captures:
static string ReplaceCallback(Match match)
{
int captureCount = match.Groups[1].Captures.Count;
string[] indexNameArray = new string[captureCount];
string[] keyBlockArray = new string[captureCount];
for (int i = 0; i < captureCount; i++)
{
keyBlockArray[i] = match.Groups[1].Captures[i].Value;
indexNameArray[i] = match.Groups[2].Captures[i].Value;
}
Array.Sort(indexNameArray, keyBlockArray);
return String.Join("\n ", keyBlockArray);
}
match.Groups[i].Captures lets us access the multiple captures of a single group. Since these are Capture objects which do not seem really useful right now, we build two string arrays from their values. Then we use Array.Sort which sorts two arrays based on the values of one (which is considered the key). As the "key" we use the capturing of the table name. As the "value" we use the full capture of one complete KEY ..., block. This sorts the full blocks by their names. Then we can simply join together the blocks, add in the whitespace separator that was used before and return them.
Not sure if I fully understand the question, but does changing the foreach to:
foreach (Match match in keys.Cast<Match>().OrderBy(m => m.Value))
do what you want?

How to do this Regex in C#?

I've been trying to do this for quite some time but for some reason never got it right.
There will be texts like these:
12325 NHGKF
34523 KGJ
29302 MMKSEIE
49504EFDF
The rule is there will be EXACTLY 5 digit number (no more or less) after that a 1 SPACE (or no space at all) and some text after as shown above. I would like to have a MATCH using a regex pattern and extract THE NUMBER and SPACE and THE TEXT.
Is this possible? Thank you very much!
Since from your wording you seem to need to be able to get each component part of the input text on a successful match, then here's one that'll give you named groups number, space and text so you can get them easily if the regex matches:
(?<number>\d{5})(?<space>\s?)(?<text>\w+)
On the returned Match, if Success==true then you can do:
string number = match.Groups["number"].Value;
string text = match.Groups["text"].Value;
bool hadSpace = match.Groups["space"] != null;
The expression is relatively simple:
^([0-9]{5}) ?([A-Z]+)$
That is, 5 digits, an optional space, and one or more upper-case letter. The anchors at both ends ensure that the entire input is matched.
The parentheses around the digits pattern and the letters pattern designate capturing groups one and two. Access them to get the number and the word.
string test = "12345 SOMETEXT";
string[] result = Regex.Split(test, #"(\d{5})\s*(\w+)");
You could use the Split method:
public class Program
{
static void Main()
{
var values = new[]
{
"12325 NHGKF",
"34523 KGJ",
"29302 MMKSEIE",
"49504EFDF"
};
foreach (var value in values)
{
var tokens = Regex.Split(value, #"(\d{5})\s*(\w+)");
Console.WriteLine("key: {0}, value: {1}", tokens[1], tokens[2]);
}
}
}

A probably simple regex expression

I am a complete newb when it comes to regex, and would like help to make an expression to match in the following:
{ValidFunctionName}({parameter}:"{value}")
{ValidFunctionName}({parameter}:"{value}",
{parameter}:"{value}")
{ValidFunctionName}()
Where {x} is what I want to match, {parameter} can be anything $%"$ for example and {value} must be enclosed in quotation marks.
ThisIsValid_01(a:"40")
would be "ThisIsValid_01", "a", "40"
ThisIsValid_01(a:"40", b:"ZOO")
would be "ThisIsValid_01", "a", "40", "b", "ZOO"
01_ThisIsntValid(a:"40")
wouldn't return anything
ThisIsntValid_02(a:40)
wouldn't return anything, as 40 is not enclosed in quotation marks.
ThisIsValid_02()
would return "ThisIsValid_02"
For a valid function name I came across: "[A-Za-z_][A-Za-z_0-9]*"
But I can't for the life of me figure out how to match the rest.
I've been playing around on http://regexpal.com/ to try to get valid matches to all conditions, but to no avail :(
It would be nice if you kindly explained the regex too, so I can learn :)
EDIT: This will work, uses 2 regexs. The first get the function name and everything inside it, the second extracts each pair of params and values from what's inside the function's brackets. You cannot do this with a single regex. Add some [ \t\n\r]* for whitespace.
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
List<List<string>> matches = new List<List<string>>();
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<string>();
l.Add(match.Groups["function"].Value);
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
{
l.Add(m.Groups["param"].Value);
l.Add(m.Groups["value"].Value);
}
matches.Add(l);
}
(Old) Solution
(?<function>\w[\w\d]*?)\((?<param>.+?):"(?<value>[^"]*?)"\)
(Old) Explanation
Let's remove the group captures so it is easier to understand: \w[\w\d]*?\(.+?:"[^"]?"\)
\w is the word class, it is short for [a-zA-Z_]
\d is the digit class, it is short for [0-9]
\w[\w\d]*? Makes sure there is valid word character for the start of the function, and then matches zero or more further word or digit characters.
\(.+? Matches a left bracket then one or more of any characters (for the parameter)
:"[^"]*?"\) Matches a colon, then the opening quote, then zero or more of any character except quotes (for the value) then the close quote and right bracket.
Brackets (or parens, as some people call them) as escaped with the backslashes because otherwise they are capturing groups.
The (?<name> ) captures some text.
The ? after each the * and + operators makes them non-greedy, meaning that they will match the least, rather than the most, amount of text.
(Old) Use
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(aa%£$!:\"lolololol\") _test1(ghgasghe:\"asjkdgh\")";
List<string[]> matches = new List<string[]>();
if(r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
matches.Add(new[] { match.Groups["function"].Value, match.Groups["param"].Value, match.Groups["value"].Value });
}
EDIT: Now you've added an undefined number of multiple parameters, I would recommend making your own parser rather than using regexs. The above example only works with one parameter and strictly no whitespace. This will match multiple parameters with strict whitespace but will not return the parameters and values:
\w[\w\d]*?\(.+?:"[^"]*?"(,.+?:"[^"]*?")*\)
Just for fun, like above but with whitepace:
\w[\w\d]*?[ \t\r\n]*\([ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?"([ \t\r\n]*,[ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?")*[ \t\r\n]*\)
Capturing the text you want will be hard, because you don't know how many captures you are going to have and as such regexs are unsuited.
Someone else has already given an answer that gives you a flat list of strings, but in the interest of strong typing and proper class structure, I’m going to provide a solution that encapsulates the data properly.
First, declare two classes:
public class ParamValue // For a parameter and its value
{
public string Parameter;
public string Value;
}
public class FunctionInfo // For a whole function with all its parameters
{
public string FunctionName;
public List<ParamValue> Values;
}
Then do the matching and populate a list of FunctionInfos:
(By the way, I’ve made some slight fixes to the regexes... it will now match identifiers correctly, and it will not include the double-quotes as part of the “value” of each parameter.)
Regex r = new Regex(#"(?<function>[\p{L}_]\w*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
var matches = new List<FunctionInfo>();
if (r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<ParamValue>();
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
l.Add(new ParamValue
{
Parameter = m.Groups["param"].Value,
Value = m.Groups["value"].Value
});
matches.Add(new FunctionInfo
{
FunctionName = match.Groups["function"].Value,
Values = l
});
}
}
Then you can access the collection nicely with identifiers like FunctionName:
foreach (var match in matches)
{
Console.WriteLine("{0}({1})", match.FunctionName,
string.Join(", ", match.Values.Select(val =>
string.Format("{0}: \"{1}\"", val.Parameter, val.Value))));
}
Try this:
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*)\(((?<parameter>[^:]*):"(?<value>[^"]+)",?\s*)*\)
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*) matches the function name, ^ means start of the line, so that the first character in string must match. You can keep you remove the whitespace capture if you don't need it, I just added it to make the match a little more flexible.
The next set \(((?<parameter>[^:]*):"(?<value>[^"]+)",?)*\) means capture each parameter-value pair inside the parenthesis. You have to escape the parenthesis for the function since they are symbols within the regex syntax.
The ?<> inside parenthesis are named capture groups, which when supported by a library, as they are in .NET, make grabbing the groups in the matches a little easier.
Here:
\w[\w\d]*\s*\(\s*(?:(\w[\w\d]*):("[^"]*"|\d+))*\s*\)
Visualization of that regex here.
For Problems like that I always suggest people not to "find" a single regex but to write multiple regex sharing the work.
But here is my quick shot:
(?<funcName>[A-Za-z_][A-Za-z_0-9]*)
\(
(?<ParamGroup>
(?<paramName>[^(]+?)
:
"(?<paramValue>[^"]*)"
((,\s*)|(?=\)))
)*
\)
The whitespaces are there for better readability. Remove them or set the option to ignore pattern whitespaces.
This regex passes all your test cases:
^(?<function>[A-Za-z][\w]*?)\(((?<param>[^:]*?):"(?<value>[^"]*?)",{0,1}\s*)*\)$
This works on multiple parameters and no parameters. It also handles special characters in the param name and whitespace after the comma. There may need to be some adjustments as your test cases do not cover everything you indicate in your text.
Please note that \w usually includes digits and is not appropriate as the leading character of the function name. Reference: http://www.regular-expressions.info/charclass.html#shorthand

Regex to match multiple strings

I need to create a regex that can match multiple strings. For example, I want to find all the instances of "good" or "great". I found some examples, but what I came up with doesn't seem to work:
\b(good|great)\w*\b
Can anyone point me in the right direction?
Edit: I should note that I don't want to just match whole words. For example, I may want to match "ood" or "reat" as well (parts of the words).
Edit 2: Here is some sample text: "This is a really great story."
I might want to match "this" or "really", or I might want to match "eall" or "reat".
If you can guarantee that there are no reserved regex characters in your word list (or if you escape them), you could just use this code to make a big word list into #"(a|big|word|list)". There's nothing wrong with the | operator as you're using it, as long as those () surround it. It sounds like the \w* and the \b patterns are what are interfering with your matches.
String[] pattern_list = whatever;
String regex = String.Format("({0})", String.Join("|", pattern_list));
(good)*(great)*
after your edit:
\b(g*o*o*d*)*(g*r*e*a*t*)*\b
I think you are asking for smth you dont really mean
if you want to search for any Part of the word, you litterally searching letters
e.g. Search {Jack, Jim} in "John and Shelly are cool"
is searching all letters in the names {J,a,c,k,i,m}
*J*ohn *a*nd Shelly *a*re
and for that you don't need REG-EX :)
in my opinion,
A Suffix Tree can help you with that
http://en.wikipedia.org/wiki/Suffix_tree#Functionality
enjoy.
I don't understand the problem correctly:
If you want to match "great" or "reat" you can express this by a pattern like:
"g?reat"
This simply says that the "reat"-part must exist and the "g" is optional.
This would match "reat" and "great" but not "eat", because the first "r" in "reat" is required.
If you have the too words "great" and "good" and you want to match them both with an optional "g" you can write this like this:
(g?reat|g?ood)
And if you want to include a word-boundary like:
\b(g?reat|g?ood)
You should be aware that this would not match anything like "breat" because you have the "reat" but the "r" is not at the word boundary because of the "b".
So if you want to match whole words that contain a substring link "reat" or "ood" then you should try:
"\b\w*?(reat|ood)\w+\b"
This reads:
1. Beginning with a word boundary begin matching any number word-characters, but don't be gready.
2. Match "reat" or "ood" enshures that only those words are matched that contain one of them.
3. Match any number of word characters following "reat" or "ood" until the next word boundary is reached.
This will match:
"goodness", "good", "ood" (if a complete word)
It can be read as: Give me all complete words that contain "ood" or "reat".
Is that what you are looking for?
I'm not entirely sure that regex alone offers a solution for what you're trying to do. You could, however, use the following code to create a regex expression for a given word. Although, the resulting regex pattern has the potential to become very long and slow:
function wordPermutations( $word, $minLength = 2 )
{
$perms = array( );
for ($start = 0; $start < strlen( $word ); $start++)
{
for ($end = strlen( $word ); $end > $start; $end--)
{
$perm = substr( $word, $start, ($end - $start));
if (strlen( $perm ) >= $minLength)
{
$perms[] = $perm;
}
}
}
return $perms;
}
Test Code:
$perms = wordPermutations( 'great', 3 ); // get all permutations of "great" that are 3 or more chars in length
var_dump( $perms );
echo ( '/\b('.implode( '|', $perms ).')\b/' );
Example Output:
array
0 => string 'great' (length=5)
1 => string 'grea' (length=4)
2 => string 'gre' (length=3)
3 => string 'reat' (length=4)
4 => string 'rea' (length=3)
5 => string 'eat' (length=3)
/\b(great|grea|gre|reat|rea|eat)\b/
Just check for the boolean that Regex.IsMatch() returns.
if (Regex.IsMatch(line, "condition") && Regex.IsMatch(line, "conditition2"))
The line will have both regex, right.

Categories

Resources