Extract Enums with brackets in comments? - c#

The Enum I want to extract is like following:
...
other code
...
enum A
{
a,
b=2,
c=3,
d//{x}
}
...
More Enums like the above.
...
First, I have tried using the Option Singleline with Regex:
enum\s*\w+\s*{.*?\}
However, since the comments have brackets.The regex doesn't work. It will stop when it runs to the bracket in comments.
So I tried excluding the bracket after comments. Based on what I have searched so far,it seems I need Negative look ahead with grouping construct Multiline.
Then I tried parsing the brackets without comments ahead.
The substep is to find brackets after comments:
(?m:^.*?//.*?}.*?$).
However, it seems the . still match anychar including newline even in inline multiline mode.
Then I tried using multiline in the first place. Since the main problem is the brackets in comments.I tried:
(?!//.*)}
Negative look ahead doesn't work the way I expected.
Here is a csharp-regex-test-link for you to test.
To summarize, I need parse enum from a csharp source code file.
The main problem to me is the brackets in comments.
Edit:
To clarify
1.brackets in comments are in pairs. For example:
xxx=xxx; //{xx}
2.comments are only in the form of //
3.I can't rely on indentations.

You may use
#"\benum\s*\w+\s*{(?>[^{}]+|(?<o>){|(?<-o>)})*(?(o)(?!)|)}"
See the regex demo
Details
\benum - a whole word enum
\s* - 0+ whitespaces
\w+ - 1+ word chars
\s* - 0+ whitespaces
{ - a { char
(?>[^{}]+|(?<o>){|(?<-o>)})* - either 1+ chars other than { and }, or a { with an empty string pushed onto the Group o stack, or } with a value popped from Group o stack
(?(o)(?!)|) - a conditional yes-no construct that fails the match and makes the regex engine backtrack at the current location if Group o still has any items left on the stack
} - a } char.

I don't think it is possible to do your task with a single regex. What if you have a string that looks like
var notEnum = "enum A {a, b, c}";
Hovewer you can capture your enums with few passes. Take a look at this algorithm
Clear strings content
Drop singleline comments
Drop muliline comments
Use you original regex
Example:
var code = ...
var stringLiterals = new Regex("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"", RegexOptions.Compiled);
var multilineComments = new Regex("/\\*.*?\\*/", RegexOptions.Compiled | RegexOptions.Singleline);
var singlelineComments = new Regex("//.*$", RegexOptions.Compiled | RegexOptions.Multiline);
var #enum = new Regex("enum\\s*\\w+\\s*{.*?}", RegexOptions.Compiled | RegexOptions.Singleline);
code = stringLiterals.Replace(code, m => "\"\"");
code = multilineComments.Replace(code, m => "");
code = singlelineComments.Replace(code, m => "");
var enums = #enum.Matches(code).Cast<Match>().ToArray();
foreach (var match in enums)
Console.WriteLine(match.Value);

Related

c# Regex of value after certain words

I have a question at regex I have a string that looks like:
Slot:0 Module:No module in slot
And what I need is a regex that well get values after slot and module, slot will allways be a number but i have a problem with module (this can be word with spaces), I tried:
var pattern = "(?<=:)[a-zA-Z0-9]+";
foreach (string config in backplaneConfig)
{
List<string> values = Regex.Matches(config, pattern).Cast<Match>().Select(x => x.Value).ToList();
modulesInfo.Add(new ModuleIdentyfication { ModuleSlot = Convert.ToInt32(values.First()), ModuleType = values.Last() });
}
So slot part works, but module works only if it is a word with no spaces, in my example it will give me only "No". Is there a way to do that
You may use a regex to capture the necessary details in the input string:
var pattern = #"Slot:(\d+)\s*Module:(.+)";
foreach (string config in backplaneConfig)
{
var values = Regex.Match(config, pattern);
if (values.Success)
{
modulesInfo.Add(new ModuleIdentyfication { ModuleSlot = Convert.ToInt32(values.Groups[1].Value), ModuleType = values.Groups[2].Value });
}
}
See the regex demo. Group 1 is the ModuleSlot and Group 2 is the ModuleType.
Details
Slot: - literal text
(\d+) - Capturing group 1: one or more digits
\s* - 0+ whitespaces
Module: - literal text
(.+) - Capturing group 2: the rest of the string to the end.
The most simple way would be to add 'space' to your pattern
var pattern = "(?<=:)[a-zA-Z0-9 ]+";
But the best solution would probably the answer from #Wiktor Stribiżew
Another option is to match either 1+ digits followed by a word boundary or match a repeating pattern using your character class but starting with [a-zA-Z]
(?<=:)(?:\d+\b|[a-zA-Z][a-zA-Z0-9]*(?: [a-zA-Z0-9]+)*)
(?<=:) Assert a : on the left
(?: Non capturing group
\d+\b Match 1+ digits followed by a word boundary
| Or
[a-zA-Z][a-zA-Z0-9]* Start a match with a-zA-Z
(?: [a-zA-Z0-9]+)* Optionally repeat a space and what is listed in the character class
) Close on capturing group
Regex demo
Plase replace this:
// regular exp.
(\d+)\s*(.+)
You don't need to use regex for such simple parsing. Try below:
var str = "Slot:0 Module:No module in slot";
str.Split(new string[] { "Slot:", "Module:"},StringSplitOptions.RemoveEmptyEntries)
.Select(s => s.Trim());

C# Regex - starts with pattern1 not contain pattern2

for the following input string contains all of these:
a1.aaa[SUBSCRIBED]
a1.bbb
a1.ccc
b1.ddd
d1.ddd[SUBSCRIBED]
I want to get the output:
bbb
ccc
which means: all the words that come after "a1." And not contain the substring "[SUBSCRIBED]"
all the words comes after "a1." And not contains the substring
"[SUBSCRIBED]"
Why regex? Following is crystal clear:
var result = strings
.Where(s => s.StartsWith("a1.") && !s.Contains("[SUBSCRIBED]"))
.Select(s => s.Substring(3));
Tim's answer makes sense. However if you insist on it I would venture that a Regex would look like this though.
^a1\.(.*)(?<!\[SUBSCRIBED\])$
with ^a1 meaning starts with a1
\.(.*) taking any number of character
and the negative lookbehind (?<!\[SUBSCRIBED\])$ would refuse text ending with [SUBSCRIBED]
You may use
^a1\.(?!.*\[SUBSCRIBED])(.*)
See the regex demo.
Details
^ - start of string
a1\. - a literal a1. substring
(?!.*\[SUBSCRIBED]) - a negative lookahead that fails the match if there is a [SUBSCRIBED] substring is present after any 0+ chars (other than newline if the RegexOptions.Singleline option is not used)
(.*) - Group 1: the rest of the line up to the end (if you use RegexOptions.Singleline option, . will match newlines as well).
C# code:
var result = string.Empty;
var m = Regex.Match(s, #"^a1\.(?!.*\[SUBSCRIBED])(.*)");
if (m.Success)
{
result = m.Groups[1].Value;
}

regex pattern for tags needed

Howzit,
I need help with the following please.
I need to find tags in a string. These tags start with {{ and end with }}, there will be multiple tags in the string I receive.
So far I have this, but it doesn't find any matches, what am I missing here?
List<string> list = new List<string>();
string pattern = "{{*}}";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = r.Match(text);
while (m.Success)
{
list.Add(m.Groups[0].Value);
m = m.NextMatch();
}
return list;
even tried string pattern = "{{[A-Za-z0-9]}}";
thanx
PS. I know close to nothing about regex.
Not only do you want to use {{.+?}} as your regex, you also need to pass RegexOptions.SingleLine. That will treat your entire string as a single line and the . will match \n (which it normally will not do).
Try {{.+}}. The .+ means there has to be at least one character as part of the tag.
EDIT:
To capture the string containing your tags you can do {{(.+)}} and then tokenize your match with the Tokenize or Scanner class?
I would recommend trying something like the following:
List<string> list = new List<string>();
string pattern = "{{(.*?)}}";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = r.Match(text);
while (m.Success)
{
list.Add(m.Groups[1].Value);
m = m.NextMatch();
}
return list;
the regex specifies:
{{ # match {{ literally
( # begin capturing into group #1
.*? # match any characters, from zero to infinite, but be lazy*
) # end capturing group
}} # match }} literally
"lazy" means to attempt to continue matching the pattern afterwards "}}" before backtracking to the .*? and reluctantly adding a character to the capturing group only if the character does not match }} - hope that made sense.
I changed your code by modifying the regex and to extract the first matching group from the regex match object (m.Groups[1].value) instead of the entire match.
{{.*?}} or
{{.+?}}
. - means any symbol
? - means lazy(don't capute nextpattern)

Regexp skip pattern

Problem
I need to replace all asterisk symbols('*') with percent symbol('%'). The asterisk symbols in square brackets should be ignored.
Example
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "Hel[*o], w*rld!";
var output = Regex.Replace(input, "What_pattern_should_be_there?", "%")
Assert.AreEqual("Hel[*o], w%rld!", output));
}
Try using a look ahead:
\*(?![^\[\]]*\])
Here's a bit stronger solution, which takes care of [] blocks better, and even escaped \[ characters:
string text = #"h*H\[el[*o], w*rl\]d!";
string pattern = #"
\\. # Match an escaped character. (to skip over it)
|
\[ # Match a character class
(?:\\.|[^\]])* # which may also contain escaped characters (to skip over it)
\]
|
(?<Asterisk>\*) # Match `*` and add it to a group.
";
text = Regex.Replace(text, pattern,
match => match.Groups["Asterisk"].Success ? "%" : match.Value,
RegexOptions.IgnorePatternWhitespace);
If you don't care about escaped characters you can simplify it to:
\[ # Skip a character class
[^\]]* # until the first ']'
\]
|
(?<Asterisk>\*)
Which can be written without comments as: #"\[[^\]]*\]|(?<Asterisk>\*)".
To understand why it works we need to understand how Regex.Replace works: for every position in the string it tries to match the regex. If it fails, it moves one character. If it succeeds, it moves over the whole match.
Here, we have dummy matches for the [...] blocks so we may skip over the asterisks we don't want to replace, and match only the lonely ones. That decision is made in a callback function that checks if Asterisk was matched or not.
I couldn't come up with a pure RegEx solution. Therefore I am providing you with a pragmatic solution. I tested it and it works:
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "H*]e*l[*o], w*rl[*d*o] [o*] [o*o].";
var actual = ReplaceAsterisksNotInSquareBrackets(input);
var expected = "H%]e%l[*o], w%rl[*d*o] [o*] [o*o].";
Assert.AreEqual(expected, actual);
}
private static string ReplaceAsterisksNotInSquareBrackets(string s)
{
Regex rx = new Regex(#"(?<=\[[^\[\]]*)(?<asterisk>\*)(?=[^\[\]]*\])");
var matches = rx.Matches(s);
s = s.Replace('*', '%');
foreach (Match match in matches)
{
s = s.Remove(match.Groups["asterisk"].Index, 1);
s = s.Insert(match.Groups["asterisk"].Index, "*");
}
return s;
}
EDITED
Okay here is my final attempt ;)
Using negative lookbehind (?<!) and negative lookahead (?!).
var output = Regex.Replace(input, #"(?<!\[)\*(?!\])", "%");
This also passes the test in the comment to another answer "Hel*o], w*rld!"

A probably simple regex expression

I am a complete newb when it comes to regex, and would like help to make an expression to match in the following:
{ValidFunctionName}({parameter}:"{value}")
{ValidFunctionName}({parameter}:"{value}",
{parameter}:"{value}")
{ValidFunctionName}()
Where {x} is what I want to match, {parameter} can be anything $%"$ for example and {value} must be enclosed in quotation marks.
ThisIsValid_01(a:"40")
would be "ThisIsValid_01", "a", "40"
ThisIsValid_01(a:"40", b:"ZOO")
would be "ThisIsValid_01", "a", "40", "b", "ZOO"
01_ThisIsntValid(a:"40")
wouldn't return anything
ThisIsntValid_02(a:40)
wouldn't return anything, as 40 is not enclosed in quotation marks.
ThisIsValid_02()
would return "ThisIsValid_02"
For a valid function name I came across: "[A-Za-z_][A-Za-z_0-9]*"
But I can't for the life of me figure out how to match the rest.
I've been playing around on http://regexpal.com/ to try to get valid matches to all conditions, but to no avail :(
It would be nice if you kindly explained the regex too, so I can learn :)
EDIT: This will work, uses 2 regexs. The first get the function name and everything inside it, the second extracts each pair of params and values from what's inside the function's brackets. You cannot do this with a single regex. Add some [ \t\n\r]* for whitespace.
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
List<List<string>> matches = new List<List<string>>();
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<string>();
l.Add(match.Groups["function"].Value);
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
{
l.Add(m.Groups["param"].Value);
l.Add(m.Groups["value"].Value);
}
matches.Add(l);
}
(Old) Solution
(?<function>\w[\w\d]*?)\((?<param>.+?):"(?<value>[^"]*?)"\)
(Old) Explanation
Let's remove the group captures so it is easier to understand: \w[\w\d]*?\(.+?:"[^"]?"\)
\w is the word class, it is short for [a-zA-Z_]
\d is the digit class, it is short for [0-9]
\w[\w\d]*? Makes sure there is valid word character for the start of the function, and then matches zero or more further word or digit characters.
\(.+? Matches a left bracket then one or more of any characters (for the parameter)
:"[^"]*?"\) Matches a colon, then the opening quote, then zero or more of any character except quotes (for the value) then the close quote and right bracket.
Brackets (or parens, as some people call them) as escaped with the backslashes because otherwise they are capturing groups.
The (?<name> ) captures some text.
The ? after each the * and + operators makes them non-greedy, meaning that they will match the least, rather than the most, amount of text.
(Old) Use
Regex r = new Regex(#"(?<function>\w[\w\d]*?)\((?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(aa%£$!:\"lolololol\") _test1(ghgasghe:\"asjkdgh\")";
List<string[]> matches = new List<string[]>();
if(r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
matches.Add(new[] { match.Groups["function"].Value, match.Groups["param"].Value, match.Groups["value"].Value });
}
EDIT: Now you've added an undefined number of multiple parameters, I would recommend making your own parser rather than using regexs. The above example only works with one parameter and strictly no whitespace. This will match multiple parameters with strict whitespace but will not return the parameters and values:
\w[\w\d]*?\(.+?:"[^"]*?"(,.+?:"[^"]*?")*\)
Just for fun, like above but with whitepace:
\w[\w\d]*?[ \t\r\n]*\([ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?"([ \t\r\n]*,[ \t\r\n]*.+?[ \t\r\n]*:[ \t\r\n]*"[^"]*?")*[ \t\r\n]*\)
Capturing the text you want will be hard, because you don't know how many captures you are going to have and as such regexs are unsuited.
Someone else has already given an answer that gives you a flat list of strings, but in the interest of strong typing and proper class structure, I’m going to provide a solution that encapsulates the data properly.
First, declare two classes:
public class ParamValue // For a parameter and its value
{
public string Parameter;
public string Value;
}
public class FunctionInfo // For a whole function with all its parameters
{
public string FunctionName;
public List<ParamValue> Values;
}
Then do the matching and populate a list of FunctionInfos:
(By the way, I’ve made some slight fixes to the regexes... it will now match identifiers correctly, and it will not include the double-quotes as part of the “value” of each parameter.)
Regex r = new Regex(#"(?<function>[\p{L}_]\w*?)\((?<inner>.*?)\)");
Regex inner = new Regex(#",?(?<param>.+?):""(?<value>[^""]*?)""");
string input = "_test0(a:\"lolololol\",b:\"2\") _test1(ghgasghe:\"asjkdgh\")";
var matches = new List<FunctionInfo>();
if (r.IsMatch(input))
{
MatchCollection mc = r.Matches(input);
foreach (Match match in mc)
{
var l = new List<ParamValue>();
foreach (Match m in inner.Matches(match.Groups["inner"].Value))
l.Add(new ParamValue
{
Parameter = m.Groups["param"].Value,
Value = m.Groups["value"].Value
});
matches.Add(new FunctionInfo
{
FunctionName = match.Groups["function"].Value,
Values = l
});
}
}
Then you can access the collection nicely with identifiers like FunctionName:
foreach (var match in matches)
{
Console.WriteLine("{0}({1})", match.FunctionName,
string.Join(", ", match.Values.Select(val =>
string.Format("{0}: \"{1}\"", val.Parameter, val.Value))));
}
Try this:
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*)\(((?<parameter>[^:]*):"(?<value>[^"]+)",?\s*)*\)
^\s*(?<FunctionName>[A-Za-z][A-Za-z_0-9]*) matches the function name, ^ means start of the line, so that the first character in string must match. You can keep you remove the whitespace capture if you don't need it, I just added it to make the match a little more flexible.
The next set \(((?<parameter>[^:]*):"(?<value>[^"]+)",?)*\) means capture each parameter-value pair inside the parenthesis. You have to escape the parenthesis for the function since they are symbols within the regex syntax.
The ?<> inside parenthesis are named capture groups, which when supported by a library, as they are in .NET, make grabbing the groups in the matches a little easier.
Here:
\w[\w\d]*\s*\(\s*(?:(\w[\w\d]*):("[^"]*"|\d+))*\s*\)
Visualization of that regex here.
For Problems like that I always suggest people not to "find" a single regex but to write multiple regex sharing the work.
But here is my quick shot:
(?<funcName>[A-Za-z_][A-Za-z_0-9]*)
\(
(?<ParamGroup>
(?<paramName>[^(]+?)
:
"(?<paramValue>[^"]*)"
((,\s*)|(?=\)))
)*
\)
The whitespaces are there for better readability. Remove them or set the option to ignore pattern whitespaces.
This regex passes all your test cases:
^(?<function>[A-Za-z][\w]*?)\(((?<param>[^:]*?):"(?<value>[^"]*?)",{0,1}\s*)*\)$
This works on multiple parameters and no parameters. It also handles special characters in the param name and whitespace after the comma. There may need to be some adjustments as your test cases do not cover everything you indicate in your text.
Please note that \w usually includes digits and is not appropriate as the leading character of the function name. Reference: http://www.regular-expressions.info/charclass.html#shorthand

Categories

Resources