What's the regex for literals intermixed with variables containing literals? - c#

I have a custom C# app (.NET 4.7.1) that needs to evaluate various and sundry text strings. As one of many cases, I have the following string in the midst of other text:
OR S:D00Q0600 ) OR
I need to find these precise situations (each string segment will be surrounded by a single space, or be at the beginning or end of a line) in which there is an OR followed by a string containing a :, followed by a ), followed by another OR. The ORs are literal and the : within the string is literal, and the ) is literal -- but the D00Q0600 is variable and will be different every time.
And when that precise situation occurs I need to replace the string with:
OR S:D00Q0600 OR
(Simply remove the ) - from that little snippet only - not the whole string)
So to break it down a little cleaner:
Find an OR (always uppercase)
...followed by a space followed by a string with a :
...followed by a space followed by a )
...followed by a space followed by an OR
When found, remove the ) in that position
Do not remove any other )s which will often exist in the entire string
In many cases, the ) is correct and must remain; only in the case described above should it be removed.
S:D00Q0600 can be of variable length. It could also be (for example) S:D00Q or S:D00Q0600XYZ, etc.
How can I construct the type of C# regex that would solve this?

You can use this regex and do replace with what matches with group 1 and group 2. This ensures that only when this regex matches, the replace occurs.
(OR [A-Z]:[A-Z0-9]+ )\) (OR)
Check here,
https://regex101.com/r/0EZiu6/1/
Edit 1:
Modified your c# code and now this works.
string pattern = #"(OR [A-Z]:[A-Z0-9]+ )\) (OR)";
string substitution = #"$1$2";
string input = #"OR S:D00Q0600 ) OR ok sir how )r u OR S:D11Q06 ) OR i ()am fine OR D:D67Q06S0A23DR ) OR";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
Console.WriteLine("Before Replace: " + input);
Console.WriteLine("After Replace: " + result);
I have just replaced \1 \2 with $1 $2 and added print statement in last to print the result before and after replace.
Following is the output of this program which is exactly as you desired.
Before Replace: OR S:D00Q0600 ) OR ok sir how )r u OR S:D11Q06 ) OR i ()am fine OR D:D67Q06S0A23DR ) OR
After Replace: OR S:D00Q0600 OR ok sir how )r u OR S:D11Q06 OR i ()am fine OR D:D67Q06S0A23DR OR

For the single example of
OR S:D00Q0600 ) OR
... this regex works:
(\bOR S:........ )\)( OR\b)
with the replacing groups being $1 and $2.
The regex assumes that the length of the middle string will always be seven characters. If you have more/different input data, please update your question with examples where this regex fails.
Explanation
(\bOR S:........ )\)( OR\b)
\b assert position at a word boundary (transition from non-word to word, or from word to non-word)
OR S: matches the characters literally (case sensitive)
. matches any character (except for line terminators)
matches the character literally (case sensitive)
\) matches the character ) literally (case sensitive)
Regex101

Related

Regex match text not proceded by quotation mark (ignore whitespaces)

I have following text:
SELECT
U_ArrObjJson(
s."Description", s."DateStart", sp.*
) as "Result"
FROM "Supplier" s
OUTER APPLY(
SELECT
U_ArrObjJson,
'U_ArrObjJson(',
' <- THE PROBLEM IS HERE
U_ArrObjJson(
p."Id", p."Description", p."Price"
) as "Products"
FROM "Products" p
WHERE p."SupplierId" = s."Id"
) sp
What I need to do is find instances of U_ArrObjJson function which are not proceded quotation mark. I end up with following expression:
(?<!\')\bU_ArrObjJson\b[\n\r\s]*[\(]+
The problem is that the last occurence of U_ArrObjJson is proceded by single quotation mark but there are spaces and new lines indicators between quotation mark and instance of name I looking for.
This expression I need to use with dotnet Regex in my method:
var matches = new Regex(#"(?<!\')\bU_ArrObjJson\b[\n\r\s]*[\(]+", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant).Matches(template);
How can I modify my expression to ignore preceded spaces?
Since .NET's regex supports non-fixed width Lookbehinds, you can just add \s* to the Lookbehind:
(?<!\'\s*)\bU_ArrObjJson\s*\(+
Demo.
Notes:
[\n\r\s] can be replaced with just \s here because the latter matches any whitespace character (including EOL). So, \n\r is redundant here.
As indicated by Wiktor Stribiżew in the comments, the second \b is also redundant because the function name will either be followed by a whitespace or a ( character. In both cases, a word boundary is implicitly required.
Unless you actually want to match the function name followed by multiple ( characters, you probably should also remove the + at the end.

How to make a Regular expression in C# for parsing python string Literals?

I've been trying to parse python language string via Regex of C#
python strings are as follow :
"string1"
"string2\" it is still string till now"
"""This is a \r
\na multiline\r
\npython string"""
""" this is also a multiline string\""" but it doesnt end here,
\n it ends with all three quotes together without escape sequence so it ends here ->"""
I have to look for a condition where my string matches this..
if (Regex.IsMatch(input, "^\"" + #"[\w\s\W]*" + "[^\\]"+"\\" +"$") || Regex.IsMatch(input, "^\"\"\"" + #"[\w\s\W]*" + "\"\"\"$"))
{ // do something then }
Try
(?:u|r|ur|ru)?(?:(?P<q1>'''|""")(?:[^'"\\]*(?:\\.|(?!\1)['"]))*[^'"\\]*(?P=q1)|(?P<q2>'|")(?:[^'"\\\n]*(?:\\.|(?!\1)['"]))*[^'"\\\n]*(?P=q2))
Demo.
Explanation:
(?: // first, any combination of "r" and "u" (optionally)
u|r|ur|ru
)?
(?: // next, either a multi- or single line string
(?P<q1> // create a named capturing group for the quotes
'''|"""
)
(?:
[^'"\\]* // then match anything except quotes and backslashes
(?: // if there's a quote or backslash, check if the string ends here
\\. // if there's a backslash next, match the next two characters unconditionally
|
(?!\1)['"] // otherwise, if there is NOT a closing quote, match any quote
)
)* // do this as many times as possible, then...
[^'"\\]* //...match anything that's no quote or backslash one last time...
(?P=q1) //...and end with the quote the string started with.
|
// down below the same thing for single line strings.
(?P<q2>
'|"
)
(?:
[^'"\\\n]*
(?:
\\.
|
(?!\1)['"]
)
)*
[^'"\\\n]*
(?P=q2)
)
The below regex would match all the python string literals which are enclosed within " or """.
"""(?:(?!(?<![\\\/])""").)*"""|"(?:(?!(?<![\\\/])").)*"
DEMO
Note that i have included s DOTALL modifier in the above regex. It won't work for incomplete quotes.
^(".*")$
You can use this as well.See demo.
http://regex101.com/r/hJ7nT4/2

How to make balancing group capturing?

Let's say I have this text input.
tes{}tR{R{abc}aD{mnoR{xyz}}}
I want to extract the ff output:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
Currently, I can only extract what's inside the {}groups using balanced group approach as found in msdn. Here's the pattern:
^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$
Does anyone know how to include the R{} and D{} in the output?
I think that a different approach is required here. Once you match the first larger group R{R{abc}aD{mnoR{xyz}}} (see my comment about the possible typo), you won't be able to get the subgroups inside as the regex doesn't allow you to capture the individual R{ ... } groups.
So, there had to be some way to capture and not consume and the obvious way to do that was to use a positive lookahead. From there, you can put the expression you used, albeit with some changes to adapt to the new change in focus, and I came up with:
(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))
[I also renamed the 'Open' to 'O' and removed the named capture for the close brace to make it shorter and avoid noises in the matches]
On regexhero.net (the only free .NET regex tester I know so far), I got the following capture groups:
1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}
Breakdown of regex:
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
(?: # First non-capture group opening
(?: # Second non-capture group opening
(?'O'{) # Get the named opening brace
[^{}]* # Any non-brace
)+ # Close of second non-capture group and repeat over as many times as necessary
(?: # Third non-capture group opening
(?'-O'}) # Removal of named opening brace when encountered
[^{}]*? # Any other non-brace characters in case there are more nested braces
)+ # Close of third non-capture group and repeat over as many times as necessary
)+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
(?(O)(?!)) # Condition to prevent unbalanced braces
) # Close capture group
) # Close positive lookahead
The following will not work in C#
I actually wanted to try out how it should be working out on the PCRE engine, since there was the option to have recursive regex and I think it was easier since I'm more familiar with it and which yielded a shorter regex :)
(?=([A-Z]{(?:[^{}]|(?1))+}))
regex101 demo
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
{ # Opening brace
(?: # Opening non-capture group
[^{}] # Matches non braces
| # OR
(?1) # Recurse first capture group
)+ # Close non-capture group and repeat as many times as necessary
} # Closing brace
) # Close of capture group
) # Close of positive lookahead
I'm not sure a single regex would be able to suit your needs: these nested substrings always mess it up.
One solution could be the following algorithm (written in Java, but I guess the translation to C# won't be that hard):
/**
* Finds all matches (i.e. including sub/nested matches) of the regex in the input string.
*
* #param input
* The input string.
* #param regex
* The regex pattern. It has to target the most nested substrings. For example, given the following input string
* <code>A{01B{23}45C{67}89}</code>, if you want to catch every <code>X{*}</code> substrings (where <code>X</code> is a capital letter),
* you have to use <code>[A-Z][{][^{]+?[}]</code> or <code>[A-Z][{][^{}]+[}]</code> instead of <code>[A-Z][{].+?[}]</code>.
* #param format
* The format must follow the <a href= "http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html#syntax" >format string
* syntax</a>. It will be given one single integer as argument, so it has to contain (and to contain only) a <code>%d</code> flag. The
* format must not be foundable anywhere in the input string. If <code>null</code>, <code>ééé%dèèè</code> will be used.
* #return The list of all the matches of the regex in the input string.
*/
public static List<String> findAllMatches(String input, String regex, String format) {
if (format == null) {
format = "ééé%dèèè";
}
int counter = 0;
Map<String, String> matches = new LinkedHashMap<String, String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
// if a substring has been found
while (matcher.find()) {
// create a unique replacement string using the counter
String replace = String.format(format, counter++);
// store the relation "replacement string --> initial substring" in a queue
matches.put(replace, matcher.group());
String end = input.substring(matcher.end(), input.length());
String start = input.substring(0, matcher.start());
// replace the found substring by the created unique replacement string
input = start + replace + end;
// reiterate on the new input string (faking the original matcher.find() implementation)
matcher = pattern.matcher(input);
}
List<Entry<String, String>> entries = new LinkedList<Entry<String, String>>(matches.entrySet());
// for each relation "replacement string --> initial substring" of the queue
for (int i = 0; i < entries.size(); i++) {
Entry<String, String> current = entries.get(i);
// for each relation that could have been found before the current one (i.e. more nested)
for (int j = 0; j < i; j++) {
Entry<String, String> previous = entries.get(j);
// if the current initial substring contains the previous replacement string
if (current.getValue().contains(previous.getKey())) {
// replace the previous replacement string by the previous initial substring in the current initial substring
current.setValue(current.getValue().replace(previous.getKey(), previous.getValue()));
}
}
}
return new LinkedList<String>(matches.values());
}
Thus, in your case:
String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}";
String regex = "[A-Z][{][^{}]+[}]";
findAllMatches(input, regex, null);
Returns:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
Balancing groups in .Net regular expressions give you control over exactly what to capture, and the .Net regex engine keeps a full history of all captures of the group (unlike most other flavors that only capture the last occurrence of each group).
The MSDN example is a little too complicated. A simpler approach for matching nestes structures would be:
(?>
(?<O>)\p{Lu}\{ # Push to the O stack, and match an upper-case letter and {
| # OR
\}(?<-O>) # Match } and pop from the stack
| # OR
\p{Ll} # Match a lower-case letter
)+
(?(O)(?!)) # Make sure the stack is empty
or in a single line:
(?>(?<O>)\p{Lu}\{|\}(?<-O>)|\p{Ll})+(?(O)(?!))
Working example on Regex Storm
In your example it also matches the "tes" at the start of the string, but don't worry about that, we're not done.
With a small correction we can also capture the occurrences between the R{...} pairs:
(?>(?<O>)\p{Lu}\{|\}(?<Target-O>)|\p{Ll})+(?(O)(?!))
Each Match will have a Group called "Target", and each such Group will have a Capture for each occurrences - you only care about these captures.
Working example on Regex Storm - Click on Table tab and examine the 4 captures of ${Target}
See also:
What are regular expression Balancing Groups?

Replace with wildcards

I need some advice. Suppose I have the following string: Read Variable
I want to find all pieces of text like this in a string and make all of them like the following:Variable = MessageBox.Show. So as aditional examples:
"Read Dog" --> "Dog = MessageBox.Show"
"Read Cat" --> "Cat = MessageBox.Show"
Can you help me? I need a fast advice using RegEx in C#. I think it is a job involving wildcards, but I do not know how to use them very well... Also, I need this for a school project tomorrow... Thanks!
Edit: This is what I have done so far and it does not work: Regex.Replace(String, "Read ", " = Messagebox.Show").
You can do this
string ns= Regex.Replace(yourString,"Read\s+(.*?)(?:\s|$)","$1 = MessageBox.Show");
\s+ matches 1 to many space characters
(.*?)(?:\s|$) matches 0 to many characters till the first space (i.e \s) or till the end of the string is reached(i.e $)
$1 represents the first captured group i.e (.*?)
You might want to clarify your question... but here goes:
If you want to match the next word after "Read " in regex, use Read (\w*) where \w is the word character class and * is the greedy match operator.
If you want to match everything after "Read " in regex, use Read (.*)$ where . will match all characters and $ means end of line.
With either regex, you can use a replace of $1 = MessageBox.Show as $1 will reference the first matched group (which was denoted by the parenthesis).
Complete code:
replacedString = Regex.Replace(inStr, #"Read (.*)$", "$1 = MessageBox.Show");
The problem with your attempt is, that it cannot know that the replacement string should be inserted after your variable. Let's assume that valid variable names contain letters, digits and underscores (which can be conveniently matched with \w). That means, any other character ends the variable name. Then you could match the variable name, capture it (using parentheses) and put it in the replacement string with $1:
output = Regex.Replace(input, #"Read\s+(\w+)", "$1 = MessageBox.Show");
Note that \s+ matches one or more arbitrary whitespace characters. \w+ matches one or more letters, digits and underscores. If you want to restrict variable names to letters only, this is the place to change it:
output = Regex.Replace(input, #"Read\s+([a-zA-Z]+)", "$1 = MessageBox.Show");
Here is a good tutorial.
Finally note, that in C# it is advisable to write regular expressions as verbatim strings (#"..."). Otherwise, you will have to double escape everything, so that the backslashes get through to the regex engine, and that really lessens the readability of the regex.

Matching repeating patterns

I'm currently trying to match and capture text in the following input:
field: one two three field: "moo cow" field: +this
I can match the field: with [a-z]*\: however I can't seem to match the rest of the content so far my attempts have only resulted in capturing everything which is not what I want to do.
If you know that it is always going to be literally field: there is absolutely no need for a regular expression:
var delimiters = new String[] {"field:"};
string[] values = input.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
However, from your regex I assume that the name field can vary, as long as it's in front of a colon. You could try to capture a word followed by : and then everything up to the next of those words (using a lookahead).
foreach(Match match in Regex.Matches(input, #"([a-z]+):((?:(?![a-z]+:).)*)"))
{
string fieldName = match.Groups[1].Value;
string value = match.Groups[2].Value;
}
An explanation of the regular expression:
( # opens a capturing group; the content can later be accessed with Groups[1]
[a-z] # lower-case letter
+ # one or more of them
) # end of capturing group
: # a literal colon
( # opens a capturing group; the content can later be accessed with Groups[2]
(?: # opens a non-capturing group; just a necessary subpattern which we do not
# need later any more
(?! # negative lookahead; this will NOT match if the pattern inside matches
[a-z]+:
# a word followed by a colon; just the same as we used at the beginning of
# the regex
) # end of negative lookahead (not that this does not consume any characters;
# it LOOKS ahead)
. # any character (except for line breaks)
) # end of non-capturing group
* # 0 or more of those
) # end of capturing group
So first we match anylowercaseword:. And then we match one more character at a time, for each one checking that this character is not the start of anotherlowercaseword:. With the capturing groups we can then later separately find the field's name and the field's value.
Don't forget that you can actually match literal strings in regexes. If your pattern is like this:
field\:
You will match "field:" literally, and nothing else.

Categories

Resources