Regexp skip pattern - c#

Problem
I need to replace all asterisk symbols('*') with percent symbol('%'). The asterisk symbols in square brackets should be ignored.
Example
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "Hel[*o], w*rld!";
var output = Regex.Replace(input, "What_pattern_should_be_there?", "%")
Assert.AreEqual("Hel[*o], w%rld!", output));
}

Try using a look ahead:
\*(?![^\[\]]*\])
Here's a bit stronger solution, which takes care of [] blocks better, and even escaped \[ characters:
string text = #"h*H\[el[*o], w*rl\]d!";
string pattern = #"
\\. # Match an escaped character. (to skip over it)
|
\[ # Match a character class
(?:\\.|[^\]])* # which may also contain escaped characters (to skip over it)
\]
|
(?<Asterisk>\*) # Match `*` and add it to a group.
";
text = Regex.Replace(text, pattern,
match => match.Groups["Asterisk"].Success ? "%" : match.Value,
RegexOptions.IgnorePatternWhitespace);
If you don't care about escaped characters you can simplify it to:
\[ # Skip a character class
[^\]]* # until the first ']'
\]
|
(?<Asterisk>\*)
Which can be written without comments as: #"\[[^\]]*\]|(?<Asterisk>\*)".
To understand why it works we need to understand how Regex.Replace works: for every position in the string it tries to match the regex. If it fails, it moves one character. If it succeeds, it moves over the whole match.
Here, we have dummy matches for the [...] blocks so we may skip over the asterisks we don't want to replace, and match only the lonely ones. That decision is made in a callback function that checks if Asterisk was matched or not.

I couldn't come up with a pure RegEx solution. Therefore I am providing you with a pragmatic solution. I tested it and it works:
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "H*]e*l[*o], w*rl[*d*o] [o*] [o*o].";
var actual = ReplaceAsterisksNotInSquareBrackets(input);
var expected = "H%]e%l[*o], w%rl[*d*o] [o*] [o*o].";
Assert.AreEqual(expected, actual);
}
private static string ReplaceAsterisksNotInSquareBrackets(string s)
{
Regex rx = new Regex(#"(?<=\[[^\[\]]*)(?<asterisk>\*)(?=[^\[\]]*\])");
var matches = rx.Matches(s);
s = s.Replace('*', '%');
foreach (Match match in matches)
{
s = s.Remove(match.Groups["asterisk"].Index, 1);
s = s.Insert(match.Groups["asterisk"].Index, "*");
}
return s;
}

EDITED
Okay here is my final attempt ;)
Using negative lookbehind (?<!) and negative lookahead (?!).
var output = Regex.Replace(input, #"(?<!\[)\*(?!\])", "%");
This also passes the test in the comment to another answer "Hel*o], w*rld!"

Related

How do I regex match each individual word within backticks?

I am trying to get results for each individual word within backticks. For example,
if I have something like this text
some description `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`
I want the search results to be:
match
these_words
th_is_wor
THIS_WOR
thi_sqw
word_snake
I'm essentially trying to get each "word", word being one or more english letter or underscore characters, between each set of backticks.
I currently have the following regex that seems to match ALL the text between each set of backticks:
/(?<=`)(\b([^`\]|\w|_)*\b)(?=`)/gi
This uses a positive lookbehind to find text that comes after a ` character: (?<=`)
Followed by a capture group for one or more things such that the thing is not a `, not a \, is a word character, or is an _ character within word boundaries: (\b([^`\]|\w|_)*\b)
Followed by a positive lookahead for another ` character to ensure we're enclosed within backticks.
This sort of works, but captures ALL the text between backticks instead of each individual word. This would require further processing which I'd like to avoid. My regex results right now are:
match these_words th_is_wor
THIS_WOR thi_sqw
word_snake
If there is a generic formula for getting each individual word within backticks or within quotes, that would be fantastic. Thank you!
Note: Much appreciated if the answer could be formatted in C#, but not required, as I can do that bit myself if needed.
Edit: Thank you Mr. إين from Ben Awad's Discord server for the quickest response! This is the solution as proposed by him. Also thank you to everyone who responded to my post, you guys are all AWESOME!
using System;
using System.Text.RegularExpressions;
class Program {
static void Main(string[] args) {
string backtickSentence = "i want to `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`";
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
string quoteSentence = "some other \"words in a \" sentence be \"gettin me tripped_up AllUp inHere\"";
string quotePattern = "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*\"(?:[^\"]* )*)\\w+";
// Call Matches method without specifying any options.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {} // Do Nothing: Assume that timeout represents no match.
Console.WriteLine();
// Call Matches method for case-insensitive matching.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {}
}
}
His explanation for this was as follows, but you can paste his regex into regexr.com for more info
var NOT_BACKTICK = #"[^`]*";
var WORD = #"(\w+)";
var START = $#"^{NOT_BACKTICK}"; // match anything before the first backtick
var INSIDE_BACKTICKS = $#"`{NOT_BACKTICK}`"; // match a pair of backticks
var ODD_NUM_BACKTICKS_BEFORE = $#"{START}({INSIDE_BACKTICKS}{NOT_BACKTICK})*`"; // match anything before the first backtick, then any amount of paired backticks with anything afterwards, then a single opening backtick
var CONDITION = $#"(?<={ODD_NUM_BACKTICKS_BEFORE})";
var CONDITION_TRUE = $#"(?: *{WORD})"; // match any spaces then a word
var CONDITION_FALSE = $#"(?:(?<={ODD_NUM_BACKTICKS_BEFORE}{NOT_BACKTICK} ){WORD})"; // match up to an opening backtick, then anything up to a space before the current word
// uses conditional matching
// see https://learn.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr
var pattern = $#"(?{CONDITION}{CONDITION_TRUE}|{CONDITION_FALSE})";
// refined backtick pattern
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
With C# you can use the Group.Captures Property and then get the capture group values.
Note that \w also matches _
`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`
Explanation
<code> Match literally
(?: Non capture group to repeat as a whole part
[\p{Zs}\t]* Match optional spaces
(\w+) Capture group 1, match 1+ word characters
[\p{Zs}\t]* Match optional spaces
)+ Close the non capture group and repeat as least 1 or more times
<code> Match literally
See a .NET regex demo and a C# demo.
For example:
string s = #"some description ` match these_words th_is_wor ` or `THIS_WOR thi_sqw` a `word_snake`";
string pattern = #"`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`";
foreach (Match m in Regex.Matches(s, pattern))
{
string[] result = m.Groups[1].Captures.Select(c => c.Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
Output
match,these_words,th_is_wor
THIS_WOR,thi_sqw
word_snake

Regex exclude ":" and a whitespace if they exist

So I have a regex here:
var text = new Regex(#"(?<=Paybacks).*", RegexOptions.IgnoreCase);
This looks for the line where it starts with Paybacks. Now it currently prints ": blah".
The context sometimes can be "Paybacks" or "Paybacks:" or "Paybacks " or I don't know "Paybacks (with thousands of whitespaces). How can I modify this regex to be like.. after "Paybacks" ignore a colon and a whitespace (or whitespaces) that may or may not exist.
I've been playing with it in regex101 and this seems to be working, but is there a better way?
(?<=Volatility(:\s)).*
In these situations, you'd better use a regex with a capturing group:
var pattern = new Regex(#"Paybacks[\s:]*(.*)", RegexOptions.IgnoreCase);
Then, you can use
var output = Regex.Match(text, pattern)?.Groups[1].Value;
See the .NET regex demo:
See the C# demo:
var texts = new List<string> { "Paybacks: blah","Paybacks:blah","Paybacks blah"};
var pattern = new Regex(#"Paybacks[\s:]*(.*)", RegexOptions.IgnoreCase);
texts.ForEach(text => Console.WriteLine(pattern.Match(text)?.Groups[1].Value));
printing 3 blahs.
You might also match optional colons and whitspace chars in the lookbehind, and start matching the first chars being any non whitspace char other than :
(?<=Paybacks[:\s]*)[^\s:].*
The pattern matches:
(?<= Positive lookbehind, assert what is on the left is
Paybacks Match literally
[:\s]* Optionally match either : or a whitespace char using a character class
) Close lookbehind
[^\s:].* Match a single non whitespace char other than : and the rest of the line
Regex demo | C# demo
var regex = new Regex(#"(?<=Paybacks[:\s]*)[^\s:].*", RegexOptions.IgnoreCase);
string[] strings = {"Paybacks: blah", "Paybacks blah", "Paybacks blah"};
foreach (String s in strings)
{
Console.WriteLine(regex.Match(s)?.Value);
}
Output
blah
blah
blah
If the order should be a single optional colon and optional whitespace chars, you can make the colon optional and the quantifier for the whitespace chars 0 or more using :?\s*
(?<=Paybacks:?\s*)[^\s:].*
Regex demo

Replace one character but not two in a string

I want to replace single occurrences of a character but not two in a string using C#.
For example, I want to replace & by an empty string but not when the ocurrence is &&. Another example, a&b&&c would become ab&&c after the replacement.
If I use a regex like &[^&], it will also match the character after the & and I don't want to replace it.
Another solution I found is to iterate over the string characters.
Do you know a cleaner solution to do that?
To only match one & (not preceded or followed by &), use look-arounds (?<!&) and (?!&):
(?<!&)&(?!&)
See regex demo
You tried to use a negated character class that still matches a character, and you need to use a look-ahead/look-behind to just check for some character absence/presence, without consuming it.
See regular-expressions.info:
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u).
Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt.
You can match both & and && (or any number of repetition) and only replace the single one with an empty string:
str = Regex.Replace(str, "&+", m => m.Value.Length == 1 ? "" : m.Value);
You can use this regex: #"(?<!&)&(?!&)"
var str = Regex.Replace("a&b&&c", #"(?<!&)&(?!&)", "");
Console.WriteLine(str); // ab&&c
You can go with this:
public static string replacement(string oldString, char charToRemove)
{
string newString = "";
bool found = false;
foreach (char c in oldString)
{
if (c == charToRemove && !found)
{
found = true;
continue;
}
newString += c;
}
return newString;
}
Which is as generic as possible
I would use something like this, which IMO should be better than using Regex:
public static class StringExtensions
{
public static string ReplaceFirst(this string source, char oldChar, char newChar)
{
if (string.IsNullOrEmpty(source)) return source;
int index = source.IndexOf(oldChar);
if (index < 0) return source;
var chars = source.ToCharArray();
chars[index] = newChar;
return new string(chars);
}
}
I'll contribute to this statement from the comments:
in this case, only the substring with odd number of '&' will be replaced by all the "&" except the last "&" . "&&&" would be "&&" and "&&&&" would be "&&&&"
This is a pretty neat solution using balancing groups (though I wouldn't call it particularly clean nor easy to read).
Code:
string str = "11&222&&333&&&44444&&&&55&&&&&";
str = Regex.Replace(str, "&((?:(?<2>&)(?<-2>&)?)*)", "$1$2");
Output:
11222&&333&&44444&&&&55&&&&
ideone demo
It always matches the first & (not captured).
If it's followed by an even number of &, they're matched and stored in $1. The second group is captured by the first of the pair, but then it's substracted by the second.
However, if there's there's an odd number of of &, the optional group (?<-2>&)? does not match, and the group is not substracted. Then, $2 will capture an extra &
For example, matching the subject "&&&&", the first char is consumed and it isn't captured (1). The second and third chars are matched, but $2 is substracted (2). For the last char, $2 is captured (3). The last 3 chars were stored in $1, and there's an extra & in $2.
Then, the substitution "$1$2" == "&&&&".

Getting the substring after a character in C# using regex

I have the following input string:
string val = "[01/02/70]\nhello world ";
I want to get the all words after the last ] character.
Example output for a sample string above:
\nhello world
In C#, use Substring() with IndexOf:
string val = val.Substring(val.IndexOf(']') + 1);
If you have multiple ] symbols, and you want to get all the string after the last one, use LastIndexOf:
string val = "[01/02/70]\nhello [01/02/80] world ";
string val = val.Substring(val.LastIndexOf(']') + 1); // => " world "
If you are a fan of Regex, you might want to use a Regex.Replace like
string val = "[01/02/70]\nhello [01/02/80] world ";
val = Regex.Replace(val, #"^.*\]", string.Empty, RegexOptions.Singleline); // => " world "
See demo
Notes on REGEX:
RegexOptions.Singleline makes . match a linebreak
^ - matches beginning of string
.* - matches 0 or more characters but as many as possible (greedy matching)
\] - matches literal ] (as it is a special regex metacharacter, it must be escaped).
You need to use lookbehind assertion. And not only that, you have to enable DOTALL modifier also, so that it would also match the newline character present inbetween.
"(?s)(?<=\\]).*"
(?s) - DOTALL modifier.
(?<=\\]) - lookbehind which asserts that the match must be preceeded by a close bracket
.* - Matches any chracater zero or more times.
or
"(?s)(?<=\\])[\\s\\S]*"
Try this if you don't want to match the following newline character.
#"(?<=\][\n\r]*).*"

Regex to extract Variable Part

I have a string containing this: #[User::RootPath]+"Dim_MyPackage10.dtsx" and I need to extract the [User::RootPath] part using a regex. So far I have this regex: [a-zA-Z0-9]*\.dtsx but I don't know how to proceed further.
For the variable, why not consume what is needed by using the not set [^ ] to extract everything except in the set?
The ^ in the braces means find what is not matched, such as this where it seeks all that is not a ] or a quote (").
Then we can place the actual matches in named capture groups (?<{NameHere}> ) and extract accordingly
string pattern = #"(?:#\[)(?<Path>[^\]]+)(?:\]\+\"")(?<File>[^\""]+)(?:"")";
// Pattern is (?:#\[)(?<Path>[^\]]+)(?:\]\+\")(?<File>[^\"]+)(?:")
// w/o the "'s escapes for the C# parser
string text = #"#[User::RootPath]+""Dim_MyPackage10.dtsx""";
var result = Regex.Match(text, pattern);
Console.WriteLine ("Path: {0}{1}File: {2}",
result.Groups["Path"].Value,
Environment.NewLine,
result.Groups["File"].Value
);
/* Outputs
Path: User::RootPath
File: Dim_MyPackage10.dtsx
*/
(?: ) is match but don't capture, because we use those as defacto anchors for our pattern and to not place them into the match capture groups.
Use this regex pattern:
\[[^[\]]*\]
Check this demo.
Your regex will match any number of alphanumeric characters, followed by .dtsx. In your example, it would match MyPackage10.dtsx.
If you want to match Dim_MyPackage10.dtsx you need to add an underscore to your list of allowed characters in the regex: [a-zA-Z0-9]*.dtsx
If you want to match the [User::RootPath], you need a regex that will stop at the last / (or \, depends on which type of slashes you use in the paths): something like this: .*\/ (or .*\\)
From the answers and comments - and the fact that none has been 'accepted' so far - it appears to me that the question/problem is not completely clear. If you're looking for the pattern [User::SomeVariable] where only 'SomeVariable' is, well, variable, then you may try:
\[User::\w+]
to capture the full expression.
Furthermore, if you wish to detect that pattern, but then need only the "SomeVariable" part, you may try:
(?<=\[User::)\w+(?=])
which uses look-arounds.
Here it is bro
using System;
using System.Text.RegularExpressions;
namespace myapp
{
class Class1
{
static void Main(string[] args)
{
String sourcestring = "source string to match with pattern";
Regex re = new Regex(#"\[\S+\]");
MatchCollection mc = re.Matches(sourcestring);
int mIdx=0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
}
}
}

Categories

Resources