Regex: only replace non-nested matches

Regex: only replace non-nested matches - c#

Given text such as:
This is my [position].
Here are some items:
[items]
[item]
Position within the item: [position]
[/item]
[/items]
Once again, my [position].
I need to match the first and last [position], but not the [position] within [items]...[/items]. Is this doable with a regular expression? So far, all I have is:
Regex.Replace(input, #"\[position\]", "replacement value")
But that is replacing more than I want.

As Wug mentioned, regular expressions aren't great at counting. An easier option would be to just find the locations of all of the tokens you're looking for, and then iterate over them and construct your output accordingly. Perhaps something like this:
public string Replace(input, replacement)
{
// find all the tags
var regex = new Regex("(\[(?:position|/?item)\])");
var matches = regex.Matches(input);
// loop through the tags and build up the output string
var builder = new StringBuilder();
int lastIndex = 0;
int nestingLevel = 0;
foreach(var match in matches)
{
// append everything since the last tag;
builder.Append(input.Substring(lastIndex, (match.Index - lastIndex) + 1));
switch(match.Value)
{
case "[item]":
nestingLevel++;
builder.Append(match.Value);
break;
case "[/item]":
nestingLevel--;
builder.Append(match.Value);
break;
case "[position]":
// Append the replacement text if we're outside of any [item]/[/item] pairs
// Otherwise append the tag
builder.Append(nestingLevel == 0 ? replacement : match.Value);
break;
}
lastIndex = match.Index + match.Length;
}
builder.Append(input.Substring(lastIndex));
return builder.ToString();
}
(Disclaimer: Have not tested. Or even attempted to compile. Apologies in advance for inevitable bugs.)

You could maaaaaybe get away with:
Regex.Replace(input,#"(?=\[position\])(!(\[item\].+\[position\].+\[/item\]))","replacement value");
I dunno, I hate ones like this. But this is a job for xml parsing, not regex. If your brackets are really brackets, just search and replace them with carrots, then xml parse.

What if you check it twice. Like,
s1 = Regex.Replace(input, #"(\[items\])(\w|\W)*(\[\/items\])", "")
This will give you the:
This is my [position].
Here are some items:
Once again, my [position].
As you can see the items section is extracted. And then on s1 you can extract your desired positions. Like,
s2 = Regex.Replace(s1, #"\[position\]", "raplacement_value")
This might not be the best solution. I tried very hard to solve it on regex but not successful.

Related

Replace Nth regex match occurrence in string

I know there are quite a few of these questions on SO, but I can't find one that explains how they implemented the pattern to return the N'th match, that was broken down. All the answers I looked just give the code to the OP with minimal explanation.
What I know is, you need to implement this {X} in the pattern where the X is the number occurrence you want to return.
So I am trying to match a string between two chars and I seemed to have been able to get that working.
The string to be tested looks something like this,
"=StringOne&=StringTwo&=StringThree&=StringFour&"
"[^/=]+(?=&)"
Again, after reading as much as I could, this pattern will also return all matches,
[^/=]+(?=&){1}
Due to {1} being the default and therefore redundant in the above pattern.
But I can't do this,
[^/=]+(?=&){2}
As it will not return 3rd match as I was expecting it too.
So could someone please shove me in the right direction and explain how to get the pattern needed to find the occurrence of the match that will be needed?

A pure regex way is possible, but is not really very efficient if your pattern is complex.
var s = "=StringOne&=StringTwo&=StringThree&=StringFour&";
var idx = 2; // Replace this occurrence
var result = Regex.Replace(s, $#"^(=(?:[^=&]+&=){{{idx-1}}})[^=&]+", "${1}REPLACED");
Console.WriteLine(result); // => =StringOne&=REPLACED&=StringThree&=StringFour&
See this C# demo and the regex demo.
Regex details
^ - start of string
(=(?:[^=&]+&=){1}) - Group 1 capturing:
= - a = symbol
(?:[^=&]+&=){1} - 1 occurrence (this number is generated dynamically) of
[^=&]+ - 1 or more chars other than = and & (NOTE that in case the string may contain = and &, it is safer to replace it with .*? and pass RegexOptions.Singleline option to the regex compiler)
&= - a &= substring.
[^=&]+ - 1 or more chars other than = and &
The ${1} in the replacement pattern inserts the contents of Group 1 back into the resulting string.
As an alternative, I can suggest introducing a counter and increment is upon each match, and only replace the one when the counter is equal to the match occurrence you specify.
Use
var s = "=StringOne&=StringTwo&=StringThree&=StringFour&";
var idx_to_replace = 2; // Replace this occurrence
var cnt = 0; // Counter
var result = Regex.Replace(s, "[^=]+(?=&)", m => { // Match evaluator
cnt++; return cnt == idx_to_replace ? "REPLACED" : m.Value; });
Console.WriteLine(result);
// => =StringOne&=REPLACED&=StringThree&=StringFour&
See the C# demo.
The cnt is incremented inside the match evaluator inside Regex.Replace and m is assigned the current Match object. When cnt is equal to idx_to_replace the replacement occurs, else, the whole match is pasted back (with m.Value).
Another approach is to iterate through the matches, and once the Nth match is found, replace it by splitting the string into parts before the match and after the match breaking out of the loop once the replacement is done:
var s = "=StringOne&=StringTwo&=StringThree&=StringFour&";
var idx_to_replace = 2; // Replace this occurrence
var cnt = 0; // Counter
var result = string.Empty; // Final result variable
var rx = "[^=]+(?=&)"; // Pattern
for (var m=Regex.Match(s, rx); m.Success; m = m.NextMatch())
{
cnt++;
if (cnt == idx_to_replace) {
result = $"{s.Substring(0, m.Index)}REPLACED{s.Substring(m.Index+m.Length)}";
break;
}
}
Console.WriteLine(result); // => =StringOne&=REPLACED&=StringThree&=StringFour&
See another C# demo.
This might be quicker since the engine does not have to find all matches.

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);

'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12

While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.

some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}

Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.

I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Get partial string from string

I have the following string:
This isMyTest testing
I want to get isMyTest as a result. I only have two first characters available("is"). The rest of the word can vary.
Basically, I need to select a first word delimeted by spaces which starts with chk.
I've started with the following:
if (text.contains(" is"))
{
text.LastIndexOf(" is"); //Should give me index.
}
now I cannot find the right bound of the word since I need to match on something like

You can use regular expressions:
string pattern = #"\bis";
string input = "This isMyTest testing";
return Regex.Matches(input, pattern);

You can use IndexOf to get the index of the next space:
int startPosition = text.LastIndexOf(" is");
if (startPosition != -1)
{
int endPosition = text.IndexOf(' ', startPosition + 1); // Find next space
if (endPosition == -1)
endPosition = text.Length - 1; // Select end if this is the last word?
}

What about using a regex match? Generally if you are searching for a pattern in a string (ie starting with a space followed by some other character) regex are perfectly suited to this. Regex statements really only fall apart in contextually sensitive areas (such as HTML) but are perfect for a regular string search.
// First we see the input string.
string input = "/content/alternate-1.aspx";
// Here we call Regex.Match.
Match match = Regex.Match(input, #"[ ]is[A-z0-9]*", RegexOptions.IgnoreCase);
// Here we check the Match instance.
if (match.Success)
{
// Finally, we get the Group value and display it.
string key = match.Groups[1].Value;
Console.WriteLine(key);
}

Splitting a string by another string

I got a string which I need to separate by another string which is a substring of the original one. Let's say I got the following text:
string s = "<DOC>something here <TEXT> and some stuff here </TEXT></DOC>"
And I want to retrieve:
"and some stuff here"
I need to get the string between the "<TEXT>" and his locker "</TEXT>".
I don't manage to do so with the common split method of string even though one of the function parameters is of type string[]. What I am trying is :
Console.Write(s.Split("<TEXT>")); // Which doesn't compile
Thanks in advance for your kind help.

var start = s.IndexOf("<TEXT>");
var end = s.IndexOf("</TEXT>", start+1);
string res;
if (start >= 0 && end > 0) {
res = s.Substring(start, end-start-1).Trim();
} else {
res = "NOT FOUND";
}

Splitting on "<TEXT>" isn't going to help you in this case anyway, since the close tag is "</TEXT>".
The most robust solution would be to parse it properly as XML. C# provides functionality for doing that. The second example at http://msdn.microsoft.com/en-us/library/cc189056%28v=vs.95%29.aspx should put you on the right track.
However, if you're just looking for a quick-and-dirty one-time solution your best bet is going to be to hand-code something, such as dasblinkenlight's solution above.

var output = new List<String>();
foreach (Match match in Regex.Matches(source, "<TEXT>(.*?)</TEXT>")) {
output.Add(match.Groups[1].Value);
}

string s = "<DOC>something here <TEXT> and some stuff here </TEXT></DOC>";
string result = Regex.Match(s, "(?<=<TEXT>).*?(?=</TEXT>)").Value;
EDIT: I am using this regex pattern (?<=prefix)find(?=suffix) which will match a position between a prefix and a suffix.
EDIT 2:
Find several results:
MatchCollection matches = Regex.Matches(s, "(?<=<TEXT>).*?(?=</TEXT>)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}

If last tag is </doc> then you could use XElement.Load to load XML and then go through it to discover wanted element (you could also use Linq To XML).
If this is not necessarily correct XML string, you could always go with Regural Expressions to find desired part of text. In this case expression should not be to hard to write it yourself.

How can I find a string after a specific string/character using regex

I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex

The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"

Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something

/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex: only replace non-nested matches - c#

Related

Replace Nth regex match occurrence in string

How to remove only certain substrings from a string?

Get partial string from string

Splitting a string by another string

How can I find a string after a specific string/character using regex

Categories

Resources