Using Regex to edit a string in C#

Using Regex to edit a string in C# - c#

I'm just beginning to use Regex so bear with my terminology. I have a regex pattern that is working properly on a string. The string could be in the format "text [pattern] text". Therefore, I also have a regex pattern that negates the first pattern. If I print out the results from each of the matches everything is shown correctly.
The problem I'm having is I want to add text into the string and it changes the index of matches in a regex MatchCollection. For example, if I wanted to enclose the found match in "td" match "/td"" tags I have the following code:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
if (mc.Count > 0)
{
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index, mc[i].Length);
text = text.Insert(mc[i].Index, "<td>" + mc[i].Value + "</td>");
}
}
This works great for the first match. But as you'd expect the mc[i].Index is no longer valid because the string has changed. Therefore, I tried to search for just a single match in the for loop for the amount of matches I would expect (mc.Count), but then I keep finding the first match.
So hopefully without introducing more regex to make sure it's not the first match and with keeping everything in one string, does anybody have any input on how I could accomplish this? Thanks for your input.
Edit: Thank you all for your responses, I appreciate all of them.

It can be as simple as:-
string newString = Regex.Replace("abc", "b", "<td>${0}</td>");
Results in a<td>b</td>c.
In your case:-
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
text = r.Replace(text, "<td>${0}</td>");
Will replace all occurance of negRegexPattern with the content of that match surrounded by the td element.

Although I agree that the Regex.Replace answer above is the best choice, just to answer the question you asked, how about replacing from the last match to the first. This way your string grows beyond the "previous" match so the earlier matches indexes will still be valid.
for (int i = mc.Count - 1; i > 0; --i)

static string Tabulate(Match m)
{
return "<td>" + m.ToString() + "</td>";
}
static void Replace()
{
string text = "your text";
string result = Regex.Replace(text, "your_regexp", new MatchEvaluator(Tabulate));
}

You can try something like this:
Regex.Replace(input, pattern, match =>
{
return "<tr>" + match.Value + "</tr>";
});

Keep a counter before the loop starts, and add the amount of characters you inserted every time. IE:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
int counter = 0;
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index + counter, mc[i].Length);
text = text.Insert(mc[i].Index + counter, "<td>" + mc[i].Value + "</td>");
counter += ("<td>" + "</td>").Length;
}
I haven't tested this, but it SHOULD work.

Related

My regex is not matching if a line break is found

I have a large string separated by line breaks.
Example:
This is my first sentence and here i will search for the word my
This is my second sentence
Using the code below, if I search for 'my' it will only return the 2 instances of 'my' from the first sentence and not the second.
I wish to display the sentence the phrase is found in - which works fine but its just that it does not search anything after the first line break if found.
Code;
var regex = new Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere, RegexOptions.Singleline));
var results = regex.Matches(largeStringInHere);
for (int i = 0; i < results.Count; i++)
{
searchCriteriaFound.Append((results[i].Value.Trim()));
searchCriteriaFound.Append(Environment.NewLine);
}
Code Edit:
string pattern = #".*(" + userSearchCriteraInHere + ")+.*";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(largeStringInHere, pattern, options))
{
searchCriteriaFound.Append(m.Value);
}

var userSearchCriteraInHere = "my";
var largeStringInHere = #"This is my first sentence and here i will search for the word my.
This is my second sentence.";
var regex = new Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere), RegexOptions.Singleline);
var results = regex.Matches(largeStringInHere);
Console.WriteLine(results.Count);
var searchCriteriaFound = new StringBuilder();
for (int i = 0; i < results.Count; i++)
{
searchCriteriaFound.Append((results[i].Value.Trim()));
searchCriteriaFound.Append(Environment.NewLine);
}
Console.Write(searchCriteriaFound.ToString());
This returns the following output:
2
This is my first sentence and here i will search for the word my.
This is my second sentence.
I did need to add periods at the end of your sentences, as your regex expects them.

Is there a particular reason not to just search for the word "my" multiple times in the following way:
(my)+
You can test it over at the following URL on Regex101: https://regex101.com/r/QIHWKf/1
If you want to match the whole sentence that has "my" you can use the following:
.*(my)+.*
https://regex101.com/r/QIHWKf/2
Here your full match is the whole sentence, and your first group match is the "my".

Change
Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere, RegexOptions.Singleline)
To
Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere, RegexOptions.Multiline)
This changes the meaning of the symbols ^ and $ to be at the beginning/end of a line, rather than the entire string.

You could use a word boundary \b to prevent it from being part of a larger match like for example mystery and change the option to RegexOptions.Multiline instead of RegexOptions.Singleline to let ^ and $ match the end of the line.
^.*\bmy\b.*$
Regex demo
Test

To get all lines containing 'my' word, you can try this:
Code
static string GetSentencesContainMyWord(StreamReader file)
{
int counter = 0;
string line;
var sb = new StringBuilder();
while ((line = file.ReadLine()) != null)
{
if (line.Contains("my"))
sb.Append(line + Environment.NewLine);
counter++;
}
return sb.ToString();
}

Regex pattern BBCode to Wiki Notation, C#

I am tasked with converting BB code to WIKI notation and thanx to the many examples on SO I have cracked most of the tougher nuts. This is my first foray into Regex and I'm trying to learn it as I go (I would prefer stringbuilder but it doesnt seem to work with BB code). I have 4 items I need replaced that I cannot seem to create the proper pattern to identify: (original string on left, what I need on right after double dash)
the first item is a problem child because the wiki engine adds a new line where the spaces are. It is not a separate field but part of a larger string so I cant TRIM() it. I am currently using
result = result.Replace("[b]", "*").Replace("[/b]", "*");
the img issue is a need to somehow include the attributes if possible in the given format.
for the last 2 I am stumped. I have used
Regex r = new Regex(#"<a .*?href=['""](.+?)['""].*?>(.+?)</a>");
foreach (var match in r.Matches(multistring).Cast<Match>().OrderByDescending(m => m.Index))
{
string href = match.Groups[1].Value;
string txt = match.Groups[2].Value;
string wikilink = "[" + txt + "|" + href + "]";
sb.Remove(match.Groups[2].Index, match.Groups[2].Length);
sb.Insert(match.Groups[2].Index, wikilink);
}
in the past for HTML but cant seem to refactor it for my current needs. Suggestions, links to resources, all would be appreciated.
EDIT
solved the img issue, though it's not pretty and I still risk removing a closing [/img] tag that may not be caught earlier. The [img] code is fairly consistent, so I used:
Regex imgparser = new Regex(#"\[img[^\]]*\]([^\[]*)");
foreach (var itag in imgparser.Matches(multistring).Cast<Match>().OrderByDescending(m => m.Index))
{
string isrc = itag.Groups[1].Value;
string wikipic = itag.ToString().Replace("[img ", "!" + isrc).Replace("width=", "!width=").Replace("height=", ",height=").Replace("]" + isrc, string.Empty);
result = result.Replace(itag.ToString(), wikipic);
}
result = result.Replace("[/img]", "!");

I can give you a little example for the last case :
string str1 = "[url=http://aadqsdqsd]link[/url]";
var pattern = #"^\[url=(.*)\](.*)\[\/url\]$";
var match = Regex.Match(str1, pattern);
var result = string.Format("[{0}| {1}]", match.Groups[2].Value, match.Groups[1].Value);
//[link| http://aadqsdqsd]
Is it what you want ?
EDIT
if you want to match a larger string you can do :
var strTomatch = "[url=http://1]link1[/url][url=http://2]link2[/url]" + Environment.NewLine +
"[url = http://3]link3[/url]" + Environment.NewLine +
"[url=http://4]link4[/url]";
var match = Regex.Match(strTomatch, #"\[url\s*=\s*(.*?)\](.*?)\[\/url\]", RegexOptions.Multiline);
while (match.Success)
{
var result = string.Format("[{0}| {1}]", match.Groups[2].Value, match.Groups[1].Value);
Debug.WriteLine(result);
match = match.NextMatch();
}
Output
[link1| http://1]
[link2| http://2]
[link3| http://3]
[link4| http://4]

Regex split and replace

I need to replace a word that starts with %.
For example Welcome to home | %brand %productName
hoping to split on words begining with % which would give me { brand, productName }.
My regex is less than average so would appreciate help with this.

Following code might help you :
string[] splits = "Welcome to home | %brand %productName".Split(' ');
List<string> lstdata = new List<string>();
for(i=0;i<splits.length;i++)
{
if(splits[i].StartsWith("%"))
lstdata.Add(splits[i].Replace('%',''));
}

Nothing wrong with string.split approach, mind you, but here's a regex approach:
string input = #"Welcome to home | %brand %productName";
string pattern = #"%\S+";
var matches = Regex.Matches(input, pattern);
string result = string.Empty;
for (int i = 0; i < matches.Count; i++)
{
result += "match " + i + ",value:" + matches[i].Value + "\n";
}
Console.WriteLine(result);

Try this:
(?<=%)\w+
This looks for any combination of word characters immediately preceded by a percent symbol.
Now, if you're doing search and replace on these matches, you'll probably want to remove the % sign as well, so you'd need to remove the lookbehind group and just have this:
%\w+
But in doing so, your replacement code would need to trim off the % sign from each match to get the word by itself.

Regex matching key="value" pattern

I want to match following pattern:
key="value" key="value" key="value" key="value" ...
where key and value are [a-z0-9]+, both should be grouped (2 groups, the " - chars can be matched or skipped)
input that should not be matched:
key="value"key="value" (no space between pairs)
For now I got this(not .NET syntax):
([a-z0-9]+)=(\"[a-z0-9]+\")(?=\s|$)
the problem with that, that it matches key4="value4" in input:
key3="value3"key4="value4"

The spec isn't very clear, but you can try:
(?<!\S)([a-z0-9]+)=("[a-z0-9]+")(?!\S)
Or, as a C# string literal:
"(?<!\\S)([a-z0-9]+)=(\"[a-z0-9]+\")(?!\\S)"
This uses a negative lookarounds to ensure that the the key-value pair is neither preceded nor followed by non-whitespace characters.
Here's an example snippet (as seen on ideone.com):
var input = "key1=\"value1\" key2=\"value2\"key3=\"value3\" key4=\"value4\"";
Console.WriteLine(input);
// key1="value1" key2="value2"key3="value3" key4="value4"
Regex r = new Regex("(?<!\\S)([a-z0-9]+)=(\"[a-z0-9]+\")(?!\\S)");
foreach (Match m in r.Matches(input)) {
Console.WriteLine(m);
}
// key1="value1"
// key4="value4"
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
On validating the entire input
You can use Regex.IsMatch to see if the input string matches against what should be the correct input pattern. You can also use the same pattern to extract the keys/values, thanks to the fact that .NET regex lets you access individual captures.
string[] inputs = {
"k1=\"v1\" k2=\"v2\" k3=\"v3\" k4=\"v4\"",
"k1=\"v1\" k2=\"v2\"k3=\"v3\" k4=\"v4\"",
" k1=\"v1\" k2=\"v2\" k3=\"v3\" k4=\"v4\" ",
" ",
" what is this? "
};
Regex r = new Regex("^\\s*(?:([a-z0-9]+)=\"([a-z0-9]+)\"(?:\\s+|$))+$");
foreach (string input in inputs) {
Console.Write(input);
if (r.IsMatch(input)) {
Console.WriteLine(": MATCH!");
Match m = r.Match(input);
CaptureCollection keys = m.Groups[1].Captures;
CaptureCollection values = m.Groups[2].Captures;
int N = keys.Count;
for (int i = 0; i < N; i++) {
Console.WriteLine(i + "[" + keys[i] + "]=>[" + values[i] + "]");
}
} else {
Console.WriteLine(": NO MATCH!");
}
}
The above prints (as seen on ideone.com):
k1="v1" k2="v2" k3="v3" k4="v4": MATCH!
0[k1]=>[v1]
1[k2]=>[v2]
2[k3]=>[v3]
3[k4]=>[v4]
k1="v1" k2="v2"k3="v3" k4="v4": NO MATCH!
k1="v1" k2="v2" k3="v3" k4="v4" : MATCH!
0[k1]=>[v1]
1[k2]=>[v2]
2[k3]=>[v3]
3[k4]=>[v4]
: NO MATCH!
what is this? : NO MATCH!
References
Is there a regex flavor that allows me to count the number of repetitions matched by the * and + operators?
Explanation of the pattern
The pattern to validate the entire input is essentially:
maybe leading
spaces ___ end of string anchor
| /
^\s*(entry)+$
| \
beginning \__ one or more entry
of string
anchor
Where each entry is:
key=value(\s+|$)
That is, a key/value pair followed by either spaces or the end of the string.

I think SilentGhost proposal is about using String.Split()
Like this :
String keyValues = "...";
foreach(String keyValuePair in keyValues.Split(' '))
Console.WriteLine(keyValuePair);
This is definitively faster and simpler.

Use a lookbehind like you used your lookahead:
(?<=\s|^)([a-z0-9]+)=(\"[a-z0-9]+\")(?=\s|$)

I second Jens' answer (but am still too puny to comment on others' answers).
Also, I've found this Regular Expressions Reference site to be quite awesome. There's a section on Lookaround about halfway down on the Advanced page, and some further notes about Lookbehind.

Why Group.Value always the last matched group string?

Recently, I found one C# Regex API really annoying.
I have regular expression (([0-9]+)|([a-z]+))+. I want to find all matched string. The code is like below.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456defFOO";
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
The output is:
Match1
Match group count = 4
Group0='abc123xyz456def'
Group1='def'
Group2='456'
Group3='def'
go to next match
It seems that all group.Value is the last matched string ("def" and "456"). I spent some time to figure out that I should count on group.Captures instead of group.Value.
string regularExp = "(([0-9]+)|([a-z]+))+";
string str = "abc123xyz456def";
//Console.WriteLine(str);
Match match = Regex.Match(str, regularExp, RegexOptions.None);
int matchCount = 0;
while (match.Success)
{
Console.WriteLine("Match" + (++matchCount));
Console.WriteLine("Match group count = {0}", match.Groups.Count);
for (int i = 0; i < match.Groups.Count; i++)
{
Group group = match.Groups[i];
Console.WriteLine("Group" + i + "='" + group.Value + "'");
CaptureCollection cc = group.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine(" Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
match = match.NextMatch();
Console.WriteLine("go to next match");
Console.WriteLine();
}
This will output:
Match1
Match group count = 4
Group0='abc123xyz456def'
Capture0='abc123xyz456def', Position=0
Group1='def'
Capture0='abc', Position=0
Capture1='123', Position=3
Capture2='xyz', Position=6
Capture3='456', Position=9
Capture4='def', Position=12
Group2='456'
Capture0='123', Position=3
Capture1='456', Position=9
Group3='def'
Capture0='abc', Position=0
Capture1='xyz', Position=6
Capture2='def', Position=12
go to next match
Now, I am wondering why the API design is like this. Why Group.Value only returns the last matched string? This design doesn't look good.

The primary reason is historical: regexes have always worked that way, going back to Perl and beyond. But it's not really bad design. Usually, if you want every match like that, you just leave off the outermost quantifier (+ in ths case) and use the Matches() method instead of Match(). Every regex-enabled language provides a way to do that: in Perl or JavaScript you do the match in /g mode; in Ruby you use the scan method; in Java you call find() repeatedly until it returns false. Similarly, if you're doing a replace operation, you can plug the captured substrings back in as you go with placeholders ($1, $2 or \1, \2, depending on the language).
On the other hand, I know of no other Perl 5-derived regex flavor that provides the ability to retrieve intermediate capture-group matches like .NET does with its CaptureCollections. And I'm not surprised: it's actually very seldom that you really need to capture all the matches in one go like that. And think of all the storage and/or processing power it can take to keep track of all those intermediate matches. It is a nice feature though.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using Regex to edit a string in C# - c#

static string Tabulate(Match m) { return "<td>" + m.ToString() + "</td>"; } static void Replace() { string text = "your text"; string result = Regex.Replace(text, "your_regexp", new MatchEvaluator(Tabulate)); }

You can try something like this: Regex.Replace(input, pattern, match => { return "<tr>" + match.Value + "</tr>"; });

Related

My regex is not matching if a line break is found

Regex pattern BBCode to Wiki Notation, C#

Regex split and replace

Regex matching key="value" pattern

Why Group.Value always the last matched group string?

Categories

Resources