advanced regex.replace handling

advanced regex.replace handling - c#

I am using a Regex with a MatchEvaluator delegate to process a format string, e.g. "Time: %t, bytes: %b" would replace the "%t" with a time stamp, and the "%b" with a bytes count. Needless to say, there are loads of other options!
To do this, I use:
Regex regex = new Regex("%((?<BytesCompleted>b)|(?<BytesRemaining>B)|(?<TimeStamp>t))");
string s = "%bhello%t(HH:MM:SS)%P";
string r = regex.Replace(s, new MatchEvaluator(ReplaceMatch));
and
string ReplaceMatch(Match m)
{
... Handle the match replacement.
}
What would be nice is if I could use the Regex group name (or even the number, I'm not that fussy) in the delegate instead of comparing against the raw match to find out which match this is:
string ReplaceMatch(Match m)
{
...
case "%t":
...
case "%b";
...
}
Is pretty ugly; I would like to use
string ReplaceMatch(Match m)
{
...
case "BytesCompleted":
...
case "TimeStamp":
...
}
I can't see anything obvious via the debugger, or via google. Any ideas?

It would be nice to be able to use the group name in the switch; unfortunately the Group object doesn't have a Name property (and neither does its base class Capture) so the best you'll be able to do is the following:
string ReplaceMatch(Match m)
{
if (m.Groups["BytesCompleted"].Success) // ...
else if (m.Groups["BytesRemaining"].Success) // ...
// ...
}

You can use Regex.GroupNameFromNumber instance method, and this unfortunately means that the match-evaluator method requires a reference to the Regex object:
string ReplaceMatch(Match m)
{
for (int i = 0; i < m.Groups.Count; i++)
{
string groupName = _regex.GroupNameFromNumber(i);
switch (groupName)
{
case "BytesCompleted":
// do something
break;
case "BytesRemaining":
// do somehting else
break;
...
}
}
...
}
Here I assumed that the Regex object is accessible through the _regex variable.

We'll need to combine both Sina's and james' answers. We need to enumerate the groups to get the index and check for group success. Then we use the index to get group name. I have expanded a bit into a self-explaining test that uses dictionary to replace matched substrings. With a little more support for group names in the framework, this would have been so much easier.
Also see another answer that might work for you: https://stackoverflow.com/a/1381163/481812
[Test]
public void ExploreRxReplace()
{
var txt = " ABC XYZ DEF ";
var exp = " *** XYZ xxx ";
var rx = new Regex(#"\s*(?<abc>ABC)\s*|\s*(?<def>DEF)\s*");
var data = new Dictionary<string, string>() { { "abc", "***" }, { "def", "xxx" } };
var txt2 = rx.Replace(txt, (m) =>
{
var sb = new StringBuilder();
var pos = m.Index;
for (var idx = 1; idx < m.Groups.Count; idx++)
{
var group = m.Groups[idx];
if (!group.Success)
{
// ignore non-matching group
continue;
}
var name = rx.GroupNameFromNumber(idx);
// append what's before
sb.Append(txt.Substring(pos, group.Index - pos));
string value = null;
if (group.Success && data.TryGetValue(name, out value))
{
// append dictionary value
sb.Append(value);
}
else
{
// append group value
sb.Append(group.Value);
}
pos = group.Index + group.Length;
}
// append what's after
sb.Append(txt.Substring(pos, m.Index + m.Length - pos));
return sb.ToString();
});
Assert.AreEqual(exp, txt2);
}

Related

How to split a string, keeping order and the reason for the split?

I am trying to split an HTML string into a Dictionary, where I keep the text, and what the HTML element was
For example, with this input
var input = "This is <b>bold</b> where as <i>this is italic</i>. This is the last sentence";
I would like the following output
{"This is ", "None"},
{"bold", "Bold"},
{" where as ", "None"},
{"this is italic", "italic"},
{". This is the last sentence", "None"},
I can share my effort, but it's fairly pointless as I can't get it to work, and my approach feels impossible to scale.
internal Dictionary<string, string> SplitTextByHtmlTags(string input)
{
var result = new Dictionary<string, string>();
var splitText = new List<string>();
var split = Split(input, "b");
foreach (var bold in split)
{
var italics = Split(bold, "i");
splitText.AddRange(italics);
}
foreach (var bold in splitText)
{
var underlines= Split(bold, "u");
splitText.AddRange(underlines);
}
return result;
}
private IEnumerable<string> Split(string input, string htmlEleName)
{
return input.Split("<"+htmlEleName+">").Select(s => s.Split("</"+htmlEleName+">")).ToList();
}
As I said, the above does not return the right value nor does it work.

Assuming the input text is always this simple (no nested tags, no attributes, comments, etc.), this is fairly easy to achieve using Regular Expression. Otherwise, I would stick to using an HTML parser.
Here's a full example:
var result = new List<(string text, string styling)>();
string input =
"This is <b>bold</b> where as <i>this is italic</i>. This is the last sentence";
var matches = Regex.Matches(input, #"[^<]+|<([bi])>([^<]+)</\1>");
foreach (Match match in matches)
{
// If neither `<b>` nor `<i>` was found.
if (!match.Groups[1].Success)
{
result.Add((match.Value, "None"));
}
else
{
string styling = (match.Groups[1].Value == "b" ? "Bold" : "Italic");
result.Add((match.Groups[2].Value, styling));
}
}
The example above creates a list of ValueTuple instead of a dictionary (which won't work in this case for reasons mentioned in the comments. The ValueTuple here has two string items. You might consider using an enum instead of a string for the styling.
Explanation of the Regex pattern:
[^<]+ - Match one or more characters other than '<'.
| - Or.
<([bi])> - Match either 'b' or 'i' enclosed in angle brakets and capture the letter in group 1.
([^<]+) - Match one or more characters other than '<' and capture them in group 2.
</\1> - Match a closing HTML tag (i.e., </..>) with the letter that was captured in group 1.
If you need to support other HTML tags, replace [bi] with something like (?:[biu]|div|span|etc) in the pattern above (or simply use \w+ to support any arbitrary tag). Then, you can have a dictionary that returns the "nice name" for each tag name:
var tags = new Dictionary<string, string>()
{
{ "b", "Bold" },
{ "i", "Italic" },
{ "u", "Underline" },
};
Then, you can use it in the else branch like this:
if (!tags.TryGetValue(match.Groups[1].Value, out string tag))
tag = match.Groups[1].Value;
result.Add((match.Groups[2].Value, tag));

Try something like this:
internal Dictionary<string, string> SplitTextByHtmlTags(string input)
{
var result = new Dictionary<string, string>();
// Iterating through a string
for (var i = 0; i < input.Length; i++)
{
// Detecting the opening of the tag
if (input[i] == '<')
{
string
tag = "", // Name of the tag
content = ""; // Content of the tag
// Iterating over the tag
for (int j = i+1; j < input.Length; j++)
{
/**
* If alphabetic characters are being iterated over,
* then, most likely, this is the name of the tag.
*/
if (!input[j].IsLetter())
{
// As soon as any character that is not a letter occurs
for (int k = j; k < input.Length; k++)
{
// Looking for the end of the tag
if (input[k] == '>')
{
// Sorting through the contents of the tag
for (int l = k+1; l < input.Length; l++)
{
if (input[l] != '<')
{
content += input[l];
/*
* We move the "cursor" of the main loop
* to the place where the tag opening symbol was found.
*/
i = l;
// We put the found values in the map
result.Add(tag, content);
break;
}
}
break;
}
}
}
else tag += input[j];
}
}
}
return result;
}

Increment string if exists

I need a piece of code that increments end of string in "[]" brackets, but I got headache with that.
Thing is, if name "test" exists in given collection, algorithm should return test_[0], if both exists then "test_[1]" etc. That works so far. But when I tried to pass as currentName value "test_[something]" algorithm creates something like test_[0]_[0], test_[0]_[1] instead of test_[somenthing+someNumber]. Does anyone know the way to change this behavior?
// test test, test_[2], test_[3]
protected string GetDistinctName2(string currentName, IEnumerable<string> existingNames)
{
int iteration = 0;
if (existingNames.Any(n => n.Equals(currentName)))
{
do
{
if (!currentName.EndsWith($"({iteration})"))
{
currentName = $"{currentName}_[{++iteration}]";
}
}
while (existingNames.Any(n => n.Equals(currentName)));
}
return currentName;
}
EDIT :
The best solution so far is that(I can bet that I've seen it here, but someone had to delete)
public static void Main()
{
var currentOriginal = "test";
var existingNamesOriginal = new[] { "test", "test_[2]", "test_[3]" };
string outputOriginal = GetDistinctNameFromSO(currentOriginal, existingNamesOriginal);
Console.WriteLine("original : " + outputOriginal);
Console.ReadLine();
}
protected static string GetDistinctNameFromSO(string currentName,
IEnumerable<string> existingNames)
{
if (null == currentName)
throw new ArgumentNullException(nameof(currentName));
else if (null == existingNames)
throw new ArgumentNullException(nameof(existingNames));
string pattern = $#"^{Regex.Escape(currentName)}(?:_\[(?<Number>[0-9]+)\])?$";
Regex regex = new Regex(pattern);
var next = existingNames
.Select(item => regex.Match(item))
.Where(match => match.Success)
.Select(match => string.IsNullOrEmpty(match.Groups["Number"].Value)
? 1
: int.Parse(match.Groups["Number"].Value))
.DefaultIfEmpty()
.Max() + 1;
if (next == 1)
return currentName; // No existingNames - return currentName
else
return $"{currentName}_[{next}]";
}
For given "test" string it returns "test_[4]" which is excellent, but if given string is let's say "test_[2]" it should also return "test_[4]"(string with given pattern with first free number), but it returns "test_[2]_[2]" instead.

I will try answer with minimal adjustments to your code:
Use square brackets to check if name exists
Use a local variable to prevent adding [0] over and over
increment iteration on every do/while loop
if the result, should never be "test", exclude it from the existing results
The result looks like (not tested, but this should get you on your way):
// test test, test_[2], test_[3]
protected string GetDistinctName2(string currentName, IEnumerable<string> existingNames)
{
int iteration = 0;
// Use a different variable this will prevent you from adding [0] again and again
var result = currentName;
if (existingNames.Where(s => s != currentName).Any(n => n.Equals(result)))
{
do
{
// Use square brackets
if (!result .EndsWith($"[{iteration}]"))
{
result = $"{currentName}_[{iteration}]";
}
iteration++; // Increment with every step
}
while (existingNames.Any(n => n.Equals(result)));
}
return result;
}

Here's a simpler rewrite:
protected string GetDistinctName2(string currentName, IEnumerable<string> existingNames)
{
int iteration = 0;
var name = currentName;
while(existingNames.Contains(name))
{
name = currentName + "_[" + (iteration++) + "]";
}
return name;
}
Tests:
GetDistinctName2("test", new List<string> {"test", "test_[0]", "test_[2]", "test_[3]"}).Dump();//Out: test_[1]
GetDistinctName2("test", new List<string> {"test", "test_[0]", "test_[1]", "test_[2]", "test_[3]"}).Dump();//Out: test_[4]
GetDistinctName2("test", new List<string> {}).Dump();//Out: test

Your description of your problem and your code and completely different. Here will increment a number inside the square brackets without appending additional text.
The change from the initial code addresses a problem mentioned in the question about including text inside the square brackets with a number. Below you can replace something with other text.
protected string GetDistinctName2(string currentName, IEnumerable<string> existingNames)
{
int iteration = 0;
string nextName = currentName;
while (existingNames.Contains(nextName))
{
nextName = $"{currentName}_[something{iteration}]";
iteration++;
}
return nextName;
}
C# interactive shell example:
> GetDistinctName2("test", new List<string>() { "test", "test_[something0]", "test_[something1]"})
"test_[something2]"

c# contains word exception number

I need to search a string and see if it contains "<addnum(x)>"
I have used .contains on the other words that i searched for and the easiest way i could think for is you somehow could make exception for numbers or do you need to use another code for that?
My code this far.
public List<string> arguments = new List<string>();
public void Custom_naming(string name_code)
{
arguments.Add("Changing the name to " + name_code); // Sets the new name.
if( name_code.Contains("<addnum>") )
{
Add_number();
}
if (name_code.Contains("<addnum(x)>"))
{// X = any number.
}
}
private void Add_number()
{
arguments.Add("Replaces all <addnum> with a number");
}
private void Add_number(int zeros)
{
arguments.Add("Replaces all <addnumxx> with a number with lentgh of");
}

You probably need to use a regular expression:
var match = Regex.Match(name_code, #"<addnum(?:\((\d+)\))?>");
if (match.Success)
{
int zeros;
if (int.TryParse(match.Groups[1].Value, out zeros))
{
Add_number(zeros);
}
else
{
Add_number();
}
}
This will return invoke the appropriate Add_number method if name_code contains <addnum> or anything like <addnum(123)>.
If there could possibly be more than one such in name_code, e.g. <addnum(1)><addnum(2)>, you'll want to use a loop to analyze each match, like this:
var matches = Regex.Matches(name_code, #"<addnum(?:\((\d+)\))?>");
foreach(var match in matches)
{
int zeros;
if (int.TryParse(match.Groups[1].Value, out zeros))
{
Add_number(zeros);
}
else
{
Add_number();
}
}

Use regular expression:
string s = "Foo <addnum(8)> bar.";
var contains = Regex.IsMatch(s, #"<addnum\(\d+\)>");
If you want also extract number:
string s = "Foo <addnum(42)> bar.";
var match = Regex.Match(s, #"<addnum\((\d+)\)>");
if (match.Success)
{
// assume you have valid integer number
var number = Int32.Parse(match.Groups[1].Value);
}

How can I search through a string in C# and replace areas bounded by a pattern?

We tried a few solutions now that try and use XML parsers. All fail because the strings are not always 100% valid XML. Here's our problem.
We have strings that look like this:
var a = "this is a testxxx of my data yxxx and of these xxx parts yxxx";
var b = "hello testxxx world yxxx ";
"this is a testxxx3yxxx and of these xxx1yxxx";
"hello testxxx1yxxx ";
The key here is that we want to do something to the data between xxx and yxxx. In the example above I would need a function that counts words and replaces the strings with a word count.
Is there a way we can process the string a and apply a function to change the data that's between the xxx and yxxx? Any function right now as we're just trying to get an idea of how to code this.

You can use Split method:
var parts = a.Split(new[] {"xxx", "yxxx"}, StringSplitOptions.None)
.Select((s, index) =>
{
string s1 = index%2 == 1 ? string.Format("{0}{2}{1}", "xxx", "yxxx", s + "1") : s;
return s1;
});
var result = string.Join("", parts);

If it always going to xxx and yxxx, you can use regex as suggested.
var stringBuilder = new StringBuilder();
Regex regex = new Regex("xxx(.*?)yxxx");
var splitGroups = Regex.Match(a);
foreach(var group in splitGroups)
{
var value = splitGroupsCopy[i];
// do something to value and then append it to string builder
stringBuilder.Append(string.Format("{0}{1}{2}", "xxx", value, "yxxx"));
}
I suppose this is as basic as it gets.

Using Regex.Replace will replace all the matches with your choice of text, something like this:
Regex rgx = new Regex("xxx.+yxxx");
string cleaned = rgx.Replace(a, "replacementtext");

This code will process each of the parts delimited by "xxx". It preserves the "xxx" separators. If you do not want to preserve the "xxx" separators, remove the two lines that say "result.Append(separator);".
Given:
"this is a testxxx of my data yxxx and there are many of these xxx parts yxxx"
It prints:
"this is a testxxx>> of my data y<<xxx and there are many of these xxx>> parts y<<xxx"
I'm assuming that's the kind of thing you want. Add your own processing to "processPart()".
using System;
using System.Text;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main(string[] args)
{
string text = "this is a testxxx of my data yxxx and there are many of these xxx parts yxxx";
string separator = "xxx";
var result = new StringBuilder();
int index = 0;
while (true)
{
int start = text.IndexOf(separator, index);
if (start < 0)
{
result.Append(text.Substring(index));
break;
}
result.Append(text.Substring(index, start - index));
int end = text.IndexOf(separator, start + separator.Length);
if (end < 0)
{
throw new InvalidOperationException("Unbalanced separators.");
}
start += separator.Length;
result.Append(separator);
result.Append(processPart(text.Substring(start, end-start)));
result.Append(separator);
index = end + separator.Length;
}
Console.WriteLine(result);
}
private static string processPart(string part)
{
return ">>" + part + "<<";
}
}
}
[EDIT] Here's the code amended to work with two different separators:
using System;
using System.Text;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main(string[] args)
{
string text = "this is a test<pre> of my data y</pre> and there are many of these <pre> parts y</pre>";
string separator1 = "<pre>";
string separator2 = "</pre>";
var result = new StringBuilder();
int index = 0;
while (true)
{
int start = text.IndexOf(separator1, index);
if (start < 0)
{
result.Append(text.Substring(index));
break;
}
result.Append(text.Substring(index, start - index));
int end = text.IndexOf(separator2, start + separator1.Length);
if (end < 0)
{
throw new InvalidOperationException("Unbalanced separators.");
}
start += separator1.Length;
result.Append(separator1);
result.Append(processPart(text.Substring(start, end-start)));
result.Append(separator2);
index = end + separator2.Length;
}
Console.WriteLine(result);
}
private static string processPart(string part)
{
return "|" + part + "|";
}
}
}

The indexOf() function will return to you the index of the first occurrence of a given substring.
(My indices might be a bit off, but) I would suggest doing something like this:
var searchme = "this is a testxxx of my data yxxx and there are many of these xxx parts yxxx";
var startindex= searchme.indexOf("xxx");
var endindex = searchme.indexOf("yxxx") + 3; //added 3 to find the index of the last 'x' instead of the index of the 'y' character
var stringpiece = searchme.substring(startindex, endindex - startindex);
and you can repeat that while startindex != -1
Like I said, the indices might be slightly off, you might have to add a +1 or -1 somewhere, but this will get you along nicely (I think).
Here is a little sample program that counts chars instead of words. But you should just need to change the processor function.
var a = "this is a testxxx of my data yxxx and there are many of these xxx parts yxxx";
a = ProcessString(a, CountChars);
string CountChars(string a)
{
return a.Length.ToString();
}
string ProcessString(string a, Func<string, string> processor)
{
int idx_start, idx_end = -4;
while ((idx_start = a.IndexOf("xxx", idx_end + 4)) >= 0)
{
idx_end = a.IndexOf("yxxx", idx_start + 3);
if (idx_end < 0)
break;
var string_in_between = a.Substring(idx_start + 3, idx_end - idx_start - 3);
var newString = processor(string_in_between);
a = a.Substring(0, idx_start + 3) + newString + a.Substring(idx_end, a.Length - idx_end);
idx_end -= string_in_between.Length - newString.Length;
}
return a;
}

I would use Regex Groups:
Here my solution to get the parts in the string:
private static IEnumerable<string> GetParts( string searchFor, string begin, string end ) {
string exp = string.Format("({0}(?<searchedPart>.+?){1})+", begin, end);
Regex regex = new Regex(exp);
MatchCollection matchCollection = regex.Matches(searchFor);
foreach (Match match in matchCollection) {
Group #group = match.Groups["searchedPart"];
yield return #group.ToString();
}
}
you can use it like to get the parts:
string a = "this is a testxxx of my data yxxx and there are many of these xxx parts yxxx";
IEnumerable<string> parts = GetParts(a, "xxx", "yxxx");
To replace the parts in the original String you can use the Regex Group to determine Length and StartPosition (#group.Index, #group.Length).

Find a substring, replace a substring according the case

What's the easiest and fastest way to find a sub-string(template) in a string and replace it with something else following the template's letter case (if all lower case - replace with lowercase, if all upper case - replace with uppercase, if begins with uppercase and so on...)
so if the substring is in curly braces
"{template}" becomes "replaced content"
"{TEMPLATE}" becomes "REPLACED CONTENT" and
"{Template}" becomes "Replaced content" but
"{tEMPLATE}" becomes "rEPLACED CONTENT"

Well, you could use regular expressions and a match evaluator callback like this:
regex = new Regex(#"\{(?<value>.*?)\}",
RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);
string replacedText = regex.Replace(<text>,
new MatchEvaluator(this.EvaluateMatchCallback));
And your evaluator callback would do something like this:
private string EvaluateMatchCallback(Match match) {
string templateInsert = match.Groups["value"].Value;
// or whatever
string replacedText = GetReplacementTextBasedOnTemplateValue(templateInsert);
return replacedText;
}
Once you get the regex match value you can just do a case-sensitive comparison and return the correct replacement value.
EDIT I sort of assumed you were trying to find the placeholders in a block of text rather than worry about the casing per se, if your pattern is valid all the time then you can just check the first two characters of the placeholder itself and that will tell you the casing you need to use in the replacement expression:
string foo = "teMPLATE";
if (char.IsLower(foo[0])) {
if (char.IsLower(foo[1])) {
// first lower and second lower
}
else {
// first lower and second upper
}
}
else {
if (char.IsLower(foo[1])) {
// first upper and second lower
}
else {
// first upper and second upper
}
}
I would still use a regular expression to match the replacement placeholder, but that's just me.

You can check the case of the first two letters of the placeholder and choose one of the four case transforming strategies for the inserted text.
public static string Convert(string input, bool firstIsUpper, bool restIsUpper)
{
string firstLetter = input.Substring(0, 1);
firstLetter = firstIsUpper ? firstLetter.ToUpper() : firstLetter.ToLower();
string rest = input.Substring(1);
rest = restIsUpper ? rest.ToUpper() : rest.ToLower();
return firstLetter + rest;
}
public static string Replace(string input, Dictionary<string, string> valueMap)
{
var ms = Regex.Matches(input, "{(\\w+?)}");
int i = 0;
var sb = new StringBuilder();
for (int j = 0; j < ms.Count; j++)
{
string pattern = ms[j].Groups[1].Value;
string key = pattern.ToLower();
bool firstIsUpper = char.IsUpper(pattern[0]);
bool restIsUpper = char.IsUpper(pattern[1]);
sb.Append(input.Substring(i, ms[j].Index - i));
sb.Append(Convert(valueMap[key], firstIsUpper, restIsUpper));
i = ms[j].Index + ms[j].Length;
}
return sb.ToString();
}
public static void DoStuff()
{
Console.WriteLine(Replace("--- {aAA} --- {AAA} --- {Aaa}", new Dictionary<string,string> {{"aaa", "replacement"}}));
}

Ended up doing that:
public static string ReplaceWithTemplate(this string original, string pattern, string replacement)
{
var template = Regex.Match(original, pattern, RegexOptions.IgnoreCase).Value.Remove(0, 1);
template = template.Remove(template.Length - 1);
var chars = new List<char>();
var isLetter = false;
for (int i = 0; i < replacement.Length; i++)
{
if (i < (template.Length)) isLetter = Char.IsUpper(template[i]);
chars.Add(Convert.ToChar(
isLetter ? Char.ToUpper(replacement[i])
: Char.ToLower(replacement[i])));
}
return new string(chars.ToArray());
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

advanced regex.replace handling - c#

Related

How to split a string, keeping order and the reason for the split?

Increment string if exists

c# contains word exception number

How can I search through a string in C# and replace areas bounded by a pattern?

Find a substring, replace a substring according the case

Categories

Resources