Regex to first match, then replace found matches - c#

In my C# program I am using Regular expressions to:
Loop through a list of possible words in need of replacing.
For each word, to find out if a string I am given has any matches.
If it does, I perform some (slightly costly) logic to create the replacement.
I then perform the actual replacement.
My current code looks roughly as follows:
string toSearchInside; // The actual string I'm going to be replacing within
List<string> searchStrings; // The list of words to look for via regex
string pattern = #"([:#?]{0})";
string replacement;
foreach (string toMatch in searchStrings)
{
var regex = new Regex(
string.Format(pattern, toMatch),
RegexOptions.IgnoreCase
);
var matches = regex.Matches(toSearchInside);
if (matches.Count == 0)
continue;
replacement = CreateReplacement(toMatch);
toSearchInside = regex.Replace(toSearchInside, replacement);
And I can get this working, but it seems somewhat inefficient in that it is using the regex engine twice - Once to find the matches (regex.Matches()) and once for the replacing regex.Replace()). I was wondering if there was a way to simply say replace the matches you already found?

you could get all the matches from the first match - and for each match you have its index, that you could iterate through the matches and replace it in the string itself - since it is more efficient than regex replace.
Though I would measure the performance with small unit test ( and having NCrunch running in background makes it faster)

Related

Extract groups with regex and construct URL in a single line

I am currently trying to extract values from a string and construct a URL that includes those values. I went through a dozen regex question, but I am not quite satisfied with the answers.
I have custom encoded strings with more than one information and I want to construct a new URL that contains those information.
For example 35afe06d-8393-4559-b6d7-74d35ce131d8|Master should become http://my-server/media/guid/35afe06d-8393-4559-b6d7-74d35ce131d8?v=Master. My first assumption was
var input = "35afe06d-8393-4559-b6d7-74d35ce131d8|Master"
var pattern = #"((?:[a-f0-9]+-?){5})|(\w+)"
var replacement = "http://my-server/media/guid/$1?v=$2"
var output = Regex.Replace(input, pattern, replacement)
However this replaces each group with the full URL. Limitation is, that I am not aware of input, pattern, replacement or output. pattern and replacement are two config values and I don't want to make it x pairs of config values, input comes from somewhere else in the application and could have any custom encoding (pipe, colon, ...) output depends on the use case. It can have any number of groups in the pattern and doesn't even have to be a URL in the end.
I can think of different ways to do this, like parsing the string myself, or trying to create a replacement dictionary, or using regex to find the groups and then string replace for $1 => match.Groups[0]. I just feel like there must be an obvious 1-liner solution for that in .NET since I even remember doing that in PHP.
Answer: It's not a .NET limitation, it was simply the unescaped pipe.
In your pattern (([a-f0-9]+-?){5})|\w+ the second group should be capturing the word characters after the pipe (escape the pipe to match it literally).
If you repeat this part ([a-f0-9]+-?) 5 times, the match could also end on a hyphen.
To match the values separated by the dash, you could match the character class [a-f0-9]+ and repeat matching that {4} times prepended by a -
([a-f0-9]+(?:-[a-f0-9]+){4})\|(\w+)
.NET Regex demo | C# demo
var input = "35afe06d-8393-4559-b6d7-74d35ce131d8|Master";
var pattern = #"([a-f0-9]+(?:-[a-f0-9]+){4})\|(\w+)";
var replacement = "http://my-server/media/guid/$1?v=$2";
var output = Regex.Replace(input, pattern, replacement);
Console.WriteLine(output);
Result
http://my-server/media/guid/35afe06d-8393-4559-b6d7-74d35ce131d8?v=Master
This expression might also work here:
^(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b)\s*\|\s*(.*?)\s*$
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"^(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b)\s*\|\s*(.*?)\s*$";
string substitution = #"http://my-server/media/guid/\1?v=$2";
string input = #"35afe06d-8393-4559-b6d7-74d35ce131d8|Master
35afe06d-8393-4559-b6d7-74d35ce131d8| Master ";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
Reference
Searching for UUIDs in text with regex

Regex - C# - Get non matching part of string

The regex pattern I wrote below is matching the string before "FinalFolder".
How can I get the folder name (in this case "FinalFolder") just after the string matching the regex?
EDIT : Pretty sure I got my Regex wrong. My intent was to match upto "C:\FolderA\FolderB\FolderC\FolderD\Test 1.0\FolderE\FolderF" and then find the folder after that. So, in this case, the folder I am looking for is "FinalFolder"
[TestMethod]
public void TestRegex()
{
string pattern = #"[A-Za-z:]\\[A-Za-z]{1,}\\[A-Za-z]{1,}\\[A-Za-z0-9]{1,}\\[A-Za-z0-9]{1,}\\[A-Za-z0-9._s]{1,}\\[A-Za-z]{1,}\\[A-Za-z]{1,}";
string textToMatch = #"C:\FolderA\FolderB\FolderC\FolderD\Test 1.0\FolderE\FolderF\FinalFolder\Subfolder\Test.txt";
string[] matches = Regex.Split(textToMatch, pattern);
Console.WriteLine(matches[0]);
}
There are plenty of other hints and advice that will lead you to getting the desired folder and I recommend considering them. But since it looks like you would still benefit from learning more regex skills, here is the answer you asked for: Getting non-matching part of string.
Let's imagine that your Regex actually matched the given path, for instance a pattern like: [A-Za-z]:\\[A-Za-z]+\\[A-Za-z]+\\[A-Za-z0-9]+\\[A-Za-z0-9]+\\[A-Za-z0-9._\s]+\\[A-Za-z]+\\[A-Za-z]+
You could get the matched string, its position and length, then determine where in the original source string the next folder name would start. But then you would also need to determine where the next folder name ends.
MatchCollection matches = Regex.Matches(textToMatch, pattern);
if (matches.Count > 0 ) {
Match m = matches[0];
var remaining = textToMatch.Substring(m.Index + m.Length);
//Now find the next backslash and grab the leftmost part...
}
That answers your most general question, but that approach defeats the entire utility of using regex. Instead, just extend your pattern to match the next folder!
Regex patterns already provide the ability to capture certain portions of a match. The default regex construct for capturing text is a set of parenthesis. Even better, .Net regex supports named capture groups using (?<name>).
//using System.Text.RegularExpressions;
string pattern = #"(?<start>"
+ #"[A-Za-z]:\\[A-Za-z]+\\[A-Za-z]+\\[A-Za-z0-9]+\\[A-Za-z0-9]+\\[A-Za-z0-9._\s]+\\[A-Za-z]+\\[A-Za-z]+"
+ #")\\(?<next>[A-Za-z0-9._\s]+)(\\|$)";
string textToMatch = #"C:\FolderA\FolderB\FolderC\FolderD\Test 1.0\FolderE\FolderF\FinalFolder\Subfolder\Test.txt";
MatchCollection matches = Regex.Matches(textToMatch, pattern);
if (matches.Count > 0 ) {
var nextFolderName = matches[0].Groups["next"];
Console.WriteLine(nextFolderName);
}
As posted in a comment, your regex seems to be matching the entire string. But in this particular case, since you are dealing with a filename, I would use FileInfo.
FileInfo fi = new FileInfo(textToMatch);
Console.WriteLine(fi.DirectoryName);
Console.WriteLine(fi.Directory.Name);
DirectoryName will be the full path, while Directory.Name will be just the subfolder in question.
So, using FileInfo, something like this?
(new FileInfo(textToMatch)).Directory.Parent.Name

Find String Between To Identical Control Separators?

I'm reading from a file, and need to find a string that is encapsulated by two identical non-ascii values/control seperators, in this case 'RS'
How would I go about doing this? Would I need some form of regex?
RS stands for Record Separator, and it has a value of 30 (or 0x1E in hexadecimal). You can use this regular expression:
\x1E([\w\s]*?)\x1E
That matches the RS, then matches any letter, number or space, and then again the RS. The ? is to make the regex match as less characters as possible, in case there are more RS characters afterwards.
If you prefer not to match numbers, you could use [a-zA-Z\s] instead of [\w\s].
Example:
string fileContents = "Something \u001Eyour string\u001E more things \u001Eanother text\u001E end.";
MatchCollection matches = Regex.Matches(fileContents, #"\x1E([\w\s]*?)\x1E");
if (matches.Count == 0)
return; // Not found, display an error message and exit.
foreach (Match match in matches)
{
if (match.Groups.Count > 1)
Console.WriteLine(match.Groups[1].Value);
}
As you can see, you get a collection of Match, and each match.Value will have the whole matched string including the separators. match.Groups will have all matched groups, being the first one again the whole matched string (that's by default) and then each of your groups (those between parenthesis). In this case, you only have one in your regex, so you just need the second one on that list.
Using regex you can do something like this:
string pattern = string.Format("{0}(.*){1}",firstString,secondString);
var matches = Regex.Matches(myString, pattern);
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
//Do stuff, with the current you should remove firstString and secondString from the capture.Value
}
}
After that use Regex.match to find the string that match with the pattern built before.
Remember to escape all the special char for regex.
You can use Regex.Matches, I'm using X as the separator in this example:
var fileContents = "Xsomething1X Xsomething2X Xsomething3X";
var results = Regex.Matches(fileContents, #"(X).*?(\1)");
The you can loop on results to do anything you want with the matches.
The \1 in the regex means "reference first group". I've put X between () so it is going to be group 1, the I use \1 to say that the match in this place should be exactly the same as the group 1.
You don't need a regular expression for that.
Read the contents of the file (File.ReadAllText).
Split on the separator character (String.Split).
If you know there's only one occurrence of your string, take the second array element (result[1]). Otherwise, take every other entry (result.Where((x, i) => i % 2 == 1)).

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text
Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.
Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.
If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();
The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

C# Regular Expressions

I have a string that has multiple regular expression groups, and some parts of the string that aren't in the groups. I need to replace a character, in this case ^ only within the groups, but not in the parts of the string that aren't in a regex group.
Here's the input string:
STARTDONTREPLACEME^ENDDONTREPLACEME~STARTREPLACEME^ENDREPLACEME~STARTREPLACEME^BLAH^ENDREPLACEME~STARTDONTREPLACEME^BLAH^ENDDONTREPLACEME~
Here's what the output string should look like:
STARTDONTREPLACEME^ENDDONTREPLACEME~STARTREPLACEMEENDREPLACEME~STARTREPLACEMEBLAHENDREPLACEME~STARTDONTREPLACEME^BLAH^ENDDONTREPLACEME~
I need to do it using C# and can use regular expressions.
I can match the string into groups of those that should and shouldn't be replaced, but am struggling on how to return the final output string.
I'm not sure I get exactly what you're having trouble with, but it didn't take long to come up with this result:
string strRegex = #"STARTREPLACEME(.+)ENDREPLACEME";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = #"STARTDONTREPLACEME^ENDDONTREPLACEME~STARTREPLACEME^ENDREPLACEME~STARTREPLACEME^BLAH^ENDREPLACEME~STARTDONTREPLACEME^BLAH^ENDDONTREPLACEME~";
string strReplace = "STARTREPLACEMEENDREPLACEME";
return myRegex.Replace(strTargetString, strReplace);
By using my favorite online Regex tool: http://regexhero.net/tester/
Is that helpful?
Regex rgx = new Regex(
#"\^(?=(?>(?:(?!(?:START|END)(?:DONT)?REPLACEME).)*)ENDREPLACEME)");
string s1 = rgx.Replace(s0, String.Empty);
Explanation: Each time a ^ is found, the lookahead scans ahead for an ending delimiter (ENDREPLACEME). If it finds one without seeing any of the other delimiters first, the match must have occurred inside a REPLACEME group. If the lookahead reports failure, it indicates that the ^ was found either between groups or within a DONTREPLACEME group.
Because lookaheads are zero-width assertions, only the ^ will actually be consumed in the event of a successful match.
Be aware that this will only work if delimiters are always properly balanced and groups are never nested within other groups.
If you are able to separate into groups that should be replaced and those that shouldn't, then instead of providing a single replacement string, you should be able to use a MatchEvaluator (a delegate that takes a Match and returns a string) to make the decision of which case it is currently dealing with and return the replacement string for that group alone.
You may also use an additional regex inside the MatchEvaluator. This solution produces the expected output:
Regex outer = new Regex(#"STARTREPLACEME.+ENDREPLACEME", RegexOptions.Compiled);
Regex inner = new Regex(#"\^", RegexOptions.Compiled);
string replaced = outer.Replace(start, m =>
{
return inner.Replace(m.Value, String.Empty);
});

Categories

Resources