Extract groups with regex and construct URL in a single line - c#

I am currently trying to extract values from a string and construct a URL that includes those values. I went through a dozen regex question, but I am not quite satisfied with the answers.
I have custom encoded strings with more than one information and I want to construct a new URL that contains those information.
For example 35afe06d-8393-4559-b6d7-74d35ce131d8|Master should become http://my-server/media/guid/35afe06d-8393-4559-b6d7-74d35ce131d8?v=Master. My first assumption was
var input = "35afe06d-8393-4559-b6d7-74d35ce131d8|Master"
var pattern = #"((?:[a-f0-9]+-?){5})|(\w+)"
var replacement = "http://my-server/media/guid/$1?v=$2"
var output = Regex.Replace(input, pattern, replacement)
However this replaces each group with the full URL. Limitation is, that I am not aware of input, pattern, replacement or output. pattern and replacement are two config values and I don't want to make it x pairs of config values, input comes from somewhere else in the application and could have any custom encoding (pipe, colon, ...) output depends on the use case. It can have any number of groups in the pattern and doesn't even have to be a URL in the end.
I can think of different ways to do this, like parsing the string myself, or trying to create a replacement dictionary, or using regex to find the groups and then string replace for $1 => match.Groups[0]. I just feel like there must be an obvious 1-liner solution for that in .NET since I even remember doing that in PHP.
Answer: It's not a .NET limitation, it was simply the unescaped pipe.

In your pattern (([a-f0-9]+-?){5})|\w+ the second group should be capturing the word characters after the pipe (escape the pipe to match it literally).
If you repeat this part ([a-f0-9]+-?) 5 times, the match could also end on a hyphen.
To match the values separated by the dash, you could match the character class [a-f0-9]+ and repeat matching that {4} times prepended by a -
([a-f0-9]+(?:-[a-f0-9]+){4})\|(\w+)
.NET Regex demo | C# demo
var input = "35afe06d-8393-4559-b6d7-74d35ce131d8|Master";
var pattern = #"([a-f0-9]+(?:-[a-f0-9]+){4})\|(\w+)";
var replacement = "http://my-server/media/guid/$1?v=$2";
var output = Regex.Replace(input, pattern, replacement);
Console.WriteLine(output);
Result
http://my-server/media/guid/35afe06d-8393-4559-b6d7-74d35ce131d8?v=Master

This expression might also work here:
^(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b)\s*\|\s*(.*?)\s*$
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"^(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b)\s*\|\s*(.*?)\s*$";
string substitution = #"http://my-server/media/guid/\1?v=$2";
string input = #"35afe06d-8393-4559-b6d7-74d35ce131d8|Master
35afe06d-8393-4559-b6d7-74d35ce131d8| Master ";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
Reference
Searching for UUIDs in text with regex

Related

Use RegEx to uppercase and lowercase the string

I am trying to convert a string to uppercase and lowercase based on the index.
My string is a LanguageCode like cc-CC where cc is the language code and CC is the country code. The user can enter in any format like "cC-Cc". I am using the regular expression to match whether the data is in the format cc-CC.
var regex = new Regex("^[a-z]{2}-[A-Z]{2}$", RegexOptions.IgnoreCase);
//I can use CultureInfos from .net framework and compare it's valid or not.
//But the requirement is it should allow invalid language codes also as long
//The enterd code is cc-CC format
Now when the user enters something cC-Cc I'm trying to lowercase the first two characters and then uppercase last two characters.
I can split the string using - and then concatenate them.
var languageDetails = languageCode.Split('-');
var languageCodeUpdated = $"{languageDetails[0].ToLowerInvariant()}-{languageDetails[1].ToUpperInvariant()}";
I thought can I avoid multiple strings creation and use RegEx itself to uppercase and lowercase accordingly.
While searching for the same I found some solutions to use \L and \U but I am not able to use them as the C# compiler showing error. Also, RegEx.Replace() has a parameter or delegate MatchEvaluator which I'm not able to understand.
Is there any way in C# we can use RegEx to replace uppercase with lowercase and vice versa.
.NET regex does not support case modifying operators.
You may use MatchEvaluator:
var result = Regex.Replace(s, #"(?i)^([a-z]{2})-([a-z]{2})$", m =>
$"{m.Groups[1].Value.ToLower()}-{m.Groups[2].Value.ToUpper()}");
See the C# demo.
Details
(?i) - the inline version of RegexOptions.IgnoreCase mopdiofier
^ - start of the string
([a-z]{2}) - Capturing group #1: 2 ASCII letters
- - a hyphen
([a-z]{2}) - Capturing group #2: 2 ASCII letters
$ - end of string.
TLDR: This is Regex.Replace with \U and \L support.
private static string EnhancedReplace(string input, string pattern, string replacement, RegexOptions options)
{
replacement = Regex.Replace(replacement, #"(?<mode>\\[UL])(?<group>\$((\d+)|({[^}]+})))", #"<!<mode:${mode}>%&${group}&%>");
var output = Regex.Replace(input, pattern, replacement, options);
output = Regex.Replace(output, #"<!<mode:\\L>%&(?<value>[\w\W]*?)&%>", x => x.Groups["value"].Value.ToLower());
output = Regex.Replace(output, #"<!<mode:\\U>%&(?<value>[\w\W]*?)&%>", x => x.Groups["value"].Value.ToUpper());
return output;
}
How To Use
Call the function with \U followed by the group to be uppercase
var result = EnhancedReplace(input, #"(public \w+ )(\w)", #"$1\U$2", RegexOptions.None);
Will replace this:
public string test12 { get; set; } = "test3";
With that:
public string Test12 { get; set; } = "test3";
Details
I'm currently working on an app which allows the user to define a batch of Regex Replace operations.
For example the user enters json and the batch converts it to a C#-Class.
Therefore, speed is no key requirement. But it would be very handy to be able to use \U and \L.
This method will apply Regex.Replace 3 times to the whole content and one time to the replacement string. Therefore it’s at least three times slower than Regex.Replace without \U \L support.
Step by Step
The first Regex.Replace enhances the replacement string.
It replaces: \U$1 with <!<mode:\\U>%&$1&%>
(Also works for named groups: ${groupName})
The new replacement will be applied to the content.
& 4. The inserted placeholder is now relatively unique. That allows you to search only for <!<mode:\\U>%&Actual Value&%> and use the MatchEvaluator to replace it with its uppercase version. The same will be done for \L
Regex101 Demo:
Step 1: Enhance pattern with placeholder
https://regex101.com/r/ZtqigN/1
Step 2 Use new replacement pattern
https://regex101.com/r/PWLTFD/1
Step 3&4 Resolve new placeholders
https://regex101.com/r/5DIIUo/1
Answer
var result = EnhancedReplace(input, #"(cc)(-)(cc)", #"\L$1$2\U$3", RegexOptions.IgnoreCase);

Regex - C# - Get non matching part of string

The regex pattern I wrote below is matching the string before "FinalFolder".
How can I get the folder name (in this case "FinalFolder") just after the string matching the regex?
EDIT : Pretty sure I got my Regex wrong. My intent was to match upto "C:\FolderA\FolderB\FolderC\FolderD\Test 1.0\FolderE\FolderF" and then find the folder after that. So, in this case, the folder I am looking for is "FinalFolder"
[TestMethod]
public void TestRegex()
{
string pattern = #"[A-Za-z:]\\[A-Za-z]{1,}\\[A-Za-z]{1,}\\[A-Za-z0-9]{1,}\\[A-Za-z0-9]{1,}\\[A-Za-z0-9._s]{1,}\\[A-Za-z]{1,}\\[A-Za-z]{1,}";
string textToMatch = #"C:\FolderA\FolderB\FolderC\FolderD\Test 1.0\FolderE\FolderF\FinalFolder\Subfolder\Test.txt";
string[] matches = Regex.Split(textToMatch, pattern);
Console.WriteLine(matches[0]);
}
There are plenty of other hints and advice that will lead you to getting the desired folder and I recommend considering them. But since it looks like you would still benefit from learning more regex skills, here is the answer you asked for: Getting non-matching part of string.
Let's imagine that your Regex actually matched the given path, for instance a pattern like: [A-Za-z]:\\[A-Za-z]+\\[A-Za-z]+\\[A-Za-z0-9]+\\[A-Za-z0-9]+\\[A-Za-z0-9._\s]+\\[A-Za-z]+\\[A-Za-z]+
You could get the matched string, its position and length, then determine where in the original source string the next folder name would start. But then you would also need to determine where the next folder name ends.
MatchCollection matches = Regex.Matches(textToMatch, pattern);
if (matches.Count > 0 ) {
Match m = matches[0];
var remaining = textToMatch.Substring(m.Index + m.Length);
//Now find the next backslash and grab the leftmost part...
}
That answers your most general question, but that approach defeats the entire utility of using regex. Instead, just extend your pattern to match the next folder!
Regex patterns already provide the ability to capture certain portions of a match. The default regex construct for capturing text is a set of parenthesis. Even better, .Net regex supports named capture groups using (?<name>).
//using System.Text.RegularExpressions;
string pattern = #"(?<start>"
+ #"[A-Za-z]:\\[A-Za-z]+\\[A-Za-z]+\\[A-Za-z0-9]+\\[A-Za-z0-9]+\\[A-Za-z0-9._\s]+\\[A-Za-z]+\\[A-Za-z]+"
+ #")\\(?<next>[A-Za-z0-9._\s]+)(\\|$)";
string textToMatch = #"C:\FolderA\FolderB\FolderC\FolderD\Test 1.0\FolderE\FolderF\FinalFolder\Subfolder\Test.txt";
MatchCollection matches = Regex.Matches(textToMatch, pattern);
if (matches.Count > 0 ) {
var nextFolderName = matches[0].Groups["next"];
Console.WriteLine(nextFolderName);
}
As posted in a comment, your regex seems to be matching the entire string. But in this particular case, since you are dealing with a filename, I would use FileInfo.
FileInfo fi = new FileInfo(textToMatch);
Console.WriteLine(fi.DirectoryName);
Console.WriteLine(fi.Directory.Name);
DirectoryName will be the full path, while Directory.Name will be just the subfolder in question.
So, using FileInfo, something like this?
(new FileInfo(textToMatch)).Directory.Parent.Name

Regex to first match, then replace found matches

In my C# program I am using Regular expressions to:
Loop through a list of possible words in need of replacing.
For each word, to find out if a string I am given has any matches.
If it does, I perform some (slightly costly) logic to create the replacement.
I then perform the actual replacement.
My current code looks roughly as follows:
string toSearchInside; // The actual string I'm going to be replacing within
List<string> searchStrings; // The list of words to look for via regex
string pattern = #"([:#?]{0})";
string replacement;
foreach (string toMatch in searchStrings)
{
var regex = new Regex(
string.Format(pattern, toMatch),
RegexOptions.IgnoreCase
);
var matches = regex.Matches(toSearchInside);
if (matches.Count == 0)
continue;
replacement = CreateReplacement(toMatch);
toSearchInside = regex.Replace(toSearchInside, replacement);
And I can get this working, but it seems somewhat inefficient in that it is using the regex engine twice - Once to find the matches (regex.Matches()) and once for the replacing regex.Replace()). I was wondering if there was a way to simply say replace the matches you already found?
you could get all the matches from the first match - and for each match you have its index, that you could iterate through the matches and replace it in the string itself - since it is more efficient than regex replace.
Though I would measure the performance with small unit test ( and having NCrunch running in background makes it faster)

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text
Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.
Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.
If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();
The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

C# Regular Expressions

I have a string that has multiple regular expression groups, and some parts of the string that aren't in the groups. I need to replace a character, in this case ^ only within the groups, but not in the parts of the string that aren't in a regex group.
Here's the input string:
STARTDONTREPLACEME^ENDDONTREPLACEME~STARTREPLACEME^ENDREPLACEME~STARTREPLACEME^BLAH^ENDREPLACEME~STARTDONTREPLACEME^BLAH^ENDDONTREPLACEME~
Here's what the output string should look like:
STARTDONTREPLACEME^ENDDONTREPLACEME~STARTREPLACEMEENDREPLACEME~STARTREPLACEMEBLAHENDREPLACEME~STARTDONTREPLACEME^BLAH^ENDDONTREPLACEME~
I need to do it using C# and can use regular expressions.
I can match the string into groups of those that should and shouldn't be replaced, but am struggling on how to return the final output string.
I'm not sure I get exactly what you're having trouble with, but it didn't take long to come up with this result:
string strRegex = #"STARTREPLACEME(.+)ENDREPLACEME";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = #"STARTDONTREPLACEME^ENDDONTREPLACEME~STARTREPLACEME^ENDREPLACEME~STARTREPLACEME^BLAH^ENDREPLACEME~STARTDONTREPLACEME^BLAH^ENDDONTREPLACEME~";
string strReplace = "STARTREPLACEMEENDREPLACEME";
return myRegex.Replace(strTargetString, strReplace);
By using my favorite online Regex tool: http://regexhero.net/tester/
Is that helpful?
Regex rgx = new Regex(
#"\^(?=(?>(?:(?!(?:START|END)(?:DONT)?REPLACEME).)*)ENDREPLACEME)");
string s1 = rgx.Replace(s0, String.Empty);
Explanation: Each time a ^ is found, the lookahead scans ahead for an ending delimiter (ENDREPLACEME). If it finds one without seeing any of the other delimiters first, the match must have occurred inside a REPLACEME group. If the lookahead reports failure, it indicates that the ^ was found either between groups or within a DONTREPLACEME group.
Because lookaheads are zero-width assertions, only the ^ will actually be consumed in the event of a successful match.
Be aware that this will only work if delimiters are always properly balanced and groups are never nested within other groups.
If you are able to separate into groups that should be replaced and those that shouldn't, then instead of providing a single replacement string, you should be able to use a MatchEvaluator (a delegate that takes a Match and returns a string) to make the decision of which case it is currently dealing with and return the replacement string for that group alone.
You may also use an additional regex inside the MatchEvaluator. This solution produces the expected output:
Regex outer = new Regex(#"STARTREPLACEME.+ENDREPLACEME", RegexOptions.Compiled);
Regex inner = new Regex(#"\^", RegexOptions.Compiled);
string replaced = outer.Replace(start, m =>
{
return inner.Replace(m.Value, String.Empty);
});

Categories

Resources