Need to extract text in a C# Regex - c#

I have a collection of strings such as Some song [FEAT. John Doe] and I'm trying to extract the 'featured' part. It could be identified by one of several different words "FEAT|FEAT\\.|Featuring" and may or may not be enclosed by brackets. I'm using a Regex for this and here is what I've got so far:
[TestMethod]
public void ExtractFeaturedPerformers()
{
IEnumerable<string> titles = new string[]
{
"abc [FEAT one two] 123",
"def(FEAT. three'four) 456",
"ghi Featuring five",
"jkl"
};
// Must be able to use an arbitrary array of words
var arrayOfWords = new string[] { "FEAT", "FEAT.", "Featuring" };
string options = string.Join("|", arrayOfWords.Select(s => Regex.Escape(s)));
var result = new List<string>();
foreach(string title in titles)
{
var _ = Regex.Match(title, $#"(?<=({options})\s*)(.*?)(?=[\]\)]|$)");
if (_.Success)
result.Add(_.Value);
}
Assert.AreEqual(3, result.Count());
Assert.IsTrue(result.Contains("one two"));
Assert.IsTrue(result.Contains("three'four"));
Assert.IsTrue(result.Contains("five"));
}
I have it mostly working but there are two limitations. My main problem is that the second result includes the ..
. three'four
How can I remove this as part of the Regex so that I can accept an arbitrary options string rather than stripping it away later? Dealing with the . is my main concern but I would also appreciate suggestions for removing the leading and trailing whitespace from the result so that I don't have to call Trim() later.

You need
(?:FEAT\.?|Featuring)\s*([^])]*)
See the regex demo
Details
(?:FEAT\.?|Featuring) - FEAT and an optional . or Featuring
\s* - zero or more whitespace
([^])]*) - Capturing group 1: zero or more chars other than ] and ).
You need to amend the C# code to get Group 1 values.
Here is the full C# demo:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
IEnumerable<string> titles = new string[]
{
"abc [FEAT one two] 123",
"def(FEAT. three'four) 456",
"ghi Featuring five",
"jkl"
};
var keys = new List<string> { "FEAT", "FEAT.", "Featuring" };
keys = keys.OrderByDescending(x => x.Length).ToList();
var pattern = $#"(?:{string.Join("|", keys.Select(z => Regex.Escape(z)))})\s*([^])]*)";
Console.WriteLine(pattern);
var result = new List<string>();
foreach(string title in titles)
{
var _ = Regex.Match(title, pattern);
if (_.Success)
result.Add(_.Groups[1].Value);
}
Console.WriteLine( result.Count()); // Assert.AreEqual(3, result.Count());
Console.WriteLine( result.Contains("one two") ); //Assert.IsTrue(result.Contains("one two"));
Console.WriteLine( result.Contains("three'four") ); //Assert.IsTrue(result.Contains("three'four"));
Console.WriteLine( result.Contains("five") ); // Assert.IsTrue(result.Contains("five"));
}
}
The output is
(?:Featuring|FEAT.|FEAT)\s*([^])]*)
3
True
True
True
Note how the regex pattern is built:
var keys = new List<string> { "FEAT", "FEAT.", "Featuring" }; initializes the keys string list with the search phrases
keys = keys.OrderByDescending(x => x.Length).ToList(); - sorts the items in the list by length in the descending order
var pattern = $#"(?:{string.Join("|", keys.Select(z => Regex.Escape(z)))})\s*([^])]*)"; - creates the regex pattern by putting the escaped search phrases into a non-capturing group with | alternation operator in between, (?:Featuring|FEAT\.|FEAT).

Related

Regex for obtaining numeric values within a string in C#

I have the following example strings:
TAR:100
TAR:100|LED:50
TAR:30|LED:30|ASO:40
I need a regex that obtains the numeric values after the colon, which are always in the range 0 to 100 inclusive.
The result after the regex is applied to any of the above strings should be:
for TAR:100 the result should be 100
for TAR:100|LED:50 the result should be the array [100,50]
for TAR:30|LED:30|ASO:40 the result should be the array [30,30,40]
The word before the colon can have any length and both upper and lowercase.
I have tried with the following but it doesn't yield the result I need:
String text = "TAR:100|LED:50";
String pattern = "\\|?([a-zA-Z]{1,}:)";
string[] values= Regex.Split(text, pattern);
The regex should work whether the string is TAR:100 or TAR:100|LED:50 if possible.
You added () which makes the text parts that you want to remove also be returned.
Below is my solution, with a slightly changed regex.
Note that we need to start looping the values at i = 1, which is purely caused by using Split on a string that starts with a split-sequence; it has nothing to do with the Regex itself.
Explanation: if we used a simpler str.Split to split by a separator "#", then "a#b#c" would produce ["a", "b", "c"], whereas "#b#c" would produce ["", "b", "c"]. In general, and by definition: if Split removes N sequences by which the string gets splitted, then the result is N+1 strings. And all the strings that we deal with here are of the form "#b#c", so there is always an empty first result.
Accepting that as a given fact, the results are usable by starting from i = 1:
var pattern = #"\|?[a-zA-Z]+:";
var testCases = new[] { "TAR:100", "TAR:100|LED:50", "TAR:30|LED:30|ASO:40" };
foreach (var text in testCases)
{
string[] values = Regex.Split(text, pattern);
for (var i = 1; i < values.Length; i++)
Console.WriteLine(values[i]);
Console.WriteLine("------------");
}
Output:
100
------------
100
50
------------
30
30
40
------------
Working DotNetFiddle: https://dotnetfiddle.net/i9kH8n
In .NET you can use the Group.Captures and use the same name for 2 capture groups and match the format of the string.
\b[a-zA-Z]+:(?<numbers>[0-9]+)(?:\|[a-zA-Z]+:(?<numbers>[0-9]+))*\b
Regex demo | C# demo
string[] strings = {
"TAR:100",
"TAR:100|LED:50",
"TAR:30|LED:30|ASO:40"
};
string pattern = #"\b[a-zA-Z]+:(?<numbers>[0-9]+)(?:\|[a-zA-Z]+:(?<numbers>[0-9]+))*\b";
foreach (String str in strings)
{
Match match = Regex.Match(str, pattern);
if (match.Success)
{
string[] result = match.Groups["numbers"].Captures.Select(c => c.Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
}
Output
100
100,50
30,30,40
Another option could be making use of the \G anchor and have the value in capture group 1.
\b(?:[a-zA-Z]+:|\G(?!^))([0-9]+)(?:\||$)
Regex demo | C# demo
string[] strings = {
"TAR:100",
"TAR:100|LED:50",
"TAR:30|LED:30|ASO:40"
};
string pattern = #"\b(?:[a-zA-Z]+:|\G(?!^))([0-9]+)(?:\||$)";
foreach (String str in strings)
{
MatchCollection matches = Regex.Matches(str, pattern);
string[] result = matches.Select(m => m.Groups[1].Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
Output
100
100,50
30,30,40

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

Replacing Parts of a String C#

I have a series of strings that look like "WORD1: JUNK1 WORD2: JUNK2" and I want to remove the junk from the string while preserving the number of characters between the words (including those taken up by the junk).
I have a list of what words will be used but not junk
The words, number of spaces between everything, and junk all change every line
So far I've been using a regex like (word|word|word)(.\*)(word|word|word)(.*) but I don't know how to maintain the formatting that way.
EDITED
Sorry, you were right, WORD1/WORD2 and JUNK1/JUNK2 are meant to be variables for the actual values I've been seeing. Its all alphanumeric characters and slashes.
Input Examples:
"CATEGORY:(4 spaces)SIDES(3 spaces)DATE CREATED:(3 spaces)03/12/16"
"PRODUCT:(6 spaces)CARROTS(4 spaces)DATE DELETED:(4 spaces)05/11/17"
Output Examples:
"CATEGORY:(12 spaces)DATE CREATED:(11 spaces)"
"PRODUCT:(17 spaces)DATE DELETED:(12 spaces"
I am trying to replace the word "SIDES" as well as "03/12/16" with spaces. Rather, I want the number of characters between CATEGORY and DATE CREATED to remain the same and all be spaces.
I suggest a solution that is based on a Regex.Split operation:
var s = "CATEGORY: SIDES DATE CREATED: 03/12/16";
var rx = #"(\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):)";
var chunks = Regex.Split(s, rx);
Console.WriteLine(string.Concat(
chunks.Select(
x => Regex.IsMatch(x, $"^{rx}$") ? x : new String(' ', x.Length))
)
);
See the C# demo
The (\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):) regex is the delimiter pattern inside a capturing group so that Regex.Split could add the matches to the resulting array. It matches whole words CATEGORY, PRODUCT, DATE CREATED and DATE DELETED, and then a :. If the item matches this delimiter fully (see ^ and $ anchors in Regex.IsMatch(x, $"^{rx}$")) then it must stay as is, else, a string of spaces is built new String(' ', x.Length).
If you need a purely regex solution, you may use
var delim = #"\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):";
Console.WriteLine(Regex.Replace(s, $#"(\G(?!^)\s*|{delim}\s*)(?!{delim})\S", "$1 "));
See this regex demo.
Details
(\G(?!^)\s*|{delim}\s*) - Group 1 ($1 in the replacement pattern): the end of the previous match (\G(?!^)) followed with 0+ whitespaces (\s*) or (|) the delim pattern with 0+ whitespaces
(?!{delim})\S - any non-whitespace char that is not a starting char of a delim sequence
I'm sure someone will give you a nice clean answer using regex but here's a quick solution off the top of my head:
string msg = "this is a silly test message";
string[] junk = new string[] { "silly", "message" };
foreach(string j in junk)
{
msg = Regex.Replace(msg, j, string.Empty.PadRight(j.Length));
}
I thought this was an interesting experiment and I came up with what appears to be a very different method than the other answers.
public class WordStripper
{
public string StripWords(string input)
{
var ignoreWords = new List<string>
{
"CATEGORY:",
"DATE CREATED:",
"PRODUCT:",
"DATE DELETED:"
};
var deliminator = string.Join("|", ignoreWords);
var splitInput = Regex.Split(input, $"({deliminator})");
var sb = new StringBuilder();
foreach (var word in splitInput)
{
if (ignoreWords.Contains(word))
{
sb.Append(word);
}
else
{
var wordLength = word.Length;
sb.Append(new string(' ', wordLength));
}
}
return sb.ToString();
}
}
And a unit test to validate it in case you're interested (uses NUnit)
[TestFixture]
public class Test
{
[Test]
[TestCase("CATEGORY: SIDES DATE CREATED: 03/12/16", "CATEGORY: DATE CREATED: ")]
[TestCase("PRODUCT: CARROTS DATE DELETED: 05/11/17", "PRODUCT: DATE DELETED: ")]
public void TestMethod(string input, string expectedResult)
{
//arrange
var uut = new WordStripper();
//act
var actualResults = uut.StripWords(input);
//assert
Assert.AreEqual(expectedResult, actualResults);
}
}

Split string at first alphabetic character

I have a following string array as shown in the image. while looping through the array, i need to separate numeric value and Alphabetic value .
eg:
35.00MY to 35.00 and MY
2.10D8 to 2.10 and D8
80.00YRI to 80.00 and YRI
4.00G8 to 4.00 and G8
I tried following code , but that didn't help
foreach (string taxText in taxSplit) {
Regex re = new Regex(#"([a-zA-Z]+)(\d+)");
Match result = re.Match(taxText);
string alphaPart = result.Groups[1].ToString();
string numberPart = result.Groups[2].ToString(); }
Both returned empty
You can bastardize a Split and use a lookahead (?= ... ) and a lookbehind (?<= ... ):
string original = "35.00ab3500bc";
Regex reg = new Regex("(?<=[0-9])(?=[A-Za-z])");
string[] parts = reg.Split(original, 2);
Here, we have to instantiate a new Regex instance because this version of Split isn't available as a static method. The pattern we pass says to find a void where the left side of the void is a number (i.e. the lookbehind), and the right side of the void is a letter (i.e. the lookahead). We pass a 2 to say that we want at most two items in parts.
var lst = new List<string>() { "35.00MY", "2.10D8", "80.00YRI", "4.00GB" };
var res = new List<string>();
lst.ForEach(v =>
{
res.Add(new string(v.TakeWhile(c => !Char.IsLetter(c)).ToArray()));
res.Add(v.TrimStart("01234567890.".ToCharArray()));
} );

Get only wild card value using regular expression

I want to extract only wild card tokens using regular expressions in dotnet (C#).
Like if I use pattern like Book_* (so it match directory wild card), it extract values what match with *.
For Example:
For a string "Book_1234" and pattern "Book_*"
I want to extract "1234"
For a string "Book_1234_ABC" and pattern "Book_*_*"
I should be able to extract 1234 and ABC
This should do it : (DEMO)
string input = "Book_1234_ABC";
MatchCollection matches = Regex.Matches(input, #"_([A-Za-z0-9]*)");
foreach (Match m in matches)
if (m.Success)
Console.WriteLine(m.Groups[1].Value);
The approach to your scenario would be to
Get the List of strings which appears in between the wildcard (*).
Join the lists with regexp divider (|).
replace the regular expression with char which you do not expect in your string (i suppose space should be adequate here)
trim and then split the returned string by char you used in previous step which will return you the list of wildcard characters.
var str = "Book_1234_ABC";
var inputPattern = "Book_*_*";
var patterns = inputPattern.Split('*');
if (patterns.Last().Equals(""))
patterns = patterns.Take(patterns.Length - 1).ToArray();
string expression = string.Join("|", patterns);
var wildCards = Regex.Replace(str, expression, " ").Trim().Split(' ');
I would first convert the '*' wildcard in an equivalent Regex, ie:
* becames \w+
then I use this regex to extract the matches.
When I run this code using your input strings:
using System;
using System.Text.RegularExpressions;
namespace SampleApplication
{
public class Test
{
static Regex reg = new Regex(#"Book_([^_]+)_*(.*)");
static void DoMatch(String value) {
Console.WriteLine("Input: " + value);
foreach (Match item in reg.Matches(value)) {
for (int i = 0; i < item.Groups.Count; ++i) {
Console.WriteLine(String.Format("Group: {0} = {1}", i, item.Groups[i].Value));
}
}
Console.WriteLine("\n");
}
static void Main(string[] args) {
// For a string "Book_1234" and pattern "Book_*" I want to extract "1234"
DoMatch("Book_1234");
// For a string "Book_1234_ABC" and pattern "Book_*_*" I should be able to extract 1234 and ABC
DoMatch("Book_1234_ABC");
}
}
}
I get this console output:
Input: Book_1234
Group: 0 = Book_1234
Group: 1 = 1234
Group: 2 =
Input: Book_1234_ABC
Group: 0 = Book_1234_ABC
Group: 1 = 1234
Group: 2 = ABC

Categories

Resources