Regular expression with specific word boundary - c#

Let's say I have a string of type
(Price+Discounted_Price)*2-Max.Price
and a dictionary containing what to replace for each element
Price: A1 Discounted_Price: A2 Max.Price:A3
How can I replace exactly each phrases, without touching the other. Meaning search for Price should not modify Price in Discounted_Price. The result should be (A1+A2)*2-A3 and not (A1+Discounted_A1) - Max.A1 or anything else
Thank you.

If your variables can consist of alphanumeric/underscore/dot characters, you can match them with [\w.]+ regex pattern, and add boundaries that include .:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var s = "(Price+Discounted_Price)*2-Max.Price";
var dct = new Dictionary<string, string>();
dct.Add("Price", "A1");
dct.Add("Discounted_Price", "A2");
dct.Add("Max.Price","A3");
var res = Regex.Replace(s, #"(?<![\w.])[\w.]+(?![\w.])", // Find all matches with the regex inside s
x => dct.ContainsKey(x.Value) ? // Does the dictionary contain the key that equals the matched text?
dct[x.Value] : // Use the value for the key if it is present to replace current match
x.Value); // Otherwise, insert the match found back into the result
Console.WriteLine(res);
}
}
See the IDEONE demo
The (?<![\w.]) negative lookbehind fails the match if the match is preceded with a word or a dot char, and the (?![\w.]) negative lookahead will fail the match if it is followed with a word or dot char.
Note that [\w.]+ allows a dot in the leading and trailing positions, thus, you might want to replace it with \w+(?:\.\w+)* and use as #"(?<![\w.])\w+(?:\.\w+)*(?![\w.])".
UPDATE
Since you have already extracted the keywords to replace as a list, you need to use a more sophisticated word boundary excluding dots:
var listAbove = new List<string> { "Price", "Discounted_Price", "Max.Price" };
var result = s;
foreach (string phrase in listAbove)
{
result = Regex.Replace(result, #"\b(?<![\w.])" + Regex.Escape(phrase) + #"\b(?![\w.])", dct[phrase]);
}
See IDEONE demo.

For word boundaries, you can use \b
Use: \bPrice\b
But this will replace Price in Max.Price.
Maybe you want to use regular string replace with:
"Price+" --> A1 + "+"
Example:
string test = "(Price+Discounted_Price)*2-Max.Price";
string a1 = "7";
string a2 = "3";
string a3 = "4";
test = test.Replace("(Price", "(" + a1);
test = test.Replace("Discounted_Price", a2);
test = test.Replace("Max.Price", a3);
Result:
test is: (7+3)*2-4

Related

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

Regular expression split string, extract string value before and numeric value between square brackets

I need to parse a string that looks like "Abc[123]". The numerical value between the brackets is needed, as well as the string value before the brackets.
The most examples that I tested work fine, but have problems to parse some special cases.
This code seems to work fine for "normal" cases, but has some problems handling "special" cases:
var pattern = #"\[(.*[0-9])\]";
var query = "Abc[123]";
var numVal = Regex.Matches(query, pattern).Cast<Match>().Select(m => m.Groups[1].Value).FirstOrDefault();
var stringVal = Regex.Split(query, pattern)
.Select(x => x.Trim())
.FirstOrDefault();
How should the code be adjusted to handle also some special cases?
For instance for the string "Abc[]" the parser should return correctly "Abc" as the string value and indicate an empty the numeric value (which could be eventually defaulted to 0).
For the string "Abc[xy33]" the parser should return "Abc" as the string value and indicate an invalid numeric value.
For the string "Abc" the parser should return "Abc" as the string value and indicate a missing numeric value. The blanks before/after or inside the brackets should be trimmed "Abc [ 123 ] ".
Try this pattern: ^([^\[]+)\[([^\]]*)\]
Explanation of a pattern:
^ - match beginning of a string
([^\[]+) - match one or more of any character ecept [ and store it insinde first capturing group
\[ - match [ literally
([^\]]*) - match zero or more of any character except ] and store inside second capturing group
\] - match ] literally
Here's tested code:
var pattern = #"^([^\[]+)\[([^\]]*)\]";
var queries = new string[]{ "Abc[123]", "Abc[xy33]", "Abc[]", "Abc[ 33 ]", "Abc" };
foreach (var query in queries)
{
string beforeBrackets;
string insideBrackets;
var match = Regex.Match(query, pattern);
if (match.Success)
{
beforeBrackets = match.Groups[1].Value;
insideBrackets = match.Groups[2].Value.Trim();
if (insideBrackets == "")
insideBrackets = "0";
else if (!int.TryParse(insideBrackets, out int i))
insideBrackets = "incorrect value!";
}
else
{
beforeBrackets = query;
insideBrackets = "no value";
}
Console.WriteLine($"Input string {query} : before brackets: {beforeBrackets}, inside brackets: {insideBrackets}");
}
Console.ReadKey();
Output:
We can try doing a regex replacement on the input, for a one-liner solution:
string input = "Abc[123]";
string letters = Regex.Replace(input, "\\[.*\\]", "");
string numbers = Regex.Replace("Abc[123]", ".*\\[(\\d+)\\]", "$1");
Console.WriteLine(letters);
Console.WriteLine(numbers);
This prints:
Abc
123
Pretty sure there'd be some language-based techniques for that, which I wouldn't know, yet with a regular expression, we'd capture everything using capturing groups and check for things one by one, maybe:
^([A-Za-z]+)\s*(\[?)\s*([A-Za-z]*)(\d*)\s*(\]?)\s*$
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
You can achieve that easily without using regex
string temp = "Abc[123]";
string[] arr = temp.Split('[');
string name = arr[0];
string value = arr[1].ToString().TrimEnd(']');
output name = Abc, and value = 123

Replacing Parts of a String C#

I have a series of strings that look like "WORD1: JUNK1 WORD2: JUNK2" and I want to remove the junk from the string while preserving the number of characters between the words (including those taken up by the junk).
I have a list of what words will be used but not junk
The words, number of spaces between everything, and junk all change every line
So far I've been using a regex like (word|word|word)(.\*)(word|word|word)(.*) but I don't know how to maintain the formatting that way.
EDITED
Sorry, you were right, WORD1/WORD2 and JUNK1/JUNK2 are meant to be variables for the actual values I've been seeing. Its all alphanumeric characters and slashes.
Input Examples:
"CATEGORY:(4 spaces)SIDES(3 spaces)DATE CREATED:(3 spaces)03/12/16"
"PRODUCT:(6 spaces)CARROTS(4 spaces)DATE DELETED:(4 spaces)05/11/17"
Output Examples:
"CATEGORY:(12 spaces)DATE CREATED:(11 spaces)"
"PRODUCT:(17 spaces)DATE DELETED:(12 spaces"
I am trying to replace the word "SIDES" as well as "03/12/16" with spaces. Rather, I want the number of characters between CATEGORY and DATE CREATED to remain the same and all be spaces.
I suggest a solution that is based on a Regex.Split operation:
var s = "CATEGORY: SIDES DATE CREATED: 03/12/16";
var rx = #"(\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):)";
var chunks = Regex.Split(s, rx);
Console.WriteLine(string.Concat(
chunks.Select(
x => Regex.IsMatch(x, $"^{rx}$") ? x : new String(' ', x.Length))
)
);
See the C# demo
The (\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):) regex is the delimiter pattern inside a capturing group so that Regex.Split could add the matches to the resulting array. It matches whole words CATEGORY, PRODUCT, DATE CREATED and DATE DELETED, and then a :. If the item matches this delimiter fully (see ^ and $ anchors in Regex.IsMatch(x, $"^{rx}$")) then it must stay as is, else, a string of spaces is built new String(' ', x.Length).
If you need a purely regex solution, you may use
var delim = #"\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):";
Console.WriteLine(Regex.Replace(s, $#"(\G(?!^)\s*|{delim}\s*)(?!{delim})\S", "$1 "));
See this regex demo.
Details
(\G(?!^)\s*|{delim}\s*) - Group 1 ($1 in the replacement pattern): the end of the previous match (\G(?!^)) followed with 0+ whitespaces (\s*) or (|) the delim pattern with 0+ whitespaces
(?!{delim})\S - any non-whitespace char that is not a starting char of a delim sequence
I'm sure someone will give you a nice clean answer using regex but here's a quick solution off the top of my head:
string msg = "this is a silly test message";
string[] junk = new string[] { "silly", "message" };
foreach(string j in junk)
{
msg = Regex.Replace(msg, j, string.Empty.PadRight(j.Length));
}
I thought this was an interesting experiment and I came up with what appears to be a very different method than the other answers.
public class WordStripper
{
public string StripWords(string input)
{
var ignoreWords = new List<string>
{
"CATEGORY:",
"DATE CREATED:",
"PRODUCT:",
"DATE DELETED:"
};
var deliminator = string.Join("|", ignoreWords);
var splitInput = Regex.Split(input, $"({deliminator})");
var sb = new StringBuilder();
foreach (var word in splitInput)
{
if (ignoreWords.Contains(word))
{
sb.Append(word);
}
else
{
var wordLength = word.Length;
sb.Append(new string(' ', wordLength));
}
}
return sb.ToString();
}
}
And a unit test to validate it in case you're interested (uses NUnit)
[TestFixture]
public class Test
{
[Test]
[TestCase("CATEGORY: SIDES DATE CREATED: 03/12/16", "CATEGORY: DATE CREATED: ")]
[TestCase("PRODUCT: CARROTS DATE DELETED: 05/11/17", "PRODUCT: DATE DELETED: ")]
public void TestMethod(string input, string expectedResult)
{
//arrange
var uut = new WordStripper();
//act
var actualResults = uut.StripWords(input);
//assert
Assert.AreEqual(expectedResult, actualResults);
}
}

Get only wild card value using regular expression

I want to extract only wild card tokens using regular expressions in dotnet (C#).
Like if I use pattern like Book_* (so it match directory wild card), it extract values what match with *.
For Example:
For a string "Book_1234" and pattern "Book_*"
I want to extract "1234"
For a string "Book_1234_ABC" and pattern "Book_*_*"
I should be able to extract 1234 and ABC
This should do it : (DEMO)
string input = "Book_1234_ABC";
MatchCollection matches = Regex.Matches(input, #"_([A-Za-z0-9]*)");
foreach (Match m in matches)
if (m.Success)
Console.WriteLine(m.Groups[1].Value);
The approach to your scenario would be to
Get the List of strings which appears in between the wildcard (*).
Join the lists with regexp divider (|).
replace the regular expression with char which you do not expect in your string (i suppose space should be adequate here)
trim and then split the returned string by char you used in previous step which will return you the list of wildcard characters.
var str = "Book_1234_ABC";
var inputPattern = "Book_*_*";
var patterns = inputPattern.Split('*');
if (patterns.Last().Equals(""))
patterns = patterns.Take(patterns.Length - 1).ToArray();
string expression = string.Join("|", patterns);
var wildCards = Regex.Replace(str, expression, " ").Trim().Split(' ');
I would first convert the '*' wildcard in an equivalent Regex, ie:
* becames \w+
then I use this regex to extract the matches.
When I run this code using your input strings:
using System;
using System.Text.RegularExpressions;
namespace SampleApplication
{
public class Test
{
static Regex reg = new Regex(#"Book_([^_]+)_*(.*)");
static void DoMatch(String value) {
Console.WriteLine("Input: " + value);
foreach (Match item in reg.Matches(value)) {
for (int i = 0; i < item.Groups.Count; ++i) {
Console.WriteLine(String.Format("Group: {0} = {1}", i, item.Groups[i].Value));
}
}
Console.WriteLine("\n");
}
static void Main(string[] args) {
// For a string "Book_1234" and pattern "Book_*" I want to extract "1234"
DoMatch("Book_1234");
// For a string "Book_1234_ABC" and pattern "Book_*_*" I should be able to extract 1234 and ABC
DoMatch("Book_1234_ABC");
}
}
}
I get this console output:
Input: Book_1234
Group: 0 = Book_1234
Group: 1 = 1234
Group: 2 =
Input: Book_1234_ABC
Group: 0 = Book_1234_ABC
Group: 1 = 1234
Group: 2 = ABC

C# regular expression trouble

Problem!
I Have the following input (rules) from a flat file (talking about numeric input):
Input might be a natural number (below 1000): 1, 10, 100, 999, ...
Input might be a comma separated number surrounded by quotes (above 1000): "1,000", "2,000", "3,000", "10,000", ...
I Have the following regular expression to validate the input: (?:(\d+)|\x22([0-9]+(?:,[0-9]+)*)\x22), So for an input like 10 I'm expecting in the first matching group 10, which is exactly what I got. But when I got an input like "10,000" I'm expecting in the first matching group 10,000, but it is stored at the second matching group.
Example
string text1 = "\"" + "10,000" + "\"";
string text2 = "50";
string pattern = #"(\d+)|\x22([0-9]+(?:,[0-9]+){0,})\x22";
Match match1 = Regex.Match(text1, pattern);
Match match2 = Regex.Match(text2, pattern);
if (match1.Success)
{
Console.WriteLine("Match#1 Group#1: " + match1.Groups[1].Value);
Console.WriteLine("Match#1 Group#2: " + match1.Groups[2].Value);
# Outputs
# Match#1 Group#1:
# Match#1 Group#2: 10,000
}
if (match2.Success)
{
Console.WriteLine("Match#2 Group#1: " + match2.Groups[1].Value);
Console.WriteLine("Match#2 Group#2: " + match2.Groups[2].Value);
# Outputs
# Match#2 Group#1: 50
# Match#2 Group#2:
}
Expected Result
Both results on the same matching group, in this case 1
Questions?
What am I doing wrong? I'm just getting bad grouping from the regular expression matches.
Also, I'm using filehelpers .NET to parse the file, is there any other way to resolve this problem. Actualy I'm trying to implement a custom converter.
Object File
[FieldConverter(typeof(OOR_Quantity))]
public Int32 Quantity;
OOR_Quantity
internal class OOR_Quantity : ConverterBase
{
public override object StringToField(string from)
{
string pattern = #"(?:(\d+)|\x22([0-9]+(?:,[0-9]+)*)\x22)";
Regex regex = new Regex(pattern);
if (regex.IsMatch(from))
{
Match match = regex.Match(from);
return int.Parse(match.Groups[1].Value);
}
throw new ...
}
}
Group numbers are assigned purely on the basis of their positions in the regex--specifically, the relative position of the opening bracket, (. In your regex, (\d+) is the first group and ([0-9]+(?:,[0-9]+)*) is the second.
If you want to refer to them both with the same identifier, use named groups and give them both the same name:
#"(?:(?<NUMBER>\d+)|\x22(?<NUMBER>[0-9]+(?:,[0-9]+)*)\x22)"
Now you can retrieve the captured value as match.Groups["NUMBER"].Value.
I tested the regex below with Ruby:
text1 = "\"10,000\""
text2 = "50"
regex = /"?([0-9]+(?:,[0-9]+){0,})"?/
text1 =~ regex
puts "#$1"
text2 =~ regex
puts "#$1"
The result is:
10,000
50
I think you can rewrite in C#. Isn't it enough for you?

Categories

Resources