I have a series of strings that look like "WORD1: JUNK1 WORD2: JUNK2" and I want to remove the junk from the string while preserving the number of characters between the words (including those taken up by the junk).
I have a list of what words will be used but not junk
The words, number of spaces between everything, and junk all change every line
So far I've been using a regex like (word|word|word)(.\*)(word|word|word)(.*) but I don't know how to maintain the formatting that way.
EDITED
Sorry, you were right, WORD1/WORD2 and JUNK1/JUNK2 are meant to be variables for the actual values I've been seeing. Its all alphanumeric characters and slashes.
Input Examples:
"CATEGORY:(4 spaces)SIDES(3 spaces)DATE CREATED:(3 spaces)03/12/16"
"PRODUCT:(6 spaces)CARROTS(4 spaces)DATE DELETED:(4 spaces)05/11/17"
Output Examples:
"CATEGORY:(12 spaces)DATE CREATED:(11 spaces)"
"PRODUCT:(17 spaces)DATE DELETED:(12 spaces"
I am trying to replace the word "SIDES" as well as "03/12/16" with spaces. Rather, I want the number of characters between CATEGORY and DATE CREATED to remain the same and all be spaces.
I suggest a solution that is based on a Regex.Split operation:
var s = "CATEGORY: SIDES DATE CREATED: 03/12/16";
var rx = #"(\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):)";
var chunks = Regex.Split(s, rx);
Console.WriteLine(string.Concat(
chunks.Select(
x => Regex.IsMatch(x, $"^{rx}$") ? x : new String(' ', x.Length))
)
);
See the C# demo
The (\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):) regex is the delimiter pattern inside a capturing group so that Regex.Split could add the matches to the resulting array. It matches whole words CATEGORY, PRODUCT, DATE CREATED and DATE DELETED, and then a :. If the item matches this delimiter fully (see ^ and $ anchors in Regex.IsMatch(x, $"^{rx}$")) then it must stay as is, else, a string of spaces is built new String(' ', x.Length).
If you need a purely regex solution, you may use
var delim = #"\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):";
Console.WriteLine(Regex.Replace(s, $#"(\G(?!^)\s*|{delim}\s*)(?!{delim})\S", "$1 "));
See this regex demo.
Details
(\G(?!^)\s*|{delim}\s*) - Group 1 ($1 in the replacement pattern): the end of the previous match (\G(?!^)) followed with 0+ whitespaces (\s*) or (|) the delim pattern with 0+ whitespaces
(?!{delim})\S - any non-whitespace char that is not a starting char of a delim sequence
I'm sure someone will give you a nice clean answer using regex but here's a quick solution off the top of my head:
string msg = "this is a silly test message";
string[] junk = new string[] { "silly", "message" };
foreach(string j in junk)
{
msg = Regex.Replace(msg, j, string.Empty.PadRight(j.Length));
}
I thought this was an interesting experiment and I came up with what appears to be a very different method than the other answers.
public class WordStripper
{
public string StripWords(string input)
{
var ignoreWords = new List<string>
{
"CATEGORY:",
"DATE CREATED:",
"PRODUCT:",
"DATE DELETED:"
};
var deliminator = string.Join("|", ignoreWords);
var splitInput = Regex.Split(input, $"({deliminator})");
var sb = new StringBuilder();
foreach (var word in splitInput)
{
if (ignoreWords.Contains(word))
{
sb.Append(word);
}
else
{
var wordLength = word.Length;
sb.Append(new string(' ', wordLength));
}
}
return sb.ToString();
}
}
And a unit test to validate it in case you're interested (uses NUnit)
[TestFixture]
public class Test
{
[Test]
[TestCase("CATEGORY: SIDES DATE CREATED: 03/12/16", "CATEGORY: DATE CREATED: ")]
[TestCase("PRODUCT: CARROTS DATE DELETED: 05/11/17", "PRODUCT: DATE DELETED: ")]
public void TestMethod(string input, string expectedResult)
{
//arrange
var uut = new WordStripper();
//act
var actualResults = uut.StripWords(input);
//assert
Assert.AreEqual(expectedResult, actualResults);
}
}
Related
I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.
Let's say I have a string of type
(Price+Discounted_Price)*2-Max.Price
and a dictionary containing what to replace for each element
Price: A1 Discounted_Price: A2 Max.Price:A3
How can I replace exactly each phrases, without touching the other. Meaning search for Price should not modify Price in Discounted_Price. The result should be (A1+A2)*2-A3 and not (A1+Discounted_A1) - Max.A1 or anything else
Thank you.
If your variables can consist of alphanumeric/underscore/dot characters, you can match them with [\w.]+ regex pattern, and add boundaries that include .:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var s = "(Price+Discounted_Price)*2-Max.Price";
var dct = new Dictionary<string, string>();
dct.Add("Price", "A1");
dct.Add("Discounted_Price", "A2");
dct.Add("Max.Price","A3");
var res = Regex.Replace(s, #"(?<![\w.])[\w.]+(?![\w.])", // Find all matches with the regex inside s
x => dct.ContainsKey(x.Value) ? // Does the dictionary contain the key that equals the matched text?
dct[x.Value] : // Use the value for the key if it is present to replace current match
x.Value); // Otherwise, insert the match found back into the result
Console.WriteLine(res);
}
}
See the IDEONE demo
The (?<![\w.]) negative lookbehind fails the match if the match is preceded with a word or a dot char, and the (?![\w.]) negative lookahead will fail the match if it is followed with a word or dot char.
Note that [\w.]+ allows a dot in the leading and trailing positions, thus, you might want to replace it with \w+(?:\.\w+)* and use as #"(?<![\w.])\w+(?:\.\w+)*(?![\w.])".
UPDATE
Since you have already extracted the keywords to replace as a list, you need to use a more sophisticated word boundary excluding dots:
var listAbove = new List<string> { "Price", "Discounted_Price", "Max.Price" };
var result = s;
foreach (string phrase in listAbove)
{
result = Regex.Replace(result, #"\b(?<![\w.])" + Regex.Escape(phrase) + #"\b(?![\w.])", dct[phrase]);
}
See IDEONE demo.
For word boundaries, you can use \b
Use: \bPrice\b
But this will replace Price in Max.Price.
Maybe you want to use regular string replace with:
"Price+" --> A1 + "+"
Example:
string test = "(Price+Discounted_Price)*2-Max.Price";
string a1 = "7";
string a2 = "3";
string a3 = "4";
test = test.Replace("(Price", "(" + a1);
test = test.Replace("Discounted_Price", a2);
test = test.Replace("Max.Price", a3);
Result:
test is: (7+3)*2-4
I want to split camelCase or PascalCase words to space separate collection of words.
So far, I have:
Regex.Replace(value, #"(\B[A-Z]+?(?=[A-Z][^A-Z])|\B[A-Z]+?(?=[^A-Z]))", " $0", RegexOptions.Compiled);
It works fine for converting "TestWord" to "Test Word" and for leaving single words untouched, e.g. Testing remains Testing.
However, ABCTest gets converted to A B C Test when I would prefer ABC Test.
Try:
[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z])|[a-z]+|[A-Z]+
An example on Regex101
How is it used in CS?
string strText = " TestWord asdfDasdf ABCDef";
string[] matches = Regex.Matches(strText, #"[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z])|[a-z]+|[A-Z]+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
string result = String.Join(" ", matches);
result = 'Test Word asdf Dasdf ABC Def'
How it works
In the example string:
TestWord qwerDasdf
ABCTest Testing ((*&^%$CamelCase!"£$%^^))
asdfAasdf
AaBbbCD
[A-Z][a-z]+ matches:
[0-4] Test
[4-8] Word
[13-18] Dasdf
[22-26] Test
[27-34] Testing
[45-50] Camel
[50-54] Case
[68-73] Aasdf
[74-76] Aa
[76-79] Bbb
[A-Z]+(?=[A-Z][a-z]) matches:
[19-22] ABC
[a-z]+ matches:
[9-13] qwer
[64-68] asdf
[A-Z]+ matches:
[79-81] CD
Here is my attempt:
(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b)|(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b)
This regex can be used with Regex.Replace and $0 as a replacement string.
Regex.Replace(value, #"(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b)|(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b)", " $0", RegexOptions.Compiled);
See demo
Regex Explanation:
Contains 2 alternatives to account for a chain of capital letters before or after lowercase letters.
(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b) - first alternative that matches several uppercase letters that are not preceded with a start of string, word boundary or another uppercase letter, and that are followed by a lowercase letter or a word boundary,
(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b) - the second alternative that matches a single capital letter that is not preceded with a start of string with optional uppercase letters right after, or word boundary and is followed by a lowercase letter or a word boundary that is not preceded by optional uppercase letters.
Do you have a requirement to use Regex? To be honest, I wouldn't use Regex for this at all. They're hard to debug and not especially readable.
You also sometimes end up with all sorts of fun like this: Regex problem: IsMatch method never returns
The regex above will not deal with the wonderful world of unicode - e.g. Cyrillics (http://en.wikipedia.org/wiki/Cyrillic_script) (not that your specific problem domain probably needs this, but for completeness...)
I would go with a small, reusable, easily testable extension method:
class Program
{
static void Main(string[] args)
{
string[] inputs = new[]
{
"ABCTest",
"HelloWorld",
"testTest$Test",
"aaҚbb"
};
var output = inputs.Select(x => x.SplitWithSpaces(CultureInfo.CurrentUICulture));
foreach (string x in output)
{
Console.WriteLine(x);
}
Console.Read();
}
}
public static class StringExtensions
{
public static bool IsLowerCase(this TextInfo textInfo, char input)
{
return textInfo.ToLower(input) == input;
}
public static string SplitWithSpaces(this string input, CultureInfo culture = null)
{
if (culture == null)
{
culture = CultureInfo.InvariantCulture;
}
TextInfo textInfo = culture.TextInfo;
StringBuilder sb = new StringBuilder(input);
for (int i = 1; i < sb.Length; i++)
{
int previous = i - 1;
if (textInfo.IsLowerCase(sb[previous]))
{
int insertLocation = previous - 1;
if (insertLocation > 0)
{
sb.Insert(insertLocation, ' ');
}
while (i < sb.Length && textInfo.IsLowerCase(sb[i]))
{
i++;
}
}
}
return sb.ToString();
}
}
If I have a string like MCCORMIC 3H R Final 08-26-2011.dwg or even MCCORMIC SMITH 2N L Final 08-26-2011.dwg and I wanted to capture the R in the first string or the L in the second string in a variable, what is the best method for doing so? I was thinking about trying the below statement but it does not work.
string filename = "MCCORMIC 3H R Final 08-26-2011.dwg"
string WhichArea = "";
int WhichIndex = 0;
WhichIndex = filename.IndexOf("Final");
WhichArea = filename.Substring(WhichIndex - 1,1); //Trying to get the R in front of word Final
Just split by space:
var parts = filename.Split(new [] {' '},
StringSplitOptions.RemoveEmptyEntries);
WhichArea = parts[parts.Length - 3];
It looks like the file names have a very specific format, so this will work just fine.
Even with any number of spaces, using StringSplitOptions.RemoveEmptyEntries means spaces will not be part of the split result set.
Code updated to deal with both examples - thanks Nikola.
I had to do something similar, but with Mirostation drawings instead of Autocad. I used regex in my case. Here's what I did, just in case you feel like making it more complex.
string filename = "MCCORMIC 3H R Final 08-26-2011.dwg";
string filename2 = "MCCORMIC SMITH 2N L Final 08-26-2011.dwg";
Console.WriteLine(TheMatch(filename));
Console.WriteLine(TheMatch(filename2));
public string TheMatch(string filename) {
Regex reg = new Regex(#"[A-Za-z0-9]*\s*([A-Z])\s*Final .*\.dwg");
Match match = reg.Match(filename);
if(match.Success) {
return match.Groups[1].Value;
}
return String.Empty;
}
I don't think Oded's answer covers all cases. The first example has two words before the wanted letter, and the second one has three words before it.
My opinion is that the best way to get this letter is by using RegEx, assuming that the word Final always comes after the letter itself, separated by any number of spaces.
Here's the RegEx code:
using System.Text.RegularExpressions;
private string GetLetter(string fileName)
{
string pattern = "\S(?=\s*?Final)";
Match match = Regex.Match(fileName, pattern);
return match.Value;
}
And here's the explanation of RegEx pattern:
\S(?=\s*?Final)
\S // Anything other than whitespace
(?=\s*?Final) // Positive look-ahead
\s*? // Whitespace, unlimited number of repetitions, as few as possible.
Final // Exact text.
I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.