I want to split camelCase or PascalCase words to space separate collection of words.
So far, I have:
Regex.Replace(value, #"(\B[A-Z]+?(?=[A-Z][^A-Z])|\B[A-Z]+?(?=[^A-Z]))", " $0", RegexOptions.Compiled);
It works fine for converting "TestWord" to "Test Word" and for leaving single words untouched, e.g. Testing remains Testing.
However, ABCTest gets converted to A B C Test when I would prefer ABC Test.
Try:
[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z])|[a-z]+|[A-Z]+
An example on Regex101
How is it used in CS?
string strText = " TestWord asdfDasdf ABCDef";
string[] matches = Regex.Matches(strText, #"[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z])|[a-z]+|[A-Z]+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
string result = String.Join(" ", matches);
result = 'Test Word asdf Dasdf ABC Def'
How it works
In the example string:
TestWord qwerDasdf
ABCTest Testing ((*&^%$CamelCase!"£$%^^))
asdfAasdf
AaBbbCD
[A-Z][a-z]+ matches:
[0-4] Test
[4-8] Word
[13-18] Dasdf
[22-26] Test
[27-34] Testing
[45-50] Camel
[50-54] Case
[68-73] Aasdf
[74-76] Aa
[76-79] Bbb
[A-Z]+(?=[A-Z][a-z]) matches:
[19-22] ABC
[a-z]+ matches:
[9-13] qwer
[64-68] asdf
[A-Z]+ matches:
[79-81] CD
Here is my attempt:
(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b)|(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b)
This regex can be used with Regex.Replace and $0 as a replacement string.
Regex.Replace(value, #"(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b)|(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b)", " $0", RegexOptions.Compiled);
See demo
Regex Explanation:
Contains 2 alternatives to account for a chain of capital letters before or after lowercase letters.
(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b) - first alternative that matches several uppercase letters that are not preceded with a start of string, word boundary or another uppercase letter, and that are followed by a lowercase letter or a word boundary,
(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b) - the second alternative that matches a single capital letter that is not preceded with a start of string with optional uppercase letters right after, or word boundary and is followed by a lowercase letter or a word boundary that is not preceded by optional uppercase letters.
Do you have a requirement to use Regex? To be honest, I wouldn't use Regex for this at all. They're hard to debug and not especially readable.
You also sometimes end up with all sorts of fun like this: Regex problem: IsMatch method never returns
The regex above will not deal with the wonderful world of unicode - e.g. Cyrillics (http://en.wikipedia.org/wiki/Cyrillic_script) (not that your specific problem domain probably needs this, but for completeness...)
I would go with a small, reusable, easily testable extension method:
class Program
{
static void Main(string[] args)
{
string[] inputs = new[]
{
"ABCTest",
"HelloWorld",
"testTest$Test",
"aaҚbb"
};
var output = inputs.Select(x => x.SplitWithSpaces(CultureInfo.CurrentUICulture));
foreach (string x in output)
{
Console.WriteLine(x);
}
Console.Read();
}
}
public static class StringExtensions
{
public static bool IsLowerCase(this TextInfo textInfo, char input)
{
return textInfo.ToLower(input) == input;
}
public static string SplitWithSpaces(this string input, CultureInfo culture = null)
{
if (culture == null)
{
culture = CultureInfo.InvariantCulture;
}
TextInfo textInfo = culture.TextInfo;
StringBuilder sb = new StringBuilder(input);
for (int i = 1; i < sb.Length; i++)
{
int previous = i - 1;
if (textInfo.IsLowerCase(sb[previous]))
{
int insertLocation = previous - 1;
if (insertLocation > 0)
{
sb.Insert(insertLocation, ' ');
}
while (i < sb.Length && textInfo.IsLowerCase(sb[i]))
{
i++;
}
}
}
return sb.ToString();
}
}
Related
Lets say we have string:
Hello
The user enters a char input "e"
What is the correct way of returning the string as the following using a regex method:
-e---
Code tried:
public static string updatedWord(char guess, string word)
{
string result = Regex.Replace(word, guess, "-");
console.writeline(result);
return result;
}
Assuming the input were e, you could build the following regex pattern:
[^e]
Then, do a global replacement on this pattern, which matches any single character which is not e, and replace it with a single dash.
string word = "Hello";
char guess = 'e';
string regex = "[^" + guess + "]";
string result = Regex.Replace(word, regex, "-");
Console.WriteLine(result);
This prints:
-e---
Note that to ensure that we handle regex metacharacters correctly, should they be allowed as inputs, we can wrap the regex pattern above in Regex.Escape:
Regex.Escape(regex)
This can be done without Regex, you need to "loop" all characters of the secret word and replace not yet guessed characters with -, regex will loop letters also, but c# methods are more comprehensible ;)
You need to keep collection of already guessed letters.
public class Guess
{
private readonly string _word;
private readonly HashSet<char> _guessed;
public Guess(string word)
{
_word = word;
_guessed = new HashSet<char>();
}
public string Try(char letter)
{
_guessed.Add(letter);
var maskedLetters = _word.Select(c => _guessed.Contains(c) ? c : '-').ToArray();
return new string(maskedLetters);
}
}
Usage
var game = new Guess("Hello");
var result = game.Try('e');
Console.WriteLine(result); // "-e---"
I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.
I need to split a CamelCase string into an array of words based on the case of the letters. The rules for dividing the string are as follows:
Break the string in all places where a lowercase letter is followed by an uppercase letter, and break before the uppercase letter.
e.g.: aB -> { "a", "B" }
e.g.: helloWorld -> { "hello", "World" }
Break the string in all places where an uppercase letter is followed by a lowercase letter, and break before the uppercase letter.
e.g.: ABc -> { "A", "Bc" }
e.g.: HELLOWorld -> { "HELLO", "World" }
Some edge cases deserve examples of expected output:
FYYear -> { "FY", "Year" }
CostCenter -> { "Cost", "Center" }
cosTCenter -> { "cos", "T", "Center" }
CostcenteR -> { "Costcente", "R" }
COSTCENTER -> { "COSTCENTER" }
I've tried using a regular expression as shown in the code below:
updateCaption = string.Join(" ", Regex.Split(updateCaption, #"(?<!^)(?=[A-Z])"));
But this doesn't work.
This RegEx should do the trick:
private string ToUppercase(string input) {
var regex = new Regex(#"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])");
return regex.Replace(input, " ");
}
I copied the formatting from https://regex101.com/r/ahah3D/2 for further explanation:
There are two matching groups considered here. The first positive lookbehind looks for any uppercase letter followed by any (uppercase or lowercase) letter. The second one tests for your standard case i. e. a lowercase letter followed by an uppercase letter.
Let me know if that solves your question.
Here's my approach:
static IEnumerable<string> SplitCamelCase(string input)
{
return Regex.Split(input, #"([A-Z]?[a-z]+)").Where(str => !string.IsNullOrEmpty(str));
}
It works by splitting the string using "an uppercase letter followed by one or more lowercase letters" (or just one or more lowercase letters) as a delimiter. string.Split will include the delimiters in the result array if they are captured in parentheses (and they are, in my example). And this leaves only the spans of capital letters (all but the last) occurring between delimiters, which string.Split will include in the array naturally. It does produce superfluous empty strings in some cases, but they can be filtered out; I did so with a .Where clause.
It's not bad. I only wish there were a nicer way to eliminate the empty strings more easily.
By the way, I elected to return IEnumerable<string> because I feel like that format is more reusable. But you can always .ToArray() the result if you prefer an array, or the result can be joined with spaces using string.Join(" ", result) to form your corrected string.
Here's a complete demonstration:
class Program
{
static IEnumerable<string> SplitCamelCase(string input)
{
return Regex.Split(input, #"([A-Z]?[a-z]+)").Where(str => !string.IsNullOrEmpty(str));
}
static void Main(string[] args)
{
string[] examples = new string[] {
"FYYear",
"CostCenter",
"cosTCenter",
"CostcenteR",
"COSTCENTER"
};
foreach (string str in examples) {
Console.WriteLine("{0, 10} -> {1}", str, String.Join(" ", SplitCamelCase(str)));
}
}
}
Output:
FYYear -> FY Year
CostCenter -> Cost Center
cosTCenter -> cos T Center
CostcenteR -> Costcente R
COSTCENTER -> COSTCENTER
I have a series of strings that look like "WORD1: JUNK1 WORD2: JUNK2" and I want to remove the junk from the string while preserving the number of characters between the words (including those taken up by the junk).
I have a list of what words will be used but not junk
The words, number of spaces between everything, and junk all change every line
So far I've been using a regex like (word|word|word)(.\*)(word|word|word)(.*) but I don't know how to maintain the formatting that way.
EDITED
Sorry, you were right, WORD1/WORD2 and JUNK1/JUNK2 are meant to be variables for the actual values I've been seeing. Its all alphanumeric characters and slashes.
Input Examples:
"CATEGORY:(4 spaces)SIDES(3 spaces)DATE CREATED:(3 spaces)03/12/16"
"PRODUCT:(6 spaces)CARROTS(4 spaces)DATE DELETED:(4 spaces)05/11/17"
Output Examples:
"CATEGORY:(12 spaces)DATE CREATED:(11 spaces)"
"PRODUCT:(17 spaces)DATE DELETED:(12 spaces"
I am trying to replace the word "SIDES" as well as "03/12/16" with spaces. Rather, I want the number of characters between CATEGORY and DATE CREATED to remain the same and all be spaces.
I suggest a solution that is based on a Regex.Split operation:
var s = "CATEGORY: SIDES DATE CREATED: 03/12/16";
var rx = #"(\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):)";
var chunks = Regex.Split(s, rx);
Console.WriteLine(string.Concat(
chunks.Select(
x => Regex.IsMatch(x, $"^{rx}$") ? x : new String(' ', x.Length))
)
);
See the C# demo
The (\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):) regex is the delimiter pattern inside a capturing group so that Regex.Split could add the matches to the resulting array. It matches whole words CATEGORY, PRODUCT, DATE CREATED and DATE DELETED, and then a :. If the item matches this delimiter fully (see ^ and $ anchors in Regex.IsMatch(x, $"^{rx}$")) then it must stay as is, else, a string of spaces is built new String(' ', x.Length).
If you need a purely regex solution, you may use
var delim = #"\b(?:CATEGORY|PRODUCT|DATE (?:CREA|DELE)TED):";
Console.WriteLine(Regex.Replace(s, $#"(\G(?!^)\s*|{delim}\s*)(?!{delim})\S", "$1 "));
See this regex demo.
Details
(\G(?!^)\s*|{delim}\s*) - Group 1 ($1 in the replacement pattern): the end of the previous match (\G(?!^)) followed with 0+ whitespaces (\s*) or (|) the delim pattern with 0+ whitespaces
(?!{delim})\S - any non-whitespace char that is not a starting char of a delim sequence
I'm sure someone will give you a nice clean answer using regex but here's a quick solution off the top of my head:
string msg = "this is a silly test message";
string[] junk = new string[] { "silly", "message" };
foreach(string j in junk)
{
msg = Regex.Replace(msg, j, string.Empty.PadRight(j.Length));
}
I thought this was an interesting experiment and I came up with what appears to be a very different method than the other answers.
public class WordStripper
{
public string StripWords(string input)
{
var ignoreWords = new List<string>
{
"CATEGORY:",
"DATE CREATED:",
"PRODUCT:",
"DATE DELETED:"
};
var deliminator = string.Join("|", ignoreWords);
var splitInput = Regex.Split(input, $"({deliminator})");
var sb = new StringBuilder();
foreach (var word in splitInput)
{
if (ignoreWords.Contains(word))
{
sb.Append(word);
}
else
{
var wordLength = word.Length;
sb.Append(new string(' ', wordLength));
}
}
return sb.ToString();
}
}
And a unit test to validate it in case you're interested (uses NUnit)
[TestFixture]
public class Test
{
[Test]
[TestCase("CATEGORY: SIDES DATE CREATED: 03/12/16", "CATEGORY: DATE CREATED: ")]
[TestCase("PRODUCT: CARROTS DATE DELETED: 05/11/17", "PRODUCT: DATE DELETED: ")]
public void TestMethod(string input, string expectedResult)
{
//arrange
var uut = new WordStripper();
//act
var actualResults = uut.StripWords(input);
//assert
Assert.AreEqual(expectedResult, actualResults);
}
}
I have a string which will have the word "TAG" followed by an integer,underscore and another word.
Eg: "TAG123_Sample"
I need to cut the "TAGXXX_" pattern and get only the word Sample. Meaning I will have to cut the word "TAG" and the integer followed by and the underscore.
I wrote the following code but it doesn't work. What have I done wrong? How can I do this? Please advice.
static void Main(string[] args)
{
String sentence = "TAG123_Sample";
String pattern=#"TAG[^\d]_";
String replacement = "";
Regex r = new Regex(pattern);
String res = r.Replace(sentence,replacement);
Console.WriteLine(res);
Console.ReadLine();
}
You're currently negating (matching NOT a digit), you need to modify the regex as follows:
String s = "TAG123_Sample";
String r = Regex.Replace(s, #"TAG\d+_", "");
Console.WriteLine(r); //=> "Sample"
Explanation:
TAG match 'TAG'
\d+ digits (0-9) (1 or more times)
_ '_'
You can use String.Split for this:
string[] s = "TAG123_Sample".Split('_');
Console.WriteLine(s[1]);
https://msdn.microsoft.com/en-us/library/b873y76a.aspx
Try this will work in this case for sure:
resultString = Regex.Replace(sentence ,
#"^ # Match start of string
[^_]* # Match 0 or more characters except underscore
_ # Match the underscore", "", RegexOptions.IgnorePatternWhitespace);
No regex is necessary if your string contains 1 underscore and you need to get a substring after it.
Here is a Substring+IndexOf-based approach:
var res = sentence.Substring(sentence.IndexOf('_') + 1); // => Sample
See IDEONE demo