Escape character in C#'s Split() - c#

I am parsing some delimiter separated values, where ? is specified as the escape character in case the delimiter appears as part of one of the values.
For instance: if : is the delimiter, and a certain field the value 19:30, this needs to be written as 19?:30.
Currently, I use string[] values = input.Split(':'); in order to get an array of all values, but after learning about this escape character, this won't work anymore.
Is there a way to make Split take escape characters into account? I have checked the overload methods, and there does not seem to be such an option directly.

string[] substrings = Regex.Split("aa:bb:00?:99:zz", #"(?<!\?):");
for
aa
bb
00?:99
zz
Or as you probably want to unescape ?: at some point, replace the sequence in the input with another token, split and replace back.
(This requires the System.Text.RegularExpressions namespace to be used.)

This kind of stuff is always fun to code without using Regex.
The following does the trick with one single caveat: the escape character will always escape, it has no logic to check for only valid ones: ?;. So the string one?two;three??;four?;five will be split into onewo, three?, fourfive.
public static IEnumerable<string> Split(this string text, char separator, char escapeCharacter, bool removeEmptyEntries)
{
string buffer = string.Empty;
bool escape = false;
foreach (var c in text)
{
if (!escape && c == separator)
{
if (!removeEmptyEntries || buffer.Length > 0)
{
yield return buffer;
}
buffer = string.Empty;
}
else
{
if (c == escapeCharacter)
{
escape = !escape;
if (!escape)
{
buffer = string.Concat(buffer, c);
}
}
else
{
if (!escape)
{
buffer = string.Concat(buffer, c);
}
escape = false;
}
}
}
if (buffer.Length != 0)
{
yield return buffer;
}
}

No, there's no way to do that. You will need to use regex (which depends on how exactly do you want your "escape character" to behave). In worst case I suppose you'll have to do the parsing manually.

Related

Read Text after custom word

My goal is to find a way how to read the text after a word in a File. An example of this is:
Word("Text")
The output would be Text.
Is this achievable?
Cut your problem into small pieces:
Read a text file
Divide the text into a sequence of words
Skip all words in the sequence until the word that you are looking for
Use the rest of the sequence of words
Read the characters in a text file as a sequence of characters
public IEnumerable<char> ReadTextFile(string fileName)
{
using (TextReader textReader = new StreamReader(fileName))
{
// read the characters one by one until there are no more character (= -1)
int readResult = textReader.Read();
while (readResult != -1);
{
yield return (char)readResult;
}
}
}
I decided to return a sequence of char instead of a string. This is a small optimization. If you decide not to read all characters, then not the complete file is read.
Divide a sequence of characters into a sequence of words.
The problem is: what is a word? Probably something like: all characters between two white spaces. Something special with the beginning and the end of the sequence.
But how about this: "Hello..World" Are these only the words "Hello" and "World", or is there also an empty word between the two dots? And what if the sequence of characters starts with a dot: ".Hello"?
I'll write this part as an extension method, so you can use it in a LINQ. If you are not familiar with extension methods, consider to read Extension methods demystified.
public static IEnumerable<string> ToWords(this IEnumerable<char> source)
{
string word = String.Empty;
foreach (char c in source)
{
if (Char.IsWhiteSpace(c))
{
// white space. only a word if already read something
if (word.Length != 0)
{
yield return word
word = String.Empty;
}
// else: the sequence starts with a white space: not a word
}
else
{
// not a white space: add the character to the word
word = word + c;
}
}
// if word not empty, then there were characters after the last whitespace
// like: "Hello World". "Hello" already returned. "World" not yet
if (word.Length != 0)
yield return word;
}
Consider to optimize the word = word + c; part.
Use LINQ to concatenate what you want
string fileName = ...
string wordToFind = "Hello";
// in this example ignore case:
IEqualityComparer stringComparer = StringComparison.CurrentCultureIgnoreCase;
IEnumerable<string> wordsAfterHello = ReadTextFile(fileName)
.ToWords()
.SkipWhile(word => stringComparer.Equals(word, wordToFind);
Of course if you plan to use this often, you could write extension methods for this.

Replace regular expression with regular expression

Consider two regular expressions:
var regex_A = "Main\.(.+)\.Value";
var regex_B = "M_(.+)_Sp";
I want to be able to replace a string using regex_A as input, and regex_B as the replacement string. But also the other way around. And without supplying additional information like a format string per regex.
Specifically I want to create a replaced_B string from an input_A string. So:
var input_A = "Main.Rotating.Value";
var replaced_B = input_A.RegEx_Awesome_Replace(regex_A, regex_B);
Assert.AreEqual("M_Rotating_Sp", replaced_B);
And this should also work in reverse (thats the reason i can't use a simple string.format for regex_B). Because I don't want to supply a format string for every regular expression (i'm lazy).
var input_B = "M_Skew_Sp";
var replaced_A = input_B.RegEx_Awesome_Replace(regex_B, regex_A);
Assert.AreEqual("Main.Skew.Value", replaced_A);
I have no clue if this exists, or how to call it. Google search finds me all kinds of other regex replaces... not this one.
Update:
So basically I need a way to convert a regular expression to a format string.
var regex_A_format = Regex2Format(regex_A);
Assert.AreEqual("Main.$1.Value", regex_A_format);
and
var regex_B_format = Regex2Format(regex_B);
Assert.AreEqual("M_$1_Sp", regex_B_format);
So what should the RegEx_Awesome_Replace and/or Regex2Format function look like?
Update 2:
I guess the RegEx_Awesome_Replace should look something like (using some code from answers below):
public static class StringExtenstions
{
public static string RegExAwesomeReplace(this string inputString,string searchPattern,string replacePattern)
{
return Regex.Replace(inputString, searchPattern, Regex2Format(replacePattern));
}
}
Which would leave the Regex2Format as an open question.
There is no defined way for one regex to refer to a match found in another regex. Regexes are not format strings.
What you can do is to use Tuples of a format string together with its regex. e.g.
var a = new Tuple<Regex,string>(new Regex(#"(?<=Main\.).+(?=\.Value)"), #"Main.{0}.Value")
var b = new Tuple<Regex,string>(new Regex(#"(?<=M_).+(?=_Sp)"), #"M_{0}_Sp")`
Then you can pass these objects to a common replacement method in any order, like this:
private string RegEx_Awesome_Replace(string input, Tuple<Regex,string> toFind, Tuple<Regex,string> replaceWith)
{
return string.Format(replaceWith.Item2, toFind.Item1.Match(input).Value);
}
You will notice that I have used zero-width positive lookahead assertion and zero-width positive lookbehind assertions in my regexes, to ensure that Value contains exactly the text that I want to replace.
You may also want to add error handling, for cases where the match can not be found. Maybe read about Regex.Match
Since you have already reduced your problem to where you need to change a Regex into a string format (implementing Regex2Format) I will focus my answer just on that part. Note that my answer is incomplete because it doesn't address the full breadth of parsing regex capturing groups, however it works for simple cases.
First thing needed is a Regex that will match Regex capture groups. There is a negative lookbehind to not match escaped bracket symbols. There are other cases that break this regex. E.g. a non-capturing group, wildcard symbols, things between square braces.
private static readonly Regex CaptureGroupMatcher = new Regex(#"(?<!\\)\([^\)]+\)");
The implementation of Regex2Format here basically writes everything outside of capture groups into the output string, and replaces the capture group value by {x}.
static string Regex2Format(string pattern)
{
var targetBuilder = new StringBuilder();
int previousEndIndex = 0;
int formatIndex = 0;
foreach (Match match in CaptureGroupMatcher.Matches(pattern))
{
var group = match.Groups[0];
int endIndex = group.Index;
AppendPart(pattern, previousEndIndex, endIndex, targetBuilder);
targetBuilder.Append('{');
targetBuilder.Append(formatIndex++);
targetBuilder.Append('}');
previousEndIndex = group.Index + group.Length;
}
AppendPart(pattern, previousEndIndex, pattern.Length, targetBuilder);
return targetBuilder.ToString();
}
This helper function writes pattern string values into the output, it currently writes everything except \ characters used to escape something.
static void AppendPart(string pattern, int previousEndIndex, int endIndex, StringBuilder targetBuilder)
{
for (int i = previousEndIndex; i < endIndex; i++)
{
char c = pattern[i];
if (c == '\\' && i < pattern.Length - 1 && pattern[i + 1] != '\\')
{
//backslash not followed by another backslash - it's an escape char
}
else
{
targetBuilder.Append(c);
}
}
}
Test cases
static void Test()
{
var cases = new Dictionary<string, string>
{
{ #"Main\.(.+)\.Value", #"Main.{0}.Value" },
{ #"M_(.+)_Sp(.*)", "M_{0}_Sp{1}" },
{ #"M_\(.+)_Sp", #"M_(.+)_Sp" },
};
foreach (var kvp in cases)
{
if (PatternToStringFormat(kvp.Key) != kvp.Value)
{
Console.WriteLine("Test failed for {0} - expected {1} but got {2}", kvp.Key, kvp.Value, PatternToStringFormat(kvp.Key));
}
}
}
To wrap up, here is the usage:
private static string AwesomeRegexReplace(string input, string sourcePattern, string targetPattern)
{
var targetFormat = PatternToStringFormat(targetPattern);
return Regex.Replace(input, sourcePattern, match =>
{
var args = match.Groups.OfType<Group>().Skip(1).Select(g => g.Value).ToArray<object>();
return string.Format(targetFormat, args);
});
}
Something like this might work
var replaced_B = Regex.Replace(input_A, #"Main\.(.+)\.Value", #"M_$1_Sp");
Are you looking for something like this?
public static class StringExtenstions
{
public static string RegExAwesomeReplace(this string inputString,string searchPattern,string replacePattern)
{
Match searchMatch = Regex.Match(inputString,searchPattern);
Match replaceMatch = Regex.Match(inputString, replacePattern);
if (!searchMatch.Success || !replaceMatch.Success)
{
return inputString;
}
return inputString.Replace(searchMatch.Value, replaceMatch.Value);
}
}
The string extension method returns the string with replaced value for search pattern and replace pattern.
This is how you call:
input_A.RegEx_Awesome_Replace(regex_A, regex_B);

How to rewrite a string by pattern

I have a string, where the "special areas" are enclosed in curly braces:
{intIncG}/{intIncD}/02-{yy}
I need to iterate through all of these elements inbetween {} and replace them based on their content. What is the best code structure to do it in C#?
I can't just do a replace since I need to know the index of each "speacial area {}" in order to replace it with the correct value.
Regex rgx = new Regex( #"\({[^\}]*\})");
string output = rgx.Replace(input, new MatchEvaluator(DoStuff));
static string DoStuff(Match match)
{
//Here you have access to match.Index, and match.Value so can do something different for Match1, Match2, etc.
//You can easily strip the {'s off the value by
string value = match.Value.Substring(1, match.Value.Length-2);
//Then call a function which takes value and index to get the string to pass back to be susbstituted
}
string.Replace will do just fine.
var updatedString = myString.Replace("{intIncG}", "something");
Do once for every different string.
Update:
Since you need the index of { in order to produce the replacement string (as you commented), you can use Regex.Matches to find the indices of { - each Match object in the Matches collection will include the index in the string.
Use Regex.Replace:
Replaces all occurrences of a character pattern defined by a regular expression with a specified replacement character string.
from msdn
You can define a function and join it's output -- so you'll only need to traverse the parts once and not for every replace rule.
private IEnumerable<string> Traverse(string input)
{
int index = 0;
string[] parts = input.Split(new[] {'/'});
foreach(var part in parts)
{
index++;
string retVal = string.Empty;
switch(part)
{
case "{intIncG}":
retVal = "a"; // or something based on index!
break;
case "{intIncD}":
retVal = "b"; // or something based on index!
break;
...
}
yield return retVal;
}
}
string replaced = string.Join("/", Traverse(inputString));

Regular Expression To Split On Comma Except If Quoted

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:
max,emily,john = ["max", "emily", "john"]
BUT
max,"emily,kate",john = ["max", "emily,kate", "john"]
Looking to use in C#: Regex.Split(string, "PATTERN-HERE");
Thanks.
Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.
You might try something like this instead:
public static IEnumerable<string> SplitCSV(string csvString)
{
var sb = new StringBuilder();
bool quoted = false;
foreach (char c in csvString) {
if (quoted) {
if (c == '"')
quoted = false;
else
sb.Append(c);
} else {
if (c == '"') {
quoted = true;
} else if (c == ',') {
yield return sb.ToString();
sb.Length = 0;
} else {
sb.Append(c);
}
}
}
if (quoted)
throw new ArgumentException("csvString", "Unterminated quotation mark.");
yield return sb.ToString();
}
It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.
This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.
Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():
You could use the regex (please don't!)
(?<=^(?:[^"]*"[^"]*")*[^"]*) # assert that there is an even number of quotes before...
\s*,\s* # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$) # as well as after the comma.
if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.
This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.
A little late maybe but I hope I can help someone else
String[] cols = Regex.Split("max, emily, john", #"\s*,\s*");
foreach ( String s in cols ) {
Console.WriteLine(s);
}
Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.
Here's our simple regex:
"[^"]*"|(,)
The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.
This program shows how to use the regex (see the results at the bottom of the online demo):
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
string s1 = #"max,""emily,kate"",john";
var myRegex = new Regex(#"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value == "") return m.Value;
else return "SplitHere";
});
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

check content of string input

How can I check if my input is a particular kind of string. So no numeric, no "/",...
Well, to check that an input is actually an object of type System.String, you can simply do:
bool IsString(object value)
{
return value is string;
}
To check that a string contains only letters, you could do something like this:
bool IsAllAlphabetic(string value)
{
foreach (char c in value)
{
if (!char.IsLetter(c))
return false;
}
return true;
}
If you wanted to combine these, you could do so:
bool IsAlphabeticString(object value)
{
string str = value as string;
return str != null && IsAllAlphabetic(str);
}
If you mean "is the string completely letters", you could do:
string myString = "RandomStringOfLetters";
bool allLetters = myString.All( c => Char.IsLetter(c) );
This is based on LINQ and the Char.IsLetter method.
It's not entirely clear what you want, but you can probably do it with a regular expression. For example to check that your string contains only letters in a-z or A-Z you can do this:
string s = "dasglakgsklg";
if (Regex.IsMatch(s, "^[a-z]+$", RegexOptions.IgnoreCase))
{
Console.WriteLine("Only letters in a-z.");
}
else
{
// Not only letters in a-z.
}
If you also want to allow spaces, underscores, or other characters simply add them between the square brackets in the regular expression. Note that some characters have a special meaning inside regular expression character classes and need to be escaped with a backslash.
You can also use \p{L} instead of [a-z] to match any Unicode character that is considered to be a letter, including letters in foreign alphabets.
using System.Linq;
...
bool onlyAlphas = s.All(c => (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
Something like this (have not tested) may fit your (vague) requirement.
if (input is string)
{
// test for legal characters?
string pattern = "^[A-Za-z]+$";
if (Regex.IsMatch(input, pattern))
{
// legal string? do something
}
// or
if (input.Any(c => !char.IsLetter(c)))
{
// NOT legal string
}
}

Categories

Resources