Parse a custom file with a list of commands - c#

There is a list of commands in a file that look like this:
command1 argument1 argument2
command2 argument3 argument4
And the result should look like
//Dictionary<command name,list of arguments>
Dictionary<string,List<string>>
Of course, there can be any amount of arguments, not just 2 of them. Parsing it is a piece of cake. But the thing is, there can be multi-line arguments.
command {some
amount
of random text
} {and the second
argument} and_here_goes_argument_3
This is where it gets tricky. I've created a while loop with if conditions to parse this file, but it took me like 200+ lines of code and was totally unreadable. I bet there is a better way to do this.
Of course, I am not asking you to write the code for me. All I need is a basic approach.
As for the language- it can be either C# or C++.

Showing how much pain doing it with regexes would be:
string text = #"command1 argument1 argument2
command2 argument3 argument4
command {some
amount
of random text
} {and the second
argument} and_here_goes_argument_3";
var rx = new Regex(#"^(?<command>(?:(?!\r|$)[^ ])*) +(?:(?<argument>{[^}]*}|(?!\r?$|{)(?:(?!\r|$)[^ ])+)(?: +\r?$?|\r?$))*", RegexOptions.Multiline | RegexOptions.ExplicitCapture);
var matches = rx.Matches(text);
foreach (Match match in matches)
{
Console.WriteLine($"Command: {match.Groups["command"].Value}");
foreach (Capture capture in match.Groups["argument"].Captures)
{
Console.WriteLine($" - arg: [{capture.Value}]");
}
Console.WriteLine();
}
The problem is that this regex is both unreadable and brittle. Try adding a x just after the argument}, like argument}x. Handling malformed text is very difficult.
The only interesting part is that I use the RegexOptions.Multiline to handle multiple lines of text, and that $ matches the \n but not the \r that I handle manually.
Paradoxically a small grammar using a library could be the "simplest" solution...
Ok now some "real" code:
private static readonly string[] commandDelimiters = new[] { " ", "\r", "\n" };
// We don't want the { to be used inside arguments that aren't in the form {...}
// Note that at this time there is no way to "escape" the }
private static readonly string[] argumentDelimiters = new[] { " ", "\r", "\n", "{" };
public static IEnumerable<Tuple<string, string[]>> ParseCommands(string str)
{
int ix = 0;
int line = 0;
int ixStartLine = 0;
var args = new List<string>();
while (ix < str.Length)
{
string command = ParseWord(str, ref ix, commandDelimiters);
if (command.Length == 0)
{
throw new Exception($"No command, at line {line}, col {ix - ixStartLine}");
}
while (true)
{
SkipSpaces(str, ref ix);
if (IsEOL(str, true, ref ix))
{
line++;
ixStartLine = ix;
break;
}
if (str[ix] == '{')
{
int ix2 = str.IndexOf('}', ix + 1);
if (ix2 == -1)
{
throw new Exception($"Unclosed {{ at line {line}, col {ix - ixStartLine}");
}
// Skipping the {
ix++;
// Skipping the }, because we don't do ix2 - ix -1
string arg = str.Substring(ix, ix2 - ix);
// We count the new lines "inside" the { }
for (int i = 0; i < arg.Length; )
{
if (IsEOL(arg, true, ref i))
{
line++;
ixStartLine = ix + i + 1;
}
else
{
i++;
}
}
// Skipping the }
ix = ix2 + 1;
// If there is no space of eol after the } then error
if (ix < str.Length && str[ix] != ' ' && !IsEOL(str, false, ref ix))
{
throw new Exception($"Unexpected character at line {line}, col {ix - ixStartLine}");
}
args.Add(arg);
}
else
{
string arg = ParseWord(str, ref ix, commandDelimiters);
// If the terminator is {, then error.
if (ix < str.Length && str[ix] == '{')
{
throw new Exception($"Unexpected character at line {line}, col {ix - ixStartLine}");
}
args.Add(arg);
}
}
var args2 = args.ToArray();
args.Clear();
yield return Tuple.Create(command, args2);
}
}
// Stops at any of terminators, doesn't "consume" it advancing ix
public static string ParseWord(string str, ref int ix, string[] terminators)
{
int start = ix;
int curr = ix;
while (curr < str.Length && !terminators.Any(x => string.CompareOrdinal(str, curr, x, 0, x.Length) == 0))
{
curr++;
}
ix = curr;
return str.Substring(start, curr - start);
}
public static bool SkipSpaces(string str, ref int ix)
{
bool atLeastOne = false;
while (ix < str.Length && str[ix] == ' ')
{
atLeastOne = true;
ix++;
}
return atLeastOne;
}
// \r\n, \r, \n, end-of-string == true
public static bool IsEOL(string str, bool advance, ref int ix)
{
if (ix == str.Length)
{
return true;
}
if (str[ix] == '\r')
{
if (advance)
{
if (ix + 1 < str.Length && str[ix + 1] == '\n')
{
ix += 2;
}
ix += 2;
}
return true;
}
if (str[ix] == '\n')
{
if (advance)
{
ix++;
}
return true;
}
return false;
}
It is quite long, but I do think it is quite clear to read. The errors should be very exact (line and col given). Note that there is no possible escaping for the }. Doing it in an elegant way is complex.
Use it like:
var res = ParseCommands(text).ToArray();

Related

Best way to split string into lines with maximum length, without breaking words

I want to break a string up into lines of a specified maximum length, without splitting any words, if possible (if there is a word that exceeds the maximum line length, then it will have to be split).
As always, I am acutely aware that strings are immutable and that one should preferably use the StringBuilder class. I have seen examples where the string is split into words and the lines are then built up using the StringBuilder class, but the code below seems "neater" to me.
I mentioned "best" in the description and not "most efficient" as I am also interested in the "eloquence" of the code. The strings will never be huge, generally splitting into 2 or three lines, and it won't be happening for thousands of lines.
Is the following code really bad?
private static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
stringToSplit = stringToSplit.Trim();
var lines = new List<string>();
while (stringToSplit.Length > 0)
{
if (stringToSplit.Length <= maximumLineLength)
{
lines.Add(stringToSplit);
break;
}
var indexOfLastSpaceInLine = stringToSplit.Substring(0, maximumLineLength).LastIndexOf(' ');
lines.Add(stringToSplit.Substring(0, indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine : maximumLineLength).Trim());
stringToSplit = stringToSplit.Substring(indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine + 1 : maximumLineLength);
}
return lines.ToArray();
}
Even when this post is 3 years old I wanted to give a better solution using Regex to accomplish the same:
If you want the string to be splitted and then use the text to be displayed you can use this:
public string SplitToLines(string stringToSplit, int maximumLineLength)
{
return Regex.Replace(stringToSplit, #"(.{1," + maximumLineLength +#"})(?:\s|$)", "$1\n");
}
If on the other hand you need a collection you can use this:
public MatchCollection SplitToLines(string stringToSplit, int maximumLineLength)
{
return Regex.Matches(stringToSplit, #"(.{1," + maximumLineLength +#"})(?:\s|$)");
}
NOTES
Remember to import regex (using System.Text.RegularExpressions;)
You can use string interpolation on the match:
$#"(.{{1,{maximumLineLength}}})(?:\s|$)"
The MatchCollection works almost like an Array
Matching example with explanation here
How about this as a solution:
IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
var words = stringToSplit.Split(' ').Concat(new [] { "" });
return
words
.Skip(1)
.Aggregate(
words.Take(1).ToList(),
(a, w) =>
{
var last = a.Last();
while (last.Length > maximumLineLength)
{
a[a.Count() - 1] = last.Substring(0, maximumLineLength);
last = last.Substring(maximumLineLength);
a.Add(last);
}
var test = last + " " + w;
if (test.Length > maximumLineLength)
{
a.Add(w);
}
else
{
a[a.Count() - 1] = test;
}
return a;
});
}
I reworked this as prefer this:
IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
var words = stringToSplit.Split(' ');
var line = words.First();
foreach (var word in words.Skip(1))
{
var test = $"{line} {word}";
if (test.Length > maximumLineLength)
{
yield return line;
line = word;
}
else
{
line = test;
}
}
yield return line;
}
I don't think your solution is too bad. I do, however, think you should break up your ternary into an if else because you are testing the same condition twice. Your code might also have a bug. Based on your description, it seems you want lines <= maxLineLength, but your code counts the space after the last word and uses it in the <= comparison resulting in effectively < behavior for the trimmed string.
Here is my solution.
private static IEnumerable<string> SplitToLines(string stringToSplit, int maxLineLength)
{
string[] words = stringToSplit.Split(' ');
StringBuilder line = new StringBuilder();
foreach (string word in words)
{
if (word.Length + line.Length <= maxLineLength)
{
line.Append(word + " ");
}
else
{
if (line.Length > 0)
{
yield return line.ToString().Trim();
line.Clear();
}
string overflow = word;
while (overflow.Length > maxLineLength)
{
yield return overflow.Substring(0, maxLineLength);
overflow = overflow.Substring(maxLineLength);
}
line.Append(overflow + " ");
}
}
yield return line.ToString().Trim();
}
It is a bit longer than your solution, but it should be more straightforward. It also uses a StringBuilder so it is much faster for large strings. I performed a benchmarking test for 20,000 words ranging from 1 to 11 characters each split into lines of 10 character width. My method completed in 14ms compared to 1373ms for your method.
Try this (untested)
private static IEnumerable<string> SplitToLines(string value, int maximumLineLength)
{
var words = value.Split(' ');
var line = new StringBuilder();
foreach (var word in words)
{
if ((line.Length + word.Length) >= maximumLineLength)
{
yield return line.ToString();
line = new StringBuilder();
}
line.AppendFormat("{0}{1}", (line.Length>0) ? " " : "", word);
}
yield return line.ToString();
}
~6x faster than the accepted answer
More than 1.5x faster than the Regex version in Release Mode (dependent on line length)
Optionally keep the space at the end of the line or not (the regex version always keeps it)
static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength, bool removeSpace = true)
{
int start = 0;
int end = 0;
for (int i = 0; i < stringToSplit.Length; i++)
{
char c = stringToSplit[i];
if (c == ' ' || c == '\n')
{
if (i - start > maximumLineLength)
{
string substring = stringToSplit.Substring(start, end - start); ;
start = removeSpace ? end + 1 : end; // + 1 to remove the space on the next line
yield return substring;
}
else
end = i;
}
}
yield return stringToSplit.Substring(start); // remember last line
}
Here is the example code used to test speeds (again, run on your own machine and test in Release mode to get accurate timings)
https://dotnetfiddle.net/h5I1GC
Timings on my machine in release mode .Net 4.8
Accepted Answer: 667ms
Regex: 368ms
My Version: 117ms
My requirement was to have a line break at the last space before the 30 char limit.
So here is how i did it. Hope this helps anyone looking.
private string LineBreakLongString(string input)
{
var outputString = string.Empty;
var found = false;
int pos = 0;
int prev = 0;
while (!found)
{
var p = input.IndexOf(' ', pos);
{
if (pos <= 30)
{
pos++;
if (p < 30) { prev = p; }
}
else
{
found = true;
}
}
outputString = input.Substring(0, prev) + System.Environment.NewLine + input.Substring(prev, input.Length - prev).Trim();
}
return outputString;
}
An approach using recursive method and ReadOnlySpan (Tested)
public static void SplitToLines(ReadOnlySpan<char> stringToSplit, int index, ref List<string> values)
{
if (stringToSplit.IsEmpty || index < 1) return;
var nextIndex = stringToSplit.IndexOf(' ');
var slice = stringToSplit.Slice(0, nextIndex < 0 ? stringToSplit.Length : nextIndex);
if (slice.Length <= index)
{
values.Add(slice.ToString());
nextIndex++;
}
else
{
values.Add(slice.Slice(0, index).ToString());
nextIndex = index;
}
if (stringToSplit.Length <= index) return;
SplitToLines(stringToSplit.Slice(nextIndex), index, ref values);
}

C# line break every n characters

Suppose I have a string with the text: "THIS IS A TEST". How would I split it every n characters? So if n was 10, then it would display:
"THIS IS A "
"TEST"
..you get the idea. The reason is because I want to split a very big line into smaller lines, sort of like word wrap. I think I can use string.Split() for this, but I have no idea how and I'm confused.
Any help would be appreciated.
Let's borrow an implementation from my answer on code review. This inserts a line break every n characters:
public static string SpliceText(string text, int lineLength) {
return Regex.Replace(text, "(.{" + lineLength + "})", "$1" + Environment.NewLine);
}
Edit:
To return an array of strings instead:
public static string[] SpliceText(string text, int lineLength) {
return Regex.Matches(text, ".{1," + lineLength + "}").Cast<Match>().Select(m => m.Value).ToArray();
}
Maybe this can be used to handle efficiently extreme large files :
public IEnumerable<string> GetChunks(this string sourceString, int chunkLength)
{
using(var sr = new StringReader(sourceString))
{
var buffer = new char[chunkLength];
int read;
while((read= sr.Read(buffer, 0, chunkLength)) == chunkLength)
{
yield return new string(buffer, 0, read);
}
}
}
Actually, this works for any TextReader. StreamReader is the most common used TextReader. You can handle very large text files (IIS Log files, SharePoint Log files, etc) without having to load the whole file, but reading it line by line.
You should be able to use a regex for this. Here is an example:
//in this case n = 10 - adjust as needed
List<string> groups = (from Match m in Regex.Matches(str, ".{1,10}")
select m.Value).ToList();
string newString = String.Join(Environment.NewLine, lst.ToArray());
Refer to this question for details:
Splitting a string into chunks of a certain size
Probably not the most optimal way, but without regex:
string test = "my awesome line of text which will be split every n characters";
int nInterval = 10;
string res = String.Concat(test.Select((c, i) => i > 0 && (i % nInterval) == 0 ? c.ToString() + Environment.NewLine : c.ToString()));
Coming back to this after doing a code review, there's another way of doing the same without using Regex
public static IEnumerable<string> SplitText(string text, int length)
{
for (int i = 0; i < text.Length; i += length)
{
yield return text.Substring(i, Math.Min(length, text.Length - i));
}
}
Some code that I just wrote:
string[] SplitByLength(string line, int len, int IsB64=0) {
int i;
if (IsB64 == 1) {
// Only Allow Base64 Line Lengths without '=' padding
int mod64 = (len % 4);
if (mod64 != 0) {
len = len + (4 - mod64);
}
}
int parts = line.Length / len;
int frac = line.Length % len;
int extra = 0;
if (frac != 0) {
extra = 1;
}
string[] oline = new string[parts + extra];
for(i=0; i < parts; i++) {
oline[i] = line.Substring(0, len);
line = line.Substring(len);
}
if (extra == 1) {
oline[i] = line;
}
return oline;
}
string CRSplitByLength(string line, int len, int IsB64 = 0)
{
string[] lines = SplitByLength(line, len, IsB64);
return string.Join(System.Environment.NewLine, lines);
}
string m = "1234567890abcdefghijklmnopqrstuvwxhyz";
string[] r = SplitByLength(m, 6, 0);
foreach (string item in r) {
Console.WriteLine("{0}", item);
}

C# string comparison ignoring spaces, carriage return or line breaks

How can I compare 2 strings in C# ignoring the case, spaces and any line-breaks. I also need to check if both strings are null then they are marked as same.
Thanks!
You should normalize each string by removing the characters that you don't want to compare and then you can perform a String.Equals with a StringComparison that ignores case.
Something like this:
string s1 = "HeLLo wOrld!";
string s2 = "Hello\n WORLd!";
string normalized1 = Regex.Replace(s1, #"\s", "");
string normalized2 = Regex.Replace(s2, #"\s", "");
bool stringEquals = String.Equals(
normalized1,
normalized2,
StringComparison.OrdinalIgnoreCase);
Console.WriteLine(stringEquals);
Here Regex.Replace is used first to remove all whitespace characters. The special case of both strings being null is not treated here but you can easily handle that case before performing the string normalization.
This may also work.
String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreCase | CompareOptions.IgnoreSymbols) == 0
Edit:
IgnoreSymbols: Indicates that the string comparison must ignore symbols, such as
white-space characters, punctuation, currency symbols, the percent
sign, mathematical symbols, the ampersand, and so on.
Remove all the characters you don't want and then use the ToLower() method to ignore case.
edit: While the above works, it's better to use StringComparison.OrdinalIgnoreCase. Just pass it as the second argument to the Equals method.
First replace all whitespace via regular expression from both string and then use the String.Compare method with parameter ignoreCase = true.
string a = System.Text.RegularExpressions.Regex.Replace("void foo", #"\s", "");
string b = System.Text.RegularExpressions.Regex.Replace("voidFoo", #"\s", "");
bool isTheSame = String.Compare(a, b, true) == 0;
If you need performance, the Regex solutions on this page run too slow for you. Maybe you have a large list of strings you want to sort. (A Regex solution is more readable however)
I have a class that looks at each individual char in both strings and compares them while ignoring case and whitespace. It doesn't allocate any new strings. It uses the char.IsWhiteSpace(ch) to determine whitespace, and char.ToLowerInvariant(ch) for case-insensitivity (if required). In my testing, my solution runs about 5x - 8x faster than a Regex-based solution. My class also implements IEqualityComparer's GetHashCode(obj) method using this code in another SO answer. This GetHashCode(obj) also ignores whitespace and optionally ignores case.
Here's my class:
private class StringCompIgnoreWhiteSpace : IEqualityComparer<string>
{
public bool Equals(string strx, string stry)
{
if (strx == null) //stry may contain only whitespace
return string.IsNullOrWhiteSpace(stry);
else if (stry == null) //strx may contain only whitespace
return string.IsNullOrWhiteSpace(strx);
int ix = 0, iy = 0;
for (; ix < strx.Length && iy < stry.Length; ix++, iy++)
{
char chx = strx[ix];
char chy = stry[iy];
//ignore whitespace in strx
while (char.IsWhiteSpace(chx) && ix < strx.Length)
{
ix++;
chx = strx[ix];
}
//ignore whitespace in stry
while (char.IsWhiteSpace(chy) && iy < stry.Length)
{
iy++;
chy = stry[iy];
}
if (ix == strx.Length && iy != stry.Length)
{ //end of strx, so check if the rest of stry is whitespace
for (int iiy = iy + 1; iiy < stry.Length; iiy++)
{
if (!char.IsWhiteSpace(stry[iiy]))
return false;
}
return true;
}
if (ix != strx.Length && iy == stry.Length)
{ //end of stry, so check if the rest of strx is whitespace
for (int iix = ix + 1; iix < strx.Length; iix++)
{
if (!char.IsWhiteSpace(strx[iix]))
return false;
}
return true;
}
//The current chars are not whitespace, so check that they're equal (case-insensitive)
//Remove the following two lines to make the comparison case-sensitive.
chx = char.ToLowerInvariant(chx);
chy = char.ToLowerInvariant(chy);
if (chx != chy)
return false;
}
//If strx has more chars than stry
for (; ix < strx.Length; ix++)
{
if (!char.IsWhiteSpace(strx[ix]))
return false;
}
//If stry has more chars than strx
for (; iy < stry.Length; iy++)
{
if (!char.IsWhiteSpace(stry[iy]))
return false;
}
return true;
}
public int GetHashCode(string obj)
{
if (obj == null)
return 0;
int hash = 17;
unchecked // Overflow is fine, just wrap
{
for (int i = 0; i < obj.Length; i++)
{
char ch = obj[i];
if(!char.IsWhiteSpace(ch))
//use this line for case-insensitivity
hash = hash * 23 + char.ToLowerInvariant(ch).GetHashCode();
//use this line for case-sensitivity
//hash = hash * 23 + ch.GetHashCode();
}
}
return hash;
}
}
private static void TestComp()
{
var comp = new StringCompIgnoreWhiteSpace();
Console.WriteLine(comp.Equals("abcd", "abcd")); //true
Console.WriteLine(comp.Equals("abCd", "Abcd")); //true
Console.WriteLine(comp.Equals("ab Cd", "Ab\n\r\tcd ")); //true
Console.WriteLine(comp.Equals(" ab Cd", " A b" + Environment.NewLine + "cd ")); //true
Console.WriteLine(comp.Equals(null, " \t\n\r ")); //true
Console.WriteLine(comp.Equals(" \t\n\r ", null)); //true
Console.WriteLine(comp.Equals("abcd", "abcd h")); //false
Console.WriteLine(comp.GetHashCode(" a b c d")); //-699568861
//This is -699568861 if you #define StringCompIgnoreWhiteSpace_CASE_INSENSITIVE
// Otherwise it's -1555613149
Console.WriteLine(comp.GetHashCode("A B c \t d"));
}
Here's my testing code (with a Regex example):
private static void SpeedTest()
{
const int loop = 100000;
string first = "a bc d";
string second = "ABC D";
var compChar = new StringCompIgnoreWhiteSpace();
Stopwatch sw1 = Stopwatch.StartNew();
for (int i = 0; i < loop; i++)
{
bool equals = compChar.Equals(first, second);
}
sw1.Stop();
Console.WriteLine(string.Format("char time = {0}", sw1.Elapsed)); //char time = 00:00:00.0361159
var compRegex = new StringCompIgnoreWhiteSpaceRegex();
Stopwatch sw2 = Stopwatch.StartNew();
for (int i = 0; i < loop; i++)
{
bool equals = compRegex.Equals(first, second);
}
sw2.Stop();
Console.WriteLine(string.Format("regex time = {0}", sw2.Elapsed)); //regex time = 00:00:00.2773072
}
private class StringCompIgnoreWhiteSpaceRegex : IEqualityComparer<string>
{
public bool Equals(string strx, string stry)
{
if (strx == null)
return string.IsNullOrWhiteSpace(stry);
else if (stry == null)
return string.IsNullOrWhiteSpace(strx);
string a = System.Text.RegularExpressions.Regex.Replace(strx, #"\s", "");
string b = System.Text.RegularExpressions.Regex.Replace(stry, #"\s", "");
return String.Compare(a, b, true) == 0;
}
public int GetHashCode(string obj)
{
if (obj == null)
return 0;
string a = System.Text.RegularExpressions.Regex.Replace(obj, #"\s", "");
return a.GetHashCode();
}
}
I would probably start by removing the characters you don't want to compare from the string before comparing. If performance is a concern, you might look at storing a version of each string with the characters already removed.
Alternatively, you could write a compare routine that would skip over the characters you want to ignore. But that just seems like more work to me.
You can also use the following custom function
public static string ExceptChars(this string str, IEnumerable<char> toExclude)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
char c = str[i];
if (!toExclude.Contains(c))
sb.Append(c);
}
return sb.ToString();
}
public static bool SpaceCaseInsenstiveComparision(this string stringa, string stringb)
{
return (stringa==null&&stringb==null)||stringa.ToLower().ExceptChars(new[] { ' ', '\t', '\n', '\r' }).Equals(stringb.ToLower().ExceptChars(new[] { ' ', '\t', '\n', '\r' }));
}
And then use it following way
"Te st".SpaceCaseInsenstiveComparision("Te st");
Another option is the LINQ SequenceEquals method which according to my tests is more than twice as fast as the Regex approach used in other answers and very easy to read and maintain.
public static bool Equals_Linq(string s1, string s2)
{
return Enumerable.SequenceEqual(
s1.Where(c => !char.IsWhiteSpace(c)).Select(char.ToUpperInvariant),
s2.Where(c => !char.IsWhiteSpace(c)).Select(char.ToUpperInvariant));
}
public static bool Equals_Regex(string s1, string s2)
{
return string.Equals(
Regex.Replace(s1, #"\s", ""),
Regex.Replace(s2, #"\s", ""),
StringComparison.OrdinalIgnoreCase);
}
Here the simple performance test code I used:
var s1 = "HeLLo wOrld!";
var s2 = "Hello\n WORLd!";
var watch = Stopwatch.StartNew();
for (var i = 0; i < 1000000; i++)
{
Equals_Linq(s1, s2);
}
Console.WriteLine(watch.Elapsed); // ~1.7 seconds
watch = Stopwatch.StartNew();
for (var i = 0; i < 1000000; i++)
{
Equals_Regex(s1, s2);
}
Console.WriteLine(watch.Elapsed); // ~4.6 seconds
An approach not optimized for performance, but for completeness.
normalizes null
normalizes unicode, combining characters, diacritics
normalizes new lines
normalizes white space
normalizes casing
code snippet:
public static class StringHelper
{
public static bool AreEquivalent(string source, string target)
{
if (source == null) return target == null;
if (target == null) return false;
var normForm1 = Normalize(source);
var normForm2 = Normalize(target);
return string.Equals(normForm1, normForm2);
}
private static string Normalize(string value)
{
Debug.Assert(value != null);
// normalize unicode, combining characters, diacritics
value = value.Normalize(NormalizationForm.FormC);
// normalize new lines to white space
value = value.Replace("\r\n", "\n").Replace("\r", "\n");
// normalize white space
value = Regex.Replace(value, #"\s", string.Empty);
// normalize casing
return value.ToLowerInvariant();
}
}
I would Trim the string using Trim() to remove all the
whitespace.
Use StringComparison.OrdinalIgnoreCase to ignore case sensitivity ex. stringA.Equals(stringB, StringComparison.OrdinalIgnoreCase)

Algorithm for expanding a character set?

Are there any ready-made functions for expanding a C# regex-style character set?
For example, expand("a-z1") would return a string containing all the characters a to z, followed by the number 1.
Here's what I've got so far:
public static string ExpandCharacterSet(string set)
{
var sb = new StringBuilder();
int start = 0;
while (start < set.Length - 1)
{
int dash = set.IndexOf('-', start + 1);
if (dash <= 0 || dash >= set.Length - 1)
break;
sb.Append(set.Substring(start, dash - start - 1));
char a = set[dash - 1];
char z = set[dash + 1];
for (var i = a; i <= z; ++i)
sb.Append(i);
start = dash + 2;
}
sb.Append(set.Substring(start));
return sb.ToString();
}
Is there anything I'm overlooking?
PS: Let's ignore negative character sets for now.
Thought my example was quite clear... let's try that again. This is what I want:
ExpandCharacterSet("a-fA-F0-9") == "abcdefABCDEF0123456789"
It took a bit of work to get this but here's what I was able to muster. Of course this is not going to be portable since I'm messing with internals. But it works well enough for simple test cases. It will accept any regex character class but will not work for negated classes. The range of values is way too broad without any restrictions. I don't know if it will be correct for all cases and it doesn't handle repetition at all but it's a start. At least you won't have to roll out your own parser. As of .NET Framework 4.0:
public static class RegexHelper
{
public static string ExpandCharClass(string charClass)
{
var regexParser = new RegexParser(CultureInfo.CurrentCulture);
regexParser.SetPattern(charClass);
var regexCharClass = regexParser.ScanCharClass(false);
int count = regexCharClass.RangeCount();
List<string> ranges = new List<string>();
// range 0 can be skipped
for (int i = 1; i < count; i++)
{
var range = regexCharClass.GetRangeAt(i);
ranges.Add(ExpandRange(range));
}
return String.Concat(ranges);
}
static string ExpandRange(SingleRange range)
{
char first = range._first;
char last = range._last;
return String.Concat(Enumerable.Range(first, last - first + 1).Select(i => (char)i));
}
internal class RegexParser
{
static readonly Type RegexParserType;
static readonly ConstructorInfo RegexParser_Ctor;
static readonly MethodInfo RegexParser_SetPattern;
static readonly MethodInfo RegexParser_ScanCharClass;
static RegexParser()
{
RegexParserType = Assembly.GetAssembly(typeof(Regex)).GetType("System.Text.RegularExpressions.RegexParser");
var flags = BindingFlags.NonPublic | BindingFlags.Instance;
RegexParser_Ctor = RegexParserType.GetConstructor(flags, null, new[] { typeof(CultureInfo) }, null);
RegexParser_SetPattern = RegexParserType.GetMethod("SetPattern", flags, null, new[] { typeof(String) }, null);
RegexParser_ScanCharClass = RegexParserType.GetMethod("ScanCharClass", flags, null, new[] { typeof(Boolean) }, null);
}
private readonly object instance;
internal RegexParser(CultureInfo culture)
{
instance = RegexParser_Ctor.Invoke(new object[] { culture });
}
internal void SetPattern(string pattern)
{
RegexParser_SetPattern.Invoke(instance, new object[] { pattern });
}
internal RegexCharClass ScanCharClass(bool caseInsensitive)
{
return new RegexCharClass(RegexParser_ScanCharClass.Invoke(instance, new object[] { caseInsensitive }));
}
}
internal class RegexCharClass
{
static readonly Type RegexCharClassType;
static readonly MethodInfo RegexCharClass_RangeCount;
static readonly MethodInfo RegexCharClass_GetRangeAt;
static RegexCharClass()
{
RegexCharClassType = Assembly.GetAssembly(typeof(Regex)).GetType("System.Text.RegularExpressions.RegexCharClass");
var flags = BindingFlags.NonPublic | BindingFlags.Instance;
RegexCharClass_RangeCount = RegexCharClassType.GetMethod("RangeCount", flags, null, new Type[] { }, null);
RegexCharClass_GetRangeAt = RegexCharClassType.GetMethod("GetRangeAt", flags, null, new[] { typeof(Int32) }, null);
}
private readonly object instance;
internal RegexCharClass(object regexCharClass)
{
if (regexCharClass == null)
throw new ArgumentNullException("regexCharClass");
if (regexCharClass.GetType() != RegexCharClassType)
throw new ArgumentException("not an instance of a RegexCharClass object", "regexCharClass");
instance = regexCharClass;
}
internal int RangeCount()
{
return (int)RegexCharClass_RangeCount.Invoke(instance, new object[] { });
}
internal SingleRange GetRangeAt(int i)
{
return new SingleRange(RegexCharClass_GetRangeAt.Invoke(instance, new object[] { i }));
}
}
internal struct SingleRange
{
static readonly Type RegexCharClassSingleRangeType;
static readonly FieldInfo SingleRange_first;
static readonly FieldInfo SingleRange_last;
static SingleRange()
{
RegexCharClassSingleRangeType = Assembly.GetAssembly(typeof(Regex)).GetType("System.Text.RegularExpressions.RegexCharClass+SingleRange");
var flags = BindingFlags.NonPublic | BindingFlags.Instance;
SingleRange_first = RegexCharClassSingleRangeType.GetField("_first", flags);
SingleRange_last = RegexCharClassSingleRangeType.GetField("_last", flags);
}
internal char _first;
internal char _last;
internal SingleRange(object singleRange)
{
if (singleRange == null)
throw new ArgumentNullException("singleRange");
if (singleRange.GetType() != RegexCharClassSingleRangeType)
throw new ArgumentException("not an instance of a SingleRange object", "singleRange");
_first = (char)SingleRange_first.GetValue(singleRange);
_last = (char)SingleRange_last.GetValue(singleRange);
}
}
}
// usage:
RegexHelper.ExpandCharClass(#"[\-a-zA-F1 5-9]");
// "-abcdefghijklmnopqrstuvwxyzABCDEF1 56789"
Seems like a pretty unusual requirement, but since there are only about 96 characters that you can match (unless you include high chars), you might as well just test your regular expression against all of them, and output the matches:
public static string expando(string input_re) {
// add more chars in s as needed, such as ,.?/|=+_-éñ etc.
string s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
string output = "";
Regex exp = new Regex(input_re);
for (int i = 0; i < s.Length; i++) {
if (exp.IsMatch(s.Substring(i, 1))) {
output += s[i];
}
}
return output;
}
By using an actual regex to determine your character class, you can expand whatever regular expression you want, [^A-B]|[0123a-cg-h], for example.
Something like this?
var input = "a-fA-F0-9!";
var matches = Regex.Matches(input,#".-.|.");
var list = new StringBuilder();
foreach (Match m in matches)
{
var value = m.Value;
if (value.Length == 1)
list.Append(value);
else
{
if (value[2] < value[0]) throw new ArgumentException("invalid format"); // or switch, if you want.
for (char c = value[0]; c <= value[2]; c++)
list.Append(c);
}
}
Console.WriteLine(list);
Output:
abcdefABCDEF0123456789!
The moral, of course, is to solve your regex problems with more regex!
Here's a version with support for escape characters. It all depends how robust you want it to be... for example, I don't do anything special here to handle surrogates, so that probably won't work. Also, if you're trying to match the performance of a current regex engine exactly you'll need to know exactly what all the parameters are, which would be a fairly big job.
void Main()
{
//these are all equivalent:
var input = #"\x41-\0x46\u41";
var input2 = #"\65-\70\65";
var input3 = "A-FA";
// match hex as \0x123 or \x123 or \u123, or decimal \412, or the escapes \n\t\r, or any character
var charRegex = #"(\\(0?x|u)[0-9a-fA-F]+|\\[0-9]+|\\[ntr]|.)";
var matches = Regex.Matches(input, charRegex + "-" + charRegex + "|" + charRegex);
var list = new StringBuilder();
foreach (Match m in matches)
{
var dashIndex = m.Value.IndexOf('-', 1); //don't look at 0 (in case it's a dash)
if (dashIndex > 0) // this means we have two items: a range
{
var charLeft = Decode(m.Value.Substring(0,dashIndex));
var charRight = Decode(m.Value.Substring(dashIndex+1));
if (charRight < charLeft) throw new ArgumentException("invalid format (left bigger than right)"); // or switch, if you want.
for (char c = charLeft; c <= charRight; c++)
list.Append(c);
}
else // just one item
{
list.Append(Decode(m.Value));
}
}
Console.WriteLine(list);
}
char Decode(string s)
{
if (s.Length == 1)
return s[0];
// here, s[0] == '\', because of the regex
if (s.Length == 2)
switch (s[1])
{
// incomplete; add more as wished
case 'n': return '\n';
case 't': return '\t';
case 'r': return '\r';
default: break;
}
if (s[1] == 'u' || s[1] == 'x')
return (char)Convert.ToUInt16(s.Substring(2), 16);
else if (s.Length > 2 && s[1] == '0' && s[2] == 'x')
return (char)Convert.ToUInt16(s.Substring(3), 16);
else
return (char)Convert.ToUInt16(s.Substring(1)); // will fail from here if invalid escape (e.g. \g)
}
private static readonly IEnumerable<char> CharacterSet = Enumerable.Range(0, char.MaxValue + 1).Select(Convert.ToChar).Where(c => !char.IsControl(c));
public static string ExpandCharacterSet(string set)
{
var sb = new StringBuilder();
int start = 0;
bool invertSet = false;
if (set.Length == 0)
return "";
if (set[0] == '[' && set[set.Length - 1] == ']')
set = set.Substring(1, set.Length - 2);
if (set[0] == '^')
{
invertSet = true;
set = set.Substring(1);
}
while (start < set.Length - 1)
{
int dash = set.IndexOf('-', start + 1);
if (dash <= 0 || dash >= set.Length - 1)
break;
sb.Append(set.Substring(start, dash - start - 1));
char a = set[dash - 1];
char z = set[dash + 1];
for (var i = a; i <= z; ++i)
sb.Append(i);
start = dash + 2;
}
sb.Append(set.Substring(start));
if (!invertSet) return sb.ToString();
var A = new HashSet<char>(CharacterSet);
var B = new HashSet<char>(sb.ToString());
A.ExceptWith(B);
return new string(A.ToArray());
}

Building a smart string trimming function in C#

I am attempting to build a string extension method to trim a string to a certain length but with not breaking a word. I wanted to check to see if there was anything built into the framework or a more clever method than mine. Here's mine so far (not thoroughly tested):
public static string SmartTrim(this string s, int length)
{
StringBuilder result = new StringBuilder();
if (length >= 0)
{
if (s.IndexOf(' ') > 0)
{
string[] words = s.Split(' ');
int index = 0;
while (index < words.Length - 1 && result.Length + words[index + 1].Length <= length)
{
result.Append(words[index]);
result.Append(" ");
index++;
}
if (result.Length > 0)
{
result.Remove(result.Length - 1, 1);
}
}
else
{
result.Append(s.Substring(0, length));
}
}
else
{
throw new ArgumentOutOfRangeException("length", "Value cannot be negative.");
}
return result.ToString();
}
I'd use string.LastIndexOf - at least if we only care about spaces. Then there's no need to create any intermediate strings...
As yet untested:
public static string SmartTrim(this string text, int length)
{
if (text == null)
{
throw new ArgumentNullException("text");
}
if (length < 0)
{
throw new ArgumentOutOfRangeException();
}
if (text.Length <= length)
{
return text;
}
int lastSpaceBeforeMax = text.LastIndexOf(' ', length);
if (lastSpaceBeforeMax == -1)
{
// Perhaps define a strategy here? Could return empty string,
// or the original
throw new ArgumentException("Unable to trim word");
}
return text.Substring(0, lastSpaceBeforeMax);
}
Test code:
public class Test
{
static void Main()
{
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(20));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(3));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(4));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(5));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(7));
}
}
Results:
'foo bar baz'
'foo'
'foo'
'foo'
'foo bar'
How about a Regex based solution ? You will probably want to test some more, and do some bounds checking; but this is what spring to my mind:
using System;
using System.Text.RegularExpressions;
namespace Stackoverflow.Test
{
static class Test
{
private static readonly Regex regWords = new Regex("\\w+", RegexOptions.Compiled);
static void Main()
{
Console.WriteLine("The quick brown fox jumped over the lazy dog".SmartTrim(8));
Console.WriteLine("The quick brown fox jumped over the lazy dog".SmartTrim(20));
Console.WriteLine("Hello, I am attempting to build a string extension method to trim a string to a certain length but with not breaking a word. I wanted to check to see if there was anything built into the framework or a more clever method than mine".SmartTrim(100));
}
public static string SmartTrim(this string s, int length)
{
var matches = regWords.Matches(s);
foreach (Match match in matches)
{
if (match.Index + match.Length > length)
{
int ln = match.Index + match.Length > s.Length ? s.Length : match.Index + match.Length;
return s.Substring(0, ln);
}
}
return s;
}
}
}
Try this out. It's null-safe, won't break if length is longer than the string, and involves less string manipulation.
Edit: Per recommendations, I've removed the intermediate string. I'll leave the answer up as it could be useful in cases where exceptions are not wanted.
public static string SmartTrim(this string s, int length)
{
if(s == null || length < 0 || s.Length <= length)
return s;
// Edit a' la Jon Skeet. Removes unnecessary intermediate string. Thanks!
// string temp = s.Length > length + 1 ? s.Remove(length+1) : s;
int lastSpace = s.LastIndexOf(' ', length + 1);
return lastSpace < 0 ? string.Empty : s.Remove(lastSpace);
}
string strTemp = "How are you doing today";
int nLength = 12;
strTemp = strTemp.Substring(0, strTemp.Substring(0, nLength).LastIndexOf(' '));
I think that should do it. When I ran that, it ended up with "How are you".
So your function would be:
public static string SmartTrim(this string s, int length)
{
return s.Substring(0, s.Substring(0, length).LastIndexOf(' '));;
}
I would definitely add some exception handling though, such as making sure the integer length is no greater than the string length and not less than 0.
Obligatory LINQ one liner, if you only care about whitespace as word boundary:
return new String(s.TakeWhile((ch,idx) => (idx < length) || (idx >= length && !Char.IsWhiteSpace(ch))).ToArray());
Use like this
var substring = source.GetSubstring(50, new string[] { " ", "." })
This method can get a sub-string based on one or many separator characters
public static string GetSubstring(this string source, int length, params string[] options)
{
if (string.IsNullOrWhiteSpace(source))
{
return string.Empty;
}
if (source.Length <= length)
{
return source;
}
var indices =
options.Select(
separator => source.IndexOf(separator, length, StringComparison.CurrentCultureIgnoreCase))
.Where(index => index >= 0)
.ToList();
if (indices.Count > 0)
{
return source.Substring(0, indices.Min());
}
return source;
}
I'll toss in some Linq goodness even though others have answered this adequately:
public string TrimString(string s, int maxLength)
{
var pos = s.Select((c, idx) => new { Char = c, Pos = idx })
.Where(item => char.IsWhiteSpace(item.Char) && item.Pos <= maxLength)
.Select(item => item.Pos)
.SingleOrDefault();
return pos > 0 ? s.Substring(0, pos) : s;
}
I left out the parameter checking that others have merely to accentuate the important code...

Categories

Resources