Building a smart string trimming function in C#

Building a smart string trimming function in C# - c#

I am attempting to build a string extension method to trim a string to a certain length but with not breaking a word. I wanted to check to see if there was anything built into the framework or a more clever method than mine. Here's mine so far (not thoroughly tested):
public static string SmartTrim(this string s, int length)
{
StringBuilder result = new StringBuilder();
if (length >= 0)
{
if (s.IndexOf(' ') > 0)
{
string[] words = s.Split(' ');
int index = 0;
while (index < words.Length - 1 && result.Length + words[index + 1].Length <= length)
{
result.Append(words[index]);
result.Append(" ");
index++;
}
if (result.Length > 0)
{
result.Remove(result.Length - 1, 1);
}
}
else
{
result.Append(s.Substring(0, length));
}
}
else
{
throw new ArgumentOutOfRangeException("length", "Value cannot be negative.");
}
return result.ToString();
}

I'd use string.LastIndexOf - at least if we only care about spaces. Then there's no need to create any intermediate strings...
As yet untested:
public static string SmartTrim(this string text, int length)
{
if (text == null)
{
throw new ArgumentNullException("text");
}
if (length < 0)
{
throw new ArgumentOutOfRangeException();
}
if (text.Length <= length)
{
return text;
}
int lastSpaceBeforeMax = text.LastIndexOf(' ', length);
if (lastSpaceBeforeMax == -1)
{
// Perhaps define a strategy here? Could return empty string,
// or the original
throw new ArgumentException("Unable to trim word");
}
return text.Substring(0, lastSpaceBeforeMax);
}
Test code:
public class Test
{
static void Main()
{
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(20));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(3));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(4));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(5));
Console.WriteLine("'{0}'", "foo bar baz".SmartTrim(7));
}
}
Results:
'foo bar baz'
'foo'
'foo'
'foo'
'foo bar'

How about a Regex based solution ? You will probably want to test some more, and do some bounds checking; but this is what spring to my mind:
using System;
using System.Text.RegularExpressions;
namespace Stackoverflow.Test
{
static class Test
{
private static readonly Regex regWords = new Regex("\\w+", RegexOptions.Compiled);
static void Main()
{
Console.WriteLine("The quick brown fox jumped over the lazy dog".SmartTrim(8));
Console.WriteLine("The quick brown fox jumped over the lazy dog".SmartTrim(20));
Console.WriteLine("Hello, I am attempting to build a string extension method to trim a string to a certain length but with not breaking a word. I wanted to check to see if there was anything built into the framework or a more clever method than mine".SmartTrim(100));
}
public static string SmartTrim(this string s, int length)
{
var matches = regWords.Matches(s);
foreach (Match match in matches)
{
if (match.Index + match.Length > length)
{
int ln = match.Index + match.Length > s.Length ? s.Length : match.Index + match.Length;
return s.Substring(0, ln);
}
}
return s;
}
}
}

Try this out. It's null-safe, won't break if length is longer than the string, and involves less string manipulation.
Edit: Per recommendations, I've removed the intermediate string. I'll leave the answer up as it could be useful in cases where exceptions are not wanted.
public static string SmartTrim(this string s, int length)
{
if(s == null || length < 0 || s.Length <= length)
return s;
// Edit a' la Jon Skeet. Removes unnecessary intermediate string. Thanks!
// string temp = s.Length > length + 1 ? s.Remove(length+1) : s;
int lastSpace = s.LastIndexOf(' ', length + 1);
return lastSpace < 0 ? string.Empty : s.Remove(lastSpace);
}

string strTemp = "How are you doing today";
int nLength = 12;
strTemp = strTemp.Substring(0, strTemp.Substring(0, nLength).LastIndexOf(' '));
I think that should do it. When I ran that, it ended up with "How are you".
So your function would be:
public static string SmartTrim(this string s, int length)
{
return s.Substring(0, s.Substring(0, length).LastIndexOf(' '));;
}
I would definitely add some exception handling though, such as making sure the integer length is no greater than the string length and not less than 0.

Obligatory LINQ one liner, if you only care about whitespace as word boundary:
return new String(s.TakeWhile((ch,idx) => (idx < length) || (idx >= length && !Char.IsWhiteSpace(ch))).ToArray());

Use like this
var substring = source.GetSubstring(50, new string[] { " ", "." })
This method can get a sub-string based on one or many separator characters
public static string GetSubstring(this string source, int length, params string[] options)
{
if (string.IsNullOrWhiteSpace(source))
{
return string.Empty;
}
if (source.Length <= length)
{
return source;
}
var indices =
options.Select(
separator => source.IndexOf(separator, length, StringComparison.CurrentCultureIgnoreCase))
.Where(index => index >= 0)
.ToList();
if (indices.Count > 0)
{
return source.Substring(0, indices.Min());
}
return source;
}

I'll toss in some Linq goodness even though others have answered this adequately:
public string TrimString(string s, int maxLength)
{
var pos = s.Select((c, idx) => new { Char = c, Pos = idx })
.Where(item => char.IsWhiteSpace(item.Char) && item.Pos <= maxLength)
.Select(item => item.Pos)
.SingleOrDefault();
return pos > 0 ? s.Substring(0, pos) : s;
}
I left out the parameter checking that others have merely to accentuate the important code...

Related

Best way to compare two strings when one is not an exact reverse of the other C#

So I have 2 string first is RJPDLLDHLDHAFASR and the second one is ASRAFLDHLDHDLRJP.
What could be the best way of comparing these two?
If one segregates this string into sub-strings then it can be observed how these 2 strings are similar. Sub strings RJP DL LDH LDH AF ASR.
That's true that I need a pattern where I should find above mentioned sub strings as a string in both the bigger strings.

So I gave this a try on my lunch break. I used the rotation method I mentioned in the comments. This seems to work but there's probably room for improvement.
// Rotate a string 1 character to the right
// ex: "abc" -> "cba"
static string RotateRight(string s)
{
return s[s.Length - 1] + s.Substring(0, s.Length - 1);
}
// Compare 2 strings using first n letters
static bool StrNCmp(string s1, string s2, int n)
{
if (n == 0 || n > s1.Length || n > s1.Length) return false;
return s1.Substring(0, n) == s2.Substring(0, n);
}
// Rotate s2 until a match with s1 is found.
// Return number of rotations or -1 if no match found
static int FindMatch(string s1, ref string s2)
{
var count = 0;
while (!StrNCmp(s1, s2, count))
{
s2 = RotateRight(s2);
count += 1;
// Gone all the way around - stop
if (count > s2.Length) return -1;
}
return count;
}
static void Main(string[] args)
{
var s1 = "RJPDLLDHLDHAFASR";
var s2 = "ASRAFLDHLDHDLRJP";
while (s1.Length != 0)
{
var count = FindMatch(s1, ref s2);
if (count == -1)
{
Console.WriteLine("FAIL");
break;
}
Console.WriteLine(s1.Substring(0, count));
// Remove matched chars
s1 = s1.Substring(count);
s2 = s2.Substring(count);
}
}
Output:
RJP
DL
LDH
LDH
AF
ASR
Version 2 using a stack. You can do it without a stack, just lop off the last char in s2, but a stack makes it easier.
static int FindMatch(string s1, Stack<char> stack)
{
string built = "";
do
{
char prev = stack.Pop();
built = prev + built;
if (s1.StartsWith(built))
{
return built.Length;
}
} while (stack.Count() > 0) ;
return -1;
}
static void Main(string[] args)
{
var s1 = "RJPDLLDHLDHAFASR";
var s2 = "ASRAFLDHLDHDLRJP";
Stack<char> stack = new Stack<char>();
foreach (var c in s2) stack.Push(c);
while (s1.Length != 0)
{
var count = FindMatch(s1, stack);
if (count == -1)
{
Console.WriteLine("FAIL");
break;
}
Console.WriteLine(s1.Substring(0, count));
// Remove matched chars
s1 = s1.Substring(count);
}
}

Take first letter in the first string. Iterate second string from the start, searching for this letter. When you find it try to match the substring made in the second string from that letter to the end with the substring of the same length in the first string. If they match remove the substrings and repeat the process. If you get two empty strings then the original strings are matched by this "reverse substring" criteria.

How can I convert a escaped unicode to regular format unicode

I have this code to help parse the unicode for an emoji:
public string DecodeEncodedNonAsciiCharacters(string value)
{
return Regex.Replace(
value,
#"\\u(?<Value>[a-zA-Z0-9]{4})",
m =>
((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString();
);
}
so I put my code as such
DecodeEncodedNonAsciiCharacters("\uD83C\uDFCB\uD83C\uDFFF\u200D\u2642\uFE0F");
into Console.WriteLine(); which gives me this emoji 🏋🏿‍♂️ so my question is how can I turn this
"\uD83C\uDFCB\uD83C\uDFFF\u200D\u2642\uFE0F"
into this Codepoints
U+1F3CB, U+1F3FF, U+200D, U+2642, U+FE0F
the codepoints above are from Emojipedia.org

It seems, that you want to combine two surrogate characters into one Utf-32:
\uD83C\uDFCB => \U0001F3CB
If it's your case, you can put it like this:
Code:
public static IEnumerable<int> CombineSurrogates(string value) {
if (null == value)
yield break; // or throw new ArgumentNullException(name(value));
for (int i = 0; i < value.Length; ++i) {
char current = value[i];
char next = i < value.Length - 1 ? value[i + 1] : '\0';
if (char.IsSurrogatePair(current, next)) {
yield return (char.ConvertToUtf32(current, next));
i += 1;
}
else
yield return (int)current;
}
}
public static string DecodeEncodedNonAsciiCharacters(string value) =>
string.Join(" ", CombineSurrogates(value).Select(code => $"U+{code:X4}"));
Demo:
string data = "\uD83C\uDFCB\uD83C\uDFFF\u200D\u2642\uFE0F";
// If you want codes, uncomment the line below
//int[] codes = CombineSurrogates().ToArray(data);
string result = DecodeEncodedNonAsciiCharacters(data);
Console.Write(result);
Outcome:
U+1F3CB U+1F3FF U+200D U+2642 U+FE0F

C# Determine if a char at index is between two characters in a string

In C# how do I determine if a char at a certain index is between two characters in a string. I'm trying to do this to remove all spaces between quotes in a string.
Example syntax: isBetween(string str, int index, char start, char end)
Thanks in advance
Edit: I also need it to work if start and end are the same character
Edit2: To clarify, I need it to work not only directly between, but it needs to work for other strings like isBetween("as((sup)hello)as", 5, '(', ')')

Based on the information given, and from what I understand about your question, you want an extension method for strings.
Something like this:
public class Program
{
public static void Main()
{
var isBetween = "abc".IsBetween(1, 'a', 'c');
Console.WriteLine(isBetween); //True
}
}
public static class Extensions
{
public static bool IsBetween(this String str, int index, char start, char end)
{
var left = str[index - 1];
var right = str[index + 1];
return left == start && right == end;
}
}
The code above will check if the character at index 1 (which is b), is between two characters (a and c). This returns true.
(Note that this does not account for index out of bound exceptions)

what I understood form your question. you are looking for something like this.
public class Program {
public static void Main()
{
var isBetween = "abc".IsBetween('b', 'a', 'c', out int i);
Console.WriteLine(isBetween); //True
Console.WriteLine(i); //True
}
}
public static class Extensions {
public static bool IsBetween(this String str, char middle, char start, char end, out int index)
{
index = - 1;
var left = str.IndexOf(start);
var right = str.IndexOf(end);
index = str.IndexOf(start) + 1 == str.IndexOf(end) -1 ? str.IndexOf(end) - 1: -1 ;
return str[index] == middle ;
}
}
#ThePerplexedOne I did reuse your code.

You can do this using RegularExpressions Demo
public class Program {
public static void Main()
{
String xs = "aBc";
var x= xs.ReplaceInBetween('a', 'c', 'B', 'b'); }
}
public static class Extensions {
public static string ReplaceInBetween(this String str, char start, char end, char middle, char replacewith)
{
Regex x = new Regex($"([{start}])({middle})([{end}])");
str= x.Replace(str, "$1" + replacewith + "$3");
return str;
}
}

Well, fro a public methid we should validate str, index and then check for chars at index - 1, index + 1:
public static bool IsBetween(this String str, int index, char start, char end) {
return str != null && // not null
index >= 1 && index < str.Length - 1 && // valid index
str[index - 1] == start && // char at index is between
str[index + 1] == stop;
}
However, if you want to remove some characters (say, enquoted spaces), I suggest building a new string, e.g.
// "abc " 123 456 789" pq" -> "abc "123456789" pq"
public static string RemoveQuotedSpaces(String str) {
if (string.IsNullOrEmpty(str))
return str;
StringBuilder sb = new StringBuilder(str.Length);
bool inQuotation = false;
foreach (char c in str) {
if (c == '"')
inQuotation != inQuotation;
if (!inQuotation || c != ' ')
sb.Append(c);
}
return sb.ToString();
}

try this
public static bool IsBetween(String str, int index, char start, char end)
{
var startIndex = str.Substring(0, index).LastIndexOf(start);
var LastIndex = str.Substring(index).IndexOf(end);
if (startIndex == -1 || LastIndex == -1)
return false;
LastIndex = LastIndex + index;
return startIndex <= (index - 1) && LastIndex >= (index - 1);
}

Best way to split string into lines with maximum length, without breaking words

I want to break a string up into lines of a specified maximum length, without splitting any words, if possible (if there is a word that exceeds the maximum line length, then it will have to be split).
As always, I am acutely aware that strings are immutable and that one should preferably use the StringBuilder class. I have seen examples where the string is split into words and the lines are then built up using the StringBuilder class, but the code below seems "neater" to me.
I mentioned "best" in the description and not "most efficient" as I am also interested in the "eloquence" of the code. The strings will never be huge, generally splitting into 2 or three lines, and it won't be happening for thousands of lines.
Is the following code really bad?
private static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
stringToSplit = stringToSplit.Trim();
var lines = new List<string>();
while (stringToSplit.Length > 0)
{
if (stringToSplit.Length <= maximumLineLength)
{
lines.Add(stringToSplit);
break;
}
var indexOfLastSpaceInLine = stringToSplit.Substring(0, maximumLineLength).LastIndexOf(' ');
lines.Add(stringToSplit.Substring(0, indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine : maximumLineLength).Trim());
stringToSplit = stringToSplit.Substring(indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine + 1 : maximumLineLength);
}
return lines.ToArray();
}

Even when this post is 3 years old I wanted to give a better solution using Regex to accomplish the same:
If you want the string to be splitted and then use the text to be displayed you can use this:
public string SplitToLines(string stringToSplit, int maximumLineLength)
{
return Regex.Replace(stringToSplit, #"(.{1," + maximumLineLength +#"})(?:\s|$)", "$1\n");
}
If on the other hand you need a collection you can use this:
public MatchCollection SplitToLines(string stringToSplit, int maximumLineLength)
{
return Regex.Matches(stringToSplit, #"(.{1," + maximumLineLength +#"})(?:\s|$)");
}
NOTES
Remember to import regex (using System.Text.RegularExpressions;)
You can use string interpolation on the match:
$#"(.{{1,{maximumLineLength}}})(?:\s|$)"
The MatchCollection works almost like an Array
Matching example with explanation here

How about this as a solution:
IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
var words = stringToSplit.Split(' ').Concat(new [] { "" });
return
words
.Skip(1)
.Aggregate(
words.Take(1).ToList(),
(a, w) =>
{
var last = a.Last();
while (last.Length > maximumLineLength)
{
a[a.Count() - 1] = last.Substring(0, maximumLineLength);
last = last.Substring(maximumLineLength);
a.Add(last);
}
var test = last + " " + w;
if (test.Length > maximumLineLength)
{
a.Add(w);
}
else
{
a[a.Count() - 1] = test;
}
return a;
});
}
I reworked this as prefer this:
IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
var words = stringToSplit.Split(' ');
var line = words.First();
foreach (var word in words.Skip(1))
{
var test = $"{line} {word}";
if (test.Length > maximumLineLength)
{
yield return line;
line = word;
}
else
{
line = test;
}
}
yield return line;
}

I don't think your solution is too bad. I do, however, think you should break up your ternary into an if else because you are testing the same condition twice. Your code might also have a bug. Based on your description, it seems you want lines <= maxLineLength, but your code counts the space after the last word and uses it in the <= comparison resulting in effectively < behavior for the trimmed string.
Here is my solution.
private static IEnumerable<string> SplitToLines(string stringToSplit, int maxLineLength)
{
string[] words = stringToSplit.Split(' ');
StringBuilder line = new StringBuilder();
foreach (string word in words)
{
if (word.Length + line.Length <= maxLineLength)
{
line.Append(word + " ");
}
else
{
if (line.Length > 0)
{
yield return line.ToString().Trim();
line.Clear();
}
string overflow = word;
while (overflow.Length > maxLineLength)
{
yield return overflow.Substring(0, maxLineLength);
overflow = overflow.Substring(maxLineLength);
}
line.Append(overflow + " ");
}
}
yield return line.ToString().Trim();
}
It is a bit longer than your solution, but it should be more straightforward. It also uses a StringBuilder so it is much faster for large strings. I performed a benchmarking test for 20,000 words ranging from 1 to 11 characters each split into lines of 10 character width. My method completed in 14ms compared to 1373ms for your method.

Try this (untested)
private static IEnumerable<string> SplitToLines(string value, int maximumLineLength)
{
var words = value.Split(' ');
var line = new StringBuilder();
foreach (var word in words)
{
if ((line.Length + word.Length) >= maximumLineLength)
{
yield return line.ToString();
line = new StringBuilder();
}
line.AppendFormat("{0}{1}", (line.Length>0) ? " " : "", word);
}
yield return line.ToString();
}

~6x faster than the accepted answer
More than 1.5x faster than the Regex version in Release Mode (dependent on line length)
Optionally keep the space at the end of the line or not (the regex version always keeps it)
static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength, bool removeSpace = true)
{
int start = 0;
int end = 0;
for (int i = 0; i < stringToSplit.Length; i++)
{
char c = stringToSplit[i];
if (c == ' ' || c == '\n')
{
if (i - start > maximumLineLength)
{
string substring = stringToSplit.Substring(start, end - start); ;
start = removeSpace ? end + 1 : end; // + 1 to remove the space on the next line
yield return substring;
}
else
end = i;
}
}
yield return stringToSplit.Substring(start); // remember last line
}
Here is the example code used to test speeds (again, run on your own machine and test in Release mode to get accurate timings)
https://dotnetfiddle.net/h5I1GC
Timings on my machine in release mode .Net 4.8
Accepted Answer: 667ms
Regex: 368ms
My Version: 117ms

My requirement was to have a line break at the last space before the 30 char limit.
So here is how i did it. Hope this helps anyone looking.
private string LineBreakLongString(string input)
{
var outputString = string.Empty;
var found = false;
int pos = 0;
int prev = 0;
while (!found)
{
var p = input.IndexOf(' ', pos);
{
if (pos <= 30)
{
pos++;
if (p < 30) { prev = p; }
}
else
{
found = true;
}
}
outputString = input.Substring(0, prev) + System.Environment.NewLine + input.Substring(prev, input.Length - prev).Trim();
}
return outputString;
}

An approach using recursive method and ReadOnlySpan (Tested)
public static void SplitToLines(ReadOnlySpan<char> stringToSplit, int index, ref List<string> values)
{
if (stringToSplit.IsEmpty || index < 1) return;
var nextIndex = stringToSplit.IndexOf(' ');
var slice = stringToSplit.Slice(0, nextIndex < 0 ? stringToSplit.Length : nextIndex);
if (slice.Length <= index)
{
values.Add(slice.ToString());
nextIndex++;
}
else
{
values.Add(slice.Slice(0, index).ToString());
nextIndex = index;
}
if (stringToSplit.Length <= index) return;
SplitToLines(stringToSplit.Slice(nextIndex), index, ref values);
}

C# line break every n characters

Suppose I have a string with the text: "THIS IS A TEST". How would I split it every n characters? So if n was 10, then it would display:
"THIS IS A "
"TEST"
..you get the idea. The reason is because I want to split a very big line into smaller lines, sort of like word wrap. I think I can use string.Split() for this, but I have no idea how and I'm confused.
Any help would be appreciated.

Let's borrow an implementation from my answer on code review. This inserts a line break every n characters:
public static string SpliceText(string text, int lineLength) {
return Regex.Replace(text, "(.{" + lineLength + "})", "$1" + Environment.NewLine);
}
Edit:
To return an array of strings instead:
public static string[] SpliceText(string text, int lineLength) {
return Regex.Matches(text, ".{1," + lineLength + "}").Cast<Match>().Select(m => m.Value).ToArray();
}

Maybe this can be used to handle efficiently extreme large files :
public IEnumerable<string> GetChunks(this string sourceString, int chunkLength)
{
using(var sr = new StringReader(sourceString))
{
var buffer = new char[chunkLength];
int read;
while((read= sr.Read(buffer, 0, chunkLength)) == chunkLength)
{
yield return new string(buffer, 0, read);
}
}
}
Actually, this works for any TextReader. StreamReader is the most common used TextReader. You can handle very large text files (IIS Log files, SharePoint Log files, etc) without having to load the whole file, but reading it line by line.

You should be able to use a regex for this. Here is an example:
//in this case n = 10 - adjust as needed
List<string> groups = (from Match m in Regex.Matches(str, ".{1,10}")
select m.Value).ToList();
string newString = String.Join(Environment.NewLine, lst.ToArray());
Refer to this question for details:
Splitting a string into chunks of a certain size

Probably not the most optimal way, but without regex:
string test = "my awesome line of text which will be split every n characters";
int nInterval = 10;
string res = String.Concat(test.Select((c, i) => i > 0 && (i % nInterval) == 0 ? c.ToString() + Environment.NewLine : c.ToString()));

Coming back to this after doing a code review, there's another way of doing the same without using Regex
public static IEnumerable<string> SplitText(string text, int length)
{
for (int i = 0; i < text.Length; i += length)
{
yield return text.Substring(i, Math.Min(length, text.Length - i));
}
}

Some code that I just wrote:
string[] SplitByLength(string line, int len, int IsB64=0) {
int i;
if (IsB64 == 1) {
// Only Allow Base64 Line Lengths without '=' padding
int mod64 = (len % 4);
if (mod64 != 0) {
len = len + (4 - mod64);
}
}
int parts = line.Length / len;
int frac = line.Length % len;
int extra = 0;
if (frac != 0) {
extra = 1;
}
string[] oline = new string[parts + extra];
for(i=0; i < parts; i++) {
oline[i] = line.Substring(0, len);
line = line.Substring(len);
}
if (extra == 1) {
oline[i] = line;
}
return oline;
}
string CRSplitByLength(string line, int len, int IsB64 = 0)
{
string[] lines = SplitByLength(line, len, IsB64);
return string.Join(System.Environment.NewLine, lines);
}
string m = "1234567890abcdefghijklmnopqrstuvwxhyz";
string[] r = SplitByLength(m, 6, 0);
foreach (string item in r) {
Console.WriteLine("{0}", item);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Building a smart string trimming function in C# - c#

Obligatory LINQ one liner, if you only care about whitespace as word boundary: return new String(s.TakeWhile((ch,idx) => (idx < length) || (idx >= length && !Char.IsWhiteSpace(ch))).ToArray());

Related

Best way to compare two strings when one is not an exact reverse of the other C#

How can I convert a escaped unicode to regular format unicode

C# Determine if a char at index is between two characters in a string

Best way to split string into lines with maximum length, without breaking words

C# line break every n characters

Categories

Resources