c# Best way to break up a long string - c#

This question is not related to:
Best way to break long strings in C# source code
Which is about source, this is about processing long outputs. If someone enters:
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
As a comment, it breaks the container and makes the entire page really wide. Is there any clever regexp that can say, define a maximum word length of 20 chars and then force a whitespace character?
Thanks for any help!

There's probably no need to involve regexes in something this simple. Take this extension method:
public static string Abbreviate(this string text, int length) {
if (text.Length <= length) {
return text;
}
char[] delimiters = new char[] { ' ', '.', ',', ':', ';' };
int index = text.LastIndexOfAny(delimiters, length - 3);
if (index > (length / 2)) {
return text.Substring(0, index) + "...";
}
else {
return text.Substring(0, length - 3) + "...";
}
}
If the string is short enough, it's returned as-is. Otherwise, if a "word boundary" is found in the second half of the string, it's "gracefully" cut off at that point. If not, it's cut off the hard way at just under the desired length.
If the string is cut off at all, an ellipsis ("...") is appended to it.
If you expect the string to contain non-natural-language constructs (such as URLs) you 'd need to tweak this to ensure nice behavior in all circumstances. In that case working with a regex might be better.

You could try using a regular expression that uses a positive look-ahead like this:
string outputStr = Regex.Replace(inputStr, #"([\S]{20}(?=\S+))", "$1\n");
This should "insert" a line break into all words that are longer than 20 characters.

Yes you can use this one regex
string pattern = #"^([\w]{1,20})$";
this regex allow to enter not more than 20 characters
string strRegex = #"^([\w]{1,20})$";
string strTargetString = #"asdfasfasfasdffffff";
if(Regex.IsMatch(strTargetString, strRegex))
{
//do something
}
If you need only lenght constraint you should use this regex
^(.{1,20})$
because the \w is match only
alphanumeric and underscore symbol

Related

C# - Auto-detect Escape and Argument curly braces in string.Format

starting from this C# example:
string myStringFormat = "I want to surround string {0} with {{ and }}";
string myStringArgs = "StringToSurround";
string myFinalString = string.Format(myStringFormat, myStringArgs);
I'd like to know if there is a quick and simple way to distinguish between escape character/sequence and arguments for curly braces/brackets.
The reasons why I am asking this are:
+] I want to provide some logging functionality and I don't want to require users to be aware of the double curly braces/brackets escape rule
+] I want to be very fast in doing this distinction for performance requirements
Currently the only solution I can think about is to scan the string looking for curly braces/brackets and do some check (number parsing) on subsequent characters. Probably regex can be helpful but I cannot find a way to use them in this scenario.
Btw, the final situation I'd like to achieve is user being allowed to this without getting exceptions:
string myStringFormat = "I want to surround string {0} with { and }";
string myStringArgs = "StringToSurround";
//string myFinalString = string.Format(myStringFormat, myStringArgs); throwing exception
string myFinalString = MyCustomizedStringFormat(myStringFormat, myStringArgs);
EDIT:
sorry the word "surround" was tricky and misleading, please consider this example:
string myStringFormat = "I want to append to string {0} these characters {{ and }}";
string myStringArgs = "StringToAppendTo";
string myFinalString = string.Format(myStringFormat, myStringArgs);
giving output
I want to append to string StringToAppendTo these characters { and }
Use this Regex to find the Argument substrings:
{\d+}
This regex escapes {} {1a} etc. and only chooses {1} {11} etc.
Now you need to handle either Arguments (replace them with their values) or the Escaped curly braces (replace them with double braces). The choice is yours and it depends on the number of occurrences of each case. (I chose to replace arguments in my code below)
Now you need to actually replace the characters. Again the choice is yours to use a StringBuilder or not. It depends on the size of your input and the number of replacements. In any case I suggest StringBuilder.
var m = Regex.Matches(input, #"{\d+}");
if (m.Any())
{
// before any arg
var sb = new StringBuilder(input.Substring(0, m[0].Index));
for (int i = 0; i < m.Count; i++)
{
// arg itself
sb.Append(args[i]);
// from right after arg
int start = m[i].Index + m[i].Value.Length;
if (i < m.Count - 1)
{
// until next arg
int length = m[i + 1].Index - start;
sb.Append(input.Substring(start, length));
}
else
// until end of input
sb.Append(input.Substring(start));
}
}
I believe this is the most robust and cleanest way to do it,and it does not have any performance (memory or speed) issues.
Edit
If you don't have access to args[] then you can first replace {/} with {{/}} and then simply do these modifications to the code:
use this pattern: #"{{\d+}}"
write m[i].Value.Substring(1, m[i].Value.Length - 2) instead of args[i]

convert non alphanumeric glyphs to unicode while preserving alphanumeric

I need to convert non alpha-numeric glyphs in a string to their unicode value, while preserving the alphanumeric characters. Is there a method to do this in C#?
As an example, I need to convert this string:
"hello world!"
To this:
"hello_x0020_world_x0021_"
To get string safe for XML node name you should use XmlConver.EncodeName.
Note that if you need to encode all non-alphanumeric characters you'd need to write it yourself as "_" is not encoded by that method.
You could start with this code using LINQ Select extension method:
string str = "hello world!";
string a = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
a += a.ToLower();
char[] alphabet = a.ToCharArray();
str = string.Join("",
str.Select(ch => alphabet.Contains(ch) ?
ch.ToString() : String.Format("_x{0:x4}_", ch)).ToArray()
);
Now clearly it has some problems:
it does linear search in the list of characters
missed numeric...
if we add numeric need to decide if first character is ok to be digit (assuming yes)
code creates large number of strings that are immediately discarded (one per character)
alphanumeric is limited to ASCII (assuming ok, if not Char.IsLetterOrDigit to help)
does to much work for pure alpha-numeric strings
First two are easy - we can use HashSet (O(1) Contains) initialized by full list of characters (if any alpahnumeric characters are ok more readable to use existing method - Char.IsLetterOrDigit):
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
To avoid ch.ToString() that really pointlessly produces strings for immediate GC we need to figure out how to construct string from mix of char and string. String.Join does not work because it wants strings to start with, regular new string(...) does not have option for mix of char and string. So we are left with StringBuilder that happily takes both to Append. Consider starting with initial size str.Length if most strings don't have other characters.
So for each character we just need to either builder.Append(ch) or builder.AppendFormat(("_x{0:x4}_", (int)ch). To perform iteration it is easier to just use regular foreach, but if one really wants LINQ - Enumerable.Aggregate is the way to go.
string ReplaceNonAlphaNum(string str)
{
var builder = new StringBuilder();
foreach (var ch in str)
{
if (asciiAlphaNum.Contains(ch))
builder.Append(ch);
else
builder.AppendFormat("_x{0:x4}_", (int)ch);
}
return builder.ToString();
}
string ReplaceNonAlphaNumLinq(string str)
{
return str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
To the last point - we don't really need to do anything if there is nothing to convert - so some check like check alphanumeric characters in string in c# would help to avoid extra strings.
Thus final version (LINQ as it is a bit shorter and fancier):
private static asciiAlphaNumRx = new Regex(#"^[a-zA-Z0-9]*$");
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
string ReplaceNonAlphaNumLinq(string str)
{
return asciiAlphaNumRx.IsMatch(str) ? str :
str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
Alternatively whole thing could be done with Regex - see Regex replace: Transform pattern with a custom function for starting point.

regex match partial or whole word

I am trying to figure out a regular expression which can match either the whole word or a (predefined in length, e.g first 4 chars) part of the word.
For example, if I am trying to match the word between and my offset is set to 4, then
between betwee betwe betw
are matches, but not the
bet betweenx bet12 betw123 beta
I have created an example in regex101, where I am trying (with no luck) a combination of positive lookahead (?=) and a non-word boundary \B.
I found a similar question which proposes a word around in its accepted answer. As I understand, it overrides the matcher somehow, to run all the possible regular expressions, based on the word and an offset.
My code has to be written in C#, so I am trying to convert the aforementioned code. As I see Regex.Replace (and I assume Regex.Match also) can accept delegates to override the default functionality, but I can not make it work.
You could take the first 4 characters, and make the remaining ones optional.
Then wrap these in word boundaries and parenthesis.
So in the case of "between", it would be
#"\b(betw)(?:(e|ee|een)?)\b"
The code to achieve that would be:
public string createRegex(string word, int count)
{
var mandatory = word.Substring(0, count);
var optional = "(" + String.Join("|", Enumerable.Range(1, count - 1).Select(i => word.Substring(count, i))) + ")?";
var regex = #"\b(" + mandatory + ")(?:" + optional + #")\b";
return regex;
}
The code in the answer you mentioned simply builds up this:
betw|betwe|betwee|between
So all you need is to write a function, to build up a string with a substrings of given word given minimum length.
static String BuildRegex(String word, int min_len)
{
String toReturn = "";
for(int i = 0; i < word.Length - min_len +1; i++)
{
toReturn += word.Substring(0, min_len+i);
toReturn += "|";
}
toReturn = toReturn.Substring(0, toReturn.Length-1);
return toReturn;
}
Demo
You can use this regex
\b(bet(?:[^\s]){1,4})\b
And replace bet and the 4 dynamically like this:
public static string CreateRegex(string word, int minLen)
{
string token = word.Substring(0, minLen - 1);
string pattern = #"\b(" + token + #"(?:[^\s]){1," + minLen + #"})\b";
return pattern;
}
Here's a demo: https://regex101.com/r/lH0oL2/1
EDIT: as for the bet1122 match, you can edit the pattern this way:
\b(bet(?:[^\s0-9]){1,4})\b
If you don't want to match some chars, just enqueue them into the [] character class.
Demo: https://regex101.com/r/lH0oL2/2
For more info, see http://www.regular-expressions.info/charclass.html

Cut the string to be <= 80 characters AND must keep the words without cutting them

I am new to C#, but I have a requirement to cut the strings to be <= 80 characters AND they must keep the words integrity (without cutting them)
Examples
Before: I have a requirenment to cut the strings to be <= 80 characters AND must keep the words without cutting them (length=108)
After: I have a requirenment to cut the strings to be <= 80 characters AND must keep (length=77)
Before: requirenment to cut the strings to be <= 80 characters AND must keep the words without cutting them (length=99)
After: requirenment to cut the strings to be <= 80 characters AND must keep the words (length=78)
Before: I have a requirenment the strings to be <= 80 characters AND must keep the words without cutting them (length=101)
After: I have a requirenment the strings to be <= 80 characters AND must keep the words (length=80)
I want to use the RegEx, but I don't know anything about the regex. It would be a hassle to to the else-if's for this.
I would appreciate if you could point me to the right article which I could use to create this expression.
this is my function that I want to cut to one line:
public String cutTitleto80(String s){
String[] words = Regex.Split(s, "\\s+");
String finalResult = "";
foreach (String word in words)
{
String tmp = finalResult + " " + word;
if (tmp.Length > 80)
{
return finalResult;
}
finalResult = tmp;
}
return finalResult;
}
Try
^(.{0,80})(?: |$)
This is a capturing greedy match which must be followed by a space or end of string. You could also use a zero-width lookahead assertion, as in
^.{0,80}(?= |$)
If you use a live test tool like http://regexhero.net/tester/ it's pretty cool, you can actually see it jump back to the word boundary as you type beyond 80 characters.
And here's one which will simply truncate at the 80th character if there are no word boundaries (spaces) to be found:
^(.{1,80}(?: |$)|.{80})
Here's an approach without using Regex: just split the string (however you'd like) into whatever you consider "words" to be. Then, just start concatenating them together using a StringBuilder, checking for your desired length, until you can't add the next "word". Then, just return the string that you have built up so far.
(Untested code ahead)
public string TruncateWithPreservation(string s, int len)
{
string[] parts = s.Split(' ');
StringBuilder sb = new StringBuilder();
foreach (string part in parts)
{
if (sb.Length + part.Length > len)
break;
sb.Append(' ');
sb.Append(part);
}
return sb.ToString();
}
string truncatedText = text.Substring(0, 80); // truncate to 80 characters
if (text[80] != ' ') // don't remove last word if a space occurs after it in the original string (in other words, last word is already complete)
truncatedText = truncatedText.Substring(0, truncatedText.LastIndexOf(' ')); // remove any cut-off words
Updated to fix issue from comments where last word could get cut off even if it is complete.
This isn't using regex but this is how I would do it:
Use String.LastIndexOf to get the last space before the 81st char.
If the 81th char is a space then take it until 80.
if it returns a number > -1 cut it off there.
If it's -1 you-have-a-really-long-word-or-someone-messing-with-the-system so you do wathever you like.

Regular expression to split long strings in several lines

I'm not an expert in regular expressions and today in my project I face the need to split long string in several lines in order to check if the string text fits the page height.
I need a C# regular expression to split long strings in several lines by "\n", "\r\n" and keeping 150 characters by line maximum. If the character 150 is in the middle of an word, the entire word should be move to the next line.
Can any one help me?
It's actually a quite simple problem. Look for any characters up to 150, followed by a space. Since Regex is greedy by nature it will do exactly what you want it to. Replace it by the Match plus a newline:
.{0,150}(\s+|$)
Replace with
$0\r\n
See also: http://regexhero.net/tester/?id=75645133-1de2-4d8d-a29d-90fff8b2bab5
var regex = new Regex(#".{0,150}", RegexOptions.Multiline);
var strings = regex.Replace(sourceString, "$0\r\n");
Here you go:
^.{1,150}\n
This will match the longest initial string like this.
if you just want to split a long string into lines of 150 chars then I'm not sure why you'd need a regular expression:
private string stringSplitter(string inString)
{
int lineLength = 150;
StringBuilder sb = new StringBuilder();
while (inString.Length > 0)
{
var curLength = inString.Length >= lineLength ? lineLength : inString.Length;
var lastGap = inString.Substring(0, curLength).LastIndexOfAny(new char[] {' ', '\n'});
if (lastGap == -1)
{
sb.AppendLine(inString.Substring(0, curLength));
inString = inString.Substring(curLength);
}
else
{
sb.AppendLine(inString.Substring(0, lastGap));
inString = inString.Substring(lastGap + 1);
}
}
return sb.ToString();
}
edited to account for word breaks
This code should help you. It will check the length of the current string. If it is greater than your maxLength (150) in this case, it will start at the 150th character and (going backwards) find the first non-word character (as described by the OP, this is a sequence of non-space characters). It will then store the string up to that character and start over again with the remaining string, repeating until we end up with a substring that is less than maxLength characters. Finally, join them all back together again in a final string.
string line = "This is a really long run-on sentence that should go for longer than 150 characters and will need to be split into two lines, but only at a word boundary.";
int maxLength = 150;
string delimiter = "\r\n";
List<string> lines = new List<string>();
// As long as we still have more than 'maxLength' characters, keep splitting
while (line.Length > maxLength)
{
// Starting at this character and going backwards, if the character
// is not part of a word or number, insert a newline here.
for (int charIndex = (maxLength); charIndex > 0; charIndex--)
{
if (char.IsWhiteSpace(line[charIndex]))
{
// Split the line after this character
// and continue on with the remainder
lines.Add(line.Substring(0, charIndex+1));
line = line.Substring(charIndex+1);
break;
}
}
}
lines.Add(line);
// Join the list back together with delimiter ("\r\n") between each line
string final = string.Join(delimiter , lines);
// Check the results
Console.WriteLine(final);
Note: If you run this code in a console application, you may want to change "maxLength" to a smaller number so that the console doesn't wrap on you.
Note: This code does not take into effect any tab characters. If tabs are also included, your situation gets a bit more complicated.
Update: I fixed a bug where new lines were starting with a space.

Categories

Resources