How to remove words based on a word count

How to remove words based on a word count - c#

Here is what I'm trying to accomplish. I have an object coming back from
the database with a string description. This description can be up to 1000
characters long, but we only want to display a short view of this. So I coded
up the following, but I'm having trouble in actually removing the number of
words after the regular expression finds the total count of words. Does anyone
have good way of dispalying the words which are less than the Regex.Matches?
Thanks!
if (!string.IsNullOrEmpty(myObject.Description))
{
string original = myObject.Description;
MatchCollection wordColl = Regex.Matches(original, #"[\S]+");
if (wordColl.Count < 70) // 70 words?
{
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", myObject.Description);
}
else
{
string shortendText = original.Remove(200); // 200 characters?
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", shortendText);
}
}
EDIT:
So this is what I got working on my own:
else
{
int count = 0;
StringBuilder builder = new StringBuilder();
string[] workingText = original.Split(' ');
foreach (string word in workingText)
{
if (count < 70)
{
builder.AppendFormat("{0} ", word);
}
count++;
}
string shortendText = builder.ToString();
}
It's not pretty, but it worked. I would call it a pretty naive way of doing this. Thanks for all of the suggestions!

I would opt to go by a strict character count rather than a word count because you might happen to have a lot of long words.
I might do something like (pseudocode)
if text.Length > someLimit
find first whitespace after someLimit (or perhaps last whitespace immediately before)
display substring of text
else
display text
Possible code implementation:
string TruncateText(string input, int characterLimit)
{
if (input.Length > characterLimit)
{
// find last whitespace immediately before limit
int whitespacePosition = input.Substring(0, characterLimit).LastIndexOf(" ");
// or find first whitespace after limit (what is spec?)
// int whitespacePosition = input.IndexOf(" ", characterLimit);
if (whitespacePosition > -1)
return input.Substring(0, whitespacePosition);
}
return input;
}

One method, if you're using at least C#3.0, would be a LINQ like the following. This is provided you're going strictly by word count, not character count.
if (wordColl.Count > 70)
{
foreach (var subWord in wordColl.Cast<Match>().Select(r => r.Value).Take(70))
{
//Build string here out of subWord
}
}
I did a test using a simple Console.WriteLine with your Regex and your question body (which is over 70 words, it turns out).

You can use Regex Capture Groups to hold the match and access it later.
For your application, I'd recommend instead simply splitting the string by spaces and returning the first n elements of the array:
if (!string.IsNullOrEmpty(myObject.Description))
{
string original = myObject.Description;
string[] words = original.Split(' ');
if (words.Length < 70)
{
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", original);
}
else
{
string shortDesc = string.Empty;
for(int i = 0; i < 70; i++) shortDesc += words[i] + " ";
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", shortDesc.Trim());
}
}

Are you wanting to remove 200 characters or start truncating at the 200th character? When you call original.Remove(200) you are indexing the start of the truncation at the 200th character. This is how you use Remove() for a certain number of characters to remove:
string shortendText = original.Remove(0,200);
This starts at the first character and removes 200 starting with that one. Which I imagine that's not what you're trying to do since you're shortening a description. That's merely the correct way to use Remove().
Instead of using Regex matchcollections why not just split the string? It's a lot easier and straight forward. You can set the delimiter to a space character and split that way. Not sure if that completely fixes your need but it just might. I'm not sure what your data looks like in the description. But you split this way:
String[] wordArray = original.Split(' ');
From there you can determine the word count with wordArray's Length property value.

If I was you I would go by characters as you may have many one letter words or many long words in your text.
Go through until characters <= your limit, then either find the next space and then add these characters to a new string (possibly using the SubString method) or take these characters and add a few full stops, then make a new string The later could be unproffessional I suppose.

Related

Reverse the words in a string without using builtin functions like split and substring in C#

I wrote the following:
class Program
{
public static void Main(string[] args)
{
string name = "Hello World"
StringBuilder builder = new StringBuilder();
for (int i = name.Length - 1; i >= 0; i--)
{
builder.Append(name[i]);
}
string newName =builder.ToString();
Console.WriteLine(newName);
}
}
I am getting "dlrow olleh " as output. I want "World Hello ".

I wrote a quick solution which uses 2 loops.
The first loop is iterating over the string from left to right.
When it encounters a whitespace (which delimits a word) the second loop starts.
The second loop is looping through the word from right to left. The second loop can not go from left to right because we have no idea where the word will begin (unless we would remember where we met the previous whitespace). Hence the second loop iterates from right to left through the string until it encounters a whitespace (beginning of the word), or until the index becomes 0 (which is a corner case). The same corner case can be observed in the first loop.
Here it is :
String name = "Hello World";
StringBuilder builder = new StringBuilder();
for (int i = 0; i < name.length(); i++)
{
char tmp = name.charAt(i);
if(tmp == ' ' || i == name.length() - 1) {
// Found a whitespace, what is preceding must be a word
builder.insert(0, ' ');
for(int j = i - 1; j >= 0; j--) {
if(i == name.length() - 1 && j == i - 1) {
// Exceptional case (necessary because after the last word there is no whitespace)
builder.insert(0, name.charAt(i));
}
char curr = name.charAt(j);
if(curr == ' ') {
// We passed the beginning of the word
break;
}
else if(j == 0) {
builder.insert(0, curr);
break;
}
//builder.append(curr);
builder.insert(0, curr);
}
}
}
String newName = builder.toString();
System.out.println(newName);
Note that this is java code but it should be straightforward to translate it to C#.

In pseudo code, a solution to this problem might look something like this:
result = empty string
last = input.length
for i from input.length - 1 to 0
if input(i) is whitespace or i is 0
for j from i to last - 1
append input[j] to result
last = i
This starts at the back of the string, and loops backwards. When it finds a whitespace (or gets to the beginning of the stirng), it knows it has found a complete word and adds it to the list. The variable last keeps track of where the last word that got added started.
You will have to tweak this a bit to get the spaces between the words right.

Just because you can't use "built-in" functions, doesn't mean you can't write your own. What you should be learning from a task like this is how to break a problem down into a set of sub problems.
string Reverse(string pInput) {
// iterate backwards over the input, assemble a new string, and return it
}
List<string> Split(string pInput, char pSplitOn) {
// iterate forwards over the input
// build up a new string until the split char is found
// when it is found, add the current string to a list
// and start the building string over at empty
// maybe only add strings to the list if they aren't empty
// (although that wouldn't preserve extra whitespace, which you may want)
// make sure to add the end of the string since it probably
// doesn't end with the split char
// return the list
}
string Join(List<string> pWords, char pSeparator) {
// build up a new string consisting of each of the words separated by the separator
}
string ReverseWords(string pInput) {
// split the input on a space
// for each "word" in the resulting list, reverse it
// join the reversed words back together into one string, with spaces separating them
}
This assumes that the only whitespace that you will encounter consists of spaces. Also assumes you're allowed to use List<T>.

string.IndexOf search for whole word match

I am seeking a way to search a string for an exact match or whole word match. RegEx.Match and RegEx.IsMatch don't seem to get me where I want to be. Consider the following scenario:
namespace test
{
class Program
{
static void Main(string[] args)
{
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
int indx = str.IndexOf("TOTAL");
string amount = str.Substring(indx + "TOTAL".Length, 10);
string strAmount = Regex.Replace(amount, "[^.0-9]", "");
Console.WriteLine(strAmount);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
The output of the above code is:
// 34.37
// Press any key to continue...
The problem is, I don't want SUBTOTAL, but IndexOf finds the first occurrence of the word TOTAL which is in SUBTOTAL which then yields the incorrect value of 34.37.
So the question is, is there a way to force IndexOf to find only an exact match or is there another way to force that exact whole word match so that I can find the index of that exact match and then perform some useful function with it. RegEx.IsMatch and RegEx.Match are, as far as I can tell, simply boolean searches. In this case, it isn't enough to just know the exact match exists. I need to know where it exists in the string.
Any advice would be appreciated.

You can use Regex
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var indx = Regex.Match(str, #"\WTOTAL\W").Index; // will be 18

My method is faster than the accepted answer because it does not use Regex.
string str = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var indx = str.IndexOfWholeWord("TOTAL");
public static int IndexOfWholeWord(this string str, string word)
{
for (int j = 0; j < str.Length &&
(j = str.IndexOf(word, j, StringComparison.Ordinal)) >= 0; j++)
if ((j == 0 || !char.IsLetterOrDigit(str, j - 1)) &&
(j + word.Length == str.Length || !char.IsLetterOrDigit(str, j + word.Length)))
return j;
return -1;
}

You can use word boundaries, \b, and the Match.Index property:
var text = "SUBTOTAL 34.37 TAX TOTAL 37.43";
var idx = Regex.Match(text, #"\bTOTAL\b").Index;
// => 19
See the C# demo.
The \bTOTAL\b matches TOTAL when it is not enclosed with any other letters, digits or underscores.
If you need to count a word as a whole word if it is enclosed with underscores, use
var idx = Regex.Match(text, #"(?<![^\W_])TOTAL(?![^\W_])").Index;
where (?<![^\W_]) is a negative lookbehind that fails the match if there is a character other than a non-word and underscore immediately to the left of the current location (so, there can be a start of string position, or a char that is a not a digit nor letter), and (?![^\W_]) is a similar negative lookahead that only matches if there is an end of string position or a char other than a letter or digit immediately to the right of the current location.
If the boundaries are whitespaces or start/end of string use
var idx = Regex.Match(text, #"(?<!\S)TOTAL(?!\S)").Index;
where (?<!\S) requires start of string or a whitespace immediately on the left, and (?!\S) requires the end of string or a whitespace on the right.
NOTE: \b, (?<!...) and (?!...) are non-consuming patterns, that is the regex index does not advance when matching these patterns, thus, you get the exact positions of the word you search for.

To make the accepted answer a little bit safer (since IndexOf returns -1 for unmatched):
string pattern = String.Format(#"\b{0}\b", findTxt);
Match mtc = Regex.Match(queryTxt, pattern);
if (mtc.Success)
{
return mtc.Index;
}
else
return -1;

While this may be a hack that just works for only your example, try
string amount = str.Substring(indx + " TOTAL".Length, 10);
giving an extra space before total. As this will not occur with SUBTOTAL, it should skip over the word you don't want and just look for an isolated TOTAL.

I'd recommend the Regex solution from L.B. too, but if you can't use Regex, then you could use String.LastIndexOf("TOTAL"). Assuming the TOTAL always comes after SUBTOTAL?
http://msdn.microsoft.com/en-us/library/system.string.lastindexof(v=vs.110).aspx

Remove a single space from a string

"How do I do this? "
Let's say I have this string. How do I remove only one space from the end? The code shown below gives me an error saying the count is out of range.
string s = "How do I do this? ";
s = s.Remove(s.Length, 1);

You just have to use this instead :
string s = "How do I do this? ";
s = s.Remove(s.Length-1, 1);
As stated here:
Remove(Int32) Returns a new string in which all the characters in the current
instance, beginning at a specified position and continuing through the
last position, have been deleted.
In an array, positions range from 0 to Length-1, hence the compiler error.

The indexing in C# are zero-based.
s = s.Remove(s.Length - 1, 1);

Just do a substring from the first character (chars are 0-based in string) and get number of chars less the string length by 1
s = s.Substring(0, s.Length - 1);

This is a little safer, just in case the last character is not a space
string s = "How do I do this? ";
s = Regex.Replace(s, #" $", "")

You have to write something in the lines of
string s = "How do I do this?
s = s.Remove(s.Length-1, 1);
Reason being that in C# when referring to indexes in arrays the first element is always at position 0 and end at Length - 1. The Length generally tells you how long a string is but doesn't map to the actual array index.

Another way to do it is;
string s = "How do I do this? ";
s=s.SubString(0,s.Length-1);
Additional :
If you would like do some additional checking for the last character being a space or anything,you can do it in this way;
string s = "How do I do this? a";//Just for example,i've added a 'a' at the end.
int index = s.Length - 1;//Get last Char index.
if (index > 0)//If index exists.
{
if (s[index] == ' ')//If the character at 'index' is a space.
{
MessageBox.Show("Its a space.");
}
else if (char.IsLetter(s[index]))//If the character at 'index' is a letter.
{
MessageBox.Show("Its a letter.");
}
else if(char.IsDigit(s[index]))//If the character at 'index' is a digit.
{
MessageBox.Show("Its a digit.");
}
}
This gives you a MessageBox with message "Its a letter".
One more thing that might be helpful,if you want to create a string with equal no. of spaces between each word,then you can try this.
string s = "How do I do this? ";
string[] words = s.Split(new char[] {' '},StringSplitOptions.RemoveEmptyEntries);//Break the string into individual words.
StringBuilder sb = new StringBuilder();
foreach (string word in words)//Iterate through each word.
{
sb.Append(word);//Append the word.
sb.Append(" ");//Append a single space.
}
MessageBox.Show(sb.ToString());//Resultant string 'sb.ToString()'.
This gives you "How do I do this? " (equal spaces between words).

Cut the string to be <= 80 characters AND must keep the words without cutting them

I am new to C#, but I have a requirement to cut the strings to be <= 80 characters AND they must keep the words integrity (without cutting them)
Examples
Before: I have a requirenment to cut the strings to be <= 80 characters AND must keep the words without cutting them (length=108)
After: I have a requirenment to cut the strings to be <= 80 characters AND must keep (length=77)
Before: requirenment to cut the strings to be <= 80 characters AND must keep the words without cutting them (length=99)
After: requirenment to cut the strings to be <= 80 characters AND must keep the words (length=78)
Before: I have a requirenment the strings to be <= 80 characters AND must keep the words without cutting them (length=101)
After: I have a requirenment the strings to be <= 80 characters AND must keep the words (length=80)
I want to use the RegEx, but I don't know anything about the regex. It would be a hassle to to the else-if's for this.
I would appreciate if you could point me to the right article which I could use to create this expression.
this is my function that I want to cut to one line:
public String cutTitleto80(String s){
String[] words = Regex.Split(s, "\\s+");
String finalResult = "";
foreach (String word in words)
{
String tmp = finalResult + " " + word;
if (tmp.Length > 80)
{
return finalResult;
}
finalResult = tmp;
}
return finalResult;
}

Try
^(.{0,80})(?: |$)
This is a capturing greedy match which must be followed by a space or end of string. You could also use a zero-width lookahead assertion, as in
^.{0,80}(?= |$)
If you use a live test tool like http://regexhero.net/tester/ it's pretty cool, you can actually see it jump back to the word boundary as you type beyond 80 characters.
And here's one which will simply truncate at the 80th character if there are no word boundaries (spaces) to be found:
^(.{1,80}(?: |$)|.{80})

Here's an approach without using Regex: just split the string (however you'd like) into whatever you consider "words" to be. Then, just start concatenating them together using a StringBuilder, checking for your desired length, until you can't add the next "word". Then, just return the string that you have built up so far.
(Untested code ahead)
public string TruncateWithPreservation(string s, int len)
{
string[] parts = s.Split(' ');
StringBuilder sb = new StringBuilder();
foreach (string part in parts)
{
if (sb.Length + part.Length > len)
break;
sb.Append(' ');
sb.Append(part);
}
return sb.ToString();
}

string truncatedText = text.Substring(0, 80); // truncate to 80 characters
if (text[80] != ' ') // don't remove last word if a space occurs after it in the original string (in other words, last word is already complete)
truncatedText = truncatedText.Substring(0, truncatedText.LastIndexOf(' ')); // remove any cut-off words
Updated to fix issue from comments where last word could get cut off even if it is complete.

This isn't using regex but this is how I would do it:
Use String.LastIndexOf to get the last space before the 81st char.
If the 81th char is a space then take it until 80.
if it returns a number > -1 cut it off there.
If it's -1 you-have-a-really-long-word-or-someone-messing-with-the-system so you do wathever you like.

Regular expression to split long strings in several lines

I'm not an expert in regular expressions and today in my project I face the need to split long string in several lines in order to check if the string text fits the page height.
I need a C# regular expression to split long strings in several lines by "\n", "\r\n" and keeping 150 characters by line maximum. If the character 150 is in the middle of an word, the entire word should be move to the next line.
Can any one help me?

It's actually a quite simple problem. Look for any characters up to 150, followed by a space. Since Regex is greedy by nature it will do exactly what you want it to. Replace it by the Match plus a newline:
.{0,150}(\s+|$)
Replace with
$0\r\n
See also: http://regexhero.net/tester/?id=75645133-1de2-4d8d-a29d-90fff8b2bab5

var regex = new Regex(#".{0,150}", RegexOptions.Multiline);
var strings = regex.Replace(sourceString, "$0\r\n");

Here you go:
^.{1,150}\n
This will match the longest initial string like this.

if you just want to split a long string into lines of 150 chars then I'm not sure why you'd need a regular expression:
private string stringSplitter(string inString)
{
int lineLength = 150;
StringBuilder sb = new StringBuilder();
while (inString.Length > 0)
{
var curLength = inString.Length >= lineLength ? lineLength : inString.Length;
var lastGap = inString.Substring(0, curLength).LastIndexOfAny(new char[] {' ', '\n'});
if (lastGap == -1)
{
sb.AppendLine(inString.Substring(0, curLength));
inString = inString.Substring(curLength);
}
else
{
sb.AppendLine(inString.Substring(0, lastGap));
inString = inString.Substring(lastGap + 1);
}
}
return sb.ToString();
}
edited to account for word breaks

This code should help you. It will check the length of the current string. If it is greater than your maxLength (150) in this case, it will start at the 150th character and (going backwards) find the first non-word character (as described by the OP, this is a sequence of non-space characters). It will then store the string up to that character and start over again with the remaining string, repeating until we end up with a substring that is less than maxLength characters. Finally, join them all back together again in a final string.
string line = "This is a really long run-on sentence that should go for longer than 150 characters and will need to be split into two lines, but only at a word boundary.";
int maxLength = 150;
string delimiter = "\r\n";
List<string> lines = new List<string>();
// As long as we still have more than 'maxLength' characters, keep splitting
while (line.Length > maxLength)
{
// Starting at this character and going backwards, if the character
// is not part of a word or number, insert a newline here.
for (int charIndex = (maxLength); charIndex > 0; charIndex--)
{
if (char.IsWhiteSpace(line[charIndex]))
{
// Split the line after this character
// and continue on with the remainder
lines.Add(line.Substring(0, charIndex+1));
line = line.Substring(charIndex+1);
break;
}
}
}
lines.Add(line);
// Join the list back together with delimiter ("\r\n") between each line
string final = string.Join(delimiter , lines);
// Check the results
Console.WriteLine(final);
Note: If you run this code in a console application, you may want to change "maxLength" to a smaller number so that the console doesn't wrap on you.
Note: This code does not take into effect any tab characters. If tabs are also included, your situation gets a bit more complicated.
Update: I fixed a bug where new lines were starting with a space.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove words based on a word count - c#

Related

Reverse the words in a string without using builtin functions like split and substring in C#

string.IndexOf search for whole word match

Remove a single space from a string

Cut the string to be <= 80 characters AND must keep the words without cutting them

Regular expression to split long strings in several lines

Categories

Resources