how to count words correctly in text [duplicate]

how to count words correctly in text [duplicate] - c#

This question already has answers here:
C# Word Count function
(5 answers)
Closed 8 years ago.
I want give my program a text and it count the words correctly
I tried to use an array to save words in it :
string[] words = richTextBox1.Text.Split(' ');
But this code has a problem and it is count the spaces in text
so I tried the following code :
string[] checkwords= richTextBox1.Text.Split(' ');
for (int i = 0; i < checkwords.Length; i++)
{
if (richTextBox1.Text.EndsWith(" ") )
{
return;
}
else
{
string[] words = richTextBox1.Text.Split(' ');
toolStripStatusLabel1.Text = "Words" + " = " + words.Length.ToString();
but now it wont work correctly .

I'd recommend using Regex here, using the 'word boundary' anchor
Otherwise your code may not correctly take into account things like Tabs and New Lines - \b will take care of that for you
var words = Regex
.Split("hello world", #"\b")
.Where(s => !string.IsNullOrWhiteSpace(s));
var wordCount = words.Count();

You can use the overload of String.Split with StringSplitOptions.RemoveEmptyEntries to ignore multiple consecutive spaces.
string text = "a b c d"; // 4 "words"
int words = text.Split(new char[]{}, StringSplitOptions.RemoveEmptyEntries).Length;
I'm using an empty char[] (you can also use new string[]{}) because that takes all white-space characters into account, so not only ' ' but also tabs or new-line characters.

I do not know why you wish to return if the textbox ends in " ". Maybe it should be a next or continue instead.
If multiple spaces is a possibility.
Regex myRege = new Regex(#"[ ]{2,}");
string myText = regex.Replace(richTextBox1.Text, #" ");
string[] words= myText.Split(" ");
toolStripStatusLabel1.Text = "Words" + " = " + words.Length.ToString();
Just for fun
private string[] GetCount(string bodyText)
{
bodyText = bodyText.Replace(" "," ");
if(bodyText.Contains(" ")
GetCount(bodyText)
return bodyText.Split(' ');
}
string[] words = GetCount(richTextBox1.Text)
toolStripStatusLabel1.Text = "Words" + " = " + words.Length.ToString();

Related

Count chars and words in text, can it be optimized?

I want to count chars in a big text, I do it with this code:
string s = textBox.Text;
int chars = 0;
int words = 0;
foreach(var v in s.ToCharArray())
chars++;
foreach(var v in s.Split(' '))
words++;
this code works but it seems pretty slow with large text, so how can i improve this?

You don't need another char-array, you can use String.Length directly:
int chars = s.Length;
int words = s.Split().Length;
Side-note: if you call String.Split without an argument all white-space characters are used as delimiter. Those include spaces, tab-characters and new-line characters. This is not a complete list of possible word delimiters but it's better than " ".
You are also counting consecutive spaces as different "words". Use StringSplitOptions.RemoveEmptyEntries:
string[] wordSeparators = { "\r\n", "\n", ",", ".", "!", "?", ";", ":", " ", "-", "/", "\\", "[", "]", "(", ")", "<", ">", "#", "\"", "'" }; // this list is probably too extensive, tim.schmelter#myemail.com would count as 4 words, but it should give you an idea
string[] words = s.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
int wordCount = words.Length;

You can do this in a single pass through without making a copy of your string:
int chars = 0;
int words = 0;
//keep track of spaces so as to only count nonspace-space-nonspace transitions
//it is initialized to true to count the first word only when we come to it
bool lastCharWasSpace = true;
foreach (var c in s)
{
chars++;
if (c == ' ')
{
lastCharWasSpace = true;
}
else if (lastCharWasSpace)
{
words++;
lastCharWasSpace = false;
}
}
Note the reason I do not use string.Split here is that it does a bunch of string copies under the hood to return the resulting array. Since you're not using the contents but instead are only interested in the count, this is a waste of time and memory - especially if you have a big enough text that has to be shuffled off to main memory, or worse yet swap space.
Do be aware that string.Split does on the other hand by default use a longer list of delimiters than just ' ', so you may want to add other conditions to the if statement.

You can simply use
int numberOfLetters = textBox.Length;
or use LINQ
int numberOfLetters = textBox.ToCharArray().Count();
or
int numberOfLetters = 0;
foreach (char letter in textBox)
{
numberOfLetters++;
}

var chars = textBox.Text.Length;
var words = textbox.Text.Count(c => c == ' ') + 1;

How to count only letters in a string?

At the next code I'm splitting text to words, inserting them into a table separately and counting the numbers of letters in each word.
The problem is that counter is also counting spaces at the beginning of each line, and give me wrong value for some of the words.
How can I count only the letters of each word exactly?
var str = reader1.ReadToEnd();
char[] separators = new char[] {' ', ',', '/', '?'}; //Clean punctuation from copying
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries).ToArray(); //Insert all the song words into "words" string
string constring1 = "datasource=localhost;port=3306;username=root;password=123";
using (var conDataBase1 = new MySqlConnection(constring1))
{
conDataBase1.Open();
for (int i = 0; i < words.Length; i++)
{
int numberOfLetters = words[i].ToCharArray().Length; //Calculate the numbers of letters in each word
var songtext = "insert into myproject.words (word_text,word_length) values('" + words[i] + "','" + numberOfLetters + "');"; //Insert words list and length into words table
MySqlCommand cmdDataBase1 = new MySqlCommand(songtext, conDataBase1);
try
{
cmdDataBase1.ExecuteNonQuery();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
}

This will be a simple and fast way of doing so:
int numberOfLetters = words[i].Count(word => !Char.IsWhiteSpace(word));
Another simple solution that will save you the above and rest of the answers here, will be to Trim() first, and than do your normal calculation, due your statement that it is happening just in the beginning of every line.
var words = str.Trim().Split(separators, StringSplitOptions.RemoveEmptyEntries);
Than all you will need is: (Without the redundant conversion)
int numberOfLetters = words[i].Length;

See String.Trim()
int numberOfLetters = words[i].Trim().ToCharArray().Length; //Calculate the numbers of letters in each word

instead of ' ' use '\s+' since it matches one or more whitespace at once, so it splits on any number of whitespace characters.
Regex.Split(myString, #"\s+");

How to find the second word in a string

When I run this code, it's purpose is to obtain the first, second, and third word in a 3 word sentence only and have each of those words display in a text box. The sentence I use is 'kerry tells stories', look at pic to see what I mean,
The problem however, is that when i run it, the third word is 'stori', what happened to 'es' making 'stories'
var tbt = textBox.Text;
var firstWord = tbt.Substring(0, tbt.IndexOf(" "));
var indexword = tbt.IndexOf(" ");
var indexnumber = indexword +1;
string myString = indexnumber.ToString();
var secondWord = tbt.Substring(indexnumber, tbt.IndexOf(" "));
var indexword2 = tbt.IndexOf(" ", indexnumber);
var indexnumber2 = indexword2 + 1;
string myString2 = indexnumber2.ToString();
var thirdWord = tbt.Substring(indexnumber2, tbt.IndexOf(" "));
var indexword3 = tbt.IndexOf(" ", indexnumber2);
var indexnumber3 = indexword3 + 1;
string mystring3 = indexnumber3.ToString();
textBox6.Text = firstWord;
textBox7.Text = secondWord;
textBox8.Text = thirdWord;
Where is the problem?

You can try splitting the sentence via String.Split() method:
string test = "kelly tell stories";
string[] split = test.Split(' '); //Use empty space between word(s) as split character
for(int i=0; i< split.Length; i++)
{
Console.WriteLine(split[i]);
}
Console.ReadLine();
This yields 3 string elements in one-dimensional array, each element can be accessed via index. 0 = first word, 1 = second word... so on.

Why make it too complicated ? can do it simply by the following code :
string input = "kerry tells stories";
string[] output=input.Split(' ');
textBox6.Text = output[0];
textBox7.Text = output[1];
textBox8.Text = output[2];

The IndexOf is using the index of the first space, it doesn't move across.
Reports the zero-based index of the first occurrence of the specified string in this instance.
You can do this much more simply by using Split
var words = tbt.Split(new char[]{' '});
1 = words[0]
2 = words[1]
3 = words[2]

Here is an approach using split with multiple tokens:
Dim a() As String = txtSource.Text.Split({vbCr, vbLf, ".", "?", "!"}, StringSplitOptions.RemoveEmptyEntries)
txtSentences.Text = String.Join(vbCrLf, a)
a = txtSource.Text.Split({" ", ",", ";", ":", vbCr, vbLf, ".", "?", "!"}, StringSplitOptions.RemoveEmptyEntries)
txtWords.Text = String.Join(vbCrLf, a)
Won't work with all cases, but a lot is them. Add three multiline textboxes to test: txtSource, txtSentences, TxtWords

How to parse a text removing a char

Let's start with that i have a txtProbe(textbox) and there i have 12-34-56-78-90. I want to parse them in different labels or textboxes ... For now just in one textbox - txtParse. I tried with that code - I'm removing the "-" and then tried to display them but nothing happens:
{
char[] delemiterChars = { '-' };
string text = txtProbe.Text;
string[] words = text.Split(delemiterChars);
txtParse.Text = text;
foreach (string s in words)
{
txtParse.Text = s;
}
}
EDIT:
I want to set the received information in different labels:
12-label1
34-label2
56-label3
78-label4
90-label5

You could just use String.Replace:
txtParse.Text = txtProbe.Text.Replace("-", " ");

The following 'll do the trick more "semantically":
var parsed = text.Replace("-", " ");
You might be changing the value too early in the Page Life Cycle (in the case of Web Forms), with regards to why you're not seeing the parsed value in the server control.

For your specific sample it seems that you could just replace '-' by ' '.
txtParse.Text = txtProbe.Text.Replace('-', ' ');
But in order to join an array of strings using a white space separator, you could use this
txtParse.Text = string.Join(" ", words);
Your logic is not appropiated for the task you're trying to acheive but just for learning purposes I'll write the correct version of your snippet
string separator = string.Empty; // starts empty so doesn't apply for first element
foreach (string s in words)
{
txtParse.Text += separator + s; // You need to use += operator to append content
separator = " "; // from second element will append " "
}
EDIT
This is for the case of using different labels
Label[] labelList = new Label[] {label1, label2, label3, label4, label5};
for (int i = 0; i < words.Length; i++)
{
labelList[i].Text = words[i];
}

You can use String.Join to join collection of strings as you want. In your case:
txtParse.Text = String.Join(" ", words);

You can use the delimiter character directly in .split('-') to return an array of strings representing that data you want.

Your problem is that you keep assigning s to the Text property. The end result will be the last s in your array.
You can use TextBox.AppendText() instead.
char[] delemiterChars = { '-' };
string text = txtProbe.Text;
string[] words = text.Split(delemiterChars);
txtParse.Text = text;
foreach (string s in words)
{
txtParse.AppendText(s + " ");
}

You can just put txtParse.Text = txtProbe.Text.Replace('-', " ");
Or by modifying your code:
char[] delemiterChars = { '-' };
string text = txtProbe.Text;
string[] words = text.Split(delemiterChars,StringSplitOptions.RemoveEmptyEntries);
foreach (string s in words)
{
txtParse.Text += " " + s;
}

How do I replace multiple spaces with a single space in C#?

How can I replace multiple spaces in a string with only one space in C#?
Example:
1 2 3 4 5
would be:
1 2 3 4 5

I like to use:
myString = Regex.Replace(myString, #"\s+", " ");
Since it will catch runs of any kind of whitespace (e.g. tabs, newlines, etc.) and replace them with a single space.

string sentence = "This is a sentence with multiple spaces";
RegexOptions options = RegexOptions.None;
Regex regex = new Regex("[ ]{2,}", options);
sentence = regex.Replace(sentence, " ");

string xyz = "1 2 3 4 5";
xyz = string.Join( " ", xyz.Split( new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries ));

I think Matt's answer is the best, but I don't believe it's quite right. If you want to replace newlines, you must use:
myString = Regex.Replace(myString, #"\s+", " ", RegexOptions.Multiline);

Another approach which uses LINQ:
var list = str.Split(' ').Where(s => !string.IsNullOrWhiteSpace(s));
str = string.Join(" ", list);

It's much simpler than all that:
while(str.Contains(" ")) str = str.Replace(" ", " ");

Regex can be rather slow even with simple tasks. This creates an extension method that can be used off of any string.
public static class StringExtension
{
public static String ReduceWhitespace(this String value)
{
var newString = new StringBuilder();
bool previousIsWhitespace = false;
for (int i = 0; i < value.Length; i++)
{
if (Char.IsWhiteSpace(value[i]))
{
if (previousIsWhitespace)
{
continue;
}
previousIsWhitespace = true;
}
else
{
previousIsWhitespace = false;
}
newString.Append(value[i]);
}
return newString.ToString();
}
}
It would be used as such:
string testValue = "This contains too much whitespace."
testValue = testValue.ReduceWhitespace();
// testValue = "This contains too much whitespace."

myString = Regex.Replace(myString, " {2,}", " ");

For those, who don't like Regex, here is a method that uses the StringBuilder:
public static string FilterWhiteSpaces(string input)
{
if (input == null)
return string.Empty;
StringBuilder stringBuilder = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (i == 0 || c != ' ' || (c == ' ' && input[i - 1] != ' '))
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}
In my tests, this method was 16 times faster on average with a very large set of small-to-medium sized strings, compared to a static compiled Regex. Compared to a non-compiled or non-static Regex, this should be even faster.
Keep in mind, that it does not remove leading or trailing spaces, only multiple occurrences of such.

This is a shorter version, which should only be used if you are only doing this once, as it creates a new instance of the Regex class every time it is called.
temp = new Regex(" {2,}").Replace(temp, " ");
If you are not too acquainted with regular expressions, here's a short explanation:
The {2,} makes the regex search for the character preceding it, and finds substrings between 2 and unlimited times.
The .Replace(temp, " ") replaces all matches in the string temp with a space.
If you want to use this multiple times, here is a better option, as it creates the regex IL at compile time:
Regex singleSpacify = new Regex(" {2,}", RegexOptions.Compiled);
temp = singleSpacify.Replace(temp, " ");

You can simply do this in one line solution!
string s = "welcome to london";
s.Replace(" ", "()").Replace(")(", "").Replace("()", " ");
You can choose other brackets (or even other characters) if you like.

no Regex, no Linq... removes leading and trailing spaces as well as reducing any embedded multiple space segments to one space
string myString = " 0 1 2 3 4 5 ";
myString = string.Join(" ", myString.Split(new char[] { ' ' },
StringSplitOptions.RemoveEmptyEntries));
result:"0 1 2 3 4 5"

// Mysample string
string str ="hi you are a demo";
//Split the words based on white sapce
var demo= str .Split(' ').Where(s => !string.IsNullOrWhiteSpace(s));
//Join the values back and add a single space in between
str = string.Join(" ", demo);
// output: string str ="hi you are a demo";

Consolodating other answers, per Joel, and hopefully improving slightly as I go:
You can do this with Regex.Replace():
string s = Regex.Replace (
" 1 2 4 5",
#"[ ]{2,}",
" "
);
Or with String.Split():
static class StringExtensions
{
public static string Join(this IList<string> value, string separator)
{
return string.Join(separator, value.ToArray());
}
}
//...
string s = " 1 2 4 5".Split (
" ".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries
).Join (" ");

I just wrote a new Join that I like, so I thought I'd re-answer, with it:
public static string Join<T>(this IEnumerable<T> source, string separator)
{
return string.Join(separator, source.Select(e => e.ToString()).ToArray());
}
One of the cool things about this is that it work with collections that aren't strings, by calling ToString() on the elements. Usage is still the same:
//...
string s = " 1 2 4 5".Split (
" ".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries
).Join (" ");

Many answers are providing the right output but for those looking for the best performances, I did improve Nolanar's answer (which was the best answer for performance) by about 10%.
public static string MergeSpaces(this string str)
{
if (str == null)
{
return null;
}
else
{
StringBuilder stringBuilder = new StringBuilder(str.Length);
int i = 0;
foreach (char c in str)
{
if (c != ' ' || i == 0 || str[i - 1] != ' ')
stringBuilder.Append(c);
i++;
}
return stringBuilder.ToString();
}
}

Use the regex pattern
[ ]+ #only space
var text = Regex.Replace(inputString, #"[ ]+", " ");

I know this is pretty old, but ran across this while trying to accomplish almost the same thing. Found this solution in RegEx Buddy. This pattern will replace all double spaces with single spaces and also trim leading and trailing spaces.
pattern: (?m:^ +| +$|( ){2,})
replacement: $1
Its a little difficult to read since we're dealing with empty space, so here it is again with the "spaces" replaced with a "_".
pattern: (?m:^_+|_+$|(_){2,}) <-- don't use this, just for illustration.
The "(?m:" construct enables the "multi-line" option. I generally like to include whatever options I can within the pattern itself so it is more self contained.

I can remove whitespaces with this
while word.contains(" ") //double space
word = word.Replace(" "," "); //replace double space by single space.
word = word.trim(); //to remove single whitespces from start & end.

Without using regular expressions:
while (myString.IndexOf(" ", StringComparison.CurrentCulture) != -1)
{
myString = myString.Replace(" ", " ");
}
OK to use on short strings, but will perform badly on long strings with lots of spaces.

try this method
private string removeNestedWhitespaces(char[] st)
{
StringBuilder sb = new StringBuilder();
int indx = 0, length = st.Length;
while (indx < length)
{
sb.Append(st[indx]);
indx++;
while (indx < length && st[indx] == ' ')
indx++;
if(sb.Length > 1 && sb[0] != ' ')
sb.Append(' ');
}
return sb.ToString();
}
use it like this:
string test = removeNestedWhitespaces("1 2 3 4 5".toCharArray());

Here is a slight modification on Nolonar original answer.
Checking if the character is not just a space, but any whitespace, use this:
It will replace any multiple whitespace character with a single space.
public static string FilterWhiteSpaces(string input)
{
if (input == null)
return string.Empty;
var stringBuilder = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (i == 0 || !char.IsWhiteSpace(c) || (char.IsWhiteSpace(c) &&
!char.IsWhiteSpace(strValue[i - 1])))
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}

How about going rogue?
public static string MinimizeWhiteSpace(
this string _this)
{
if (_this != null)
{
var returned = new StringBuilder();
var inWhiteSpace = false;
var length = _this.Length;
for (int i = 0; i < length; i++)
{
var character = _this[i];
if (char.IsWhiteSpace(character))
{
if (!inWhiteSpace)
{
inWhiteSpace = true;
returned.Append(' ');
}
}
else
{
inWhiteSpace = false;
returned.Append(character);
}
}
return returned.ToString();
}
else
{
return null;
}
}

Mix of StringBuilder and Enumerable.Aggregate() as extension method for strings:
using System;
using System.Linq;
using System.Text;
public static class StringExtension
{
public static string CondenseSpaces(this string s)
{
return s.Aggregate(new StringBuilder(), (acc, c) =>
{
if (c != ' ' || acc.Length == 0 || acc[acc.Length - 1] != ' ')
acc.Append(c);
return acc;
}).ToString();
}
public static void Main()
{
const string input = " (five leading spaces) (five internal spaces) (five trailing spaces) ";
Console.WriteLine(" Input: \"{0}\"", input);
Console.WriteLine("Output: \"{0}\"", StringExtension.CondenseSpaces(input));
}
}
Executing this program produces the following output:
Input: " (five leading spaces) (five internal spaces) (five trailing spaces) "
Output: " (five leading spaces) (five internal spaces) (five trailing spaces) "

Old skool:
string oldText = " 1 2 3 4 5 ";
string newText = oldText
.Replace(" ", " " + (char)22 )
.Replace( (char)22 + " ", "" )
.Replace( (char)22 + "", "" );
Assert.That( newText, Is.EqualTo( " 1 2 3 4 5 " ) );

You can create a StringsExtensions file with a method like RemoveDoubleSpaces().
StringsExtensions.cs
public static string RemoveDoubleSpaces(this string value)
{
Regex regex = new Regex("[ ]{2,}", RegexOptions.None);
value = regex.Replace(value, " ");
// this removes space at the end of the value (like "demo ")
// and space at the start of the value (like " hi")
value = value.Trim(' ');
return value;
}
And then you can use it like this:
string stringInput =" hi here is a demo ";
string stringCleaned = stringInput.RemoveDoubleSpaces();

I looked over proposed solutions, could not find the one that would handle mix of white space characters acceptable for my case, for example:
Regex.Replace(input, #"\s+", " ") - it will eat your line breaks, if they are mixed with spaces, for example \n \n sequence will be replaced with
Regex.Replace(source, #"(\s)\s+", "$1") - it will depend on whitespace first character, meaning that it again might eat your line breaks
Regex.Replace(source, #"[ ]{2,}", " ") - it won't work correctly when there's mix of whitespace characters - for example "\t \t "
Probably not perfect, but quick solution for me was:
Regex.Replace(input, #"\s+",
(match) => match.Value.IndexOf('\n') > -1 ? "\n" : " ", RegexOptions.Multiline)
Idea is - line break wins over the spaces and tabs.
This won't handle windows line breaks correctly, but it would be easy to adjust to work with that too, don't know regex that well - may be it is possible to fit into single pattern.

The following code remove all the multiple spaces into a single space
public string RemoveMultipleSpacesToSingle(string str)
{
string text = str;
do
{
//text = text.Replace(" ", " ");
text = Regex.Replace(text, #"\s+", " ");
} while (text.Contains(" "));
return text;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

how to count words correctly in text [duplicate] - c#

Related

Count chars and words in text, can it be optimized?

How to count only letters in a string?

How to find the second word in a string

How to parse a text removing a char

How do I replace multiple spaces with a single space in C#?

Categories

Resources