I want to count chars in a big text, I do it with this code:
string s = textBox.Text;
int chars = 0;
int words = 0;
foreach(var v in s.ToCharArray())
chars++;
foreach(var v in s.Split(' '))
words++;
this code works but it seems pretty slow with large text, so how can i improve this?
You don't need another char-array, you can use String.Length directly:
int chars = s.Length;
int words = s.Split().Length;
Side-note: if you call String.Split without an argument all white-space characters are used as delimiter. Those include spaces, tab-characters and new-line characters. This is not a complete list of possible word delimiters but it's better than " ".
You are also counting consecutive spaces as different "words". Use StringSplitOptions.RemoveEmptyEntries:
string[] wordSeparators = { "\r\n", "\n", ",", ".", "!", "?", ";", ":", " ", "-", "/", "\\", "[", "]", "(", ")", "<", ">", "#", "\"", "'" }; // this list is probably too extensive, tim.schmelter#myemail.com would count as 4 words, but it should give you an idea
string[] words = s.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
int wordCount = words.Length;
You can do this in a single pass through without making a copy of your string:
int chars = 0;
int words = 0;
//keep track of spaces so as to only count nonspace-space-nonspace transitions
//it is initialized to true to count the first word only when we come to it
bool lastCharWasSpace = true;
foreach (var c in s)
{
chars++;
if (c == ' ')
{
lastCharWasSpace = true;
}
else if (lastCharWasSpace)
{
words++;
lastCharWasSpace = false;
}
}
Note the reason I do not use string.Split here is that it does a bunch of string copies under the hood to return the resulting array. Since you're not using the contents but instead are only interested in the count, this is a waste of time and memory - especially if you have a big enough text that has to be shuffled off to main memory, or worse yet swap space.
Do be aware that string.Split does on the other hand by default use a longer list of delimiters than just ' ', so you may want to add other conditions to the if statement.
You can simply use
int numberOfLetters = textBox.Length;
or use LINQ
int numberOfLetters = textBox.ToCharArray().Count();
or
int numberOfLetters = 0;
foreach (char letter in textBox)
{
numberOfLetters++;
}
var chars = textBox.Text.Length;
var words = textbox.Text.Count(c => c == ' ') + 1;
Related
I need to split a string into paragraphs and count those paragraphs (paragraphs separated by 2 or more empty lines).
In addition I need to read each word from the text and need the ability to mention the paragraph which this word belong to.
For example (Each paragraph is more then one line and two empty lines separates between paragraphs):
This is
the first
paragraph
This is
the second
paragraph
This is
the third
paragraph
Something like this should work for you:
var paragraphMarker = Environment.NewLine + Environment.NewLine;
var paragraphs = fileText.Split(new[] {paragraphMarker},
StringSplitOptions.RemoveEmptyEntries);
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
You may need to change line delimiter, file can have different variants like "\n", "\r", "\r\n".
Also you can pass specific characters inside Trim function to remove symbols like '.',',','!','"' and others.
Edit: To add more flexibility you can use regexp for splitting paragraphs:
var paragraphs = Regex.Split(fileText, #"(\r\n?|\n){2}")
.Where(p => p.Any(char.IsLetterOrDigit));
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
I think that you want to split the text in paragraphs, but do you have a delimiter to tell you to know you need to split the string?, for example if you want to identify the paragraph with "." this should do the trick
string paragraphs="My first paragraph. Once upon a time";
string[] words = paragraphs.Split('.');
foreach (string word in words)
{
Console.WriteLine(word);
}
The result for this will be:
My first paragraph
Once upon a time
Just remember that the "." character was removed!.
public static List<string> SplitLine(string isstr, int size = 100)
{
var words = isstr.Split(new[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
List<string> lo = new List<string>();
string tmp = "";
int i = 0;
for (i = 0; i < words.Length; i++)
{
if ((tmp.Length + words[i].Length) > size)
{
lo.Add(tmp);
tmp = "";
}
tmp += " " + words[i];
}
if (!String.IsNullOrWhiteSpace(tmp))
{
lo.Add(tmp);
}
return lo;
}
At the next code I'm splitting text to words, inserting them into a table separately and counting the numbers of letters in each word.
The problem is that counter is also counting spaces at the beginning of each line, and give me wrong value for some of the words.
How can I count only the letters of each word exactly?
var str = reader1.ReadToEnd();
char[] separators = new char[] {' ', ',', '/', '?'}; //Clean punctuation from copying
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries).ToArray(); //Insert all the song words into "words" string
string constring1 = "datasource=localhost;port=3306;username=root;password=123";
using (var conDataBase1 = new MySqlConnection(constring1))
{
conDataBase1.Open();
for (int i = 0; i < words.Length; i++)
{
int numberOfLetters = words[i].ToCharArray().Length; //Calculate the numbers of letters in each word
var songtext = "insert into myproject.words (word_text,word_length) values('" + words[i] + "','" + numberOfLetters + "');"; //Insert words list and length into words table
MySqlCommand cmdDataBase1 = new MySqlCommand(songtext, conDataBase1);
try
{
cmdDataBase1.ExecuteNonQuery();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
}
This will be a simple and fast way of doing so:
int numberOfLetters = words[i].Count(word => !Char.IsWhiteSpace(word));
Another simple solution that will save you the above and rest of the answers here, will be to Trim() first, and than do your normal calculation, due your statement that it is happening just in the beginning of every line.
var words = str.Trim().Split(separators, StringSplitOptions.RemoveEmptyEntries);
Than all you will need is: (Without the redundant conversion)
int numberOfLetters = words[i].Length;
See String.Trim()
int numberOfLetters = words[i].Trim().ToCharArray().Length; //Calculate the numbers of letters in each word
instead of ' ' use '\s+' since it matches one or more whitespace at once, so it splits on any number of whitespace characters.
Regex.Split(myString, #"\s+");
How do I get the first 250 words of a string?
You need to split the string. You can use the overload without parameter(whitespaces are assumed).
IEnumerable<string> words = str.Split().Take(250);
Note that you need to add using System.Linq for Enumerable.Take.
You can use ToList() or ToArray() ro create a new collection from the query or save memory and enumerate it directly:
foreach(string word in words)
Console.WriteLine(word);
Update
Since it seems to be quite popular I'm adding following extension which is more efficient than the Enumerable.Take approach and also returns a collection instead of the (deferred executed) query.
It uses String.Split where white-space characters are assumed to be the delimiters if the separator parameter is null or contains no characters. But the method also allows to pass different delimiters:
public static string[] GetWords(
this string input,
int count = -1,
string[] wordDelimiter = null,
StringSplitOptions options = StringSplitOptions.None)
{
if (string.IsNullOrEmpty(input)) return new string[] { };
if(count < 0)
return input.Split(wordDelimiter, options);
string[] words = input.Split(wordDelimiter, count + 1, options);
if (words.Length <= count)
return words; // not so many words found
// remove last "word" since that contains the rest of the string
Array.Resize(ref words, words.Length - 1);
return words;
}
It can be used easily:
string str = "A B C D E F";
string[] words = str.GetWords(5, null, StringSplitOptions.RemoveEmptyEntries); // A,B,C,D,E
yourString.Split(' ').Take(250);
I guess. You should provide more info.
String.Join(" ", yourstring.Split().Take(250).ToArray())
Addition to Tim answer, this is what you can try
IEnumerable<string> words = str.Split().Take(250);
StringBuilder firstwords = new StringBuilder();
foreach(string s in words)
{
firstwords.Append(s + " ");
}
firstwords.Append("...");
Console.WriteLine(firstwords.ToString());
Try this one:
public string TakeWords(string str,int wordCount)
{
char lastChar='\0';
int spaceFound=0;
var strLen= str.Length;
int i=0;
for(; i<strLen; i++)
{
if(str[i]==' ' && lastChar!=' ')
{
spaceFound++;
}
lastChar=str[i];
if(spaceFound==wordCount)
break;
}
return str.Substring(0,i);
}
It's possible without calling Take().
string[] separatedWords = stringToProcess.Split(new char[] {' '}, 250, StringSplitOptions.RemoveEmptyEntries);
Which also allows splitting based on more than just space " " and therefore fixes occurrences when spaces are incorrectly missing in input.
string[] separatedWords = stringToProcess.Split(new char[] {' ', '.', ',' ':', ';'}, 250, StringSplitOptions.RemoveEmptyEntries);
Use StringSplitOptions.None, if you want empty strings to be returned instead.
I'm stuck on how to count how many words are in each sentence, an example of this is: string sentence = "hello how are you. I am good. that's good."
and have it come out like:
//sentence1: 4 words
//sentence2: 3 words
//sentence3: 2 words
I can get the number of sentences
public int GetNoOfWords(string s)
{
return s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
}
label2.Text = (GetNoOfWords(sentance).ToString());
and i can get the number of words in the whole string
public int CountWord (string text)
{
int count = 0;
for (int i = 0; i < text.Length; i++)
{
if (text[i] != ' ')
{
if ((i + 1) == text.Length)
{
count++;
}
else
{
if(text[i + 1] == ' ')
{
count++;
}
}
}
}
return count;
}
then button1
int words = CountWord(sentance);
label4.Text = (words.ToString());
But I can't count how many words are in each sentence.
Instead of looping over the string as you do in CountWords I would just use;
int words = s.Split(' ').Length;
It's much more clean and simple. You split on white spaces which returns an array of all the words, the length of that array is the number of words in the string.
Why not use Split instead?
var sentences = "hello how are you. I am good. that's good.";
foreach (var sentence in sentences.TrimEnd('.').Split('.'))
Console.WriteLine(sentence.Trim().Split(' ').Count());
If you want number of words in each sentence, you need to
string s = "This is a sentence. Also this counts. This one is also a thing.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string sentence in sentences)
{
Console.WriteLine(sentence.Split(' ').Length + " words in sentence *" + sentence + "*");
}
Use CountWord on each element of the array returned by s.Split:
string sentence = "hello how are you. I am good. that's good.";
string[] words = sentence.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
for (string sentence in sentences)
{
int noOfWordsInSentence = CountWord(sentence);
}
string text = "hello how are you. I am good. that's good.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
IEnumerable<int> wordsPerSentence = sentences.Select(s => s.Trim().Split(' ').Length);
As noted in several answers here, look at String functions like Split, Trim, Replace, etc to get you going. All answers here will solve your simple example, but here are some sentences which they may fail to analyse correctly;
"Hello, how are you?" (no '.' to parse on)
"That apple costs $1.50." (a '.' used as a decimal)
"I like whitespace . "
"Word"
If you only need a count, I'd avoid Split() -- it takes up unnecessary space. Perhaps:
static int WordCount(string s)
{
int wordCount = 0;
for(int i = 0; i < s.Length - 1; i++)
if (Char.IsWhiteSpace(s[i]) && !Char.IsWhiteSpace(s[i + 1]) && i > 0)
wordCount++;
return ++wordCount;
}
public static void Main()
{
Console.WriteLine(WordCount(" H elloWor ld g ")); // prints "4"
}
It counts based on the number of spaces (1 space = 2 words). Consecutive spaces are ignored.
Does your spelling of sentence in:
int words = CountWord(sentance);
have anything to do with it?
I have a text
Category2,"Something with ,comma"
when I split this by ',' it should give me two string
Category2
"Something with ,comma"
but in actual it split string from every comma.
how can I achieve my expected result.
Thanx
Just call variable.Split(new char[] { ',' }, 2). Complete documentation in MSDN.
There are a number of things that you could be wanting to do here so I will address a few:
Split on the first comma
String text = text.Split(new char[] { ',' }, 2);
Split on every comma
String text = text.Split(new char[] {','});
Split on a comma not in "
var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");
Last one taken from C# Regex Split
Specify the maximum number of strings you want in the array:
string[] parts = text.Split(new char[] { ',' }, 2);
String.Split works at the simplest, fastest level - so it splits the text on all of the delimiters you pass into it, and it has no concept of special rules like double-quotes.
If you need a CSV parser which understands double-quotes, then you can write your own or there are some excellent open source parsers available - e.g. http://www.codeproject.com/KB/database/CsvReader.aspx - this is one I've used in several projects and recommend.
Try this:
public static class StringExtensions
{
public static IEnumerable<string> SplitToSubstrings(this string str)
{
int startIndex = 0;
bool isInQuotes = false;
for (int index = 0; index < str.Length; index++ )
{
if (str[index] == '\"')
isInQuotes = !isInQuotes;
bool isStartOfNewSubstring = (!isInQuotes && str[index] == ',');
if (isStartOfNewSubstring)
{
yield return str.Substring(startIndex, index - startIndex).Trim();
startIndex = index + 1;
}
}
yield return str.Substring(startIndex).Trim();
}
}
Usage is pretty simple:
foreach(var str in text.SplitToSubstrings())
Console.WriteLine(str);