Get first 250 words of a string? - c#

How do I get the first 250 words of a string?

You need to split the string. You can use the overload without parameter(whitespaces are assumed).
IEnumerable<string> words = str.Split().Take(250);
Note that you need to add using System.Linq for Enumerable.Take.
You can use ToList() or ToArray() ro create a new collection from the query or save memory and enumerate it directly:
foreach(string word in words)
Console.WriteLine(word);
Update
Since it seems to be quite popular I'm adding following extension which is more efficient than the Enumerable.Take approach and also returns a collection instead of the (deferred executed) query.
It uses String.Split where white-space characters are assumed to be the delimiters if the separator parameter is null or contains no characters. But the method also allows to pass different delimiters:
public static string[] GetWords(
this string input,
int count = -1,
string[] wordDelimiter = null,
StringSplitOptions options = StringSplitOptions.None)
{
if (string.IsNullOrEmpty(input)) return new string[] { };
if(count < 0)
return input.Split(wordDelimiter, options);
string[] words = input.Split(wordDelimiter, count + 1, options);
if (words.Length <= count)
return words; // not so many words found
// remove last "word" since that contains the rest of the string
Array.Resize(ref words, words.Length - 1);
return words;
}
It can be used easily:
string str = "A B C D E F";
string[] words = str.GetWords(5, null, StringSplitOptions.RemoveEmptyEntries); // A,B,C,D,E

yourString.Split(' ').Take(250);
I guess. You should provide more info.

String.Join(" ", yourstring.Split().Take(250).ToArray())

Addition to Tim answer, this is what you can try
IEnumerable<string> words = str.Split().Take(250);
StringBuilder firstwords = new StringBuilder();
foreach(string s in words)
{
firstwords.Append(s + " ");
}
firstwords.Append("...");
Console.WriteLine(firstwords.ToString());

Try this one:
public string TakeWords(string str,int wordCount)
{
char lastChar='\0';
int spaceFound=0;
var strLen= str.Length;
int i=0;
for(; i<strLen; i++)
{
if(str[i]==' ' && lastChar!=' ')
{
spaceFound++;
}
lastChar=str[i];
if(spaceFound==wordCount)
break;
}
return str.Substring(0,i);
}

It's possible without calling Take().
string[] separatedWords = stringToProcess.Split(new char[] {' '}, 250, StringSplitOptions.RemoveEmptyEntries);
Which also allows splitting based on more than just space " " and therefore fixes occurrences when spaces are incorrectly missing in input.
string[] separatedWords = stringToProcess.Split(new char[] {' ', '.', ',' ':', ';'}, 250, StringSplitOptions.RemoveEmptyEntries);
Use StringSplitOptions.None, if you want empty strings to be returned instead.

Related

String.IndexOf() returns unexpected index of string

String.IndexOf() method is not acting as I expected.
I expected it not to find a match, since the exact word you is not in str.
string str = "I am your Friend";
int index = str.IndexOf("you",0,StringComparison.OrdinalIgnoreCase);
Console.WriteLine(index);
Output: 5
My Expected Result is -1 because the string doesn't contain you.
The issue you're facing is because IndexOf matches a single character, or sequence of characters (a search string) within the greater string. Therefore "I am your friend" contains the sequence "you". To match words only, you have to consider things at a word level.
For example, you could use Regular Expressions' to match around the word boundaries:
private static int IndexOfWord(string val, int startAt, string search)
{
// escape the match expression in case it contains any characters meaningful
// to regular expressions, and then create an expression with the \b boundary
// characters
var escapedMatch = string.Format(#"\b{0}\b", Regex.Escape(search));
// create a case-sensitive regular expression object using the pattern
var exp = new Regex(escapedMatch, RegexOptions.IgnoreCase);
// perform the match from the start position
var match = exp.Match(val, startAt);
// if it's successful, return the match index
if (match.Success)
{
return match.Index;
}
// if it's unsuccessful, return -1
return -1;
}
// overload without startAt, for when you just want to start from the beginning
private static int IndexOfWord(string val, string search)
{
return IndexOfWord(val, 0, search);
}
In your example you would try to match \byou\b, which because of the boundary requirements won't match your.
Try it online
See more about word boundaries in Regular Expressions here.
you is a valid substring of I am your Friend. If you would like to find if a word is in a string you can parse the string with Split method.
char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string[] words = text.Split(delimiterChars);
And then look inside the array. Or turn it into more lookup-friendly data structure.
If you would like to search case-insensitive you can use the following code:
char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string text = "I am your Friend";
// HasSet allows faster lookups in case of big strings
var words = text.Split(delimiterChars).ToHashSet(StringComparer.OrdinalIgnoreCase);
Console.WriteLine(words.Contains("you"));
Console.WriteLine(words.Contains("friend"));
False
True
Creating dictionary as in the following code-snippet you can quickly check all positions for all words.
char[] delimiterChars = { ' ', ',', '.', ':', '\t' };
string text = "i am your friend. I Am Your Friend.";
var words = text.Split(delimiterChars);
var dict = new Dictionary<string, List<int>>(StringComparer.InvariantCultureIgnoreCase);
for (int i = 0; i < words.Length; ++i)
{
if (dict.ContainsKey(words[i])) dict[words[i]].Add(i);
else dict[words[i]] = new List<int>() { i };
}
Console.WriteLine("youR: ");
dict["youR"].ForEach(i => Console.WriteLine("\t{0}", i));
Console.WriteLine("friend");
dict["friend"].ForEach(i => Console.WriteLine("\t{0}", i));
youR:
2
7
friend
3
8

Remove last character from array in c#

I have an array which brings me 19 records in the array like below
[![Array snapshot][1]][1]
My last value is coming as ""
I want to remove that.
this is where i am declaring and getting the values in an array
string[] arrS = hidRateA.Value.Split(new char[] { ',' });
Kindly let me know how to remove the last value from an array.
You can trim the last comma , before splitting your string if only the last item will be empty:
string[] arrS = hidRateA.Value.TrimEnd(',').Split(new char[] { ',' });
Or you could just "take" if you know that last item will always be empty and you don't need to detect that.
var result=arr.Take(arr.Length-1);
You could specify StringSplitOptions.RemoveEmptyEntries
hidRateA.Value.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
Also note, " " (White spaces) by definition is not empty so it will not be removed from resulting array.
In case if you've white space you could use below code to filter white spaces.
hidRateA.Value.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries)
.Where(x => !string.IsNullOrWhiteSpace(x))
.Select(s => s.Trim());
This code does pretty much exactly what you're asking for in the question:
void Main()
{
string firstTest = "1,2,3,"; // ends with a comma
string secondTest = "a,b,c"; // ends with a character
string[] firstArray = CreateArray(firstTest);
PrintArray(firstArray);
Console.WriteLine();
string[] secondArray = CreateArray(secondTest);
PrintArray(secondArray);
}
string[] CreateArray(string str)
{
string[] array = str.Split(',');
return array.Where((s, i) => // return only those cells in the array where
i < array.Length - 1 || // it's not the last character, or
s != string.Empty) // it's not empty
.ToArray(); // in array form
}
void PrintArray(string[] array)
{
for (int i = 0; i < array.Length; i++)
{
Console.WriteLine("array[{0}] = '{1}'", i, array[i]);
}
}
But you haven't told us much about why you're doing what you're doing. Do you really only want to remove the last item, if it is empty? If you just want to remove any empty items then str.Split(',', StringSplitOptions.RemoveEmptyEntries) will work better for you.

Count chars and words in text, can it be optimized?

I want to count chars in a big text, I do it with this code:
string s = textBox.Text;
int chars = 0;
int words = 0;
foreach(var v in s.ToCharArray())
chars++;
foreach(var v in s.Split(' '))
words++;
this code works but it seems pretty slow with large text, so how can i improve this?
You don't need another char-array, you can use String.Length directly:
int chars = s.Length;
int words = s.Split().Length;
Side-note: if you call String.Split without an argument all white-space characters are used as delimiter. Those include spaces, tab-characters and new-line characters. This is not a complete list of possible word delimiters but it's better than " ".
You are also counting consecutive spaces as different "words". Use StringSplitOptions.RemoveEmptyEntries:
string[] wordSeparators = { "\r\n", "\n", ",", ".", "!", "?", ";", ":", " ", "-", "/", "\\", "[", "]", "(", ")", "<", ">", "#", "\"", "'" }; // this list is probably too extensive, tim.schmelter#myemail.com would count as 4 words, but it should give you an idea
string[] words = s.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
int wordCount = words.Length;
You can do this in a single pass through without making a copy of your string:
int chars = 0;
int words = 0;
//keep track of spaces so as to only count nonspace-space-nonspace transitions
//it is initialized to true to count the first word only when we come to it
bool lastCharWasSpace = true;
foreach (var c in s)
{
chars++;
if (c == ' ')
{
lastCharWasSpace = true;
}
else if (lastCharWasSpace)
{
words++;
lastCharWasSpace = false;
}
}
Note the reason I do not use string.Split here is that it does a bunch of string copies under the hood to return the resulting array. Since you're not using the contents but instead are only interested in the count, this is a waste of time and memory - especially if you have a big enough text that has to be shuffled off to main memory, or worse yet swap space.
Do be aware that string.Split does on the other hand by default use a longer list of delimiters than just ' ', so you may want to add other conditions to the if statement.
You can simply use
int numberOfLetters = textBox.Length;
or use LINQ
int numberOfLetters = textBox.ToCharArray().Count();
or
int numberOfLetters = 0;
foreach (char letter in textBox)
{
numberOfLetters++;
}
var chars = textBox.Text.Length;
var words = textbox.Text.Count(c => c == ' ') + 1;

Count how many words in each sentence

I'm stuck on how to count how many words are in each sentence, an example of this is: string sentence = "hello how are you. I am good. that's good."
and have it come out like:
//sentence1: 4 words
//sentence2: 3 words
//sentence3: 2 words
I can get the number of sentences
public int GetNoOfWords(string s)
{
return s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
}
label2.Text = (GetNoOfWords(sentance).ToString());
and i can get the number of words in the whole string
public int CountWord (string text)
{
int count = 0;
for (int i = 0; i < text.Length; i++)
{
if (text[i] != ' ')
{
if ((i + 1) == text.Length)
{
count++;
}
else
{
if(text[i + 1] == ' ')
{
count++;
}
}
}
}
return count;
}
then button1
int words = CountWord(sentance);
label4.Text = (words.ToString());
But I can't count how many words are in each sentence.
Instead of looping over the string as you do in CountWords I would just use;
int words = s.Split(' ').Length;
It's much more clean and simple. You split on white spaces which returns an array of all the words, the length of that array is the number of words in the string.
Why not use Split instead?
var sentences = "hello how are you. I am good. that's good.";
foreach (var sentence in sentences.TrimEnd('.').Split('.'))
Console.WriteLine(sentence.Trim().Split(' ').Count());
If you want number of words in each sentence, you need to
string s = "This is a sentence. Also this counts. This one is also a thing.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string sentence in sentences)
{
Console.WriteLine(sentence.Split(' ').Length + " words in sentence *" + sentence + "*");
}
Use CountWord on each element of the array returned by s.Split:
string sentence = "hello how are you. I am good. that's good.";
string[] words = sentence.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
for (string sentence in sentences)
{
int noOfWordsInSentence = CountWord(sentence);
}
string text = "hello how are you. I am good. that's good.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
IEnumerable<int> wordsPerSentence = sentences.Select(s => s.Trim().Split(' ').Length);
As noted in several answers here, look at String functions like Split, Trim, Replace, etc to get you going. All answers here will solve your simple example, but here are some sentences which they may fail to analyse correctly;
"Hello, how are you?" (no '.' to parse on)
"That apple costs $1.50." (a '.' used as a decimal)
"I like whitespace . "
"Word"
If you only need a count, I'd avoid Split() -- it takes up unnecessary space. Perhaps:
static int WordCount(string s)
{
int wordCount = 0;
for(int i = 0; i < s.Length - 1; i++)
if (Char.IsWhiteSpace(s[i]) && !Char.IsWhiteSpace(s[i + 1]) && i > 0)
wordCount++;
return ++wordCount;
}
public static void Main()
{
Console.WriteLine(WordCount(" H elloWor ld g ")); // prints "4"
}
It counts based on the number of spaces (1 space = 2 words). Consecutive spaces are ignored.
Does your spelling of sentence in:
int words = CountWord(sentance);
have anything to do with it?

Using String Split

I have a text
Category2,"Something with ,comma"
when I split this by ',' it should give me two string
Category2
"Something with ,comma"
but in actual it split string from every comma.
how can I achieve my expected result.
Thanx
Just call variable.Split(new char[] { ',' }, 2). Complete documentation in MSDN.
There are a number of things that you could be wanting to do here so I will address a few:
Split on the first comma
String text = text.Split(new char[] { ',' }, 2);
Split on every comma
String text = text.Split(new char[] {','});
Split on a comma not in "
var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");
Last one taken from C# Regex Split
Specify the maximum number of strings you want in the array:
string[] parts = text.Split(new char[] { ',' }, 2);
String.Split works at the simplest, fastest level - so it splits the text on all of the delimiters you pass into it, and it has no concept of special rules like double-quotes.
If you need a CSV parser which understands double-quotes, then you can write your own or there are some excellent open source parsers available - e.g. http://www.codeproject.com/KB/database/CsvReader.aspx - this is one I've used in several projects and recommend.
Try this:
public static class StringExtensions
{
public static IEnumerable<string> SplitToSubstrings(this string str)
{
int startIndex = 0;
bool isInQuotes = false;
for (int index = 0; index < str.Length; index++ )
{
if (str[index] == '\"')
isInQuotes = !isInQuotes;
bool isStartOfNewSubstring = (!isInQuotes && str[index] == ',');
if (isStartOfNewSubstring)
{
yield return str.Substring(startIndex, index - startIndex).Trim();
startIndex = index + 1;
}
}
yield return str.Substring(startIndex).Trim();
}
}
Usage is pretty simple:
foreach(var str in text.SplitToSubstrings())
Console.WriteLine(str);

Categories

Resources