How to split string preserving whole words? - c#

I need to split long sentence into parts preserving whole words. Each part should have given maximum number of characters (including space, dots etc.).
For example:
int partLenght = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon."
Output:
1 part: "Silver badges are awarded for"
2 part: "longer term goals. Silver badges are"
3 part: "uncommon."

Try this:
static void Main(string[] args)
{
int partLength = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
string[] words = sentence.Split(' ');
var parts = new Dictionary<int, string>();
string part = string.Empty;
int partCounter = 0;
foreach (var word in words)
{
if (part.Length + word.Length < partLength)
{
part += string.IsNullOrEmpty(part) ? word : " " + word;
}
else
{
parts.Add(partCounter, part);
part = word;
partCounter++;
}
}
parts.Add(partCounter, part);
foreach (var item in parts)
{
Console.WriteLine("Part {0} (length = {2}): {1}", item.Key, item.Value, item.Value.Length);
}
Console.ReadLine();
}

I knew there had to be a nice LINQ-y way of doing this, so here it is for the fun of it:
var input = "The quick brown fox jumps over the lazy dog.";
var charCount = 0;
var maxLineLength = 11;
var lines = input.Split(' ', StringSplitOptions.RemoveEmptyEntries)
.GroupBy(w => (charCount += w.Length + 1) / maxLineLength)
.Select(g => string.Join(" ", g));
// That's all :)
foreach (var line in lines) {
Console.WriteLine(line);
}
Obviously this code works only as long as the query is not parallel, since it depends on charCount to be incremented "in word order".

I've been testing Jon's and Lessan's answers, but they don't work properly if your max length needs to be absolute, rather than approximate. As their counter increments, it doesn't count the empty space left at the end of a line.
Running their code against the OP's example, you get:
1 part: "Silver badges are awarded for " - 29 Characters
2 part: "longer term goals. Silver badges are" - 36 Characters
3 part: "uncommon. " - 13 Characters
The "are" on line two, should be on line three. This happens because the counter does not include the 6 characters from the end of line one.
I came up with the following modification of Lessan's answer to account for this:
public static class ExtensionMethods
{
public static string[] Wrap(this string text, int max)
{
var charCount = 0;
var lines = text.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
return lines.GroupBy(w => (charCount += (((charCount % max) + w.Length + 1 >= max)
? max - (charCount % max) : 0) + w.Length + 1) / max)
.Select(g => string.Join(" ", g.ToArray()))
.ToArray();
}
}

Split the string with a (space), that build up new strings from the resulting array, stopping before your limit for each new segment.
Untested pseudo-code:
string[] words = sentence.Split(new char[] {' '});
IList<string> sentenceParts = new List<string>();
sentenceParts.Add(string.Empty);
int partCounter = 0;
foreach (var word in words)
{
if(sentenceParts[partCounter].Length + word.Length > myLimit)
{
partCounter++;
sentenceParts.Add(string.Empty);
}
sentenceParts[partCounter] += word + " ";
}

It seems like everyone is using some form of "Split then rebuild the sentence"...
I thought I would take a stab at this the way my brain would logically think about doing this manually, which is:
Split on length
Go backwards to the nearest space and use that chunk
Remove the used chunk and start over
The code ended up being a little more complex than I was hoping for, however I believe it handles most (all?) edge cases - including words that are longer than maxLength, when the words end exactly on the maxLength, etc.
Here's my function:
private static List<string> SplitWordsByLength(string str, int maxLength)
{
List<string> chunks = new List<string>();
while (str.Length > 0)
{
if (str.Length <= maxLength) //if remaining string is less than length, add to list and break out of loop
{
chunks.Add(str);
break;
}
string chunk = str.Substring(0, maxLength); //Get maxLength chunk from string.
if (char.IsWhiteSpace(str[maxLength])) //if next char is a space, we can use the whole chunk and remove the space for the next line
{
chunks.Add(chunk);
str = str.Substring(chunk.Length + 1); //Remove chunk plus space from original string
}
else
{
int splitIndex = chunk.LastIndexOf(' '); //Find last space in chunk.
if (splitIndex != -1) //If space exists in string,
chunk = chunk.Substring(0, splitIndex); // remove chars after space.
str = str.Substring(chunk.Length + (splitIndex == -1 ? 0 : 1)); //Remove chunk plus space (if found) from original string
chunks.Add(chunk); //Add to list
}
}
return chunks;
}
Test usage:
string testString = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
int length = 35;
List<string> test = SplitWordsByLength(testString, length);
foreach (string chunk in test)
{
Console.WriteLine(chunk);
}
Console.ReadLine();

At first I was thinking this might be a Regex kind of thing but here's my shot at it:
List<string> parts = new List<string>();
int partLength = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
string[] pieces = sentence.Split(' ');
StringBuilder tempString = new StringBuilder("");
foreach(var piece in pieces)
{
if(piece.Length + tempString.Length + 1 > partLength)
{
parts.Add(tempString.ToString());
tempString.Clear();
}
tempString.Append(" " + piece);
}

Expanding on jon's answer above; I needed to switch g with g.toArray(), and also change max to (max + 2) to get an exact wrapping on the max'th character.
public static class ExtensionMethods
{
public static string[] Wrap(this string text, int max)
{
var charCount = 0;
var lines = text.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
return lines.GroupBy(w => (charCount += w.Length + 1) / (max + 2))
.Select(g => string.Join(" ", g.ToArray()))
.ToArray();
}
}
And here is sample usage as NUnit tests:
[Test]
public void TestWrap()
{
Assert.AreEqual(2, "A B C".Wrap(4).Length);
Assert.AreEqual(1, "A B C".Wrap(5).Length);
Assert.AreEqual(2, "AA BB CC".Wrap(7).Length);
Assert.AreEqual(1, "AA BB CC".Wrap(8).Length);
Assert.AreEqual(2, "TEST TEST TEST TEST".Wrap(10).Length);
Assert.AreEqual(2, " TEST TEST TEST TEST ".Wrap(10).Length);
Assert.AreEqual("TEST TEST", " TEST TEST TEST TEST ".Wrap(10)[0]);
}

Joel there is a little bug in your code that I've corrected here:
public static string[] StringSplitWrap(string sentence, int MaxLength)
{
List<string> parts = new List<string>();
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
string[] pieces = sentence.Split(' ');
StringBuilder tempString = new StringBuilder("");
foreach (var piece in pieces)
{
if (piece.Length + tempString.Length + 1 > MaxLength)
{
parts.Add(tempString.ToString());
tempString.Clear();
}
tempString.Append((tempString.Length == 0 ? "" : " ") + piece);
}
if (tempString.Length>0)
parts.Add(tempString.ToString());
return parts.ToArray();
}

This works:
int partLength = 35;
string sentence = "Silver badges are awarded for longer term goals. Silver badges are uncommon.";
List<string> lines =
sentence
.Split(' ')
.Aggregate(new [] { "" }.ToList(), (a, x) =>
{
var last = a[a.Count - 1];
if ((last + " " + x).Length > partLength)
{
a.Add(x);
}
else
{
a[a.Count - 1] = (last + " " + x).Trim();
}
return a;
});
It gives me:
Silver badges are awarded for
longer term goals. Silver badges
are uncommon.

While CsConsoleFormat† was primarily designed to format text for console, it supports generating plain text as well.
var doc = new Document().AddChildren(
new Div("Silver badges are awarded for longer term goals. Silver badges are uncommon.") {
TextWrap = TextWrapping.WordWrap
}
);
var bounds = new Rect(0, 0, 35, Size.Infinity);
string text = ConsoleRenderer.RenderDocumentToText(doc, new TextRenderTarget(), bounds);
And, if you actually need trimmed strings like in your question:
List<string> lines = text.Trim()
.Split(new[] { Environment.NewLine }, StringSplitOptions.None)
.Select(s => s.Trim())
.ToList();
In addition to word wrap on spaces, you get proper handling of hyphens, zero-width spaces, no-break spaces etc.
† CsConsoleFormat was developed by me.

Related

best way of splitting numbers from text and keeping text [duplicate]

This question already has answers here:
how can i split a string by multiple delimiters and keep the delimiters?
(1 answer)
RegEx - Match Numbers of Variable Length
(4 answers)
Closed 4 years ago.
I have a text file. One of the columns contains a field which contains text along with numbers.
I'm trying to figure out the best way to split the numbers and text.
Below is an example of the typical values in the field.
.2700 Aqr sh./Tgt sh.
USD 2.4700/Tgt sh.
Currently I'm making use of the Split function (code below) however feel there is probably a smarter way of doing this.
My assumption is there will only ever be one number in the text (I'm 99% sure this is the case) however I have only seen a few examples so its possible my code below will not work.
I have read a little on regex. But not sure I tested it properly as it didn't quite get the output I wanted. For example
string input = "USD 2.4700/Tgt sh.";
string[] numbers = Regex.Split(input, #"\D+");
foreach (string value in numbers)
{
if (!string.IsNullOrEmpty(value))
{
int i = int.Parse(value);
Console.WriteLine("Number: {0}", i);
}
}
But the output is,
2
47
Whereas I was expecting 2.47 and I also don't want to lose the text. My desired result is
myText = "USD Tgt sh."
myNum = 2.47
For the other example
myText = "Aqr sh./Tgt sh."
myNum = 0.27
My Code
string[] sData = sTerms.Split(' ');
double num;
bool isNum = double.TryParse(sData[0], out num);
if(isNum)
{
ma.StockTermsNum = num;
StringBuilder sb = new StringBuilder();
for (int i = 1; i < sData.Length; i++)
sb = sb.Append(sData[i] + " ");
ma.StockTerms = sb.ToString();
}
else
{
string[] sNSplit = sData[1].Split('/');
ma.StockTermsNum = Convert.ToDouble(sNSplit[0]);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < sData.Length; i++)
{
if (i == 1)
sb = sb.Append(sNSplit[i] + " ");
else
sb = sb.Append(sData[i] + " ");
}
ma.StockTerms = sb.ToString();
}
I suggest spliting by group, (...) in order to preserve delimiter:
string source = #".2700 Aqr sh./Tgt sh.";
//string source = "USD 2.4700/Tgt sh.";
// please, notice "(...)" in the pattern - group
string[] parts = Regex.Split(source, #"([0-9]*\.?[0-9]+)");
// combining all texts
string myText = string.Concat(parts.Where((v, i) => i % 2 == 0));
// combining all numbers
string myNumber = string.Concat(parts.Where((v, i) => i % 2 != 0));
Tests:
string[] tests = new string[] {
#".2700 Aqr sh./Tgt sh.",
#"USD 2.4700/Tgt sh.",
};
var result = tests
.Select(test => new {
text = test,
parts = Regex.Split(test, #"([0-9]*\.?[0-9]+)"),
})
.Select(item => new {
text = item.text,
myText = string.Concat(item.parts.Where((v, i) => i % 2 == 0)),
myNumber = string.Concat(item.parts.Where((v, i) => i % 2 != 0)),
})
.Select(item => $"{item.text,-25} : {item.myNumber,-15} : {item.myText}");
Console.WriteLine(string.Join(Environment.NewLine, result));
Outcome:
.2700 Aqr sh./Tgt sh. : Aqr sh./Tgt sh. : .2700
USD 2.4700/Tgt sh. : USD /Tgt sh. : 2.4700
Could by something like this regex:
string input = "USD 2.4700/Tgt sh.";
var numbers = Regex.Matches(input, #"[\d]+\.?[\d]*");
foreach (Match res in numbers)
{
if (!string.IsNullOrEmpty(res.Value))
{
decimal i = decimal.Parse(res.Value);
Console.WriteLine("Number: {0}", i);
}
}
I would suggest you to use System.Text.RegularExpressions.RegEx. Here is example how you can achieve it:
static void Main(string[] args)
{
string a1 = ".2700 Aqr sh./Tgt sh.";
string a2 = "USD 2.4700/Tgt sh.";
var firstStringNums = GetNumbersFromString(ref a1);
Console.Write("My Text: {0}",a1);
Console.Write("myNums: ");
foreach(double a in firstStringNums)
{
Console.Write(a +"\t");
}
var secondStringNums = GetNumbersFromString(ref a2);
Console.Write("My Text: {0}", a2);
Console.Write("myNums: ");
foreach (double a in secondStringNums)
{
Console.Write(a + "\t");
}
}
public static List<double> GetNumbersFromString(ref string input)
{
List<double> result = new List<double>();
Regex r = new Regex("[0-9.,]+");
var numsFromString = r.Matches(input);
foreach(Match a in numsFromString)
{
if(double.TryParse(a.Value,out double val))
{
result.Add(val);
input =input.Replace(a.Value, "");
}
}
return result;
}
The pattern is just an example and off course will not cover every case that you will imagine.

English to PigLatin Conversion of a Sentence

I am making a program to convert English to PigLatin. However, my solution only seems to work with one word. If I enter in more than ond word, only the last is translated.
testing one translation
Would simply output:
translationway
I've looked at some solutions, but most are in the same fashion as mine, or use "simplified" solutions beyond the scope of my knowledge.
Code:
static void Main(string[] args)
{
Console.WriteLine("Enter a sentence to convert to PigLatin:");
string sentence = Console.ReadLine();
string pigLatin = ToPigLatin(sentence);
Console.WriteLine(pigLatin);
}
static string ToPigLatin (string sentence)
{
string firstLetter,
restOfWord,
vowels = "AEIOUaeio";
int currentLetter;
foreach (string word in sentence.Split())
{
firstLetter = sentence.Substring(0, 1);
restOfWord = sentence.Substring(1, sentence.Length - 1);
currentLetter = vowels.IndexOf(firstLetter);
if (currentLetter == -1)
{
sentence = restOfWord + firstLetter + "ay";
}
else
{
sentence = word + "way";
}
}
return sentence;
All help is greatly appreciated!
Edit
Thanks to great feedback, I've updated my code:
static string ToPigLatin (string sentence)
{
const string vowels = "AEIOUaeio";
List<string> pigWords = new List<string>();
foreach (string word in sentence.Split(' '))
{
string firstLetter = word.Substring(0, 1);
string restOfWord = word.Substring(1, word.Length - 1);
int currentLetter = vowels.IndexOf(firstLetter);
if (currentLetter == -1)
{
pigWords.Add(restOfWord + firstLetter + "ay");
}
else
{
pigWords.Add(word + "way");
}
}
return string.Join(" ", pigWords);
}
Would it be very complex to adapt this code to work with consonant clusters?
For example, right now testing one translation prints as:
estingtay oneway ranslationtay
While, as I understand PigLatin rules, it should read:
estingtay oneway anslationtray
Just place += instead of = here:
if (currentLetter == -1)
{
sentence += restOfWord + firstLetter + "ay";
}
else
{
sentence += word + "way";
}
On your version, you were overriding the sentence in each iteration of your loop
Edit
I've made a lot of changes to the code:
public static string ToPigLatin(string sentence)
{
const string vowels = "AEIOUaeio";
List<string> newWords = new List<string>();
foreach (string word in sentence.Split(' '))
{
string firstLetter = word.Substring(0, 1);
string restOfWord = word.Substring(1, word.Length - 1);
int currentLetter = vowels.IndexOf(firstLetter);
if (currentLetter == -1)
{
newWords.Add(restOfWord + firstLetter + "ay");
}
else
{
newWords.Add(word + "way");
}
}
return string.Join(" ", newWords);
}
As Panagiotis-Kanavos said, and he's damn right, don't build your output on your input but with your input. Thus, I added the newWords list (some might prefer a StringBuilder, I don't).
You were misusing your variable in your loop, especially with the Substrings calls, it's now fixed.
If you have any question on this, don't hesitate.
private static void Main(string[] args)
{
Console.WriteLine("Enter a sentence to convert to PigLatin:");
string sentence = Console.ReadLine();
var pigLatin = GetSentenceInPigLatin(sentence);
Console.WriteLine(pigLatin);
Console.ReadLine();
}
private static string GetSentenceInPigLatin(string sentence)
{
const string vowels = "AEIOUaeio";
var returnSentence = "";
foreach (var word in sentence.Split())
{
var firstLetter = word.Substring(0, 1);
var restOfWord = word.Substring(1, word.Length - 1);
var currentLetter = vowels.IndexOf(firstLetter, StringComparison.Ordinal);
if (currentLetter == -1)
{
returnSentence += restOfWord + firstLetter + "ay ";
}
else
{
returnSentence += word + "way ";
}
}
return returnSentence;
}
I came up with a short LINQ implementation for this:
string.Join(" ", "testing one translation".Split(' ')
.Select(word => "aeiouy".Contains(word[0])
? word.Skip(1).Concat(word.Take(1))
: word.ToCharArray())
.Select(word => word.Concat("way".ToCharArray()))
.Select(word => string.Concat(word)));
Output: "testingway neoway translationway"
Of course, I'd probably refactor it to something like this:
"testing one translation"
.Split(' ')
.Select(word => word.ToCharsWithStartingVowelLast())
.Select(word => word.WithEnding("way"))
.Select(word => string.Concat(word))
.Join(' ');
static class Extensions {
public static IEnumerable<char> ToCharsWithStartingVowelLast(this string word)
{
return "aeiouy".Contains(word[0])
? word.Skip(1).Concat(word.Take(1))
: word.ToCharArray();
}
public static IEnumerable<char> WithEnding(this IEnumerable<char> word, string ending)
{
return word.Concat(ending.ToCharArray())
}
public static string Join(this IEnumerable<IEnumerable<char>> words, char separator)
{
return string.Join(separator, words.Select(word => string.Concat(word)));
}
}
Update:
With your edit, you asked about consonant clusters. One of the things I like about doing this with LINQ is that it's pretty simple to just update that part of the pipeline, and make it all work:
public static IEnumerable<char> ToCharsWithStartingConsonantsLast(this string word)
{
return word.SkipWhile(c => c.IsConsonant()).Concat(word.TakeWhile(c => c.IsConsonant()));
}
public static bool IsConsonant(this char c)
{
return !"aeiouy".Contains(c);
}
The entire pipeline, without refactoring to extension methods, now looks like this:
string.Join(" ", "testing one translation".Split(' ')
.Select(word => word.SkipWhile(c => !"aeiouy".Contains(c)).Concat(word.TakeWhile(c => !"aeiou".Contains(c))))
.Select(word => word.Concat("way".ToCharArray()))
.Select(word => string.Concat(word)))
and outputs "estingtway oneway anslationtrway".
Update 2:
I noticed I wasn't handling word endings correctly. Here's an update that takes care of only adding w to the ending when the word (without the ending) ends with a vowel:
string.Join(" ", "testing one translation".Split(' ')
.Select(word => word.SkipWhile(c => !"aeiouy".Contains(c)).Concat(word.TakeWhile(c => !"aeiou".Contains(c))))
.Select(word =>
{
var ending = "aeiouy".Contains(word.Last()) ? "way" : "ay";
return word.Concat(ending.ToCharArray());
})
.Select(word => string.Concat(word)))
Output: "estingtay oneway anslationtray". Note how it's only the step that handles adding the ending that changed - all other parts of the algorithm were unchanged.
Given how simple this now is, I'd probably only use two extension methods: Join(this IEnumerable<IEnumerable<char>> words, char separator) and IsConsonant(this char c) (the implementation of the latter should be easy given the code samples above). This yields the following final implementation:
"testing one translation"
.Split(' ')
.Select(word => word.SkipWhile(c => !c.IsVowel()).Concat(word.TakeWhile(c => c.IsVowel())))
.Select(word => word.Concat((word.Last().IsVowel() ? "way" : "ay").ToCharArray()))
.Select(word => string.Concat(word))
.Join(" ")
It's also really easy to see here what we do to translate:
Split the sentence into words
Shuffle any consonants to the end of the word (it's admittedly not apparent at first sight that this is what happens, but I can't find a simpler way to express it except by wrapping it in an extension method)
Add the ending
Convert IEnumerable<char>s to strings
Re-join the words into a sentence

How do i find the third longest word in string

Hello im a bit stuck i dont know how to find the third longest word in a string, i have got my code to find the longest but i cant manage to get it to find the third longest. any help?
public void longestWord()
{
string sentance, word;
word = " ";
char[] a = new char[] { ' ' };
sentance = textBox1.Text; //<--string here
foreach (string s1 in sentance.Split(a))
{
if (word.Length < s1.Length)
{
word = s1;
}
}
label9.Text = ("The longest word is " + word + " and its length is " + word.Length + " characters long");
}
P.S an example of the string im testing is:
1.
DarkN3ss is my most experienced provider of Windows based business solutions. I focus on delivering my business value in best possible understanding of this technologies and directions.
DarkN3ss recognising me as an “elite business partner” for implementing solutions based on my capabilities and experience with Windows and Linux products.
how about using linq?
sentance.Split(' ').OrderByDescending(w => w.Length).Skip(2).FirstOrDefault()
in a function :
public void nthLongestWord(int index = 0)
{
string word = null;
if(index <= 0)
{
word = sentance.Split(' ').OrderByDescending(w => w.Length).FirstOrDefault();
}
else
{
word = sentance.Split(' ').OrderByDescending(w => w.Length).Skip(index - 1).FirstOrDefault();
}
if(!string.IsNullOrWhitespace(word))
{
label9.Text = ("The longest word is " + word + " and its length is " + word.Length + " characters long");
}
else
{
// display something else?
}
}
Solution: To get all the third largest words
string[] splitStr = sentence.Split(' ');
if (splitStr.Length > 2)
{
List<int> allLengths = splitStr.Select(x => x.Length).Distinct().ToList();
int thirdLargestWordLength = allLengths.OrderByDescending(x => x)
.Skip(2).Distinct().Take(1).FirstOrDefault();
if (splitStr[0].Length != thirdLargestWordLength &&
splitStr[1].Length != thirdLargestWordLength)
{
string[] theThirdLargestWords = splitStr.Where(x => x.Length == thirdLargestWordLength)
.ToArray();
if (theThirdLargestWords.Length == 1)
{
label9.Text = "The third longest word is " + theThirdLargestWords[0];
}
else
{
string words = "";
for (int i = 0; i < theThirdLargestWords.Length; i++)
{
if (i == 0)
{
words = theThirdLargestWords[i];
}
//else if ((i + 1) == theThirdLargestWords.Length)
//{
// words += " and " + theThirdLargestWords[i];
//}
else
{
words += ", " + theThirdLargestWords[i];
}
}
label9.Text = "The third longest words are " + words;
}
}
}
I commented out the "and" part as i'm not sure if you want that in your string. I also added the first if statement so you don't get an error if you have less then 3 words.
If you'd like to make minimal changes to your current code, then what you should do is store the three longest words (i.e., instead of word, have word1, word2, and word3, or an array if you'd prefer).
Then, in your if statement, set word3=word2, word2=word1, and word1=s1.
That way, the third largest word will end up in word3.
Not the most efficient, but you'll be able to keep your current code, to a degree.
If you feel confortable using a temporary array, you should copy your words there, sort them and take the 3rd longest one.
This might have compile error, but just to give you an idea.
var words = sentence.split();
words.OrderBy (w => w.length ).ToArray ()[2]

Count how many words in each sentence

I'm stuck on how to count how many words are in each sentence, an example of this is: string sentence = "hello how are you. I am good. that's good."
and have it come out like:
//sentence1: 4 words
//sentence2: 3 words
//sentence3: 2 words
I can get the number of sentences
public int GetNoOfWords(string s)
{
return s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
}
label2.Text = (GetNoOfWords(sentance).ToString());
and i can get the number of words in the whole string
public int CountWord (string text)
{
int count = 0;
for (int i = 0; i < text.Length; i++)
{
if (text[i] != ' ')
{
if ((i + 1) == text.Length)
{
count++;
}
else
{
if(text[i + 1] == ' ')
{
count++;
}
}
}
}
return count;
}
then button1
int words = CountWord(sentance);
label4.Text = (words.ToString());
But I can't count how many words are in each sentence.
Instead of looping over the string as you do in CountWords I would just use;
int words = s.Split(' ').Length;
It's much more clean and simple. You split on white spaces which returns an array of all the words, the length of that array is the number of words in the string.
Why not use Split instead?
var sentences = "hello how are you. I am good. that's good.";
foreach (var sentence in sentences.TrimEnd('.').Split('.'))
Console.WriteLine(sentence.Trim().Split(' ').Count());
If you want number of words in each sentence, you need to
string s = "This is a sentence. Also this counts. This one is also a thing.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string sentence in sentences)
{
Console.WriteLine(sentence.Split(' ').Length + " words in sentence *" + sentence + "*");
}
Use CountWord on each element of the array returned by s.Split:
string sentence = "hello how are you. I am good. that's good.";
string[] words = sentence.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
for (string sentence in sentences)
{
int noOfWordsInSentence = CountWord(sentence);
}
string text = "hello how are you. I am good. that's good.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
IEnumerable<int> wordsPerSentence = sentences.Select(s => s.Trim().Split(' ').Length);
As noted in several answers here, look at String functions like Split, Trim, Replace, etc to get you going. All answers here will solve your simple example, but here are some sentences which they may fail to analyse correctly;
"Hello, how are you?" (no '.' to parse on)
"That apple costs $1.50." (a '.' used as a decimal)
"I like whitespace . "
"Word"
If you only need a count, I'd avoid Split() -- it takes up unnecessary space. Perhaps:
static int WordCount(string s)
{
int wordCount = 0;
for(int i = 0; i < s.Length - 1; i++)
if (Char.IsWhiteSpace(s[i]) && !Char.IsWhiteSpace(s[i + 1]) && i > 0)
wordCount++;
return ++wordCount;
}
public static void Main()
{
Console.WriteLine(WordCount(" H elloWor ld g ")); // prints "4"
}
It counts based on the number of spaces (1 space = 2 words). Consecutive spaces are ignored.
Does your spelling of sentence in:
int words = CountWord(sentance);
have anything to do with it?

C# split string but keep split chars / separators [duplicate]

I would like to split a string with delimiters but keep the delimiters in the result.
How would I do this in C#?
If the split chars were ,, ., and ;, I'd try:
using System.Text.RegularExpressions;
...
string[] parts = Regex.Split(originalString, #"(?<=[.,;])")
(?<=PATTERN) is positive look-behind for PATTERN. It should match at any place where the preceding text fits PATTERN so there should be a match (and a split) after each occurrence of any of the characters.
If you want the delimiter to be its "own split", you can use Regex.Split e.g.:
string input = "plum-pear";
string pattern = "(-)";
string[] substrings = Regex.Split(input, pattern); // Split on hyphens
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
// The method writes the following to the console:
// 'plum'
// '-'
// 'pear'
So if you are looking for splitting a mathematical formula, you can use the following Regex
#"([*()\^\/]|(?<!E)[\+\-])"
This will ensure you can also use constants like 1E-02 and avoid having them split into 1E, - and 02
So:
Regex.Split("10E-02*x+sin(x)^2", #"([*()\^\/]|(?<!E)[\+\-])")
Yields:
10E-02
*
x
+
sin
(
x
)
^
2
Building off from BFree's answer, I had the same goal, but I wanted to split on an array of characters similar to the original Split method, and I also have multiple splits per string:
public static IEnumerable<string> SplitAndKeep(this string s, char[] delims)
{
int start = 0, index;
while ((index = s.IndexOfAny(delims, start)) != -1)
{
if(index-start > 0)
yield return s.Substring(start, index - start);
yield return s.Substring(index, 1);
start = index + 1;
}
if (start < s.Length)
{
yield return s.Substring(start);
}
}
Just in case anyone wants this answer aswell...
Instead of string[] parts = Regex.Split(originalString, #"(?<=[.,;])") you could use string[] parts = Regex.Split(originalString, #"(?=yourmatch)") where yourmatch is whatever your separator is.
Supposing the original string was
777- cat
777 - dog
777 - mouse
777 - rat
777 - wolf
Regex.Split(originalString, #"(?=777)") would return
777 - cat
777 - dog
and so on
This version does not use LINQ or Regex and so it's probably relatively efficient. I think it might be easier to use than the Regex because you don't have to worry about escaping special delimiters. It returns an IList<string> which is more efficient than always converting to an array. It's an extension method, which is convenient. You can pass in the delimiters as either an array or as multiple parameters.
/// <summary>
/// Splits the given string into a list of substrings, while outputting the splitting
/// delimiters (each in its own string) as well. It's just like String.Split() except
/// the delimiters are preserved. No empty strings are output.</summary>
/// <param name="s">String to parse. Can be null or empty.</param>
/// <param name="delimiters">The delimiting characters. Can be an empty array.</param>
/// <returns></returns>
public static IList<string> SplitAndKeepDelimiters(this string s, params char[] delimiters)
{
var parts = new List<string>();
if (!string.IsNullOrEmpty(s))
{
int iFirst = 0;
do
{
int iLast = s.IndexOfAny(delimiters, iFirst);
if (iLast >= 0)
{
if (iLast > iFirst)
parts.Add(s.Substring(iFirst, iLast - iFirst)); //part before the delimiter
parts.Add(new string(s[iLast], 1));//the delimiter
iFirst = iLast + 1;
continue;
}
//No delimiters were found, but at least one character remains. Add the rest and stop.
parts.Add(s.Substring(iFirst, s.Length - iFirst));
break;
} while (iFirst < s.Length);
}
return parts;
}
Some unit tests:
text = "[a link|http://www.google.com]";
result = text.SplitAndKeepDelimiters('[', '|', ']');
Assert.IsTrue(result.Count == 5);
Assert.AreEqual(result[0], "[");
Assert.AreEqual(result[1], "a link");
Assert.AreEqual(result[2], "|");
Assert.AreEqual(result[3], "http://www.google.com");
Assert.AreEqual(result[4], "]");
A lot of answers to this! One I knocked up to split by various strings (the original answer caters for just characters i.e. length of 1). This hasn't been fully tested.
public static IEnumerable<string> SplitAndKeep(string s, params string[] delims)
{
var rows = new List<string>() { s };
foreach (string delim in delims)//delimiter counter
{
for (int i = 0; i < rows.Count; i++)//row counter
{
int index = rows[i].IndexOf(delim);
if (index > -1
&& rows[i].Length > index + 1)
{
string leftPart = rows[i].Substring(0, index + delim.Length);
string rightPart = rows[i].Substring(index + delim.Length);
rows[i] = leftPart;
rows.Insert(i + 1, rightPart);
}
}
}
return rows;
}
This seems to work, but its not been tested much.
public static string[] SplitAndKeepSeparators(string value, char[] separators, StringSplitOptions splitOptions)
{
List<string> splitValues = new List<string>();
int itemStart = 0;
for (int pos = 0; pos < value.Length; pos++)
{
for (int sepIndex = 0; sepIndex < separators.Length; sepIndex++)
{
if (separators[sepIndex] == value[pos])
{
// add the section of string before the separator
// (unless its empty and we are discarding empty sections)
if (itemStart != pos || splitOptions == StringSplitOptions.None)
{
splitValues.Add(value.Substring(itemStart, pos - itemStart));
}
itemStart = pos + 1;
// add the separator
splitValues.Add(separators[sepIndex].ToString());
break;
}
}
}
// add anything after the final separator
// (unless its empty and we are discarding empty sections)
if (itemStart != value.Length || splitOptions == StringSplitOptions.None)
{
splitValues.Add(value.Substring(itemStart, value.Length - itemStart));
}
return splitValues.ToArray();
}
Recently I wrote an extension method do to this:
public static class StringExtensions
{
public static IEnumerable<string> SplitAndKeep(this string s, string seperator)
{
string[] obj = s.Split(new string[] { seperator }, StringSplitOptions.None);
for (int i = 0; i < obj.Length; i++)
{
string result = i == obj.Length - 1 ? obj[i] : obj[i] + seperator;
yield return result;
}
}
}
I'd say the easiest way to accomplish this (except for the argument Hans Kesting brought up) is to split the string the regular way, then iterate over the array and add the delimiter to every element but the last.
To avoid adding character to new line try this :
string[] substrings = Regex.Split(input,#"(?<=[-])");
result = originalString.Split(separator);
for(int i = 0; i < result.Length - 1; i++)
result[i] += separator;
(EDIT - this is a bad answer - I misread his question and didn't see that he was splitting by multiple characters.)
(EDIT - a correct LINQ version is awkward, since the separator shouldn't get concatenated onto the final string in the split array.)
Iterate through the string character by character (which is what regex does anyway.
When you find a splitter, then spin off a substring.
pseudo code
int hold, counter;
List<String> afterSplit;
string toSplit
for(hold = 0, counter = 0; counter < toSplit.Length; counter++)
{
if(toSplit[counter] = /*split charaters*/)
{
afterSplit.Add(toSplit.Substring(hold, counter));
hold = counter;
}
}
That's sort of C# but not really. Obviously, choose the appropriate function names.
Also, I think there might be an off-by-1 error in there.
But that will do what you're asking.
veggerby's answer modified to
have no string items in the list
have fixed string as delimiter like "ab" instead of single character
var delimiter = "ab";
var text = "ab33ab9ab"
var parts = Regex.Split(text, $#"({Regex.Escape(delimiter)})")
.Where(p => p != string.Empty)
.ToList();
// parts = "ab", "33", "ab", "9", "ab"
The Regex.Escape() is there just in case your delimiter contains characters which regex interprets as special pattern commands (like *, () and thus have to be escaped.
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace ConsoleApplication9
{
class Program
{
static void Main(string[] args)
{
string input = #"This;is:a.test";
char sep0 = ';', sep1 = ':', sep2 = '.';
string pattern = string.Format("[{0}{1}{2}]|[^{0}{1}{2}]+", sep0, sep1, sep2);
Regex regex = new Regex(pattern);
MatchCollection matches = regex.Matches(input);
List<string> parts=new List<string>();
foreach (Match match in matches)
{
parts.Add(match.ToString());
}
}
}
}
I wanted to do a multiline string like this but needed to keep the line breaks so I did this
string x =
#"line 1 {0}
line 2 {1}
";
foreach(var line in string.Format(x, "one", "two")
.Split("\n")
.Select(x => x.Contains('\r') ? x + '\n' : x)
.AsEnumerable()
) {
Console.Write(line);
}
yields
line 1 one
line 2 two
I came across same problem but with multiple delimiters. Here's my solution:
public static string[] SplitLeft(this string #this, char[] delimiters, int count)
{
var splits = new List<string>();
int next = -1;
while (splits.Count + 1 < count && (next = #this.IndexOfAny(delimiters, next + 1)) >= 0)
{
splits.Add(#this.Substring(0, next));
#this = new string(#this.Skip(next).ToArray());
}
splits.Add(#this);
return splits.ToArray();
}
Sample with separating CamelCase variable names:
var variableSplit = variableName.SplitLeft(
Enumerable.Range('A', 26).Select(i => (char)i).ToArray());
I wrote this code to split and keep delimiters:
private static string[] SplitKeepDelimiters(string toSplit, char[] delimiters, StringSplitOptions splitOptions = StringSplitOptions.None)
{
var tokens = new List<string>();
int idx = 0;
for (int i = 0; i < toSplit.Length; ++i)
{
if (delimiters.Contains(toSplit[i]))
{
tokens.Add(toSplit.Substring(idx, i - idx)); // token found
tokens.Add(toSplit[i].ToString()); // delimiter
idx = i + 1; // start idx for the next token
}
}
// last token
tokens.Add(toSplit.Substring(idx));
if (splitOptions == StringSplitOptions.RemoveEmptyEntries)
{
tokens = tokens.Where(token => token.Length > 0).ToList();
}
return tokens.ToArray();
}
Usage example:
string toSplit = "AAA,BBB,CCC;DD;,EE,";
char[] delimiters = new char[] {',', ';'};
string[] tokens = SplitKeepDelimiters(toSplit, delimiters, StringSplitOptions.RemoveEmptyEntries);
foreach (var token in tokens)
{
Console.WriteLine(token);
}

Categories

Resources