I need to split a string into paragraphs and count those paragraphs (paragraphs separated by 2 or more empty lines).
In addition I need to read each word from the text and need the ability to mention the paragraph which this word belong to.
For example (Each paragraph is more then one line and two empty lines separates between paragraphs):
This is
the first
paragraph
This is
the second
paragraph
This is
the third
paragraph
Something like this should work for you:
var paragraphMarker = Environment.NewLine + Environment.NewLine;
var paragraphs = fileText.Split(new[] {paragraphMarker},
StringSplitOptions.RemoveEmptyEntries);
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
You may need to change line delimiter, file can have different variants like "\n", "\r", "\r\n".
Also you can pass specific characters inside Trim function to remove symbols like '.',',','!','"' and others.
Edit: To add more flexibility you can use regexp for splitting paragraphs:
var paragraphs = Regex.Split(fileText, #"(\r\n?|\n){2}")
.Where(p => p.Any(char.IsLetterOrDigit));
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
I think that you want to split the text in paragraphs, but do you have a delimiter to tell you to know you need to split the string?, for example if you want to identify the paragraph with "." this should do the trick
string paragraphs="My first paragraph. Once upon a time";
string[] words = paragraphs.Split('.');
foreach (string word in words)
{
Console.WriteLine(word);
}
The result for this will be:
My first paragraph
Once upon a time
Just remember that the "." character was removed!.
public static List<string> SplitLine(string isstr, int size = 100)
{
var words = isstr.Split(new[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
List<string> lo = new List<string>();
string tmp = "";
int i = 0;
for (i = 0; i < words.Length; i++)
{
if ((tmp.Length + words[i].Length) > size)
{
lo.Add(tmp);
tmp = "";
}
tmp += " " + words[i];
}
if (!String.IsNullOrWhiteSpace(tmp))
{
lo.Add(tmp);
}
return lo;
}
Related
I have a text file from which I read the text in lines. Also from all that text I need to find the longest sentence and find in which line it begins. I have no trouble finding the longest sentence but the problem arises when I need to find where it begins.
The contents of the text file is:
V. M. Putinas
Margi sakalai
Lydėdami gęstančią žarą vėlai
Pakilo į dangų;;, margi sakalai.
Paniekinę žemės vylingus sapnus,
Padangėje ištiesė,,; savo sparnus.
Ir tarė margieji: negrįšim į žemę,
Kol josios kalnai ir pakalnės aptemę.
My code:
static void Sakiniai (string fv, string skyrikliai)
{
char[] skyrikliaiSak = { '.', '!', '?' };
string naujas = "";
string[] lines = File.ReadAllLines(fv, Encoding.GetEncoding(1257));
foreach (string line in lines)
{
// Add lines into a string so I can separate them into sentences
naujas += line;
}
// Separating into sentences
string[] sakiniai = naujas.Split(skyrikliaiSak);
// This method finds the longest sentence
string ilgiausiasSak = RastiIlgiausiaSakini(sakiniai);
}
From the text file the longest sentence is: "Margi sakalai Lydėdami gęstančią žarą vėlai Pakilo į dangų;;, margi sakalai"
How can I find the exact line where the sentence begins?
What about a nested for loop? If two sentences are the same length, this just finds the first one.
var lines = File.ReadAllLines(fv, Encoding.GetEncoding(1257));
var terminators = new HashSet<char> { '.', '?', '!' };
var currentLength = 0;
var currentSentence = new StringBuilder();
var maxLength = 0;
var maxLine = default(int?);
var maxSentence = "";
for (var currentLine = 0; currentLine < lines.Count(); currentLine++)
{
foreach (var character in lines[currentLine])
{
if (terminators.Contains(character))
{
if (currentLength > maxLength)
{
maxLength = currentLength;
maxLine = currentLine;
maxSentence = currentSentence.ToString();
}
currentLength = 0;
currentSentence.Clear();
}
else
{
currentLength++;
currentSentence.Append(character);
}
}
}
First find the start index of the longest sentence in the whole content
int startIdx = naujas.IndexOf(ilgiausiasSak);
then loop the lines to find out which line the startIdx falls in
int i = 0;
while (i < lines.Length && startIdx >= 0)
{
startIdx -= lines[i].Length;
i++;
}
// do stuff with i
i is where the longest sentence starts at. e.g. i=2 means it start from the second line
Build an index that solves your problem.
We can make a straightforward modification of your existing code:
var lineOffsets = new List<int>();
lineOffsets.Add(0);
foreach (string line in lines)
{
// Add lines into a string so I can separate them into sentences
naujas += line;
lineOffsets.Add(naujas.Length);
}
All right; now you have a list of the character offset in your final string corresponding to each line.
You have a substring of the big string. You can use IndexOf to find the offset of the substring in the big string. Then you can search through the list to find the list index of the last element that is smaller or equal than the offset. That's the line number.
If the list is large, you can binary search it.
How about
identify the lines in the text
split the text into sentences
split the sentences into sections based on the line breaks (could work also with splitting on words as well if needed)
verify the sections of the sentence are on consecutive rows
In the end certain sections of the sentence might occur on multiple lines forming other sentences so you need to correctly identify the sentences spreading consecutive rows
// define separators for various contexts
var separator = new
{
Lines = new[] { '\n' },
Sentences = new[] { '.', '!', '?' },
Sections = new[] { '\n' },
};
// isolate the lines and their corresponding number
var lines = paragraph
.Split(separator.Lines, StringSplitOptions.RemoveEmptyEntries)
.Select((text, number) => new
{
Number = number += 1,
Text = text,
})
.ToList();
// isolate the sentences with corresponding sections and line numbers
var sentences = paragraph
.Split(separator.Sentences, StringSplitOptions.RemoveEmptyEntries)
.Select(sentence => sentence.Trim())
.Select(sentence => new
{
Text = sentence,
Length = sentence.Length,
Sections = sentence
.Split(separator.Sections)
.Select((section, index) => new
{
Index = index,
Text = section,
Lines = lines
.Where(line => line.Text.Contains(section))
.Select(line => line.Number)
})
.OrderBy(section => section.Index)
})
.OrderByDescending(p => p.Length)
.ToList();
// build the possible combinations of sections within a sentence
// and filter only those that are on consecutive lines
var results = from sentence in sentences
let occurences = sentence.Sections
.Select(p => p.Lines)
.Cartesian()
.Where(p => p.Consecutive())
.SelectMany(p => p)
select new
{
Text = sentence.Text,
Length = sentence.Length,
Lines = occurences,
};
and the end results would look like this
where .Cartesian and .Consecutive are just some helper extension methods over enumerable (see associated gist for the entire source code in linqpad ready format)
public static IEnumerable<T> Yield<T>(this T instance)
{
yield return instance;
}
public static IEnumerable<IEnumerable<T>> Cartesian<T>(this IEnumerable<IEnumerable<T>> instance)
{
var seed = Enumerable.Empty<T>().Yield();
return instance.Aggregate(seed, (accumulator, sequence) =>
{
var results = from vector in accumulator
from item in sequence
select vector.Concat(new[]
{
item
});
return results;
});
}
public static bool Consecutive(this IEnumerable<int> instance)
{
var distinct = instance.Distinct().ToList();
return distinct
.Zip(distinct.Skip(1), (a, b) => a + 1 == b)
.All(p => p);
}
I want to comparison two string. First is from the dateTimePicker, and second is from file.
string firtsdate = dateTimePicker1.Value.ToString("yyyy-MM-dd");
string seconddate = dateTimePicker2.Value.ToString("yyyy-MM-dd");
string FilePath = path;
string fileContent = File.ReadAllText(FilePath);
string[] integerStrings = fileContent.Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
int count = 0;
for (int n = 0; n < integerStrings.Length;)
{
count = integerStrings[n].Length;
//Console.Write(count + "\n");
count--;
if (count > 2)
{
string datastart;
string dataend;
if (integerStrings[n] == firtsdate)
{
datastart = integerStrings[n];
Console.Write(datastart);
dataend = (DateTime.Parse(datastart).AddDays(1)).ToShortDateString();
Console.Write(dataend + "\n");
}
else
{
n = n + 7;
}
}
}
File looks like this:
2016-07-01
2016-07-02
2016-07-06
...
Problem is that they do not want to compare two of the same value, like 2016-07-02 == 2016-07-02 (from file).
I suspect this is the problem:
string fileContent = File.ReadAllText(FilePath);
string[] integerStrings = fileContent.Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
A line break on Windows is "\r\n" - so each line in your split will end in a \r. The simplest way to fix this is to just replace those two lines with:
string[] integerStrings = File.ReadAllLines(FilePath);
If you are sure about your date time format, and strings are correct, you can compare 2 strings by Equals, or Compare.
The end of line character in linux is \n (line feed) and windows is \r (carriage return), and \r\n for both, so you should split line by these characters or read file line by line.
Let's start with that i have a txtProbe(textbox) and there i have 12-34-56-78-90. I want to parse them in different labels or textboxes ... For now just in one textbox - txtParse. I tried with that code - I'm removing the "-" and then tried to display them but nothing happens:
{
char[] delemiterChars = { '-' };
string text = txtProbe.Text;
string[] words = text.Split(delemiterChars);
txtParse.Text = text;
foreach (string s in words)
{
txtParse.Text = s;
}
}
EDIT:
I want to set the received information in different labels:
12-label1
34-label2
56-label3
78-label4
90-label5
You could just use String.Replace:
txtParse.Text = txtProbe.Text.Replace("-", " ");
The following 'll do the trick more "semantically":
var parsed = text.Replace("-", " ");
You might be changing the value too early in the Page Life Cycle (in the case of Web Forms), with regards to why you're not seeing the parsed value in the server control.
For your specific sample it seems that you could just replace '-' by ' '.
txtParse.Text = txtProbe.Text.Replace('-', ' ');
But in order to join an array of strings using a white space separator, you could use this
txtParse.Text = string.Join(" ", words);
Your logic is not appropiated for the task you're trying to acheive but just for learning purposes I'll write the correct version of your snippet
string separator = string.Empty; // starts empty so doesn't apply for first element
foreach (string s in words)
{
txtParse.Text += separator + s; // You need to use += operator to append content
separator = " "; // from second element will append " "
}
EDIT
This is for the case of using different labels
Label[] labelList = new Label[] {label1, label2, label3, label4, label5};
for (int i = 0; i < words.Length; i++)
{
labelList[i].Text = words[i];
}
You can use String.Join to join collection of strings as you want. In your case:
txtParse.Text = String.Join(" ", words);
You can use the delimiter character directly in .split('-') to return an array of strings representing that data you want.
Your problem is that you keep assigning s to the Text property. The end result will be the last s in your array.
You can use TextBox.AppendText() instead.
char[] delemiterChars = { '-' };
string text = txtProbe.Text;
string[] words = text.Split(delemiterChars);
txtParse.Text = text;
foreach (string s in words)
{
txtParse.AppendText(s + " ");
}
You can just put txtParse.Text = txtProbe.Text.Replace('-', " ");
Or by modifying your code:
char[] delemiterChars = { '-' };
string text = txtProbe.Text;
string[] words = text.Split(delemiterChars,StringSplitOptions.RemoveEmptyEntries);
foreach (string s in words)
{
txtParse.Text += " " + s;
}
I'm stuck on how to count how many words are in each sentence, an example of this is: string sentence = "hello how are you. I am good. that's good."
and have it come out like:
//sentence1: 4 words
//sentence2: 3 words
//sentence3: 2 words
I can get the number of sentences
public int GetNoOfWords(string s)
{
return s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
}
label2.Text = (GetNoOfWords(sentance).ToString());
and i can get the number of words in the whole string
public int CountWord (string text)
{
int count = 0;
for (int i = 0; i < text.Length; i++)
{
if (text[i] != ' ')
{
if ((i + 1) == text.Length)
{
count++;
}
else
{
if(text[i + 1] == ' ')
{
count++;
}
}
}
}
return count;
}
then button1
int words = CountWord(sentance);
label4.Text = (words.ToString());
But I can't count how many words are in each sentence.
Instead of looping over the string as you do in CountWords I would just use;
int words = s.Split(' ').Length;
It's much more clean and simple. You split on white spaces which returns an array of all the words, the length of that array is the number of words in the string.
Why not use Split instead?
var sentences = "hello how are you. I am good. that's good.";
foreach (var sentence in sentences.TrimEnd('.').Split('.'))
Console.WriteLine(sentence.Trim().Split(' ').Count());
If you want number of words in each sentence, you need to
string s = "This is a sentence. Also this counts. This one is also a thing.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string sentence in sentences)
{
Console.WriteLine(sentence.Split(' ').Length + " words in sentence *" + sentence + "*");
}
Use CountWord on each element of the array returned by s.Split:
string sentence = "hello how are you. I am good. that's good.";
string[] words = sentence.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries).Length;
for (string sentence in sentences)
{
int noOfWordsInSentence = CountWord(sentence);
}
string text = "hello how are you. I am good. that's good.";
string[] sentences = s.Split(new char[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
IEnumerable<int> wordsPerSentence = sentences.Select(s => s.Trim().Split(' ').Length);
As noted in several answers here, look at String functions like Split, Trim, Replace, etc to get you going. All answers here will solve your simple example, but here are some sentences which they may fail to analyse correctly;
"Hello, how are you?" (no '.' to parse on)
"That apple costs $1.50." (a '.' used as a decimal)
"I like whitespace . "
"Word"
If you only need a count, I'd avoid Split() -- it takes up unnecessary space. Perhaps:
static int WordCount(string s)
{
int wordCount = 0;
for(int i = 0; i < s.Length - 1; i++)
if (Char.IsWhiteSpace(s[i]) && !Char.IsWhiteSpace(s[i + 1]) && i > 0)
wordCount++;
return ++wordCount;
}
public static void Main()
{
Console.WriteLine(WordCount(" H elloWor ld g ")); // prints "4"
}
It counts based on the number of spaces (1 space = 2 words). Consecutive spaces are ignored.
Does your spelling of sentence in:
int words = CountWord(sentance);
have anything to do with it?
How can i add a string after text if the string is not already there?
I have a textbox with the following lines:
name:username thumbnail:example.com message:hello
name:username message:hi
name:username message:hey
how can i add thumbnail:example.com after name:username to the second and third line but not the first line?
Edit: Didn't notice that you are reading from a textbox - you'll have to join the textbox lines to one string to use my example. You can do that with string.join()
Try this... this assumes that there are no spaces allowed in the username. There are probably plenty of better/more efficient ways to do this, but this should work.
var sbOut = new StringBuilder();
var combined = String.Join(Environment.NewLine, textbox1.Lines);
//split string on "name:" rather than on lines
string[] lines = combined.Split(new string[] { "name:" }, StringSplitOptions.RemoveEmptyEntries);
foreach (var item in lines)
{
//add name back in as split strips it out
sbOut.Append("name:");
//find first space
var found = item.IndexOf(" ");
//add username IMPORTANT assumes no spaces in username
sbOut.Append(item.Substring(0, found + 1));
//Add thumbnail:example.com if it doesn't exist
if (!item.Substring(found + 1).StartsWith("thumbnail:example.com"))
sbOut.Append("thumbnail:example.com ");
//Add the rest of the string
sbOut.Append(item.Substring(found + 1));
}
var lines = textbox.Text.Split(new string[] { Environment.NewLine.ToString() }, StringSplitOptions.RemoveEmptyEntries);
textbox.Text = string.Empty;
for (int i = 0; i < lines.Length; i++)
{
if (!lines[i].Contains("thumbnail:example.com") && lines[i].Contains("name:"))
{
lines[i] = lines[i].Insert(lines[i].IndexOf(' '), " thumbnail:example.com");
}
}
textbox.Text = string.Join(Environment.NewLine, lines);
Hope this helps.