Having trouble finding the start of a sentence inside string array - c#

I have a text file from which I read the text in lines. Also from all that text I need to find the longest sentence and find in which line it begins. I have no trouble finding the longest sentence but the problem arises when I need to find where it begins.
The contents of the text file is:
V. M. Putinas
Margi sakalai
Lydėdami gęstančią žarą vėlai
Pakilo į dangų;;, margi sakalai.
Paniekinę žemės vylingus sapnus,
Padangėje ištiesė,,; savo sparnus.
Ir tarė margieji: negrįšim į žemę,
Kol josios kalnai ir pakalnės aptemę.
My code:
static void Sakiniai (string fv, string skyrikliai)
{
char[] skyrikliaiSak = { '.', '!', '?' };
string naujas = "";
string[] lines = File.ReadAllLines(fv, Encoding.GetEncoding(1257));
foreach (string line in lines)
{
// Add lines into a string so I can separate them into sentences
naujas += line;
}
// Separating into sentences
string[] sakiniai = naujas.Split(skyrikliaiSak);
// This method finds the longest sentence
string ilgiausiasSak = RastiIlgiausiaSakini(sakiniai);
}
From the text file the longest sentence is: "Margi sakalai Lydėdami gęstančią žarą vėlai Pakilo į dangų;;, margi sakalai"
How can I find the exact line where the sentence begins?

What about a nested for loop? If two sentences are the same length, this just finds the first one.
var lines = File.ReadAllLines(fv, Encoding.GetEncoding(1257));
var terminators = new HashSet<char> { '.', '?', '!' };
var currentLength = 0;
var currentSentence = new StringBuilder();
var maxLength = 0;
var maxLine = default(int?);
var maxSentence = "";
for (var currentLine = 0; currentLine < lines.Count(); currentLine++)
{
foreach (var character in lines[currentLine])
{
if (terminators.Contains(character))
{
if (currentLength > maxLength)
{
maxLength = currentLength;
maxLine = currentLine;
maxSentence = currentSentence.ToString();
}
currentLength = 0;
currentSentence.Clear();
}
else
{
currentLength++;
currentSentence.Append(character);
}
}
}

First find the start index of the longest sentence in the whole content
int startIdx = naujas.IndexOf(ilgiausiasSak);
then loop the lines to find out which line the startIdx falls in
int i = 0;
while (i < lines.Length && startIdx >= 0)
{
startIdx -= lines[i].Length;
i++;
}
// do stuff with i
i is where the longest sentence starts at. e.g. i=2 means it start from the second line

Build an index that solves your problem.
We can make a straightforward modification of your existing code:
var lineOffsets = new List<int>();
lineOffsets.Add(0);
foreach (string line in lines)
{
// Add lines into a string so I can separate them into sentences
naujas += line;
lineOffsets.Add(naujas.Length);
}
All right; now you have a list of the character offset in your final string corresponding to each line.
You have a substring of the big string. You can use IndexOf to find the offset of the substring in the big string. Then you can search through the list to find the list index of the last element that is smaller or equal than the offset. That's the line number.
If the list is large, you can binary search it.

How about
identify the lines in the text
split the text into sentences
split the sentences into sections based on the line breaks (could work also with splitting on words as well if needed)
verify the sections of the sentence are on consecutive rows
In the end certain sections of the sentence might occur on multiple lines forming other sentences so you need to correctly identify the sentences spreading consecutive rows
// define separators for various contexts
var separator = new
{
Lines = new[] { '\n' },
Sentences = new[] { '.', '!', '?' },
Sections = new[] { '\n' },
};
// isolate the lines and their corresponding number
var lines = paragraph
.Split(separator.Lines, StringSplitOptions.RemoveEmptyEntries)
.Select((text, number) => new
{
Number = number += 1,
Text = text,
})
.ToList();
// isolate the sentences with corresponding sections and line numbers
var sentences = paragraph
.Split(separator.Sentences, StringSplitOptions.RemoveEmptyEntries)
.Select(sentence => sentence.Trim())
.Select(sentence => new
{
Text = sentence,
Length = sentence.Length,
Sections = sentence
.Split(separator.Sections)
.Select((section, index) => new
{
Index = index,
Text = section,
Lines = lines
.Where(line => line.Text.Contains(section))
.Select(line => line.Number)
})
.OrderBy(section => section.Index)
})
.OrderByDescending(p => p.Length)
.ToList();
// build the possible combinations of sections within a sentence
// and filter only those that are on consecutive lines
var results = from sentence in sentences
let occurences = sentence.Sections
.Select(p => p.Lines)
.Cartesian()
.Where(p => p.Consecutive())
.SelectMany(p => p)
select new
{
Text = sentence.Text,
Length = sentence.Length,
Lines = occurences,
};
and the end results would look like this
where .Cartesian and .Consecutive are just some helper extension methods over enumerable (see associated gist for the entire source code in linqpad ready format)
public static IEnumerable<T> Yield<T>(this T instance)
{
yield return instance;
}
public static IEnumerable<IEnumerable<T>> Cartesian<T>(this IEnumerable<IEnumerable<T>> instance)
{
var seed = Enumerable.Empty<T>().Yield();
return instance.Aggregate(seed, (accumulator, sequence) =>
{
var results = from vector in accumulator
from item in sequence
select vector.Concat(new[]
{
item
});
return results;
});
}
public static bool Consecutive(this IEnumerable<int> instance)
{
var distinct = instance.Distinct().ToList();
return distinct
.Zip(distinct.Skip(1), (a, b) => a + 1 == b)
.All(p => p);
}

Related

Remove Identical Words from a string array

The goal is to remove a certain prefix word from a string in string array example: ["Market1", "Market2", "Market3"]. The prefix word Market is dominant in string array, so we have to remove Market from string array so the result should be ["1", "2", "3"]. Please take note that the Market prefix word in string could be anything.
Look for the first character that is not identical among all strings and select a substring starting at that position to remove the prefix.
string[] words = new string[] { "Market1", "Market2", "Market3" };
int i = 0;
while (words.All(word => word.Length > i && word[i] == words[0][i])) ++i;
var wordsWithoutPrefixes = words.Select(word => word.Substring(i)).ToArray();
Make a delimited string and then replace all the Market with an empty string and then split the string to an array.
string[] arr = new string[] { "Market1", "Market2", "Market3" };
string[] result = string.Join(".", arr).Replace("Market", "").Split('.');
Loop through each item in the array and for each item chop off the beginning the start matches.
var commonPrefix = "Market";
for (int i = 0; i < arr.length, i++) {
if(arr[i].IndexOf(commonPrefix) == 0) {
arr[i] = arr[i].Substring(commonPrefix.Length);
}
}
You can use LINQ:
string[] myArray = ["Market1", "Market2", "Market3"];
string prefix = myArray[0];
foreach (var s in myArray)
{
while (!s.StartsWith(prefix))
prefix = prefix.Substring(0, prefix.Length - 1);
}
string[] result = myArray
.Select(s => s.Substring(prefix.Length))
.ToArray();
Loop through the array of string and replace the substring containing prefix with an empty substring.
string[] s=new string[]{"Market1","Market2","Market3"};
string prefix="Market";
foreach(var x in s)
{
if(x.Contains(prefix))
{
x=x.Replace(prefix,"");
}
}

How to split text into paragraphs?

I need to split a string into paragraphs and count those paragraphs (paragraphs separated by 2 or more empty lines).
In addition I need to read each word from the text and need the ability to mention the paragraph which this word belong to.
For example (Each paragraph is more then one line and two empty lines separates between paragraphs):
This is
the first
paragraph
This is
the second
paragraph
This is
the third
paragraph
Something like this should work for you:
var paragraphMarker = Environment.NewLine + Environment.NewLine;
var paragraphs = fileText.Split(new[] {paragraphMarker},
StringSplitOptions.RemoveEmptyEntries);
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
You may need to change line delimiter, file can have different variants like "\n", "\r", "\r\n".
Also you can pass specific characters inside Trim function to remove symbols like '.',',','!','"' and others.
Edit: To add more flexibility you can use regexp for splitting paragraphs:
var paragraphs = Regex.Split(fileText, #"(\r\n?|\n){2}")
.Where(p => p.Any(char.IsLetterOrDigit));
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
I think that you want to split the text in paragraphs, but do you have a delimiter to tell you to know you need to split the string?, for example if you want to identify the paragraph with "." this should do the trick
string paragraphs="My first paragraph. Once upon a time";
string[] words = paragraphs.Split('.');
foreach (string word in words)
{
Console.WriteLine(word);
}
The result for this will be:
My first paragraph
Once upon a time
Just remember that the "." character was removed!.
public static List<string> SplitLine(string isstr, int size = 100)
{
var words = isstr.Split(new[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
List<string> lo = new List<string>();
string tmp = "";
int i = 0;
for (i = 0; i < words.Length; i++)
{
if ((tmp.Length + words[i].Length) > size)
{
lo.Add(tmp);
tmp = "";
}
tmp += " " + words[i];
}
if (!String.IsNullOrWhiteSpace(tmp))
{
lo.Add(tmp);
}
return lo;
}

split string in to several strings at specific points

I have a text file with lines of text laid out like so
12345MLOL68
12345MLOL68
12345MLOL68
I want to read the file and add commas to the 5th point, 6th point and 9th point and write it to a different text file so the result would be.
12345,M,LOL,68
12345,M,LOL,68
12345,M,LOL,68
This is what I have so far
public static void ToCSV(string fileWRITE, string fileREAD)
{
int count = 0;
string x = "";
StreamWriter commas = new StreamWriter(fileWRITE);
string FileText = new System.IO.StreamReader(fileREAD).ReadToEnd();
var dataList = new List<string>();
IEnumerable<string> splitString = Regex.Split(FileText, "(.{1}.{5})").Where(s => s != String.Empty);
foreach (string y in splitString)
{
dataList.Add(y);
}
foreach (string y in dataList)
{
x = (x + y + ",");
count++;
if (count == 3)
{
x = (x + "NULL,NULL,NULL,NULL");
commas.WriteLine(x);
x = "";
count = 0;
)
}
commas.Close();
}
The problem I'm having is trying to figure out how to split the original string lines I read in at several points. The line
IEnumerable<string> splitString = Regex.Split(FileText, "(.{1}.{5})").Where(s => s != String.Empty);
Is not working in the way I want to. It's just adding up the 1 and 5 and splitting all strings at the 6th char.
Can anyone help me split each string at specific points?
Simpler code:
public static void ToCSV(string fileWRITE, string fileREAD)
{
string[] lines = File.ReadAllLines(fileREAD);
string[] splitLines = lines.Select(s => Regex.Replace(s, "(.{5})(.)(.{3})(.*)", "$1,$2,$3,$4")).ToArray();
File.WriteAllLines(fileWRITE, splitLines);
}
Just insert at the right place in descending order like this.
string str = "12345MLOL68";
int[] indices = {5, 6, 9};
indices = indices.OrderByDescending(x => x).ToArray();
foreach (var index in indices)
{
str = str.Insert(index, ",");
}
We're doing this in descending order because if we do other way indices will change, it will be hard to track it.
Here is the Demo
Why don't you use substring , example
editedstring=input.substring(0,5)+","+input.substring(5,1)+","+input.substring(6,3)+","+input.substring(9);
This should suits your need.

C# text file search for specific word and delete whole line of text that contains that word

Basically I have a text file that I read in and display in a rich text box, which is fine, but I then want to be able to search through the text for a specific word and delete the whole line of text that contains this word. I can search through the text to see if the word exists or not but I cannot figure out how to delete the whole line. Any help would be great.
The easiest is to rewrite the whole file without the line(s) that contain the word. You can use LINQ for that:
var oldLines = System.IO.File.ReadAllLines(path);
var newLines = oldLines.Where(line => !line.Contains(wordToDelete));
System.IO.File.WriteAllLines(path, newLines);
If you only want to delete all lines that contain the word(not only the sequence of characters), you need to split the line by ' ':
var newLines = oldLines.Select(line => new {
Line = line,
Words = line.Split(' ')
})
.Where(lineInfo => !lineInfo.Words.Contains(wordToDelete))
.Select(lineInfo => lineInfo.Line);
You can do it easily without LINK
string search_text = text;
string old;
string n="";
StreamReader sr = File.OpenText(FileName);
while ((old = sr.ReadLine()) != null)
{
if (!old.Contains(search_text))
{
n += old+Environment.NewLine;
}
}
sr.Close();
File.WriteAllText(FileName, n);
Code:
"using System.Linq;" is required.
Write your own extension method IsNotAnyOf(,) (put it in a static class) and call the method (i. e. it is called) from .Where(n => n.IsNotAnyOf(...))...(); The for-loop will return false if the condition is met, if not the method will return true:
static void aMethod()
{
string[] wordsToDelete = { "aa", "bb" };
string[] Lines = System.IO.File.ReadAllLines(TextFilePath)
.Where(n => n.IsNotAnyOf(wordsToDelete)).ToArray();
IO.File.WriteAllLines(TextFilePath, Lines);
}
static private bool IsNotAnyOf(this string n, string[] wordsToDelete)
{ for (int ct = 0; ct < wordsToDelete.Length; ct++)
if (n == wordsToDelete[ct]) return false;
return true;
}
We can convert the string to an array of lines, work over it, and convert back:
string[] dados_em_lista = dados_em_txt.Split(
new[] { "\r\n", "\r", "\n" },
StringSplitOptions.None
);
var nova_lista = dados_em_lista.Where(line => !line.Contains(line_to_remove)).ToArray();
dados_em_txt = String.Join("\n", nova_lista);

C# split string but keep split chars / separators [duplicate]

I would like to split a string with delimiters but keep the delimiters in the result.
How would I do this in C#?
If the split chars were ,, ., and ;, I'd try:
using System.Text.RegularExpressions;
...
string[] parts = Regex.Split(originalString, #"(?<=[.,;])")
(?<=PATTERN) is positive look-behind for PATTERN. It should match at any place where the preceding text fits PATTERN so there should be a match (and a split) after each occurrence of any of the characters.
If you want the delimiter to be its "own split", you can use Regex.Split e.g.:
string input = "plum-pear";
string pattern = "(-)";
string[] substrings = Regex.Split(input, pattern); // Split on hyphens
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
// The method writes the following to the console:
// 'plum'
// '-'
// 'pear'
So if you are looking for splitting a mathematical formula, you can use the following Regex
#"([*()\^\/]|(?<!E)[\+\-])"
This will ensure you can also use constants like 1E-02 and avoid having them split into 1E, - and 02
So:
Regex.Split("10E-02*x+sin(x)^2", #"([*()\^\/]|(?<!E)[\+\-])")
Yields:
10E-02
*
x
+
sin
(
x
)
^
2
Building off from BFree's answer, I had the same goal, but I wanted to split on an array of characters similar to the original Split method, and I also have multiple splits per string:
public static IEnumerable<string> SplitAndKeep(this string s, char[] delims)
{
int start = 0, index;
while ((index = s.IndexOfAny(delims, start)) != -1)
{
if(index-start > 0)
yield return s.Substring(start, index - start);
yield return s.Substring(index, 1);
start = index + 1;
}
if (start < s.Length)
{
yield return s.Substring(start);
}
}
Just in case anyone wants this answer aswell...
Instead of string[] parts = Regex.Split(originalString, #"(?<=[.,;])") you could use string[] parts = Regex.Split(originalString, #"(?=yourmatch)") where yourmatch is whatever your separator is.
Supposing the original string was
777- cat
777 - dog
777 - mouse
777 - rat
777 - wolf
Regex.Split(originalString, #"(?=777)") would return
777 - cat
777 - dog
and so on
This version does not use LINQ or Regex and so it's probably relatively efficient. I think it might be easier to use than the Regex because you don't have to worry about escaping special delimiters. It returns an IList<string> which is more efficient than always converting to an array. It's an extension method, which is convenient. You can pass in the delimiters as either an array or as multiple parameters.
/// <summary>
/// Splits the given string into a list of substrings, while outputting the splitting
/// delimiters (each in its own string) as well. It's just like String.Split() except
/// the delimiters are preserved. No empty strings are output.</summary>
/// <param name="s">String to parse. Can be null or empty.</param>
/// <param name="delimiters">The delimiting characters. Can be an empty array.</param>
/// <returns></returns>
public static IList<string> SplitAndKeepDelimiters(this string s, params char[] delimiters)
{
var parts = new List<string>();
if (!string.IsNullOrEmpty(s))
{
int iFirst = 0;
do
{
int iLast = s.IndexOfAny(delimiters, iFirst);
if (iLast >= 0)
{
if (iLast > iFirst)
parts.Add(s.Substring(iFirst, iLast - iFirst)); //part before the delimiter
parts.Add(new string(s[iLast], 1));//the delimiter
iFirst = iLast + 1;
continue;
}
//No delimiters were found, but at least one character remains. Add the rest and stop.
parts.Add(s.Substring(iFirst, s.Length - iFirst));
break;
} while (iFirst < s.Length);
}
return parts;
}
Some unit tests:
text = "[a link|http://www.google.com]";
result = text.SplitAndKeepDelimiters('[', '|', ']');
Assert.IsTrue(result.Count == 5);
Assert.AreEqual(result[0], "[");
Assert.AreEqual(result[1], "a link");
Assert.AreEqual(result[2], "|");
Assert.AreEqual(result[3], "http://www.google.com");
Assert.AreEqual(result[4], "]");
A lot of answers to this! One I knocked up to split by various strings (the original answer caters for just characters i.e. length of 1). This hasn't been fully tested.
public static IEnumerable<string> SplitAndKeep(string s, params string[] delims)
{
var rows = new List<string>() { s };
foreach (string delim in delims)//delimiter counter
{
for (int i = 0; i < rows.Count; i++)//row counter
{
int index = rows[i].IndexOf(delim);
if (index > -1
&& rows[i].Length > index + 1)
{
string leftPart = rows[i].Substring(0, index + delim.Length);
string rightPart = rows[i].Substring(index + delim.Length);
rows[i] = leftPart;
rows.Insert(i + 1, rightPart);
}
}
}
return rows;
}
This seems to work, but its not been tested much.
public static string[] SplitAndKeepSeparators(string value, char[] separators, StringSplitOptions splitOptions)
{
List<string> splitValues = new List<string>();
int itemStart = 0;
for (int pos = 0; pos < value.Length; pos++)
{
for (int sepIndex = 0; sepIndex < separators.Length; sepIndex++)
{
if (separators[sepIndex] == value[pos])
{
// add the section of string before the separator
// (unless its empty and we are discarding empty sections)
if (itemStart != pos || splitOptions == StringSplitOptions.None)
{
splitValues.Add(value.Substring(itemStart, pos - itemStart));
}
itemStart = pos + 1;
// add the separator
splitValues.Add(separators[sepIndex].ToString());
break;
}
}
}
// add anything after the final separator
// (unless its empty and we are discarding empty sections)
if (itemStart != value.Length || splitOptions == StringSplitOptions.None)
{
splitValues.Add(value.Substring(itemStart, value.Length - itemStart));
}
return splitValues.ToArray();
}
Recently I wrote an extension method do to this:
public static class StringExtensions
{
public static IEnumerable<string> SplitAndKeep(this string s, string seperator)
{
string[] obj = s.Split(new string[] { seperator }, StringSplitOptions.None);
for (int i = 0; i < obj.Length; i++)
{
string result = i == obj.Length - 1 ? obj[i] : obj[i] + seperator;
yield return result;
}
}
}
I'd say the easiest way to accomplish this (except for the argument Hans Kesting brought up) is to split the string the regular way, then iterate over the array and add the delimiter to every element but the last.
To avoid adding character to new line try this :
string[] substrings = Regex.Split(input,#"(?<=[-])");
result = originalString.Split(separator);
for(int i = 0; i < result.Length - 1; i++)
result[i] += separator;
(EDIT - this is a bad answer - I misread his question and didn't see that he was splitting by multiple characters.)
(EDIT - a correct LINQ version is awkward, since the separator shouldn't get concatenated onto the final string in the split array.)
Iterate through the string character by character (which is what regex does anyway.
When you find a splitter, then spin off a substring.
pseudo code
int hold, counter;
List<String> afterSplit;
string toSplit
for(hold = 0, counter = 0; counter < toSplit.Length; counter++)
{
if(toSplit[counter] = /*split charaters*/)
{
afterSplit.Add(toSplit.Substring(hold, counter));
hold = counter;
}
}
That's sort of C# but not really. Obviously, choose the appropriate function names.
Also, I think there might be an off-by-1 error in there.
But that will do what you're asking.
veggerby's answer modified to
have no string items in the list
have fixed string as delimiter like "ab" instead of single character
var delimiter = "ab";
var text = "ab33ab9ab"
var parts = Regex.Split(text, $#"({Regex.Escape(delimiter)})")
.Where(p => p != string.Empty)
.ToList();
// parts = "ab", "33", "ab", "9", "ab"
The Regex.Escape() is there just in case your delimiter contains characters which regex interprets as special pattern commands (like *, () and thus have to be escaped.
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace ConsoleApplication9
{
class Program
{
static void Main(string[] args)
{
string input = #"This;is:a.test";
char sep0 = ';', sep1 = ':', sep2 = '.';
string pattern = string.Format("[{0}{1}{2}]|[^{0}{1}{2}]+", sep0, sep1, sep2);
Regex regex = new Regex(pattern);
MatchCollection matches = regex.Matches(input);
List<string> parts=new List<string>();
foreach (Match match in matches)
{
parts.Add(match.ToString());
}
}
}
}
I wanted to do a multiline string like this but needed to keep the line breaks so I did this
string x =
#"line 1 {0}
line 2 {1}
";
foreach(var line in string.Format(x, "one", "two")
.Split("\n")
.Select(x => x.Contains('\r') ? x + '\n' : x)
.AsEnumerable()
) {
Console.Write(line);
}
yields
line 1 one
line 2 two
I came across same problem but with multiple delimiters. Here's my solution:
public static string[] SplitLeft(this string #this, char[] delimiters, int count)
{
var splits = new List<string>();
int next = -1;
while (splits.Count + 1 < count && (next = #this.IndexOfAny(delimiters, next + 1)) >= 0)
{
splits.Add(#this.Substring(0, next));
#this = new string(#this.Skip(next).ToArray());
}
splits.Add(#this);
return splits.ToArray();
}
Sample with separating CamelCase variable names:
var variableSplit = variableName.SplitLeft(
Enumerable.Range('A', 26).Select(i => (char)i).ToArray());
I wrote this code to split and keep delimiters:
private static string[] SplitKeepDelimiters(string toSplit, char[] delimiters, StringSplitOptions splitOptions = StringSplitOptions.None)
{
var tokens = new List<string>();
int idx = 0;
for (int i = 0; i < toSplit.Length; ++i)
{
if (delimiters.Contains(toSplit[i]))
{
tokens.Add(toSplit.Substring(idx, i - idx)); // token found
tokens.Add(toSplit[i].ToString()); // delimiter
idx = i + 1; // start idx for the next token
}
}
// last token
tokens.Add(toSplit.Substring(idx));
if (splitOptions == StringSplitOptions.RemoveEmptyEntries)
{
tokens = tokens.Where(token => token.Length > 0).ToList();
}
return tokens.ToArray();
}
Usage example:
string toSplit = "AAA,BBB,CCC;DD;,EE,";
char[] delimiters = new char[] {',', ';'};
string[] tokens = SplitKeepDelimiters(toSplit, delimiters, StringSplitOptions.RemoveEmptyEntries);
foreach (var token in tokens)
{
Console.WriteLine(token);
}

Categories

Resources