Splitting text into sentences - c#

I run into a problem while trying to parse my text into sentences.
Everything works fine is text is formatted this way: (random text)
Much did had call new drew that kept. Limits expect wonder law she.
Now has you views woman noisy match money rooms.
Program parses text into 3 sentences.
But as soon as there is a line break in the middle of a sentence my program splits text incorrectly.
Much did had call new drew that kept. Limits (new line her) expect wonder law she.
Now has you views woman noisy match money rooms.
Program parses text as 4 sentences.
My code:
public static void ReadData()
{
char[] sentenceSeparators = {'.', '!', '?'};
using (StreamReader reader = new StreamReader(dataFile))
{
string line = null;
while (null != (line = reader.ReadLine()))
{
var split = line.Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var i in split)
{
Console.WriteLine(i);
}
}
}
}
Input #1:
Much did had call new drew that kept. Limits expect wonder law she.
Now has you views woman noisy match money rooms.
Output #1:
Much did had call new drew that kept
Limits expect wonder law she
Now has you views woman noisy match money rooms
Input #2:
Much did had call new drew that kept. Limits expect
wonder law she.
Now has you views woman noisy match money rooms.
Output #2:
Much did had call new drew that kept
Limits expect
wonder law she
Now has you views woman noisy match money rooms

Its because you are using ReadLine. Use ReadToEnd instead.
public static void ReadData()
{
char[] sentenceSeparators = {'.', '!', '?'};
using (StreamReader reader = new StreamReader(dataFile))
{
string line = reader.ReadToEnd();
var split = line.Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var i in split)
{
Console.WriteLine(i);
}
}
}

As already mentionen don't read it line by line if you want \n not to influence your splitting. Here is a version which does the job in 1 line:
string [] split = File.ReadAllText(dataFile).Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
Also: the display in the console is illusionary. Since it will display the "bad" sentence on 2 lines, but in the split array it will be on a single position!
Console.WriteLine(split.Length); // will display 3

Related

Finding multiple semi predictable patterns in a string

Alright, so I'm writing an application that needs to be able to extract a VAT-Number from an invoice (https://en.wikipedia.org/wiki/VAT_identification_number)
The biggest challenge to overcome here is that as apparent from the wikipedia article I have linked to, each country uses its own format for these VAT-numbers (The Netherlands uses a 14 character number while Germany uses a 11 character number).
In order to extract these numbers, I throw every line from the invoice into an array of strings, and for each string I test if it has a length that is equal to one of the VAT formats, and if that checks out, I check if said string also contains a country code ("NL", "DE", etc).
string[] ProcessedFile = Reader.ProcessFile(Input);
foreach(string S in ProcessedFile)
{
RtBEditor.AppendText(S + "\n");
}
foreach(string X in ProcessedFile)
{
string S = X.Replace(" ", string.Empty);
if (S.Length == 7)
{
if (S.Contains("GBGD"))
{
MessageBox.Show("Land = Groot Britanie (Regering)");
}
}
/*
repeat for all other lenghts and country codes.
*/
The problem with this code is that 1st:
if there is a string that happens to have the same length as one of the VAT-formats, and it has a country code embedded in it, the code will incorrectly think that it has found the VAT-number.
2nd:
In some cases, the VAT-number will be included like "VAT-number: [VAT-number]". In this case, the text that precedes the actual number will be added to its length, making the program unable to detect the actual VAT-Number.
The best way to fix this is in my assumption to somehow isolate the VAT-Number from the strings all together, but I have yet to find a way how to actually do this.
Does anyone by any chance know any potential solution?
Many thanks in advance!
EDIT:
Added a dummy invoice to clarify what kind of data is contained within the invoices.
As someone in the comments had pointed out, the best way to fix this is by using Regex. After trying around a bit I came to the following solution:
public Regex FilterNormaal = new Regex(#"[A-Z]{2}(\d)+B?\d*");
private void BtnUitlezen_Click(object sender, EventArgs e)
{
RtBEditor.Clear();
/*
Temp dummy vatcodes for initial testing.
*/
Form1.Dummy1.VAT = "NL855291886B01";
Form1.Dummy2.VAT = "DE483270846";
Form1.Dummy3.VAT = "SE482167803501";
OCR Reader = new OCR();
/*
Grab and process image
*/
if(openFileDialog1.ShowDialog() == DialogResult.OK)
{
try
{
Input = new Bitmap(openFileDialog1.FileName);
}
catch
{
MessageBox.Show("Please open an image file.");
}
}
string[] ProcessedFile = Reader.ProcessFile(Input);
foreach(string S in ProcessedFile)
{
string X = S.Replace(" ", string.Empty);
RtBEditor.AppendText(X + "\n");
}
foreach (Match M in FilterNormaal.Matches(RtBEditor.Text))
{
MessageBox.Show(M.Value);
}
}
At first, I attempted to iterate through my array of strings to find a match, but for reasons unknown, this did not yield any results. When applying the regex to the entire textbox, it did output the results I needed.

Find a delimiter of csv or text files in c#

I want to find a delimiter being used to separate the columns in csv or text files.
I am using TextFieldParser class to read those files.
Below is my code,
String path = #"c:\abc.csv";
DataTable dt = new DataTable();
if (File.Exists(path))
{
using (Microsoft.VisualBasic.FileIO.TextFieldParser parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
if (path.Contains(".txt"))
{
parser.SetDelimiters("|");
}
else
{
parser.SetDelimiters(",");
}
parser.HasFieldsEnclosedInQuotes = true;
bool firstLine = true;
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (firstLine)
{
foreach (var val in fields)
{
dt.Columns.Add(val);
}
firstLine = false;
continue;
}
dt.Rows.Add(fields);
}
}
lblCount.Text = "Count of total rows in the file: " + dt.Rows.Count.ToString();
dgvTextFieldParser1.DataSource = dt;
Instead of passing the delimiters manually based on the file type, I want to read the delimiter from the file and then pass it.
How can I do that?
Mathematically correct but totally useless answer: It's not possible.
Pragmatical answer: It's possible but it depends on how much you know about the file's structure. It boils down to a bunch of assumptions and depending on which we'll make, the answer will vary. And if you can't make any assumptions, well... see the mathematically correct answer.
For instance, can we assume that the delimiter is one or any of the elements in the set below?
List<char> delimiters = new List<char>{' ', ';', '|'};
Or can we assume that the delimiter is such that it produces elements of equal length?
Should we try to find a delimiter that's a single character or can a word be one?
Etc.
Based on the question, I'll assume that it's the first option and that we have a limited set of possible characters, precisely one of which is be a delimiter for a given file.
How about you count the number of occurrences of each such character and assume that the one that's occurring most frequently is the one? Is that sufficiently rigid or do you need to be more sure than that?
List<char> delimiters = new List<char>{' ', ';', '-'};
Dictionary<char, int> counts = delimiters.ToDictionary(key => key, value => 0);
foreach(char c in delimiters)
counts[c] = textArray.Count(t => t == c);
I'm not in front of a computer so I can't verify but the last step would be returning the key from the dictionary the value of which is the maximal.
You'll need to take into consideration a special case such that there's no delimiters detected, there are equally many delimiters of two types etc.
Very simple guessing approach using LINQ:
static class CsvSeperatorDetector
{
private static readonly char[] SeparatorChars = {';', '|', '\t', ','};
public static char DetectSeparator(string csvFilePath)
{
string[] lines = File.ReadAllLines(csvFilePath);
return DetectSeparator(lines);
}
public static char DetectSeparator(string[] lines)
{
var q = SeparatorChars.Select(sep => new
{Separator = sep, Found = lines.GroupBy(line => line.Count(ch => ch == sep))})
.OrderByDescending(res => res.Found.Count(grp => grp.Key > 0))
.ThenBy(res => res.Found.Count())
.First();
return q.Separator;
}
}
What this does is it reads the file line by line (note that CSV files may include line breaks), then checks for each potential separator how often it occurs in each line.
Then we check which separator occurs on the most lines, and of those which occur on the same number of lines, we take the one with the most even distribution (e.g. 5 occurences on every line are ranked higher than one that occurs once in one line and 10 times in another line).
Of course you might have to tweak this for your own purposes, add error handling, fallback logic and so forth. I'm sure it's not perfect, but it's good enough for me.
You could probably take n bytes from the file, count possible delimiter characters(or all characters found) using a hash map/dictionary, and then the character repeated most is probably the delimiter you're looking for. It would make sense to me that the characters used as delimiters would be the ones used the most. When done you reset the stream, but since you're using a text reader you would have to probably initialize another text reader or something. This would get slightly more hairy if the CSV used more than one delimiter. You would probably have to ignore some characters like alpha and numeric.
In python we can do this easily by using csv sniffer. It will cater for text files and also if you just need to read some bytes from the file.

Text file line by line into string array

I need help, trying to take a large text document ~1000 lines and put it into a string array, line by line.
Example:
string[] s = {firstLineHere, Secondline, etc};
I also want a way to find the first word, only the first word of the line, and once first word it found, copy the entire line. Find only the first word or each line!
You can accomplish this with File.ReadAllLines combined with a little Linq (to accomplish the addition to the question stated in the comments of Praveen's answer.
string[] identifiers = { /*Your identifiers for needed lines*/ };
string[] allLines = File.ReadAllLines("C:\test.txt");
string[] neededLines = allLines.Where(c => identifiers.Contains(c.SubString(0, c.IndexOf(' ') - 1))).ToArray();
Or make it more of a one liner:
string[] lines = File.ReadAllLines("your path").Where(c => identifiers.Contains(c.SubString(0, c.IndexOf(' ') - 1))).ToArray();
This will give you array of all the lines in your document that start with the keywords you define within your identifiers string array.
There is an inbuilt method to achieve your requirement.
string[] lines = System.IO.File.ReadAllLines(#"C:\sample.txt");
If you want to read the file line by line
List<string> lines = new List<string>();
using (StreamReader reader = new StreamReader(#"C:\sample.txt"))
{
while (reader.Peek() >= 0)
{
string line = reader.ReadLine();
//Add your conditional logic to add the line to an array
if (line.Contains(searchTerm)) {
lines.Add(line);
}
}
}
Another option you could use would be to read each individual line, while splitting the line into segments and comparing only the first element against
the provided search term. I have provided a complete working demonstration below:
Solution:
class Program
{
static void Main(string[] args)
{
// Get all lines that start with a given word from a file
var result = GetLinesWithWord("The", "temp.txt");
// Display the results.
foreach (var line in result)
{
Console.WriteLine(line + "\r");
}
Console.ReadLine();
}
public static List<string> GetLinesWithWord(string word, string filename)
{
List<string> result = new List<string>(); // A list of strings where the first word of each is the provided search term.
// Create a stream reader object to read a text file.
using (StreamReader reader = new StreamReader(filename))
{
string line = string.Empty; // Contains a single line returned by the stream reader object.
// While there are lines in the file, read a line into the line variable.
while ((line = reader.ReadLine()) != null)
{
// If the line is white space, then there are no words to compare against, so move to next line.
if (line != string.Empty)
{
// Split the line into parts by a white space delimiter.
var parts = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
// Get only the first word element of the line, trim off any additional white space
// and convert the it to lowercase. Compare the word element to the search term provided.
// If they are the same, add the line to the results list.
if (parts.Length > 0)
{
if (parts[0].ToLower().Trim() == word.ToLower().Trim())
{
result.Add(line);
}
}
}
}
}
return result;
}
}
Where the sample text file may contain:
How shall I know thee in the sphere which keeps
The disembodied spirits of the dead,
When all of thee that time could wither sleeps
And perishes among the dust we tread?
For I shall feel the sting of ceaseless pain
If there I meet thy gentle presence not;
Nor hear the voice I love, nor read again
In thy serenest eyes the tender thought.
Will not thy own meek heart demand me there?
That heart whose fondest throbs to me were given?
My name on earth was ever in thy prayer,
Shall it be banished from thy tongue in heaven?
In meadows fanned by heaven's life-breathing wind,
In the resplendence of that glorious sphere,
And larger movements of the unfettered mind,
Wilt thou forget the love that joined us here?
The love that lived through all the stormy past,
And meekly with my harsher nature bore,
And deeper grew, and tenderer to the last,
Shall it expire with life, and be no more?
A happier lot than mine, and larger light,
Await thee there; for thou hast bowed thy will
In cheerful homage to the rule of right,
And lovest all, and renderest good for ill.
For me, the sordid cares in which I dwell,
Shrink and consume my heart, as heat the scroll;
And wrath has left its scar--that fire of hell
Has left its frightful scar upon my soul.
Yet though thou wear'st the glory of the sky,
Wilt thou not keep the same beloved name,
The same fair thoughtful brow, and gentle eye,
Lovelier in heaven's sweet climate, yet the same?
Shalt thou not teach me, in that calmer home,
The wisdom that I learned so ill in this--
The wisdom which is love--till I become
Thy fit companion in that land of bliss?
And you wanted to retrieve every line where the first word of the line is the word 'the' by calling the method like so:
var result = GetLinesWithWord("The", "temp.txt");
Your result should then be the following:
The disembodied spirits of the dead,
The love that lived through all the stormy past,
The same fair thoughtful brow, and gentle eye,
The wisdom that I learned so ill in this--
The wisdom which is love--till I become
Hopefully this answers your question adequately enough.

TextElement Enumerator Class Bug or (Tamil) Unicode Bug

why the TextElementEnumerator not properly parsing the Tamil Unicode character.
using System;
using System.Collections.Generic;
using System.Globalization;
namespace Glyphtest
{
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static List<string> Syllabify(string unicodetext)
{
if (string.IsNullOrEmpty(unicodetext)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(unicodetext);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
}
Following above code sample deals with Unicode character
'கௌ'-> 0x0bc8 (க) +0xbcc(ௌ). (Correct Form)
'கௌ'->0x0bc8 (க) +0xbc6(ெ) + 0xbb3(ள) (In Correct Form)
Is it bug in Text Element Enumerator Class ,
why its not to Enumerate it properly from the string.
i.e
கெளவை => 'கெள'+ 'வை' has to enumerated in Correct form
கெளவை => 'கெ' +'ள' +'வை' not to be enumerated in Incorrect form.
If so how to overcome this issue.
Its not been bug with Unicode character or TextElementEnumerator Class,
As specific to the lanaguage (Tamil)
letter made by any Tamil consonants followed by visual glyph
for eg-
க -\u0b95
ெ -\u0bc6
ள -\u0bb3
form Tamil character 'கெள' while its seems similar to formation of visual glyph
க -\u0b95
ௌ-\u0bcc
and its right form to solution.
hence before enumerating Tamil character we have replace irregular formation of character.
As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்)
the visual glyph (ௌ) will come as starting letter of a word.
so that. the above code is to be should processed as
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static string CheckVisualGlyphPattern(string txt)
{
string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
string list = string.Empty;
var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
foreach (string s in data)
{
var matches = new List<Match>();
string outputs = rx.Replace(s, match =>
{
matches.Add(match);
return string.Format("{0}\u0bcc", match.Groups[1].Value);
});
list += string.Format("{0} ", outputs);
}
return list.Trim();
}
public static List<string> Syllabify(string unicodetext)
{
var processdata = CheckVisualGlyphPattern(unicodetext);
if (string.IsNullOrEmpty(processdata)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
It produce the appropriate visual glyph while enumerating.
U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ has Grapheme_Cluster_Break=XX (Other). This makes the grapheme clusters <U+0BC8 U+0BC6><U+0BB3> the correct ones since there is always a grapheme cluster break before characters with Grapheme_Cluster_Break equal to Other.
<U+0BC8 U+0BCC> has no internal grapheme cluster breaks because U+0BCC has Grapheme_Cluster_Break=SpacingMark and there are usually no breaks before such characters (exceptions are at the start of text or when preceded by a control character).
Well, at least this is what the Unicode standard has to say (http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
Now, I have no idea of how Tamil works, so take what follows with a pinch of salt.
U+0BCC decomposes into <U+0BC6 U+0BD7>, meaning the two sequences (<U+0BC8 U+0BC6 U+0BB3> and <U+0BC8 U+0BCC>) not canonically equivalent, so there is no requirement for grapheme cluster segmentation to yield the same results.
When I look at it with my Tamil-ignorant eyes, it seems U+0BCC ᴛᴀᴍɪʟ ᴀᴜ ʟᴇɴɢᴛʜ ᴍᴀʀᴋ and U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ look exactly the same. However, U+0BCC is a spacing mark, but U+0BB3 isn't. If you use U+0BCC in the input instead of U+0BB3, the result is what you expected.
Going on a limb, I will say that you are using the wrong character but, again, I don't know Tamil at all so I can't be sure.

reading a CSV issue

I am trying to read a csv
following is the sample.
"0734306547 ","9780734306548 ","Jane Eyre Pink PP ","Bronte Charlotte ","FRONT LIST",20/03/2013 0:00:00,0,"PAPERBACK","Y","Pen"
Here is the code i am using read CSV
public void readCSV()
{
StreamReader reader = new StreamReader(File.OpenRead(#"C:\abc\21-08-2013\PNZdatafeed.csv"),Encoding.ASCII);
List<string> ISBN = new List<String>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (!String.IsNullOrWhiteSpace(line))
{
string[] values = line.Split(',');
if (values[9] == "Pen")
{
ISBN.Add(values[1]);
}
}
}
MessageBox.Show(ISBN.Count().ToString());
}
I am not able to compare it values if (values[9] == "Pen") because when i debug the code it says values[9] value is \"Pen\""
How do i get rid of the special characters.?
The problem here is that you're splitting the line every time you find , and leaving the data like that. For example, if this is the line you're reading in:
"A","B","C"
and you split it at commas, you'll get "A", "B", and "C" as your data. According to your description, you don't want quotes around the data.
To throw away quotes around a string:
Check if the leftmost character is ".
If so, check if the rightmost character is ".
If so, remove the leftmost and rightmost characters.
In pseudocode:
if (data.left(1) == "\"" && data.right(1) == "\"") {
data = data.trimleft(1).trimright(1)
}
At this point you might have a few questions (I'm not sure how much experience you have). If any of these apply to you, feel free to ask them, and I'll explain further.
What does "\"" mean?
How do I extract the leftmost/rightmost character of a string?
How do I extract the middle of a string?

Categories

Resources