Parsing a textfile in C# with skipping some contents

Parsing a textfile in C# with skipping some contents - c#

I'm trying to parse a text file that has a heading and the body. In the heading of this file, there are line number references to sections of the body. For example:
SECTION_A 256
SECTION_B 344
SECTION_C 556
This means, that SECTION_A starts in line 256.
What would be the best way to parse this heading into a dictionary and then when necessary read the sections.
Typical scenarios would be:
Parse the header and read only section SECTION_B
Parse the header and read fist paragraph of each section.
The data file is quite large and I definitely don't want to load all of it to the memory and then operate on it.
I'd appreciate your suggestions. My environment is VS 2008 and C# 3.5 SP1.

You can do this quite easily.
There are three parts to the problem.
1) How to find where a line in the file starts. The only way to do this is to read the lines from the file, keeping a list that records the start position in the file of that line. e.g
List lineMap = new List();
lineMap.Add(0); // Line 0 starts at location 0 in the data file (just a dummy entry)
lineMap.Add(0); // Line 1 starts at location 0 in the data file
using (StreamReader sr = new StreamReader("DataFile.txt"))
{
String line;
int lineNumber = 1;
while ((line = sr.ReadLine()) != null)
lineMap.Add(sr.BaseStream.Position);
}
2) Read and parse your index file into a dictionary.
Dictionary index = new Dictionary();
using (StreamReader sr = new StreamReader("IndexFile.txt"))
{
String line;
while ((line = sr.ReadLine()) != null)
{
string[] parts = line.Split(' '); // Break the line into the name & line number
index.Add(parts[0], Convert.ToInt32(parts[1]));
}
}
Then to find a line in your file, use:
int lineNumber = index["SECTION_B";]; // Convert section name into the line number
long offsetInDataFile = lineMap[lineNumber]; // Convert line number into file offset
Then open a new FileStream on DataFile.txt, Seek(offsetInDataFile, SeekOrigin.Begin) to move to the start of the line, and use a StreamReader (as above) to read line(s) from it.

Well, obviously you can store the name + line number into a dictionary, but that's not going to do you any good.
Well, sure, it will allow you to know which line to start reading from, but the problem is, where in the file is that line? The only way to know is to start from the beginning and start counting.
The best way would be to write a wrapper that decodes the text contents (if you have encoding issues) and can give you a line number to byte position type of mapping, then you could take that line number, 256, and look in a dictionary to know that line 256 starts at position 10000 in the file, and start reading from there.
Is this a one-off processing situation? If not, have you considered stuffing the entire file into a local database, like a SQLite database? That would allow you to have a direct mapping between line number and its contents. Of course, that file would be even bigger than your original file, and you'd need to copy data from the text file to the database, so there's some overhead either way.

Just read the file one line at a time and ignore the data until you get to the ones you need. You won't have any memory issues, but performance probably won't be great. You can do this easily in a background thread though.

Read the file until the end of the header, assuming you know where that is. Split the strings you've stored on whitespace, like so:
Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline
foreach(string header in headers) {
var s = header.Split(new[]{' '});
sectionIndex.Add(s[0], Int32.Parse(s[1]));
}
Find the dictionary entry you want, keep a count of the number of lines read in the file, and loop until you hit that line number, then read until you reach the next section's starting line. I don't know if you can guarantee the order of keys in the Dictionary, so you'd probably need the current and next section's names.
Be sure to do some error checking to make sure the section you're reading to isn't before the section you're reading from, and any other error cases you can think of.

You could read line by line until all the heading information is captured and stop (assuming all section pointers are in the heading). You would have the section and line numbers for use in retrieving the data at a later time.
string dataRow = "";
try
{
TextReader tr = new StreamReader("filename.txt");
while (true)
{
dataRow = tr.ReadLine();
if (dataRow.Substring(1, 8) != "SECTION_")
break;
else
//Parse line for section code and line number and log values
continue;
}
tr.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}

Related

How Can I Read a Multiline Field from a CSV Without Altering It?

I have a CSV that looks like this. My goal is to extract each entry (notice I said entry, not line), where an entry starts from the first column and stretches to the last column, and may span multiple lines. I'd like to extract an entry without ruining the formatting. For example, I do not want the following to be considered four seperate lines,
Eg. 1, One Column Multiple Lines
...,"1. copy ctor
2. copy ctor
3. declares function
4. default ctor",... // Where ... represents the columns before and after
but rather a column in one entry that can be represented as such
Eg. 2, One Column Single Line
"1. copy ctor\n2.copy ctor\ndeclares function\n4.default ctor"
When I iterate over the CSV, as such, I get Eg. 1. I'm not sure why splitting on a comma is treating a new line as a comma.
using (var streamReader = new StreamReader("results-survey111101.csv"))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
foreach (var column in splitLine)
Console.WriteLine(column);
}
}
If someone can show me what I need to do to get these multi line CSV columns into one line that maintains the formatting (e.g. adds \t or \n where necessary) that would be great. Thanks!

Assuming your source file is valid CSV, variability in the data is really hard to account for. That's all I'll say, but I'll link you to another SO answer if you need convincing that writing your own CSV parser is a horrible task. Reading CSV files using C#
Let's assume you are going to take advantage of an existing CSV reader library. I'll use TextFieldParser from the Microsoft.VisualBasic library as is used in the example answer I linked.
Your task is to read your source file line by line, and validate whether the line is a complete CSV entry on it's own, or if it forms part of a broken line.
If it forms part of a broken line, we need to remember the line and add the next line to it before attempting validation again.
For this we need to know one thing:
What is the expected number of fields each data entry row should have?
int expectedFieldCount = 7;
string brokenLine = "";
using (var streamReader = new StreamReader("results-survey111101.csv"))
{
string line;
while ((line = streamReader.ReadLine()) != null) // read the next line
{
// if the previous line was incomplete, add it to the current line,
// otherwise use the current line
string csvLineData = (brokenLine.Length > 0) ? brokenLine + line : line;
try
{
using (StringReader stringReader = new StringReader(csvLineData ))
using (TextFieldParser parser = new TextFieldParser(stringReader))
{
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields(); // tests if the line is valid csv
if (expectedFieldCount == fields.Length)
{
// do whatever you want with the fields now.
foreach (var field in fields)
{
Console.WriteLine(field);
}
brokenLine = ""; // reset the brokenLine
}
else // it was valid csv, but we don't have the required number of fields yet
{
brokenLine += line + #"\r\n";
break;
}
}
}
}
catch (Exception ex) // the current line is NOT valid csv, update brokenLine
{
brokenLine += (line + #"\r\n");
}
}
}
I am replacing the line breaks that broken lines contain with \r\n literals. You can display these in your resulting one-liner field however you want. But you shouldn't expect to be able to copy paste the result into notepad and see line breaks.

One assumes you have the same number of columns in each record. Therefore in your code where you do your Split you can merely sum the length of splitLine into a running columnsReadCount until they equal the desired columnsPerRecordCount. At that point you have read all the record and can reset the running columnsReadCount back to zero ready for the next record to read.

How to traverse back and forth of text file line

I have a text file which contains lines that i need to process.Here is the format of the lines present into my text file..
07 IVIN 15:37 06/03 022 00:00:14 600 2265507967 0:03
08 ITRS 15:37 06/03 022 00:00:09 603 7878787887 0:03
08 ITRS 15:37 06/03 022 00:00:09 603 2265507967 0:03
Now as per my requirement i have to read this text file line by line.Now as soon as i get ITRS into any line i have to search for the number 2265507967 into the immediate upside of the text file lines.As soon as it gets 2265507967 in the upside lines ,it should read that line.
Now i am reading the lines into strings and breaking into characters based on spaces.Here is my code..
var strings = line.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
My problem is that i am not getting way to traverse upside of the text file and search for the substring .i.e. 2265507967.Please help .

I am not aware of being able to go backwards when reading a file (other than using the seek() method) but I might be wrong...
A simpler approach would be to:
Create a dictionary, key value being the long numeric values while the value being the line to which it belongs: <2265507967,07 IVIN 15:37 06/03 022 00:00:14 600 2265507967 0:03>
Go through the file one line at a time and:
a. If the line contains ITRS, get the value from the line and check your dictionary. Once you will have found it, clear the dictionary and go back to step 1.
b. If it does not contain ITRS, simply add the number and the line as key-value pairs.
This should be quicker than going through one line at a time and also simpler. The drawback would be that it could be quite memory intensive.
EDIT: I do not have a .NET compiler handy, so I will provide some pseudo code to better explain my answer:
//Initialization
Dictionary<string, string> previousLines = new Dictionary<string, string>();
TextReader tw = new TextReader(filePath);
string line = String.Empty;
//Read the file one line at a time
while((line = tw.ReadLine()) != null)
{
if(line.contains("ITRS")
{
//Get the number you will use for searching
string number = line.split(new char[]{' '})[4];
//Use the dictionary to read a line you have previously read.
string line = previousLines[number];
previousLines.Clear(); //Remove the elements so that they do not interrupt the next searches. I am assuming that you want to search between lines which are found between ITRS tags. If you do not want this, simply omit this line.
... //Do your logic here.
}
else
{
string number = line.split(new char[]{' '})[4];
previousLines.Add(number, line);
}
}

How to add linebreaks to a stream reader if conditions are met

So I have code that needs to check if the file has already been split every 50 characters. 99% of the time it will come to me already split, where each line is 50 characters, however there is an off chance that it may come to me as a single line, and I need to add a linebreak every 50 characters. This file will always come to me as a stream.
Once I have the properly formatted file, I process it as needed.
However, I am uncertain how I can check if the stream is properly formatted.
Here is the code I have to check if the first line if larger than 50 characters(an indicator it may need to be split).
var streamReader = new StreamReader(s);
var firstLineCount = streamReader.ReadLines().Count();
if(firstLineCount > 50)
{
//code to add line breaks
}
//once the file is good
using(var trackReader = new TrackingTextReader(streamReader))
{
//do biz logic
}
How can I add linebreaks to a stream reader?

I would add all lines to a List<string>. (Line by line)
Do the check for each item in the list (using for, not foreach, because we will be inserting items).
If some item in the list has more than 50 characters.
Add an item to the next index of the list using item.SubString(50) (all the string after the 50th character).
And cut the final of the item at current index using YourList[i] = YourList[i].SubString(0,50).
Funny comment someone did helped for this:
You can also create a StreamWriter to write the Stream you're reading with the corrections.
Then you get the produced Stream and pass it forward to what you need.

You can't write anything to TextReader, because... it is a reader. The option here is to make a well-formed copy of data:
private IEnumerable<string> GetWellFormedData(Stream s)
{
using (var reader = new StreamReader(s))
{
while (!reader.EndOfStream)
{
var nextLine = reader.ReadLine();
if (nextLine.Length > 50)
{
// break the line into 50-chars fragments and yield return fragments
}
else
yield return nextLine;
}
}
}

Read line by line a large text file and search for a string

I am currently developing an application that reads a text file of about 50000 lines. For each line, I need to check if it contains a specific String.
At the moment, I use the conventional System.IO.StreamReader to read my file line by line.
The problem is that the size of the text file changes each time. I made several test performance and I noticed that when the file size increase, the more time it will take to read a line.
For example :
Reading a txt file that contains 5000 lines : 0:40
Reading a txt file that contains 10000 lines : 2:54
It take 4 times longer to read a file 2 times larger. I can't imagine how much time it will takes to read a 100000 lines file.
Here's my code :
using (StreamReader streamReader = new StreamReader(this.MyPath))
{
while (streamReader.Peek() > 0)
{
string line = streamReader.ReadLine();
if (line.Contains(Resources.Constants.SpecificString)
{
// Do some action with the string.
}
}
}
Is there a way to avoid the situation: bigger File = more time to read a single line?

Try this:
var toSearch = Resources.Constants.SpecificString;
foreach (var str in File.ReadLines(MyPath).Where(s => s.Contains(toSearch))) {
// Do some action with the string
}
This avoids accessing the resources on each iteration by caching value before the loop. If this does not help, try writing your own Contains based on an advanced string searching algorithm, such as the KMP.
Note: be sure to use File.ReadLines which reads lines lazily (unlike similarly looking File.ReadAllLines that reads all lines at once).

Use RegEx.IsMatch and you should see some performance improvements.
using (StreamReader streamReader = new StreamReader(this.MyPath))
{
var regEx = new Regex(MyPattern, RegexOptions.Compiled);
while (streamReader.Peek() > 0)
{
string line = streamReader.ReadLine();
if (regEx.IsMatch(line))
{
// Do some action with the string.
}
}
}
Please remember to use a compiled RegEx, however. Here's a pretty good article with some benchmarks you can look at.
Happy coding!

Reading stream with 2 different readers

I have a text file that contains a fixed length table that I am trying to parse. However, the beginning of the file is general information about when this table was generated (IE Time, Data, etc).
To read this I have attempted to make a FileStream, then read the first part of this file with a StreamReader. I parse out what I need from the top part of the document, and then when I am done, set the stream's position to the first line of the structured data.
Then I attach a TextFieldParser to the stream (with appropriate settings for the fixed length table), and then attempt to read the file. On the first row, it fails, and in the ErrorLine property, it lists off the last half of the third row of the table. I stepped through it and it was on the first row to read, yet the ErrorLine property suggests otherwise.
When debugging, I found that if I tried using my StreamReader.ReadLine() method after I had attached the TextFieldParser to the stream, the first 2 row show up fine. When I read the third row however, it returns a line where it starts with the first half of the third row (and stops right where the text in ErrorLine would be) appends some part from much later in the document. If I try this before I attach the TextFieldParser, it reads all 3 rows fine.
I have a feeling this has to do with my tying 2 readers to the same stream. I'm not sure how to read this with a structured part and an unstructured part, without just tokenizing the lines myself. I can do that but I assume I am not the first person to want to read part of a stream one way, and a later part of a stream in another.
Why is it skipping like this, and how would you read a text file with different formats?
Example:
Date: 3/1/2013
Time: 3:00 PM
Sensor: Awesome Thing
Seconds X Y Value
0 5.1 2.8 55
30 4.9 2.5 33
60 5.0 5.3 44
Code tailored for this simplified example:
Boolean setupInfo = true;
DataTable result = new DataTable();
String[] fields;
Double[] dFields;
FileStream stream = File.Open(filePath,FileMode.Open);
StreamReader reader = new StreamReader(stream);
String tempLine;
for(int j = 1; j <= 7; j++)
{
result.Columns.Add(("Column" + j));
}
//Parse the unstructured part
while(setupInfo)
{
tempLine = reader.ReadLine();
if( tempLine.StartsWith("Date: "))
{
result.Rows.Add(tempLine);
}
else if (tempLine.StartsWith("Time: "))
{
result.Rows.Add(tempLine);
}
else if (tempLine.StartsWith("Seconds")
{
//break out of this loop because the
//next line to be read is the unstructured part
setupInfo = false;
}
}
//Parse the structured part
TextFieldParser parser = new TextFieldParser(stream);
parser.TextFieldType = FieldType.FixedWidth;
parser.HasFieldsEnclosedInQuotes = false;
parser.SetFieldWidths(10, 10, 10, 10);
while (!parser.EndOfData)
{
if (reader.Peek() == '*')
{
break;
}
else
{
fields = parser.ReadFields();
if (parseStrings(fields, out dFields))
{
result.Rows.Add(dFields);
}
}
}
return result;

The reason it's skipping is that the StreamReader is reading blocks of data from the FileStream, rather than reading character-by-character. For example, the StreamReader might read 4 kilobytes from the FileStream and then parse out the lines as required to respond to ReadLine() calls. So when you attach the TextFieldParser to the FileStream, it's going to read from the current file position -- which is where the StreamReader left it.
The solution should be pretty simple: just connect the TextFieldParser to the StreamReader:
TextFieldParser parser = new TextFieldParser(reader);
See TextFieldParser(TextReader reader)

Generally speaking, most streams are consuming - that is, once read, it's no longer available. You could fork off to multiple streams by writing an intermediary class that derives from Stream and either raises an event, republished to other streams, etc.

In your case you don't need the StreamReader. The best choice is to check the file contents is using the File.ReadLines method instead. It will not load the whole file content, just the lines until you've found all that you need:
foreach (string line in File.ReadLines(filePath))
{
if( line.StartsWith("Date: "))
{
result.Rows.Add(line);
}
else if (line.StartsWith("Time: "))
{
result.Rows.Add(line);
}
else if (line.StartsWith("Seconds"))
{
break;
}
}
EDIT
You can do it even more simple using LINQ:
var d = from line in File.ReadLines(filePath) where line.Contains("Date: ") select line;
result.Rows.Add(d);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing a textfile in C# with skipping some contents - c#

Just read the file one line at a time and ignore the data until you get to the ones you need. You won't have any memory issues, but performance probably won't be great. You can do this easily in a background thread though.

Related

How Can I Read a Multiline Field from a CSV Without Altering It?

How to traverse back and forth of text file line

How to add linebreaks to a stream reader if conditions are met

Read line by line a large text file and search for a string

Reading stream with 2 different readers

Categories

Resources