C# Parsing text file overhead

C# Parsing text file overhead - c#

I'm trying to parse a pretty simple text file into some structures. For that, I need whole text split on every new line and on every whitespace. Code is pretty straightforward:
string path = "C:/file.ext";
string fileString = File.ReadAllText (path);
string[] splitFile = fileString.Split (' ', '\n', '/');
After profiling above code (using built-in game engine profiler), I've noticed that while parsing a 40KB file, there is 280KB memory allocated by File.ReadAllText and 310KB by string's Split, which sums up to almost 15 times the size of a file.
Is it normal?
Is there any way to read text files while avoiding such big allocations (maybe unsafe code?) ?
NOTE:
The main point is, whether allocations x times bigger than file's size, when reading it, are normal. I understand that reading line by line will let GC collect garbage from previous ReadLine. It just doesn't seem normal, and, since target device is an old android phone, I was worried whether parsing 50MB file wouldn't actually make application crash. What I mean by question 2 is how to minimalize allocations, not how to split those allocations.

I do not have enough reputation to comment on above post but have you tried reading the file in binary form using the Binary Reader Class then reading in 8 bytes at a time?

Related

How to load file fully and process record csvreader?

I use the CSV reader and found that it takes a lot of time to parse the data. how can I load the entire csv file to memory and then process it record by record as I have to do custom mapping of the records.
TextReader tr = new StreamReader(File.Open(#"C:\MarketData\" + symbol + ".txt", FileMode.Open));
CsvReader csvr = new CsvReader(tr);
while (csvr.Read())
{
// do your magic
}

Create a class that exactly represents/mirrors your CSV file. Then read all the contents into a list of that class. The following snip is from CsvHelper's documentation.
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>().ToList();
The important part is the .ToList(), as this will force the load of all the data into your list, rather than yielding results as you access them.
You can then perform additional mapping / extraction on that list, which will be in memory.
If you're already doing this, you may benefit from loading your csv into a HashSet rather than a List via (ToHashSet()). See HashSet vs List Performance

To answer your question directly: You can load the file fully into a memory stream and then re-read from that stream using your CsvReader. Similarly, you can create a bigger read buffer for your filestream, eg, 15MB, which would read the entire file into the buffer in one hit. I doubt either of these will actually improve performance for 10MB files.
Find your real performance bottleneck: Time to read file content from disk, time to parse CSV into fields, or time to process a record? A 10MB file looks really small. I'm processing sets of 250MB+ csv files with a custom csv reader with no complaints.
If processing is the bottleneck and you have several threads available and your csv file format does not need to support escaped line breaks, then you could read the entire file into a list of lines (System.IO.File.ReadAllLines / .ReadLines) and parse each line using a different Task. For example:
System.IO.File.ReadLines()
.Skip(1) // header line. Assume trusted to be correct.
.AsParallel()
.Select(ParseRecord) // RecordClass ParseRecord(string line)
.ForAll(ProcessRecord); // void ProcessRecord(RecordClass)
If you have many files to parse, you could process each file in a different Task and use async methods to maximise throughput. If they all come from the same physical disk then your milage will vary and may even get worse than a single-threaded approach.
More advanced:
If you know your files to contain 8-bit characters only, then you can operate on byte arrays and skip the StreamReader overheads to cast bytes into chars. This way you can read the entire file into a byte array in a single call and scan for line breaks assuming no line break escapes need to be supported. In that case scanning for line breaks can be done by multiple threads, each looking at a part of the byte array.
If you don't need to support field escapes (a,"b,c",d), then you can write a faster parser, simply looking for field separators (typically comma). You can also split field-demarcation parsing and field content parsing into threads if that's a bottleneck, though memory access locality may negate any benefits.
Under certain circumstances you may not need to parse fields into intermediate data structures (eg doubles, strings) and can process directly off references to the start/end of fields and save yourself some intermediate data structure creation.

How to read specific line from large file?

I got the problem of reading single line form large file encoded in UTF-8. Lines in the file has the constant length.
The file in average has 300k lines. The time is the main constraint, so I want to do it the fastest way possible.
I've tried LinQ
File.ReadLines("file.txt").Skip(noOfLines).Take(1).First();
But the time is not satisfactory enough.
My biggest hope was using the stream, and setting its possition to the desired line start, but the problem is that lines sizes in bytes differ.
Any ideas, how to do it?

Now this is where you don't want to use linq (-:
You actually want to find a nth occurrence of a new line in the file and read something till the next new line.
You probably want to check out this documentation on memory mapped files as well:
https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile(v=vs.110).aspx
There is also a post comparing different access methods
http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files

How do I write an obscene amount of data to file?

I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader and StreamWriter.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[] using StreamWriter. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines, which allows me to write string[]s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder and using the AppendLine method to add each line to make a gigantic string. Then I could convert that StringBuilder to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?

Two things will increase the speed of output using StreamWriter.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader reads into a buffer and parses lines out of it, and the StreamWriter writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time) to something very close to max(input-time, process-time, output-time). And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.

According to the docs, StreamWriter does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines is lazy and WriteAllLines has a lazy overload, it will stream the file rather than attempting to read the whole thing.

What about building strings to write?
Something like
int cnt = 0;
StringBuilder s = new StringBuilder();
while(line = reader.readLine())
{
cnt++;
String x = (manipulate line);
s.append(x+"\n");
if(cnt%10000 == 0)
{
StreamWriter.write(s);
s=new StringBuilder();
}
}
Edited because comment below is right, should have used stringbuilder.

Reading file into RichTextBox without using LoadFile

I want to read a file into a RichTextBox without using LoadFile (I might want to display the progress). The file contains only ASCII characters.
I was thinking of reading the file in chunks.
I have done the following (which is working):
const int READ_BUFFER_SIZE = 4 * 1024;
BinaryReader reader = new BinaryReader(File.Open("file.txt", FileMode.Open));
byte[] buf = new byte[READ_BUFFER_SIZE];
do {
int ret = reader.Read(buf, 0, READ_BUFFER_SIZE);
if (ret <= 0) {
break;
}
string text = Encoding.ASCII.GetString(buf);
richTextBox.AppendText(text);
} while (true);
My concern is:
string text = Encoding.ASCII.GetString(buf);
I have seen that it is not possible to add a byte[] to a RichTextBox.
My questions are:
Will a new string object be allocated for every chunk which is read?
Isn't there a better way not to have to create a string object just for appending the text to the RichTextBox?
Or, is it more efficient to read lines from the file (StreamReader.ReadLine) and just add to the RichTextBox the string returned?

Will a new string object be allocated for every chunk which is read?
Yes.
Isn't there a better way not to have to create a string object just for appending the text to the RichTextBox?
No, AppendText requires a string
Or, is it more efficient to read lines from the file (StreamReader.ReadLine) and just add to the RichTextBox the string returned?
No, that's considerably less efficient. You'll now create a new string object much more frequently. Which is okay from the garbage collected heap perspective, you don't create more garbage. But it is absolute murder on the RichTextBox, it constantly needs to re-allocate its own buffer. Which includes moving all the text previously read. What you have is already good, you should just use a much larger READ_BUFFER_SIZE.
Unfortunately there are conflicting goals here. You don't want to make the buffer larger than 39,999 bytes or the strings end up in the Large Object Heap and clog it up until a gen# 2 garbage collection happens. But the RTB will be much happier if you go considerably past that size, like a megabyte if the file is so large that you need a progress bar.
If you want to make it really efficient then you need to replace RichTextBox.LoadFile(). The underlying Windows message is EM_STREAMIN, it uses a callback mechanism to stream in the text. You can technically replace the callback to do what the default one does in RichTextBox, plus update a progress bar. It does permit getting rid of the strings btw. The pinvoke is pretty unfriendly, use the Reference Source for guidance.
Take the easy route first, increase the buffer size. Only consider using the pinvoke route when your code is considerably slower than using File.ReadAllText().

Try this:
richTextBox.AppendText(File.ReadAllText("file.txt"));
or
richTextBox.AppendText(File.ReadAllText("file.txt", Encoding.ASCII));

You can use a StreamReader. Then you can read eacht row of the file and display the progress while reading.

Fast/low-memory method to parse first two columns in a large csv file using c#

I'm parsing a large csv files - about 500 meg (many rows, many columns). I only need the first two columns (so up to the second comma on each line). Also, multiple threads need access to this file at the same time, so I can't take an exclusive lock.
What's the fastest/least memory consuming approach to this problem? What classes/methods should I be looking at? I assume that I should stay as low-level as possible - reading character by character, line by line?
Perhaps this is a way to allow simultaneous access?
using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
using ( var reader = new StreamReader( filestream ) )
{
...
}
}
Edit
Decided to check out http://www.codeproject.com/KB/database/CsvReader.aspx
which seems to give me the ability to read just two columns and then skip to the next line.
They also have some benchmarks showing fast performance and low memory profile.

If you want low memory, you'll probably use a StreamReader and ReadLine by line.
In a similar case the other day, I was able to skip the first 20,000,000 lines in a 500 MB file and build a string (using StringBuilder) for the next 1,000,000 lines in about 7 seconds.

Assuming that the file contains ASCII encoded text (would be typical for csv), your best bet may be to use Stream directly and the Stream.Read method, which allows you to read into a pre-allocated buffer. This has a few advantages:
You only allocate a buffer once, whereas ReadLine() will create a new String for every line.
You don't have to perform the Unicode conversion for the entire line; you can either do this only for the portion up to the second comma or (if you're severely time-constrained), you can write your own numeric parser that operates on the ASCII string data in the buffer (I'm sure there are well-documented algorithms for doing this.) This is assuming you need numeric data, of course.
Additional methods you'll likely need include the ASCII Encoding methods, particularly Encoding.ASCII.GetString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.