How do I write an obscene amount of data to file?

How do I write an obscene amount of data to file? - c#

I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader and StreamWriter.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[] using StreamWriter. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines, which allows me to write string[]s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder and using the AppendLine method to add each line to make a gigantic string. Then I could convert that StringBuilder to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?

Two things will increase the speed of output using StreamWriter.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader reads into a buffer and parses lines out of it, and the StreamWriter writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time) to something very close to max(input-time, process-time, output-time). And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.

According to the docs, StreamWriter does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines is lazy and WriteAllLines has a lazy overload, it will stream the file rather than attempting to read the whole thing.

What about building strings to write?
Something like
int cnt = 0;
StringBuilder s = new StringBuilder();
while(line = reader.readLine())
{
cnt++;
String x = (manipulate line);
s.append(x+"\n");
if(cnt%10000 == 0)
{
StreamWriter.write(s);
s=new StringBuilder();
}
}
Edited because comment below is right, should have used stringbuilder.

Related

Loop on all lines of very large file C# [duplicate]

This question already has answers here:
What's the fastest way to read a text file line-by-line?
(9 answers)
Closed 4 years ago.
I want to loop on all the lines of a very large file (10GB for example) using foreach
I am currently using File.ReadLines like that:
var lines = File.ReadLines(fileName);
foreach (var line in lines) {
// Process line
}
But this is very slow if the file is larger than 2MB and it will do the loop very slowly.
How can I loop on very large files?
Any help would be appreciated.
Thanks!

The way you do it is the best way available given that
you don't want to read your whole file into RAM at once
your line processing is independent of previous lines
Sorry, reading stuff from a hard disk is just slow.
Improvements will likely come from other sources:
store your file on a faster device (SSD?)
get more RAM to read your file into memory to at least speed up processing

First of all do you need to read the whole file or only the section of the file.
If you only need to read the section of the file
const int chunkSize = 1024; // read the file by chunks of 1KB
using (var file = File.OpenRead("yourfile"))
{
int bytesRead;
var buffer = new byte[chunkSize];
while ((bytesRead = file.Read(buffer, 0 /* start offset */, buffer.Length)) > 0)
{
// TODO: Process bytesRead number of bytes from the buffer
// not the entire buffer as the size of the buffer is 1KB
// whereas the actual number of bytes that are read are
// stored in the bytesRead integer.
}
}
If you need to load the whole file to the memory.
Use this method repeatedly instead of directly loading to the memory since you have control over what you are doing and at any time you can stop the process.
Or you can use MemoryMappedFile https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx?f=255&MSPPError=-2147217396
Memory mapped files will give a view to the program as beign accessed from the Memory, but it will load from the Disk for the first time only.
long offset = 0x10000000; // 256 megabytes
long length = 0x20000000; // 512 megabytes
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(#"c:\ExtremelyLargeImage.data", FileMode.Open,"ImgA"))
{
// Create a random access view, from the 256th megabyte (the offset)
// to the 768th megabyte (the offset plus length).
using (var accessor = mmf.CreateViewAccessor(offset, length))
{
//Your process
}
}

The looping will always be slow because of the sheer number of items that you have to loop through. Im pretty sure that its not the looping but the actual work you are doing on each one of those lines that slows it down. A file with 10GB of lines could literally have trillions of lines and anything but the most simple of tasks will take a lot of time.
You could always try making the job threaded so that a different thread is working on a different line. That way at least you have more cores working on the problem.
Set up a for loop and have them increment at different amounts.
Also, im not 100% but I think that you could get a huge increase in speed by splitting the whole thing into an array of string by splitting based on new lines and then working through those since everything is stored in the memory.
string lines = "your huge text";
string[] words = lines.Split('\n');
foreach(string singleLine in lines)
{
}
** Added based on comments **
So there's massive downsides and will take a huge amount of memory. At least the amount that the original file used but this gets round the problem of a slow hard drive and all the data will be read directly into the RAM of the machine, which will be far far faster than reading from the hard drive in small chunks.
There is also an issue here of having a limit of about 2 billion lines, since that the is the maximum number of entries in an array that you can have.

C#: Reading Huge CSV File

I'm parsing a 40MB CSV file.
It works nicely right now, and it's rather easy to parse, the only problem I have is performance, which of course is rather slow.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and then stop looping, so if the entry is at the beginning of the file it finishes quickly, but if it's at the end it takes a while.
I could balance this by giving it a random start line, but the algorithm would still be O(n)... So I'm not sure if it's really worth it.
Is there a way I can improve my sequential parsing algorithm?

First: "Reading Huge CSV File" and "So I'm parsing a 40MB CSV file.". Ihave space delimited files here of 10+ GIGAbyte - what would you call those?
Also: the size of the file is irrelevant, you process them normally anyway line by line.
the only problem I have is performance, which of course is rather slow
Define. What do you think is slow? Parsing them is quite fast when done properly.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and
then stop looping, so if the entry is at the beggining of the file it
finishes quickly, but if it's at the end it takes a while.
Do NOT use a CSV file? More than 60 years ago people invented databases for this.
Is there a way I can improve my secuential parsing algorithm?
YOu mean except pulling the parsing into a separate thread, and using an efficient code (which you may not have - noone knows).
Theoretically you could:
Read on one thread, with a decent buffer (less IO = faster)
Move field split into thread 2 (optional)
Use tasks to parse the fields (one per field per line) so you use all processors).
I am currently processing some (around 10.000) files (with sizes in double digit gigabte sadly) and... I go this way (Have to process them in a specific order) to use my computer fully.
That should give you a lot - and seriously, a 40mb file should load in 0.x seconds (0.5 - 0.6).
STILL that is very inefficient. Any reason you do not load the file into a database like all people do? CSV is good as some transport format, it sucks as a database.

Why don't you convert your csv to a normal database. Even sqlexpress will be fine.

Of course.
Say you order it alphabetically.
Then, start in the middle.
Each iteration, move to the middle of the top or bottom; whichever has the appropriate key.
This algorithm has O(log n).
This is called a "binary search," and is what "Mike Christianson" suggests in his comment.

Will suggest you to break one 40Mb File into smaller size few files.
And using Parallel.ForEach you could improve file processing performace

You can load the CSV into DataTable and use available operations that could be faster than looping through
Loading it to database and perform your operation on that is another option

This, I believe, is the fastest way to read a CSV file sequentially. There may be other ways to extract data from CSV, but if you limited to this approach, then this solution might work for you.
const int BUFFER_SIZE = 0x8000; //represents 32768 bytes
public unsafe void parseCSV(string filePath)
{
byte[] buffer = new byte[BUFFER_SIZE];
int workingSize = 0; //store how many bytes left in buffer
int bufferSize = 0; //how many bytes were read by the file stream
StringBuilder builder = new StringBuilder();
char cByte; //character representation of byte
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
do
{
bufferSize = fs.Read(buffer, 0, BUFFER_SIZE);
workingSize = bufferSize;
fixed (byte* bufferPtr = buffer)
{
byte* workingBufferPtr = bufferptr;
while (workingSize-- > 0)
{
switch (cByte = (char)*workingBufferPtr++)
{
case '\n':
break;
case '\r':
case ',':
builder.ToString();
builder.Clear();
break;
default:
builder.Append(cByte);
break;
}
}
}
} while (bufferSize != 0);
}
}
Explanation:
Reading the file into a byte buffer. This will be done using the basic Filestream class, which gives access to the always fast Read()
Unsafe code. While I generally recommend not using unsafe code, when traversing any kind of buffer, using pointers can bring a speedup.
StringBuilder since we will be concatenating bytes into workable strings to test againt the key. StringBuilder is by far the fastest way to append bytes together and get a workable string out them.
Note that this method fairly complaint with RFC 4180, but if you deal with quotes, you can easily modify the code I posted to handle trimming.

Parsing large data file from disk significantly slower than parsing in memory?

While writing a simple library to parse a game's data files, I noticed that reading an entire data file into memory and parsing from there was significantly faster (by up to 15x, 106s v 7s).
Parsing is usually sequential but seeks will be done every now and then to read some data stored elsewhere in a file, linked by an offset.
I realise that parsing from memory will definitely be faster, but something is wrong if the difference is so significant. I wrote some code to simulate this:
public static void Main(string[] args)
{
Stopwatch n = new Stopwatch();
n.Start();
byte[] b = File.ReadAllBytes(#"D:\Path\To\Large\File");
using (MemoryStream s = new MemoryStream(b, false))
RandomRead(s);
n.Stop();
Console.WriteLine("Memory read done in {0}.", n.Elapsed);
b = null;
n.Reset();
n.Start();
using (FileStream s = File.Open(#"D:\Path\To\Large\File", FileMode.Open))
RandomRead(s);
n.Stop();
Console.WriteLine("File read done in {0}.", n.Elapsed);
Console.ReadLine();
}
private static void RandomRead(Stream s)
{
// simulate a mostly sequential, but sometimes random, read
using (BinaryReader br = new BinaryReader(s)) {
long l = s.Length;
Random r = new Random();
int c = 0;
while (l > 0) {
l -= br.ReadBytes(r.Next(1, 5)).Length;
if (c++ <= r.Next(10, 15)) continue;
// simulate seeking
long o = s.Position;
s.Position = r.Next(0, (int)s.Length);
l -= br.ReadBytes(r.Next(1, 5)).Length;
s.Position = o;
c = 0;
}
}
}
I used one of the game's data files as input to this. That file was about 102 MB, and it produced this result (Memory read done in 00:00:03.3092618. File read done in 00:00:32.6495245.) which has memory reading about 11x faster than file.
The memory read was done before the file read to try and improve its speed via the file cache. It's still that much slower.
I've tried increasing or decreasing FileStream's buffer size; nothing produced significantly better results, and increasing or decreasing it too much just worsened the speed.
Is there something I'm doing wrong, or is this to be expected? Is there any way to at least make the slowdown less significant?
Why is reading the entire file at once and then parsing it so much faster than reading and parsing simultaneously?
I've actually compared to a similar library written in C++, which uses the Windows native CreateFileMapping and MapViewOfFile to read files, and it's very fast. Could it be the constant switching from managed to unmanaged and the involved marshaling that causes this?
I've also tried MemoryMappedFiles present in .NET 4; the speed gain was only about one second.

Is there something I'm doing wrong, or is this to be expected?
No, nothing wrong. This is entirely expected. That accessing the disk is an order of magnitude slower than accessing memory is more than reasonable.
Update:
That a single read of the file followed by processing is faster than processing while reading is also expected.
Doing a large IO operation and processing in memory would be faster than getting a bit from disk, processing it, calling the disk again (waiting for the IO to complete), processing that bit etc...

Is there something I'm doing wrong, or is this to be expected?
A harddisk has, compared to RAM, huge access times. Sequential reads are pretty speedy, but as soon as the heads have to move (because data is fragmented) it takes lots of milliseconds to get the next bit of data, during which your application is idling.
Is there any way to at least make the slowdown less significant?
Buy an SSD.
You also can take a look at Memory-Mapped Files for .NET:
MemoryMappedFile.CreateFromFile().
As for your edit: I'd go with #Oded and say that reading the file on beforehand adds a penalty. However, that should not cause the method that first reads the whole file to be seven times as slow as 'process-as-you-read'.

I decided to do some benchmarks comparing various ways of reading a file in C++ and C#. First I created a 256mb file. In the c++ benchmarks, buffered means I first copied the entire file to a buffer then read the data from the buffer. All the benchmarks read the file, directly or indirectly, byte by byte sequentially and calculate a checksum. All times are measured from the moment I open the file until I am completely done and the file is closed. All benchmarks were run multiple times to maintain consistent OS file caching.
C++
Unbuffered memory mapped file: 300ms
Buffered memory mapped file: 500ms
Unbuffered fread: 23,000ms
Buffered fread: 500ms
Unbuffered ifstream: 26,000ms
Buffered ifstream: 700ms
C#
MemoryMappedFile: 112,000ms
FileStream: 2,800ms
MemoryStream: 2,300ms
ReadAllBytes: 600ms
Interpret the data as you wish. C#'s memory mapped files are slower than even the worst case c++ code, whereas c++'s memory mapped files are the fastest things around. C#'s ReadAllBytes is decently fast, only twice as slow as c++'s memory mapped files. So if you want the best performance I recommend you use ReadAllBytes and then access the data directly from the array without using a stream.

How to increase the writing speed of the streamwriter?

How to raise the speed of the streamwriter to write a 83MB csv file.
I have inceresed the buffersize to 65536 but its also consuming more time. How to improve the speed.
StreamWriter writer =new streamWriter(
new FileStream(filePath, FileMode.CreateNew), Encoding.UTF8, 65536))
string str=string.Empty;
while((str = reader.ReadLine())!=null)
writer.WriteLine(str)}
writer.Close()

Depending on the number of lines your CSV file contains, you could possibly end up with a loop that executes millions of times. Not a good idea to access the disk that many times.
The easy way out (if you have memory) is to read the entire CSV file in a string[] in memory (using File.ReadAll() i think), do your processing and write it all once (File.WriteAll() i think).
This will greatly increase your performance.
The other way out is to use asynchronous read/writes, increase buffer size AND create a mechanism to read bigger chunks of data. Having a big buffer if you are only reading 1 line will not help you.

I would try raising the buffer even more. At that buffer size it's going to take over 8200 writes to create the whole file. Try a buffer around 256K or 512K.

Based on your comment showing what your code is actualy doing:
File.WriteAllText( filePath, reader.ReadToEnd() );

Preallocating file space in C#?

I am creating a downloading application and I wish to preallocate room on the harddrive for the files before they are actually downloaded as they could potentially be rather large, and noone likes to see "This drive is full, please delete some files and try again." So, in that light, I wrote this.
// Quick, and very dirty
System.IO.File.WriteAllBytes(filename, new byte[f.Length]);
It works, atleast until you download a file that is several hundred MB's, or potentially even GB's and you throw Windows into a thrashing frenzy if not totally wipe out the pagefile and kill your systems memory altogether. Oops.
So, with a little more enlightenment, I set out with the following algorithm.
using (FileStream outFile = System.IO.File.Create(filename))
{
// 4194304 = 4MB; loops from 1 block in so that we leave the loop one
// block short
byte[] buff = new byte[4194304];
for (int i = buff.Length; i < f.Length; i += buff.Length)
{
outFile.Write(buff, 0, buff.Length);
}
outFile.Write(buff, 0, f.Length % buff.Length);
}
This works, well even, and doesn't suffer the crippling memory problem of the last solution. It's still slow though, especially on older hardware since it writes out (potentially GB's worth of) data out to the disk.
The question is this: Is there a better way of accomplishing the same thing? Is there a way of telling Windows to create a file of x size and simply allocate the space on the filesystem rather than actually write out a tonne of data. I don't care about initialising the data in the file at all (the protocol I'm using - bittorrent - provides hashes for the files it sends, hence worst case for random uninitialised data is I get a lucky coincidence and part of the file is correct).

FileStream.SetLength is the one you want. The syntax:
public override void SetLength(
long value
)

If you have to create the file, I think that you can probably do something like this:
using (FileStream outFile = System.IO.File.Create(filename))
{
outFile.Seek(<length_to_write>-1, SeekOrigin.Begin);
OutFile.WriteByte(0);
}
Where length_to_write would be the size in bytes of the file to write. I'm not sure that I have the C# syntax correct (not on a computer to test), but I've done similar things in C++ in the past and it's worked.

Unfortunately, you can't really do this just by seeking to the end. That will set the file length to something huge, but may not actually allocate disk blocks for storage. So when you go to write the file, it will still fail.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.