How to increase the writing speed of the streamwriter? - c#

How to raise the speed of the streamwriter to write a 83MB csv file.
I have inceresed the buffersize to 65536 but its also consuming more time. How to improve the speed.
StreamWriter writer =new streamWriter(
new FileStream(filePath, FileMode.CreateNew), Encoding.UTF8, 65536))
string str=string.Empty;
while((str = reader.ReadLine())!=null)
writer.WriteLine(str)}
writer.Close()

Depending on the number of lines your CSV file contains, you could possibly end up with a loop that executes millions of times. Not a good idea to access the disk that many times.
The easy way out (if you have memory) is to read the entire CSV file in a string[] in memory (using File.ReadAll() i think), do your processing and write it all once (File.WriteAll() i think).
This will greatly increase your performance.
The other way out is to use asynchronous read/writes, increase buffer size AND create a mechanism to read bigger chunks of data. Having a big buffer if you are only reading 1 line will not help you.

I would try raising the buffer even more. At that buffer size it's going to take over 8200 writes to create the whole file. Try a buffer around 256K or 512K.

Based on your comment showing what your code is actualy doing:
File.WriteAllText( filePath, reader.ReadToEnd() );

Related

Read text file with lots of line in C#

I have a text file, that could potentially have up to 1 Million lines in it, and I have a code for reading the file one line at a time, but this is taking a lot of time...lots and lots of time. Is there a method in c# to potentially optimize this process, and improve the reading. This is the code I'm using.
using(var file = new StreamReader(filePath))
{
while((line = file.ReadLine()) != null)
{
//do something.
}
}
Any suggestions on reading these lines in bulk or improving the process?
Thanks.
Thanks for all your comments. The issue had to do with the \do something where I was using the SmartXls library to write to Excel, which was causing the bottle neck. I have contacted the developers to address the issue. All the suggested solutions will work in other scenarios.
Well, this code would be simpler, if you're using .NET 4 or later you can use File.ReadLines:
foreach (var line in File.ReadLines())
{
// Do something
}
Note that this is not the same as ReadAllLines, as ReadLines returns an IEnumerable<string> which reads lines lazily, instead of reading the whole file in one go.
The effect at execution time will be broadly the same as your original code (it won't improve performance) - this is just simpler to read.
Fundamentally, if you're reading a large file, that can take a long time - but reading just a million lines shouldn't take "lots and lots of time". My guess is that whatever you're doing with the lines takes a long time. You might want to parallelize that, potentially using a producer/consumer queue (e.g. via BlockingCollection) or TPL Dataflow, or just use Parallel LINQ, Parallel.ForEach etc.
You should use a profiler to work out where the time is being spent. If you're reading from a very slow file system, then it's possible that it really is the reading which is taking the time. We don't have enough information to guide you on that, but you should be able to narrow it down yourself.
Try to use streamreader, see if it's faster
string filePath = "";
string fileData = "";
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
byte[] data = new byte[fs.Length];
fs.Seek(0, SeekOrigin.Begin);
fs.Read(data, 0, int.Parse(fs.Length.ToString()));
fileData = System.Text.Encoding.Unicode.GetString(data);
}
You can read more data at once using StreamReader's int ReadBlock(char[] buffer, int index, int count) rather than line by line. This avoids reading reading the entire file at once (File.ReadAllLines) but allows you to process larger chunks in RAM at a time.
To improve performance, consider performing whatever work you are currently doing in your loop by spawning another thread to handle the load.
Parallel.ForEach(file.ReadLines(), (line) =>
{
// do your business
});
If space is not an issue..Create a buffer of around 1mb..
using(BufferedStream bs=new BufferedStream(File.OpenRead(path),1024*1024))
{
int read=-1;
byte[] buffer=new byte[1024*1024];
while((read=bs.Read(buffer,0,buffer.Length))!=0)
{
//play with buffer
}
}
You can also use ReadAllLines(filepath) and load the file into an array of lines, like this:
string[] lines = System.IO.File.ReadAllLines(#"path");

How do I write an obscene amount of data to file?

I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader and StreamWriter.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[] using StreamWriter. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines, which allows me to write string[]s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder and using the AppendLine method to add each line to make a gigantic string. Then I could convert that StringBuilder to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?
Two things will increase the speed of output using StreamWriter.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader reads into a buffer and parses lines out of it, and the StreamWriter writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time) to something very close to max(input-time, process-time, output-time). And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.
According to the docs, StreamWriter does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines is lazy and WriteAllLines has a lazy overload, it will stream the file rather than attempting to read the whole thing.
What about building strings to write?
Something like
int cnt = 0;
StringBuilder s = new StringBuilder();
while(line = reader.readLine())
{
cnt++;
String x = (manipulate line);
s.append(x+"\n");
if(cnt%10000 == 0)
{
StreamWriter.write(s);
s=new StringBuilder();
}
}
Edited because comment below is right, should have used stringbuilder.

Read Extremely large file efficiently in C#. Currently using StreamReader

I have a Json file that is sized 50GB and beyond.
Following is what I have written to read a very small chunk of the Json. I now need to modify this to read the large file.
internal static IEnumerable<T> ReadJson<T>(string filePath)
{
DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(T));
using (StreamReader sr = new StreamReader(filePath))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
byte[] jsonBytes = Encoding.UTF8.GetBytes(line);
XmlDictionaryReader jsonReader = JsonReaderWriterFactory.CreateJsonReader(jsonBytes, XmlDictionaryReaderQuotas.Max);
var myPerson = ser.ReadObject(jsonReader);
jsonReader.Close();
yield return (T)myPerson;
}
}
}
Would it suffice if I specify the buffer size while constructing the StreamReader in the current code?
Please correct me if I am wrong here.. The buffer size basically specifies how much data is read from disk to memory at a time. So if File is 100MB in size with buffer size as 5MB, it reads 5MB at a time to memory, until entire file is read.
Assuming my understanding of point 3 is right, what would be the ideal buffer size with such a large text file? Would int.Max size be a bad idea? In 64-bit PC int.Max size is 2147483647. I presume buffer size is in bytes, which evaluates to about 2GB. This itself could consume time. I was looking at something like 100MB - 300MB as buffer size.
It is going to read a line at a time (of the input file), which could be 10 bytes, and could be all 50GB. So it comes down to : how is the input file structured? And if the input JSON has newlines other than cleanly at the breaks between objects, this could get really ill.
The buffer size might impact how much it reads while looking for the end of each line, but ultimately: it needs to find a new-line each time (at least, how it is written currently).
I think you should first compare different parsers before worrying about details as the buffer size.
The differences between DataContractJsonSerializer, Raven JSON or Newtonsoft JSON will be quite significant.
So your main issue with this is where are your boundaries, and given that your doc is a JSON doc it seems to me likely that your boundaries are likely to be classes, I assume (or hope) that you don't have 1 honking great class that is 50gb large. I also assume that you don't really need all those classes in memory but you may need to search the whole thing for your subset...does that sound roughly right? if so I think that your pseudo code is something like
using a Json parser that accepts a streamreader (newtonsoft?)
read and parse until eof
yield return your parsed class that matches criteria
read and parse next class
end

C#: Reading Huge CSV File

I'm parsing a 40MB CSV file.
It works nicely right now, and it's rather easy to parse, the only problem I have is performance, which of course is rather slow.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and then stop looping, so if the entry is at the beginning of the file it finishes quickly, but if it's at the end it takes a while.
I could balance this by giving it a random start line, but the algorithm would still be O(n)... So I'm not sure if it's really worth it.
Is there a way I can improve my sequential parsing algorithm?
First: "Reading Huge CSV File" and "So I'm parsing a 40MB CSV file.". Ihave space delimited files here of 10+ GIGAbyte - what would you call those?
Also: the size of the file is irrelevant, you process them normally anyway line by line.
the only problem I have is performance, which of course is rather slow
Define. What do you think is slow? Parsing them is quite fast when done properly.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and
then stop looping, so if the entry is at the beggining of the file it
finishes quickly, but if it's at the end it takes a while.
Do NOT use a CSV file? More than 60 years ago people invented databases for this.
Is there a way I can improve my secuential parsing algorithm?
YOu mean except pulling the parsing into a separate thread, and using an efficient code (which you may not have - noone knows).
Theoretically you could:
Read on one thread, with a decent buffer (less IO = faster)
Move field split into thread 2 (optional)
Use tasks to parse the fields (one per field per line) so you use all processors).
I am currently processing some (around 10.000) files (with sizes in double digit gigabte sadly) and... I go this way (Have to process them in a specific order) to use my computer fully.
That should give you a lot - and seriously, a 40mb file should load in 0.x seconds (0.5 - 0.6).
STILL that is very inefficient. Any reason you do not load the file into a database like all people do? CSV is good as some transport format, it sucks as a database.
Why don't you convert your csv to a normal database. Even sqlexpress will be fine.
Of course.
Say you order it alphabetically.
Then, start in the middle.
Each iteration, move to the middle of the top or bottom; whichever has the appropriate key.
This algorithm has O(log n).
This is called a "binary search," and is what "Mike Christianson" suggests in his comment.
Will suggest you to break one 40Mb File into smaller size few files.
And using Parallel.ForEach you could improve file processing performace
You can load the CSV into DataTable and use available operations that could be faster than looping through
Loading it to database and perform your operation on that is another option
This, I believe, is the fastest way to read a CSV file sequentially. There may be other ways to extract data from CSV, but if you limited to this approach, then this solution might work for you.
const int BUFFER_SIZE = 0x8000; //represents 32768 bytes
public unsafe void parseCSV(string filePath)
{
byte[] buffer = new byte[BUFFER_SIZE];
int workingSize = 0; //store how many bytes left in buffer
int bufferSize = 0; //how many bytes were read by the file stream
StringBuilder builder = new StringBuilder();
char cByte; //character representation of byte
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
do
{
bufferSize = fs.Read(buffer, 0, BUFFER_SIZE);
workingSize = bufferSize;
fixed (byte* bufferPtr = buffer)
{
byte* workingBufferPtr = bufferptr;
while (workingSize-- > 0)
{
switch (cByte = (char)*workingBufferPtr++)
{
case '\n':
break;
case '\r':
case ',':
builder.ToString();
builder.Clear();
break;
default:
builder.Append(cByte);
break;
}
}
}
} while (bufferSize != 0);
}
}
Explanation:
Reading the file into a byte buffer. This will be done using the basic Filestream class, which gives access to the always fast Read()
Unsafe code. While I generally recommend not using unsafe code, when traversing any kind of buffer, using pointers can bring a speedup.
StringBuilder since we will be concatenating bytes into workable strings to test againt the key. StringBuilder is by far the fastest way to append bytes together and get a workable string out them.
Note that this method fairly complaint with RFC 4180, but if you deal with quotes, you can easily modify the code I posted to handle trimming.

Filestream only reading the first 4 characters of the file

Hey there! I'm trying to read a 150mb file with a file stream but every time I do it all I get is: |zl instead of the whole stream. Note that it has some special characters in it.
Does anybody know what the problem could be? here is my code:
using (FileStream fs = File.OpenRead(path))
{
byte[] buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
extract = Encoding.Default.GetString(buffer);
}
Edit:
I tried to read all text but it still returned the same four characters. It works fine on any other file except for these few. When I use read all lines it only gets the first line.
fs.Read() does not read the whole smash of bytes all at once, it reads some number of bytes and returns the number of bytes read. MSDN has an excellent example of how to use it to get the whole file:
http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx
For what it's worth, reading the entire 150MB of data into memory is really going to put a drain on your client's system -- the preferred option would be to optimize it so that you don't need the whole file all at once.
If you want to read text this way File.ReadAllLine (or ReadAllText) - http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx is better option.
My guess the file is not text file to start with and the way you display resulting string does stop at 0 characters.
As debracey pointed out Read returns number of bytes read - check that out. Also for file operations it is unlikely to stop at 4 characters...

Categories

Resources