I want to read a file into a RichTextBox without using LoadFile (I might want to display the progress). The file contains only ASCII characters.
I was thinking of reading the file in chunks.
I have done the following (which is working):
const int READ_BUFFER_SIZE = 4 * 1024;
BinaryReader reader = new BinaryReader(File.Open("file.txt", FileMode.Open));
byte[] buf = new byte[READ_BUFFER_SIZE];
do {
int ret = reader.Read(buf, 0, READ_BUFFER_SIZE);
if (ret <= 0) {
break;
}
string text = Encoding.ASCII.GetString(buf);
richTextBox.AppendText(text);
} while (true);
My concern is:
string text = Encoding.ASCII.GetString(buf);
I have seen that it is not possible to add a byte[] to a RichTextBox.
My questions are:
Will a new string object be allocated for every chunk which is read?
Isn't there a better way not to have to create a string object just for appending the text to the RichTextBox?
Or, is it more efficient to read lines from the file (StreamReader.ReadLine) and just add to the RichTextBox the string returned?
Will a new string object be allocated for every chunk which is read?
Yes.
Isn't there a better way not to have to create a string object just for appending the text to the RichTextBox?
No, AppendText requires a string
Or, is it more efficient to read lines from the file (StreamReader.ReadLine) and just add to the RichTextBox the string returned?
No, that's considerably less efficient. You'll now create a new string object much more frequently. Which is okay from the garbage collected heap perspective, you don't create more garbage. But it is absolute murder on the RichTextBox, it constantly needs to re-allocate its own buffer. Which includes moving all the text previously read. What you have is already good, you should just use a much larger READ_BUFFER_SIZE.
Unfortunately there are conflicting goals here. You don't want to make the buffer larger than 39,999 bytes or the strings end up in the Large Object Heap and clog it up until a gen# 2 garbage collection happens. But the RTB will be much happier if you go considerably past that size, like a megabyte if the file is so large that you need a progress bar.
If you want to make it really efficient then you need to replace RichTextBox.LoadFile(). The underlying Windows message is EM_STREAMIN, it uses a callback mechanism to stream in the text. You can technically replace the callback to do what the default one does in RichTextBox, plus update a progress bar. It does permit getting rid of the strings btw. The pinvoke is pretty unfriendly, use the Reference Source for guidance.
Take the easy route first, increase the buffer size. Only consider using the pinvoke route when your code is considerably slower than using File.ReadAllText().
Try this:
richTextBox.AppendText(File.ReadAllText("file.txt"));
or
richTextBox.AppendText(File.ReadAllText("file.txt", Encoding.ASCII));
You can use a StreamReader. Then you can read eacht row of the file and display the progress while reading.
Related
I'm trying to parse a pretty simple text file into some structures. For that, I need whole text split on every new line and on every whitespace. Code is pretty straightforward:
string path = "C:/file.ext";
string fileString = File.ReadAllText (path);
string[] splitFile = fileString.Split (' ', '\n', '/');
After profiling above code (using built-in game engine profiler), I've noticed that while parsing a 40KB file, there is 280KB memory allocated by File.ReadAllText and 310KB by string's Split, which sums up to almost 15 times the size of a file.
Is it normal?
Is there any way to read text files while avoiding such big allocations (maybe unsafe code?) ?
NOTE:
The main point is, whether allocations x times bigger than file's size, when reading it, are normal. I understand that reading line by line will let GC collect garbage from previous ReadLine. It just doesn't seem normal, and, since target device is an old android phone, I was worried whether parsing 50MB file wouldn't actually make application crash. What I mean by question 2 is how to minimalize allocations, not how to split those allocations.
I do not have enough reputation to comment on above post but have you tried reading the file in binary form using the Binary Reader Class then reading in 8 bytes at a time?
I have a StringBuilder that appends all the pixel in an image, this amount being extremely large. Every time I run my program, everything goes well, but once I change a pixel color (ArGB) I get a OutOfMemoryException at the spot where I clear the StringBuilder. The problem is that I need to create an instance of StreamWriter then add my text to it THEN set the file path.| My current code it:
StringBuilder PixelFile = new StringBuilder("", 5000);
Private void Render()
{
//One second run, I get an OutOfMemoryException
PixelFile.Clear();
//This is in a for but cut it out for reverence.
PixelFile.Append(ArGBFormat);
}
I do not know what is causing this. I have tried PixelFile.Length = 0; and PixelFile.Capacity = 0;
OutOfMemory probably means you're building the string too big for StringBuilder, which is designed to handle a very different type of operation.
While I'm at a loss for how to make StringBuilder work, let me point you at a more intuitive implementation that will be less likely to fail.
You can read and write from a file using direct binary through the BinaryReader and BinaryWriter classes. This can also save you a lot of effort since you can make sure you're serializing bytes instead of character strings or entire words.
If you absolutely must use plaintext, consider the StreamReader and StreamWriter classes directly, as they won't throw exceptions for size. Remember, streams are intended for this sort of operation, StringBuilder is not, so Streams are far more likely to work with far less effort on your part.
EDIT:
When the maximum capacity is reached, no further memory can be allocated for the StringBuilder object, and trying to add characters or expand it beyond its maximum capacity throws either an ArgumentOutOfRangeException or an OutOfMemoryException exception.
Therefore, this is a limitation of the StringBuilder class and cannot be overcome with your current implementation.
EDIT: Additional implementation
In addition to StreamWriters which can write directly to files, you can also use the MemoryStream class to pipe information to memory instead of disk. Be aware this could lead to slow performance of the program, and I recommend instead trying to refactor the process to only need to perform a stream once.
That being said, it is still possible.
var mem = new MemoryStream();
var memWriter = new StreamWriter(mem);
// TODO: use memWriter.Write as per StreamWriter
mem.Position = 0; // This ensures you are copying your stream from the beginning
// TODO: Show your file save dialog
var fileStream = new StreamWriter(fileNameFromDialog);
mem.CopyTo(fileWriter); // Perform the copy
I have a file that I'm opening into a stream and passing to another method. However, I'd like to replace a string in the file before passing the stream to the other method. So:
string path = "C:/...";
Stream s = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);
//need to replace all occurrences of "John" in the file to "Jack" here.
CallMethod(s);
The original file should not be modified, only the stream. What would be the easiest way to do this?
Thanks...
It's a lot easier if you just read in the file as lines, and then deal with those, instead of forcing yourself to stick with a Stream, simply because stream deals with both text and binary files, and needs to be able to read in one character at a time (which makes such replacement very hard). If you read in a whole line at a time (so long as you don't have multi-line replacement) it's quite easy.
var lines = File.ReadLines(path)
.Select(line => line.Replace("John", "Jack"));
Note that ReadLines still does stream the data, and Select doesn't need to materialize the whole thing, so you're still not reading the whole file into memory at one time when doing this.
If you don't actually need to stream the data you can easily just load it all as one big string, do the replace, and then create a stream based on that one string:
string data = File.ReadAllText(path)
.Replace("John", "Jack");
byte[] bytes = Encoding.ASCII.GetBytes(data);
Stream s = new MemoryStream(bytes);
This question probably has many good answers. I'll try one I've used and has always worked for me and my peers.
I suggest you create a separate stream, say a MemoryStream. Read from your filestream and write into the memory one. You can then extract strings from either and replace stuff, and you would pass the memory stream ahead. That makes it double sure that you are not messing up with the original stream, and you can ever read the original values from it whenever you need, though you are using basically twice as much memory by using this method.
If the file has extremely long lines, the replaced string may contain a newline or there are other constraints preventing the use of File.ReadLines() while requiring streaming, there is an alternative solution using streams only, even though it is not trivial.
Implement your own stream decorator (wrapper) that performs the replacement. I.e. a class based on Stream that takes another stream in its constructor, reads data from the stream in its Read(byte[], int, int) override and performs the replacement in the buffer. See notes to Stream implementers for further requirements and suggestions.
Let's call the string being replaced "needle", the source stream "haystack" and the replacement string "replacement".
Needle and replacement need to be encoded using the encoding of the haystack contents (typically Encoding.UTF8.GetBytes()). Inside streams, the data is not converted to string, unlike in StreamReader.ReadLine(). Thus unnecessary memory allocation is prevented.
Simple cases: If both needle and replacement are just a single byte, the implementation is just a simple loop over the buffer, replacing all occurrences. If needle is a single byte and replacement is empty (i.e. deleting the byte, e.g. deleting carriage return for line ending normalization), it is a simple loop maintaining from and to indexes to the buffer, rewriting the buffer byte by byte.
In more complex cases, implement the KMP algorithm to perform the replacement.
Read the data from the underlying stream (haystack) to an internal buffer that is at least as long as needle and perform the replacement while rewriting the data to the output buffer. The internal buffer is needed so that data from a partial match are not published before a complete match is detected -- then, it would be too late to go back and delete the match completely.
Process the internal buffer byte by byte, feeding each byte into the KMP automaton. With each automaton update, write the bytes it releases to the appropriate position in output buffer.
When a match is detected by KMP, replace it: reset the automaton keeping the position in the internal buffer (which deletes the match) and write the replacement in the output buffer.
When end of either buffer is reached, keep the unwritten output and unprocessed part of the internal buffer including current partial match as a starting point for next call to the method and return the current output buffer. Next call to the method writes the remaining output and starts processing the rest of haystack where the current one stopped.
When end of haystack is reached, release the current partial match and write it to the output buffer.
Just be careful not to return an empty output buffer before processing all the data of haystack -- that would signal end of stream to the caller and therefore truncate the data.
I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader and StreamWriter.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[] using StreamWriter. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines, which allows me to write string[]s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder and using the AppendLine method to add each line to make a gigantic string. Then I could convert that StringBuilder to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?
Two things will increase the speed of output using StreamWriter.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader reads into a buffer and parses lines out of it, and the StreamWriter writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time) to something very close to max(input-time, process-time, output-time). And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.
According to the docs, StreamWriter does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines is lazy and WriteAllLines has a lazy overload, it will stream the file rather than attempting to read the whole thing.
What about building strings to write?
Something like
int cnt = 0;
StringBuilder s = new StringBuilder();
while(line = reader.readLine())
{
cnt++;
String x = (manipulate line);
s.append(x+"\n");
if(cnt%10000 == 0)
{
StreamWriter.write(s);
s=new StringBuilder();
}
}
Edited because comment below is right, should have used stringbuilder.
I'm parsing a 40MB CSV file.
It works nicely right now, and it's rather easy to parse, the only problem I have is performance, which of course is rather slow.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and then stop looping, so if the entry is at the beginning of the file it finishes quickly, but if it's at the end it takes a while.
I could balance this by giving it a random start line, but the algorithm would still be O(n)... So I'm not sure if it's really worth it.
Is there a way I can improve my sequential parsing algorithm?
First: "Reading Huge CSV File" and "So I'm parsing a 40MB CSV file.". Ihave space delimited files here of 10+ GIGAbyte - what would you call those?
Also: the size of the file is irrelevant, you process them normally anyway line by line.
the only problem I have is performance, which of course is rather slow
Define. What do you think is slow? Parsing them is quite fast when done properly.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and
then stop looping, so if the entry is at the beggining of the file it
finishes quickly, but if it's at the end it takes a while.
Do NOT use a CSV file? More than 60 years ago people invented databases for this.
Is there a way I can improve my secuential parsing algorithm?
YOu mean except pulling the parsing into a separate thread, and using an efficient code (which you may not have - noone knows).
Theoretically you could:
Read on one thread, with a decent buffer (less IO = faster)
Move field split into thread 2 (optional)
Use tasks to parse the fields (one per field per line) so you use all processors).
I am currently processing some (around 10.000) files (with sizes in double digit gigabte sadly) and... I go this way (Have to process them in a specific order) to use my computer fully.
That should give you a lot - and seriously, a 40mb file should load in 0.x seconds (0.5 - 0.6).
STILL that is very inefficient. Any reason you do not load the file into a database like all people do? CSV is good as some transport format, it sucks as a database.
Why don't you convert your csv to a normal database. Even sqlexpress will be fine.
Of course.
Say you order it alphabetically.
Then, start in the middle.
Each iteration, move to the middle of the top or bottom; whichever has the appropriate key.
This algorithm has O(log n).
This is called a "binary search," and is what "Mike Christianson" suggests in his comment.
Will suggest you to break one 40Mb File into smaller size few files.
And using Parallel.ForEach you could improve file processing performace
You can load the CSV into DataTable and use available operations that could be faster than looping through
Loading it to database and perform your operation on that is another option
This, I believe, is the fastest way to read a CSV file sequentially. There may be other ways to extract data from CSV, but if you limited to this approach, then this solution might work for you.
const int BUFFER_SIZE = 0x8000; //represents 32768 bytes
public unsafe void parseCSV(string filePath)
{
byte[] buffer = new byte[BUFFER_SIZE];
int workingSize = 0; //store how many bytes left in buffer
int bufferSize = 0; //how many bytes were read by the file stream
StringBuilder builder = new StringBuilder();
char cByte; //character representation of byte
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
do
{
bufferSize = fs.Read(buffer, 0, BUFFER_SIZE);
workingSize = bufferSize;
fixed (byte* bufferPtr = buffer)
{
byte* workingBufferPtr = bufferptr;
while (workingSize-- > 0)
{
switch (cByte = (char)*workingBufferPtr++)
{
case '\n':
break;
case '\r':
case ',':
builder.ToString();
builder.Clear();
break;
default:
builder.Append(cByte);
break;
}
}
}
} while (bufferSize != 0);
}
}
Explanation:
Reading the file into a byte buffer. This will be done using the basic Filestream class, which gives access to the always fast Read()
Unsafe code. While I generally recommend not using unsafe code, when traversing any kind of buffer, using pointers can bring a speedup.
StringBuilder since we will be concatenating bytes into workable strings to test againt the key. StringBuilder is by far the fastest way to append bytes together and get a workable string out them.
Note that this method fairly complaint with RFC 4180, but if you deal with quotes, you can easily modify the code I posted to handle trimming.