Read Extremely large file efficiently in C#. Currently using StreamReader - c#

I have a Json file that is sized 50GB and beyond.
Following is what I have written to read a very small chunk of the Json. I now need to modify this to read the large file.
internal static IEnumerable<T> ReadJson<T>(string filePath)
{
DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(T));
using (StreamReader sr = new StreamReader(filePath))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
byte[] jsonBytes = Encoding.UTF8.GetBytes(line);
XmlDictionaryReader jsonReader = JsonReaderWriterFactory.CreateJsonReader(jsonBytes, XmlDictionaryReaderQuotas.Max);
var myPerson = ser.ReadObject(jsonReader);
jsonReader.Close();
yield return (T)myPerson;
}
}
}
Would it suffice if I specify the buffer size while constructing the StreamReader in the current code?
Please correct me if I am wrong here.. The buffer size basically specifies how much data is read from disk to memory at a time. So if File is 100MB in size with buffer size as 5MB, it reads 5MB at a time to memory, until entire file is read.
Assuming my understanding of point 3 is right, what would be the ideal buffer size with such a large text file? Would int.Max size be a bad idea? In 64-bit PC int.Max size is 2147483647. I presume buffer size is in bytes, which evaluates to about 2GB. This itself could consume time. I was looking at something like 100MB - 300MB as buffer size.

It is going to read a line at a time (of the input file), which could be 10 bytes, and could be all 50GB. So it comes down to : how is the input file structured? And if the input JSON has newlines other than cleanly at the breaks between objects, this could get really ill.
The buffer size might impact how much it reads while looking for the end of each line, but ultimately: it needs to find a new-line each time (at least, how it is written currently).

I think you should first compare different parsers before worrying about details as the buffer size.
The differences between DataContractJsonSerializer, Raven JSON or Newtonsoft JSON will be quite significant.

So your main issue with this is where are your boundaries, and given that your doc is a JSON doc it seems to me likely that your boundaries are likely to be classes, I assume (or hope) that you don't have 1 honking great class that is 50gb large. I also assume that you don't really need all those classes in memory but you may need to search the whole thing for your subset...does that sound roughly right? if so I think that your pseudo code is something like
using a Json parser that accepts a streamreader (newtonsoft?)
read and parse until eof
yield return your parsed class that matches criteria
read and parse next class
end

Related

Write large data into MemoryStream with C# [duplicate]

This question already has answers here:
Failed to write large amount of data to stream
(2 answers)
Closed 1 year ago.
I have a big stream (4Go), I need to replace some character (I need to replace one specific character with 2 or 3 ones) in that stream, i get the stream from à service à.d I have to return back a stream.
This is what I'm doing
private static Stream UpdateStream(Stream stream, string oldCharacters, string newCharacters, int size = 2048)
{
stream.Position = 0;
StreamReader reader = new StreamReader(stream);
MemoryStream outputStream = new MemoryStream();
StreamWriter writer = new StreamWriter(outputStream);
writer.AutoFlush = true;
char[] buffer = new char[size];
while (!reader.EndOfStream)
{
reader.Read(buffer, 0, buffer.Length);
if (buffer != null)
{
string line = new string(buffer);
if (!string.IsNullOrEmpty(line))
{
string newLine = line.Replace(oldCharacters, newCharacters);
writer.Write(newLine);
}
}
}
return outputStream;
}
But I'm getting an OutOfMemory exception at some point in this line but when looking at computer memery I still have planty available.
writer.Write(newLine);
Any advise ?
This is not an answer, but I couldn't possibly fit it into a comment.
Currently your problem is not solvable without making some assumptions. The problem, as I hear it, is that you want to replace some parts of a large body of text saved in a file and save the modified text in the file again.
Some unknown variables:
How long are those strings you are replacing?
How long are those strings you are replacing it with? The same length as the replaced strings?
What kinds of strings are you looking to replace? A single word? A whole sentence? A whole paragraph?
A solution to your problem would be to read the file into memory in chunks, replace the necessary text and save the "updated" text in a new file and then finally rename the "new file" to the name of the old file. However, without knowing the answers to the above points, you could potentially be wanting to replace a string as long as all text in the file (unlikely, yes). This means in order to do the "replacing" I would have to read the whole file into memory before I can replace any of the text, which causes an OutOfMemoryException. (Yes, you could do some clever scanning to replace such large strings without reading it all into memory at once, but I doubt such a solution is necessary here).
Please edit your question to address the above points.
So to make it work I had to :
use the HugeMemoryStream class from this post Failed to write large amount of data to stream
and define the gcAllowVeryLargeObjects parameter to true
and set the build to 64 bit (Prefer 32-bit unchecked)

Replacing a string within a stream in C# (without overwriting the original file)

I have a file that I'm opening into a stream and passing to another method. However, I'd like to replace a string in the file before passing the stream to the other method. So:
string path = "C:/...";
Stream s = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);
//need to replace all occurrences of "John" in the file to "Jack" here.
CallMethod(s);
The original file should not be modified, only the stream. What would be the easiest way to do this?
Thanks...
It's a lot easier if you just read in the file as lines, and then deal with those, instead of forcing yourself to stick with a Stream, simply because stream deals with both text and binary files, and needs to be able to read in one character at a time (which makes such replacement very hard). If you read in a whole line at a time (so long as you don't have multi-line replacement) it's quite easy.
var lines = File.ReadLines(path)
.Select(line => line.Replace("John", "Jack"));
Note that ReadLines still does stream the data, and Select doesn't need to materialize the whole thing, so you're still not reading the whole file into memory at one time when doing this.
If you don't actually need to stream the data you can easily just load it all as one big string, do the replace, and then create a stream based on that one string:
string data = File.ReadAllText(path)
.Replace("John", "Jack");
byte[] bytes = Encoding.ASCII.GetBytes(data);
Stream s = new MemoryStream(bytes);
This question probably has many good answers. I'll try one I've used and has always worked for me and my peers.
I suggest you create a separate stream, say a MemoryStream. Read from your filestream and write into the memory one. You can then extract strings from either and replace stuff, and you would pass the memory stream ahead. That makes it double sure that you are not messing up with the original stream, and you can ever read the original values from it whenever you need, though you are using basically twice as much memory by using this method.
If the file has extremely long lines, the replaced string may contain a newline or there are other constraints preventing the use of File.ReadLines() while requiring streaming, there is an alternative solution using streams only, even though it is not trivial.
Implement your own stream decorator (wrapper) that performs the replacement. I.e. a class based on Stream that takes another stream in its constructor, reads data from the stream in its Read(byte[], int, int) override and performs the replacement in the buffer. See notes to Stream implementers for further requirements and suggestions.
Let's call the string being replaced "needle", the source stream "haystack" and the replacement string "replacement".
Needle and replacement need to be encoded using the encoding of the haystack contents (typically Encoding.UTF8.GetBytes()). Inside streams, the data is not converted to string, unlike in StreamReader.ReadLine(). Thus unnecessary memory allocation is prevented.
Simple cases: If both needle and replacement are just a single byte, the implementation is just a simple loop over the buffer, replacing all occurrences. If needle is a single byte and replacement is empty (i.e. deleting the byte, e.g. deleting carriage return for line ending normalization), it is a simple loop maintaining from and to indexes to the buffer, rewriting the buffer byte by byte.
In more complex cases, implement the KMP algorithm to perform the replacement.
Read the data from the underlying stream (haystack) to an internal buffer that is at least as long as needle and perform the replacement while rewriting the data to the output buffer. The internal buffer is needed so that data from a partial match are not published before a complete match is detected -- then, it would be too late to go back and delete the match completely.
Process the internal buffer byte by byte, feeding each byte into the KMP automaton. With each automaton update, write the bytes it releases to the appropriate position in output buffer.
When a match is detected by KMP, replace it: reset the automaton keeping the position in the internal buffer (which deletes the match) and write the replacement in the output buffer.
When end of either buffer is reached, keep the unwritten output and unprocessed part of the internal buffer including current partial match as a starting point for next call to the method and return the current output buffer. Next call to the method writes the remaining output and starts processing the rest of haystack where the current one stopped.
When end of haystack is reached, release the current partial match and write it to the output buffer.
Just be careful not to return an empty output buffer before processing all the data of haystack -- that would signal end of stream to the caller and therefore truncate the data.

How do I write an obscene amount of data to file?

I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader and StreamWriter.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[] using StreamWriter. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines, which allows me to write string[]s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder and using the AppendLine method to add each line to make a gigantic string. Then I could convert that StringBuilder to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?
Two things will increase the speed of output using StreamWriter.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader reads into a buffer and parses lines out of it, and the StreamWriter writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time) to something very close to max(input-time, process-time, output-time). And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.
According to the docs, StreamWriter does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines is lazy and WriteAllLines has a lazy overload, it will stream the file rather than attempting to read the whole thing.
What about building strings to write?
Something like
int cnt = 0;
StringBuilder s = new StringBuilder();
while(line = reader.readLine())
{
cnt++;
String x = (manipulate line);
s.append(x+"\n");
if(cnt%10000 == 0)
{
StreamWriter.write(s);
s=new StringBuilder();
}
}
Edited because comment below is right, should have used stringbuilder.

Parsing a big CSV file C# .net 4

I know this question has been asked before, but I can't seem to get it working with the answers I've read. I've got a CSV file ~ 1.2GB , If I'm running the process like a 32bit i get outOfMemoryException, it works if i run it as a 64bit process, but it still takes 3,4gb in memory, i do know that I'm storing a lot of data in my customData class, but still 3,4gb of ram?, Am I doing something wrong when reading the file?
dict is a dictionary in which i just have a mapping to which property to save something in, depending on the column it's in. Am i doing the reading the right way?
StreamReader reader = new StreamReader(File.OpenRead(path));
while(!reader.EndOfStream) {
String line = reader.ReadLine();
String[] values = line.Split(';');
CustomData data = new CustomData();
string value;
for (int i = 0; i < values.Length; i++) {
dict.TryGetValue(i, out value);
Type targetType = data.GetType();
PropertyInfo prop = targetType.GetProperty(value);
if(values[i]==null)
{
prop.SetValue(data, "NULL",null);
}
else
{
prop.SetValue(data, values[i], null);
}
}
dataList.Add(data);
}
There doesn't seem to be anything wrong in your usage of the stream reader, you read a line in memory, then forget it.
However, in C# a string is encoded in memory as UTF-16 so on the average a character consumes 2 bytes in memory.
If your CSV contains also a lot of empty fields that you convert to "NULL" you add up to 7 bytes for each empty field.
So on the whole, since you basically store all the data from your file in memory, it's not really surprising that you require almost 3 times the size of the file in memory.
The actual solution is to parse your data by chucks of N lines, treat them, and free them from memory.
Note: Consider using a CSV parser, there is more to CSV than just comas or semi-colons, what if one of your field conatins a semi-colon, a newline, a quote... ?
Edit
Actually each string take up to 20+(N/2)*4 bytes in memory see C# in Depth
Ok a couple of points here.
As pointed out in the comments, .NET under x86 can only consume 1.5GBytes per process, so consider that your maximum memory in 32 bit
The StreamReader itself will have an overhead. I don't know if it caches the entire file in memory, or not (maybe someone can clarify?). If so, reading and processing the file in chunks might be a better solution
The CustomData class, how many fields does it have, and how many instances are created? Note you will need 32bits for each reference in x86 and 64 bits for each reference in x64. So if you have CustomData class, which has 10 fields of type System.Object, each CustomData class before storing any data requires 88 bytes.
The dataList.Add at the end. I assume you are adding to a generic List? If so, note that List employes a doubling algorithm to resize. If you have 1GByte in a List and it requires 1 more byte in size, it will create a 2GByte array and copy the 1GByte to the 2GByte array on resize. So all of a sudden the 1GByte + 1 byte actually requires 3GBytes to manipulate. Another alternative is to use a pre-sized array

Best approach for in memory manipulation of text file in memory: read as byte[] first? read as File.ReadAllText() then save as binary?

I need to change a file in memory, and currently I read the file to memory into a byte[] using a filestream and a binaryreader.
I was wondering whats the best approach to change that file in memory, convert the byte[] to string, make changes and do an Encoding.GetBytes()? or Read the file first as string using File.ReadAllText() and then Encoding.GetBytes()? or any approach will work without caveats?
Any special approaches? I need to replace specific text inside files with additional chars or replacement strings, several 100,000 of files. Reliability is preferred over efficiency. Files are text like HTML, not binary files.
Read the files using File.ReadAllText(), modify them, then do byte[] byteData = Encoding.UTF8.GetBytes(your_modified_string_from_file). Use the encoding with which the files were saved. This will give you an array of byte[]. You can convert the byte[] to a stream like this:
MemoryStream stream = new MemoryStream();
stream.Write(byteData, 0, byteData.Length);
Edit:
It looks like one of the Add methods in the API can take a byte array, so you don't have to use a stream.
You're definitely making things harder on yourself by reading into bytes first. Just use a StreamReader. You can probably get away with using ReadLine() and processing a line at a time. This can seriously reduce your app's memory usage, especially if you're working with that many files.
using (var reader = File.OpenText(originalFile))
using (var writer = File.CreateText(tempFile))
{
string line;
while ((line = reader.ReadLine()) != null)
{
var temp = DoMyStuff(line);
writer.WriteLine(temp);
}
}
File.Delete(originalFile);
File.Move(tempFile, originalFile);
Based on the size of the files, I would use File.ReadAllText to read them and File.WriteAllText to wirte them. This frees you up from the responsibility of having to call Close or Dispose on either read or write.
You generally don't want to read a text file on a binary level - just use File.ReadAllText() and supply it with the correct encoding used in the file (there's an overload for that). If the file encoding is UTF8 or UTF32 usually the method can automatically detect and use the correct endcoding. Same applies to writing it back - if it's not UTF8 specify which encoding you want.

Categories

Resources