Byte[stream.length] - out of memory exception, best way to solve? - c#

i am trying to read the stream of bytes from a file. However when I try to read the bytes I get a
The function evaluation was disabled because of an out of memory
exception
Quite straightforward. However, what is the best way of getting around this problem? Is it too loop around the length at 1028 at a time? Or is there a better way?
The C# I am using
BinaryReader br = new BinaryReader(stream fs);
// The length is around 600000000
long Length = fs.Length;
// Error here
bytes = new byte[Length];
for (int i = 0; i < Length; i++)
{
bytes [i] = br.ReadByte();
}
Thanks

Well. First of all. Imagine a file with the size of e.g. 2GB. Your code would allocate 2GB of memory. Just read the part of the file you really need instead of the whole file at once.
Secondly: Don't do something like this:
for (int i = 0; i < Length; i++)
{
bytes [i] = br.ReadByte();
}
It is quite inefficient. To read the raw bytes of a stream you should use something like this:
using(var stream = File.OpenRead(filename))
{
int bytesToRead = 1234;
byte[] buffer = new byte[bytesToRead];
int read = stream.Read(buffer, 0, buffer.Length);
//do something with the read data ... e.g.:
for(int i = 0; i < read; i++)
{
//...
}
}

When you try to allocate an array, the CLR lays it out contiguously in virtual memory given to it by the OS. Virtual memory can be fragmented, however, so a contiguous 1 GB block may not be available, hence the OutOfMemoryException. It doesn't matter how much physical RAM your machine has, and this problem is not limited to managed code (try allocating a huge array in native C and you'll find similar results).
Instead of allocating a huge array, I recommend using several smaller arrays, an ArrayList or List, that way the Framework can allocate the data in chunks.
Hope that helps

I believe that the instantiation of the stream object already reads the file (into a cache). Then your loop copies the bytes in memory to another array.
So, why not to use the data into "br" instead of making a further copy?

Related

C# BinaryReader ReadBytes(len) returns different results than Read(bytes, 0, len)

I've got a BinaryReader reading in a number of bytes into an array. The underlying Stream for the reader is a BufferedStream(whose underlying stream is a network stream). I noticed that sometimes the reader.Read(arr, 0, len) method is returning different(wrong) results than reader.ReadBytes(len).
Basically my setup code looks like this:
var httpClient = new HttpClient();
var reader = new BinaryReader(new BufferedStream(await httpClient.GetStreamAsync(url).ConfigureAwait(false)));
Later on down the line, I'm reading a byte array from the reader. I can confirm the sz variable is the same for both scenarios.
int sz = ReadSize(reader); //sz of the array to read
if (bytes == null || bytes.Length <= sz)
{
bytes = new byte[sz];
}
//reader.Read will return different results than reader.ReadBytes sometimes
//everything else is the same up until this point
//var tempBytes = reader.ReadBytes(sz); <- this will return right results
reader.Read(bytes, 0, sz); // <- this will not return the right results sometimes
It seems like the reader.Read method is reading further into the stream than it needs to or something, because the rest of the parsing will break after this happens. Obviously I could stick with reader.ReadBytes, but I want to reuse the byte array to go easy on the GC here.
Would there ever be any reason that this would happen? Is a setting wrong or something?
Make sure you clear out bytes array before calling this function because Read(bytes, 0, len) does NOT clear given byte array, so some previous bytes may conflict with new one. I also had this problem long ago in one of my parsers. just set all elements to zero, or make sure that you are only reading (parsing) up to given len

Memory growing with byte array copy

I have problem with simple byte[] copy. In ConsoleApplication i load 75MB DAT file into byte[]. After that i would like to cut array with function bellow.
public static byte[] SubArray(this byte[] Data, int Index, int Length = 0)
{
if (Length == 0) Length = Data.Length - Index;
byte[] Result = new byte[Length];
Array.Copy(Data, Index, Result, 0, Length);
return Result;
}
If i use only one Data = Data.SubArray(32), memory grow from 100 to 180MB, but if i do a test with three times Data = Data.SubArray(32), memory grow triple to 340MB. I suppose that old array is still in memory. How do i release old array from memory? I don't need it anymore and with more array sub in code memory growth to 2GB.
You need to let the Garbage Collector to do its thing. To make it easier for GC, you would normally set the old unused reference to null or replace it with a new reference value. GC needs some time to hit.

Is there a way to use Parallel processing to read chunks from a file and join together the string in order?

I see many examples on how to add numbers using Parallel, however I have not found anything that could demonstrate reading in multiple chunks (say 512 bytes per chunk) in parallel from a stream, and have the results joined together.
I would like to know if it is possible to read multiple parts of a stream and have them concatenated together in proper order.
For example
Assume the following text file
Bird
Cats
Dogs
And reading in a chunk size of 5 bytes, from a normal stream would be something like :
byte[] buffer = new byte[5];
int bytesRead = 0;
StringBuilder sb = new StringBuilder();
using (Stream stream = new FileStream( "animals.txt", FileMode.Open, FileAccess.Read )) {
while ( (bytesRead = stream.Read( buffer, 0, buffer.Length )) > 0 ) {
sb.Append( Encoding.UTF8.GetString( buffer ) );
}
}
Would read in each line (all lines are 5 bytes) and join them together in order so the resulting string would be identical to the file.
However, consider using something like this solution seems like it would potentially join them out of order. I also don't know how it would apply in the above context to replace the where loop.
How can I read in those chunks simultaneously and have them append to StringBuilder the bytes from each iteration - not the order the iteration occurs, but the order which is proper so I don't end up with something like
Cats
Bird
Dog
Sorry I don't have any parallel code to show as this is the reason for the post. It seems easy if you want to sum up numbers, but to have it work in the manner that it is as follows :
Reading from a stream in byte chunks (say 512 bytes per chunk)
Appending to a master result in the order which they are in the stream, not necessarily the order processed.
... seems to be a daunting challenge
By their nature, streams are not compatible with parallel processing. The abstraction of a stream is sequential access.
You can read the stream content sequentially into an array, then launch parallel processing on it, which has the desired effect (processing is parallelized). You can even spawn the parallel tasks as chunks of the stream arrive.
var tasks = new List<Task>();
do {
var buffer = new byte[blockSize];
var location = stream.Position;
stream.Read(buffer);
tasks.Add(ProcessAsync(buffer, location));
} while (!end of stream);
await Task.WhenAll(tasks.ToArray());
Or, if you have random access, you can spawn parallel tasks each with instructions to read from a particular portion of the input, process it, and store to the corresponding part of the result. But note that although random access to files is possible, the access still has to go through a single disk controller... and that hard disks are not random access even though they expose a random-access interface, non-sequential read patterns will result in a lot of time wasted seeking, lowering efficiency far below what you get from streaming. (SSDs don't seek so there's not much penalty for random request patterns, but you don't benefit either)
Thanks to #Kraang for collaborating on the following example, matching the case of parallel processing binary data.
If reading from bytes alone, you could use parallel processing to handle the chunks as follows :
// the byte array goes here
byte[] data = new byte[N];
// the block size
int blockSize = 5;
// find how many chunks there are
int blockCount = 1 + (data.Length - 1) / blockSize;
byte[][] processedChunks = new byte[blockCount][];
Parallel.For( 0, blockCount, ( i ) => {
var offset = i * blockSize;
// set the buffer size to block size or remaining bytes whichever is smaller
var buffer = new byte[Math.Min( blockSize, data.Length - offset )];
// copy the bytes from data to the buffer
Buffer.BlockCopy( data, i * blockSize, buffer, 0, buffer.Length );
// store buffer results into array in position `i` preserving order
processedChunks[i] = Process(buffer);
} );
// recombine chunks using e.g. LINQ SelectMany

Why writing many small byte arrays to a file is faster than writing one big array?

I made a test to see if there is a difference between the time it takes to write a 1GB file on disk from a single byte array and writing another 1GB file from 1024 arrays (1MB each).
Test Writing many arrays
331.6902 ms
Test Writing big array
14756.7559 ms
For this test, the "many arrays" is actually a single byte[1024 * 1024] array that I write 1024 times using a for loop.
The "big array" is just a 1GB byte array filled with random values.
Here's what the code looks like :
Console.WriteLine("Test Writing many arrays");
byte[] data = new byte[1048576];
for (int i = 0; i < 1048576; i++)
data[i] = (byte)(i % 255);
FileStream file = new FileStream("test.txt", FileMode.Create);
sw1.Restart();
for (int i = 0; i < 1024; i++ )
file.Write(data, 0, 1048576);
file.Close();
sw1.Stop();
s1 = sw1.Elapsed;
Console.WriteLine(s1.TotalMilliseconds);
Console.WriteLine("Test Writing big array");
byte[] data2 = new byte[1073741824];
for (int i = 0; i < 1073741824; i++)
data2[i] = (byte)(i % 255);
FileStream file2 = new FileStream("test2.txt", FileMode.Create);
sw1.Restart();
file2.Write(data2, 0, 1073741824);
file2.Close();
sw1.Stop();
s1 = sw1.Elapsed;
Console.WriteLine(s1.TotalMilliseconds);
I included the file.Close() inside the timed part, since it calls the Flush() method and writes the stream to the disk.
The resulting files are the exact same size.
I tought maybe C# can see that I always use the same array and it might optimize the iteration/writing process, but the result is not 2-3 times faster, it's about 45 times faster... Why?
I think the major reason for the big difference is that the OS manages to cache almost the entire 1GB write that you do in small chunks.
You need to change the way your benchmark is set up: the code should write the same data, first time in 1024 chunks, and the second time in one chunk. You also need to turn off the caching of data in the OS by specifying FileOptions.WriteThrough, like this:
var sw1 = new Stopwatch();
Console.WriteLine("Test Writing many arrays");
var data = new byte[1073741824];
for (var i = 0; i < 1073741824; i++)
data[i] = (byte)(i % 255);
var file = new FileStream("c:\\temp\\__test1.txt", FileMode.Create, FileSystemRights.WriteData, FileShare.None, 8, FileOptions.WriteThrough);
sw1.Restart();
for (int i = 0; i < 1024; i++)
file.Write(data, i*1024, 1048576);
file.Close();
sw1.Stop();
var s1 = sw1.Elapsed;
Console.WriteLine(s1.TotalMilliseconds);
Console.WriteLine("Test Writing big array");
var file2 = new FileStream("c:\\temp\\__test2.txt", FileMode.Create, FileSystemRights.WriteData, FileShare.None, 8, FileOptions.WriteThrough);
sw1.Restart();
file2.Write(data, 0, 1073741824);
file2.Close();
sw1.Stop();
s1 = sw1.Elapsed;
Console.WriteLine(s1.TotalMilliseconds);
When you run this code, the results look as follows:
Test Writing many arrays
5234.5885
Test Writing big array
5032.3626
The reason is likely to be that the single 1MB array is being held in main memory, but the 1GB array was swapped out to disk.
Therefore when writing the single array 1024 times, you were writing from memory to disk. If the destination file is contiguous, the HDD head doesn't have to move far during this process.
Writing the 1GB array once, you were reading from disk to memory then writing to disk, in all likelihood resulting in at least two HDD head movements for each write - first to read the block from the swapfile, then back to the destination file to write it.
It could be related to how OS handles file writes. When writing 1GB using a single write call OS will have to pause writing many times to allow other processes use Disk I/O. And also you are not buffering writes. You may optimize the speeds by specifying a larger bufferSize.
public FileStream(
SafeFileHandle handle,
FileAccess access,
int bufferSize
)

Copying a part of a byte[] array into a PDFReader

This is a continuation of the ongoing struggle to reduce my memory load mention in
How do you refill a byte array using SqlDataReader?
So I have a byte array that is a set size, for this example, I'll say new byte[400000]. Inside of this array, I'll be placing pdf's of different sizes (less than 400000).
psuedo code would be:
public void Run()
{
byte[] fileRetrievedFromDatabase = new byte[400000];
foreach (var document in documentArray)
{
// Refill the file with data from the database
var currentDocumentSize = PopulateFileWithPDFDataFromDatabase(fileRetrievedFromDatabase);
var reader = new iTextSharp.text.pdf.PdfReader(fileRetrievedFromDatabase.Take((int)currentDocumentSize ).ToArray());
pageCount = reader.NumberOfPages;
// DO ADDITIONAL WORK
}
}
private int PopulateFileWithPDFDataFromDatabase(byte[] fileRetrievedFromDatabase)
{
// DataAccessCode Goes here
int documentSize = 0;
int bufferSize = 100; // Size of the BLOB buffer.
byte[] outbyte = new byte[bufferSize]; // The BLOB byte[] buffer to be filled by GetBytes.
myReader = logoCMD.ExecuteReader(CommandBehavior.SequentialAccess);
Array.Clear(fileRetrievedFromDatabase, 0, fileRetrievedFromDatabase.Length);
if (myReader == null)
{
return;
}
while (myReader.Read())
{
documentSize = myReader.GetBytes(0, 0, null, 0, 0);
// Reset the starting byte for the new BLOB.
startIndex = 0;
// Read the bytes into outbyte[] and retain the number of bytes returned.
retval = myReader.GetBytes(0, startIndex, outbyte, 0, bufferSize);
// Continue reading and writing while there are bytes beyond the size of the buffer.
while (retval == bufferSize)
{
Array.Copy(outbyte, 0, fileRetrievedFromDatabase, startIndex, retval);
// Reposition the start index to the end of the last buffer and fill the buffer.
startIndex += retval;
retval = myReader.GetBytes(0, startIndex, outbyte, 0, bufferSize);
}
}
return documentSize;
}
The problem with the above code is that that I keep getting a "Rebuild trailer not found. Original Error: PDF startxref not found" error when I try to access the PDF Reader. I believe it's because the byte array is too long and has trailing 0's. But since I'm using the byte array so that I'm not continuously building new objects on the LOH, I need to do this.
So how do I get just the piece of the Array that I need and send it to the PDFReader?
Updated
So I looked at the source and realized I had some variables from my actual code that was confusing. I'm basically reusing the fileRetrievedFromDatabase object in each iteration of the loop. Since it's passed by reference, it gets cleared (set to all zero's), and then filled in the PopulateFileWithPDFDataFromDatabase. This object is then used to create a new PDF.
If I didn't do it this way, a new large byte array would be created in every iteration and the Large Object Heap gets full and eventually throws an OutOfMemory exception.
You have at least two options:
Treat your buffer like a circular buffer with two indexes for starting and ending position.
need an index of the last byte written in outByte and you have to stop reading when you reach that index.
Simply read the same number of bytes as you have in your data array to avoid reading into the "unknown" parts of the buffer which don't belong to the same file.
In other words, instead of having bufferSize as the last parameter, have the data.Length.
// Read the bytes into outbyte[] and retain the number of bytes returned.
retval = myReader.GetBytes(0, startIndex, outbyte, 0, data.Length);
If data length is 10 and your outbyte buffer is 15, then you should only read the data.Length not the bufferSize.
However, I still don't see how you're reusing the outbyte "buffer", if that's what you're doing... I'm simply not following based on what you've provided in your answer. Maybe you can clarify exactly what is being reused.
Apparently, I the way the while loop is currently structured, it wasn't copying the data on it's last iteration. Needed to add this:
if (outbyte != null && outbyte.Length > 0 && retval > 0)
{
Array.Copy(outbyte, 0, currentDocument.Data, startIndex, retval);
}
It's now working, although I will definitely need to refactor.

Categories

Resources