Binaryreader read from Filestream which loads in chunks

Binaryreader read from Filestream which loads in chunks - c#

I'm reading values from a huge file (> 10 GB) using the following code:
FileStream fs = new FileStream(fileName, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int count = br.ReadInt32();
List<long> numbers = new List<long>(count);
for (int i = count; i > 0; i--)
{
numbers.Add(br.ReadInt64());
}
unfortunately the read-speed from my SSD is stuck at a few MB/s. I guess the limit are the IOPS of the SSD, so it might be better to read in chunks from the file.
Question
Does the FileStream in my code really read only 8 bytes from the file everytime the BinaryReader calls ReadInt64()?
If so, is there a transparent way for the BinaryReader to provide a stream that reads in larger chunks from the file to speed up the procedure?
Test-Code
Here's a minimal example to create a test-file and to measure the read-performance.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
namespace TestWriteRead
{
class Program
{
static void Main(string[] args)
{
System.IO.File.Delete("test");
CreateTestFile("test", 1000000000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
IEnumerable<long> test = Read("test");
stopwatch.Stop();
Console.WriteLine("File loaded within " + stopwatch.ElapsedMilliseconds + "ms");
}
private static void CreateTestFile(string filename, int count)
{
FileStream fs = new FileStream(filename, FileMode.CreateNew);
BinaryWriter bw = new BinaryWriter(fs);
bw.Write(count);
for (int i = 0; i < count; i++)
{
long value = i;
bw.Write(value);
}
fs.Close();
}
private static IEnumerable<long> Read(string filename)
{
FileStream fs = new FileStream(filename, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int count = br.ReadInt32();
List<long> values = new List<long>(count);
for (int i = 0; i < count; i++)
{
long value = br.ReadInt64();
values.Add(value);
}
fs.Close();
return values;
}
}
}

You should configure the stream to use SequentialScan to indicate that you will read the stream from start to finish. It should improve the speed significantly.
Indicates that the file is to be accessed sequentially from beginning
to end. The system can use this as a hint to optimize file caching. If
an application moves the file pointer for random access, optimum
caching may not occur; however, correct operation is still guaranteed.
using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
var br = new BinaryReader(fs);
var count = br.ReadInt32();
var numbers = new List<long>();
for (int i = count; i > 0; i--)
{
numbers.Add(br.ReadInt64());
}
}
Try read blocks instead:
using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
var br = new BinaryReader(fs);
var numbersLeft = (int)br.ReadInt64();
byte[] buffer = new byte[8192];
var bufferOffset = 0;
var bytesLeftToReceive = sizeof(long) * numbersLeft;
var numbers = new List<long>();
while (true)
{
// Do not read more then possible
var bytesToRead = Math.Min(bytesLeftToReceive, buffer.Length - bufferOffset);
if (bytesToRead == 0)
break;
var bytesRead = fs.Read(buffer, bufferOffset, bytesToRead);
if (bytesRead == 0)
break; //TODO: Continue to read if file is not ready?
//move forward in read counter
bytesLeftToReceive -= bytesRead;
bytesRead += bufferOffset; //include bytes from previous read.
//decide how many complete numbers we got
var numbersToCrunch = bytesRead / sizeof(long);
//crunch them
for (int i = 0; i < numbersToCrunch; i++)
{
numbers.Add(BitConverter.ToInt64(buffer, i * sizeof(long)));
}
// move the last incomplete number to the beginning of the buffer.
var remainder = bytesRead % sizeof(long);
Buffer.BlockCopy(buffer, bytesRead - remainder, buffer, 0, remainder);
bufferOffset = remainder;
}
}
Update in response to a comment:
May I know what's the reason that manual reading is faster than the other one?
I don't know how the BinaryReader is actually implemented. So this is just assumptions.
The actual read from the disk is not the expensive part. The expensive part is to move the reader arm into the correct position on the disk.
As your application isn't the only one reading from a hard drive the disk have to re-position itself every time an application requests a read.
Thus if the BinaryReader just reads the requested int it have to wait on the disk for every read (if some other application make a read in-between).
As I read a much larger buffer directly (which is faster) I can process more integers without having to wait for the disk between reads.
Caching will of course speed things up a bit, and that's why it's "just" three times faster.
(future readers: If something above is incorrect, please correct me).

You can use a BufferedStream to increase the read buffer size.

In theory memory mapped files should help here. You could load it into memory using several very large chunks. Not sure though how much is this relevant when using SSDs.

Related

Channels & Memory Management Strategies for Large Objects

I'm trying to determine how to best implement .Net Core 3 Channels and whether it's a good idea to pass very large objects between tasks. In my example, one task that is very fast can read in a 1GB chunk from a very large file. A number of consumer tasks can read a chunk from the channel and process them in parallel, as processing is much slower and needs parallel (multi-threaded) execution.
In testing my code, there is a massive amount of GC happening and total RAM used far exceeds the sum of all data waiting in one bounded channel and all executing tasks. I've simplified my code down to the most basic example hoping someone can give me some tips on how to better allocate/manage memory or if this approach is a good idea?
using System;
using System.IO;
using System.Threading.Channels;
using System.Threading.Tasks;
namespace MergeSort
{
public class Example
{
private Channel<byte[]> _channelProcessing;
public async Task DoSort(int queueDepth, int parallelTaskCount)
{
// Hard-code some values so we can talk about details
queueDepth = 2;
parallelTasks = 8;
_channelProcessing = Channel.CreateBounded<byte[]>(queueDepth);
Task[] processingTasks = new Task[parallelTaskCount];
int outputBufferSize = 1024 * 1024;
for (int x = 0; x < parallelTaskCount; x++)
{
string outputFile = $"C:\\Output.{x:00000000}.txt";
processingTasks[x] = Task.Run(() => ProcessChunkAsync(outputBufferSize));
}
// Task put unsorted chunks on the channel
string inputFile = "C:\\Input.txt";
int chunkSize = 1024 * 1024 * 1024; // 1GiB
Task inputTask = Task.Run(() => ReadInputAsync(inputFile, chunkSize));
// Wait for all tasks building chunk files to complete before continuing
await inputTask;
await Task.WhenAll(processingTasks);
}
private async Task ReadInputAsync(string inputFile, int chunkSize)
{
int bytesRead = 0;
byte[] chunkBuffer = new byte[chunkSize];
using (FileStream fileStream = File.Open(inputFile, FileMode.Open, FileAccess.Read, FileShare.Read))
{
// Read chunks until input EOF
while (fileStream.Position != fileStream.Length)
{
bytesRead = fileStream.Read(chunkBuffer, 0, chunkBuffer.Length);
// Fake code him to simulate the work I need to do showing outBuffer.Length is calculated at runtime
Random rnd = new Random();
int runtimeCalculatedAmount = rnd.Next(100, 600);
byte[] tempBuffer = new byte[runtimeCalculatedAmount];
// Create the buffer with a slightly variable size that needs to be passed to the channel for next task
byte[] outBuffer = new byte[1024 * 1024 * 1024 + runtimeCalculatedAmount];
Array.Copy(chunkBuffer, outBuffer, bytesRead);
Array.Copy(tempBuffer, 0, outBuffer, bytesRead, outBuffer.Length);
await _channelProcessing.Writer.WriteAsync(outBuffer);
outBuffer = null;
}
}
// Not sure if it's safe to .Complete() before consumers have read all data from channel?
_channelProcessing.Writer.Complete();
}
private async Task ProcessChunkAsync(int outputBufferSize)
{
while (await _channelProcessing.Reader.WaitToReadAsync())
{
if (_channelProcessing.Reader.TryRead(out byte[] inBuffer))
{
// myBigThing is also a very large object (result of processing inBuffer and slightly larger)
MyBigThing myBigThing = new MyBigThing(inBuffer);
inBuffer = null;
// Create file and write all rows
using (FileStream fileStream = File.Create("C:\\Output.txt", outputBufferSize, FileOptions.SequentialScan))
{
// Write myBigThing to output file
fileStream.Write(myBigThing.Data);
}
myBigThing = null;
}
}
}
}
}

C# Garbage Collection Weird Behavior

I have a block of code that loads a custom storage file (data.00x) and dumps it's file contents (several files...) [for this example we'll say the referenced index only contains data.001 files for export]
Example:
public void ExportFileEntries(ref List<IndexEntry> filteredIndex, string dataDirectory, string buildDirectory, int chunkSize)
{
OnTotalMaxDetermined(new TotalMaxArgs(8));
// For each set of dataId files in the filteredIndex
for (int dataId = 1; dataId < 8; dataId++)
{
OnTotalProgressChanged(new TotalChangedArgs(dataId, string.Format("Exporting selected files from data.00{0}", dataId)));
// Filter only entries with current dataId into temp index
List<IndexEntry> tempIndex = GetEntriesByDataId(ref filteredIndex, dataId, SortType.Offset);
// Determine the path of the data.xxx file being exported from
string dataPath = string.Format(#"{0}\data.00{1}", dataDirectory, dataId);
if (File.Exists(dataPath))
{
// Load the data.xxx into filestream
using (FileStream dataFs = new FileStream(dataPath, FileMode.Open, FileAccess.Read))
{
// Loop through filex to export
foreach (IndexEntry indexEntry in tempIndex)
{
int fileLength = indexEntry.Length;
OnCurrentMaxDetermined(new CurrentMaxArgs(fileLength));
// Set the filestreams position to the file entries offset
dataFs.Position = indexEntry.Offset;
// Read the file into a byte array (buffer)
byte[] fileBytes = new byte[indexEntry.Length];
dataFs.Read(fileBytes, 0, fileBytes.Length);
// Define some information about the file being exported
string fileExt = Path.GetExtension(indexEntry.Name).Remove(0, 1);
string buildPath = string.Format(#"{0}\{1}\{2}", buildDirectory, fileExt.ToUpper(), indexEntry.Name);
// If needed unencrypt the data (fileBytes buffer)
if (XOR.Encrypted(fileExt)) { byte b = 0; XOR.Cipher(ref fileBytes, ref b); }
// If no chunkSize is provided, generate default
if (chunkSize == 0) { chunkSize = Math.Max(64000, (int)(fileBytes.Length * .02)); }
// If the build directory doesn't exist yet, create it.
if (!Directory.Exists(Path.GetDirectoryName(buildPath))) { Directory.CreateDirectory(Path.GetDirectoryName(buildPath)); }
using (FileStream buildFs = new FileStream(buildPath, FileMode.Create, FileAccess.Write))
{
using (BinaryWriter bw = new BinaryWriter(buildFs, encoding))
{
for (int byteCount = 0; byteCount < fileLength; byteCount += Math.Min(fileLength - byteCount, chunkSize))
{
bw.Write(fileBytes, byteCount, Math.Min(fileLength - byteCount, chunkSize));
OnCurrentProgressChanged(new CurrentChangedArgs(byteCount, ""));
}
}
}
OnCurrentProgressReset(EventArgs.Empty);
fileBytes = null;
}
}
}
else { OnError(new ErrorArgs(string.Format("[ExportFileEntries] Cannot locate: {0}", dataPath))); }
}
OnTotalProgressReset(EventArgs.Empty);
GC.Collect();
}
The data.001 stores about 12k files, most are very small .jpg pictures etc...etc.. for about the first half of the export process the gc collects just fine, but out of nowhere toward the last half of the export process the gc just stops giving a crap.
If I don't issue GC.Collect() at the end of the method the tool sits at around 255mb ram, but if I do call it goes down to about 14mb. What I'm asking, is there any obvious improvements over the way I coded the method (to increase gc performance)?

Write file directly into TcpClient without storing it in memory

I have a 1 GB file that I need to write to a TcpClient object. What's the best way to do this without reading the entire file into memory?

You have to read it into memory at some point though you obviously don't need to do it all at once!
Just use BinaryReader.Read and read in "n" number of bytes at a time, something like:
BinaryReader reader = new BinaryReader(new FileStream("test.dat", FileMode.Open));
int currentIndex = 0;
byte[] buffer = new byte[100];
while (reader.Read(buffer, currentIndex, 100) > 0)
{
//Send buffer
currentIndex += 100;
}

Can't read FileStream into byte[] correctly

I have some C# code to call as TF(true,"C:\input.txt","C:\noexistsyet.file"), but when I run it, it breaks on FileStream.Read() for reading the last chunk of the file into the buffer, getting an index-out-of-bounds ArgumentException.
To me, the code seems logical with no overflow for trying to write to the buffer. I thought I had all that set up with rdlen and _chunk, but maybe I'm looking at it wrong. Any help?
My error: ArgumentException was unhandled: Offset and length were out of bounds for the array or count is greater than the number of elements from index to the end of the source collection.
public static bool TF(bool tf, string filepath, string output)
{
long _chunk = 16 * 1024; //buffer count
long total_size = 0
long rdlen = 0;
long wrlen = 0;
long full_chunks = 0;
long end_remain_buf_len = 0;
FileInfo fi = new FileInfo(filepath);
total_size = fi.Length;
full_chunks = total_size / _chunk;
end_remain_buf_len = total_size % _chunk;
fi = null;
FileStream fs = new FileStream(filepath, FileMode.Open);
FileStream fw = new FileStream(output, FileMode.Create);
for (long chunk_pass = 0; chunk_pass < full_chunks; chunk_pass++)
{
int chunk = (int)_chunk * ((tf) ? (1 / 3) : 3); //buffer count for xbuffer
byte[] buffer = new byte[_chunk];
byte[] xbuffer = new byte[(buffer.Length * ((tf) ? (1 / 3) : 3))];
//Read chunk of file into buffer
fs.Read(buffer, (int)rdlen, (int)_chunk); //ERROR occurs here
//xbuffer = do stuff to make it *3 longer or *(1/3) shorter;
//Write xbuffer into chunk of completed file
fw.Write(xbuffer, (int)wrlen, chunk);
//Keep track of location in file, for index/offset
rdlen += _chunk;
wrlen += chunk;
}
if (end_remain_buf_len > 0)
{
byte[] buffer = new byte[end_remain_buf_len];
byte[] xbuffer = new byte[(buffer.Length * ((tf) ? (1 / 3) : 3))];
fs.Read(buffer, (int)rdlen, (int)end_remain_buf_len); //error here too
//xbuffer = do stuff to make it *3 longer or *(1/3) shorter;
fw.Write(xbuffer, (int)wrlen, (int)end_remain_buf_len * ((tf) ? (1 / 3) : 3));
rdlen += end_remain_buf_len;
wrlen += chunk;
}
//Close opened files
fs.Close();
fw.Close();
return false; //no functionality yet lol
}

The Read() method of Stream (the base class of FileStream) returns an int indicating the number of bytes read, and 0 when it has no more bytes to read, so you don't even need to know the file size beforehand:
public static void CopyFileChunked(int chunkSize, string filepath, string output)
{
byte[] chunk = new byte[chunkSize];
using (FileStream reader = new FileStream(filepath, FileMode.Open))
using (FileStream writer = new FileStream(output, FileMode.Create))
{
int bytes;
while ((bytes = reader.Read(chunk , 0, chunkSize)) > 0)
{
writer.Write(chunk, 0, bytes);
}
}
}
Or even File.Copy() may do the trick, if you can live with letting the framework decide about the chunk size.

I think it's failing on this line:
fw.Write(xbuffer, (int)wrlen, chunk);
You are declaring xbuffer as
byte[] xbuffer = new byte[(buffer.Length * ((tf) ? (1 / 3) : 3))];
Since 1 / 3 is an integer division, it returns 0.And you are declaring xbuffer with the size 0 hence the error.You can fix it by casting one of the operand to a floating point type or using literals.But then you still need to cast the result back to integer.
byte[] xbuffer = new byte[(int)(buffer.Length * ((tf) ? (1m / 3) : 3))];
The same problem also present in the chunk declaration.

Reading a file one byte at a time in reverse order

Hi I am trying to read a file one byte at a time in reverse order.So far I only managed to read the file from begining to end and write it on another file.
I need to be able to read the file from the end to the begining and print it to another file.
This is what I have so far:
string fileName = Console.ReadLine();
using (FileStream file = new FileStream(fileName ,FileMode.Open , FileAccess.Read))
{
//file.Seek(endOfFile, SeekOrigin.End);
int bytes;
using (FileStream newFile = new FileStream("newsFile.txt" , FileMode.Create , FileAccess.Write))
{
while ((bytes = file.ReadByte()) >= 0)
{
Console.WriteLine(bytes.ToString());
newFile.WriteByte((byte)bytes);
}
}
}
I know that I have to use the Seek method on the fileStream and that gets me to the end of the file.I already did that at the commented protion of the code , but I do not know how to read the file now in the while loop.
How can I achive this?

string fileName = Console.ReadLine();
using (FileStream file = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
byte[] output = new byte[file.Length]; // reversed file
// read the file backwards using SeekOrigin.Current
//
long offset;
file.Seek(0, SeekOrigin.End);
for (offset = 0; offset < fs.Length; offset++)
{
file.Seek(-1, SeekOrigin.Current);
output[offset] = (byte)file.ReadByte();
file.Seek(-1, SeekOrigin.Current);
}
// write entire reversed file array to new file
//
File.WriteAllBytes("newsFile.txt", output);
}

You could do it by reading one byte at a time, or you could read a larger buffer, write it to the output file in reverse, and continue like that until you've reached the beginning of the file. For example:
string inputFilename = "inputFile.txt";
string outputFilename = "outputFile.txt";
using (ofile = File.OpenWrite(outputFilename))
{
using (ifile = File.OpenRead(inputFilename))
{
int bufferSize = 4096;
byte[] buffer = new byte[bufferSize];
long filePos = ifile.Length;
do
{
long newPos = Math.Max(0, filePos - bufferSize);
int bytesToRead = (int)(filePos - newPos);
ifile.Seek(newPos, SeekOrigin.Set);
int bytesRead = ifile.Read(buffer, 0, bytesToRead);
// write the buffer to the output file, in reverse
for (int i = bytesRead-1; i >= 0; --i)
{
ofile.WriteByte(buffer[i]);
}
filePos = newPos;
} while (filePos > 0);
}
}
An obvious optimization would be to reverse the buffer after you've read it, and then write it in one whole chunk to the output file.
And if you know that the file will fit into memory, it's really easy:
var buffer = File.ReadAllBytes(inputFilename);
// now, reverse the buffer
int i = 0;
int j = buffer.Length-1;
while (i < j)
{
byte b = buffer[i];
buffer[i] = buffer[j];
buffer[j] = b;
++i;
--j;
}
// and write it
File.WriteAllBytes(outputFilename, buffer);

If the file is small (fits in your RAM) then this would work:
public static IEnumerable<byte> Reverse(string inputFilename)
{
var bytes = File.ReadAllBytes(inputFilename);
Array.Reverse(bytes);
foreach (var b in bytes)
{
yield return b;
}
}
Usage:
foreach (var b in Reverse("smallfile.dat"))
{
}

If the file is large (bigger than your RAM) then this would work:
using (var inputFile = File.OpenRead("bigfile.dat"))
using (var inputFileReversed = new ReverseStream(inputFile))
using (var binaryReader = new BinaryReader(inputFileReversed))
{
while (binaryReader.BaseStream.Position != binaryReader.BaseStream.Length)
{
var b = binaryReader.ReadByte();
}
}
It uses the ReverseStream class which can be found here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Binaryreader read from Filestream which loads in chunks - c#

You can use a BufferedStream to increase the read buffer size.

In theory memory mapped files should help here. You could load it into memory using several very large chunks. Not sure though how much is this relevant when using SSDs.

Related

Channels & Memory Management Strategies for Large Objects

C# Garbage Collection Weird Behavior

Write file directly into TcpClient without storing it in memory

Can't read FileStream into byte[] correctly

Reading a file one byte at a time in reverse order

Categories

Resources