MemoryMappedFiles: How much memory can be allocated for files

MemoryMappedFiles: How much memory can be allocated for files - c#

I'm having large CT rawdata files which can exceed the size of 20 to 30GB maximum. For most of our current computers in the department we have only 3GB maximum. But for processing the data we need to go through all the available data. Of course we could do this by sequentially going through the data via the read and write functions. But it's sometimes necessary to keep some data in memory.
Currently I have my own memory management where I created a so called MappableObject. Each rawdatafile contains, say 20000 structs each showing different data. Each MappableObject refers to a location in the file.
In C# I created a somewhat partially working mechanism which automatically mps and unmaps the data if necessary. From years ago I know the MemoryMappedFiles, but in .NET 3.5 I refused to use it because I knew in .NET 4.0 it will be available natively.
So today I have tried the MemoryMappedFiles and found out that it is not possible to allocate as much memory a need. If I have a 32bit system, and I want to allocate 20GB it doesn't work due to exceeding the size of the logical address space. This is somehow clear to me.
But is there a way to process such large files as I have? What other chances do I have? How do you guys solve such things?
Thanks
Martin

Only limitation i'm aware of is the size of the largest view of a file you can map which is limited by address space. A memory mapped file can be larger than address space. Windows needs to map a file view in a contiguous chunk of your process's address space, so the size of the largest mapping equals the size of the largest free chunk of address space. The only limit on total file size is imposed by the file system itself.
Take a look at this article: Working with Large Memory-Mapped Files

"Memory mapped", you can't map 20 gigabytes into a 2 gigabyte virtual address space. Getting 500 MB on a 32-bit operating system is tricky. Beware that it is not a good solution unless you need heavy random access to the file data. Which ought to be difficult when you have to partition the views. Sequential access through a regular file is unbeatable with very modest memory usage. Also beware the cost of marshaling the data from the MMF, you're still paying for the copy of the managed struct or the marshaling cost.

You can still sequentially read through the file, you just can't store more than 2GB of data in memory.
You can map blocks of the file at a time, preferably blocks that are multiples of your struct.
eg. File is 32GB. Memory map 32MB of the file at a time and parse it. Once you hit the end of those 32MB, map the next 32MB of the file and continue until you've reached the end of the file.
I'm not sure what the optimal mapping size is, but this is an example of how it may be done.

You are both right. What I tried first is to use a memorymappedfile without a file. There it doesn't work. If I have an existing file. I can map as much memory I want. The reason why I wanted to use MemoryMappedFiles without having a real existing file is that it should delete automatically when the stream gets disposed. This is not supported by the MemoryMappedFile.
What I saw now is that I can do the following to get the expected result:
// Create the stream
FileStream stream = new FileStream(
"D:\\test.dat",
FileMode.Create,
FileAccess.ReadWrite,
FileShare.ReadWrite,
8,
FileOptions.DeleteOnClose // This is the necessary part for me.
);
// Create a file mapping
MemoryMappedFile x = MemoryMappedFile.CreateFromFile(
stream,
"File1",
10000000000,
MemoryMappedFileAccess.ReadWrite,
new MemoryMappedFileSecurity(),
System.IO.HandleInheritability.None,
false
);
// Dispose the stream, using the FileOptions.DeleteOnClose the file is gone now
stream.Dispose();
At least when looking at the first result it looks fine fore me.
Thank you.

Related

efficient continuous data writes on HDD

In my application I need to continuously write data chunks (around 2MB) about every 50ms in a large file (around 2-7 GB). This is done in a sequential, circular way, so I write chunk after chunk into the file and when I'm at the end of the file I start again at the beginning.
Currently I'm doing it as follows:
In C# I call File.OpenWrite once to open the file with read access and set the size of the file with SetLength. When I need to write a chunk, I pass the safe file handle to the unmanaged WriteFile (kernel32.dll). Hereby I pass an overlapped structure to specify the position within the file where the chunk has to be written. The chunk I need to write is stored in unmanaged memory, so I have an IntPtr which I can pass to WriteFile.
Now I'd like to know if and how I can make this process more efficient. Any ideas?
Some questions in detail:
Will changing from file I/O to memory-mapped file help?
Can I include some optimizations for NTFS?
Are there some useful parameters when creating the file that I'm missing? (maybe an unmanaged call with special parameters)

Using better hardware will probably be the most cost efficient way to increase file writing efficiency.
There is a paper from Microsoft research that will answer most of your questions: Sequential File Programming Patterns and Performance with .NET and the downloadable source code (C#) if you want to run the tests from the paper on your machine.
In short:
The default behavior provides excellent performance on a single disk.
Unbufffered IO should be tested if you have a disc array. Could improve write speed with a factor of eight.
This thread on social.msdn might also be of interest.

Efficient way to transfer many binary files into SQL Server database

We have a requirement for a Winforms app to read thousands of files from a local filesystem (or a network location) and store them in a database.
I am wondering what would be the most efficient way to load the files? There could potentially be many gigabytes of data in total.
File.ReadAllBytes is currently used but the application eventually locks up as the computer's memory is used up.
The current code loops through a table containing file paths, which are used to read the binary data:
protected CustomFile ConvertFile(string path)
{
try
{
byte[] file = File.ReadAllBytes(path);
return new CustomFile { FileValue = file };
}
catch
{
return null;
}
}
The data is then saved to the database (either SQL Server 2008 R2 or 2012) using NHibernate as ORM.

First, let me state that my knowledge is pre NET 4.0 so this information may be outdated because I know they were going to make improvements in this area.
Do not use File.ReadAllBytes to read large files (larger than 85kb), specially when you are doing it to many files sequentially. I repeat, do not.
Use something like a stream and BinaryReader.Read instead to buffer your reading. Even if this may sound not efficient since you won't blast the CPU through a single buffer, if you do it with ReadAllBytes it simply won't work as you discovered.
The reason for that is because ReadAllBytes reads the whole thing inside a byte array. If that byte array is >85Kb in mem (there are other considerations like # of array elements) it is going into the Large Object Heap, which is fine, BUT, LOH doesn't move memory around, nor defragments the released space, so, simplifying, this can happen:
Read 1GB file, you have a 1GB chunk in the LOH, save the file. (No GC cycle)
Read 1.5GB file, you request a 1.5GB chunk of memory, it goes into the end of the LOH, but say you get a GC cycle so the 1GB chunk you previously used gets cleared, but now you have a chunk of 2.5GB memory, the first 1GB free.
Read a 1.6GB file, the 1GB free block at the beginning doesn't work, so the allocator goes to the end. Now you have a 4.1GB chunk of memory.
Repeat.
You are running out of memory but you surely aren't actually using it all, fragmentation is probably killing you. Also you can actually hit a real OOM situation if the file is very large (I think the process space in Windows 32 bit is 2GB?).
If files aren't ordered or dependent on each other maybe a few threads reading them by buffering with a BinaryReader would get the job done.
References:
http://www.red-gate.com/products/dotnet-development/ants-memory-profiler/learning-memory-management/memory-management-fundamentals
https://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/

If you have many files, you should read them one-by-one.
If you have big files, and the database allows it, you should read them block by block into a buffer and write them block by block to the database. If you use File.ReadAllBytes, you might get an OutOfMemoryException when the file is too big to fit in the runtime's memory. The upper limit is less than 2 GiB, and even less when the memory is fragmented when the application runs for a while.

Storing file in byte array vs reading and writing with file stream?

I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.

This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)

If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.

It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.

Create contiguous file using C#?

Is it possible to create a large (physically) contiguous file on a disk using C#? Preferably using only managed code, but if it's not possible then an unmanaged solution would be appreciated.

Any file you create will be logically contiguous.
If you want physically contiguous you are on OS and FS territory. Really (far) beyond the control of normal I/O API's.
But what will probably come close is to claim the space up-front: create an Empty stream and set the Length or Position property to what you need.

Writing a defragger?
It sounds like you're after the defragmentation API anyway:-
http://msdn.microsoft.com/en-us/library/aa363911%28v=vs.85%29.aspx
The link from the bottom, cos it seems you've missed the C# wrapper that someone has kindly produced.
http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx

With modern file systems it is hard to ensure a contiguous file on the hard disk. Logically the file is always contiguous, but the physical blocks that keep the data vary from file system to file system.
The best bet for this would be to use an old file system (ext2, FAT32, etc.) and just ask for a large file using seek to the file size you want and then flushing this file. More up-to-date file systems will probably mark a large file size, but won't actually write anything to the hard disk, instead returning zeros on a future read without actually reading.
int fileSize = 1024 * 1024 * 512;
FileStream file = new FileStream("C:\\MyFile", FileMode.Create, FileAccess.Write);
file.Seek(fileSize, SeekOrigin.Begin);
file.Close();

To build a database, you will need to use the scatter-gather I/O functions provided by the Windows API. This is a special type of file I/O that allows you to either "scatter" data from a file into memory or "gather" data from memory and write it to a contiguous region of a file. While the buffers into which the data is scattered or from which it is gathered need not be contiguous, the source or destination file region is always contiguous.
This functionality consists of two primary functions, both of which work asynchronously. The ReadFileScatter function reads contiguous data from a file on disk and writes it into an array of non-contiguous memory buffers. The WriteFileGather function reads non-contiguous data from memory buffers and writes it to a contiguous file on disk. Of course, you'll also need the OVERLAPPED structure that is used by both of these functions.
This is exactly what SQL Server uses when it reads and writes to the database and/or its log files, and in fact this functionality was added to an early service pack for NT 4.0 specifically for SQL Server's use.
Of course, this is pretty advanced level stuff, and hardly for the faint of heart. Surprisingly, you can actually find the P/Invoke definitions on pinvoke.net, but I have an intensely skeptical mistrust of the site. Since you'll need to spend a lot of quality time with the documentation just to understand how these functions work, you might as well write the declarations yourself. And doing it from C# will create a whole host of additional problems for you, such that I don't even recommend it. If this kind of I/O performance is important to you, I think you're using the wrong tool for the job.
The poor man's solution is contig.exe, a single-file defragmenter available for free download here.

In short no then
The OS will do this in the background, what i would do is make the file as big as you expect it to be, that way the OS will place it contigously. And if you need to grow the file, you again grow it by like 10% each time.
This is simular to how a SQL server keeps it database files.
Open opening the FileStream you open it with append.
Example:
FileStream fwriter = new FileStream("C:\\test.txt", FileMode.Append, FileAccess.Write, FileShare.Read);

Reading huge amounts of small files in sequence

I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.

Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files

I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.

That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?

Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.

You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.

Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network

I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.

You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.

Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.