Is it possible to create a large (physically) contiguous file on a disk using C#? Preferably using only managed code, but if it's not possible then an unmanaged solution would be appreciated.
Any file you create will be logically contiguous.
If you want physically contiguous you are on OS and FS territory. Really (far) beyond the control of normal I/O API's.
But what will probably come close is to claim the space up-front: create an Empty stream and set the Length or Position property to what you need.
Writing a defragger?
It sounds like you're after the defragmentation API anyway:-
http://msdn.microsoft.com/en-us/library/aa363911%28v=vs.85%29.aspx
The link from the bottom, cos it seems you've missed the C# wrapper that someone has kindly produced.
http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx
With modern file systems it is hard to ensure a contiguous file on the hard disk. Logically the file is always contiguous, but the physical blocks that keep the data vary from file system to file system.
The best bet for this would be to use an old file system (ext2, FAT32, etc.) and just ask for a large file using seek to the file size you want and then flushing this file. More up-to-date file systems will probably mark a large file size, but won't actually write anything to the hard disk, instead returning zeros on a future read without actually reading.
int fileSize = 1024 * 1024 * 512;
FileStream file = new FileStream("C:\\MyFile", FileMode.Create, FileAccess.Write);
file.Seek(fileSize, SeekOrigin.Begin);
file.Close();
To build a database, you will need to use the scatter-gather I/O functions provided by the Windows API. This is a special type of file I/O that allows you to either "scatter" data from a file into memory or "gather" data from memory and write it to a contiguous region of a file. While the buffers into which the data is scattered or from which it is gathered need not be contiguous, the source or destination file region is always contiguous.
This functionality consists of two primary functions, both of which work asynchronously. The ReadFileScatter function reads contiguous data from a file on disk and writes it into an array of non-contiguous memory buffers. The WriteFileGather function reads non-contiguous data from memory buffers and writes it to a contiguous file on disk. Of course, you'll also need the OVERLAPPED structure that is used by both of these functions.
This is exactly what SQL Server uses when it reads and writes to the database and/or its log files, and in fact this functionality was added to an early service pack for NT 4.0 specifically for SQL Server's use.
Of course, this is pretty advanced level stuff, and hardly for the faint of heart. Surprisingly, you can actually find the P/Invoke definitions on pinvoke.net, but I have an intensely skeptical mistrust of the site. Since you'll need to spend a lot of quality time with the documentation just to understand how these functions work, you might as well write the declarations yourself. And doing it from C# will create a whole host of additional problems for you, such that I don't even recommend it. If this kind of I/O performance is important to you, I think you're using the wrong tool for the job.
The poor man's solution is contig.exe, a single-file defragmenter available for free download here.
In short no then
The OS will do this in the background, what i would do is make the file as big as you expect it to be, that way the OS will place it contigously. And if you need to grow the file, you again grow it by like 10% each time.
This is simular to how a SQL server keeps it database files.
Open opening the FileStream you open it with append.
Example:
FileStream fwriter = new FileStream("C:\\test.txt", FileMode.Append, FileAccess.Write, FileShare.Read);
Related
Lets assume that exactly 1 byte after the File-1-EOF another file (file2) starts.
If I open up file 1 and use FileStream Filemode.Append, does it overwrite file2 or does it make another copy at a place where there is enough memory?
Thanks, in regards!
Edit:
For everyone after me: I forgot that you have a file system, which is split into chunks. Making this question nonsense!
You appear to be laboring under the misapprehension that files are stored sequentially on disk, and that extending one file might overwrite parts of another file. This doesn't happen when you go via a filestream append in c#. The operating system will write the bytes you add however it likes, wherever it likes (and it likes to not overwrite other files) which is how files end up broken into smaller chunks (and why defragging is thing) scattered all over the disk. None of this is of any concern to you, because the OS presents those scattered file fragments as a single contiguous stream of bytes to any program that wants to read them
Of course, if you wrote a program that bypassed the OS and performed low level disk access, located the end of the file and then blindly write more bytes into the locations after it then you would end up damaging other files, and even the OS's carefully curated filesystem .. but a .net file stream won't make that possible
TLDR; add your bytes and don't worry about it. Keeping the filesystem in order is not your job
If I open up file 1 and use FileStream Filemode.Append, does it overwrite file2 or does it make another copy at a place where there is enough memory?
Thankfully no.
Here's a brief overview why:
Your .NET C# code does not have direct OS level interaction.
Your code is compiled into byte-code and is interpreted at runtime by the .NET runtime.
During runtime your byte-code is executed by the .NET Runtime which is built mostly in a combination of C#/C/C++.
The runtime secures what it calls SafeHandles, which are wrappers around the file handles provided by what I can assume is window.h(for WIN32 applications at least), or whatever OS level provider for file handles you're architecture is running on.
The runtime uses these handles to read and write data using the OS level API.
It is the OS's job to ensure changes to yourfile.txt, using the handle it's provided to the runtime, only affects that file.
Files are not generally stored in memory, and as such are not subject to buffer overflows.
The runtime may use a buffer in memory to.. buffer your reads and writes but that is implemented by the runtime, and has no affect on the file and operating system.
Any attempt to overflow this buffer is safe-guarded by the runtime itself and the execution of your code will stop. Regardless, if a buffer overflow happened on this buffer successfully - no extra bytes will be written to the underlying handle. Rather the runtime would likely stop executing with a memory access violation, or general unspecified behavior.
The handle you're given is little more than a token that the OS uses to keep track which file you want to read or write bytes to.
If you attempt to write more bytes to a file than an architecture allows - most operating systems will have safe guards in place to end your process, close the file, or straight up send an interrupt to crash the system.
In my application I need to continuously write data chunks (around 2MB) about every 50ms in a large file (around 2-7 GB). This is done in a sequential, circular way, so I write chunk after chunk into the file and when I'm at the end of the file I start again at the beginning.
Currently I'm doing it as follows:
In C# I call File.OpenWrite once to open the file with read access and set the size of the file with SetLength. When I need to write a chunk, I pass the safe file handle to the unmanaged WriteFile (kernel32.dll). Hereby I pass an overlapped structure to specify the position within the file where the chunk has to be written. The chunk I need to write is stored in unmanaged memory, so I have an IntPtr which I can pass to WriteFile.
Now I'd like to know if and how I can make this process more efficient. Any ideas?
Some questions in detail:
Will changing from file I/O to memory-mapped file help?
Can I include some optimizations for NTFS?
Are there some useful parameters when creating the file that I'm missing? (maybe an unmanaged call with special parameters)
Using better hardware will probably be the most cost efficient way to increase file writing efficiency.
There is a paper from Microsoft research that will answer most of your questions: Sequential File Programming Patterns and Performance with .NET and the downloadable source code (C#) if you want to run the tests from the paper on your machine.
In short:
The default behavior provides excellent performance on a single disk.
Unbufffered IO should be tested if you have a disc array. Could improve write speed with a factor of eight.
This thread on social.msdn might also be of interest.
I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.
This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)
If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.
It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.
I'm having large CT rawdata files which can exceed the size of 20 to 30GB maximum. For most of our current computers in the department we have only 3GB maximum. But for processing the data we need to go through all the available data. Of course we could do this by sequentially going through the data via the read and write functions. But it's sometimes necessary to keep some data in memory.
Currently I have my own memory management where I created a so called MappableObject. Each rawdatafile contains, say 20000 structs each showing different data. Each MappableObject refers to a location in the file.
In C# I created a somewhat partially working mechanism which automatically mps and unmaps the data if necessary. From years ago I know the MemoryMappedFiles, but in .NET 3.5 I refused to use it because I knew in .NET 4.0 it will be available natively.
So today I have tried the MemoryMappedFiles and found out that it is not possible to allocate as much memory a need. If I have a 32bit system, and I want to allocate 20GB it doesn't work due to exceeding the size of the logical address space. This is somehow clear to me.
But is there a way to process such large files as I have? What other chances do I have? How do you guys solve such things?
Thanks
Martin
Only limitation i'm aware of is the size of the largest view of a file you can map which is limited by address space. A memory mapped file can be larger than address space. Windows needs to map a file view in a contiguous chunk of your process's address space, so the size of the largest mapping equals the size of the largest free chunk of address space. The only limit on total file size is imposed by the file system itself.
Take a look at this article: Working with Large Memory-Mapped Files
"Memory mapped", you can't map 20 gigabytes into a 2 gigabyte virtual address space. Getting 500 MB on a 32-bit operating system is tricky. Beware that it is not a good solution unless you need heavy random access to the file data. Which ought to be difficult when you have to partition the views. Sequential access through a regular file is unbeatable with very modest memory usage. Also beware the cost of marshaling the data from the MMF, you're still paying for the copy of the managed struct or the marshaling cost.
You can still sequentially read through the file, you just can't store more than 2GB of data in memory.
You can map blocks of the file at a time, preferably blocks that are multiples of your struct.
eg. File is 32GB. Memory map 32MB of the file at a time and parse it. Once you hit the end of those 32MB, map the next 32MB of the file and continue until you've reached the end of the file.
I'm not sure what the optimal mapping size is, but this is an example of how it may be done.
You are both right. What I tried first is to use a memorymappedfile without a file. There it doesn't work. If I have an existing file. I can map as much memory I want. The reason why I wanted to use MemoryMappedFiles without having a real existing file is that it should delete automatically when the stream gets disposed. This is not supported by the MemoryMappedFile.
What I saw now is that I can do the following to get the expected result:
// Create the stream
FileStream stream = new FileStream(
"D:\\test.dat",
FileMode.Create,
FileAccess.ReadWrite,
FileShare.ReadWrite,
8,
FileOptions.DeleteOnClose // This is the necessary part for me.
);
// Create a file mapping
MemoryMappedFile x = MemoryMappedFile.CreateFromFile(
stream,
"File1",
10000000000,
MemoryMappedFileAccess.ReadWrite,
new MemoryMappedFileSecurity(),
System.IO.HandleInheritability.None,
false
);
// Dispose the stream, using the FileOptions.DeleteOnClose the file is gone now
stream.Dispose();
At least when looking at the first result it looks fine fore me.
Thank you.
I am trying to read a few text files ( around 300 kb each ). Until now I've been using the Filestream to open the file and read it. ( TAB DELIMITED ). However, I heard about the memory mapped file in .net 4.0. Would it make my reads any faster ?
Is there any sample code that does the read of a simple file and compare performance ?
If the files are on disk and just need to be read into memory, then using a memory mapped file will not help at all, as you still need to read them from disk.
If all you are doing is reading the files, there is no point in memory mapping them.
Memory mapped files are for use when you are doing intensive work with the file (reading, writing, changing) and want to avoid the disk IO.
If you're just reading once then memory-mapped files don't make sense; it still takes the same amount of time to load the data from disk. Memory-mapped files excel when many random reads and/or writes must be performed on a file since there's no need to interrupt the read or write operations with seek operations.
With your amount of data MMFs don't give any advantage. However, in general, if one bothers to carry the tests, he will find, that copying large (huge) files using MMFs is faster than calling ReadFile/WriteFile sequentially. This is caused by different mechanisms used internally in Windows for MMF management and for file IO.
Processing data in memory always faster than doing something similar via disk IO. If your processing is sequential and easily fit into memory, you can use File.ReadLines() to get data line by line and process them fast without hard memory overhead. Here example: How to open a large text file in C#
Check this answer too: When to use memory-mapped files?
Memory Mapped File is not recommended to read text files. To read text file you are doing right with Filestream. MMP is best to read binary data.