I am trying to read a few text files ( around 300 kb each ). Until now I've been using the Filestream to open the file and read it. ( TAB DELIMITED ). However, I heard about the memory mapped file in .net 4.0. Would it make my reads any faster ?
Is there any sample code that does the read of a simple file and compare performance ?
If the files are on disk and just need to be read into memory, then using a memory mapped file will not help at all, as you still need to read them from disk.
If all you are doing is reading the files, there is no point in memory mapping them.
Memory mapped files are for use when you are doing intensive work with the file (reading, writing, changing) and want to avoid the disk IO.
If you're just reading once then memory-mapped files don't make sense; it still takes the same amount of time to load the data from disk. Memory-mapped files excel when many random reads and/or writes must be performed on a file since there's no need to interrupt the read or write operations with seek operations.
With your amount of data MMFs don't give any advantage. However, in general, if one bothers to carry the tests, he will find, that copying large (huge) files using MMFs is faster than calling ReadFile/WriteFile sequentially. This is caused by different mechanisms used internally in Windows for MMF management and for file IO.
Processing data in memory always faster than doing something similar via disk IO. If your processing is sequential and easily fit into memory, you can use File.ReadLines() to get data line by line and process them fast without hard memory overhead. Here example: How to open a large text file in C#
Check this answer too: When to use memory-mapped files?
Memory Mapped File is not recommended to read text files. To read text file you are doing right with Filestream. MMP is best to read binary data.
Related
In my application I need to continuously write data chunks (around 2MB) about every 50ms in a large file (around 2-7 GB). This is done in a sequential, circular way, so I write chunk after chunk into the file and when I'm at the end of the file I start again at the beginning.
Currently I'm doing it as follows:
In C# I call File.OpenWrite once to open the file with read access and set the size of the file with SetLength. When I need to write a chunk, I pass the safe file handle to the unmanaged WriteFile (kernel32.dll). Hereby I pass an overlapped structure to specify the position within the file where the chunk has to be written. The chunk I need to write is stored in unmanaged memory, so I have an IntPtr which I can pass to WriteFile.
Now I'd like to know if and how I can make this process more efficient. Any ideas?
Some questions in detail:
Will changing from file I/O to memory-mapped file help?
Can I include some optimizations for NTFS?
Are there some useful parameters when creating the file that I'm missing? (maybe an unmanaged call with special parameters)
Using better hardware will probably be the most cost efficient way to increase file writing efficiency.
There is a paper from Microsoft research that will answer most of your questions: Sequential File Programming Patterns and Performance with .NET and the downloadable source code (C#) if you want to run the tests from the paper on your machine.
In short:
The default behavior provides excellent performance on a single disk.
Unbufffered IO should be tested if you have a disc array. Could improve write speed with a factor of eight.
This thread on social.msdn might also be of interest.
Can I use ghostscript API to convert PDF to some other format without reading data from disk or writing results to disk?
It has a big overhead!
I need something like this:
public static byte[][] ConvertPDF(byte[] pdfData)
{
//// Returns an array of byte-array of pages data
}
Using the Ghostscript API you can send input from anywhere you like. Depending on the output device you choose you may be able to send the output to stdout, or to retrieve a bitmap in memory.
If you want TIFF output then you have to have an output file (Tagged Image File Format, the clue is in the name...)
Similarly, you can't do this with PDF files as input, those have to be available as a file, because PDF is a random access format.
What leads you to think that this is a performance problem ?
Since there still isn't a correct answer here all these years later, I'll provide one.
Ghostscipt performs its operations on disk. It doesn't use an input & output path merely to load the file into memory, perform operations, and write it back. It actually reads and writes parts of the file to disk as it goes (using multiple threads). While this IS slower, it also uses much less memory(bearing in mind that these files could potentially be quite large).
Because the operations are performed on disk, there was not (at the time of this question) any way to pass in or retrieve a byte array/memory stream because to do so would be "dishonest"--it might imply that it was a "shortcut" to prevent disk IO when in fact it would not. Later, support was added to accept & return memory streams, but it's important to note that this support merely accepted the memory stream, wrote it to a temporary file, performed the operations, and then read it back to a new memory stream.
If that still meets your needs (for example, if you want the inevitable IO to be handled by the library rather than your business logic), here are a couple links demonstrating how to go about it (your exact needs do change the mechanics).
Image to pdf (memory stream to memory stream via rasterizer)
Image to pdf (file to memory stream via processor)
Pdf to image (memory stream to memory stream via rasterizer)
Hopefully these will, collectively, provide enough information to solve this issue for others who, like me & OP, mostly found people saying it was impossible and that I shouldn't even be trying.
I am writing an application to read and parse files which may be 1 KB to 200 MB in size.
I have to parse it two times...
Extract an image contained in the file.
Parse that image To extract the contents of the image.
I generally use the file stream, buffered stream, binary reader and binary writer to read and write the contents.
Now, I want to know the fastest and most efficient way to read the file and extract the contents...
Is there a good method or a good class library?
NOTE: Unsafe code is OK!
The fastest and simplest way to read the file is simply:
var file = File.ReadAllBytes(fileName);
That will read the entire file as a byte array into memory. You can then go through it looking for what you need at memory array access speed (which is to say, extremely fast). This will almost certainly be faster than trying to process the file as you read it.
However, if this file will not comfortably fit in memory (and 81 MB will), then you will need to do this in chunks. If this is not needed, we can safely avoid that tricky discussion. The solutions in this case will be either:
If using .NET 4.0, use memory mapped files (more in What are the advantages of memory-mapped files?).
If not, you'll need to chunk read, cache and keep around what you think you'll need in memory (for efficiency) or re-reading it you simply can't keep it in memory. This can become messy and slow.
I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.
This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)
If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.
It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.
Is it possible to create a large (physically) contiguous file on a disk using C#? Preferably using only managed code, but if it's not possible then an unmanaged solution would be appreciated.
Any file you create will be logically contiguous.
If you want physically contiguous you are on OS and FS territory. Really (far) beyond the control of normal I/O API's.
But what will probably come close is to claim the space up-front: create an Empty stream and set the Length or Position property to what you need.
Writing a defragger?
It sounds like you're after the defragmentation API anyway:-
http://msdn.microsoft.com/en-us/library/aa363911%28v=vs.85%29.aspx
The link from the bottom, cos it seems you've missed the C# wrapper that someone has kindly produced.
http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx
With modern file systems it is hard to ensure a contiguous file on the hard disk. Logically the file is always contiguous, but the physical blocks that keep the data vary from file system to file system.
The best bet for this would be to use an old file system (ext2, FAT32, etc.) and just ask for a large file using seek to the file size you want and then flushing this file. More up-to-date file systems will probably mark a large file size, but won't actually write anything to the hard disk, instead returning zeros on a future read without actually reading.
int fileSize = 1024 * 1024 * 512;
FileStream file = new FileStream("C:\\MyFile", FileMode.Create, FileAccess.Write);
file.Seek(fileSize, SeekOrigin.Begin);
file.Close();
To build a database, you will need to use the scatter-gather I/O functions provided by the Windows API. This is a special type of file I/O that allows you to either "scatter" data from a file into memory or "gather" data from memory and write it to a contiguous region of a file. While the buffers into which the data is scattered or from which it is gathered need not be contiguous, the source or destination file region is always contiguous.
This functionality consists of two primary functions, both of which work asynchronously. The ReadFileScatter function reads contiguous data from a file on disk and writes it into an array of non-contiguous memory buffers. The WriteFileGather function reads non-contiguous data from memory buffers and writes it to a contiguous file on disk. Of course, you'll also need the OVERLAPPED structure that is used by both of these functions.
This is exactly what SQL Server uses when it reads and writes to the database and/or its log files, and in fact this functionality was added to an early service pack for NT 4.0 specifically for SQL Server's use.
Of course, this is pretty advanced level stuff, and hardly for the faint of heart. Surprisingly, you can actually find the P/Invoke definitions on pinvoke.net, but I have an intensely skeptical mistrust of the site. Since you'll need to spend a lot of quality time with the documentation just to understand how these functions work, you might as well write the declarations yourself. And doing it from C# will create a whole host of additional problems for you, such that I don't even recommend it. If this kind of I/O performance is important to you, I think you're using the wrong tool for the job.
The poor man's solution is contig.exe, a single-file defragmenter available for free download here.
In short no then
The OS will do this in the background, what i would do is make the file as big as you expect it to be, that way the OS will place it contigously. And if you need to grow the file, you again grow it by like 10% each time.
This is simular to how a SQL server keeps it database files.
Open opening the FileStream you open it with append.
Example:
FileStream fwriter = new FileStream("C:\\test.txt", FileMode.Append, FileAccess.Write, FileShare.Read);