I am writing an application to read and parse files which may be 1 KB to 200 MB in size.
I have to parse it two times...
Extract an image contained in the file.
Parse that image To extract the contents of the image.
I generally use the file stream, buffered stream, binary reader and binary writer to read and write the contents.
Now, I want to know the fastest and most efficient way to read the file and extract the contents...
Is there a good method or a good class library?
NOTE: Unsafe code is OK!
The fastest and simplest way to read the file is simply:
var file = File.ReadAllBytes(fileName);
That will read the entire file as a byte array into memory. You can then go through it looking for what you need at memory array access speed (which is to say, extremely fast). This will almost certainly be faster than trying to process the file as you read it.
However, if this file will not comfortably fit in memory (and 81 MB will), then you will need to do this in chunks. If this is not needed, we can safely avoid that tricky discussion. The solutions in this case will be either:
If using .NET 4.0, use memory mapped files (more in What are the advantages of memory-mapped files?).
If not, you'll need to chunk read, cache and keep around what you think you'll need in memory (for efficiency) or re-reading it you simply can't keep it in memory. This can become messy and slow.
Related
Can I use ghostscript API to convert PDF to some other format without reading data from disk or writing results to disk?
It has a big overhead!
I need something like this:
public static byte[][] ConvertPDF(byte[] pdfData)
{
//// Returns an array of byte-array of pages data
}
Using the Ghostscript API you can send input from anywhere you like. Depending on the output device you choose you may be able to send the output to stdout, or to retrieve a bitmap in memory.
If you want TIFF output then you have to have an output file (Tagged Image File Format, the clue is in the name...)
Similarly, you can't do this with PDF files as input, those have to be available as a file, because PDF is a random access format.
What leads you to think that this is a performance problem ?
Since there still isn't a correct answer here all these years later, I'll provide one.
Ghostscipt performs its operations on disk. It doesn't use an input & output path merely to load the file into memory, perform operations, and write it back. It actually reads and writes parts of the file to disk as it goes (using multiple threads). While this IS slower, it also uses much less memory(bearing in mind that these files could potentially be quite large).
Because the operations are performed on disk, there was not (at the time of this question) any way to pass in or retrieve a byte array/memory stream because to do so would be "dishonest"--it might imply that it was a "shortcut" to prevent disk IO when in fact it would not. Later, support was added to accept & return memory streams, but it's important to note that this support merely accepted the memory stream, wrote it to a temporary file, performed the operations, and then read it back to a new memory stream.
If that still meets your needs (for example, if you want the inevitable IO to be handled by the library rather than your business logic), here are a couple links demonstrating how to go about it (your exact needs do change the mechanics).
Image to pdf (memory stream to memory stream via rasterizer)
Image to pdf (file to memory stream via processor)
Pdf to image (memory stream to memory stream via rasterizer)
Hopefully these will, collectively, provide enough information to solve this issue for others who, like me & OP, mostly found people saying it was impossible and that I shouldn't even be trying.
I'm currently working with a lot of different file types (txt, binary, office, etc). I typically use a byte[] or string to hold the file data in memory (while it is being written/parsed) and in order to read/write it into files I write the entire data using a FileStream after the data has been completely processed.
Should I be using a TextStream instead of a string while generating data for a text file?
Should I be using a FileStream instead of a byte[] while generating data for a binary file?
Would using streams give me better performance instead of calculating the entire data and outputting it in one go at the end?
Is it a general rule that File I/O should always use streams or is my approach fine in some cases?
The advantage of a byte[]/string vs a stream may be that the byte[]/string is in memory, and accessing it may be faster. If the file is very large, however, you may end up paging thus reducing performance. Another advantage of the byte[]/string approach is that the parsing may be a little easier (simply use File.ReadAllText, say).
If your parsing allows (particularly if you don't need to seek randomly), using a FileStream can be more efficient especially if the file is rather large. Also, you can make use of C#'s (4.5) async/await features to very easily read/write the file asynchronously and process chunks that you read in.
Personally, I'd probably just read the file into memory if I'm not too worried about performance, or the file is very small. Otherwise I'd consider using streams.
Ultimately I would say write some simple test programs and time the performance of each if you're worried about the performance differences, that should give you your best answer.
Apart from talking about the size of the data, another important question is the purpose of the data. Manipulation is easier to perform when working with strings and arrays. If both strings and arrays are equally convenient then an array of bytes would be preferred. Strings have to be interpreted which brings in complexity (Encoding, BOM etc) and therefore increases the likelihood of a bug. Use strings only for text. Binary data should always be handled by byte arrays or streams.
Streams should be considered each time you either don't have to perform any manipulation or the subjected data is very large or the subjected data is coming in very slowly. Streams are a natural way of processing data part by part whereas strings and arrays in general expect the data to be there in its entirety before processing it.
Working in streams will generally yield performance since it opens up the possibility for having different channels both reading and writing asynchronously.
while generating data for a text file
If the file data flushing is immediate, your choice is StreamWriter over the FileStream. If not, then the StringBuilder.
while generating data for a binary file?
MemoryStream is a choice. Additionally, BinaryWriter over the memstream is preferred.
I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.
This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)
If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.
It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.
I am trying to read a few text files ( around 300 kb each ). Until now I've been using the Filestream to open the file and read it. ( TAB DELIMITED ). However, I heard about the memory mapped file in .net 4.0. Would it make my reads any faster ?
Is there any sample code that does the read of a simple file and compare performance ?
If the files are on disk and just need to be read into memory, then using a memory mapped file will not help at all, as you still need to read them from disk.
If all you are doing is reading the files, there is no point in memory mapping them.
Memory mapped files are for use when you are doing intensive work with the file (reading, writing, changing) and want to avoid the disk IO.
If you're just reading once then memory-mapped files don't make sense; it still takes the same amount of time to load the data from disk. Memory-mapped files excel when many random reads and/or writes must be performed on a file since there's no need to interrupt the read or write operations with seek operations.
With your amount of data MMFs don't give any advantage. However, in general, if one bothers to carry the tests, he will find, that copying large (huge) files using MMFs is faster than calling ReadFile/WriteFile sequentially. This is caused by different mechanisms used internally in Windows for MMF management and for file IO.
Processing data in memory always faster than doing something similar via disk IO. If your processing is sequential and easily fit into memory, you can use File.ReadLines() to get data line by line and process them fast without hard memory overhead. Here example: How to open a large text file in C#
Check this answer too: When to use memory-mapped files?
Memory Mapped File is not recommended to read text files. To read text file you are doing right with Filestream. MMP is best to read binary data.
I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?