I have what I would consider to be a fairly common problem, but have not managed to find a good solution on my own or by browsing this forum.
Problem
I have written a tool to get a file listing of a folder with some additional information such as file name, file path, file size, hash, etc.
The biggest problem that I have is that some of the folders contain millions of files (maybe 50 million in the structure).
Possible Solutions
I have two solutions, but neither of them are ideal.
Every time a file is read, the information is written straight to file. This is OK, but it means I can't multi-thread the file without running into issues with the thread locking the file.
Every time a file is read, the information is added to some form of collection such as a ConcurrentBag. The means I can multi-thread the enumeration of the files and add them to the collection. Once the enumeration is done, I can write the whole collection to a file using File.WriteAllLines; however adding 50 million entries to the collection makes most machines run out of memory.
Other Options
Is there any way to add items to a collection and then write them to a file when it gets to a certain number of records in the collection or something like that?
I looked into a BlockingCollection, but that will just fill up really quickly as the producer will be multi-threaded, but the consumer would only be single-threaded.
Create a FileStream that is shared by all threads. Before writing to that FileStream, a thread must lock it. FileStream has some buffer (4096bytes if i remember right), so it doesn't actually write to disk every time. You may use a BufferedStream around that if 4096 bytes is still not enough.
BlockingCollection is precisely what you need. You can create one with a large buffer and have a single writer thread writing to a file that it keeps open for the duration of the run.
If reading is the dominant operation time-wise the queue will be near empty the whole time and total time will be just slightly more than the read time.
If writing is the dominant operation time-wise the queue will fill up until you reach the limit you set (to prevent out of memory situations) and producers will only advance as the writer advances. The total time will be the time needed to write all the records to a single file sequentially and you cannot do better than that (when writer is the slowest part).
You may be able to get slightly better performance by pipelining through multiple blocking collections, e.g. making the hash-calculation (a CPU-bound operation) potentially separate from the read, or write operations. If you want to do that though consider the TPL DataFlow library.
Related
I've been doing some work with loading multiple image files into an HTML document that is then converted into a PDF.
I am unsure on the specifics, but I was under the impression it's better to read a single file at a time and keep the memory footprint low, rather than loading all the files into memory (in a dictionary) at once (there are so many images - the collection can be as big as 500MB!).
I was wondering what is faster though? Is it quicker to read say, 100MB worth of files into memory, process them, then load another 100MB? Or is it better to do it a single file at a time (surely the number of disk I/O operations would be similar in either regard)?
It's better to read file one by one as it is more memory efficient. If you can you should work only with stream rather than in memory buffer.
When you use more memory, your data may ends in a page file resulting in more disk I/O operations.
You should avoid working with large memory block if you don't want to see an OutOfMemoryException.
This depends on a number of things, but fundamentally, disk is a lot slower than memory, so you can gain by reading, if you do it right.
First, a warning: if you do not have plenty of memory to fit the files you attempt to load, then your operating system will page memory to disk, which will slow your system down far more than reading one file at a time, so be careful.
The key to improving disk io performance is to keep the disk busy. Reading one at a time leaves the disk idle while you are processing the file in memory. Reading a set of files into a large block of memory, but still reading one at a time, and then processing the block of files, probably won't improve performance except in very unusual conditions.
If your goal is to reduce the time from start to finish of processing these files, you will probably want to run on multiple threads; the system calls to open and read a file still take time to queue, so depending on the capabilities of your disk, you can usually get better overall io throughput by having at least one read request queued while the disk is loading another request; this minimizes idle time between requests, and keeps the disk at its absolute maximum. Note that having too many requests queued can slow performance.
Since processing in memory is likely to be faster, you could have at least 2 threads set up to read files, and at least 1 thread set up to process the files that have already been loaded into memory by the other threads.
A better way than managing your own threads is to use a thread pool; this would naturally limit the number of io requests to the number of concurrent threads allowed, and wouldn't require you to manage the threads yourself. This may not be quite optimal, but a thread pool should be faster than processing the files one at a time, and easier/safer than managing threads.
Note that if you don't understand what I mean by threads and a threadpool, or you haven't done much multi-threaded development relating to disk io, you are probably better off sticking with one file at a time, unless improving the total processing time is a requirement that you can't get around. There are plenty of examples of how to use threads on MSDN, but if you haven't done it much, this probably isn't a good first project for threading.
I have some 2TB read only (no writing once created) files on a RAID 5 (4 x 7.2k # 3TB) system.
Now I have some threads that wants to read portions of that file.
Every thread has an array of chunks it needs.
Every chunk is addressed by file offset (position) and size (mostly about 300 bytes) to read from.
What is the fastest way to read this data.
I don't care about CPU cycles, (disk) latency is what counts.
So if possible I want take advantage of NCQ of the hard disks.
As the files are highly compressed and will accessed randomly and I know exactly the position, I have no other way to optimize it.
Should I pool the file reading to one thread?
Should I keep the file open?
Should every thread (maybe about 30) keep every file open simultaneously, what is with new threads that are coming (from web server)?
Will it help if I wait 100ms and sort my readings by file offsets (lowest first)?
What is the best way to read the data? Do you have experiences, tips, hints?
The optimum number of parallel requests depends highly on factors outside your app (e.g. Disk count=4, NCQ depth=?, driver queue depth=? ...), so you might want to use a system, that can adapt or be adapted. My recommendation is:
Write all your read requests into a queue together with some metadata that allows to notify the requesting thread
have N threads dequeue from that queue, synchronously read the chunk, notify the requesting thread
Make N runtime-changeable
Since CPU is not your concern, your worker threads can calculate a floating latency average (and/or maximum, depending on your needs)
Slide N up and down, until you hit the sweet point
Why sync reads? They have lower latency than ascync reads.
Why waste latency on a queue? A good lockless queue implementation starts at less than 10ns latency, much less than two thread switches
Update: Some Q/A
Should the read threads keep the files open? Yes, definitly so.
Would you use a FileStream with FileOptions.RandomAccess? Yes
You write "synchronously read the chunk". Does this mean every single read thread should start reading a chunk from disk as soon as it dequeues an order to read a chunk? Yes, that's what I meant. The queue depth of read requests is managed by the thread count.
Disks are "single threaded" because there is only one head. It won't go faster no matter how many threads you use... in fact more threads probably will just slow things down. Just get yourself the list and arrange (sort) it in the app.
You can of course use many threads that'd make use of NCQ probably more efficient, but arranging it in the app and using one thread should work better.
If the file is fragmented - use NCQ and a couple of threads because you then can't know exact position on disk so only NCQ can optimize reads. If it's contignous - use sorting.
You may also try direct I/O to bypass OS caching and read the whole file sequentially... it sometimes can be faster, especially if you have no other load on this array.
Will ReadFileScatter do what you want?
I have the following code:
if (!this.writeDataStore.Exists(mat))
{
BlockingCollection<ImageFile> imageFiles = new BlockingCollection<ImageFile>();
Parallel.ForEach(fileGrouping, fi => DecompressAndReadGzFile(fi, imageFiles));
this.PushIntoDb(mat, imageFiles.ToList());
}
DecompressAndReadGzFile is a static method in the same class that this method is contained in. As per the method name I am decompressing and reading gz files, lots of them, i.e. up to 1000, so the overhead of parallelisation is worth it for the benefits. However, I'm not seeing the benefits. When I use ANTS performance profiler I see that they are running at exactly the same times as if no parallelisation is occuring. I also check the CPU cores with process explorer and it looks like there is possibly work being done on two cores but one core seems to be doing most of the work. What am I not understanding as far as getting Parallel.ForEach to decompress and read files in parallel?
UPDATED QUESTION: What is the fastest way to read information in from a list of files?
The Problem (simplified):
There is a large list of .gz files (1200).
Each file has a line containing "DATA: ", the location and line number are not static and can vary from file to file.
We need to retrieve the first number after "DATA: " (just for simplicity's sake) and store it in an object in memory (e.g. a List)
In the initial question, I was using the Parallel.ForEach loop but I didn't seem to be CPU bound on more than 1 core.
Is it possible that the threads are spending most of their time waiting for IO? By reading multiple files at a time, you may be making the disk thrash more than it would with a single operation. It's possible that you could improve performance by using a single thread reading sequentially, but then doling out the CPU-bound decompression to separate threads... but you may actually find that you only really need one thread performing the decompression anyway, if the disk is slower than the decompression process itself.
One way to test this would be to copy the files requiring decompression onto a ramdisk first and still use your current code. I suspect you'll then find you're CPU-bound, and all the processors are busy almost all the time.
(You should also consider what you're doing with the decompressed files. Are you writing those back to disk? If so, again there's the possibility that you're basically waiting for a thrashing disk.)
is there any chance your static method is sharing any global resource among its calls.
Because in that case this static method will be called sequentially and no parallel benefit.
Can you put your fileGrouping class code ?
Would like the most efficient way of copying, say, 1MB from an offset in file 1 to the same or different offset in file 2. Is it possible to have multiple threads doing the reads and the writes at the same time safely?
To clarify, I of course want to handle file->file as stated. However I also want to have multiple threads reading from a network IO bound location (internet, etc..), and then reassemble those back into a single file locally. If it makes more sense that the write operation be single-threaded that is fine too.
You almost certainly don't want multiple threads doing this - you're going to be IO-bound, and you don't want the overhead of the disk seeking all over the place.
I'd just do it in one thread:
Open file 1 for reading
Seek
Open file 2 for writing
Seek
Repeatedly copy a buffer at a time (e.g. 32K) from one stream to the other.
You may find that some of the FileOptions (e.g. SequentialScan for the reader) could make a difference.
EDIT: As noted in comments, it may be worth using two threads - one to read and one to write, particularly if you're using two separate drives. However, it's also possible that with the operating system automatically doing prefetching etc, that wouldn't be helpful. It would certainly complicate the code.
Do you have a target time for this operation? How fast does a simple implementation take compared with that target time? I definitely wouldn't venture into multiple threads or async operations until you've established how long the simple approach takes.
I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?