We have an application that extract data from several hardware devices. Each device's data should be stored in a different file.
Currently we have one FileStream by file and doing a write when a data comes and that's it.
We have a lot of data coming in, the disk is struggling with an HDD(not a SSD), I guess because the flash is faster, but also because we do not have to jump to different file places all the time.
Some metrics for the default case: 400 different data source(each should have his own file) and we receive ~50KB/s for each data(so 20MB/s). Each data source acquisition is running concurrently and at total we are using ~6% of the CPU.
Is there a way to organize the flush to the disk in order to ensure the better flow?
We will also consider improving the hardware, but it's not really the subject here, since it's a good way to improve our read/write
Windows and NTFS handle multiple concurrent sequential IO streams to the same disk terribly inefficiently. Probably, you are suffering from random IO. You need to schedule the IO yourself in bigger chunks.
You might also see extreme fragmentation. In such cases NTFS sometimes allocates every Nth sector to each of the N files. It is hard to believe how bad NTFS is in such scenarios.
Buffer data for each file until you have like 16MB. Then, flush it out. Do not write to multiple files at the same time. That way you have one disk seek for each 16MB segment which reduces seek overhead to near zero.
Related
I need to write a can change at any time stream to the file for others reading elsewhere, but if the disk is modified too frequently, this will definitely damage the disk.
It's said that Temporary file will use memory as possible without actually write into disk.
But I've found it "seems" still writing into disk.
Who can answer my confusion?
Modern HDD / SDD has a buffer cache (with a typical size of several Megabytes) which has been specially designed for this very issue:
caching hot spots, frequently read and modified data. It's much faster to read/write data using memory than HDD;
SDD has another problem: limited number of writes and thus we should avoid too frequent writes. In case of power off, a capacitor
(or HDD disks rotation) provides enough energy to safely write down all the data from cache back to HDD/SDD.
In case of Copy hot spot data will (with a high probability) stay at cache.
Resume: please, don't re-invent the wheel, let hardware manufactors do their own work and solve the (very typical) issue for you.
I've been doing some work with loading multiple image files into an HTML document that is then converted into a PDF.
I am unsure on the specifics, but I was under the impression it's better to read a single file at a time and keep the memory footprint low, rather than loading all the files into memory (in a dictionary) at once (there are so many images - the collection can be as big as 500MB!).
I was wondering what is faster though? Is it quicker to read say, 100MB worth of files into memory, process them, then load another 100MB? Or is it better to do it a single file at a time (surely the number of disk I/O operations would be similar in either regard)?
It's better to read file one by one as it is more memory efficient. If you can you should work only with stream rather than in memory buffer.
When you use more memory, your data may ends in a page file resulting in more disk I/O operations.
You should avoid working with large memory block if you don't want to see an OutOfMemoryException.
This depends on a number of things, but fundamentally, disk is a lot slower than memory, so you can gain by reading, if you do it right.
First, a warning: if you do not have plenty of memory to fit the files you attempt to load, then your operating system will page memory to disk, which will slow your system down far more than reading one file at a time, so be careful.
The key to improving disk io performance is to keep the disk busy. Reading one at a time leaves the disk idle while you are processing the file in memory. Reading a set of files into a large block of memory, but still reading one at a time, and then processing the block of files, probably won't improve performance except in very unusual conditions.
If your goal is to reduce the time from start to finish of processing these files, you will probably want to run on multiple threads; the system calls to open and read a file still take time to queue, so depending on the capabilities of your disk, you can usually get better overall io throughput by having at least one read request queued while the disk is loading another request; this minimizes idle time between requests, and keeps the disk at its absolute maximum. Note that having too many requests queued can slow performance.
Since processing in memory is likely to be faster, you could have at least 2 threads set up to read files, and at least 1 thread set up to process the files that have already been loaded into memory by the other threads.
A better way than managing your own threads is to use a thread pool; this would naturally limit the number of io requests to the number of concurrent threads allowed, and wouldn't require you to manage the threads yourself. This may not be quite optimal, but a thread pool should be faster than processing the files one at a time, and easier/safer than managing threads.
Note that if you don't understand what I mean by threads and a threadpool, or you haven't done much multi-threaded development relating to disk io, you are probably better off sticking with one file at a time, unless improving the total processing time is a requirement that you can't get around. There are plenty of examples of how to use threads on MSDN, but if you haven't done it much, this probably isn't a good first project for threading.
My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)
This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.
The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.
What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.
Is there such a thing as an optimum chunk size for processing large files? I have an upload service (WCF) which is used to accept file uploads ranging from several hundred megabytes.
I've experimented with 4KB, 8KB through to 1MB chunk sizes. Bigger chunk sizes is good for performance (faster processing) but it comes at the cost of memory.
So, is there way to work out the optimum chunk size at the moment of uploading files. How would one go about doing such calculations? Would it be a combination of available memory and the client, CPU and network bandwidth which determines the optimum size?
Cheers
EDIT: Probably should mention that the client app will be in silverlight.
If you are concerned about running out of resources, then the optimum is probably best determined by evaluating your peak upload concurrency against your system's available memory. How many simultaneous uploads you have in progress at a time would be the key critical variable in any calculation you might do. All you have to do is make sure you have enough memory to handle the upload concurrency, and that's rather trivial to achieve. Memory is cheap and you will likely run out of network bandwidth long before you get to the point where your concurrency would overrun your memory availability.
On the performance side, this isn't the kind of thing you can really optimize much during app design and development. You have to have the system in place, users uploading files for real, and then you can monitor actual runtime performance.
Try a chunk size that matches your network's TCP/IP window size. That's about as optimal as you'd really need to get at design time.
I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?