Reading huge amounts of small files in sequence - c#

I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.

Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files

I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.

That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?

Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.

You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.

Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network

I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.

You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.

Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?

Related

.Net Write continuously data to the disk in different files

We have an application that extract data from several hardware devices. Each device's data should be stored in a different file.
Currently we have one FileStream by file and doing a write when a data comes and that's it.
We have a lot of data coming in, the disk is struggling with an HDD(not a SSD), I guess because the flash is faster, but also because we do not have to jump to different file places all the time.
Some metrics for the default case: 400 different data source(each should have his own file) and we receive ~50KB/s for each data(so 20MB/s). Each data source acquisition is running concurrently and at total we are using ~6% of the CPU.
Is there a way to organize the flush to the disk in order to ensure the better flow?
We will also consider improving the hardware, but it's not really the subject here, since it's a good way to improve our read/write
Windows and NTFS handle multiple concurrent sequential IO streams to the same disk terribly inefficiently. Probably, you are suffering from random IO. You need to schedule the IO yourself in bigger chunks.
You might also see extreme fragmentation. In such cases NTFS sometimes allocates every Nth sector to each of the N files. It is hard to believe how bad NTFS is in such scenarios.
Buffer data for each file until you have like 16MB. Then, flush it out. Do not write to multiple files at the same time. That way you have one disk seek for each 16MB segment which reduces seek overhead to near zero.

Storing file in byte array vs reading and writing with file stream?

I'm working on a program that modifies a file, and I'm wondering if the way I'm working with it is wrong.
The file is stored in blocks inside another file and is separated by a bunch of hashes. It's only about 1mb in size, so I just calculate its location once and read it into a byte array and work with it like that.
I'm wondering if it's some kind of horrendous programming habit to a read an entire file, despite its size, into a byte array in memory. It is the sole purpose of my program though and is about the only memory it takes up.
This depends entirely on the expected size (range) of the files you will be reading in. If your input files can reach over a hundred MB in size, this approach doesn't make much sense.
If your input files are small relative to the memory of machines your software will run on, and your program design benefits from having the entire contents in memory, then it's not horrendous; it's sensible.
However, if your software doesn't actually require the entire file's contents in memory, then there's not much of an argument for doing this (even for smaller files.)
If you require random read/write access to the file in order to modify it then reading it into memory is probably ok as long as you can be sure the file will never ever exceed a certain size (you don't want to read a few hundred MB file into memory).
Usually using a stream reader (like a BinaryReader) and processing the data as you go is a better option.
It's horrendous -- like most memory-/CPU-hogging activities -- if you don't have to do it.

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.
The content of the files should be neither completely random nor uniform.
A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
I'd like to keep the number of files at a manageable level, let's say o(10).
For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:
Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
while (bytesRemaining > 0)
{
int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
if (!zeroes) _rnd.NextBytes(buffer);
fileStream.Write(buffer, 0, sizeOfChunkToWrite);
bytesRemaining -= sizeOfChunkToWrite;
}
fileStream.Close();
}
With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.
For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.
Neither of these is quite satisfactory for me.
I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.
The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.
What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.
Currently I have an approach that sort of works but it takes too long to run.
Has anyone else solved this?
Is there a much faster way to write a text file than via StreamWriter?
Suggestions?
EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.
For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).
You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.
If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.
Edit
Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.
You could always code yourself a little web crawler...
UPDATE
Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".
A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.
I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.
Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.
I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.
You could then use a similiar approach for binary files by looking for .exes or .dlls.
For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.
For a more structured approach you could use a Markov chain trained on some large free english text.
Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.
Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.
Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.
Thanks for all the quick input.
I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.
To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff.
I randomly select and download one document of that subset.
I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. It generates a text one word at a time.
For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.
UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.
I haven't yet figured out how to do the binary generation, but it will likely be something analogous.
Thank you all, again, for all the helpful ideas.

Is there a more efficient way to reconcile large data sets?

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).
The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.
My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.
Any ideas?
Also, let me know if this is too open ended for SO.
I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.
Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.
Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).
If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.
The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.
The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.
One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.
Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.
Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.
Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Categories

Resources