Efficiently random enumeration of files from huge directory

Efficiently random enumeration of files from huge directory - c#

I want to be able to enumerate files with a specific search pattern (e.g., *.txt) recursively from a directory. But with couple of constraints:
The mechanism should be very efficient. The goal is to enumerate file one by one (using IEnumerable), so that if there is a huge list of files, then it shouldn't take forever to get one file for processing.
The enumeration should return files randomly, so that if two instances of my program are trying to enumerate the directory, both should not be seeing the files in the same sequence.
Given the requirements, DirectoryInfo.EnumerateFiles looks promising, except that it does not fulfill the second requirement. If I remove the performance consideration, the solution is straightforward (just get the entire collection and randomize the sequence before accessing).
Can someone suggest possible choices for C# implementation in .net 3.5/4.0 ?

What you are asking for is impossible.
A truly "random" enumeration (in the sense that the order likely changes each time) requires a "pick without replacement" strategy. Such a strategy necessarily requires two pools: one of "chosen" files, and one of "unchosen." The "unchosen" list has to be populated before anything from it can be "chosen" randomly. This breaks your #1 requirement.
Two thoughts on how to solve your problem:
What is the problem with two instances seeing the files in the same order? If it's a file locking issue, choose a read-only lock.
You might be able to get away with a "holding pile" approach. Here, you would create your own enumerator class that starts by reading a small number of FileInfo records into a "Hold" collection. Then, each time your calling code requests a file, it either feeds one directly from the EnumerateFiles, or it reads one from there but swaps it out with one in your "Hold" pile and returns that one instead. The decision would be random until the EnumerateFiles returns nothing, at which point you would empty out your Hold pile. That won't provide a truly random selection order, but maybe it will add enough fuzziness to the order to meet your needs. The max size of the "Hold" collection can be adjusted to taste to balance your need for "randomness" with the need to quickly get the first file.

Related

Load Full Collection in RavenDB

At the startup of an application that uses RavenDB, I need to load the full collection of documents of a specific type and loop through them. The number of documents should always be small (< 1000).
I can do this via:
session.Query<MyType>();
but I want to ensure that the results I am getting are immediately consistent.
It seems that this case falls between a Load() and a Query(), since (I think) the query is intended to be eventually consistent, and the load is intended to be immediately consistent?
But in this case, there should be no index involved (no filtering or sorting), so will using Query() be immediately consistent?

Query() is always immediately inconsistent. session.SaveChanges() stores only to the document store. The indexes are always updated asynchronously later, although for the most part very very fast!
This is generally a poor modeling design and a code smell with a document database. Since you are mentioning that is at application startup and a relatively small amount, it sounds like reference information that changes infrequently. Could you enclose all of this in a single document that contains a List<MyType> instead?
Failing that, you can try the LoadStartingWith() command, which is still fully ACID, and give it the prefix that fits all the necessary documents, although up to 1000 is still a lot and I don't know if that method will return ALL results or cut it off at some point.
If you must use Query(), you will have to use one of the variations of .WaitForNonStaleResults() (the other variants take into consideration what you are waiting for rather than requiring all indexing to be complete) to get consistent results, but that should be fairly safe if the documents change infrequently. Although I really do hate using this method in nearly all its forms, preferring to use any of the above more preferable methods.

Sorting algorithm for large set of Server and Path

Working in C#, I would like to write an efficient sorting algorithm that would take as input a text file containing unsorted list of server and path combinations and output a sorted file.
As an exercise, I am working under assumption that the input data size will exceed available memory, so I am thinking of reading the file into memory a chunk at a time, doing a Quick sort (or a Heap sort, maybe?), outputting sorted chunks to temporary files, then doing a merge sort to produce the final output.
The format of the input file is up to my discretion. It can be just a list of UNC paths (server and path as single string) or it can be a CSV with servers and paths as separate fields.
My question is whether there is any benefit to be had from having server and path be separate entities in my data structure and evaluating them separately?
Having server and path separate would eliminate having to compare the server names during the path comparison run, but require additional run to sort by server and, given the available memory constraint, would require me to somehow cache the sorted server lists, increasing disk IO overhead.
Is there some technique I can leverage to optimize performance of such an application by providing server and path as separate fields in my input?
Any other optimization techniques that I might consider given the nature of the dataset?
EDIT: This is a one-time task. I do not need to later look up the entries

I am thinking of reading the file into memory a chunk at a time, doing a Quick sort (or a Heap sort, maybe?), outputting sorted chunks to temporary files, then doing a merge sort to produce the final output.
That's a perfectly reasonable plan.
An alternate solution would be: create an on-disk b-tree, and insert all your data one record at a time into the b-tree. You never need to have more than a few pages of the b-tree in memory and you can read the records one at a time from the unsorted list. Once it's in the b-tree, read it back out in order.
Having server and path separate would eliminate having to compare the server names during the path comparison run, but require additional run to sort by server and, given the available memory constraint, would require me to somehow cache the sorted server lists, increasing disk IO overhead.
OK.
My question is whether there is any benefit to be had from having server and path be separate entities in my data structure and evaluating them separately?
You just said what the pros and cons are. You've already listed them. Why are you asking this question if you already know the answer?
Is there some technique I can leverage to optimize performance of such an application by providing server and path as separate fields in my input?
Probably, yes.
How can I know for sure?
Write the code both ways and run it. The one that is better will be observed to be better.
Any other optimization techniques that I might consider given the nature of the dataset?
Your question and speculations are premature.
Start by setting a performance goal.
Then implement the code as clearly and correctly as you can.
Then carefully measure to see if you met your goal.
If you did, knock off early and go to the beach.
If you did not, get a profiler and use it to analyze your program to find the worst-performing part. Then optimize that part.
Keep doing that until either you meet your goal, or you give up.

I'm certainly not going to out-answer Eric Lippert, but from a novice's perspective I wonder if you're not looking for the most complex answer first.
You don't need to read the file into memory all at once with File.ReadLines...so your input is one line at a time. Use the Uri object for quick parsing of each string into it's component parts: host and path.
If you are thinking of a OO approach, then how about a 'serverUri' object that implements IComparable and having a SortedList of path strings. Make a SortedList of the serverUri objects, so that part of the string is stored only once, and for each path with that server uri, add it to the sub collection. Viola...its all sorted...spit it out to disk.

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!

Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.

The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.

This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.

You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

Is there a more efficient way to reconcile large data sets?

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).
The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.
My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.
Any ideas?
Also, let me know if this is too open ended for SO.

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.
Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.
Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).
If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.
The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.
Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.
Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.

Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Removing Duplicates in Large Text Files

I've been trying to calculate all the unique permutations for a very long word (antidisestablishmentarianism), and although I can calculate the permutations for the words, I am having problems with stopping the production of duplications.
Normally I would just run the List<T>.Contains() method on my string, but the list of permutations becomes so large I can't keep it in memory. I made that mistake earlier and managed to use up all 8GB of memory in my computer. In order to prevent that from happening again, I changed the code to append the calculated permutation to a file and release it from memory.
My main question is this: How can I prevent duplicate permutations from being added to my file without loading the whole thing in memory? Is it possible to selectively load, for example, the first few megabytes, scan that, and move on until the file is completed, or should I be looking in a different direction?
This is not homework, my math homework gave a hypothetical situation where a computer could calculate 30 permutations per second and made me figure out how long it would take to calculate all the permutations. That wasn't a problem, and I don't need help with that, I just wanted to know how long it would take a modern computer to perform the same task.

How about using an algorithm that generates all permutations without duplicates? That way you wouldn't have to check for them in the first place.
A Google search for "algorithm generate permutations" turns up dozens of references to get you started. e.g. Permutation Generation Methods

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.