Is there a more efficient way to reconcile large data sets?

Is there a more efficient way to reconcile large data sets? - c#

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).
The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.
My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.
Any ideas?
Also, let me know if this is too open ended for SO.

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.
Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.
Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).
If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.
The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.
Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.
Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.

Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Related

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!

Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.

The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.

This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.

You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

Need a way to sort a 100 GB log file by date [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
So, for some strange reason I end up with a 100GB log file that is unsorted (actually it's partially sorted), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so
data <date> data data more data
I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.
Incidentally parsing the date string with DateTime.Parse() is very slow and takes up a lot of CPU time - The chugging-rate is measly 10 MB/sec. Is there a faster way than the following?
public static DateTime Parse(string data)
{
int year, month, day;
int.TryParse(data.Substring(0, 4), out year);
int.TryParse(data.Substring(5, 2), out month);
int.TryParse(data.Substring(8, 2), out day);
return new DateTime(year, month, day);
}
I wrote that to speed up DateTime.Parse() and it actually works well, but is still taking a bucket-load of cycles.
Note that for the current log-file I'm interested in hours, minutes and seconds also. I know that I can provide DateTime.Parse() with format, but that doesn't seem to speed it up all that much.
I'm looking for a nudge in the right direction, thanks in advance.
EDIT: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.
EDIT 2 : Well, thanks to several suggestions that I use windows sort, I found out that there's a similar tool for Linux. Basically you call sort and it fixes everything for you. As we speak it's doing something, and I hope it'll finish soon. The command I'm using is
sort -k 2b 2008.log > 2008.sorted.log
-k specifies that I want to sort on the second row, which is an date-time string in the usual YYYY-MM-DD hh:mm:ss.msek format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running info coreutils 'sort invocation'.
I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.
Edit 3 Well, checking iotop suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(
sort isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?
Edit 4 Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...
Edit 5 Before I went home, I restarted the process with the following command:
sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log
It returned this, this morning:
sort: write failed: /media/My Passport/sortQAUKdT: File too large
Wraawr! I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.
Final Notice
This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!
Benchmarking
Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.

Code like this is completely bound by how fast you can get the data off the disk. The file simply can never fit in the file system cache so you're always waiting on the disk to supply the data. You're doing fairly well at 10 MB/sec, optimizing the code is never going to have a discernible effect.
Get a faster disk. Defrag the one you've got as an intermediate step.

If a string sort will work for you, then just use the Windows SORT command. Sort the file and be done with it. It'll happily sort your 100GB file, and it's simple to use.
If you need to filter and convert the file, specifically the date field, then I would simply write a small conversion program that converts the data field in to a 0 filled integer (like # of seconds since 1970, or whatever you like), and rewrites the record. Then you can pipe (|) the output in to the sort command, then you have a final, sorted file thats more readily parsed by your utility program.
I think the mistake you're making is simply trying to do this all in one go. 100GB of data is a lot, and it takes some time to copy, but it doesn't take THAT long. Since you have to sort it, you already have to deal with a copy of the file at some point (i.e. you need as much free space on your machine to handle both copies at some time), even with an external sorting routine like merge sort.
Writing a simple reformatter and piping it in to sort will save you a couple trips through the file, and save space on disk, since you'll inevitably just need the two copies.
I would also tweak the formatter in to pulling only the fields I'm really interested in, and do all of the "heavy" parsing at that point so that what you end up with is essentially a formatted file that easily handled by your reporting routines. That way you'll save time later when potentially running your reports more than once.
Use a simple CSV or, even better, a fixed length file format for output if possible.
Make sure your date information, if you choose to use an integer, has all of the fields the same length. Otherwise the SORT utility won't sort them correctly (you end up with 1 10 2 3 instead of 1 2 3 10. You're better to have 01 02 03 10.).
Edit --
Let's approach it from a different tact.
The biggest question is "do you need all this data". This relates to the earlier suggestion about doing the heavy parsing first. Obviously, the more you can reduce the initial set the better. For example, simply removing 10% of the data is 10GB.
Something I like to think about as a rule of thumb, especially when dealing with a lot of data: "If you have 1 Million of something, then every millisecond saved, is 20 minutes off the bottom line."
Normally, we really don't think in terms of milliseconds for our work, it's more "seat of the pants", "that feels faster". But the 1ms == 20min/million is a good measure to get a grasp of how much data you're dealing with, and how long stuff should/could take.
For you case, 100GB of data. With a swag of 100 bytes per record, you're taking 1 Billion rows. 20,000 minutes per millisecond. -- 5 1/2 hours. gulp (It's a rule of thumb, if you do the math it doesn't quite work out to this.)
So, you can appreciate the desire to reduce the raw data if at all possible.
That was one reason I deferred to the Windows SORT command. It's a basic process, but one affected by nuance, and one that can use some optimization. The folks who wrote SORT had time and opportunity to make it "optimal", in many ways. Whether they did or did not, I can't say. But its a fair assumption that they would put more time and attention in to this process to make their SORT as good as practical, versus you who are under a tight deadline.
There are 3rd party sorting utilities for large data sets, that probably (ideally) work better for that case. But, those are unavailable to you (you can get them but I don't think you wanted to rush out and get some other utility right away). So, SORT is our best guess for now.
That said, reducing the data set will gain more than any sort utility.
How much detail do you really need? And how much information are you really tracking? For example, if it were, say, web statistics, you may have 1000 pages on your site. But even with hourly numbers for a year, 365 * 24 * 1000, that's only 8.7M "buckets" of information -- a far cry from 1B.
So, is there any preprocessing you can do that does not require sorting? Summarizing the information into a coarser granularity? You can do that without sorting, simply using memory based hash maps. Even if you don't have "enough memory" to process all 100GB of data in one throw, you probably have enough to do it in chunks (5 chunks, 10 chunks), and write out the intermediary results.
You may also have a lot better luck splitting the data as well. Into monthly, or weekly file chunks. Maybe that's not easily done because the data is "mostly" sorted. But, in that case, if it's by date, the offenders (i.e. the data that's out of sort) may well be clustered within the file, with the "out of order" stuff being just mixed up on the barriers of the time periods (like around day transitions, maybe you have rows like 11:58pm, 11:59pm, 00:00am, 00:01am, 11:58pm, 00:02pm). You might be able to leverage that heuristic as well.
The goal being that if you can somewhat deterministically determine the subset that's out of order, and break the file up in to chunks of "in order data" and "out of order data", your sorting task may be MUCH MUCH smaller. Sort the few rows that are out of order, and then you have a merge problem (much simpler than a sorting problem).
So, those are tactics you can take approaching the problem. Summarization is obviously the best one as anything that reduces this data load in any measurable, is likely worth the trouble. Of course it all boils down to what you really want from the data, clearly the reports will drive that. This is also a nice point about "pre-mature optimization". If they're not reporting on it, don't process it :).

Short answer - load the data into a relational database eg Sql Express, create an index, and use a cursor based solution eg DataReader to read each record off and write it to disk.

Why don't you try this relatively unkown tool from microsoft called logparser. It basically allows you to do an SQL query over a CSV file (or any other formatted textfile).
Saves you the trouble of pumping it into a database, doing your sort, and pumping it back out again

Just to answer your question about sorting a long file that doesn't fit into the memory - you'll need to use some external sorting algorithm such as Merge sort. The process is roughly following:
Partition the input into several parts that fit into memory and can be sorted using standard in-memory sorting algorithms (e.g. 100 MB or larger - you'll need to keep ~4 parts in memory at once). Sort all the parts and write them back to disk.
Read two parts from the disk (they are both sorted) and merge them, which can be done just by simultaneously iterating over the two inputs. Write the merged data set to another place in the disk. Note that you don't need to read the whole part into memory - just read it/write it in blocks as you go.
Repeat merging of parts until you have only a single part (which will be sorted file with all the data from your original input data set).
You mentioned that the data is partially sorted already, so it would be a good idea to pick some algorithm for in-memory sorting (in the first phase) that is efficient in this case. You can see some suggestions in this question (though I'm not sure if the answer will be the same for very large data sets - and it depends on how much partially sorted the input is).

The best way of optimising the parsing of the dates is to not parse them at all.
As the dates is in ISO 8601 format, you can just compare them as strings. There is no parsing needed at all.
Regarding the sorting, you should be able to effectively use the fact that it's partially sorted. One approach could be to read the file and write into separate files divided on time ranges, for example daily or hourly. If you make each file small enough you can then read them into memory and sort them, and then just merge all the files.
Another approach could be to read the file and write the records that are in order into one file, and the the other ones into another file. Sort the second file (possibly using this process recursively if it's large) and zip the two files together. I.e. a modified merge sort.

For sorting you could implement a file-based bucket sort:
Open input file
Read file line by line
Get date as string from line
Append line to file <date>.log
The result would be a separate log file for each day, or separate for each hour. Choose so that you get files of a size that you can easily sort.
The remaining task would be to sort the created files and possibly merge the file again.

I do need to parse dates for the algorithms.
On *NIX, I generally would have first converted dates into something simple, suitable for text comparison and made it first word on the string. It's too early for date/time object creation. My usual date presentation is YYYYMMDD-hhmmss.millis. Make it that all files would have same date format.
I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.
As you have figured it out already, merge sort is the only option.
So to me the tasks falls into the following step:
dumb conversion to make dates sortable. Complexity: read/write sequentially 100GB.
split data in chunks of usable size, e.g. 1GB and sort every chunk using plain quick sort before writing it to disk. Complexity: read/write sequentially 100GB; memory for quick sort.
merge-sort the small files into one large. One can do it step-wise, using a program which takes two files and merges them into new one. Complexity: read/write sequentially 100GB log(N) times (where N is the number of files). HDD space requirement: 2*100GB (last merge of 2 x 50GB files into single 100GB file).
A program to automate the previous step: pick two (e.g. smallest) files, start program to sort-merge them into a new file, remove the two original files. Repeat until number of files is greater than 1.
(Optional) split the 100GB sorted file into smaller chunks of manageable size. After all you are going to do something with them. Number them sequentially or put first and last time stamps into the file name.
General concept: do not try to find a way to do it fast, piping 100GB would take time anyway; plan for the programs one every step to run over-night as a batch, without your attention.
On Linux that is all doable with shell/sort/awk/Perl, and I do not think that it is a problem to write it all in any other programming language. This is potentially 4 programs - but all of them are rather simple to code.

Assuming that your log file only has 1-2% of the rows out of order, you could make a single pass through the complete log, outputing two files: one file that is in order and another file containing the 1-2% of rows that are out of order. Then sort the out-of-order rows in memory and perform a single merge of the formerly out-of-order rows with the in-order rows. This will be much faster than a full mergesort which will do many more passes.
Assuming that your log file has no row more than N rows out of place, you could make a single pass through the log with a sorted queue N rows deep. Whenever you encounter a log row that is out of order, just insert it into the proper place in the queue. Since this only requires a single pass through the log, it's going to be as fast as you can get.

Actually I don`t have many ideas about the date conversion, but the things that I would try to use to do that is:
A database with a Index in the Date Column (to be easy to search in this data after).
To Insert in this base use Bulk Insert.
And some way to parallel the reading (In think parallel LINQ would be good and is very easy to use).
Lots of patience (the most important/hard thing)

Pre-emptive comment: My answer only addresses the sub-problem of parsing date time values.
DateTime.Parse contains checks for all possible date formats. If you have a fix format you can optimize parsing quite well. A simple optimization would be to convert the characters directly:
class DateParserYyyyMmDd
{
static void Main(string[] args)
{
string data = "2010-04-22";
DateTime date = Parse(data);
}
struct Date
{
public int year;
public int month;
public int day;
}
static Date MyDate;
static DateTime Parse2(string data)
{
MyDate.year = (data[0] - '0') * 1000 + (data[1] - '0') * 100
+ (data[2] - '0') * 10 + (data[3] - '0');
MyDate.month = (data[5] - '0') * 10 + (data[6] - '0');
MyDate.day = (data[8] - '0') * 10 + (data[9] - '0');
return new DateTime(MyDate.year, MyDate.month, MyDate.day);
}
}

Apart from whatever you are doing (probably, willw's suggestion is helpful), your parsing could be done over multiple threads provided you have multiple processors or processor cores.

Not really as a solution, but just out of interest, one way to do it like this:
First break the file down into 1GB files
Then reading 2 files at a time, load the contents into a list of string and sort it
Write it back down to the individual files.
The problem is that you would need to read/write 100 files on each pass and do 100 passes to make sure that the data is sorted.
If my maths is correct: That is 10 000 GB read and 10 000 GB write, at an average 10MB/sec that is 20 000 000 sec which is 231 days
One way that is might work is that you scan the file once and write to smaller files, one for each time period for example day or hour. Then sort these individual files.

You can try implement radix sort algorithm. Because radix scans the whole list sequentially and only few times, it can help here to prevent gigant number of scans and seeks of your 100 GB file.
Radix sort intend to classificate your records each iteration by one part. This part can be a digit, or a datetime part like year, month, day. in this case you dont even need to convert the string to DateTime, you can convert only the specific part to int.
Edit:
For sorting purposes, you can create temp binary file with only 2 columns: DateTime (DateTime.ToBinary() as Int64) and line address in the source file (as Int64).
Then you getting a much smaller file with fixed size records, only 16 bytes per record, then you can sort it much faster (IO operations will be faster at least).
Once finished sorting the temp file, you can create back the full sorted 100 GB log file.

Wow. First of all, that is a whole new level of documenting-obssession.
My actual edvice would be, try to consider how neccessary this file really is.
About sorting, I have no idea if this will work or not, but you might want to try to build an Enumerator that returns the data directly from the Hard Disk (not saving anything but few pointers maybe), and then trying to use LINQ's OrderBy, which returns IEnumerator as well, which you, hopefuly, can Enamurate and save directly back to the disk.
The only question is whether or not OrderBy saves anything in the RAM.

Boot up a Linux flavor from USB
And use the while command to read
The file. Utilize grep, filters and
Pipes to segregate the data.
This can all be done in
3 lines of a BASH script.
Grep will rip through the data in
No time. I've grepped through
7 million lines in 45 seconds

Dealing with huge SQL resultset

I am working with a rather large mysql database (several million rows) with a column storing blob images. The application attempts to grab a subset of the images and runs some processing algorithms on them. The problem I'm running into is that, due to the rather large dataset that I have, the dataset that my query is returning is too large to store in memory.
For the time being, I have changed the query to not return the images. While iterating over the resultset, I run another select which grabs the individual image that relates to the current record. This works, but the tens of thousands of extra queries have resulted in a performance decrease that is unacceptable.
My next idea is to limit the original query to 10,000 results or so, and then keep querying over spans of 10,000 rows. This seems like the middle of the road compromise between the two approaches. I feel that there is probably a better solution that I am not aware of. Is there another way to only have portions of a gigantic resultset in memory at a time?
Cheers,
Dave McClelland

One option is to use a DataReader. It streams the data, but it's at the expense of keeping an open connection to the database. If you're iterating over several million rows and performing processing for each one, that may not be desirable.
I think you're heading down the right path of grabbing the data in chunks, probably using MySql's Limit method, correct?

When dealing with such large datasets it is important not to need to have it all in memory at once. If you are writing the result out to disk or to a webpage, do that as you read in each row. Don't wait until you've read all rows before you start writing.
You also could have set the images to DelayLoad = true so that they are only fetched when you need them rather than implementing this functionality yourself. See here for more info.

I see 2 options.
1) if this is a windows app (as opposed to a web app) you can read each image using a data reader and dump the file to a temp folder on the disk, then you can do whatever processing you need to against the physical file.
2) Read and process the data in small chunks. 10k rows can still be a lot depending on how large the images are and how much process you want to do. Returning 5k worth of rows at a time and reading more in a separate thread when you are down to 1k remaining to process can make for a seamless process.
Also while not always recommended, forcing garbage collection before processing the next set of rows can help to free up memory.

I've used a solution like one outlined in this tutorial before:
http://www.asp.net/(S(pdfrohu0ajmwt445fanvj2r3))/learn/data-access/tutorial-25-cs.aspx
You could use multi-threading to pre-pull a portion of the next few datasets (at first pull 1-10,000 and in the background pull 10,001 - 20,000 and 20,001-30,000 rows; and delete the previous pages of the data (say if you are at 50,000 to 60,000 delete the first 1-10,000 rows to conserve memory if that is an issue). And use the user's location of the current "page" as a pointer to pull next range of data or delete some out-of-range data.

Reading huge amounts of small files in sequence

I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.

Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files

I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.

That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?

Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.

You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.

Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network

I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.

You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.

Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?

Removing Duplicates in Large Text Files

I've been trying to calculate all the unique permutations for a very long word (antidisestablishmentarianism), and although I can calculate the permutations for the words, I am having problems with stopping the production of duplications.
Normally I would just run the List<T>.Contains() method on my string, but the list of permutations becomes so large I can't keep it in memory. I made that mistake earlier and managed to use up all 8GB of memory in my computer. In order to prevent that from happening again, I changed the code to append the calculated permutation to a file and release it from memory.
My main question is this: How can I prevent duplicate permutations from being added to my file without loading the whole thing in memory? Is it possible to selectively load, for example, the first few megabytes, scan that, and move on until the file is completed, or should I be looking in a different direction?
This is not homework, my math homework gave a hypothetical situation where a computer could calculate 30 permutations per second and made me figure out how long it would take to calculate all the permutations. That wasn't a problem, and I don't need help with that, I just wanted to know how long it would take a modern computer to perform the same task.

How about using an algorithm that generates all permutations without duplicates? That way you wouldn't have to check for them in the first place.
A Google search for "algorithm generate permutations" turns up dozens of references to get you started. e.g. Permutation Generation Methods

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.