Removing Duplicates in Large Text Files

Removing Duplicates in Large Text Files - c#

I've been trying to calculate all the unique permutations for a very long word (antidisestablishmentarianism), and although I can calculate the permutations for the words, I am having problems with stopping the production of duplications.
Normally I would just run the List<T>.Contains() method on my string, but the list of permutations becomes so large I can't keep it in memory. I made that mistake earlier and managed to use up all 8GB of memory in my computer. In order to prevent that from happening again, I changed the code to append the calculated permutation to a file and release it from memory.
My main question is this: How can I prevent duplicate permutations from being added to my file without loading the whole thing in memory? Is it possible to selectively load, for example, the first few megabytes, scan that, and move on until the file is completed, or should I be looking in a different direction?
This is not homework, my math homework gave a hypothetical situation where a computer could calculate 30 permutations per second and made me figure out how long it would take to calculate all the permutations. That wasn't a problem, and I don't need help with that, I just wanted to know how long it would take a modern computer to perform the same task.

How about using an algorithm that generates all permutations without duplicates? That way you wouldn't have to check for them in the first place.
A Google search for "algorithm generate permutations" turns up dozens of references to get you started. e.g. Permutation Generation Methods

Related

Efficiently random enumeration of files from huge directory

I want to be able to enumerate files with a specific search pattern (e.g., *.txt) recursively from a directory. But with couple of constraints:
The mechanism should be very efficient. The goal is to enumerate file one by one (using IEnumerable), so that if there is a huge list of files, then it shouldn't take forever to get one file for processing.
The enumeration should return files randomly, so that if two instances of my program are trying to enumerate the directory, both should not be seeing the files in the same sequence.
Given the requirements, DirectoryInfo.EnumerateFiles looks promising, except that it does not fulfill the second requirement. If I remove the performance consideration, the solution is straightforward (just get the entire collection and randomize the sequence before accessing).
Can someone suggest possible choices for C# implementation in .net 3.5/4.0 ?

What you are asking for is impossible.
A truly "random" enumeration (in the sense that the order likely changes each time) requires a "pick without replacement" strategy. Such a strategy necessarily requires two pools: one of "chosen" files, and one of "unchosen." The "unchosen" list has to be populated before anything from it can be "chosen" randomly. This breaks your #1 requirement.
Two thoughts on how to solve your problem:
What is the problem with two instances seeing the files in the same order? If it's a file locking issue, choose a read-only lock.
You might be able to get away with a "holding pile" approach. Here, you would create your own enumerator class that starts by reading a small number of FileInfo records into a "Hold" collection. Then, each time your calling code requests a file, it either feeds one directly from the EnumerateFiles, or it reads one from there but swaps it out with one in your "Hold" pile and returns that one instead. The decision would be random until the EnumerateFiles returns nothing, at which point you would empty out your Hold pile. That won't provide a truly random selection order, but maybe it will add enough fuzziness to the order to meet your needs. The max size of the "Hold" collection can be adjusted to taste to balance your need for "randomness" with the need to quickly get the first file.

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!

Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.

The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.

This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.

You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

Need a way to sort a 100 GB log file by date [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
So, for some strange reason I end up with a 100GB log file that is unsorted (actually it's partially sorted), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so
data <date> data data more data
I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.
Incidentally parsing the date string with DateTime.Parse() is very slow and takes up a lot of CPU time - The chugging-rate is measly 10 MB/sec. Is there a faster way than the following?
public static DateTime Parse(string data)
{
int year, month, day;
int.TryParse(data.Substring(0, 4), out year);
int.TryParse(data.Substring(5, 2), out month);
int.TryParse(data.Substring(8, 2), out day);
return new DateTime(year, month, day);
}
I wrote that to speed up DateTime.Parse() and it actually works well, but is still taking a bucket-load of cycles.
Note that for the current log-file I'm interested in hours, minutes and seconds also. I know that I can provide DateTime.Parse() with format, but that doesn't seem to speed it up all that much.
I'm looking for a nudge in the right direction, thanks in advance.
EDIT: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.
EDIT 2 : Well, thanks to several suggestions that I use windows sort, I found out that there's a similar tool for Linux. Basically you call sort and it fixes everything for you. As we speak it's doing something, and I hope it'll finish soon. The command I'm using is
sort -k 2b 2008.log > 2008.sorted.log
-k specifies that I want to sort on the second row, which is an date-time string in the usual YYYY-MM-DD hh:mm:ss.msek format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running info coreutils 'sort invocation'.
I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.
Edit 3 Well, checking iotop suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(
sort isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?
Edit 4 Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...
Edit 5 Before I went home, I restarted the process with the following command:
sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log
It returned this, this morning:
sort: write failed: /media/My Passport/sortQAUKdT: File too large
Wraawr! I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.
Final Notice
This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!
Benchmarking
Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.

Code like this is completely bound by how fast you can get the data off the disk. The file simply can never fit in the file system cache so you're always waiting on the disk to supply the data. You're doing fairly well at 10 MB/sec, optimizing the code is never going to have a discernible effect.
Get a faster disk. Defrag the one you've got as an intermediate step.

If a string sort will work for you, then just use the Windows SORT command. Sort the file and be done with it. It'll happily sort your 100GB file, and it's simple to use.
If you need to filter and convert the file, specifically the date field, then I would simply write a small conversion program that converts the data field in to a 0 filled integer (like # of seconds since 1970, or whatever you like), and rewrites the record. Then you can pipe (|) the output in to the sort command, then you have a final, sorted file thats more readily parsed by your utility program.
I think the mistake you're making is simply trying to do this all in one go. 100GB of data is a lot, and it takes some time to copy, but it doesn't take THAT long. Since you have to sort it, you already have to deal with a copy of the file at some point (i.e. you need as much free space on your machine to handle both copies at some time), even with an external sorting routine like merge sort.
Writing a simple reformatter and piping it in to sort will save you a couple trips through the file, and save space on disk, since you'll inevitably just need the two copies.
I would also tweak the formatter in to pulling only the fields I'm really interested in, and do all of the "heavy" parsing at that point so that what you end up with is essentially a formatted file that easily handled by your reporting routines. That way you'll save time later when potentially running your reports more than once.
Use a simple CSV or, even better, a fixed length file format for output if possible.
Make sure your date information, if you choose to use an integer, has all of the fields the same length. Otherwise the SORT utility won't sort them correctly (you end up with 1 10 2 3 instead of 1 2 3 10. You're better to have 01 02 03 10.).
Edit --
Let's approach it from a different tact.
The biggest question is "do you need all this data". This relates to the earlier suggestion about doing the heavy parsing first. Obviously, the more you can reduce the initial set the better. For example, simply removing 10% of the data is 10GB.
Something I like to think about as a rule of thumb, especially when dealing with a lot of data: "If you have 1 Million of something, then every millisecond saved, is 20 minutes off the bottom line."
Normally, we really don't think in terms of milliseconds for our work, it's more "seat of the pants", "that feels faster". But the 1ms == 20min/million is a good measure to get a grasp of how much data you're dealing with, and how long stuff should/could take.
For you case, 100GB of data. With a swag of 100 bytes per record, you're taking 1 Billion rows. 20,000 minutes per millisecond. -- 5 1/2 hours. gulp (It's a rule of thumb, if you do the math it doesn't quite work out to this.)
So, you can appreciate the desire to reduce the raw data if at all possible.
That was one reason I deferred to the Windows SORT command. It's a basic process, but one affected by nuance, and one that can use some optimization. The folks who wrote SORT had time and opportunity to make it "optimal", in many ways. Whether they did or did not, I can't say. But its a fair assumption that they would put more time and attention in to this process to make their SORT as good as practical, versus you who are under a tight deadline.
There are 3rd party sorting utilities for large data sets, that probably (ideally) work better for that case. But, those are unavailable to you (you can get them but I don't think you wanted to rush out and get some other utility right away). So, SORT is our best guess for now.
That said, reducing the data set will gain more than any sort utility.
How much detail do you really need? And how much information are you really tracking? For example, if it were, say, web statistics, you may have 1000 pages on your site. But even with hourly numbers for a year, 365 * 24 * 1000, that's only 8.7M "buckets" of information -- a far cry from 1B.
So, is there any preprocessing you can do that does not require sorting? Summarizing the information into a coarser granularity? You can do that without sorting, simply using memory based hash maps. Even if you don't have "enough memory" to process all 100GB of data in one throw, you probably have enough to do it in chunks (5 chunks, 10 chunks), and write out the intermediary results.
You may also have a lot better luck splitting the data as well. Into monthly, or weekly file chunks. Maybe that's not easily done because the data is "mostly" sorted. But, in that case, if it's by date, the offenders (i.e. the data that's out of sort) may well be clustered within the file, with the "out of order" stuff being just mixed up on the barriers of the time periods (like around day transitions, maybe you have rows like 11:58pm, 11:59pm, 00:00am, 00:01am, 11:58pm, 00:02pm). You might be able to leverage that heuristic as well.
The goal being that if you can somewhat deterministically determine the subset that's out of order, and break the file up in to chunks of "in order data" and "out of order data", your sorting task may be MUCH MUCH smaller. Sort the few rows that are out of order, and then you have a merge problem (much simpler than a sorting problem).
So, those are tactics you can take approaching the problem. Summarization is obviously the best one as anything that reduces this data load in any measurable, is likely worth the trouble. Of course it all boils down to what you really want from the data, clearly the reports will drive that. This is also a nice point about "pre-mature optimization". If they're not reporting on it, don't process it :).

Short answer - load the data into a relational database eg Sql Express, create an index, and use a cursor based solution eg DataReader to read each record off and write it to disk.

Why don't you try this relatively unkown tool from microsoft called logparser. It basically allows you to do an SQL query over a CSV file (or any other formatted textfile).
Saves you the trouble of pumping it into a database, doing your sort, and pumping it back out again

Just to answer your question about sorting a long file that doesn't fit into the memory - you'll need to use some external sorting algorithm such as Merge sort. The process is roughly following:
Partition the input into several parts that fit into memory and can be sorted using standard in-memory sorting algorithms (e.g. 100 MB or larger - you'll need to keep ~4 parts in memory at once). Sort all the parts and write them back to disk.
Read two parts from the disk (they are both sorted) and merge them, which can be done just by simultaneously iterating over the two inputs. Write the merged data set to another place in the disk. Note that you don't need to read the whole part into memory - just read it/write it in blocks as you go.
Repeat merging of parts until you have only a single part (which will be sorted file with all the data from your original input data set).
You mentioned that the data is partially sorted already, so it would be a good idea to pick some algorithm for in-memory sorting (in the first phase) that is efficient in this case. You can see some suggestions in this question (though I'm not sure if the answer will be the same for very large data sets - and it depends on how much partially sorted the input is).

The best way of optimising the parsing of the dates is to not parse them at all.
As the dates is in ISO 8601 format, you can just compare them as strings. There is no parsing needed at all.
Regarding the sorting, you should be able to effectively use the fact that it's partially sorted. One approach could be to read the file and write into separate files divided on time ranges, for example daily or hourly. If you make each file small enough you can then read them into memory and sort them, and then just merge all the files.
Another approach could be to read the file and write the records that are in order into one file, and the the other ones into another file. Sort the second file (possibly using this process recursively if it's large) and zip the two files together. I.e. a modified merge sort.

For sorting you could implement a file-based bucket sort:
Open input file
Read file line by line
Get date as string from line
Append line to file <date>.log
The result would be a separate log file for each day, or separate for each hour. Choose so that you get files of a size that you can easily sort.
The remaining task would be to sort the created files and possibly merge the file again.

I do need to parse dates for the algorithms.
On *NIX, I generally would have first converted dates into something simple, suitable for text comparison and made it first word on the string. It's too early for date/time object creation. My usual date presentation is YYYYMMDD-hhmmss.millis. Make it that all files would have same date format.
I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.
As you have figured it out already, merge sort is the only option.
So to me the tasks falls into the following step:
dumb conversion to make dates sortable. Complexity: read/write sequentially 100GB.
split data in chunks of usable size, e.g. 1GB and sort every chunk using plain quick sort before writing it to disk. Complexity: read/write sequentially 100GB; memory for quick sort.
merge-sort the small files into one large. One can do it step-wise, using a program which takes two files and merges them into new one. Complexity: read/write sequentially 100GB log(N) times (where N is the number of files). HDD space requirement: 2*100GB (last merge of 2 x 50GB files into single 100GB file).
A program to automate the previous step: pick two (e.g. smallest) files, start program to sort-merge them into a new file, remove the two original files. Repeat until number of files is greater than 1.
(Optional) split the 100GB sorted file into smaller chunks of manageable size. After all you are going to do something with them. Number them sequentially or put first and last time stamps into the file name.
General concept: do not try to find a way to do it fast, piping 100GB would take time anyway; plan for the programs one every step to run over-night as a batch, without your attention.
On Linux that is all doable with shell/sort/awk/Perl, and I do not think that it is a problem to write it all in any other programming language. This is potentially 4 programs - but all of them are rather simple to code.

Assuming that your log file only has 1-2% of the rows out of order, you could make a single pass through the complete log, outputing two files: one file that is in order and another file containing the 1-2% of rows that are out of order. Then sort the out-of-order rows in memory and perform a single merge of the formerly out-of-order rows with the in-order rows. This will be much faster than a full mergesort which will do many more passes.
Assuming that your log file has no row more than N rows out of place, you could make a single pass through the log with a sorted queue N rows deep. Whenever you encounter a log row that is out of order, just insert it into the proper place in the queue. Since this only requires a single pass through the log, it's going to be as fast as you can get.

Actually I don`t have many ideas about the date conversion, but the things that I would try to use to do that is:
A database with a Index in the Date Column (to be easy to search in this data after).
To Insert in this base use Bulk Insert.
And some way to parallel the reading (In think parallel LINQ would be good and is very easy to use).
Lots of patience (the most important/hard thing)

Pre-emptive comment: My answer only addresses the sub-problem of parsing date time values.
DateTime.Parse contains checks for all possible date formats. If you have a fix format you can optimize parsing quite well. A simple optimization would be to convert the characters directly:
class DateParserYyyyMmDd
{
static void Main(string[] args)
{
string data = "2010-04-22";
DateTime date = Parse(data);
}
struct Date
{
public int year;
public int month;
public int day;
}
static Date MyDate;
static DateTime Parse2(string data)
{
MyDate.year = (data[0] - '0') * 1000 + (data[1] - '0') * 100
+ (data[2] - '0') * 10 + (data[3] - '0');
MyDate.month = (data[5] - '0') * 10 + (data[6] - '0');
MyDate.day = (data[8] - '0') * 10 + (data[9] - '0');
return new DateTime(MyDate.year, MyDate.month, MyDate.day);
}
}

Apart from whatever you are doing (probably, willw's suggestion is helpful), your parsing could be done over multiple threads provided you have multiple processors or processor cores.

Not really as a solution, but just out of interest, one way to do it like this:
First break the file down into 1GB files
Then reading 2 files at a time, load the contents into a list of string and sort it
Write it back down to the individual files.
The problem is that you would need to read/write 100 files on each pass and do 100 passes to make sure that the data is sorted.
If my maths is correct: That is 10 000 GB read and 10 000 GB write, at an average 10MB/sec that is 20 000 000 sec which is 231 days
One way that is might work is that you scan the file once and write to smaller files, one for each time period for example day or hour. Then sort these individual files.

You can try implement radix sort algorithm. Because radix scans the whole list sequentially and only few times, it can help here to prevent gigant number of scans and seeks of your 100 GB file.
Radix sort intend to classificate your records each iteration by one part. This part can be a digit, or a datetime part like year, month, day. in this case you dont even need to convert the string to DateTime, you can convert only the specific part to int.
Edit:
For sorting purposes, you can create temp binary file with only 2 columns: DateTime (DateTime.ToBinary() as Int64) and line address in the source file (as Int64).
Then you getting a much smaller file with fixed size records, only 16 bytes per record, then you can sort it much faster (IO operations will be faster at least).
Once finished sorting the temp file, you can create back the full sorted 100 GB log file.

Wow. First of all, that is a whole new level of documenting-obssession.
My actual edvice would be, try to consider how neccessary this file really is.
About sorting, I have no idea if this will work or not, but you might want to try to build an Enumerator that returns the data directly from the Hard Disk (not saving anything but few pointers maybe), and then trying to use LINQ's OrderBy, which returns IEnumerator as well, which you, hopefuly, can Enamurate and save directly back to the disk.
The only question is whether or not OrderBy saves anything in the RAM.

Boot up a Linux flavor from USB
And use the while command to read
The file. Utilize grep, filters and
Pipes to segregate the data.
This can all be done in
3 lines of a BASH script.
Grep will rip through the data in
No time. I've grepped through
7 million lines in 45 seconds

Memory Efficient Recursion

I have written an application in C# that generates all the words that can be existed in the combination of alphabets, numbers and few special characters.
The problem is that it isn't memory efficient as it is adapting Recursion and also some collection like List.
Is there any way I can make it to run in limited memory environment?
Umair

Convert it to an iterative function.

Unfortunately C# compiler does not perform tail call optimization, which is something that you want to happen in this case. CLR supports it, kinda, but you shouldn't rely on it.
Perhaps left of field, but maybe you can write the recursive part of your program in F#? This way you can leverage guaranteed tail call optimization and reuse bits of your C# code. Whilst a steep learning curve, F# is a more suitable language for these combinatorial tasks.

Well...I am not sure whom with I go amongst you but I got the solution. I am using more than one process one that is interacting with user and other for finding the words combination. The other process finds 5000 words, save them and quit. Communication is being achieved through WCF. This looks pretty fine as when process quits = frees memory.

Well, you obviously cannot store the intermediate results in memory (unless you've got some sort of absurd computer at your disposal); you will have to write the results to disk.
The recursion depth isn't a result of the number of considered characters - its determined by what the maximum string length you're willing to consider.
For instance, my install of python 2.6.2 has it's default recursion limit set to 1000. Arguable, I should be able to generate all possible 1-1000 length strings given a character set within this limitation (now, I think the recursion limit applies to total stack depth, so the actual limit may be less than 1000).
Edit (added python sample):
The following python snippet will produce what you're asking for (limiting itself to the given runtime stack limits):
from string import ascii_lowercase
def generate(base="", charset=ascii_lowercase):
for c in charset:
next = base + c
yield next
try:
for s in generate(next, charset):
yield s
except:
continue
for s in generate():
print s
One could produce essentially the same in C# by try/catching on StackOverflowException. As I'm typing this update, the script is running, chewing up one of my cores. However, memory usage is constant at less than 7MB. Now, I'm only print to stdout since I'm not interested in capturing the result, but I think it proves the point above. ;)
Addendum to the example:
Interesting note: Looking closer at running processes, python is actually I/O bound with the above example. It's only using 7% of my CPU, while the rest of the core is bound rending the results in my command window. Minimizing the window allows python to climb to 40% of total CPU usage, this is on a 2 core machine.

One more consideration: When you concatenate or use some other method to generate a string in C#, it occupies its own memory and may stick around for a while. If you are generating millions of strings, you are likely to notice some performance drag.
If you don't need to keep your many strings around, I would see if there's away to avoid generating the strings. For example, maybe you have a character array that you keep updating as you move through the character combinations, and if you're outputting them to a file, you would output them one character at a time so you don't have to build the string.

Is there a more efficient way to reconcile large data sets?

I've been tasked with reconciling two big data sets (two big lists of transactions). Basically i extract the relevant fields from the two data sources into two files of the same format, then compare the files to find any records that are in A but not in B, or vice versa, and report on them. I wrote a blog entry on my best efforts achieving this (click if interested).
The gist of it is to load both data sets into a big hash table, with the keys being the rows, and the values being +1 each time it appears in file A, and -1 each time it appears in file B. Then at the end, i look for any key/value pairs where the value != 0.
My algorithm seems fast enough (10 seconds for 2*100mb files), however its a bit memory-intensive: 280mb to compare two sets of 100mb files, i would hope to get it down to 100mb peak memory usage, and possibly lower if the two data sets are sorted in roughly the same order.
Any ideas?
Also, let me know if this is too open ended for SO.

I have done something similar to this only in scripts on unix using shell and perl, however the theory may cary over.
Step 1, sort both files so they are in order by the same criteria. I used the unix sort command to do this (i required the unique flag, but you just need some sort of memory efficient file sort). This is likely the tricky part to figure out on you're own.
Step 2, open both files, and essentially scan them line by line (or record by record if binary format). If the line in the left file is equal to the one in the right file, then the lines match and move on (remember we already sorted the file, so the smallest record should be first).
If left record is greater than right record, you're right record is missing, add it to you're list, and read the next line on the right file. And simply do you're check again. Same thing applies if you're right record is greater, than you left record is missing, report it and keep going.
The scanning the records should be very memory efficient. It may not be as fast, but for me I was able to crunch several gigs of data with multiple passes looking at different fields witihn a couple minutes.

The only way I can think of is to not load all of the data into memory at once. If you change the way you process it so that it grabs a bit of each file at a time it would reduce your memory foot print but increase your disk IO which would probably result in a longer processing time.

One option may be to change the in-memory format of your data. If your data is a series of numbers stored as text, storing them as integers in memory may lower your memory footprint.
Another option may be use some kind of external program to sort the rows -- then you can do a simple scan of the two files in-order looking for differences.
Back to your question though, 280mb sounds high for comparing a pair of 100mb files though -- you are only loading one into memory (the smaller one) and just scrolling through the other one, right? As you describe it, I don't think you'll need to have the full contents of both in memory at once.

Using this method you would have to have the contents of one of the files in memory at all times though. It would be more efficient, as far as memory goes, to simply take half of the file in. Compare it line by line against the second file. Then take the second half to memory and do the same. This overlapping would ensure that there are no records missed. And would eliminate the need for the entire file to be stored temporarily.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.