Related
We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.
Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.
I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.
The files will not be edited!
While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.
I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.
One problem I have with this approach is how to get specific lines from the stored string.
If anyone has any ideas, can highlight problems with my approach then thank you.
I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:
string[] lines = File.ReadAllLines("file.txt");
Then you can search for matching lines with LINQ, display them easily etc.
Here is an approach that scales well on modern CPU's with multiple cores.
You create an iterator block that yields the lines from the text file (or multiple text files if required):
IEnumerable<String> GetLines(String fileName) {
using (var streamReader = File.OpenText(fileName))
while (!streamReader.EndOfStream)
yield return streamReader.ReadLine();
}
You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.
GetLines(fileName)
.AsParallel()
.AsOrdered()
.Where(line => ...)
.ForAll(line => ...);
You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.
This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.
ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.
This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.
I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)
There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.
So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.
Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.
[/rant]
I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.
I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.
I have a .JSON file which is approx. 1.5MB in size containing around 1500 JSON objects that I want to convert into domain objects at the start-up of my app.
Currently my process on the Phone (not on my development PC) takes around 23 seconds which is far too slow for me and is forcing me to write the list of objects into ApplicationSettings so that I dont have to do it each time the app loads (just first off), but even that takes 15-odd seconds to write to, and 16 seconds to read from, all of which is not really good enough.
I have not had a lot of serialization experience and I dont really know the fastest way to get it done.
Currently, I am using the System.Runtime.Serialization namespace with DataContract and DataMember approach.
Any ideas on performance with this type of data loading?
I found the Json.NET library to be more performant and to have better options that the standard json serializer.
One performance issue I encountered in my app was that my domain objects implemented INotifyPropertyChanged with code to support dispatching the event back to the UI thread. Since the deserialization code populated those properties I was doing a lot of thread marshalling that didn't need to be there. Cutting out the notifications during deserialization substantially increased performance.
Update: I was using Caliburn Micro which has a property on PropertyChangedBase that can turn off property changed notifications. I then added the following:
[OnDeserializing]
public void OnDeserializing(StreamingContext context)
{
IsNotifying = false;
}
[OnDeserialized]
public void OnDeserialized(StreamingContext context)
{
IsNotifying = true;
}
Try profiling your app with the free EQATEC Profiler for WP7. The real issue could be something completely unexpected and easy to fix, like the INotifyPropertyChanged-example Nigel mentions.
You can quickly shoot yourself in the foot using the application settings. The issue is that these are always serialized/deserialized "in bulk" and loaded in memory, so unless your objects are extremely small this can cause memory and performance issues down the road.
I am still wondering about the need for 1500 objects. Do you really need 1500 of the entire object, and if so, why - ultimately the phone is showing something to the user and no user can process 1500 bits of information at once. They can only process information that is presented, no? So is there possible parts of the object that you can show, and wait to load the other parts until later? For example, if I have 2000 contacts I will never load 2000 contacts. I might load 2000 names, let the user filter/sort names, and then when they select a name load the contact.
I would suggest serializing this to isolated storage as a file. The built-in JSON serializer has the smallest footprint on disk and performs quite well.
Here is a post about serialization. Use binary or Json.Net.
Storing/restoring into ApplicationSettings is going to involve serialization as well (pretty sure it's Xml) so I don't think you are ever going to get any faster than the 16 seconds you are seeing.
Moving that amount of data around is just not going to be fast no matter how good the deerializer. My recommendation would be to look at why you are storing that many objects. If you can't reduce the set of objects you need to store look at breaking them up into logical groups so that you can load on demand rather than up front.
Have you tried using multiple smaller files and [de]serializing in parallel to see if that will be faster?
I have this problem: I have a collection of small files that are about 2000 bytes large each (they are all the exact same size) and there are about ~100.000 of em which equals about 200 megabytes of space. I need to be able to, in real time, select a range in these files. Say file 1000 to 1100 (100 files total), read them and send them over the network decently fast.
The good thing is the files will always be read in sequence, i.e. it's always going to be a range of say "from this file and a hundred more" and not "this file here, and that file over there, etc.".
Files can also be added to this collection during runtime, so it's not a fixed amount of files.
The current scheme I've come up with is this: No file is larger then 2000 bytes, so instead of having several files allocated on the disk I'm going to have one large file containing all other files at even 2048 byte intervals with the 2 first bytes of each 2048 block being the actual byte size of the file contained in the next 2046 bytes (the files range between 1800 and 1950 bytes or so in size) and then seek inside this file instead of opening a new file handle for each file I need to read.
So when I need to get file at position X i will just do X*2048, read the first two bytes and then read the bytes from (X*2048)+2 to the size contained in the first two bytes. This large 200mb file will be append only so it's safe to read even while the serialized input thread/process (haven't decided yet) appends more data to it.
This has to be doable on Windows, C is an option but I would prefer C#.
Do you have anything against storing these files in a database?
A simple RDBMS would drastically speed up the searching and sorting of a bunch fo 2k files
I think your idea is probably the best you can do with decent work.
Alternatively you could buy a solid state disk and not care about the filesize.
Or you could just preload the entire data into a collection into memory if you don't depend on keeping RAM usage low (will also be the fastest option).
Or you could use a database, but the overhead here will be substantial.
That sounds like a reasonable option.
When reading the data for the range, I'd be quite tempted to seek to the start of the "block of data", and read the whole lot into memory (i.e. the 2048 byte buffers for all the files) in one go. That will get the file IO down to a minimum.
Once you've got all the data in memory, you can decode the sizes and send just the bits which are real data.
Loading all of it into memory may well be a good idea, but that will entirely depend on how often it's modified and how often it's queried.
Was there anything more to the question than just "is this a sane thing to do"?
Are you sure you will never want to delete files from, say, 1200 to 1400? What happens when you are done transferring? Is the data archived or will it continuously grow?
I really don't see why appending all of the data to a single file would improve performance. Instead it's likely to cause more issues for you down the line. So, why would you combine them?
Other things to consider are, what happens if the massive file gets some corruption in the middle from bad sectors on the disk? Looks like you lose everything. Keeping them separate should increase their survivability.
You can certainly work with large files without loading the entire thing in memory, but that's not exactly easy and you will ultimately have to drop down to some low level coding to do it. Don't constrain yourself. Also, what if the file requires a bit of hand editing? Most programs would force you to load and lock the entire thing.
Further, having a single large file would mean that you can't have multiple processes reading / writing the data. This limits scalability.
If you know you need files from #1000 to 1100, you can use the built in (c#) code to get a collection of files meeting that criteria.
You can simply concatenate all the files in one big file 'dbase' without any header or footer.
In another file 'index', you can save the position of all the small files in 'dbase'. This index file, as very small, can be cached completely in memory.
This scheme allows you to fast read the required files, and to add new ones at the end of your collection.
Your plan sounds workable. It seems like a filestream can peform the seeks and reads that you need. Are you running into specific problems with implementation, or are you looking for a better way to do it?
Whether there is a better way might depend upon how fast you can read the files vs how fast you can transmit them on the network. Assuming that you can read tons of individual files faster than you can send them, perhaps you could set up a bounded buffer, where you read ahead x number of files into a queue. Another thread would be reading from the queue and sending them on the network
I would modify your scheme in one way: instead of reading the first two bytes, then using those to determine the size of the next read, I'd just read 2KiB immediately, then use the first two bytes to determine how many bytes you transmit.
You'll probably save more time by using only one disk read than by avoiding transferring the last ~150 bytes from the disk into memory.
The other possibility would be to pack the data for the files together, and maintain a separate index to tell you the start position of each. For your situation, this has the advantage that instead of doing a lot of small (2K) reads from the disk, you can combine an arbitrary number into one large read. Getting up to around 64-128K per read will generally save a fair amount of time.
You could stick with your solution of one big file but use memory mapping to access it (see here e.g.). This might be a bit more performant, since you also avoid paging and the virtual memory management is optimized for transferring chunks of 4096 bytes.
Afaik, there's no direct support for memory mapping, but here is some example how to wrap the WIN32 API calls for C#.
See also here for a related question on SO.
Interestingly, this problem reminds me of the question in this older SO question:
Is this an over-the-top question for Senior Java developer role?
For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.
The content of the files should be neither completely random nor uniform.
A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
I'd like to keep the number of files at a manageable level, let's say o(10).
For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:
Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
while (bytesRemaining > 0)
{
int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
if (!zeroes) _rnd.NextBytes(buffer);
fileStream.Write(buffer, 0, sizeOfChunkToWrite);
bytesRemaining -= sizeOfChunkToWrite;
}
fileStream.Close();
}
With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.
For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.
Neither of these is quite satisfactory for me.
I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.
The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.
What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.
Currently I have an approach that sort of works but it takes too long to run.
Has anyone else solved this?
Is there a much faster way to write a text file than via StreamWriter?
Suggestions?
EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.
For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).
You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.
If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.
Edit
Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.
You could always code yourself a little web crawler...
UPDATE
Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".
A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.
I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.
Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.
I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.
You could then use a similiar approach for binary files by looking for .exes or .dlls.
For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.
For a more structured approach you could use a Markov chain trained on some large free english text.
Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.
Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.
Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.
Thanks for all the quick input.
I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.
To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff.
I randomly select and download one document of that subset.
I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. It generates a text one word at a time.
For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.
UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.
I haven't yet figured out how to do the binary generation, but it will likely be something analogous.
Thank you all, again, for all the helpful ideas.