Related
I have a SQL Server table with around ~300,000,000 absolute UNC paths and I'm trying to (quickly) validate each one to make sure the path in the SQL Server table actually exists as a file on disk.
At face value, I'm querying the table in batches of 50,000 and incrementing a counter to advance my batch as I go.
Then, I'm using a data reader object to store my current batch set and loop through the batch, checking each file with a File.Exists(path) command, like in the following example.
Problem is, I'm processing at approx. 1000 files per second max on a quad core 3.4ghz i5 with 16gb ram which is going to take days. Is there a faster way to do this?
I do have a columnstore index on the SQL Server table and I've profiled it. I get batches of 50k records in <1s, so it's not a SQL bottleneck when issuing batches to the .net console app.
while (counter <= MaxRowNum)
{
command.CommandText = "SELECT id, dbname, location FROM table where ID BETWEEN " + counter + " AND " + (counter+50000).ToString();
connection.Open();
using (var reader = command.ExecuteReader())
{
var indexOfColumn1 = reader.GetOrdinal("ID");
var indexOfColumn2 = reader.GetOrdinal("dbname");
var indexOfColumn3 = reader.GetOrdinal("location");
while (reader.Read())
{
var ID = reader.GetValue(indexOfColumn1);
var DBName = reader.GetValue(indexOfColumn2);
var Location = reader.GetValue(indexOfColumn3);
if (!File.Exists(#Location.ToString()))
{
//log entry to logging table
}
}
}
// increment counter to grab next batch
counter += 50000;
// report on progress, I realize this might be off and should be incremented based on ID
Console.WriteLine("Last Record Processed: " + counter.ToString());
connection.Close();
}
Console.WriteLine("Done");
Console.Read();
EDIT: Adding some additional info:
thought about doing this all via the database itself; it's sql server enterprise with 2tb of ram and 64 cores. The problem is the sql server service account doesn't have access to the nas paths hosting the data so my cmdshell runs via an SP failed (I don't control the AD stuff), and the UNC paths have hundreds of thousands of individual sub directories based on an MD5 hash of the file. So enumerating contents of directories ends up not being useful because you may have a file 10 directories deep housing only 1 file. That's why I have to do a literal full path match/check.
Oh, and the paths are very long in general. I actually tried loading them all to a list in memory before I realized it was the equivalent of 90gb of data (lol, oops). Totally agree with other comments on threading it out. The database is super fast, not worried at all there. Hadn't considered SMB chatter though, that very well may be what I'm running up against. – JRats 13 hours ago
Oh! And I'm also only updating the database if a file doesn't exist. If it does, I don't care. So my database runs are minimized to grabbing batches of paths. Basically, we migrated a bunch of data from slower storage to this nimble appliance and I was asked to make sure everything actually made it over by writing something to verify existence per file.
Threading helped quite a bit. I spanned the file check over 4 threads and got my processing power up to about 3,300 records / second, which is far better, but I'm still hoping to get even quicker if I can. Is there a good way to tell if I'm getting bound by SMB traffic? I noticed once I tried to bump up my thread count to 4 or 5, my speed dropped down to a trickle; I thought maybe I was deadlocking somewhere, but no.
Oh, and I can't do a FilesOnNetwork check for the exact reason you said, there's 3 or 4x as many files actually hosted there compared to what I want to check. There's probably 1.5b files or so on that nimble appliance.
Optimizing the SQL side is moot here because you are file IO bound.
I would use Directory.EnumerateFiles to obtain a list of all files that exist. Enumerating the files in a directory should be much faster than testing each file individually.
You can even invert the problem entirely and bulk insert that file list into a database temp table so that you can do SQL based set processing right in the database.
If you want to go ahead and test individually you probably should do this in parallel. It is not clear that the process is really disk bound. Might be network or CPU bound.
Parallelism will help here by overlapping multiple requests. It's the network latency, not the bandwidth that's likely to be the problem. At DOP 1 at least one machine is idle at any given time. There are times where both are idle.
there's 3 or 4x as many files actually hosted there compared to what I want to check
Use the dir /b command to pipe a list of all file names into a .txt file. Execute that locally on the machine that has the files, but if impossible execute remotely. Then use bcp to bulk insert them into a table into the database. Then, you can do a fast existence check in a single SQL query which will be highly optimized. You'll be getting a hash join.
If you want to parallelism the dir phase of this strategy you can write a program for that. But maybe there is no need to and dir is fast enough despite being single-threaded.
The bottleneck most likely is network traffic, or more specifically: SMB traffic. Your machine talks SMB to retrieve the file info from the network storage. SMB traffic is "chatty", you need a few messages to check a file's existence and your permission to read it.
For what it's worth, on my network I can query the existence of about a hundred files per second over SMB, while listing 15K files recursively takes 10 seconds.
What can be faster is to retrieve the remote directory listing on beforehand. This will be trivial if the directory structure is predictable - and if the storage does not contain many irrelevant files in those directories.
Then your code will look like this:
HashSet<string> filesOnNetwork = new HashSet<string>(Directory.EnumerateFiles(
baseDirectory, "*.*", SearchOption.AllDirectories));
foreach (var fileToCheck in filesFromDatabase)
{
fileToCheckExists = filesOnNetwork.Contains(fileToCheck);
}
This may work adversely if there are many more files on the network than you need to check, as the filling of and searching through filesOnNetwork will become the bottleneck of your application.
On your current solution getting batches of 50,000 and open and closing the connection serves NO purpose but to slow things doen. DataReader streams. Just open it once and read them all one at a time. Under the covers Reader will send batches at a time. DataReader won't try and jam the client with 300,000,000 rows when you have only read 10.
I think you are worried about optimizing the fastest step - reading from SQL
Validating a file path is going to be the slowest step
I like the answer from CodeCaster but at 350 million you are going to get into object size limits with .NET. And by reading into a HashSet it does not start working until that step is done.
I would use a BlockingCollection with two collections
enumerate files
write to db
The slowest step is read file names so do that as fast as possible and don't interrupt. Do that on a device close to the storage device. Run the program on a SAN attached device.
I know you are going to say write to db is slow but it only has to be faster than enumerate file. Just have a binary columns for found - don't write the full filename to a #temp. I will bet dollars to donuts an (optimized) update is faster than enumerate files. Chunk your updates like 10,000 rows at a time to keep the round trips down. And I would do the update asynch so you can build up the next update while the current is processing.
Then in the end you have check the DB for any file that was not marked as found.
Don't go to a intermediate collection first. Process the enumeration directly. This lets you start doing the work immediately and keeps memory down.
foreach (string fileName in Directory.EnumerateFiles(baseDirectory, "*.*", SearchOption.AllDirectories))
{
// write filename to blocking collection
}
A quick idea if CodeCaster's approach doesn't work due to there being too many files on the remote servers, and if you are able to install new programs on the remote servers: Write a program that you install on every server, and that listens to some port for HTTP requests (or whichever web service technology you prefer). The program that queries the database should batch up the file names per server, and send a request to each server with all the file names that are located on that server. The web service checks the file existence (which should be fast since it is now a local operation) and responds with e.g. a list containing only the file names that actually did exist. This should eliminate most of the protocol overhead and network latency, since the number of requests is greatly reduced.
If I will do such task, I know that the bottlenecks are:
disks access latency (~1ms)
network access latency (for 100mbps ~0.2ms)
database limited by disk
The fastest thing is cpu cache, the second fast is RAM.
I assume, that I could use additional database table to store temporal data.
Database where now data i will call main database.
I will do to tasks in parallel:
recursive directory reading and save to second database in chunks for about 50k files.
get chunks of records from main database, and compare ONE chunk to ONE table from second database - all files not found will write to third database (and mark exists files in first database).
after all chunks from main database compared to second database - check all chunks from third database with second database and delete files found.
At the end in third database will only left non exist files, so i could just get strings from it and mark data in main database.
There could be additional improvements, could discuss if interest.
How about sorting the locations as they are retrieved from the DB (db's are good at sorting). Then the checks can be benefit from cached directory info in the cifs client,
you could get the directory-listing for the next row in the result-set, then check that row for existence in the dir-list, then repeat checking if the next row in the result-set is in the same directory, and if so check the already fetched dir-list, if not repeat outer loop.
The mongodb document says that
To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).
in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
I wonder how to make the mongodb free deleted disk space automatically ?
p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.
In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.
So, you should try to provide as much disk-space as possible for the database.
However if you must shrink the database you should keep two things in mind.
MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.
Serverside Javascript
You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...
Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...
$ mongo foo bar.js
The javascript file would look something like ...
// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();
print('Storage Size: ' + tojson(storage));
print('TotalSize: ' + tojson(total));
print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');
// Run repair
db.repairDatabase()
// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();
print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));
This will run and return something like ...
MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153
Run this on a schedule (during none peak hours) and you are good to go.
Capped Collections
However there is one other option, capped collections.
Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.
Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.
This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
Also this can not be automated. Well it could, but I don't think I'm willing to try.
Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.
As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.
use cleanup_database
db.dropDatabase();
use oversize_database
db.collection.find({},{}).forEach(function(doc){
db = db.getSiblingDB("cleanup_database");
db.collection_subset.insert(doc);
});
use oversize_database
db.dropDatabase();
use cleanup_database
db.collection_subset.find({},{}).forEach(function(doc){
db = db.getSiblingDB("oversize_database");
db.collection.insert(doc);
});
use oversize_database
<add indexes>
db.collection.ensureIndex({field:1});
use cleanup_database
db.dropDatabase();
An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.
Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.
If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.
To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.
By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.
To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:
NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES
#!/bin/bash
# First arg is host MongoDB is running on, second arg is the MongoDB port
MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath
# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
sleep 2
done
echo "Node is no longer primary!\n"
# Now shut down that server
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user#$MONGOHOST sudo service mongodb stop
# Wipe the data files for that server
ssh -t user#$MONGOHOST sudo rm -rf $DBPATH
ssh -t user#$MONGOHOST sudo mkdir $DBPATH
ssh -t user#$MONGOHOST sudo chown mongodb:mongodb $DBPATH
# Start up server again
# similar to shutdown something like
ssh -t user#$MONGOHOST sudo service mongodb start
I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.
A project I have been working on for the past year writes out log files to a network drive every time a task is performed. We currently have about 1300 folders each with a number of files and folders. The important part here is that we have about 3-4 XML files on average within each folder that contain serial numbers and other identifying data for our products. Early on it was easy to just use Windows file search, but now it can take over 10 minutes to search that way. The need for a log viewer/searcher has arose and I need to figure out how to make it fast. We are considering a number of ideas. One being a locally stored XML index file, but I think that will eventually get too big to be fast. Second was to create a folder watching service that will write index information to an SQL database and link to files. Third idea was to have the program writing the log files also write index information to the database. The second database option is sounding like the best option right now since we already have a bunch of history that will need to be indexed, but to me it sounds a little convoluted. So my question in short is: How can I quickly search for information contained in XML files in a large and constantly growing number of directories?
We ended up using an SQL database to do this, for many reasons:
It maintained much faster query times in a simulated 10-year data growth (1.5 mil - 2 mil entries still had under 2 seconds) than XML (about 20-30 seconds).
We can have the application directly publish data to the database, this removes the need for a crawler (other than for initial legacy data).
There could have been potential issues with file locking if we decided to host the file on the network somewhere.
I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.
I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.
I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.
I've considered a few approaches:
1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.
2) Write into one file and -- somehow -- process it later to produce 1000 files.
3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.
How would you approach this problem?
I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.
If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.
Seriously. Database it.
Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.
Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.
Like n8wrl i also would recommend a DB. But if you really dislike this feature ...
Let's find another solution ;-)
In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.
Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.
With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).
But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.
First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.
Edit: corrected mistake. Sorry, my bad.
I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.
I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...
1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
This is close to db-ish-ness! Also sounds like more trouble than it's worth.
RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
How would I approach this? DB! Because then I can query index, analyze, etc. etc.
:)
I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.
If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.
I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.
DATA SIZES:
Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.
50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour
Perhaps you don't want the overhead of a DB install?
In that case, you could try a filesystem-based database like sqlite:
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.
I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.
I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.
If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.
All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.
I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.
It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.
EDIT
If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.
For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.
Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.