Reading and aggregating thousands of files using C# and SQL Server - c#

I have a lot of files lying out on random file shares. I have to copy them into my SQL Server 2008 database and sum up all of the points. The overhead to copying the file from the network to C# to database makes this process slow and I have thousands of very large files to process.
File 1 example
Player | Points
---------------
Bean | 10
Ender | 15
File 2 example
Player | Points
---------------
Ender | 20
Peter | 5
Result
Player | Points
---------------
Bean | 10
Ender | 35
Peter | 5
Current approach: using C#, read each file into the database and merge into the master table.
MERGE INTO Points as Target
USING Source as Source
ON Target.Player = Source.Player
WHEN MATCHED THEN
UPDATE SET Target.Points = Target.Points + Source.Points
WHEN NOT MATCHED THEN
INSERT (Player, Points) VALUES (Source.Player, Source.Points);
This approach is fine, but I'm looking ideas for improvement (kinda slow).
Proposed solution:
Read each file into a SQLite database (based on reading, this should be very fast), bulk load the entire database into my SQL Server database and do all of the processing there. I should be able to assign a rank to each player, thus speeding up the grouping since I'm not comparing based on a text column. Downfall of proposed solution is it can't work on multiple threads.
What's the fastest way to get all of these files into the database and them aggregate them?
Edit: A little more background on the files I forgot to mention
These files are located on several servers
I need to keep the "impact" of this task to a minimum - so no installing of apps
The files can be huge - as much as 1gb per file, so doing anything in memory is not an option
There are thousands of files to process

So, if you can't/don't want to run code to start the parsing operation on the individual servers containing these files, and transferring the gigs and gigs of them is slow, then whether this is multithreaded is probably irrelevant - the performance bottleneck in your process is the file transfer.
So to make some assumptions:
There's the one master server and only it does any work.
It has immediate (if slow) access to all the file shares necessary, accessible by a simple path, and you know those paths.
The master tally server has a local database sitting on it to store player scores.
If you can transfer multiple files just as fast as you can transfer one, I'd write code that did the following:
Gather the list of files that needs to be aggregated - this at least should be a small and cheap list. Gather them into a ConcurrentBag.
Spin up as many Tasks as the bandwidth on the machine will allow you to run copy operations. You'll need to test to determine what this is.
Each Task takes the ConcurrentBag as an argument. It begins with a loop running TryTake() until it succeeds - once it's successfully removed a filepath from the bag it begins reading directly from the file location and parsing, adding each user's score to whatever is currently in the local database for that user.
Once a Task finishes working on a file it resumes trying to get the next filepath from the ConcurrentBag and so forth.
Eventually all filepaths have been worked on and the Tasks end.
So the code would be roughly:
public void Start()
{
var bag = new ConcurrentBag<string>();
for(var i = 0; i < COPY_OPERATIONS; i++)
{
Task.Factory.StartNew(() =>
{
StartCopy(bag);
});
}
}
public void StartCopy(ConcurrentBag<string> bag)
{
while (true)
{
// Loop until the bag is available to hand us a path to work on
string path = null;
while (!bag.IsEmpty && !bag.TryTake(out path))
{}
// Access the file via a stream and begin parsing it, dumping scores to the db
}
}
By streaming you keep the copy operations running full tilt (in fact most likely the OS will readahead a bit for you to really ensure you max out the copy speed) and still avoid knocking over memory with the size of these files.
By not using multiple intermediary steps you skip the repeated cost of transferring and considering all that data - this way you do it just the once.
And by using the above approach you can easily account for the optimal number of copy operations.
There are optimizations you can make here to make it easily restartable like having all threads receive a signal to stop what they're doing and record in the database the files they've worked on, the one they were working on now, and the line they left off on. You could have them constantly write these values to the database at a small cost to performance to make it crash proof (by making the line number and score writes part of a single transaction).
Original answer
You forgot to specify this in your question but it appears these scattered files log the points scored by players playing a game on a cluster of webservers?
This sounds like an embarrassingly parallel problem. Instead of copying massive files off of each machine, why not write a simple app that can run on all of them and distribute it to them? It just sums the points there on the machine and sends back a single number and player id per player over the network, solving the slow network issue.
If this is an on-going task you can timestamp the sums so you never count the same point twice and just run it in batch periodically.
I'd write the webserver apps as a simple webapp that only responds to one IP (the master tally server you were originally going to do everything on), and in response to a request, runs the tally locally and responds with the sum. That way the master server just sends requests out to all the score servers, and waits for them to send back their sums. Done.
You can keep the client apps very simple by just storing the sum data in memory as a Dictionary mapping player id to sum - no SQL necessary.
The tally software can also likely do everything in RAM then dump it all to SQL Server as totals complete to save time.
Fun problem.

Related

Validating the existence of 350 million files over a network

I have a SQL Server table with around ~300,000,000 absolute UNC paths and I'm trying to (quickly) validate each one to make sure the path in the SQL Server table actually exists as a file on disk.
At face value, I'm querying the table in batches of 50,000 and incrementing a counter to advance my batch as I go.
Then, I'm using a data reader object to store my current batch set and loop through the batch, checking each file with a File.Exists(path) command, like in the following example.
Problem is, I'm processing at approx. 1000 files per second max on a quad core 3.4ghz i5 with 16gb ram which is going to take days. Is there a faster way to do this?
I do have a columnstore index on the SQL Server table and I've profiled it. I get batches of 50k records in <1s, so it's not a SQL bottleneck when issuing batches to the .net console app.
while (counter <= MaxRowNum)
{
command.CommandText = "SELECT id, dbname, location FROM table where ID BETWEEN " + counter + " AND " + (counter+50000).ToString();
connection.Open();
using (var reader = command.ExecuteReader())
{
var indexOfColumn1 = reader.GetOrdinal("ID");
var indexOfColumn2 = reader.GetOrdinal("dbname");
var indexOfColumn3 = reader.GetOrdinal("location");
while (reader.Read())
{
var ID = reader.GetValue(indexOfColumn1);
var DBName = reader.GetValue(indexOfColumn2);
var Location = reader.GetValue(indexOfColumn3);
if (!File.Exists(#Location.ToString()))
{
//log entry to logging table
}
}
}
// increment counter to grab next batch
counter += 50000;
// report on progress, I realize this might be off and should be incremented based on ID
Console.WriteLine("Last Record Processed: " + counter.ToString());
connection.Close();
}
Console.WriteLine("Done");
Console.Read();
EDIT: Adding some additional info:
thought about doing this all via the database itself; it's sql server enterprise with 2tb of ram and 64 cores. The problem is the sql server service account doesn't have access to the nas paths hosting the data so my cmdshell runs via an SP failed (I don't control the AD stuff), and the UNC paths have hundreds of thousands of individual sub directories based on an MD5 hash of the file. So enumerating contents of directories ends up not being useful because you may have a file 10 directories deep housing only 1 file. That's why I have to do a literal full path match/check.
Oh, and the paths are very long in general. I actually tried loading them all to a list in memory before I realized it was the equivalent of 90gb of data (lol, oops). Totally agree with other comments on threading it out. The database is super fast, not worried at all there. Hadn't considered SMB chatter though, that very well may be what I'm running up against. – JRats 13 hours ago
Oh! And I'm also only updating the database if a file doesn't exist. If it does, I don't care. So my database runs are minimized to grabbing batches of paths. Basically, we migrated a bunch of data from slower storage to this nimble appliance and I was asked to make sure everything actually made it over by writing something to verify existence per file.
Threading helped quite a bit. I spanned the file check over 4 threads and got my processing power up to about 3,300 records / second, which is far better, but I'm still hoping to get even quicker if I can. Is there a good way to tell if I'm getting bound by SMB traffic? I noticed once I tried to bump up my thread count to 4 or 5, my speed dropped down to a trickle; I thought maybe I was deadlocking somewhere, but no.
Oh, and I can't do a FilesOnNetwork check for the exact reason you said, there's 3 or 4x as many files actually hosted there compared to what I want to check. There's probably 1.5b files or so on that nimble appliance.
Optimizing the SQL side is moot here because you are file IO bound.
I would use Directory.EnumerateFiles to obtain a list of all files that exist. Enumerating the files in a directory should be much faster than testing each file individually.
You can even invert the problem entirely and bulk insert that file list into a database temp table so that you can do SQL based set processing right in the database.
If you want to go ahead and test individually you probably should do this in parallel. It is not clear that the process is really disk bound. Might be network or CPU bound.
Parallelism will help here by overlapping multiple requests. It's the network latency, not the bandwidth that's likely to be the problem. At DOP 1 at least one machine is idle at any given time. There are times where both are idle.
there's 3 or 4x as many files actually hosted there compared to what I want to check
Use the dir /b command to pipe a list of all file names into a .txt file. Execute that locally on the machine that has the files, but if impossible execute remotely. Then use bcp to bulk insert them into a table into the database. Then, you can do a fast existence check in a single SQL query which will be highly optimized. You'll be getting a hash join.
If you want to parallelism the dir phase of this strategy you can write a program for that. But maybe there is no need to and dir is fast enough despite being single-threaded.
The bottleneck most likely is network traffic, or more specifically: SMB traffic. Your machine talks SMB to retrieve the file info from the network storage. SMB traffic is "chatty", you need a few messages to check a file's existence and your permission to read it.
For what it's worth, on my network I can query the existence of about a hundred files per second over SMB, while listing 15K files recursively takes 10 seconds.
What can be faster is to retrieve the remote directory listing on beforehand. This will be trivial if the directory structure is predictable - and if the storage does not contain many irrelevant files in those directories.
Then your code will look like this:
HashSet<string> filesOnNetwork = new HashSet<string>(Directory.EnumerateFiles(
baseDirectory, "*.*", SearchOption.AllDirectories));
foreach (var fileToCheck in filesFromDatabase)
{
fileToCheckExists = filesOnNetwork.Contains(fileToCheck);
}
This may work adversely if there are many more files on the network than you need to check, as the filling of and searching through filesOnNetwork will become the bottleneck of your application.
On your current solution getting batches of 50,000 and open and closing the connection serves NO purpose but to slow things doen. DataReader streams. Just open it once and read them all one at a time. Under the covers Reader will send batches at a time. DataReader won't try and jam the client with 300,000,000 rows when you have only read 10.
I think you are worried about optimizing the fastest step - reading from SQL
Validating a file path is going to be the slowest step
I like the answer from CodeCaster but at 350 million you are going to get into object size limits with .NET. And by reading into a HashSet it does not start working until that step is done.
I would use a BlockingCollection with two collections
enumerate files
write to db
The slowest step is read file names so do that as fast as possible and don't interrupt. Do that on a device close to the storage device. Run the program on a SAN attached device.
I know you are going to say write to db is slow but it only has to be faster than enumerate file. Just have a binary columns for found - don't write the full filename to a #temp. I will bet dollars to donuts an (optimized) update is faster than enumerate files. Chunk your updates like 10,000 rows at a time to keep the round trips down. And I would do the update asynch so you can build up the next update while the current is processing.
Then in the end you have check the DB for any file that was not marked as found.
Don't go to a intermediate collection first. Process the enumeration directly. This lets you start doing the work immediately and keeps memory down.
foreach (string fileName in Directory.EnumerateFiles(baseDirectory, "*.*", SearchOption.AllDirectories))
{
// write filename to blocking collection
}
A quick idea if CodeCaster's approach doesn't work due to there being too many files on the remote servers, and if you are able to install new programs on the remote servers: Write a program that you install on every server, and that listens to some port for HTTP requests (or whichever web service technology you prefer). The program that queries the database should batch up the file names per server, and send a request to each server with all the file names that are located on that server. The web service checks the file existence (which should be fast since it is now a local operation) and responds with e.g. a list containing only the file names that actually did exist. This should eliminate most of the protocol overhead and network latency, since the number of requests is greatly reduced.
If I will do such task, I know that the bottlenecks are:
disks access latency (~1ms)
network access latency (for 100mbps ~0.2ms)
database limited by disk
The fastest thing is cpu cache, the second fast is RAM.
I assume, that I could use additional database table to store temporal data.
Database where now data i will call main database.
I will do to tasks in parallel:
recursive directory reading and save to second database in chunks for about 50k files.
get chunks of records from main database, and compare ONE chunk to ONE table from second database - all files not found will write to third database (and mark exists files in first database).
after all chunks from main database compared to second database - check all chunks from third database with second database and delete files found.
At the end in third database will only left non exist files, so i could just get strings from it and mark data in main database.
There could be additional improvements, could discuss if interest.
How about sorting the locations as they are retrieved from the DB (db's are good at sorting). Then the checks can be benefit from cached directory info in the cifs client,
you could get the directory-listing for the next row in the result-set, then check that row for existence in the dir-list, then repeat checking if the next row in the result-set is in the same directory, and if so check the already fetched dir-list, if not repeat outer loop.

MongoDB Disk Usage [duplicate]

The mongodb document says that
To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).
in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
I wonder how to make the mongodb free deleted disk space automatically ?
p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.
In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.
So, you should try to provide as much disk-space as possible for the database.
However if you must shrink the database you should keep two things in mind.
MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.
Serverside Javascript
You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...
Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...
$ mongo foo bar.js
The javascript file would look something like ...
// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();
print('Storage Size: ' + tojson(storage));
print('TotalSize: ' + tojson(total));
print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');
// Run repair
db.repairDatabase()
// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();
print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));
This will run and return something like ...
MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153
Run this on a schedule (during none peak hours) and you are good to go.
Capped Collections
However there is one other option, capped collections.
Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.
Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.
This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
Also this can not be automated. Well it could, but I don't think I'm willing to try.
Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.
As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.
use cleanup_database
db.dropDatabase();
use oversize_database
db.collection.find({},{}).forEach(function(doc){
db = db.getSiblingDB("cleanup_database");
db.collection_subset.insert(doc);
});
use oversize_database
db.dropDatabase();
use cleanup_database
db.collection_subset.find({},{}).forEach(function(doc){
db = db.getSiblingDB("oversize_database");
db.collection.insert(doc);
});
use oversize_database
<add indexes>
db.collection.ensureIndex({field:1});
use cleanup_database
db.dropDatabase();
An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.
Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.
If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.
To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.
By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.
To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:
NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES
#!/bin/bash
# First arg is host MongoDB is running on, second arg is the MongoDB port
MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath
# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
sleep 2
done
echo "Node is no longer primary!\n"
# Now shut down that server
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user#$MONGOHOST sudo service mongodb stop
# Wipe the data files for that server
ssh -t user#$MONGOHOST sudo rm -rf $DBPATH
ssh -t user#$MONGOHOST sudo mkdir $DBPATH
ssh -t user#$MONGOHOST sudo chown mongodb:mongodb $DBPATH
# Start up server again
# similar to shutdown something like
ssh -t user#$MONGOHOST sudo service mongodb start

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

How do I quickly search for data stored in an XML file within a large number of directories?

A project I have been working on for the past year writes out log files to a network drive every time a task is performed. We currently have about 1300 folders each with a number of files and folders. The important part here is that we have about 3-4 XML files on average within each folder that contain serial numbers and other identifying data for our products. Early on it was easy to just use Windows file search, but now it can take over 10 minutes to search that way. The need for a log viewer/searcher has arose and I need to figure out how to make it fast. We are considering a number of ideas. One being a locally stored XML index file, but I think that will eventually get too big to be fast. Second was to create a folder watching service that will write index information to an SQL database and link to files. Third idea was to have the program writing the log files also write index information to the database. The second database option is sounding like the best option right now since we already have a bunch of history that will need to be indexed, but to me it sounds a little convoluted. So my question in short is: How can I quickly search for information contained in XML files in a large and constantly growing number of directories?
We ended up using an SQL database to do this, for many reasons:
It maintained much faster query times in a simulated 10-year data growth (1.5 mil - 2 mil entries still had under 2 seconds) than XML (about 20-30 seconds).
We can have the application directly publish data to the database, this removes the need for a crawler (other than for initial legacy data).
There could have been potential issues with file locking if we decided to host the file on the network somewhere.

Storing Real Time data into 1000 files

I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.
I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.
I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.
I've considered a few approaches:
1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.
2) Write into one file and -- somehow -- process it later to produce 1000 files.
3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.
How would you approach this problem?
I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.
If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.
Seriously. Database it.
Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.
Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.
Like n8wrl i also would recommend a DB. But if you really dislike this feature ...
Let's find another solution ;-)
In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.
Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.
With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).
But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.
First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.
Edit: corrected mistake. Sorry, my bad.
I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.
I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...
1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
This is close to db-ish-ness! Also sounds like more trouble than it's worth.
RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
How would I approach this? DB! Because then I can query index, analyze, etc. etc.
:)
I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.
If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.
I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.
DATA SIZES:
Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.
50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour
Perhaps you don't want the overhead of a DB install?
In that case, you could try a filesystem-based database like sqlite:
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.
I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.
I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.
If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.
All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.
I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.
It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.
EDIT
If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.
For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.
Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

Categories

Resources