The mongodb document says that
To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).
in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
I wonder how to make the mongodb free deleted disk space automatically ?
p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.
In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.
So, you should try to provide as much disk-space as possible for the database.
However if you must shrink the database you should keep two things in mind.
MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.
Serverside Javascript
You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...
Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...
$ mongo foo bar.js
The javascript file would look something like ...
// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();
print('Storage Size: ' + tojson(storage));
print('TotalSize: ' + tojson(total));
print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');
// Run repair
db.repairDatabase()
// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();
print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));
This will run and return something like ...
MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153
Run this on a schedule (during none peak hours) and you are good to go.
Capped Collections
However there is one other option, capped collections.
Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.
Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.
This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
Also this can not be automated. Well it could, but I don't think I'm willing to try.
Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.
As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.
use cleanup_database
db.dropDatabase();
use oversize_database
db.collection.find({},{}).forEach(function(doc){
db = db.getSiblingDB("cleanup_database");
db.collection_subset.insert(doc);
});
use oversize_database
db.dropDatabase();
use cleanup_database
db.collection_subset.find({},{}).forEach(function(doc){
db = db.getSiblingDB("oversize_database");
db.collection.insert(doc);
});
use oversize_database
<add indexes>
db.collection.ensureIndex({field:1});
use cleanup_database
db.dropDatabase();
An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.
Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.
If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.
To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.
By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.
To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:
NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES
#!/bin/bash
# First arg is host MongoDB is running on, second arg is the MongoDB port
MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath
# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
sleep 2
done
echo "Node is no longer primary!\n"
# Now shut down that server
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user#$MONGOHOST sudo service mongodb stop
# Wipe the data files for that server
ssh -t user#$MONGOHOST sudo rm -rf $DBPATH
ssh -t user#$MONGOHOST sudo mkdir $DBPATH
ssh -t user#$MONGOHOST sudo chown mongodb:mongodb $DBPATH
# Start up server again
# similar to shutdown something like
ssh -t user#$MONGOHOST sudo service mongodb start
Related
We have an application that over time stores immense amounts of data for our users (talking hundreds of TB or more here). Due to new EU directives, should a user decide to discontinue using our sevices, all their data must be available for export for the next 80 days, after which it MUST be eradicated completely. The data is stored in azure storage block blobs, and the metadata in an sql database.
Sadly, the data cannot be exported as-is (it is in a proprietary format), so it would need to be processed and converted to PDF for export. A file is approximately 240KB in size, so imagine the amount of PDFs for the TB value stated above.
We tried using functions to split the job into tiny 50 value chunks, but it went haywire at some point and created enormous costs, spinning out of control.
So what we're looking for is this:
Can be run on demand from a web trigger/queue/db entry
Is pay-what-you-use as this will occur at random times and (so we hope) rarely.
Can process extreme amounts of data fairly effectively at minimum cost
Is easy to maintain and keep track of. The functions jobs were just fire and pray -utter chaos- due to their amount and parallel processing.
Does anyone know of a service fitting our requirements?
Here's a getting started link for .NET, python or node.js:
https://learn.microsoft.com/en-us/azure/batch/batch-dotnet-get-started
The concept in batch is pretty simple, although it takes a bit of fiddling to get it working the first time, in my experience. I'll try to explain what it involves, to the best of my ability. Any suggestions or comments are welcome.
The following concepts are important:
The Pool. This is an abstraction of all the nodes (i.e. virtual machines) that you provision to do work. These could be running Linux, Windows Server or any of the other offerings that Azure has. You can provision a pool through the API.
The Jobs. This is an abstraction where you place the 'Tasks' you need executed. Each task is a command-line execution of your executable, possibly with some arguments.
Your tasks are picked one by one by an available node in the pool and it executes the command that the task specifies. Available on the node are your executables and a file that you assigned to the task, containing some data identifying, in your case e.g. which users should be processed by the task.
So suppose in your case that you need to perform the processing for 100 users. Each individual processing job is an execution of some executable you create, e.g. ProcessUserData.exe.
As an example, suppose your executable takes, in addition to a userId, an argument specifying whether this should be performed in test or prod, so e.g.
ProcessUserData.exe "path to file containing user ids to process" --environment test.
We'll assume that your executable doesn't need other input than the user id and the environment in which to perform the processing.
You upload all the application files to a blob (named "application blob" in the following). This consists of your main executable along with any dependencies. It will all end up in a folder on each node (virtual machine) in your pool, once provisioned. The folder is identified through an environment variable created on each node in your pool so that you can find it easily.
See https://learn.microsoft.com/en-us/azure/batch/batch-compute-node-environment-variables
In this example, you create 10 input files, each containing 10 userIds (100 userIds total) that should be processed. One file for each of the command line tasks. Each file could contain 1 user id or 10 userids, it's entirily up to you how you want your main executable to parse this file and process the input. You upload these to the 'input' blob container.
These will also end up in a directory identified by an environment variable on each node so are also easy to construct a path in your command line activity on each node.
When uploaded to the input container, you will receive a reference (ResourceFile) to each input file. One input file should be associated with one "Task" and each task is passed to an available node as the job executes.
The details of how to do this are clear from the getting started link, I'm trying to focus on the concepts, so I won't go into much detail.
You now create the tasks (CloudTask) to be executed, specify what it should run on the command line, and add them to the job. Here you reference the input file that each task should take as input.
An example (assuming Windows cmd):
cmd /c %AZ_BATCH_NODE_SHARED_DIR%\ProcessUserdata.exe %AZ_BATCH_TASK_DIR%\userIds1.txt --environment test
Here, userIds1.txt is the filename your first ResourceFile returned when you uploaded the input files. The next command will specify userIds2.txt, etc.
When you've created your list of CloudTask objects containing the commands, you add them to the job, e.g in C#.
await batchClient.JobOperations.AddTaskAsync(jobId, tasks);
And now you wait for the job to finish.
What happens now is that Azure batch looks at the nodes in the pool and while there are more tasks in the tasks list, it assigns a task to an available (idle) node.
Once completed (which you can poll for through the API), you can delete the pool, the job and pay only for the compute that you've used.
One final note: Your tasks may depend on external packages, i.e. an execution environment that is not installed by default on the OS you've selected, so there are a few possible ways of resolving this:
1. Upload an application package, which will be distributed to each node as it enters the pool (again, there's an environment variable pointing to it). This can be done through the Azure Portal.
2. Use a command line tool to get what you need, e.g. apt-get install on Ubuntu.
Hope that gives you an overview of what Batch is. In my opinion the best way to get started is to do something very simple, i.e. print environment variables in a single task on a single node.
You can inspect the stdout and stderr of each node while the execution is underway, again through the portal.
There's obviously a lot more to it than this, but this is a basic guide. You can create linked tasks and a lot of other nifty things, but you can read up on that if you need it.
Assuming lot of people are looking for a solution for these kind of requirements, the new release of ADLA (Azure Data Lake Analytics) supports the Parquet format now. This will be supported along with U-SQL. With less than 100 lines of code now you can read these small files to large files and with less number of resources (vertices) you can compress the data to Parquet files. For example, you can store 3TB data into 10000 parquet files. And reading these files is also very simple and as per the requirement you can create csv files in no time. This will save too much cost and time for you for sure.
I have the following challenge:
I have a Azure Cloud Worker Role with many instances. Every minute, each instance spins up about 20-30 threads. In each thread, it needs to read some metadata about how to process the thread from 3 objects. The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). Most of the time (like 99.999%) the data in RavenDb does not change.
I've decided to implement local storage caching. First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. I'm using File.ReadAllText()
This approach appears to be bogging the machine down and procesing slows down considerably. I'm guessing the disks on "Small" Worker Roles are not fast enough.
Is there anyway, I can have OS help me out and cache those files? Perhaps there is an alternative to caching of this data?
I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance
Not a straight answer, but three possible options:
Use the built-in RavenDB caching mechanism
My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching)
The problem you have is that the cache is local to each server. If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time.
One possible option you could implement is divide the workload. For example:
Server A => fetch files that start with A-D
Server B => fetch files that start with E-H
Server C => ...
This would ensure that you optimize the cache on each server.
Get a bigger machine
If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck:
Disk access
Deserialization of the JSON
For these issues, the only thing I can imagine would be to get bigger resources:
If it's the disk, use premium storage with SSD's.
If it's deserialization, get VM's with a bigger CPU
Cache files in RAM
Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. Doing this would eliminate disk access and deserialization.
But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism.
I have a SQL Server table with around ~300,000,000 absolute UNC paths and I'm trying to (quickly) validate each one to make sure the path in the SQL Server table actually exists as a file on disk.
At face value, I'm querying the table in batches of 50,000 and incrementing a counter to advance my batch as I go.
Then, I'm using a data reader object to store my current batch set and loop through the batch, checking each file with a File.Exists(path) command, like in the following example.
Problem is, I'm processing at approx. 1000 files per second max on a quad core 3.4ghz i5 with 16gb ram which is going to take days. Is there a faster way to do this?
I do have a columnstore index on the SQL Server table and I've profiled it. I get batches of 50k records in <1s, so it's not a SQL bottleneck when issuing batches to the .net console app.
while (counter <= MaxRowNum)
{
command.CommandText = "SELECT id, dbname, location FROM table where ID BETWEEN " + counter + " AND " + (counter+50000).ToString();
connection.Open();
using (var reader = command.ExecuteReader())
{
var indexOfColumn1 = reader.GetOrdinal("ID");
var indexOfColumn2 = reader.GetOrdinal("dbname");
var indexOfColumn3 = reader.GetOrdinal("location");
while (reader.Read())
{
var ID = reader.GetValue(indexOfColumn1);
var DBName = reader.GetValue(indexOfColumn2);
var Location = reader.GetValue(indexOfColumn3);
if (!File.Exists(#Location.ToString()))
{
//log entry to logging table
}
}
}
// increment counter to grab next batch
counter += 50000;
// report on progress, I realize this might be off and should be incremented based on ID
Console.WriteLine("Last Record Processed: " + counter.ToString());
connection.Close();
}
Console.WriteLine("Done");
Console.Read();
EDIT: Adding some additional info:
thought about doing this all via the database itself; it's sql server enterprise with 2tb of ram and 64 cores. The problem is the sql server service account doesn't have access to the nas paths hosting the data so my cmdshell runs via an SP failed (I don't control the AD stuff), and the UNC paths have hundreds of thousands of individual sub directories based on an MD5 hash of the file. So enumerating contents of directories ends up not being useful because you may have a file 10 directories deep housing only 1 file. That's why I have to do a literal full path match/check.
Oh, and the paths are very long in general. I actually tried loading them all to a list in memory before I realized it was the equivalent of 90gb of data (lol, oops). Totally agree with other comments on threading it out. The database is super fast, not worried at all there. Hadn't considered SMB chatter though, that very well may be what I'm running up against. – JRats 13 hours ago
Oh! And I'm also only updating the database if a file doesn't exist. If it does, I don't care. So my database runs are minimized to grabbing batches of paths. Basically, we migrated a bunch of data from slower storage to this nimble appliance and I was asked to make sure everything actually made it over by writing something to verify existence per file.
Threading helped quite a bit. I spanned the file check over 4 threads and got my processing power up to about 3,300 records / second, which is far better, but I'm still hoping to get even quicker if I can. Is there a good way to tell if I'm getting bound by SMB traffic? I noticed once I tried to bump up my thread count to 4 or 5, my speed dropped down to a trickle; I thought maybe I was deadlocking somewhere, but no.
Oh, and I can't do a FilesOnNetwork check for the exact reason you said, there's 3 or 4x as many files actually hosted there compared to what I want to check. There's probably 1.5b files or so on that nimble appliance.
Optimizing the SQL side is moot here because you are file IO bound.
I would use Directory.EnumerateFiles to obtain a list of all files that exist. Enumerating the files in a directory should be much faster than testing each file individually.
You can even invert the problem entirely and bulk insert that file list into a database temp table so that you can do SQL based set processing right in the database.
If you want to go ahead and test individually you probably should do this in parallel. It is not clear that the process is really disk bound. Might be network or CPU bound.
Parallelism will help here by overlapping multiple requests. It's the network latency, not the bandwidth that's likely to be the problem. At DOP 1 at least one machine is idle at any given time. There are times where both are idle.
there's 3 or 4x as many files actually hosted there compared to what I want to check
Use the dir /b command to pipe a list of all file names into a .txt file. Execute that locally on the machine that has the files, but if impossible execute remotely. Then use bcp to bulk insert them into a table into the database. Then, you can do a fast existence check in a single SQL query which will be highly optimized. You'll be getting a hash join.
If you want to parallelism the dir phase of this strategy you can write a program for that. But maybe there is no need to and dir is fast enough despite being single-threaded.
The bottleneck most likely is network traffic, or more specifically: SMB traffic. Your machine talks SMB to retrieve the file info from the network storage. SMB traffic is "chatty", you need a few messages to check a file's existence and your permission to read it.
For what it's worth, on my network I can query the existence of about a hundred files per second over SMB, while listing 15K files recursively takes 10 seconds.
What can be faster is to retrieve the remote directory listing on beforehand. This will be trivial if the directory structure is predictable - and if the storage does not contain many irrelevant files in those directories.
Then your code will look like this:
HashSet<string> filesOnNetwork = new HashSet<string>(Directory.EnumerateFiles(
baseDirectory, "*.*", SearchOption.AllDirectories));
foreach (var fileToCheck in filesFromDatabase)
{
fileToCheckExists = filesOnNetwork.Contains(fileToCheck);
}
This may work adversely if there are many more files on the network than you need to check, as the filling of and searching through filesOnNetwork will become the bottleneck of your application.
On your current solution getting batches of 50,000 and open and closing the connection serves NO purpose but to slow things doen. DataReader streams. Just open it once and read them all one at a time. Under the covers Reader will send batches at a time. DataReader won't try and jam the client with 300,000,000 rows when you have only read 10.
I think you are worried about optimizing the fastest step - reading from SQL
Validating a file path is going to be the slowest step
I like the answer from CodeCaster but at 350 million you are going to get into object size limits with .NET. And by reading into a HashSet it does not start working until that step is done.
I would use a BlockingCollection with two collections
enumerate files
write to db
The slowest step is read file names so do that as fast as possible and don't interrupt. Do that on a device close to the storage device. Run the program on a SAN attached device.
I know you are going to say write to db is slow but it only has to be faster than enumerate file. Just have a binary columns for found - don't write the full filename to a #temp. I will bet dollars to donuts an (optimized) update is faster than enumerate files. Chunk your updates like 10,000 rows at a time to keep the round trips down. And I would do the update asynch so you can build up the next update while the current is processing.
Then in the end you have check the DB for any file that was not marked as found.
Don't go to a intermediate collection first. Process the enumeration directly. This lets you start doing the work immediately and keeps memory down.
foreach (string fileName in Directory.EnumerateFiles(baseDirectory, "*.*", SearchOption.AllDirectories))
{
// write filename to blocking collection
}
A quick idea if CodeCaster's approach doesn't work due to there being too many files on the remote servers, and if you are able to install new programs on the remote servers: Write a program that you install on every server, and that listens to some port for HTTP requests (or whichever web service technology you prefer). The program that queries the database should batch up the file names per server, and send a request to each server with all the file names that are located on that server. The web service checks the file existence (which should be fast since it is now a local operation) and responds with e.g. a list containing only the file names that actually did exist. This should eliminate most of the protocol overhead and network latency, since the number of requests is greatly reduced.
If I will do such task, I know that the bottlenecks are:
disks access latency (~1ms)
network access latency (for 100mbps ~0.2ms)
database limited by disk
The fastest thing is cpu cache, the second fast is RAM.
I assume, that I could use additional database table to store temporal data.
Database where now data i will call main database.
I will do to tasks in parallel:
recursive directory reading and save to second database in chunks for about 50k files.
get chunks of records from main database, and compare ONE chunk to ONE table from second database - all files not found will write to third database (and mark exists files in first database).
after all chunks from main database compared to second database - check all chunks from third database with second database and delete files found.
At the end in third database will only left non exist files, so i could just get strings from it and mark data in main database.
There could be additional improvements, could discuss if interest.
How about sorting the locations as they are retrieved from the DB (db's are good at sorting). Then the checks can be benefit from cached directory info in the cifs client,
you could get the directory-listing for the next row in the result-set, then check that row for existence in the dir-list, then repeat checking if the next row in the result-set is in the same directory, and if so check the already fetched dir-list, if not repeat outer loop.
I have a lot of files lying out on random file shares. I have to copy them into my SQL Server 2008 database and sum up all of the points. The overhead to copying the file from the network to C# to database makes this process slow and I have thousands of very large files to process.
File 1 example
Player | Points
---------------
Bean | 10
Ender | 15
File 2 example
Player | Points
---------------
Ender | 20
Peter | 5
Result
Player | Points
---------------
Bean | 10
Ender | 35
Peter | 5
Current approach: using C#, read each file into the database and merge into the master table.
MERGE INTO Points as Target
USING Source as Source
ON Target.Player = Source.Player
WHEN MATCHED THEN
UPDATE SET Target.Points = Target.Points + Source.Points
WHEN NOT MATCHED THEN
INSERT (Player, Points) VALUES (Source.Player, Source.Points);
This approach is fine, but I'm looking ideas for improvement (kinda slow).
Proposed solution:
Read each file into a SQLite database (based on reading, this should be very fast), bulk load the entire database into my SQL Server database and do all of the processing there. I should be able to assign a rank to each player, thus speeding up the grouping since I'm not comparing based on a text column. Downfall of proposed solution is it can't work on multiple threads.
What's the fastest way to get all of these files into the database and them aggregate them?
Edit: A little more background on the files I forgot to mention
These files are located on several servers
I need to keep the "impact" of this task to a minimum - so no installing of apps
The files can be huge - as much as 1gb per file, so doing anything in memory is not an option
There are thousands of files to process
So, if you can't/don't want to run code to start the parsing operation on the individual servers containing these files, and transferring the gigs and gigs of them is slow, then whether this is multithreaded is probably irrelevant - the performance bottleneck in your process is the file transfer.
So to make some assumptions:
There's the one master server and only it does any work.
It has immediate (if slow) access to all the file shares necessary, accessible by a simple path, and you know those paths.
The master tally server has a local database sitting on it to store player scores.
If you can transfer multiple files just as fast as you can transfer one, I'd write code that did the following:
Gather the list of files that needs to be aggregated - this at least should be a small and cheap list. Gather them into a ConcurrentBag.
Spin up as many Tasks as the bandwidth on the machine will allow you to run copy operations. You'll need to test to determine what this is.
Each Task takes the ConcurrentBag as an argument. It begins with a loop running TryTake() until it succeeds - once it's successfully removed a filepath from the bag it begins reading directly from the file location and parsing, adding each user's score to whatever is currently in the local database for that user.
Once a Task finishes working on a file it resumes trying to get the next filepath from the ConcurrentBag and so forth.
Eventually all filepaths have been worked on and the Tasks end.
So the code would be roughly:
public void Start()
{
var bag = new ConcurrentBag<string>();
for(var i = 0; i < COPY_OPERATIONS; i++)
{
Task.Factory.StartNew(() =>
{
StartCopy(bag);
});
}
}
public void StartCopy(ConcurrentBag<string> bag)
{
while (true)
{
// Loop until the bag is available to hand us a path to work on
string path = null;
while (!bag.IsEmpty && !bag.TryTake(out path))
{}
// Access the file via a stream and begin parsing it, dumping scores to the db
}
}
By streaming you keep the copy operations running full tilt (in fact most likely the OS will readahead a bit for you to really ensure you max out the copy speed) and still avoid knocking over memory with the size of these files.
By not using multiple intermediary steps you skip the repeated cost of transferring and considering all that data - this way you do it just the once.
And by using the above approach you can easily account for the optimal number of copy operations.
There are optimizations you can make here to make it easily restartable like having all threads receive a signal to stop what they're doing and record in the database the files they've worked on, the one they were working on now, and the line they left off on. You could have them constantly write these values to the database at a small cost to performance to make it crash proof (by making the line number and score writes part of a single transaction).
Original answer
You forgot to specify this in your question but it appears these scattered files log the points scored by players playing a game on a cluster of webservers?
This sounds like an embarrassingly parallel problem. Instead of copying massive files off of each machine, why not write a simple app that can run on all of them and distribute it to them? It just sums the points there on the machine and sends back a single number and player id per player over the network, solving the slow network issue.
If this is an on-going task you can timestamp the sums so you never count the same point twice and just run it in batch periodically.
I'd write the webserver apps as a simple webapp that only responds to one IP (the master tally server you were originally going to do everything on), and in response to a request, runs the tally locally and responds with the sum. That way the master server just sends requests out to all the score servers, and waits for them to send back their sums. Done.
You can keep the client apps very simple by just storing the sum data in memory as a Dictionary mapping player id to sum - no SQL necessary.
The tally software can also likely do everything in RAM then dump it all to SQL Server as totals complete to save time.
Fun problem.
I am having performance issues with MongoDB.
Running on:
MongoDB 2.0.1
Windows 2008 R2
12 GB RAM
2 TB HDD (5400 rpm)
I've written a daemon which removes and inserts records async. Each hour most of the collections are cleared and they'll get new inserted data (10-12 million deletes and 10-12 million inserts). The daemon uses ~60-80 of the CPU while inserting the data (due calculating 1+ million knapsack problems). When I fire up the daemon it can do it's job about 1-2 mins till it crashes due a socket time out (writing data to the MongoDB server).
When I look in the logs I see it takes about 30 seconds to remove data in the collection. It seems it has something to do with the CPU load and memory usage.., because when I run the daemon on a different PC everything goes fine.
Is there any optimization possible or I am just bound to using a separate PC for running the daemon (or pick another document store)?
UPDATE 11/13/2011 18:44 GMT+1
Still having problems.. I've made some modifications to my daemon. I've decreased the concurrent number of writes. However the daemon still crashes when the memory is getting full (11.8GB of 12GB) and receives more load (loading data into the frontend). It crashes due a long insert/remove of MongoDB(30 seconds). The crash of the daemon is because of MongoDB is responding slow (socket time out exception). Ofcourse there should be try/catch statements to catch such exceptions, but it should not happen in the first place. I'm looking for a solution to solve this issue instead of working around it.
Total storage size is: 8,1 GB
Index size is: 2,1 GB
I guess the problem lies in that the working set + indexes are too large to store in memory and MongoDB needs to access the HDD (which is slow 5400 rpm).. However why would this be a problem? Aren't there other strategies to store the collections (e.g. in seperate files instead of large chunks of 2GB). If an Relational database can read/write data in an acceptable amount of time from the disk, why can't MongoDB?
UPDATE 11/15/2011 00:04 GMT+1
Log file to illustrate the issue:
00:02:46 [conn3] insert bargains.auction-history-eu-bloodhoof-horde 421ms
00:02:47 [conn6] insert bargains.auction-history-eu-blackhand-horde 1357ms
00:02:48 [conn3] insert bargains.auction-history-eu-bloodhoof-alliance 577ms
00:02:48 [conn6] insert bargains.auction-history-eu-blackhand-alliance 499ms
00:02:49 [conn4] remove bargains.crafts-eu-agamaggan-horde 34881ms
00:02:49 [conn5] remove bargains.crafts-eu-aggramar-horde 3135ms
00:02:49 [conn5] insert bargains.crafts-eu-aggramar-horde 234ms
00:02:50 [conn2] remove bargains.auctions-eu-aerie-peak-horde 36223ms
00:02:52 [conn5] remove bargains.auctions-eu-aegwynn-horde 1700ms
UPDATE 11/18/2011 10:41 GMT+1
After posting this issue in the mongodb usergroup we found out that "drop" wasn't issued. Drop is much faster then a full remove of all records.
I am using official mongodb-csharp-driver. I issued this command collection.Drop();. However It didn't work, so for the time being I used this:
public void Clear()
{
if (collection.Exists())
{
var command = new CommandDocument {
{ "drop", collectionName }
};
collection.Database.RunCommand(command);
}
}
The daemon is quite stable now, yet I have to find out why the collection.Drop() method doesn't work as it supposed to, since the driver uses the native drop command aswell.
Some optimizations may be possible:
Make sure your mongodb is not running in verbose mode, this will ensure minimal logging and hence minimal I/O . Else it writes every operation to a log file.
If possible by application logic, convert your inserts to bulk inserts.Bulk insert is supported in most mongodb drivers.
http://www.mongodb.org/display/DOCS/Inserting#Inserting-Bulkinserts
Instead of one remove operation per record, try to remove in bulk.
eg. collect "_id" of 1000 documents, then fire a remove query using $in operator.
You will have 1000 times less queries to mongoDb.
If you are removing/inserting for same document to refresh data, try considering an update instead.
What kind of deamon are you running ? If you can share more info on that,it may be possible to optimize that too to reduce CPU load.
It could be totally unrelated, but there was an issue in 2.0.0 that had to do with CPU consumption. after upgrade to 2.0.0 mongo starts consuming all cpu resources locking the system, complains of memory leak
Unless I have misunderstood, your application is crashing, not mongod. Have you tried to remove MongoDB from the picture and replacing writes to MongoDB with perhaps writes to the file system?
Maybe this will bring light to some other issue inside your application that is not related specifically to MongoDB.
I had something similar happen with SQL Server 2008 on Windows Server 2008 R2. For me, it ended up being the network card. The NIC was set to auto-sense the connection speed which was leading to occasional dropped/lost packets which was leading to the socket timeout problems. To test you can ping the box from your local workstation and kick off your process to load the Windows 2008 R2 server. If it is this problem eventually you'll start to see the timeouts on your ping command
ping yourWin2008R2Server -n 1000
The solution ended up being to explicitly set the NIC connection speed
Manage Computer > Device Manager > Network Adapters > Properties and then depending on the nic you'll have either a link speed setting tab or have to go into another menu. You'll want to set this to exactly the speed of the network it is connected to. In my DEV environment it ended up being 100Mbps Half duplex.
These types of problems, as you know, can be a real pain to track down!
Best to you in figuring it out.
The daemon is stable now, After posting this issue in the mongodb usergroup we found out that "drop" wasn't issued. Drop is much faster then a full remove of all records.