Validating the existence of 350 million files over a network

Validating the existence of 350 million files over a network - c#

I have a SQL Server table with around ~300,000,000 absolute UNC paths and I'm trying to (quickly) validate each one to make sure the path in the SQL Server table actually exists as a file on disk.
At face value, I'm querying the table in batches of 50,000 and incrementing a counter to advance my batch as I go.
Then, I'm using a data reader object to store my current batch set and loop through the batch, checking each file with a File.Exists(path) command, like in the following example.
Problem is, I'm processing at approx. 1000 files per second max on a quad core 3.4ghz i5 with 16gb ram which is going to take days. Is there a faster way to do this?
I do have a columnstore index on the SQL Server table and I've profiled it. I get batches of 50k records in <1s, so it's not a SQL bottleneck when issuing batches to the .net console app.
while (counter <= MaxRowNum)
{
command.CommandText = "SELECT id, dbname, location FROM table where ID BETWEEN " + counter + " AND " + (counter+50000).ToString();
connection.Open();
using (var reader = command.ExecuteReader())
{
var indexOfColumn1 = reader.GetOrdinal("ID");
var indexOfColumn2 = reader.GetOrdinal("dbname");
var indexOfColumn3 = reader.GetOrdinal("location");
while (reader.Read())
{
var ID = reader.GetValue(indexOfColumn1);
var DBName = reader.GetValue(indexOfColumn2);
var Location = reader.GetValue(indexOfColumn3);
if (!File.Exists(#Location.ToString()))
{
//log entry to logging table
}
}
}
// increment counter to grab next batch
counter += 50000;
// report on progress, I realize this might be off and should be incremented based on ID
Console.WriteLine("Last Record Processed: " + counter.ToString());
connection.Close();
}
Console.WriteLine("Done");
Console.Read();
EDIT: Adding some additional info:
thought about doing this all via the database itself; it's sql server enterprise with 2tb of ram and 64 cores. The problem is the sql server service account doesn't have access to the nas paths hosting the data so my cmdshell runs via an SP failed (I don't control the AD stuff), and the UNC paths have hundreds of thousands of individual sub directories based on an MD5 hash of the file. So enumerating contents of directories ends up not being useful because you may have a file 10 directories deep housing only 1 file. That's why I have to do a literal full path match/check.
Oh, and the paths are very long in general. I actually tried loading them all to a list in memory before I realized it was the equivalent of 90gb of data (lol, oops). Totally agree with other comments on threading it out. The database is super fast, not worried at all there. Hadn't considered SMB chatter though, that very well may be what I'm running up against. – JRats 13 hours ago
Oh! And I'm also only updating the database if a file doesn't exist. If it does, I don't care. So my database runs are minimized to grabbing batches of paths. Basically, we migrated a bunch of data from slower storage to this nimble appliance and I was asked to make sure everything actually made it over by writing something to verify existence per file.
Threading helped quite a bit. I spanned the file check over 4 threads and got my processing power up to about 3,300 records / second, which is far better, but I'm still hoping to get even quicker if I can. Is there a good way to tell if I'm getting bound by SMB traffic? I noticed once I tried to bump up my thread count to 4 or 5, my speed dropped down to a trickle; I thought maybe I was deadlocking somewhere, but no.
Oh, and I can't do a FilesOnNetwork check for the exact reason you said, there's 3 or 4x as many files actually hosted there compared to what I want to check. There's probably 1.5b files or so on that nimble appliance.

Optimizing the SQL side is moot here because you are file IO bound.
I would use Directory.EnumerateFiles to obtain a list of all files that exist. Enumerating the files in a directory should be much faster than testing each file individually.
You can even invert the problem entirely and bulk insert that file list into a database temp table so that you can do SQL based set processing right in the database.
If you want to go ahead and test individually you probably should do this in parallel. It is not clear that the process is really disk bound. Might be network or CPU bound.
Parallelism will help here by overlapping multiple requests. It's the network latency, not the bandwidth that's likely to be the problem. At DOP 1 at least one machine is idle at any given time. There are times where both are idle.
there's 3 or 4x as many files actually hosted there compared to what I want to check
Use the dir /b command to pipe a list of all file names into a .txt file. Execute that locally on the machine that has the files, but if impossible execute remotely. Then use bcp to bulk insert them into a table into the database. Then, you can do a fast existence check in a single SQL query which will be highly optimized. You'll be getting a hash join.
If you want to parallelism the dir phase of this strategy you can write a program for that. But maybe there is no need to and dir is fast enough despite being single-threaded.

The bottleneck most likely is network traffic, or more specifically: SMB traffic. Your machine talks SMB to retrieve the file info from the network storage. SMB traffic is "chatty", you need a few messages to check a file's existence and your permission to read it.
For what it's worth, on my network I can query the existence of about a hundred files per second over SMB, while listing 15K files recursively takes 10 seconds.
What can be faster is to retrieve the remote directory listing on beforehand. This will be trivial if the directory structure is predictable - and if the storage does not contain many irrelevant files in those directories.
Then your code will look like this:
HashSet<string> filesOnNetwork = new HashSet<string>(Directory.EnumerateFiles(
baseDirectory, "*.*", SearchOption.AllDirectories));
foreach (var fileToCheck in filesFromDatabase)
{
fileToCheckExists = filesOnNetwork.Contains(fileToCheck);
}
This may work adversely if there are many more files on the network than you need to check, as the filling of and searching through filesOnNetwork will become the bottleneck of your application.

On your current solution getting batches of 50,000 and open and closing the connection serves NO purpose but to slow things doen. DataReader streams. Just open it once and read them all one at a time. Under the covers Reader will send batches at a time. DataReader won't try and jam the client with 300,000,000 rows when you have only read 10.
I think you are worried about optimizing the fastest step - reading from SQL
Validating a file path is going to be the slowest step
I like the answer from CodeCaster but at 350 million you are going to get into object size limits with .NET. And by reading into a HashSet it does not start working until that step is done.
I would use a BlockingCollection with two collections
enumerate files
write to db
The slowest step is read file names so do that as fast as possible and don't interrupt. Do that on a device close to the storage device. Run the program on a SAN attached device.
I know you are going to say write to db is slow but it only has to be faster than enumerate file. Just have a binary columns for found - don't write the full filename to a #temp. I will bet dollars to donuts an (optimized) update is faster than enumerate files. Chunk your updates like 10,000 rows at a time to keep the round trips down. And I would do the update asynch so you can build up the next update while the current is processing.
Then in the end you have check the DB for any file that was not marked as found.
Don't go to a intermediate collection first. Process the enumeration directly. This lets you start doing the work immediately and keeps memory down.
foreach (string fileName in Directory.EnumerateFiles(baseDirectory, "*.*", SearchOption.AllDirectories))
{
// write filename to blocking collection
}

A quick idea if CodeCaster's approach doesn't work due to there being too many files on the remote servers, and if you are able to install new programs on the remote servers: Write a program that you install on every server, and that listens to some port for HTTP requests (or whichever web service technology you prefer). The program that queries the database should batch up the file names per server, and send a request to each server with all the file names that are located on that server. The web service checks the file existence (which should be fast since it is now a local operation) and responds with e.g. a list containing only the file names that actually did exist. This should eliminate most of the protocol overhead and network latency, since the number of requests is greatly reduced.

If I will do such task, I know that the bottlenecks are:
disks access latency (~1ms)
network access latency (for 100mbps ~0.2ms)
database limited by disk
The fastest thing is cpu cache, the second fast is RAM.
I assume, that I could use additional database table to store temporal data.
Database where now data i will call main database.
I will do to tasks in parallel:
recursive directory reading and save to second database in chunks for about 50k files.
get chunks of records from main database, and compare ONE chunk to ONE table from second database - all files not found will write to third database (and mark exists files in first database).
after all chunks from main database compared to second database - check all chunks from third database with second database and delete files found.
At the end in third database will only left non exist files, so i could just get strings from it and mark data in main database.
There could be additional improvements, could discuss if interest.

How about sorting the locations as they are retrieved from the DB (db's are good at sorting). Then the checks can be benefit from cached directory info in the cifs client,
you could get the directory-listing for the next row in the result-set, then check that row for existence in the dir-list, then repeat checking if the next row in the result-set is in the same directory, and if so check the already fetched dir-list, if not repeat outer loop.

Related

Fastest way to read files in a multi-processing environment? C#

I have the following challenge:
I have a Azure Cloud Worker Role with many instances. Every minute, each instance spins up about 20-30 threads. In each thread, it needs to read some metadata about how to process the thread from 3 objects. The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). Most of the time (like 99.999%) the data in RavenDb does not change.
I've decided to implement local storage caching. First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. I'm using File.ReadAllText()
This approach appears to be bogging the machine down and procesing slows down considerably. I'm guessing the disks on "Small" Worker Roles are not fast enough.
Is there anyway, I can have OS help me out and cache those files? Perhaps there is an alternative to caching of this data?
I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance

Not a straight answer, but three possible options:
Use the built-in RavenDB caching mechanism
My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching)
The problem you have is that the cache is local to each server. If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time.
One possible option you could implement is divide the workload. For example:
Server A => fetch files that start with A-D
Server B => fetch files that start with E-H
Server C => ...
This would ensure that you optimize the cache on each server.
Get a bigger machine
If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck:
Disk access
Deserialization of the JSON
For these issues, the only thing I can imagine would be to get bigger resources:
If it's the disk, use premium storage with SSD's.
If it's deserialization, get VM's with a bigger CPU
Cache files in RAM
Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. Doing this would eliminate disk access and deserialization.
But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism.

Reading and aggregating thousands of files using C# and SQL Server

I have a lot of files lying out on random file shares. I have to copy them into my SQL Server 2008 database and sum up all of the points. The overhead to copying the file from the network to C# to database makes this process slow and I have thousands of very large files to process.
File 1 example
Player | Points
---------------
Bean | 10
Ender | 15
File 2 example
Player | Points
---------------
Ender | 20
Peter | 5
Result
Player | Points
---------------
Bean | 10
Ender | 35
Peter | 5
Current approach: using C#, read each file into the database and merge into the master table.
MERGE INTO Points as Target
USING Source as Source
ON Target.Player = Source.Player
WHEN MATCHED THEN
UPDATE SET Target.Points = Target.Points + Source.Points
WHEN NOT MATCHED THEN
INSERT (Player, Points) VALUES (Source.Player, Source.Points);
This approach is fine, but I'm looking ideas for improvement (kinda slow).
Proposed solution:
Read each file into a SQLite database (based on reading, this should be very fast), bulk load the entire database into my SQL Server database and do all of the processing there. I should be able to assign a rank to each player, thus speeding up the grouping since I'm not comparing based on a text column. Downfall of proposed solution is it can't work on multiple threads.
What's the fastest way to get all of these files into the database and them aggregate them?
Edit: A little more background on the files I forgot to mention
These files are located on several servers
I need to keep the "impact" of this task to a minimum - so no installing of apps
The files can be huge - as much as 1gb per file, so doing anything in memory is not an option
There are thousands of files to process

So, if you can't/don't want to run code to start the parsing operation on the individual servers containing these files, and transferring the gigs and gigs of them is slow, then whether this is multithreaded is probably irrelevant - the performance bottleneck in your process is the file transfer.
So to make some assumptions:
There's the one master server and only it does any work.
It has immediate (if slow) access to all the file shares necessary, accessible by a simple path, and you know those paths.
The master tally server has a local database sitting on it to store player scores.
If you can transfer multiple files just as fast as you can transfer one, I'd write code that did the following:
Gather the list of files that needs to be aggregated - this at least should be a small and cheap list. Gather them into a ConcurrentBag.
Spin up as many Tasks as the bandwidth on the machine will allow you to run copy operations. You'll need to test to determine what this is.
Each Task takes the ConcurrentBag as an argument. It begins with a loop running TryTake() until it succeeds - once it's successfully removed a filepath from the bag it begins reading directly from the file location and parsing, adding each user's score to whatever is currently in the local database for that user.
Once a Task finishes working on a file it resumes trying to get the next filepath from the ConcurrentBag and so forth.
Eventually all filepaths have been worked on and the Tasks end.
So the code would be roughly:
public void Start()
{
var bag = new ConcurrentBag<string>();
for(var i = 0; i < COPY_OPERATIONS; i++)
{
Task.Factory.StartNew(() =>
{
StartCopy(bag);
});
}
}
public void StartCopy(ConcurrentBag<string> bag)
{
while (true)
{
// Loop until the bag is available to hand us a path to work on
string path = null;
while (!bag.IsEmpty && !bag.TryTake(out path))
{}
// Access the file via a stream and begin parsing it, dumping scores to the db
}
}
By streaming you keep the copy operations running full tilt (in fact most likely the OS will readahead a bit for you to really ensure you max out the copy speed) and still avoid knocking over memory with the size of these files.
By not using multiple intermediary steps you skip the repeated cost of transferring and considering all that data - this way you do it just the once.
And by using the above approach you can easily account for the optimal number of copy operations.
There are optimizations you can make here to make it easily restartable like having all threads receive a signal to stop what they're doing and record in the database the files they've worked on, the one they were working on now, and the line they left off on. You could have them constantly write these values to the database at a small cost to performance to make it crash proof (by making the line number and score writes part of a single transaction).
Original answer
You forgot to specify this in your question but it appears these scattered files log the points scored by players playing a game on a cluster of webservers?
This sounds like an embarrassingly parallel problem. Instead of copying massive files off of each machine, why not write a simple app that can run on all of them and distribute it to them? It just sums the points there on the machine and sends back a single number and player id per player over the network, solving the slow network issue.
If this is an on-going task you can timestamp the sums so you never count the same point twice and just run it in batch periodically.
I'd write the webserver apps as a simple webapp that only responds to one IP (the master tally server you were originally going to do everything on), and in response to a request, runs the tally locally and responds with the sum. That way the master server just sends requests out to all the score servers, and waits for them to send back their sums. Done.
You can keep the client apps very simple by just storing the sum data in memory as a Dictionary mapping player id to sum - no SQL necessary.
The tally software can also likely do everything in RAM then dump it all to SQL Server as totals complete to save time.
Fun problem.

What is the fastest way to write hundreds of files to disk using C#?

My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)

This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.

The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.

What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.

How do I quickly search for data stored in an XML file within a large number of directories?

A project I have been working on for the past year writes out log files to a network drive every time a task is performed. We currently have about 1300 folders each with a number of files and folders. The important part here is that we have about 3-4 XML files on average within each folder that contain serial numbers and other identifying data for our products. Early on it was easy to just use Windows file search, but now it can take over 10 minutes to search that way. The need for a log viewer/searcher has arose and I need to figure out how to make it fast. We are considering a number of ideas. One being a locally stored XML index file, but I think that will eventually get too big to be fast. Second was to create a folder watching service that will write index information to an SQL database and link to files. Third idea was to have the program writing the log files also write index information to the database. The second database option is sounding like the best option right now since we already have a bunch of history that will need to be indexed, but to me it sounds a little convoluted. So my question in short is: How can I quickly search for information contained in XML files in a large and constantly growing number of directories?

We ended up using an SQL database to do this, for many reasons:
It maintained much faster query times in a simulated 10-year data growth (1.5 mil - 2 mil entries still had under 2 seconds) than XML (about 20-30 seconds).
We can have the application directly publish data to the database, this removes the need for a crawler (other than for initial legacy data).
There could have been potential issues with file locking if we decided to host the file on the network somewhere.

Storing Real Time data into 1000 files

I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.
I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.
I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.
I've considered a few approaches:
1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.
2) Write into one file and -- somehow -- process it later to produce 1000 files.
3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.
How would you approach this problem?

I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.
If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.
Seriously. Database it.
Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.
Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.

Like n8wrl i also would recommend a DB. But if you really dislike this feature ...
Let's find another solution ;-)
In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.
Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.
With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).
But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.

First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.
Edit: corrected mistake. Sorry, my bad.

I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.

I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...
1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
This is close to db-ish-ness! Also sounds like more trouble than it's worth.
RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
How would I approach this? DB! Because then I can query index, analyze, etc. etc.
:)

I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.
If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.
I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.
DATA SIZES:
Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.
50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour

Perhaps you don't want the overhead of a DB install?
In that case, you could try a filesystem-based database like sqlite:
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.

I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.

I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.

If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.
All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.

I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.
It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.
EDIT
If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.
For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.
Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.