I have the following challenge:
I have a Azure Cloud Worker Role with many instances. Every minute, each instance spins up about 20-30 threads. In each thread, it needs to read some metadata about how to process the thread from 3 objects. The objects/data reside in a remote RavenDb and even though RavenDb is very fast at retrieving the objects via HTTP, it is still under a considerable load from 30+ workers that are hitting it 3 times per thread per minute (about 45 requests/sec). Most of the time (like 99.999%) the data in RavenDb does not change.
I've decided to implement local storage caching. First, I read a tiny record which indicates if the metadata has changed (it changes VERY rarely), and then I read from local file storage instead of RavenDb, if local storage has the object cached. I'm using File.ReadAllText()
This approach appears to be bogging the machine down and procesing slows down considerably. I'm guessing the disks on "Small" Worker Roles are not fast enough.
Is there anyway, I can have OS help me out and cache those files? Perhaps there is an alternative to caching of this data?
I'm looking at about ~1000 files of varying sizes ranging from 100k to 10mb in size stored on each Cloud Role instance
Not a straight answer, but three possible options:
Use the built-in RavenDB caching mechanism
My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has caching built-in (see here for how to fine-tune it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching)
The problem you have is that the cache is local to each server. If server A downloaded a file before, server B will still have to fetch it if it happens to process that file the next time.
One possible option you could implement is divide the workload. For example:
Server A => fetch files that start with A-D
Server B => fetch files that start with E-H
Server C => ...
This would ensure that you optimize the cache on each server.
Get a bigger machine
If you still want to employ your own caching mechanism, there are two things that I imagine could be the bottleneck:
Disk access
Deserialization of the JSON
For these issues, the only thing I can imagine would be to get bigger resources:
If it's the disk, use premium storage with SSD's.
If it's deserialization, get VM's with a bigger CPU
Cache files in RAM
Alternatively, instead of writing the files to disk, store them in memory and get a VM with more RAM. You shouldn't need THAT much RAM, since 1000 files * 10MB is still just 1 GB. Doing this would eliminate disk access and deserialization.
But ultimately, it's probably best to first measure where the bottleneck is and see if it can be mitigated by using RavenDB's built-in caching mechanism.
I have a lot of files lying out on random file shares. I have to copy them into my SQL Server 2008 database and sum up all of the points. The overhead to copying the file from the network to C# to database makes this process slow and I have thousands of very large files to process.
File 1 example
Player | Points
---------------
Bean | 10
Ender | 15
File 2 example
Player | Points
---------------
Ender | 20
Peter | 5
Result
Player | Points
---------------
Bean | 10
Ender | 35
Peter | 5
Current approach: using C#, read each file into the database and merge into the master table.
MERGE INTO Points as Target
USING Source as Source
ON Target.Player = Source.Player
WHEN MATCHED THEN
UPDATE SET Target.Points = Target.Points + Source.Points
WHEN NOT MATCHED THEN
INSERT (Player, Points) VALUES (Source.Player, Source.Points);
This approach is fine, but I'm looking ideas for improvement (kinda slow).
Proposed solution:
Read each file into a SQLite database (based on reading, this should be very fast), bulk load the entire database into my SQL Server database and do all of the processing there. I should be able to assign a rank to each player, thus speeding up the grouping since I'm not comparing based on a text column. Downfall of proposed solution is it can't work on multiple threads.
What's the fastest way to get all of these files into the database and them aggregate them?
Edit: A little more background on the files I forgot to mention
These files are located on several servers
I need to keep the "impact" of this task to a minimum - so no installing of apps
The files can be huge - as much as 1gb per file, so doing anything in memory is not an option
There are thousands of files to process
So, if you can't/don't want to run code to start the parsing operation on the individual servers containing these files, and transferring the gigs and gigs of them is slow, then whether this is multithreaded is probably irrelevant - the performance bottleneck in your process is the file transfer.
So to make some assumptions:
There's the one master server and only it does any work.
It has immediate (if slow) access to all the file shares necessary, accessible by a simple path, and you know those paths.
The master tally server has a local database sitting on it to store player scores.
If you can transfer multiple files just as fast as you can transfer one, I'd write code that did the following:
Gather the list of files that needs to be aggregated - this at least should be a small and cheap list. Gather them into a ConcurrentBag.
Spin up as many Tasks as the bandwidth on the machine will allow you to run copy operations. You'll need to test to determine what this is.
Each Task takes the ConcurrentBag as an argument. It begins with a loop running TryTake() until it succeeds - once it's successfully removed a filepath from the bag it begins reading directly from the file location and parsing, adding each user's score to whatever is currently in the local database for that user.
Once a Task finishes working on a file it resumes trying to get the next filepath from the ConcurrentBag and so forth.
Eventually all filepaths have been worked on and the Tasks end.
So the code would be roughly:
public void Start()
{
var bag = new ConcurrentBag<string>();
for(var i = 0; i < COPY_OPERATIONS; i++)
{
Task.Factory.StartNew(() =>
{
StartCopy(bag);
});
}
}
public void StartCopy(ConcurrentBag<string> bag)
{
while (true)
{
// Loop until the bag is available to hand us a path to work on
string path = null;
while (!bag.IsEmpty && !bag.TryTake(out path))
{}
// Access the file via a stream and begin parsing it, dumping scores to the db
}
}
By streaming you keep the copy operations running full tilt (in fact most likely the OS will readahead a bit for you to really ensure you max out the copy speed) and still avoid knocking over memory with the size of these files.
By not using multiple intermediary steps you skip the repeated cost of transferring and considering all that data - this way you do it just the once.
And by using the above approach you can easily account for the optimal number of copy operations.
There are optimizations you can make here to make it easily restartable like having all threads receive a signal to stop what they're doing and record in the database the files they've worked on, the one they were working on now, and the line they left off on. You could have them constantly write these values to the database at a small cost to performance to make it crash proof (by making the line number and score writes part of a single transaction).
Original answer
You forgot to specify this in your question but it appears these scattered files log the points scored by players playing a game on a cluster of webservers?
This sounds like an embarrassingly parallel problem. Instead of copying massive files off of each machine, why not write a simple app that can run on all of them and distribute it to them? It just sums the points there on the machine and sends back a single number and player id per player over the network, solving the slow network issue.
If this is an on-going task you can timestamp the sums so you never count the same point twice and just run it in batch periodically.
I'd write the webserver apps as a simple webapp that only responds to one IP (the master tally server you were originally going to do everything on), and in response to a request, runs the tally locally and responds with the sum. That way the master server just sends requests out to all the score servers, and waits for them to send back their sums. Done.
You can keep the client apps very simple by just storing the sum data in memory as a Dictionary mapping player id to sum - no SQL necessary.
The tally software can also likely do everything in RAM then dump it all to SQL Server as totals complete to save time.
Fun problem.
My program should write hundreds of files to disk, received by external resources (network)
each file is a simple document that I'm currently store it with the name of GUID in a specific folder but creating hundred files, writing, closing is a lengthy process.
Is there any better way to store these amount of files to disk?
I've come to a solution, but I don't know if it is the best.
First, I create 2 files, one of them is like allocation table and the second one is a huge file storing all the content of my documents. But reading from this file would be a nightmare; maybe a memory-mapped file technique could help. Could working with 30GB or more create a problem?
Edit: What is the fastest way to storing 1000 text files on disk ? (write operation performs frequently)
This is similar to how Subversion stores its repositories on disk. Each revision in the repository is stored as a file, and the repository uses a folder for each 1000 revisions. This seems to perform rather well, except there is a good chance for the files to either become fragmented or be located further apart from each other. Subversion allows you to pack each 1000 revision folder into a single file (but this works nicely since the revisions are not modified once created.
If you plan on modifying these documents often, you could consider using an embedded database to manage the solid file for you (Firebird is a good one that doesn't have any size limitations). This way you don't have to manage the growth and organization of the files yourself (which can get complicated when you start modifying files inside the solid file). This will also help with the issues of concurrent access (reading / writing) if you use a separate service / process to manage the database and communicate with it. The new version of Firebird (2.5) supports multiple process access to a database even when using an embedded server. This way you can have multiple accesses to your file storage without having to run a database server.
The first thing you should do is profile your app. In particular you want to get the counters around Disk Queue Length. Your queue length shouldn't be any more than 1.5 to 2 times the number of disk spindles you have.
For example, if you have a single disk system, then the queue length shouldn't go above 2. If you have a RAID array with 3 disks, it should be more than 6.
Verify that you are indeed write bound. If so then the best way to speed up performance of doing massive writes is to buy disks with very fast write performance. Note that most RAID setups will result in decreased performance.
If write performance is critical, then spreading out the storage across multiple drives could work. Of course, you would have to take this into consideration for any app that that needs to read that information. And you'll still have to buy fast drives.
Note that not all drives are created equal and some are better suited for high performance than others.
What about using the ThreadPool for that?
I.e. for each received "file", enqueue a write function in a thread pool thread that actually persists the data to a file on disk.
A project I have been working on for the past year writes out log files to a network drive every time a task is performed. We currently have about 1300 folders each with a number of files and folders. The important part here is that we have about 3-4 XML files on average within each folder that contain serial numbers and other identifying data for our products. Early on it was easy to just use Windows file search, but now it can take over 10 minutes to search that way. The need for a log viewer/searcher has arose and I need to figure out how to make it fast. We are considering a number of ideas. One being a locally stored XML index file, but I think that will eventually get too big to be fast. Second was to create a folder watching service that will write index information to an SQL database and link to files. Third idea was to have the program writing the log files also write index information to the database. The second database option is sounding like the best option right now since we already have a bunch of history that will need to be indexed, but to me it sounds a little convoluted. So my question in short is: How can I quickly search for information contained in XML files in a large and constantly growing number of directories?
We ended up using an SQL database to do this, for many reasons:
It maintained much faster query times in a simulated 10-year data growth (1.5 mil - 2 mil entries still had under 2 seconds) than XML (about 20-30 seconds).
We can have the application directly publish data to the database, this removes the need for a crawler (other than for initial legacy data).
There could have been potential issues with file locking if we decided to host the file on the network somewhere.
I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.
I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.
I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.
I've considered a few approaches:
1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.
2) Write into one file and -- somehow -- process it later to produce 1000 files.
3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.
How would you approach this problem?
I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.
If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.
Seriously. Database it.
Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.
Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.
Like n8wrl i also would recommend a DB. But if you really dislike this feature ...
Let's find another solution ;-)
In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.
Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.
With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).
But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.
First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.
Edit: corrected mistake. Sorry, my bad.
I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.
I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...
1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
This is close to db-ish-ness! Also sounds like more trouble than it's worth.
RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
How would I approach this? DB! Because then I can query index, analyze, etc. etc.
:)
I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.
If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.
I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.
DATA SIZES:
Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.
50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour
Perhaps you don't want the overhead of a DB install?
In that case, you could try a filesystem-based database like sqlite:
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.
I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.
I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.
If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.
All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.
I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.
It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.
EDIT
If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.
For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.
Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.