Performance for reading files and inserting contents into database

Performance for reading files and inserting contents into database - c#

I'm developing a system that isn't real time but there's an intervening standalone server between the end user machines and the database. The idea is that instead of burdening the database server every time a user sends something up, a windows service on the database machine sweeps the relay server at regular intervals and updates the database, deleting the temporary files on the relay box.
There is a scenario where the client software installed on thousands of machines sends up information at nearly the same time. The following hold true:
The above scenario won't occur often but could occur once every other week.
For each machine, 24 bytes of data (4k on the disk) is written on the relay server, which we want to then pick up and update the database with. So although it's fine if the user base is only a few thousands for now, they may amount to millions overtime.
I was thinking of a batch operation that only picks up some 15,000 - 20,000 files at a time and runs every whenever (amendable from app.config). The problem is that if the user base grows to a few million that will take days to complete. Yes, it doesn't have to be real-time information but waiting for days for all the data to reach the database isn't ideal either.
I think there will always be a bottleneck if the relay box is hammered, but are there better ways to improve performance and get the data across at a reasonable time (a day, two tops)?
Regards,
F.

I think you might consider that to avoid hammering the disk only one thread reads the files and then hands off processing to multiple threads to write to the database and returns to the disk thread to delete the files after commit. The amount of DB threads could be "amendable from app.config" to find the best value for your hardware config.
Just my 2 cents to get you thinking.

Related

What is the best Method for monitoring a large number of clients reliably with good performance

This is more of a programming strategy and direction question, than the actual code itself.
I am programming in C-Sharp.
I have an application that remotely starts processes on many different clients on the network, could be up to 1000 clients in theory.
It then monitors the status of the remote processes by reading a log file on each client.
I currently do this by running one thread that loops through all of the clients in a list, and reading the log file. It works fine for 10 or 20 machines, but 1000 would probably be untenable.
There are several problems with this approach:
First, if the thread doesn’t finish reading all of the client statuses before it’s called again, the client statuses at the end of the list might not be read and updated.
Secondly, if any client in the list goes offline during this period, the updating hangs, until that client is back online again.
So I require a different approach, and have thought up a few possible ways to resolve this.
Spawn a separate thread for each client, to read their log file and update its progress.
a. However, I’m not sure if having 1000 threads running on my machine is something that would be acceptable.
Test the connect for each machine first, before trying to read the file, and if it cannot connect, then just ignore it for that iteration and move on to the next client in the list.
a. This still has the same problem of not getting through the list before the next call, and causes more delay and it tries to test the connection via a port first. With 1000 clients, this would be noticeable.
Have each client send the data to the machine running the application whenever there is an update.
a. This could create a lot of chatter with 1000 machines trying to send data repeatedly.
So I’m trying to figure if there is another more efficient and reliable method, that I haven’t considered, or which one of these would be the best.
Right now I’m leaning towards having the clients send updates to the application, instead of having the application pulling the data.
Looking for thoughts, concerns, ideas and recommendations.

In my opinion, you are doing this (Monitoring) the wrong way. Instead of keeping all logs in a text file, you'd better preserve them in a central data repository that can be of any kind. With respect to the fact that you are monitoring the performance of those system, your design and the mechanism behind it must not impact the performance of the target systems negatively, and with this design the disk and CPU would be involved so much in certain cases that can result in a performance issue itself.
I recommend you to create a log repository server using a fast in-memory database like Redis, and send logged data directly to that server. Keep in mind that this database must be running on a different virtual machine. You can then tune Redis to store received data on physical Disk once a particular number of indexes are reached or a particular interval elapses. The in-memory feature here is advantageous as you may need to query information a lot in a monitoring application like this. On the other hand, the performance of Redis is so high that it efficiently passes processing millions of indexes.
The blueprint for you is that:
1- Centralize all log data in a single repository.
2- Configure clients to send monitored information to the centralized repository.
3- Read the data from the centralized repository by the main server (monitoring system) when required.
I'm not trying to advertise for a particular tool here as I'm only sharing my own experience. There's many more tools that you can use for this purpose such as ElasticSearch.

What is the fastest way to persistently increment a list of numbers from multiple threads?

My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?

You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.

I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.

MongoDB slow writes causes socket time out exception

I am having performance issues with MongoDB.
Running on:
MongoDB 2.0.1
Windows 2008 R2
12 GB RAM
2 TB HDD (5400 rpm)
I've written a daemon which removes and inserts records async. Each hour most of the collections are cleared and they'll get new inserted data (10-12 million deletes and 10-12 million inserts). The daemon uses ~60-80 of the CPU while inserting the data (due calculating 1+ million knapsack problems). When I fire up the daemon it can do it's job about 1-2 mins till it crashes due a socket time out (writing data to the MongoDB server).
When I look in the logs I see it takes about 30 seconds to remove data in the collection. It seems it has something to do with the CPU load and memory usage.., because when I run the daemon on a different PC everything goes fine.
Is there any optimization possible or I am just bound to using a separate PC for running the daemon (or pick another document store)?
UPDATE 11/13/2011 18:44 GMT+1
Still having problems.. I've made some modifications to my daemon. I've decreased the concurrent number of writes. However the daemon still crashes when the memory is getting full (11.8GB of 12GB) and receives more load (loading data into the frontend). It crashes due a long insert/remove of MongoDB(30 seconds). The crash of the daemon is because of MongoDB is responding slow (socket time out exception). Ofcourse there should be try/catch statements to catch such exceptions, but it should not happen in the first place. I'm looking for a solution to solve this issue instead of working around it.
Total storage size is: 8,1 GB
Index size is: 2,1 GB
I guess the problem lies in that the working set + indexes are too large to store in memory and MongoDB needs to access the HDD (which is slow 5400 rpm).. However why would this be a problem? Aren't there other strategies to store the collections (e.g. in seperate files instead of large chunks of 2GB). If an Relational database can read/write data in an acceptable amount of time from the disk, why can't MongoDB?
UPDATE 11/15/2011 00:04 GMT+1
Log file to illustrate the issue:
00:02:46 [conn3] insert bargains.auction-history-eu-bloodhoof-horde 421ms
00:02:47 [conn6] insert bargains.auction-history-eu-blackhand-horde 1357ms
00:02:48 [conn3] insert bargains.auction-history-eu-bloodhoof-alliance 577ms
00:02:48 [conn6] insert bargains.auction-history-eu-blackhand-alliance 499ms
00:02:49 [conn4] remove bargains.crafts-eu-agamaggan-horde 34881ms
00:02:49 [conn5] remove bargains.crafts-eu-aggramar-horde 3135ms
00:02:49 [conn5] insert bargains.crafts-eu-aggramar-horde 234ms
00:02:50 [conn2] remove bargains.auctions-eu-aerie-peak-horde 36223ms
00:02:52 [conn5] remove bargains.auctions-eu-aegwynn-horde 1700ms
UPDATE 11/18/2011 10:41 GMT+1
After posting this issue in the mongodb usergroup we found out that "drop" wasn't issued. Drop is much faster then a full remove of all records.
I am using official mongodb-csharp-driver. I issued this command collection.Drop();. However It didn't work, so for the time being I used this:
public void Clear()
{
if (collection.Exists())
{
var command = new CommandDocument {
{ "drop", collectionName }
};
collection.Database.RunCommand(command);
}
}
The daemon is quite stable now, yet I have to find out why the collection.Drop() method doesn't work as it supposed to, since the driver uses the native drop command aswell.

Some optimizations may be possible:
Make sure your mongodb is not running in verbose mode, this will ensure minimal logging and hence minimal I/O . Else it writes every operation to a log file.
If possible by application logic, convert your inserts to bulk inserts.Bulk insert is supported in most mongodb drivers.
http://www.mongodb.org/display/DOCS/Inserting#Inserting-Bulkinserts
Instead of one remove operation per record, try to remove in bulk.
eg. collect "_id" of 1000 documents, then fire a remove query using $in operator.
You will have 1000 times less queries to mongoDb.
If you are removing/inserting for same document to refresh data, try considering an update instead.
What kind of deamon are you running ? If you can share more info on that,it may be possible to optimize that too to reduce CPU load.

It could be totally unrelated, but there was an issue in 2.0.0 that had to do with CPU consumption. after upgrade to 2.0.0 mongo starts consuming all cpu resources locking the system, complains of memory leak

Unless I have misunderstood, your application is crashing, not mongod. Have you tried to remove MongoDB from the picture and replacing writes to MongoDB with perhaps writes to the file system?
Maybe this will bring light to some other issue inside your application that is not related specifically to MongoDB.

I had something similar happen with SQL Server 2008 on Windows Server 2008 R2. For me, it ended up being the network card. The NIC was set to auto-sense the connection speed which was leading to occasional dropped/lost packets which was leading to the socket timeout problems. To test you can ping the box from your local workstation and kick off your process to load the Windows 2008 R2 server. If it is this problem eventually you'll start to see the timeouts on your ping command
ping yourWin2008R2Server -n 1000
The solution ended up being to explicitly set the NIC connection speed
Manage Computer > Device Manager > Network Adapters > Properties and then depending on the nic you'll have either a link speed setting tab or have to go into another menu. You'll want to set this to exactly the speed of the network it is connected to. In my DEV environment it ended up being 100Mbps Half duplex.
These types of problems, as you know, can be a real pain to track down!
Best to you in figuring it out.

The daemon is stable now, After posting this issue in the mongodb usergroup we found out that "drop" wasn't issued. Drop is much faster then a full remove of all records.

Caching architecture advise for a specific scenario

SETUP:
We have a .Net application that is distributed over 6 local servers each with a local database(ORACLE), 1 main server and 1 load balance machine. Requests come to the load balancer which redirects the incoming requests to one of the 6 local servers. In certain time intervals data is gathered in the main server and redistributed to the 6 local servers to be able to make decisions with the complete data.
Each local server has a cache component that caches the incoming requests based on different parameters (Location, incoming parameters, etc). With each request a local server decides whether to go to the database (ORACLE) or get the response from the cache. However in both cases the local server has to goto the database to do 1 insert and 1 update per request.
PROBLEM:
On a peak day each local server receives 2000 requests per second and system starts slowing down (CPU: 90% ). I am trying to increase the capacity before adding another local server to the mix. After running some benchmarks the bottleneck as it always is, seems to be the inevitable 1 insert and 1 update per request to database.
TRIED METHODS
To be able decrease the frequency I have created a Windows service that sits between the DB and .NET application. It contains a pipe server and receives each insert and update from the main .NET application and saves them in a Hashtable. The new service then at certain time intervals goes to the database once to do batch inserts and updates. The point was to go to the database less frequently. Although this had a positive effect it didn't benefit to the system load as much as I expected. The most of the cpu load comes from oracle.exe as requests per second increase.
I am trying to avoid going to the database as much as I can and the only way to avoid DB seems to be increasing the cache hit ratio other than the above mentioned solution I tried. My cache hit ratio is around 81 % percent currently. Because each local machine has its own cache I am actually missing lots of cacheable requests. When two similar requests redirects to different servers the second request cannot benefit from the cached result of the first one.
I don't have a lot of experience in system architecture so I would appreciate any help to this problem. Any suggestions on different caching architectures or setup, or any tools are welcome.
Thank you in advance, hopefully I made my question clear.

For me this looks like a application for a timesten solution. In that case you can eliminate the local databases and return to just one. Where you now have the local oracle databases, you can implement a cache grid. Most likely this is going to be a AWT (Async, Write Through) cache. See Oracle In-Memory Database Cache Concepts
It's not a cheap option but if could be worth investigating.
You can keep concentrating on the business logic and have no worries about speed. This of course only works good, if the aplication code is already tuned and the sql is performant and scalable. The SQL has to be prepared (using bind variables) to have the best performance.
Your application connects to the cache and no longer to the database. You create the cache tables in the cache group for which you want to have caching. All tables in a SQL should be cached, otherwise, the complete SQL is passed through to the Oracle database. In the grid a cache fusion mechanism is in place so you have no worries about where the data in your grid is located.
In the current release support for .net is included.
The data is consistent and asynchronously updated to the Oracle database. If the data that is needed is in the cache and you take the Oracle database down, the app can keep running. As soon as the database is back again, the synchronization pick up again. Very powerful.

2000 requests per second per server, about 24000 rps to database. It's a HUGE load for DB.
Try to optimize, scaleup or clusterize database.
May be NoSQL DB (Redis\Raven\Mongo) as middleware will be suitable for you. Local server will read\write sharded NoSQL DB, aggregated data will by synchronized with Oracle off-peak times.

I know the question is old now, but I wanted let everyone know how we solved our issue.
After trying many optimizations it turned out that all we needed was Solid State Drives for the 6 local machines. The CPU dropped down to 30% percent immediately after we installed them. This is the first time that I've seen any kind of hardware update contributes this much to performance.
If you have high load setup, before making any software or architecture changes try upgrading to a SSD.
Thanks everyone for your answers.

Storing Real Time data into 1000 files

I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.
I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.
I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.
I've considered a few approaches:
1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.
2) Write into one file and -- somehow -- process it later to produce 1000 files.
3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.
How would you approach this problem?

I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.
If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.
Seriously. Database it.
Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.
Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.

Like n8wrl i also would recommend a DB. But if you really dislike this feature ...
Let's find another solution ;-)
In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.
Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.
With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).
But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.

First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.
Edit: corrected mistake. Sorry, my bad.

I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.

I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...
1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
This is close to db-ish-ness! Also sounds like more trouble than it's worth.
RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
How would I approach this? DB! Because then I can query index, analyze, etc. etc.
:)

I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.
If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.
I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.
DATA SIZES:
Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.
50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour

Perhaps you don't want the overhead of a DB install?
In that case, you could try a filesystem-based database like sqlite:
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.

I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.

I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.

If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.
All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.

I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.
It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.
EDIT
If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.
For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.
Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.