I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.
I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.
I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.
I've considered a few approaches:
1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.
2) Write into one file and -- somehow -- process it later to produce 1000 files.
3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.
How would you approach this problem?
I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.
If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.
Seriously. Database it.
Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.
Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.
Like n8wrl i also would recommend a DB. But if you really dislike this feature ...
Let's find another solution ;-)
In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.
Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.
With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).
But on the other side a DB would be so fine for this problem...
One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.
First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.
Edit: corrected mistake. Sorry, my bad.
I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.
I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...
1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
This is close to db-ish-ness! Also sounds like more trouble than it's worth.
RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.
How would I approach this? DB! Because then I can query index, analyze, etc. etc.
:)
I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.
If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.
I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.
DATA SIZES:
Average Message ~50bytes ->
small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.
50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour
Perhaps you don't want the overhead of a DB install?
In that case, you could try a filesystem-based database like sqlite:
SQLite is a software library that
implements a self-contained,
serverless, zero-configuration,
transactional SQL database engine.
SQLite is the most widely deployed SQL
database engine in the world. The
source code for SQLite is in the
public domain.
I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.
I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.
If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.
All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.
I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). These products have API's and they even have packages where you can make queries to the data just like it was SQL.
It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.
EDIT
If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partitioning, indexes, and how long you are going to store the data.
For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.
Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.
Related
I'm trying to record log files on my database. My question is which has the less load on making logs on the database. I'm thinking of storing long term log files ,maybe 3-5 years maximum, for an Inventory Program.
Process: I'll be using a barcode scanner.
After scanning a barcode, I'll get all the details of who is logged in, date and time, product details then saved per piece.
I came up with two ideas.
After the scanning event, It will be saved on a DataTable then after finishing a batch.. DataTable will be written on a *.txt file and then uploaded to my database.
After every scanned barcode, an INSERT query will be executed. I suspect this option will be heavy on the server side since I'm not the only one using this server
What are the pros and cons of the two options?
Are there more efficient ways of storing logs?
Based on your use case, I also think you need to consider at least 2 additional factors, the first being how important is it that the scanned item is logged in the database immediately. If you need the scanned item to be logged because you'll be checking to see if its been scanned, for example to prevent other scans, then doing a single insert is probably a very good idea. The second thing to consider is will you ever need to "unscan" an item, and at which part of the process? If the person scanning needs the ability to revert the scan immediately, it might be a good idea to wait until theyre done all their scannings before dumping the data to the database, as this will let you avoid ever having to delete from the table.
Overall I wouldnt worry too much about what the database can handle, sql-server is very good at handling simultaneous single inserts into a table thats designed for that use case. If youre only going to be inserting new data to the end of the table, and not updating or deleting existing records, performance is going to scale very well. The same goes for larger batch inserts, theyre very efficient no matter how many rows you want to bring in, assuming your table is designed for that purpose.
So overall I would probably pick the more efficient solution from the application side for your specific use case, and then once you have decided that, you can shape the database around the code, rather than trying to shape your code around suspected limitations of the database.
What are the pros and cons of the two options?
Basically your question is which way is more efficient (bulk insert or multiple single insert)?
The answers is always depends and always be situation based. So unfortunately, I don't think there's a right answer for you
The way you structure the log table.
If you choose bulk insert, how many rows do you want to insert at 1 time?
Is it read-only table? And if you want to read from it, how often do you do the read?
Do you need to scale it up?
etc...
Are there more efficient ways of storing logs?
There're some possible ways to improve I can think of (not all of them can work together)
If you go with the first option, maybe you can schedule the insert to non-peak hours
If you go with the first option, chunk the log files and do the insert
Use another database to do the logging
If you go with the second option, do some load testing
Personally, I prefer to go with second option if the project is small to medium size and the logging is critical part of the project.
hope it helps.
Go with the second option, and use transactions. This way the data will not be sent to the db until you call the transaction to complete. (Which can be scheduled.) This will also prevent broken data from getting into your database when a crash or something occurs.
Transactions in .net
Transaction Tutorial in C#
My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?
You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.
I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.
I have a c# application that generates data every 1 second (stock tick data) which can be discarded after each itteration.
I would like to pass this data to a Coldfusion (10) application and I have considered having the c# application writing the data to a file every second and then having the Coldfusion application reading that data, but this is most likely going to cause issues with the potential for both applications trying to read or write to the file at the same time ?
I was wondering if using Memory Mapped Files would be a better approach ? If so, how could I access the memory mapped file from Coldfusion ?
Any advice would be greatly appreciated. Thanks.
We have produced a number of stock applications that include tick by tick tracking of watchlists, charting etc. I think the idea of a file is probably not a great idea unless you are talking about a single stock with regular intervals. In my experience a change every "second" is probably way understating the case. Some stokes (AAPL or GOOG are good examples) have hundreds of "ticks" per second during peak times.
So if you are NOT taking every tick but really are "updating the file" every 1 second then your idea has some merit in that you could use a file watching gateway to fire events for you and "see" that the file is updated.
But keep in mind that you are in effect introducing something "in the middle". A file now stands between your Java or CF applications and the quote engine. That's going to introduce latency no matter what you choose to do (file handles getting and releasing etc). And the locks of one process may interfere with the other.
When you are dealing with facebook updates miliseconds don't really matter much - in spite of all the teenage girls who probably disagree with me :) With stock quotes however, half of the task is shaving off miliseconds to get your processes as close to real time as possible.
Our choice is usually to choose sockets instead of something in the middle bridging the data. The quote engine then keeps it's watchlist and updates it's arrays like normal but also sends any updates down stream to the socket engine which pushes it to something taht can handle it (a chart application, watchlist, socketgateway for webpage etc).
Hope this helps - it's not a clear answer but more of a clarification to the hurdles you face.
I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.
This is more or less a design question. we have to process like a 1 million rows and send an xml to a third party. Initially we have to send like 1 million records, later we will send the deltas only.
Right now the stored procedure is taking approchimately around 15 to 20 min to return the data. Its a consoleapp right. I know its not a good way to get 1 million records at time.
I want to know the following things
1) Is console app in c# which connects to database is right approach or not
2) Are there any other ways of doing this?
Appreciate your guidence on this , there is no need for any coding or so , We need some some advice on how to proceed.
Thanks in advance.
My thoughts:
don't fetch all the data then process it; but process it as it arrives - via IDataReader or LINQ
use equally streaming approach for the file; perhaps XmlWriter directly, or maybe XStreamingElement - in either case reading from the source above
this vastly reduces the amount of memory you need, and allows your machine to do something useful while waiting on the network IO
Re 1: Depends on your architecture. That simple. It is a VIABLE approach.
Re 2: Yes, tons. All vaible. You could make a system service handling data generation upon request. You could have a web application.
In general, a console app will work fine, and 1 million rows nia a result set are no exactly a lot either. Totally workable.
1-20 minuts is odd, though. Where is the time spent? 1 million rows to transfer and write out shoul not take more than 2-3 minutes.
1) Yeah, why not.
2) Yes.
Use cursors.
You will need to be a little more specific on what you are doing during the 15 to 20 minutes.
You are asking about the "right" way to do things - what are you optimising for?
Speed? A 15 - 20 minute stored proc sounds dangerous. What is it doing?
Maintenance / Readability? A console app will work. It would also be easier to test (unit testing etc) than a stored proc.
I have never liked long running stored procedures because it's not easy to see progress. At least with a console app you can output something
Trust me, 1 million records isn't a big deal to a famous commercial data base, it's not worth 15 to 20 min to return the records. Somewhere else is wrong! Are you building the XML file in the store procedure? If yes, please remove them and implement the XML building in C#. The SP has only one simple task: fetching data. It won't take a long time if you are not joining 1 million records on another 1 million records. After the data come into the application(console application is ok in this case), build the XML with maybe LINQ-to-XML. If you are still not satisfy with the performance, make your code parallel.
EDIT Your SP is time consuming, you need to optimize it. An example: In the SP T_Data with 1m records joins T_User with 1m records that costs a lot of time. After optimization: In the SP T_Data joins one record in T_User(almost a WHERE expression which is very fast), and in C# code you are getting the records from T_User, for each record, call the SP and get the data, then build one section/*piece* of your XML. All of them can be processed concurrently. At last, you merge all the pieces of XML into one.