Database with a table containing 700 million records [duplicate]

Database with a table containing 700 million records [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the performance characteristics of sqlite with very large database files?
I want to create a .Net application that uses a database that will contain around 700 million records in one of its tables. I wonder if the performance of SQLite would satisfy this scenario or should I use SQL Server. I like the portability that SQLite gives me.

Go for SQL Server for sure. 700 million records in SQLite is too much.
With SQLite you have following limitation
Single process write.
No mirroring
No replication
Check out this thread: What are the performance characteristics of sqlite with very large database files?

700m is a lot.
To give you an idea. Let's say your record size was 4 bytes (essentially storing a single value), then your DB is going to be over 2GB. If your record size is something closer to 100 bytes then it's closer to 65GB... (that's not including space used by indexes, and transaction log files, etc).
We do a lot of work with large databases and I'd never consider SQLLite for anything of that size. Quite frankly, "Portability" is the least of your concerns here. In order to query a DB of that size with any sort of responsiveness you will need an appropriately sized database server. I'd start with 32GB of RAM and fast drives.
If it's write heavy 90%+, you might get away with smaller RAM. If it's read heavy then you will want to try and build it out so that the machine can load as much of the DB (or at least indexes) in RAM as possible. Otherwise you'll be dependent on disk spindle speeds.

SQLite SHOULD be able to handle this much data. However, you may have to configure it to allow it to grow to this size, and you shouldn't have this much data in an "in-memory" instance of SQLite, just on general principles.
For more detail, see this page which explains the practical limits of the SQLite engine. The relevant config settings are the page size (normally 64KB) and page count (up to a 64-bit int's max value of approx 2.1 billion). Do the math, and the entire database can take up more than 140TB. A database consisting of a single table with 700m rows would be on the order of tens of gigs; easily manageable.
However, just because SQLite CAN store that much data doesn't mean you SHOULD. The biggest drawback of SQLite for large datastores is that the SQLite code runs as part of your process, using the thread on which it's called and taking up memory in your sandbox. You don't get the tools that are available in server-oriented DBMSes to "divide and conquer" large queries or datastores, like replication/clustering. In dealing with a large table like this, insertion/deletion will take a very long time to put it in the right place and update all the indexes. Selection MAY be livable, but only in indexed queries; a page or table scan will absolutely kill you.

I've had tables with similar record counts and no problems retrieval wise.
For starters, the hardware and allocation to the server is where you can start. See this for examples: http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/
Regardless of size or number of records as long as you:
create indexes on foreign key(s),
store common queries in Views (http://en.wikipedia.org/wiki/View_%28database%29),
and maintain the database and tables regularly
you should be fine. Also, setting the proper column type/size for each column will help.

Related

C#, Npgsql, NpgsqlCommandBuilder and transferred data

I am developing a C# app that connects to a remote Postgresql database.
The database size is ~50-60Mb (this is approximately the size of the "data\base" folder and the size returned by "select pg_database_size"), but if I perform a "SELECT * FROM " on all tables, the data trasferred through the LAN is ~600Mb (ten time bigger!).
I checked that most of data trasfer is due to CommandBuilder
NpgsqlCommandBuilder cBuilder = new NpgsqlCommandBuilder(_dAdapter);
_dAdapter.DeleteCommand = cBuilder.GetDeleteCommand();
_dAdapter.InsertCommand = cBuilder.GetInsertCommand();
_dAdapter.UpdateCommand = cBuilder.GetUpdateCommand();
Where is the issue? There is a way to minimize the data transfer performing a "SELECT *" query?

This has nothing to do with C# and very little to do with Npgsql.
What it has to do with is not comparing like with like.
On the one hand you have the database's internal storage. This is data on disk stored in a way designed with a primary goal of quick querying and updating, along with a goal of taking up little disk space when possible. In particular, all large values stored using TOAST is compressed internally.
See the documentation on Database Physical Storage for more.
This you say is about 50-60 Mb (though I guess you probably mean MB) in your case.
On the other hand you have the database's Frontend/Backend Protocol, which is how postgresql servers and clients communicate with each other.
While reducing the number of bytes involved is also a goal here, so too is ease of translation for the applications involved. The problems around representing the boundaries between different values is completely different, and there is also SSL overhead and so on if you are going over a wire.
See the documentation on Frontend/Backend Protocol for more.
As such while some of what is stored in the files is irrelevant here, we would expect the size of a full transfer to be much, much larger. You say 600Mb (again, I'm guessing you mean MB), so that matches that.
Where is the issue?
There is no issue.
There is a way to minimize the data transfer performing a "SELECT *" query?
Well, one will hopefully be coming for free in a bit, as the Npgsql team were working on some reductions to transfer size through more optimised use of the protocol last time I checked. (I contributed to Npgsql some years ago, but these days I only occasionally have a look at what they're up to).
This optimisation will be worth doing, along with other optimisation work, but it's still not going to make a big difference though: Even if you hand-optimised all the Frontend/Backend protocol uses (and even if you did so with the advantage of prior knowledge of the actual data allowing you to make the perfect choice in cases where one approach results in smaller transfers sometimes and larger other times) there's still going to be a much greater size for doing SELECT * on all tables than the size involved in storage.*
Beyond that, the best way to reduce the cost of doing SELECT * are not to do so:
Avoid grabbing entire tables in the first place; engineer as much as possible to select only meaningful data to a given use.
If you really need to grab everything, use COPY. This postgresql extension to SQL is supported by Npgsql and is optimised for bulk transfer rather than queries.
Using COPY may or may not be meaningfully smaller, so it's worth seeing how it fairs with your particular set of data.
*Just how much larger would depend on how much space was taken up by indices, how much saved by the database containing large values that were compressed in TOAST and as such there could potentially be a case where transfer size was in fact smaller than storage size, but that would be a side-effect of the data stored rather than something one could deliberately engineer for.

Best way to store 10 - 100 million simulation outputs from .net (SQL vs. flat file)

I've been working on a project that is generating on the order of 10 - 100 million outputs from a simulation that I would like to store for future analyses. There are several nature levels of organization to the data e.g. Classrooms hold Students who take Tests which have a handful of different performance metrics.
It seems like my data is border line in terms of being able to fit in memory all at once (given the calculation of the simulations requires a fair amount of data in memory to do the calculation), but I don't have any immediate need for all of the data to be available to my program at once.
I am considering whether it would be better to be outputting the calculated values to a SQL database or a flat text file. I am looking for advice about which approach might be faster/easier to maintain (or if you have an alternate suggestion for storing the data I am open to that).
I don't need to be able to share the data with anyone else or worry about accessing the data years down the line. I just need a convenient way to avoid regenerating the simulations everytime I want to carry out a tweak to the analysis of the values.

I'd consider using a database - 100 million files is too many for a file system without some kind of classification scheme, while a database can easily handle this many rows. You could just serialize the output into a BLOB column so you don't have to map it. Also, consider that SQL Server has file stream access so this could be essentially a hybrid approach where SQL manages the files for you.

Offhand, it sounds like you would be better off saving the results of each simulation run into a flat file. It need not be a text file - it could be binary.
After one or more simulation runs, the files could be read and placed into a data warehouse for later analysis.

The back-of-the-envelope rate for loading the data from an RDBMS server into memory is roughly 10K records per second. If you have 100M records, and if you must use all data at some point, you are looking at roughly three hours to load the data. That is before you do any calculations!
Plain files can be orders of magnitude faster. You can get pretty fast with a text-based file; going binary would improve your speed some more at the expense of readability of your data file.

Take a look at MongoDB, which is around 30x-50x faster in performance than SQL Server 2008 apparently.
http://blog.michaelckennedy.net/2010/04/29/mongodb-vs-sql-server-2008-performance-showdown/

Optimal storage of data structure for fast lookup and persistence

Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.

2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

Dealing with large number of text strings

My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005
Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:
Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)
EDITED: I will keep adding to this to clarify points raised
I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.

The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a varchar(4000) column and use (on average) 300 bytes each.
Then you have up to 200,000 of these to store in a database.
That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.
At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.
And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.
SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.

If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx
As an example, you can create a compressed table like this:
CREATE TABLE T1
(c1 int, c2 nvarchar(50) )
WITH (DATA_COMPRESSION = PAGE);
If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like System.IO.Compression. I suppose you could try it, though.

Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.

Not entirely clear on what you are asking.
In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?
Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.
This would be an alternative to having something like EF generating essentially 200K insert statements.
UPDATE
Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx

I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).
Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.

Sounds like you would benefit from using Large-Value Data Types
These data types will store up to 2^31-1 bytes of data
If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.

It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.

SQL Server CE or Access for a portable database with c#?

I'm developing an application which needs to store large amounts of data.
I cannot use SQL Server Express edition since it requires separate installation and our target customers have already loaded us with complaints about our previous release with SQL Server express.
Now my choices are between SQL Server compact and Access.
We store huge amounts of reporting data (3 million a week). Which database can I use?
The application is portable and a product based application.
Our company is asking us to provide the application in such a way that it can be downloaded and used by anyone from our website. Please help.
Thanks.
Edit : 40,000 records within an hour is the approximate rate at which it is stored. The data stored is just normal varchar,datetime,nvarchar,etc. No Images and No binary or special stuff.

What is "3 million data"/ 3 million large images? 3 million bytes? There could be a vast difference there.
At any rate, I'd probably choose SQL CE over Access if the actual data size isn't going to exceed what SQL CE supports (4GB). I've used SQL CE for applications that collect a few hundred thousand records in a week without problem. The database is a single file, is portable, and has the huge benefit that full SQL Server can just attach to it and use it as a data source, even in replication scenarios.

40,000 records per hour is 10 records per second. I'd suggest creating the various tables and indexes required in both and testing first. And let the test run for a solid 8 hours and see what happens.
It's quite possible that the first x records may insert reasonably well but they get slower and slower. x being some number between 10K and 1M. Slower and slower is quite subjective and depends on the app. In Access I'd suggest doing a compact on a regular basis, ie after 100K records maybe, to clean up the indexes. However if the app wants to insert records for 8 hours straight without a break then clearly this won't work well for you.
Or you could try deleting the indexes, do the record inserts and recreate the indexes. However if the users want to query on the data while the records are being inserted then this too won't work.
Also Access can work significantly faster if the database isn't shared. Again that may not be practical.
Finally if you still don't get decent performances, or even if you do, considering having the the user install a solid state disk drive and place your database file on it. A 32 Gb SSD drive for few hundred dollars buys a lot of developer time mucking around with things.

If you are tightly coupled to .Net : sql server compact would be a better choice.
If not, consider using: sqlite

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.