Dealing with large number of text strings

Dealing with large number of text strings - c#

My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005
Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:
Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)
EDITED: I will keep adding to this to clarify points raised
I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.

The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a varchar(4000) column and use (on average) 300 bytes each.
Then you have up to 200,000 of these to store in a database.
That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.
At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.
And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.
SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.

If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx
As an example, you can create a compressed table like this:
CREATE TABLE T1
(c1 int, c2 nvarchar(50) )
WITH (DATA_COMPRESSION = PAGE);
If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like System.IO.Compression. I suppose you could try it, though.

Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.

Not entirely clear on what you are asking.
In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?
Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.
This would be an alternative to having something like EF generating essentially 200K insert statements.
UPDATE
Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx

I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).
Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.

Sounds like you would benefit from using Large-Value Data Types
These data types will store up to 2^31-1 bytes of data
If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.

It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.

Related

C#, Npgsql, NpgsqlCommandBuilder and transferred data

I am developing a C# app that connects to a remote Postgresql database.
The database size is ~50-60Mb (this is approximately the size of the "data\base" folder and the size returned by "select pg_database_size"), but if I perform a "SELECT * FROM " on all tables, the data trasferred through the LAN is ~600Mb (ten time bigger!).
I checked that most of data trasfer is due to CommandBuilder
NpgsqlCommandBuilder cBuilder = new NpgsqlCommandBuilder(_dAdapter);
_dAdapter.DeleteCommand = cBuilder.GetDeleteCommand();
_dAdapter.InsertCommand = cBuilder.GetInsertCommand();
_dAdapter.UpdateCommand = cBuilder.GetUpdateCommand();
Where is the issue? There is a way to minimize the data transfer performing a "SELECT *" query?

This has nothing to do with C# and very little to do with Npgsql.
What it has to do with is not comparing like with like.
On the one hand you have the database's internal storage. This is data on disk stored in a way designed with a primary goal of quick querying and updating, along with a goal of taking up little disk space when possible. In particular, all large values stored using TOAST is compressed internally.
See the documentation on Database Physical Storage for more.
This you say is about 50-60 Mb (though I guess you probably mean MB) in your case.
On the other hand you have the database's Frontend/Backend Protocol, which is how postgresql servers and clients communicate with each other.
While reducing the number of bytes involved is also a goal here, so too is ease of translation for the applications involved. The problems around representing the boundaries between different values is completely different, and there is also SSL overhead and so on if you are going over a wire.
See the documentation on Frontend/Backend Protocol for more.
As such while some of what is stored in the files is irrelevant here, we would expect the size of a full transfer to be much, much larger. You say 600Mb (again, I'm guessing you mean MB), so that matches that.
Where is the issue?
There is no issue.
There is a way to minimize the data transfer performing a "SELECT *" query?
Well, one will hopefully be coming for free in a bit, as the Npgsql team were working on some reductions to transfer size through more optimised use of the protocol last time I checked. (I contributed to Npgsql some years ago, but these days I only occasionally have a look at what they're up to).
This optimisation will be worth doing, along with other optimisation work, but it's still not going to make a big difference though: Even if you hand-optimised all the Frontend/Backend protocol uses (and even if you did so with the advantage of prior knowledge of the actual data allowing you to make the perfect choice in cases where one approach results in smaller transfers sometimes and larger other times) there's still going to be a much greater size for doing SELECT * on all tables than the size involved in storage.*
Beyond that, the best way to reduce the cost of doing SELECT * are not to do so:
Avoid grabbing entire tables in the first place; engineer as much as possible to select only meaningful data to a given use.
If you really need to grab everything, use COPY. This postgresql extension to SQL is supported by Npgsql and is optimised for bulk transfer rather than queries.
Using COPY may or may not be meaningfully smaller, so it's worth seeing how it fairs with your particular set of data.
*Just how much larger would depend on how much space was taken up by indices, how much saved by the database containing large values that were compressed in TOAST and as such there could potentially be a case where transfer size was in fact smaller than storage size, but that would be a side-effect of the data stored rather than something one could deliberately engineer for.

Storing computed values in database when reads are far higher than writes

I find myself faced with a conundrum of which the answer probably falls outside of my expertise. I'm hoping someone can help.
I have an optimised and efficient query for fetching table (and linked) data, the actual contents of which are unimportant. However upon each read that data then needs to be processed to present the data in JSON format. As we're talking typical examples where a few hundred rows could have a few hundred-thousand associated rows this takes time. With multi-threading and a powerful CPU (i7 3960X) this processing is around 400ms - 800ms at 100% CPU. It's not a lot I know but why process it each time in the first place?
In this particular example, although everything I've ever read points to not doing so (as I understood it) I'm considering storing the computed JSON in a VARCHAR(MAX) column for fast reading.
Why? Well the data is read 100 times or more for every single write (change), it seems to me that given those numbers it would be far better to stored the JSON for optimised retrieval and re-compute and update it on the odd occasion the associations are changed - adding perhaps 10 to 20 ms to the time taken to write changes, but improving the reads by some large factor.
Your opinions on this would be much appreciated.

Yes, storing redundant information for performance reasons is pretty common. The first step is to measure the overhead - and it sounds like you've done that already (although I would also ask: what json serializer are you using? have you tried others?)
But fundamentally, yes that's ok, when the situation warrants it. To give an example: stackoverflow has a similar scenario - the markdown you type is relatively expensive to process into html. We could do that on every read, but we have insanely more reads than writes, so we cook the markdown at write, and store the html as well as the source markdown - then it is just a simple "data in, data out" exercise for most of the "show" code.
It would be unusual for this to be a common problem with json, though, since json serialization is a bit simpler and lots of meta-programming optimization is performed by most serializers. Hence my suggestion to try a different serializer before going this route.
Note also that the rendered json may need more network bandwidth that the original source data in TDS - so your data transfer between the db server and the application server may increase; another thing to consider.

Best way to store 10 - 100 million simulation outputs from .net (SQL vs. flat file)

I've been working on a project that is generating on the order of 10 - 100 million outputs from a simulation that I would like to store for future analyses. There are several nature levels of organization to the data e.g. Classrooms hold Students who take Tests which have a handful of different performance metrics.
It seems like my data is border line in terms of being able to fit in memory all at once (given the calculation of the simulations requires a fair amount of data in memory to do the calculation), but I don't have any immediate need for all of the data to be available to my program at once.
I am considering whether it would be better to be outputting the calculated values to a SQL database or a flat text file. I am looking for advice about which approach might be faster/easier to maintain (or if you have an alternate suggestion for storing the data I am open to that).
I don't need to be able to share the data with anyone else or worry about accessing the data years down the line. I just need a convenient way to avoid regenerating the simulations everytime I want to carry out a tweak to the analysis of the values.

I'd consider using a database - 100 million files is too many for a file system without some kind of classification scheme, while a database can easily handle this many rows. You could just serialize the output into a BLOB column so you don't have to map it. Also, consider that SQL Server has file stream access so this could be essentially a hybrid approach where SQL manages the files for you.

Offhand, it sounds like you would be better off saving the results of each simulation run into a flat file. It need not be a text file - it could be binary.
After one or more simulation runs, the files could be read and placed into a data warehouse for later analysis.

The back-of-the-envelope rate for loading the data from an RDBMS server into memory is roughly 10K records per second. If you have 100M records, and if you must use all data at some point, you are looking at roughly three hours to load the data. That is before you do any calculations!
Plain files can be orders of magnitude faster. You can get pretty fast with a text-based file; going binary would improve your speed some more at the expense of readability of your data file.

Take a look at MongoDB, which is around 30x-50x faster in performance than SQL Server 2008 apparently.
http://blog.michaelckennedy.net/2010/04/29/mongodb-vs-sql-server-2008-performance-showdown/

Database with a table containing 700 million records [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the performance characteristics of sqlite with very large database files?
I want to create a .Net application that uses a database that will contain around 700 million records in one of its tables. I wonder if the performance of SQLite would satisfy this scenario or should I use SQL Server. I like the portability that SQLite gives me.

Go for SQL Server for sure. 700 million records in SQLite is too much.
With SQLite you have following limitation
Single process write.
No mirroring
No replication
Check out this thread: What are the performance characteristics of sqlite with very large database files?

700m is a lot.
To give you an idea. Let's say your record size was 4 bytes (essentially storing a single value), then your DB is going to be over 2GB. If your record size is something closer to 100 bytes then it's closer to 65GB... (that's not including space used by indexes, and transaction log files, etc).
We do a lot of work with large databases and I'd never consider SQLLite for anything of that size. Quite frankly, "Portability" is the least of your concerns here. In order to query a DB of that size with any sort of responsiveness you will need an appropriately sized database server. I'd start with 32GB of RAM and fast drives.
If it's write heavy 90%+, you might get away with smaller RAM. If it's read heavy then you will want to try and build it out so that the machine can load as much of the DB (or at least indexes) in RAM as possible. Otherwise you'll be dependent on disk spindle speeds.

SQLite SHOULD be able to handle this much data. However, you may have to configure it to allow it to grow to this size, and you shouldn't have this much data in an "in-memory" instance of SQLite, just on general principles.
For more detail, see this page which explains the practical limits of the SQLite engine. The relevant config settings are the page size (normally 64KB) and page count (up to a 64-bit int's max value of approx 2.1 billion). Do the math, and the entire database can take up more than 140TB. A database consisting of a single table with 700m rows would be on the order of tens of gigs; easily manageable.
However, just because SQLite CAN store that much data doesn't mean you SHOULD. The biggest drawback of SQLite for large datastores is that the SQLite code runs as part of your process, using the thread on which it's called and taking up memory in your sandbox. You don't get the tools that are available in server-oriented DBMSes to "divide and conquer" large queries or datastores, like replication/clustering. In dealing with a large table like this, insertion/deletion will take a very long time to put it in the right place and update all the indexes. Selection MAY be livable, but only in indexed queries; a page or table scan will absolutely kill you.

I've had tables with similar record counts and no problems retrieval wise.
For starters, the hardware and allocation to the server is where you can start. See this for examples: http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/
Regardless of size or number of records as long as you:
create indexes on foreign key(s),
store common queries in Views (http://en.wikipedia.org/wiki/View_%28database%29),
and maintain the database and tables regularly
you should be fine. Also, setting the proper column type/size for each column will help.

SQL Server Compact compared to C# data structures

We currently use List<T> to store events from a simulation project we are running. We need to optimise memory utilisation and the time it takes to process the events in order to derive certain key metrics.
We thought of moving the event log to a SQL Server Compact database table and then possibly use Linq to calculate the metrics. From your experience do you think it will be faster to use SQL Server Compact than C#'s built-in data structures or are we going to have issues?

Some ideas.
MSMQ (Microsoft Message Queue)
You can have a thread dequeueing off of MSMQ and updating metrics on the fly. If you need to store these events for later paroosal you can put them into the database as you dequeue them. MSMQ demonstrates much better scalability in these scenarios - especially when the publisher and subscriber have assymetric processing speeds; and binary data is being used (as SQL can get bogged down with allocating space for VARBINARY, or allocating/splitting pages for indexes).
The two other SQL scenarios are complimentary to this one - you can still use dequeueing to insert into SQL; to avoid any hiccups in your simulation while SQL allocates space.
You can side-step what #Aliostad said using this one, to a certain degree.
OLAP (Online Analytical Processing)
Sounds like you might benefit from from OLAP (cubes etc.). This will increase the overall runtime of your simulation but will improve the value of the data. Unfortunately this means forking out cash for one of the bigger SQL editions.
Stored Procedures
While Linq-to-SQL is great for 'your average developer' please keep away from it in scientific projects. There are a host of great tricks you can use in raw TSQL, in addition to being able to inspect the query plan. If you want the best possible performance plan your DB carefully and create stored procedures/UDFs to aggregate your data.
If you can only calculate some of the metrics in C#, do as much work in SQL before-hand - and then feel free to use Linq-to-SQL to grab the data.
Also remember if you are inserting off the end of a MSMQ you can agressively index, which will speed up your metric calculations without impacting your simulation.
I would only involve SQL if there is a real need for better memory utilization (i.e. you are actually running out of it).
Memory Mapped Files
This allows you to offset memory pressure onto disk; at a performance penalty if it needs to be 'paged' back in.
Overall
I could steer clear of Linq to define basic metrics - do it in SQL. MSMQ is without a doubt a huge winner in this case. Don't overcomplicate the memory issue and keep it in .Net if you are not running out of memory.

If you need to process all of the events a C# List<> will be faster than Sql Server. An Array<> will have better performance, especially if the elements are structs and not classes, since structs are put in arrays where class instances only are referenced from the array. Having the structs within the array reduces garbage collection and increases cache locality.
If you only need to process part of the events, I think the solutions are in this order when it come to speed:
C# data structures, crafted especially for your needs.
Sql Server
Naive C# data structures, traversing a list searching for the right elements.

It sounds like you're thinking you need to have them in a database in order to use Linq. This isn't the case. You can use Linq with csharp's built in structures.

Depends on what you mean "faster use". If this is about performance of access to data, it's all about how much data you have, on big data the DB solution, only for statistical purposes, is definitely good choice.
Like DB, for this kind of purposes I would suggest SQLite: as this is single file (no services need like SQL Server compact) fully ACID supported DB. But again, this depends on your data size, as SQLite has limit of data inferior to that one of SQLServer.
Regards.

We need to optimise memory utilisation
Use Sql-Server-CE
the time it takes to process the events
Use Linq-To-Objects.
These two objectives are conflicting and you need to choose one that matters more to you.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.