I am at the very begining stage of designing an in-memory cache in C# (which will run as a windows service). Once in production, this is expected to hold close to a million objects (various types) on average. Some cache items can be up to 10MB (or more) in size.
I considered a variety of data sotrage solutions and i have now decided to go for either a DataTable or an SQLite in-memory instance as the cache store. At this point my questions are
How do you think the performance of a DataTable will be with this many number of records?
Do you think going with an SQLite solution is an overkill?(Since SQLite is designed as a 'database', i may not really want all those database related plumbing activities)
Performance is the highest priority for me.
EDIT
Adding some more specifics.
These cache items are not just key-value pairs, they have two more
(as of now) properties (pinned and locked items), which can affect
their availability. Every look up is going to include all the three
properties.
Memcached has been considered, but at this point that is not an
option mainly due to our SLA constraints (That’s all I can say about
it).
Not all items are of 10MB in size.
I am pretty sure that many of these items are going to be mere
numerical and small string values.
I believe, availability of RAM is not an issue.
Thanks in advance,
James
1: TERRIBLE. DataTbles are slow and Memory hogs, that wont magically Change for large items.
2: You tell us.
Have you considered using a simple dictionary? Key/Value pairs, you know.
The answers really dpend on what you plan doing with the Cache.
If every item is 1 MB that is 1 TB of memory.
You have 1 TB of memory to dedicate to this?
A database on a solid state disk may be a better design.
DataTable is large and slow.
How are you going to look the items up?
Are you going to have a complete key?
Are you going to have enough memeory?
If so dictionary.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the performance characteristics of sqlite with very large database files?
I want to create a .Net application that uses a database that will contain around 700 million records in one of its tables. I wonder if the performance of SQLite would satisfy this scenario or should I use SQL Server. I like the portability that SQLite gives me.
Go for SQL Server for sure. 700 million records in SQLite is too much.
With SQLite you have following limitation
Single process write.
No mirroring
No replication
Check out this thread: What are the performance characteristics of sqlite with very large database files?
700m is a lot.
To give you an idea. Let's say your record size was 4 bytes (essentially storing a single value), then your DB is going to be over 2GB. If your record size is something closer to 100 bytes then it's closer to 65GB... (that's not including space used by indexes, and transaction log files, etc).
We do a lot of work with large databases and I'd never consider SQLLite for anything of that size. Quite frankly, "Portability" is the least of your concerns here. In order to query a DB of that size with any sort of responsiveness you will need an appropriately sized database server. I'd start with 32GB of RAM and fast drives.
If it's write heavy 90%+, you might get away with smaller RAM. If it's read heavy then you will want to try and build it out so that the machine can load as much of the DB (or at least indexes) in RAM as possible. Otherwise you'll be dependent on disk spindle speeds.
SQLite SHOULD be able to handle this much data. However, you may have to configure it to allow it to grow to this size, and you shouldn't have this much data in an "in-memory" instance of SQLite, just on general principles.
For more detail, see this page which explains the practical limits of the SQLite engine. The relevant config settings are the page size (normally 64KB) and page count (up to a 64-bit int's max value of approx 2.1 billion). Do the math, and the entire database can take up more than 140TB. A database consisting of a single table with 700m rows would be on the order of tens of gigs; easily manageable.
However, just because SQLite CAN store that much data doesn't mean you SHOULD. The biggest drawback of SQLite for large datastores is that the SQLite code runs as part of your process, using the thread on which it's called and taking up memory in your sandbox. You don't get the tools that are available in server-oriented DBMSes to "divide and conquer" large queries or datastores, like replication/clustering. In dealing with a large table like this, insertion/deletion will take a very long time to put it in the right place and update all the indexes. Selection MAY be livable, but only in indexed queries; a page or table scan will absolutely kill you.
I've had tables with similar record counts and no problems retrieval wise.
For starters, the hardware and allocation to the server is where you can start. See this for examples: http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/
Regardless of size or number of records as long as you:
create indexes on foreign key(s),
store common queries in Views (http://en.wikipedia.org/wiki/View_%28database%29),
and maintain the database and tables regularly
you should be fine. Also, setting the proper column type/size for each column will help.
I need to get the set of GUIDs in a remote database which do not exist in an IEnumerable (for context, this is coming from a Lucene index). There are potentially many millions of these Guids.
I currently think that inserting the IEnumerable to the database and doing the difference there will be too expensive (the inserts will hammer the database), but I am prepared to be proven wrong!
Reading both sets into memory is also infeasible due to the amount of data - our existing solution does this and fails with very large sets.
I would like a solution which can operate on a small subset of the data at a time so that we have a constant memory footprint. We have an idea as to how to roll our own implementation of this, but it is non-trivial, so would obviously rather use an existing one if it exists.
If anybody has any recommendations for an existing solution, I'd be grateful to hear them!
You could use SqlBulkCopy to load the guids very fast to the database(if it is SQL-Server).
Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.
After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.
I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.
2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.
Does anyone have any experience with receiving and updating a large volume of data, storing it, sorting it, and visualizing it very quickly?
Preferably, I'm looking for a .NET solution, but that may not be practical.
Now for the details...
I will receive roughly 1000 updates per second, some updates, some new rows of data records. But, it can also be very burst driven, with sometimes 5000 updates and new rows.
By the end of the day, I could have 4 to 5 million rows of data.
I have to both store them and also show the user updates in the UI. The UI allows the user to apply a number of filters to the data to just show what they want. I need to update all the records plus show the user these updates.
I have an visual update rate of 1 fps.
Anyone have any guidance or direction on this problem? I can't imagine I'm the first one to have to deal with something like this...
At first though, some sort of in memory database I would think, but will it be fast enough for querying for updates near the end of the day once I get a large enough data set? Or is that all dependent on smart indexing and queries?
Thanks in advance.
It's a very interesting and also challenging problem.
I would approach a pipeline design with processors implementing sorting, filtering, aggregation etc. The pipeline needs an async (threadsafe) input buffer that is processed in a timely manner (according to your 1fps req. under a second). If you can't do it, you need to queue the data somewhere, on disk or in memory depending on the nature of your problem.
Consequently, the UI needs to be implemented in a pull style rather than push, you only want to update it every second.
For datastore you have several options. Using a database is not a bad idea, since you need the data persisted (and I guess also queryable) anyway. If you are using an ORM, you may find NHibernate in combination with its superior second level cache a decent choice.
Many of the considerations might also be similar to those Ayende made when designing NHProf, a realtime profiler for NHibernate. He has written a series of posts about them on his blog.
May be Oracle is more appropriate RDBMS solution fo you. The problem with your question is that at this "critical" levels there are too much variables and condition you need to deal with. Not only software, but hardware that you can have (It costs :)), connection speed, your expected common user system setup and more and more and more...
Good Luck.
Curious if anyone has opinions on which method would be better suited for asp.net caching. Option one, have fewer items in the cache which are more complex, or many items which are less complex.
For sake of discussion lets imagine my site has SalesPerson and Customer objects. These are pretty simple classes but I don’t want to be chatty with the database so I want to lazy load them into cache and invalidate them out of the cache when I make a change – simple enough.
Option 1
Create Dictionary and cache the entire dictionary. When I need to load an instance of a SalesPerson from the cache I get out the Dictionary and perform a normal key lookup against the Dictionary.
Option 2
Prefix the key of each item and store it directly in the asp.net cache. For example every SalesPerson instance in the cache would use a composite of the prefix plus the key for that object so it may look like sp_[guid] and is stored in the asp.net cache and also in the cache are the Customer objects with a key like cust_[guid].
One of my fears with option two is that the numbers of entries will grow very large, between SalesPerson, Customer and a dozen or so other categories I might have 25K items in cache and highly repetitive lookups for something like a string resource that I am using in several places might pay a penalty while the code looks through the cache’s key collection to find it amongst the other 25K.
I am sure at some point there is a diminishing return here on storing too many items in the cache but I am curious as to opinions on these matters.
You are best off to create many, smaller items in the cache than to create fewer, larger items. Here is the reasoning:
1) If your data is small, then the number of items in the cache will be relatively small and it won't make any difference. Fetching single entities from the cache is easier than fetching a dictionary and then fetching an item from that dictionary, too.
2) Once your data grows large, the cache may be used to manage the data in an intelligent fashion. The HttpRuntime.Cache object makes use of a Least Recently Used (LRU) algorithm to determine which items in the cache to expire. If you have only a small number of highly used items in the cache, this algorithm will be useless. However, if you have many smaller items in the cache, but 90% of them are not in use at any given moment (very common usage heuristic), then the LRU algorithm can ensure that those items that are seeing active use remain in the cache while evicting less-used items to ensure sufficient room remains for the used ones.
As your application grows, the importance of being able to manage what is in the cache will be most important. Also, I've yet to see any performance degradation from having millions of keys in the cache -- hashtables are extremely fast and if you find issues there it's likely easily solved by altering your naming conventions for your cache keys to optimize them for use as hashtable keys.
The ASP.NET Cache uses its own dictionary so using its dictionary to locate your dictionary to do lookups to retrieve your objects seems less than optimal. Dictionaries use hash tables which is about the most efficient lookup you can do. Using your own dictionaries would just add more overhead, I think. I don't know about diminishing returns in regards to hash tables, but I think it would be in terms of storage size, not lookup time.
I would concern yourself with whatever makes your job easier. If having the Cache more organized will make your app easier to understand, debug, extend and maintain then I would do it. If it makes those things more complex then I would not do it.
And as nullvoid mentioned, this is all assuming you've already explored the larger implications of caching, which involve gauging the performance gains vs. the performance hit. You're talking about storing lots and lots of objects, and this implies lots of cache traffic. I would only store something in the cache that you can measure a performance gain from doing so.
We have built an application that uses Caching for storing all resources. The application is multi-language, so for each label in the application we have at least three translations. We load a (Label,Culture) combination when first needed and then expire it from cache only if it was changed by and admin in the database. This scenario worked perfectly well even when the cache contained 100000 items in it. We only took care to configure the cache and the expiry policies such that we really benefit of the Cache. We use no-expiration, so the items are cached until the worker process is reset or until the item is intentionally expired. We also took care to define a domain for the values of the keys in such a way to uniquely identify a label in a specific culture with the least amount of characters.
I'm going to assume that you've considered all the implications of data changing from multiple users and how that will affect the cached data in terms of handling conflicting data. Caching is really only meant to be done on reletively static data.
From an efficiency perspective I would assume that if you're using the .net serialization properly you're going to benefit from storing the data in the cache in the form of larger typed serialized collections rather than individual base types.
From a maintenance perspective this would also be a better approach, as you can create a strongly typed object to represent the data and use serialization to cast it between the cache and your salesperson/customer object.