This seems like perhaps a naive question, but I got into a discussion with a co-worker where I argued that there is no real need for a cache to be thread-safe/synchronized as I would assume that it does not matter who is putting in a value, as the value for a given key should be "constant" (in that it is coming from the same source ultimately). If the values can change readily, then the cache itself does not seem to be all the useful (in that if you care that the value is "currently correct" you should go to the original source).
The main reason I see to make at least the GET synchronized is that if it is very expensive to miss in the cache and you don't want multiple threads each going out to get a value to put back in the cache. Even then, you'd need something that actually blocks all consumers during a read-fetch-put cycle.
Anyhow, my working assumption is that a hash is by its very nature thread-safe because for any {key,value} combination, the value is either null or something that it doesn't matter who go there "first" to write.
Question is: Is this a reasonable assumption?
Update: The real scope of my question is around very simple id->value style caches (or {parameters}->{calculated value} where no matter who writes to the cache, the value will be the same and we are just trying to save from "re-calculating"/going back to the database. The actual graph of the object isn't relevant and the cache is generally long-lived.
For most implementations of a hash, you'd need to synchronize. What if the hash table needs to be expanded/rehashed? What if two threads are trying to add something to the hash table where the keys are different, but the hashes collide? They could both be modifying the same slot in the hash table in different ways at the same time. Assuming you're using a hash table to implement your cache (which you imply in your question) I suggest reading a little about the details of how hash tables are implemented if you're not already familiar with this.
Writes aren't always atomic. You must either use atomic data types or provide some synchronization (RCU, locks etc.). No shared data is thread-safe per se. Or make this go away by sticking to lock-free algorithms (that is, where possible and feasible).
As long as the cost for acquiring and releasing a lock is less than the cost for recreating the object (from a file or database or whatever) all accesses to a cache should indeed be synchronized. If it’s not you don’t really need a cache at all. :)
If you want to avoid data corruption, you must synchronize. This is especially true when the cache contains multiple tables that must be updated atomically. Imagine you have a database for a DMV (department of motor vehicles). You add a new person to the database, that person will have records for auto registrations plus records for tickets received for records for home address and perhaps other contact information. If you don't update these tables atomically -- in the database and in the cache -- then any client pulling data out of the cache may get inconsistent data.
Yes, any one piece of data may be constant, but databases very commonly hold data that -- if not updated together and atomically -- can cause database clients to get incorrect or incomplete or inconsistent results.
If you are using Java 5 or above you can use a ConcurrentHashMap. This supports multiple readers and writers in a threadsafe manner.
Related
I'm trying to improve upon this program that I wrote for work. Initially I was rushed, and they don't care about performance or anything. So, I made a horrible decision to query an entire database(a SQLite database), and then store the results in lists for use in my functions. However, I'm now considering having each of my functions threaded, and having the functions query only the parts of the database that it needs. There are ~25 functions. My question is, is this safe to do? Also, is it possible to have that many concurrent connections? I will only be PULLING information from the database, never inserting or updating.
The way I've had it described to me[*] is to have each concurrent thread open its own connection to the database, as each connection can only process one query or modification at a time. The group of threads with their connections can then perform concurrent reads easily. If you've got a significant problem with many concurrent writes causing excessive blocking or failure to acquire locks, you're getting to the point where you're exceeding what SQLite does for you (and should consider a server-based DB like PostgreSQL).
Note that you can also have a master thread open the connections for the worker threads if that's more convenient, but it's advised (for your sanity's sake if nothing else!) to only actually use each connection from one thread.
[* For a normal build of SQLite. It's possible to switch things off at build time, of course.]
SQLite has no write concurrency, but it supports arbitrarily many connections that read at the same time.
Just ensure that every thread has its own connection.
25 simultanious connections is not a smart idea. That's a huge number.
I usually create a multi-layered design for this problem. I send all requests to the database through a kind of ObjectFactory class that has an internal cache. The ObjectFactory will forward the request to a ConnectionPoolHandler and will store the results in its cache. This connection pool handler uses X simultaneous connections but dispatches them to several threads.
However, some remarks must be made before applying this design. You first have to ask yourself the following 2 questions:
Is your application the only application that has access to this
database?
Is your application the only application that modifies data in this database?
If the first question is negatively, then you could encounter locking issues. If your second question is answered negatively, then it will be extremely difficult to apply caching. You may even prefer not to implement any caching it all.
Caching is especially interesting in case you are often requesting objects based on a unique reference, such as the primary key. In that case you can store the most often used objects in a Map. A popular collection for caching is an "LRUMap" ("Least-Recently-Used" map). The benifit of this collection is that it automatically arranges the most often used objects to the top. At the same time it has a maximum size and automatically removes items from the map that are rarely ever used.
A second advantage of caching is that each object exists only once. For example:
An Employee is fetched from the database.
The ObjectFactory converts the resultset to an actual object instance
The ObjectFactory immediatly stores it in cache.
A bit later, a bunch of employees are fetched using an SQL "... where name like "John%" statement.
Before converting the resultset to objects, the ObjectFactory first checks if the IDs of these records are perhaps already stored in cache.
Found a match ! Aha, this object does not need to be recreated.
There are several advantages to having a certain object only once in memory.
Last but not least in Java there is something like "Weak References". These are references that are references that in fact can be cleaned up by the garbage collector. I am not sure if it exists in C# and how it's called. By implementing this, you don't even have to care about the maximum amount of cached objects, your garbage collector will take care of it.
So here I am implementing some caching layer. Particurally I am stuck with
ConcurrentDictionary<SomeKey,HashSet<SomeKey2>>
I need to ensure that operations on HashSet are threadsafe too (ergo Update is threadsafe). Is it possible in any simple way or do I have to synchronize in the UpdateFactory delegate? If the answer is yes (which I presume) did any one of You encountered this problem before and solved it?
I want to avoid ConcurrentDictionary of ConcurrentDictionaries because they allocate a lot of synchronization objects and I potentially have around a million entries in this thing, so I want to have less pressure in on the GC.
HashSet was chosen because it guarantees amortized constant cost of insertion,deletion and access.
The aforementioned structure will be used as a index on a larger data set with to columns as a key (SomeKey and Somekey2) much like a database index.
Ok so finally I decided to go with Immutable set and lock striping because it is reasonably simple to implement and understand. If I will need more performance on the writes (no copying the whole hash set on insert) I will implement reader/writer locks with striping - which should be fine anyway.
Thanks for suggestions.
MemoryCache comes with a Default cache by default and additional named caches can be created.
It seems like there might be advantages to isolating the caching of results of different processes in different instances For example, results of queries against an index could be cached in an "IndexQueryResult" cache and result of database queries in a "DatabaseQueryResult" cache. That's rather contrived but explains the principle.
Does the memory pressure on one cache that results in evictions affect the other caches at all? Are there any differences in the way .Net manages multiple caches compared to how it manages one?
Am I wasting my time considering the idea of multiple caches, or is there real value in doing so?
I can't speak to the first few questions, and I'm interested to hear answers to those. However, I can say that we've been had a good experience so far using multiple caches in our product. Here are the benefits I see:
Reduced chance of key collision: Rather than coming up with some kind of scheme to ensure that no two separate values end up with the same key, we can simply create a cache that's specific to a given repository type, and know that as long as that repository class uses keys unique to its objects, we won't have collisions.
Better precision with cache eviction: The repository type that "owns" a particular cache instance can subscribe to certain event types on a system-wide event bus, so that it knows when some parts of the cache need to be purged. If we're lucky, it can determine the keys of the entries to purge purely based on the arguments of the published event. However, this is often not the case, and we must either purge the entire cache or iterate through all the cached values to figure out which ones are affected by the published event. If we were using a single cache instance for all data types in our system, we would end up crawling through a lot of unrelated entries. By using separate caches, we can restrict our search to the values that this particular repository was responsible for populating.
Regarding the second point: we also built a UI to expose all the cache instances in the system, and allow us to purge any of them with the click of a button. This comes in handy when we need to make changes directly to the database, and need the system to pick up those changes without having to restart the server. Again, if we only used a single cache, we couldn't be nearly as precise: we'd have to purge all the cached values systemwide, instead of just the values associated with the data types that we tinkered with.
I created two (or more) threads to insert data in a table in database. When inserting, there is a field CreatedDateTime, that of course, stores the datetime of the record creation.
For one case, I want the threads to stay synchronized, so that their CreatedDateTime field will have exactly the same value. When testing with multi threading, usually I've got different milliseconds...
I want to test different scenarios in my system, such as:
1) conflicts inserting record exactly at the same time.
2) problems with ordering/selection of records.
3) Problems with database connection pooling.
4) Problems with multiple users (hundred) accessing at same time.
There may be other test cases I haven't listed here.
Yes, that's what happens. Even if by some freak of nature, your threads were to start at exactly the same time, they would soon get out of step simply because of resource contention between them (at a bare minimum, access to the DB table or DBMS server process).
If they stay mostly in step (i.e., never more than a few milliseconds out), just choose a different "resolution" for your CreatedDateTime field. Put it in to the nearest 10th of a second (or second) rather than millisecond. Or use fixed values in some other way.
Otherwise, just realize that this is perfectly normal behavior.
And, as pointed out by BC in a comment, you may misunderstand the use of the word "synchronized". It's used (in Java, I hope C# is similar) to ensure two threads don't access the same resource at the same time. In actuality, it almost guarantees that threads won't stay synchronized as you understand the term to mean (personally I think your definition is right in terms of English usage (things happening at the same time) but certain computer languages have suborned the definition for their own purposes).
If you're testing what happens when specific timestamps go into the database, you cannot rely on threads "behaving themselves" by being scheduled in a specific order and at specific times. You really need to dummy up the data somehow, otherwise it's like trying to nail jelly to a tree (or training a cat).
One solution is to not use things such as getCurrentTime() or now() but use a specific set of inserts which have known timestamps. Depending on your actual architecture, this may be difficult (for example, if you just call an API which itself gets the current timestamp to millisecond resolution).
If you control the actual SQL that's populating the timestamp column, you need to change that to use pre-calculated values rather the now() or its equivalents.
If you want to have the same timestamps on multiple rows being inserted; you should create a SQL thread which will do a multirow insert in one query which will allow you to get the same timestamps. Other than this, I agree with everyone else, you cannot get an exact timestamp at a huge resolution with multithreads unless you were to insert the timestamp as it is seen in the application and share that timestamp to be inserted. This of course, throws the caveat issues of threads out the window. It's like saying, I'm going to share this data, but I don't want to use mutexes because they stop the other thread from processing once it hits a lock().
Curious if anyone has opinions on which method would be better suited for asp.net caching. Option one, have fewer items in the cache which are more complex, or many items which are less complex.
For sake of discussion lets imagine my site has SalesPerson and Customer objects. These are pretty simple classes but I don’t want to be chatty with the database so I want to lazy load them into cache and invalidate them out of the cache when I make a change – simple enough.
Option 1
Create Dictionary and cache the entire dictionary. When I need to load an instance of a SalesPerson from the cache I get out the Dictionary and perform a normal key lookup against the Dictionary.
Option 2
Prefix the key of each item and store it directly in the asp.net cache. For example every SalesPerson instance in the cache would use a composite of the prefix plus the key for that object so it may look like sp_[guid] and is stored in the asp.net cache and also in the cache are the Customer objects with a key like cust_[guid].
One of my fears with option two is that the numbers of entries will grow very large, between SalesPerson, Customer and a dozen or so other categories I might have 25K items in cache and highly repetitive lookups for something like a string resource that I am using in several places might pay a penalty while the code looks through the cache’s key collection to find it amongst the other 25K.
I am sure at some point there is a diminishing return here on storing too many items in the cache but I am curious as to opinions on these matters.
You are best off to create many, smaller items in the cache than to create fewer, larger items. Here is the reasoning:
1) If your data is small, then the number of items in the cache will be relatively small and it won't make any difference. Fetching single entities from the cache is easier than fetching a dictionary and then fetching an item from that dictionary, too.
2) Once your data grows large, the cache may be used to manage the data in an intelligent fashion. The HttpRuntime.Cache object makes use of a Least Recently Used (LRU) algorithm to determine which items in the cache to expire. If you have only a small number of highly used items in the cache, this algorithm will be useless. However, if you have many smaller items in the cache, but 90% of them are not in use at any given moment (very common usage heuristic), then the LRU algorithm can ensure that those items that are seeing active use remain in the cache while evicting less-used items to ensure sufficient room remains for the used ones.
As your application grows, the importance of being able to manage what is in the cache will be most important. Also, I've yet to see any performance degradation from having millions of keys in the cache -- hashtables are extremely fast and if you find issues there it's likely easily solved by altering your naming conventions for your cache keys to optimize them for use as hashtable keys.
The ASP.NET Cache uses its own dictionary so using its dictionary to locate your dictionary to do lookups to retrieve your objects seems less than optimal. Dictionaries use hash tables which is about the most efficient lookup you can do. Using your own dictionaries would just add more overhead, I think. I don't know about diminishing returns in regards to hash tables, but I think it would be in terms of storage size, not lookup time.
I would concern yourself with whatever makes your job easier. If having the Cache more organized will make your app easier to understand, debug, extend and maintain then I would do it. If it makes those things more complex then I would not do it.
And as nullvoid mentioned, this is all assuming you've already explored the larger implications of caching, which involve gauging the performance gains vs. the performance hit. You're talking about storing lots and lots of objects, and this implies lots of cache traffic. I would only store something in the cache that you can measure a performance gain from doing so.
We have built an application that uses Caching for storing all resources. The application is multi-language, so for each label in the application we have at least three translations. We load a (Label,Culture) combination when first needed and then expire it from cache only if it was changed by and admin in the database. This scenario worked perfectly well even when the cache contained 100000 items in it. We only took care to configure the cache and the expiry policies such that we really benefit of the Cache. We use no-expiration, so the items are cached until the worker process is reset or until the item is intentionally expired. We also took care to define a domain for the values of the keys in such a way to uniquely identify a label in a specific culture with the least amount of characters.
I'm going to assume that you've considered all the implications of data changing from multiple users and how that will affect the cached data in terms of handling conflicting data. Caching is really only meant to be done on reletively static data.
From an efficiency perspective I would assume that if you're using the .net serialization properly you're going to benefit from storing the data in the cache in the form of larger typed serialized collections rather than individual base types.
From a maintenance perspective this would also be a better approach, as you can create a strongly typed object to represent the data and use serialization to cast it between the cache and your salesperson/customer object.