How to ensure ConcurrentDictionary of collections thread-safety?

How to ensure ConcurrentDictionary of collections thread-safety? - c#

So here I am implementing some caching layer. Particurally I am stuck with
ConcurrentDictionary<SomeKey,HashSet<SomeKey2>>
I need to ensure that operations on HashSet are threadsafe too (ergo Update is threadsafe). Is it possible in any simple way or do I have to synchronize in the UpdateFactory delegate? If the answer is yes (which I presume) did any one of You encountered this problem before and solved it?
I want to avoid ConcurrentDictionary of ConcurrentDictionaries because they allocate a lot of synchronization objects and I potentially have around a million entries in this thing, so I want to have less pressure in on the GC.
HashSet was chosen because it guarantees amortized constant cost of insertion,deletion and access.
The aforementioned structure will be used as a index on a larger data set with to columns as a key (SomeKey and Somekey2) much like a database index.

Ok so finally I decided to go with Immutable set and lock striping because it is reasonably simple to implement and understand. If I will need more performance on the writes (no copying the whole hash set on insert) I will implement reader/writer locks with striping - which should be fine anyway.
Thanks for suggestions.

Related

Concurrent skiplist Read locking

I will try to make this question as generic as possible, but I will give a brief introduction to my actual problem -
I am trying to implement a concurrent skiplist for a priority queue. Each 'node', has a value, and an array of 'forward' nodes, where node.forward[i] represents the next node on the i-th level of the skiplist. For write access (i.e. insertions and deletions), I use a Spinlock (still to determine if that is the best lock to use)
My question is essentially, when I need a read access for a traversal,
node = node.forward[i]
What kind of thread safety do I need around something like this? If another thread is modifying node.forward[i] at exactly the same time that I read (with no current locking mechanism for read), what can happen here?
My initial thought is to have a ReaderWriterLockSLim on the getter and setter of the indexer for Forward. Will there be too much unnecessary locking in this scenario?
Edit: Or would it be best to instead use a Interlocked.Exchange for all of my reads?

If another thread is modifying node.forward[i] at exactly the same time that I read (with no current locking mechanism for read), what can happen here?
It really depends on the implementation. It's possible to use Interlocked.Exchange when setting "forward" in a way that can prevent the references from being invalid (as it can make the "set" atomic), but there is no guarantee of which reference you'd get on read. However, with a naive implementation, anything can happen, including getting bad data.
My initial thought is to have a ReaderWriterLockSLim on the getter and setter of the indexer for Forward.
This is likely to be a good place to start. It will be fairly easy to make a properly synchronized collection using a ReaderWriterLockSlim, and functional is always the first priority.
This would likely be a good starting point.
Will there be too much unnecessary locking in this scenario?
There's no way to know without seeing how you implement it, and more importantly, how it's goign to be used. Depending on your usage, you can profile and look for optimization opportunities if necessary at that point.
On a side note - you might want to reconsider using node.Forward[i] as opposed to more of a "linked list" approach here. Any access to Forward[i] is likely to require a fair bit of synchronization to iterate through the skip list i steps, all of which will need some synchronization to prevent errors if there are concurrent writes anywhere between node and i elements beyond node. If you only look ahead one step, you can (potentially) reduce the amount of synchronization required.

Whats the Best Concurrent Thread Shared Memory Architecture Without Locking?

I have a 2d Array of memory. I have multiple threads reading and writing to single elements in the array spontaneously, arbitrarily, and concurrently.
What is the fastest way or best practice to construct my memory access code? I don't like the idea of locking because it blocks other threads.
Data integrity is actually not that important, but it should be (mostly) consistent. My code can handle a few memory errors.
It needs to be really, really fast!
Thanks for feedback.

If data integrity is not important, you can just access the data without caring about multithreading at all.
No one can predict the result, though.
I wouldn't call this approach "best practice", however. IMHO best pratice is caring about multithreading, and protecting the data with appropriately-grained mutexes. My opinion is that every application should be first correct, and only then fast. Inconsistent results are just wrong, doesn't matter if they come fast or not.

Use the Interlocked class to CAS (CompareAndExchange) the objects/values in your array. It makes the operation atomic which ensures that the data is not corrupted. That's about the fastest thing you can do (aside from accessing/modifying the data directly without interlocking). However, if you're modifying the size of the 2D array (growing/shrinking) then you will have some serious problems unless you use some kind of locking mechanism on your array.

Declare the array as volatile and ensure it's scoped such that it's visible to all your threads. I generally like to avoid statics, so either pass the array by reference, or set up all your threads to run methods of an instance class that has the array defined as an instance field.
However, I strongly urge you to rethink what "volatile access" means in terms of data integrity. Best practice is NOT to do what you are attempting without good locking mechanics. You may think it's a small problem, but you can find yourself with a very non-deterministic system, so much so that its data isn't reliable in the slightest.
Let's say you have 8 threads running, and all of them will get a value from an index of the array, do some calculation, then add the result back to the index of the array. Thread 1 starts first and gets the value of the index, 0. Then threads 2-7 all start and get the same value. Thread 1 performs its calculation, gets the index again to ensure it has the "latest" value, then tries to update the value. However, other threads are waiting for that memory, and due to some scheduling implementation you know nothing about, in between Thread 1 getting the index (still zero) and writing its result, threads 2-7 have ALL written their values. Then Thread 1 writes its value, overwriting everything the other 7 threads have done. The other 7 threads, in turn, probably had similar "races" with each other such that the value overwritten by Thread 1 probably overwrote the results of half the threads anyway.
I guarantee you that this behavior is NOT what you want, no matter how much you think you can get away with it; it WILL cause data corruption, which WILL affect other areas of the system, and you WILL be forced to implement proper locking.

If you are interested solely in performance, then the way in which you order your memory accesses can play a big role. Spend an hour or so reading through the slides from Lecture 1 of MIT's Performance Engineering class. The other lectures may also be interesting to you (such as Lecture 6).
Basically, you can optimize your use of the cache to greatly improve performance, depending on your read/write patterns, given the workload you are using.
This should not stop you from doing something that is correct, however.

What C# container is most resource-efficient for existence for only one operation?

I find myself often with a situation where I need to perform an operation on a set of properties. The operation can be anything from checking if a particular property matches anything in the set to a single iteration of actions. Sometimes the set is dynamically generated when the function is called, some built with a simple LINQ statement, other times it is a hard-coded set that will always remain the same. But one constant always exists: the set only exists for one single operation and has no use before or after it.
My problem is, I have so many points in my application where this is necessary, but I appear to be very, very inconsistent in how I store these sets. Some of them are arrays, some are lists, and just now I've found a couple linked lists. Now, none of the operations I'm specifically concerned about have to care about indices, container size, order, or any other functionality that is bestowed by any of the individual container types. I picked resource efficiency because it's a better idea than flipping coins. I figured, since array size is configured and it's a very elementary container, that might be my best choice, but I figure it is a better idea to ask around. Alternatively, if there's a better choice not out of resource-efficiency but strictly as being a better choice for this kind of situation, that would be nice as well.

With your acknowledgement that this is more about coding consistency rather than performance or efficiency, I think the general practice is to use a List<T>. Its actual backing store is an array, so you aren't really losing much (if anything noticable) to container overhead. Without more qualifications, I'm not sure that I can offer anything more than that.
Of course, if you truly don't care about the things that you list in your question, just type your variables as IEnumerable<T> and you're only dealing with the actual container when you're populating it; where you consume it will be entirely consistent.

There are two basic principles to be aware of regarding resource efficiency.
Runtime complexity
Memory overhead
You said that indices and order do not matter and that a frequent operation is matching. A Dictionary<T> (which is a hashtable) is an ideal candidate for this type of work. Lookups on the keys are very fast which would be beneficial in your matching operation. The disadvantage is that it will consume a little more memory than what would be strictly required. The usual load factor is around .8 so we are not talking about a huge increase or anything.
For your other operations you may find that an array or List<T> is a better option especially if you do not need to have the fast lookups. As long as you are not needing high performance on specialty operations (lookups, sorting, etc.) then it is hard to beat the general resource characteristics of array based containers.

List is probably fine in general. It's easy to understand (in the literate programming sense) and reasonably efficient. The keyed collections (e.g. Dict, SortedList) will throw an exception if you add an entry with a duplicate key, though this may not be a problem for what you're working on now.
Only if you find that you're running into a CPU-time or memory-size problem should you look at improving the "efficiency", and then only after determining that this is the bottleneck.
No matter which approach you use, there will still be creation and deletion of the underlying objects (collection or iterator) that will eventually be garbage collected, if the application runs long enough.

.NET: Scalability of generic Dictionary

I'm using a Dictionary<> to store a bazillion items. Is it safe to assume that as long as the server's memory has enough space to accommodate these bazillion items that I'll get near O(1) retrieval of items from it? What should I know about using a generic Dictionary as huge cache when performance is important?
EDIT: I shouldn't rely on the default implementations? What makes for a good hashing function?

It depends, just about entirely, on how good a hashing functionality your "bazillion items" support -- if their hashing function is not excellent (so that many conflicts result) your performance will degrade with the growth of the dictionary.

You should measure it and find out. You're the one who has knowledge of the exact usage of your dictionary, so you're the one who can measure it to see if it meets your needs.
A word of advice: I have in the past done performance analysis on large dictionary structures, and discovered that performance did degrade as the dictionary became extremely large. But it seemed to degrade here and there, not consistently on each operation. I did a lot of work trying to analyze the hash algorithms, etc, before smacking myself in the forehead. The garbage collector was getting slower because I had so much live working set; the dictionary was just as fast as it always was, but if a collection happened to be triggered, then that was eating up my cycles.
That's why it is important to not do performance testing in unrealistic benchmark scenarios; to find out what the real-world performance cost of your bazillion-item dictionary is, well, that's going to be gated on lots of stuff that has nothing to do with your dictionary, like how much collection triggering is happening throughout the rest of your program, and when.

Yes you will have O(1) access times. In fact to be pedantic g it will be exactly O(1).
You need to ensure that all your objects that are used as keys have a good GetHashCode implementation and should likely override Equals.
Edit to clarify: In reality acess times will get slower the more items you have unless you can provide a "perfect" hash function.

Yes, you will have near O(1) no matter how many objects you put into the Dictionary. But for the Dictionary to be fast, your key-objects should provide a sufficient GetHashCode-implementation, because Dictionary uses a hashtable inside.

Does a cache need to synchronized?

This seems like perhaps a naive question, but I got into a discussion with a co-worker where I argued that there is no real need for a cache to be thread-safe/synchronized as I would assume that it does not matter who is putting in a value, as the value for a given key should be "constant" (in that it is coming from the same source ultimately). If the values can change readily, then the cache itself does not seem to be all the useful (in that if you care that the value is "currently correct" you should go to the original source).
The main reason I see to make at least the GET synchronized is that if it is very expensive to miss in the cache and you don't want multiple threads each going out to get a value to put back in the cache. Even then, you'd need something that actually blocks all consumers during a read-fetch-put cycle.
Anyhow, my working assumption is that a hash is by its very nature thread-safe because for any {key,value} combination, the value is either null or something that it doesn't matter who go there "first" to write.
Question is: Is this a reasonable assumption?
Update: The real scope of my question is around very simple id->value style caches (or {parameters}->{calculated value} where no matter who writes to the cache, the value will be the same and we are just trying to save from "re-calculating"/going back to the database. The actual graph of the object isn't relevant and the cache is generally long-lived.

For most implementations of a hash, you'd need to synchronize. What if the hash table needs to be expanded/rehashed? What if two threads are trying to add something to the hash table where the keys are different, but the hashes collide? They could both be modifying the same slot in the hash table in different ways at the same time. Assuming you're using a hash table to implement your cache (which you imply in your question) I suggest reading a little about the details of how hash tables are implemented if you're not already familiar with this.

Writes aren't always atomic. You must either use atomic data types or provide some synchronization (RCU, locks etc.). No shared data is thread-safe per se. Or make this go away by sticking to lock-free algorithms (that is, where possible and feasible).

As long as the cost for acquiring and releasing a lock is less than the cost for recreating the object (from a file or database or whatever) all accesses to a cache should indeed be synchronized. If it’s not you don’t really need a cache at all. :)

If you want to avoid data corruption, you must synchronize. This is especially true when the cache contains multiple tables that must be updated atomically. Imagine you have a database for a DMV (department of motor vehicles). You add a new person to the database, that person will have records for auto registrations plus records for tickets received for records for home address and perhaps other contact information. If you don't update these tables atomically -- in the database and in the cache -- then any client pulling data out of the cache may get inconsistent data.
Yes, any one piece of data may be constant, but databases very commonly hold data that -- if not updated together and atomically -- can cause database clients to get incorrect or incomplete or inconsistent results.

If you are using Java 5 or above you can use a ConcurrentHashMap. This supports multiple readers and writers in a threadsafe manner.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.