So I have Dictionary<string, SomeClass> which will be accessed heavily by multiple concurrent threads - some will write, most will read. No locks, no synchronization - worker thread will make no checks - simply read or write. The only guarantee is that no two threads will write value with same key. The question is could this data structure become corrupted this way? By corrupted I mean not working anymore even with one thread.
could this data structure become corrupted this way?
Yes, most likely you would get an IndexOutOfRange or similar exception.
And even when you catch and ignore the exceptions, you would not get reliable data any more. Both duplicates and missing values are possible.
So just don't do this.
The worst-case scenarios include:
NullReferenceException or IndexOutOfRangeException thrown out of a Dictionary<,> method.
An arbitrary amount of data is lost. If two threads attempt to resize the Dictionary<,> table at the same time, they can stomp on each other, screw up, and lose data.
Wrong answer is returned by a read from the Dictionary<,>.
Basically, the Dictionary<,> can do just about anything bad you can think of, within the limits imposed by the CLR. Presumably, you still won't break type safety or corrupt the heap as you could in a native programming language. Probably, anyways :-)
If you're accessing a collection from multiple threads, it would be safest to use one of the threadsafe varieties in .Net 4, such as ConcurrentDictionary
Related
I'm maintaining a legacy application that uses strings to lock values in a cache. It does so something like this:
object Cache(string key, Func<object> createObjToCache)
{
object result = Get(key);
if (result == null)
{
string internKey = string.Intern(key);
lock (internKey) {
result = Get(key);
if (result == null)
{
result = createObjToCache();
Add(key, result);
}
}
}
return result;
}
I've two questions about this code. First is string.Intern() thread safe? Is it possible that two threads on two separate CPUs with two identical strings would return different references? If not is that a possible bottle neck, does string.Intern block?
Secondly I'm concerned that this application might be using a huge number of strings as keys. I'd like to be able to monitor the amount of memory that the intern pool uses to store all these strings, but I can't find a performance counter for this on .Net Memory. Is there one somewhere else?
NOTE:
I'm aware that this implementation sucks. However I need to make the case to management before re-writing what they see as a critical bit of code. Hence I could use facts and stats on exactly how bad it sucks rather than alternative solutions.
Also Get() and Add() are not in the original code. I've replaced the original code to keep this question simple. We can assume that Add() will not fail if it is called twice with the same or different keys.
MSDN does not make any mention of thread-safety on string.Intern, so you're right in that it is very undefined what would happen if two threads called Intern for a new key at exactly the same time. I want to say "it'll probably work OK", but that isn't a guarantee. There is no guarantee AFAIK. The implementation is extern, so peeking at the implementation means looking at the runtime itself.
Frankly, there are so many reasons not to do this that it is hard to get excited about answering these specific questions. I'd be tempted to look at some kind of Dictionary<string,object> or ThreadSafeDictionary<string,object> (where the object here is simply a new object() that I can use for the lock) - without all the issues related to string.Intern. Then I can a: query the size, b: discard it at whim, c: have parallel isolated containers, etc.
First is string.Intern() thread safe?
Unless something has changed (my info on this is quite old, and I'm not curious enough to take a look at the current implementation), yes. This however is about the only good thing with this idea.
Indeed, it's not fully a good thing. string.Intern() locks globally which is one of the things that can make it slow.
Secondly I'm concerned that this application might be using a huge number of strings as keys.
If that cache lives forever then that's an issue (or not if the memory use is sufficiently low) whether you intern or not. In which case have the wrong approach to the right potential issue to investigate:
I'd like to be able to monitor the amount of memory that the intern pool uses to store all these strings,
If they weren't interned but still lived forever in that cache, then if you stopped interning, you'd still be the same that amount of memory for the strings themselves, and the extra memory overhead of the interning wouldn't really be the issue.
There are a few reasons why one might want to intern a key, and not all of them are even bad (if the strings being interned are going to all appear regularly throughout the lifetime of the application then interning could even reduce memory use), but it seems here that the reason is to make sure that the key locked on is the same instance that another attempt to use the same string would use.
This might be thread safety at the wrong place, if Add() isn't thread-safe enough to guarantee that two simultaneous insertions of different keys can't put it into an invalid state (if Add() isn't explicitly thread-safe, then it does not make this guarantee).
If the cache is threadsafe, then this is likely extra thread safety for no good reason. Since objToCache has already been created and races will result in one being thrown away, it might be fine to let them race and have a brief period of two objToCache existing before one is collected. If not then MemoryCache.AddOrGetExisting or ConcurrentDictionary.GetOrAdd deal with this issue much better than this.
I'm always confused on which one of these to pick. As I see it I use Dictionary over List if I want two data types as a Key and Value so I can easily find a value by its key but I am always confused if I should use a ConcurrentDictionary or Dictionary?
Before you go off at me for not putting much research in to this I have tried, but it seems google hasn't really got anything on Dictionary vs ConcurrentDictionary but has something on each one individually.
I have asked a friend this before but all they said is: "use ConcurrentDictionary if you use your dictionary a lot in code" and I didn't really want to pester them in to explaining it in larger detail. Could anyone expand on this?
"Use ConcurrentDictionary if you use your dictionary in a lot in code" is kind of vague advice. I don't blame you for the confusion.
ConcurrentDictionary is primarily for use in an environment where you're updating the dictionary from multiple threads (or async tasks). You can use a standard Dictionary from as much code as you like if it's from a single thread ;)
If you look at the methods on a ConcurrentDictionary, you'll spot some interesting methods like TryAdd, TryGetValue, TryUpdate, and TryRemove.
For example, consider a typical pattern you might see for working with a normal Dictionary class.
// There are better ways to do this... but we need an example ;)
if (!dictionary.ContainsKey(id))
dictionary.Add(id, value);
This has an issue in that between the check for whether it contains a key and calling Add a different thread could call Add with that same id. When this thread calls Add, it'll throw an exception. The method TryAdd handles that for you and will return a true/false telling you whether it added it (or whether that key was already in the dictionary).
So unless you're working in a multi-threaded section of code, you probably can just use the standard Dictionary class. That being said, you could theoretically have locks to prevent concurrent access to a dictionary; that question is already addressed in "Dictionary locking vs. ConcurrentDictionary".
The biggest reason to use ConcurrentDictionary over the normal Dictionary is thread safety. If your application will get multiple threads using the same dictionary at the same time, you need the thread-safe ConcurrentDictionary, this is particularly true when these threads are writing to or building the dictionary.
The downside to using ConcurrentDictionary without the multi-threading is overhead. All those functions that allow it to be thread-safe will still be there, all the locks and checks will still happen, taking processing time and using extra memory.
ConcurrentDictionary is useful when you need to access a dictionary across multiple threads (i.e. multithreading). Vanilla Dictionary objects do not possess this capability and therefore should only be used in a single-threaded manner.
A ConcurrentDictionary is useful when you want a high-performance dictionary that can be safely accessed by multiple threads concurrently. Compared to a standard Dictionary protected with a lock, it is more efficient under heavy usage because of its granular locking implementation. Instead of all threads competing for a single lock, the ConcurrentDictionary maintains multiple locks internally, minimizing this way the contention, and limiting the possibility of becoming a bottleneck.
Despite these nice characteristics, the number of scenarios where using a ConcurrentDictionary is the best option is actually quite small. There are two reasons for that:
The thread-safety guaranties offered by the ConcurrentDictionary are limited to the protection of its internal state. That's it. If you want to do anything slightly non-trivial, like for example updating the dictionary and another variable as an atomic operation, you are out of luck. This is not a supported scenario for a ConcurrentDictionary. Even protecting the elements it contains (in case they are mutable objects) is not supported. If you try to update one of its values using the AddOrUpdate method, the dictionary will be protected but the value will not. The Update in this context means replace the existing value with another one, not modify the existing value.
Whenever you find tempting to use a ConcurrentDictionary, there are usually better alternatives available. Alternatives that do not involve shared state, which is what a ConcurrentDictionary essentially is. No matter how efficient is its locking scheme, it will have a hard time beating an architecture where there is no shared state at all, and each thread does its own thing without interfering with the other threads. Commonly used libraries that follow this principle are the PLINQ and the TPL Dataflow library. Below is a PLINQ example:
Dictionary<string, Product> dictionary = productIDs
.AsParallel()
.Select(id => GetProduct(id))
.ToDictionary(product => product.Barcode);
Instead of creating a dictionary beforehand, and then having multiple threads filling it concurrently with values, you can trust PLINQ to produce a dictionary utilizing more efficient strategies, involving partitioning of the initial workload, and assigning each partition to a different worker thread. A single thread will eventually aggregate the partial results, and fill the dictionary.
The accepted answer above is correct. However, it is worth mentioning explicitly if a dictionary is not being modified i.e. it is only ever read from, regardless of number of threads, then Dictionary<TKey,TValue> is preferred because no synchronization is required.
e.g. caching config in a Dictionary<TKey,TValue>, that is populated only once at startup and used throughout the application for the life of the application.
When to use a thread-safe collection : ConcurrentDictionary vs. Dictionary
If you are only reading key or values, the Dictionary<TKey,TValue> is faster because no synchronization is required if the dictionary is not being modified by any threads.
So here I am implementing some caching layer. Particurally I am stuck with
ConcurrentDictionary<SomeKey,HashSet<SomeKey2>>
I need to ensure that operations on HashSet are threadsafe too (ergo Update is threadsafe). Is it possible in any simple way or do I have to synchronize in the UpdateFactory delegate? If the answer is yes (which I presume) did any one of You encountered this problem before and solved it?
I want to avoid ConcurrentDictionary of ConcurrentDictionaries because they allocate a lot of synchronization objects and I potentially have around a million entries in this thing, so I want to have less pressure in on the GC.
HashSet was chosen because it guarantees amortized constant cost of insertion,deletion and access.
The aforementioned structure will be used as a index on a larger data set with to columns as a key (SomeKey and Somekey2) much like a database index.
Ok so finally I decided to go with Immutable set and lock striping because it is reasonably simple to implement and understand. If I will need more performance on the writes (no copying the whole hash set on insert) I will implement reader/writer locks with striping - which should be fine anyway.
Thanks for suggestions.
I have a 2d Array of memory. I have multiple threads reading and writing to single elements in the array spontaneously, arbitrarily, and concurrently.
What is the fastest way or best practice to construct my memory access code? I don't like the idea of locking because it blocks other threads.
Data integrity is actually not that important, but it should be (mostly) consistent. My code can handle a few memory errors.
It needs to be really, really fast!
Thanks for feedback.
If data integrity is not important, you can just access the data without caring about multithreading at all.
No one can predict the result, though.
I wouldn't call this approach "best practice", however. IMHO best pratice is caring about multithreading, and protecting the data with appropriately-grained mutexes. My opinion is that every application should be first correct, and only then fast. Inconsistent results are just wrong, doesn't matter if they come fast or not.
Use the Interlocked class to CAS (CompareAndExchange) the objects/values in your array. It makes the operation atomic which ensures that the data is not corrupted. That's about the fastest thing you can do (aside from accessing/modifying the data directly without interlocking). However, if you're modifying the size of the 2D array (growing/shrinking) then you will have some serious problems unless you use some kind of locking mechanism on your array.
Declare the array as volatile and ensure it's scoped such that it's visible to all your threads. I generally like to avoid statics, so either pass the array by reference, or set up all your threads to run methods of an instance class that has the array defined as an instance field.
However, I strongly urge you to rethink what "volatile access" means in terms of data integrity. Best practice is NOT to do what you are attempting without good locking mechanics. You may think it's a small problem, but you can find yourself with a very non-deterministic system, so much so that its data isn't reliable in the slightest.
Let's say you have 8 threads running, and all of them will get a value from an index of the array, do some calculation, then add the result back to the index of the array. Thread 1 starts first and gets the value of the index, 0. Then threads 2-7 all start and get the same value. Thread 1 performs its calculation, gets the index again to ensure it has the "latest" value, then tries to update the value. However, other threads are waiting for that memory, and due to some scheduling implementation you know nothing about, in between Thread 1 getting the index (still zero) and writing its result, threads 2-7 have ALL written their values. Then Thread 1 writes its value, overwriting everything the other 7 threads have done. The other 7 threads, in turn, probably had similar "races" with each other such that the value overwritten by Thread 1 probably overwrote the results of half the threads anyway.
I guarantee you that this behavior is NOT what you want, no matter how much you think you can get away with it; it WILL cause data corruption, which WILL affect other areas of the system, and you WILL be forced to implement proper locking.
If you are interested solely in performance, then the way in which you order your memory accesses can play a big role. Spend an hour or so reading through the slides from Lecture 1 of MIT's Performance Engineering class. The other lectures may also be interesting to you (such as Lecture 6).
Basically, you can optimize your use of the cache to greatly improve performance, depending on your read/write patterns, given the workload you are using.
This should not stop you from doing something that is correct, however.
As MSDN says
ConcurrentDictionary<TKey, TValue> Class Represents a thread-safe collection of key-value pairs that can be accessed by multiple threads concurrently.
But as I know, System.Collections.Concurrent classes are designed for PLINQ.
I have Dictionary<Key,Value> which keeps on-line clients in the server, and I make it thread safe by locking object when I have access to it.
Can I safely replace Dictionary<TKey,TValue> by ConcurrentDictionary<TKey,TValue> in my case? will the performance increased after replacement?
Here in Part 5 Joseph Albahari mentioned that it designed for Parallel programming
The concurrent collections are tuned for parallel programming. The conventional collections outperform them in all but highly concurrent scenarios.
A thread-safe collection doesn’t guarantee that the code using it will be thread-safe.
If you enumerate over a concurrent collection while another thread is modifying it, no exception is thrown. Instead, you get a mixture of old and new content.
There’s no concurrent version of List.
The concurrent stack, queue, and bag classes are implemented internally with linked lists. This makes them less memory-efficient than the nonconcurrent Stack and Queue classes, but better for concurrent access because linked lists are conducive to lock-free or low-lock implementations. (This is because inserting a node into a linked list requires updating just a couple of references, while inserting an element into a List-like structure may require moving thousands of existing elements.)
Without knowing more about what you're doing within the lock, then it's impossible to say.
For instance, if all of your dictionary access looks like this:
lock(lockObject)
{
foo = dict[key];
}
... // elsewhere
lock(lockObject)
{
dict[key] = foo;
}
Then you'll be fine switching it out (though you likely won't see any difference in performance, so if it ain't broke, don't fix it). However, if you're doing anything fancy within the lock block where you interact with the dictionary, then you'll have to make sure that the dictionary provides a single function that can accomplish what you're doing within the lock block, otherwise you'll end up with code that is functionally different from what you had before. The biggest thing to remember is that the dictionary only guarantees that concurrent calls to the dictionary are executed in a serial fashion; it can't handle cases where you have a single action in your code that interacts with the dictionary multiple times. Cases like that, when not accounted for by the ConcurrentDictionary, require your own concurrency control.
Thankfully, the ConcurrentDictionary provides some helper functions for more common multi-step operations like AddOrUpdate or GetOrAdd, but they can't cover every circumstance. If you find yourself having to work to shoehorn your logic into these functions, it may be better to handle your own concurrency.
It's not as simple as replacing Dictionary with ConcurrentDictionary, you'll need to adapt your code, as these classes have new methods that behave differently, in order to guarantee thread-safety.
Eg., instead of calling Add or Remove, you have TryAdd and TryRemove. It's important you use these methods that behave atomically, as if you make two calls where the second is reliant on the outcome of the first, you'll still have race conditions and need a lock.
You can replace Dictionary<TKey, TValue> with ConcurrentDictionary<TKey, TValue>.
The effect on performance may not be what you want though (if there is a lot of locking/synchronization, performance may suffer...but at least your collection is thread-safe).
While I'm unsure about replacement difficulties, but if you have anywhere where you need to access multiple elements in the dictionary in the same "lock session" then you'll need to modify your code.
It could give improved performance if Microsoft has given separate locks for read and write, since read operations shouldn't block other read operations.
Yes you can safely replace, however dictionary designed for plinq may have some extra code for added functionality that you may not use. But the performance overhead will be marginally very small.