I have a 2d Array of memory. I have multiple threads reading and writing to single elements in the array spontaneously, arbitrarily, and concurrently.
What is the fastest way or best practice to construct my memory access code? I don't like the idea of locking because it blocks other threads.
Data integrity is actually not that important, but it should be (mostly) consistent. My code can handle a few memory errors.
It needs to be really, really fast!
Thanks for feedback.
If data integrity is not important, you can just access the data without caring about multithreading at all.
No one can predict the result, though.
I wouldn't call this approach "best practice", however. IMHO best pratice is caring about multithreading, and protecting the data with appropriately-grained mutexes. My opinion is that every application should be first correct, and only then fast. Inconsistent results are just wrong, doesn't matter if they come fast or not.
Use the Interlocked class to CAS (CompareAndExchange) the objects/values in your array. It makes the operation atomic which ensures that the data is not corrupted. That's about the fastest thing you can do (aside from accessing/modifying the data directly without interlocking). However, if you're modifying the size of the 2D array (growing/shrinking) then you will have some serious problems unless you use some kind of locking mechanism on your array.
Declare the array as volatile and ensure it's scoped such that it's visible to all your threads. I generally like to avoid statics, so either pass the array by reference, or set up all your threads to run methods of an instance class that has the array defined as an instance field.
However, I strongly urge you to rethink what "volatile access" means in terms of data integrity. Best practice is NOT to do what you are attempting without good locking mechanics. You may think it's a small problem, but you can find yourself with a very non-deterministic system, so much so that its data isn't reliable in the slightest.
Let's say you have 8 threads running, and all of them will get a value from an index of the array, do some calculation, then add the result back to the index of the array. Thread 1 starts first and gets the value of the index, 0. Then threads 2-7 all start and get the same value. Thread 1 performs its calculation, gets the index again to ensure it has the "latest" value, then tries to update the value. However, other threads are waiting for that memory, and due to some scheduling implementation you know nothing about, in between Thread 1 getting the index (still zero) and writing its result, threads 2-7 have ALL written their values. Then Thread 1 writes its value, overwriting everything the other 7 threads have done. The other 7 threads, in turn, probably had similar "races" with each other such that the value overwritten by Thread 1 probably overwrote the results of half the threads anyway.
I guarantee you that this behavior is NOT what you want, no matter how much you think you can get away with it; it WILL cause data corruption, which WILL affect other areas of the system, and you WILL be forced to implement proper locking.
If you are interested solely in performance, then the way in which you order your memory accesses can play a big role. Spend an hour or so reading through the slides from Lecture 1 of MIT's Performance Engineering class. The other lectures may also be interesting to you (such as Lecture 6).
Basically, you can optimize your use of the cache to greatly improve performance, depending on your read/write patterns, given the workload you are using.
This should not stop you from doing something that is correct, however.
Related
I'm maintaining a legacy application that uses strings to lock values in a cache. It does so something like this:
object Cache(string key, Func<object> createObjToCache)
{
object result = Get(key);
if (result == null)
{
string internKey = string.Intern(key);
lock (internKey) {
result = Get(key);
if (result == null)
{
result = createObjToCache();
Add(key, result);
}
}
}
return result;
}
I've two questions about this code. First is string.Intern() thread safe? Is it possible that two threads on two separate CPUs with two identical strings would return different references? If not is that a possible bottle neck, does string.Intern block?
Secondly I'm concerned that this application might be using a huge number of strings as keys. I'd like to be able to monitor the amount of memory that the intern pool uses to store all these strings, but I can't find a performance counter for this on .Net Memory. Is there one somewhere else?
NOTE:
I'm aware that this implementation sucks. However I need to make the case to management before re-writing what they see as a critical bit of code. Hence I could use facts and stats on exactly how bad it sucks rather than alternative solutions.
Also Get() and Add() are not in the original code. I've replaced the original code to keep this question simple. We can assume that Add() will not fail if it is called twice with the same or different keys.
MSDN does not make any mention of thread-safety on string.Intern, so you're right in that it is very undefined what would happen if two threads called Intern for a new key at exactly the same time. I want to say "it'll probably work OK", but that isn't a guarantee. There is no guarantee AFAIK. The implementation is extern, so peeking at the implementation means looking at the runtime itself.
Frankly, there are so many reasons not to do this that it is hard to get excited about answering these specific questions. I'd be tempted to look at some kind of Dictionary<string,object> or ThreadSafeDictionary<string,object> (where the object here is simply a new object() that I can use for the lock) - without all the issues related to string.Intern. Then I can a: query the size, b: discard it at whim, c: have parallel isolated containers, etc.
First is string.Intern() thread safe?
Unless something has changed (my info on this is quite old, and I'm not curious enough to take a look at the current implementation), yes. This however is about the only good thing with this idea.
Indeed, it's not fully a good thing. string.Intern() locks globally which is one of the things that can make it slow.
Secondly I'm concerned that this application might be using a huge number of strings as keys.
If that cache lives forever then that's an issue (or not if the memory use is sufficiently low) whether you intern or not. In which case have the wrong approach to the right potential issue to investigate:
I'd like to be able to monitor the amount of memory that the intern pool uses to store all these strings,
If they weren't interned but still lived forever in that cache, then if you stopped interning, you'd still be the same that amount of memory for the strings themselves, and the extra memory overhead of the interning wouldn't really be the issue.
There are a few reasons why one might want to intern a key, and not all of them are even bad (if the strings being interned are going to all appear regularly throughout the lifetime of the application then interning could even reduce memory use), but it seems here that the reason is to make sure that the key locked on is the same instance that another attempt to use the same string would use.
This might be thread safety at the wrong place, if Add() isn't thread-safe enough to guarantee that two simultaneous insertions of different keys can't put it into an invalid state (if Add() isn't explicitly thread-safe, then it does not make this guarantee).
If the cache is threadsafe, then this is likely extra thread safety for no good reason. Since objToCache has already been created and races will result in one being thrown away, it might be fine to let them race and have a brief period of two objToCache existing before one is collected. If not then MemoryCache.AddOrGetExisting or ConcurrentDictionary.GetOrAdd deal with this issue much better than this.
I will try to make this question as generic as possible, but I will give a brief introduction to my actual problem -
I am trying to implement a concurrent skiplist for a priority queue. Each 'node', has a value, and an array of 'forward' nodes, where node.forward[i] represents the next node on the i-th level of the skiplist. For write access (i.e. insertions and deletions), I use a Spinlock (still to determine if that is the best lock to use)
My question is essentially, when I need a read access for a traversal,
node = node.forward[i]
What kind of thread safety do I need around something like this? If another thread is modifying node.forward[i] at exactly the same time that I read (with no current locking mechanism for read), what can happen here?
My initial thought is to have a ReaderWriterLockSLim on the getter and setter of the indexer for Forward. Will there be too much unnecessary locking in this scenario?
Edit: Or would it be best to instead use a Interlocked.Exchange for all of my reads?
If another thread is modifying node.forward[i] at exactly the same time that I read (with no current locking mechanism for read), what can happen here?
It really depends on the implementation. It's possible to use Interlocked.Exchange when setting "forward" in a way that can prevent the references from being invalid (as it can make the "set" atomic), but there is no guarantee of which reference you'd get on read. However, with a naive implementation, anything can happen, including getting bad data.
My initial thought is to have a ReaderWriterLockSLim on the getter and setter of the indexer for Forward.
This is likely to be a good place to start. It will be fairly easy to make a properly synchronized collection using a ReaderWriterLockSlim, and functional is always the first priority.
This would likely be a good starting point.
Will there be too much unnecessary locking in this scenario?
There's no way to know without seeing how you implement it, and more importantly, how it's goign to be used. Depending on your usage, you can profile and look for optimization opportunities if necessary at that point.
On a side note - you might want to reconsider using node.Forward[i] as opposed to more of a "linked list" approach here. Any access to Forward[i] is likely to require a fair bit of synchronization to iterate through the skip list i steps, all of which will need some synchronization to prevent errors if there are concurrent writes anywhere between node and i elements beyond node. If you only look ahead one step, you can (potentially) reduce the amount of synchronization required.
So here I am implementing some caching layer. Particurally I am stuck with
ConcurrentDictionary<SomeKey,HashSet<SomeKey2>>
I need to ensure that operations on HashSet are threadsafe too (ergo Update is threadsafe). Is it possible in any simple way or do I have to synchronize in the UpdateFactory delegate? If the answer is yes (which I presume) did any one of You encountered this problem before and solved it?
I want to avoid ConcurrentDictionary of ConcurrentDictionaries because they allocate a lot of synchronization objects and I potentially have around a million entries in this thing, so I want to have less pressure in on the GC.
HashSet was chosen because it guarantees amortized constant cost of insertion,deletion and access.
The aforementioned structure will be used as a index on a larger data set with to columns as a key (SomeKey and Somekey2) much like a database index.
Ok so finally I decided to go with Immutable set and lock striping because it is reasonably simple to implement and understand. If I will need more performance on the writes (no copying the whole hash set on insert) I will implement reader/writer locks with striping - which should be fine anyway.
Thanks for suggestions.
I've been reading Joe Duffy's book on Concurrent programming. I have kind of an academic question about lockless threading.
First: I know that lockless threading is fraught with peril (if you don't believe me, read the sections in the book about memory model)
Nevertheless, I have a question:
suppose I have an class with an int property on it.
The value referenced by this property will be read very frequently by multiple threads
It is extremely rare that the value will change, and when it does it will be a single thread that changes it.
If it does change while another operation that uses it is in flight, no one is going to lose a finger (the first thing anyone using it does is copy it to a local variable)
I could use locks (or a readerwriterlockslim to keep the reads concurrent).
I could mark the variable volatile (lots of examples where this is done)
However, even volatile can impose a performance hit.
What if I use VolatileWrite when it changes, and leave the access normal for reads. Something like this:
public class MyClass
{
private int _TheProperty;
internal int TheProperty
{
get { return _TheProperty; }
set { System.Threading.Thread.VolatileWrite(ref _TheProperty, value); }
}
}
I don't think that I would ever try this in real life, but I'm curious about the answer (more than anything, as a checkpoint of whether I understand the memory model stuff I've been reading).
Marking a variable as "volatile" has two effects.
1) Reads and writes have acquire and release semantics, so that reads and writes of other memory locations will not "move forwards and backwards in time" with respect to reads and writes of this memory location. (This is a simplification, but you take my point.)
2) The code generated by the jitter will not "cache" a value that seems to logically be unchanging.
Whether the former point is relevant in your scenario, I don't know; you've only described one memory location. Whether or not it is important that you have only volatile writes but not volatile reads is something that is up to you to decide.
But it seems to me that the latter point is quite relevant. If you have a spin lock on a non-volatile variable:
while(this.prop == 0) {}
the jitter is within its rights to generate this code as though you'd written
if (this.prop == 0) { while (true) {} }
Whether it actually does so or not, I don't know, but it has the right to. If what you want is for the code to actually re-check the property on each go round the loop, marking it as volatile is the right way to go.
The question is whether the reading thread will ever see the change. It's not just a matter of whether it sees it immediately.
Frankly I've given up on trying to understand volatility - I know it doesn't mean quite what I thought it used to... but I also know that with no kind of memory barrier on the reading thread, you could be reading the same old data forever.
The "performance hit" of volatile is because the compiler now generates code to actually check the value instead of optimizing that away - in other words, you'll have to take that performance hit regardless of what you do.
At the CPU level, yes every processor will eventually see the change to the memory address. Even without locks or memory barriers. Locks and barriers would just ensure that it all happened in a relative ordering (w.r.t other instructions) such that it appeared correct to your program.
The problem isn't cache-coherency (I hope Joe Duffy's book doesn't make that mistake). The caches stay conherent - it is just that this takes time, and the processors don't bother to wait for that to happen - unless you enforce it. So instead, the processor moves on to the next instruction, which may or may not end up happening before the previous one (because each memory read/write make take a different amount of time. Ironically because of the time for the processors to agree on coherency, etc. - this causes some cachelines to be conherent faster than others (ie depending on whether the line was Modified, Exclusive, Shared, or Invalid it takes more or less work to get into the necessary state).)
So a read may appear old or from an out of date cache, but really it just happened earlier than expected (typically because of look-ahead and branch prediction). When it really was read, the cache was coherent, it has just changed since then. So the value wasn't old when you read it, but it is now when you need it. You just read it too soon. :-(
Or equivalently, it was written later than the logic of your code thought it would be written.
Or both.
Anyhow, if this was C/C++, even without locks/barriers, you would eventually get the updated values. (within a few hundred cycles typically, as memory takes about that long). In C/C++ you could use volatile (the weak non-thread volatile) to ensure that the value wasn't read from a register. (Now there's a non-coherent cache! ie the registers)
In C# I don't know enough about CLR to know how long a value could stay in a register, nor how to ensure you get a real re-read from memory. You've lost the 'weak' volatile.
I would suspect as long as the variable access doesn't completely get compiled away, you will eventually run out of registers (x86 doesn't have many to start with) and get your re-read.
But no guarantees that I see. If you could limit your volatile-read to a particular point in your code that was often, but not too often (ie start of next task in a while(things_to_do) loop) then that might be the best you can do.
This is the pattern I use when the 'last writer wins' pattern is applicable to the situation. I had used the volatile keyword, but after seeing this pattern in a code example from Jeffery Richter, I started using it.
For normal things (like memory-mapped devices), the cache-coherency protocols going on within/between the CPU/CPUs is there to ensure that different threads sharing that memory get a consistent view of things (i.e., if I change the value of a memory location in one CPU, it will be seen by other CPUs that have the memory in their caches). In this regard volatile will help to ensure that the optimizer doesn't optimize away memory accesses (which are always going through cache anyway) by, say, reading the value cached in a register. The C# documentation seems pretty clear on this. Again, the application programmer doesn't generally have to deal with cache-coherency themselves.
I highly recommend reading the freely available paper "What Every Programmer Should Know About Memory". A lot of magic goes on under the hood that mostly prevents shooting oneself in the foot.
In C#, the int type is thread-safe.
Since you said that only one thread writes to it, you should never have contention as to what is the proper value, and as long as you are caching a local copy, you should never get dirty data.
You may, however, want to declare it volatile if an OS thread will be doing the update.
Also keep in mind that some operations are not atomic, and can cause problems if you have more than one writer. For example, even though the bool type wont corrupt if you have more than one writer, a statement like this:
a = !a;
is not atomic. If two threads read at the same time, you have a race condition.
How often do you find yourself actually using spinlocks in your code? How common is it to come across a situation where using a busy loop actually outperforms the usage of locks?
Personally, when I write some sort of code that requires thread safety, I tend to benchmark it with different synchronization primitives, and as far as it goes, it seems like using locks gives better performance than using spinlocks. No matter for how little time I actually hold the lock, the amount of contention I receive when using spinlocks is far greater than the amount I get from using locks (of course, I run my tests on a multiprocessor machine).
I realize that it's more likely to come across a spinlock in "low-level" code, but I'm interested to know whether you find it useful in even a more high-level kind of programming?
It depends on what you're doing. In general application code, you'll want to avoid spinlocks.
In low-level stuff where you'll only hold the lock for a couple of instructions, and latency is important, a spinlock mat be a better solution than a lock. But those cases are rare, especially in the kind of applications where C# is typically used.
In C#, "Spin locks" have been, in my experience, almost always worse than taking a lock - it's a rare occurrence where spin locks will outperform a lock.
However, that's not always the case. .NET 4 is adding a System.Threading.SpinLock structure. This provides benefits in situations where a lock is held for a very short time, and being grabbed repeatedly. From the MSDN docs on Data Structures for Parallel Programming:
In scenarios where the wait for the lock is expected to be short, SpinLock offers better performance than other forms of locking.
Spin locks can outperform other locking mechanisms in cases where you're doing something like locking through a tree - if you're only having locks on each node for a very, very short period of time, they can out perform a traditional lock. I ran into this in a rendering engine with a multithreaded scene update, at one point - spin locks profiled out to outperform locking with Monitor.Enter.
For my realtime work, particularly with device drivers, I've used them a fair bit. It turns out that (when last I timed this) waiting for a sync object like a semaphore tied to a hardware interrupt chews up at least 20 microseconds, no matter how long it actually takes for the interrupt to occur. A single check of a memory-mapped hardware register, followed by a check to RDTSC (to allow for a time-out so you don't lock up the machine) is in the high nannosecond range (basicly down in the noise). For hardware-level handshaking that shouldn't take much time at all, it is really tough to beat a spinlock.
My 2c: If your updates satisfy some access criteria then they are good spinlock candidates:
fast, ie you will have time to acquire the spinlock, perform the updates and release the spinlock in a single thread quanta so that you don't get pre-empted while holding the spinlock
localized all data you update are in preferably one single page that is already loaded, you do not want a TLB miss while you holding the spinlock, and you definetely don't want an page fault swap read!
atomic you do not need any other lock to perform the operation, ie. never wait for locks under spinlock.
For anything that has any potential to yield, you should use a notified lock structure (events, mutex, semaphores etc).
One use case for spin locks is if you expect very low contention but are going to have a lot of them. If you don't need support for recursive locking, a spinlock can be implemented in a single byte, and if contention is very low then the CPU cycle waste is negligible.
For a practical use case, I often have arrays of thousands of elements, where updates to different elements of the array can safely happen in parallel. The odds of two threads trying to update the same element at the same time are very small (low contention) but I need one lock for every element (I'm going to have a lot of them). In these cases, I usually allocate an array of ubytes of the same size as the array I'm updating in parallel and implement spinlocks inline as (in the D programming language):
while(!atomicCasUbyte(spinLocks[i], 0, 1)) {}
myArray[i] = newVal;
atomicSetUbyte(spinLocks[i], 0);
On the other hand, if I had to use regular locks, I would have to allocate an array of pointers to Objects, and then allocate a Mutex object for each element of this array. In scenarios such as the one described above, this is just plain wasteful.
If you have performance critical code and you have determined that it needs to be faster than it currently is and you have determined that the critical factor is the lock speed, then it'd be a good idea to try a spinlock. In other cases, why bother? Normal locks are easier to use correctly.
Please note the following points :
Most mutexe's implementations spin for a little while before the thread is actually unscheduled. Because of this it is hard to compare theses mutexes with pure spinlocks.
Several threads spining "as fast as possible" on the same spinlock will consome all the bandwidth and drasticly decrease your program efficiency. You need to add tiny "sleeping" time by adding noop in your spining loop.
You hardly ever need to use spinlocks in application code, if anything you should avoid them.
I can't thing of any reason to use a spinlock in c# code running on a normal OS. Busy locks are mostly a waste on the application level - the spinning can cause you to use the entire cpu timeslice, vs a lock will immediatly cause a context switch if needed.
High performance code where you have nr of threads=nr of processors/cores might benefit in some cases, but if you need performance optimization at that level your likely making next gen 3D game, working on an embedded OS with poor synchronization primitives, creating an OS/driver or in any case not using c#.
I used spin locks for the stop-the-world phase of the garbage collector in my HLVM project because they are easy and that is a toy VM. However, spin locks can be counter-productive in that context:
One of the perf bugs in the Glasgow Haskell Compiler's garbage collector is so annoying that it has a name, the "last core slowdown". This is a direct consequence of their inappropriate use of spinlocks in their GC and is excacerbated on Linux due to its scheduler but, in fact, the effect can be observed whenever other programs are competing for CPU time.
The effect is clear on the second graph here and can be seen affecting more than just the last core here, where the Haskell program sees performance degradation beyond only 5 cores.
Always keep these points in your mind while using spinlocks:
Fast user mode execution.
Synchronizes threads within a single process, or multiple processes if in shared memory.
Does not return until the object is owned.
Does not support recursion.
Consumes 100% of CPU while "waiting".
I have personally seen so many deadlocks just because someone thought it will be a good idea to use spinlock.
Be very very careful while using spinlocks
(I can't emphasize this enough).