Spinlocks, How Useful Are They? - c#

How often do you find yourself actually using spinlocks in your code? How common is it to come across a situation where using a busy loop actually outperforms the usage of locks?
Personally, when I write some sort of code that requires thread safety, I tend to benchmark it with different synchronization primitives, and as far as it goes, it seems like using locks gives better performance than using spinlocks. No matter for how little time I actually hold the lock, the amount of contention I receive when using spinlocks is far greater than the amount I get from using locks (of course, I run my tests on a multiprocessor machine).
I realize that it's more likely to come across a spinlock in "low-level" code, but I'm interested to know whether you find it useful in even a more high-level kind of programming?

It depends on what you're doing. In general application code, you'll want to avoid spinlocks.
In low-level stuff where you'll only hold the lock for a couple of instructions, and latency is important, a spinlock mat be a better solution than a lock. But those cases are rare, especially in the kind of applications where C# is typically used.

In C#, "Spin locks" have been, in my experience, almost always worse than taking a lock - it's a rare occurrence where spin locks will outperform a lock.
However, that's not always the case. .NET 4 is adding a System.Threading.SpinLock structure. This provides benefits in situations where a lock is held for a very short time, and being grabbed repeatedly. From the MSDN docs on Data Structures for Parallel Programming:
In scenarios where the wait for the lock is expected to be short, SpinLock offers better performance than other forms of locking.
Spin locks can outperform other locking mechanisms in cases where you're doing something like locking through a tree - if you're only having locks on each node for a very, very short period of time, they can out perform a traditional lock. I ran into this in a rendering engine with a multithreaded scene update, at one point - spin locks profiled out to outperform locking with Monitor.Enter.

For my realtime work, particularly with device drivers, I've used them a fair bit. It turns out that (when last I timed this) waiting for a sync object like a semaphore tied to a hardware interrupt chews up at least 20 microseconds, no matter how long it actually takes for the interrupt to occur. A single check of a memory-mapped hardware register, followed by a check to RDTSC (to allow for a time-out so you don't lock up the machine) is in the high nannosecond range (basicly down in the noise). For hardware-level handshaking that shouldn't take much time at all, it is really tough to beat a spinlock.

My 2c: If your updates satisfy some access criteria then they are good spinlock candidates:
fast, ie you will have time to acquire the spinlock, perform the updates and release the spinlock in a single thread quanta so that you don't get pre-empted while holding the spinlock
localized all data you update are in preferably one single page that is already loaded, you do not want a TLB miss while you holding the spinlock, and you definetely don't want an page fault swap read!
atomic you do not need any other lock to perform the operation, ie. never wait for locks under spinlock.
For anything that has any potential to yield, you should use a notified lock structure (events, mutex, semaphores etc).

One use case for spin locks is if you expect very low contention but are going to have a lot of them. If you don't need support for recursive locking, a spinlock can be implemented in a single byte, and if contention is very low then the CPU cycle waste is negligible.
For a practical use case, I often have arrays of thousands of elements, where updates to different elements of the array can safely happen in parallel. The odds of two threads trying to update the same element at the same time are very small (low contention) but I need one lock for every element (I'm going to have a lot of them). In these cases, I usually allocate an array of ubytes of the same size as the array I'm updating in parallel and implement spinlocks inline as (in the D programming language):
while(!atomicCasUbyte(spinLocks[i], 0, 1)) {}
myArray[i] = newVal;
atomicSetUbyte(spinLocks[i], 0);
On the other hand, if I had to use regular locks, I would have to allocate an array of pointers to Objects, and then allocate a Mutex object for each element of this array. In scenarios such as the one described above, this is just plain wasteful.

If you have performance critical code and you have determined that it needs to be faster than it currently is and you have determined that the critical factor is the lock speed, then it'd be a good idea to try a spinlock. In other cases, why bother? Normal locks are easier to use correctly.

Please note the following points :
Most mutexe's implementations spin for a little while before the thread is actually unscheduled. Because of this it is hard to compare theses mutexes with pure spinlocks.
Several threads spining "as fast as possible" on the same spinlock will consome all the bandwidth and drasticly decrease your program efficiency. You need to add tiny "sleeping" time by adding noop in your spining loop.

You hardly ever need to use spinlocks in application code, if anything you should avoid them.
I can't thing of any reason to use a spinlock in c# code running on a normal OS. Busy locks are mostly a waste on the application level - the spinning can cause you to use the entire cpu timeslice, vs a lock will immediatly cause a context switch if needed.
High performance code where you have nr of threads=nr of processors/cores might benefit in some cases, but if you need performance optimization at that level your likely making next gen 3D game, working on an embedded OS with poor synchronization primitives, creating an OS/driver or in any case not using c#.

I used spin locks for the stop-the-world phase of the garbage collector in my HLVM project because they are easy and that is a toy VM. However, spin locks can be counter-productive in that context:
One of the perf bugs in the Glasgow Haskell Compiler's garbage collector is so annoying that it has a name, the "last core slowdown". This is a direct consequence of their inappropriate use of spinlocks in their GC and is excacerbated on Linux due to its scheduler but, in fact, the effect can be observed whenever other programs are competing for CPU time.
The effect is clear on the second graph here and can be seen affecting more than just the last core here, where the Haskell program sees performance degradation beyond only 5 cores.

Always keep these points in your mind while using spinlocks:
Fast user mode execution.
Synchronizes threads within a single process, or multiple processes if in shared memory.
Does not return until the object is owned.
Does not support recursion.
Consumes 100% of CPU while "waiting".
I have personally seen so many deadlocks just because someone thought it will be a good idea to use spinlock.
Be very very careful while using spinlocks
(I can't emphasize this enough).

Related

Interlocked & Thread-Safe operations

1. Out of curiosity, what does operations like the following do behind the scenes when they get called for example from 2 or 3 threads at the same time?
Interlocked.Add(ref myInt, 24);
Interlocked.Increment(ref counter);
Does the C# creates an inside queue that tells for example Thread 2, now it's your turn to do the operation, then it tells Thread 1 now it's your turn, and then Thread 3 you do the operation? So that they will not interfere with each other?
2. Why doesn't the C# do this process automatically?
Isn't it obvious that when a programmer write something like the following inside a multi-thread method:
myByte++;
Sum = int1 + int2 + int3;
and this variables are Shared with other threads, that he wants each of this operations do be executed as an Atomic operation without interruptions?
Why does the programmer have to tell it Explicitly to do so?
Isn't it clear that that's what every programmer wants? Aren't this "Interlocked" methods just add unnecessary complication to the language?
Thanks.
what does operations like the following do behind the scenes
As far as how it's implemented internally, CPU hardware arbitrates which core gets ownership of the cache line when there's contention. See Can num++ be atomic for 'int num'? for a C++ and x86 asm / cpu-architecture explanation of the details.
Re: why CPUs and compilers want to load early and store late:
see Java instruction reordering and CPU memory reordering
Atomic RMW prevents that, so do seq_cst store semantics on most ISAs where you do a plain store and then a full barrier. (AArch64 has a special interaction between stlr and ldar to prevent StoreLoad reordering of seq_cst operations, but still allow reordering with other operations.)
Isn't it obvious that when a programmer write something like the following inside a multi-thread method [...]
What does that even mean? It's not running the same method in multiple threads that's a problem, it's accessing shared data. How is the compiler supposed to know which data will be accessed non-readonly from multiple threads at the same time, not inside a critical section?
There's no reasonable way to prove this in general, only in some simplistic cases. If a compiler were to try, it would have to be conservative, erring on the side of making more things atomic, at a huge cost in performance. The other kind of mistake would be a correctness problem, and if that could just happen when the compiler guesses wrong based on some undocumented heuristics, it would make the language unusable for multi-threaded programs.
Besides that, not all multi-threaded code needs sequential consistency all the time; often acquire/release or relaxed atomics are fine, but sometimes they aren't. It makes a lot of sense for programmers to be explicit about what ordering and atomicity their algorithm is built on.
Also you carefully design lock-free multi-threaded code to do things in a sensible order. In C++, you don't have to use Interlock..., but instead you make a variable std::atomic<int> shared_int; for example. (Or use std::atomic_ref<int> to do atomic operations on variables that other code can access non-atomically, like using Interlocked functions).
Having no explicit indication in the source of which operations are atomic with what ordering semantics would make it harder to read and maintain such code. Correct lock-free algorithms don't just happen by having the compiler turn individual operators into atomic ops.
Promoting every operation to atomic would destroy performance. Most data isn't shared, even in functions that access some shared data structures.
Atomic RMW (like x86 lock add [rdi], eax) is much slower than a non-atomic operation, especially since non-atomic lets the compiler optimize variables into registers.
An atomic RMW on x86 is a full memory barrier, so making every operation atomic would destroy memory-level parallelism every time you use a += or ++.
e.g. one per 18 cycle throughput on Skylake for lock xadd [mem], reg if hot in L1d cache, vs. one per 0.25 cycles for add reg, reg (https://uops.info), not to mention removing opportunities to optimize away and combine operations. And reducing the ability for out-of-order execution to overlap work.
This is a partial answer to you question you asked in the in the comments:
Why not? If I, as a programmer know exactly where I should put this protections, why can't the compiler?
In order for the compiler to do that, it would need to understand all possible execution paths through your program. This is effectively the Path Testing problem discussed here: https://softwareengineering.stackexchange.com/questions/277693/does-path-coverage-guarantee-finding-all-bugs
That article states that this is equivalent to the halting problem, which is computer-science-ese for saying it's an unsolvable problem.
The cool thing is that you want to do this in a world where you have multiple threads of execution running on possibly multiple processors. That makes an unsolvable problem that much harder to solve.
On the other hand, the programmer should know what his/her program does...

Why do c# iterators track creating thread over using an interlocked operation?

This is just something that's been puzzling me ever since I read about iterators on Jon Skeet's site.
There's a simple performance optimisation that Microsoft has implemented with their automatic iterators - the returned IEnumerable can be reused as an IEnumerator, saving an object creation. Now because an IEnumerator necessarily needs to track state, this is only valid the first time it's iterated.
What I cannot understand is why the design team took the approach they did to ensure thread safety.
Normally when I'm in a similar position I'd use what I consider to be a simple Interlocked.CompareExchange - to ensure that only one thread manages to change the state from "available" to "in process".
Conceptually it's very simple, a single atomic operation, no extra fields are required etc.
But the design teams approach? Every IEnumerable keeps a field of the managed thread ID of the creating thread, and then that thread ID is checked on calling GetEnumerator against this field, and only if it's the same thread, and it's the first time it's called, can the IEnumerable return itself as the IEnumerator. It seems harder to reason about, imo.
I'm just wondering why this approach was taken. Are Interlocked operations far slower than two calls to System.Threading.Thread.CurrentThread.ManagedThreadId, so much so that it justifies the extra field?
Or is there some other reason behind this, perhaps involving memory models or ARM devices or something I'm not seeing? Maybe the spec imparts specific requirements on the implementation of IEnumerable? Just genuinely puzzled.
I can't answer definatively, but as to your question:
Are Interlocked operations far slower than two calls to
System.Threading.Thread.CurrentThread.ManagedThreadId, so much so that
it justifies the extra field?
Yes interlocked operations are much slower that two calls to get the ManagedThreadId - interlocked operations aren't cheap because they required multi-CPU systems to synchonize their caches.
From Understanding the Impact of Low-Lock Techniques in Multithreaded Apps:
Interlocked instructions need to ensure that caches are synchronized
so that reads and writes don't seem to move past the instruction.
Depending on the details of the memory system and how much memory was
recently modified on various processors, this can be pretty expensive
(hundreds of instruction cycles).
In Threading in C#, it lists overhead the overhead as 10ns. Whereas getting the ManagedThreadId should be a normal non-locked read of static data.
Now this is just my speculation, but if you think about the normal use case it would be to call the function to retrieve the IEnumerable and immediately iterative over it once. So in the standard use case the object is:
Used once
Used on the same thread it was created
Short lived
So this design brings in no synchronization overhead and sacrifices 4 bytes, which will probably only be in use for a very short period of time.
Of course to prove this you would have to do performance analysis to determine the relative costs and code analysis to prove what the common case was.

Does multi-threading equal less CPU?

I have a small list of rather large files that I want to process, which got me thinking...
In C#, I was thinking of using Parallel.ForEach of TPL to take advantage of modern multi-core CPUs, but my question is more of a hypothetical character;
Does the use of multi-threading in practicality mean that it would take longer time to load the files in parallel (using as many CPU-cores as possible), as opposed to loading each file sequentially (but with probably less CPU-utilization)?
Or to put it in another way (:
What is the point of multi-threading? More tasks in parallel but at a slower rate, as opposed to focusing all computing resources on one task at a time?
In order to not increase latency, parallel computational programs typically only create one thread per core. Applications which aren't purely computational tend to add more threads so that the number of runnable threads is the number of cores (the others are in I/O wait, and not competing for CPU time).
Now, parallelism on disk-I/O bound programs may well cause performance to decrease, if the disk has a non-negligible seek time then much more time will be wasted performing seeks and less time actually reading. This is called "churning" or "thrashing". Elevator sorting helps somewhat, true random access (such as solid state memories) helps more.
Parallelism does almost always increase the total raw work done, but this is only important if battery life is of foremost importance (and by the time you account for power used by other components, such as the screen backlight, completing quicker is often still more efficient overall).
You asked multiple questions, so I've broken up my response into multiple answers:
Multithreading may have no effect on loading speed, depending on what your bottleneck during loading is. If you're loading a lot of data off disk or a database, I/O may be your limiting factor. On the other hand if 'loading' involves doing a lot of CPU work with some data, you may get a speed up from using multithreading.
Generally speaking you can't focus "all computing resources on one task." Some multicore processors have the ability to overclock a single core in exchange for disabling other cores, but this speed boost is not equal to the potential performance benefit you would get from fully utilizing all of the cores using multithreading/multiprocessing. In other words it's asymmetrical -- if you have a 4 core 1Ghz CPU, it won't be able to overclock a single core all the way to 4ghz in exchange for disabling the others. In fact, that's the reason the industry is going multicore in the first place -- at least for now we've hit limits on how fast we can make a single CPU run, so instead we've gone the route of adding more CPUs.
There are 2 reasons for multithreading. The first is that you want to tasks to run at the same time simply because it's desirable for both to be able to happen simultaneously -- e.g. you want your GUI to continue to respond to clicks or keyboard presses while it's doing other work (event loops are another way to accomplish this though). The second is to utilize multiple cores to get a performance boost.
For loading files from disk, this is likely to make things much slower. What happens is the operating system tries to lay out files on disk such that you should only need to do an expensive disk seek once for each file. If you have a lot of threads reading a lot of files, you're gonna have contention over which thread has access to the disk, and you'll have to seek back to the right place in the file every time the next thread gets a turn.
What you can do is use exactly two threads. Set one to load all of the files in the background, and let the other remain available for other tasks, like handling user input. In C# winforms, you can do this easily with a BackgroundWorker control.
Multi-threading is useful for highly parallelizable tasks. CPU intensive tasks are perfect. Your CPU has many cores, many threads can use many cores. They'll use more CPU time, but in the end they'll use less "user" time. If your app is I/O bounded, then multithreading isn't always the solution (but it COULD help)
It might be helpful to first understand the difference between Multithreading and Parallelism, as more often than not I see them being used rather interchangeably. Joseph Albahari has written a quite interesting guide about the subject: Threading in C# - Part 5 - Parallelism
As with all great programming endeavors, it depends. By and large, you'll be requesting files from one physical store, or one physical controller which will serialize the requests anyhow (or worse, cause a LOT of head back-and-forth on a classical hard drive) and slow down the already slow I/O.
OTOH, if the controllers and the medium are separate, multiple cores loading data from them should be improved over a sequential method.

Should Threads be avoided if at all possible inside software components?

I have recently been looking at code, specifically component oriented code that uses threads internally. Is this a bad practise. The code I looked at was from an F# example that showed the use of event based programming techniques. I can not post the code in case of copyright infringements, but it does spin up a thread of its own. Is this regarded as bad practise or is it feasible that code not written by yourself has full control of thread creation. I do point out that this code is not a visual component and is very much "built from scratch".
What are the best practises of component creation where threading would be helpful?
I am completely language agnostic on this, the f# example could have been in c# or python.
I am concerned about the lack of control over the components run time and hogging of resources, the example just implemented another thread, but as far as I can see there is nothing stopping this type of design from spawning as many threads as it wishes, well to the limit of what your program allows.
I did think of methods such as object injecting and so fourth, but threads are weird as they are from a component perspective pure "action" as opposed to "model, state, declarations"
any help would be great.
This is too general a question to bear any answer more specific than "it depends" :-)
There are cases when using internal threads within a component is completely valid, and there are cases when not. This has to be decided on a case by case basis. Overall, though, since threads do make the code much more difficult to test and maintain, and increase the chances of subtle, hard to find bugs, they should be used with caution, only when there is a really decisive reason to use them.
An example to the legitimate use of threads is a worker thread, where a component handling an event starts an action which takes a long time to execute (such as a lengthy computation, a web request, or extensive file I/O), and spawns a separate thread to do the job, so that the control can be immediately returned to the interface to handle further user input. Without the worker thread, the UI would be totally unresponsive for a long time, which usually makes users angry.
Another example is a lengthy calculation/process which lends itself well to parallel execution, i.e. it consists of many smaller independent tasks of more or less similar size. If there are strong performance requirements, it does indeed make sense to execute the individual tasks in a concurrent fashion using a pool of worker threads. Many languages provide high level support for such designs.
Note that components are generally free to allocate and use any other kinds of resources too and thus wreak havoc in countless other ways - are you ever worried about a component eating up all memory, exhausting the available file handles, reserving ports etc.? Many of these can cause much more trouble globally within a system than spawning extra threads.
There's nothing wrong about creating new threads in a component/library. The only thing wrong would be if it didn't give the consumer of the API/component a way to synchronize whenever necessary.
First of all, what is the nature of component you are talking about? Is it a dll to be consumed by some different code? What does it do? What are the business requirements? All these are essential to determine if you do need to worry about parallelism or not.
Second of all, threading is just a tool to acheive better performance, responsivness so avoiding it at all cost everywhere does not sound like a smart approach - threading is certainly vital for some business needs.
Third of all, when comparing threading symantics in c# vs f#, you have to remember that those are very different beasts in theirselfs - f# implicitly makes threading safer to code as there is no notion of global variables hence the critical section in your code is something easier to eschew in f# than in c#. That puts your as a deleloper in a better place bc you dont have to deal with memoryblocks, locks, semaphores etc.
I would say if your 'component' relies heavily on threading you might want to consider using either the parallel FX in c# or even go with f# since it kind of approaches working with processer time slicing and parallelism in more elegant way (IMHO).
And last but not least, when you say about hogging up computer resources by using threading in your component - please remember that coding threads do not necessarily impose higher resource impact per se – you can just as easily do the same damage on one thread if you don’t dispose of your objects (unmaneged) properly, granted you might get OutOfMemeory Exception faster when you make the same mistake on several threads…

Debugging and diagnosing lock convoying problems in .NET

I am looking into performance issues of a large C#/.NET 3.5 system that exhibits performance degradation as the number of users making requests scales up to 40-50 distinct user requests per second.
The request durations increase significantly, while CPU and I/O loads appear to stay about the same. This leads me to believe we may have problem with how shared objects in our system, which are protected using c# lock() {...} statements may be affecting concurrent access performance. Specifically, I suspect that some degree of lock convoying is occurring on frequently used shared data that is protected by critical sections (because it it read/write).
Does anyone have suggestions on how to actually diagnose if lock convoying is the problem .. or if lock contention of any kind is contributing to long request times?
Lock convoys are hard to debug in general. Does your code path have sequential lock statements either directly or in branches?
The Total # of Contentions performance counter gives a base estimate of contention in the app.
Also break open a profiler and look. You can also write some perf counters to track down the slow parts of a code path. Also make sure that locks are only being held for as long as absolutely necessary.
Also check out the Windows Performance Tools. I have found these to be extremely useful as you can track down lots of low level problems like abnormal amounts of context switching.
A good place to start is by having a look at the Lock and Thread performance counters. Out of interesting what exactly are you locking for in your Web app? Locking in most ASP.NET applications isn't common.
I can't provide much insight into the diagnostics, but if you find proof to back up your assumption then you might be interested in System.Threading.ReaderWriterLockSlim which allows for concurrent reads, but prevents concurrent writes.

Categories

Resources