We have a concurrent, multithreaded program.
How would I make a sample number increase by +5 interval every time? Does Interlocked.Increment, have an overload for interval? I don't see it listed.
Microsoft Interlocked.Increment Method
// Attempt to make it increase by 5
private int NumberTest;
for (int i = 1; i <= 5; i++)
{
NumberTest= Interlocked.Increment(ref NumberTest);
}
This is another question its based off,
C# Creating global number which increase by 1
I think you want Interlocked.Add:
Adds two integers and replaces the first integer with the sum, as an atomic operation.
int num = 0;
Interlocked.Add(ref num, 5);
Console.WriteLine(num);
Adding (i.e +=) is not and cannot be an atomic operation (as you know). Unfortunately, there is no way to achieve this without enforcing a full fence, on the bright-side these are fairly optimised at a low level. However, there are several other ways you can ensure integrity though (especially since this is just an add)
The use of Interlocked.Add (The sanest solution)
Apply exclusive lock (or Moniter.Enter) outside the for loop.
AutoResetEvent to ensure threads doing task one by one (meh sigh).
Create a temp int in each thread and once finished add temp onto the sum under an exclusive lock or similar.
The use of ReaderWriterLockSlim.
Parallel.For thread based accumulation with Interlocked.Increment sum, same as 4.
Related
For parallel work on an acceleration data structure I currently use a SpinLock but would like to design the algorithm lock free.
The data structure is a jagged array where each inner array has a different size.
The working threads should fetch the next element in the inner array, increment the index and, if the index is larger, switch to the next index in the outer array:
for(int i = 0; i < arr.Length; ++i)
{
for(int j = 0; j < arr[i].Length; ++j)
{
DoWork(arr[i][j]);
}
}
I can't think of a way to do this lock free except to increment a shared index and then sum up the lengths of the arrays:
int sharedIndex = -1;
// -- In the worker thread ---------------------
bool loop = false;
do
{
int index = Interlocked.Increment(ref sharedIndex);
int count = 0;
loop = false;
for(int i = 0; i < arr.Length; ++i)
{
count += arr[i].Length;
if(count > index)
{
var remaining = index - (count - arr[i].Length);
DoWork(arr[i][remaining]);
loop = true;
break;
}
}
} while(loop);
Is there a way to not have to loop over the entire outer array and still remain lock free?
Because I can't increment two indexes at the same time (for the outer and inner index).
Can you divide up work by having each thread do one to four outer iterations between synchronization steps? If outer_size / chunk_size / threads is at least 4 or so (or maybe greater than the expected ratio between your shortest and longest inner arrays), scheduling of work is dynamic enough that you should usually avoid having one thread running for a long time on a very long array while the other threads have all finished.
(If a chunk size of 1 row aka inner array is coarse enough for efficiency, you can simply do that. You say that DoWork is so slow that even a shared counter for single elements might not be a problem)
That might still be a risk if the very last inner array is longer than the others. Depending on how common that is, and/or how important it is to avoid that worst-case scenario, you might look at the inner sizes ahead of time and sort or partition them to start working on the longest inner arrays first, so at the end the differences between threads finishing are the differences in lengths of the shorter arrays. (e.g. real-time where limiting the worst case is more important than speeding up the average, vs. a throughput-oriented use-case. Also if there's anything useful for other threads to be doing with free CPU cores if you don't schedule this perfectly.)
Atomically incrementing a shared counter for every inner element would serialize all threads on that, so unless processing each inner element was very expensive, it would be much slower than single-threaded without synchronization.
I'm assuming you don't need to start work on each element in sequential order, since even a shared counter wouldn't guarantee that (a thread could sleep after incrementing, with another thread starting the element after).
If you are going to search, start from your previous position.
If you do want to use a single shared counter, instead of linear searching from the start of the outer array every time, only search from your previous position. The shared counter is monotonically increasing, so the next position will usually be later this row, or into the very next. Should be more efficient to do that than to search from the start every time.
e.g. keep 3 variables: prev_index, and prev_i, prev_j. If j = prev_j + (index - prev_index) is still within the current array, you're done. This is likely the common case. Otherwise, move to the next row and recompute by subtracting arr[i].Length until you have a j that's in-bounds for that i.
Theodor Zoulias suggested pre-computing an array with a running total (aka prefix-sum) of the lengths. Good idea, but searching from the previous position probably makes that unnecessary, unless your rows are typically very short and you have lots of threads. In that case each step might involve multiple rows, so you could linear search from your previous position in a running-total array a bit more efficiently.
Per-row position counter: other threads can help finish a long row
If dividing work between threads only by rows isn't fine-grained enough, you could still mostly do that (with low contention), but create a way to threads to go back and help with unfinished long rows once there are no more fresh rows.
So you start as I proposed, with each thread claiming a whole row via a shared counter. When it gets to the end of a row, atomic fetch_add to get a new row to work on. Once you run out of fresh rows, threads can go back and look for arrays with arr[i].work_pos < arr[i].length.
Inside each row, you have a struct with the array itself (which records the length), and an atomic current-position counter, and another atomic counter for the number of threads currently working on this sub-array.
While working on an inner array, a thread atomically increments the position-within-array counter for that inner array, using that as the position of the next DoWork. So it's still a full memory barrier between every DoWork call (or unroll to claim 2 at a time and then do them), but contention is greatly reduced for most of the total run time because this will be the only thread incrementing that counter. (Until later threads jump in and start helping)
An atomic RMW on a cache line that stays hot in this core's L1d cache is much cheaper than an atomic RMW on a line when we have to request it from another core. So we want the per-row struct to be allocated separately, ideally contiguous with the row data like in C struct { _Atomic size_t work_pos; size_t len; atomic_int thread_active; struct work arr[]; }; with a "flexible array member" (so arbitrary-length array is contiguous with the end of the struct), or another level of indirection to just have a pointer/reference to an array.
Or if you can use the first 2 elements of an array of integers for this atomic bookkeeping, that also works. The outer array should be an array of references to these structs, not by value where multiple control blocks will share a cache line. False sharing would be about as bad as true sharing. And having pairs of threads contend with each other would be nearly as bad as all threads contending for the same single counter, if DoWork is slow enough that either way there's usually just one request for it in flight by one core.
Then the fun part comes at the end, when an Interlocked.Increment on the rowIndex returns an index past the end. Then that thread has to find an in-progress row to help with. Ideally this could be even distributed over still-working threads.
Perhaps we should have an array that records which row each thread is working on, with an entry for each thread? So threads looking for a place to help can scan through that and find the array with the highest work_left / threads_working. (That's why I suggested a thread-count member in the control block). Races in atomic stores of a pointer/reference to this array vs. readers reading one entry at a time aren't a problem; if an array was almost done, we wouldn't have wanted to pick it anyway, and we'll find somewhere useful to join.
If you naively just search backward from the end of the outer array, new threads will pile on to the last incomplete row, even if it's almost done, and create lots of contention for its atomic counters. But you also don't want to have to search over the whole outer array every time, if it could be large-ish. (If not, if rows are long but there aren't many of them, then that's fine.)
Reading the atomic work_pos counter that another thread is using will disturb that thread, as it loses exclusive ownership so its next Interlocked.Increment will be slower. So we'd like to avoid threads needing to find new rows to jump in on too frequently.
If we had a good heuristic for them to say that a row looks "good enough" and jump in immediately, instead of looking at all active / incomplete rows every time, that could reduce contention. But only if it's a good enough heuristic to make good choices.
Another way to reduce contention is to minimize how often a thread gets to the end of a row. Choosing the larger work_left / threads_working should achieve that, as that should be a decent approximation of which row will be completed last.
Multiple threads choosing at the same time might all pick the same row, but I don't think we can be perfect (or it would be too expensive). We can detect this when they use Interlocked.Increment to add themselves to the number of threads working on this row. Fallback to the second-longest estimated time row could be appropriate, or check if this is still the estimated-slowest row with the extra workers.
This doesn't have to be perfect; this is all just cleanup at the end of things, after running with minimal contention most of the time. As long as DoWork isn't too cheap relative to inter-thread latency, it's not a disaster if we sometimes have a bit more contention than was necessary.
Perhaps you'd also want some heuristic for a thread stopping itself before all the work is done could be useful, if there are other things a CPU core could be doing. (Or for this thread to be doing, in a pool of worker threads.)
You could optimize your current algorithm by doing binary search of a precomputed array, that contains the accumulated length of all the arrays up to this index. So for example if you have a jagged array of 10 inner arrays with lengths 8, 9, 5, 4, 0, 0, 6, 4, 4, 7, then the precomputed array will contain the values 0, 8, 17, 22, 26, 26, 26, 32, 36, 40. Doing a binary search will get you directly to the inner array that corresponds to the index that you are searching for, doing only O(Log n) comparisons.
Here is an implementation of this idea:
// --- Preparation ------------------------------
int[] indices = new int[arr.Length];
indices[0] = 0;
for (int i = 1; i < arr.Length; i++)
indices[i] = indices[i - 1] + arr[i - 1].Length;
int sumLength = arr.Sum(inner => inner.Length);
int sharedIndex = -1;
// --- In the worker thread ---------------------
while (true)
{
int index = Interlocked.Increment(ref sharedIndex);
if (index >= sumLength) break;
int outerIndex = Array.BinarySearch(indices, index);
if (outerIndex < 0) outerIndex = (~outerIndex) - 1;
while (arr[outerIndex].Length == 0) outerIndex++; // Skip empty arrays
int innerIndex = index - indices[outerIndex];
DoWork(arr[outerIndex][innerIndex]);
}
I got a Task that counts the number of packets it receives from some source.
Every 250ms a timer fires up reads and outputs the count to the user. Right after i need to set the count back to 0.
My concern is that between reading and displaying the count, but BEFORE I set count=0, count has incremented in the other thread, so i end up losing counts by zeroing it out.
I am new to Threading so i have been at multiple options.
I looked into using Interlocked but as far as i know it only gives me arithmetic operations, i don't have the option to actually set the variable to value.
I was also looking into the ReaderWriterLockSlim, what i need is the most efficient / less overhead way to accomplish since there is lot of data coming across.
You want Exchange:
int currentCount = System.Threading.Interlocked.Exchange(ref count, 0)
As per the docs:
Sets a 32-bit signed integer to a specified value and returns the original value, as an atomic operation.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Interlocked.CompareExchange<Int> using GreaterThan or LessThan instead of equality
I know that Interlocked.CompareExchange exchanges values only if the value and the comparand are equal,
How to exchange them if they are not equal to achieve something like this ?
if (Interlocked.CompareExchange(ref count, count + 1, max) != max)
// i want it to increment as long as it is not equal to max
{
//count should equal to count + 1
}
A more efficient (less bus locking and less reads) and simplified implementation of what Marc posted:
static int InterlockedIncrementAndClamp(ref int count, int max)
{
int oldval = Volatile.Read(ref count), val = ~oldval;
while(oldval != max && oldval != val)
{
val = oldval;
oldval = Interlocked.CompareExchange(ref count, oldval + 1, oldval);
}
return oldval + 1;
}
If you have really high contention, we might be able to improve scalability further by reducing the common case to a single atomic increment instruction: same overhead as CompareExchange, but no chance of a loop.
static int InterlockedIncrementAndClamp(ref int count, int max, int drift)
{
int v = Interlocked.Increment(ref count);
while(v > (max + drift))
{
// try to adjust value.
v = Interlocked.CompareExchange(ref count, max, v);
}
return Math.Min(v, max);
}
Here we allow count to go up to drift values beyond max. But we still return only up to max. This allows us to fold the entire op into a single atomic increment in most cases, which will allow maximum scalability. We only need more than one op if we go above our drift value, which you can probably make large enough to make very rare.
In response to Marc's worries about Interlocked and non-Interlocked memory access working together:
Regarding specifically volatile vs Interlocked: volatile is just a normal memory op, but one that isn't optimized away and one that isn't reordered with regards to other memory ops. This specific issue doesn't revolve around either of those specific properties, so really we're talking non-Interlocked vs Interlocked interoperability.
The .NET memory model guarantees reads and writes of basic integer types (up to the machine's native word size) and references are atomic. The Interlocked methods are also atomic. Because .NET only has one definition of "atomic", they don't need to explicitly special-case saying they're compatible with each-other.
One thing Volatile.Read does not guarantee is visibility: you'll always get a load instruction, but the CPU might read an old value from its local cache instead of a new value just put in memory by a different CPU. x86 doesn't need to worry about this in most cases (special instructions like MOVNTPS being the exception), but it's a very possible thing with others architectures.
To summarize, this describes two problems that could affect the Volatile.Read: first, we could be running on a 16-bit CPU, in which case reading an int is not going to be atomic and what we read might not be the value someone else is writing. Second, even if it is atomic, we might be reading an old value due to visibility.
But affecting Volatile.Read does not mean they affect the algorithm as a whole, which is perfectly secure from these.
The first case would only bite us if you're writing to count concurrently in a non-atomic way. This is because what could end up happening is (write A[0]; CAS A[0:1]; write A[1]). Because all of our writes to count happen in the guaranteed-atomic CAS, this isn't a problem. When we're just reading, if we read a wrong value, it'll be caught in the upcoming CAS.
If you think about it, the second case is actually just a specialization of the normal case where the value changes between read and write -- the read just happens before we ask for it. In this case the first Interlocked.CompareExchange call would report a different value than what Volatile.Read gave, and you'd start looping until it succeeded.
If you'd like, you can think of the Volatile.Read as a pure optimization for cases of low contention. We could init oldval with 0 and it would still work just fine. Using Volatile.Read gives it a high chance to only perform one CAS (which, as instructions go, is quite expensive especially in multi-CPU configurations) instead of two.
But yes, as Marc says -- sometimes locks are just simpler!
There isn't a "compare if not equal", however: you can test the value yourself first, and then only do the update if you don't get a thread race; this often means you may need to loop if the second test fails. In pseudo-code:
bool retry;
do {
retry = false;
// get current value
var val = Interlocked.CompareExchange(ref field, 0, 0);
if(val != max) { // if not maxed
// increment; if the value isn't what it was above: redo from start
retry = Interlocked.CompareExchange(ref field, val + 1, val) != val;
}
} while (retry);
But frankly, a lock would be simpler.
One day I was trying to get a better understanding of threading concepts, so I wrote a couple of test programs. One of them was:
using System;
using System.Threading.Tasks;
class Program
{
static volatile int a = 0;
static void Main(string[] args)
{
Task[] tasks = new Task[4];
for (int h = 0; h < 20; h++)
{
a = 0;
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = new Task(() => DoStuff());
tasks[i].Start();
}
Task.WaitAll(tasks);
Console.WriteLine(a);
}
Console.ReadKey();
}
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
a++;
}
}
}
I hoped I will be able to see outputs less than 2000000. The model in my imagination was the following: more threads read variable a at the same time, all local copies of a will be the same, the threads increment it and the writes happen and one or more increments are "lost" this way.
Although the output is against this reasoning. One sample output (from a corei5 machine):
2000000
1497903
1026329
2000000
1281604
1395634
1417712
1397300
1396031
1285850
1092027
1068205
1091915
1300493
1357077
1133384
1485279
1290272
1048169
704754
If my reasoning were true I would see 2000000 occasionally and sometimes numbers a bit less. But what I see is 2000000 occasionally and numbers way less than 2000000. This indicates that what happens behind the scenes is not just a couple of "increment losses" but something more is going on. Could somebody explain me the situation?
Edit:
When I was writing this test program I was fully aware how I could make this thrad safe and I was expecting to see numbers less than 2000000. Let me explain why I was surprised by the output: First lets assume that the reasoning above is correct. Second assumption (this wery well can be the source of my confusion): if the conflicts happen (and they do) than these conflicts are random and I expect a somewhat normal distribution for these random event occurences. In this case the first line of the output says: from 500000 experiments the random event never occured. The second line says: the random event occured at least 167365 times. The difference between 0 and 167365 is just to big (almost impossible with a normal distribution). So the case boils down to the following:
One of the two assumptions (the "increment loss" model or the "somewhat normally distributed paralell conflicts" model) are incorrect. Which one is and why?
The behavior stems from the fact that you are using both the volatile keyword as well as not locking access to the variable a when using the increment operator (++) (although you still get a random distribution when not using volatile, using volatile does change the nature of the distribution, which is explored below).
When using the increment operator, it's the equivalent of:
a = a + 1;
In this case, you're actually doing three operations, not one:
Read the value of a
Add 1 to the value of a
Assign the result of 2 back to a
While the volatile keyword serializes access, in the above case, it's serializing access to three separate operations, not serializing access to them collectively, as an atomic unit of work.
Because you're performing three operations when incrementing instead of one, you have additions that are being dropped.
Consider this:
Time Thread 1 Thread 2
---- -------- --------
0 read a (1) read a (1)
1 evaluate a + 1 (2) evaluate a + 1 (2)
2 write result to a (3) write result to a (3)
Or even this:
Time a Thread 1 Thread 2 Thread 3
---- - -------- -------- --------
0 1 read a read a
1 1 evaluate a + 1 (2)
2 2 write back to a
3 2 read a
4 2 evaluate a + 1 (3)
5 3 write back to a
6 3 evaluate a + 1 (2)
7 2 write back to a
Note in particular steps 5-7, thread 2 has written a value back to a, but because thread 3 has an old, stale value, it actually overwrites the results that previous threads have written, essentially wiping out any trace of those increments.
As you can see, as you add more threads, you have a greater potential to mix up the order in which the operations are being performed.
volatile will prevent you from corrupting the value of a due to two writes happening at the same time, or a corrupt read of a due to a write happening during a read, but it doesn't do anything to handle making the operations atomic in this case (since you're performing three operations).
In this case, volatile ensures that the distribution of the value of a is between 0 and 2,000,000 (four threads * 500,000 iterations per thread) because of this serialization of access to a. Without volatile, you run the risk of a being anything as you can run into corruption of the value a when reads and/or writes happen at the same time.
Because you haven't synchronized access to a for the entire increment operation, the results are unpredictable, as you have writes that are being overwritten (as seen in the previous example).
What's going on in your case?
For your specific case you have many writes that are being overwritten, not just a few; since you have four threads each writing a loop two million times, theoretically all the writes could be overwritten (expand the second example to four threads and then just add a few million rows to increment the loops).
While it's not really probable, there shouldn't be an expectation that you wouldn't drop a tremendous amount of writes.
Additionally, Task is an abstraction. In reality (assuming you are using the default scheduler), it uses the ThreadPool class to get threads to process you requests. The ThreadPool is ultimately shared with other operations (some internal to the CLR, even in this case) and even then, it does things like work-stealing, using the current thread for operations and ultimately at some point drops down to the operating system at some level to get a thread to perform work on.
Because of this, you can't assume that there's a random distribution of overwrites that will be skipped, as there's always going to be a lot more going on that will throw whatever order you expect out the window; the order of processing is undefined, the allocation of work will never be evenly distributed.
If you want to ensure that additions won't be overwritten, then you should use the Interlocked.Increment method in the DoStuff method, like so:
for (int i = 0; i < 500000; i++)
{
Interlocked.Increment(ref a);
}
This will ensure that all writes will take place, and your output will be 2000000 twenty times (as per your loop).
It also invalidates the need for the volatile keyword, as you're making the operations you need atomic.
The volatile keyword is good when the operation that you need to make atomic is limited to a single read or write.
If you have to do anything more than a read or a write, then the volatile keyword is too granular, you need a more coarse locking mechanism.
In this case, it's Interlocked.Increment, but if you have more that you have to do, then the lock statement will more than likely be what you rely on.
I don't think it's anything else happening - it's just happening a lot. If you add 'locking' or some other synch technique (Best thread-safe way to increment an integer up to 65535) you'll reliably get the full 2,000,000 increments.
Each task is calling DoStuff() as you'd expect.
private static object locker = new object();
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
lock (locker)
{
a++;
}
}
}
Try increasing the the amounts, the timespan is simply to short to draw any conclusions on. Remember that normal IO is in the range of milliseconds and just one blocking IO-op in this case would render the results useless.
Something along the lines of this is better: (or why not intmax?)
static void DoStuff()
{
for (int i = 0; i < 50000000; i++) // 50 000 000
a++;
}
My results ("correct" being 400 000 000):
63838940
60811151
70716761
62101690
61798372
64849158
68786233
67849788
69044365
68621685
86184950
77382352
74374061
58356697
70683366
71841576
62955710
70824563
63564392
71135381
Not really a normal distribution but we are getting there. Bear in mind that this is roughly 35% of the correct amount.
I can explain my results as I am running on 2 physical cores, although viewed as 4 due to hyperthreading, which means that if it is optimal to do a "ht-switch" during the actual addition atleast 50% of the additions will be "removed" (if I remember the implementation of ht correctly it would be (ie modifying some threads data in ALU while loading/saving other threads data)). And the remaining 15% due to the program actually running on 2 cores in parallell.
My recommendations
post your hardware
increase the loop count
vary the TaskCount
hardware matters!
I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}
The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)
The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.
if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.