One day I was trying to get a better understanding of threading concepts, so I wrote a couple of test programs. One of them was:
using System;
using System.Threading.Tasks;
class Program
{
static volatile int a = 0;
static void Main(string[] args)
{
Task[] tasks = new Task[4];
for (int h = 0; h < 20; h++)
{
a = 0;
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = new Task(() => DoStuff());
tasks[i].Start();
}
Task.WaitAll(tasks);
Console.WriteLine(a);
}
Console.ReadKey();
}
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
a++;
}
}
}
I hoped I will be able to see outputs less than 2000000. The model in my imagination was the following: more threads read variable a at the same time, all local copies of a will be the same, the threads increment it and the writes happen and one or more increments are "lost" this way.
Although the output is against this reasoning. One sample output (from a corei5 machine):
2000000
1497903
1026329
2000000
1281604
1395634
1417712
1397300
1396031
1285850
1092027
1068205
1091915
1300493
1357077
1133384
1485279
1290272
1048169
704754
If my reasoning were true I would see 2000000 occasionally and sometimes numbers a bit less. But what I see is 2000000 occasionally and numbers way less than 2000000. This indicates that what happens behind the scenes is not just a couple of "increment losses" but something more is going on. Could somebody explain me the situation?
Edit:
When I was writing this test program I was fully aware how I could make this thrad safe and I was expecting to see numbers less than 2000000. Let me explain why I was surprised by the output: First lets assume that the reasoning above is correct. Second assumption (this wery well can be the source of my confusion): if the conflicts happen (and they do) than these conflicts are random and I expect a somewhat normal distribution for these random event occurences. In this case the first line of the output says: from 500000 experiments the random event never occured. The second line says: the random event occured at least 167365 times. The difference between 0 and 167365 is just to big (almost impossible with a normal distribution). So the case boils down to the following:
One of the two assumptions (the "increment loss" model or the "somewhat normally distributed paralell conflicts" model) are incorrect. Which one is and why?
The behavior stems from the fact that you are using both the volatile keyword as well as not locking access to the variable a when using the increment operator (++) (although you still get a random distribution when not using volatile, using volatile does change the nature of the distribution, which is explored below).
When using the increment operator, it's the equivalent of:
a = a + 1;
In this case, you're actually doing three operations, not one:
Read the value of a
Add 1 to the value of a
Assign the result of 2 back to a
While the volatile keyword serializes access, in the above case, it's serializing access to three separate operations, not serializing access to them collectively, as an atomic unit of work.
Because you're performing three operations when incrementing instead of one, you have additions that are being dropped.
Consider this:
Time Thread 1 Thread 2
---- -------- --------
0 read a (1) read a (1)
1 evaluate a + 1 (2) evaluate a + 1 (2)
2 write result to a (3) write result to a (3)
Or even this:
Time a Thread 1 Thread 2 Thread 3
---- - -------- -------- --------
0 1 read a read a
1 1 evaluate a + 1 (2)
2 2 write back to a
3 2 read a
4 2 evaluate a + 1 (3)
5 3 write back to a
6 3 evaluate a + 1 (2)
7 2 write back to a
Note in particular steps 5-7, thread 2 has written a value back to a, but because thread 3 has an old, stale value, it actually overwrites the results that previous threads have written, essentially wiping out any trace of those increments.
As you can see, as you add more threads, you have a greater potential to mix up the order in which the operations are being performed.
volatile will prevent you from corrupting the value of a due to two writes happening at the same time, or a corrupt read of a due to a write happening during a read, but it doesn't do anything to handle making the operations atomic in this case (since you're performing three operations).
In this case, volatile ensures that the distribution of the value of a is between 0 and 2,000,000 (four threads * 500,000 iterations per thread) because of this serialization of access to a. Without volatile, you run the risk of a being anything as you can run into corruption of the value a when reads and/or writes happen at the same time.
Because you haven't synchronized access to a for the entire increment operation, the results are unpredictable, as you have writes that are being overwritten (as seen in the previous example).
What's going on in your case?
For your specific case you have many writes that are being overwritten, not just a few; since you have four threads each writing a loop two million times, theoretically all the writes could be overwritten (expand the second example to four threads and then just add a few million rows to increment the loops).
While it's not really probable, there shouldn't be an expectation that you wouldn't drop a tremendous amount of writes.
Additionally, Task is an abstraction. In reality (assuming you are using the default scheduler), it uses the ThreadPool class to get threads to process you requests. The ThreadPool is ultimately shared with other operations (some internal to the CLR, even in this case) and even then, it does things like work-stealing, using the current thread for operations and ultimately at some point drops down to the operating system at some level to get a thread to perform work on.
Because of this, you can't assume that there's a random distribution of overwrites that will be skipped, as there's always going to be a lot more going on that will throw whatever order you expect out the window; the order of processing is undefined, the allocation of work will never be evenly distributed.
If you want to ensure that additions won't be overwritten, then you should use the Interlocked.Increment method in the DoStuff method, like so:
for (int i = 0; i < 500000; i++)
{
Interlocked.Increment(ref a);
}
This will ensure that all writes will take place, and your output will be 2000000 twenty times (as per your loop).
It also invalidates the need for the volatile keyword, as you're making the operations you need atomic.
The volatile keyword is good when the operation that you need to make atomic is limited to a single read or write.
If you have to do anything more than a read or a write, then the volatile keyword is too granular, you need a more coarse locking mechanism.
In this case, it's Interlocked.Increment, but if you have more that you have to do, then the lock statement will more than likely be what you rely on.
I don't think it's anything else happening - it's just happening a lot. If you add 'locking' or some other synch technique (Best thread-safe way to increment an integer up to 65535) you'll reliably get the full 2,000,000 increments.
Each task is calling DoStuff() as you'd expect.
private static object locker = new object();
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
lock (locker)
{
a++;
}
}
}
Try increasing the the amounts, the timespan is simply to short to draw any conclusions on. Remember that normal IO is in the range of milliseconds and just one blocking IO-op in this case would render the results useless.
Something along the lines of this is better: (or why not intmax?)
static void DoStuff()
{
for (int i = 0; i < 50000000; i++) // 50 000 000
a++;
}
My results ("correct" being 400 000 000):
63838940
60811151
70716761
62101690
61798372
64849158
68786233
67849788
69044365
68621685
86184950
77382352
74374061
58356697
70683366
71841576
62955710
70824563
63564392
71135381
Not really a normal distribution but we are getting there. Bear in mind that this is roughly 35% of the correct amount.
I can explain my results as I am running on 2 physical cores, although viewed as 4 due to hyperthreading, which means that if it is optimal to do a "ht-switch" during the actual addition atleast 50% of the additions will be "removed" (if I remember the implementation of ht correctly it would be (ie modifying some threads data in ALU while loading/saving other threads data)). And the remaining 15% due to the program actually running on 2 cores in parallell.
My recommendations
post your hardware
increase the loop count
vary the TaskCount
hardware matters!
Related
We have a concurrent, multithreaded program.
How would I make a sample number increase by +5 interval every time? Does Interlocked.Increment, have an overload for interval? I don't see it listed.
Microsoft Interlocked.Increment Method
// Attempt to make it increase by 5
private int NumberTest;
for (int i = 1; i <= 5; i++)
{
NumberTest= Interlocked.Increment(ref NumberTest);
}
This is another question its based off,
C# Creating global number which increase by 1
I think you want Interlocked.Add:
Adds two integers and replaces the first integer with the sum, as an atomic operation.
int num = 0;
Interlocked.Add(ref num, 5);
Console.WriteLine(num);
Adding (i.e +=) is not and cannot be an atomic operation (as you know). Unfortunately, there is no way to achieve this without enforcing a full fence, on the bright-side these are fairly optimised at a low level. However, there are several other ways you can ensure integrity though (especially since this is just an add)
The use of Interlocked.Add (The sanest solution)
Apply exclusive lock (or Moniter.Enter) outside the for loop.
AutoResetEvent to ensure threads doing task one by one (meh sigh).
Create a temp int in each thread and once finished add temp onto the sum under an exclusive lock or similar.
The use of ReaderWriterLockSlim.
Parallel.For thread based accumulation with Interlocked.Increment sum, same as 4.
There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.
You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Threads synchronization. How exactly lock makes access to memory 'correct'?
This question is inspired by this one.
We got a following test class
class Test
{
private static object ms_Lock=new object();
private static int ms_Sum = 0;
public static void Main ()
{
Parallel.Invoke(HalfJob, HalfJob);
Console.WriteLine(ms_Sum);
Console.ReadLine();
}
private static void HalfJob()
{
for (int i = 0; i < 50000000; i++) {
lock(ms_Lock) { }// empty lock
ms_Sum += 1;
}
}
}
Actual result is very close to expected value 100 000 000 (50 000 000 x 2, since 2 loops are running at the same time), with around 600 - 200 difference (mistake is approx 0.0004% on my machine which is very low). No other way of synchronization can provide such way of approximation (its either a much bigger mistake % or its 100% correct)
We currently understand that such level of preciseness is because of program runs in the following way:
Time is running left to right, and 2 threads are represented by two rows.
where
black box represents process of acquiring, holding and releasing the
lock plus represents addition operation ( schema represents scale on
my PC, lock takes approximated 20 times longer than add)
white box represents period that consists of try to acquire lock,
and further awaiting for it to become available
Also lock provides full memory fence.
So the question now is: if above schema represents what is going on, what is the cause of such big error (now its big cause schema looks like very strong syncrhonization schema)? We could understand difference between 1-10 on boundaries, but its clearly no the only reason of error? We cannot see when writes to ms_Sum can happen at the same time, to cause the error.
EDIT: many people like to jump to quick conclusions. I know what synchronization is, and that above construct is not a real or close to good way to synchronize threads if we need correct result. Have some faith in poster or maybe read linked answer first. I don't need a way to synchronize 2 threads to perform additions in parallel, I am exploring this extravagant and yet efficient , compared to any possible and approximate alternative, synchronization construct (it does synchronize to some extent so its not meaningless like suggested)
lock(ms_Lock) { } this is meaningless construct. lock gurantees exclusive execution of code inside it.
Let me explain why this empty lock decreases (but doesn't eliminate!) the chance of data corruption. Let's simplify threading model a bit:
Thread executes one line of code at time slice.
Thread scheduling is done in strict round robin manner (A-B-A-B).
Monitor.Enter/Exit takes significantly longer to execute than arithmetics. (Let's say 3 times longer. I stuffed code with Nops that mean that previous line is still executing.)
Real += takes 3 steps. I broke them down to atomic ones.
At left column shown which line is executed at the time slice of threads (A and B).
At right column - the program (according to my model).
A B
1 1 SomeOperation();
1 2 SomeOperation();
2 3 Monitor.Enter(ms_Lock);
2 4 Nop();
3 5 Nop();
4 6 Monitor.Exit(ms_Lock);
5 7 Nop();
7 8 Nop();
8 9 int temp = ms_Sum;
3 10 temp++;
9 11 ms_Sum = temp;
4
10
5
11
A B
1 1 SomeOperation();
1 2 SomeOperation();
2 3 int temp = ms_Sum;
2 4 temp++;
3 5 ms_Sum = temp;
3
4
4
5
5
As you see in first scenario thread B just can't catch thread A and A has enough time to finish execution of ms_Sum += 1;. In second scenario the ms_Sum += 1; is interleaved and causes constant data corruption. In reality thread scheduling is stochastic, but it means that thread A has more change to finish increment before another thread gets there.
This is a very tight loop with not much going on inside it, so ms_Sum += 1 has a reasonable chance of being executed in "just the wrong moment" by the parallel threads.
Why would you ever write a code like this in practice?
Why not:
lock(ms_Lock) {
ms_Sum += 1;
}
or just:
Interlocked.Increment(ms_Sum);
?
-- EDIT ---
Some comments on why would you see the error despite memory barrier aspect of the lock... Imagine the following scenario:
Thread A enters the lock, leaves the lock and then is pre-empted by the OS scheduler.
Thread B enters and leaves the lock (possibly once, possibly more than once, possibly millions of times).
At that point the thread A is scheduled again.
Both A and B hit the ms_Sum += 1 at the same time, resulting in some increments being lost (because increment = load + add + store).
As noted the statement
lock(ms_Lock) {}
will cause a full memory barrier. In short this means the value of ms_Sum will be flushed between all caches and updated ("visible") among all threads.
However, ms_Sum += 1 is still not atomic as it is just short-hand for ms_Sum = ms_Sum + 1: a read, an operation, and an assignment. It is in this construct that there is still a race condition -- the count of ms_Sum could be slightly lower than expected. I would also expect that the difference to be more without the memory barrier.
Here is a hypothetical situation of why it might be lower (A and B represent threads and a and b represent thread-local registers):
A: read ms_Sum -> a
B: read ms_Sum -> b
A: write a + 1 -> ms_Sum
B: write b + 1 -> ms_Sum // change from A "discarded"
This depends on a very particular order of interleaving and is dependent upon factors such as thread execution granularity and relative time spent in said non-atomic region. I suspect the lock itself will reduce (but not eliminate) the chance of the interleave above because each thread must wait-in-turn to get through it. The relative time spent in the lock itself to the increment may also play a factor.
Happy coding.
As others have noted, use the critical region established by the lock or one of the atomic increments provided to make it truly thread-safe.
As noted: lock(ms_Lock) { } locks an empty block and so does nothing. You still have a race condition with ms_Sum += 1;. You need:
lock( ms_Lock )
{
ms_Sum += 1 ;
}
[Edited to note:]
Unless you properly serialize access to ms_Sum, you have a race condition. Your code, as written does the following (assuming the optimizer doesn't just throw away the useless lock statement:
Acquire lock
Release lock
Get value of ms_Sum
Increment value of ms_Sum
Store value of ms_Sum
Each thread may be suspended at any point, even in mid-instruction. Unless it is specifically documented as being atomic, it's a fair bet that any machine instruction that takes more than 1 clock cycle to execute may be interrupted mid-execution.
So let's assume that your lock is actually serializing the two threads. There is still nothing in place to prevent one thread from getting suspended (and thus giving priority to the other) whilst it is somewhere in the middle of executing the last three steps.
So the first thread in, locks, release, gets the value of ms_Sum and is then suspended. The second thread comes in, locks, releases, gets the [same] value of ms_Sum, increments it and stores the new value back in ms_Sum, then gets suspended. The 1st thread increments its now-outdates value and stores it.
There's your race condition.
The += operator is not atomic, that is, first it reads then it writes the new value. In the mean time, between reading and writing, the thread A could switch to the other B, in fact without writing the value... then the other thread B don't see the new value because it was not assigned by the other thread A... and when returning to the thread A it will discard all the work of the thread B.
Does looping in C# occur at the same speed for all systems. If not, how can I control a looping speed to make the experience consistent on all platforms?
You can set a minimum time for the time taken to go around a loop, like this:
for(int i= 0; i < 10; i++)
{
System.Threading.Thread.Sleep(100);
... rest of your code...
}
The sleep call will take a minimum of 100ms (you cannot say what the maximum will be), so your loop wil take at least 1 second to run 10 iterations.
Bear in mind that it's counter to the normal way of Windows programming to sleep on your user-interface thread, but this might be useful to you for a quick hack.
You can never depend on the speed of a loop. Although all existing compilers strive to make loops as efficient as possible and so they probably produce very similar results (given enough development time), the compilers are not the only think influencing this.
And even leaving everything else aside, different machines have different performance. No two machines will yield the exact same speed for a loop. In fact, even starting the program twice on the same machine will yield slightly different performances. It depends on what other programs are running, how the CPU is feeling today and whether or not the moon is shining.
No, loops do not occur the same in all systems. There are so many factors to this question that it can not be appreciable answered without code.
This is a simple loop:
int j;
for(int i = 0; i < 100; i++) {
j = j + i;
}
this loop is too simple, it's merely a pair of load, add, store operations, with a jump and a compare. This will be only a few microops and will be really fast. However, the speed of those microops will be dependent on the processor. If the processor can do one microop in 1 billionth of a second (roughly one gigahertz) then the loop will take approximately 6 * 100 microops (this is all rough estimation, there are so many factors involved that I'm only going for approximation) or 6 * 100 billionths of a second, or slightly less than one millionth of a second. For the entire loop. You can barely measure this with most operating system functions.
I wanted to demonstrate the speed of the looping. I referenced above a processor of 1 billion microops per second. Now consider a processor that can do 4 billion microops per second. That processor would be four times faster (roughly) than the first processor. And we didn't change the code.
Does this answer the question?
For those who want to mention that the compiler might loop unroll this, ignore that for the sake of the learning.
One way of controlling this is by using the Stopwatch to control when you do your logic. See this example code:
int noofrunspersecond = 30;
long ticks1 = 0;
long ticks2 = 0;
double interval = (double)Stopwatch.Frequency / noofrunspersecond;
while (true) {
ticks2 = Stopwatch.GetTimestamp();
if (ticks2 >= ticks1 + interval) {
ticks1 = Stopwatch.GetTimestamp();
//perform your logic here
}
Thread.Sleep(1);
}
This will make sure that that the logic is performed at given intervals as long as the system can keep up, so if you try to execute 100 times per second, depending on the logic performed the system might not manage to perform that logic 100 times a second. In other cases this should work just fine.
This kind of logic is good for getting smooth animations that will not speed up or slow down on different systems for example.
I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}
The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)
The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.
if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.