Hi
I am reading a “threading in C#” tutorial. One of the things it mentions is:
“The CLR assigns each thread its own memory stack so that local variables are kept separate”
And there’s this example:
namespace ConsoleApplication1 {
class Program {
static void Main(string[] args) {
for (int i = 0; i < 20; i++) {
Thread t = new Thread(() => {
Console.WriteLine(i);
});
t.Start();
}
Console.ReadLine();
}
}
}
Output:
1
2
2
4
6
8
10
10
10
10
12
12
14
15
17
18
18
20
20
So the way I understand what’s happening here is:
The main thread starts executing the for loop.
A new thread is instantiated and defined such that it will receive the value of “i”
and print it to the console.
The thread instance is started meanwhile the main thread continues working.
Being “i” an integer my guess is that the new thread will have its own copy in its memory stack. And then print the value to the console. But as the results show, it’s skipping values jumping from 10 to 12 or 12 to 14.
So is the new thread is receiving a reference to i? But if “i” is an integer shouldn’t the new thread store a new value in its memory stack instead of what seems a reference to i.
Also why are there duplicate values? It’s printing several times 2,10, 12, 18, 20.
Thanks.
That sample is fatally flawed... because every thread is actually sharing the one i variable. It's being captured by the lambda expression.
This is a very common problem, but it's a real shame to see it in a threading tutorial. (I hope it's not one of my articles! Please tell us where you're reading this.) Eric Lippert has written about it very carefully in his blog posts, "closing over the loop variable considered harmful" - part 1; part 2.
It's worth distinguishing between threading behaviour and that of lambda expressions. Threads really do have their own stacks and their own local variables - but here, i is shared between all threads due to the lambda expression. It's not a local variable in the "normal" sense.
Here's an example which shows each thread having its own local variables:
using System;
using System.Threading;
public class Test
{
static void Main()
{
for (int x = 0; x < 10; x++)
{
new Thread(Count).Start();
}
}
static void Count()
{
int threadId = Thread.CurrentThread.ManagedThreadId;
Console.WriteLine("Thread {0} starting", threadId);
for (int i = 0; i < 5; i++)
{
Console.WriteLine("{0}: {1}", threadId, i);
}
Console.WriteLine("Thread {0} ending", threadId);
}
}
Each thread will definitely print 0..4 along with its own thread ID. The i variable is genuinely local to each thread this time - there's no sharing.
When a variable is used in a lambda expression, like your variable i, it gets hoisted into the a closure (in your case that of the Main method) (first hit on google for "closure c#" happens to be Jon Skeet's article on the subject). And because of this, it is not a local variable and does not live on the thread's stack.
The problem is simple as the tread take time to initialize and run, the i's value will have changed in the mean time. And there is also a possibility of more than one loop would have completed by the time the other threads get a processor cycle to process. Thus a single number gets printed multiple times.
Related
I have problem with my library for neural networks. It uses multithreading to fasten computations. But after about 30-60 sec of runtime my program does not utilize 100% of my i7 3610QM 4cores 8threads anymore.
Basically my processing looks like (c# with pseudocode)
for each training example t in training set
for each layer l in neural network
Parallel.For(0, N, (int i)=>{l.processForward(l.regions[i])})
for each layer l in neural network (but with reversed order)
Parallel.For(0, N, (int i)=>{l.backPropageteError(l.regions[i])})
Where regions is layer's list of precalculated regions of neuron to process. Every region is the same size of 1/N of current layer so Tasks are same size to minimize chance that other threads need to wait for longest task to finish.
Like i said, this processing scheme is consuming 100% of my processor only for a short time and then drops to about 80-85%. In my case i set N to Environment.ProcessorsCount (= 8);
I can share whole code/repository if anyone is willing to help.
I tried to investigate and I created new console project and put there almost Hello World of Parallel.For() and i simply can't tell what is going on. This might be other issue of Parallel.For() but i also want you to address this problem. Here is the code:
class Program
{
static void Main(string[] args)
{
const int n = 1;
while (true)
{
//int counter = 0; for (int ii = 0; ii < 1000; ++ii) counter++;
Parallel.For(0, n, (int i) => { int counter = 0; for (int ii = 0; ii < 1000; ++ii) counter++; });
}
}
}
In this code, I constantly (while loop) create one task (n=1) that has some work to do (increase counter one thousand times). As i know, Parallel.For blocks execution / waits for all parallel calls to finish. If that is true it should be doing the same work as commented section (provided n=1). But on my computer, this program uses 100% of CPU, like there is work for more than one thread! How is that possible? When i switch to commented version, program uses less than 20% of CPU and this is what I expected. Please help me understand this behaviour.
As #TaW said, their is a cost of going parallel. That's why f() and Parallel.For(0, n, _ => f()) are not equivalent. Parallel version incurs thread scheduling and context switching. In your case the execution time of f() is comparable to thread scheduling overhead. That why you do get performance degrade with parallel version. Parallel.For do wait until operation completes, but is completes so fast that several threads run on the CPU in a very short period of time (remember that each time you invoke Parallel.For it may choose different thread to run f() on it) on different CPU cores.
As for the first part of the question, i guess the problem lies in the index range passed to Parallel.For. Instead of [0, number of CPU cores), it should be equal to the index range of data.
I want to access a web server using httpwebrequest and fetch thousands of records from a given range of pages. Each hit to a webpage fetches 15 records, and there are almost 8 to 10000 pages on the webserver. That means a total of 120000 hits to the server! If done trivially with a single process, the task can be very time consuming. Hence, multiple threading is the immediate solution that comes to mind.
Currently, I have created a worker class for searching purpose, that worker class will spawn 5 subworkers (threads) to search in a given range. But, due to my novice abilities in threading, I am unable to make it work, as I am having trouble synchronizing and making them all work together. I know about delegates, actions, events in .NET but making them to work with threads is getting confusing..This is the code that I am using:
public void Start()
{
this.totalRangePerThread = ((this.endRange - this.startRange) / this.subWorkerThreads.Length);
for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
//theThreads[counter] = new Thread(new ThreadStart(MethodName));
this.subWorkerThreads[i] = new Thread(() => searchItem(this.startRange, this.totalRangePerThread));
//this.subWorkerThreads[i].Start();
this.startRange = this.startRange + this.totalRangePerThread;
}
for (int threadIndex = 0; threadIndex < this.subWorkerThreads.Length; ++threadIndex)
this.subWorkerThreads[threadIndex].Start();
}
The searchItem method:
public void searchItem(int start, int pagesToSearchPerThread)
{
for (int count = 0; count < pagesToSearchPerThread; ++count)
{
//searching routine here
}
}
The problem exists between the shared variables of the threads, can anyone guide me how to make it a threadsafe procedure?
the real problem you're facing is that the labmda expression in the Thread constructor is capturing the outer variable (startRange). One way to fix it is to make a copy of the variable, like this:
for (int i = 0; i < this.subWorkerThreads.Length; ++i)
{
var copy = startRange;
this.subWorkerThreads[i] = new Thread(() => searchItem(copy, this.totalRangePerThread));
this.startRange = this.startRange + this.totalRangePerThread;
}
for more information on creating and starting threads, see Joe Albahari's excellent ebook (there's also a section on captured variables a bit further down). If you want to learn about closures, see this question.
The first answer is that these threads don't really need that much work to share variables (assuming I'm understanding you correctly). They have some shared input variables (the target web server, for example), but those are thread-safe because they aren't being changed. The plan is that they'll build a database or some such containing the records they retrieve. You should be fine to just have each of the five fill their own input archive, and then merge them in a single thread once all the subworker threads are done. If somehow the architecture that you're using to store the data makes that expensive... well, how much you're planning to store and what you're planning to store it in becomes pertinent, then, and perhaps you could share what those are?
One day I was trying to get a better understanding of threading concepts, so I wrote a couple of test programs. One of them was:
using System;
using System.Threading.Tasks;
class Program
{
static volatile int a = 0;
static void Main(string[] args)
{
Task[] tasks = new Task[4];
for (int h = 0; h < 20; h++)
{
a = 0;
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = new Task(() => DoStuff());
tasks[i].Start();
}
Task.WaitAll(tasks);
Console.WriteLine(a);
}
Console.ReadKey();
}
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
a++;
}
}
}
I hoped I will be able to see outputs less than 2000000. The model in my imagination was the following: more threads read variable a at the same time, all local copies of a will be the same, the threads increment it and the writes happen and one or more increments are "lost" this way.
Although the output is against this reasoning. One sample output (from a corei5 machine):
2000000
1497903
1026329
2000000
1281604
1395634
1417712
1397300
1396031
1285850
1092027
1068205
1091915
1300493
1357077
1133384
1485279
1290272
1048169
704754
If my reasoning were true I would see 2000000 occasionally and sometimes numbers a bit less. But what I see is 2000000 occasionally and numbers way less than 2000000. This indicates that what happens behind the scenes is not just a couple of "increment losses" but something more is going on. Could somebody explain me the situation?
Edit:
When I was writing this test program I was fully aware how I could make this thrad safe and I was expecting to see numbers less than 2000000. Let me explain why I was surprised by the output: First lets assume that the reasoning above is correct. Second assumption (this wery well can be the source of my confusion): if the conflicts happen (and they do) than these conflicts are random and I expect a somewhat normal distribution for these random event occurences. In this case the first line of the output says: from 500000 experiments the random event never occured. The second line says: the random event occured at least 167365 times. The difference between 0 and 167365 is just to big (almost impossible with a normal distribution). So the case boils down to the following:
One of the two assumptions (the "increment loss" model or the "somewhat normally distributed paralell conflicts" model) are incorrect. Which one is and why?
The behavior stems from the fact that you are using both the volatile keyword as well as not locking access to the variable a when using the increment operator (++) (although you still get a random distribution when not using volatile, using volatile does change the nature of the distribution, which is explored below).
When using the increment operator, it's the equivalent of:
a = a + 1;
In this case, you're actually doing three operations, not one:
Read the value of a
Add 1 to the value of a
Assign the result of 2 back to a
While the volatile keyword serializes access, in the above case, it's serializing access to three separate operations, not serializing access to them collectively, as an atomic unit of work.
Because you're performing three operations when incrementing instead of one, you have additions that are being dropped.
Consider this:
Time Thread 1 Thread 2
---- -------- --------
0 read a (1) read a (1)
1 evaluate a + 1 (2) evaluate a + 1 (2)
2 write result to a (3) write result to a (3)
Or even this:
Time a Thread 1 Thread 2 Thread 3
---- - -------- -------- --------
0 1 read a read a
1 1 evaluate a + 1 (2)
2 2 write back to a
3 2 read a
4 2 evaluate a + 1 (3)
5 3 write back to a
6 3 evaluate a + 1 (2)
7 2 write back to a
Note in particular steps 5-7, thread 2 has written a value back to a, but because thread 3 has an old, stale value, it actually overwrites the results that previous threads have written, essentially wiping out any trace of those increments.
As you can see, as you add more threads, you have a greater potential to mix up the order in which the operations are being performed.
volatile will prevent you from corrupting the value of a due to two writes happening at the same time, or a corrupt read of a due to a write happening during a read, but it doesn't do anything to handle making the operations atomic in this case (since you're performing three operations).
In this case, volatile ensures that the distribution of the value of a is between 0 and 2,000,000 (four threads * 500,000 iterations per thread) because of this serialization of access to a. Without volatile, you run the risk of a being anything as you can run into corruption of the value a when reads and/or writes happen at the same time.
Because you haven't synchronized access to a for the entire increment operation, the results are unpredictable, as you have writes that are being overwritten (as seen in the previous example).
What's going on in your case?
For your specific case you have many writes that are being overwritten, not just a few; since you have four threads each writing a loop two million times, theoretically all the writes could be overwritten (expand the second example to four threads and then just add a few million rows to increment the loops).
While it's not really probable, there shouldn't be an expectation that you wouldn't drop a tremendous amount of writes.
Additionally, Task is an abstraction. In reality (assuming you are using the default scheduler), it uses the ThreadPool class to get threads to process you requests. The ThreadPool is ultimately shared with other operations (some internal to the CLR, even in this case) and even then, it does things like work-stealing, using the current thread for operations and ultimately at some point drops down to the operating system at some level to get a thread to perform work on.
Because of this, you can't assume that there's a random distribution of overwrites that will be skipped, as there's always going to be a lot more going on that will throw whatever order you expect out the window; the order of processing is undefined, the allocation of work will never be evenly distributed.
If you want to ensure that additions won't be overwritten, then you should use the Interlocked.Increment method in the DoStuff method, like so:
for (int i = 0; i < 500000; i++)
{
Interlocked.Increment(ref a);
}
This will ensure that all writes will take place, and your output will be 2000000 twenty times (as per your loop).
It also invalidates the need for the volatile keyword, as you're making the operations you need atomic.
The volatile keyword is good when the operation that you need to make atomic is limited to a single read or write.
If you have to do anything more than a read or a write, then the volatile keyword is too granular, you need a more coarse locking mechanism.
In this case, it's Interlocked.Increment, but if you have more that you have to do, then the lock statement will more than likely be what you rely on.
I don't think it's anything else happening - it's just happening a lot. If you add 'locking' or some other synch technique (Best thread-safe way to increment an integer up to 65535) you'll reliably get the full 2,000,000 increments.
Each task is calling DoStuff() as you'd expect.
private static object locker = new object();
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
lock (locker)
{
a++;
}
}
}
Try increasing the the amounts, the timespan is simply to short to draw any conclusions on. Remember that normal IO is in the range of milliseconds and just one blocking IO-op in this case would render the results useless.
Something along the lines of this is better: (or why not intmax?)
static void DoStuff()
{
for (int i = 0; i < 50000000; i++) // 50 000 000
a++;
}
My results ("correct" being 400 000 000):
63838940
60811151
70716761
62101690
61798372
64849158
68786233
67849788
69044365
68621685
86184950
77382352
74374061
58356697
70683366
71841576
62955710
70824563
63564392
71135381
Not really a normal distribution but we are getting there. Bear in mind that this is roughly 35% of the correct amount.
I can explain my results as I am running on 2 physical cores, although viewed as 4 due to hyperthreading, which means that if it is optimal to do a "ht-switch" during the actual addition atleast 50% of the additions will be "removed" (if I remember the implementation of ht correctly it would be (ie modifying some threads data in ALU while loading/saving other threads data)). And the remaining 15% due to the program actually running on 2 cores in parallell.
My recommendations
post your hardware
increase the loop count
vary the TaskCount
hardware matters!
I have a Parallel.ForEach() loop that takes a list of URL's and downloads each of them for some additional processing. Outside my loop I have declared a loop counter variable and inside the loop body I use Interlocked.Increment() thinking this would be the best way to keep a "thread safe" way of increasing the count as each loop interation is performed.
int counter = 0;
Parallel.ForEach(urlList, (url, state) =>
{
// various code statments
Interlocked.Increment( ref counter );
Debug.WriteLine(" ......... counter: " + counter);
});
I would have thought that I would see something similar to:
......... 1
......... 2
......... 3
......... 4
......... 5
.........
.........
......... n
But what I get instead is 16 " ......... 0" (this is because I have a dual quad core computer with 8 native cores, but hyper threading is enabled giving me a total of 16 cores). Then I will start to see the counter get incremented normally for the most part but sometimes I will see duplicate or even triplicate counter values in the Debug output.
Using a Parallel.ForEach() what is the best way to count loop iterations? Thanks for any advice.
Interlocked.Increment will return your incremented value.
So,
int counter = 0;
Parallel.ForEach(urlList, (url, state) =>
{
// various code statments
var counterNow = Interlocked.Increment( ref counter );
Debug.WriteLine(" ......... counter: " + counterNow);
});
should return the counter value as it was incremented.
When you are running on a multi processor/multi-core machines and have multiple application threads, you need to be aware that the values you see for variables may not reflect the actual current state of that variable.
This is because there may be many individual caches, for each CPU or die or socket, and your thread can read the cached value (saving a hit to read main memory)
If you just want to read a value that is updated by multiple threads, you should use Interlocked.Read() to guarantee you have the current value for counter.
This is because the Increment + WriteLine together are not atomic.
It might be that thread1 increments counter, then thread2 increments it again, and then the two threads get to the WriteLine part with the same value of counter.
I'm not sure, but I think this is probably due to the way variable captures work in lambdas. Have you tried putting it in a separate function, or moving the variable outside and declaring it as static?
I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}
The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)
The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.
if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.