Parallel.For "Thread local state" - c#

MSDN
My question is: The third parameter in the parallel.for, what does it do?
When I change it to ()=> 1d, it doubles my result, set to two it triples, but it ignores the decimals.
Why does it ignore the decimals, if it was some sort of doubling? What is really happening there?
I've now tried adding locks. And it does not just initialize the interimresult to the value specified.
Here is the code Im using:
static void RunParallelForCorrectedAdam()
{
object _lock = new object();
double result = 0d;
// Here we call same method several times.
// for (int i = 0; i < 32; i++)
Parallel.For(0, 32,
// Func<TLocal> localInit,
() => 3d,
// Func<int, ParallelLoopState, TLocal, TLocal> body,
(i, state, interimResult) =>
{
lock (_lock)
{
return interimResult + 1;
}
},
//Final step after the calculations
//we add the result to the final result
// Action<TLocal> localFinally
(lastInterimResult) =>
{
lock (_lock)
{
result += lastInterimResult;
}
}
);
// Print the result
Console.WriteLine("The result is {0}", result);
}

With () => 3d, result will be 32 + 3 * t, where t is the number of threads that were used. 3d is passed as interimResult to the first call to body within each thread.
The whole purpose of Parallel.For is to distribute the work on several threads. So interimResult + 1 is executed exactly 32 times (possibly on different threads). But each thread has to have some initial value for interimResult. That's the value that is returned by localInit.
So if the work is distributed on e.g. two threds, each one does + 1 16 times and thus calculates 3 + 16. At the end, the partial results are summed yielding 6 + 32.
In short, in this example, it doesn't make much sense for localInit to return somthing different than 0d.

My question is: The third parameter in the parallel.for, what does it do?
It's a Func that gets executed once per thread. If your loop requires thread-local variable, this is where you initialize it.
EDIT:
Step by step:
(i, state, interimResult) => interimResult + 1,
Do you understand that interimResult is your local variable, the same one you initialized as 0d?

Related

C# parallel foreach does not give expected speedup

I am trying to find out why parallel foreach does not give the expected speedup on a machine with 32 physical cores and 64 logical cores with a simple test computation.
...
var parameters = new List<string>();
for (int i = 1; i <= 9; i++) {
parameters.Add(i.ToString());
if (Scenario.UsesParallelForEach)
{
Parallel.ForEach(parameters, parameter => {
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
lc.Compute();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
});
}
else
{
foreach (var parameter in parameters)
{
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "started");
var lc = new LongComputation();
lc.Compute();
FireOnParameterComputed(this, parameter, Thread.CurrentThread.ManagedThreadId, "stopped");
}
}
}
...
class LongComputation
{
public void Compute()
{
var s = "";
for (int i = 0; i <= 40000; i++)
{
s = s + i.ToString() + "\n";
}
}
}
The Compute function takes about 5 seconds to complete. My assumption was, that with the parallel foreach loop each additional iteration creates a parallel thread running on one of the cores and taking as much as it would take to compute the Compute function only once. So, if I run the loop twice, then with the sequential foreach, it would take 10 seconds, with the parallel foreach only 5 seconds (assuming 2 cores are available). The speedup would be 2. If I run the loop three times, then with the sequential foreach, it would take 15 seconds, but again with the parallel foreach only 5 seconds. The speedup would be 3, then 4, 5, 6, 7, 8, and 9. However, what I observe is a constant speedup of 1.3.
Sequential vs parallel foreach. X-axis: number of sequential/parallel execution of the computation. Y-axis: time in seconds
Speedup, time of the sequential foreach divided by parallel foreach
The event fired in FireOnParameterComputed is intended to be used in a GUI progress bar to show the progress. In the progress bar it can be clearly see, that for each iteration, a new thread is created.
My question is, why don't I see the expected speedup or at least close to the expected speedup?
Tasks aren't threads.
Sometimes starting a task will cause a thread to be created, but not always. Creating and managing threads consumes time and system resources. When a task only takes a short amount of time, even though it's counter-intuitive, the single-threaded model is often faster.
The CLR knows this and tries to make its best judgment on how to execute the task based on a number of factors including any hints that you've passed to it.
For Parallel.ForEach, if you're certain that you want multiple threads to be spawned, try passing in ParallelOptions.
Parallel.ForEach(parameters, new ParallelOptions { MaxDegreeOfParallelism = 100 }, parameter => {});

C# Tasks sums variable

I have the following tasks, they share the sum variable and at the end the sum should be 9, but I get 3. Can you please help me how to fix it. Many thanks.
int sum = 0;
Task t1 = Task.Factory.StartNew(() =>
{
sum = sum + Computation();
});
Task t2 = Task.Factory.StartNew(() =>
{
sum = sum + Computation();
});
Task t3 = Task.Factory.StartNew(() =>
{
sum = sum + Computation();
});
Task.WaitAll(t1, t2, t3);
Console.WriteLine($"The sum is {sum}");
private static int Computation()
{
return 3;
}
It's because you're writing the same field from multiple threads at the same time.
Use Interlocked.Add from System.Threading, which will prevent each thread from writing the variable at the same exact moment.
int sum = 0;
Task t1 = Task.Factory.StartNew(() =>
{
Interlocked.Add(ref sum,Computation());
});
Task t2 = Task.Factory.StartNew(() =>
{
Interlocked.Add(ref sum,Computation());
});
Task t3 = Task.Factory.StartNew(() =>
{
Interlocked.Add(ref sum,Computation());
});
Task.WaitAll(t1, t2, t3);
Console.WriteLine($"The sum is {sum}");
You never tell your code to wait until task 't1' is finished until you start 't2', etc, so everything executes in parallel. Each thread reads the value in "sum" (initially 0) and adds 3. So 0+3 = 3. After that it then writes back the 3. So the code does exactly you programmed it to do.
Galister explained how you could add locks (one side note on this comments: operations in a computer almost never happen at exactly the same moment ;) )
Interlocked class are great when atomic operations are required but if you care about performance consider combine it with Thread Local Storage (TLS).
The Parallel.For has a unique overloads for them, documantation.
Example:
int sum = 0;
Parallel.For(1, 3,
() => 0, //The type of the thread-local data.
(x, state, tls) => // The delegate that is invoked once per iteration.
{
tls += x;
return tls;
},
partial => //The delegate that performs a final action on the local state of each task.
{
Interlocked.Add(ref sum, partial);
});
Do consider that for a small loops, it does not matter, and there will be no actual difference between using the Thread Local Storage and `Interlocked. For big loops, it will make a difference, using lock in big loops can cause serious overhead (blog):
This will potentially add a huge amount of overhead to our
calculation. Since we can potentially block while waiting on the lock
for every single iteration, we will most likely slow this down to
where it is actually quite a bit slower than our serial
implementation. The problem is the lock statement – any time you use
lock(object), you’re almost assuring reduced performance in a parallel
situation. When parallelizing a routine, try to avoid locks.
The idea is to reduce the acquire a lock on the sum variable, note that every task is trying to acquire a lock it in every single point of time. Using the Thread Local Storage making the sum variable to be locked as much only as the number of threads.
instead of synchronizing once per element (potentially millions of
times), you’ll only have to synchronize once per thread

parallel code causing strange results [duplicate]

This question already has answers here:
Parallel.For(): Update variable outside of loop
(7 answers)
Closed 6 years ago.
this is allmost my first attempt at parallel code, (first attempt worked fine and speeded up some code) but this below is causing strange issues and I cant see why. Both for loops below give the same result most of the time but not allways, i.e. res != res1. The function IdealGasEnthalpy is just calculating a number and not changing anything else, i cant figure out what the problem is or even where to begin to look, has anyone any suggestions?
double res = 0;
object lockObject = new object();
for (int I = 0; I < cc.Count; I++)
{
res += IdealGasEnthalpy(T, cc[I], enumMassOrMolar.Molar) * x[I];
}
double res1 = 0;
Parallel.For(0, cc.Count, I =>
{
res1 += IdealGasEnthalpy(T, cc[I], enumMassOrMolar.Molar) * x[I];
});
I tried the following code, but its very slow and doubled the execution time for the whole program compared to serial code.
double res = 0.0d;
Parallel.For(0, cc.Count,
() => 0.0d,
(x, loopState, partialResult) =>
{
return partialResult += IdealGasEnthalpy(T, cc[x], enumMassOrMolar.Molar) * X[x];
},
(localPartialSum) =>
{
lock (lockObject)
{
res += localPartialSum;
}
});
Also tried this below, going to stick to non-parallel for this routine as the parallel versions are all a lot slower...
double res = 0.0d;
double[] partialresult = new double[cc.Count];
Parallel.For(0, cc.Count, i =>
{
partialresult[i] = IdealGasEnthalpy(T, cc[i], enumMassOrMolar.Molar) * X[i];
});
for (int i = 0; i < cc.Count; i++)
{
res += partialresult[i];
}*
Your second operation needs to do an interlocked add, because += is not atomic. Remember this is shorthand for read the variable, add to it, and store the result. There is a race condition where two reads of the same old value could occur before either has stored the new result. You need to synchronize access.
Note that, depending on how computationally expensive your function is, interlocking with the Parallel.For approach might be slower than just doing a serial approach. It comes down to how much time is spent calculating the value versus how much time is spent synchronizing and doing the summation.
Alternately you could store the results in an array which you allocate in advance, then do the summation after all parallel operations are done. That way no two operations modify the same variable. The array trades memory for speed, since you eliminate overhead from synchronization.

Parallel.Foreach with localFinally gets stalled despite completing all iterations

In My Parallel.ForEach Loop the localFinally delegate does get called on all the threads.
I have found this to happen as my Parallel Loop stalls.
In my Parallel Loop I have about three condition check stages that return before completion of the Loop. And it seems that it is when the Threads are returned from these stages and not the execution of the entire body that it does not execute the localFinally delegate.
The Loop structure is as follows:
var startingThread = Thread.CurrentThread;
Parallel.ForEach(fullList, opt,
()=> new MultipleValues(),
(item, loopState, index, loop) =>
{
if (cond 1)
return loop;
if (cond 2)
{
process(item);
return loop;
}
if (cond 3)
return loop;
Do Work(item);
return loop;
},
partial =>
{
Log State of startingThread and threads
} );
I have run the loop on a small data set and logged in detail and found that while the Parallel.ForEach completes all the iterations and the Log at the last thread of localFinally is --
Calling Thread State is WaitSleepJoin for Thread 6 Loop Indx 16
the Loop still does not complete gracefully and remains stalled... any clues why the stalls ?
Cheers!
Just did a quick test run after seeing the definition of localFinally (executed after each thread finished), which had me suspecting that that could mean there would be far less threads created by parallelism than loops executed. e.g.
var test = new List<List<string>> ();
for (int i = 0; i < 1000; i++)
{
test.Add(null);
}
int finalcount = 0;
int itemcount = 0;
int loopcount = 0;
Parallel.ForEach(test, () => new List<string>(),
(item, loopState, index, loop) =>
{
Interlocked.Increment(ref loopcount);
loop.Add("a");
//Thread.Sleep(100);
return loop;
},
l =>
{
Interlocked.Add(ref itemcount, l.Count);
Interlocked.Increment(ref finalcount);
});
at the end of this loop, itemcount and loopcount were 1000 as expected, and (on my machine) finalcount 1 or 2 depending on the speed of execution. In the situation with the conditions: when returned directly the execution is probably much faster and no extra threads are needed. only when the dowork is executed more threads are needed. However the parameter (l in my case) contains the combined list of all executions.
Could this be the cause of the logging difference?
I think you just misunderstood what localFinally means. It's not called for each item, it's called for each thread that is used by Parallel.ForEach(). And many items can share the same thread.
The reason why it exists is that you can perform some aggregation independently on each thread, and join them together only in the end. This way, you have to deal with synchronization (and have it impact your performance) only in a very small piece of code.
For example, if you want to compute the sum of score for a collection of items, you could do it like this:
int totalSum = 0;
Parallel.ForEach(
collection, item => Interlocked.Add(ref totalSum, ComputeScore(item)));
But here, you call Interlocked.Add() for every item, which can be slow. Using localInit and localFinally, you can rewrite the code like this:
int totalSum = 0;
Parallel.ForEach(
collection,
() => 0,
(item, state, localSum) => localSum + ComputeScore(item),
localSum => Interlocked.Add(ref totalSum, localSum));
Notice that the code uses Interlocked.Add() only in the localFinally and does access the global state in body. This way, the cost of synchronization is paid only a few times, once for each thread used.
Note: I used Interlocked in this example, because it is very simple and quite obviously correct. If the code was more complicated, I would use lock first, and try to use Interlocked only when it was necessary for good performance.

C# How to use Interlocked.CompareExchange

My goal is the following:
There is a certain range of integers, and I have to test every integer in that range for something random. I want to use multiple threads for this, and divide the work equally among the threads using a shared counter. I set the counter at the beginning value, and let every thread take a number, increase it, do some calculations, and return a result. This shared counter has to be incremented with locks, because otherwise there will be gaps / overlaps in the range of integers to test.
I have no idea where to start. Let's say I want 12 threads to do the work, I do:
for (int t = 0; t < threads; t++)
{
Thread thr = new Thread(new ThreadStart(startThread));
}
startThread() isthe method I use for the calculations.
Can you help me on my way? I know I have to use the Interlocked class, but that's all….
Say you have an int field somewhere (initialized to -1 initially) then:
int newVal = Interlocked.Increment(ref theField);
is a thread-safe increment; assuming you don't mind the (very small) risk of overflowing the upper int limit, then:
int next;
while((next = Interlocked.Increment(ref theField)) <= upperInclusive) {
// do item with index "next"
}
However, Parallel.For will do all of this a lot more conveniently:
Parallel.For(lowerInclusive, upperExclusive, i => DoWork(i));
or (to constrain to 12 threads):
var options = new ParallelOptions { MaxDegreeOfParallelism = 12 };
Parallel.For(lowerInclusive, upperExclusive, options, i => DoWork(i));

Categories

Resources