Different thread affinities in Parallel.For Iterations

Different thread affinities in Parallel.For Iterations - c#

I need an iteration of a parallel for loop to use 7 cores(or stay away from 1 core) but another iteration to use 8(all) cores and tried below code:
Parallel.For(0,2,i=>{
if(i=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(i==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
this outputs 255 for both iterations. So either parallel.for loop is using single thread for them or one setting sets other iterations affinity too. Another problem is, this is from a latency sensitive application and all this affinity settings add 1 to 15 milliseconds latency.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
OpenCL context is using max cores for kernel execution on cpu in one iteration. Other iterations using 1-2 cores to copy buffers and send command to devices. When cpu is used by opencl, it uses all cores and devices cannot get enough time to copy buffers. Device fission seems to be harder than solving this issue I hink.

Different thread affinities in Parallel.For Iterations
Question is misleading, as it is based on assumption that Parallel API means multiple threads. Parallel API does refer to Data Parallel processing, but doesn't provide any guarantee for invoking multiple threads, especially for the code provided above, where there's hardly any work per thread.
For the Parallel API, you can set the Max degree of Parallelism, as follows:
ParallelOptions parallelOption = new ParallelOptions();
parallelOption.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, 20, parallelOption, i =>
But that never guarantee the number of threads that would be invoked to parallel processing, since Threads are used from the ThreadPool and CLR decides at run-time, based on amount of work to be processed, whether more than one thread is required for the processing.
In the same Parallel loop can you try the following, print Thread.Current.ManageThreadId, this would provide a clear idea, regarding the number of threads being invoked in the Parallel loop.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
Can you post the code, for multiple threads, can you try something like this.
Thread[] threadArray = new Thread[2];
threadArray[0] = new Thread(<ThreadDelegate>);
threadArray[1] = new Thread(<ThreadDelegate>);
threadArray[0]. ProcessorAffinity = <Set Processor Affinity>
threadArray[1]. ProcessorAffinity = <Set Processor Affinity>
Assuming you assign the affinity correctly, you can print them and find different values, check the following ProcessThread.ProcessorAffinity.
On another note as you could see in the link above, you can set the value in hexadecimal based on processor affinity, not sure what does values 254, 255 denotes , do you really have server with that many processors.
EDIT:
Try the following edit to your program, (based on the fact that two Thread ids are getting printed), now by the time both threads some in the picture, they both get same value of variable i, they need a local variable to avoid closure issue
Parallel.For(0,2,i=>{
int local = i;
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
EDIT 2: (Would mostly not work as both threads might increment, before actual logic execution)
int local = -1;
Parallel.For(0,2,i=>{
Interlocked.Increment(ref local);
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});

Related

How can I achieve maximum parallelism and utilize maximum CPU with Parallel.ForEach?

There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.

You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.

Increase Number of running thread in Parallel.For

I have just did a sample for multithreading using This Link like below:
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
It gives me 15 thread before Parellel.For and after it gives me 17 thread only. So only 2 thread is occupy with Parellel.For.
Then I have created a another sample code using This Link like below:
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Console.WriteLine("MaxDegreeOfParallelism : {0}", Environment.ProcessorCount * 10);
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
In above code, I have set MaxDegreeOfParallelism where it sets 40 but is still taking same threads for Parallel.For.
So how can I increase running thread for Parallel.For?

I am facing a problem that some numbers is skipped inside the Parallel.For when I perform some heavy and complex functionality inside it. So here I want to increase the maximum thread and override the skipping issue.
What you're saying is something like: "My car is shaking when driving too fast. I'm trying to avoid this by driving even faster." That doesn't make any sense. What you need is to fix the car, not change the speed.
How exactly to do that depends on what are you actually doing in the loop. The code you showed is obviously placeholder, but even that's wrong. So I think what you should do first is to learn about thread safety.
Using a lock is one option, and it's the easiest one to get correct. But it's also hard to make it efficient. What you need is to lock only for a short amount of time each iteration.
There are other options how to achieve thread safety, including using Interlocked, overloads of Parallel.For that use thread-local data and approaches other than Parallel.For(), like PLINQ or TPL Dataflow.
After you made sure your code is thread safe, only then it's time to worry about things like the number of threads. And regarding that, I think there are two things to note:
For CPU-bound computations, it doesn't make sense to use more threads than the number of cores your CPU has. Using more threads than that will actually usually lead to slower code, since switching between threads has some overhead.
I don't think you can measure the number of threads used by Parallel.For() like that. Parallel.For() uses the thread pool and it's quite possible that there already are some threads in the pool before the loop begins.

Parallel loops use hardware CPU cores. If your CPU has 2 cores, this is the maximum degree of paralellism that you can get in your machine.
Taken from MSDN:
What to Expect
By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the
number of available cores. In typical scenarios, the more cores you
have, the faster your loop executes, until you reach the point of
diminishing returns that Amdahl's Law predicts. How much faster
depends on the kind of work your loop does.
Further reading:
Threading vs Parallelism, how do they differ?
Threading vs. Parallel Processing

Parallel loops will give you wrong result for summation operations without locks as result of each iteration depends on a single variable 'Count' and value of 'Count' in parallel loop is not predictable. However, using locks in parallel loops do not achieve actual parallelism. so, u should try something else for testing parallel loop instead of summation.

Determininistic random numbers in parallel code

I have a question regarding thread ordering for the TPL.
Indeed, it is very important for me that my Parallel.For loop to be executed in the order of the loop. What I mean is that given 4 threads, i would like the first thread to execute every 4k loop, 2nd thread every 4k+1 etc with (k between 0 and, NbSim/4).
1st thread -> 1st loop, 2nd thread -> 2nd loop , 3rd thread -> 3rd loop
4th thread -> 4th loop , 1th thread -> 5th loop etc ...
I have seen the OrderedPartition directive but I am not quite sure of the way I should apply it to a FOR loop and not to a Parallel.FOREACH loop.
Many Thanks for your help.
Follwing the previous remkarks, I am completing the description :
Actually, after some consideration, I believe that my problem is not about ordering.
Indeed, I am working on a Monte-Carlo engine, in which for each iteration I am generating a set of random numbers (always the same (seed =0)) and then apply some business logic to them. Thus everything should be deterministic and when running the algorithm twice I should get the exact same results. But unfortunately this is not the case, and I am strugeling to understand why. Any idea, on how to solve that kind of problems (without printing out every variable I have)?
Edit Number 2:
Thank you all for your suggestions
First, here is the way my code is ordered :
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 4; //or 1
ParallelLoopResult res = Parallel.For<LocalDataStruct>(1,NbSim, options,
() => new LocalDataStruct(//params of the constructor of LocalData),
(iSim, loopState, localDataStruct) => {
//logic
return localDataStruct;
}, localDataStruct => {
lock(syncObject) {
//critical section for outputting the parameters
});
When setting the degreeofParallelism to 1,everything works fine, however when setting the degree of Parallelism to 4 I am getting results that are false and non deterministic (when running the code twice I get different results). It is probably due to mutable objects that is what I am checking now, but the source code is quite extensive, so it takes time. Do you think that there is a good strategy to check the code other than review it (priniting out all variables is impossible in this case (> 1000)? Also when setting the Nb of Simulation to 4 for 4 threads everything is working fine as well, mostly due to luck I believe ( that s why I metionned my first idea regarding ordering).

You can enforce ordering in PLINQ but it comes at a cost. It gives ordered results but does not enforce ordering of execution.
You really cannot do this with TPL without essentially serializing your algorithm. The TPL works on a Task model. It allows you to schedule tasks which are executed by the scheduler with no guarantee as to the order in which the Tasks are executed. Typically parallel implementations take the PLINQ approach and guarantee ordering of results not ordering of execution.
Why is ordered execution important?
So. For a Monte-Carlo engine you would need to make sure that each index in your array received the same random numbers. This does not mean that you need to order your threads, just make the random numbers are ordered across the work done by each thread. So if each loop of your ParallelForEach was passed not only the array of elements to do work on but also it's own instance of a random number generator (with a different fixed seed per thread) then you will still get deterministic results.
I'm assuming that you are familiar with the challenges related to parallelizing Monte-Carlo and generating good random number sequences. If not here's something to get you started;Pseudo-random Number Generation for
Parallel Monte Carlo—A Splitting Approach, Fast, High-Quality, Parallel Random-Number Generators: Comparing Implementations.
Some suggestions
I would start off by ensuring that you can get deterministic results in the sequential case by replacing the ParallelForEach with a ForEach and see if this runs correctly. You could also try comparing the output of a sequential and a parallel run, add some diagnostic output and pipe it to a text file. Then use a diff tool to compare the results.
If this is OK then it is something to do with your parallel implementation, which as is pointed out below is usually related to mutable state. Some things to consider:
Is your random number generator threadsafe? Random is a poor random number generator at best and as far as I know is not designed for parallel execution. It is certainly not suitable for M-C calculations, parallel or otherwise.
Does your code have other state shared between threads, if so what is it? This state will be mutated in a non-deterministic manner and effect your results.
Are you merging results from different threads in parallel. The non-associativity of parallel floating point operations will also cause you issues here, see How can floating point calculations be made deterministic?. Even if the thread results are deterministic if you are combining them in a non deterministic way you will still have issues.

Assuming all threads share the same random number generator, then although you are generating the same sequence every time, which thread gets which elements of this sequence is non-deterministic. Hence you could arrive at different results.
That's if the random number generator is thread-safe; if it isn't, then it's not even guaranteed to generate the same sequence when called from multiple threads.
Apart from that it is difficult to theorize what could be causing non-determinism to arise; basically any global mutable state is suspicious. Each Task should be working with its own data.

If, rather than using a random number generator, you set up an array [0...N-1] of pre-determined values, say [0, 1/N, 2/N, ...], and do a Parallel.ForEach on that, does it still give nondeterministic results? If so, the RNG isn't the issue.

What is the simplest way to execute a parallel, for loop as a low-priority, CPU operation?

Background
Historically, I've not had to write too much thread related code. I am familiar with the process and I can use System.Threading to do what I need to do when the need arises. However, the bulk of my experience has been stuck in .Net 2.0 world (ugh!) and I need a bit of help executing a very simple task in the most recent version of .Net.
I need to write a simple, parallel job but I would prefer it to execute as a low-priority task that doesn't bring my PC to a sluggish halt.
Below is a simple parallel example that demonstrates the type of work that I'm trying to accomplish. In this example, I take a random number and I store it if the new value is larger than the last, largest random number that has been discovered.
Please understand, the point of this example is to strictly show that I have a calculation that I wish to repeatedly execute, compare and store. I understand the need for variable locking and I also know that this code isn't perfect. However, it is an extremely simple example of the nature of what I need to accomplish. If you run this code, your CPU will come to a grinding halt.
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
How can I do this work, in parallel, and have it execute as a low-priority CPU job. The above mechanism has provided tremendous performance improvements, over a standard for-loop. However, the cost has been that I can't use my computer, even after setting the MaxDegreeOfParallelism option.
What can I do?

Because Parallel.For uses ThreadPool threads you can not set the priority of the thread to be low. However you can change the priority of the entire application
Process currentProcess = Process.GetCurrentProcess();
ProcessPriorityClass oldPriority = currentProcess.PriorityClass;
try
{
currentProcess.PriorityClass = ProcessPriorityClass.BelowNormal;
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
}
finally
{
//Bring the priority back up to the original level.
currentProcess.PriorityClass = oldPriority;
}
If you really don't want your whole application's priority to be lowered combine this trick with some form of IPC like WCF and have your slow long running operation running in a 2nd process that you start up and kill as needed.

Using the Parallel.For method, you cannot set thread priority since they are ThreadPool threads.
Your best bet is to set ParallelOptions.MaxDegreeOfParallelism to ProcessorCount - 1, this way you have a free core to do other things (such as handle the GUI).

Semaphore - What is the use of initial count?

http://msdn.microsoft.com/en-us/library/system.threading.semaphoreslim.aspx
To create a semaphore, I need to provide an initial count and maximum count. MSDN states that an initial count is -
The initial number of requests for the
semaphore that can be granted
concurrently.
While it states that maximum count is
The maximum number of requests for the
semaphore that can be granted
concurrently.
I can understand that the maximum count is the maximum number of threads that can access a resource concurrently, but what is the use of initial count?
If I create a semaphore with an initial count of 0 and a maximum count of 2, none of my threadpool threads are able to access the resource. If I set the initial count as 1 and maximum count as 2 then only one thread pool thread can access the resource. It is only when I set both initial count and maximum count as 2, 2 threads are able to access the resource concurrently. So, I am really confused about the significance of initial count?
SemaphoreSlim semaphoreSlim = new SemaphoreSlim(0, 2); //all threadpool threads wait
SemaphoreSlim semaphoreSlim = new SemaphoreSlim(1, 2);//only one thread has access to the resource at a time
SemaphoreSlim semaphoreSlim = new SemaphoreSlim(2, 2);//two threadpool threads can access the resource concurrently

So, I am really confused about the significance of initial count?
One important point that may help here is that Wait decrements the semaphore count and Release increments it.
initialCount is the number of resource accesses that will be allowed immediately. Or, in other words, it is the number of times Wait can be called without blocking immediately after the semaphore was instantiated.
maximumCount is the highest count the semaphore can obtain. It is the number of times Release can be called without throwing an exception assuming initialCount count was zero. If initialCount is set to the same value as maximumCount then calling Release immediately after the semaphore was instantiated will throw an exception.

Yes, when the initial number sets to 0 - all threads will be waiting while you increment the "CurrentCount" property. You can do it with Release() or Release(Int32).
Release(...) - will increment the semaphore counter
Wait(...) - will decrement it
You can't increment the counter ("CurrentCount" property) greater than maximum count which you set in initialization.
For example:
SemaphoreSlim^ s = gcnew SemaphoreSlim(0,2); //s->CurrentCount = 0
s->Release(2); //s->CurrentCount = 2
...
s->Wait(); //Ok. s->CurrentCount = 1
...
s->Wait(); //Ok. s->CurrentCount = 0
...
s->Wait(); //Will be blocked until any of the threads calls Release()

How many threads do you want to be able to access resource at once? Set your initial count to that number. If that number is never going to increase throughout the life of the program, set your max count to that number too. That way, if you have a programming error in how you release the resource, your program will crash and let you know.
(There are two constructors: one that takes only an initial value, and one that additionally takes the max count. Use whichever is appropriate.)

Normally, when the SemaphoreSlim is used as a throttler, both initialCount and maxCount have the same value:
var semaphore = new SemaphoreSlim(maximumConcurrency, maximumConcurrency);
...and the semaphore is used with this pattern:
await semaphore.WaitAsync(); // or semaphore.Wait();
try
{
// Invoke the operation that must be throttled
}
finally
{
semaphore.Release();
}
The initialCount configures the maximum concurrency policy, and the maxCount ensures that this policy will not be violated. If you omit the second argument (the maxCount) your code will work just as well, provided that there are no bugs in it. If there is a bug, and each WaitAsync could be followed by more than one Release, then the maxCount will help at detecting this bug before it ends up in the released version of your program. The bug will be surfaced as a SemaphoreFullException, hopefully during the testing of a pre-release version, and so you'll be able to track and eliminate it before it does any real harm (before it has caused the violation of the maximum concurrency policy in a production environment).
The default value of the maxCount argument, in case you omit it, is Int32.MaxValue (source code).

If you wish that no thread should access your resource for some time, you pass the initial count as 0 and when you wish to grant the access to all of them just after creating the semaphore, you pass the value of initial count equal to maximum count. For example:
hSemaphore = CreateSemaphoreA(NULL, 0, MAX_COUNT, NULL) ;
//Do something here
//No threads can access your resource
ReleaseSemaphore(hSemaphore, MAX_COUNT, 0) ;
//All threads can access the resource now
As quoted in MSDN Documentation- "Another use of ReleaseSemaphore is during an application's initialization. The application can create a semaphore with an initial count of zero. This sets the semaphore's state to nonsignaled and blocks all threads from accessing the protected resource. When the application finishes its initialization, it uses ReleaseSemaphore to increase the count to its maximum value, to permit normal access to the protected resource."

This way when the current thread creates the semaphore it could claim some resources from the start.

Think of it like this:
initialCount is the "degree of parallelism" (number of threads that can enter)
maxCount ensures that you don't Release more than you should
For example, say you want a concurrency degree of "1" (only one operation at a time). But then due to some bug in your code, you release the semaphore twice. So now you have a concurrency of two!
But if you set maxCount - it will not allow this and throw an exception.

maxCount is the number of concurrent threads that you're going to be allowing.
However, when you start the throttling, you may already know there are a few active threads, so you'd want to tell it "hey, I want to have 6 concurrent threads, but I already have 4, so I want you to only allow 2 more for now", so you'd set initialCount to 2 and maxCount to 6.
The limitation with initialCount in SemaphoreSlim is that it cannot be a negative number, so you can't say "hey, I want to have up to 6 concurrent threads, but I currently have 10, so let 5 get released before you allow another one in.". That would mean an initialCount of -4. For that you'd need to use a 3rd party package like SemaphoreSlimThrottling (note that I am the author of SemaphoreSlimThrottling).

As MSDN explains it under the Remarks section:
If initialCount is less than maximumCount, the effect is the same as if the current thread had called WaitOne (maximumCount minus initialCount) times. If you do not want to reserve any entries for the thread that creates the semaphore, use the same number for maximumCount and initialCount.
So If the initial count is 0 and max is 2 it is as if WaitOne has been called twice by the main thread so we have reached capacity (semaphore count is 0 now) and no thread can enter Semaphore. Similarly If initial count is 1 and max is 2 WaitOnce has been called once and only one thread can enter before we reach capacity again and so on.
If 0 is used for initial count we can always call Release(2) to increase the semaphore count to max to allow maximum number of threads to acquire resource.

Semaphores can be used to protect a pool of resources. We use resource pools to reuse things that are expensive to create - such as database connections.
So initial count refers to the number of available resources in the pool at the
start of some process. When you read the initialCount in code you should be thinking in terms of how much up front effort are you putting into creating this pool of resources.
I am really confused about the significance of initial count?
Initial count = Upfront cost
As such, depending on the usage profile of your application, this value can have a dramatic effect on the performance of your application. It's not just some arbitrary number.
You should think carefully about what you creating, how expensive they are to create and how many you need right away. You should literally able able to graph the optimal value for this parameter and should likely think about making it configurable so you can adapt the performance of the process to the time at which it is being executed.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.