I have just did a sample for multithreading using This Link like below:
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
It gives me 15 thread before Parellel.For and after it gives me 17 thread only. So only 2 thread is occupy with Parellel.For.
Then I have created a another sample code using This Link like below:
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Console.WriteLine("MaxDegreeOfParallelism : {0}", Environment.ProcessorCount * 10);
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
In above code, I have set MaxDegreeOfParallelism where it sets 40 but is still taking same threads for Parallel.For.
So how can I increase running thread for Parallel.For?
I am facing a problem that some numbers is skipped inside the Parallel.For when I perform some heavy and complex functionality inside it. So here I want to increase the maximum thread and override the skipping issue.
What you're saying is something like: "My car is shaking when driving too fast. I'm trying to avoid this by driving even faster." That doesn't make any sense. What you need is to fix the car, not change the speed.
How exactly to do that depends on what are you actually doing in the loop. The code you showed is obviously placeholder, but even that's wrong. So I think what you should do first is to learn about thread safety.
Using a lock is one option, and it's the easiest one to get correct. But it's also hard to make it efficient. What you need is to lock only for a short amount of time each iteration.
There are other options how to achieve thread safety, including using Interlocked, overloads of Parallel.For that use thread-local data and approaches other than Parallel.For(), like PLINQ or TPL Dataflow.
After you made sure your code is thread safe, only then it's time to worry about things like the number of threads. And regarding that, I think there are two things to note:
For CPU-bound computations, it doesn't make sense to use more threads than the number of cores your CPU has. Using more threads than that will actually usually lead to slower code, since switching between threads has some overhead.
I don't think you can measure the number of threads used by Parallel.For() like that. Parallel.For() uses the thread pool and it's quite possible that there already are some threads in the pool before the loop begins.
Parallel loops use hardware CPU cores. If your CPU has 2 cores, this is the maximum degree of paralellism that you can get in your machine.
Taken from MSDN:
What to Expect
By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the
number of available cores. In typical scenarios, the more cores you
have, the faster your loop executes, until you reach the point of
diminishing returns that Amdahl's Law predicts. How much faster
depends on the kind of work your loop does.
Further reading:
Threading vs Parallelism, how do they differ?
Threading vs. Parallel Processing
Parallel loops will give you wrong result for summation operations without locks as result of each iteration depends on a single variable 'Count' and value of 'Count' in parallel loop is not predictable. However, using locks in parallel loops do not achieve actual parallelism. so, u should try something else for testing parallel loop instead of summation.
Related
Here is the code:
using (var context = new AventureWorksDataContext())
{
IEnumerable<Customer> _customerQuery = from c in context.Customers
where c.FirstName.StartsWith("A")
select c;
var watch = new Stopwatch();
watch.Start();
var result = Parallel.ForEach(_customerQuery, c => Console.WriteLine(c.FirstName));
watch.Stop();
Debug.WriteLine(watch.ElapsedMilliseconds);
watch = new Stopwatch();
watch.Start();
foreach (var customer in _customerQuery)
{
Console.WriteLine(customer.FirstName);
}
watch.Stop();
Debug.WriteLine(watch.ElapsedMilliseconds);
}
The problem is, Parallel.ForEach takes about 400ms vs a regular foreach, which takes about 40ms. What exactly am I doing wrong and why doesn't this work as I expect it to?
Suppose you have a task to perform. Let's say you're a math teacher and you have twenty papers to grade. It takes you two minutes to grade a paper, so it's going to take you about forty minutes.
Now let's suppose that you decide to hire some assistants to help you grade papers. It takes you an hour to locate four assistants. You each take four papers and you are all done in eight minutes. You've traded 40 minutes of work for 68 total minutes of work including the extra hour to find the assistants, so this isn't a savings. The overhead of finding the assistants is larger than the cost of doing the work yourself.
Now suppose you have twenty thousand papers to grade, so it is going to take you about 40000 minutes. Now if you spend an hour finding assistants, that's a win. You each take 4000 papers and are done in a total of 8060 minutes instead of 40000 minutes, a savings of almost a factor of 5. The overhead of finding the assistants is basically irrelevant.
Parallelization is not free. The cost of splitting up work amongst different threads needs to be tiny compared to the amount of work done per thread.
Further reading:
Amdahl's law
Gives the theoretical speedup in latency of the execution of a task at fixed workload, that can be expected of a system whose resources are improved.
Gustafson's law
Gives the theoretical speedup in latency of the execution of a task at fixed execution time, that can be expected of a system whose resources are improved.
The first thing you should realize is that not all parallelism is beneficial. There is an amount of overhead to parallelism, and this overhead may or may not be significant depending on the complexity what is being parallelized. Since the work in your parallel function is very small, the overhead of the management the parallelism has to do becomes significant, thus slowing down the overall work.
The additional overhead of creating all the threads for your enumerable VS just executing the numerable is more than likely the cause for the slowdown. Parallel.ForEach is not a blanket performance increasing move; it needs to be weighed whether or not the operation that is to be completed for each element is likely to block.
For example, if you were to make a web request or something instead of simply writing to the console, the parallel version might be faster. As it is, simply writing to the console is a very fast operation, so the overhead of creating the threads and starting them is going to be slower.
As previous writer has said there are some overhead associated with Parallel.ForEach, but that is not why you can't see your performance improvement. Console.WriteLine is a synchronous operation, so only one thread is working at a time. Try changing the body to something non-blocking and you will see the performance increase (as long as the amount of work in the body is big enough to outweight the overhead).
I like salomons answer and would like to add that you also have additional overhead of
Allocating delegates.
Calling through them.
I need an iteration of a parallel for loop to use 7 cores(or stay away from 1 core) but another iteration to use 8(all) cores and tried below code:
Parallel.For(0,2,i=>{
if(i=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(i==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
this outputs 255 for both iterations. So either parallel.for loop is using single thread for them or one setting sets other iterations affinity too. Another problem is, this is from a latency sensitive application and all this affinity settings add 1 to 15 milliseconds latency.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
OpenCL context is using max cores for kernel execution on cpu in one iteration. Other iterations using 1-2 cores to copy buffers and send command to devices. When cpu is used by opencl, it uses all cores and devices cannot get enough time to copy buffers. Device fission seems to be harder than solving this issue I hink.
Different thread affinities in Parallel.For Iterations
Question is misleading, as it is based on assumption that Parallel API means multiple threads. Parallel API does refer to Data Parallel processing, but doesn't provide any guarantee for invoking multiple threads, especially for the code provided above, where there's hardly any work per thread.
For the Parallel API, you can set the Max degree of Parallelism, as follows:
ParallelOptions parallelOption = new ParallelOptions();
parallelOption.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, 20, parallelOption, i =>
But that never guarantee the number of threads that would be invoked to parallel processing, since Threads are used from the ThreadPool and CLR decides at run-time, based on amount of work to be processed, whether more than one thread is required for the processing.
In the same Parallel loop can you try the following, print Thread.Current.ManageThreadId, this would provide a clear idea, regarding the number of threads being invoked in the Parallel loop.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
Can you post the code, for multiple threads, can you try something like this.
Thread[] threadArray = new Thread[2];
threadArray[0] = new Thread(<ThreadDelegate>);
threadArray[1] = new Thread(<ThreadDelegate>);
threadArray[0]. ProcessorAffinity = <Set Processor Affinity>
threadArray[1]. ProcessorAffinity = <Set Processor Affinity>
Assuming you assign the affinity correctly, you can print them and find different values, check the following ProcessThread.ProcessorAffinity.
On another note as you could see in the link above, you can set the value in hexadecimal based on processor affinity, not sure what does values 254, 255 denotes , do you really have server with that many processors.
EDIT:
Try the following edit to your program, (based on the fact that two Thread ids are getting printed), now by the time both threads some in the picture, they both get same value of variable i, they need a local variable to avoid closure issue
Parallel.For(0,2,i=>{
int local = i;
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
EDIT 2: (Would mostly not work as both threads might increment, before actual logic execution)
int local = -1;
Parallel.For(0,2,i=>{
Interlocked.Increment(ref local);
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371
There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.
You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.
I have a for loop with more than 20k iterations,for each iteration it is taking around two or three seconds and total around 20minutes. how i can optimize this for loop. I am using .net3.5 so parallel foreach is not possible. so i splited the 200000 nos into small chunks and implemented some threading now i am able reduce the time by 50%. is there any other way to optimize these kind of for loops.
My sample code is given below
static double sum=0.0;
public double AsyncTest()
{
List<Item> ItemsList = GetItem();//around 20k items
int count = 0;
bool flag = true;
var newItemsList = ItemsList.Take(62).ToList();
while (flag)
{
int j=0;
WaitHandle[] waitHandles = new WaitHandle[62];
foreach (Item item in newItemsList)
{
var delegateInstance = new MyDelegate(MyMethod);
IAsyncResult asyncResult = delegateInstance.BeginInvoke(item.id, new AsyncCallback(MyAsyncResults), null);
waitHandles[j] = asyncResult.AsyncWaitHandle;
j++;
}
WaitHandle.WaitAll(waitHandles);
count = count + 62;
newItemsList = ItemsList.Skip(count).Take(62).ToList();
}
return sum;
}
public double MyMethod(int id)
{
//Calculations
return sum;
}
static public void MyAsyncResults(IAsyncResult iResult)
{
AsyncResult asyncResult = (AsyncResult) iResult;
MyDelegate del = (MyDelegate) asyncResult.AsyncDelegate;
double mySum = del.EndInvoke(iResult);
sum = sum + mySum;
}
It's possible to reduce number of loops by various techniques. However, this won't give you any noticeable improvement since the heavy computation is performed inside your loops. If you've already parallelized it to use all your CPU cores there is not much to be done. There is a certain amount of computation to be done and there is a certain computer power available. You can't squeeze from your machine more than it can provide.
You can try to:
Do a more efficient implementation of your algorithm if it's possible
Switch to faster environment/language, such as unmanaged C/C++.
Is there a rationale behind your batches size (62)?
Is "MyMethod" method IO bound or CPU bound?
What you do in each cycle is wait till all the batch completes and this wastes some cycles (you are actually waiting for all 62 calls to complete before taking the next batch).
Why won't you change the approach a bit so that you still keep N operations running simultaneosly, but you fire a new operation as soon as one of the executind operations completes?
According to this blog, for loops are more faster than foreach in case of collections. Try looping with for. It will help.
It sounds like you have a CPU intensive MyMethod. For CPU intensive tasks, you can gain a significant improvement by parallelization, but only to the point of better utilizing all CPU cores. Beyond that point, too much parallelization can start to hurt performance -- which I think is what you're doing. (This is unlike I/O intensive tasks where you pretty much parallelize as much as possible.)
What you need to do, in my opinion, is write another method that takes a "chunk" of items (not a single item) and returns their "sum":
double SumChunk(IEnumerable<Item> items)
{
return items.Sum(x => MyMethod(x));
}
Then divide the number of items by n (n being the degree of parallelism -- try n = number of CPU cores, and compare that to x2) and pass each chunk to an async task of SumChunk. And finally, sum up the sub-results.
Also, watch if any of the chunks is completed much before the other ones. If that's the case, then your task distributions is not homogen. You'd need to create smaller chunks (say chunks of 300 items) and pass those to SumChunk.
Correct me if I'm wrong, but it looks to me like your threading is at the individual item level - I wonder if this may be a little too granular.
You are already doing your work in blocks of 62 items. What if you were to take those items and process all of them within a single thread? I.e., you would have something like this:
void RunMyMethods(IEnumerable<Item> items)
{
foreach(Item item in items)
{
var result = MyMethod(item);
...
}
}
Keep in mind that WaitHandle objects can be slower than using Monitor objects: http://www.yoda.arachsys.com/csharp/threads/waithandles.shtml
Otherwise, the usual advice holds: profile the performance to find the true bottlenecks. In your question you state that it takes 2-3 seconds per iteration - with 20000 iterations, it would take a fair bit more than 20 minutes.
Edit:
If you are wanting to maximise your usage of CPU time, then it may be best to split your 20000 items into, say, four groups of 5000 and process each group in its own thread. I would imagine that this sort of "thick 'n chunky" concurrency would be more efficient than a very fine-grained approach.
To start with, the numbers just don't add:
20k iterations,for each iteration it is taking around two or three seconds and total around 20minutes
That's a x40 'parallelism factor' - you can never achieve that running on a normal machine.
Second, when 'optimizing' a CPU intensive computation, there's no sense in parallelizing beyond the number of cores. Try dropping that magical 62 to 16 and bench test - it will actually run faster.
I ran a deformed malversion of your code on my laptop, and got some 10-20% improvement using Parallel.ForEach
So maybe you can make it run 17 minutes instead of 20 - does it really matter ?