For a project with a very large database I am using the following two procedures thousands of times in a loop:
select_points_object_model_3d()
render_object_model_3d()
This takes hours and hours for every test as it is using only 1/16 cores. Now I was wondering: Is there a way to run multiple HDev engines in different threads all executing said procedures?
You can try to work with halcon multithreading operators.
Infinitely running:
par_start<Thread1>: procedure(...)
Wait for threads to finish:
par_start <Thread1> : process (...)
par_start <Thread2> : process (...)
par_join ([Thread1, Thread2])
See: https://www.mvtec.com/doc/halcon/12/en/par_join.html
Related
I have REST web API service in IIS which takes a collection of request objects. The user can enter more than 100 request objects.
I want to run this 100 request concurrently and then aggregate the result and send it back. This involves both I/O operation (calling to backend services for each request) and CPU bound operations (to compute few response elements)
Code snippet -
using System.Threading.Tasks;
....
var taskArray = new Task<FlightInformation>[multiFlightStatusRequest.FlightRequests.Count];
for (int i = 0; i < multiFlightStatusRequest.FlightRequests.Count; i++)
{
var z = i;
taskArray[z] = Tasks.Task.Run(() =>
PerformLogic(multiFlightStatusRequest.FlightRequests[z],lite, fetchRouteByAnyLeg)
);
}
Task.WaitAll(taskArray);
for (int i = 0; i < taskArray.Length; i++)
{
flightInformations.Add(taskArray[i].Result);
}
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
If i individually run the PerformLogic operation (for 1 object) it is taking 300 ms, now my requirement is when I run this PerformLogic() for 100 objects in a single request it should take around 2 secs.
PerformLogic() has the following steps - 1. Call a 3rd Party web service to get some details 2. Based on the details call another 3rd Party webservice 3. Collect the result from the webservice, apply few transformation
But with Task.run() it takes around 7 secs, I would like to know the best approach to handle concurrency and achieve the desired NFR of 2 secs.
I can see that at any point of time 7-8 threads are working concurrently
not sure if I can spawn 100 threads or tasks may be we can see some better performance. Please suggest an approach to handle this efficiently.
Judging by this
public Object PerformLogic(Request,...)
{
//multiple IO operations each depends on the outcome of the previous result
//Computations after getting the result from all I/O operations
}
I'd wager that PerformLogic spends most its time waiting on the IO operations. If so, there's hope with async. You'll have to rewrite PerformLogicand maybe even the IO operations - async needs to be present in all levels, from the top to the bottom. But if you can do it, the result should be a lot faster.
Other than that - get faster hardware. If 8 cores take 7 seconds, then get 32 cores. It's pricey, but could still be cheaper than rewriting the code.
First, don't reinvent the wheel. PLINQ is perfectly capable of doing stuff in parallel, there is no need for manual task handling or result merging.
If you want 100 tasks each taking 300ms done in 2 seconds, you need at least 15 parallel workers, ignoring the cost of parallelization itself.
var results = multiFlightStatusRequest.FlightRequests
.AsParallel()
.WithDegreeOfParallelism(15)
.Select(flightRequest => PerformLogic(flightRequest, lite, fetchRouteByAnyLeg)
.ToList();
Now you have told PLinq to use 15 concurrent workers to work on your queue of tasks. Are you sure your machine is up to the task? You could put any number you want in there, that doesn't mean that your computer magically gets the power to do that.
Another option is to look at your PerformLogic method and optimize that. You call it 100 times, maybe it's worth optimizing.
My application processes millions of pieces of data which vary in size. Small objects are processed quickly while others can take upwards of fifteen minutes.
My current code:
List<QueueRecords> queueRecords= Get500QueueRecords();
bool morefiles=true;
while(morefiles)
{
Parallel.ForEach(
queueRecords,parallelOptions,(record,loopstate)=>
{
//dowork
}
queueRecords = Get500QueueRecords();
if(queueRecords.Count() == 0)
{
morefiles = false;
}
}
The issue with this is that many times I will end up with one thread performing a long running task while there are still massive amounts of data to be processed.
Which pattern should I look into to resolve this issue?
Issues:
1) Get500QueueRecords could also taking some time to execute during which time you aren't doing any processing;
2) If the last record in a set takes 15 minutes you are only processing one at a time when it's processing because ParallelForEach will be waiting for it to complete.
You really should look at TPL DataFlow (https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) or at least create a reader Task that's pumping data into a BlockingCollection<T> and then launch multiple reader Tasks that pull from the blocking collection until it's consumed.
Using a producer and a consumer with a finite size BlockingCollection<T> between them allows you to control (i) how many items are buffered from the reader Task and (ii) how many Tasks you have consuming it.
I need an iteration of a parallel for loop to use 7 cores(or stay away from 1 core) but another iteration to use 8(all) cores and tried below code:
Parallel.For(0,2,i=>{
if(i=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(i==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
this outputs 255 for both iterations. So either parallel.for loop is using single thread for them or one setting sets other iterations affinity too. Another problem is, this is from a latency sensitive application and all this affinity settings add 1 to 15 milliseconds latency.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
OpenCL context is using max cores for kernel execution on cpu in one iteration. Other iterations using 1-2 cores to copy buffers and send command to devices. When cpu is used by opencl, it uses all cores and devices cannot get enough time to copy buffers. Device fission seems to be harder than solving this issue I hink.
Different thread affinities in Parallel.For Iterations
Question is misleading, as it is based on assumption that Parallel API means multiple threads. Parallel API does refer to Data Parallel processing, but doesn't provide any guarantee for invoking multiple threads, especially for the code provided above, where there's hardly any work per thread.
For the Parallel API, you can set the Max degree of Parallelism, as follows:
ParallelOptions parallelOption = new ParallelOptions();
parallelOption.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, 20, parallelOption, i =>
But that never guarantee the number of threads that would be invoked to parallel processing, since Threads are used from the ThreadPool and CLR decides at run-time, based on amount of work to be processed, whether more than one thread is required for the processing.
In the same Parallel loop can you try the following, print Thread.Current.ManageThreadId, this would provide a clear idea, regarding the number of threads being invoked in the Parallel loop.
Do I have to use threads explicitly and should I set affinities only once?
Edit: I tried threaded version, same thing happens. Even with explicit two threads, both writing 255 to console. Now it seems this command is for a process not a thread.
Can you post the code, for multiple threads, can you try something like this.
Thread[] threadArray = new Thread[2];
threadArray[0] = new Thread(<ThreadDelegate>);
threadArray[1] = new Thread(<ThreadDelegate>);
threadArray[0]. ProcessorAffinity = <Set Processor Affinity>
threadArray[1]. ProcessorAffinity = <Set Processor Affinity>
Assuming you assign the affinity correctly, you can print them and find different values, check the following ProcessThread.ProcessorAffinity.
On another note as you could see in the link above, you can set the value in hexadecimal based on processor affinity, not sure what does values 254, 255 denotes , do you really have server with that many processors.
EDIT:
Try the following edit to your program, (based on the fact that two Thread ids are getting printed), now by the time both threads some in the picture, they both get same value of variable i, they need a local variable to avoid closure issue
Parallel.For(0,2,i=>{
int local = i;
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
EDIT 2: (Would mostly not work as both threads might increment, before actual logic execution)
int local = -1;
Parallel.For(0,2,i=>{
Interlocked.Increment(ref local);
if(local=0)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(255);
if(local==1)
Process.GetCurrentProcess().ProcessorAffinity = (System.IntPtr)(254);
Thread.Sleep(25);// to make sure both set their affinities
Console.WriteLine(Process.GetCurrentProcess().ProcessorAffinity);
});
I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371
I have just did a sample for multithreading using This Link like below:
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
It gives me 15 thread before Parellel.For and after it gives me 17 thread only. So only 2 thread is occupy with Parellel.For.
Then I have created a another sample code using This Link like below:
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Console.WriteLine("MaxDegreeOfParallelism : {0}", Environment.ProcessorCount * 10);
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
In above code, I have set MaxDegreeOfParallelism where it sets 40 but is still taking same threads for Parallel.For.
So how can I increase running thread for Parallel.For?
I am facing a problem that some numbers is skipped inside the Parallel.For when I perform some heavy and complex functionality inside it. So here I want to increase the maximum thread and override the skipping issue.
What you're saying is something like: "My car is shaking when driving too fast. I'm trying to avoid this by driving even faster." That doesn't make any sense. What you need is to fix the car, not change the speed.
How exactly to do that depends on what are you actually doing in the loop. The code you showed is obviously placeholder, but even that's wrong. So I think what you should do first is to learn about thread safety.
Using a lock is one option, and it's the easiest one to get correct. But it's also hard to make it efficient. What you need is to lock only for a short amount of time each iteration.
There are other options how to achieve thread safety, including using Interlocked, overloads of Parallel.For that use thread-local data and approaches other than Parallel.For(), like PLINQ or TPL Dataflow.
After you made sure your code is thread safe, only then it's time to worry about things like the number of threads. And regarding that, I think there are two things to note:
For CPU-bound computations, it doesn't make sense to use more threads than the number of cores your CPU has. Using more threads than that will actually usually lead to slower code, since switching between threads has some overhead.
I don't think you can measure the number of threads used by Parallel.For() like that. Parallel.For() uses the thread pool and it's quite possible that there already are some threads in the pool before the loop begins.
Parallel loops use hardware CPU cores. If your CPU has 2 cores, this is the maximum degree of paralellism that you can get in your machine.
Taken from MSDN:
What to Expect
By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the
number of available cores. In typical scenarios, the more cores you
have, the faster your loop executes, until you reach the point of
diminishing returns that Amdahl's Law predicts. How much faster
depends on the kind of work your loop does.
Further reading:
Threading vs Parallelism, how do they differ?
Threading vs. Parallel Processing
Parallel loops will give you wrong result for summation operations without locks as result of each iteration depends on a single variable 'Count' and value of 'Count' in parallel loop is not predictable. However, using locks in parallel loops do not achieve actual parallelism. so, u should try something else for testing parallel loop instead of summation.