I have problem with my library for neural networks. It uses multithreading to fasten computations. But after about 30-60 sec of runtime my program does not utilize 100% of my i7 3610QM 4cores 8threads anymore.
Basically my processing looks like (c# with pseudocode)
for each training example t in training set
for each layer l in neural network
Parallel.For(0, N, (int i)=>{l.processForward(l.regions[i])})
for each layer l in neural network (but with reversed order)
Parallel.For(0, N, (int i)=>{l.backPropageteError(l.regions[i])})
Where regions is layer's list of precalculated regions of neuron to process. Every region is the same size of 1/N of current layer so Tasks are same size to minimize chance that other threads need to wait for longest task to finish.
Like i said, this processing scheme is consuming 100% of my processor only for a short time and then drops to about 80-85%. In my case i set N to Environment.ProcessorsCount (= 8);
I can share whole code/repository if anyone is willing to help.
I tried to investigate and I created new console project and put there almost Hello World of Parallel.For() and i simply can't tell what is going on. This might be other issue of Parallel.For() but i also want you to address this problem. Here is the code:
class Program
{
static void Main(string[] args)
{
const int n = 1;
while (true)
{
//int counter = 0; for (int ii = 0; ii < 1000; ++ii) counter++;
Parallel.For(0, n, (int i) => { int counter = 0; for (int ii = 0; ii < 1000; ++ii) counter++; });
}
}
}
In this code, I constantly (while loop) create one task (n=1) that has some work to do (increase counter one thousand times). As i know, Parallel.For blocks execution / waits for all parallel calls to finish. If that is true it should be doing the same work as commented section (provided n=1). But on my computer, this program uses 100% of CPU, like there is work for more than one thread! How is that possible? When i switch to commented version, program uses less than 20% of CPU and this is what I expected. Please help me understand this behaviour.
As #TaW said, their is a cost of going parallel. That's why f() and Parallel.For(0, n, _ => f()) are not equivalent. Parallel version incurs thread scheduling and context switching. In your case the execution time of f() is comparable to thread scheduling overhead. That why you do get performance degrade with parallel version. Parallel.For do wait until operation completes, but is completes so fast that several threads run on the CPU in a very short period of time (remember that each time you invoke Parallel.For it may choose different thread to run f() on it) on different CPU cores.
As for the first part of the question, i guess the problem lies in the index range passed to Parallel.For. Instead of [0, number of CPU cores), it should be equal to the index range of data.
Related
There is a C# function A(arg1, arg2) which needs to be called lots of times. To do this fastest, I am using parallel programming.
Take the example of the following code:
long totalCalls = 2000000;
int threads = Environment.ProcessorCount;
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threads;
Parallel.ForEach(Enumerable.Range(1, threads), options, range =>
{
for (int i = 0; i < total / threads; i++)
{
// init arg1 and arg2
var value = A(arg1, agr2);
// do something with value
}
});
Now the issue is that this is not scaling up with an increase in number of cores; e.g. on 8 cores it is using 80% of CPU and on 16 cores it is using 40-50% of CPU. I want to use the CPU to maximum extent.
You may assume A(arg1, arg2) internally contains a complex calculation, but it doesn't have any IO or network-bound operations, and also there is no thread locking. What are other possibilities to find out which part of the code is making it not perform in a 100% parallel manner?
I also tried increasing the degree of parallelism, e.g.
int threads = Environment.ProcessorCount * 2;
// AND
int threads = Environment.ProcessorCount * 4;
// etc.
But it was of no help.
Update 1 - if I run the same code by replacing A() with a simple function which is calculating prime number then it is utilizing 100 CPU and scaling up well. So this proves that other piece of code is correct. Now issue could be within the original function A(). I need a way to detect that issue which is causing some sort of sequencing.
You have determined that the code in A is the problem.
There is one very common problem: Garbage collection. Configure your application in app.config to use the concurrent server GC. The Workstation GC tends to serialize execution. The effect is severe.
If this is not the problem pause the debugger a few times and look at the Debug -> Parallel Stacks window. There, you can see what your threads are doing. Look for common resources and contention. For example if you find many thread waiting for a lock that's your problem.
Another nice debugging technique is commenting out code. Once the scalability limit disappears you know what code caused it.
Background
Historically, I've not had to write too much thread related code. I am familiar with the process and I can use System.Threading to do what I need to do when the need arises. However, the bulk of my experience has been stuck in .Net 2.0 world (ugh!) and I need a bit of help executing a very simple task in the most recent version of .Net.
I need to write a simple, parallel job but I would prefer it to execute as a low-priority task that doesn't bring my PC to a sluggish halt.
Below is a simple parallel example that demonstrates the type of work that I'm trying to accomplish. In this example, I take a random number and I store it if the new value is larger than the last, largest random number that has been discovered.
Please understand, the point of this example is to strictly show that I have a calculation that I wish to repeatedly execute, compare and store. I understand the need for variable locking and I also know that this code isn't perfect. However, it is an extremely simple example of the nature of what I need to accomplish. If you run this code, your CPU will come to a grinding halt.
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
How can I do this work, in parallel, and have it execute as a low-priority CPU job. The above mechanism has provided tremendous performance improvements, over a standard for-loop. However, the cost has been that I can't use my computer, even after setting the MaxDegreeOfParallelism option.
What can I do?
Because Parallel.For uses ThreadPool threads you can not set the priority of the thread to be low. However you can change the priority of the entire application
Process currentProcess = Process.GetCurrentProcess();
ProcessPriorityClass oldPriority = currentProcess.PriorityClass;
try
{
currentProcess.PriorityClass = ProcessPriorityClass.BelowNormal;
int i = 0;
ThreadSafeRNG r = new ThreadSafeRNG();
ParallelOptions.MaxDegreeOfParallelism = 4; //Assume I have 8 cores.
Parallel.For(0, Int32.MaxValue, (j, loopState) =>
{
int k = r.Next();
if (k > i) k = i;
}
}
finally
{
//Bring the priority back up to the original level.
currentProcess.PriorityClass = oldPriority;
}
If you really don't want your whole application's priority to be lowered combine this trick with some form of IPC like WCF and have your slow long running operation running in a 2nd process that you start up and kill as needed.
Using the Parallel.For method, you cannot set thread priority since they are ThreadPool threads.
Your best bet is to set ParallelOptions.MaxDegreeOfParallelism to ProcessorCount - 1, this way you have a free core to do other things (such as handle the GUI).
Does looping in C# occur at the same speed for all systems. If not, how can I control a looping speed to make the experience consistent on all platforms?
You can set a minimum time for the time taken to go around a loop, like this:
for(int i= 0; i < 10; i++)
{
System.Threading.Thread.Sleep(100);
... rest of your code...
}
The sleep call will take a minimum of 100ms (you cannot say what the maximum will be), so your loop wil take at least 1 second to run 10 iterations.
Bear in mind that it's counter to the normal way of Windows programming to sleep on your user-interface thread, but this might be useful to you for a quick hack.
You can never depend on the speed of a loop. Although all existing compilers strive to make loops as efficient as possible and so they probably produce very similar results (given enough development time), the compilers are not the only think influencing this.
And even leaving everything else aside, different machines have different performance. No two machines will yield the exact same speed for a loop. In fact, even starting the program twice on the same machine will yield slightly different performances. It depends on what other programs are running, how the CPU is feeling today and whether or not the moon is shining.
No, loops do not occur the same in all systems. There are so many factors to this question that it can not be appreciable answered without code.
This is a simple loop:
int j;
for(int i = 0; i < 100; i++) {
j = j + i;
}
this loop is too simple, it's merely a pair of load, add, store operations, with a jump and a compare. This will be only a few microops and will be really fast. However, the speed of those microops will be dependent on the processor. If the processor can do one microop in 1 billionth of a second (roughly one gigahertz) then the loop will take approximately 6 * 100 microops (this is all rough estimation, there are so many factors involved that I'm only going for approximation) or 6 * 100 billionths of a second, or slightly less than one millionth of a second. For the entire loop. You can barely measure this with most operating system functions.
I wanted to demonstrate the speed of the looping. I referenced above a processor of 1 billion microops per second. Now consider a processor that can do 4 billion microops per second. That processor would be four times faster (roughly) than the first processor. And we didn't change the code.
Does this answer the question?
For those who want to mention that the compiler might loop unroll this, ignore that for the sake of the learning.
One way of controlling this is by using the Stopwatch to control when you do your logic. See this example code:
int noofrunspersecond = 30;
long ticks1 = 0;
long ticks2 = 0;
double interval = (double)Stopwatch.Frequency / noofrunspersecond;
while (true) {
ticks2 = Stopwatch.GetTimestamp();
if (ticks2 >= ticks1 + interval) {
ticks1 = Stopwatch.GetTimestamp();
//perform your logic here
}
Thread.Sleep(1);
}
This will make sure that that the logic is performed at given intervals as long as the system can keep up, so if you try to execute 100 times per second, depending on the logic performed the system might not manage to perform that logic 100 times a second. In other cases this should work just fine.
This kind of logic is good for getting smooth animations that will not speed up or slow down on different systems for example.
I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}
The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)
The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.
if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.
I'm writing a UDP multicast client/server pair in C# and I need a delay on the order of 50-100 µsec (microseconds) to throttle the server transmission rate. This helps to avoid significant packet loss and also helps to keep from overloading the clients that are disk I/O bound. Please do not suggest Thread.Sleep or Thread.SpinWait. I would not ask if I needed either of those.
My first thought was to use some kind of a high-performance counter and do a simple while() loop checking the elapsed time but I'd like to avoid that as it feels kludgey. Wouldn't that also peg the CPU utilization for the server process?
Bonus points for a cross-platform solution, i.e. not Windows specific. Thanks in advance, guys!
Very short sleep times are generally best achieved by a CPU spin loop (like the kind you describe). You generally want to avoid using the high-precision timer calls as they can themselves take up time and skew the results. I wouldn't worry too much about CPU pegging on the server for such short wait times.
I would encapsulate the behavior in a class, as follows:
Create a class whose static constructor runs a spin loop for several million iterations and captures how long it takes. This gives you an idea of how long a single loop cycle would take on the underlying hardware.
Compute a uS/iteration value that you can use to compute arbitrary sleep times.
When asked to sleep for a particular period of time, divide uS to sleep by the uS/iteration value previously computed to identify how many loop iterations to perform.
Spin using a while loop until the estimated time elapses.
I would use stopwatch but would need a loop
read this to add more extension to the stopwatch, like ElapsedMicroseconds
or something like this might work too
System.Diagnostics.Stopwatch.IsHighResolution MUST be true
static void Main(string[] args)
{
Stopwatch sw;
sw = Stopwatch.StartNew();
int i = 0;
while (sw.ElapsedMilliseconds <= 5000)
{
if (sw.Elapsed.Ticks % 100 == 0)
{ i++; /* do something*/ }
}
sw.Stop();
}
I've experienced with such requirement when I needed more precision with my multicast application.
I've found that the best solution resides with the MultiMedia timers as seen in this example.
I've used this implementation and added TPL async invoke to it. You should see my SimpleMulticastAnalyzer project for more information.
static void udelay(long us)
{
var sw = System.Diagnostics.Stopwatch.StartNew();
long v = (us * System.Diagnostics.Stopwatch.Frequency )/ 1000000;
while (sw.ElapsedTicks < v)
{
}
}
static void Main(string[] args)
{
for (int i = 0; i < 100; i++)
{
Console.WriteLine("" + i + " " + DateTime.Now.Second + "." + DateTime.Now.Millisecond);
udelay(1000000);
}
}
I would discourage using spin loop as it consumes and creates blocking thread. thread.sleep is better, it doesn't use processor resource during sleep, it just slice the time. Try it, and you'll see from task manager how the CPU usage spike with the spin loop.
Have you looked at multimedia timers? You could probably find a .NET library somewhere that wraps the API calls somewhere.