While testing application performance, I came across some pretty strange GC behavior. In short, the GC runs even on an empty program without runtime allocations!
The following application demonstrates the issue:
using System;
using System.Collections.Generic;
public class Program
{
// Preallocate strings to avoid runtime allocations.
static readonly List<string> Integers = new List<string>();
static int StartingCollections0, StartingCollections1, StartingCollections2;
static Program()
{
for (int i = 0; i < 1000000; i++)
Integers.Add(i.ToString());
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
static void Main(string[] args)
{
DateTime start = DateTime.Now;
int i = 0;
Console.WriteLine("Test 1");
StartingCollections0 = GC.CollectionCount(0);
StartingCollections1 = GC.CollectionCount(1);
StartingCollections2 = GC.CollectionCount(2);
while (true)
{
if (++i >= Integers.Count)
{
Console.WriteLine();
break;
}
// 1st test - no collections!
{
if (i % 50000 == 0)
{
PrintCollections();
Console.Write(" - ");
Console.WriteLine(Integers[i]);
//System.Threading.Thread.Sleep(100);
// or a busy wait (run in debug mode)
for (int j = 0; j < 50000000; j++)
{ }
}
}
}
i = 0;
Console.WriteLine("Test 2");
StartingCollections0 = GC.CollectionCount(0);
StartingCollections1 = GC.CollectionCount(1);
StartingCollections2 = GC.CollectionCount(2);
while (true)
{
if (++i >= Integers.Count)
{
Console.WriteLine("Press any key to continue...");
Console.ReadKey(true);
return;
}
DateTime now = DateTime.Now;
TimeSpan span = now.Subtract(start);
double seconds = span.TotalSeconds;
// 2nd test - several collections
if (seconds >= 0.1)
{
PrintCollections();
Console.Write(" - ");
Console.WriteLine(Integers[i]);
start = now;
}
}
}
static void PrintCollections()
{
Console.Write(Integers[GC.CollectionCount(0) - StartingCollections0]);
Console.Write("|");
Console.Write(Integers[GC.CollectionCount(1) - StartingCollections1]);
Console.Write("|");
Console.Write(Integers[GC.CollectionCount(2) - StartingCollections2]);
}
}
Can someone explain what is going on here? I was under the impression that the GC won't run unless memory pressure hits specific limits. However, it seems to run (and collect) all the time - is this normal?
Edit: I have modified the program to avoid all runtime allocations.
Edit 2: Ok, new iteration and it seems that DateTime is the culprit. One of the DateTime methods allocates memory (probably Subtract), which causes the GC to run. The first test now causes absolutely no collections - as expected - while the second causes several.
In short, the GC only runs when it needs to run - I was just generating memory pressure unwittingly (DateTime is a struct and I thought it wouldn't generate garbage).
GC.CollectionCount(0) returns the following:
The number of times garbage collection has occurred for the specified generation since the process was started.
Therefore you should see an increase in the numbers and that increase doesn't mean that memory is leaking but that the GC has run.
Also in the first case you can see this increase. It simply will happen much slower because the very slow Console.WriteLine method is called much more often, slowing things down a lot.
Another thing that should be noted here is that GC.Collect() is not a synchronous function call. It triggers a garbage collection, but that garbage collection occurs on a background thread, and theoretically may not have finished running by the time you get around to checking your GC statistics.
There is a GC.WaitForPendingFinalizers call which you can make after GC.Collect to block until the garbage collection occurs.
If you really want to attempt to accurately track GC statistics in different situations, I would instead utilize the Windows Performance Monitor on your process, where you can create monitors on all sorts of things, including .NET Heap statistics.
If you just wait a few seconds, you see that the collection count also increases in the first test, but not as fast.
The differences between the codes is that the first test writes out the collection count all the time, as fast as it can, while the second test loops without writing anything out until the time limit is reached.
The first test spends most of the time waiting for text being written to the console, while the second test spends most of the time looping, waiting for the time limit. The second test will do a lot more iterations during the same time.
I counted the iterations, and printed out the number of iterations per garbage collection. On my computer the first test stabilises around 45000 iterations per GC, while the second test stabilises around 130000 iterations per GC.
So, the first test actually does more garbage collections than the second test, about three times as many.
Thanks everyone! Your suggestions helped reveal the culprit: DateTime is allocating heap memory.
The GC does not run all the time but only when memory is allocated. If memory usage is flat, the GC will never run and GC.CollectionCount(0) will always return 0, as expected.
The latest iteration of the test showcases this behavior. The first test run does not allocate any heap memory (GC.CollectionCount(0) remains 0), while the second allocates memory in a non-obvious fashion: through DateTime.Subtract() -> Timespan.
Now, both DateTime and Timespan are value types, which is why I found this behavior surprising. Still, there you have it: there was a memory leak after all.
Related
I ran this on a laptop, 64-bit Windows 8.1, 2.2 Ghz Intel Core i3. The code was compiled in release mode and ran without a debugger attached.
static void Main(string[] args)
{
calcMax(new[] { 1, 2 });
calcMax2(new[] { 1, 2 });
var A = GetArray(200000000);
var stopwatch = new Stopwatch();
stopwatch.Start(); stopwatch.Stop();
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax(A);
stopwatch.Stop();
Console.WriteLine("caclMax - \t{0}", stopwatch.Elapsed);
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax2(A);
stopwatch.Stop();
Console.WriteLine("caclMax2 - \t{0}", stopwatch.Elapsed);
Console.ReadKey();
}
static int[] GetArray(int size)
{
var r = new Random(size);
var ret = new int[size];
for (int i = 0; i < size; i++)
{
ret[i] = r.Next();
}
return ret;
}
static int calcMax(int[] A)
{
int max = int.MinValue;
for (int i = 0; i < A.Length; i++)
{
max = Math.Max(max, A[i]);
}
return max;
}
static int calcMax2(int[] A)
{
int max1 = int.MinValue;
int max2 = int.MinValue;
for (int i = 0; i < A.Length; i += 2)
{
max1 = Math.Max(max1, A[i]);
max2 = Math.Max(max2, A[i + 1]);
}
return Math.Max(max1, max2);
}
Here are some statistics of program performance (time in miliseconds):
Framework 2.0
X86 platform:
2269 (calcMax)
2971 (calcMax2)
[winner calcMax]
X64 platform:
6163 (calcMax)
5916 (calcMax2)
[winner calcMax2]
Framework 4.5 (time in miliseconds)
X86 platform:
2109 (calcMax)
2579 (calcMax2)
[winner calcMax]
X64 platform:
2040 (calcMax)
2488 (calcMax2)
[winner calcMax]
As you can see the performance is different depend on framework and choosen compilied platform. I see generated IL code and it is the same for each cases.
The calcMax2 is under test because it should use "pipelining" of processor. But it is faster only with framework 2.0 on 64-bit platform. So, what is real reason of shown case in different performance?
Just some notes worth mentioning. My processor (Haswell i7) doesn't compare well with yours, I certainly can't get close to reproducing the outlier x64 result.
Benchmarking is a hazardous exercise and it is very easy to make simple mistakes that can have big consequences on execution time. You can only truly see them when you look at the generated machine code. Use Tools + Options, Debugging, General and untick the "Suppress JIT optimization" option. That way you can look at the code with Debug > Windows > Disassembly and not affect the optimizer.
Some things you'll see when you do this:
You made a mistake, you are not actually using the method return value. The jitter optimizer opportunities like this where possible, it completely omits the max variable assignment in calcMax(). But not in calcMax2(). This is a classic benchmarking oops, in a real program you'd of course use the return value. This makes calcMax() look too good.
The .NET 4 jitter is smarter about optimizing Math.Max(), in can generate the code inline. The .NET 2 jitter couldn't do that yet, it has to make a call to a CLR helper function. The 4.5 test should thus run a lot faster, that it didn't is a strong hint at what really throttles the code execution. It is not the processor's execution engine, it is the cost of accessing memory. Your array is too large to fit in the processor caches so your program is bogged down waiting for the slow RAM to supply the data. If the processor cannot overlap that with executing instructions then it just stalls.
Noteworthy about calcMax() is what happens to the array-bounds check that C# performs. The jitter knows how to completely eliminate it from the loop. It however isn't smart enough to do the same in calcMax2(), the A[i + 1] screws that up. That check doesn't come for free, it should make calcMax2() quite a bit slower. That it doesn't is again a strong hint that memory is the true bottleneck. That's pretty normal btw, array bound checking in C# can have low to no overhead because it is so much cheaper than the array element access.
As for your basic quest, trying to improve super-scalar execution opportunities, no, that's not how processors work. A loop is not a boundary for the processor, it just sees a different stream of compare and branch instructions, all of which can execute concurrently if they don't have inter-dependencies. What you did by hand is something the optimizer already does itself, an optimization called "loop unrolling". It selected not to do so in this particular case btw. An overview of jitter optimizer strategies is available in this post. Trying to outsmart the processor and the optimizer is a pretty tall order and getting a worse result by trying to help is certainly not unusual.
Many of the differences that you see are well within the range of tolerance, so they should be considered as no differences.
Essentially, what these numbers show is that Framework 2.0 was highly unoptimized for X64, (no surprise at all here,) and that overall, calcMax performs slightly better than calcMax2. (No surprise there either, because calcMax2 contains more instructions.)
So, what we learn is that someone came up with a theory that they could achieve better performance by writing high-level code that somehow takes advantage of some pipelining of the CPU, and that this theory was proved wrong.
The running time of your code is dominated by the failed branch predictions that are occurring within Math.max() due to the randomness of your data. Try less randomness (more consecutive values where the 2nd one will always be greater) and see if it gives you any better insights.
Every time you run the program, you'll get slightly different results.
Sometimes calcMax will win, and sometimes calcMax2 will win. This is because there is a problem comparing performance that way. What StopWhatch measures is the time elapsed since stopwatch.Start() is called, until stopwatch.Stop() is called. In between, things independent of your code can occur. For example, the operating system can take the processor from your process and give it for a while to another process running on your machine, due to the end of your process's time slice. after a while, your process gets the processor back for another time slice.
Such occurrences cannot be controlled or foreseen by your comparison code, and thus the entire experiment shouldn't be treated as reliable.
To minimize this kind of measurement errors, you should measure every function many times (for example, 1000 times), and calculate the average time of all measurements. This method of measurement tends to significantly improve the reliability of the result, as it is more resilient to statistical errors.
I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((#"c:\txt-set\agg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea?
Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions.
Is there a flaw in .NET's run-time?
Thanks in advance.
Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector.
Most well tuned applications are in the 0-10% range. This is typically
caused by an allocation pattern that allows objects to live just long
enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec.
This is quite high. It is not uncommon that this is simply a
performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
An attempt to explain the results:
a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are
var dic = new Dictionary<string, List<int>>();
...
dic[token].Add(1);
I replaced this with
var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;
and the result is closer to a 2x speedup.
But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.
Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.
Just for fun, here is a shorter PLINQ version:
File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count());
I want to get the maximum count I have to execute a loop for it to take x milliseconds to finish.
For eg.
int GetIterationsForExecutionTime(int ms)
{
int count = 0;
/* pseudocode
do
some code here
count++;
until executionTime > ms
*/
return count;
}
How do I accomplish something like this?
I want to get the maximum count I have to execute a loop for it to take x milliseconds to finish.
First off, simply do not do that. If you need to wait a certain number of milliseconds do not busy-wait in a loop. Rather, start a timer and return. When the timer ticks, have it call a method that resumes where you left off. The Task.Delay method might be a good one to use; it takes care of the timer details for you.
If your question is actually about how to time the amount of time that some code takes then you need much more than simply a good timer. There is a lot of art and science to getting accurate timings.
First you should always use Stopwatch and never use DateTime.Now for these timings. Stopwatch is designed to be a high-precision timer for telling you how much time elapsed. DateTime.Now is a low-precision timer for telling you if it is time to watch Doctor Who yet. You wouldn't use a wall clock to time an Olympic race; you'd use the highest precision stopwatch you could get your hands on. So use the one provided for you.
Second, you need to remember that C# code is compiled Just In Time. The first time you go through a loop can therefore be hundreds or thousands of times more expensive than every subsequent time due to the cost of the jitter analyzing the code that the loop calls. If you are intending on measuring the "warm" cost of a loop then you need to run the loop once before you start timing it. If you are intending on measuring the average cost including the jit time then you need to decide how many times makes up a reasonable number of trials, so that the average works out correctly.
Third, you need to make sure that you are not wearing any lead weights when you are running. Never make performance measurements while debugging. It is astonishing the number of people who do this. If you are in the debugger then the runtime may be talking back and forth with the debugger to make sure that you are getting the debugging experience you want, and that chatter takes time. The jitter is generating worse code than it normally would, so that your debugging experience is more consistent. The garbage collector is collecting less aggressively. And so on. Always run your performance measurements outside the debugger, and with optimizations turned on.
Fourth, remember that virtual memory systems impose costs similar to those of jitters. If you are already running a managed program, or have recently run one, then the pages of the CLR that you need are likely "hot" -- already in RAM -- where they are fast. If not, then the pages might be cold, on disk, and need to be page faulted in. That can change timings enormously.
Fifth, remember that the jitter can make optimizations that you do not expect. If you try to time:
// Let's time addition!
for (int i = 0; i < 1000000; ++i) { int j = i + 1; }
the jitter is entirely within its rights to remove the entire loop. It can realize that the loop computes no value that is used anywhere else in the program and remove it entirely, giving it a time of zero. Does it do so? Maybe. Maybe not. That's up to the jitter. You should measure the performance of realistic code, where the values computed are actually used somehow; the jitter will then know that it cannot optimize them away.
Sixth, timings of tests which create lots of garbage can be thrown off by the garbage collector. Suppose you have two tests, one that makes a lot of garbage and one that makes a little bit. The cost of the collection of the garbage produced by the first test can be "charged" to the time taken to run the second test if by luck the first test manages to run without a collection but the second test triggers one. If your tests produce a lot of garbage then consider (1) is my test realistic to begin with? It doesn't make any sense to do a performance measurement of an unrealistic program because you cannot make good inferences to how your real program will behave. And (2) should I be charging the cost of garbage collection to the test that produced the garbage? If so, then make sure that you force a full collection before the timing of the test is done.
Seventh, you are running your code in a multithreaded, multiprocessor environment where threads can be switched at will, and where the thread quantum (the amount of time the operating system will give another thread until yours might get a chance to run again) is about 16 milliseconds. 16 milliseconds is about fifty million processor cycles. Coming up with accurate timings of sub-millisecond operations can be quite difficult if the thread switch happens within one of the several million processor cycles that you are trying to measure. Take that into consideration.
var sw = Stopwatch.StartNew();
...
long elapsedMilliseconds = sw.ElapsedMilliseconds;
You could also use the Stopwatch class:
int GetIterationsForExecutionTime(int ms)
{
int count = 0;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
do
{
// some code here
count++;
} while (stopwatch.ElapsedMilliseconds < ms);
stopwatch.Stop();
return count;
}
Good points from Eric Lippert.
I'd been benchmarking and unit testing for a while and I'd advise you should discard every first-pass on you code cause JIT compilation.
So in a benchmarking code which use loop and Stopwatch remember to put this at the end of the loop:
// JIT optimization.
if (i == 0)
{
// Discard every result you've collected.
// And restart the timer.
stopwatch.Restart();
}
Ok, So, I just started screwing around with threading, now it's taking a bit of time to wrap my head around the concepts so i wrote a pretty simple test to see how much faster if faster at all printing out 20000 lines would be (and i figured it would be faster since i have a quad core processor?)
so first i wrote this, (this is how i would normally do the following):
System.DateTime startdate = DateTime.Now;
for (int i = 0; i < 10000; ++i)
{
Console.WriteLine("Producing " + i);
Console.WriteLine("\t\t\t\tConsuming " + i);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
And then with threading:
public class Test
{
static ProducerConsumer queue;
public System.DateTime startdate = DateTime.Now;
static void Main()
{
queue = new ProducerConsumer();
new Thread(new ThreadStart(ConsumerJob)).Start();
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Producing {0}", i);
queue.Produce(i);
}
Test a = new Test();
}
static void ConsumerJob()
{
Test a = new Test();
for (int i = 0; i < 10000; i++)
{
object o = queue.Consume();
Console.WriteLine("\t\t\t\tConsuming {0}", o);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
}
}
public class ProducerConsumer
{
readonly object listLock = new object();
Queue queue = new Queue();
public void Produce(object o)
{
lock (listLock)
{
queue.Enqueue(o);
Monitor.Pulse(listLock);
}
}
public object Consume()
{
lock (listLock)
{
while (queue.Count == 0)
{
Monitor.Wait(listLock);
}
return queue.Dequeue();
}
}
}
Now, For some reason i assumed this would be faster, but after testing it 15 times, the median of the results is ... a few milliseconds different in favor of non threading
Then i figured hey ... maybe i should try it on a million Console.WriteLine's, but the results were similar
am i doing something wrong ?
Writing to the console is internally synchronized. It is not parallel. It also causes cross-process communication.
In short: It is the worst possible benchmark I can think of ;-)
Try benchmarking something real, something that you actually would want to speed up. It needs to be CPU bound and not internally synchronized.
As far as I can see you have only got one thread servicing the queue, so why would this be any quicker?
I have an example for why your expectation of a big speedup through multi-threading is wrong:
Assume you want to upload 100 pictures. The single threaded variant loads the first, uploads it, loads the second, uploads it, etc.
The limiting part here is the bandwidth of your internet connection (assuming that every upload uses up all the upload bandwidth you have).
What happens if you create 100 threads to upload 1 picture only? Well, each thread reads its picture (this is the part that speeds things up a little, because reading the pictures is done in parallel instead of one after the other).
As the currently active thread uses 100% of the internet upload bandwidth to upload its picture, no other thread can upload a single byte when it is not active. As the amount of bytes that needs to be transmitted, the time that 100 threads need to upload one picture each is the same time that one thread needs to upload 100 pictures one after the other.
You only get a speedup if uploading pictures was limited to lets say 50% of the available bandwidth. Then, 100 threads would be done in 50% of the time it would take one thread to upload 100 pictures.
"For some reason i assumed this would be faster"
If you don't know why you assumed it would be faster, why are you surprised that it's not? Simply starting up new threads is never guaranteed to make any operation run faster. There has to be some inefficiency in the original algorithm that a new thread can reduce (and that is sufficient to overcome the extra overhead of creating the thread).
All the advice given by others is good advice, especially the mention of the fact that the console is serialized, as well as the fact that adding threads does not guarantee speedup.
What I want to point out and what it seems the others missed is that in your original scenario you are printing everything in the main thread, while in the second scenario you are merely delegating the entire printing task to the secondary worker. This cannot be any faster than your original scenario because you simply traded one worker for another.
A scenario where you might see speedup is this one:
for(int i = 0; i < largeNumber; i++)
{
// embarrassingly parallel task that takes some time to process
}
and then replacing that with:
int i = 0;
Parallel.For(i, largeNumber,
o =>
{
// embarrassingly parallel task that takes some time to process
});
This will split the loop among the workers such that each worker processes a smaller chunk of the original data. If the task does not need synchronization you should see the expected speedup.
Cool test.
One thing to have in mind when dealing with threads is bottlenecks. Consider this:
You have a Restaurant. Your kitchen can make a new order every 10
minutes (your chef has a bladder problem so he's always in the
bathroom, but is your girlfriend's cousin), so he produces 6 orders an
hour.
You currently employ only one waiter, which can attend tables
immediately (he's probably on E, but you don't care as long as the
service is good).
During the first week of business everything is fine: you get
customers every ten minutes. Customers still wait for exactly ten
minutes for their meal, but that's fine.
However, after that week, you are getting as much as 2 costumers every
ten minutes, and they have to wait as much as 20 minutes to get their
meal. They start complaining and making noises. And god, you have
noise. So what do you do?
Waiters are cheap, so you hire two more. Will the wait time change?
Not at all... waiters will get the order faster, sure (attend two
customers in parallel), but still some customers wait 20 minutes for
the chef to complete their orders.You need another chef, but as you
search, you discover they are lacking! Every one of them is on TV
doing some crazy reality show (except for your girlfriend's cousin who
actually, you discover, is a former drug dealer).
In your case, waiters are the threads making calls to Console.WriteLine; But your chef is the Console itself. It can only service so much calls a second. Adding some threads might make things a bit faster, but the gains should be minimal.
You have multiple sources, but only 1 output. It that case multi-threading will not speed it up. It's like having a road where 4 lanes that merge into 1 lane. Having 4 lanes will move traffic faster, but at the end it will slow back down when it merges into 1 lane.
I am trying to run the following program from the book.
The author claims that the resultant output
" should be "
1000
2000
....
10000
if you run the program on normal processor but on multiprocessor computer it could be
999
1998
...
9998
when using normal increment method (number+=1) but using the intelocked increment as shown in the program solves the problem(i.e. you get first output)
Now I have got 3 questions.
First why cant i use normal increment in the inner loop [i++ instead of Interlocked.Increment(ref i)]. Why has author choosed the other method?
Secondly what purpose does Thread.Sleep(1000) has in the context. When I comment out this line, I get second output even if I am using Interlocked method to increment number.
Thirdly I get correct output even by using normal increment method [number += 1] if I dont comment the Thread.Sleep(1000) line and second output if I do so.
Now I am running the program on Intel(R) Core(TM) i7 Q820 cpu if it makes any difference
static void Main(string[] args)
{
MyNum n = new MyNum();
for (int a = 0; a < 10; a++)
{
for (int i = 1; i <= 1000; Interlocked.Increment(ref i))
{
Thread t = new Thread(new ThreadStart(n.AddOne));
t.Start();
}
Thread.Sleep(1000);
Console.WriteLine(n.number);
}
}
class MyNum
{
public int number = 0;
public void AddOne()
{
Interlocked.Increment(ref number);
}
}
The sleep is easy--let the threads finish before you look at the result. It's not really a good answer, though--while they should finish in a second there is no guarantee they actually do.
The need for the interlocked increment in the MyNum class is clear--there are 1000 threads trying for the number, without protection it would be quite possible for one to read the number, then a second read it, then the first one put it back and then the second put it back, wiping out the change the first one made. Note that such errors are FAR more likely when there are multiple cores, otherwise it can only happen if a thread switch hits at the wrong time.
I can't see why i needs to be protected, though.
Edit: You are getting about the same result because the code executes too fast. The thread runs faster than it's created so they aren't running all at once.
Try:
public void AddOne()
{
int x = number + fibnocci(20) + 1 - fibnocci(20);
}
private int fibnocci(int n)
{
if (n < 3) return 1 else return fibnocci(n - 1) + fibnocci(n - 2);
}
(I hope the optimizer isn't good enough to kill this extra code)
The code is actually pretty strange. Since Thread t is declared locally on each iteration, it can possibly be garbage collected by .NET because no reference exists to the thread. Anyway...
To answer the first question, I don't see a need for Interlocked.Increment(ref i) to take place. The main thread is the only thread that will touch i. Using i++ is not a problem here.
For the second question, Thread.Sleep(1000) exists to give the program enough time to complete all the threads. Your i7 (quad core with hyper-threading) is probably finishing each item pretty fast.
For the third question, having the same result is not really a guaranteed with number += 1. The two cores might read the same numeral and increment the numerals to the same value (i.e., 1001, 1001).
Lastly, I'm not sure whether or not you are running the program in debug mode. Building the program in release mode may give you different behaviors and side effects that a multi-threaded program should do.
if you comment out the thread.sleep line, there is a good chance that the threads will not finish prior to the print line... in this case you will see a number smaller than the "correct" output, but not because the incrementer wasn't atomic.
On a true multicore system, it is possible for the non-atomic actions to collide. Perhaps you are doing too few iterations to see the collision.