What does cause different performance of Math.Max in C#?

What does cause different performance of Math.Max in C#? - c#

I ran this on a laptop, 64-bit Windows 8.1, 2.2 Ghz Intel Core i3. The code was compiled in release mode and ran without a debugger attached.
static void Main(string[] args)
{
calcMax(new[] { 1, 2 });
calcMax2(new[] { 1, 2 });
var A = GetArray(200000000);
var stopwatch = new Stopwatch();
stopwatch.Start(); stopwatch.Stop();
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax(A);
stopwatch.Stop();
Console.WriteLine("caclMax - \t{0}", stopwatch.Elapsed);
GC.Collect();
stopwatch.Reset();
stopwatch.Start();
calcMax2(A);
stopwatch.Stop();
Console.WriteLine("caclMax2 - \t{0}", stopwatch.Elapsed);
Console.ReadKey();
}
static int[] GetArray(int size)
{
var r = new Random(size);
var ret = new int[size];
for (int i = 0; i < size; i++)
{
ret[i] = r.Next();
}
return ret;
}
static int calcMax(int[] A)
{
int max = int.MinValue;
for (int i = 0; i < A.Length; i++)
{
max = Math.Max(max, A[i]);
}
return max;
}
static int calcMax2(int[] A)
{
int max1 = int.MinValue;
int max2 = int.MinValue;
for (int i = 0; i < A.Length; i += 2)
{
max1 = Math.Max(max1, A[i]);
max2 = Math.Max(max2, A[i + 1]);
}
return Math.Max(max1, max2);
}
Here are some statistics of program performance (time in miliseconds):
Framework 2.0
X86 platform:
2269 (calcMax)
2971 (calcMax2)
[winner calcMax]
X64 platform:
6163 (calcMax)
5916 (calcMax2)
[winner calcMax2]
Framework 4.5 (time in miliseconds)
X86 platform:
2109 (calcMax)
2579 (calcMax2)
[winner calcMax]
X64 platform:
2040 (calcMax)
2488 (calcMax2)
[winner calcMax]
As you can see the performance is different depend on framework and choosen compilied platform. I see generated IL code and it is the same for each cases.
The calcMax2 is under test because it should use "pipelining" of processor. But it is faster only with framework 2.0 on 64-bit platform. So, what is real reason of shown case in different performance?

Just some notes worth mentioning. My processor (Haswell i7) doesn't compare well with yours, I certainly can't get close to reproducing the outlier x64 result.
Benchmarking is a hazardous exercise and it is very easy to make simple mistakes that can have big consequences on execution time. You can only truly see them when you look at the generated machine code. Use Tools + Options, Debugging, General and untick the "Suppress JIT optimization" option. That way you can look at the code with Debug > Windows > Disassembly and not affect the optimizer.
Some things you'll see when you do this:
You made a mistake, you are not actually using the method return value. The jitter optimizer opportunities like this where possible, it completely omits the max variable assignment in calcMax(). But not in calcMax2(). This is a classic benchmarking oops, in a real program you'd of course use the return value. This makes calcMax() look too good.
The .NET 4 jitter is smarter about optimizing Math.Max(), in can generate the code inline. The .NET 2 jitter couldn't do that yet, it has to make a call to a CLR helper function. The 4.5 test should thus run a lot faster, that it didn't is a strong hint at what really throttles the code execution. It is not the processor's execution engine, it is the cost of accessing memory. Your array is too large to fit in the processor caches so your program is bogged down waiting for the slow RAM to supply the data. If the processor cannot overlap that with executing instructions then it just stalls.
Noteworthy about calcMax() is what happens to the array-bounds check that C# performs. The jitter knows how to completely eliminate it from the loop. It however isn't smart enough to do the same in calcMax2(), the A[i + 1] screws that up. That check doesn't come for free, it should make calcMax2() quite a bit slower. That it doesn't is again a strong hint that memory is the true bottleneck. That's pretty normal btw, array bound checking in C# can have low to no overhead because it is so much cheaper than the array element access.
As for your basic quest, trying to improve super-scalar execution opportunities, no, that's not how processors work. A loop is not a boundary for the processor, it just sees a different stream of compare and branch instructions, all of which can execute concurrently if they don't have inter-dependencies. What you did by hand is something the optimizer already does itself, an optimization called "loop unrolling". It selected not to do so in this particular case btw. An overview of jitter optimizer strategies is available in this post. Trying to outsmart the processor and the optimizer is a pretty tall order and getting a worse result by trying to help is certainly not unusual.

Many of the differences that you see are well within the range of tolerance, so they should be considered as no differences.
Essentially, what these numbers show is that Framework 2.0 was highly unoptimized for X64, (no surprise at all here,) and that overall, calcMax performs slightly better than calcMax2. (No surprise there either, because calcMax2 contains more instructions.)
So, what we learn is that someone came up with a theory that they could achieve better performance by writing high-level code that somehow takes advantage of some pipelining of the CPU, and that this theory was proved wrong.
The running time of your code is dominated by the failed branch predictions that are occurring within Math.max() due to the randomness of your data. Try less randomness (more consecutive values where the 2nd one will always be greater) and see if it gives you any better insights.

Every time you run the program, you'll get slightly different results.
Sometimes calcMax will win, and sometimes calcMax2 will win. This is because there is a problem comparing performance that way. What StopWhatch measures is the time elapsed since stopwatch.Start() is called, until stopwatch.Stop() is called. In between, things independent of your code can occur. For example, the operating system can take the processor from your process and give it for a while to another process running on your machine, due to the end of your process's time slice. after a while, your process gets the processor back for another time slice.
Such occurrences cannot be controlled or foreseen by your comparison code, and thus the entire experiment shouldn't be treated as reliable.
To minimize this kind of measurement errors, you should measure every function many times (for example, 1000 times), and calculate the average time of all measurements. This method of measurement tends to significantly improve the reliability of the result, as it is more resilient to statistical errors.

Related

Full CPU usage for Parallel.For loops

I am writing a WPF application that processes an image data stream from an IR camera. The application uses a class library for processing steps such as rescaling or colorizing, which I am also writing myself. An image processing step looks something like this:
ProcessFrame(double[,] frame)
{
int width = frame.GetLength(1);
int height = frame.GetLength(0);
byte[,] result = new byte[height, width];
Parallel.For(0, height, row =>
{
for(var col = 0; col < width; ++col)
ManipulatePixel(frame[row, col]);
});
}
Frames are processed by a task that runs in the background. The issue is, that depending on how costly the specific processing algorithm is ( ManipulatePixel() ), the application can't keep up with the camera's frame rate any more. However, I have noticed that despite me using parallel for loops, the application simply won't use all of the CPU that is available - task manager performance tab shows about 60-80% CPU usage.
I have used the same processing algorithms in C++ before, using the concurrency::parallel_for loops from the parallel patterns library. The C++ code uses all of the CPU it can get, as I would expect, and I also tried PInvoking a C++ DLL from my C# code, doing the same algorithm that runs slowly in the C# library - it also uses all the CPU power available, CPU usage is right at 100% virtually the whole time and there is no trouble at all keeping up with the camera.
Outsourcing the code into a C++ DLL and then marshalling it back into C# is an extra hassle I'd of course rather avoid. How do I make my C# code actually make use of all the CPU potential? I have tried increasing process priority like this:
using (Process process = Process.GetCurrentProcess())
process.PriorityClass = ProcessPriorityClass.RealTime;
Which has an effect, but only a very small one. I also tried setting the degree of parallelism for the Parallel.For() loops like this:
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
and then passing that to the Parallel.For() loop, this had no effect at all but I suppose that's not surprising since the default settings should already be optimized. I also tried setting this in the application configuration:
<runtime>
<Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
<GCCpuGroup enabled="true"></GCCpuGroup>
<gcServer enabled="true"></gcServer>
</runtime>
but this actually makes it run even slower.
EDIT:
The ProcessFrame code block I quoted originally was actually not quite correct. What I was doing at the time was:
ProcessFrame(double[,] frame)
{
byte[,] result = new byte[frame.GetLength(0), frame.GetLength(1)];
Parallel.For(0, frame.GetLength(0), row =>
{
for(var col = 0; col < frame.GetLength(1); ++col)
ManipulatePixel(frame[row, col]);
});
}
Sorry for this, I was paraphrasing code at the time and I didn't realize that this is an actual pitfall, that produces different results. I have since changed the code to what I originally wrote (i.e. the width and height variables set at the beginning of the function, and the array's length properties only queried once each instead of in the for loop's conditional statements). Thank you #Seabizkit, your second comment inspired me to try this. The change in fact already makes the code run noticeably faster - I didn't realize this because C++ doesn't know 2D arrays so I had to pass the pixel dimensions as separate arguments anyway. Whether it is fast enough as it is I cannot say yet however.
Also thank you for the other answers, they contain a lot of things I don't know yet but it's great to know what I have to look for. I'll update once I reached a satisfactory result.

I would need to have all of your code and be able to run it locally in order to diagnose the problem because your posting is devoid of details (I would need to see inside your ManipulatePixel function, as well as the code that calls ProcessFrame). but here's some general tips that apply in your case.
2D arrays in .NET are significantly slower than 1D arrays and staggered arrays, even in .NET Core today - this is a longstanding bug.
See here:
https://github.com/dotnet/coreclr/issues/4059
Why are multi-dimensional arrays in .NET slower than normal arrays?
Multi-dimensional array vs. One-dimensional
So consider changing your code to use either a jagged array (which also helps with memory locality/proximity caching, as each thread would have its own private buffer) or a 1D array with your own code being responsible for bounds-checking.
Or better-yet: use stackalloc to manage the buffer's lifetime and pass that by-pointer (unsafe ahoy!) to your thread delegate.
Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.
Avoid allocating a new buffer for each frame encountered - if a frame has a limited lifespan then consider using reusable buffers using a buffer-pool.
Consider using the SIMD and AVX features in .NET. While modern C/C++ compilers are smart enough to compile code to use those instructions, the .NET JIT isn't so hot - but you can make explicit calls into SMID/AVX instructions using the SIMD-enabled types (you'll need to use .NET Core 2.0 or later for the best accelerated functionality)
Also, avoid copying individual bytes or scalar values inside a for loop in C#, instead consider using Buffer.BlockCopy for bulk copy operations (as these can use hardware memory copy features).
Regarding your observation of "80% CPU usage" - if you have a loop in a program then that will cause 100% CPU usage within the time-slices provided by the operating-system - if you don't see 100% usage then your code then:
Your code is actually running faster than real-time (this is a good thing!) - (unless you're certain your program can't keep-up with the input?)
Your codes' thread (or threads) is blocked by something, such as a blocking IO call or a misplaced Thread.Sleep. Use tools like ETW to see what your process is doing when you think it should be CPU-bound.
Ensure you aren't using any lock (Monitor) calls or using other thread or memory synchronization primitives.

Efficiency matters ( it is not true-[PARALLEL], but may, yet need not, benefit from a "just"-[CONCURRENT] work
The BEST, yet a rather hard way, if ultimate performance is a MUST :
in-line an assembly, optimised as per cache-line sizes in the CPU hierarchy and keep indexing that follows the actual memory-layout of the 2D data { column-wise | row-wise }. Given there is no 2D-kernel-transformation mentioned, your process does not need to "touch" any topological-neighbours, the indexing can step in whatever order "across" both of the ranges of the 2D-domain and the ManipulatePixel() may get more efficient on transforming rather blocks-of pixels, instead of bearing all overheads for calling a process just for each isolated atomicised-1px ( ILP + cache-efficiency are on your side ).
Given your target production-platform CPU-family, best use (block-SIMD)-vectorised instructions available from AVX2, best AVX512 code. As you most probably know, may use C/C++ using AVX-intrinsics for performance optimisations with assembly-inspection and finally "copy" the best resulting assembly for your C# assembly-inlining. Nothing will run faster. Tricks with CPU-core affinity mapping and eviction/reservation are indeed a last resort, yet may help for indeed an almost hard-real-time production settings ( though, hard R/T systems are seldom to get developed in an ecosystem with non-deterministic behaviour )
A CHEAP, few-seconds step :
Test and benchmark the run-time per batch of frames of a reversed composition of moving the more-"expensive"-part, the Parallel.For(...{...}) inside the for(var col = 0; col < width; ++col){...} to see the change of the costs of instantiations of the Parallel.For() instrumentation.
Next, if going this cheap way, think about re-factoring the ManipulatePixel() to at least use a block of data, aligned with data-storage layout and being a multiple of cache-line length ( for cache-hits ~ 0.5 ~ 5 [ns] improved costs-of-memory accesses, being ~ 100 ~ 380 [ns] otherwise - here, a will to distribute a work (the worse per 1px) across all NUMA-CPU-cores will result in paying way more time, due to extended access-latencies for cross-NUMA-(non-local) memory addresses and besides never re-using an expensively cached block-of-fetched-data, you knowingly pay excessive costs from cross-NUMA-(non-local) memory fetches ( from which you "use" just 1px and "throw" away all the rest of the cached-block ( as those pixels will get re-fetched and manipulated in some other CPU-core in some other time ~ a triple-waste of time ~ sorry to have mentioned that explicitly, but when shaving each possible [ns] this cannot happen in production pipeline ) )
Anyway, let me wish you perseverance and good luck on your steps forwards to gain the needed efficiency back onto your side.

Here's what I ended up doing, mostly based on Dai's answer:
made sure to query image pixel dimensions once at the beginning of the processing functions, not within the for loop's conditional statement. With parallel loops, it would seem this creates competitive access of those properties from multriple threads which noticeably slows things down.
removed allocation of output buffers within the processing functions. They now return void and accept the output buffer as an argument. The caller creates one buffer for each image processing step (filtering, scaling, colorizing) only, which doesn't change in size but gets overwritten with each frame.
removed an extra data processing step where raw image data in the format ushort (what the camera originally spits out) was converted to double (actual temperature values). Instead, processing is applied to the raw data directly. Conversion to actual temperatures will be dealt with later, as necessary.
I also tried, without success, to use 1D arrays instead of 2D but there is actually no difference in performance. I don't know if it's because the bug Dai mentioned was fixed in the meantime, but I couldn't confirm 2D arrays to be any slower than 1D arrays.
Probably also worth mentioning, the ManipulatePixel() function in my original post was actually more of a placeholder rather than a real call to another function. Here's a more proper example of what I am doing to a frame, including the changes I made:
private static void Rescale(ushort[,] originalImg, byte[,] scaledImg, in (ushort, ushort) limits)
{
Debug.Assert(originalImg != null);
Debug.Assert(originalImg.Length != 0);
Debug.Assert(scaledImg != null);
Debug.Assert(scaledImg.Length == originalImg.Length);
ushort min = limits.Item1;
ushort max = limits.Item2;
int width = originalImg.GetLength(1);
int height = originalImg.GetLength(0);
Parallel.For(0, height, row =>
{
for (var col = 0; col < width; ++col)
{
ushort value = originalImg[row, col];
if (value < min)
scaledImg[row, col] = 0;
else if (value > max)
scaledImg[row, col] = 255;
else
scaledImg[row, col] = (byte)(255.0 * (value - min) / (max - min));
}
});
}
This is just one step and some others are much more complex but the approach would be similar.
Some of the things mentioned like SIMD/AVX or the answer of user3666197 unfortunately are well beyond my abilities right now so I couldn't test that out.
It's still relatively easy to put enough processing load into the stream to tank the frame rate but for my application the performance should be enough now. Thanks to everyone who provided input, I'll mark Dai's answer as accepted because I found it the most helpful.

Slow execution under 64 bits. Possible RyuJIT bug?

I have the following C# code trying to benchmark under release mode:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication54
{
class Program
{
static void Main(string[] args)
{
int counter = 0;
var sw = new Stopwatch();
unchecked
{
int sum = 0;
while (true)
{
try
{
if (counter > 20)
throw new Exception("exception");
}
catch
{
}
sw.Restart();
for (int i = 0; i < int.MaxValue; i++)
{
sum += i;
}
counter++;
Console.WriteLine(sw.Elapsed);
}
}
}
}
}
I am on a 64-bit machine and VS 2015 installed. When I run the code under 32-bit, it runs each iteration around 0.6 seconds, printed to the console. When I run it under 64-bit then the duration for each iteration simply jumps to 4 seconds! I tried the sample code in my colleagues computer which only has VS 2013 installed. There both 32-bit and 64-bit versions run around 0.6 seconds.
In addition to that, if we just remove the try catch block, it also runs in 0.6 seconds with VS 2015 in 64-bit.
This looks like a serious RyuJIT regression when there is a try catch block. Am I correct ?

Bench-marking is a fine art. Make a small modification to your code:
Console.WriteLine("{0}", sw.Elapsed, sum);
And you'll now see the difference disappear. Or to put it another way, the x86 version is now just as slow as the x64 code. You can probably figure out what RyuJIT doesn't do what the legacy jitter did from this minor change, it doesn't eliminate the unnecessary
sum += i;
Something you can see when you look at the generated machine code with Debug > Windows > Disassembly. Which is indeed a quirk in RyuJIT. Its dead code elimination isn't as thorough as the legacy jitter. Otherwise not entirely without reason, Microsoft rewrote the x64 jitter because of bugs that it could not easily fix. One of them was a fairly nasty issue with the optimizer, it had no upper-bound on the amount of time it spent on optimizing a method. Causing rather poor behavior on methods with very large bodies, it could be out in the woods for dozens of milliseconds and cause noticeable execution pauses.
Calling it a bug, meh, not really. Write sane code and the jitter won't disappoint you. Optimization does forever start at the usual place, between the programmer's ears.

After a bit of testing I've got some interesting results. My testing revolved around the try catch block. As the OP pointed out, if you remove this block, the time to execute is the same. I've narrowed this down a bit further and have concluded that it's because of counter variable in if statement in the try block.
Lets remove the redundant throw:
try
{
if (counter== 0) { }
}
catch
{
}
You will get the same results with this code as you did with the original code.
Lets change counter to be an actual int value:
try
{
if (1 == 0) { }
}
catch
{
}
With this code, the 64 bit version has decreased in execution time from 4 seconds to about 1.7 seconds. Still double that of the 32 bit version. However I thought that was interesting. Unfortunately after my quick Google search I haven't come up with a reason, but I'll dig a bit more and update this answer if I find out why this is happening.
As for the remaining second that we would like to shave off the 64 bit version, I can see that this is down to incrementing the sum by i in your for loop.
Lets change this so that sum does not exceed its bounds:
for (int i = 0; i < int.MaxValue; i++)
{
sum ++;
}
This change (along with the change in the try block) will reduce the execution time of the 64 bit app to 0.7 seconds. My reasoning for the 1 second difference in time is due to the artificial way that the 64 bit version needs to handle an int which is naturally 32 bits.
In the 32 bit version, there are 32 bits allocated to the Int32 (sum). When sum goes above its bounds it is easy to determine this fact.
In the 64 bit version, there are 64 bits allocated to the Int32 (sum). When sum goes above its bounds there needs to be a mechanism to detect this, which could lead to the slow down. Perhaps even the operation of adding sum & i takes longer due to the increase in redundant bits allocated.
I am theorising here; so don't take this as gospel. I just thought I would post my findings. I'm sure someone else will be able to shed some light on the problem that I've found.
--
Update
#HansPassant 's answer pointed out that the sum += i; line may be eliminated as it is deemed unnecessary, which makes perfect sense, sum is not being used outside of the for loop. After he introduced the value of sum outside of the for loop, we noticed that the x86 version was just as slow as the x64 version. So I decided to do a bit of testing. Lets change the for loop and printing to the following:
int x = 0;
for (int i = 0; i < int.MaxValue; i++)
{
sum += i;
x = sum;
}
counter++;
Console.WriteLine(sw.Elapsed + " " + x);
You can see that I've introduced a new int x which is being assigned the value of sum in the for loop. That value of x is not being written out to the console. sum doesn't leave the for loop. This, believe it or not, actually reduces the execution time for x64 to 0.7 seconds. However, x86 version jumps up to 1.4 seconds.

.NET's Multi-threading vs Multi-processing: Awful Parallel.ForEach Performance

I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((#"c:\txt-set\agg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea?
Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions.
Is there a flaw in .NET's run-time?
Thanks in advance.

Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector.
Most well tuned applications are in the 0-10% range. This is typically
caused by an allocation pattern that allows objects to live just long
enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec.
This is quite high. It is not uncommon that this is simply a
performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.

An attempt to explain the results:
a quick run in the VS profiler shows it's barely reaching 40% CPU utilization.
String.Split is the main hotspot.
so a shared something must be blocking the the CPU.
that something is most likely memory allocation. Your bottlenecks are
var dic = new Dictionary<string, List<int>>();
...
dic[token].Add(1);
I replaced this with
var dic = new Dictionary<string, int>();
...
... else dic[token] += 1;
and the result is closer to a 2x speedup.
But my counter question would be: does it matter? Your code is very artificial and incomplete. The parallel version ends up creating multiple dictionaries without merging them. This is not even close to a real situation. And as you can see, little details do matter.
Your sample code is to complex to make broad statements about Parallel.ForEach().
It is too simple to solve/analyze a real problem.

Just for fun, here is a shorter PLINQ version:
File.ReadAllText("big.txt").Split().AsParallel().GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count());

Measure code speed in .net in milliseconds

I want to get the maximum count I have to execute a loop for it to take x milliseconds to finish.
For eg.
int GetIterationsForExecutionTime(int ms)
{
int count = 0;
/* pseudocode
do
some code here
count++;
until executionTime > ms
*/
return count;
}
How do I accomplish something like this?

I want to get the maximum count I have to execute a loop for it to take x milliseconds to finish.
First off, simply do not do that. If you need to wait a certain number of milliseconds do not busy-wait in a loop. Rather, start a timer and return. When the timer ticks, have it call a method that resumes where you left off. The Task.Delay method might be a good one to use; it takes care of the timer details for you.
If your question is actually about how to time the amount of time that some code takes then you need much more than simply a good timer. There is a lot of art and science to getting accurate timings.
First you should always use Stopwatch and never use DateTime.Now for these timings. Stopwatch is designed to be a high-precision timer for telling you how much time elapsed. DateTime.Now is a low-precision timer for telling you if it is time to watch Doctor Who yet. You wouldn't use a wall clock to time an Olympic race; you'd use the highest precision stopwatch you could get your hands on. So use the one provided for you.
Second, you need to remember that C# code is compiled Just In Time. The first time you go through a loop can therefore be hundreds or thousands of times more expensive than every subsequent time due to the cost of the jitter analyzing the code that the loop calls. If you are intending on measuring the "warm" cost of a loop then you need to run the loop once before you start timing it. If you are intending on measuring the average cost including the jit time then you need to decide how many times makes up a reasonable number of trials, so that the average works out correctly.
Third, you need to make sure that you are not wearing any lead weights when you are running. Never make performance measurements while debugging. It is astonishing the number of people who do this. If you are in the debugger then the runtime may be talking back and forth with the debugger to make sure that you are getting the debugging experience you want, and that chatter takes time. The jitter is generating worse code than it normally would, so that your debugging experience is more consistent. The garbage collector is collecting less aggressively. And so on. Always run your performance measurements outside the debugger, and with optimizations turned on.
Fourth, remember that virtual memory systems impose costs similar to those of jitters. If you are already running a managed program, or have recently run one, then the pages of the CLR that you need are likely "hot" -- already in RAM -- where they are fast. If not, then the pages might be cold, on disk, and need to be page faulted in. That can change timings enormously.
Fifth, remember that the jitter can make optimizations that you do not expect. If you try to time:
// Let's time addition!
for (int i = 0; i < 1000000; ++i) { int j = i + 1; }
the jitter is entirely within its rights to remove the entire loop. It can realize that the loop computes no value that is used anywhere else in the program and remove it entirely, giving it a time of zero. Does it do so? Maybe. Maybe not. That's up to the jitter. You should measure the performance of realistic code, where the values computed are actually used somehow; the jitter will then know that it cannot optimize them away.
Sixth, timings of tests which create lots of garbage can be thrown off by the garbage collector. Suppose you have two tests, one that makes a lot of garbage and one that makes a little bit. The cost of the collection of the garbage produced by the first test can be "charged" to the time taken to run the second test if by luck the first test manages to run without a collection but the second test triggers one. If your tests produce a lot of garbage then consider (1) is my test realistic to begin with? It doesn't make any sense to do a performance measurement of an unrealistic program because you cannot make good inferences to how your real program will behave. And (2) should I be charging the cost of garbage collection to the test that produced the garbage? If so, then make sure that you force a full collection before the timing of the test is done.
Seventh, you are running your code in a multithreaded, multiprocessor environment where threads can be switched at will, and where the thread quantum (the amount of time the operating system will give another thread until yours might get a chance to run again) is about 16 milliseconds. 16 milliseconds is about fifty million processor cycles. Coming up with accurate timings of sub-millisecond operations can be quite difficult if the thread switch happens within one of the several million processor cycles that you are trying to measure. Take that into consideration.

var sw = Stopwatch.StartNew();
...
long elapsedMilliseconds = sw.ElapsedMilliseconds;

You could also use the Stopwatch class:
int GetIterationsForExecutionTime(int ms)
{
int count = 0;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
do
{
// some code here
count++;
} while (stopwatch.ElapsedMilliseconds < ms);
stopwatch.Stop();
return count;
}

Good points from Eric Lippert.
I'd been benchmarking and unit testing for a while and I'd advise you should discard every first-pass on you code cause JIT compilation.
So in a benchmarking code which use loop and Stopwatch remember to put this at the end of the loop:
// JIT optimization.
if (i == 0)
{
// Discard every result you've collected.
// And restart the timer.
stopwatch.Restart();
}

Does looping occurs at the same speed for all systems?

Does looping in C# occur at the same speed for all systems. If not, how can I control a looping speed to make the experience consistent on all platforms?

You can set a minimum time for the time taken to go around a loop, like this:
for(int i= 0; i < 10; i++)
{
System.Threading.Thread.Sleep(100);
... rest of your code...
}
The sleep call will take a minimum of 100ms (you cannot say what the maximum will be), so your loop wil take at least 1 second to run 10 iterations.
Bear in mind that it's counter to the normal way of Windows programming to sleep on your user-interface thread, but this might be useful to you for a quick hack.

You can never depend on the speed of a loop. Although all existing compilers strive to make loops as efficient as possible and so they probably produce very similar results (given enough development time), the compilers are not the only think influencing this.
And even leaving everything else aside, different machines have different performance. No two machines will yield the exact same speed for a loop. In fact, even starting the program twice on the same machine will yield slightly different performances. It depends on what other programs are running, how the CPU is feeling today and whether or not the moon is shining.

No, loops do not occur the same in all systems. There are so many factors to this question that it can not be appreciable answered without code.
This is a simple loop:
int j;
for(int i = 0; i < 100; i++) {
j = j + i;
}
this loop is too simple, it's merely a pair of load, add, store operations, with a jump and a compare. This will be only a few microops and will be really fast. However, the speed of those microops will be dependent on the processor. If the processor can do one microop in 1 billionth of a second (roughly one gigahertz) then the loop will take approximately 6 * 100 microops (this is all rough estimation, there are so many factors involved that I'm only going for approximation) or 6 * 100 billionths of a second, or slightly less than one millionth of a second. For the entire loop. You can barely measure this with most operating system functions.
I wanted to demonstrate the speed of the looping. I referenced above a processor of 1 billion microops per second. Now consider a processor that can do 4 billion microops per second. That processor would be four times faster (roughly) than the first processor. And we didn't change the code.
Does this answer the question?
For those who want to mention that the compiler might loop unroll this, ignore that for the sake of the learning.

One way of controlling this is by using the Stopwatch to control when you do your logic. See this example code:
int noofrunspersecond = 30;
long ticks1 = 0;
long ticks2 = 0;
double interval = (double)Stopwatch.Frequency / noofrunspersecond;
while (true) {
ticks2 = Stopwatch.GetTimestamp();
if (ticks2 >= ticks1 + interval) {
ticks1 = Stopwatch.GetTimestamp();
//perform your logic here
}
Thread.Sleep(1);
}
This will make sure that that the logic is performed at given intervals as long as the system can keep up, so if you try to execute 100 times per second, depending on the logic performed the system might not manage to perform that logic 100 times a second. In other cases this should work just fine.
This kind of logic is good for getting smooth animations that will not speed up or slow down on different systems for example.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.