I'm fairly new to C# and programming in general and I was trying out parallel programming.
I have written this example code that computes the sum of an array first, using multiple threads, and then, using one thread (the main thread).
I've timed both cases.
static long Sum(int[] numbers, int start, int end)
{
long sum = 0;
for (int i = start; i < end; i++)
{
sum += numbers[i];
}
return sum;
}
static async Task Main()
{
// Arrange data.
const int COUNT = 100_000_000;
int[] numbers = new int[COUNT];
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = random.Next(100);
}
// Split task into multiple parts.
int threadCount = Environment.ProcessorCount;
int taskCount = threadCount - 1;
int taskSize = numbers.Length / taskCount;
var start = DateTime.Now;
// Run individual parts in separate threads.
List<Task<long>> tasks = new();
for (int i = 0; i < taskCount; i++)
{
int begin = i * taskSize;
int end = (i == taskCount - 1) ? numbers.Length : (i + 1) * taskSize;
tasks.Add(Task.Run(() => Sum(numbers, begin, end)));
}
// Wait for all threads to finish, as we need the result.
var partialSums = await Task.WhenAll(tasks);
long sumAsync = partialSums.Sum();
var durationAsync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Async sum: {sumAsync}");
Console.WriteLine($"Async duration: {durationAsync} miliseconds");
// Sequential
start = DateTime.Now;
long sumSync = Sum(numbers, 0, numbers.Length);
var durationSync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Sync sum: {sumSync}");
Console.WriteLine($"Sync duration: {durationSync} miliseconds");
var factor = durationSync / durationAsync;
Console.WriteLine($"Factor: {factor:0.00}x");
}
When the array size is 100 million, the parallel sum is computed 2x faster. (on average).
But when the array size is 1 billion, it's significantly slower than the sequential sum.
Why is it running slower?
Hardware Information
Environment.ProcessorCount = 4
GC.GetGCMemoryInfo().TotalAvailableMemoryBytes = 8468377600
Timing:
When array size is 100,000,000
When array size is 1,000,000,000
New Test:
This time instead of separate threads (it was 3 in my case) working on different parts of a single array of 1,000,000,000 integers, I physically divided the dataset into 3 separate arrays of 333,333,333 (one-third in size). This time, although, I'm working on adding up a billion integers on the same machine, my parallel code runs faster (as expected)
private static void InitArray(int[] numbers)
{
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = (int)random.Next(100);
}
}
public static async Task Main()
{
Stopwatch stopwatch = new();
const int SIZE = 333_333_333; // one third of a billion
List<int[]> listOfArrays = new();
for (int i = 0; i < Environment.ProcessorCount - 1; i++)
{
int[] numbers = new int[SIZE];
InitArray(numbers);
listOfArrays.Add(numbers);
}
// Sequential.
stopwatch.Start();
long syncSum = 0;
foreach (var array in listOfArrays)
{
syncSum += Sum(array);
}
stopwatch.Stop();
var sequentialDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Sequential sum: {syncSum}");
Console.WriteLine($"Sequential duration: {sequentialDuration} ms");
// Parallel.
stopwatch.Restart();
List<Task<long>> tasks = new();
foreach (var array in listOfArrays)
{
tasks.Add(Task.Run(() => Sum(array)));
}
var partialSums = await Task.WhenAll(tasks);
long parallelSum = partialSums.Sum();
stopwatch.Stop();
var parallelDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Parallel sum: {parallelSum}");
Console.WriteLine($"Parallel duration: {parallelDuration} ms");
Console.WriteLine($"Factor: {sequentialDuration / parallelDuration:0.00}x");
}
Timing
I don't know if it helps figure out what went wrong in the first approach.
The asynchronous pattern is not the same as running code in parallel. The main reason for asynchronous code is better resource utilization while the computer is waiting for some kind of IO device. Your code would be better described as parallel computing or concurrent computing.
While your example should work fine, it may not be the easiest, nor optimal way to do it. The easiest option would probably be to use Parallel Linq: numbers.AsParallel().Sum();. There is also a Parallel.For method that should be better suited, including an overload that maintains a thread local state. Note that while the parallel.For will attempt to optimize its partitioning, you probably want to process chunks of data in each iteration to reduce overhead. I would try around 1-10k values or so.
We can only guess the reason your parallel method is slower. Summing numbers is a really fast operation, so it may be that the computation is limited by memory bandwith or Cache usage. And while you want your work partitions to be fairly large, using too large partitions may result in less overall parallelism if a thread gets suspended for any reason. You may also want partitions on certain sizes to work well with the caching system, see cache associativity. It is also possible you are including things you did not intend to measure, like compilation times or GCs, See benchmark .Net that takes care of many of the edge cases when measuring performance.
Also, never use DateTime for measuring performance, Stopwatch is both much easier to use and much more accurate.
My machine has 4GB RAM, so initializing an int[1_000_000_000] results in memory paging. Going from int[100_000_000] to int[1_000_000_000] results in non-linear performance degradation (100x instead of 10x). Essentially a CPU-bound operation becomes I/O-bound. Instead of adding numbers, the program spends most of its time reading segments of the array from the disk. In these conditions using multiple threads can be detrimental for the overall performance, because the pattern of accessing the storage device becomes more erratic and less streamlined.
Maybe something similar happens on your 8GB RAM machine too, but I can't say for sure.
I am trying to process numbers as fast as possible with C# app. I use a Thread.Sleep() to simulate a processing and random numbers. I use 3 different techniques.
This is test code that I used:
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Test
{
internal class Program
{
private static void Main()
{
var data = new int[500000];
var random = new Random();
for (int i = 0; i < 500000; i++)
{
data[i] = random.Next();
}
var partialTimes = new Dictionary<int, double>();
var iterations = 5;
for (int i = 1; i < iterations + 1; i++)
{
Console.Write($"ProcessData3 {i}\t");
StartProcessing(data, partialTimes, ProcessData3);
GC.Collect();
}
Console.WriteLine();
Console.WriteLine("Press Enter to Exit");
Console.ReadLine();
}
private static void StartProcessing(int[] data, Dictionary<int, double> partialTimes, Action<int[], Dictionary<int, double>> processData)
{
var stopwatch = Stopwatch.StartNew();
try
{
processData?.Invoke(data, partialTimes);
stopwatch.Stop();
Console.WriteLine($"{stopwatch.Elapsed.ToString(#"mm\:ss\:fffffff")} total = {partialTimes.Sum(s => s.Value)} max = {partialTimes.Values.Max()}");
}
finally
{
partialTimes.Clear();
}
}
private static void ProcessData1(int[] data, Dictionary<int, double> partialTimes)
{
Parallel.ForEach(data, number =>
{
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
}
});
}
private static void ProcessData3(int[] data, Dictionary<int, double> partialTimes)
{
// Partition the entire source array.
var rangePartitioner = Partitioner.Create(0, data.Length);
// Loop over the partitions in parallel.
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
// Loop over each range element without a delegate invocation.
for (int i = range.Item1; i < range.Item2; i++)
{
var number = data[i];
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
}
}
});
}
private static void ProcessData2(int[] data, Dictionary<int, double> partialTimes)
{
var tasks = new Task[data.Count()];
for (int i = 0; i < data.Count(); i++)
{
var number = data[i];
tasks[i] = Task.Factory.StartNew(() =>
{
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
}
});
}
Task.WaitAll(tasks);
}
}
}
For each technique I restart the program. And I get these results,
with having a Thread.Sleep( 1 ):
ProcessData1 1 00:56:1796688 total = 801335,282599955 max = 16,8783
ProcessData1 2 00:23:5390014 total = 816167,642100022 max = 14,5913
ProcessData1 3 00:14:7090566 total = 827589,675899998 max = 13,2617
ProcessData1 4 00:10:8929177 total = 829296,528300007 max = 15,0175
ProcessData1 5 00:10:6333310 total = 839282,123200008 max = 29,2738
ProcessData2 1 00:37:8084153 total = 824507,174200022 max = 112,071
ProcessData2 2 00:16:3762096 total = 849272,47810001 max = 77,1514
ProcessData2 3 00:12:9177717 total = 854012,353100029 max = 67,5684
ProcessData2 4 00:10:4798701 total = 857396,642899983 max = 92,9408
ProcessData2 5 00:09:2206146 total = 870966,655499989 max = 51,8945
ProcessData3 1 01:13:6814541 total = 803581,718699918 max = 25,6815
ProcessData3 2 01:07:9809277 total = 814069,532899922 max = 26,0671
ProcessData3 3 01:07:9857984 total = 814148,329399928 max = 21,3116
ProcessData3 4 01:07:4812183 total = 808042,695499966 max = 16,8601
ProcessData3 5 01:07:2954614 total = 805895,325499903 max = 23,8517
Where
total is total a time spent inside each Parallel.ForEach() function together and
max is a maximum time of each function.
Why is the first loop so slow? How is it possible that other attempts are processed so quickly? How to achieve a faster parallel processing on the first attempt?
EDIT:
So I tried it also with having a Thread.Sleep( 10 )
Results are:
ProcessData1 1 02:50:2845698 total = 5109831,95429994 max = 12,0612
ProcessData1 2 00:56:3361645 total = 5125884,05919954 max = 12,7666
ProcessData1 3 00:53:4911541 total = 5131105,15209993 max = 12,7486
ProcessData1 4 00:49:5665628 total = 5144654,75829992 max = 13,2678
ProcessData1 5 00:46:0218194 total = 5152955,19509996 max = 13,702
ProcessData2 1 01:21:7207557 total = 5121889,31579983 max = 73,8152
ProcessData2 2 00:39:6660074 total = 5175557,68889969 max = 59,369
ProcessData2 3 00:31:9036416 total = 5193819,89889973 max = 56,2895
ProcessData2 4 00:27:4616803 total = 5207168,56969977 max = 65,5495
ProcessData2 5 00:24:4270755 total = 5222567,9044998 max = 65,368
ProcessData3 1 02:44:9985645 total = 5110117,19019997 max = 11,7172
ProcessData3 2 02:25:6533128 total = 5237779,27010012 max = 26,3171
ProcessData3 3 02:22:2771259 total = 5116123,45259975 max = 12,0581
ProcessData3 4 02:22:1678911 total = 5112574,93779995 max = 11,5334
ProcessData3 5 02:21:9418178 total = 5104980,07120004 max = 11,5583
So first loop still takes much more seconds than others..
The behavior you're seeing is entirely explained by the fact that the ThreadPool class delays creating new threads until some small amount of time has passed (on the order of 1 second…it's changed over the years).
It can be informative to add instrumentation to one's program. In your example, a very useful tool is to count the number of concurrent threads as managed by the thread pool, determine the "high water mark" (i.e. the maximum number of threads it eventually settles on), and then use that number to override the thread pool's behavior.
When I did that, I discovered that on the first run of the first method, you get up to about 25 threads. But since the default for the thread pool is to only create a number of threads equal to the number of cores on your computer (eight, in my case), creating the additional threads can take a fair amount of time. And of course, during that time, you get significantly less throughput than you would otherwise (so you incur a larger delay than just the 20 seconds or so getting up to that number of threads causes).
On the subsequent runs of that test, the max number of threads gradually rises (since each new run is starting with more threads in the thread pool already, from the previous run) gets as high as around 53.
If you know in advance how many threads the thread pool is going to require in order to perform your work efficiently, you can use the SetMinThreads() method to increase the number of threads it will create immediately on demand before switching to the throttled thread-creation algorithm. For example, having that 53 thread high water mark in hand, you can set the number of minimum threads to that number (or a nice round one, like 50).
When I do that, all five runs of your first test, which previously took between 25 seconds to 1 minute (with the longer runs being earlier, of course), take around 19 seconds to complete.
I'd like to emphasize that you should use SetMinThreads() very carefully. The thread pool is, in general, very good about managing work-loads. The scenario you present above is obviously just for the sake of example and not realistic, but it does have the problem that you're not really doing that much work in each Parallel.ForEach() iteration in the first place. It doesn't seem like a good fit for concurrency, since so much of the time spent will be on overhead. Using SetMinThreads() in any similar scenario just papers over a more insidious underlying issue.
You'll find that if you tailor your workloads to better match available resources, and to minimize transitions between tasks and threads, you can get good throughput without overriding the default thread pool numbers.
Some other notes on this particular test…
Note that if you change the program to run all three tests in the same session (five runs each), the "first run is longer" happens only for the first test. For future reference, you should always approach this sort of "first time is slower" question with an eye to testing different combinations and ordering, to verify whether it's a particular implementation that suffers from the effect, or if you see the effect for the first test, regardless of which implementation is run first. There are a number of implementation and platform details, including JIT, thread pool, disk cache that can affect the initial run of any algorithm, and you'll want to make sure that you quickly narrow down your search to knowing whether you're dealing with one of those or some genuine issue in your own algorithm.
By the way, not that it really matters for your question, but I find it odd your choice to use the random number in the data array as the key for your timings dictionary. This IMHO renders those timing values useless, due to collisions in the random numbers. You won't count every time (when there's a collision, only the last instance of that number will get stored) which means that the "total" time displayed is less than the true total time spent, and even the max values won't necessarily be correct (if the true max value gets overwritten by a later value using the same key, you'll miss it).
Here's my modified version of your first test, which shows both the diagnostic code I added, and (commented out) the statements to set the thread pool counts to produce faster, more consistent behavior:
private static int _threadCount1;
private static int _maxThreadCount1;
private static void ProcessData1(int[] data, Dictionary<int, double> partialTimes)
{
const int minOverride = 50;
int minMain, minIOCP, maxMain, maxIOCP;
ThreadPool.GetMinThreads(out minMain, out minIOCP);
ThreadPool.GetMaxThreads(out maxMain, out maxIOCP);
WriteLine($"cores: {Environment.ProcessorCount}");
WriteLine($"threads: {minMain} min, {maxMain} max");
// Uncomment two lines below to see uniform behavior across test runs:
//ThreadPool.SetMinThreads(minOverride, minIOCP);
//ThreadPool.SetMaxThreads(minOverride, maxIOCP);
_threadCount1 = _maxThreadCount1 = 0;
Parallel.ForEach(data, number =>
{
int threadCount = Interlocked.Increment(ref _threadCount1);
var partialStopwatch = Stopwatch.StartNew();
Thread.Sleep(1);
partialStopwatch.Stop();
lock (partialTimes)
{
partialTimes[number] = partialStopwatch.Elapsed.TotalMilliseconds;
if (_maxThreadCount1 < threadCount)
{
_maxThreadCount1 = threadCount;
}
}
Interlocked.Decrement(ref _threadCount1);
});
ThreadPool.SetMinThreads(minMain, minIOCP);
ThreadPool.SetMaxThreads(maxMain, maxIOCP);
WriteLine($"max thread count: {_maxThreadCount1}");
}
I've made some speed tests concerning Lists in C#. Here is a result that I cannot explain. I hope someone can figure out what is happening.
Miliseconds for 1000 iterations if cloneList.RemoveAt(cloneList.Count - 1) is called before cloneList.Add(next): x milliseconds.
Miliseconds for 1000 iterations if cloneList.RemoveAt(cloneList.Count - 1) is NOT called before cloneList.Add(next): at least 20x milliseconds.
It seems if a have one more statement my code get 20 times faster (see the code below):
Stopwatch stopWatch = new Stopwatch();
Random random = new Random(100);
TimeSpan caseOneTimeSpan = new TimeSpan();
TimeSpan caseTwoTimeSpan = new TimeSpan();
int len = 1000;
List<int> myList = new List<int>();
myList.Capacity = len + 1;
// filling the list
for (int i = 0; i < len; i++)
myList.Add(random.Next(1000));
// number of tests (1000)
for (int i = 0; i < 1000; i++)
{
List<int> cloneList = myList.ToList();
int next = random.Next();
// case 1 - remove last item before adding the new item
stopWatch.Start();
cloneList.RemoveAt(cloneList.Count - 1);
cloneList.Add(next);
caseOneTimeSpan += stopWatch.Elapsed;
// reset stopwatch and clone list
stopWatch.Reset();
cloneList = myList.ToList();
// case 2 - add without removing
stopWatch.Start();
cloneList.Add(next);
caseTwoTimeSpan += stopWatch.Elapsed;
stopWatch.Reset();
}
Console.WriteLine("Case 1: " + caseOneTimeSpan.TotalMilliseconds);
Console.WriteLine("Case 2: " + caseTwoTimeSpan.TotalMilliseconds);
Console.WriteLine("Case 2 / Case 1: " + caseTwoTimeSpan.TotalMilliseconds / caseOneTimeSpan.TotalMilliseconds);
When you add an item to a list there are two possibilities:
The internal buffer is large enough to add another item. The item is placed in the next free location. Speed: O(1) (This is the most common case.)
The internal buffer is not large enough. Create a new, larger, buffer. Copy all items from the old buffer to the new one. Add the next item to the new buffer. Speed: O(n) (this shouldn't be occurring often)
While most Add calls will be O(1), some are O(n).
Removing the last item is always O(1).
Since Add is sometimes dependent on the size of the list, when the list is larger it takes longer (if any calls require a new buffer). If you always remove items when adding a new one you are ensuring that the internal buffer always has enough space.
You can look at the Capacity property of List to see the current size of the internal buffer and compare it to Count, which is the number of items that the list actually has. (Therefore Capacity-Count is the number of free items in the buffer.) While not often useful in real programs, looking at these tools when debugging or developing an application can be useful to helping you see what's going on underneath.
I'm looking for a fast and efficient Radix-Sort Implementation for Dictionary/KeyValuePair Collection if possible in C# (but not mandatory). The key is an Integer between 1 000 000 and 9 999 999 999. The number of values are varying between 5 to several thousand.
At the moment I'm using LINQ-OrderBy, which is I think QuickSort. For me performance is really important and I would like to test whether a Radix-Sort would be faster.
I found only Array implementations. Of course I could try it by myself but because I'm new to this topic I believe it wouldn't be the fastest and most efficient algorithm. ;-)
Thank you.
Rene
Have you tested your code to determine that the LINQ-based sort is the bottleneck in your program? LINQ's sort is pretty darned quick. For example, the code below times the sorting of a dictionary that contains from 1,000 to 10,000 items. The average, over 1,000 runs, is on the order of 3.5 milliseconds.
static void DoIt()
{
int NumberOfTests = 1000;
Random rnd = new Random();
TimeSpan totalTime = TimeSpan.Zero;
for (int i = 0; i < NumberOfTests; ++i)
{
// fill the dictionary
int DictionarySize = rnd.Next(1000, 10000);
var dict = new Dictionary<int, string>();
while (dict.Count < DictionarySize)
{
int key = rnd.Next(1000000, 9999999);
if (!dict.ContainsKey(key))
{
dict.Add(key, "x");
}
}
// Okay, sort
var sw = Stopwatch.StartNew();
var sorted = (from kvp in dict
orderby kvp.Key
select kvp).ToList();
sw.Stop();
totalTime += sw.Elapsed;
Console.WriteLine("{0:N0} items in {1:N6} ms", dict.Count, sw.Elapsed.TotalMilliseconds);
}
Console.WriteLine("Total time = {0:N6} ms", totalTime.TotalMilliseconds);
Console.WriteLine("Average time = {0:N6} ms", totalTime.TotalMilliseconds / NumberOfTests);
Note that the reported average includes the JIT time (the first time through the loop, which takes approximately 35 ms).
Whereas it's possible that a good radix sort implementation will improve your sorting performance, I suspect your optimization efforts would be better spent somewhere else.
I am trying to find the time taken to run a function. I am doing it this way:
SomeFunc(input) {
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
//some operation on input
stopWatch.Stop();
long timeTaken = stopWatch.ElapsedMilliseconds;
}
Now the "some operation on input" as mentioned in the comments takes significant time based on the input to SomeFunc.
The problem is when I call SomeFunc multiple times from the main, I get timeTaken correctly only for the first time, and the rest of the time it is being assigned to 0. Is there a problem with the above code?
EDIT:
There is a UI with multiple text fields, and when a button is clicked, it is delegated to the SomeFunc. The SomeFunc makes some calculations based on the input (from the text fields) and displays the result on the UI. I am not allowed to share the code in "some operation on input" since I have signed an NDA. I can however answer your questions as to what I am trying to achieve there. Please help.
EDIT 2:
As it seems that I am getting weird value when the function is called the first time, and as #Mike Bantegui mentioned, there must be JIT optimization going on, the only solution I can think of now (to not get zero as execution time) is that to display the time in nano seconds. How is it possible to display the time in nano seconds in C#?
Well, you aren't outputing that data anywhere. Ideally you would do it something more like this.
void SomeFunc(input)
{
Do sstuff
}
main()
{
List<long> results = new List<long>();
Stopwatch sw = new Stopwatch();
for(int i = 0; i < MAX_TRIES; i++)
{
sw.Start();
SomeFunc(arg);
sw.Stop();
results.Add(sw.ElapsedMilliseconds);
sw.Reset();
}
//Perform analyses and results
}
In fact you are getting the wrong time at the first start and correct time to the remaining. You can't relay just on the first call to measure the time. However It seams to be that the operation is too fast and so you get the 0 results. To measure the test correctly call the function 1000 times for example to see the average cost time:
Stopwatch watch = StopWatch.StartNew();
for (int index = 0; index < 1000; index++)
{
SomeFunc(input);
}
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
Edit:
How is it possible to display the time in nano seconds
You can get watch.ElapsedTicks and then convert it to nanoseconds : (watch.ElapsedTicks / Stopwatch.Frequency) * 1000000000
As a simple example, consider the following (contrived) example:
double Mean(List<double> items)
{
double mu = 0;
foreach (double val in items)
mu += val;
return mu / items.Length;
}
We can time it like so:
void DoTimings(int n)
{
Stopwatch sw = new Stopwatch();
int time = 0;
double dummy = 0;
for (int i = 0; i < n; i++)
{
List<double> items = new List<double>();
// populate items with random numbers, excluded for brevity
sw.Start();
dummy += Mean(items);
sw.Stop();
time += sw.ElapsedMilliseconds;
}
Console.WriteLine(dummy);
Console.WriteLine(time / n);
}
This works if the list of items is actually very large. But if it's too small, we'll have to do multiple runs under one timing:
void DoTimings(int n)
{
Stopwatch sw = new Stopwatch();
int time = 0;
double dummy = 0;
List<double> items = new List<double>(); // Reuse same list
// populate items with random numbers, excluded for brevity
sw.Start();
for (int i = 0; i < n; i++)
{
dummy += Mean(items);
time += sw.ElapsedMilliseconds;
}
sw.Stop();
Console.WriteLine(dummy);
Console.WriteLine(time / n);
}
In the second example, if the size of the list is too small, then we can accurately get an idea of how long it takes by simply running this for a large enough n. Each has it's advantages and flaws though.
However, before doing either of these I would do a "warm up" calculation before hand:
// Or something smaller, just enough to let the compiler JIT
double dummy = 0;
for (int i = 0; i < 10000; i++)
dummy += Mean(data);
Console.WriteLine(dummy);
// Now do the actual timing
An alternative method of both would be to do what #Rig did in his answer, and build up a list of results to do statistics on. In the first case, you'd simply build up a list of each individual time. In the second case, you would build up a list of the average timing of multiple runs, since the time for a calculation could smaller than finest grained time in your Stopwatch.
With all that said, I would say there is one very large caveat in all of this: Calculating the time it takes for something to run is very hard to do properly. It's admirable to want to do profiling, but you should do some research on SO and see what other people have done to do this properly. It's very easy to write a routine that times something badly, but very hard to do it right.