I have been working on Async calls and I found that the Async version of a method is running much slower than the Sync version. Can anyone comment on what I may be missing. Thanks.
Statistics
Sync method time is 00:00:23.5673480
Async method time is 00:01:07.1628415
Total Records/Entries returned per call = 19972
Below is the code that i am running.
-------------------- Test class ----------------------
[TestMethod]
public void TestPeoplePerformanceSyncVsAsync()
{
DateTime start;
DateTime end;
start = DateTime.Now;
for (int i = 0; i < 10; i++)
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
IList<IPerson> people1 = repository.GetPeople();
IList<IPerson> people2 = repository.GetPeople();
}
}
end = DateTime.Now;
var diff = start - end;
Console.WriteLine(diff);
start = DateTime.Now;
for (int i = 0; i < 10; i++)
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
Task<IList<IPerson>> people1 = GetPeopleAsync();
Task<IList<IPerson>> people2 = GetPeopleAsync();
Task.WaitAll(new Task[] {people1, people2});
}
}
end = DateTime.Now;
diff = start - end;
Console.WriteLine(diff);
}
private async Task<IList<IPerson>> GetPeopleAsync()
{
using (IPersonRepository repository = kernel.Get<IPersonRepository>())
{
return await repository.GetPeopleAsync();
}
}
-------------------------- Repository ----------------------------
public IList<IPerson> GetPeople()
{
List<IPerson> people = new List<IPerson>();
using (PersonContext context = new PersonContext())
{
people.AddRange(context.People);
}
return people;
}
public async Task<IList<IPerson>> GetPeopleAsync()
{
List<IPerson> people = new List<IPerson>();
using (PersonContext context = new PersonContext())
{
people.AddRange(await context.People.ToListAsync());
}
return people;
}
So we've got a whole bunch of issues here, so I'll just say right off the bat that this isn't going to be an exhaustive list.
First off, the point of asynchrony is not strictly to improve performance. It can be, in certain contexts, used to improve performance, but that's not necessarily its goal. It can also be used to keep a UI responsive, for example. Paralleization is usually used to increase performance, but parallelization and asynchrony aren't equivalent. On top of that, parallelization has an overhead. You're spending time creating threads, scheduling them, synchronizing data between them, etc. The benefit of performing some operations in parallel may or may not surpass this overhead. If it doesn't, a synchronous solution may well be more performant.
Next, your "asynchronous" example isn't asynchronous "all the way up". You're calling WaitAll on the tasks inside the loop. For the example to be properly asynchronous one would like to see it be asynchronous all the way up to a single operation, namely some form of message loop.
Next, the two aren't don't the exact same thing in an asynchronous and synchronous manor. They are doing different things, which will obviously affect performance:
Your "asynchronous" solution creates 3 repositories. Your synchronous solution creates one. There is going to be some overhead here.
GetPeopleAsync takes a list, then pulls all of the items out of the list and puts them into another list. That's unnecessary overhead.
Then there are problems with your benchmarking:
You're using DateTime.Now, which is not designed for timing how long an operation takes. it's precision isn't particularly high, for example. You should use a StopWatch to time how long code takes.
You aren't performing all that many iterations. There's plenty of opportunity for the variation to affect the results here.
You aren't accounting for the fact that the first few runs through a section of code will take longer. The JITter needs to "warm up".
Garbage collections can be affecting your timings, namely that the objects created in the first test can end up being cleaned up during the second test.
It may depend on your data, or rather the amount of it. You didn't post what test metrics you're using to run your tests but this is my experience:
Usually when you see a slowdown in the performance of parallel algorithms when you're expecting improvement it's that the overhead of loading the extra libraries and spawning threads etc. slows down the parallel algorithm and makes it look like the linear/single-threaded version is performing better.
A greater amount of data should show better performance. Also try running the same test twice when all the libraries are loaded to avoid the load overhead.
If you don't see improvement, something is seriously wrong.
Note: You're getting voted down, I'm guessing, because you posted much more code than context, metrics etc. in the OP. IMO, very few SOers will actually bother to read and grok even that much code without being able to execute it while also being presented with metrics that are not at all useful!
Why I didn't read the code: When I see a code block with scroll bars along with the kind of text that was present in the original OP, my brain says: Don't bother. I think many if not most, probably do this.
Things to try:
Two different synch times does not mean statistically significant data. You should run each algorithm a number of times (5 at least) to see if you're experiencing anomalies. If your results for the same algorithms vary wildly then you may have other issues such as bandwidth restriction, server load etc. and the issue is external.
Try a .NET memory performance and/or memory profiler to help you track down the issue.
See #servy's great answer for more clues. It seems that he actually took the time to look at your code more closely.
Related
I'm implementing image processing algorithms in C# using .NET Framework 4.72 and need to decrease the computation code. Overall the code is sequential but there are quite a few methods with parameters that do not depend on each other. For example, it might be something like this
public void Algorithm(Object x, Object y) {
x = Filter(x);
x = Morphology(x);
y = Filter(y);
y = Morphology(y);
var z = Add(x,y);
//Similar pattern of separate operation that are then combined.
}
These functions generally take around 100ms to 500ms. They can be parallelised, and my approach has been something like this:
public void Algorithm(Object x, Object y) {
var xTask = Task.Run(() => {
x = Filter(x);
x = Morphology(x);
});
var yTask = Task.Run(() => {
y = Filter(y);
y = Morphology(y);
});
Task.WaitAll(xTask, yTask);
var z = Add(x,y);
}
It seems to work, a similar bit of code runs approximately twice as fast. (Note that the whole thing is wrapped in another Task.Run in the top most level function, so that is why I'm not awaiting here.
Question: Is this a valid approach, or is there another method for parallelising lots of little method calls that is more safe or efficient?
Update: This is not for parallelising processing a batch of images. It is about processing a single image as quick as possible.
This is valid enough - if you can process your workload in parallel then you should. You just need to be very aware of WHEN your workload can and should be parallel - and when it needs to be performed in order.
You also need to consider the cost of creating a new task, versus the benefits of doing so (i.e. sometimes avoid very small, very fast tasks).
I would strongly recommend you create additional methods and collections for managing your tasks - when they complete, and handle running lots of separate sets in parallel. Avoiding locking, managing shared memory/variables etc. For example, are you only ever processing one image at a time, or can you start processing the next one if you have cores available?
You need to be very careful with Task.WaitAll() - obviously you need to draw all your work together at some point, but be careful not to lock or block other work.
There's lots of articles out there on the various patterns you can use (pipelines sounds like a good match here).
Here's a few starters:
https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/tpl-and-traditional-async-programming
https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism
I've got the following:
[HttpPost]
public async Task<IEnumerable<PlotAutocompleteModel>> Get()
{
IEnumerable<PlotDomain> plots = await plotService.RetrieveAllPlots();
var concurrent = ConcurrentQueue<PlotAutoCompleteModel>();
Parallel.ForEach(plots, (plot) =>
{
concurrent.Enqueue(new PlotAutocompleteModel(plot);
});
return concurrent;
}
With this usage, it takes about two seconds. Compared to: return plots.Select(plot => new PlotsAutocompleteModel(plot)).ToList(); which takes about four and a half seconds.
But I've always been told that for a simple transformation of a domain model into a view model, a Parallel.ForEach isn't ideal, mostly because it should be for more com-putative code. Which my usage clearly doesn't do.
Clarification: Where you would use significantly more resources, for instance you have a bitmap, a large quantity, which you have to rasterize and create new images from.
Is this the proper option for this code? I clearly see a performance gain due to the raw amount of records I'm iterating then transforming. Does a better approach and exist?
Update:
public class ProductAutocompleteModel
{
private readonly PlotDomain plot;
public ProductAutocompleteModel(PlotDomain plot)
{
this.plot = plot;
}
public string ProductName => plot.Project.Name;
// Another fourteen exist.
}
With this usage, it takes about two seconds. Compared to... about four and a half seconds.
But I've always been told that for a simple transformation of a domain model into a view model, a Parallel.ForEach isn't ideal, mostly because it should be for more com-putative code.
Yeah, um... there's no way - absolutely no way - that a "simple transformation of a domain model into a view model" should take four and a half seconds. There is something seriously wrong there. It should take maybe half a millisecond or so. So, your PlotAutocompleteModel constructor is doing something like 10,000 times the amount of work that is normal.
Is this the proper option for this code? I clearly see a performance gain due to the raw amount of records I'm iterating then transforming.
Probably not, because you're hosting on ASP.NET. If you use parallelism on ASP.NET, you will see individual requests complete faster, but it will negatively impact the scalability of your web server as a whole. For this reason, I never recommend parallelism in ASP.NET handlers. (There are specific situations where it would be acceptable - such as a non-public server where you know you have a hard upper limit on the number of simultaneous users - but as a general rule, it's not a good idea).
Since your PlotAutocompleteModel constructor is taking several orders of magnitude longer than expected, I suspect that it's doing blocking I/O as part of its work. The best solution here is to change the blocking I/O to asynchronous I/O, and then use concurrent asynchrony, something like this:
class PlotAutocompleteModel
{
public static async Task<PlotAutocompleteModel> CreateAsync(PlotDomain plot)
{
... // do asynchronous I/O to create a PlotAutocompleteModel.
}
}
[HttpPost]
public async Task<IEnumerable<PlotAutocompleteModel>> Get()
{
IEnumerable<PlotDomain> plots = await plotService.RetrieveAllPlots();
var tasks = plots.Select(plot => PlotAutocompleteModel.CreateAsync(plot));
return await Task.WhenAll(tasks);
}
I wonder whether the following code can be optimized to execute faster. I currently seem to max out at around 1.4 million simple messages per second on a pretty simple data flow structure. I am aware that this sample process passes/transforms messages synchronously, however, I currently test TPL Dataflow as a possible replacement for my own custom solution based on Tasks and concurrent collections. I know the terms "concurrent" already suggest I run things in parallel but for current testing purposes I pushed messages on my own solution through synchronously and I get to about 5.1 million messages per second. What am I missing here, I read TPL Dataflow was pushed as a high throughput, low latency solution but so far I must be overlooking performance tweaks. Anyone who could point me into the right direction please?
class TPLDataFlowExperiments
{
public TPLDataFlowExperiments()
{
var buf1 = new BufferBlock<int>();
var transform = new TransformBlock<int, string>(t =>
{
return "";
});
var action = new ActionBlock<string>(s =>
{
//Thread.Sleep(100);
//Console.WriteLine(s);
});
buf1.LinkTo(transform);
transform.LinkTo(action);
//Propagate all Completions down the flow
buf1.Completion.ContinueWith(t =>
{
transform.Complete();
transform.Completion.ContinueWith(u =>
{
action.Complete();
});
});
Stopwatch watch = new Stopwatch();
watch.Start();
int cap = 10000000;
for (int i = 0; i < cap; i++)
{
buf1.Post(i);
}
//Mark Buffer as Complete
buf1.Complete();
action.Completion.ContinueWith(t =>
{
watch.Stop();
Console.WriteLine("All Blocks finished processing");
Console.WriteLine("Units processed per second: " + cap / watch.ElapsedMilliseconds * 1000);
});
Console.ReadLine();
}
}
I think this mostly comes down to one thing: your test is pretty much meaningless. All those blocks are supposed to do something, and use multiple cores and asynchronous operations to do that.
Also, in your test, it's likely that a lot of time is spent on synchronization. With a more realistic code, the code will take some time to execute, so there will be less contention, so the actual overhead will be smaller than what you measured.
But to actually answer your question, yes, you're overlooking some performance tweaks. Specifically, SingleProducerConstrained, which means data structures with less locking can be used. If I use this on both blocks (the BufferBlock is completely useless here, you can safely remove it), the rate raises from about 3–4 millions of items per second to more than 5 millions on my computer.
To add to svick's answer, the test uses only a single processing thread for a single action block. This way it tests nothing more than the overhead of using the blocks.
DataFlow works in a manner similar to F# Agents, Scala actors and MPI implementations. Each action block executes a single task at a time, listening to input and producing output. Speedup is provided by breaking an algorithm in steps that can be executed independently on multiple cores, passing only messages to each other.
While you can increase the number of concurrent tasks, the most important issue is designing a flow that perform the maximum amount of steps independently of the others.
You can also increase the degrees of parallelism for dataflow blocks. This may offer an additional speedup and can also help with load balancing between linear tasks if you find one of your blocks acts as a bottleneck to the rest.
If your workload is so granular that you expect to process millions of messages per second, then passing individual messages through the pipeline becomes not viable because of the associated overhead. You'll need to chunkify the workload by batching the messages to arrays or lists. For example:
var transform = new TransformBlock<int[], string[]>(batch =>
{
var results = new string[batch.Length];
for (int i = 0; i < batch.Length; i++)
{
results[i] = ProcessItem(batch[i]);
}
return results;
});
For batching your input you could use a BatchBlock, or the "linqy" Buffer extension method from the System.Interactive package, or the similar in functionality Batch method from the MoreLinq package, or do it manually.
I'm just diving into learning about the Parallel class in the 4.0 Framework and am trying to understand when it would be useful. At first after reviewing some of the documentation I tried to execute two loops, one using Parallel.Invoke and one sequentially like so:
static void Main()
{
DateTime start = DateTime.Now;
Parallel.Invoke(BasicAction, BasicAction2);
DateTime end = DateTime.Now;
var parallel = end.Subtract(start).TotalSeconds;
start = DateTime.Now;
BasicAction();
BasicAction2();
end = DateTime.Now;
var sequential = end.Subtract(start).TotalSeconds;
Console.WriteLine("Parallel:{0}", parallel.ToString());
Console.WriteLine("Sequential:{0}", sequential.ToString());
Console.Read();
}
static void BasicAction()
{
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Method=BasicAction, Thread={0}, i={1}", Thread.CurrentThread.ManagedThreadId, i.ToString());
}
}
static void BasicAction2()
{
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Method=BasicAction2, Thread={0}, i={1}", Thread.CurrentThread.ManagedThreadId, i.ToString());
}
}
There is no noticeable difference in time of execution here, or am I missing the point? Is it more useful for asynchronous invocations of web services or...?
EDIT: I removed the DateTime with Stopwatch, removed the write to the console with a simple addition operation.
UPDATE - Big Time Difference Now: Thanks for clearing up the problems I had when I involved Console
static void Main()
{
Stopwatch s = new Stopwatch();
s.Start();
Parallel.Invoke(BasicAction, BasicAction2);
s.Stop();
var parallel = s.ElapsedMilliseconds;
s.Reset();
s.Start();
BasicAction();
BasicAction2();
s.Stop();
var sequential = s.ElapsedMilliseconds;
Console.WriteLine("Parallel:{0}", parallel.ToString());
Console.WriteLine("Sequential:{0}", sequential.ToString());
Console.Read();
}
static void BasicAction()
{
Thread.Sleep(100);
}
static void BasicAction2()
{
Thread.Sleep(100);
}
The test you are doing is nonsensical; you are testing to see if something that you can not perform in parallel is faster if you perform it in parallel.
Console.Writeline handles synchronization for you so it will always act as though it is running on a single thread.
From here:
...call the SetIn, SetOut, or SetError method, respectively. I/O
operations using these streams are synchronized, which means multiple
threads can read from, or write to, the streams.
Any advantage that the parallel version gains from running on multiple threads is lost through the marshaling done by the console. In fact I wouldn't be surprised to see that all the thread switching actually means that the parallel run would be slower.
Try doing something else in the actions (a simple Thread.Sleep would do) that can be processed by multiple threads concurrently and you should see a large difference in the run times. Large enough that the inaccuracy of using DateTime as your timing mechanism will not matter too much.
It's not a matter of time of execution. The output to the console is determined by how the actions are scheduled to run. To get an accurate time of execution, you should be using StopWatch. At any rate, you are using Console.Writeline so it will appear as though it is in one thread of execution. Any thing you have tried to attain by using parallel.invoke is lost by the nature of Console.Writeline.
On something simple like that the run times will be the same. What Parallel.Invoke is doing is running the two methods at the same time.
In the first case you'll have lines spat out to the console in a mixed up order.
Method=BasicAction2, Thread=6, i=9776
Method=BasicAction, Thread=10, i=9985
// <snip>
Method=BasicAction, Thread=10, i=9999
Method=BasicAction2, Thread=6, i=9777
In the second case you'll have all the BasicAction's before the BasicAction2's.
What this shows you is that the two methods are running at the same time.
In ideal case (if number of delegates is equal to number of parallel threads & there are enough cpu cores) duration of operations will become MAX(AllDurations) instead of SUM(AllDurations) (if AllDurations is a list of each delegate execution times like {1sec,10sec, 20sec, 5sec} ). In less idealcase its moving in this direction.
Its useful when you don't care about the order in which delegates are invoked, but you care that you block thread execution until every delegate is completed, so yes it can be a situation where you need to gather data from various sources before you can proceed (they can be webservices or other types of sources).
Parallel.For can be used much more often I think, in this case its pretty much required that you got different tasks and each is taking substantial duration to execute, and I guess if you don't have an idea of possible range of execution times ( which is true for webservices) Invoke will shine the most.
Maybe your static constructor requires to build up two independant dictionaries for your type to use, you can invoke methods that fill them using Invoke() in parallel and shorten time 2x if they both take roughly same time for example.
Quite often on SO I find myself benchmarking small chunks of code to see which implemnetation is fastest.
Quite often I see comments that benchmarking code does not take into account jitting or the garbage collector.
I have the following simple benchmarking function which I have slowly evolved:
static void Profile(string description, int iterations, Action func) {
// warm up
func();
// clean up
GC.Collect();
var watch = new Stopwatch();
watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}
Usage:
Profile("a descriptions", how_many_iterations_to_run, () =>
{
// ... code being profiled
});
Does this implementation have any flaws? Is it good enough to show that implementaion X is faster than implementation Y over Z iterations? Can you think of any ways you would improve this?
EDIT
Its pretty clear that a time based approach (as opposed to iterations), is preferred, does anyone have any implementations where the time checks do not impact performance?
Here is the modified function: as recommended by the community, feel free to amend this its a community wiki.
static double Profile(string description, int iterations, Action func) {
//Run at highest priority to minimize fluctuations caused by other processes/threads
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Thread.CurrentThread.Priority = ThreadPriority.Highest;
// warm up
func();
var watch = new Stopwatch();
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
return watch.Elapsed.TotalMilliseconds;
}
Make sure you compile in Release with optimizations enabled, and run the tests outside of Visual Studio. This last part is important because the JIT stints its optimizations with a debugger attached, even in Release mode.
Finalisation won't necessarily be completed before GC.Collect returns. The finalisation is queued and then run on a separate thread. This thread could still be active during your tests, affecting the results.
If you want to ensure that finalisation has completed before starting your tests then you might want to call GC.WaitForPendingFinalizers, which will block until the finalisation queue is cleared:
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
If you want to take GC interactions out of the equation, you may want to run your 'warm up' call after the GC.Collect call, not before. That way you know .NET will already have enough memory allocated from the OS for the working set of your function.
Keep in mind that you're making a non-inlined method call for each iteration, so make sure you compare the things you're testing to an empty body. You'll also have to accept that you can only reliably time things that are several times longer than a method call.
Also, depending on what kind of stuff you're profiling, you may want to do your timing based running for a certain amount of time rather than for a certain number of iterations -- it can tend to lead to more easily-comparable numbers without having to have a very short run for the best implementation and/or a very long one for the worst.
I think the most difficult problem to overcome with benchmarking methods like this is accounting for edge cases and the unexpected. For example - "How do the two code snippets work under high CPU load/network usage/disk thrashing/etc." They're great for basic logic checks to see if a particular algorithm works significantly faster than another. But to properly test most code performance you'd have to create a test that measures the specific bottlenecks of that particular code.
I'd still say that testing small blocks of code often has little return on investment and can encourage using overly complex code instead of simple maintainable code. Writing clear code that other developers, or myself 6 months down the line, can understand quickly will have more performance benefits than highly optimized code.
I'd avoid passing the delegate at all:
Delegate call is ~ virtual method call. Not cheap: ~ 25% of smallest memory allocation in .NET. If you're interested in details, see e.g. this link.
Anonymous delegates may lead to usage of closures, that you won't even notice. Again, accessing closure fields is noticeably than e.g. accessing a variable on the stack.
An example code leading to closure usage:
public void Test()
{
int someNumber = 1;
Profiler.Profile("Closure access", 1000000,
() => someNumber + someNumber);
}
If you're not aware about closures, take a look at this method in .NET Reflector.
I'd call func() several times for the warm-up, not just one.
Suggestions for improvement
Detecting if the execution environment is good for benchmarking (such as detecting if a debugger is attached or if jit optimization is disabled which would result in incorrect measurements).
Measuring parts of the code independently (to see exactly where the bottleneck is).
Comparing different versions/components/chunks of code (In your first sentence you say '... benchmarking small chunks of code to see which implementation is fastest.').
Regarding #1:
To detect if a debugger is attached, read the property System.Diagnostics.Debugger.IsAttached (Remember to also handle the case where the debugger is initially not attached, but is attached after some time).
To detect if jit optimization is disabled, read property DebuggableAttribute.IsJITOptimizerDisabled of the relevant assemblies:
private bool IsJitOptimizerDisabled(Assembly assembly)
{
return assembly.GetCustomAttributes(typeof (DebuggableAttribute), false)
.Select(customAttribute => (DebuggableAttribute) customAttribute)
.Any(attribute => attribute.IsJITOptimizerDisabled);
}
Regarding #2:
This can be done in many ways. One way is to allow several delegates to be supplied and then measure those delegates individually.
Regarding #3:
This could also be done in many ways, and different use-cases would demand very different solutions. If the benchmark is invoked manually, then writing to the console might be fine. However if the benchmark is performed automatically by the build system, then writing to the console is probably not so fine.
One way to do this is to return the benchmark result as a strongly typed object that can easily be consumed in different contexts.
Etimo.Benchmarks
Another approach is to use an existing component to perform the benchmarks. Actually, at my company we decided to release our benchmark tool to public domain. At it's core, it manages the garbage collector, jitter, warmups etc, just like some of the other answers here suggest. It also has the three features I suggested above. It manages several of the issues discussed in Eric Lippert blog.
This is an example output where two components are compared and the results are written to the console. In this case the two components compared are called 'KeyedCollection' and 'MultiplyIndexedKeyedCollection':
There is a NuGet package, a sample NuGet package and the source code is available at GitHub. There is also a blog post.
If you're in a hurry, I suggest you get the sample package and simply modify the sample delegates as needed. If you're not in a hurry, it might be a good idea to read the blog post to understand the details.
You must also run a "warm up" pass prior to actual measurement to exclude the time JIT compiler spends on jitting your code.
Depending on the code you are benchmarking and the platform it runs on, you may need to account for how code alignment affects performance. To do so would probably require a outer wrapper that ran the test multiple times (in separate app domains or processes?), some of the times first calling "padding code" to force it to be JIT compiled, so as to cause the code being benchmarked to be aligned differently. A complete test result would give the best-case and worst-case timings for the various code alignments.
If you're trying to eliminate Garbage Collection impact from the benchmark complete, is it worth setting GCSettings.LatencyMode?
If not, and you want the impact of garbage created in func to be part of the benchmark, then shouldn't you also force collection at the end of the test (inside the timer)?
The basic problem with your question is the assumption that a single
measurement can answer all your questions. You need to measure
multiple times to get an effective picture of the situation and
especially in a garbage collected langauge like C#.
Another answer gives an okay way of measuring the basic performance.
static void Profile(string description, int iterations, Action func) {
// warm up
func();
var watch = new Stopwatch();
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
}
However, this single measurement does not account for garbage
collection. A proper profile additionally accounts for the worst case performance
of garbage collection spread out over many calls (this number is sort
of useless as the VM can terminate without ever collecting left over
garbage but is still useful for comparing two different
implementations of func.)
static void ProfileGarbageMany(string description, int iterations, Action func) {
// warm up
func();
var watch = new Stopwatch();
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
watch.Start();
for (int i = 0; i < iterations; i++) {
func();
}
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
}
And one might also want to measure the worst case performance of
garbage collection for a method that is only called once.
static void ProfileGarbage(string description, int iterations, Action func) {
// warm up
func();
var watch = new Stopwatch();
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
watch.Start();
for (int i = 0; i < iterations; i++) {
func();
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.Elapsed.TotalMilliseconds);
}
But more important than recommending any specific possible additional
measurements to profile is the idea that one should measure multiple
different statistics and not just one kind of statistic.