I am trying to assess the performance of a program I'm writing.
I have a method:
public double FooBar(ClassA firstArg, EnumType secondArg)
{
[...]
If I check the Function Details in the VS Performace Analyser for FooBar I can see that the method accounts for 14% of the total time (inclusive), and that 10% is spent in the body of the method itself. The thing that I cannot understand is that it looks like 6.5% of the total time (both inclusive and exclusive) is spent in the open brace of this method; it is actually the most time-consuming line in the code (as exclusive time concerns).
The method is not overriding any other method. The profiling is done in Debug configuration using sampling, the run last about 150s and that 6.5% correspond to more than 3000 samples out of a total of 48000.
Can someone explain me what it is happening in this line and if there is a way to improve that behaviour?
In the first open curly braces of the method is shown the amount of time spent for method initialization.
During the method initialization, the local variables are allocated and initialized.
Be aware that all the local variables of the method are initialized before the execution also if are declared in the middle of the body.
In order to reduce the initialization time try moving local variables to the heap or, if they are only used sometimes (like variables inside an if branch or after a return), extract the piece of code that uses them to another method.
Related
For a really simple code snippet, I'm trying to see how much of the time is spent actually allocating objects on the small object heap (SOH).
static void Main(string[] args)
{
const int noNumbers = 10000000; // 10 mil
ArrayList numbers = new ArrayList();
Random random = new Random(1); // use the same seed as to make
// benchmarking consistent
for (int i = 0; i < noNumbers; i++)
{
int currentNumber = random.Next(10); // generate a non-negative
// random number less than 10
object o = currentNumber; // BOXING occurs here
numbers.Add(o);
}
}
In particular, I want to know how much time is spent allocating space for the all the boxed int instances on the heap (I know, this is an ArrayList and there's horrible boxing going on as well - but it's just for educational purposes).
The CLR has 2 ways of performing memory allocations on the SOH: either calling the JIT_TrialAllocSFastMP (for multi-processor systems, ...SFastSP for single processor ones) allocation helper - which is really fast since it consists of a few assembly instructions - or failing back to the slower JIT_New allocation helper.
PerfView sees just fine the JIT_New being invoked:
However, I can't figure out which - if any - is the native function involved for the "quick way" of allocating. I certainly don't see any JIT_TrialAllocSFastMP. I've already tried raising the count of the loop (from 10 to 500 mil), in the hope of increasing my chances of of getting a glimpse of a few stacks containing the elusive function, but to no avail.
Another approach was to use JetBrains dotTrace (line-by-line) performance viewer, but it falls short of what I want: I do get to see the approximate time it takes the boxing operation for each int, but 1) it's just a bar and 2) there's both the allocation itself and the copying of the value (of which the latter is not what I'm after).
Using the JetBrains dotTrace Timeline viewer won't work either, since they currently don't (quite) support native callstacks.
At this point it's unclear to me if there's a method being dynamically generated and called when JIT_TrialAllocSFastMP is invoked - and by miracle neither of the PerfView-collected stack frames (one every 1 ms) ever capture it -, or somehow the Main's method body gets patched, and those few assembly instructions mentioned above are somehow injected directly in the code. It's also hard to believe that the fast way of allocating memory is never called.
You could ask "But you already have the .NET Core CLR code, why can't you figure out yourself ?". Since the .NET Framework CLR code is not publicly available, I've looked into its sibling, the .NET Core version of the CLR (as Matt Warren recommends in his step 6 here). The \src\vm\amd64\JitHelpers_InlineGetThread.asm file contains a JIT_TrialAllocSFastMP_InlineGetThread function. The issue is that parsing/understanding the C++ code there is above my grade, and also I can't really think of a way to "Step Into" and see how the JIT-ed code is generated, since this is way lower-level that your usual Press-F11-in-Visual-Studio.
Update 1: Let's simplify the code, and only consider individual boxed int values:
const int noNumbers = 10000000; // 10 mil
object o = null;
for (int i=0;i<noNumbers;i++)
{
o = i;
}
Since this is a Release build, and dead code elimination could kick in, WinDbg is used to check the final machine code.
The resulting JITed code, whose main loop is highlighted in blue below, which simply does repeated boxing, shows that the method that handles the memory allocation is not inlined (note the call to hex address 00af30f4):
This method in turn tries to allocate via the "fast" way, and if that fails, goes back to the "slow" way of a call to JIT_New itself):
It's interesting how the call stack in PerfView obtained from the code above doesn't show any intermediary method between the level of Main and the JIT_New entry itself (given that Main doesn't directly call JIT_New):
I have a simple class intended to store scaled integral values
using member variables "scaled_value" (long) with a "scale_factor".
I have a constructor that fills a new class instance with a decimal
value (although I think the value type is irrelevant).
Assignment to the "scaled_value" slot appears... to not happen.
I've inserted an explicit assignment of the constant 1 to it.
The Debug.Assert below fails... and scaled_value is zero.
On the assertion break in the immediate window I can inspect/set using assignment/inspect "scale_factor"; it changes as I set it.
I can inspect "scaled_value". It is always zero. I can type an
assignment to it which the immediate window executes, but its value
doesn't change.
I'm using Visual Studio 2017 with C# 2017.
What is magic about this slot?
public class ScaledLong : Base // handles scaled-by-power-of-ten long numbers
// intended to support equivalent of fast decimal arithmetic while hiding scale factors from user
{
public long scaled_value; // up to log10_MaxLong digits of decimal precision
public sbyte scale_factor; // power of ten representing location of decimal point range -21..+21. Set by constructor AND NEVER CHANGED.
public byte byte_size; // holds size of value in underlying memory array
string format_string;
<other constructors with same arguments except last value type>
public ScaledLong(sbyte sf, byte size, string format, decimal initial_value)
{
scale_factor = sf;
byte_size = size;
format_string = format;
decimal temp;
sbyte exponent;
{ // rip exponent out of decimal value leaving behind an integer;
_decimal_structure.value = initial_value;
exponent = (sbyte)_decimal_structure.Exponent;
_decimal_structure.Exponent = 0; // now decimal value is integral
temp = _decimal_structure.value;
}
sbyte sfDelta = (sbyte)(sf - exponent);
if (sfDelta >= 0)
{ // sfDelta > 0
this.scaled_value = 1;
Debug.Assert(scaled_value == 1);
scaled_value = (long)Math.Truncate(temp * DecimalTenToPower[sfDelta]);
}
else
{
temp = Math.Truncate(temp / DecimalHalfTenToPower[-sfDelta]);
temp += (temp % 2); /// this can overflow for value at very top of range, not worth fixing; note: this works for both + and- numbers (?)
scaled_value = (long)(temp / 2); // final result
}
}
The biggest puzzles often have the stupidest foundations. This one is a lesson in unintended side effects.
I found this by thinking about, wondering how in earth a member can get modified in unexpected ways. I found the solution before I read #mjwills comment, but he was definitely sniffing at the right thing.
What I left out (of course!) was that I had just coded a ToString() method for the class... that wasn't debugged. Why did I leave it out? Because it obviously can't affect anything so it can't be part of the problem.
Bzzzzt! it used the member variable as a scratchpad and zeroed it (there's the side effect); that was obviously unintended.
When this means is that when code the just runs, ToString() isn't called and the member variable DOES get modified correctly. (I even had unit tests for the "Set" routine checked all that and they were working).
But, when you are debugging.... the debugger can (and did in this case) show local variables. To do that, it will apparently call ToString() to get a nice displayable value. So the act of single stepping caused ToSTring() to get called, and its buggy scratch variable assignment zeroed out the slot after each step call.
So it wasn't a setter that bit me. It was arguably a getter. (Where is FORTRAN's PURE keyword when you need it?)
Einstein hated spooky actions at a distance. Programmers hate spooky side effects at a distance.
One wonders a bit at the idea of the debugger calling ToString() on a class, whose constructor hasn't finished. What assertions about the state of the class can ToString trust, given the constructor isn't done? I think the MS debugger should be fixed. With that, I would have spent my time debugging ToString instead of chasing this.
Thanks for putting up with my question. It got me to the answer.
If you still have a copy of that old/buggy code it would be interesting to try to build it under VS 2019 and Rider (hopefully the latest, 2022.1.1 at this point) with ReSharper (built in) allowed to do the picky scan and with a .ruleset allowed to bitch about just about anything (just for the 1st build - you'll turn off a lot but you need it to scream in order to see what to turn off). And with .NET 5.0 or 6.0
The reason I mention is that I remember some MS bragging about doing dataflow analysis to some degree in 2019 and I did see Rider complaining about some "unsafe assignments". If the old code is long lost - never mind.
CTOR-wise, if CTOR hasn't finished yet, we all know that the object "doesn't exist" yet and has invalid state, but to circumvent that, C# uses default values for everything. When you see code with constant assignments at the point of definition of data members that look trivial and pointless - the reason for that is that a lot of people do remember C++ and don't trust implicit defaults - just in case :-)
There is a 2-phase/round initialization sequence with 2 CTOR-s and implicit initializations in-between. Not widely documented (so that people with weak hearts don't use it :-) but completely deterministic and thread-safe (hidden fuses everywhere). Just for the sake of it's stability you never-ever want to have a call to any method before the 2 round is done (plain CTOR done still doesn't mean fully constructed object and any method invocation from the outside may trigger the 2nd round prematurely).
1st (plain) CTOR can be used in implicit initializations before the 2nd runs => you can control the (implicit) ordering, just have to be careful and step through it in debugger.
Oh and .ToString normally shouldn't be defined at all - on purpose :-) It's de-facto intrinsic => compiler can take it's liberties with it. Plus, if you define it, pretty soon you'll be obliged to support (and process) format specifiers.
I used to define ToJson (before big libs came to fore) to provide, let's say a controllable printable (which can also go over the wire and is 10-100 times faster than deserialization). These days VS debugger has a collection of "visualizers" and an option to tell debugger to use it or not (when it's off then it will jerk ToString's chain if it sees it.
Also, it's good to have dotPeek (or actual Reflector, owned by Redgate these days) with "find source code" turned off. Then you see the real generated code which is sometimes glorious (String is intrinsic and compiler goes a few extra miles to optimize its operations) and sometimes ugly (async/await - total faker, inefficient and flat out dangerous - how do you say "deadlock" in C# :-) - not kidding) but you need to to be able to see the final code or you are driving blind.
Environment: Visual Studio 2015 RTM. (I haven't tried older versions.)
Recently, I've been debugging some of my Noda Time code, and I've noticed that when I've got a local variable of type NodaTime.Instant (one of the central struct types in Noda Time), the "Locals" and "Watch" windows don't appear to call its ToString() override. If I call ToString() explicitly in the watch window, I see the appropriate representation, but otherwise I just see:
variableName {NodaTime.Instant}
which isn't very useful.
If I change the override to return a constant string, the string is displayed in the debugger, so it's clearly able to pick up that it's there - it just doesn't want to use it in its "normal" state.
I decided to reproduce this locally in a little demo app, and here's what I've come up with. (Note that in an early version of this post, DemoStruct was a class and DemoClass didn't exist at all - my fault, but it explains some comments which look odd now...)
using System;
using System.Diagnostics;
using System.Threading;
public struct DemoStruct
{
public string Name { get; }
public DemoStruct(string name)
{
Name = name;
}
public override string ToString()
{
Thread.Sleep(1000); // Vary this to see different results
return $"Struct: {Name}";
}
}
public class DemoClass
{
public string Name { get; }
public DemoClass(string name)
{
Name = name;
}
public override string ToString()
{
Thread.Sleep(1000); // Vary this to see different results
return $"Class: {Name}";
}
}
public class Program
{
static void Main()
{
var demoClass = new DemoClass("Foo");
var demoStruct = new DemoStruct("Bar");
Debugger.Break();
}
}
In the debugger, I now see:
demoClass {DemoClass}
demoStruct {Struct: Bar}
However, if I reduce the Thread.Sleep call down from 1 second to 900ms, there's still a short pause, but then I see Class: Foo as the value. It doesn't seem to matter how long the Thread.Sleep call is in DemoStruct.ToString(), it's always displayed properly - and the debugger displays the value before the sleep would have completed. (It's as if Thread.Sleep is disabled.)
Now Instant.ToString() in Noda Time does a fair amount of work, but it certainly doesn't take a whole second - so presumably there are more conditions that cause the debugger to give up evaluating a ToString() call. And of course it's a struct anyway.
I've tried recursing to see whether it's a stack limit, but that appears not to be the case.
So, how can I work out what's stopping VS from fully evaluating Instant.ToString()? As noted below, DebuggerDisplayAttribute appears to help, but without knowing why, I'm never going to be entirely confident in when I need it and when I don't.
Update
If I use DebuggerDisplayAttribute, things change:
// For the sample code in the question...
[DebuggerDisplay("{ToString()}")]
public class DemoClass
gives me:
demoClass Evaluation timed out
Whereas when I apply it in Noda Time:
[DebuggerDisplay("{ToString()}")]
public struct Instant
a simple test app shows me the right result:
instant "1970-01-01T00:00:00Z"
So presumably the problem in Noda Time is some condition that DebuggerDisplayAttribute does force through - even though it doesn't force through timeouts. (This would be in line with my expectation that Instant.ToString is easily fast enough to avoid a timeout.)
This may be a good enough solution - but I'd still like to know what's going on, and whether I can change the code simply to avoid having to put the attribute on all the various value types in Noda Time.
Curiouser and curiouser
Whatever is confusing the debugger only confuses it sometimes. Let's create a class which holds an Instant and uses it for its own ToString() method:
using NodaTime;
using System.Diagnostics;
public class InstantWrapper
{
private readonly Instant instant;
public InstantWrapper(Instant instant)
{
this.instant = instant;
}
public override string ToString() => instant.ToString();
}
public class Program
{
static void Main()
{
var instant = NodaConstants.UnixEpoch;
var wrapper = new InstantWrapper(instant);
Debugger.Break();
}
}
Now I end up seeing:
instant {NodaTime.Instant}
wrapper {1970-01-01T00:00:00Z}
However, at the suggestion of Eren in comments, if I change InstantWrapper to be a struct, I get:
instant {NodaTime.Instant}
wrapper {InstantWrapper}
So it can evaluate Instant.ToString() - so long as that's invoked by another ToString method... which is within a class. The class/struct part seems to be important based on the type of the variable being displayed, not what code needs
to be executed in order to get the result.
As another example of this, if we use:
object boxed = NodaConstants.UnixEpoch;
... then it works fine, displaying the right value. Colour me confused.
Update:
This bug has been fixed in Visual Studio 2015 Update 2. Let me know if you are still running into problems evaluating ToString on struct values using Update 2 or later.
Original Answer:
You are running into a known bug/design limitation with Visual Studio 2015 and calling ToString on struct types. This can also be observed when dealing with System.DateTimeSpan. System.DateTimeSpan.ToString() works in the evaluation windows with Visual Studio 2013, but does not always work in 2015.
If you are interested in the low level details, here's what's going on:
To evaluate ToString, the debugger does what's known as "function evaluation". In greatly simplified terms, the debugger suspends all threads in the process except the current thread, changes the context of the current thread to the ToString function, sets a hidden guard breakpoint, then allows the process to continue. When the guard breakpoint is hit, the debugger restores the process to its previous state and the return value of the function is used to populate the window.
To support lambda expressions, we had to completely rewrite the CLR Expression Evaluator in Visual Studio 2015. At a high level, the implementation is:
Roslyn generates MSIL code for expressions/local variables to get the values to be displayed in the various inspection windows.
The debugger interprets the IL to get the result.
If there are any "call" instructions, the debugger executes a
function evaluation as described above.
The debugger/roslyn takes this result and formats it into the
tree-like view that's shown to the user.
Because of the execution of IL, the debugger is always dealing with a complicated mix of "real" and "fake" values. Real values actually exist in the process being debugged. Fake values only exist in the debugger process. To implement proper struct semantics, the debugger always needs to make a copy of the value when pushing a struct value to the IL stack. The copied value is no longer a "real" value and now only exists in the debugger process. That means if we later need to perform function evaluation of ToString, we can't because the value doesn't exist in the process. To try and get the value we need to emulate execution of the ToString method. While we can emulate some things, there are many limitations. For example, we can't emulate native code and we can't execute calls to "real" delegate values or calls on reflection values.
With all of that in mind, here is what's causing the various behaviors you are seeing:
The debugger isn't evaluating NodaTime.Instant.ToString -> This is
because it is struct type and the implementation of ToString can't
be emulated by the debugger as described above.
Thread.Sleep seems to take zero time when called by ToString on a
struct -> This is because the emulator is executing ToString.
Thread.Sleep is a native method, but the emulator is aware
of it and just ignores the call. We do this to try and get a value
to show to the user. A delay wouldn't be helpful in this case.
DisplayAttibute("ToString()") works. -> That is confusing. The only
difference between the implicit calling of ToString and
DebuggerDisplay is that any time-outs of the implicit ToString
evaluation will disable all implicit ToString evaluations for that
type until the next debug session. You may be observing that
behavior.
In terms of the design problem/bug, this is something we are planning to address in a future release of Visual Studio.
Hopefully that clears things up. Let me know if you have more questions. :-)
The project I am working on is about request a xml from a web set. The server side construct the xml. The xml may have many nodes, so the performance is not that good.
I use virtual studio 2010 profiler to analyze the performance issue.
Find out that the most time-consuming function is System.Collections.Generic.ICollection`1.get_Count() which actually is Count property of Generic List.This function is called about 9000 times.
The performance data shown as below:
The Elapsed exclusive time is 4154.14(ms), while the Application exclusive time is just 0.52(ms).
I know the different between Elapsed exclusive time and Application exclusive time.
Application exclusive time exclude the time spend on the context switch stuff.
How could context switch stuff happy when the code just obtain the Count property of the Generic List.
I am very confused by the performance profiler data. Is anyone can provide some information? Thanks a lot!
Actually the decompiled sources show the following for List<T>:
[__DynamicallyInvokable]
public int Count
{
[__DynamicallyInvokable, TargetedPatchingOptOut("Performance critical to inline this type of method across NGen image boundaries")] get
{
return this._size;
}
}
It's literally returning the value of a field and doing nothing else. I'd suggest your performance hit is elsewhere, or you're misinterpreting your profiler's output.
I have a problem with time measuring that's really bothering me. I am executing something like the following code (in C#) :
Stopwatch sw = Stopwatch.StartNew();
Foo(args);
sw.Stop();
//log time
public void Foo(var args)
{
Stopwatch sw = Stopwatch.StartNew();
//do stuff
sw.Stop();
//log time
}
And the result is a big difference between both times, my code gives me : 15535 ms from inside the function, and 15668 ms from the outside... 133ms seems a lot to me for a function call (even with the 10 params I am giving to mine), or to incriminate Stopwatch precision (which is supposed to be super precise).
How would you explain this difference in times ?
note 1 : Same thing happens on several successive calls : I am getting 133, 81, 72, 75, 75 milliseconds difference for 5 calls
note 2 : the actual parameters of my function are :
6 class objects
one array of struct (the array is passed as reference, right ?)
2 ref int
1 out byte[]
1 out class
1 out struct of small size (< 25 bytes)
Update :
In Release, the difference for the first call is even bigger (is JIT compilation more expensive in release, which could explain that ?), and the next steps have the same overhead (~75 ms)
I tried to initialize stopwatches outside, pass one as parameter and log outside of the function, the difference is still there.
I also forgot that I am giving some properties as parameters that have to be constructed the first time, so the 50ms difference for the first call might be explained by properties initialization and JIT compilation.
My bad, It was a property calling a property doing some disk read access. I thought it was a simple member and didn't go deep enough. I took the calls out of the function call and times are now almost the same (0-1 ms, I guess it's the logging)
The moral is : properties should not have side effects. If you make a property that is not doing something obvious, code a function instead, or at least warn the next developer about what you are doing in the documentation of the property !
And the moral of the moral is : If something looks suspicious, always look at the call tree to the deepest level !