Memory Allocation Time (The Fast Way)

Memory Allocation Time (The Fast Way) - c#

For a really simple code snippet, I'm trying to see how much of the time is spent actually allocating objects on the small object heap (SOH).
static void Main(string[] args)
{
const int noNumbers = 10000000; // 10 mil
ArrayList numbers = new ArrayList();
Random random = new Random(1); // use the same seed as to make
// benchmarking consistent
for (int i = 0; i < noNumbers; i++)
{
int currentNumber = random.Next(10); // generate a non-negative
// random number less than 10
object o = currentNumber; // BOXING occurs here
numbers.Add(o);
}
}
In particular, I want to know how much time is spent allocating space for the all the boxed int instances on the heap (I know, this is an ArrayList and there's horrible boxing going on as well - but it's just for educational purposes).
The CLR has 2 ways of performing memory allocations on the SOH: either calling the JIT_TrialAllocSFastMP (for multi-processor systems, ...SFastSP for single processor ones) allocation helper - which is really fast since it consists of a few assembly instructions - or failing back to the slower JIT_New allocation helper.
PerfView sees just fine the JIT_New being invoked:
However, I can't figure out which - if any - is the native function involved for the "quick way" of allocating. I certainly don't see any JIT_TrialAllocSFastMP. I've already tried raising the count of the loop (from 10 to 500 mil), in the hope of increasing my chances of of getting a glimpse of a few stacks containing the elusive function, but to no avail.
Another approach was to use JetBrains dotTrace (line-by-line) performance viewer, but it falls short of what I want: I do get to see the approximate time it takes the boxing operation for each int, but 1) it's just a bar and 2) there's both the allocation itself and the copying of the value (of which the latter is not what I'm after).
Using the JetBrains dotTrace Timeline viewer won't work either, since they currently don't (quite) support native callstacks.
At this point it's unclear to me if there's a method being dynamically generated and called when JIT_TrialAllocSFastMP is invoked - and by miracle neither of the PerfView-collected stack frames (one every 1 ms) ever capture it -, or somehow the Main's method body gets patched, and those few assembly instructions mentioned above are somehow injected directly in the code. It's also hard to believe that the fast way of allocating memory is never called.
You could ask "But you already have the .NET Core CLR code, why can't you figure out yourself ?". Since the .NET Framework CLR code is not publicly available, I've looked into its sibling, the .NET Core version of the CLR (as Matt Warren recommends in his step 6 here). The \src\vm\amd64\JitHelpers_InlineGetThread.asm file contains a JIT_TrialAllocSFastMP_InlineGetThread function. The issue is that parsing/understanding the C++ code there is above my grade, and also I can't really think of a way to "Step Into" and see how the JIT-ed code is generated, since this is way lower-level that your usual Press-F11-in-Visual-Studio.
Update 1: Let's simplify the code, and only consider individual boxed int values:
const int noNumbers = 10000000; // 10 mil
object o = null;
for (int i=0;i<noNumbers;i++)
{
o = i;
}
Since this is a Release build, and dead code elimination could kick in, WinDbg is used to check the final machine code.
The resulting JITed code, whose main loop is highlighted in blue below, which simply does repeated boxing, shows that the method that handles the memory allocation is not inlined (note the call to hex address 00af30f4):
This method in turn tries to allocate via the "fast" way, and if that fails, goes back to the "slow" way of a call to JIT_New itself):
It's interesting how the call stack in PerfView obtained from the code above doesn't show any intermediary method between the level of Main and the JIT_New entry itself (given that Main doesn't directly call JIT_New):

Related

C# Array access vs C++ PInvoke pointer access

I've got an idea of optimising a large jagged array. Let's say i got in c# array
struct BlockData
{
internal short type;
internal short health;
internal short x;
internal short y;
internal short z;
internal byte connection;
}
BlockData[][][] blocks = null;
byte[] GetBlockTypes()
{
if (blocks == null)
blocks = InitializeJaggedArray<BlockData[][][]>(256, 64, 256);
//BlockData is struct
MemoryStream stream = new MemoryStream();
for (int x = 0; x < blocks.Length; x++)
{
for (int y = 0; y < blocks[x].Length; y++)
{
for (int z = 0; z < block[x][y].Length; z++)
{
stream.WriteByte(blocks[x][y][z].type);
}
}
}
return stream.ToArray();
}
Would storing the Blocks as a BlockData***in C++ Dll and then using PInvoke to read/write them be more efficient than storing them in C# arrays?
Note. I'm unable to perform tests right now because my computer is right now at service.

This sounds like a question where you should first read the speed rant, starting at part 2: https://ericlippert.com/2012/12/17/performance-rant/
This is such a miniscule difference - if it matters you are probably in a realtime scenario. And .NET is the wrong choice for realtime scenarios to begin with. If you are in a realtime scenario, this is not going to be the only thing you have to wear off GC Memory Management and security checks.
It is true that accessing a array in Native C++ is faster then acessing it in .NET. .NET has the indexers as proper function calls, similar to properties. And .NET does verify in the Index is valid. However, it is not as bad as you might think. The optimisations are pretty good. Function calls can be inlined. Array access will be pruned with a temporary variable if possible. And even the array check is not save from sensible removal. So it is not as big a advantage as you might think.
As others pointed out, P/Invoke will consume any gains there might be, with it's overhead. But actually going into a different environment is unnecessary:
The thing is, you can also use naked pointers in .NET. You have to enable it with unsafe code, but it is there. You can then acquire a piece of unmanaged memory and treat it like a array in native C++. Of course that subjects to to mistakes like messing up the pointer arithmetic or overflow - the exact reasons those checks exist in the first place!

Would storing the Blocks as a BlockData***in C++ Dll and then using PInvoke to read/write them be more efficient than storing them in C# arrays?
No, because P/Invoke has a significant overhead, whereas array access in C# .NET is compiled at runtime by the JIT to fairly efficient code with bounds-checks. Jagged-arrays in .NET also have adequate performance (the only weak-area in .NET is true multidimensional arrays, which is disappointing - but I don't believe your proposal would help that either).
Update: Multidimensional array performance in .NET Core actually seems worse than .NET Framework (if I'm reading this thread correctly).

Another way to look at it - GC and overall maintanance. Your proposal is essentially the same as allocated one big array and using (layer * layerSize + row * rowSize + column) for indexing it. PInvoke will give you following drawbacks:
you likely endup with unmanaged allocation for the array. This make GC unaware of large amount of allocated memory and you need to make sure to notify GC about it.
PInvoked calls can't be completely inlined unlike all .Net code during JIT
you need to maintain code in two languages
PInvoke is not as portable - requires platform/bitness specific libraries to deal with and add a lot of fun when sharing your program.
and one possible gain:
removing boundary checks performed by .Net on arrays
Back of a napkin calculation shows that at best both will balance out in raw performance. I'd go with .Net-only version as it is easier to maintain, less fun with GC.
Additionally when you hide chunk auto-generation/partially generated chunks behind index method of the chunk it is easier to write code in a single language... In reality the fact that fully populated chunks are very memory consuming your main issue would likely be memory usage/memory access cost rather than raw performance of iterating through elements. Try and measure...

GC Eager Root Collection

On pg 96 of Pro .NET Performance - Optimize Your C# Applications it talks about GC eager root collection:
For each local variable, the JIT embeds into a table the addresses of
the earliest and latest instruction pointers where the variable is
still relevant as a root. The GC then uses these tables when it
performs its stack walk.
It then provides this example:
static void Main(string[] args)
{
Widget a = new Widget();
a.Use();
//...additional code
Widget b = new Widget();
b.Use();
//...additional code
Foo(); //static method
}
It then says:
The above discussion implies that breaking your code into smaller
methods and using fewer local variables is not just a good design
measure or a software engineering technique. With the .NET GC, it can
provide a performance benefit as well because you have fewer local
roots! It means less work for the JIT when compiling the method, less
space to be occupied by the root IP tables, and less work for the GC
when performing its stack walk.
I don't understand how breaking code into smaller methods would help.
I've broken the code up into this:
static void Main(string[] args)
{
UseWidgetA();
//...additional code
UseWidgetB();
//...additional code
Foo(); //static method
}
static void UseWidgetA()
{
Widget a = new Widget();
a.Use();
}
static void UseWidgetB()
{
Widget b = new Widget();
b.Use();
}
}
Fewer local roots:
Why are there fewer local roots?
There are still the same number of local roots, one local root in each method.
Less work for the JIT when compiling the method:
Surely this would make things worse because it would need 2 extra tables for the 2 extra methods. The JIT would also still need to record the earliest and latest instruction pointers where the variable is still relevant within each method, but it would just have more methods to do that for.
Less work for the GC when performing its stack walk:
How does having more smaller methods mean less work for the GC during the stack walk?

I'm not in Sasha's mind but let me put my two cents to that.
First of all, I perceive it as a generalized rule - when you split a method into smaller ones, there is a chance that some parts will not need to be JITted, because some subroutines are executed conditionally.
Secondly, JITting indeed produces so-called GC info about live stack roots. The bigger method, the bigger GC info is Theoretically, there should be also a bigger cost of interpreting it during the GC, however, this is overcome by splitting GC Info into chunks. However, information about the stack roots liveness is stored only for so-called safe points. There are two types of methods:
partially interruptible - the only safe points are during calls to other methods. This makes a method less "suspendable" because the runtime needs to wait for such a safe point to suspend a method, but consumes less memory for the GC info.
fully interruptible - every instruction of a method is treated as a safe point, which obviously makes a method very "suspendable" but requires significant storage (of quantity similar to the code itself)
As Book Of The Runtime says: “The JIT chooses whether to emit fully- or partially
interruptible code based on heuristics to find the best trade-off between code quality,
size of the GC info, and GC suspension latency.”
In my opinion, smaller methods help the JIT to make better decisions (based on its heuristics) to make methods partially or fully interruptible.

Profile Time Spent in the Inner .NET Framework Methods

In Visual Studio - is there a way to profile the time spent in the inner methods of .NET Framework ?
An example - consider a good-old fashioned ArrayList, and adding some random numbers to it:
static void Main(string[] args)
{
const int noNumbers = 10000; // 10k
ArrayList numbers = new ArrayList();
Random random = new Random(1); // use the same seed as to make
// benchmarking consistent
for (int i = 0; i < noNumbers; i++)
{
int currentNumber = random.Next(10); // generate a non-negative
// random number less than 10
numbers.Add(currentNumber); // BOXING occurs here
}
}
I can step into the .NET Framework source code just fine while debugging. One can use the default Microsoft symbols and the source code for .NET (as described in this answer) or go the dotPeek route (detailed here). As for the cleanest option of just using the Reference Source symbols - as Hans Passant was saying in his answer almost 5 years ago - the framework version (down to security updates installed) for which the symbols were created would have to match exactly to your version; you'd have to be really lucky to get that working (I wasn't). Bottom line - there are 2 ways I can successfully use to step into the .NET source code.
For the sample at hand, there aren't big differences between the Reference Source code and the dotPeek reverse-engineered code - that is for the methods invoked by ArrayList's Add - namely the Capacity setter and ArrayList's EnsureCapacity, of which the latter can be seen below (ReferenceSource code on the left, dotPeek source code on the right):
Running an "Instrumentation" profiling session will return a breakdown of the time spent in each method, but as long as the .NET types go, it appears that one only gets to see the methods the respective code called "directly" - in this case the function that Adds elements to the ArrayList, the one that generates a Random int, and the respective types' constructors. But there's no trace of EnsureCapacity or Capacity's setter, which are both invoked heavily by ArrayList's Add.
Drilling down on a specific .NET method doesn't show any of the methods it called in turn, nor any source code (despite being able to see that very code earlier, while stepping into with the debugger):
How can one get to see those additional, "inner" .NET methods ? If Visual Studio can't do it, perhaps another tool can ?
PS There is a very similar question here, however it's almost 10 years old, and there's not much there that brings light to the problem.
Later Update: As KolA very well points out, JetBrains dotTrace can show this. A line-by-line profiling session below:

perhaps another tool can ?
DotTrace can profile performance down to properties if that's what you're looking for. This example is for generic List<T> not ArrayList but I think it shouldn't matter.

Improve RAM usage behaviour to avoid lags

We have a problem which seems to be caused by the constant allocation and deallocation of memory:
We have a rather complex system here, where a USB device is measuring arbitrary points and sending the measurement data to the PC at a rate of 50k samples per second. These samples are then collected as MeasurementTasks in the software for each point and afterwards processed which causes even more needed memory because of the requirements of the calculations.
Simplified each MeasurementTask looks like the following:
public class MeasurementTask
{
public LinkedList<Sample> Samples { get; set; }
public ComplexSample[] ComplexSamples { get; set; }
public Complex Result { get; set; }
}
Where Sample looks like:
public class Sample
{
public ushort CommandIndex;
public double ValueChannel1;
public double ValueChannel2;
}
and ComplexSample like:
public class ComplexSample
{
public double Channel1Real;
public double Channel1Imag;
public double Channel2Real;
public double Channel2Imag;
}
In the calculation process the Samples are first calculated into a ComplexSample each and then futher processed until we get our Complex Result. After these calculations are done we release all the Sample and ComplexSample instances and the GC cleans them up soon after, but this results in a constant "up and down" of the memory usage.
This is how it looks at the moment with each MeasurementTask containing ~300k samples:
Now we have sometimes the problem that the samples buffer in our HW device is overflown, as it can only store ~5000 samples (~100ms) and it seems the application is not always reading the device fast enough (we use BULK transfer with LibUSB/LibUSBDotNet). We tracked this problem down to this "memory up and down" by the following facts:
the reading from the USB device happens in its own thread which runs at ThreadPriority.Highest, so the calculations should not interfere
CPU usage is between 1-5% on my 8-core CPU => <50% of one core
if we have (much) faster MeasurementTasks with only a few hundret samples each, the memory goes only up and down very little and the buffer never overflows (but the amount of instances/second is the same, as the device still sends 50k samples/second)
we had a bug before, which did not release the Sample and ComplexSample instances after the calculations and so the memory only went up at ~2-3 MB/s and the buffer overflew all the time
At the moment (after fixing the bug mentioned above) we have a direct correlation between the samples count per point and the overflows. More samples/point = higher memory delta = more overflows.
Now to the actual question:
Can this behaviour be improved (easily)?
Maybe there is a way to tell the GC/runtime to not release the memory so there is no need to re-allocate?
We also thought of an alternative approach by "re-using" the LinkedList<Sample> and ComplexSample[]: Keep a pool of such lists/arrays and instead of releasing them put them back in the pool and "change" these instances instead of creating new ones, but we are not sure this is a good idea as it adds complexity to the whole system...
But we are open to other suggestions!
UPDATE:
I now optimized the code base with the following improvements and did various test runs:
converted Sample to a struct
got rid of the LinkedList<Sample> and replaced them by straigt arrays (I actually had another one somewhere else I also removed)
several minor optimizations I found during analysis and optimization
(optional - see below) converted ComplexSample to a struct
In any case it seems that the problem is gone now on my machine (long term tests and test on low-spec hardware will follow), but I first run a test with both types as struct and got the following memory usage graph:
There it still was going up to ~300 MB on a regular basis (but no overflow errors anymore), but as this still seemed odd to me I did some additional tests:
Side note: Each value of each ComplexSample is altered at least once during the calculations.
1) Add a GC.Collect after a task is processed and the samples are not referenced any more:
Now it was alternating between 140 MB and 150 MB (no noticable perfomance hit).
2) ComplexSample as a class (no GC.Collect):
Using a class it is much more "stable" at ~140-200 MB.
3) ComplexSample as a class and GC.Collect:
Now it is going "up and down" a little in the range of 135-150 MB.
Current solution:
As we are not sure this is a valid case for manually calling GC.Collect we are using "solution 2)" now and I will start running the long-term (= several hours) and low-spec hardware tests...

Can this behaviour be improved (easily)?
Yes (depends on how much you need to improve it).
The first thing I would do is to change Sample and ComplexSample to be value-types. This will reduce the complexity of the graph dealt with by GC as while the arrays and linked lists are still collected, they contain those values directly rather than references to them, and that simplifies the rest of GC.
Then I'd measure performance at this point. The impact of working with relatively large structs is mixed. The guideline that value types should be less than 16 bytes comes from it being around that point where the performance benefits of using a reference type tend to overwhelm the performance benefits of using a value type, but that guideline is only a guideline because "tend to overwhelm" is not the same as "will overwhelm in your application".
After that if it had either not improved things, or not improved things enough, I would consider using a pool of objects; whether for those smaller objects, only the larger objects, or both. This will most certainly increase the complexity of your application, but if it's time-critical, then it might well help. (See How do these people avoid creating any garbage? for example which discusses avoiding normal GC in a time-critical case).
If you know you'll need a fixed maximum of a given type this isn't too hard; create and fill an array of them and dole them out from that array before returning them as they are no longer used. It's still hard enough in that you no longer have GC being automatic and have to manually "delete" the objects by putting them back in the pool.
If you don't have such knowledge, it gets harder but is still possible.
If it is really vital that you avoid GC, be careful of hidden objects. Adding to most collection types can for example result in them moving up to a larger internal store, and leaving the earlier store to be collected. Maybe this is fine in that you've still reduced GC use enough that it is no longer causing the problem you have, but maybe not.

Rarely I've seen a LinkedList<> used in .NET... Have you tried using a List<>? Consider that the basic "element" of a LinkedList<> is a LinkedListNode<> that is a class... So for each Sample there is a whole additional overhead of one object.
Note that if you want to use "big" value types (as suggested by others), the List<> could become again slower (because the List<> grows by "generate a new-internal array of double the current size size and copy from old to new), so the bigger the elements, the more memory the List<> has to copy around when it doubles itself.
If you go to List<> you could try splitting the Sample to
List<ushort> CommandIndex;
List<Sample> ValueChannels;
This because the doubles of Sample require 8 byte alignment, so as written the Sample is 24 bytes, with only 18 bytes used.
This wouldn't be a good idea for LinkedList<>, because the LL has a big overhead per item.

Change Sample and ComplexSample to struct.

In C#, does copying a member variable to a local stack variable improve performance?

I quite often write code that copies member variables to a local stack variable in the belief that it will improve performance by removing the pointer dereference that has to take place whenever accessing member variables.
Is this valid?
For example
public class Manager {
private readonly Constraint[] mConstraints;
public void DoSomethingPossiblyFaster()
{
var constraints = mConstraints;
for (var i = 0; i < constraints.Length; i++)
{
var constraint = constraints[i];
// Do something with it
}
}
public void DoSomethingPossiblySlower()
{
for (var i = 0; i < mConstraints.Length; i++)
{
var constraint = mConstraints[i];
// Do something with it
}
}
}
My thinking is that DoSomethingPossiblyFaster is actually faster than DoSomethingPossiblySlower.
I know this is pretty much a micro optimisation, but it would be useful to have a definitive answer.
Edit
Just to add a little bit of background around this. Our application has to process a lot of data coming from telecom networks, and this method is likely to be called about 1 billion times a day for some of our servers. My view is that every little helps, and sometimes all I am trying to do is give the compiler a few hints.

Which is more readable? That should usually be your primary motivating factor. Do you even need to use a for loop instead of foreach?
As mConstraints is readonly I'd potentially expect the JIT compiler to do this for you - but really, what are you doing in the loop? The chances of this being significant are pretty small. I'd almost always pick the second approach simply for readability - and I'd prefer foreach where possible. Whether the JIT compiler optimizes this case will very much depend on the JIT itself - which may vary between versions, architectures, and even how large the method is or other factors. There can be no "definitive" answer here, as it's always possible that an alternative JIT will optimize differently.
If you think you're in a corner case where this really matters, you should benchmark it - thoroughly, with as realistic data as possible. Only then should you change your code away from the most readable form. If you're "quite often" writing code like this, it seems unlikely that you're doing yourself any favours.
Even if the readability difference is relatively small, I'd say it's still present and significant - whereas I'd certainly expect the performance difference to be negligible.

If the compiler/JIT isn't already doing this or a similar optimization for you (this is a big if), then DoSomethingPossiblyFaster should be faster than DoSomethingPossiblySlower. The best way to explain why is to look at a rough translation of the C# code to straight C.
When a non-static member function is called, a hidden pointer to this is passed into the function. You'd have roughly the following, ignoring virtual function dispatch since it's irrelevant to the question (or equivalently making Manager sealed for simplicity):
struct Manager {
Constraint* mConstraints;
int mLength;
}
void DoSomethingPossiblyFaster(Manager* this) {
Constraint* constraints = this->mConstraints;
int length = this->mLength;
for (int i = 0; i < length; i++)
{
Constraint constraint = constraints[i];
// Do something with it
}
}
void DoSomethingPossiblySlower()
{
for (int i = 0; i < this->mLength; i++)
{
Constraint constraint = (this->mConstraints)[i];
// Do something with it
}
}
The difference is that in DoSomethingPossiblyFaster, mConstraints lives on the stack and access only requires one layer of pointer indirection, since it's at a fixed offset from the stack pointer. In DoSomethingPossiblySlower, if the compiler misses the optimization opportunity, there's an extra pointer indirection. The compiler has to read a fixed offset from the stack pointer to access this and then read a fixed offset from this to get mConstraints.
There are two possible optimizations that could negate this hit:
The compiler could do exactly what you did manually and cache mConstraints on the stack.
The compiler could store this in a register so that it doesn't need to fetch it from the stack on every loop iteration before dereferencing it. This means that fetching mConstraints from this or from the stack is basically the same operation: A single dereference of a fixed offset from a pointer that's already in a register.

You know the response you will get, right? "Time it."
There is probably not a definitive answer. First, the compiler might do the optimization for you. Second, even if it doesn't, indirect addressing at the assembly level may not be significantly slower. Third, it depends on the cost of making the local copy, compared to the number of loop iterations. Then there are caching effects to consider.
I love to optimize, but this is one place I would definitely say wait until you have a problem, then experiment. This is a possible optimization that can be added when needed, not one of those optimizations that needs to be planned up front to avoid a massive ripple effect later.
Edit: (towards a definitive answer)
Compiling both functions in release mode and examining the IL with IL Dasm shows that in both places the "PossiblyFaster" function uses the local variable, it has one less instruction
ldloc.0 vs
ldarg.0; ldfld class Constraint[] Manager::mConstraints
Of course, this is still one level removed from the machine code - you don't know what the JIT compiler will do for you. But it is likely that "PossiblyFaster" is marginally faster.
However, I still don't recommend adding the extra variable until you are sure this function is the most expensive thing in your system.

I've profiled this and came up with a bunch of interesting results that are probably only valid for my specific example, but I thought would be worth while noting here.
The fastest is X86 release mode. That runs one iteration of my test in 7.1 seconds, whereas the equivalent X64 code takes 8.6 seconds. This was running 5 iterations, each iteration processing the loop 19.2 million times.
The fastest approach for the loop was:
foreach (var constraint in mConstraints)
{
... do stuff ...
}
The second fastest approach, which massively surprised me was the following
for (var i = 0; i < mConstraints.Length; i++)
{
var constraint = mConstraints[i];
... do stuff ...
}
I guess this was because mConstraints was stored in a register for the loop.
This slowed down when I removed the readonly option for mConstraints.
So, my summary from this is that being readable in this situation does give performance as well.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.