I've got an idea of optimising a large jagged array. Let's say i got in c# array
struct BlockData
{
internal short type;
internal short health;
internal short x;
internal short y;
internal short z;
internal byte connection;
}
BlockData[][][] blocks = null;
byte[] GetBlockTypes()
{
if (blocks == null)
blocks = InitializeJaggedArray<BlockData[][][]>(256, 64, 256);
//BlockData is struct
MemoryStream stream = new MemoryStream();
for (int x = 0; x < blocks.Length; x++)
{
for (int y = 0; y < blocks[x].Length; y++)
{
for (int z = 0; z < block[x][y].Length; z++)
{
stream.WriteByte(blocks[x][y][z].type);
}
}
}
return stream.ToArray();
}
Would storing the Blocks as a BlockData***in C++ Dll and then using PInvoke to read/write them be more efficient than storing them in C# arrays?
Note. I'm unable to perform tests right now because my computer is right now at service.
This sounds like a question where you should first read the speed rant, starting at part 2: https://ericlippert.com/2012/12/17/performance-rant/
This is such a miniscule difference - if it matters you are probably in a realtime scenario. And .NET is the wrong choice for realtime scenarios to begin with. If you are in a realtime scenario, this is not going to be the only thing you have to wear off GC Memory Management and security checks.
It is true that accessing a array in Native C++ is faster then acessing it in .NET. .NET has the indexers as proper function calls, similar to properties. And .NET does verify in the Index is valid. However, it is not as bad as you might think. The optimisations are pretty good. Function calls can be inlined. Array access will be pruned with a temporary variable if possible. And even the array check is not save from sensible removal. So it is not as big a advantage as you might think.
As others pointed out, P/Invoke will consume any gains there might be, with it's overhead. But actually going into a different environment is unnecessary:
The thing is, you can also use naked pointers in .NET. You have to enable it with unsafe code, but it is there. You can then acquire a piece of unmanaged memory and treat it like a array in native C++. Of course that subjects to to mistakes like messing up the pointer arithmetic or overflow - the exact reasons those checks exist in the first place!
Would storing the Blocks as a BlockData***in C++ Dll and then using PInvoke to read/write them be more efficient than storing them in C# arrays?
No, because P/Invoke has a significant overhead, whereas array access in C# .NET is compiled at runtime by the JIT to fairly efficient code with bounds-checks. Jagged-arrays in .NET also have adequate performance (the only weak-area in .NET is true multidimensional arrays, which is disappointing - but I don't believe your proposal would help that either).
Update: Multidimensional array performance in .NET Core actually seems worse than .NET Framework (if I'm reading this thread correctly).
Another way to look at it - GC and overall maintanance. Your proposal is essentially the same as allocated one big array and using (layer * layerSize + row * rowSize + column) for indexing it. PInvoke will give you following drawbacks:
you likely endup with unmanaged allocation for the array. This make GC unaware of large amount of allocated memory and you need to make sure to notify GC about it.
PInvoked calls can't be completely inlined unlike all .Net code during JIT
you need to maintain code in two languages
PInvoke is not as portable - requires platform/bitness specific libraries to deal with and add a lot of fun when sharing your program.
and one possible gain:
removing boundary checks performed by .Net on arrays
Back of a napkin calculation shows that at best both will balance out in raw performance. I'd go with .Net-only version as it is easier to maintain, less fun with GC.
Additionally when you hide chunk auto-generation/partially generated chunks behind index method of the chunk it is easier to write code in a single language... In reality the fact that fully populated chunks are very memory consuming your main issue would likely be memory usage/memory access cost rather than raw performance of iterating through elements. Try and measure...
Related
For a really simple code snippet, I'm trying to see how much of the time is spent actually allocating objects on the small object heap (SOH).
static void Main(string[] args)
{
const int noNumbers = 10000000; // 10 mil
ArrayList numbers = new ArrayList();
Random random = new Random(1); // use the same seed as to make
// benchmarking consistent
for (int i = 0; i < noNumbers; i++)
{
int currentNumber = random.Next(10); // generate a non-negative
// random number less than 10
object o = currentNumber; // BOXING occurs here
numbers.Add(o);
}
}
In particular, I want to know how much time is spent allocating space for the all the boxed int instances on the heap (I know, this is an ArrayList and there's horrible boxing going on as well - but it's just for educational purposes).
The CLR has 2 ways of performing memory allocations on the SOH: either calling the JIT_TrialAllocSFastMP (for multi-processor systems, ...SFastSP for single processor ones) allocation helper - which is really fast since it consists of a few assembly instructions - or failing back to the slower JIT_New allocation helper.
PerfView sees just fine the JIT_New being invoked:
However, I can't figure out which - if any - is the native function involved for the "quick way" of allocating. I certainly don't see any JIT_TrialAllocSFastMP. I've already tried raising the count of the loop (from 10 to 500 mil), in the hope of increasing my chances of of getting a glimpse of a few stacks containing the elusive function, but to no avail.
Another approach was to use JetBrains dotTrace (line-by-line) performance viewer, but it falls short of what I want: I do get to see the approximate time it takes the boxing operation for each int, but 1) it's just a bar and 2) there's both the allocation itself and the copying of the value (of which the latter is not what I'm after).
Using the JetBrains dotTrace Timeline viewer won't work either, since they currently don't (quite) support native callstacks.
At this point it's unclear to me if there's a method being dynamically generated and called when JIT_TrialAllocSFastMP is invoked - and by miracle neither of the PerfView-collected stack frames (one every 1 ms) ever capture it -, or somehow the Main's method body gets patched, and those few assembly instructions mentioned above are somehow injected directly in the code. It's also hard to believe that the fast way of allocating memory is never called.
You could ask "But you already have the .NET Core CLR code, why can't you figure out yourself ?". Since the .NET Framework CLR code is not publicly available, I've looked into its sibling, the .NET Core version of the CLR (as Matt Warren recommends in his step 6 here). The \src\vm\amd64\JitHelpers_InlineGetThread.asm file contains a JIT_TrialAllocSFastMP_InlineGetThread function. The issue is that parsing/understanding the C++ code there is above my grade, and also I can't really think of a way to "Step Into" and see how the JIT-ed code is generated, since this is way lower-level that your usual Press-F11-in-Visual-Studio.
Update 1: Let's simplify the code, and only consider individual boxed int values:
const int noNumbers = 10000000; // 10 mil
object o = null;
for (int i=0;i<noNumbers;i++)
{
o = i;
}
Since this is a Release build, and dead code elimination could kick in, WinDbg is used to check the final machine code.
The resulting JITed code, whose main loop is highlighted in blue below, which simply does repeated boxing, shows that the method that handles the memory allocation is not inlined (note the call to hex address 00af30f4):
This method in turn tries to allocate via the "fast" way, and if that fails, goes back to the "slow" way of a call to JIT_New itself):
It's interesting how the call stack in PerfView obtained from the code above doesn't show any intermediary method between the level of Main and the JIT_New entry itself (given that Main doesn't directly call JIT_New):
I was researching the best way to return 'views' into a very large array and found ArraySegment which perfectly suited my needs. However, I then found Memory<T> which seems to behave the same, with the exception of requiring a span to view the memory.
For the use-case of creating and writing to views into a massive (2GB+) array, does it matter which one is used?
The reasons for the large arrays are they hold bytes of an image.
Resurrecting this in case someone bumps into this question.
When to use ArraySegment over Memory?
Never, unless you need to call something old that expects an ArraySegment<T>, which I doubt will be the case as it was never that popular.
ArraySegment<T> is just an array, an offset, and a length, which are all exposed directly where you can choose to ignore the offset and length and access the entirety of the array if you want to. There’s also no read-only version of ArraySegment<T>.
Span<T> and Memory<T> can be backed by arrays, similar to ArraySegment<T>, but also by strings and unmanaged memory (in the form of a pointer in Span<T>’s case, and by using a custom MemoryManager<T> in Memory<T>’s case). They provide better encapsulation by not exposing their underlying data source and have read-only versions for immutable access.
Back then, we had to pass the array/offset/count trio to a lot of APIs (APIs that needed a direct reference of an array), but now that Span<T> and Memory<T> exist and are widely supported by most, if not all, .NET APIs that need to interact with continuous blocks of memory, you should have no reason to use an ArraySegment<T>.
See also: Memory- and span-related types - MS Docs
Memory is sort of a wrapper around Span - one that doesn't have to be on the stack. And as the link provided by CoolBots pointed out it's an addition to arrays and array segments not really a replacement for them.
The main reason you would want to consider using Span/Memory is for performance and flexibility. Span gives you access to the memory directly instead of copying it back and forth to the array, and it allows you to treat the memory in a flexible way. Below I'll go from using the array as bytes to using it as an array of uint.
I'll skip right to Span but you could use AsMemory instead so you could pass that around easier. But it'd still boil down to getting the Span from the Memory.
Here's an example:
const int dataSize = 512;
const int segSize = 256;
byte[] rawdata = new byte[dataSize];
var segment = new ArraySegment<byte>(rawdata, segSize, segSize);
var seg1 = segment.AsSpan();
var seg1Uint = MemoryMarshal.Cast<byte, uint>(seg1);
for (int i = 0; i < segSize / sizeof(uint); ++i)
{
ref var data = ref seg1Uint[i];
data = 0x000066;
}
foreach (var b in rawdata)
Console.WriteLine(b);
Currently to deliver struct from c++ to c# i declare it on both sides (c++ and c#) and use delegate. This approach is described here. In my opinion, for low latency applications, it could be not suitable because marshaling/unmarshalling spents CPU/memory and may affect performance when structure size is big enough, frequency of request is high enough and latency requirement is high enough.
For such, low latency, scenarios it would be better not to allocate extra memory but work with c++ memory in c# directly. I have found that one project uses System.IO.UnmanagedMemoryStream and System.IO.BinaryReader for that, then certain fields can be read for example this way:
reader = new System.IO.BinaryReader(stream);
stream.Position = 8;
return reader.ReadInt32();
However I can not find complete example (how to have UnmanagedMemoryStream in c# which points to some structure or array of structures in c++?) I'm not sure this is best approach, but it could be so. What would you suggest for "low latency" transfer of structures from c++ to c#? Could you give an example?
I don't care about portability, maintainability etc. Only latency is important. It's temporary solution until I get rid of c#.
Import your C++ function and have it return a struct as a parameter. In C#, use the IntPtr type, and the use
DllImport("somedll.dll")
public static extern void SomeFunction(out IntPtr someStructParameterOutput);
And then:
IntPtr yourStruct;
SomeFunction(out yourStruct);
Stream s = new UnmanagedMemoryStream((byte*)yourStruct.ToPointer(), length);
I quite often write code that copies member variables to a local stack variable in the belief that it will improve performance by removing the pointer dereference that has to take place whenever accessing member variables.
Is this valid?
For example
public class Manager {
private readonly Constraint[] mConstraints;
public void DoSomethingPossiblyFaster()
{
var constraints = mConstraints;
for (var i = 0; i < constraints.Length; i++)
{
var constraint = constraints[i];
// Do something with it
}
}
public void DoSomethingPossiblySlower()
{
for (var i = 0; i < mConstraints.Length; i++)
{
var constraint = mConstraints[i];
// Do something with it
}
}
}
My thinking is that DoSomethingPossiblyFaster is actually faster than DoSomethingPossiblySlower.
I know this is pretty much a micro optimisation, but it would be useful to have a definitive answer.
Edit
Just to add a little bit of background around this. Our application has to process a lot of data coming from telecom networks, and this method is likely to be called about 1 billion times a day for some of our servers. My view is that every little helps, and sometimes all I am trying to do is give the compiler a few hints.
Which is more readable? That should usually be your primary motivating factor. Do you even need to use a for loop instead of foreach?
As mConstraints is readonly I'd potentially expect the JIT compiler to do this for you - but really, what are you doing in the loop? The chances of this being significant are pretty small. I'd almost always pick the second approach simply for readability - and I'd prefer foreach where possible. Whether the JIT compiler optimizes this case will very much depend on the JIT itself - which may vary between versions, architectures, and even how large the method is or other factors. There can be no "definitive" answer here, as it's always possible that an alternative JIT will optimize differently.
If you think you're in a corner case where this really matters, you should benchmark it - thoroughly, with as realistic data as possible. Only then should you change your code away from the most readable form. If you're "quite often" writing code like this, it seems unlikely that you're doing yourself any favours.
Even if the readability difference is relatively small, I'd say it's still present and significant - whereas I'd certainly expect the performance difference to be negligible.
If the compiler/JIT isn't already doing this or a similar optimization for you (this is a big if), then DoSomethingPossiblyFaster should be faster than DoSomethingPossiblySlower. The best way to explain why is to look at a rough translation of the C# code to straight C.
When a non-static member function is called, a hidden pointer to this is passed into the function. You'd have roughly the following, ignoring virtual function dispatch since it's irrelevant to the question (or equivalently making Manager sealed for simplicity):
struct Manager {
Constraint* mConstraints;
int mLength;
}
void DoSomethingPossiblyFaster(Manager* this) {
Constraint* constraints = this->mConstraints;
int length = this->mLength;
for (int i = 0; i < length; i++)
{
Constraint constraint = constraints[i];
// Do something with it
}
}
void DoSomethingPossiblySlower()
{
for (int i = 0; i < this->mLength; i++)
{
Constraint constraint = (this->mConstraints)[i];
// Do something with it
}
}
The difference is that in DoSomethingPossiblyFaster, mConstraints lives on the stack and access only requires one layer of pointer indirection, since it's at a fixed offset from the stack pointer. In DoSomethingPossiblySlower, if the compiler misses the optimization opportunity, there's an extra pointer indirection. The compiler has to read a fixed offset from the stack pointer to access this and then read a fixed offset from this to get mConstraints.
There are two possible optimizations that could negate this hit:
The compiler could do exactly what you did manually and cache mConstraints on the stack.
The compiler could store this in a register so that it doesn't need to fetch it from the stack on every loop iteration before dereferencing it. This means that fetching mConstraints from this or from the stack is basically the same operation: A single dereference of a fixed offset from a pointer that's already in a register.
You know the response you will get, right? "Time it."
There is probably not a definitive answer. First, the compiler might do the optimization for you. Second, even if it doesn't, indirect addressing at the assembly level may not be significantly slower. Third, it depends on the cost of making the local copy, compared to the number of loop iterations. Then there are caching effects to consider.
I love to optimize, but this is one place I would definitely say wait until you have a problem, then experiment. This is a possible optimization that can be added when needed, not one of those optimizations that needs to be planned up front to avoid a massive ripple effect later.
Edit: (towards a definitive answer)
Compiling both functions in release mode and examining the IL with IL Dasm shows that in both places the "PossiblyFaster" function uses the local variable, it has one less instruction
ldloc.0 vs
ldarg.0; ldfld class Constraint[] Manager::mConstraints
Of course, this is still one level removed from the machine code - you don't know what the JIT compiler will do for you. But it is likely that "PossiblyFaster" is marginally faster.
However, I still don't recommend adding the extra variable until you are sure this function is the most expensive thing in your system.
I've profiled this and came up with a bunch of interesting results that are probably only valid for my specific example, but I thought would be worth while noting here.
The fastest is X86 release mode. That runs one iteration of my test in 7.1 seconds, whereas the equivalent X64 code takes 8.6 seconds. This was running 5 iterations, each iteration processing the loop 19.2 million times.
The fastest approach for the loop was:
foreach (var constraint in mConstraints)
{
... do stuff ...
}
The second fastest approach, which massively surprised me was the following
for (var i = 0; i < mConstraints.Length; i++)
{
var constraint = mConstraints[i];
... do stuff ...
}
I guess this was because mConstraints was stored in a register for the loop.
This slowed down when I removed the readonly option for mConstraints.
So, my summary from this is that being readable in this situation does give performance as well.
I found a blog entry which suggests that sometimes c# compiler may decide to put array on the stack instead of the heap:
Improving Performance Through Stack Allocation (.NET Memory Management: Part 2)
This guy claims that:
The compiler will also sometimes decide to put things on the stack on its own. I did an experiment with TestStruct2 in which I allocated it both an unsafe and normal context. In the unsafe context the array was put on the heap, but in the normal context when I looked into memory the array had actually been allocated on the stack.
Can someone confirm that?
I was trying to repeat his example, but everytime I tried array was allocated on the heap.
If c# compiler can do such trick without using 'unsafe' keyword I'm specially intrested in it. I have a code that is working on many small byte arrays (8-10 bytes long) and so using heap for each new byte[...] is a waste of time and memory (especially that each object on heap has 8 bytes overhead needed for garbage collector).
EDIT: I just want to describe why it's important to me:
I'm writing library that is communicating with Gemalto.NET smart card which can have .net code working in it. When I call a method that returns something, smart card return 8 bytes that describes me the exact Type of return value. This 8 bytes are calculated by using md5 hash and some byte arrays concatenations.
Problem is that when I have an array that is not known to me I must scan all types in all assemblies loaded in application and for each I must calculate those 8 bytes until I find the same array.
I don't know other way to find the type, so I'm trying to speed it up as much as possible.
Author of the linked-to article here.
It seems impossible to force stack allocation outside of an unsafe context. This is likely the case to prevent some classes of stack overflow condition.
Instead, I recommend using a memory recycler class which would allocate byte arrays as needed but also allow you to "turn them in" afterward for reuse. It's as simple as keeping a stack of unused byte arrays and, when the list is empty, allocating new ones.
Stack<Byte[]> _byteStack = new Stack<Byte[]>();
Byte[] AllocateArray()
{
Byte[] outArray;
if (_byteStack.Count > 0)
outArray = _byteStack.Pop();
else
outArray = new Byte[8];
return outArray;
}
void RecycleArray(Byte[] inArray)
{
_byteStack.Push(inArray);
}
If you are trying to match a hash with a type it seems the best idea would be to use a Dictionary for fast lookups. In this case you could load all relevant types at startup, if this causes program startup to become too slow you might want to consider caching them the first time each type is used.
From your line:
I have a code that is working on many small byte arrays (8-10 bytes long)
Personally, I'd be more interested in allocating a spare buffer somewhere that different parts of your code can re-use (while processing the same block). Then you don't have any creation/GC to worry about. In most cases (where the buffer is used for very discreet operations) with a scratch-buffer, you can even always assume that it is "all yours" - i.e. every method that needs it can assume that they can start writing at zero.
I use this single-buffer approach in some binary serialization code (while encoding data); it is a big boost to performance. In my case, I pass a "context" object between the layers of serialization (that encapsulates the scratch-buffer, the output-stream (with some additional local buffering), and a few other oddities).
System.Array (the class representing an array) is a reference type and lives on the heap. You can only have an array on the stack if you use unsafe code.
I can't see where it says otherwise in the article that you refer to. If you want to have a stack allocated array, you can do something like this:
decimal* stackAllocatedDecimals = stackalloc decimal[4];
Personally I wouldn't bother- how much performance do you think you will gain by this approach?
This CodeProject article might be useful to you though.