GC Eager Root Collection

GC Eager Root Collection - c#

On pg 96 of Pro .NET Performance - Optimize Your C# Applications it talks about GC eager root collection:
For each local variable, the JIT embeds into a table the addresses of
the earliest and latest instruction pointers where the variable is
still relevant as a root. The GC then uses these tables when it
performs its stack walk.
It then provides this example:
static void Main(string[] args)
{
Widget a = new Widget();
a.Use();
//...additional code
Widget b = new Widget();
b.Use();
//...additional code
Foo(); //static method
}
It then says:
The above discussion implies that breaking your code into smaller
methods and using fewer local variables is not just a good design
measure or a software engineering technique. With the .NET GC, it can
provide a performance benefit as well because you have fewer local
roots! It means less work for the JIT when compiling the method, less
space to be occupied by the root IP tables, and less work for the GC
when performing its stack walk.
I don't understand how breaking code into smaller methods would help.
I've broken the code up into this:
static void Main(string[] args)
{
UseWidgetA();
//...additional code
UseWidgetB();
//...additional code
Foo(); //static method
}
static void UseWidgetA()
{
Widget a = new Widget();
a.Use();
}
static void UseWidgetB()
{
Widget b = new Widget();
b.Use();
}
}
Fewer local roots:
Why are there fewer local roots?
There are still the same number of local roots, one local root in each method.
Less work for the JIT when compiling the method:
Surely this would make things worse because it would need 2 extra tables for the 2 extra methods. The JIT would also still need to record the earliest and latest instruction pointers where the variable is still relevant within each method, but it would just have more methods to do that for.
Less work for the GC when performing its stack walk:
How does having more smaller methods mean less work for the GC during the stack walk?

I'm not in Sasha's mind but let me put my two cents to that.
First of all, I perceive it as a generalized rule - when you split a method into smaller ones, there is a chance that some parts will not need to be JITted, because some subroutines are executed conditionally.
Secondly, JITting indeed produces so-called GC info about live stack roots. The bigger method, the bigger GC info is Theoretically, there should be also a bigger cost of interpreting it during the GC, however, this is overcome by splitting GC Info into chunks. However, information about the stack roots liveness is stored only for so-called safe points. There are two types of methods:
partially interruptible - the only safe points are during calls to other methods. This makes a method less "suspendable" because the runtime needs to wait for such a safe point to suspend a method, but consumes less memory for the GC info.
fully interruptible - every instruction of a method is treated as a safe point, which obviously makes a method very "suspendable" but requires significant storage (of quantity similar to the code itself)
As Book Of The Runtime says: “The JIT chooses whether to emit fully- or partially
interruptible code based on heuristics to find the best trade-off between code quality,
size of the GC info, and GC suspension latency.”
In my opinion, smaller methods help the JIT to make better decisions (based on its heuristics) to make methods partially or fully interruptible.

Related

How to allow the Concept of Circularity in a program without infinite computational resources?

This might sound like a very strange question. But i work on a project which needs to have cirular references within it. Actually, they are even non-avoidable. Because Users could create their own circular references within the GUI. And this is absolutely intended.... Please don't ask why, this would take ages to explain.
All Question, Answers, Resources i found which discuss Circular References provide Solutions and Approaches on how to avoid one. But non i have read contained a solution on how to make one, without killing the underlying computational resources.
Issues i see
Such a cicular reference seems to me to always have the possibility to completely overhelm the underlying system, be it a simple home computer or research supercomputer where this program is meant to be run.
This is due to my understanding that the resources provided are always finite, but circular references are infinite by nature.
The resources i see which might be of issue here are:
computational power (CPU)
working memory (RAM)
Data storage
Network bandwidth
How could it be possible to mitigate those issues
Mitigation could take place by making sure that the program itself is only ever able to increase it's needs for computational resources in an very minor and incremental fashion. If there are then measures implemented which, based on gathered Data of the whole System as a Unit, allows us to decide if further evolutions are even necessary to improve the perceived Quality of the System. It would help us to cap the needs for Computational Resources.
One of the ways i could imagine that this capping could take place is by introducing time as a limiting factor. The program could be designed in such a way that it only considers re-evaluating "itself" after a given amount of time. If this time and the limit of Quality are carefully choosen to match the underlying computational resources, i feel like the resource issues with circular references could be mitigated.
Code Snippet
Find below a very simplified Code Snippet. Point 1 and Point 2 are completely independent in nature, they could even be on different Threads (actually that's an Idea how it could be done, but i dont understand multithreading well enough to decide if it would be a good approach or not). The action first begins when they are attached to another. I do not care if the behavior of "First this then that" happens in a specific way. The only thing for which i do care is that all interactions between those two Points have been taken place at some point in the future (after their attachement).
namespace Circularity
{
class Program
{
static void Main(string[] args)
{
Point Point1 = new Point();
Point Point2 = new Point();
Point1.attach(Point2);
}
}
class Point
{
private ulong Value;
public Point()
{
Value = ulong.MaxValue / 2;
}
public void attach(Point otherPoint)
{
if (Value < ulong.MaxValue) Value++;
otherPoint.attach(this);
}
}
}
This Code leads instantly to a Stack Overflow. But i do not understand the underlying concepts of the Stack well enough to implement a counter measure. I tried to apply the Time concept here already, but it just takes longer for the Stack Overflow.

The reason you're getting a stack overflow is because you're calling attach recursively, so you will just keep adding stack frames, the CLR can't handle that many and as you've witnessed, it quickly maxes out. One strategy here would be to use Continuation Passing Style so you avoid building a stack of method calls.
When and how to use continuation passing style

Memory Allocation Time (The Fast Way)

For a really simple code snippet, I'm trying to see how much of the time is spent actually allocating objects on the small object heap (SOH).
static void Main(string[] args)
{
const int noNumbers = 10000000; // 10 mil
ArrayList numbers = new ArrayList();
Random random = new Random(1); // use the same seed as to make
// benchmarking consistent
for (int i = 0; i < noNumbers; i++)
{
int currentNumber = random.Next(10); // generate a non-negative
// random number less than 10
object o = currentNumber; // BOXING occurs here
numbers.Add(o);
}
}
In particular, I want to know how much time is spent allocating space for the all the boxed int instances on the heap (I know, this is an ArrayList and there's horrible boxing going on as well - but it's just for educational purposes).
The CLR has 2 ways of performing memory allocations on the SOH: either calling the JIT_TrialAllocSFastMP (for multi-processor systems, ...SFastSP for single processor ones) allocation helper - which is really fast since it consists of a few assembly instructions - or failing back to the slower JIT_New allocation helper.
PerfView sees just fine the JIT_New being invoked:
However, I can't figure out which - if any - is the native function involved for the "quick way" of allocating. I certainly don't see any JIT_TrialAllocSFastMP. I've already tried raising the count of the loop (from 10 to 500 mil), in the hope of increasing my chances of of getting a glimpse of a few stacks containing the elusive function, but to no avail.
Another approach was to use JetBrains dotTrace (line-by-line) performance viewer, but it falls short of what I want: I do get to see the approximate time it takes the boxing operation for each int, but 1) it's just a bar and 2) there's both the allocation itself and the copying of the value (of which the latter is not what I'm after).
Using the JetBrains dotTrace Timeline viewer won't work either, since they currently don't (quite) support native callstacks.
At this point it's unclear to me if there's a method being dynamically generated and called when JIT_TrialAllocSFastMP is invoked - and by miracle neither of the PerfView-collected stack frames (one every 1 ms) ever capture it -, or somehow the Main's method body gets patched, and those few assembly instructions mentioned above are somehow injected directly in the code. It's also hard to believe that the fast way of allocating memory is never called.
You could ask "But you already have the .NET Core CLR code, why can't you figure out yourself ?". Since the .NET Framework CLR code is not publicly available, I've looked into its sibling, the .NET Core version of the CLR (as Matt Warren recommends in his step 6 here). The \src\vm\amd64\JitHelpers_InlineGetThread.asm file contains a JIT_TrialAllocSFastMP_InlineGetThread function. The issue is that parsing/understanding the C++ code there is above my grade, and also I can't really think of a way to "Step Into" and see how the JIT-ed code is generated, since this is way lower-level that your usual Press-F11-in-Visual-Studio.
Update 1: Let's simplify the code, and only consider individual boxed int values:
const int noNumbers = 10000000; // 10 mil
object o = null;
for (int i=0;i<noNumbers;i++)
{
o = i;
}
Since this is a Release build, and dead code elimination could kick in, WinDbg is used to check the final machine code.
The resulting JITed code, whose main loop is highlighted in blue below, which simply does repeated boxing, shows that the method that handles the memory allocation is not inlined (note the call to hex address 00af30f4):
This method in turn tries to allocate via the "fast" way, and if that fails, goes back to the "slow" way of a call to JIT_New itself):
It's interesting how the call stack in PerfView obtained from the code above doesn't show any intermediary method between the level of Main and the JIT_New entry itself (given that Main doesn't directly call JIT_New):

Which code-flow pattern is more efficient in C#/.NET?

Consider the situation in which the main logic of a method should only actually run given a certain condition. As far as I know, there are two basic ways to achieve this:
If inverse condition is true, simply return:
public void aMethod(){
if(!aBoolean) return;
// rest of method code goes here
}
or
If original condition is true, continue execution:
public void aMethod(){
if(aBoolean){
// rest of method code goes here
}
}
Now, I would guess that which of these implementations is more efficient is dependent on the language its written in and/or how if statements and return statements, and possibly method calls, are implemented by the compiler/interpreter/VM (depending on language); so the first part of my question is, is this true?
The second part of my question is, if the the answer to the first part is "yes", which of the above code-flow patterns is more efficient specifically in C#/.NET 4.6.x?
Edit:
In reference to Dark Falcon's comment: the purpose of this question is not actually to fix performance issues or optimize any real code I've written, I am just curious about how each piece of each pattern is implemented by the compiler, e.g. for arguments sake, if it was compiled verbatim with no compiler optimizations, which would be more efficient?

TL;DR It doesn't make a difference. Current generations of processors (circa Ivy Bridge and later) don't use a static branch-prediction algorithm that you can reason about anymore, so there is no possible performance gain in using one form or the other.
On most older processors, the static branch-prediction strategy is generally that forward conditional jumps are assumed to be taken, while backwards conditional jumps are assumed not-taken. Therefore, there might be a small performance advantage to be gained the first time the code is executed by arranging for the fall-through case to be the most likely—i.e.,
if { expected } else { unexpected }.
But the fact is, this kind of low-level performance analysis makes very little sense when writing in a managed, JIT-compiled language like C#.
You're getting a lot of answers that say readability and maintainability should be your primary concern when writing code. This is regrettably common with "performance" questions, and while it is completely true and unarguable, it mostly skirts the question instead of answering it.
Moreover, it isn't clear why form "A" would be intrinsically more readable than form "B", or vice versa. There are just as many arguments one way or the other—do all parameter validation at the top of the function, or ensure there is only a single return point—and it ultimately gets down to doing what your style guide says, except in really egregious cases where you'd have to contort the code in all sorts of terrible ways, and then you should obviously do what is most readable.
Beyond being a completely reasonable question to ask on conceptual/theoretical grounds, understanding the performance implications also seems like an excellent way to make an informed decision about which general form to adopt when writing your style guide.
The remainder of the existing answers consist of misguided speculation, or downright incorrect information. Of course, that makes sense. Branch prediction is complicated, and as processors get smarter, it only gets harder to understand what is actually happening (or going to happen) under the hood.
First, let's get a couple of things straight. You make reference in the question to analyzing the performance of unoptimized code. No, you don't ever want to do that. It is a waste of time; you'll get meaningless data that does not reflect real-world usage, and then you'll try and draw conclusions from that data, which will end up being wrong (or maybe right, but for the wrong reasons, which is just as bad). Unless you're shipping unoptimized code to your clients (which you shouldn't be doing), then you don't care how unoptimized code performs. When writing in C#, there are effectively two levels of optimization. The first is performed by the C# compiler when it is generating the intermediate language (IL). This is controlled by the optimization switch in the project settings. The second level of optimization is performed by the JIT compiler when it translates the IL into machine code. This is a separate setting, and you can actually analyze the JITed machine code with optimization enabled or disabled. When you're profiling or benchmarking, or even analyzing the generated machine code, you need to have both levels of optimizations enabled.
But benchmarking optimized code is difficult, because the optimization often interferes with the thing you're trying to test. If you tried to benchmark code like that shown in the question, an optimizing compiler would likely notice that neither one of them is actually doing anything useful and transform them into no-ops. One no-op is equally fast as another no-op—or maybe it's not, and that's actually worse, because then all you're benchmarking is noise that has nothing to do with performance.
The best way to go here is to actually understand, on a conceptual level, how the code is going to be transformed by a compiler into machine code. Not only does that allow you to escape the difficulties of creating a good benchmark, but it also has value above and beyond the numbers. A decent programmer knows how to write code that produces correct results; a good programmer knows what is happening under the hood (and then makes an informed decision about whether or not they need to care).
There has been some speculation about whether the compiler will transform form "A" and form "B" into equivalent code. It turns out that the answer is complicated. The IL will almost certainly be different because it will be a more or less literal translation of the C# code that you actually write, regardless of whether or not optimizations are enabled. But it turns out that you really don't care about that, because IL isn't executed directly. It's only executed after the JIT compiler gets done with it, and the JIT compiler will apply its own set of optimizations. The exact optimizations depend on exactly what type of code you've written. If you have:
int A1(bool condition)
{
if (condition) return 42;
return 0;
}
int A2(bool condition)
{
if (!condition) return 0;
return 42;
}
it is very likely that the optimized machine code will be the same. In fact, even something like:
void B1(bool condition)
{
if (condition)
{
DoComplicatedThingA();
DoComplicatedThingB();
}
else
{
throw new InvalidArgumentException();
}
}
void B2(bool condition)
{
if (!condition)
{
throw new InvalidArgumentException();
}
DoComplicatedThingA();
DoComplicatedThingB();
}
will be treated as equivalent in the hands of a sufficiently capable optimizer. It is easy to see why: they are equivalent. It is trivial to prove that one form can be rewritten in the other without changing the semantics or behavior, and that is precisely what an optimizer's job is.
But let's assume that they did give you different machine code, either because you wrote complicated enough code that the optimizer couldn't prove that they were equivalent, or because your optimizer was just falling down on the job (which can sometimes happen with a JIT optimizer, since it prioritizes speed of code generation over maximally efficient generated code). For expository purposes, we'll imagine that the machine code is something like the following (vastly simplified):
C1:
cmp condition, 0 // test the value of the bool parameter against 0 (false)
jne ConditionWasTrue // if true (condition != 1), jump elsewhere;
// otherwise, fall through
call DoComplicatedStuff // condition was false, so do some stuff
ret // return
ConditionWasTrue:
call ThrowException // condition was true, throw an exception and never return
C2:
cmp condition, 0 // test the value of the bool parameter against 0 (false)
je ConditionWasFalse // if false (condition == 0), jump elsewhere;
// otherwise, fall through
call DoComplicatedStuff // condition was true, so do some stuff
ret // return
ConditionWasFalse:
call ThrowException // condition was false, throw an exception and never return
That cmp instruction is equivalent to your if test: it checks the value of condition and determines whether it's true or false, implicitly setting some flags inside the CPU. The next instruction is a conditional branch: it branches to the specification location/label based on the values of one or more flags. In this case, je is going to jump if the "equals" flag is set, while jne is going to jump if the "equals" flag is not set. Simple enough, right? This is exactly how it works on the x86 family of processors, which is probably the CPU for which your JIT compiler is emitting code.
And now we get to the heart of the question that you're really trying to ask; namely, does it matter whether we execute a je instruction to jump if the comparison set the equal flag, or whether we execute a jne instruction to jump if the comparison did not set the equal flag? Again, unfortunately, the answer is complicated, but enlightening.
Before continuing, we need to develop some understanding of branch prediction. These conditional jumps are branches to some arbitrary section in the code. A branch can either be taken (which means the branch actually happens, and the processor begins executing code found at a completely different location), or it can be not taken (which means that execution falls through to the next instruction as if the branch instruction wasn't even there). Branch prediction is very important because mispredicted branches are very expensive on modern processors with deep pipelines that use speculative execution. If it predicts right, it continues uninterrupted; however, if it predicts wrong, it has to throw away all of the code that it speculatively executed and start over. Therefore, a common low-level optimization technique is replacing branches with clever branchless code in cases where the branch is likely to be mispredicted. A sufficiently smart optimizer would turn if (condition) { return 42; } else { return 0; } into a conditional move that didn't use a branch at all, regardless of which way you wrote the if statement, making branch prediction irrelevant. But we're imagining that this didn't happen, and you actually have code with a conditional branch—how does it get predicted?
How branch prediction works is complicated, and getting more complicated all the time as CPU vendors continue to improve the circuitry and logic inside of their processors. Improving branch prediction logic is a significant way that hardware vendors add value and speed to the things they're trying to sell, and every vendor uses different and proprietary branch-prediction mechanisms. Worse, every generation of processor uses slightly different branch-prediction mechanisms, so reasoning about it in the "general case" is exceedingly difficult. Static compilers offer options that allow you to optimize the code they generate for a particular generation of microprocessor, but this doesn't generalize well when shipping code to a large number of clients. You have little choice but to resort to a "general purpose" optimization strategy, although this usually works pretty well. The big promise of a JIT compiler is that, because it compiles the code on your machine right before you use it, it can optimize for your specific machine, just like a static compiler invoked with the perfect options. This promise hasn't exactly been reached, but I won't digress down that rabbit hole.
All modern processors have dynamic branch prediction, but how exactly they implement it is variable. Basically, they "remember" whether a particular (recent) branch was taken or not taken, and then predict that it will go this way the next time. There are all kinds of pathological cases that you can imagine here, and there are, correspondingly, all kinds of cases in or approaches to the branch-prediction logic that help to mitigate the possible damage. Unfortunately, there isn't really anything you can do yourself when writing code to mitigate this problem—except getting rid of branches entirely, which isn't even an option available to you when writing in C# or other managed languages. The optimizer will do whatever it will; you just have to cross your fingers and hope that it is the most optimal thing. In the code we're considering, then, dynamic branch prediction is basically irrelevant and we won't talk about it any more.
What is important is static branch prediction—what prediction is the processor going to make the first time it executes this code, the first time it encounters this branch, when it doesn't have any real basis on which to make a decision? There are a bunch of plausible static prediction algorithms:
Predict all branches are not taken (some early processors did, in fact, use this).
Assume "backwards" conditional branches are taken, while "forwards" conditional branches are not taken. The improvement here is that loops (which jump backwards in the execution stream) will be correctly predicted most of the time. This is the static branch-prediction strategy used by most Intel x86 processors, up to about Sandy Bridge.
Because this strategy was used for so long, the standard advice was to arrange your if statements accordingly:
if (condition)
{
// most likely case
}
else
{
// least likely case
}
This possibly looks counter-intuitive, but you have to go back to what the machine code looks like that this C# code will be transformed into. Compilers will generally transform the if statement into a comparison and a conditional branch into the else block. This static branch prediction algorithm will predict that branch as "not taken", since it's a forward branch. The if block will just fall through without taking the branch, which is why you want to put the "most likely" case there.
If you get into the habit of writing code this way, it might have a performance advantage on certain processors, but it's never enough of an advantage to sacrifice readability. Especially since it only matters the first time the code is executed (after that, dynamic branch prediction kicks in), and executing code for the first time is always slow in a JIT-compiled language!
Always use the dynamic predictor's result, even for never-seen branches.
This strategy is pretty strange, but it's actually what most modern Intel processors use (circa Ivy Bridge and later). Basically, even though the dynamic branch-predictor may have never seen this branch and therefore may not have any information about it, the processor still queries it and uses the prediction that it returns. You can imagine this as being equivalent to an arbitrary static-prediction algorithm.
In this case, it absolutely does not matter how you arrange the conditions of an if statement, because the initial prediction is essentially going to be random. Some 50% of the time, you'll pay the penalty of a mispredicted branch, while the other 50% of the time, you'll benefit from a correctly predicted branch. And that's only the first time—after that, the odds get even better because the dynamic predictor now has more information about the nature of the branch.
This answer has already gotten way too long, so I'll refrain from discussing static prediction hints (implemented only in the Pentium 4) and other such interesting topics, bringing our exploration of branch prediction to a close. If you're interested in more, examine the CPU vendor's technical manuals (although most of what we know has to be empirically determined), read Agner Fog's optimization guides (for x86 processors), search online for various white-papers and blog posts, and/or ask additional questions about it.
The takeaway is probably that it doesn't matter, except on processors that use a certain static branch-prediction strategy, and even there, it hardly matters when you're writing code in a JIT-compiled language like C# because the first-time compilation delay exceeds the cost of a single mispredicted branch (which may not even be mispredicted).

Same issue when validating parameters to functions.
It's much cleaner to act like a night-club bouncer, kicking the no-hopers out as soon as possible.
public void aMethod(SomeParam p)
{
if (!aBoolean || p == null)
return;
// Write code in the knowledge that everything is fine
}
Letting them in only causes trouble later on.
public void aMethod(SomeParam p)
{
if (aBoolean)
{
if (p != null)
{
// Write code, but now you're indented
// and other if statements will be added later
}
// Later on, someone else could add code here by mistake.
}
// or here...
}
The C# language prioritizes safety (bug prevention) over speed. In other words, almost everything has been slowed down to prevent bugs, one way or another.
If you need speed so badly that you start worrying about if statements, then perhaps a faster language would suit your purposes better, possibly C++
Compiler writers can and do make use of statistics to optimize code, for example "else clauses are only executed 30% of the time".
However, the hardware guys probably do a better job of predicting execution paths. I would guess that these days, the most effective optimizations happen within the CPU, with their L1 and L2 caches, and compiler writers don't need to do a thing.

I am just curious about how each piece of each pattern is implemented
by the compiler, e.g. for arguments sake, if it was compiled verbatim
with no compiler optimizations, which would be more efficient?
The best way to test efficiency in this way is to run benchmarks on the code samples you're concerned with. With C# in particular it is not going to be obvious what the JIT is doing with these scenarios.
As a side note, I throw in a +1 for the other answers that point out that efficiency isn't only determined at the compiler level - code maintainability involves magnitudes of levels of efficiency more than what you'll get from this specific sort of pattern choice.

As [~Dark Falcon] mentioned you should not be concerned by micro optimization of little bits of code, the compiler will most probably optimize both approaches to the same thing.
Instead you should be very concerned about your program maintainability and ease of read
From this perspective you should choose B for two reasons:
It only has one exit point (just one return)
The if block is surrounded by curly braces
edit
But hey! as told in the comments that is just my opinion and what I consider good practices

Improve RAM usage behaviour to avoid lags

We have a problem which seems to be caused by the constant allocation and deallocation of memory:
We have a rather complex system here, where a USB device is measuring arbitrary points and sending the measurement data to the PC at a rate of 50k samples per second. These samples are then collected as MeasurementTasks in the software for each point and afterwards processed which causes even more needed memory because of the requirements of the calculations.
Simplified each MeasurementTask looks like the following:
public class MeasurementTask
{
public LinkedList<Sample> Samples { get; set; }
public ComplexSample[] ComplexSamples { get; set; }
public Complex Result { get; set; }
}
Where Sample looks like:
public class Sample
{
public ushort CommandIndex;
public double ValueChannel1;
public double ValueChannel2;
}
and ComplexSample like:
public class ComplexSample
{
public double Channel1Real;
public double Channel1Imag;
public double Channel2Real;
public double Channel2Imag;
}
In the calculation process the Samples are first calculated into a ComplexSample each and then futher processed until we get our Complex Result. After these calculations are done we release all the Sample and ComplexSample instances and the GC cleans them up soon after, but this results in a constant "up and down" of the memory usage.
This is how it looks at the moment with each MeasurementTask containing ~300k samples:
Now we have sometimes the problem that the samples buffer in our HW device is overflown, as it can only store ~5000 samples (~100ms) and it seems the application is not always reading the device fast enough (we use BULK transfer with LibUSB/LibUSBDotNet). We tracked this problem down to this "memory up and down" by the following facts:
the reading from the USB device happens in its own thread which runs at ThreadPriority.Highest, so the calculations should not interfere
CPU usage is between 1-5% on my 8-core CPU => <50% of one core
if we have (much) faster MeasurementTasks with only a few hundret samples each, the memory goes only up and down very little and the buffer never overflows (but the amount of instances/second is the same, as the device still sends 50k samples/second)
we had a bug before, which did not release the Sample and ComplexSample instances after the calculations and so the memory only went up at ~2-3 MB/s and the buffer overflew all the time
At the moment (after fixing the bug mentioned above) we have a direct correlation between the samples count per point and the overflows. More samples/point = higher memory delta = more overflows.
Now to the actual question:
Can this behaviour be improved (easily)?
Maybe there is a way to tell the GC/runtime to not release the memory so there is no need to re-allocate?
We also thought of an alternative approach by "re-using" the LinkedList<Sample> and ComplexSample[]: Keep a pool of such lists/arrays and instead of releasing them put them back in the pool and "change" these instances instead of creating new ones, but we are not sure this is a good idea as it adds complexity to the whole system...
But we are open to other suggestions!
UPDATE:
I now optimized the code base with the following improvements and did various test runs:
converted Sample to a struct
got rid of the LinkedList<Sample> and replaced them by straigt arrays (I actually had another one somewhere else I also removed)
several minor optimizations I found during analysis and optimization
(optional - see below) converted ComplexSample to a struct
In any case it seems that the problem is gone now on my machine (long term tests and test on low-spec hardware will follow), but I first run a test with both types as struct and got the following memory usage graph:
There it still was going up to ~300 MB on a regular basis (but no overflow errors anymore), but as this still seemed odd to me I did some additional tests:
Side note: Each value of each ComplexSample is altered at least once during the calculations.
1) Add a GC.Collect after a task is processed and the samples are not referenced any more:
Now it was alternating between 140 MB and 150 MB (no noticable perfomance hit).
2) ComplexSample as a class (no GC.Collect):
Using a class it is much more "stable" at ~140-200 MB.
3) ComplexSample as a class and GC.Collect:
Now it is going "up and down" a little in the range of 135-150 MB.
Current solution:
As we are not sure this is a valid case for manually calling GC.Collect we are using "solution 2)" now and I will start running the long-term (= several hours) and low-spec hardware tests...

Can this behaviour be improved (easily)?
Yes (depends on how much you need to improve it).
The first thing I would do is to change Sample and ComplexSample to be value-types. This will reduce the complexity of the graph dealt with by GC as while the arrays and linked lists are still collected, they contain those values directly rather than references to them, and that simplifies the rest of GC.
Then I'd measure performance at this point. The impact of working with relatively large structs is mixed. The guideline that value types should be less than 16 bytes comes from it being around that point where the performance benefits of using a reference type tend to overwhelm the performance benefits of using a value type, but that guideline is only a guideline because "tend to overwhelm" is not the same as "will overwhelm in your application".
After that if it had either not improved things, or not improved things enough, I would consider using a pool of objects; whether for those smaller objects, only the larger objects, or both. This will most certainly increase the complexity of your application, but if it's time-critical, then it might well help. (See How do these people avoid creating any garbage? for example which discusses avoiding normal GC in a time-critical case).
If you know you'll need a fixed maximum of a given type this isn't too hard; create and fill an array of them and dole them out from that array before returning them as they are no longer used. It's still hard enough in that you no longer have GC being automatic and have to manually "delete" the objects by putting them back in the pool.
If you don't have such knowledge, it gets harder but is still possible.
If it is really vital that you avoid GC, be careful of hidden objects. Adding to most collection types can for example result in them moving up to a larger internal store, and leaving the earlier store to be collected. Maybe this is fine in that you've still reduced GC use enough that it is no longer causing the problem you have, but maybe not.

Rarely I've seen a LinkedList<> used in .NET... Have you tried using a List<>? Consider that the basic "element" of a LinkedList<> is a LinkedListNode<> that is a class... So for each Sample there is a whole additional overhead of one object.
Note that if you want to use "big" value types (as suggested by others), the List<> could become again slower (because the List<> grows by "generate a new-internal array of double the current size size and copy from old to new), so the bigger the elements, the more memory the List<> has to copy around when it doubles itself.
If you go to List<> you could try splitting the Sample to
List<ushort> CommandIndex;
List<Sample> ValueChannels;
This because the doubles of Sample require 8 byte alignment, so as written the Sample is 24 bytes, with only 18 bytes used.
This wouldn't be a good idea for LinkedList<>, because the LL has a big overhead per item.

Change Sample and ComplexSample to struct.

How do these people avoid creating any garbage?

Here's an interesting article that I found on the web.
It talks about how this firm is able to parse a huge amount of financial data in a managed environment, essentially by object reuse and avoiding immutables such as string. They then go on and show that their program doesn't do any GC during the continuous operation phase.
This is pretty impressive, and I'd like to know if anyone else here has some more detailed guidelines as to how to do this. For one, I'm wondering how the heck you can avoid using string, when blatently some of the data inside the messages are strings, and whatever client application is looking at the messages will want to be passed those strings? Also, what do you allocate in the startup phase? How will you know it's enough? Is it simple a matter of claiming a big chunk of memory and keeping a reference to it so that GC doesn't kick in? What about whatever client application is using the messages? Does it also need to be written according to these stringent standards?
Also, would I need a special tool to look at the memory? I've been using SciTech memory profiler thus far.

I found the paper you linked to rather deficient:
It assumes, and wants you to assume, that garbage collection is the ultimate latency killer. They have not explained why they think so, nor have they explained in what way their system is not basically a custom-made garbage collector in disguise.
It talks about the amount of memory cleaned up in garbage collection, which is irrelevant: the time taken to garbage collect depends more on the number of objects, irrespective of their size.
The table of “results” at the bottom provides no comparison to a system that uses .NET’s garbage collector.
Of course, this doesn’t mean they’re lying and it’s nothing to do with garbage collection, but it basically means that the paper is just trying to sound impressive without actually divulging anything useful that you could use to build your own.

One thing to note from the beginning is where they say "Conventional wisdom has been developing low latency messaging technology required the use of unmanaged C++ or assembly language". In particular, they are talking about a sort of case where people would often dismiss a .NET (or Java) solution out of hand. For that matter, a relatively naïve C++ solution probably wouldn't make the grade either.
Another thing to consider here, is that they have essentially haven't so much gotten rid of the GC as replaced it - there's code there managing object lifetime, but it's their own code.
There are several different ways one could do this instead. Here's one. Say I need to create and destroy several Foo objects as my application runs. Foo creation is parameterised by an int, so the normal code would be:
public class Foo
{
private readonly int _bar;
Foo(int bar)
{
_bar = bar;
}
/* other code that makes this class actually interesting. */
}
public class UsesFoo
{
public void FooUsedHere(int param)
{
Foo baz = new Foo(param)
//Do something here
//baz falls out of scope and is liable to GC colleciton
}
}
A much different approach is:
public class Foo
{
private static readonly Foo[] FOO_STORE = new Foo[MOST_POSSIBLY_NEEDED];
private static Foo FREE;
static Foo()
{
Foo last = FOO_STORE[MOST_POSSIBLY_NEEDED -1] = new Foo();
int idx = MOST_POSSIBLY_NEEDED - 1;
while(idx != 0)
{
Foo newFoo = FOO_STORE[--idx] = new Foo();
newFoo._next = FOO_STORE[idx + 1];
}
FREE = last._next = FOO_STORE[0];
}
private Foo _next;
//Note _bar is no longer readonly. We lose the advantages
//as a cost of reusing objects. Even if Foo acts immutable
//it isn't really.
private int _bar;
public static Foo GetFoo(int bar)
{
Foo ret = FREE;
FREE = ret._next;
return ret;
}
public void Release()
{
_next = FREE;
FREE = this;
}
/* other code that makes this class actually interesting. */
}
public class UsesFoo
{
public void FooUsedHere(int param)
{
Foo baz = Foo.GetFoo(param)
//Do something here
baz.Release();
}
}
Further complication can be added if you are multithreaded (though for really high performance in a non-interactive environment, you may want to have either one thread, or separate stores of Foo classes per thread), and if you cannot predict MOST_POSSIBLY_NEEDED in advance (the simplest is to create new Foo() as needed, but not release them for GC which can be easily done in the above code by creating a new Foo if FREE._next is null).
If we allow for unsafe code we can have even greater advantages in having Foo a struct (and hence the array holding a contiguous area of stack memory), _next being a pointer to Foo, and GetFoo() returning a pointer.
Whether this is what these people are actually doing, I of course cannot say, but the above does prevent GC from activating. This will only be faster in very high throughput conditions, if not then letting GC do its stuff is probably better (GC really does help you, despite 90% of questions about it treating it as a Big Bad).
There are other approaches that similarly avoid GC. In C++ the new and delete operators can be overridden, which allows for the default creation and destruction behaviour to change, and discussions of how and why one might do so might interest you.
A practical take-away from this is when objects either hold resources other than memory that are expensive (e.g. connections to databases) or "learn" as they continue to be used (e.g. XmlNameTables). In this case pooling objects is useful (ADO.NET connections do so behind the scenes by default). In this case though a simple Queue is the way to go, as the extra overhead in terms of memory doesn't matter. You can also abandon objects on lock contention (you're looking to gain performance, and lock contention will hurt it more than abandoning the object), which I doubt would work in their case.

From what I understood, the article doesn't say they don't use strings. They don't use immutable strings. The problem with immutable strings is that when you're doing parsing, most of the strings generated are just throw-away strings.
I'm guessing they're using some sort of pre-allocation combined with free lists of mutable strings.

I worked for a while with a CEP product called StreamBase. One of their engineers told me that they were migrating their C++ code to Java because they were getting better performance, fewer bugs and better portability on the JVM by pretty much avoiding GC altogether. I imagine the arguments apply to the CLR as well.
It seemed counter-intuitive, but their product was blazingly fast.
Here's some information from their site:
StreamBase avoids garbage collection in two ways: Not using objects, and only using the minimum set of objects we need.
First, we avoid using objects by using Java primitive types (Boolean, byte, int, double, and long) to represent our data for processing. Each StreamBase data type is represented by one or more primitive type. By only manipulating the primitive types, we can store data efficiently in stack or array allocated regions of memory. We can then use techniques like parallel arrays or method calling to pass data around efficiently.
Second, when we do use objects, we are careful about their creation and destruction. We tend to pool objects rather than releasing them for garbage collection. We try to manage object lifecycle such that objects are either caught by the garbage collector in the young generation, or kept around forever.
Finally, we test this internally using a benchmarking harness that measures per-tuple garbage collection. In order to achieve our high speeds, we try to eliminate all per-tuple garbage collection, generally with good success.

In 99% of the time you will be wasting your bosses money when you try to achieve this. The article describes a absolute extreme scenario where they need the last drop of performance. As you can read in the article, there are great parts of the .NET framework that can't be used when trying to be GC-free. Some of the most basic parts of the BCL use memory allocations (or 'produce garbage', as the paper calls it). You will need to find a way around those methods. And even when you need absolute blazingly fast applications, you'd better first try to build an application/architecture that can scale out (use multiple machines), before trying to walk the no-GC route. The sole reason for them to use the no-GC route is they need an absolute low latency. IMO, when you need absolute speed, but don't care about the absolute minimum response time, it will be hard to justify a no-GC architecture. Besides this, if you try to build a GC-free client application (such as Windows Forms or WPF App); forget it, those presentation frameworks create new objects constantly.
But if you really want this, it is actually quite simple. Here is a simple how to:
Find out which parts of the .NET API can't be used (you can write a tool that analyzes the .NET assemblies using an introspection engine).
Write a program that verifies the code you or your developers write to ensure they don't allocate directly or use 'forbidden' .NET methods, using the safe list created in the previous point (FxCop is a great tool for this).
Create object pools that you initialize at startup time. The rest of the program can reuse existing object so that they won't have to do any new ops.
If you need to manipulate strings, use byte arrays for this and store byte arrays in a pool (WCF uses this technique also). You will have to create an API that allows manipulating those byte arrays.
And last but not least, profile, profile, profile.
Good luck

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.