Cache friendliness in C#

Cache friendliness in C# - c#

I've just seen a very interesting talk by Herb Sutter at the Build Conference 2014, called "Modern C++: What You Need to Know". Here's a link to the video of the talk: http://channel9.msdn.com/Events/Build/2014/2-661
One of the topics of the talk was on how std::vector is very cache-friendly, mainly because it ensures that its elements are adjacent in the heap, which has a great impact on spatial locality, or at least that's what I think I've understood; this is the case even for inserting and removing items. Clever use of std::vector can bring dramatic performance improvements by exploiting caching.
I'd like to try something like what with C#/.Net, but how to ensure that the objects in my collections are all adjacent in memory?
Any other pointer to resources on cache-friendliness on C# and .Net is also appreciated. :)

For GC-managed languages, you tend to lose that explicit control over where every object is stored in memory. You can control memory access patterns but that becomes a bit futile if you can't control what memory addresses you are accessing. On top of that, every object tends to come with an unavoidable overhead of an indirection and something resembling a virtual pointer (under the hood) to allow dynamic dispatch, reflection, etc. It's the analogical equivalent of having to store pointers (references) to objects and having to work indirectly with them with each object instance also storing the analogical pointer to allow for runtime type information like reflection and what's needed for virtual dispatch.
As a result, even if you store an array of object references contiguously, that's only making the analogical pointers to the objects cache-friendly to access sequentially. The contents of each object could still be scattered in memory, incurring cache misses galore as you load memory regions into cache lines only to use one object's worth of data before that data gets evicted.
It's actually the number one appeal of a language like C++ to me: it lets you still use OOP while controlling where everything is going to be allocated in memory as well as exactly how much memory everything uses and it lets you opt out (actually by default) from the overhead associated with virtual dispatch, RTTI, etc. while still using objects and generics and so forth. Meanwhile the biggest appeal to me personally of languages like C# and Java is what you can get with each object like reflection, which comes with overhead per object, but is a justified cost if your code has plenty of use for it.
Using plain old data types (struct included in C#):
That said, I've seen very efficient code written in C# and Java on par with C and C++ but the key difference is that they avoided objects for the tiny fraction of code that is really performance-critical. For example, I saw an interactive Java raytracer using single-path tracing with very impressive speed given the brute force nature of what it was doing. However, the key difference is that while most of the raytracer was written using nice object-oriented code, for the performance-critical parts (BVH, mesh representation, and triangles stored in the leaves) it avoided objects and just used big arrays of int[] and float[]. That performance-critical code was pretty "raw", maybe even more "raw" than equally-optimized C++ code (it looked closer to C or Fortran), but it was only necessary for a few key areas of the raytracer.
When you use arrays of plain old data types for the performance-critical areas, you gain the sufficient control over memory to make all the difference, because it doesn't matter so much if the array is GC-managed and maybe occasionally moved from one memory location to another after a GC cycle (ex: out of Eden space after a first GC cycle). It doesn't matter because the array is moved as a whole. As a result the element at index 1 is still right next to elements 0 and elements 2. That's all that matters for cache-friendly sequential processing of an array is that each element in the array is right next to the other in memory, and even in Java and C#, as long as you're using arrays of PODs (which include structs in C# the last time I checked), you have that level of control.
So when you're writing performance-critical code, make sure your designs leave enough breathing room to change the representation of how things are stored, and possibly away from objects if the design becomes bottlenecky in the future. As a basic example, for an Image object (which is effectively a collection of pixels), you might avoid storing pixels as individual objects and you definitely don't want to expose an abstract Pixel object for clients to use directly. Instead you might represent a collection of pixels as an array of plain old integers or floats behind an Image interface, and a single image might represent a million pixels. That will allow cache-friendly sequential processing of the image.
Avoid using new for a boatload of teeny things.
Don't use new excessively for teeny things to put it simply. Allocate in bulk for performance-critical areas: one new for an entire array of a million integers representing a million pixels in an image, e.g., not a million calls to new to allocate one pixel at a time to locations in memory outside of your control. Besides cache-friendliness, if you allocate each pixel as a separate object in C#, the memory overhead required to store the analogical pointer for dynamic dispatch and reflection and so forth will often be bigger than the entire pixel itself, doubling or tripling the memory use of a single pixel.
Design for bulk, homogeneous processing in performance-critical areas.
If you're writing a video game revolving around OOP and inheritance rather than ECS and duck typing, the classic inheritance example is often too granular: Dog inherits Mammal, Cat inherits Mammal. Instead if you are going to be dealing with a boatload of mammals in your game loops every frame, I recommend instead to make Cats inherit Mammals, Dogs inherit Mammals. Mammals becomes an abstract container rather than something that just represents one mammal at a time. This will give your designs enough breathing room to process, say, many dogs worth of contiguous primitive data at one time very efficiently when you are trying to process an abstract Dogs which inherits abstract Mammals when you try to do things indirectly to dogs with polymorphism through the Mammals interface.
This advice actually applies regardless of whether you're using C or C++ or Java or C# or anything else. To write cache-friendly code, you have to start with designs that leave enough breathing room to optimize their data representations and access patterns as needed in the future, and ideally with a profiler in hand. The worst-case scenario is to end up with a design that has accumulated many dependencies which is bottlenecky, like many dependencies to a Pixel object or IPixel interface, which is too granular to optimize further without rewriting and redesigning the entire software. So avoid mass dependencies to overly granular designs that leave no breathing room to optimize further. Redirect the dependencies away from the analogical Pixel to the analogical Image, from the analogical Mammal to the analogical Mammals, and you'll be able to optimize to your heart's content without costly redesigns.

It seems like the only way to achieve this in C# is to use value types, that means using structs instead of classes. Using List will then store your objects in contiguous memory. You can read some more info on structs vs classes here: Choosing Between Class and Struct

You can use List, which stores items contiguously (like a std::vector).
See this answer for more info.

Related

Using pre-defined instances

A few problem about instances are that Equals method doesn't provide true if even they contain same values. So trying to override Equals method provide much more slower than reference equality.
While i was thinking point of performance, i thought that it is stupidly to create 2 instance which is same but not same memory adress. If i can avoid to create same instances with different references will increase performance and help to compare references will be more easy than write custom Equal methods.
For example:
I have Coordinate class Which hold chess board coordinates.So i need only Coordinate[8,8] array for represent all cordinates of board. Instead of create instances, i can create all instances then my factory method can return them.
Cooardinate.Get(2,3) instead of new Coordinate(2,3)
First is static class's static method which returns pre-defined coordinate in given values.
Another advantage is that we will not spend time to create and collect garbage objects in memory. All of them is pre-defined already. Also we can provide unique GetHashCode for instances in easy and fast way like 0 for [0,0], 1 for [0,1]....
Isn't it worthy to try ? this idea would make coding harder write or understand ? Is there any such a pattern ?
Well, shortly what is disadvantages of this way ?

This is a good solution and in certain situations can save you a lot of time and memory. The main drawback is that it gets really complicated if your objects are mutable. If that is not the case, then this is really not that bad. You just have to make sure that all instances are obtained from the same factory. You do not even have to create all the instances in advance, but can make the class create a new instance when a particular set of parameters is requested for the first time (basically lazy-loading).

In this instance where you're dealing with a chess board (so 64 total coordinates) you don't need to be overly concerned with performance or garbage collection. Having said that, I think holding the baseline 64 coordinates in a dictionary is just fine (and makes sense). In terms of the Equals() comparison, you're basically comparing 2 integer values which is going to be lightning fast so overriding the Equals() method to specify that comparison is the right approach.

The short answer is that the disadvantage of this approach is that it increases the complexity of your code, thereby making it more difficult to code correctly, more difficult to debug, and more difficult to maintain.
Therefore, as with all optimisations, you should only contemplate it if you have a genuine need to optimise (e.g. it's running way too slow, or using way too much memory), and if the reward (faster performance, smaller memory usage) outweighs the risks (spending time optimising when you could be doing something more useful, introducing a bug, not being able to find a bug, not being being able to modify the code in the future quickly and easily, or without introducing a bug).

If you really were running into performance problems because of having to do deep comparisons in order to determine equality then you might want to look at the Flyweight pattern, but as you're only talking about 64 pairs of smallints I think that would be complete overkill in this case. What you actually need is to override the Equals operator to compare the two coordinates - that will be plenty fast enough. Anything more complex than that is probably a false optimisation, at least on most normal platforms.

Computation overhead in C# - Using getters/setters vs. modifying arrays directly and casting speeds

I was going to write a long-winded post, but I'll boil it down here:
I'm trying to emulate the graphical old-school style of the NES via XNA. However, my FPS is SLOW, trying to modify 65K pixels per frame. If I just loop through all 65K pixels and set them to some arbitrary color, I get 64FPS. The code I made to look-up what colors should be placed where, I get 1FPS.
I think it is because of my object-orented code.
Right now, I have things divided into about six classes, with getters/setters. I'm guessing that I'm at least calling 360K getters per frame, which I think is a lot of overhead. Each class contains either/and-or 1D or 2D arrays containing custom enumerations, int, Color, or Vector2D, bytes.
What if I combined all of the classes into just one, and accessed the contents of each array directly? The code would look a mess, and ditch the concepts of object-oriented coding, but the speed might be much faster.
I'm also not concerned about access violations, as any attempts to get/set the data in the arrays will done in blocks. E.g., all writing to arrays will take place before any data is accessed from them.
As for casting, I stated that I'm using custom enumerations, int, Color, and Vector2D, bytes. Which data types are fastest to use and access in the .net Framework, XNA, XBox, C#? I think that constant casting might be a cause of slowdown here.
Also, instead of using math to figure out which indexes data should be placed in, I've used precomputed lookup tables so I don't have to use constant multiplication, addition, subtraction, division per frame. :)

There's a terrific presentation from GDC 2008 that is worth reading if you are an XNA developer. It's called Understanding XNA Framework Performance.
For your current architecture - you haven't really described it well enough to give a definite answer - you probably are doing too much unnecessary "stuff" in a tight loop. If I had to guess, I'd suggest that your current method is thrashing the cache - you need to fix your data layout.
In the ideal case you should have a nice big array of small-as-possible value types (structs not classes), and a heavily inlined loop that shoves data into it linearly.
(Aside: regarding what is fast: Integer and floating point maths is very fast - in general, you shouldn't use lookup tables. Function calls are pretty fast - to the point that copying large structs when you pass them will be more significant. The JIT will inline simple getters and setters - although you shouldn't depend on it to inline anything else in very tight loops - like your blitter.)
HOWEVER - even if optimised - your current architecture sucks. What you are doing flies in the face of how a modern GPU works. You should be loading your sprites onto your GPU and letting it composite your scene.
If you want to manipulate your sprites at a pixel level (for example: pallet swapping as you have mentioned) then you should be using pixel shaders. The CPU on the 360 (and on PCs) is fast, but the GPU is so much faster when you're doing something like this!
The Sprite Effects XNA sample is a good place to get started.

Have you profiled your code to determine where the slowdown is? Before you go rewriting your application, you ought to at least know which parts need to be rewritten.
I strongly suspect that the overhead of the accessors and data conversions is trivial. It's much more likely that your algorithms are doing unnecessary work, recomputing values that they could cache, and other things that can be addressed without blowing up your object design.

Are you specifying a color and such for each pixel or something? If that is the case I think you should really think about the architecture some more. Start using sprites that will speed things up.
EDIT
Okay I think what your solution could be load several sprites with different colours (a sprite of a few pixels) and reuse those. It is faster to point to the same sprite than to assign a different colour to each pixel as the sprite has already been loaded into memory

As with any performance problem, you should profile the application to identify the bottlenecks rather than trying to guess. I seriously doubt that getters and setters are at the root of your problem. The compiler almost always inlines these sorts of functions. I'm also curious what you have against math. Multiplying two integers, for instance, is one of the fastest things the computer can do.

Large Arrays, and LOH Fragmentation. What is the accepted convention?

I have an other active question HERE regarding some hopeless memory issues that possibly involve LOH Fragmentation among possibly other unknowns.
What my question now is, what is the accepted way of doing things?
If my app needs to be done in Visual C#, and needs to deal with large arrays to the tune of int[4000000], how can I not be doomed by the garbage collector's refusal to deal with the LOH?
It would seem that I am forced to make any large arrays global, and never use the word "new" around any of them. So, I'm left with ungraceful global arrays with "maxindex" variables instead of neatly sized arrays that get passed around by functions.
I've always been told that this was bad practice. What alternative is there?
Is there some kind of function to the tune of System.GC.CollectLOH("Seriously") ?
Are there possibly some way to outsource garbage collection to something other than System.GC?
Anyway, what are the generally accepted rules for dealing with large (>85Kb) variables?

Firstly, the garbage collector does collect the LOH, so do not be immediately scared by its prescence. The LOH gets collected when generation 2 gets collected.
The difference is that the LOH does not get compacted, which means that if you have an object in there that has a long lifetime then you will effectively be splitting the LOH into two sections — the area before and the area after this object. If this behaviour continues to happen then you could end up with the situation where the space between long-lived objects is not sufficiently large for subsequent assignments and .NET has to allocate more and more memory in order to place your large objects, i.e. the LOH gets fragmented.
Now, having said that, the LOH can shrink in size if the area at its end is completely free of live objects, so the only problem is if you leave objects in there for a long time (e.g. the duration of the application).
Starting from .NET 4.5.1, LOH could be compacted, see GCSettings.LargeObjectHeapCompactionMode property.
Strategies to avoid LOH fragmentation are:
Avoid creating large objects that hang around. Basically this just means large arrays, or objects which wrap large arrays (such as the MemoryStream which wraps a byte array), as nothing else is that big (components of complex objects are stored separately on the heap so are rarely very big). Also watch out for large dictionaries and lists as these use an array internally.
Watch out for double arrays — the threshold for these going into the LOH is much, much smaller — I can't remember the exact figure but its only a few thousand.
If you need a MemoryStream, considering making a chunked version that backs onto a number of smaller arrays rather than one huge array. You could also make custom version of the IList and IDictionary which using chunking to avoid stuff ending up in the LOH in the first place.
Avoid very long Remoting calls, as Remoting makes heavy use of MemoryStreams which can fragment the LOH during the length of the call.
Watch out for string interning — for some reason these are stored as pages on the LOH and can cause serious fragmentation if your application continues to encounter new strings to intern, i.e. avoid using string.Intern unless the set of strings is known to be finite and the full set is encountered early on in the application's life. (See my earlier question.)
Use Son of Strike to see what exactly is using the LOH memory. Again see this question for details on how to do this.
Consider pooling large arrays.
Edit: the LOH threshold for double arrays appears to be 8k.

It's an old question, but I figure it doesn't hurt to update answers with changes introduced in .NET. It is now possible to defragment the Large Object Heap. Clearly the first choice should be to make sure the best design choices were made, but it is nice to have this option now.
https://msdn.microsoft.com/en-us/library/xe0c2357(v=vs.110).aspx
"Starting with the .NET Framework 4.5.1, you can compact the large object heap (LOH) by setting the GCSettings.LargeObjectHeapCompactionMode property to GCLargeObjectHeapCompactionMode.CompactOnce before calling the Collect method, as the following example illustrates."
GCSettings can be found in the System.Runtime namespace
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();

The first thing that comes to mind is to split the array up into smaller ones, so they don't reach the memory needed for the GC to put in it the LOH. You could spit the arrays into smaller ones of say 10,000, and build an object which would know which array to look in based on the indexer you pass.
Now I haven't seen the code, but I would also question why you need an array that large. I would potentially look at refactoring the code so all of that information doesn't need to be stored in memory at once.

You get it wrong. You do NOT need to havean array size 4000000 and you definitely do not need to call the garbace collector.
Write your own IList implementation. Like "PagedList"
Store items in arrays of 65536 elements.
Create an array of arrays to hold the pages.
This allows you to access basically all your elements with ONE redirection only. And, as the individual arrays are smaller, fragmentation is not an issue...
...if it is... then REUSE pages. Dont throw them away on dispose, put them on a static "PageList" and pull them from there first. All this can be transparently done within your class.
The really good thing is that this List is pretty dynamic in the memory usage. You may want to resize the holder array (the redirector). Even when not, it is about 512kbdata per page only.
Second level arrays have basically 64k per byte - which is 8 byte for a class (512kb per page, 256kb on 32 bit), or 64kb per struct byte.
Technically:
Turn
int[]
into
int[][]
Decide whether 32 or 64 bit is better as you want ;) Both ahve advantages and disadvantages.
Dealing with ONE large array like that is unwieldely in any langauge - if you ahve to, then... basically.... allocate at program start and never recreate. Only solution.

This is an old question, but with .NET Standard 1.1 (.NET Core, .NET Framework 4.5.1+) there is another possible solution:
Using ArrayPool<T> in the System.Buffers package, we can pool arrays to avoid this problem.

Am adding an elaboration to the answer above, in terms of how the issue can arise. Fragmentation of the LOH is not only dependent on the objects being long lived, but if you've got the situation that there are multiple threads and each of them are creating big lists going onto the LOH then you could have the situation that the first thread needs to grow its List but the next contiguous bit of memory is already taken up by a List from a second thread, hence the runtime will allocate new memory for the first threads List - leaving behind a rather big hole. This is whats happening currently on one project I've inherited and so even though the LOH is approx 4.5 MB, the runtime has got a total of 117MB free memory but the largest free memory segment is 28MB.
Another way this could happen without multiple threads, is if you've got more than one list being added to in some kind of loop and as each expands beyond the memory initially allocated to it then each leapfrogs the other as they grow beyond their allocated spaces.
A useful link is: https://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/
Still looking for a solution for this, one option may be to use some kind of pooled objects and request from the pool when doing the work. If you're dealing with large arrays then another option is to develop a custom collection e.g. a collection of collections, so that you don't have just one huge list but break it up into smaller lists each of which avoid the LOH.

Why use flags+bitmasks rather than a series of booleans?

Given a case where I have an object that may be in one or more true/false states, I've always been a little fuzzy on why programmers frequently use flags+bitmasks instead of just using several boolean values.
It's all over the .NET framework. Not sure if this is the best example, but the .NET framework has the following:
public enum AnchorStyles
{
None = 0,
Top = 1,
Bottom = 2,
Left = 4,
Right = 8
}
So given an anchor style, we can use bitmasks to figure out which of the states are selected. However, it seems like you could accomplish the same thing with an AnchorStyle class/struct with bool properties defined for each possible value, or an array of individual enum values.
Of course the main reason for my question is that I'm wondering if I should follow a similar practice with my own code.
So, why use this approach?
Less memory consumption? (it doesn't seem like it would consume less than an array/struct of bools)
Better stack/heap performance than a struct or array?
Faster compare operations? Faster value addition/removal?
More convenient for the developer who wrote it?

It was traditionally a way of reducing memory usage. So, yes, its quite obsolete in C# :-)
As a programming technique, it may be obsolete in today's systems, and you'd be quite alright to use an array of bools, but...
It is fast to compare values stored as a bitmask. Use the AND and OR logic operators and compare the resulting 2 ints.
It uses considerably less memory. Putting all 4 of your example values in a bitmask would use half a byte. Using an array of bools, most likely would use a few bytes for the array object plus a long word for each bool. If you have to store a million values, you'll see exactly why a bitmask version is superior.
It is easier to manage, you only have to deal with a single integer value, whereas an array of bools would store quite differently in, say a database.
And, because of the memory layout, much faster in every aspect than an array. It's nearly as fast as using a single 32-bit integer. We all know that is as fast as you can get for operations on data.

Easy setting multiple flags in any order.
Easy to save and get a serie of 0101011 to the database.

Among other things, its easier to add new bit meanings to a bitfield than to add new boolean values to a class. Its also easier to copy a bitfield from one instance to another than a series of booleans.

It can also make Methods clearer. Imagine a Method with 10 bools vs. 1 Bitmask.

Actually, it can have a better performance, mainly if your enum derives from an byte.
In that extreme case, each enum value would be represented by a byte, containing all the combinations, up to 256. Having so many possible combinations with booleans would lead to 256 bytes.
But, even then, I don't think that is the real reason. The reason I prefer those is the power C# gives me to handle those enums. I can add several values with a single expression. I can remove them also. I can even compare several values at once with a single expression using the enum. With booleans, code can become, let's say, more verbose.

From a domain Model perspective, it just models reality better in some situations. If you have three booleans like AccountIsInDefault and IsPreferredCustomer and RequiresSalesTaxState, then it doesnn't make sense to add them to a single Flags decorated enumeration, cause they are not three distinct values for the same domain model element.
But if you have a set of booleans like:
[Flags] enum AccountStatus {AccountIsInDefault=1,
AccountOverdue=2 and AccountFrozen=4}
or
[Flags] enum CargoState {ExceedsWeightLimit=1,
ContainsDangerousCargo=2, IsFlammableCargo=4,
ContainsRadioactive=8}
Then it is useful to be able to store the total state of the Account, (or the cargo) in ONE variable... that represents ONE Domain Element whose value can represent any possible combination of states.

Raymond Chen has a blog post on this subject.
Sure, bitfields save data memory, but
you have to balance it against the
cost in code size, debuggability, and
reduced multithreading.
As others have said, its time is largely past. It's tempting to still do it, cause bit fiddling is fun and cool-looking, but it's no longer more efficient, it has serious drawbacks in terms of maintenance, it doesn't play nicely with databases, and unless you're working in an embedded world, you have enough memory.

I would suggest never using enum flags unless you are dealing with some pretty serious memory limitations (not likely). You should always write code optimized for maintenance.
Having several boolean properties makes it easier to read and understand the code, change the values, and provide Intellisense comments not to mention reduce the likelihood of bugs. If necessary, you can always use an enum flag field internally, just make sure you expose the setting/getting of the values with boolean properties.

Space efficiency - 1 bit
Time efficiency - bit comparisons are handled quickly by hardware.
Language independence - where the data may be handled by a number of different programs you don't need to worry about the implementation of booleans across different languages/platforms.
Most of the time, these are not worth the tradeoff in terms of maintance. However, there are times when it is useful:
Network protocols - there will be a big saving in reduced size of messages
Legacy software - once I had to add some information for tracing into some legacy software.
Cost to modify the header: millions of dollars and years of effort.
Cost to shoehorn the information into 2 bytes in the header that weren't being used: 0.
Of course, there was the additional cost in the code that accessed and manipulated this information, but these were done by functions anyways so once you had the accessors defined it was no less maintainable than using Booleans.

I have seen answers like Time efficiency and compatibility. those are The Reasons, but I do not think it is explained why these are sometime necessary in times like ours. from all answers and experience of chatting with other engineers I have seen it pictured as some sort of quirky old time way of doing things that should just die because new way to do things are better.
Yes, in very rare case you may want to do it the "old way" for performance sake like if you have the classic million times loop. but I say that is the wrong perspective of putting things.
While it is true that you should NOT care at all and use whatever C# language throws at you as the new right-way™ to do things (enforced by some fancy AI code analysis slaping you whenever you do not meet their code style), you should understand deeply that low level strategies aren't there randomly and even more, it is in many cases the only way to solve things when you have no help from a fancy framework. your OS, drivers, and even more the .NET itself(especially the garbage collector) are built using bitfields and transactional instructions. your CPU instruction set itself is a very complex bitfield, so JIT compilers will encode their output using complex bit processing and few hardcoded bitfields so that the CPU can execute them correctly.
When we talk about performance things have a much larger impact than people imagine, today more then ever especially when you start considering multicores.
when multicore systems started to become more common all CPU manufacturer started to mitigate the issues of SMP with the addition of dedicated transactional memory access instructions while these were made specifically to mitigate the near impossible task to make multiple CPUs to cooperate at kernel level without a huge drop in perfomrance it actually provides additional benefits like an OS independent way to boost low level part of most programs. basically your program can use CPU assisted instructions to perform memory changes to integers sized memory locations, that is, a read-modify-write where the "modify" part can be anything you want but most common patterns are a combination of set/clear/increment.
usually the CPU simply monitors if there is any other CPU accessing the same address location and if a contention happens it usually stops the operation to be committed to memory and signals the event to the application within the same instruction. this seems trivial task but superscaler CPU (each core has multiple ALUs allowing instruction parallelism), multi-level cache (some private to each core, some shared on a cluster of CPU) and Non-Uniform-Memory-Access systems (check threadripper CPU) makes things difficult to keep coherent, luckily the smartest people in the world work to boost performance and keep all these things happening correctly. todays CPU have a large amount of transistor dedicated to this task so that caches and our read-modify-write transactions work correctly.
C# allows you to use the most common transactional memory access patterns using Interlocked class (it is only a limited set for example a very useful clear mask and increment is missing, but you can always use CompareExchange instead which gets very close to the same performance).
To achieve the same result using a array of booleans you must use some sort of lock and in case of contention the lock is several orders of magnitude less permorming compared to the atomic instructions.
here are some examples of highly appreciated HW assisted transaction access using bitfields which would require a completely different strategy without them of course these are not part of C# scope:
assume a DMA peripheral that has a set of DMA channels, let say 20 (but any number up to the maximum number of bits of the interlock integer will do). When any peripheral's interrupt that might execute at any time, including your beloved OS and from any core of your 32-core latest gen wants a DMA channel you want to allocate a DMA channel (assign it to the peripheral) and use it. a bitfield will cover all those requirements and will use just a dozen of instructions to perform the allocation, which are inlineable within the requesting code. basically you cannot go faster then this and your code is just few functions, basically we delegate the hard part to the HW to solve the problem, constraints: bitfield only
assume a peripheral that to perform its duty requires some working space in normal RAM memory. for example assume a high speed I/O peripheral that uses scatter-gather DMA, in short it uses a fixed-size block of RAM populated with the description (btw the descriptor is itself made of bitfields) of the next transfer and chained one to each other creating a FIFO queue of transfers in RAM. the application prepares the descriptors first and then it chains with the tail of the current transfers without ever pausing the controller (not even disabling the interrupts). the allocation/deallocation of such descriptors can be made using bitfield and transactional instructions so when it is shared between diffent CPUs and between the driver interrupt and the kernel all will still work without conflicts. one usage case would be the kernel allocates atomically descriptors without stopping or disabling interrupts and without additional locks (the bitfield itself is the lock), the interrupt deallocates when the transfer completes.
most old strategies were to preallocate the resources and force the application to free after usage.
If you ever need to use multitask on steriods C# allows you to use either Threads + Interlocked, but lately C# introduced lightweight Tasks, guess how it is made? transactional memory access using Interlocked class. So you likely do not need to reinvent the wheel any of the low level part is already covered and well engineered.
so the idea is, let smart people (not me, I am a common developer like you) solve the hard part for you and just enjoy general purpose computing platform like C#. if you still see some remnants of these parts is because someone may still need to interface with worlds outside .NET and access some driver or system calls for example requiring you to know how to build a descriptor and put each bit in the right place. do not being mad at those people, they made our jobs possible.
In short : Interlocked + bitfields. incredibly powerful, don't use it

It is for speed and efficiency. Essentially all you are working with is a single int.
if ((flags & AnchorStyles.Top) == AnchorStyles.Top)
{
//Do stuff
}

Long lists of pass-by-ref parameters versus wrapper types

I need to get three objects out of a function, my instinct is to create a new type to return the three refs. Or if the refs were the same type I could use an array. However pass-by-ref is easier:
private void Mutate_AddNode_GetGenes(ref NeuronGene newNeuronGene, ref ConnectionGene newConnectionGene1, ref ConnectionGene newConnectionGene2)
{
}
There's obviously nothing wrong with this but I hesitate to use this approach, mostly I think for reasons of aesthetics and psycholgical bias. Are there actually any good reasons to use one of these approaches over the others? Perhaps a performance issue with creating extra wrapper objects or pushing parameters onto the stack. Note that in my particular case this is CPU intensive code. CPU cycles matter.
Is there a more elegant C#2 of C#3 approach?
Thanks.

For almost all computing problems, you will not notice the CPU difference. Since your sample code has the word "Gene" in it, you may actually fall into the rare category of code that would notice.
Creating and destroying objects just to wrap other objects would cost a bit of performance (they need to be created and garbage collected after all).
Aesthetically I would not create an object just to group unrelated objects, but if they logically belong together it is perfectly fine to define a containing object.

If you're worrying about the performance of a wrapping type (which is a lot cleaner, IMHO), you should use a struct. Current 32-bits implementations of .NET (and the upcomming 64-bits 4.0) support inlining / optimizing away of structs in many cases, so you'd probably see no performance difference whatsoever between a struct and ref arguments.

Worrying about the relative execution speed of those two options is probably a premature optimization. Focus on getting the algorithm correct first, and having clean, maintainable code. When that's done, you can run a profiler on it and optimize the 20% of the code that takes 80% of the CPU time. Even if this method ends up being in that 20%, the difference between the two calling styles is probably to small to register.
So, performance issues aside, I'd probably use a container class. Since this method takes only those three parameters, and (presumably) modifies each one, it sounds like it would make sense to have it as a method of the container class, with three member variables instead of ref parameters.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.