C# fundamentally not portable?

C# fundamentally not portable? - c#

I've been using C# for a while, and have recently started working on adding parallelism to a side project of mine. So, according to Microsoft, reads and writes to ints and even floats are atomic
I'm sure these atomicity requirements workout just fine on x86 architectures. However, on architectures such as ARM (which may not have hardware floating point support), it seems these guarantees will be hard.
The problem is only made more significant by the fact that an 'int' is always 32-bits. There are many embedded devices that can't atomically perform a 32-bit write.
It seems this is a fundamental mistake in C#. Guaranteeing the atomicity of these data types can't be done portably.
How are these atomicity guarantees intended to be implemented on architectures where there are no FPUs or 32-bit writes?

It's not too difficult to guarantee the atomicity with runtime checks. Sure, you won't be as performant as you might be if your platform supported atomic reads and writes, but that's a platform tradeoff.
Bottom line: C# (the core language, not counting some platform-specific APIs) is just as portable as Java.

The future happened yesterday, C# is in fact ported to a large number of embedded cores. The .NET Micro Framework is the typical deployment scenario. Model numbers I see listed as native targets are AT91, BF537, CortexM3, LPC22XX, LPC24XX, MC9328, PXA271 and SH2.
I don't know the exact implementation details of their core instruction set but I'm fairly sure that these are all 32-bit cores and several of them are ARM cores. Writing threaded code for them requires a minimum guarantee and atomic updates for properly aligned words is one of them. Given the supported list and that 4 byte atomic updates for aligned words is trivial to implement in 32-bit hardware, I trust they all do in fact support it.

There are two issues with regard to "portability":
Can an practical implementation of a language be produced for various platforms
Will a program written in a language be expected to run correctly on various platforms without modification
The stronger the guarantees made by a language, the harder it will be to port it to various platforms (some guarantees may make it impossible or impractical to implement the language on some platforms) but the more likely it is that programs written in the language will work without modification on any platform for which support exists.
For example, a lot of networking code relies upon the fact that (on most platforms) an unsigned char is eight bits, and a 32-bit integer is represented by four unsigned chars in ascending or descending sequence. I've used a platform where char was 16 bits, sizeof(int)==1, and sizeof(long)==2. The compiler author could have made the compiler simply use the bottom 8 bits of each address, or could have added a lot of extra code so that writing a 'char' pointer would shift the address right one bit (saving the lsb), read the address, update the high or low half based upon the saved address lsb, and writing it back. Either of those approaches would have allowed the networking code to run without modification, but would have greatly impeded the compiler's usefulness for other purposes.
Some of the guarantees in the CLR mean that it is impractical to implement it in any platform with an atomic operation size smaller than 32 bits. So what? If a microcontroller needs more than a few dozen Kbytes of code space and RAM, the cost differential between 8-bit and 32-bit is pretty small. Since nobody's going to be running any variation of the CLR on a part with 32K of code space and 4K of RAM, who cares whether such a chip could satisfy its guarantees.
BTW, I do think it would be useful to have different levels of features defined in a C spec; a lot of processors, for example, do have 8-bit chars which can be assembled into longer words using unions, and there is a lot of practical code which exploits this. It would be good to define standards for compilers which work with such things. I would also like to see more standards at the low end of the system, making some language enhancements available for 8-bit processors. For example, it would be useful to define overloads for a function which can take a run-time-computed 16-bit integer, an 8-bit variable, or an inline-expanded version with a constant. For often-used functions, there can be a big difference in efficiency among those.

That's what the CLI is for. I doubt they will certify an implementation if it isn't compliant. So basically, C# is portable to any platform that has one.

Excessively weakening guarantees for the sake of portability defeats the purpose of portability. The stronger the guarantees, the more valuable the portability. The goal is to find the right balance between what the likely target platforms can efficiently support with the guarantees that will be the most useful for development.

Related

The benefits of x64 CLR

The little premature optimization bug in my head tells me that I should port my existing x86 C# application to x64 because of the x64 release of an unmanaged DLL on which it relies. I know the answer is probably to do it, test and see what happens, but I wanted to see the what benefits to expect generally are. I found a lot of posts from two to four years ago that complain that the x64 CLR speed is slower than the x86 CLR.
What are areas when one could expect to speed up with x64 code? Is it worthwhile to port, except if you need more than 2 GB of memory? My code is mainly network oriented, dealing with medium sized byte arrays and encryption algorithms.

The simplest answer to this question is to profile 32-bit and 64-bit and compare them. However, since you said you are using encryption algorithms, I would strongly recommend considering going 64-bit, if you are using them often enough.
Encryption algorithms routinely do integer arithmetic and/or logical operations on values larger than 32-bits, which are made much faster in 64-bit code. In addition, they can often take advantage of the expanded register set of the processor (though certain superscalar/cache optimizations may already do this to a certain degree) more than normal code.
In the end, there is no way to tell which one will perform better other than testing it in your specific situations, but if whatever cryptology library you are using has a 64-bit version I believe it would be worthwhile to take a shot.

Should I use 'long' instead of 'int' on 64-bits in langs with fixed type size (like Java, C#)

In 10, or even 5 years there will be no [Edit2: server or desktop] 32-bit CPUs.
So, are there any advantages in using int (32bit) over long (64bit) ?
And are there any disadvantages in using int ?
Edit:
By 10 or 5 years I meant on vast majority of places where those langs are used
I meant which type to use by default. This days I won't even bother to think if I should use short as cycle counter, just for(int i.... The same way long counters already win
registers are already 64-bit, there is already no gain in 32 bit types. And I think some loss in 8 bit types (you have to operate on more bits then you're using)

32-bit is still a completely valid data type; just like we have 16-bit and bytes still around. We didn't throw out 16-bit or 8-bit numbers when we moved to 32-bit processors. A 32-bit number is half the size of a 64-bit integer in terms of storage. If I were modeling a database, and I knew the value couldn't go higher than what a 32-bit integer could store; I would use a 32-bit integer for storage purposes. I'd do the same thing with a 16-bit number as well. A 64-bit number takes more space in memory as well; albeit not anything significant given today's personal laptops can ship with 8 GB of memory.
There is no disadvantage of int other than it's a smaller data type. It's like asking, "Where should I store my sugar? In a sugar bowl, or a silo?" Well, that depends on entirely how much sugar you have.
Processor architecture shouldn't have much to do with what size data type you use. Use what fits. When we have 512-bit processors, we'll still have bytes.
EDIT:
To address some comments / edits..
I'm not sure about "There will be no 32-bit desktop CPUs". ARM is currently 32-bit; and has declared little interest in 64-bit; for now. That doesn't fit too well with "Desktop" in your description; but I also think in 5-10 years the landscape of the type of devices we are writing software will drastically change as well. Tablets can't be ignored; people will want C# and Java apps to run on them, considering Microsoft officially ported Windows 8 to ARM.
If you want to start using long; go ahead. There is no reason not to. If we are only looking at the CPU (ignoring storage size), and making assumptions we are on an x86-64 architecture, then it doesn't make much difference.
Assuming that we are sticking with the x86 architecture; that's true as well. You may end up with a slightly larger stack; depending on whatever framework you are using.

If you're on a 64-bit processor, and you've compiled your code for 64-bit, then at least some of the time, long is likely to be more efficient because it matches the register size. But whether that will really impact your program much is debatable. Also, if you're using long all over the place, you're generally going to use more memory - both on the stack and on the heap - which could negatively impact performance. There are too many variables to know for sure how well your program will perform using long by default instead of int. There are reasons why it could be faster and reasons why it could be slower. It could be a total wash.
The typical thing to do is to just use int if you don't care about the size of the integer. If you need a 64-bit integer, then you use long. If you're trying to use less memory and int is far more than you need, then you use byte or short.
x86_64 CPUs are going to be designed to be efficient at processing 32-bit programs and so it's not like using int is going to seriously degrade performance. Some things will be faster due to better alignment when you use 64-bit integers on a 64-bit CPU, but other things will be slower due to the increased memory requirements. And there are probably a variety of other factors involved which could definitely affect performance in either direction.
If you really want to know which is going to do better for your particular application in your particular environment, you're going to need to profile it. This is not a case where there is a clear advantage of one over the other.
Personally, I would advise that you follow the typical route of using int when you don't care about the size of the integer and to use the other types when you do.

Sorry for the C++ answer.
If the size of the type matters use a sized type:
uint8_t
int32_t
int64_t
etc
If the size doesn't matter use an expressive type:
size_t
ptrdiff_t
ssize_t
etc
I know that D has sized types and size_t. I'm not sure about Java or C#.

x64 vs x86 Performance Considerations .Net

I am trying to understand what performance differences exist when running a native C# / .Net 4.0 app in x64 vs x86. I understand the memory considerations (x64 addressing all memory, x86 limited to 2/4gb), as well as the fact that an x64 app will use more memory (all pointers are 8 bytes instead of 4 bytes). As far as I can tell, none of these should affect any of the clock for clock instructions, as the x64 pipeline is wide enough to handle the wider instructions.
Is there a performance hit in context switching, due to the larger stack size for each thread? What performance considerations am I missing in evaluating the two?

Joe White has given you some good reasons why your app might be slower. Larger pointers (and therefore by extension larger references in .NET) will take up more space in memory, meaning less of your code and data will fit into the cache.
There are, however, plenty of beneficial reasons you might want to use x64:
The AMD64 calling convention is used by default in x64 and can be quite a bit faster than the standard cdecl or stdcall, with many arguments being passed in registers and using the XMM registers for floating point.
The CLR will emit scalar SSE instructions for dealing with floating point operations in 64-bit. In x86 it falls back on using the standard x87 FP stack, which is quite a bit slower, especially for things like converting between ints and floats.
Having more registers means that there is much less chance that the JIT will have to spill them due to register pressure. Spilling registers can be quite costly for fast inner loops, especially if a function gets inlined and introduces additional register pressure there.
Any operations on 64-bit integers can benefit tremendously by being able to fit into a single register instead of being broken up into two separate halves.
This may be obvious, but the additional memory your process can access can be quite useful if your application is memory-intensive, even if it isn't hitting the theoretical limit. Fragmentation can cause you to hit "out of memory" conditions long before you reach that mark.
RIP-relative addressing in x64 can, in some cases, reduce the size of an executable image. Although that doesn't really apply directly to .NET apps, it can have an effect on the sharing of DLLs which may otherwise have to be relocated. I'd be interested in knowing if anyone has any specific information on this with regards to .NET and managed applications.
Aside from these, the x64 version of the .NET runtime seems to, at least in the current versions, perform more optimizations than the x86 equivalent. Things like inlining and memory alignment seem to happen much more often. In fact, there was a bug a while back that prevented inlining of any method that took or returned a value type; I remember seeing it fixed in x64 and not the x86 version.
Really, the only way you'll be able to tell which is better for your app will be to do profiling and testing on both architectures and comparing real results. However, I personally just use Any CPU wherever possible and avoid anything inherently architecture-dependent. This makes it easy to build and deploy, and is hopefully more future proof when the majority of users start switching to x64 exclusively.

Closely related to "x64 app will use more memory" is the fact that, with a 64-bit app, your locality of reference is smaller (because all your pointer sizes are doubled), so you get less mileage out of the CPU's on-board (ultra-fast) cache. You have to retrieve data from system RAM more often, which is much slower than the L2 and even the L1 on-chip cache.

Is C# really slower than say C++?

I've been wondering about this issue for a while now.
Of course there are things in C# that aren't optimized for speed, so using those objects or language tweaks (like LinQ) may cause the code to be slower.
But if you don't use any of those tweaks, but just compare the same pieces of code in C# and C++ (It's easy to translate one to another). Will it really be that much slower ?
I've seen comparisons that show that C# might be even faster in some cases, because in theory the JIT compiler should optimize the code in real time and get better results:
Managed Or Unmanaged?
We should remember that the JIT compiler compiles the code at real time, but that's a 1-time overhead, the same code (once reached and compiled) doesn't need to be compiled again at run time.
The GC doesn't add a lot of overhead either, unless you create and destroy thousands of objects (like using String instead of StringBuilder). And doing that in C++ would also be costly.
Another point that I want to bring up is the better communication between DLLs introduced in .Net. The .Net platform communicates much better than Managed COM based DLLs.
I don't see any inherent reason why the language should be slower, and I don't really think that C# is slower than C++ (both from experience and lack of a good explanation)..
So, will a piece of the same code written in C# will be slower than the same code in C++ ?
In if so, then WHY ?
Some other reference (Which talk about that a bit, but with no explanation about WHY):
Why would you want to use C# if its slower than C++?

Warning: The question you've asked is really pretty complex -- probably much more so than you realize. As a result, this is a really long answer.
From a purely theoretical viewpoint, there's probably a simple answer to this: there's (probably) nothing about C# that truly prevents it from being as fast as C++. Despite the theory, however, there are some practical reasons that it is slower at some things under some circumstances.
I'll consider three basic areas of differences: language features, virtual machine execution, and garbage collection. The latter two often go together, but can be independent, so I'll look at them separately.
Language Features
C++ places a great deal of emphasis on templates, and features in the template system that are largely intended to allow as much as possible to be done at compile time, so from the viewpoint of the program, they're "static." Template meta-programming allows completely arbitrary computations to be carried out at compile time (I.e., the template system is Turing complete). As such, essentially anything that doesn't depend on input from the user can be computed at compile time, so at runtime it's simply a constant. Input to this can, however, include things like type information, so a great deal of what you'd do via reflection at runtime in C# is normally done at compile time via template metaprogramming in C++. There is definitely a trade-off between runtime speed and versatility though -- what templates can do, they do statically, but they simply can't do everything reflection can.
The differences in language features mean that almost any attempt at comparing the two languages simply by transliterating some C# into C++ (or vice versa) is likely to produce results somewhere between meaningless and misleading (and the same would be true for most other pairs of languages as well). The simple fact is that for anything larger than a couple lines of code or so, almost nobody is at all likely to use the languages the same way (or close enough to the same way) that such a comparison tells you anything about how those languages work in real life.
Virtual Machine
Like almost any reasonably modern VM, Microsoft's for .NET can and will do JIT (aka "dynamic") compilation. This represents a number of trade-offs though.
Primarily, optimizing code (like most other optimization problems) is largely an NP-complete problem. For anything but a truly trivial/toy program, you're pretty nearly guaranteed you won't truly "optimize" the result (i.e., you won't find the true optimum) -- the optimizer will simply make the code better than it was previously. Quite a few optimizations that are well known, however, take a substantial amount of time (and, often, memory) to execute. With a JIT compiler, the user is waiting while the compiler runs. Most of the more expensive optimization techniques are ruled out. Static compilation has two advantages: first of all, if it's slow (e.g., building a large system) it's typically carried out on a server, and nobody spends time waiting for it. Second, an executable can be generated once, and used many times by many people. The first minimizes the cost of optimization; the second amortizes the much smaller cost over a much larger number of executions.
As mentioned in the original question (and many other web sites) JIT compilation does have the possibility of greater awareness of the target environment, which should (at least theoretically) offset this advantage. There's no question that this factor can offset at least part of the disadvantage of static compilation. For a few rather specific types of code and target environments, it can even outweigh the advantages of static compilation, sometimes fairly dramatically. At least in my testing and experience, however, this is fairly unusual. Target dependent optimizations mostly seem to either make fairly small differences, or can only be applied (automatically, anyway) to fairly specific types of problems. Obvious times this would happen would be if you were running a relatively old program on a modern machine. An old program written in C++ would probably have been compiled to 32-bit code, and would continue to use 32-bit code even on a modern 64-bit processor. A program written in C# would have been compiled to byte code, which the VM would then compile to 64-bit machine code. If this program derived a substantial benefit from running as 64-bit code, that could give a substantial advantage. For a short time when 64-bit processors were fairly new, this happened a fair amount. Recent code that's likely to benefit from a 64-bit processor will usually be available compiled statically into 64-bit code though.
Using a VM also has a possibility of improving cache usage. Instructions for a VM are often more compact than native machine instructions. More of them can fit into a given amount of cache memory, so you stand a better chance of any given code being in cache when needed. This can help keep interpreted execution of VM code more competitive (in terms of speed) than most people would initially expect -- you can execute a lot of instructions on a modern CPU in the time taken by one cache miss.
It's also worth mentioning that this factor isn't necessarily different between the two at all. There's nothing preventing (for example) a C++ compiler from producing output intended to run on a virtual machine (with or without JIT). In fact, Microsoft's C++/CLI is nearly that -- an (almost) conforming C++ compiler (albeit, with a lot of extensions) that produces output intended to run on a virtual machine.
The reverse is also true: Microsoft now has .NET Native, which compiles C# (or VB.NET) code to a native executable. This gives performance that's generally much more like C++, but retains the features of C#/VB (e.g., C# compiled to native code still supports reflection). If you have performance intensive C# code, this may be helpful.
Garbage Collection
From what I've seen, I'd say garbage collection is the poorest-understood of these three factors. Just for an obvious example, the question here mentions: "GC doesn't add a lot of overhead either, unless you create and destroy thousands of objects [...]". In reality, if you create and destroy thousands of objects, the overhead from garbage collection will generally be fairly low. .NET uses a generational scavenger, which is a variety of copying collector. The garbage collector works by starting from "places" (e.g., registers and execution stack) that pointers/references are known to be accessible. It then "chases" those pointers to objects that have been allocated on the heap. It examines those objects for further pointers/references, until it has followed all of them to the ends of any chains, and found all the objects that are (at least potentially) accessible. In the next step, it takes all of the objects that are (or at least might be) in use, and compacts the heap by copying all of them into a contiguous chunk at one end of the memory being managed in the heap. The rest of the memory is then free (modulo finalizers having to be run, but at least in well-written code, they're rare enough that I'll ignore them for the moment).
What this means is that if you create and destroy lots of objects, garbage collection adds very little overhead. The time taken by a garbage collection cycle depends almost entirely on the number of objects that have been created but not destroyed. The primary consequence of creating and destroying objects in a hurry is simply that the GC has to run more often, but each cycle will still be fast. If you create objects and don't destroy them, the GC will run more often and each cycle will be substantially slower as it spends more time chasing pointers to potentially-live objects, and it spends more time copying objects that are still in use.
To combat this, generational scavenging works on the assumption that objects that have remained "alive" for quite a while are likely to continue remaining alive for quite a while longer. Based on this, it has a system where objects that survive some number of garbage collection cycles get "tenured", and the garbage collector starts to simply assume they're still in use, so instead of copying them at every cycle, it simply leaves them alone. This is a valid assumption often enough that generational scavenging typically has considerably lower overhead than most other forms of GC.
"Manual" memory management is often just as poorly understood. Just for one example, many attempts at comparison assume that all manual memory management follows one specific model as well (e.g., best-fit allocation). This is often little (if any) closer to reality than many peoples' beliefs about garbage collection (e.g., the widespread assumption that it's normally done using reference counting).
Given the variety of strategies for both garbage collection and manual memory management, it's quite difficult to compare the two in terms of overall speed. Attempting to compare the speed of allocating and/or freeing memory (by itself) is pretty nearly guaranteed to produce results that are meaningless at best, and outright misleading at worst.
Bonus Topic: Benchmarks
Since quite a few blogs, web sites, magazine articles, etc., claim to provide "objective" evidence in one direction or another, I'll put in my two-cents worth on that subject as well.
Most of these benchmarks are a bit like teenagers deciding to race their cars, and whoever wins gets to keep both cars. The web sites differ in one crucial way though: they guy who's publishing the benchmark gets to drive both cars. By some strange chance, his car always wins, and everybody else has to settle for "trust me, I was really driving your car as fast as it would go."
It's easy to write a poor benchmark that produces results that mean next to nothing. Almost anybody with anywhere close to the skill necessary to design a benchmark that produces anything meaningful, also has the skill to produce one that will give the results he's decided he wants. In fact it's probably easier to write code to produce a specific result than code that will really produce meaningful results.
As my friend James Kanze put it, "never trust a benchmark you didn't falsify yourself."
Conclusion
There is no simple answer. I'm reasonably certain that I could flip a coin to choose the winner, then pick a number between (say) 1 and 20 for the percentage it would win by, and write some code that would look like a reasonable and fair benchmark, and produced that foregone conclusion (at least on some target processor--a different processor might change the percentage a bit).
As others have pointed out, for most code, speed is almost irrelevant. The corollary to that (which is much more often ignored) is that in the little code where speed does matter, it usually matters a lot. At least in my experience, for the code where it really does matter, C++ is almost always the winner. There are definitely factors that favor C#, but in practice they seem to be outweighed by factors that favor C++. You can certainly find benchmarks that will indicate the outcome of your choice, but when you write real code, you can almost always make it faster in C++ than in C#. It might (or might not) take more skill and/or effort to write, but it's virtually always possible.

Because you don't always need to use the (and I use this loosely) "fastest" language? I don't drive to work in a Ferrari just because it's faster...

Circa 2005 two MS performance experts from both sides of the native/managed fence tried to answer the same question. Their method and process are still fascinating and the conclusions still hold today - and I'm not aware of any better attempt to give an informed answer. They noted that a discussion of potential reasons for differences in performance is hypothetical and futile, and a true discussion must have some empirical basis for the real world impact of such differences.
So, the Old New Raymond Chen, and Rico Mariani set rules for a friendly competition. A Chinese/English dictionary was chosen as a toy application context: simple enough to be coded as a hobby side-project, yet complex enough to demonstrate non trivial data usage patterns. The rules started simple - Raymond coded a straightforward C++ implementation, Rico migrated it to C# line by line, with no sophistication whatsoever, and both implementations ran a benchmark. Afterwards, several iterations of optimizations ensued.
The full details are here: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.
This dialogue of titans is exceptionally educational and I whole heartily recommend to dive in - but if you lack the time or patience, Jeff Atwood compiled the bottom lines beautifully:
Eventually, C++ was 2x faster - but initially, it was 13x slower.
As Rico sums up:
So am I ashamed by my crushing defeat? Hardly. The managed code
achieved a very good result for hardly any effort. To defeat the
managed version, Raymond had to:
Write his own file/io stuff
Write his own string class
Write his own allocator
Write his own international mapping
Of course he used available lower level libraries to do this,
but that's still a lot of work. Can you call what's left an STL
program? I don't think so.
That is my experience still, 11 years and who knows how many C#/C++ versions later.
That is no coincidence, of course, as these two languages spectacularly achieve their vastly different design goals. C# wants to be used where development cost is the main consideration (still the majority of software), and C++ shines where you'd save no expenses to squeeze every last ounce of performance out of your machine: games, algo-trading, data-centers, etc.

C++ always have an edge for the performance. With C#, I don't get to handle memory and I have literally tons of resources available for me to do my job.
What you need to question yourself is more about which one saves you time. Machines are incredibly powerful now and most of your code should be done in a language that allows you to get the most value in the least amount of time.
If there is a core processing that takes way too long in C#, you could then build a C++ and interop your way to it with C#.
Stop thinking about your code performance. Start building value.

C# is faster than C++. Faster to write. For execution times, nothing beats a profiler.
But C# does not have as much libraries as C++ can interface easily.
And C# depends heavily on windows...

BTW, time critical applications are not coded in C# or Java, primarily due to uncertainty of when the Garbage Collection will be performed.
In modern times, application or execution speed is not as important as was previously. Development schedules, correctness and robustness are higher priorities. A high speed version of an application is no good if it has lots of bugs, crashes a lot or worse, misses an opportunity to get to market or be deployed.
Since development schedules are a priority, new languages are coming out that speed up development. C# is one of these. C# also assists in correctness and robustness by removing features from C++ that cause common problems: one example is pointers.
The differences in execution speed of an application developed with C# and one developed using C++ is negligible on most platforms. This is due to the fact that the execution bottlenecks are not language dependent but usually depend on the operating system or I/O. For example if C++ performs a function in 5 ms but C# uses 2ms, and waiting for data takes 2 seconds, the time spent in the function is insignificant compared to the time waiting for data.
Choose a language that is best suited for the developers, platform and projects. Work towards the goals of correctness, robustness and deployment. The speed of an application should be treated as a bug: prioritize it, compare to other bugs, and fix as necessary.

A better way to look at it everything is slower than C/C++ because it abstracts away rather than following the stick and mud paradigm. It's called systems programming for a reason, you program against the grain or bare metal. Doing so also grants you speed you cannot achieve with other languages like C# or Java. But alas C roots are all about doing things the hard way, so your mostly going to be writing more code and spending more time debugging it.
C is also case sensitive, also objects in C++ also follow strict rule sets. Example a purple ice cream cone may not be the same as a blue ice cream cone, though they might be cones they may not necessarily belong to the cone family and if you forget to define what cone is you bug out. Thus the properties of ice cream may or may not be clones. Now the speed argument, C/C++ uses the stack and heap approach this is where bare metal gets it's metal.
With the boost library you can achieve incredible speeds unfortunately most game studios stick to the standard library. The other reason for this might be because software written in C/C++ tends to be massive in file size, as it's a giant collection of files instead of a single file. Also take note all operating systems are written in C so generally why must we ask the question what could be faster?!
Also caching is not faster than pure memory management, sorry but this just doesn't fan out. Memory is something physical, caching is something software does in order to gain a kick in performance. One could also reason that without physical memory caching would simply not exist. It doesn't void the fact memory must be managed at some level whether its automated or manual.

Why would you write a small application that doesn't require much in the way of optimization in C++, if there is a faster route(C#)?

Getting an exact answer to your question is not really possible unless you perform benchmarks on specific systems. However, it is still interesting to think about some fundamental differences between programming languages like C# and C++.
Compilation
Executing C# code requires an additional step where the code is JIT'ed. With regard to performance that will be in favor of C++. Also, the JIT compiler is only able to optimize the generated code within the unit of code that is JIT'ed (e.g. a method) while a C++ compiler can optimize across method calls using more aggressive techniques.
However, The JIT compiler is able to optimize the generated machine code to closely match the underlying hardware enabling it to take advantage of additional hardware features if they exist. To my knowledge the .NET JIT compiler doesn't do that but it would conceiveably be able to generate different code for Atom as opposed to Pentium CPU's.
Memory access
The garbage collected architecture can in many cases create more optimal memory access patterns than standard C++ code. If the memory area used for the first generation is small enough in can stay within the CPU cache increasing performance. If you create and destroy a lot of small objects the overhead of maintaing the managed heap may be smaller than what is required by the C++ runtime. Again, this is highly dependent on the application. A study Python of performance demonstrates that a specific managed Python application is able to scale much better than the compiled version as a result of more optimal memory access patterns.

Don't let confusing!
If a C# application is written in the best case and a C++ application is written in the best case, the C++ is faster.
Many reason is here about why C++ is faster that C# inherently, such as C# uses virtual machine similar to JVM in Java. Basically higher level language has less performance (if uses in best case).
If you're an experienced professional C# programmer just like you're an experienced professional C++ programmer, developing an application using C# is much more easy and fast than C++.
Many other situations between these situations is possible. For example, you can write an C# application and C++ application so that C# app runs faster than C++ one.
For choosing a language you should note the circumstances of the project and its subject. For a general business project you should use C#. For a hight performance required project like a Video Converter or Image Processing project you should choose C++.
Update:
OK. Lets compare some practical reason about why most possible speed of C++ is more than C#. Consider a good written C# application and same C++ version:
C# uses a VM as a middle layer for executing the application. It has overhead.
AFAIK CLR could not optimises all C# codes in target machine. C++ application could be compiled on target machine with most optimisation.
In C# the most possible optimisation for runtime means most possible fast VM. VM has overhead anyway.
C# is a higher level language thus it generates more program code lines for the final process. (consider difference between an Assembly application and Ruby one! same condition is between C++ and a higher level language such as C#/Java)
If you prefer to get some more info in practice as an expert, see this. It is about Java but it also applies to C#.

The primary concern would not be speed, but stability across windows versions and upgrades. Win32 is mostly immune across windows versions making it highly stable.
When servers are decommissioned and software migrated, A lot of anxiety happens around anything using .Net and usually a lot of panic about .net versions but a Win32 application built 10 years ago just keeps running like nothing happened.

I have been specializing in optimization for about 15 years, and regularly re write C++ code, making heavy use of compiler intrinsics as much as possible because C++ performance is often nowhere near what the CPU is capable of. Cache performance often needs to be considered. Many vector maths instructions are required to replace the standard C++ floating point code.
A great deal of STL code is re written and often runs many times faster. Maths and code which makes heavy use of data can be re written with spectacular results, as the CPU approaches its optimum performance.
None of this is possible in C#. To compare their relative #real time# performance is really a staggeringly ignorant question. The fastest piece of code in C++ will be when each single assembler instruction is optimised for the task in hand, with no unnecessary instructions - at all. Where each piece of memory is used when it is required, and not copied n times because that’s what the language design requires. Where each required memory movement works in harmony with the cache.
Where the final algorithm cannot be improved, based on the exact real time requirements, considering accuracy and functionality.
Then you will be approaching an optimal solution.
To compare C# with this ideal situation is staggering. C# can’t compete. In fact, I am currently re writing a whole bunch of C# code (when I say re writing I mean removing and replacing it completely) because it is not even in the same city, let alone ball park when it comes to heavy lifting real time performance.
So please, stop fooling yourselves. C# is slow. Dead slow. All software is slowing down, and C# is making this speed decline worse. All software runs using the fetch execute cycle in assembler (you know – on the CPU). You use 10 times as many instructions; it’s going to go 10 times as slow. You cripple the cache; it’s going to go even slower. You add garbage collect to a real time piece of software then you are often fooled into thinking that the code runs ‘ok’ there are just those few moments every now and then when the code goes ‘a bit slow for a while’.
Try adding a garbage collection system to code where every cycle counts. I wonder if the stock market trading software has garbage collection (you know – on the system running on the new undersea cable which cost $300 million?). Can we spare 300 milliseconds every 2 seconds? What about flight control software on the space shuttle – is GC ok there? How about engine management software in performance vehicles? (Where victory in a season can be worth millions).
Garbage collection in real time is a complete failure.
So no, emphatically, C++ is much faster. C# is a leap backwards.

How does C# guarantee the atomicity of read/write operations?

The C# spec states in section 5.5 that reads and writes on certain types (namely bool, char, byte, sbyte, short, ushort, uint, int, float, and reference types) are guaranteed to be atomic.
This has piqued my interest. How can you do that? I mean, my lowly personal experience only showed me to lock variables or to use barriers if I wanted reads and writes to look atomic; that would be a performance killer if it had to be done for every single read/write. And yet C# does something with a similar effect.
Perhaps other languages (like Java) do it. I seriously don't know. My question isn't really intended to be language-specific, it's just that I know C# does it.
I understand that it might have to deal with certain specific processor instructions, and may not be usable in C/C++. However, I'd still like to know how it works.
[EDIT] To tell the truth, I believed that reads and writes could be non-atomic in certain conditions, like a CPU could access a memory location while another CPU is writing there. Does this only happen when the CPU can't treat all the object at once, like because it's too big or because the memory is not aligned on the proper boundary?

The reason those types have guaranteed atomicity is because they are all 32 bits or smaller. Since .NET only runs on 32 and 64 bit operating systems, the processor architecture can read and write the entire value in a single operation. This is in contrast to say, an Int64 on a 32 bit platform which must be read and written using two 32 bit operations.
I'm not really a hardware guy so I apologize if my terminology makes me sound like a buffoon but it's the basic idea.

It is fairly cheap to implement the atomicity guarantee on x86 and x64 cores since the CLR only promises atomicity for variables that are 32-bit or smaller. All that's required is that the variable is properly aligned and doesn't straddle a cache line. The JIT compiler ensures this by allocating local variables on a 4-byte aligned stack offset. The GC heap manager does the same for heap allocations.
Notable is that the CLR guarantee is not a very good one. The alignment promise is not good enough to write code that's consistently performant for arrays of doubles. Very nicely demonstrated in this thread. Interop with machine code that uses SIMD instructions is also very difficult for this reason.

On x86 reads and writes are atomic anyway. It's supported at the hardware level. This however does not mean that operations like addition and multiplication are atomic; they require a load, compute, then store, which means they can interfere. That's where the lock prefix comes in.
You mentioned locking and memory barriers; they don't have anything to do with reads and writes being atomic. There is no way on x86 with or without using memory barriers that you're going to see a half-written 32-bit value.

Yes, C# and Java guarantee that loads and stores of some primitive types are atomic, like you say. This is cheap because the processors capable of running .NET or the JVM do guarantee that loads and stores of suitably aligned primitive types are atomic.
Now, what neither C# nor Java nor the processors they run on guarantee, and which is expensive, is issuing memory barriers so that those variables can be used for synchronization in a multi-threaded program. However, in Java and C# you can mark your variable with the "volatile" attribute, in which case the compiler takes care of issuing the appropriate memory barriers.

You can't. Even going all the way down to assembly language you have to use special LOCK opcodes in order to guarantee that another core or even process isn't going to come around and wipe out all your hard work.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.