Benefits of 'Optimize code' option in Visual Studio build

Benefits of 'Optimize code' option in Visual Studio build - c#

Much of our C# release code is built with the 'Optimize code' option turned off. I believe this is to allow code built in Release mode to be debugged more easily.
Given that we are creating fairly simple desktop software which connects to backend Web Services, (ie. not a particularly processor-intensive application) then what if any sort of performance hit might be expected?
And is any particular platform likely to be worse affected? Eg. multi-processor / 64 bit.

You are the only person who can answer the "performance hit" question. Try it both ways, measure the performance, and see what happens. The hit could be enormous or it could be nonexistant; no one reading this knows whether "enormous" to you means one microsecond or twenty minutes.
If you're interested in what optimizations are done by the C# compiler -- rather than the jitter -- when the optimize switch is on, see:
http://blogs.msdn.com/ericlippert/archive/2009/06/11/what-does-the-optimize-switch-do.aspx

The full details are available at http://blogs.msdn.com/jaybaz_ms/archive/2004/06/28/168314.aspx.
In brief...
In managed code, the JITter in the runtime does nearly all the optimization. The difference in generated IL from this flag is pretty small.

In fact, there is a difference, sometimes quite significant. What can really affect the performance (as it is something that JIT does not fully take care of):
Unnecessary local variables (i.e., bigger stack frames for each call)
Too generic conditional instructions, JIT translates them in quite a straightforward manner.
Unnecessary branching (also not served well by a JIT - after all, it does not have too much time to do all the smart optimisations)
So, if you're doing something numerical - turn on the optimisation. Otherwise you won't see any difference at all.

The optimizations done by the compiler are fairly low level and shouldn't affect your users' experience.
If you'd like to quantify the optimization on your application, simply profile a non-optimized and an optimized build and compare the results.

I find that with complex, CPU intensive code (the code i'm using is a Monte Carlo simulation that can spawn enough threads to 100% utilize a computer. This was tested in a 36 core environment) the performance hit can be up to 4 times higher! A simulation that takes 2 hours will take about 9 hours without the optimization flag. (the paths are about 500,000 and for each paths there are 500 steps for around 2000 different objects with highly complex calculation on each objects).

Related

Why does code size matter for JIT compilation?

Say you're developing in a JIT-compiled language. Is there any performance downside to making your functions very large, in terms of the code size of the generated assembly?
I ask because I was looking through the source code of Buffer.MemoryCopy in C# the other day, which is obviously a very performance-sensitive method. It appears they use a large switch statement to specialize the function for all byte counts <= 16, resulting in some pretty gigantic generated assembly.
Are there any cons, performance-wise, to this approach? For example, I noticed the glibc and FreeBSD implementations of memmove do not do this, in spite of the fact that C is AOT-compiled, meaning it doesn't suffer from the cost of JIT precompilation (which is one downside)-- for C#, the JIT waits until the first call to compile the method, and so for really long methods the first invocation will take longer.
What are the up/downsides to having a gigantic switch statement and increasing code size (other than the precompilation cost I just mentioned) for JIT-ed languages? Thanks. (I'm a bit new to assembly so please go easy on me :) )

Assuming x86.
Fetching1 and decoding2 instructions is not for free.
Similarly to data cache, the CPU has a code cache; but it is usually smaller, ranging from 8 KiB to 32 KiB.
A shorter code fits better in the I-cache, requiring less fetches from memory.
Fetching, however, is only half of the story.
The x86 is historically problematic when it comes to decoding, due to its (very) variable length instructions.
There has been, and there, are various patterns to follow and limitations to workaround to reach a fast decoding.
Since the Core2 architecture, the CPU has other instruction caches that sit after the decoders3.
These caches holds the already decoded instructions, bypassing the limitations and latency of the previous stages.
Just to have a mental idea I sketched the Haswell decoding unit4:
Each arrow is a step in the data path that usually takes a clock.
The dark shaded areas are where an instruction can be found.
The closer a cache is to the Out of Order core5, mean to be at the bottom, the faster an instruction in said cache can reach the core.
However the closer the cache the smaller it becomes, so reducing the code size improve performance, specially for the critical loops6.
I draw these conclusion based on the analysis of Agner Fog.
1 The act of reading from memory.
2 The operation of converting an instruction into micro operations.
3 Pre-decoders for Core2, but still.
4 Peter, you are welcome to point out the mistakes :).
5 The part of the CPU that effectively execute the instructions.
6 Loops mean to be executed often.

Is C# really slower than say C++?

I've been wondering about this issue for a while now.
Of course there are things in C# that aren't optimized for speed, so using those objects or language tweaks (like LinQ) may cause the code to be slower.
But if you don't use any of those tweaks, but just compare the same pieces of code in C# and C++ (It's easy to translate one to another). Will it really be that much slower ?
I've seen comparisons that show that C# might be even faster in some cases, because in theory the JIT compiler should optimize the code in real time and get better results:
Managed Or Unmanaged?
We should remember that the JIT compiler compiles the code at real time, but that's a 1-time overhead, the same code (once reached and compiled) doesn't need to be compiled again at run time.
The GC doesn't add a lot of overhead either, unless you create and destroy thousands of objects (like using String instead of StringBuilder). And doing that in C++ would also be costly.
Another point that I want to bring up is the better communication between DLLs introduced in .Net. The .Net platform communicates much better than Managed COM based DLLs.
I don't see any inherent reason why the language should be slower, and I don't really think that C# is slower than C++ (both from experience and lack of a good explanation)..
So, will a piece of the same code written in C# will be slower than the same code in C++ ?
In if so, then WHY ?
Some other reference (Which talk about that a bit, but with no explanation about WHY):
Why would you want to use C# if its slower than C++?

Warning: The question you've asked is really pretty complex -- probably much more so than you realize. As a result, this is a really long answer.
From a purely theoretical viewpoint, there's probably a simple answer to this: there's (probably) nothing about C# that truly prevents it from being as fast as C++. Despite the theory, however, there are some practical reasons that it is slower at some things under some circumstances.
I'll consider three basic areas of differences: language features, virtual machine execution, and garbage collection. The latter two often go together, but can be independent, so I'll look at them separately.
Language Features
C++ places a great deal of emphasis on templates, and features in the template system that are largely intended to allow as much as possible to be done at compile time, so from the viewpoint of the program, they're "static." Template meta-programming allows completely arbitrary computations to be carried out at compile time (I.e., the template system is Turing complete). As such, essentially anything that doesn't depend on input from the user can be computed at compile time, so at runtime it's simply a constant. Input to this can, however, include things like type information, so a great deal of what you'd do via reflection at runtime in C# is normally done at compile time via template metaprogramming in C++. There is definitely a trade-off between runtime speed and versatility though -- what templates can do, they do statically, but they simply can't do everything reflection can.
The differences in language features mean that almost any attempt at comparing the two languages simply by transliterating some C# into C++ (or vice versa) is likely to produce results somewhere between meaningless and misleading (and the same would be true for most other pairs of languages as well). The simple fact is that for anything larger than a couple lines of code or so, almost nobody is at all likely to use the languages the same way (or close enough to the same way) that such a comparison tells you anything about how those languages work in real life.
Virtual Machine
Like almost any reasonably modern VM, Microsoft's for .NET can and will do JIT (aka "dynamic") compilation. This represents a number of trade-offs though.
Primarily, optimizing code (like most other optimization problems) is largely an NP-complete problem. For anything but a truly trivial/toy program, you're pretty nearly guaranteed you won't truly "optimize" the result (i.e., you won't find the true optimum) -- the optimizer will simply make the code better than it was previously. Quite a few optimizations that are well known, however, take a substantial amount of time (and, often, memory) to execute. With a JIT compiler, the user is waiting while the compiler runs. Most of the more expensive optimization techniques are ruled out. Static compilation has two advantages: first of all, if it's slow (e.g., building a large system) it's typically carried out on a server, and nobody spends time waiting for it. Second, an executable can be generated once, and used many times by many people. The first minimizes the cost of optimization; the second amortizes the much smaller cost over a much larger number of executions.
As mentioned in the original question (and many other web sites) JIT compilation does have the possibility of greater awareness of the target environment, which should (at least theoretically) offset this advantage. There's no question that this factor can offset at least part of the disadvantage of static compilation. For a few rather specific types of code and target environments, it can even outweigh the advantages of static compilation, sometimes fairly dramatically. At least in my testing and experience, however, this is fairly unusual. Target dependent optimizations mostly seem to either make fairly small differences, or can only be applied (automatically, anyway) to fairly specific types of problems. Obvious times this would happen would be if you were running a relatively old program on a modern machine. An old program written in C++ would probably have been compiled to 32-bit code, and would continue to use 32-bit code even on a modern 64-bit processor. A program written in C# would have been compiled to byte code, which the VM would then compile to 64-bit machine code. If this program derived a substantial benefit from running as 64-bit code, that could give a substantial advantage. For a short time when 64-bit processors were fairly new, this happened a fair amount. Recent code that's likely to benefit from a 64-bit processor will usually be available compiled statically into 64-bit code though.
Using a VM also has a possibility of improving cache usage. Instructions for a VM are often more compact than native machine instructions. More of them can fit into a given amount of cache memory, so you stand a better chance of any given code being in cache when needed. This can help keep interpreted execution of VM code more competitive (in terms of speed) than most people would initially expect -- you can execute a lot of instructions on a modern CPU in the time taken by one cache miss.
It's also worth mentioning that this factor isn't necessarily different between the two at all. There's nothing preventing (for example) a C++ compiler from producing output intended to run on a virtual machine (with or without JIT). In fact, Microsoft's C++/CLI is nearly that -- an (almost) conforming C++ compiler (albeit, with a lot of extensions) that produces output intended to run on a virtual machine.
The reverse is also true: Microsoft now has .NET Native, which compiles C# (or VB.NET) code to a native executable. This gives performance that's generally much more like C++, but retains the features of C#/VB (e.g., C# compiled to native code still supports reflection). If you have performance intensive C# code, this may be helpful.
Garbage Collection
From what I've seen, I'd say garbage collection is the poorest-understood of these three factors. Just for an obvious example, the question here mentions: "GC doesn't add a lot of overhead either, unless you create and destroy thousands of objects [...]". In reality, if you create and destroy thousands of objects, the overhead from garbage collection will generally be fairly low. .NET uses a generational scavenger, which is a variety of copying collector. The garbage collector works by starting from "places" (e.g., registers and execution stack) that pointers/references are known to be accessible. It then "chases" those pointers to objects that have been allocated on the heap. It examines those objects for further pointers/references, until it has followed all of them to the ends of any chains, and found all the objects that are (at least potentially) accessible. In the next step, it takes all of the objects that are (or at least might be) in use, and compacts the heap by copying all of them into a contiguous chunk at one end of the memory being managed in the heap. The rest of the memory is then free (modulo finalizers having to be run, but at least in well-written code, they're rare enough that I'll ignore them for the moment).
What this means is that if you create and destroy lots of objects, garbage collection adds very little overhead. The time taken by a garbage collection cycle depends almost entirely on the number of objects that have been created but not destroyed. The primary consequence of creating and destroying objects in a hurry is simply that the GC has to run more often, but each cycle will still be fast. If you create objects and don't destroy them, the GC will run more often and each cycle will be substantially slower as it spends more time chasing pointers to potentially-live objects, and it spends more time copying objects that are still in use.
To combat this, generational scavenging works on the assumption that objects that have remained "alive" for quite a while are likely to continue remaining alive for quite a while longer. Based on this, it has a system where objects that survive some number of garbage collection cycles get "tenured", and the garbage collector starts to simply assume they're still in use, so instead of copying them at every cycle, it simply leaves them alone. This is a valid assumption often enough that generational scavenging typically has considerably lower overhead than most other forms of GC.
"Manual" memory management is often just as poorly understood. Just for one example, many attempts at comparison assume that all manual memory management follows one specific model as well (e.g., best-fit allocation). This is often little (if any) closer to reality than many peoples' beliefs about garbage collection (e.g., the widespread assumption that it's normally done using reference counting).
Given the variety of strategies for both garbage collection and manual memory management, it's quite difficult to compare the two in terms of overall speed. Attempting to compare the speed of allocating and/or freeing memory (by itself) is pretty nearly guaranteed to produce results that are meaningless at best, and outright misleading at worst.
Bonus Topic: Benchmarks
Since quite a few blogs, web sites, magazine articles, etc., claim to provide "objective" evidence in one direction or another, I'll put in my two-cents worth on that subject as well.
Most of these benchmarks are a bit like teenagers deciding to race their cars, and whoever wins gets to keep both cars. The web sites differ in one crucial way though: they guy who's publishing the benchmark gets to drive both cars. By some strange chance, his car always wins, and everybody else has to settle for "trust me, I was really driving your car as fast as it would go."
It's easy to write a poor benchmark that produces results that mean next to nothing. Almost anybody with anywhere close to the skill necessary to design a benchmark that produces anything meaningful, also has the skill to produce one that will give the results he's decided he wants. In fact it's probably easier to write code to produce a specific result than code that will really produce meaningful results.
As my friend James Kanze put it, "never trust a benchmark you didn't falsify yourself."
Conclusion
There is no simple answer. I'm reasonably certain that I could flip a coin to choose the winner, then pick a number between (say) 1 and 20 for the percentage it would win by, and write some code that would look like a reasonable and fair benchmark, and produced that foregone conclusion (at least on some target processor--a different processor might change the percentage a bit).
As others have pointed out, for most code, speed is almost irrelevant. The corollary to that (which is much more often ignored) is that in the little code where speed does matter, it usually matters a lot. At least in my experience, for the code where it really does matter, C++ is almost always the winner. There are definitely factors that favor C#, but in practice they seem to be outweighed by factors that favor C++. You can certainly find benchmarks that will indicate the outcome of your choice, but when you write real code, you can almost always make it faster in C++ than in C#. It might (or might not) take more skill and/or effort to write, but it's virtually always possible.

Because you don't always need to use the (and I use this loosely) "fastest" language? I don't drive to work in a Ferrari just because it's faster...

Circa 2005 two MS performance experts from both sides of the native/managed fence tried to answer the same question. Their method and process are still fascinating and the conclusions still hold today - and I'm not aware of any better attempt to give an informed answer. They noted that a discussion of potential reasons for differences in performance is hypothetical and futile, and a true discussion must have some empirical basis for the real world impact of such differences.
So, the Old New Raymond Chen, and Rico Mariani set rules for a friendly competition. A Chinese/English dictionary was chosen as a toy application context: simple enough to be coded as a hobby side-project, yet complex enough to demonstrate non trivial data usage patterns. The rules started simple - Raymond coded a straightforward C++ implementation, Rico migrated it to C# line by line, with no sophistication whatsoever, and both implementations ran a benchmark. Afterwards, several iterations of optimizations ensued.
The full details are here: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.
This dialogue of titans is exceptionally educational and I whole heartily recommend to dive in - but if you lack the time or patience, Jeff Atwood compiled the bottom lines beautifully:
Eventually, C++ was 2x faster - but initially, it was 13x slower.
As Rico sums up:
So am I ashamed by my crushing defeat? Hardly. The managed code
achieved a very good result for hardly any effort. To defeat the
managed version, Raymond had to:
Write his own file/io stuff
Write his own string class
Write his own allocator
Write his own international mapping
Of course he used available lower level libraries to do this,
but that's still a lot of work. Can you call what's left an STL
program? I don't think so.
That is my experience still, 11 years and who knows how many C#/C++ versions later.
That is no coincidence, of course, as these two languages spectacularly achieve their vastly different design goals. C# wants to be used where development cost is the main consideration (still the majority of software), and C++ shines where you'd save no expenses to squeeze every last ounce of performance out of your machine: games, algo-trading, data-centers, etc.

C++ always have an edge for the performance. With C#, I don't get to handle memory and I have literally tons of resources available for me to do my job.
What you need to question yourself is more about which one saves you time. Machines are incredibly powerful now and most of your code should be done in a language that allows you to get the most value in the least amount of time.
If there is a core processing that takes way too long in C#, you could then build a C++ and interop your way to it with C#.
Stop thinking about your code performance. Start building value.

C# is faster than C++. Faster to write. For execution times, nothing beats a profiler.
But C# does not have as much libraries as C++ can interface easily.
And C# depends heavily on windows...

BTW, time critical applications are not coded in C# or Java, primarily due to uncertainty of when the Garbage Collection will be performed.
In modern times, application or execution speed is not as important as was previously. Development schedules, correctness and robustness are higher priorities. A high speed version of an application is no good if it has lots of bugs, crashes a lot or worse, misses an opportunity to get to market or be deployed.
Since development schedules are a priority, new languages are coming out that speed up development. C# is one of these. C# also assists in correctness and robustness by removing features from C++ that cause common problems: one example is pointers.
The differences in execution speed of an application developed with C# and one developed using C++ is negligible on most platforms. This is due to the fact that the execution bottlenecks are not language dependent but usually depend on the operating system or I/O. For example if C++ performs a function in 5 ms but C# uses 2ms, and waiting for data takes 2 seconds, the time spent in the function is insignificant compared to the time waiting for data.
Choose a language that is best suited for the developers, platform and projects. Work towards the goals of correctness, robustness and deployment. The speed of an application should be treated as a bug: prioritize it, compare to other bugs, and fix as necessary.

A better way to look at it everything is slower than C/C++ because it abstracts away rather than following the stick and mud paradigm. It's called systems programming for a reason, you program against the grain or bare metal. Doing so also grants you speed you cannot achieve with other languages like C# or Java. But alas C roots are all about doing things the hard way, so your mostly going to be writing more code and spending more time debugging it.
C is also case sensitive, also objects in C++ also follow strict rule sets. Example a purple ice cream cone may not be the same as a blue ice cream cone, though they might be cones they may not necessarily belong to the cone family and if you forget to define what cone is you bug out. Thus the properties of ice cream may or may not be clones. Now the speed argument, C/C++ uses the stack and heap approach this is where bare metal gets it's metal.
With the boost library you can achieve incredible speeds unfortunately most game studios stick to the standard library. The other reason for this might be because software written in C/C++ tends to be massive in file size, as it's a giant collection of files instead of a single file. Also take note all operating systems are written in C so generally why must we ask the question what could be faster?!
Also caching is not faster than pure memory management, sorry but this just doesn't fan out. Memory is something physical, caching is something software does in order to gain a kick in performance. One could also reason that without physical memory caching would simply not exist. It doesn't void the fact memory must be managed at some level whether its automated or manual.

Why would you write a small application that doesn't require much in the way of optimization in C++, if there is a faster route(C#)?

Getting an exact answer to your question is not really possible unless you perform benchmarks on specific systems. However, it is still interesting to think about some fundamental differences between programming languages like C# and C++.
Compilation
Executing C# code requires an additional step where the code is JIT'ed. With regard to performance that will be in favor of C++. Also, the JIT compiler is only able to optimize the generated code within the unit of code that is JIT'ed (e.g. a method) while a C++ compiler can optimize across method calls using more aggressive techniques.
However, The JIT compiler is able to optimize the generated machine code to closely match the underlying hardware enabling it to take advantage of additional hardware features if they exist. To my knowledge the .NET JIT compiler doesn't do that but it would conceiveably be able to generate different code for Atom as opposed to Pentium CPU's.
Memory access
The garbage collected architecture can in many cases create more optimal memory access patterns than standard C++ code. If the memory area used for the first generation is small enough in can stay within the CPU cache increasing performance. If you create and destroy a lot of small objects the overhead of maintaing the managed heap may be smaller than what is required by the C++ runtime. Again, this is highly dependent on the application. A study Python of performance demonstrates that a specific managed Python application is able to scale much better than the compiled version as a result of more optimal memory access patterns.

Don't let confusing!
If a C# application is written in the best case and a C++ application is written in the best case, the C++ is faster.
Many reason is here about why C++ is faster that C# inherently, such as C# uses virtual machine similar to JVM in Java. Basically higher level language has less performance (if uses in best case).
If you're an experienced professional C# programmer just like you're an experienced professional C++ programmer, developing an application using C# is much more easy and fast than C++.
Many other situations between these situations is possible. For example, you can write an C# application and C++ application so that C# app runs faster than C++ one.
For choosing a language you should note the circumstances of the project and its subject. For a general business project you should use C#. For a hight performance required project like a Video Converter or Image Processing project you should choose C++.
Update:
OK. Lets compare some practical reason about why most possible speed of C++ is more than C#. Consider a good written C# application and same C++ version:
C# uses a VM as a middle layer for executing the application. It has overhead.
AFAIK CLR could not optimises all C# codes in target machine. C++ application could be compiled on target machine with most optimisation.
In C# the most possible optimisation for runtime means most possible fast VM. VM has overhead anyway.
C# is a higher level language thus it generates more program code lines for the final process. (consider difference between an Assembly application and Ruby one! same condition is between C++ and a higher level language such as C#/Java)
If you prefer to get some more info in practice as an expert, see this. It is about Java but it also applies to C#.

The primary concern would not be speed, but stability across windows versions and upgrades. Win32 is mostly immune across windows versions making it highly stable.
When servers are decommissioned and software migrated, A lot of anxiety happens around anything using .Net and usually a lot of panic about .net versions but a Win32 application built 10 years ago just keeps running like nothing happened.

I have been specializing in optimization for about 15 years, and regularly re write C++ code, making heavy use of compiler intrinsics as much as possible because C++ performance is often nowhere near what the CPU is capable of. Cache performance often needs to be considered. Many vector maths instructions are required to replace the standard C++ floating point code.
A great deal of STL code is re written and often runs many times faster. Maths and code which makes heavy use of data can be re written with spectacular results, as the CPU approaches its optimum performance.
None of this is possible in C#. To compare their relative #real time# performance is really a staggeringly ignorant question. The fastest piece of code in C++ will be when each single assembler instruction is optimised for the task in hand, with no unnecessary instructions - at all. Where each piece of memory is used when it is required, and not copied n times because that’s what the language design requires. Where each required memory movement works in harmony with the cache.
Where the final algorithm cannot be improved, based on the exact real time requirements, considering accuracy and functionality.
Then you will be approaching an optimal solution.
To compare C# with this ideal situation is staggering. C# can’t compete. In fact, I am currently re writing a whole bunch of C# code (when I say re writing I mean removing and replacing it completely) because it is not even in the same city, let alone ball park when it comes to heavy lifting real time performance.
So please, stop fooling yourselves. C# is slow. Dead slow. All software is slowing down, and C# is making this speed decline worse. All software runs using the fetch execute cycle in assembler (you know – on the CPU). You use 10 times as many instructions; it’s going to go 10 times as slow. You cripple the cache; it’s going to go even slower. You add garbage collect to a real time piece of software then you are often fooled into thinking that the code runs ‘ok’ there are just those few moments every now and then when the code goes ‘a bit slow for a while’.
Try adding a garbage collection system to code where every cycle counts. I wonder if the stock market trading software has garbage collection (you know – on the system running on the new undersea cable which cost $300 million?). Can we spare 300 milliseconds every 2 seconds? What about flight control software on the space shuttle – is GC ok there? How about engine management software in performance vehicles? (Where victory in a season can be worth millions).
Garbage collection in real time is a complete failure.
So no, emphatically, C++ is much faster. C# is a leap backwards.

PostSharp has no effect on speed

I have stumbled on an impossibly good performance behaviour with PostSharp. To evaluate the speed I wrote a little program, that would execute one function a specified number of times, and if PostSharp is enable it would generate and delete a few hundred strings, just in memory (non fixed composition, so they are not auto-interned). The loop executes in a non-trivial (a few milliseconds) amount of time.
Now, I am unable to measure the difference on a few million runs, and a crazy run of ~40 billion iterations amounted to a difference of just a few nanoseconds vs non-PostSharp version doing the same number of calls. To me, this is impossible. There must be something wrong with my test. I had the code peer-reviewed by my co-workers, so I am fairly confident the code does what I intend it to.
So, is there something wrong with using string generation (which is the expected use in the intended applications) as the slow-running simulation for the benchmarks?
Alternatively, has someone else performed (or know of) a PostSharp's runtime performance analysis?
Thank you.

On a 3 GHz processor, 40 billion clock cycles alone will take 13 seconds - and I sincerely doubt that a single iteration is taking just one clock cycle. Something's definitely wrong with your test.
Something's likely getting optimized away - maybe it sees that you're doing the same thing over and over again and is deciding not to do it at all (except the first time). You need to make sure you're randomizing your data when you do perf analysis.

I have done performance tests. They were published in PostSharp Blog
Some aspects can have the same performance as hand written code if they don't use features such as: reflection, access to method parameters, access to method instance. Since PostSharp emits MSIL instructions, the generated code can be inlined by the JIT compiler.
As reminded in other answers, be sure that (1) PostSharp is indeed invoked (use Reflector on the resulting assembly) and (2) you're using the Stopwatch properly. If you're comparing the average time of a single test, it's normal that the difference between PostSharp and hand-written code is just a few nanoseconds (in the hypothesis that you don't use expensive features).

Can you change your test, such that the generated strings are used in the next iteration (string length written to the console) or something like that?
Maybe the compiler optimizes your program in such a way that either the postsharp-function is not executed at all or that it is called asynchronously or executed on another cpu, because there is no reason to sync with the other iterations. If you link it more tightly, this may force then the compiler, to synchronize the actions.

Performance difference between C++ and C# for mathematics

I would like to preface this with I'm not trying to start a fight. I was wondering if anyone had any good resources that compared C++ and C# for mathematically intensive code? My gut impression is that C# should be significantly slower, but I really have no evidence for this feeling. I was wondering if anyone here has ever run across a study or tested this themselves? I plan on running some tests myself, but would like to know if anyone has done this in a rigorous manner (google shows very little). Thanks.
EDIT: For intensive, I mean a lot of sin/cos/exp happening in tight loops

I have to periodically compare the performance of core math under runtimes and languages as part of my job.
In my most recent test, the performance of C# vs my optimized C++ control-case under the key benchmark — transform of a long array of 4d vectors by a 4d matrix with a final normalize step — C++ was about 30x faster than C#. I can get a peak throughput of one vector every 1.8ns in my C++ code, whereas C# got the job done in about 65ns per vector.
This is of course a specialized case and the C++ isn't naive: it uses software pipelining, SIMD, cache prefetch, the whole nine yards of microoptimization.

C# will be slower in general, but not significantly so. In some cases, depending on the structure of the code, C# can actually be faster, as JIT analysis can frequently improve the performance of a long-running algorithm.
Edit: Here's a nice discussion of C# vs C++ performance
Edit 2:
"In general" is not really accurate. As you say, the JIT compiler can actually turn your MSIL into faster native code that the C++ compiler because it can optimize for the hardware it is running on.
You must admit, however, that the act of JIT compiling itself is resource intensive, and there are runtime checks that occur in managed code. Pre-compiled and pre-optimized code will always be faster than just JITted code. Every benchmark comparison shows it. But long-running processes that can have a fair amount of runtime analysis can be improved over pre-compiled, pre-optimized native code.
So what I said was 100% accurate. For the general case, managed code is slightly slower than pre-compiled, pre-optimized code. It's not always a significant performance hit, however, and for some cases JIT analysis can improve performance over pre-optimized native code.

For straight mathematical functions asking if C# is faster than C++ is not the best question. What you should be asking
Is the assembly produced by the CLR JITer more or less efficient than assembly generated by the C++ compiler
The C# compiler has much less influence on the speed of purely mathmatical operations than the CLR JIT does. It would have almost identical performance as other .Net languages (such as VB.Net if you turn off overflow checing).

There are extensive benchmarks here:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=csharp&lang2=gpp&box=1
Note this compares the Mono JIT to C++. AFIAK there are no extensive benchmarks of Microsoft's implementation out there, so almost everything you will hear is hearsay. :(

I think you're asking the wrong question. You should be asking if C++ on can beat out the .NET family of languages in mathematical computation. Have a gander at F# timing comparisons for Runge Kutta

You do not define "mathematically intensive" very well (understatement for: not at all).
An attempt to a breakdown:
For the basic Sin/Cos/Log functions I would not expect much difference.
For linear algebra (matrices) I would expect .NET to loose out, the (always enforced) bounds checking on arrays is only optimized away under some circumstances.
You will probably have to benchmark something close to your intended domain.

I would consider using Mono.Simd to accelerate some operations. The minus is that on MS runtime it's not accelerated.

I haven't checked recently, but the last time I did check, Microsoft's license agreement for the .NET runtime required you to agree NOT to publish any benchmarks of its performance. That tends to limit the amount of solid information that gets published.
A few others have implied it, but I'll state it directly: I think you're engaging in (extremely) premature optimization -- or trying to anyway.
Edit:
Doing a bit of looking, the license has changed (a long time ago, in fact). The current terms
say you're allowed to publish benchmarks -- but only if you meet their conditions. Some of those conditions look (to me) nearly impossible to meet. For example, you can only publish provided: "your benchmark testing was performed using all performance tuning and best practice guidance set forth in the product documentation and/or on Microsoft's support Web sites". Given the size and number of Microsoft's web sites, I don't see how anybody stands a chance of being certain they're following all the guidance they might provide.
Although that web page talks about .NET 1.1, the newer licenses seem to refer back to it as well.
So, what I remembered was technically wrong, but effectively correct anyway.

For basic math library functions there won't be much difference because C# will call out to the same compiled code that C++ would use. For more interesting math that you won't find in the math library there are several factors that make C# worse. The Current JIT doesn't support SSE instructions that you would have access to in C++.

My 32 bit headache is now a 64bit migraine?!? (or 64bit .NET CLR Runtime issues)

What unusual, unexpected consequences have occurred in terms of performance, memory, etc when switching from running your .NET applications under the 64 bit JIT vs. the 32 bit JIT? I'm interested in the good, but more interested in the surprisingly bad issues people have run into.
I am in the process of writing a new .NET application which will be deployed in both 32bit and 64bit. There have been many questions relating to the issues with porting the application - I am unconcerned with the "gotchas" from a programming/porting standpoint. (ie: Handling native/COM interop correctly, reference types embedded in structs changing the size of the struct, etc.)
However, this question and it's answer got me thinking - What other issues am I overlooking?
There have been many questions and blog posts that skirt around this issue, or hit one aspect of it, but I haven't seen anything that's compiled a decent list of problems.
In particular - My application is very CPU bound and has huge memory usage patterns (hence the need for 64bit in the first place), as well as being graphical in nature. I'm concerned with what other hidden issues may exist in the CLR or JIT running on 64 bit Windows (using .NET 3.5sp1).
Here are a few issues I'm currently aware of:
(Now I know that) Properties, even automatic properties, don't get inlined in x64.
The memory profile of the application changes, both because of the size of references, but also because the memory allocator has different performance characteristics
Startup times can suffer on x64
I'd like to know what other, specific, issues people have discovered in the JIT on 64bit Windows, and also if there are any workarounds for performance.
Thank you all!
----EDIT-----
Just to clarify -
I am aware that trying to optimize early is often bad. I am aware that second guessing the system is often bad. I also know that portability to 64bit has its own issues - we run and test on 64bit systems daily to help with this. etc.
My application, however, is not your typical business application. It's a scientific software application. We have many processes that sit using 100% CPU on all of the cores (it's highly threaded) for hours at a time.
I spend a LOT of time profiling the application, and that makes a huge difference. However, most profilers disable many features of the JIT, so the small details in things like memory allocation, inlining in the JIT, etc, can be very difficult to pin down when you're running under a profiler. Hence my need for the question.

A particularly troublesome performance problem in .NET relates to the poor JIT:
https://connect.microsoft.com/VisualStudio/feedback/details/93858/struct-methods-should-be-inlined?wa=wsignin1.0
Basically, inlining and structs don't work well together on x64 (although that page suggests inlining now works but subsequent redunant copies aren't eliminated, that sounds suspect given the tiny perf. difference).
In any case, after wrestling with .NET long enough for this, my solution is to use C++ for anything numerically intensive. Even in "good" cases for .NET, where you're not dealing with structs and using arrays where the bounds-checking is optimized out, C++ beats .NET hands down.
If you're doing anything more complicated than dot products, the picture gets worse very quickly; the .NET code is both longer + less readable (because you need to manually inline stuff and/or can't use generics), and much slower.
I've switched to using Eigen in C++: it's absolutely great, resulting in readable code and high performance; a thin C++/CLI wrapper then provides the glue between the compute engine and the .NET world.
Eigen works by template meta-programming; in compiles vector-expressions into SSE intrinsic instructions and does a lot of the nastiest cache-related loop unrolling and rearranging for you; and though focused on linear algebra, it'll work with integers and non-matrix array expressions too.
So, for instance, if P is a matrix, this kind of stuff Just Works:
1.0 / (P.transpose() * P).diagonal().sum();
...which doesn't allocate a temporarily transposed variant of P, and doesn't compute the whole matrix product but only the fields it needs.
So, if you can run in Full Trust - just use C++ via C++/CLI, it works much much better.

I remember hearing an issue from an IRC channel I frequent.
It optimises away the temporary copy in this instance:
EventHandler temp = SomeEvent;
if(temp != null)
{
temp(this, EventArgs.Empty);
}
Putting the race condition back in and causing potential null reference exceptions.

Most of the time Visual Studio and the compiler do a pretty good job of hiding the issues from you. However, I am aware of one major problem that can arise if you set your app to auto-detect the platform (x86 vs x64) and also have any dependencies on 32bit 3rd party dlls. In this case, on 64bit platforms it will try to call the dlls using 64bit conventions and structures, and it just won't work.

You mentioned the porting issues, those are the ones to be concerned with. I (obviously) don't know your application, but trying to second-guess the JIT is often a complete waste of time. The people that write the JIT have an intimate understanding of the x86/x64 chip architecture, and in all likelyhood know what performs better and what performs worse than probably anyone else on the planet.
Yes, it's possible that you have a corner case that is different and unique, but if you're "in the process of writing a new application" then I wouldn't worry about the JIT compiler. There's likely a silly loop that can be avoided somewhere that will buy you 100x the performance improvement you'll get from trying to second-guess the JIT. Reminds me of issues we ran into writing our ORM, we'd look at code and think we could tweek a couple of machine instructions out of it... of course, the code then went off and connected to a database server over a network, so we were triming microseconds off a process that was bounded by milliseconds somewhere else.
Universal rule of performance tweaking... If you haven't measured your performance you don't know where your bottlenecks are, you just think you know... and you're likely wrong.

About Quibblesome's answer:
I tried to run the following code in my Windows 7 x64 in Release mode without debugger, and NullReferenceException has never been thrown.
using System;
using System.Threading;
namespace EventsMultithreadingTest
{
public class Program
{
private static Action<object> _delegate = new Action<object>(Program_Event);
public static event Action<object> Event;
public static void Main(string[] args)
{
Thread thread = new Thread(delegate()
{
while (true)
{
Action<object> ev = Event;
if (ev != null)
{
ev.Invoke(null);
}
}
});
thread.Start();
while (true)
{
Event += _delegate;
Event -= _delegate;
}
}
static void Program_Event(object obj)
{
object.Equals(null, null);
}
}
}

I believe the 64 JIT is not fully developed/ported to take advantage of the such 64 bit architecture CPUs so it has issues, you may be getting 'emulated' behavior of your assemblies which may cause issues and unexpected behavior. I would look into cases where this can be avoided and/or maybe see if there is good fast 64 c++ compiler to write time critical computations and algorithms. But even if you have difficulties finding info or have no time to read through dissembled code I'm quite sure that taking out heavy computation outside the managed code would decrease any issues you may have & boost up performance [somewhat sure you are already doing this but just to mention:)]

A profiler shouldn't significantly influence your timing results. If the profiler overheads really are "significant" then you probably can't squeeze much more speed out of your code, and should be thinking about looking at your hardware bottlenecks (disk, RAM, or CPU?) and upgrading. (Sounds like you are CPU bound, so that's where to start)
In general, .net and JIT frees you from most of the porting problems of 64 bit. As you know, there are effects relating to the register size (memory usage changes, marshalling to native code, needing all parts of the program to be native 64-bit builds) and some performance differences (larger memory map, more registers, wider buses etc), so I can't tell you anything more than you already know on that front. The other issues I've seen are OS rather than C# ones - there are now different registry hives for 64-bit and WOW64 applications, for example, so some registry accesses have to be written carefully.
It's generally a bad idea to worry about what the JIT will do with your code and try to adjust it to work better, because the JIT is likely to change with .net 4 or 5 or 6 and your "optimisations" may turn into inefficiencies, or worse, bugs. Also bear in mind that the JIT compiles the code specifically for the CPU it is running on, so potentially an improvement on your development PC may not be an improvement on a different PC. What you get away with using today's JIT on today's CPU might bite you in a years time when you upgrade something.
Specifically, you cite "properties are not inlined on x64". By the time you have run through your entire codebase turning all your properties into fields, there may well be a new JIT for 64 bit that does inline properties. Indeed, it may well perform better than your "workaround" code. Let Microsoft optimise that for you.
You rightly point out that your memory profile can change. So you might need more RAM, faster disks for virtual memory, and bigger CPU caches. All hardware issues. You may be able to reduce the effect by using (e.g.) Int32 rather than int but that may not make much difference and could potentially harm performance (as your CPU may handle native 64-bit values more efficiently than half-size 32-bit values).
You say "startup times can be longer", but that seems rather irrelevant in an application that you say runs for hours at 100% CPU.
So what are you really worried about? Maybe time your code on a 32-bit PC and then time it doing the same task on a 64-bit PC. Is there half an hour of difference over a 4 hour run? Or is the difference only 3 seconds? Or is the 64 bit PC actually quicker? Maybe you're looking for solutions to problems that don't exist.
So back to the usual, more generic, advice. Profile and time to identify bottlenecks. Look at the algorithms and mathematical processes you are applying, and try to improve/replace them with more efficient ones. Check that your multithreading approach is helping rather than harming your performance (i.e. that waits and locks are avoided). Try to reduce memory allocation/deallocation - e.g. re-use objects rather than replacing them with new ones. Try to reduce the use of frequent function calls and virtual functions. Switch to C++ and get rid of the inherent overheads of garbage collection, bounds checking, etc. that .net imposes. Hmmm. None of that has anything to do with 64 bit, does it?

I'm not that familiar with 64-bit issues, but I do have one comment:
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil.
-- Donald Knuth

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.