Why does tail call optimization need an op code?

Why does tail call optimization need an op code? - c#

So I've read many times before that technically .NET does support tail call optimization (TCO) because it has the opcode for it, and just C# doesn't generate it.
I'm not exactly sure why TCO needs an opcode or what it would do. As far as I know, the requirement for being able to do TCO is that the results of a recursive call are not combined with any variables in the current function scope. If you don't have that, then I don't see how an opcode prevents you from having to keep a stack frame open. If you do have that, then can't the compiler always easily compile it to something iterative?
So what is the point of an opcode? Obviously there's something I'm missing. In cases where TCO is possible at all, can't it always be handled at the compiler level than at the opcode level? What's an example of where it can't?

Following the links you already provided, this is the part which seems to me, answers your question pretty closely..
Source
The CLR and tail calls
When you're dealing with languages managed by the CLR, there are two kinds of compilers in play. There's the compiler that goes from your language's source code down to IL (C# developers know this as csc.exe), and then there's the compiler that goes from IL to native code (the JIT 32/64 bit compilers that are invoked at run time or NGEN time). Both the source->IL and IL->native compilers understand the tail call optimization. But the IL->native compiler--which I'll just refer to as JIT--has the final say on whether the tail call optimization will ultimately be used. The source->IL compiler can help to generate IL that is conducive to making tail calls, including the use of the "tail." IL prefix (more on that later). In this way, the source->IL compiler can structure the IL it generates to persuade the JIT into making a tail call. But the JIT always has the option to do whatever it wants.
When does the JIT make tail calls?
I asked Fei Chen and Grant Richins, neighbors down the hall from me who happen to work on the JIT, under what conditions the various JITs will employ the tail call optimization. The full answer is rather detailed. The quick summary is that the JITs try to use the tail call optimization whenever they can, but there are lots of reasons why the tail call optimization can't be used. Some reasons why tail calling is a non-option:
Caller doesn't return immediately after the call (duh :-))
Stack arguments between caller and callee are incompatible in a way that would require shifting things around in the caller's frame before the callee could execute
Caller and callee return different types
We inline the call instead (inlining is way better than tail calling, and opens the door to many more optimizations)
Security gets in the way
The debugger / profiler turned off JIT optimizations
The most interesting part in context of your question, which makes it super clear in my opinion, among many scenarios, is example of security mentioned above...
Security in .NET in many cases depends on the stack being accurate... at runtime.. Which is why, as highlighted above, the burden is shared by both the source to CIL compiler, and (runtime) CIL-to-native JIT compilers, with the final say being with the latter.

Guess: In a simple language like x86 assembler where you manage the stack "manually", you don't need an opcode - you can just set up the call stack appropriately.
But in something higher-level like .NET CIL, the stack is partially managed for you, and the whole act of invoking a function is a single opcode (e.g. call). So you need a different opcode to implement TCO - one that does "pass control flow to this function, but without creating a new stack frame".

Related

How to let the variable be stored in a machine register using C#?

I had referenced at MSDN and found the register keyword, but it's only in C++.
Syntax:
register int x = 0;
Can you tell me how to do that with C#?

There is no way to do that in C#. C# is compiled to MSIL, which is then compiled to native code by the JIT.
It's the JIT that will decide whether a variable will go into a register or not. You shouldn't worry about this.
As MSIL is meant to be run on different architectures, it wouldn't make much sense to include such a feature in the language. Different architectures have a different number of registers, which may be of different sizes. That's why it's the JIT's job to optimize this.

By using a keyword? No.
With unmanaged code, you certainly can though... I mean, you really don't want to... but you can : )
It is useful in extreme optimizations, where you know for sure that you can do better than the JIT Compiler. However, in those circumstances, you should probably be looking at straight unmanaged C anyway. So, I strongly urge you to do that if you can.
Let's assume you can't, and this absolutely positively must be done from C#
C# is compiled to MSIL, which takes those choices out of your hands. It actually does quite well too, so well in fact that there's rarely a need to optimize by hand. But, with C# being a managed language you have to step into an unmanaged section to do it.
There are several methods, both with and without reflection - and both using inline and external.
Firstly, you might compile that small fast section in C, ASM or some other unmanaged language as a DLL and call it unmanaged from C# in much the same way you'd call WinAPI functions... pay attention to calling conventions, there are several and each places a slightly different burden on caller/callee... for example, in terms of how parameters are passed and who clears up the stack afterwards.
Alternatively, you could use fasmNET or similar to include inline assembly for any routines which must be ultra-fast. fast can compile strings of Assembler in c# (at runtime) into a blob of memory which can then be called unmanaged from c#... many examples exist online.
Alternatively, you could externally compile just the instructions you need, provide them as a byte array yourself, and call the byte array as code in the same manner as above, but without a runtime compilation step.
There are also many tricks you can do with inline IL that can help you fine-tune your code without the JIT compilers involvement, these may or may not be useful to you depending on your project. Custom IL sections can be accomplished both with inline IL and dynamic IL and can give you considerably more control over how your c# application runs.
Depending on how often you need to switch back and forth between managed and unmanaged, you can also create a separate application domain from your code, and load your unmanaged code into that... this can help you separate the managed/unmanaged concerns and thus avoid any costly switching back and forth.
But...
I will not give code, as to how you do it depends greatly upon what you're trying to accomplish. This is not the type of thing where you should just paste a code snippet into your project - you need to research the various methods, learn about their overheads and drawbacks, and then implement them with care, wisdom and due diligence.
Personally, I'd suggest learning C and offloading such computationally important tasks as an external service. This has the added advantage of allowing you to use processor affinity to best effect. It also allows you to write clean, normal, sensible C# for your head end.
But trust me, if your code is too slow and you think using registers for a few variables will speed things up... well... 95% of the time, it absolutely won't. C# does a tonne of work behind the scenes to wrangle those CPU resources as effectively as possible ... if you step in and snatch control of a few registers from it, it will usually end up producing less optimal code overall.
So, if pressed to guess at your best strategy, I'd suggest offloading that small task to a seperate C program or service, and then use C# to throw it problems and gather output. Coupled with affinity, this can result in substantial speed gains. If you need to, it is also possible to set up shared memory between managed and unmanaged code - although this requires a lot of forward planning, may require experience using a good commercial debugger, and certainly isn't for the beginner.
Note that whichever way you go, portability WILL be adversely affected.
Re-evaluate whether you really need to do this at all. There are likely many more sensible and productive optimisations that can be done from within C#, in terms of the algorithm itself, which you should explore fully before going anywhere near the hardware.

You can't.
There aren't any real useful registers in IL and there is no guarantee that the target machine will have registers. The JIT or Ahead-of-time compiler will make those decisions for you.

Do I need to worry about inlining in Unity/C#? [duplicate]

This question already has an answer here:
AggressiveInlining doesn't exist
(1 answer)
Closed 5 years ago.
For code clarity I sometimes create a function that should very obviously be inlined, be it either a wrapper, or a function that is only called in a single point, or a short function that is supposed to be called frequently and be fast.
In C I would inline it without a second thought, but in Unity/C# there's no way to do that AFAIK (this appears to be only available at .NET 4.5).
Can I trust the compiler to be smart enough to actually inline smartly, or I'd better sometimes sacrifice code clarity for performance, mistrusting the compiler?
Sure it depends case by case, premature optimization is evil, and you should profile instead of guessing. However a general overview of this subject might still be useful as a guideline, to improve upon.

Manually forcing in-lining in C# at compile time doesn't make much sense. When the code is run the just-in-time compiler can decide to in-line the code based on these heuristics:
http://blogs.msdn.com/b/ericgu/archive/2004/01/29/64717.aspx
Methods that are greater than 32 bytes of IL will not be inlined.
Virtual functions are not inlined.
Methods that have complex flow control will not be in-lined. Complex flow control is any flow control other than if/then/else; in this case, switch or while.
Methods that contain exception-handling blocks are not inlined, though methods that throw exceptions are still candidates for inlining.
If any of the method's formal arguments are structs, the method will not be inlined.
If you're absolutely sure that the method has to be in-lined you can use these above heurstics to make the method more appealing to in-line.
MethodImplOptions.AggressiveInlining is mostly useful for inlining across assembly boundaries, something I do not believe the just-in-time compiler can do (but I'd have to check that).

Why aren't Automatic Properties inlined by default?

Being that properties are just methods under the hood, it's understandable that the performance of any logic they might perform may or may not improve performance - so it's understandable why the JIT needs to check if methods are worth inlining.
Automatic properties however (as far as I understand) cannot have any logic, and simply return or set the value of the underlying field. As far as I know, automatic properties are treated by the Compiler and the JIT just like any other methods.
(Everything below will rely on the assumption that the above paragraph is correct.)
Value Type properties show different behavior than the variable itself, but Reference Type properties supposedly should have the exact same behavior as direct access to the underlying variable.
// Automatic Properties Example
public Object MyObj { get; private set; }
Is there any case where automatic properties to Reference Types could show a performance hit by being inlined?
If not, what prevents either the Compiler or the JIT from automatically inlining them?
Note: I understand that the performance gain would probably be insignificant, especially when the JIT is likely to inline them anyway if used enough times - but small as the gain may be, it seems logical that such a seemingly simple optimization would be introduced regardless.

EDIT: The JIT compiler doesn't work in the way you think it does, which I guess is why you're probably not completely understanding what I was trying to convey above. I've quoted your comment below:
That is a different matter, but as far as I understand methods are only checked for being inline-worthy if they are called enough times. Not the mention that the checking itself is a performance hit. (Let the size of the performance hit be irrelevant for now.)
First, most, if not all, methods are checked to see if they can be inlined. Second, keep in mind that methods are only ever JITed once and it is during that one time that the JITer will determine if any methods called inside of it will be inlined. This can happen before any code is executed at all by your program. What makes a called method a good candidate for inlining?
The x86 JIT compiler (x64 and ia64 don't necessarily use the same optimization techniques) checks a few things to determine if a method is a good candidate for inlining, definitely not just the number of times it is called. The article lists things like if inlining will make the code smaller, if the call site will be executed a lot of times (ie in a loop), and others. Each method is optimized on its own, so the method may be inlined in one calling method but not in another, as in the example of a loop. These optimization heuristics are only available to JIT, the C# compiler just doesn't know: it's producing IL, not native code. There's a huge difference between them; native vs IL code size can be quite different.
To summarize, the C# compiler doesn't inline properties for performance reasons.
The jit compiler inlines most simple properties, including automatic properties. You can read more about how the JIT decides to inline method calls at this interesting blog post.
Well, the C# compiler doesn't inline any methods at all. I assume this is the case because of the way the CLR is designed. Each assembly is designed to be portable from machine to machine. A lot of times, you can change the internal behavior of a .NET assembly without having to recompile all the code, it can just be a drop in replacement (at least when types haven't changed). If the code were inlined, it breaks that (great, imo) design and you lose that luster.
Let's talk about inlining in C++ first. (Full disclosure, I haven't used C++ full time in a while, so I may be vague, my explanations rusty, or completely incorrect! I'm counting on my fellow SOers to correct and scold me)
The C++ inline keyword is like telling the compiler, "Hey man, I'd like you to inline this function, because I think it will improve performance". Unfortunately, it is only telling the compiler you'd prefer it inlined; it is not telling it that it must.
Perhaps at an earlier date, when compilers were less optimized than they are now, the compiler would more often than not compile that function inlined. However, as time went on and compilers grew smarter, the compiler writers discovered that in most cases, they were better at determining when a function should be inlined that the developer was. For those few cases where it wasn't, developers could use the seriouslybro_inlineme keyword (officially called __forceinline in VC++).
Now, why would the compiler writers do this? Well, inlining a function doesn't always mean increased performance. While it certainly can, it can also devastate your programs performance, if used incorrectly. For example, we all know one side effect of inlining code is increased code size, or "fat code syndrome" (disclaimer: not a real term). Why is "fat code syndrome" a problem? If you take a look at the article I linked above, it explains, among other things, memory is slow, and the bigger your code, the less likely it will fit in the fastest CPU cache (L1). Eventually it can only fit in memory, and then, inlining has done nothing. However, compilers know when these situations can happen, and do their best to prevent it.
Putting that together with your question, let's look at it this way: the C# compiler is like a developer writing code for the JIT compiler: the JIT is just smarter (but not a genius). It often knows when inlining will benefit or harm execution speed. "Senior developer" C# compiler doesn't have any idea how inlining a method call could benefit the runtime execution of your code, so it doesn't. I guess that actually means the C# compiler is smart, because it leaves the job of optimization to those who are better than it, in this case, the JIT compiler.

Automatic properties however (as far as I understand) cannot have any
logic, and simply return or set the value of the underlying field. As
far as I know, automatic properties are treated by the Compiler and
the JIT just like any other methods.
That automatic properties cannot have any logic is an implementation detail, there is not any special knowledge of that fact that is required for compilation. In fact, as you say auto properties are compiled down to method calls.
Suppose auto propes were inlined and the class and property are defined in a different assembly. This would mean that if the property implementation changes, you would have to recompile the application to see that change. That defeats using properties in the first place which should allow you to change the internal implementation without having to recompile the consuming application.

Automatic properties are just that - property get/set methods generated automatically. As result there is nothing special in IL for them. C# compiler by itself does very small number of optimizations.
As for reasons why not to inline - imagine your type is in a separate assembly hence you are free to change source of that assembly to have insanely complicated get/set for the property. As result compiler can't reason on complexity of the get/set code when it sees your automatic property first time while creating new assembly depending on your type.
As you've already noted in your question - "especially when the JIT is likely to inline them anyway" - this property methods will likely be inlined at JIT time.

Efficiency comparison of recursion and non recursive function in Java

As I understand, recursive functions are generally less efficient than equivalent non-recursive functions because of the overhead of function calls. However, I have recently encountered a text book saying this is not necessary true with Java (and C#).
It does not say why, but I assume this might be because the Java compiler optimizes recursive functions in some way.
Does anyone know the details of why this is so?

The text book is probably referring to tail-call optimization; see #Travis's answer for details.
However, the textbook is incorrect in the context of Java. Current Java compilers do not implement tail-call optimization, apparently because it would interfere with the Java security implementation, and would alter the behaviour of applications that introspect on the call stack for various purposes.
References:
Does the JVM prevent tail call optimizations?
This Sun bug requesting tail-call support ... still open.
This page (and the referenced paper) suggest that perhaps it wouldn't be that hard after all ...
There are hints that tail-call optimization might make it into Java 8.

This is usually only true for tail-recursion (http://en.wikipedia.org/wiki/Tail_call).
Tail-recursion is semantically equivalent to an incremented loop, and can therefore be optimized to a loop. Below is a quote from the article that I linked to (emphasis mine):
Tail calls are significant because
they can be implemented without adding
a new stack frame to the call stack.
Most of the frame of the current
procedure is not needed any more, and
it can be replaced by the frame of the
tail call, modified as appropriate.
The program can then jump to the
called subroutine. Producing such code
instead of a standard call sequence is
called tail call elimination, or tail
call optimization.
In functional programming languages,
tail call elimination is often
guaranteed by the language standard,
and this guarantee allows using
recursion, in particular tail
recursion, in place of loops

Some reasons why recursive implementations can be as efficient as iterative ones under certain circumstances:
Compilers can be clever enough to optimise out the function call for certain functions, e.g. by converting a tail-recursive function into a loop. I strongly suspect some of the modern JIT compilers for Java do this.
Modern processors do branch prediction and speculative execution, which can mean that the cost of a function call is minimal, or at least not much more than the cost of an iterative loop
In situations where you need a small amount local storage on each level of recursion, it is often more efficient to put this on the stack via a recursive function call than to allocate it in some other way (e.g. via a queue in heap memory).
My general advice however is don't bother worrying about this - the difference is so small that it is very unlikely to make a difference in your overall performance.

Guy Steele, one of the fathers of Java, wrote a paper in 1977
Debunking the "Expensive Procedure Call" Myth
or, Procedure Call Implementations Considered Harmful
or, LAMBDA: The Ultimate GOTO
Abstract:
Folklore states that GOTO statements are
"cheap', while procedure calls are 'expensive'. This
myth is largely a result of poorly designed language
Implementations.
That's funny, because even today, Java has no tail call optimization:)

To the best of my knowledge, Java does not do any sort of recursion optimization. Knowing this is important - not because of efficiency, but because recursion at an excessive depth (a few thousand should do it) will cause a stack overflow and crash your program. (Really, considering the name of this site, I'm surprised nobody brought this up before me).

I don't think so, in my experience in solving some programming problems in sites like UVA or SPOJ I had to remove the recursion in order to solve the problem within established time to solve the problem.
One way that you can think is: in recursive calls, any time that the recursion occurs, the jvm must allocate resources for the function that has being called, in non recursive functions most part of the memory is already allocated.

Why doesn't .NET/C# optimize for tail-call recursion?

I found this question about which languages optimize tail recursion. Why C# doesn't optimize tail recursion, whenever possible?
For a concrete case, why isn't this method optimized into a loop (Visual Studio 2008 32-bit, if that matters)?:
private static void Foo(int i)
{
if (i == 1000000)
return;
if (i % 100 == 0)
Console.WriteLine(i);
Foo(i+1);
}

JIT compilation is a tricky balancing act between not spending too much time doing the compilation phase (thus slowing down short lived applications considerably) vs. not doing enough analysis to keep the application competitive in the long term with a standard ahead-of-time compilation.
Interestingly the NGen compilation steps are not targeted to being more aggressive in their optimizations. I suspect this is because they simply don't want to have bugs where the behaviour is dependent on whether the JIT or NGen was responsible for the machine code.
The CLR itself does support tail call optimization, but the language-specific compiler must know how to generate the relevant opcode and the JIT must be willing to respect it.
F#'s fsc will generate the relevant opcodes (though for a simple recursion it may just convert the whole thing into a while loop directly). C#'s csc does not.
See this blog post for some details (quite possibly now out of date given recent JIT changes). Note that the CLR changes for 4.0 the x86, x64 and ia64 will respect it.

This Microsoft Connect feedback submission should answer your question. It contains an official response from Microsoft, so I'd recommend going by that.
Thanks for the suggestion. We've
considered emiting tail call
instructions at a number of points in
the development of the C# compiler.
However, there are some subtle issues
which have pushed us to avoid this so
far: 1) There is actually a
non-trivial overhead cost to using the
.tail instruction in the CLR (it is
not just a jump instruction as tail
calls ultimately become in many less
strict environments such as functional
language runtime environments where
tail calls are heavily optimized). 2)
There are few real C# methods where it
would be legal to emit tail calls
(other languages encourage coding
patterns which have more tail
recursion, and many that rely heavily
on tail call optimization actually do
global re-writing (such as
Continuation Passing transformations)
to increase the amount of tail
recursion). 3) Partly because of 2),
cases where C# methods stack overflow
due to deep recursion that should have
succeeded are fairly rare.
All that said, we continue to look at
this, and we may in a future release
of the compiler find some patterns
where it makes sense to emit .tail
instructions.
By the way, as it has been pointed out, it is worth noting that tail recursion is optimised on x64.

C# does not optimize for tail-call recursion because that's what F# is for!
For some depth on the conditions that prevent the C# compiler from performing tail-call optimizations, see this article: JIT CLR tail-call conditions.
Interoperability between C# and F#
C# and F# interoperate very well, and because the .NET Common Language Runtime (CLR) is designed with this interoperability in mind, each language is designed with optimizations that are specific to its intent and purposes. For an example that shows how easy it is to call F# code from C# code, see Calling F# code from C# code; for an example of calling C# functions from F# code, see Calling C# functions from F#.
For delegate interoperability, see this article: Delegate interoperability between F#, C# and Visual Basic.
Theoretical and practical differences between C# and F#
Here is an article that covers some of the differences and explains the design differences of tail-call recursion between C# and F#: Generating Tail-Call Opcode in C# and F#.
Here is an article with some examples in C#, F#, and C++\CLI: Adventures in Tail Recursion in C#, F#, and C++\CLI
The main theoretical difference is that C# is designed with loops whereas F# is designed upon principles of Lambda calculus. For a very good book on the principles of Lambda calculus, see this free book: Structure and Interpretation of Computer Programs, by Abelson, Sussman, and Sussman.
For a very good introductory article on tail calls in F#, see this article: Detailed Introduction to Tail Calls in F#. Finally, here is an article that covers the difference between non-tail recursion and tail-call recursion (in F#): Tail-recursion vs. non-tail recursion in F sharp.

I was recently told that the C# compiler for 64 bit does optimize tail recursion.
C# also implements this. The reason why it is not always applied, is that the rules used to apply tail recursion are very strict.

You can use the trampoline technique for tail-recursive functions in C# (or Java). However, the better solution (if you just care about stack utilization) is to use this small helper method to wrap parts of the same recursive function and make it iterative while keeping the function readable.

I had a happy surprise today :-)
I am reviewing my teaching material for my upcoming course on recursion with C#.
And it seems that finally tail call optimization has made its way into C#.
I am using NET5 with LINQPad 6 (optimization activated).
Here is the Tail call optimizable Factorial function I used:
long Factorial(int n, long acc = 1)
{
if (n <= 1)
return acc;
return Factorial(n - 1, n * acc);
}
And here is the X64 assembly code generated for this function:
See, there is no call, only a jmp. The function is agressively optimized as well (no stack frame setup/teardown). Oh Yes!

As other answers mentioned, CLR does support tail call optimization and it seems it was under progressive improvements historically. But supporting it in C# has an open Proposal issue in the git repository for the design of the C# programming language Support tail recursion #2544.
You can find some useful details and info there. For example #jaykrell mentioned
Let me give what I know.
Sometimes tailcall is a performance win-win. It can save CPU. jmp is
cheaper than call/ret It can save stack. Touching less stack makes for
better locality.
Sometimes tailcall is a performance loss, stack win.
The CLR has a complex mechanism in which to pass more parameters to
the callee than the caller recieved. I mean specifically more stack
space for parameters. This is slow. But it conserves stack. It will
only do this with the tail. prefix.
If the caller parameters are
stack-larger than callee parameters, it usually a pretty easy win-win
transform. There might be factors like parameter-position changing
from managed to integer/float, and generating precise StackMaps and
such.
Now, there is another angle, that of algorithms that demand
tailcall elimination, for purposes of being able to process
arbitrarily large data with fixed/small stack. This is not about
performance, but about ability to run at all.
Also let me mention (as extra info), When we are generating a compiled lambda using expression classes in System.Linq.Expressions namespace, there is an argument named 'tailCall' that as explained in its comment it is
A bool that indicates if tail call optimization will be applied when compiling the created expression.
I was not tried it yet, and I am not sure how it can help related to your question, but Probably someone can try it and may be useful in some scenarios:
var myFuncExpression = System.Linq.Expressions.Expression.Lambda<Func< … >>(body: … , tailCall: true, parameters: … );
var myFunc = myFuncExpression.Compile();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.