I am creating an interpreted programming language in C# (kind of for the lulz, no real purpose other than to have fun and learn about compilers), and ran into a problem. In my lexer, I remember where the token was in the original file to give more useful debug errors. I keep this "TokenPosition" object around, copying it along as the program goes through compile steps, until it winds up in the same object that runs interpreted code (for example, my "Identifier" class for named variables has a TokenPosition member).
My question: If an exception gets thrown, I want to look at the stack, find the topmost object with a TokenPosition member, and print its location. Or, more generally, "How do I get objects that are/were on the stack after an exception? Is this even possible?" (I can do the checking if it has a TokenPosition object / getting it easily, I'm not asking how to do that)
Last resorts that I do not want to have to do: Every single call to a behavior (which happens A LOT) assign a static tokenPosition variable somewhere with this.tokenPosition. I also could surround EVERYTHING with try/catches, but again, I don't really want to do this.
Parameters to methods are ephemeral. They may be overwritten by local variables when they are no longer live, or optimized out by the JIT compiler as unused, or even garbage collected while the method is running. You will have to track this information yourself, for example, by having a separate stack data structure for "currently active object" that is automatically unwound by a using clause.
Related
Just curious about this. Following are two code snippets for the same function:
void MyFunc1()
{
int i = 10;
object obj = null;
if(something) return;
}
And the other one is...
void MyFunc1()
{
if(something) return;
int i = 10;
object obj = null;
}
Now does the second one has the benefit of NOT allocating the variables when something is true? OR the local stack variables (in current scope) are always allocated as soon as the function is called and moving the return statement to the top has no effect?
A link to dotnetperls.com article says "When you call a method in your C# program, the runtime allocates a separate memory region to store all the local variable slots. This memory is allocated on the stack even if you do not access the variables in the function call."
UPDATED
Here is a comparison of the IL code for these two functions. Func2 refers to second snipped. It seems like the variable in both the cases are allocated at the beginning, though in case of Func2() they are initialized later on. So no benefit as such I guess.
Peter Duniho's answer is correct. I want to draw attention to the more fundamental problem in your question:
does the second one have the benefit of NOT allocating the variables when something is true?
Why ought that to be a benefit? Your presumption is that allocating the space for a local variable has a cost, that not doing so has a benefit and that this benefit is somehow worth obtaining. Analyzing the actual cost of local variables is very, very difficult; the presumption that there is a clear benefit in avoiding an allocation conditionally is not warranted.
To address your specific question:
The local stack variables (in current scope) are always allocated as soon as the function is called and moving the return statement to the top has no effect?
I can't answer such a complicated question easily. Let's break it down into much simpler questions:
Variables are storage locations. What are the lifetimes of the storage locations associated with local variables?
Storage locations for "ordinary" local variables -- and formal parameters of lambdas, methods, and so on -- have short, predictable lifetimes. None of them live before the method is entered, and none of them live after the method terminates, either normally or exceptionally. The C# language specification clearly calls out that local variable lifetimes are permitted to be shorter at runtime than you might think if doing so does not cause an observable change to a single-threaded program.
Storage locations for "unusual" local variables -- outer variables of lambdas, local variables in iterator blocks, local variables in async methods, and so on -- have lifetimes which are difficult to analyze at compile time or at run time, and are therefore moved to the garbage-collected heap, which uses GC policy to determine the lifetimes of the variables. There is no requirement that such variables ever be cleaned up; their storage lifetime can be extended arbitrarily at the whim of the C# compiler or the runtime.
Can a local which is unused be optimized away entirely?
Yes. If the C# compiler or the runtime can determine that removing the local from the program entirely has no observable effect in a single-threaded program, then it may do so at its whim. Essentially this is reducing its lifetime to zero.
How are storage locations for "ordinary" locals allocated?
This is an implementation detail, but typically there are two techniques. Either space is reserved on the stack, or the local is enregistered.
How does the runtime determine whether a local is enregistered or put on the stack?
This is an implementation detail of the jitter's optimizer. There are many factors, such as:
whether the address of the local could possibly be taken; registers have no address
whether the local is passed as a parameter to another method
whether the local is a parameter of the current method
what the calling conventions are of all the methods involved
the size of the local
and many, many more factors
Suppose we consider only the ordinary locals which are put on the stack. Is it the case that storage locations for all such locals are allocated when a method is entered?
Again, this is an implementation detail, but typically the answer is yes.
So a "stack local" that is used conditionally would not be allocated off the stack conditionally? Rather, its stack location would always be allocated.
Typically, yes.
What are the performance tradeoffs inherent in that decision?
Suppose we have two locals, A and B, and one is used conditionally and the other is used unconditionally. Which is faster:
Add two units to the current stack pointer
Initialize the two new stack slots to zero
or
Add one unit to the current stack pointer
Initialize the new stack slot to zero
If the condition is met, add one unit to the current stack pointer and initialize the new stack slot to zero
Keep in mind that "add one" and "add two" have the same cost.
This scheme is not cheaper if the variable B is unused, and has twice the cost if it is used. That's not a win.
But what about space? The conditional scheme uses either one or two units of stack space but the unconditional scheme uses two regardless.
Correct. Stack space is cheap. Or, more accurately, the million bytes of stack space you get per thread is insanely expensive, and that expense is paid up front, when you allocate the thread. Most programs never use anywhere close to a million bytes of stack space; trying to optimize use of that space is like spending an hour deciding whether to pay $5.01 for a latte vs $5.02 when you have a million dollars in the bank; it's not worth it.
Suppose 100% of the stack-based locals are allocated conditionally. Could the jitter put the addition to the stack pointer after the conditional code?
In theory, yes. Whether the jitter actually makes this optimization -- an optimization which saves literally less than a billionth of a second -- I don't know. Keep in mind that any code the jitter runs to make the decision to save that billionth of a second is code that takes far more than a billionth of a second. Again, it makes no sense to spend hours worrying about pennies; time is money.
And of course, how realistic is it that the billionth of a second you save will be the common path? Most method calls do something, not return immediately.
Also, keep in mind that the stack pointer is going to have to move for all the temporary value slots that aren't enregistered, regardless of whether those slots have names or not. How many scenarios are there where the condition that determines whether or not the method returns itself has no subexpression which touches the stack? Because that's the condition you're actually proposing that gets optimized. This seems like a vanishingly small set of scenarios, in which you get a vanishingly small benefit. If I were writing an optimizer I would spend exactly zero percent of my valuable time on solving this problem, when there are far juicier low-hanging fruit scenarios that I could be optimizing for.
Suppose there are two locals that are each allocated conditionally under different conditions. Are there additional costs imposed by a conditional allocation scheme other than possibly doing two stack pointer moves instead of one or zero?
Yes. In the straightforward scheme where you move the stack pointer two slots and say "stack pointer is A, stack pointer + 1 is B", you now have a consistent-throughout-the-method way to characterize the variables A and B. If you conditionally move the stack pointer then sometimes the stack pointer is A, sometimes it is B, and sometimes it is neither. That greatly complicates all the code that uses A and B.
What if the locals are enregistered?
Then this becomes a problem in register scheduling; I refer you to the extensive literature on this subject. I am far from an expert in it.
The only way to know for sure when this happens for your program, when you run it, is to look at the code the JIT compiler emits when you run your program. None of us can even answer the specific question with authority (well, I guess someone who wrote the CLR could, provided they knew which version of the CLR you're using, and possible some other details about configuration and your actual program code).
Any allocation on the stack of a local variable is strictly "implementation detail". And the CLS doesn't promise us any specific implementation.
Some locals never wind up on the stack per se, normally due to being stored in a register, but it would be legal for the runtime to use heap space instead, as long as it preserves the normal lifetime semantics of a local vaiable.
See also Eric Lippert's excellent series The Stack Is An Implementation Detail
I need to find automatically all code that not disposed properly.
Is it possible to check via reflection that my type N is used inside using statement (Dispose is called)?
No. The closest you could come is to add a finalizer - possibly conditionally so that it's only included for debug builds - which checks whether or not you've been disposed and logs the problem otherwise. (You'd probably want to keep the stack trace on construction in this case, in order to blame the right code.)
Bear in mind that adding finalizers will cause garbage to stick around for longer - although in your Dispose call you could suppress finalization, so correct code wouldn't have a significant penalty, other than generating the stack trace on construction...
Now that's all assuming you want to do things at execution time. There are various static analysis tools (such as the code analysis built into Visual Studio) which will tell you at build time if it looks like you haven't disposed of everything appropriately.
I was thinking about this just today whilst I was writing some IDisposable code.
It's good practice for the developer to either call Dispose() directly, or if the lifetime of the object allows, to use the using construct.
The only instances we need to worry about, are those where we can't use using due to the mechanics of our code. But we should, at some point, be calling Dispose() on these objects.
Given that the C# compiler knows an object implements IDisposable, it could theoretically also know that Dispose() was never called on it (it's a pretty clever compiler as it is!). It may not know the semantics of when the programmer should do it, but it could serve as a good reminder that it never is being called because it was never used in a using construct, and the method Dispose() was never called directly, on any object that implements IDisposable.
Any reason for this, or are there thoughts to go down that route?
it could theoretically also know that Dispose() was never called on it
It could determine in certain simple cases that Dispose will never be called on it. It is not possible to determine, solely based on a static analysis of the code, that all created instances will be disposed of. Code also does not need to be very complex at all to get to the point to which even estimating if objects are left undisposed is straightforward to do.
To make matters worse, not all IDisposable object instances should be disposed. There can be a number of reasons for this. Sometimes an object implements IDisposable even though only a portion of their instances actually do anything in the implementation. (IEnumerator<T> is a good example of this. A large number of implementations do nothing when disposed, but some do. If you know what the specific implementation you have won't ever do anything on disposal you can not bother; if you don't know that you need to ensure you call Dispose.
Then there are types such as Task that almost never actually need to be disposed. (See Do I need to dispose of Tasks?.) In the vast majority of cases you don't need to dispose of them, and needlessly cluttering your code with using blocks or dispose calls that do nothing hampers readability.
The major rule regarding IDisposable is "would the last one to leave the room, please turn off the lights". One major failing in the design of most .NET languages is that there is no general syntactic (or even attribute-tagging) convention to indicate whether the code that holds a particular variable or class that holds a particular field will:
Always be the last one to leave the room
Never be the last one to leave the room
Sometimes be the last one to leave the room, and easily know at runtime whether it will be (e.g. because whoever gave it a reference told it).
Possibly be the last one to leave the room, but not know before it leaves the room whether it will be the last one out.
If languages had a syntax to distinguish among those cases, then it would be simple for a compiler to ensure that things which know they're going to be the last one to leave the room turn out the lights and things which are never going to be the last one to leave the room don't turn out the lights. A compiler or framework could facilitate the third and fourth scenarios if the framework included wrapper types that the compiler knew about. Conventional reference-counting is generally not a good as a primary mechanism to determine when objects are no longer needed, since it requires processor interlocks every time a reference is copied or destroyed even if the holder of the copy knows it won't be "the last one to leave the room", but a variation on reference-counting is often the cheapest and most practical way to handle scenario #4 [copying a reference should only increment the counter if the holders of both the original and copy are going to think that they might be the last owner, and destroying a copy of a reference should only decrement the counter if the reference had been incremented when that copy was created].
In the absence of a convention to indicate whether a particular reference should be considered "the last one in the room", there's no good way for a compiler to know whether the holder of that reference should "turn out the lights" (i.e. call Dispose). Both VB.NET and C# have a special using syntax for one particular situation where the holder of a variable knows it will be the last one to leave the room, but beyond that the compilers can't really demand that things be cleaned up if they don't understand them. C++/CLI does have a more general-purpose syntax, but unfortunately it has many restrictions on its use.
The code analysis rules will detect this. Depending on your version of VS you can either use FXCop or the built in analysis rules.
It requires static analysis of the code after it has been compiled.
If an object is initialized to null, it is not possible to get the type information because the reference doesn't point to anything.
However, when I debug and I hover over a variable, it shows the type information. Only the static methods are shown, but still, it seems to know the type. Even in release builds.
Does the debugger use other information than just reflection of some sort to find out the datatype? How come it knows more than I? And if it knows this, why isn't it capable of showing the datatype in a NullReferenceException?
It seems like you're confusing the type of the reference with the type of the value that it points to. The type of the reference is embedded into the DLL metadata and as readily accessible by the debugger. There is also aditional information stored in the associated PDB that the debugger leverages to provide a better experience. Hence even for null references a debugger can determine information like type and name.
As for NullReferenceException. Could it also tell you the type on which it was querying a field / method ... possibly. I'm not familiar with the internals of this part of the CLR but there doesn't seem to be an inherent reason why it couldn't do so.
But I'm not sure the added cost to the CLR would be worth the benefit. I share the frustration about the lack of information for a null ref exception. But more than the type involved I want names! I don't care that it was an IComparable, i wanted to know it was leftCustomer.
Names are somethnig the CLR doesn't always have access to as a good portion of them live in the PDB and not metadata. Hence it can't provide them with great reliability (or speed)
Jared's answer is of course correct. Just to add a little to it:
when I debug and I hover over a variable, it shows the type information
Right. You have a bowl. The bowl is labelled "FRUIT". The bowl is empty. What is the type of the fruit in the bowl? You cannot say, because there isn't any fruit in the bowl. But that does not mean that you know nothing about the bowl. You know that the bowl can contain any fruit.
When you hover over a variable then the debugger can tell you about the variable itself or about its contents.
Does the debugger use other information than just reflection of some sort to find out the datatype?
Absolutely. The debugger needs to know not just what is the type of the thing referred to by this reference but also what restrictions are placed on what can be stored in this variable. All the information about what restrictions are placed on particular storage locations are known to the runtime, and the runtime can tell that information to the debugger.
How come it knows more than I?
I reject the premise of the question. The debugger is running on your behalf; it cannot do anything that you cannot do yourself. If you don't know what the type restriction on a particular variable is, it's not because you lack the ability to find out. You just haven't looked yet.
if it knows this, why isn't it capable of showing the datatype in a NullReferenceException?
Think about what is actually happening when you dereference null. Suppose for example you do this:
Fruit f = null;
string s = f.ToString();
ToString might be overloaded in Fruit. What code must the jitter generate? Let's suppose that local variable f is stored in a stack location. The jitter says:
copy the contents of the memory address at the stack pointer offset associated with f to register 1
The virtual function table is going to be, lets say eight bytes from the top of that pointer, and ToString is going to be, let's say, four bytes from the top of that table. (I am just making these numbers up; I don't know what the real offsets are off the top of my head.) So, start by adding eight to the current contents of register 1.
Now dereference the current contents of register 1 to get the address of the vtable into register 2
Now add four bytes to register 2
Now we have a pointer to the ToString method...
But hold on a minute, let's follow that logic again. The first step puts zero into register 1, because f contains null. The second step adds eight to that. The third step dereferences pointer 0x00000008, and the virtual memory system issues an exception stating that an illegal memory page has just been touched. The CLR handles the exception, determines that the exception happened on the first 64 K of memory, and guesses that someone has just dereferenced a null pointer. It therefore creates a null reference exception and throws it.
The virtual memory system surely does not know that the reason it dereferenced pointer 0x00000008 was because someone was trying to call f.ToString(). That information is lost in the past; the memory manager's job is to tell you when you touched something you don't have any right to touch; why you tried to touch memory you don't own is not its job to figure out.
The CLR could maintain a separate side data structure such that every time you accessed memory, it made a note of why you were attempting to do so. That way, the exception could have more information in it, describing what you were doing when the exception happened. Imagine the cost of maintaining such a data structure for every access to memory! Managed code could easily be ten times slower than it is today, and that cost is borne just as heavily by correct code as by broken code. And for what? To tell you what you can easily figure out yourself: which variable that contains null that you dereferenced.
The feature isn't worth the cost, so the CLR does not do it. There's no technical reason why it could not; it's just not practical.
Ok, maybe this isn't so amazing considering I don't really understand how the debugger works in the first place, let alone Edit and Continue, which is totally amazing.
But I was wondering if anyone knew what the debugger is doing with variable declarations in this scenario. I can be debugging through my code, move the line of execution ahead - past a variables initial declaration and assignment, and the code still runs ok. If it's a value type it will have it's default value, for a ref type, null.
So if I create a function that uses a variable before it's declared it won't compile, but if I use the debugger to run it that way it will still run without error. Why is this? And is this related to the fact that you can't put a breakpoint on a declaration?
Yes, those declarations are more structural. They're part of the locals on the stack that are allocated as the method is called. You can't break on them because they don't really happen where you write them - they're not instructions.
The reason the compiler won't let you use them before they are declared is mostly for your sanity - you always know to look up for a declaration. Complex scoping of variables within a method would further illustrate this point.
According to the article Gain performance by not initializing variables:
In .NET, the Common Language Runtime (CLR) expressly initializes all variables as soon as they are created. Value types are initialized to 0 and reference types are initialized to null.
Presumably the debugger already knows about these variables either because the code is already compiled, or (seems less likely now that I am typing it, but) the debugger is smart enough to detect that a variable was declared.
I think that because you are using the debugger, you are confusing the two different activities, compile and execute, and the two different statement types, declarative and functional.
When you compile, a declarative statement tells the compiler to reserve some memory for your variable. It says "oh, you want to declare an integer named "wombatCount"; OK, I'll take address 0x1234 and reserve four bytes just for you and stick a label on them called wombatCount." That happens during the compile, long before you run your code. *
When you execute the code in the debugger, you are running your code, so it already knows about every byte of memory it reserved for you. The variable wombatCount is already associated with four bytes at address 0x1234, so it can immediately access and change that data at any time, not just after your declaration statement. Your program can't, of course, but the debugger can.
The C# language syntax demands that you declare the memory before using it in your code, but that's just part of the language definition, and not a hard-and-fast requirement of all compilers. There are languages that do not require you to pre-declare your variables at all, and there are even some ancient languages where you can declare variables at any point in the code, not just "above" where you'll use them. But language developers now understand that the language syntax is most important for human understanding, and is no longer for ease of machine encoding or helping the compiler writers, so modern language syntaxes are generally created to help the programmers as much as possible. This means making things less confusing, so "declarations must come first" is a common rule to help you avoid mistakes.
(*To be more technically correct, I believe that in .Net the labels are only associated at compile time with a list of pointers that will reserve memory at run time, but the data bytes are not actually allocated until you use them. The difference is internal, and not very important to your understanding. The important takeaway is that a declarative statement declares the label in advance, during compile time.)