Duck typing in the C# compiler

Duck typing in the C# compiler - c#

Note This is not a question about how to implement or emulate duck typing in C#...
For several years I was under the impression that certain C# language features were depdendent on data structures defined in the language itself (which always seemed like an odd chicken & egg scenario to me). For example, I was under the impression that the foreach loop was only available to use with types that implemented IEnumerable.
Since then I've come to understand that the C# compiler uses duck typing to determine whether an object can be used in a foreach loop, looking for a GetEnumerator method rather than IEnumerable. This makes a lot of sense as it removes the chicken & egg conundrum.
I'm a little confused as to why this doesn't seem to be the case with the using block and IDisposable. Is there any particular reason the compiler can't use duck typing and look for a Dispose method? What's the reason for this inconsistency?
Perhaps there's something else going on under the hood with IDisposable?
Discussing why you would ever have an object with a Dispose method that didn't implement IDisposable is outside the scope of this question :)

There's nothing special about IDisposable here - but there is something special about iterators.
Before C# 2, using this duck type on foreach was the only way you could implement a strongly-typed iterator, and also the only way of iterating over value types without boxing. I suspect that if C# and .NET had had generics to start with, foreach would have required IEnumerable<T> instead, and not had the duck typing.
Now the compiler uses this sort of duck typing in a couple of other places I can think of:
Collection initializers look for a suitable Add overload (as well as the type having to implement IEnumerable, just to show that it really is a collection of some kind); this allows for flexible adding of single items, key/value pairs etc
LINQ (Select etc) - this is how LINQ achieves its flexibility, allowing the same query expression format against multiple types, without having to change IEnumerable<T> itself
The C# 5 await expressions require GetAwaiter to return an awaiter type which has IsCompleted / OnCompleted / GetResult
In both cases this makes it easier to add the feature to existing types and interfaces, where the concept didn't exist earlier on.
Given that IDisposable has been in the framework since the very first version, I don't think there would be any benefit in duck typing the using statement. I know you explicitly tried to discount the reasons for having Dispose without implementing IDisposable from the discussion, but I think it's a crucial point. There need to be good reasons to implement a feature in the language, and I would argue that duck typing is a feature above-and-beyond supporting a known interface. If there's no clear benefit in doing so, it won't end up in the language.

There's no chicken and egg: foreach could depend on IEnumerable since IEnumerable doesn't depend on foreach. The reason foreach is permitted on collections not implementing IEnumerable is probably largely historic:
In C#, it is not strictly necessary
for a collection class to inherit from
IEnumerable and IEnumerator in order
to be compatible with foreach; as long
as the class has the required
GetEnumerator, MoveNext, Reset, and
Current members, it will work with
foreach. Omitting the interfaces has
the advantage of allowing you to
define the return type of Current to
be more specific than object, thereby
providing type-safety.
Furthermore, not all chicken and egg problems are actually problems: for example a function can call itself (recursion!) or a reference type can contain itself (like a linked list).
So when using came around why would they use something as tricky to specify as duck typing when they can simply say: implement IDisposable? Fundamentally, by using duck typing you're doing an end-run around the type system, which is only useful when the type system is insufficient (or impractical) to address a problem.

The question which you are asking is not a chicken and egg situation. Its more like hows the language compiler is implemented. Like C# and VB.NET compiler are implemented differently.If you write a simple code of hello world and compile it with both the compiler and inspect the IL code they will be different. Coming back to your question, I will like to explain what IL code is generated by C# compiler for IEnumerable.
IEnumerator e = arr.GetEnumerator();
while(e.MoveNext())
{
e.Currrent;
}
So the C# compiler is tweaked for the case of foreach.

Related

For what collection types does the C# compiler generate collection classes?

The documentation for List<T> states:
If a value type is used for type T, the compiler generates an implementation of the List<T> class specifically for that value type. That means a list element of a List<T> object does not have to be boxed before the element can be used...
A question was raised in the comments about exactly what "the compiler" refers to here. That's tangential to the question, which is about what else "the compiler" (whatever that may mean) does this to.
Is this true of any other collection type? If it's only List<T>, is it good practice to always use List<T> for value types even when some other collection like Queue<T> better expresses your intent?

Short answer: use the most appropriate generic type that expresses your intent. Queue<T> is just fine if you want to represent a queue, for example.
Longer answer: quite honestly: that documentation is vague. When it mentions "the compiler", it isn't talking about the C# (build-time) compiler (which translates C# to IL), but rather to the JIT (runtime) compiler, which translates IL to CPU instructions (appropriately for your specific CPU and environment).
The feature it is using here is simply a feature of generics, which applies equally to any generic usage; it isn't specific to List<T> - the same ideas apply to any <T> type (or multi-generic-parameter types, too), including arrays.
The docs are also ... a little wooly and imprecise. The details probably don't really matter to most people, but it doesn't really do this per-T; or at least, not all T. Every value-type T (or permutation involving value-types, for multi-generic-parameter scenarios) gets a bespoke JIT, but all the reference-type T share a single implementation. This is tied to the fact that only value-type usages involve boxing, and that boxing is per-T; for reference-type usages, a type check is sufficient. And even for value-type scenarios, often the box step can be avoided via "constrained" calls.
If I was trying to be generous: the document is perhaps trying to contrast against non-generic collections (which you shouldn't really be using in %current year%), in a way that might make sense to someone more familiar with .NET 1.1; in doing so, they're... very far from precise.

Why pattern based programming in C#

Wondering why C# is moving towards more pattern based programming rather than conventional ways.
Ex. The foreach statement expects that the loop source to have a magic method called GetEnumerator which returns an object which has a few more magic methods like MoveNext and Current, but they don't mandate any specific interface? C# could have mandated that a class to be used in foreach should implement IEnumerable or IEnumerable<T> as it does for theusing statement in that it expects an object to be used in using statement to implement the IDisposable interface.
Also, I see a similar trend with async/await keywords as well....
Of course there must be a good reason for that, but it seems a little odd for me to understand the reason why does compiler/CLR requires "magic methods" rather than relying on interfaces.

foreach
I would say it's both about performance and compatibility
If you had chosen foreach to use IEnumerable it would have made all generic
collections iteration very slow for value-types T (because of
boxing/unboxing).
If you had chosen to use IEnumerable<T> iterating over ArrayList and
all non-generic collections from early .NET version would have not been
possible.
I think the design decision was good. When foreach was introduced (.NET 1.1) there was nothing about generics in .NET (they were introduced in .NET 2.0). Choosing IEnumerable as a source of foreach enumeration would make using it with generic collections poor or would require a radical change. I guess designers already knew that they were going to introduce generics not that long time later.
Additionaly, declaring it as use IEnumerable<T> when it's available or IEnumerable when it's not is not much different then use available GetEnumerator method or do not compile when it's not available, is it?
update
As #mikez mentioned in comments, there is one more advantage. When you don't expect GetEnumerator to return IEnumerator/IEnumerator<T> you can return struct and don't worry about boxing when the enumerator is used by loop.
LINQ
The same magic methods situation occurs when you use LINQ and syntax based queries. When you write
var results = from item in source
where item != "test"
select item.ToLower();
it's transformed by compiler into
var results = source.Where(x => x != "test")
.Select(x => x.ToLower());
And because that code would work no matter what interface source implement the same applies to syntax-based query. As long as after transforming it to method-based query every method call can be properly assigned by compiler everything is OK.
async/await
I'm not that sure but think the same thing applies to async/await. When you use these keywords compiler generates a bunch of code for yourself, which is then compiled as if you'd written the code by yourself. And as long as code made by that transformation can be compiled everything is OK.

dynamic and generics in C#

As discovered in C 3.5, the following would not be possible due to type erasure: -
int foo<T>(T bar)
{
return bar.Length; // will not compile unless I do something like where T : string
}
foo("baz");
I believe the reason this doesn't work is in C# and java, is due to a concept called type erasure, see http://en.wikipedia.org/wiki/Type_erasure.
Having read about the dynamic keyword, I wrote the following: -
int foo<T>(T bar)
{
dynamic test = bar;
return test.Length;
}
foo("baz"); // will compile and return 3
So, as far as I understand, dynamic will bypass compile time checking but if the type has been erased, surely it would still be unable to resolve the symbol unless it goes deeper and uses some kind of reflection?
Is using the dynamic keyword in this way bad practice and does this make generics a little more powerful?

dynamics and generics are 2 completely different notions. If you want compile-time safety and speed use strong typing (generics or just standard OOP techniques such as inheritance or composition). If you do not know the type at compile time you could use dynamics but they will be slower because they are using runtime invocation and less safe because if the type doesn't implement the method you are attempting to invoke you will get a runtime error.
The 2 notions are not interchangeable and depending on your specific requirements you could use one or the other.
Of course having the following generic constraint is completely useless because string is a sealed type and cannot be used as a generic constraint:
int foo<T>(T bar) where T : string
{
return bar.Length;
}
you'd rather have this:
int foo(string bar)
{
return bar.Length;
}

I believe the reason this doesn't work is in C# and java, is due to a concept called type erasure, see http://en.wikipedia.org/wiki/Type_erasure.
No, this isn't because of type erasure. Anyway there is no type erasure in C# (unlike Java): a distinct type is constructed by the runtime for each different set of type arguments, there is no loss of information.
The reason why it doesn't work is that the compiler knows nothing about T, so it can only assume that T inherits from object, so only the members of object are available. You can, however, provide more information to the compiler by adding a constraint on T. For instance, if you have an interface IBar with a Length property, you can add a constraint like this:
int foo<T>(T bar) where T : IBar
{
return bar.Length;
}
But if you want to be able to pass either an array or a string, it won't work, because the Length property isn't declared in any interface implemented by both String and Array...

No, C# does not have type erasure - only Java has.
But if you specify only T, without any constraint, you can not use obj.Lenght because T can virtually be anything.
foo(new Bar());
The above would resolve to an Bar-Class and thus the Lenght Property might not be avaiable.
You can only use Methods on T when you ensure that T this methods also really has. (This is done with the where Constraints.)
With the dynamics, you loose compile time checking and I suggest that you do not use them for hacking around generics.

In this case you would not benefit from dynamics in any way. You just delay the error, as an exception is thrown in case the dynamic object does not contain a Length property. In case of accessing the Length property in a generic method I can't see any reason for not constraining it to types who definately have this property.

"Dynamics are a powerful new tool that make interop with dynamic languages as well as COM easier, and can be used to replace much turgid reflective code. They can be used to tell the compiler to execute operations on an object, the checking of which is deferred to runtime.
The great danger lies in the use of dynamic objects in inappropriate contexts, such as in statically typed systems, or worse, in place of an interface/base class in a properly typed system."
Qouted From Article

Thought I'd weigh-in on this one, because no one clarified how generics work "under the hood". That notion of T being an object is mentioned above, and is quite clear. What is not talked about, is that when we compile C# or VB or any other supported language, - at the Intermediate Language (IL) level (what we compile to) which is more akin to an assembly language or equivalent of Java Byte codes, - at this level, there is no generics! So the new question is how do you support generics in IL? For each type that accesses the generic, a non-generic version of the code is generated which substitutes the generic(s) such as the ubiquitous T to the actual type it was called with. So if you only have one type of generic, such as List<>, then that's what the IL will contain. But if you use many implementation of a generic, then many specific implementations are created, and calls to the original code substituted with the calls to the specific non-generic version. To be clear, a MyList used as: new MyList(), will be substituted in IL with something like MyList_string().
That's my (limited) understanding of what's going on. The point being, the benefit of this approach is that the heavy lifting is done at compile-time, and at runtime there's no degradation to performance - which is again, why generic are probably so loved used anywhere, and everywhere by .NET developers.
On the down-side? If a method or type is used many times, then the output assembly (EXE or DLL) will get larger and larger, dependent of the number of different implementation of the same code. Given the average size of DLLs output - I doubt you'll ever consider generics to be a problem.

When to specify constraint `T : IEquatable<T>` even though it is not strictly required?

In short, I am looking for guidance on which of the following two methods should be preferred (and why):
static IEnumerable<T> DistinctA<T>(this IEnumerable<T> xs)
{
return new HashSet<T>(xs);
}
static IEnumerable<T> DistinctB<T>(this IEnumerable<T> xs) where T : IEquatable<T>
{
return new HashSet<T>(xs);
}
Argument in favour of DistinctA: Obviously, the constraint on T is not required, because HashSet<T> does not require it, and because instances of any T are guaranteed to be convertible to System.Object, which provides the same functionality as IEquatable<T> (namely the two methods Equals and GetHashCode). (While the non-generic methods will cause boxing with value types, that's not what I'm concerned about here.)
Argument in favour of DistinctB: The generic parameter constraint, while not strictly necessary, makes visible to callers that the method will compare instances of T, and is therefore a signal that Equals and GetHashCode should work correctly for T. (After all, defining a new type without explicitly implementing Equals and GetHashCode happens very easily, so the constraint might help catch some errors early.)
Question: Is there an established and documented best practice that recommends when to specify this particular constraint (T : IEquatable<T>), and when not to? And if not, is one of the above arguments flawed in any way? (In that case, I'd prefer well-thought-out arguments, not just personal opinions.)

While my comment at Pieter's answer is fully true, I've rethought the exact case of Distinct that you refer to.
This is a LINQ method contract, not just a method.
LINQ is meant to be a common fascade implemented by various providers. Linq2Objects may require an IEquatable, Linq2Sql may require IEquatable too, but Linq2Sql may not require and even not use at all and completely ignore the IEquatableness as the comparison is made by the DB SQL engine.
Therefore, at the layer of LINQ method definitions, it does not make sense to specify the requirement for IEquatable. It would limit and constrain the future LINQ providers to some things they really do not need to care for in their specific domains, and note that LINQ is actually very often all about domain-specificness, as very often the LINQ expressions and parameters are never actually run as code, but they are analyzed and retranslated to other constructs like SQL or XPaths.. Ommission of constraints in this case is reasonable, because you cannot really know what your future-and-unknown-domain-provider will really need to request from the caller. Linq-to-Mushrooms may need to use an IToxinEquality instead of IEquatable!
However, when you are designing interfaces/contracts that will clearly be used as runnable code (not just expression trees that will only work as configuration or declarations), then I do not see no valid point for not providing the constraints.

Start by considering when it might matter which of the two mechanisms is used; I can think of only two:
When the code is being translated to another language (either a subsequent version of C#, or a related language like Java, or a completly dissimilar language such as Haskell). In this case the second definition is clearly better by providing the translator, whether automated or manual, with more information.
When a user unfamiliar with the code is reading it to learn how to invoke the method. Again, I believe the second is clearly better by providing more information readily to such a user.
I cannot think of any circumstance in which the fist definition would be preferred, and where it actually matters beyond personal preference.
Others thoughts?

When to use run-time type information?

If I have various subclasses of something, and an algorithm which operates on instances of those subclasses, and if the behaviour of the algorithm varies slightly depending on what particular subclass an instance is, then the most usual object-oriented way to do this is using virtual methods.
For example if the subclasses are DOM nodes, and if the algorithm is to insert a child node, that algorithm differs depending on whether the parent node is a DOM element (which can have children) or DOM text (which can't): and so the insertChildren method may be virtual (or abstract) in the DomNode base class, and implemented differently in each of the DomElement and DomText subclasses.
Another possibility is give the instances a common property, whose value can be read: for example the algorithm might read the nodeType property of the DomNode base class; or for another example, you might have different types (subclasses) of network packet, which share a common packet header, and you can read the packet header to see what type of packet it is.
I haven't used run-time-type information much, including:
The is and as keywords in C#
Downcasting
The Object.GetType method in dot net
The typeid operator in C++
When I'm adding a new algorithm which depends on the type of subclass, I tend instead to add a new virtual method to the class hierarchy.
My question is, when is it appropriate to use run-time-type information, instead of virtual functions?

When there's no other way around. Virtual methods are always preferred but sometimes they just can't be used. There's couple of reasons why this could happen but most common one is that you don't have source code of classes you want to work with or you can't change them. This often happens when you work with legacy system or with closed source commercial library.
In .NET it might also happens that you have to load new assemblies on the fly, like plugins and you generally have no base classes but have to use something like duck typing.

In C++, among some other obscure cases (which mostly deal with inferior design choices), RTTI is a way to implement so-called multi methods.

This constructions ("is" and "as") are very familiar for Delphi developers since event handlers usually downcast objects to a common ancestor. For example event OnClick passes the only argurment Sender: TObject regardless of the type of the object, whether it is TButton, TListBox or any other. If you want to know something more about this object you have to access it through "as", but in order to avoid an exception, you can check it with "is" before. This downcasting allows design-type binding of objects and methods that could not be possible with strict class type checking. Imagine you want to do the same thing if the user clicks Button or ListBox, but if they provide us with different prototypes of functions, it could not be possible to bind them to the same procedure.
In more general case, an object can call a function that notifies that the object for example has changed. But in advance it leaves the destination the possibility to know him "personally" (through as and is), but not necessarily. It does this by passing self as a most common ancestor of all objects (TObject in Delphi case)

dynamic_cast<>, if I remember correctly, is depending on RTTI. Some obscure outer interfaces might also rely on RTTI when an object is passed through a void pointer (for whatever reason that might happen).
That being said, I haven't seen typeof() in the wild in 10 years of pro C++ maintenance work. (Luckily.)

You can refer to More Effective C# for a case where run-time type checking is OK.
Item 3. Specialize Generic Algorithms
Using Runtime Type Checking
You can easily reuse generics by
simply specifying new type parameters.
A new instantiation with new type
parameters means a new type having
similar functionality.
All this is great, because you write
less code. However, sometimes being
more generic means not taking
advantage of a more specific, but
clearly superior, algorithm. The C#
language rules take this into account.
All it takes is for you to recognize
that your algorithm can be more
efficient when the type parameters
have greater capabilities, and then to
write that specific code. Furthermore,
creating a second generic type that
specifies different constraints
doesn't always work. Generic
instantiations are based on the
compile-time type of an object, and
not the runtime type. If you fail to
take that into account, you can miss
possible efficiencies.
For example, suppose you write a class that provides a reverse-order enumeration on a sequence of items represented through IEnumerable<T>. In order to enumerate it backwards you may iterate it and copy items into an intermediate collection with indexer access like List<T> and than enumerate that collection using indexer access backwards. But if your original IEnumerable is IList why not take advantage of it and provide more performant way (without copying to intermediate collection) to iterate items backwards. So basically it is a special we can take advantage of but still providing the same behavior (iterating sequence backwards).
But in general you should carefully consider run-time type checking and ensure that it doesn't violate Liskov Substituion Principle.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.