Enumerator Implementation: Use struct or class? - c#

I noticed that List<T> defines its enumerator as a struct, while ArrayList defines its enumerator as a class. What's the difference? If I am to write an enumerator for my class, which one would be preferable?
EDIT: My requirements cannot be fulfilled using yield, so I'm implementing an enumerator of my own. That said, I wonder whether it would be better to follow the lines of List<T> and implement it as a struct.

Like this others, I would choose a class. Mutable structs are nasty. (And as Jared suggests, I'd use an iterator block. Hand-coding an enumerator is fiddly to get right.)
See this thread for an example of the list enumerator being a mutable struct causing problems...

The easiest way to write an enumerator in C# is with the "yield return" pattern. For example.
public IEnumerator<int> Example() {
yield return 1;
yield return 2;
}
This pattern will generate all of the enumerator code under the hood. This takes the decision out of your hands.

Reason List uses a struct enumerator is to prevent garbage generation in foreach statements. This is pretty good reason especially if you are programming for Compact Framework, because CF doesn't have generational GC and CF is usually used on low performance hardware where it can quickly lead to performance issues.
Also, I don't think mutable structs are source of problems in examples some posted, but programmers that don't have good understanding of how value types work.

An enumerator is inherently a changing structure, since it needs to update internal state to move on to the next value in the original collection.
In my opinion, structs should be immutable, so I would use a class.

Write it using yield return.
As to why you might otherwise choose between class or struct, if you make it a struct then it gets boxed as soon as it is returned as an interface, so making it a struct just causes additional copying to take place. Can't see the point of that!

Any implementation of IEnumerable<T> should return a class. It may be useful for performance reasons to have a GetEnumerator method which returns a struct which provides the methods necessary for enumeration but does not implement IEnumerator<T>; this method should be different from IEnumerable<T>.GetEnumerator, which should then be implemented explicitly.
Using this approach will allow for enhanced performance when the class is enumerated using a foreach or "For Each" loop in C# or vb.net or any context where the code which is doing the enumeration will know that the enumerator is a struct, but avoid the pitfalls that would otherwise occur when the enumerator gets boxed and passed by value.

There's a couple of blog posts that cover exactly this issue. Basically, enumerator structs are a really really bad idea...

To expand on #Earwicker: you're usually better off not writing an enumerator type, and instead using yield return to have the compiler write it for you. This is because there are a number of important subtleties that you might miss if you do it yourself.
See SO question "What is the yield keyword used for in C#?" for some more details on how to use it.
Also Raymond Chen has a series of blog posts ("The implementation of iterators in C# and its consequences": parts 1, 2, 3, and 4) that show you how to implement an iterator properly without yield return, which shows just how complex it is, and why you should just use yield return.

Related

C# enumerator terminology confusion

In C#, I was told that if a container class such as List(T) is first upcast to a container interface such as IEnumerable, and then subsequently iterated over using foreach then runtime garbage will be created. Also, even when fully downcast, I was told that iterating over Collection(T) also creates references on the heap. I understand that this is a result of a virtual call to GetEnumerator() which may return a reference or value type result.
Inspection of the MSDN docs for values types clearly lists all enumerations as value type. If an enumeration consists of an enumerator list then aren't these enumerators of value type as per the docs? are they boxed? or completely unrelated to each other but with similar names? or something else entirely?
I'm not sure how to unify these two statements and I was hoping someone could explain it to me more plainly.
Thanks.
EDIT: Question rephrased taking into account commentor suggestions on use of words such as 'never' and 'un-necessary'
Enumarations (enums) are unrelated to Enumerators.
And
I was told that to avoid...
Seems like a very premature and unnecessary optimization.

Why do C# Arrays use a reference type for Enumeration, but List<T> uses a mutable struct?

From what I've read, a design decision was made for certain Collections's Enumerator Types to be mutable structs instead of reference types for performance reasons. List.Enumerator is the most well known.
I was investigating some old code that used arrays, and was surprised to discover that C# Arrays return the type SZGenericArrayEnumerator as their generic enumerator type, which is a reference type.
I am wondering if anyone knows why Array's generic iterator was implemented as a reference type when so many other performance critical collections used mutable structs instead.
From what I've read, a design decision was made for certain Collections's Enumerator Types to be mutable structs instead of reference types for performance reasons.
Good question.
First off, you are correct. Though in general, mutable value types are a bad code smell, in this case they are justified:
The mutation is almost entirely concealed from the user.
It is highly unlikely that anyone is going to use the enumerator in a confusing manner.
The use of a mutable value type actually does solve a realistic performance problem in an extremely common scenario.
I am wondering if anyone knows why Array's generic iterator was implemented as a reference type when so many other performance critical collections used mutable structs instead.
Because if you're the sort of person who is concerned about the performance of enumerating an array then why are you using an enumerator in the first place? It's an array for heaven's sake; just write a for loop that iterates over its indicies like a normal person and never allocate the enumerator. (Or a foreach loop; the C# compiler will rewrite the foreach loop into the equivalent for loop if it knows that the loop collection is an array.)
The only reason why you'd obtain an enumerator from an array in the first place is if you are passing it to a method that takes an IEnumerator<T>, in which case if the enumerator is a struct then you're going to be boxing it anyway. Why take on the expense of making the value type and then boxing it? Just make it a reference type to begin with.
Arrays get some special treatment in the C# compiler. When you use foreach on them, the compiler translates it into a for loop. So there is no performance benefit in using struct enumerators.
List<T> on the other hand is a plain class without any special treatment, so using a struct results in better performance.

Why the Reset() method on Enumerator class must throw a NotSupportedException()?

From what I saw on http://csharpindepth.com/Articles/Chapter6/IteratorBlockImplementation.aspx, and article by Jon Skeet, the c# specification itself says that. What would be the reason?
That's not how I read the C# spec [Word doc]. Section 10.14.4 "Enumerator objects", states:
...[E]numerator objects do not support the IEnumerator.Reset method. Invoking this method causes a System.NotSupportedException to be thrown.
However, this section (and statement) is specific to "enumerator objects", which is defined as:
When a function member returning an enumerator interface type is implemented using an iterator block, invoking the function member does not immediately execute the code in the iterator block. Instead, an enumerator object is created and returned.
In other words, an "enumerator object" is a compiler generated IEnumerator1. There's no restrictions on every IEnumerator, just the ones generated from iterator blocks (aka yield).
As for why? I'd suspect because it's somewhat impossible to do in the general case - without saving every value and the consequent memory limitations of that. Combine that with the fact that IEnumerator.Reset() is rarely used (when's the last time that you Reset an enumerator?) and that MSDN specifically calls out that it need not be implemented:
The Reset method is provided for COM interoperability. It does not necessarily need to be implemented; instead, the implementer can simply throw a NotSupportedException.
and you get to cut out a lot of complexity without anyone really noticing.
As for requiring that it throw2, I suppose it's just simpler than letting the implementor decide. IMO, it's a bit much to require the throw - there may be reasonable cases that a compiler (or other implementation1) could generate a Reset method for, but I don't see it as being a real problem either.
1 Technically, the spec leaves open the possibility of other implementations:
An enumerator object is typically an instance of a compiler-generated enumerator class that encapsulates the code in the iterator block and implements the enumerator interfaces, but other methods of implementation are possible.
but I'm not aware of any other concrete implementations. Regardless, to be compliant, other implementations of an "enumerator object" would have to throw NotSupportedException as well.
2 Nitpicker's corner: I think there may be some quibble even in the "requirement" to throw. The spec, in not using the preferred "MUST, SHOULD, MAY" verbiage, leaves it a bit open. I read "causes" more as a note of implementation - not a requirement. Then again, I haven't read the entire spec, so perhaps they define these terms a bit more or are more explicit somewhere else.
It is impossible to support properly in all sequences; many are once only (network streams, etc). And if you can't rely on it all the time, it is useless, as the abstraction is broken. Sure you could have an IResettableEnumerator, but Reset() on IEnumerator doesn't
work. Frankly, it was a mistake (IMO).
I suspect it would also have made iterator blocks even more complicated (they are currently one of the two most complex parts of the compiler; I can't remember which is "top"; them, or anonymous methods / captured variables).
Here's what MSDN says
The Reset method is provided for COM
interoperability. It does not
necessarily need to be implemented;
instead, the implementer can simply
throw a NotSupportedException.
http://msdn.microsoft.com/en-us/library/system.collections.ienumerator.reset.aspx>MSDN IEnumerator..::.Reset Method
It doesn't say it must, it just says it can.
EDIT:
However as Marc has pointed out, there is a difference in the C# 2.0 Spec
22.2 Enumerator objects
Note that enumerator objects do not
support the IEnumerator.Reset method.
Invoking this method causes a
System.NotSupportedException to be
thrown.

Should I return an array or a collection from a function?

What's the preferred container type when returning multiple objects of the same type from a function?
Is it against good practice to return a simple array (like MyType[]), or should you wrap it in some generic container (like ICollection<MyType>)?
Thanks!
Eric Lippert has a good article on this. In case you can't be bothered to read the entire article, the answer is: return the interface.
Return an IEnumerable<T> using a yield return.
I would return an IList<T> as that gives the consumer of your function the greatest flexibility. That way if the consumer of your function only needed to enumerate the sequence they can do so, but if they want to use the sequence as a list they can do that as well.
My general rule of thumb is to accept the least restrictive type as a parameter and return the richest type I can. This is, of course, a balancing act as you don't want to lock yourself into any particular interface or implementation (but always, always try to use an interface).
This is the least presumptuous approach that you, the API developer, can take. It is not up to you to decide how a consumer of your function will use what they send you - that is why you would return an IList<T> in this case as to give them the greatest flexibility. Also for this same reason you would never presume to know what type of parameter a consumer will send you. If you only need to iterate a sequence sent to you as a parameter then make the parameter an IEnumerable<T> rather than a List<T>.
EDIT (monoxide): Since it doesn't look like the question is going to be closed, I just want to add a link from the other question about this: Why arrays are harmful
Why not List<T>?
From the Eric Lippert post mentioned by others, I thought I will highlight this:
If I need a sequence I’ll use
IEnumerable<T>, if I need a mapping
from contiguous numbers to data I’ll
use a List<T>, if I need a mapping
across arbitrary data I’ll use a
Dictionary<K,V>, if I need a set I’ll
use a HashSet<T>. I simply don’t need
arrays for anything, so I almost never
use them. They don’t solve a problem I
have better than the other tools at my
disposal.
A good piece of advice that I've oft heard quoted is this:
Be liberal in what you accept, precise in what you provide.
In terms of designing your API, I'd suggest you should be returning an Interface, not a concrete type.
Taking your example method, I'd rewrite it as follows:
public IList<object> Foo()
{
List<object> retList = new List<object>();
// Blah, blah, [snip]
return retList;
}
The key is that your internal implementation choice - to use a List - isn't revealed to the caller, but you're returning an appropriate interface.
Microsoft's own guidelines on framework development recommend against returning specific types, favoring interfaces. (Sorry, I couldn't find a link for this)
Similarly, your parameters should be as general as possible - instead of accepting an array, accept an IEnumerable of the appropriate type. This is compatible with arrays as well as lists and other useful types.
Taking your example method again:
public IList<object> Foo(IEnumerable<object> bar)
{
List<object> retList = new List<object>();
// Blah, blah, [snip]
return retList;
}
If the collection that is being returned is read-only, meaning you never want the elements to in the collection to be changed, then use IEnumerable<T>. This is the most basic representation of a read-only sequence of immutable (at least from the perspective of the enumeration itself) elements.
If you want it to be a self-contained collection that can be changed, then use ICollection<T> or IList<T>.
For example, if you wanted to return the results of searching for a particular set of files, then return IEnumerable<FileInfo>.
However, if you wanted to expose the files in a directory, however, you would expose IList/ICollection<FileInfo> as it makes sense that you would want to possibly change the contents of the collection.
return ICollection<type>
The advantage to generic return types, is that you can change the underlying implementation without changing the code that uses it. The advantage to returning the specific type, is you can use more type specific methods.
Always return an interface type that presents the greatest amount of functionality to the caller. So in your case ICollection<YourType> ought to be used.
Something interesting to note is that the BCL developers actually got this wrong in some place of the .NET framework - see this Eric Lippert blog post for that story.
Why not IList<MyType>?
It supports direct indexing which is hallmark for an array without removing the possibility to return a List<MyType> some day. If you want to suppress this feature, you probably want to return IEnumerable<MyType>.
It depends on what you plan to do with the collection you're returning. If you're just iterating, or if you only want the user to iterate, then I agree with #Daniel, return IEnumerable<T>. If you actually want to allow list-based operations, however, I'd return IList<T>.
Use generics. It's easier to interoperate with other collections classes and the type system is more able to help you with potential errors.
The old style of returning an array was a crutch before generics.
What ever makes your code more readable, maintainable and easier for YOU.
I would have used the simple array, simpler==better most of the time.
Although I really have to see the context to give the right answer.
There are big advantages to favouring IEnumerable over anything else, as this gives you the greatest implementation flexibility and allows you to use yield return or Linq operators for lazy implementation.
If the caller wants a List<T> instead they can simply call ToList() on whatever you returned, and the overall performance will be roughly the same as if you had created and returned a new List<T> from your method.
Array is harmful, but ICollection<T> is also harmful.
ICollection<T> cannot guarantee the object will be immutable.
My recommendation is to wrap the returning object with ReadOnlyCollection<T>

Which do you prefer for interfaces: T[], IEnumerable<T>, IList<T>, or other?

Ok, I'm hoping the community at large will aid us in solving a workplace debate that has been ongoing for a while. This has to do with defining interfaces that either accept or return lists of some type. There are several ways of doing this:
public interface Foo
{
Bar[] Bars { get; }
IEnumerable<Bar> Bars { get; }
ICollection<Bar> Bars { get; }
IList<Bar> Bars { get; }
}
My own preference is to use IEnumerable for arguments and arrays for return values:
public interface Foo
{
void Do(IEnumerable<Bar> bars);
Bar[] Bars { get; }
}
My argument for this approach is that the implementation class can create a List directly from the IEnumerable and simply return it with List.ToArray(). However some believe that IList should be returned instead of an array. The problem I have here is that now your required again to copy it with a ReadOnlyCollection before returning. The option of returning IEnumerable seems troublesome for client code?
What do you use/prefer? (especially with regards to libraries that will be used by other developers outside your organization)
My preference is IEnumerable<T>. Any other of the suggested interfaces gives the appearance of allowing the consumer to modify the underlying collection. This is almost certainly not what you want to do as it's allowing consumers to silently modify an internal collection.
Another good one IMHO, is ReadOnlyCollection<T>. It allows for all of the fun .Count and Indexer properties and unambiguously says to the consumer "you cannot modify my data".
I don't return arrays - they really are a terrible return type to use when creating an API - if you truly need a mutable sequence use the IList<T> or ICollection<T> interface or return a concrete Collection<T> instead.
Also I would suggest that you read Arrays considered somewhat harmful by Eric Lippert:
I got a moral question from an author
of programming language textbooks the
other day requesting my opinions on
whether or not beginner programmers
should be taught how to use arrays.
Rather than actually answer that
question, I gave him a long list of my
opinions about arrays, how I use
arrays, how we expect arrays to be
used in the future, and so on. This
gets a bit long, but like Pascal, I
didn't have time to make it shorter.
Let me start by saying when you
definitely should not use arrays, and
then wax more philosophical about the
future of modern programming and the
role of the array in the coming world.
For property collections that are indexed (and the indices have necessary semantic meaning), you should use ReadOnlyCollection<T> (read only) or IList<T> (read/write). It's the most flexible and expressive. For non-indexed collections, use IEnumerable<T> (read only) or ICollection<T> (read/write).
Method parameters should use IEnumerable<T> unless they 1) need to add/remove items to the collection (ICollection<T>) or 2) require indexes for necesary semantic purposes (IList<T>). If the method can benefit from indexing availability (such as a sorting routine), it can always use as IList<T> or .ToList() when that fails in the implementation.
I think about this in terms of writing the most useful code possible: code that can do more.
Put in those terms, it means I like to accept the weakest interface possible as method arguments, because that makes my code useful from more places. In this case, that's an IEnumerable<T>. Have an array? You can call my method. Have a List? You can call my method. Have an iterator block? You get the idea.
It also means I like my methods to return the strongest interface that is convenient, so that code that relies on the method can easily do more. In this case, that would be IList<T>. Note that this doesn't mean I will construct a list just so I can return it. It just means that if I already have some that implements IList<T>, I may as well use it.
Note that I'm a little unconventional with regards to return types. A more typical approach is to also return weaker types from methods to avoid locking yourself into a specific implementation.
I would prefer IEnumerable as it is the most highlevel of the interfaces giving the end user the opportunity to re-cast as he wishes. Even though this may provide the user with minimum functionality to begin with (basically only enumeration) it would still be enough to cover virtually any need, especially with the extension methods, ToArray(), ToList() etc.
IEnumerable<T> is very useful for lazy-evaluated iteration, especially in scenarios that use method chaining.
But as a return type for a typical data access tier, a Count property is often useful, and I would prefer to return an ICollection<T> with a Count property or possibly IList<T> if I think typical consumers will want to use an indexer.
This is also an indication to the caller that the collection has actually been materialized. And thus the caller can iterate through the returned collection without getting exceptions from the data access tier. This can be important. For example, a service may generate a stream (e.g. SOAP) from the returned collection. It can be awkward if an exception is thrown from the data access layer while generating the stream due to lazy-evaluated iteration, as the output stream is already partially written when the exception is thrown.
Since the Linq extension methods were added to IEnumerable<T>, I've found that my use of the other interfaces has declined considerably; probably around 80%. I used to use List<T> religiously as it had methods that accepted delegates for lazy evaluation like Find, FindAll, ForEach and the like. Since that's available through System.Linq's extensions, I've replaced all those references with IEnumerable<T> references.
I wouldn't go with array, its a type that allows modification yet doesn't have add/remove... kind of like the worst of the pack. If I want to allow modifications, then I would use a type that supports add/remove.
When you want to prevent modifications, you are already wrapping it/copying it, so I don't see what's wrong with a an IEnumerable or a ReadOnlyCollection. I would go with the later ... something I don't like about IEnumerable is that its lazy by nature, yet when you are using with pre-loaded data only to wrap it, calling code that works with it tends to assume pre-loaded data or have extra "unnecessary" lines :( ... that can get ugly results during change.

Categories

Resources