In C#, I was told that if a container class such as List(T) is first upcast to a container interface such as IEnumerable, and then subsequently iterated over using foreach then runtime garbage will be created. Also, even when fully downcast, I was told that iterating over Collection(T) also creates references on the heap. I understand that this is a result of a virtual call to GetEnumerator() which may return a reference or value type result.
Inspection of the MSDN docs for values types clearly lists all enumerations as value type. If an enumeration consists of an enumerator list then aren't these enumerators of value type as per the docs? are they boxed? or completely unrelated to each other but with similar names? or something else entirely?
I'm not sure how to unify these two statements and I was hoping someone could explain it to me more plainly.
Thanks.
EDIT: Question rephrased taking into account commentor suggestions on use of words such as 'never' and 'un-necessary'
Enumarations (enums) are unrelated to Enumerators.
And
I was told that to avoid...
Seems like a very premature and unnecessary optimization.
Related
The compiler-generated implementation of IEnumerator / IEnumerable for yield methods and getters seems to be a class, and is therefore allocated on the heap. However, other .NET types such as List<T> specifically return struct enumerators to avoid useless memory allocation. From a quick overview of the C# In Depth post, I see no reason why that couldn't also be the case here.
Am I missing something?
Servy correctly answered your question -- a question you answered yourself in a comment:
I just realized that since the return type is an interface, it would get boxed anyway, is that right?
Right. Your follow up question is:
couldn't the method be changed to return an explicitly typed enumerator (like List<T> does)?
So your idea here is that the user writes:
IEnumerable<int> Blah() ...
and the compiler actually generates a method that returns BlahEnumerable which is a struct that implements IEnumerable<int>, but with the appropriate GetEnumerator etc methods and properties that allow the "pattern matching" feature of foreach to elide the boxing.
Though that is a plausible idea, there are serious difficulties involved when you start lying about the return type of a method. Particularly when the lie involves changing whether the method returns a struct or a reference type. Think of all the things that go wrong:
Suppose the method is virtual. How can it be overridden? The return type of a virtual override method must match exactly the overridden method. (And similarly for: the method overrides another method, the method implements a method of an interface, and so on.)
Suppose the method is made into a delegate Func<IEnumerable<int>>. Func<T> is covariant in T, but covariance only applies to type arguments of reference type. The code looks like it returns an IEnumerable<T> but in fact it returns a value type that is not covariance-compatible with IEnumerable<T>, only assignment compatible.
Suppose we have void M<T>(T t) where T : class and we call M(Blah()). We expect to deduce that T is IEnumerable<int>, which passes the constraint check, but the struct type does not pass the constraint check.
And so on. You rapidly end up in an episode of Three's Company (boy am I dating myself here) where a small lie ends up compounding into a huge disaster. All of this to save a small amount of collection pressure. Not worth it.
I note though that the implementation created by the compiler does save on collection pressure in one interesting way. The first time that GetEnumerator is called on the returned enumerable, the enumerable turns itself into an enumerator. The second time of course the state is different so it allocates a new object. Since the 99.99% likely scenario is that a given sequence is enumerated exactly once, this is a big savings on collection pressure.
That class will only ever be used through the interface. If it were a struct, it would be boxed 100% of the time, making it less efficient than using a class.
You can't not box it, as it is, by definition, impossible to use the type at compile time as it doesn't exist when you start compiling the code.
When writing a custom implementation of IEnumerator you can expose the actual underlying type before compiling the code, allowing it to be potentially used without being boxed.
I saw this reply from Jon on Initialize generic object with unknown type:
If you want a single collection to
contain multiple unrelated types of
values, however, you will have to use
List<object>
I'm not comparing ArrayList vs List<>, but ArrayList vs List<object>, as both will be exposing elements of type object. What would be the benefit of using either one in this case?
EDIT: It's no concern for type safety here, since both class is exposing object as its item. One still needs to cast from object to the desired type. I'm more interested in anything other than type safety.
EDIT: Thanks Marc Gravell and Sean for the answer. Sorry, I can only pick 1 as answer, so I'll up vote both.
You'll be able to use the LINQ extension methods directly with List<object>, but not with ArrayList, unless you inject a Cast<object>() / OfType<object> (thanks to IEnumerable<object> vs IEnumerable). That's worth quite a bit, even if you don't need type safety etc.
The speed will be about the same; structs will still be boxed, etc - so there isn't much else to tell them apart. Except that I tend to see ArrayList as "oops, somebody is writing legacy code again..." ;-p
One big benefit to using List<object> is that these days most code is written to use the generic classes/interfaces. I suspect that these days most people would write a method that takes a IList<object> instead of an IList. Since ArrayList doesn't implement IList<object> you wouldn't be able to use an array list in these scenarios.
I tend to think of the non-generic classes/interfaces as legacy code and avoid them whenever possible.
In this case, ArrayList vs. List<Object> then you won't notice any differences in speed. There might be some differences in the actual methods available on each of these, particular in .NET 3.5 and counting extension methods, but that has more to do with ArrayList being somewhat deprecated than anything else.
Yes, besides being typesafe, generic collections might be actually faster.
From the MSDN (http://msdn.microsoft.com/en-us/library/system.collections.generic.aspx)
The System.Collections.Generic
namespace contains interfaces and
classes that define generic
collections, which allow users to
create strongly typed collections that
provide better type safety and
performance than non-generic strongly
typed collections.
Do some benchmarking and you will know what performs best. I guestimate that the difference is very small.
List<> is a typesafe version of ArrayList. It will guarantee that you will get the same object type in the collection.
In a coding standards document, I found this statement:
Avoid using foreach to iterate over
immutable value-type collections.
E.g. String arrays.
Why should this be avoided ?
You shouldn't avoid it. The coding standard document you're reading is talking nonsense. Try to find the author and ask him to explain.
Aside from anything else, string is a reference type and arrays are always mutable... this makes me concerned about the quality of the rest of the document, to be honest. Are there any other suspicious recommendations?
(It's possible that "immutable" was meant to refer to the value type rather than the collection - the fact that it's ambiguous is another worrying sign, IMO.)
I think that the reason for that statement is that it's written prior to .NET 2.0.
When using foreach in .NET 1.x it was using the IEnumerable interface (as the IEnumerable<T> interface didn't exist yet.) When iterating over a value type collection, the enumerator would box each item to be able to return it as an object reference, then the foreach code had to unbox it.
A string array is of course not an example of an array of value types. An integer array is.
I agree with Jon that the advice makes little sense. My guess is that the author discovered that you cannot change the value of the current item when iterating. However, if you're iterating a collection of reference types you can still modify the object the current item points to. Perhaps (s)he concluded that iteration was somehow broken for value type collections.
I am running through some tests about using ArrayLists and List.
Speed is very important in my app.
I have tested creating 10000 records in each, finding an item by index and then updating that object for example:
List[i] = newX;
Using the arraylist seems much faster. Is that correct?
UPDATE:
Using the List[i] approach, for my List<T> approach I am using LINQ to find the index eg/
....
int index = base.FindIndex(x=>x.AlpaNumericString = "PopItem");
base[index] = UpdatedItem;
It is definately slower than
ArrayList.IndexOf("PopItem"))
base[index] = UpdatedItem;
A generic List (List<T>) should always be quicker than an ArrayList.
Firstly, an ArrayList is not strongly-typed and accepts types of object, so if you're storing value types in the ArrayList, they are going to be boxed and unboxed every time they are added or accessed.
A Generic List can be defined to accept only (say) int's so therefore no boxing or unboxing needs to occur when adding/accessing elements of the list.
If you're dealing with reference types, you're probably still better off with a Generic List over an ArrayList, since although there's no boxing/unboxing going on, your Generic List is type-safe, and there will be no implicit (or explicit) casts required when retrieving your strongly-typed object from the ArrayList's "collection" of object types.
There may be some edge-cases where an ArrayList is faster performing than a Generic List, however, I (personally) have not yet come across one. Even the MSDN documentation states:
Performance Considerations
In deciding whether to use the
List<(Of <(T>)>) or ArrayList class,
both of which have similar
functionality, remember that the
List<(Of <(T>)>) class performs better
in most cases and is type safe. If a
reference type is used for type T of
the List<(Of <(T>)>) class, the
behavior of the two classes is
identical. However, if a value type is
used for type T, you need to consider
implementation and boxing issues.
If a value type is used for type T,
the compiler generates an
implementation of the List<(Of <(T>)>)
class specifically for that value
type. That means a list element of a
List<(Of <(T>)>) object does not have
to be boxed before the element can be
used, and after about 500 list
elements are created the memory saved
not boxing list elements is greater
than the memory used to generate the
class implementation.
Make certain the value type used for
type T implements the IEquatable<(Of
<(T>)>) generic interface. If not,
methods such as Contains must call the
Object..::.Equals(Object) method,
which boxes the affected list element.
If the value type implements the
IComparable interface and you own the
source code, also implement the
IComparable<(Of <(T>)>) generic
interface to prevent the BinarySearch
and Sort methods from boxing list
elements. If you do not own the source
code, pass an IComparer<(Of <(T>)>)
object to the BinarySearch and Sort
methods
Moreover, I particularly like the very last section of that paragraph, which states:
It is to your advantage to use the type-specific implementation of the List<(Of <(T>)>) class instead of using the ArrayList class or writing a strongly typed wrapper collection yourself. The reason is your implementation must do what the .NET Framework does for you already, and the common language runtime can share Microsoft intermediate language code and metadata, which your implementation cannot.
Touché! :)
Based on your recent edit it seems as though you're not performing a 1:1 comparison here. In the List you have a class object and you're looking for the index based on a property, whereas in the ArrayList you just store the values of that property. If so, this is a severely flawed comparison.
To make it a 1:1 comparison you would add the values to the list only, not the class. Or, you would add the class items to the ArrayList. The former would allow you to use IndexOf on both collections. The latter would entail looping through your entire ArrayList and comparing each item till a match was found (and you could do the same for the List), or overriding object.Equals since ArrayList uses that for comparison.
For an interesting read, I suggest taking a look at Rico Mariani's post: Performance Quiz #7 -- Generics Improvements and Costs -- Solution. Even in that post Rico also emphasizes the need to benchmark different scenarios. No blanket statement is issued about ArrayLists, although the general consensus is to use generic lists for performance, type safety, and having a strongly typed collection.
Another related article is: Why should I use List and not ArrayList.
ArrayList seems faster? According to the documentation ( http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx ) List should be faster when using a value type, and the same speed when using a reference type. ArrayList is slower with value types because it needs to box/unbox the values when you're accessing them.
I would expect them to be about the same if they are value-types. There is an extra cast/type-check for ArrayList, but nothing huge. Of course, List<T> should be preferred. If speed is the primary concern (which it almost always isn't, at least not in this way), then you might also want to profile an array (T[]) - harder (=more expensive) to add/remove, of course - but if you are just querying/assigning by index, it should be the fastest. I have had to resort to arrays for some very localised performance critical work, but 99.95% of the time this is overkill and should be avoided.
For example, for any of the 3 approaches (List<T>/ArrayList/T[]) I would expect the assignment cost to be insignificant to the cost of newing up the new instance to put into the storage.
Marc Gravell touched on this in his anwswer - I think it needs to be stressed.
It is usually a waste of time to prematurely optimize your code!
A better approach is to do a simple, well designed first implementation, and test it with anticipated real world data loads.
Often, you will find that it's "fast enough". (It helps to start out with a clear definition of "fast enough" - e.g. "Must be able to find a single CD in a 10,000 CD collection in 3 seconds or less")
If it's not, put a profiler on it. Almost invariably, the bottle neck will NOT be where you expect.
(I learned this the hard way when I brought a whole app to it's knees with single badly chosen string concatenation)
I noticed that List<T> defines its enumerator as a struct, while ArrayList defines its enumerator as a class. What's the difference? If I am to write an enumerator for my class, which one would be preferable?
EDIT: My requirements cannot be fulfilled using yield, so I'm implementing an enumerator of my own. That said, I wonder whether it would be better to follow the lines of List<T> and implement it as a struct.
Like this others, I would choose a class. Mutable structs are nasty. (And as Jared suggests, I'd use an iterator block. Hand-coding an enumerator is fiddly to get right.)
See this thread for an example of the list enumerator being a mutable struct causing problems...
The easiest way to write an enumerator in C# is with the "yield return" pattern. For example.
public IEnumerator<int> Example() {
yield return 1;
yield return 2;
}
This pattern will generate all of the enumerator code under the hood. This takes the decision out of your hands.
Reason List uses a struct enumerator is to prevent garbage generation in foreach statements. This is pretty good reason especially if you are programming for Compact Framework, because CF doesn't have generational GC and CF is usually used on low performance hardware where it can quickly lead to performance issues.
Also, I don't think mutable structs are source of problems in examples some posted, but programmers that don't have good understanding of how value types work.
An enumerator is inherently a changing structure, since it needs to update internal state to move on to the next value in the original collection.
In my opinion, structs should be immutable, so I would use a class.
Write it using yield return.
As to why you might otherwise choose between class or struct, if you make it a struct then it gets boxed as soon as it is returned as an interface, so making it a struct just causes additional copying to take place. Can't see the point of that!
Any implementation of IEnumerable<T> should return a class. It may be useful for performance reasons to have a GetEnumerator method which returns a struct which provides the methods necessary for enumeration but does not implement IEnumerator<T>; this method should be different from IEnumerable<T>.GetEnumerator, which should then be implemented explicitly.
Using this approach will allow for enhanced performance when the class is enumerated using a foreach or "For Each" loop in C# or vb.net or any context where the code which is doing the enumeration will know that the enumerator is a struct, but avoid the pitfalls that would otherwise occur when the enumerator gets boxed and passed by value.
There's a couple of blog posts that cover exactly this issue. Basically, enumerator structs are a really really bad idea...
To expand on #Earwicker: you're usually better off not writing an enumerator type, and instead using yield return to have the compiler write it for you. This is because there are a number of important subtleties that you might miss if you do it yourself.
See SO question "What is the yield keyword used for in C#?" for some more details on how to use it.
Also Raymond Chen has a series of blog posts ("The implementation of iterators in C# and its consequences": parts 1, 2, 3, and 4) that show you how to implement an iterator properly without yield return, which shows just how complex it is, and why you should just use yield return.