Regarding the collections implementing this[int] and assuming the collection won't change during the enumeration, does the foreach (var item in list) loop produce the same sequence as for (var i = 0; i < list.Count; ++i) anytime?
This means, when I need the ascending order by index, could I use foreach or is just simply safer to use for? Or it just depends on the curren collection implementation and migh vary or change in time?
foreach (var item in list)
{
// do things
}
translates to
var enumerator = list.GetEnumerator();
while(enumerator.MoveNext())
{
var item = enumerator.Current;
// do things
}
So as you can see, it's not using the indexor list[i] in the general case.
For most collections types, however, the semantics is the same.
edit
There are IList<T> implementations where the enumerator IList<T> as a linked list, it's very unlikely you will use the indexor in your enumerator implementation, as it would be very inefficient.
As a rule of thumb, using foreach ensure you use the most efficient algorithm for the class at hand, as it is the one chosen by the class' Creator. In the worst case, you will just suffer a small indirection overhead that is very unlikely to be noticeable.
edit 2 after nos's comment
There is a case where the semantics of the two constructs varies widly: the collection modification.
While using a simple for loop, nothing particular will happen if you change the collection while iterating through it. The program will behave as if it assumed you know what you're doing. This could result in some values iterated over more than once or other skipped, but no exception as long as you're not accessing outside of the range of the indexor (which would require a multithreaded program ot happen).
While using a foreachloop; if you modify the collection while iterating through it, you enter undefined behavior. The documentation tells us
An enumerator remains valid as long as the collection remains
unchanged. If changes are made to the collection, such as adding,
modifying, or deleting elements, the enumerator is irrecoverably
invalidated and its behavior is undefined.
In that case, expect most of C# built-in types to throw an InvalidOperationException, but everything can happen in a custom implementation, from missed values to repeated values , including infinite loops...
Generally speaking, yes, but strictly spoken: no. It really depends on the implementation.
Usually with for you would use the this indexer properties. foreach uses GetEnumerator() to get the enumerator that iterates over the collection. Depending on the implementation the enumerator might yield another result than the for.
The implied logic of a list is that is has a specific order, and when implementing IList you may state is it save to assume that the order of both the indexer properties as the enumerator are the same.
There is no guarantee that this would be the case. The code paths can be completely separate. Of course collections like List will produce the same result but you can write data structures (even useful ones) that do not.
The indexer is just a property with additional index argument. You can return a random value if you feel like it.
One important think you should have in mind, as a difference between the 2 is that inside foreach you can/t make any changes to the enumarared objects.
If you wish to alter (basicaly delete) objects from the enumeration you must use a for loop
Related
In C#, the IEnumerator interface defines a way to traverse a collection and look at the elements. I think this is tremendously useful because if you pass IEnumerable<T> to a method, it's not going to modify the original source.
However, in Java, Iterator defines the remove operation to (optionally!) allow deleting elements. There's no advantage in passing Iterable<T> to a method because that method can still modify the original collection.
remove's optionalness is an example of the refused bequest smell, but ignoring that (already discussed here) I'd be interested in the design decisions that prompted a remove event to be implemented on the interface.
What are the design decisions that led to remove being added to Iterator?
To put another way, what is the C# design decision that explicitly doesn't have remove defined on IEnumerator?
Iterator is able to remove elements during iteration. You cannot iterate collection using iterator and remove elements from target collection using remove() method of that collection. You will get ConcurrentModificationException on next call of Iterator.next() because iterator cannot know how exactly the collection was changed and cannot know how to continue to iterate.
When you are using remove() of iterator it knows how the collection was changed. Moreover actually you cannot remove any element of collection but only the current one. This simplifies continuation of iterating.
Concerning to advantages of passing iterator or Iterable: you can always use Collection.unmodifireableSet() or Collection.unmodifireableList() to prevent modification of your collection.
It is probably due to the fact that removing items from a collection while iterating over it has always been a cause for bugs and strange behaviour. From reading the documentation it would suggest that Java enforces at runtime remove() is only called once per call to next() which makes me think it has just been added to prevent people messing up removing data from a list when iterating over it.
There are situations where you want to be able to remove elements using the iterator because it is the most efficient way to do it. For example, when traversing a linked data structure (e.g. a linked list), removing using the iterator is an O(1) operation ... compared to O(N) via the List.remove() operations.
And of course, many collections are designed so that modifying the collection during a collection by any other means than Iterator.remove() will result in a ConcurrentModificationException.
If you have a situation where you don't want to allow modification via a collection iterator, wrapping it using Collection.unmodifiableXxxx and using it's iterator will have the desired effect. Alternatively, I think that Apache Commons provides a simple unmodifiable iterator wrapper.
By the way IEnumerable suffers from the same "smell" as Iterator. Take a look at the reset() method. I was also curious as to how the C# LinkedList class deals with the O(N) remove problem. It appears that it does this by exposing the internals of the list ... in the form of the First and Last properties whose values are LinkedListNode references. That violates another design principle ... and is (IMO) far more dangerous than Iterator.remove().
This is actually an awesome feature of Java. As you may well know, when iterating through a list in .NET to remove elements (of which there are a number of use cases for) you only have two options.
var listToRemove = new List<T>(originalList);
foreach (var item in originalList)
{
...
if (...)
{
listToRemove.Add(item)
}
...
}
foreach (var item in listToRemove)
{
originalList.Remove(item);
}
or
var iterationList = new List<T>(originalList);
for (int i = 0; i < iterationList.Count; i++)
{
...
if (...)
{
originalList.RemoveAt(i);
}
...
}
Now, I prefer the second, but with Java I don't need all of that because while I'm on an item I can remove it and yet the iteration will continue! Honestly, though it may seem out of place, it's really an optimization in a lot of ways.
Suppose I have a given collection. Without ever changing the collection in any way, I loop through its contents twice with a foreach. Barring cosmic rays and what not, is it absolutely guaranteed that the order will be consistent in both loops?
Alternatively, given a HashSet<string> with a number of elements, what can cause the output from the the commented lines in the following to be unequal:
{
var mySet = new HashSet<string>();
// Some code which populates the HashSet<string>
// Output1
printContents(mySet);
// Output2
printContents(mySet);
}
public void printContents(HashSet<string> set) {
foreach(var element in set) {
Console.WriteLine(element);
}
}
It would be helpful if I could get a general answer explaining what causes an implementation to not meet the criteria described above. Specifically, though, I am interested in Dictionary, List and arrays.
Array enumeration guarantees order.
List and List<T> are expected to provide stable order (since they are expected to implement sequentially-indexed elements).
Dictionary, HashSet are explicitly do not guarantee order. Its is very unlikely that 2 calls to iterate items one after each other will return items in different order, but there is no guarantees or expectations. One should not expect any particular order.
Sorted versions of Dictionary/HashSet return items in sort order.
Other IEnumerable objects are free to do whatever they want. Normally one implements iterators in such a way that it matches user's expectations. I.e. enumeration of something that have implicit order should be stable, if explicit order provided - expected to be stable. Query to database that does not specify order should be expected to return items in semi-random order.
Check this question for links: Does the foreach loop in C# guarantee an order of evaluation?
Everything that implements IEnumerable<T> does so in its own way. There is no general guarantee that any given collection must ensure stability.
If you are referring specifically to Collection<T> (http://msdn.microsoft.com/en-us/library/ms132397.aspx) I don't see any specific guarantee in its MSDN reference that ordering is consistent.
Will it probably be consistent? Yes. Is there a written guarantee? Not that I can find.
For many of the C# collections there are sorted versions of the collection. For instance, a HashSet is to a SortedSet as a Dictionary is to a SortedDictionary. If you're working with something where the order isn't important like the Dictionary then you can't assume the loop order will behave the same way every time.
As per your example with HashSet<T>, we now have source code to check: HashSet:Enumerator
As it is, the Slot[] set.m_slots array is iterated.
The array object is only changed in the methods TrimExcess, Initialize (both of which are only called in the constructor), OnDeserialization, and SetCapacity (only called by AddIfNotPresent and AddOrGetLocation).
The values of m_slots are only changed in methods that change elements of the HashSet(Clear, Remove, AddIfNotPresent, IntersectWith, SymmetricExceptWith).
So yes, if nothing touches the set, it enumerates in the same order.
Dictionary:Enumerator works in quite the same way, iterating an Entry[] entries that only changes when such non-readonly methods are called.
I came across a method to change a list in a foreach loop by converting to a list in itself like this:
foreach (var item in myList.ToList())
{
//add or remove items from myList
}
(If you attempt to modify myList directly an error is thrown since the enumerator basically locks it)
This works because it's not the original myList that's being modified. My question is, does this method create garbage when the loop is over (namely from the List that's returned from the ToList method? For small loops, would it be preferable to using a for loop to avoid the creation of garbage?
The second list is going to be garbage, there will be garbage for an enumerator that is used in building the second list, and add in the enumerator that the foreach would spawn, which you would have had with or without the second list.
Should you switch to a for? Maybe, if you can point to this region of code being a true performance bottleneck. Otherwise, code for simplicity and maintainability.
Yes. ToList() would create another list that would need to be garbage collected.
That's an interesting technique which I will keep in mind for the future! (I can't believe I've never thought of that!)
Anyway, yes, the list that you are building doesn't magically unallocate itself. The possible performance problems with this technique are:
Increased memory usage (building a List, separate from the IEnumerable). Probably not that big of a deal, unless you do this very frequently, or the IEnumerable is very large.
Decreased speed, since it has to go through the IEnumerable at once to build the List.
Also, if enumerating the IEnumerable has side effects, they will all be triggered by this process.
Unless this is actually inside an inner loop, or you're working with very large data sets, you can probably do this without any problems.
Yes, the ToList() method creates "garbage". I would just indexing.
for (int i = MyList.Count - 1; 0 <= i; --i)
{
var item = MyList[i];
//add or remove items from myList
}
It's non-deterministic. But the reference created from the call ToList() will be GCd eventually.
I wouldn't worry about it too much, since all it would be holding at most would be references or small value types.
I am exploring the HashSet<T> type, but I don't understand where it stands in collections.
Can one use it to replace a List<T>? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.
Is it only for enumeration?
The important thing about HashSet<T> is right there in the name: it's a set. The only things you can do with a single set is to establish what its members are, and to check whether an item is a member.
Asking if you can retrieve a single element (e.g. set[45]) is misunderstanding the concept of the set. There's no such thing as the 45th element of a set. Items in a set have no ordering. The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.
It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set. That order is not really a property of the set. You should not rely on it. If ordering of the items in a collection is important to you, that collection isn't a set.
Sets are really limited and with unique members. On the other hand, they're really fast.
Here's a real example of where I use a HashSet<string>:
Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a # or \ command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string> of all the valid commands, so whenever I hit a #xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:
Dictionary<string, ?>: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.
List<string>: If I keep the list sorted, I can use BinarySearch, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...
string[]: Again, Array.BinarySearch gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet, Dictionary, or List. Even with BinarySearch, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.
A HashSet<T> implements the ICollection<T> interface:
public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
// Methods
void Add(T item);
void Clear();
bool Contains(T item);
void CopyTo(T[] array, int arrayIndex);
bool Remove(T item);
// Properties
int Count { get; }
bool IsReadOnly { get; }
}
A List<T> implements IList<T>, which extends the ICollection<T>
public interface IList<T> : ICollection<T>
{
// Methods
int IndexOf(T item);
void Insert(int index, T item);
void RemoveAt(int index);
// Properties
T this[int index] { get; set; }
}
A HashSet has set semantics, implemented via a hashtable internally:
A set is a collection that contains no
duplicate elements, and whose elements
are in no particular order.
What does the HashSet gain, if it loses index/position/list behavior?
Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).
A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.
Performance would be a bad reason to choose HashSet over List. Instead, what better captures your intent? If order is important, then Set (or HashSet) is out. If duplicates are permitted, likewise. But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.
HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unordered. So no, a set can not be used to replace a list (unless you should've use a set in the first place).
If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.
HashSet would be used to remove duplicate elements in an IEnumerable collection. For example,
List<string> duplicatedEnumrableStrings = new List<string> {"abc", "ghjr", "abc", "abc", "yre", "obm", "ghir", "qwrt", "abc", "vyeu"};
HashSet<string> uniqueStrings = new HashSet(duplicatedEnumrableStrings);
after those codes are run, uniqueStrings holds {"abc", "ghjr", "yre", "obm", "qwrt", "vyeu"};
Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)). So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement. If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).
And no, you can't index a set, which would not make sense anyway, because sets aren't ordered. If you add some items, the set won't remember which one was first, and which second etc.
HashSet<T> is a data strucutre in the .NET framework that is a capable of representing a mathematical set as an object. In this case, it uses hash codes (the GetHashCode result of each item) to compare equality of set elements.
A set differs from a list in that it only allows one occurrence of the same element contained within it. HashSet<T> will just return false if you try to add a second identical element. Indeed, lookup of elements is very quick (O(1) time), since the internal data structure is simply a hashtable.
If you're wondering which to use, note that using a List<T> where HashSet<T> is appropiate is not the biggest mistake, though it may potentially allow problems where you have undesirable duplicate items in your collection. What is more, lookup (item retrieval) is vastly more efficient - ideally O(1) (for perfect bucketing) instead of O(n) time - which is quite important in many scenarios.
List<T> is used to store ordered sets of information. If you know the relative order of the elements of the list, you can access them in constant time. However, to determine where an element lies in the list or to check if it exists in the list, the lookup time is linear. On the other hand, HashedSet<T> makes no guarantees of the order of the stored data and consequently provides constant access time for its elements.
As the name implies, HashedSet<T> is a data structure that implements set semantics. The data structure is optimized to implement set operations (i.e. Union, Difference, Intersect), which can not be done as efficiently with the traditional List implementation.
So, to choose which data type to use really depends on what your are attempting to do with your application. If you don't care about how your elements are ordered in a collection, and only want to enumarate or check for existence, use HashSet<T>. Otherwise, consider using List<T> or another suitable data structure.
In the basic intended scenario HashSet<T> should be used when you want more specific set operations on two collections than LINQ provides. LINQ methods like Distinct, Union, Intersect and Except are enough in most situations, but sometimes you may need more fine-grained operations, and HashSet<T> provides:
UnionWith
IntersectWith
ExceptWith
SymmetricExceptWith
Overlaps
IsSubsetOf
IsProperSubsetOf
IsSupersetOf
IsProperSubsetOf
SetEquals
Another difference between LINQ and HashSet<T> "overlapping" methods is that LINQ always returns a new IEnumerable<T>, and HashSet<T> methods modify the source collection.
In short - anytime you are tempted to use a Dictionary (or a Dictionary where S is a property of T) then you should consider a HashSet (or HashSet + implementing IEquatable on T which equates on S)
I have been told that there is a performance difference between the following code blocks.
foreach (Entity e in entityList)
{
....
}
and
for (int i=0; i<entityList.Count; i++)
{
Entity e = (Entity)entityList[i];
...
}
where
List<Entity> entityList;
I am no CLR expect but from what I can tell they should boil down to basically the same code. Does anybody have concrete (heck, I'd take packed dirt) evidence one way or the other?
foreach creates an instance of an enumerator (returned from GetEnumerator) and that enumerator also keeps state throughout the course of the foreach loop. It then repeatedly calls for the Next() object on the enumerator and runs your code for each object it returns.
They don't boil down to the same code in any way, really, which you'd see if you wrote your own enumerator.
Here is a good article that shows the IL differences between the two loops.
Foreach is technically slower however much easier to use and easier to read. Unless performance is critical I prefer the foreach loop over the for loop.
The foreach sample roughly corresponds to this code:
using(IEnumerator<Entity> e = entityList.GetEnumerator()) {
while(e.MoveNext()) {
Entity entity = e.Current;
...
}
}
There are two costs here that a regular for loop does not have to pay:
The cost of allocating the enumerator object by entityList.GetEnumerator().
The cost of two virtual methods calls (MoveNext and Current) for each element of the list.
One point missed here:
A List has a Count property, it internally keeps track of how many elements are in it.
An IEnumerable DOES NOT.
If you program to the interface IEnumerable and use the count extention method it will enumerate just to count the elements.
A moot point though since in the IEnumerable you cannot refer to items by index.
So if you want to lock in to Lists and Arrays you can get small performance increases.
If you want flexability use foreach and program to IEnumerable. (allowing the use of linq and/or yield return).
In terms of allocations, it'd be better to look at this blogpost. It shows in exactly in what circumstances an enumerator is allocated on the heap.
I think one possible situation where you might get a performance gain is if the enumerable type's size and the loop condition is a constant; for example:
const int ArraySize = 10;
int[] values = new int[ArraySize];
//...
for (int i = 0; i
In this case, depending on the complexity of the loop body, the compiler might be able to replace the loop with inline calls. I have no idea if the .NET compiler does this, and it's of limited utility if the size of the enumerable type is dynamic.
One situation where foreach might perform better is with data structures like a linked list where random access means traversing the list; the enumerator used by foreach will probably iterate one item at a time, making each access O(1) and the full loop O(n), but calling the indexer means starting at the head and finding the item at the right index; O(N) each loop for O(n^2).
Personally I don't usually worry about it and use foreach any time I need all items and don't care about the index of the item. If I'm not working with all of the items or I really need to know the index, I use for. The only time I could see it being a big concern is with structures like linked lists.
For Loop
for loop is used to perform the opreration n times
for(int i=0;i<n;i++)
{
l=i;
}
foreach loop
int[] i={1,2,3,4,5,6}
foreach loop is used to perform each operation value/object in IEnumarable
foreach(var k in i)
{
l=k;
}