Enumerable.Skip and ordering - c#

So IEnumerables don't guarantee order.
Does that mean if you do myEnumerable.Skip(5) you cannot (unless you do .ToList() or otherwise before) guarantee what will be returned?

Once the objects are yielded by an IEnumerator they do have an order. There is some item that comes out first, and some item that comes out second, etc. For some particular implementations that order might have meaning, for others it might be arbitrary, but there still is some order. The Skip implementation is straightforward; it gets however many items without yielding them, and then gets the rest and yields them. Whether the items skipped mean anything in particular is the responsibility of whoever is calling the method.
Calling ToList will never change the order of the items in the sequence, so adding such a call before calling Skip wouldn't change anything. A call to OrderBy on the other hand would result in a changed ordering, possibly from a meaningless order to a meaningful order. That's not to say it's required, merely that it can, in some situations, be a useful tool.

Whether a particular ordering is guaranteed or not by any specdific IEnumerable<T> is dependent on
How that particular implementation is/was done, and on
the semantics of the underlying collection/class.
An array will enumerate its contents in the obvious sequence (from x[0] to x[n].) Ditto for a List<T>, it being essentially an array of adjustable length. Actual [linked] lists, of course, can only be enumerated in order.
The order of enumeration of Dictionary<K,V>, HashSet<T>, binary trees, etc. is dependent upon the order in which objects were added. Add the same collection of values with differing orderings to a binary tree and the structure of the tree thus constructed will vary (the degenerate case, of course, being when objects are added in order, in which case the tree structure collapsed into an [ordered] linked list.
That being said, any particular instance of IEnumerable<T>, should, barring any modifications to the underlying collection, yield the same sequence of values each time it is enumerated. That assumes, of course, a rational implementation of the interface. If the interface enumerates the collection by doing a random shuffle, of course, all bets are off.
If the actual order of items produced is important, you need to either
Use a collection having the desired semantics, or
Enforce the desired ordering by sorting the collection or enumeration.

IEnumerable is an interface. As such, the interface has no guarantee of order. However, if you have an actual object that implements that interface, that object may (and often does) guarantee the order.

If you use Skip(x), the first x elements will be ignored and everything after will be returned in a new IEnumerable<T>. The interface makes no guarantees that it will retain order but in practice it does. Anytime you operate on an IEnumerable<T> it will actually go through the same list in a linear fashion. For example if you read a file line by line into and IEnumerable<T> the lines would always be in the same order as they were in the file (assuming you don't use a sorting method). Even if you use a Where or some other method to filter the results, order will still be retained. The only thing you need to worry about is custom collections that implement IEnumerable<T>. The collections in .NET will behave as you'd expect them to.

Related

Is it possible to provide an ordering guarantee for a collection?

I'm trying to create a method which (other than in name) shows that the ordering of some collection will be preserved.
I have considered SortedList, but dismissed it due to the requirement of holding a key. I have also dismissed other Sorted types for similar reasons, and SortedSet due to Linq returning IEnumerable instead of another SortedSet when you operate on it.
I don't mind if a new type is required, or I need to write methods in a specific way. The goal here is to highlight methods which preserve the input order of a collection when operating upon it.
I had thought about adding a custom attribute and just trusting that it will be used correctly, but I would ideally like to find something in the language which is more explicit.
-- Edit
It's not so much the order of the elements in the collection (I could use an IEnumerable), but some operation on the input collection. Let's say I were returning the root of all the numbers in an array, instead of returning (root, number)[], or (root, index)[] I want to return root[] and have it clear to the user that the order of the elements in the returned array matches the order of the elements in the input parameter.
No, there is nothing in C# or .Net that let you express and enforce "this method does not change order of elements in a collection / while iterating through collection".
Conventional expectation is order of elements stored in a collection preserved while iterating unless method/class explicitly named to indicate reordering.
Examples of "no reordering":
for / 'foreach`
.Select, .First, .Take, .SelectMany, .Where
indexing of collection that is not called "SortedXxxxx" - List, array.
Examples of "does reordering"
List.Sort, List.Reverse
.OrderBy, .ThenBy
classes that don't preserve/guarantee ordering like HashSet, Dictionary, OrderedDictionary, SortedList
Sounds like you want a queue. The first object added will be the first removed, so order of insertion is preserved. Typically objects are processed out of the queue with the Dequeue() method, but there is a Peek() method if you don't want to remove from the collection.
Beyond that, you'll probably need to roll your own implementation. It would likely just be a wrapper around a List<T>, where you prevent anything from being Inserted.

C# foreach loop - is order *stability* guaranteed?

Suppose I have a given collection. Without ever changing the collection in any way, I loop through its contents twice with a foreach. Barring cosmic rays and what not, is it absolutely guaranteed that the order will be consistent in both loops?
Alternatively, given a HashSet<string> with a number of elements, what can cause the output from the the commented lines in the following to be unequal:
{
var mySet = new HashSet<string>();
// Some code which populates the HashSet<string>
// Output1
printContents(mySet);
// Output2
printContents(mySet);
}
public void printContents(HashSet<string> set) {
foreach(var element in set) {
Console.WriteLine(element);
}
}
It would be helpful if I could get a general answer explaining what causes an implementation to not meet the criteria described above. Specifically, though, I am interested in Dictionary, List and arrays.
Array enumeration guarantees order.
List and List<T> are expected to provide stable order (since they are expected to implement sequentially-indexed elements).
Dictionary, HashSet are explicitly do not guarantee order. Its is very unlikely that 2 calls to iterate items one after each other will return items in different order, but there is no guarantees or expectations. One should not expect any particular order.
Sorted versions of Dictionary/HashSet return items in sort order.
Other IEnumerable objects are free to do whatever they want. Normally one implements iterators in such a way that it matches user's expectations. I.e. enumeration of something that have implicit order should be stable, if explicit order provided - expected to be stable. Query to database that does not specify order should be expected to return items in semi-random order.
Check this question for links: Does the foreach loop in C# guarantee an order of evaluation?
Everything that implements IEnumerable<T> does so in its own way. There is no general guarantee that any given collection must ensure stability.
If you are referring specifically to Collection<T> (http://msdn.microsoft.com/en-us/library/ms132397.aspx) I don't see any specific guarantee in its MSDN reference that ordering is consistent.
Will it probably be consistent? Yes. Is there a written guarantee? Not that I can find.
For many of the C# collections there are sorted versions of the collection. For instance, a HashSet is to a SortedSet as a Dictionary is to a SortedDictionary. If you're working with something where the order isn't important like the Dictionary then you can't assume the loop order will behave the same way every time.
As per your example with HashSet<T>, we now have source code to check: HashSet:Enumerator
As it is, the Slot[] set.m_slots array is iterated.
The array object is only changed in the methods TrimExcess, Initialize (both of which are only called in the constructor), OnDeserialization, and SetCapacity (only called by AddIfNotPresent and AddOrGetLocation).
The values of m_slots are only changed in methods that change elements of the HashSet(Clear, Remove, AddIfNotPresent, IntersectWith, SymmetricExceptWith).
So yes, if nothing touches the set, it enumerates in the same order.
Dictionary:Enumerator works in quite the same way, iterating an Entry[] entries that only changes when such non-readonly methods are called.

I have read that it is bad practice to iterate over a HashSet. Should I be calling .ToList() on it first?

I have a collection of items called RegisteredItems. I do not care about the order of the items in RegisteredItems, only that they exist.
I perform two types of operations on RegisteredItems:
Find and return item by property.
Iterate over collection and have side-effect.
According to: When should I use the HashSet<T> type? Robert R. says,
"It's somewhat dangerous to iterate over a HashSet because doing so
imposes an order on the items in the set. That order is not really a
property of the set. You should not rely on it. If ordering of the
items in a collection is important to you, that collection isn't a
set."
There are some scenarios where my collection would contain 50-100 items. I realize this is not a large amount of items, but I was still hoping to reap the rewards of using a HashSet instead of List.
I have found myself looking at the following code and wondering what to do:
LayoutManager.Instance.RegisteredItems.ToList().ForEach( item => item.DoStuff() );
vs
foreach( var item in LayoutManager.Instance.RegisteredItems)
{
item.DoStuff();
}
RegisteredItems used to return an IList<T>, but now it returns a HashSet. I felt that, if I was using HashSet for efficiency, it would be improper to cast it as a List. Yet, the above quote from Robert leaves me feeling uneasy about iterating over it, as well.
What's the right call in this scenario? Thanks
If you don't care about order, use a HashSet<>. The quote is about using HashSet<> being dangerous when you're worried about order. If you run this code multiple times, and the items are operated on in different order, will you care? If not, then you're fine. If yes, then don't use a HashSet<>. Arbitrarily converting to a List first doesn't really solve the problem.
And I'm not certain, but I suspect that .ToList() will iterate over the HashSet<> to do that, so, now you're walking the collection twice.
Don't prematurely optimize. If you only have 100 items, just use a HashSet<> and move on. If you start caring about order, change it to a List<> then and use it as a list everwhere.
If you really don't care about order and you know that you can't have duplicate in your hashset (and it's what you want), go ahead use hashset.
In the quoted question, I think he's saying that if you iterate over a Set, you can easily trick yourself into thinking that the items are in a certain order. For example, it'd be easy to treat the first iterated item differently, but you aren't guaranteed that will remain the first iterated item.
As long as you keep this in mind, and consider the Set unordered, iterating over it is fine.

When should I use the HashSet<T> type?

I am exploring the HashSet<T> type, but I don't understand where it stands in collections.
Can one use it to replace a List<T>? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.
Is it only for enumeration?
The important thing about HashSet<T> is right there in the name: it's a set. The only things you can do with a single set is to establish what its members are, and to check whether an item is a member.
Asking if you can retrieve a single element (e.g. set[45]) is misunderstanding the concept of the set. There's no such thing as the 45th element of a set. Items in a set have no ordering. The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.
It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set. That order is not really a property of the set. You should not rely on it. If ordering of the items in a collection is important to you, that collection isn't a set.
Sets are really limited and with unique members. On the other hand, they're really fast.
Here's a real example of where I use a HashSet<string>:
Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a # or \ command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string> of all the valid commands, so whenever I hit a #xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:
Dictionary<string, ?>: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.
List<string>: If I keep the list sorted, I can use BinarySearch, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...
string[]: Again, Array.BinarySearch gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet, Dictionary, or List. Even with BinarySearch, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.
A HashSet<T> implements the ICollection<T> interface:
public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
// Methods
void Add(T item);
void Clear();
bool Contains(T item);
void CopyTo(T[] array, int arrayIndex);
bool Remove(T item);
// Properties
int Count { get; }
bool IsReadOnly { get; }
}
A List<T> implements IList<T>, which extends the ICollection<T>
public interface IList<T> : ICollection<T>
{
// Methods
int IndexOf(T item);
void Insert(int index, T item);
void RemoveAt(int index);
// Properties
T this[int index] { get; set; }
}
A HashSet has set semantics, implemented via a hashtable internally:
A set is a collection that contains no
duplicate elements, and whose elements
are in no particular order.
What does the HashSet gain, if it loses index/position/list behavior?
Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).
A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.
Performance would be a bad reason to choose HashSet over List. Instead, what better captures your intent? If order is important, then Set (or HashSet) is out. If duplicates are permitted, likewise. But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.
HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unordered. So no, a set can not be used to replace a list (unless you should've use a set in the first place).
If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.
HashSet would be used to remove duplicate elements in an IEnumerable collection. For example,
List<string> duplicatedEnumrableStrings = new List<string> {"abc", "ghjr", "abc", "abc", "yre", "obm", "ghir", "qwrt", "abc", "vyeu"};
HashSet<string> uniqueStrings = new HashSet(duplicatedEnumrableStrings);
after those codes are run, uniqueStrings holds {"abc", "ghjr", "yre", "obm", "qwrt", "vyeu"};
Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)). So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement. If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).
And no, you can't index a set, which would not make sense anyway, because sets aren't ordered. If you add some items, the set won't remember which one was first, and which second etc.
HashSet<T> is a data strucutre in the .NET framework that is a capable of representing a mathematical set as an object. In this case, it uses hash codes (the GetHashCode result of each item) to compare equality of set elements.
A set differs from a list in that it only allows one occurrence of the same element contained within it. HashSet<T> will just return false if you try to add a second identical element. Indeed, lookup of elements is very quick (O(1) time), since the internal data structure is simply a hashtable.
If you're wondering which to use, note that using a List<T> where HashSet<T> is appropiate is not the biggest mistake, though it may potentially allow problems where you have undesirable duplicate items in your collection. What is more, lookup (item retrieval) is vastly more efficient - ideally O(1) (for perfect bucketing) instead of O(n) time - which is quite important in many scenarios.
List<T> is used to store ordered sets of information. If you know the relative order of the elements of the list, you can access them in constant time. However, to determine where an element lies in the list or to check if it exists in the list, the lookup time is linear. On the other hand, HashedSet<T> makes no guarantees of the order of the stored data and consequently provides constant access time for its elements.
As the name implies, HashedSet<T> is a data structure that implements set semantics. The data structure is optimized to implement set operations (i.e. Union, Difference, Intersect), which can not be done as efficiently with the traditional List implementation.
So, to choose which data type to use really depends on what your are attempting to do with your application. If you don't care about how your elements are ordered in a collection, and only want to enumarate or check for existence, use HashSet<T>. Otherwise, consider using List<T> or another suitable data structure.
In the basic intended scenario HashSet<T> should be used when you want more specific set operations on two collections than LINQ provides. LINQ methods like Distinct, Union, Intersect and Except are enough in most situations, but sometimes you may need more fine-grained operations, and HashSet<T> provides:
UnionWith
IntersectWith
ExceptWith
SymmetricExceptWith
Overlaps
IsSubsetOf
IsProperSubsetOf
IsSupersetOf
IsProperSubsetOf
SetEquals
Another difference between LINQ and HashSet<T> "overlapping" methods is that LINQ always returns a new IEnumerable<T>, and HashSet<T> methods modify the source collection.
In short - anytime you are tempted to use a Dictionary (or a Dictionary where S is a property of T) then you should consider a HashSet (or HashSet + implementing IEquatable on T which equates on S)

IEnumerable<T> as return type

Is there a problem with using IEnumerable<T> as a return type?
FxCop complains about returning List<T> (it advises returning Collection<T> instead).
Well, I've always been guided by a rule "accept the least you can, but return the maximum."
From this point of view, returning IEnumerable<T> is a bad thing, but what should I do when I want to use "lazy retrieval"? Also, the yield keyword is such a goodie.
This is really a two part question.
1) Is there inherently anything wrong with returning an IEnumerable<T>
No nothing at all. In fact if you are using C# iterators this is the expected behavior. Converting it to a List<T> or another collection class pre-emptively is not a good idea. Doing so is making an assumption on the usage pattern by your caller. I find it's not a good idea to assume anything about the caller. They may have good reasons why they want an IEnumerable<T>. Perhaps they want to convert it to a completely different collection hierarchy (in which case a conversion to List is wasted).
2) Are there any circumstances where it may be preferable to return something other than IEnumerable<T>?
Yes. While it's not a great idea to assume much about your callers, it's perfectly okay to make decisions based on your own behavior. Imagine a scenario where you had a multi-threaded object which was queueing up requests into an object that was constantly being updated. In this case returning a raw IEnumerable<T> is irresponsible. As soon as the collection is modified the enumerable is invalidated and will cause an execption to occur. Instead you could take a snapshot of the structure and return that value. Say in a List<T> form. In this case I would just return the object as the direct structure (or interface).
This is certainly the rarer case though.
No, IEnumerable<T> is a good thing to return here, since all you are promising is "a sequence of (typed) values". Ideal for LINQ etc, and perfectly usable.
The caller can easily put this data into a list (or whatever) - especially with LINQ (ToList, ToArray, etc).
This approach allows you to lazily spool back values, rather than having to buffer all the data. Definitely a goodie. I wrote-up another useful IEnumerable<T> trick the other day, too.
IEnumerable is fine by me but it has some drawbacks. The client has to enumerate to get the results. It has no way to check for Count etc.
List is bad because you expose too much control; the client can add/remove etc. from it and that can be a bad thing.
Collection seems the best compromise, at least in FxCop's view.
I always use what seems appropiate in my context (eg. if I want to return a read only collection I expose collection as return type and return List.AsReadOnly() or IEnumerable for lazy evaluation through yield etc.). Take it on a case by case basis
About your principle: "accept the least you can, but return the maximum".
The key to managing the complexity of a large program is a technique called information hiding. If your method works by building a List<T>, it's not often necessary to reveal this fact by returning that type. If you do, then your callers may modify the list they get back. This removes your ability to do caching, or lazy iteration with yield return.
So a better principle is for a function to follow is: "reveal as little as possible about how you work".
Returning IEnumerable<T> is OK if you're genuinely only returning an enumeration, and it will be consumed by your caller as such.
But as others point out, it has the drawback that the caller may need to enumerate if he needs any other info (for example Count). The .NET 3.5 extension method IEnumerable<T>.Count will enumerate behind the scenes if the return value does not implement ICollection<T>, which may be undesirable.
I often return IList<T> or ICollection<T> when the result is a collection - internally your method can use a List<T> and either return it as-is, or return List<T>.AsReadOnly if you want to protect against modification (e.g. if you're caching the list internally). AFAIK FxCop is quite happy with either of these.
"accept the least you can, but return the maximum" is what I advocate. When a method returns an object, what justifications we have to not return the actual type and limit the capabilities of the object by returning a base type. This however raises a question how do we know what the "maximum" (actual type) will be when we design an interface. The answer is very simple. Only in extreme cases where the interface designer is designing an open interface, which will be implemented outside the application/component, they would not know what the actual return type may be. A smart designer should always consider what the method should be doing and what an optimal/generic return type should be.
E.g. If I am designing an interface to retrieve a vector of objects, and I know the count of returned objects are going to be variable, I'll always assume a smart developer will always use a List. If someone plans to return an Array, I'd question his capabilities, unless he/she is just returning the data from another layer that he/she doesn't own. And this is probably why FxCop advocates for ICollection (common base for List and Array).
The above being said, there are couple of other things to consider
if the returned data should be mutable or immutable
if the returned data be shared across multiple callers
Regarding the LINQ lazy evaluations I am sure 95%+ C# users don't understand the intestacies. It’s so non-oo-ish. OO promotes concrete state changes on method invocations. LINQ lazy evaluation promotes runtime state changes on expression evaluation pattern (not something non-advanced users always follow).
One important aspect is that when you return a List<T> you are actual returning a reference. That makes it possible for a caller to manipulate your list. This is a common problem—for instance, a Business layer that returns a List<T> to a GUI layer.
Just because you say you're returning IEnumerable doesn't mean you can't return a List. The idea is to reduce unneeded coupling. All that the caller should care about is getting a list of things, rather than the exact type of collection used to contain that list. If you have something that's backed by an array, then getting something like Count is going to be fast anyway.
I think your own guidance is great -- if you are able to be more specific about what you're returning without a performance hit (you don't have to e.g. build a List out of your result), do so. But if your function legitimately doesn't know what type it's going to find, like if in some situations you'll be working with a List and in some with an Array, etc., then returning IEnumerable is the "best" you can do. Think of it as the "greatest common multiple" of everything you might want to return.
I can't accept the chosen answer. There are ways of dealing with the scenario described but using a List or whatever else your using isn't one of them. The moment the IEnumerable is returned you have to assume that the caller might do a foreach. In that case it doesn't matter if the concrete type is List or spaghetti. In fact just indexing is a problem especially if items are removed.
Any returned value is a snapshot. It may be the current contents of the IEnumerable in which case if it's cached it should be a clone of the cached copy; if it's supposed to be more dynamic (like the resuts of a sql query) then use yield return; however allowing the container to mutate at will and supplying methods like Count and indexer is a recipe for disaster in a multithreaded world. I haven't even gotten into the ability of the caller to call Add or Delete on a container your code is supposed to be in control of.
Also returning a concrete type locks you into an implementation. Today internally you may be using a list. Tomorrow maybe you do become multithreaded and want to use a thread safe container or an array or a queue or the Values collection of a dictionary or the output of a Linq query. If you lock yourself into a concrete return type then you have to either change a bunch of code or do a conversions before returning.
IEnumerable is cool because you can use the yield iterator that gives to the consumer just the data they need but there is a cost hidden in the construct.
Let me explain it with an example. Let's say I am consuming this method:
IEnumerable GetFilesFromFolder(string path)
So, what do I get? To get all the files of my folder I have to iterate the enumeration, and that's fine, after all that's how enumerations work, but what if, for any reason, I have to enumerate it twice?
The second time should I expect a refreshed result or the result is idempotent? I do not know. I have to check the docs of the library / method.
The call to the GetEnumerator method of the enumeration done by the consumer, could, in fact, execute an I/O operation behind the scene, or an http call, or it could simply iterate an inner array, I can not know it for sure. I have to check the docs in the hope that this behavior is documented.
Does this detail matters? I think it does. At least from a performance perspective.
Even if the cost of iterations is slow and CPU bounded, this is not zero, and it could go even worse in the scenario of chains of enumerations, that often turn debugging sessions a nightmare.
I prefer to not give the consumer of my library doubts so whenever I know my API returns few elements I always use arrays as return type, and only when the data to return is huge I use IEnumerable or IAsyncEnumerable.
Anyway, if you want to return enumerations please document your API to tell consumers if the result is a snapshot or not.

Categories

Resources