Suppose I have a given collection. Without ever changing the collection in any way, I loop through its contents twice with a foreach. Barring cosmic rays and what not, is it absolutely guaranteed that the order will be consistent in both loops?
Alternatively, given a HashSet<string> with a number of elements, what can cause the output from the the commented lines in the following to be unequal:
{
var mySet = new HashSet<string>();
// Some code which populates the HashSet<string>
// Output1
printContents(mySet);
// Output2
printContents(mySet);
}
public void printContents(HashSet<string> set) {
foreach(var element in set) {
Console.WriteLine(element);
}
}
It would be helpful if I could get a general answer explaining what causes an implementation to not meet the criteria described above. Specifically, though, I am interested in Dictionary, List and arrays.
Array enumeration guarantees order.
List and List<T> are expected to provide stable order (since they are expected to implement sequentially-indexed elements).
Dictionary, HashSet are explicitly do not guarantee order. Its is very unlikely that 2 calls to iterate items one after each other will return items in different order, but there is no guarantees or expectations. One should not expect any particular order.
Sorted versions of Dictionary/HashSet return items in sort order.
Other IEnumerable objects are free to do whatever they want. Normally one implements iterators in such a way that it matches user's expectations. I.e. enumeration of something that have implicit order should be stable, if explicit order provided - expected to be stable. Query to database that does not specify order should be expected to return items in semi-random order.
Check this question for links: Does the foreach loop in C# guarantee an order of evaluation?
Everything that implements IEnumerable<T> does so in its own way. There is no general guarantee that any given collection must ensure stability.
If you are referring specifically to Collection<T> (http://msdn.microsoft.com/en-us/library/ms132397.aspx) I don't see any specific guarantee in its MSDN reference that ordering is consistent.
Will it probably be consistent? Yes. Is there a written guarantee? Not that I can find.
For many of the C# collections there are sorted versions of the collection. For instance, a HashSet is to a SortedSet as a Dictionary is to a SortedDictionary. If you're working with something where the order isn't important like the Dictionary then you can't assume the loop order will behave the same way every time.
As per your example with HashSet<T>, we now have source code to check: HashSet:Enumerator
As it is, the Slot[] set.m_slots array is iterated.
The array object is only changed in the methods TrimExcess, Initialize (both of which are only called in the constructor), OnDeserialization, and SetCapacity (only called by AddIfNotPresent and AddOrGetLocation).
The values of m_slots are only changed in methods that change elements of the HashSet(Clear, Remove, AddIfNotPresent, IntersectWith, SymmetricExceptWith).
So yes, if nothing touches the set, it enumerates in the same order.
Dictionary:Enumerator works in quite the same way, iterating an Entry[] entries that only changes when such non-readonly methods are called.
Related
I'm trying to create a method which (other than in name) shows that the ordering of some collection will be preserved.
I have considered SortedList, but dismissed it due to the requirement of holding a key. I have also dismissed other Sorted types for similar reasons, and SortedSet due to Linq returning IEnumerable instead of another SortedSet when you operate on it.
I don't mind if a new type is required, or I need to write methods in a specific way. The goal here is to highlight methods which preserve the input order of a collection when operating upon it.
I had thought about adding a custom attribute and just trusting that it will be used correctly, but I would ideally like to find something in the language which is more explicit.
-- Edit
It's not so much the order of the elements in the collection (I could use an IEnumerable), but some operation on the input collection. Let's say I were returning the root of all the numbers in an array, instead of returning (root, number)[], or (root, index)[] I want to return root[] and have it clear to the user that the order of the elements in the returned array matches the order of the elements in the input parameter.
No, there is nothing in C# or .Net that let you express and enforce "this method does not change order of elements in a collection / while iterating through collection".
Conventional expectation is order of elements stored in a collection preserved while iterating unless method/class explicitly named to indicate reordering.
Examples of "no reordering":
for / 'foreach`
.Select, .First, .Take, .SelectMany, .Where
indexing of collection that is not called "SortedXxxxx" - List, array.
Examples of "does reordering"
List.Sort, List.Reverse
.OrderBy, .ThenBy
classes that don't preserve/guarantee ordering like HashSet, Dictionary, OrderedDictionary, SortedList
Sounds like you want a queue. The first object added will be the first removed, so order of insertion is preserved. Typically objects are processed out of the queue with the Dequeue() method, but there is a Peek() method if you don't want to remove from the collection.
Beyond that, you'll probably need to roll your own implementation. It would likely just be a wrapper around a List<T>, where you prevent anything from being Inserted.
Regarding the collections implementing this[int] and assuming the collection won't change during the enumeration, does the foreach (var item in list) loop produce the same sequence as for (var i = 0; i < list.Count; ++i) anytime?
This means, when I need the ascending order by index, could I use foreach or is just simply safer to use for? Or it just depends on the curren collection implementation and migh vary or change in time?
foreach (var item in list)
{
// do things
}
translates to
var enumerator = list.GetEnumerator();
while(enumerator.MoveNext())
{
var item = enumerator.Current;
// do things
}
So as you can see, it's not using the indexor list[i] in the general case.
For most collections types, however, the semantics is the same.
edit
There are IList<T> implementations where the enumerator IList<T> as a linked list, it's very unlikely you will use the indexor in your enumerator implementation, as it would be very inefficient.
As a rule of thumb, using foreach ensure you use the most efficient algorithm for the class at hand, as it is the one chosen by the class' Creator. In the worst case, you will just suffer a small indirection overhead that is very unlikely to be noticeable.
edit 2 after nos's comment
There is a case where the semantics of the two constructs varies widly: the collection modification.
While using a simple for loop, nothing particular will happen if you change the collection while iterating through it. The program will behave as if it assumed you know what you're doing. This could result in some values iterated over more than once or other skipped, but no exception as long as you're not accessing outside of the range of the indexor (which would require a multithreaded program ot happen).
While using a foreachloop; if you modify the collection while iterating through it, you enter undefined behavior. The documentation tells us
An enumerator remains valid as long as the collection remains
unchanged. If changes are made to the collection, such as adding,
modifying, or deleting elements, the enumerator is irrecoverably
invalidated and its behavior is undefined.
In that case, expect most of C# built-in types to throw an InvalidOperationException, but everything can happen in a custom implementation, from missed values to repeated values , including infinite loops...
Generally speaking, yes, but strictly spoken: no. It really depends on the implementation.
Usually with for you would use the this indexer properties. foreach uses GetEnumerator() to get the enumerator that iterates over the collection. Depending on the implementation the enumerator might yield another result than the for.
The implied logic of a list is that is has a specific order, and when implementing IList you may state is it save to assume that the order of both the indexer properties as the enumerator are the same.
There is no guarantee that this would be the case. The code paths can be completely separate. Of course collections like List will produce the same result but you can write data structures (even useful ones) that do not.
The indexer is just a property with additional index argument. You can return a random value if you feel like it.
One important think you should have in mind, as a difference between the 2 is that inside foreach you can/t make any changes to the enumarared objects.
If you wish to alter (basicaly delete) objects from the enumeration you must use a for loop
So IEnumerables don't guarantee order.
Does that mean if you do myEnumerable.Skip(5) you cannot (unless you do .ToList() or otherwise before) guarantee what will be returned?
Once the objects are yielded by an IEnumerator they do have an order. There is some item that comes out first, and some item that comes out second, etc. For some particular implementations that order might have meaning, for others it might be arbitrary, but there still is some order. The Skip implementation is straightforward; it gets however many items without yielding them, and then gets the rest and yields them. Whether the items skipped mean anything in particular is the responsibility of whoever is calling the method.
Calling ToList will never change the order of the items in the sequence, so adding such a call before calling Skip wouldn't change anything. A call to OrderBy on the other hand would result in a changed ordering, possibly from a meaningless order to a meaningful order. That's not to say it's required, merely that it can, in some situations, be a useful tool.
Whether a particular ordering is guaranteed or not by any specdific IEnumerable<T> is dependent on
How that particular implementation is/was done, and on
the semantics of the underlying collection/class.
An array will enumerate its contents in the obvious sequence (from x[0] to x[n].) Ditto for a List<T>, it being essentially an array of adjustable length. Actual [linked] lists, of course, can only be enumerated in order.
The order of enumeration of Dictionary<K,V>, HashSet<T>, binary trees, etc. is dependent upon the order in which objects were added. Add the same collection of values with differing orderings to a binary tree and the structure of the tree thus constructed will vary (the degenerate case, of course, being when objects are added in order, in which case the tree structure collapsed into an [ordered] linked list.
That being said, any particular instance of IEnumerable<T>, should, barring any modifications to the underlying collection, yield the same sequence of values each time it is enumerated. That assumes, of course, a rational implementation of the interface. If the interface enumerates the collection by doing a random shuffle, of course, all bets are off.
If the actual order of items produced is important, you need to either
Use a collection having the desired semantics, or
Enforce the desired ordering by sorting the collection or enumeration.
IEnumerable is an interface. As such, the interface has no guarantee of order. However, if you have an actual object that implements that interface, that object may (and often does) guarantee the order.
If you use Skip(x), the first x elements will be ignored and everything after will be returned in a new IEnumerable<T>. The interface makes no guarantees that it will retain order but in practice it does. Anytime you operate on an IEnumerable<T> it will actually go through the same list in a linear fashion. For example if you read a file line by line into and IEnumerable<T> the lines would always be in the same order as they were in the file (assuming you don't use a sorting method). Even if you use a Where or some other method to filter the results, order will still be retained. The only thing you need to worry about is custom collections that implement IEnumerable<T>. The collections in .NET will behave as you'd expect them to.
I have a collection of items called RegisteredItems. I do not care about the order of the items in RegisteredItems, only that they exist.
I perform two types of operations on RegisteredItems:
Find and return item by property.
Iterate over collection and have side-effect.
According to: When should I use the HashSet<T> type? Robert R. says,
"It's somewhat dangerous to iterate over a HashSet because doing so
imposes an order on the items in the set. That order is not really a
property of the set. You should not rely on it. If ordering of the
items in a collection is important to you, that collection isn't a
set."
There are some scenarios where my collection would contain 50-100 items. I realize this is not a large amount of items, but I was still hoping to reap the rewards of using a HashSet instead of List.
I have found myself looking at the following code and wondering what to do:
LayoutManager.Instance.RegisteredItems.ToList().ForEach( item => item.DoStuff() );
vs
foreach( var item in LayoutManager.Instance.RegisteredItems)
{
item.DoStuff();
}
RegisteredItems used to return an IList<T>, but now it returns a HashSet. I felt that, if I was using HashSet for efficiency, it would be improper to cast it as a List. Yet, the above quote from Robert leaves me feeling uneasy about iterating over it, as well.
What's the right call in this scenario? Thanks
If you don't care about order, use a HashSet<>. The quote is about using HashSet<> being dangerous when you're worried about order. If you run this code multiple times, and the items are operated on in different order, will you care? If not, then you're fine. If yes, then don't use a HashSet<>. Arbitrarily converting to a List first doesn't really solve the problem.
And I'm not certain, but I suspect that .ToList() will iterate over the HashSet<> to do that, so, now you're walking the collection twice.
Don't prematurely optimize. If you only have 100 items, just use a HashSet<> and move on. If you start caring about order, change it to a List<> then and use it as a list everwhere.
If you really don't care about order and you know that you can't have duplicate in your hashset (and it's what you want), go ahead use hashset.
In the quoted question, I think he's saying that if you iterate over a Set, you can easily trick yourself into thinking that the items are in a certain order. For example, it'd be easy to treat the first iterated item differently, but you aren't guaranteed that will remain the first iterated item.
As long as you keep this in mind, and consider the Set unordered, iterating over it is fine.
I am exploring the HashSet<T> type, but I don't understand where it stands in collections.
Can one use it to replace a List<T>? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.
Is it only for enumeration?
The important thing about HashSet<T> is right there in the name: it's a set. The only things you can do with a single set is to establish what its members are, and to check whether an item is a member.
Asking if you can retrieve a single element (e.g. set[45]) is misunderstanding the concept of the set. There's no such thing as the 45th element of a set. Items in a set have no ordering. The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.
It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set. That order is not really a property of the set. You should not rely on it. If ordering of the items in a collection is important to you, that collection isn't a set.
Sets are really limited and with unique members. On the other hand, they're really fast.
Here's a real example of where I use a HashSet<string>:
Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a # or \ command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string> of all the valid commands, so whenever I hit a #xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:
Dictionary<string, ?>: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.
List<string>: If I keep the list sorted, I can use BinarySearch, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...
string[]: Again, Array.BinarySearch gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet, Dictionary, or List. Even with BinarySearch, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.
A HashSet<T> implements the ICollection<T> interface:
public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
// Methods
void Add(T item);
void Clear();
bool Contains(T item);
void CopyTo(T[] array, int arrayIndex);
bool Remove(T item);
// Properties
int Count { get; }
bool IsReadOnly { get; }
}
A List<T> implements IList<T>, which extends the ICollection<T>
public interface IList<T> : ICollection<T>
{
// Methods
int IndexOf(T item);
void Insert(int index, T item);
void RemoveAt(int index);
// Properties
T this[int index] { get; set; }
}
A HashSet has set semantics, implemented via a hashtable internally:
A set is a collection that contains no
duplicate elements, and whose elements
are in no particular order.
What does the HashSet gain, if it loses index/position/list behavior?
Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).
A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.
Performance would be a bad reason to choose HashSet over List. Instead, what better captures your intent? If order is important, then Set (or HashSet) is out. If duplicates are permitted, likewise. But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.
HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unordered. So no, a set can not be used to replace a list (unless you should've use a set in the first place).
If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.
HashSet would be used to remove duplicate elements in an IEnumerable collection. For example,
List<string> duplicatedEnumrableStrings = new List<string> {"abc", "ghjr", "abc", "abc", "yre", "obm", "ghir", "qwrt", "abc", "vyeu"};
HashSet<string> uniqueStrings = new HashSet(duplicatedEnumrableStrings);
after those codes are run, uniqueStrings holds {"abc", "ghjr", "yre", "obm", "qwrt", "vyeu"};
Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)). So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement. If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).
And no, you can't index a set, which would not make sense anyway, because sets aren't ordered. If you add some items, the set won't remember which one was first, and which second etc.
HashSet<T> is a data strucutre in the .NET framework that is a capable of representing a mathematical set as an object. In this case, it uses hash codes (the GetHashCode result of each item) to compare equality of set elements.
A set differs from a list in that it only allows one occurrence of the same element contained within it. HashSet<T> will just return false if you try to add a second identical element. Indeed, lookup of elements is very quick (O(1) time), since the internal data structure is simply a hashtable.
If you're wondering which to use, note that using a List<T> where HashSet<T> is appropiate is not the biggest mistake, though it may potentially allow problems where you have undesirable duplicate items in your collection. What is more, lookup (item retrieval) is vastly more efficient - ideally O(1) (for perfect bucketing) instead of O(n) time - which is quite important in many scenarios.
List<T> is used to store ordered sets of information. If you know the relative order of the elements of the list, you can access them in constant time. However, to determine where an element lies in the list or to check if it exists in the list, the lookup time is linear. On the other hand, HashedSet<T> makes no guarantees of the order of the stored data and consequently provides constant access time for its elements.
As the name implies, HashedSet<T> is a data structure that implements set semantics. The data structure is optimized to implement set operations (i.e. Union, Difference, Intersect), which can not be done as efficiently with the traditional List implementation.
So, to choose which data type to use really depends on what your are attempting to do with your application. If you don't care about how your elements are ordered in a collection, and only want to enumarate or check for existence, use HashSet<T>. Otherwise, consider using List<T> or another suitable data structure.
In the basic intended scenario HashSet<T> should be used when you want more specific set operations on two collections than LINQ provides. LINQ methods like Distinct, Union, Intersect and Except are enough in most situations, but sometimes you may need more fine-grained operations, and HashSet<T> provides:
UnionWith
IntersectWith
ExceptWith
SymmetricExceptWith
Overlaps
IsSubsetOf
IsProperSubsetOf
IsSupersetOf
IsProperSubsetOf
SetEquals
Another difference between LINQ and HashSet<T> "overlapping" methods is that LINQ always returns a new IEnumerable<T>, and HashSet<T> methods modify the source collection.
In short - anytime you are tempted to use a Dictionary (or a Dictionary where S is a property of T) then you should consider a HashSet (or HashSet + implementing IEquatable on T which equates on S)