Threadsafe foreach enumeration of lists

Threadsafe foreach enumeration of lists - c#

I need to enumerate though generic IList<> of objects. The contents of the list may change, as in being added or removed by other threads, and this will kill my enumeration with a "Collection was modified; enumeration operation may not execute."
What is a good way of doing threadsafe foreach on a IList<>? prefferably without cloning the entire list. It is not possible to clone the actual objects referenced by the list.

Cloning the list is the easiest and best way, because it ensures your list won't change out from under you. If the list is simply too large to clone, consider putting a lock around it that must be taken before reading/writing to it.

There is no such operation. The best you can do is
lock(collection){
foreach (object o in collection){
...
}
}

Your problem is that an enumeration does not allow the IList to change. This means you have to avoid this while going through the list.
A few possibilities come to mind:
Clone the list. Now each enumerator has its own copy to work on.
Serialize the access to the list. Use a lock to make sure no other thread can modify it while it is being enumerated.
Alternatively, you could write your own implementation of IList and IEnumerator that allows the kind of parallel access you need. However, I'm afraid this won't be simple.

ICollection MyCollection;
// Instantiate and populate the collection
lock(MyCollection.SyncRoot) {
// Some operation on the collection, which is now thread safe.
}
From MSDN

You'll find that's a very interesting topic.
The best approach relies on the ReadWriteResourceLock which use to have big performance issues due to the so called Convoy Problem.
The best article I've found treating the subject is this one by Jeffrey Richter which exposes its own method for a high performance solution.

So the requirements are: you need to enumerate through an IList<> without making a copy while simultaniously adding and removing elements.
Could you clarify a few things? Are insertions and deletions happening only at the beginning or end of the list?
If modifications can occur at any point in the list, how should the enumeration behave when elements are removed or added near or on the location of the enumeration's current element?
This is certainly doable by creating a custom IEnumerable object with perhaps an integer index, but only if you can control all access to your IList<> object (for locking and maintaining the state of your enumeration). But multithreaded programming is a tricky business under the best of circumstances, and this is a complex probablem.

Forech depends on the fact that the collection will not change. If you want to iterate over a collection that can change, use the normal for construct and be prepared to nondeterministic behavior. Locking might be a better idea, depending on what you're doing.

Default behavior for a simple indexed data structure like a linked list, b-tree, or hash table is to enumerate in order from the first to the last. It would not cause a problem to insert an element in the data structure after the iterator had already past that point or to insert one that the iterator would enumerate once it had arrived, and such an event could be detected by the application and handled if the application required it. To detect a change in the collection and throw an error during enumeration I could only imagine was someone's (bad) idea of doing what they thought the programmer would want. Indeed, Microsoft has fixed their collections to work correctly. They have called their shiny new unbroken collections ConcurrentCollections (System.Collections.Concurrent) in .NET 4.0.

I recently spend some time multip-threading a large application and had a lot of issues with the foreach operating on list of objects shared across threads.
In many cases you can use the good old for-loop and immediately assign the object to a copy to use inside the loop. Just keep in mind that all threads writing to the objects of your list should write to different data of the objects. Otherwise, use a lock or a copy as the other contributors suggest.
Example:
foreach(var p in Points)
{
// work with p...
}
Can be replaced by:
for(int i = 0; i < Points.Count; i ++)
{
Point p = Points[i];
// work with p...
}

Wrap the list in a locking object for reading and writing. You can even iterate with multiple readers at once if you have a suitable lock, that allows multiple concurrent readers but also a single writer (when there are no readers).

This is something that I've recently had to deal with and to me it really depends on what you're doing with the list.
If you need to use the list at a point in time (given the number of elements currently in it) AND another thread can only ADD to the end of the list, then maybe you just switch out to a FOR loop with a counter. At the point you grab the counter, you're only seeing X numbers of elements in the list. You can walk through the list (while others are adding to the end of it) . . . should not cause a problem.
Now, if the list needs to have items taken OUT of it by other threads, or CLEARED by other threads, then you'll need to implement one of the locking mechanisms mentioned above. Also, you may want to look at some of the newer "concurrent" collection classes (though I don't believe they implement IList - so you may need refactor for a dictionary).

Related

Why does Iterator define the remove() operation?

In C#, the IEnumerator interface defines a way to traverse a collection and look at the elements. I think this is tremendously useful because if you pass IEnumerable<T> to a method, it's not going to modify the original source.
However, in Java, Iterator defines the remove operation to (optionally!) allow deleting elements. There's no advantage in passing Iterable<T> to a method because that method can still modify the original collection.
remove's optionalness is an example of the refused bequest smell, but ignoring that (already discussed here) I'd be interested in the design decisions that prompted a remove event to be implemented on the interface.
What are the design decisions that led to remove being added to Iterator?
To put another way, what is the C# design decision that explicitly doesn't have remove defined on IEnumerator?

Iterator is able to remove elements during iteration. You cannot iterate collection using iterator and remove elements from target collection using remove() method of that collection. You will get ConcurrentModificationException on next call of Iterator.next() because iterator cannot know how exactly the collection was changed and cannot know how to continue to iterate.
When you are using remove() of iterator it knows how the collection was changed. Moreover actually you cannot remove any element of collection but only the current one. This simplifies continuation of iterating.
Concerning to advantages of passing iterator or Iterable: you can always use Collection.unmodifireableSet() or Collection.unmodifireableList() to prevent modification of your collection.

It is probably due to the fact that removing items from a collection while iterating over it has always been a cause for bugs and strange behaviour. From reading the documentation it would suggest that Java enforces at runtime remove() is only called once per call to next() which makes me think it has just been added to prevent people messing up removing data from a list when iterating over it.

There are situations where you want to be able to remove elements using the iterator because it is the most efficient way to do it. For example, when traversing a linked data structure (e.g. a linked list), removing using the iterator is an O(1) operation ... compared to O(N) via the List.remove() operations.
And of course, many collections are designed so that modifying the collection during a collection by any other means than Iterator.remove() will result in a ConcurrentModificationException.
If you have a situation where you don't want to allow modification via a collection iterator, wrapping it using Collection.unmodifiableXxxx and using it's iterator will have the desired effect. Alternatively, I think that Apache Commons provides a simple unmodifiable iterator wrapper.
By the way IEnumerable suffers from the same "smell" as Iterator. Take a look at the reset() method. I was also curious as to how the C# LinkedList class deals with the O(N) remove problem. It appears that it does this by exposing the internals of the list ... in the form of the First and Last properties whose values are LinkedListNode references. That violates another design principle ... and is (IMO) far more dangerous than Iterator.remove().

This is actually an awesome feature of Java. As you may well know, when iterating through a list in .NET to remove elements (of which there are a number of use cases for) you only have two options.
var listToRemove = new List<T>(originalList);
foreach (var item in originalList)
{
...
if (...)
{
listToRemove.Add(item)
}
...
}
foreach (var item in listToRemove)
{
originalList.Remove(item);
}
or
var iterationList = new List<T>(originalList);
for (int i = 0; i < iterationList.Count; i++)
{
...
if (...)
{
originalList.RemoveAt(i);
}
...
}
Now, I prefer the second, but with Java I don't need all of that because while I'm on an item I can remove it and yet the iteration will continue! Honestly, though it may seem out of place, it's really an optimization in a lot of ways.

Is there List type that can be Enumerated while it is Changing?

Sometimes it is useful to enumerate a list while it is changing.
e.g.
foreach (var item in listOfEntities)
item.Update();
// somewhere else (with someEntity contained in listOfEntities)
// an add or remove is made:
someEntity.OnUpdate += (s,e) => listOfEntities.Remove(someEntity);
This will fail if listOfEntities is a List<T>.
There are workarounds like making a copy or a simple for-loop, each with different drawbacks, but I would like to know if there is a list type in the framework (or open source) that supports this.

Look at the collections in System.Collections.Concurrent. There's no list there, but the collections' enumerators do "represents a moment-in-time snapshot of the contents of the [collection]".
These collections are designed for access from multiple threads, so they will be better suited to applications like the code sample you posted.

This has nothing to do with List<T>; it is a limitation of the enumerator. If you change the state of the collection underneath the enumerator it will throw, period.
You could use a for loop, but you will then run into logical errors as you index into a collection after the number of items have changed.
It's probably a bad idea to swap items in and out of a collection while you are enumerating it in another thread. I would stick with the tried and true method of recording the items to be removed in another collection or locking the collection while it is being enumerated.
I'm not claiming this is an impossible problem to solve, I just don't know of an easy way to do it.

What is the fastest/safest method to iterate over a HashSet?

I'm still quite new to C#, but noticed the advantages through forum postings of using a HashSet instead of a List in specific cases.
My current case isn't that I'm storing a tremendous amount of data in a single List exectly, but rather than I'm having to check for members of it often.
The catch is that I do indeed need to iterate over it as well, but the order they are stored or retrieved doesn't actually matter.
I've read that for each loops are actually slower than for next, so how else could I go about this in the fastest method possible?
The number of .Contains() checks I'm doing is definitely hurting my performance with lists, so at least comparing to the performance of a HashSet would be handy.
Edit: I'm currently using lists, iterating through them in numerous locations, and different code is being executed in each location. Most often, the current lists contain point coordinates that I then use to refer to a 2 dimensional array for that I then do some operation or another based on the criteria of the list.
If there's not a direct answer to my question, that's fine, but I assumed there might be other methods of iterating over a HashSet than just foreach cycle. I'm currently in the dark as to what other methods there might even be, what advantages they provide, etc. Assuming there are other methods, I also made the assumption that there would be a typical preferred method of choice that is only ignored when it doesn't suite the needs (my needs are pretty basic).
As far as prematurely optimizing, I already know using the lists as I am is a bottleneck. How to go about helping this issue is where I'm getting stuck. Not even stuck exactly, but I didn't want to re-invent the wheel by testing repeatedly only to find out I'm already doing it the best way I could (this is a large project with over 3 months invested, lists are everywhere, but there are definitely ones that I do not want duplicates, have a lot of data, need not be stored in any specific order, etc).

A foreach loop has a small amount of addition overhead on an indexed collections (like an array).
This is mostly because the foreach does a little more bounds checking than a for loop.
HashSet does not have an indexer so you have to use the enumerator.
In this case foreach is efficient as it only calls MoveNext() as it moves through the collection.
Also Parallel.ForEach can dramatically improve your performance, depending on the work you are doing in the loop and the size of your HashSet.
As mentioned before profiling is your best bet.

You shouldn't be iterating over a hashset in the first place to determine if an item is in it. You should use the HashSet (not the LINQ) contains method. The HashSet is designed such that it won't need to look through every item to see if any given value is inside of the set. That is what makes it so powerful for searching over a List.

Not strictly answering the question in the header, but more concerning your specific problem:
I would make your own Collection object that uses both a HashSet and a List internally. Iterating is fast as you can use the List, checking for Contains is fast as you can use the HashSet. Just make it an IEnumerable and you can use this Collection in foreach as well.
The downside is more memory, but there are only twice as many references to object, not twice as many objects. Worst case scenario it's only twice as much memory, but you seem much more concerned with performance.
Adding, checking, and iterating are fast this way, only removal is still O(N) because of the List.
EDIT: If removal needs to be O(1) as well, use a doubly linked list instead of a regular list, and make the hashSet a Dictionary<KeyType, Cell> instead. You can check the dictionary for Contains, but also to find the cell with the data in it fast, so removal from the data structure is fast.

I had the same issue, where the HashSet suits very well the addition of unique elements, but is very slow when getting elements in a for loop. I solved it by converting the HashSet to array and then running the for over it.

I have read that it is bad practice to iterate over a HashSet. Should I be calling .ToList() on it first?

I have a collection of items called RegisteredItems. I do not care about the order of the items in RegisteredItems, only that they exist.
I perform two types of operations on RegisteredItems:
Find and return item by property.
Iterate over collection and have side-effect.
According to: When should I use the HashSet<T> type? Robert R. says,
"It's somewhat dangerous to iterate over a HashSet because doing so
imposes an order on the items in the set. That order is not really a
property of the set. You should not rely on it. If ordering of the
items in a collection is important to you, that collection isn't a
set."
There are some scenarios where my collection would contain 50-100 items. I realize this is not a large amount of items, but I was still hoping to reap the rewards of using a HashSet instead of List.
I have found myself looking at the following code and wondering what to do:
LayoutManager.Instance.RegisteredItems.ToList().ForEach( item => item.DoStuff() );
vs
foreach( var item in LayoutManager.Instance.RegisteredItems)
{
item.DoStuff();
}
RegisteredItems used to return an IList<T>, but now it returns a HashSet. I felt that, if I was using HashSet for efficiency, it would be improper to cast it as a List. Yet, the above quote from Robert leaves me feeling uneasy about iterating over it, as well.
What's the right call in this scenario? Thanks

If you don't care about order, use a HashSet<>. The quote is about using HashSet<> being dangerous when you're worried about order. If you run this code multiple times, and the items are operated on in different order, will you care? If not, then you're fine. If yes, then don't use a HashSet<>. Arbitrarily converting to a List first doesn't really solve the problem.
And I'm not certain, but I suspect that .ToList() will iterate over the HashSet<> to do that, so, now you're walking the collection twice.
Don't prematurely optimize. If you only have 100 items, just use a HashSet<> and move on. If you start caring about order, change it to a List<> then and use it as a list everwhere.

If you really don't care about order and you know that you can't have duplicate in your hashset (and it's what you want), go ahead use hashset.

In the quoted question, I think he's saying that if you iterate over a Set, you can easily trick yourself into thinking that the items are in a certain order. For example, it'd be easy to treat the first iterated item differently, but you aren't guaranteed that will remain the first iterated item.
As long as you keep this in mind, and consider the Set unordered, iterating over it is fine.

Return collection as read-only

I have an object in a multi-threaded environment that maintains a collection of information, e.g.:
public IList<string> Data
{
get
{
return data;
}
}
I currently have return data; wrapped by a ReaderWriterLockSlim to protect the collection from sharing violations. However, to be doubly sure, I'd like to return the collection as read-only, so that the calling code is unable to make changes to the collection, only view what's already there. Is this at all possible?

If your underlying data is stored as list you can use List(T).AsReadOnly method.
If your data can be enumerated, you can use Enumerable.ToList method to cast your collection to List and call AsReadOnly on it.

I voted for your accepted answer and agree with it--however might I give you something to consider?
Don't return a collection directly. Make an accurately named business logic class that reflects the purpose of the collection.
The main advantage of this comes in the fact that you can't add code to collections so whenever you have a native "collection" in your object model, you ALWAYS have non-OO support code spread throughout your project to access it.
For instance, if your collection was invoices, you'd probably have 3 or 4 places in your code where you iterated over unpaid invoices. You could have a getUnpaidInvoices method. However, the real power comes in when you start to think of methods like "payUnpaidInvoices(payer, account);".
When you pass around collections instead of writing an object model, entire classes of refactorings will never occur to you.
Note also that this makes your problem particularly nice. If you don't want people changing the collections, your container need contain no mutators. If you decide later that in just one case you actually HAVE to modify it, you can create a safe mechanism to do so.
How do you solve that problem when you are passing around a native collection?
Also, native collections can't be enhanced with extra data. You'll recognize this next time you find that you pass in (Collection, Extra) to more than one or two methods. It indicates that "Extra" belongs with the object containing your collection.

If your only intent is to get calling code to not make a mistake, and modify the collection when it should only be reading all that is necessary is to return an interface which doesn't support Add, Remove, etc.. Why not return IEnumerable<string>? Calling code would have to cast, which they are unlikely to do without knowing the internals of the property they are accessing.
If however your intent is to prevent the calling code from observing updates from other threads you'll have to fall back to solutions already mentioned, to perform a deep or shallow copy depending on your need.

I think you're confusing concepts here.
The ReadOnlyCollection provides a read-only wrapper for an existing collection, allowing you (Class A) to pass out a reference to the collection safe in the knowledge that the caller (Class B) cannot modify the collection (i.e. cannot add or remove any elements from the collection.)
There are absolutely no thread-safety guarantees.
If you (Class A) continue to modify the underlying collection after you hand it out as a ReadOnlyCollection then class B will see these changes, have any iterators invalidated, etc. and generally be open to any of the usual concurrency issues with collections.
Additionally, if the elements within the collection are mutable, both you (Class A) and the caller (Class B) will be able to change any mutable state of the objects within the collection.
Your implementation depends on your needs:
- If you don't care about the caller (Class B) from seeing any further changes to the collection then you can just clone the collection, hand it out, and stop caring.
- If you definitely need the caller (Class B) to see changes that are made to the collection, and you want this to be thread-safe, then you have more of a problem on your hands. One possibility is to implement your own thread-safe variant of the ReadOnlyCollection to allow locked access, though this will be non-trivial and non-performant if you want to support IEnumerable, and it still won't protect you against mutable elements in the collection.

One should note that aku's answer will only protect the list as being read only. Elements in the list are still very writable. I don't know if there is any way of protecting non-atomic elements without cloning them before placing them in the read only list.

You can use a copy of the collection instead.
public IList<string> Data {
get {
return new List<T>(data);
}}
That way it doesn't matter if it gets updated.

You want to use the yield keyword. You loop through the IEnumerable list and return the results with yeild. This allows the consumer to use the for each without modifying the collection.
It would look something like this:
List<string> _Data;
public IEnumerable<string> Data
{
get
{
foreach(string item in _Data)
{
return yield item;
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.