Iterating Collection In Two Threads

Iterating Collection In Two Threads - c#

This question relates both to C# and Java
If you have a collection which is not modified and that collection reference is shared between two threads what happens when you iterate on each thread??
ThreadA: Collection.iterator
ThreadA: Collection.moveNext
ThreadB: Collection.iterator
ThreadB: Collection.moveNext
Will threadB see the first element?
Is the iterator always reset when it is requested? What happens if this is interleaved so movenext and item is interleaved? Is there a danger that you dont process all elements??

It works as expected because each time you request the iterator you get a new one.
If it didn't you wouldn't be able to do foreach followed by foreach on the same collection!

By convention, an Iterator is implemented such that the traversal action never alters the collection's state. It only points to the current position in the collection, and manages the iteration logic. Therefore, if you scan the same collection by N different threads, everything should work fine.
However, note that Java's Iterator allows removal of items, and ListIterator even supports the set operation. If you want to use these actions by at least one of the threads, you will probably face concurrency problems (ConcurrentModificationException), unless the Iterator is specifically designed for such scenarios (such as with ConcurrentHashMap's iterators).

In Java (and I am pretty sure also in C#), the standard API collections typically do not have single iterator. Each call to iterator() produces a new one, which has its own internal index or pointer, so that as long as both threads acquire their own iterator object, there should be no problem.
However, this is not guaranteed by the interface, nor is the ability of two iterators to work concurrently without problems. For custom implementations of collections, all bets are off.

In c# - yes, java - seems to be but I'm not familiar.
About c# see http://csharpindepth.com/Articles/Chapter6/IteratorBlockImplementation.aspx and http://csharpindepth.com/Articles/Chapter11/StreamingAndIterators.aspx

At least in C#, all of the standard collections can be enumerated simultaneously on different threads. However, enumeration on any thread will blow up if you modify the underlying collection during enumeration (as it should.) I don't believe any sane developer writing a collection class would have their enumerators mutate the collection state in a way that interferes with enumeration, but it's possible. If you're using a standard collection, however, you can safely assume this and therefore use locking strategies like Single Writer / Multiple Reader when synchronizing collection access.

Related

How to map ImmutableArray without getting it cast to IEnumerable which is not thread safe?

So I'm working in a multithreaded environment and I wan't to use ImmutableArray all the time because it's thread safe.
Unfortunately, ImmutableArray implements thread unsafe interfaces and so Select method from LINQ returns IEnumerable.
This way, my thread safe variable becomes thread unsafe.
How do I map from ImmutableArray to ImmutableArray?

It seems that there are a lot of misunderstandings behind this question. You need to go look at the source code for the Select method and learn about the yield keyword.
Second, LINQ methods are made to be short-lived. You have various threads doing various processing tasks. Are you using a pipeline situation, where you want to transform data in one thread and pass the result to another thread? You have to be careful with the yield keyword in that situation; essentially, you need to flush (er, realize, for lack of a better word) your collections before passing them to the next thread so that the actual work is done in the present thread. In that scenario, object ownership kicks in and you don't need thread-safe collections.
In short, the enumerable returned from calling Select on ImmutableArray is perfectly thread-safe. You can realize it at any point and it won't give you any errors. Of course it will only iterate through the data that was contained in your collection at the time you called Select. It won't know anything about newly assigned instances.

When to use BlockingCollection and when ConcurrentBag instead of List<T>?

The accepted answer to the question "Why does this Parallel.ForEach code freeze the program up?" advises to substitute the List usage by ConcurrentBag in a WPF application.
I'd like to understand whether a BlockingCollection can be used in this case instead?

You can indeed use a BlockingCollection, but there is absolutely no point in doing so.
First off, note that BlockingCollection is a wrapper around a collection that implements IProducerConsumerCollection<T>. Any type that implements that interface can be used as the underlying storage:
When you create a BlockingCollection<T> object, you can specify not
only the bounded capacity but also the type of collection to use. For
example, you could specify a ConcurrentQueue<T> object for first in,
first out (FIFO) behavior, or a ConcurrentStack<T> object for last
in,first out (LIFO) behavior. You can use any collection class that
implements the IProducerConsumerCollection<T> interface. The default
collection type for BlockingCollection<T> is ConcurrentQueue<T>.
This includes ConcurrentBag<T>, which means you can have a blocking concurrent bag. So what's the difference between a plain IProducerConsumerCollection<T> and a blocking collection? The documentation of BlockingCollection says (emphasis mine):
BlockingCollection<T> is used as a wrapper for an
IProducerConsumerCollection<T> instance, allowing removal attempts
from the collection to block until data is available to be removed.
Similarly, a BlockingCollection<T> can be created to enforce an
upper-bound on the number of data elements allowed in the
IProducerConsumerCollection<T> [...]
Since in the linked question there is no need to do either of these things, using BlockingCollection simply adds a layer of functionality that goes unused.

List<T> is a collection designed to use in single thread
applications.
ConcurrentBag<T> is a class of Collections.Concurrent namespace designed
to simplify using collections in multi-thread environments. If you
use ConcurrentCollection you will not have to lock your
collection to prevent corruption by other threads. You can insert
or take data from your collection with no need to write special locking codes.
BlockingCollection<T> is designed to get rid of the requirement of checking if new data is available in the shared collection between threads. if there is new data inserted into the shared collection then your consumer thread will awake immediately. So you do not have to check if new data is available for consumer thread in certain time intervals typically in a while loop.

Whenever you find the need for a thread-safe List<T>, in most cases neither the ConcurrentBag<T> nor the BlockingCollection<T> are going to be your best option. Both collections are specialized for facilitating producer-consumer scenarios, so unless you have more than one threads that are concurrently adding and removing items from the collection, you should look for other options (with the best candidate being the ConcurrentQueue<T> in most cases).
Regarding especially the ConcurrentBag<T>, it's an extremely specialized class targeting mixed producer-consumer scenarios. This means that each worker-thread is expected to be both a producer and a consumer (that adds and removes items from the same collection). It could be a good candidate for the internal storage of an ObjectPool class, but beyond that it is hard to imagine any advantageous usage scenario for this class.
People usually think that the ConcurrentBag<T> is the thread-safe equivalent of a List<T>, but it's not. The similarity of the two APIs is misleading. Calling Add to a List<T> results to adding an item at the end of the list. Calling Add to a ConcurrentBag<T> results instead to the item being added at a random slot inside the bag. The ConcurrentBag<T> is essentially unordered. It is not optimized for being enumerated, and does a lousy job when it is commanded to do so. It maintains internally a bunch of thread-local queues, so the order of its contents is dominated by which thread did what, not by when did something happened. Before each enumeration of the ConcurrentBag<T>, all these thread-local queues are copied to an array, adding pressure to the garbage collector (source code). So for example the line var item = bag.First(); results in a copy of the whole collection, for returning just one element.
These characteristics make the ConcurrentBag<T> a less than ideal choice for storing the results of a Parallel.For/Parallel.ForEach loop.
A better thread-safe substitute of the List<T>.Add is the ConcurrentQueue<T>.Enqueue method. "Enqueue" is a less familiar word than "Add", but it actually does what you expect it to do.
There is nothing that a ConcurrentBag<T> can do that a ConcurrentQueue<T> can't. For example neither collection offers a way to remove a specific item from the collection. If you want a concurrent collection with a TryRemove method that has a key parameter, you could look at the ConcurrentDictionary<K,V> class.
The ConcurrentBag<T> appears frequently in the Task Parallel Library-related examples in Microsoft's documentation. Like here for example.
Whoever wrote the documentation, apparently they valued more the tiny usability advantage of writing Add instead of Enqueue, than the behavioral/performance disadvantage of using the wrong collection. This makes some sense considering that the examples were authored at a time when the TPL was new, and the goal was the fast adoption of the library by developers who were mostly unfamiliar with parallel programming. I get it, Enqueue is a scary word when you see it for the first time. Unfortunately now there is a whole generation of developers that have incorporated the ConcurrentBag<T> in their mental tools, although it has no business being there, considering
how specialized this collection is.
In case you want to collect the results of a Parallel.ForEach loop in exactly the same order as the source elements, you can use a List<T> protected with a lock. In most cases the overhead will be negligible, especially if the work inside the loop is chunky. An example is shown below, featuring the Select LINQ operator for getting the index of each element.
var indexedSource = source.Select((item, index) => (item, index));
List<TResult> results = new();
Parallel.ForEach(indexedSource, parallelOptions, entry =>
{
var (item, index) = entry;
TResult result = GetResult(item);
lock (results)
{
while (results.Count <= index) results.Add(default);
results[index] = result;
}
});
This is for the case that the source is a deferred sequence with unknown size. If you know its size beforehand, it is even simpler. Just preallocate a TResult[] array, and update it in parallel without locking:
TResult[] results = new TResult[source.Count];
Parallel.For(0, source.Count, parallelOptions, i =>
{
results[i] = GetResult(source[i]);
});
The TPL includes memory barriers at the end of task executions, so all the values of the results array will be visible from the current thread (citation).

Yes, you could use BlockingCollection for that. finishedProxies would be defined as:
BlockingCollection<string> finishedProxies = new BlockingCollection<string>();
and to add an item, you would write:
finishedProxies.Add(checkResult);
And when it's done, you could create a list from the contents.

IProducerConsumerCollection<T>.TryAdd/.TryTake - when do they return true/false?

When I call IProducerConsumerCollection<T>.TryAdd(<T>) or IProducerConsumerCollection<T>.TryTake(out <T>) will these ever fail because another thread is using the collection?
Or is it the case that if there is space to Add or something to Take even after the other thread has finished with the collection, it will always return true?
Nothing that I can see here: http://msdn.microsoft.com/en-us/library/dd287147.aspx

While in theory the collections could reject take/add requests for any reason, the only reason I know about is Add failing because the collection has reached its capacity, and Take failing because the collection is empty.
The collections are designed from the get-go to be used from multiple threads - so if there are items left, even if two threads try to Take at the same time, they should both get an item and a return value of true.

For example, BlockingCollection<T> which is a high-level abstraction over the interface (it doesn't implement the interface though) with bounding and blocking capabilities may throw one of the following:
ObjectDisposedException on TryAdd(T) or TryTake(T) once the collection is disposed.
InvalidOperationException on TryAdd(T) if it's marked as complete for addition. Think about situation when you add values to a collection from 2 producers, one marks collection as complete, then another one tries to add to collection.

Thread safety of List<T> with One Writer, No Enumerators

While going through some database code looking for a bug unrelated to this question, I noticed that in some places List<T> was being used inappropriately. Specifically:
There were many threads concurrently accessing the List as readers, but using indexes into the list instead of enumerators.
There was a single writer to the list.
There was zero synchronization, readers and writers were accessing the list at the same time, but because of code structure the last element would never be accessed until the method that executed the Add() returned.
No elements were ever removed from the list.
By the C# documentation, this should not be thread safe.
Yet it has never failed. I am wondering, because of the specific implementation of the List (I am assuming internally it's an array that re-allocs when it runs out of space), it the 1-writer 0-enumerator n-reader add-only scenario accidentally thread safe, or is there some unlikely scenario where this could blow up in the current .NET4 implementation?
edit: Important detail I left out reading some of the replies. The readers treat the List and its contents as read-only.

This can and will blow. It just hasn't yet. Stale indices is usually the first thing that goes. It will blow just when you don't want it to. You are probably lucky at the moment.
As you are using .Net 4.0, I'd suggest changing the list to a suitable collection from System.Collections.Concurrent which is guaranteed to be thread safe. I'd also avoid using array indices and switch to ConcurrentDictionary if you need to look up something:
http://msdn.microsoft.com/en-us/library/dd287108.aspx

Because of it has never failed or your application doesn't crash that doesn't mean that this scenario is thread safe. for instance suppose the writer thread does update a field within the list, lets say that is was a long field, at the same time the reader thread reading that field. the value returned maybe a bitwise combination of the two fields the old one and the new one! that could happen because the reader thread start reading the value from memory but before it finishes reading it the writer thread just updated it.
Edit: That of course if we suppose that the reader threads will just read all the data without updating anything, I am sure that they doesn't change the values of the arrays them self but, but they could change a property or field within the value they read. for instance:
for (int index =0 ; index < list.Count; index++)
{
MyClass myClass = list[index];//ok we are just reading the value from list
myClass.SomeInteger++;//boom the same variable will be updated from another threads...
}
This example not talking about thread safe of the list itself rather than the shared variables that the list exposed.
The conclusion is that you have to use a synchronization mechanism such as lock before interaction with the list, even if it has only one writer and no item removed, that will help you prevent tinny bugs and failure scenarios you are dispensable for in the first place.

Thread safety only matters when data is modified more than once at a time. The number of readers does not matter. Even when someone is writing while someone reads, the reader either gets the old data or the new, it still works. The fact that elements can only be accessed after the Add() returns, prevents parts of the element being read seperately. If you would start using the Insert() method readers could get the wrong data.

It follows then, that if the architecture is 32 bits, writing a field bigger than 32 bits, such as long and double, is not a thread safe operation; see the documentation for System.Double:
Assigning an instance of this type is not thread safe on all hardware platforms because the
binary representation of that instance might be too large to assign in a single atomic
operation.
If the list is fixed in size, however, this situation matters only if the List is storing value types greater than 32 bits. If the list is only holding reference types, then any thread safety issues stem from the reference types themselves, not from their storage and retrieval from the List. For instance, immutable reference types are less likely to cause thread safety issues than mutable reference types.
Moreover, you can't control the implementation details of List: that class was mainly designed for performance, and it's likely to change in the future with that aspect, rather than thread safety, in mind.
In particular, adding elements to a list or otherwise changing its size is not thread safe even if the list's elements are 32 bits long, since there is more involved in inserting, adding, or removing than just placing the element in the list. If such operations are needed after other threads have access to the list, then locking access to the list or using a concurrent list implementation is a better choice.

First off, to some of the posts and comments, since when was documentation reliable?
Second, this answer is more to the general question than the specifics of the OP.
I agree with MrFox in theory because this all boils down to two questions:
Is the List class is implemented as a flat array?
If yes, then:
Can a write instruction be preempted in the middle of a write>
I believe this is not the case -- the full write will happen before anything can read that DWORD or whatever. In other words, it will never happen that I write two of the four bytes of a DWORD and then you read 1/2 of the new value and 1/2 of the old one.
So, if you're indexing an array by providing an offset to some pointer, you can read safely without thread-locking. If the List is doing more than just simple pointer math, then it is not thread safe.
If the List was not using a flat array, I think you would have seen it crash by now.
My own experience is that it is safe to read a single item from a List via index without thread-locking. This is all just IMHO though, so take it for what it's worth.
Worst case, such as if you need to iterate through the list, the best thing to do is:
lock the List
create an array the same size
use CopyTo() to copy the List to the array
unlock the List
then iterate through the array instead of the list.
in (whatever you call the .net) C++:
List<Object^>^ objects = gcnew List<Object^>^();
// in some reader thread:
Monitor::Enter(objects);
array<Object^>^ objs = gcnew array<Object^>(objects->Count);
objects->CopyTo(objs);
Monitor::Exit(objects);
// use objs array
Even with the memory allocation, this will be faster than locking the List and iterating through the entire thing before unlocking it.
Just a heads up though: if you want a fast system, thread-locking is your worst enemy. Use ZeroMQ instead. I can speak from experience, message-based synch is the right way to go.

Looking for a faster implementation for IEnumerable/IEnumerator

I'm trying to optimize a concurrent collection that tries to minimize lock contention for reads. First pass was using a linked list, which allowed me to only lock on writes while many simultaneous reads could continue unblocked. This used a custom IEnumerator to yield the next link value. Once i started comparing iteration over the collection to a plain List<T> i noticed my implementation was about half as fast (for from x in c select x on a collection of 1*m* items, i got 24ms for List<T> and 49ms for my collection).
So i thought i'd use a ReaderWriteLockSlim and sacrifice a little contention on reads so i could use a List<T> as my internal storage. Since I have to capture the read lock on iteration start and release it upon completion, i first did a yield pattern for my IEnumerable, foreaching over the internal List<T>. Now i was getting only 66ms.
I peeked at what List actually does and it uses an internal store of T[] and a custom IEnumerator that moves the index forward and returns the current index value. Now, manually using T[] as storage means a lot more maintenance work, but wth, i'm chasing microseconds.
However even mimicking the IEnumerator moving the index on an array, the best I could do was about ~38ms. So what gives List<T> its secret sauce or alternatively what's a faster implementation for an iterator?
UPDATE: Turns out my main speed culprit was running Debug compile, while List<T> is obviously a Release compile. In release my implementation is still a hair slower than List<T>, altough on mono it's now faster.
One other suggestion i got from a friend is that the BCL is faster because it's in the GAC and therefore can get pre-compiled by the system. Will have to put my test in the GAC to test that theory.

Acquiring and releasing the lock on each iteration sounds like a bad idea - because if you perform an Add or Remove while you're iterating over the list, that will invalidate the iterator. List<T> certainly wouldn't like that, for example.
Would your use case allow callers to take out a ReaderWriterLockSlim around their whole process of iteration, instead of on a per-item basis? That would be more efficient and more robust. If not, how are you planning to deal with the concurrency issue? If a writer adds an element earlier than the place where I've got to, a simple implementation would return the same element twice. The opposite would happen with a removal - the iterator would skip an element.
Finally, is .NET 4.0 an option? I know there are some highly optimised concurrent collections there...
EDIT: I'm not quite sure what your current situation is in terms of building an iterator by hand, but one thing that you might want to investigate is using a struct for the IEnumerator<T>, and making your collection explicitly declare that it returns that - that's what List<T> does. It does mean using a mutable struct, which makes kittens cry all around the world, but if this is absolutely crucial to performance and you think you can live with the horror, it's at least worth a try.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.