C# Concurrent Data Structure which can be Iterated through - c#

I am looking for a thread safe concurrent data structure which I am able to iterate though. order does not matter to me.
I want to have a Collection of Objects.
One thread will be iterating through this collection continuously calling each objects Object.Add(Item)
eg:
Collection<Object> collection = new Collection<Object>();
//Run On New thread
while(tasksAvaliable)
{
foreach(var item in collection)
{
item.Add(task); //I get tasks from a Concurrent Queue
}
}
In another thread / timer , it will attempt to try Add, Edit, Remove these objects from the collection. (however less frequently than the other thread)
What Collection Can I use? I am currently Using a concurrent Bag. But I am not sure that when I use a foreach on it that every Object in the collection will receive my Task which I provide it in the foreach loop.
I was thinking alternatively I could use a normal List<Object> and simply lock it when I Add,edit, or remove objects from it. However objects which I am not working with will get blocked. Which i would prefer to miss, ie I would prefer a fine grain lock.

Create your own class (wrapper) on IList, which provide threadsafe access or try use ConcurrentBag<T>

Related

is enumerator thread safe after getting with lock?

I am wondering if the returned enumerator is thread safe:
public IEnumerator<T> GetEnumerator()
{
lock (_sync) {
return _list.GetEnumerator();
}
}
If I have multiple threads whom are adding data (also within lock() blocks) to this list and one thread enumerating the contents of this list. When the enumerating thread is done it clears the list. Will it then be safe to use the enumerator gotten from this method.
I.e. does the enumerator point to a copy of the list at the instance it was asked for or does it always point back to the list itself, which may or may not be manipulated by another thread during its enumeration?
If the enumerator is not thread safe, then the only other course of action I can see is to create a copy of the list and return that. This is however not ideal as it will generate lots of garbage (this method is called about 60 times per second).
No, not at all. This lock synchronizes only access to _list.GetEnumerator method; where as enumerating a list is lot more than that. It includes reading the IEnumerator.Current property, calling IEnumerator.MoveNext etc.
You either need a lock over the foreach(I assume you enumerate via foreach), or you need to make a copy of list.
Better option is to take a look at Threadsafe collections provided out of the box.
According to the documentation, to guarantee thread-safity you have to lock collecton during entire iteration over it.
The enumerator does not have exclusive access to the collection;
therefore, enumerating through a collection is intrinsically not a
thread-safe procedure. To guarantee thread safety during enumeration,
you can lock the collection during the entire enumeration. To allow
the collection to be accessed by multiple threads for reading and
writing, you must implement your own synchronization.
Another option, may be define you own custom iterator, and for every thread create a new instance of it. So every thread will have it's own copy of Current read-only pointer to the same collection.
If I have multiple threads whom are adding data (also within lock() blocks) to this list and one thread enumerating the contents of this
list. When the enumerating thread is done it clears the list. Will it
then be safe to use the enumerator gotten from this method.
No. See reference here: http://msdn.microsoft.com/en-us/library/system.collections.ienumerator.aspx
An enumerator remains valid as long as the collection remains
unchanged. If changes are made to the collection, such as adding,
modifying, or deleting elements, the enumerator is irrecoverably
invalidated and the next call to MoveNext or Reset throws an
InvalidOperationException. If the collection is modified between
MoveNext and Current, Current returns the element that it is set to,
even if the enumerator is already invalidated. The enumerator does not
have exclusive access to the collection; therefore, enumerating
through a collection is intrinsically not a thread-safe procedure.
Even when a collection is synchronized, other threads can still modify
the collection, which causes the enumerator to throw an exception. To
guarantee thread safety during enumeration, you can either lock the
collection during the entire enumeration or catch the exceptions
resulting from changes made by other threads.
..
does the enumerator point to a copy of the list at the instance it was asked for or does it always point back to the list itself, which
may or may not be manipulated by another thread during its
enumeration?
Depends on the collection. See the Concurrent Collections. Concurrent Stack, ConcurrentQueue, ConcurrentBag all take snapshots of the collection when GetEnumerator() is called and returns elements from the snapshot. The underlying collection may change without changing the snapshot. On the other hand, ConcurrentDictionary doesn't take a snapshot, so changing the collection while iterating will immediately effect things per the rules above.
A trick I sometimes use in this case is to create a temporary collection to iterate so the original is free while I use the snapshot:
foreach(var item in items.ToList()) {
//
}
If your list is too large and this causes GC churn, then locking is probably your best bet. If locking is too heavy, maybe consider a partial iteration each timeslice, if that is feasible.
You said:
When the enumerating thread is done it clears the list.
Nothing says you have to process the whole list at a time. You could instead remove a range of items, move them to a separate enumerating thread, let that process, and repeat. Perhaps iteratation and lists aren't the best model here. Consider the ConcurrentQueue with which you could build producers and consumer model, and consumers just steadily remove items to process without iteration

Thread safe OfType

I am facing the problem that this code:
enumerable.OfType<Foo>().Any(x => x.Footastic == true);
Isnt thread safe and throws an enumeration has changed exception.
Is there a good way to overcome this issue?
Already tried the following but it didnt always work (seems to not fire this often)
public class Foo
{
public void DoSomeMagicWithCollection(IEnumerable enumerable)
{
lock (enumerable)
{
enumerable.OfType<Foo>().Any(x => x.Footastic == true);
}
}
}
If you're getting an exception that the underlying collection has changed while enumerating it, given that this code clearly doesn't mutate the collection itself, it means that another thread is mutating the collection while you're trying to iterate it.
There is no possible solution to this problem other than simply not doing that. What's happening is that the enumerator of the List (or whatever collection type that is) is throwing the exception and preventing further enumeration because it can see that the list was modified during the enumeration. There is no way for the enumerators of OfType of Any that wrap it to possibly recover from that. The underlying enumerator is refusing to give them the data from the list. They can't do anything about that.
You need to use some sort of synchronization mechanism to prevent another thread from mutating the collection wnile this thread is enumerating this collection. Your lock doesn't prevent another thread from using the collection, it simply prevents any code that locks on the same instance from running. You need to have any code that could possibly mutate the list also lock on the same object to properly synchronize them.
Another possibility would be to use a collection that is inherently designed to be accessed from multiple threads at the same time. There are several such collections in the System.Collections.Concurrent namespace. They may or may not fit your needs. They will take care of synchronizing access to their data (to a point) on their own, without you needing to explicitly lock when accessing them.

What will happen if two references try to add data exactly same time in List<T> colletion

Check this function.
private static IEnumerable<string> FindAccessibleDatabases()
{
var connectionStrings = new List<string>();
Parallel.For(0, _connectionStringCollection.Count, (index, loopState) =>
{
try
{
using (var connection = new OleDbConnection(_connectionStringCollection[index]))
{
connection.Open();
connectionStrings.Add(_connectionStringCollection[index]);
}
}
catch (OleDbException)
{
}
finally
{
connection.Close();
}
});
return connectionStrings.ToList();
}
I am using Parallel.Foreach and adding values in a List from multiple databases at a time. I can use ConcurrentBag(It is tread safe while retrieving the data but adding is not mentioned) as I am just adding the data to the list, can use List.
Now if two threads try to add data to the list at exactly same time what will happen?
If it will create race condition, what if I use ConcurrentBag?
Thanks,
Omkar
You run the risk of unspecified bad things happening, like duplicates, one item not being added, corrupting the data structure, etc.
The documentation says that List<T>'s Add method is not thread safe (well, specifically it says:
A List<T> can support multiple readers concurrently, as long as the
collection is not modified. Enumerating through a collection is
intrinsically not a thread-safe procedure. In the rare case where an
enumeration contends with one or more write accesses, the only way to
ensure thread safety is to lock the collection during the entire
enumeration. To allow the collection to be accessed by multiple
threads for reading and writing, you must implement your own
synchronization.
). Therefore you need to use a lock statement or other form of thread synchronization around it, or you need to switch to a thread-safe data structure, like something in the System.Collections.Concurrent namespace.
If you use ConcurrentBag's Add method, you don't need to worry about locking. The data structure is explicitly thread-safe.

List<T>: Moving from immutable to changeable structure

I am currently using quite heavily some List and I am looping very frequently via foreach over these lists.
Originally List was immuteable afer the startup. Now I have a requirement to amend the List during runtime from one thread only (a kind of listener). I need to remove from the List in object A and add to the list of object B. A and B are instances of the same class.
Unfortunaly there is no Synchronized List. What would you suggest me to do in this case? in my case speed is more important than synchronisation, thus I am currently working with copies of the lists for add/remove to avoid that the enumerators fail.
Do you have any other recommended way to deal with this?
class X {
List<T> Related {get; set;}
}
In several places and in different threads I am then using
foreach var x in X.Related
Now I need to basically perform in yet another thread
a.Related.Remove(t);
b.Related.Add(t);
To avoid potential exceptions, I am currently doing this:
List<T> aNew=new List<T> (a.Related);
aNew.Remove(t);
a.Related=aNew;
List<T>bNew=new List<T>(b.Related){t};
b.Related=bNew;
Is this correct to avoid exceptions?
From this MSDN post: http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
"...the only way to ensure thread safety is to lock the collection during the entire enumeration. "
Consider using for loops and iterate over your collection in reverse. This way you do not have the "enumerators fail", and as you are going backwards over your collection it is consistent from the POV of the loop.
It's hard to discuss the threading aspects as there is limited detail.
Update
If your collections are small, and you only have 3-4 "potential" concurrent users, I would suggest using a plain locking strategy as suggested by #Jalal although you would need to iterate backwards, e.g.
private readonly object _syncObj = new object();
lock (_syncObj)
{
for (int i = list.Count - 1; i >= 0; i--)
{
//remove from the list and add to the second one.
}
}
You need to protect all accesses to your lists with these lock blocks.
Your current implementation uses the COW (Copy-On-Write) strategy, which can be effective in some scenarios, but your particular implementation suffers from the fact that two or more threads take a copy, make their changes, but then could potentially overwrite the results of other threads.
Update
Further to your question comment, if you are guaranteed to only have one thread updating the collections, then your use of COW is valid, as there is no chance of multiple threads making updates and updates being lost by overwriting by multiple threads. It's a good use of the COW strategy to achieve lock free synchronization.
If you bring other threads in to update the collections, my previous locking comments stand.
My only concern would be that the other "reader" threads may have cached values for the addresses of the original lists, and may not see the new addresses when they are updated. In this case make the list variables volatile.
Update
If you do go for the lock-free strategy there is still one more pitfall, there will still be a gap between setting a.Related and b.Related, in which case your reader threads could be iterating over out-of-date collections e.g. item a could have been removed from list1 but not yet added to list2 - item a will be in neither lists. You could also swap the issue around and add to list2 before removing from list1, in which case item a would be in both lists - duplicates.
If consistency is important you should use locking.
You should lock before you handle the lists since you are in multithreading mode, the lock operation itself does not affect the speed here, the lock operation is executed in nanoseconds about 10 ns depending on the machine. So:
private readonly object _listLocker = new object();
lock (_listLocker)
{
for (int itemIndex = 0; itemIndex < list.Count; itemIndex++)
{
//remove from the first list and add to the second one.
}
}
If you are using framework 4.0 I encorage you to use ConcurrentBag instead of list.
Edit: code snippet:
List<T> aNew=new List<T> (a.Related);
This will work if only all interaction with the collection "including add remove replace items" managed this way. Also you have to use System.Threading.Interlocked.CompareExchange and System.Threading.Interlocked.Exchange methods to replace the existing collection with the new modified. if that is not the case then you are doing nothing by coping
This will not work. for instance consider a thread trying to get an item from the collection, at the same time another thread replace the collection. this could leave the item retrieved in a not constant data. also consider while you are coping the collection, another thread want to insert item to the collection at the same time while you are coping?
This will throw exception indicates that the collection modified.
Another thing is that you are coping the whole collection to a new list to handle it. certainly this will harm the performance, and I think using synchronization mechanism such as lock reduce the performance pallet, and it is the much appropriated thing to do while to handle multithreading scenarios.

How to speed up routines making use of collections in multithreading scenario

I've an application that makes use of parallelization for processing data.
The main program is in C#, while one of the routine for analyzing data is on an external C++ dll. This library scans data and calls a callback everytime a certain signal is found within the data. Data should be collected, sorted and then stored into HD.
Here is my first simple implementation of the method invoked by the callback and of the method for sorting and storing data:
// collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// method invoked by the callback
private void Collect(int type, long time)
{
lock(locker) { mySignalList.Add(new MySignal(type, time)); }
}
// store signals to disk
private void Store()
{
// sort the signals
mySignalList.Sort();
// file is a object that manages the writing of data to a FileStream
file.Write(mySignalList.ToArray());
}
Data is made up of a bidimensional array (short[][] data) of size 10000 x n, with n variable. I use parallelization in this way:
Parallel.For(0, 10000, (int i) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
}
Now for each of the 10000 arrays I estimate that 0 to 4 callbacks could be fired. I'm facing a bottleneck and given that my CPU resources are not over-utilized, I suppose that the lock (together with thousand of callbacks) is the problem (am I right or there could be something else?). I've tried the ConcurrentBag collection but performances are still worse (in line with other user findings).
I thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
[EDIT]
I've followed the Reed Copsey's suggestion and I used the following solution (according to the profiler of VS2010, before the burden for locking and adding to the list was taking 15% of the resources, while now only 1%):
// master collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// thread-local storage of data (each thread is working on its List<MySignal>)
ThreadLocal<List<MySignal>> threadLocal;
// analyze data
private void AnalizeData()
{
using(threadLocal = new ThreadLocal<List<MySignal>>(() =>
{ return new List<MySignal>(); }))
{
Parallel.For<int>(0, 10000,
() =>
{ return 0;},
(i, loopState, localState) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
return 0;
},
(localState) =>
{
lock(this)
{
// add thread-local lists to the master collection
mySignalList.AddRange(local.Value);
local.Value.Clear();
}
});
}
}
// method invoked by the callback
private void Collect(int type, long time)
{
local.Value.Add(new MySignal(type, time));
}
thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
You might want to look at using ThreadLocal<T> to hold your collections. This automatically allocates a separate collection per thread.
That being said, there are overloads of Parallel.For which work with local state, and have a collection pass at the end. This, potentially, would allow you to spawn your ProcessData wrapper, where each loop body was working on its own collection, and then recombine at the end. This would, potentially, eliminate the need for locking (since each thread is working on it's own data set) until the recombination phase, which happens once per thread (instead of once per task,ie: 10000 times). This could reduce the number of locks you're taking from ~25000 (0-4*10000) down to a few (system and algorithm dependent, but on a quad core system, probably around 10 in my experience).
For details, see my blog post on aggregating data with Parallel.For/ForEach. It demonstrates the overloads and explains how they work in more detail.
You don't say how much of a "bottleneck" you're encountering. But let's look at the locks.
On my machine (quad core, 2.4 GHz), a lock costs about 70 nanoseconds if it's not contended. I don't know how long it takes to add an item to a list, but I can't imagine that it takes more than a few microseconds. But let's it takes 100 microseconds (I would be very surprised to find that it's even 10 microseconds) to add an item to the list, taking into account lock contention. So if you're adding 40,000 items to the list, that's 4,000,000 microseconds, or 4 seconds. And I would expect one core to be pegged if this were the case.
I haven't used ConcurrentBag, but I've found the performance of BlockingCollection to be very good.
I suspect, though, that your bottleneck is somewhere else. Have you done any profiling?
The basic collections in C# aren't thread safe.
The problem you're having is due to the fact that you're locking the entire collection just to call an add() method.
You could create a thread-safe collection that only locks single elements inside the collection, instead of the whole collection.
Lets look at a linked list for example.
Implement an add(item (or list)) method that does the following:
Lock collection.
A = get last item.
set last item reference to the new item (or last item in new list).
lock last item (A).
unclock collection.
add new items/list to the end of A.
unlock locked item.
This will lock the whole collection for just 3 simple tasks when adding.
Then when iterating over the list, just do a trylock() on each object. if it's locked, wait for the lock to be free (that way you're sure that the add() finished).
In C# you can do an empty lock() block on the object as a trylock().
So now you can add safely and still iterate over the list at the same time.
Similar solutions can be implemented for the other commands if needed.
Any built-in solution for a collection is going to involve some locking. There may be ways to avoid it, perhaps by segregating the actual data constructs being read/written, but you're going to have to lock SOMEWHERE.
Also, understand that Parallel.For() will use the thread pool. While simple to implement, you lose fine-grained control over creation/destruction of threads, and the thread pool involves some serious overhead when starting up a big parallel task.
From a conceptual standpoint, I would try two things in tandem to speed up this algorithm:
Create threads yourself, using the Thread class. This frees you from the scheduling slowdowns of the thread pool; a thread starts processing (or waiting for CPU time) when you tell it to start, instead of the thread pool feeding requests for threads into its internal workings at its own pace. You should be aware of the number of threads you have going at once; the rule of thumb is that the benefits of multithreading are overcome by the overhead when you have more than twice the number of active threads as "execution units" available to execute threads. However, you should be able to architect a system that takes this into account relatively simply.
Segregate the collection of results, by creating a dictionary of collections of results. Each results collection is keyed to some token carried by the thread doing the processing and passed to the callback. The dictionary can have multiple elements READ at one time without locking, and as each thread is WRITING to a different collection within the Dictionary there shouldn't be a need to lock those lists (and even if you did lock them you wouldn't be blocking other threads). The result is that the only collection that has to be locked such that it would block threads is the main dictionary, when a new collection for a new thread is added to it. That shouldn't have to happen often if you're smart about recycling tokens.

Categories

Resources