Fast way to build a hash set, single or multiple thread?

Fast way to build a hash set, single or multiple thread? - c#

i want to know which why is faster to build a hashset.
my process like this:
1, DB access (single thread) , get a large list of IDs.
2,
Plan A
foreach( var oneID in IDs)
{
myHashSet.add(oneID);
}
Plan B
Parallel.ForEach(IDs,myPallOpt,(oneID)=>
{
myHashSet.add(oneID);
});
So which is faster Plan A or B?
Thanks

HashSet<T> is not thread safe, so the second option (using Parallel.ForEach) will likely cause bugs. It should definitely be avoided.
The best option is likely to just build the hashset directly from the results:
var myHashSet = new HashSet<int>(IDs);
Note that this only works if the HashSet is only intended to contain the items from this collection. If you're adding to an existing HashSet<T>, a foreach (your first option) is likely the best option.

Plan B likely won't work because it is likely not thread-safe (most .NET collection classes are not thread safe). You could fix it by making access to it thread-safe, but that essentially means serializing access to it, which is no better than single threaded. The only case it would make sense is if between beginning of your for loop and actual addition you have some cpu intensive processing that would be good to parallelize.

Related

Should I use a C# Dictionary if I only need fast lookup of keys, and values are irrelevant?

I am in need of a data type that is able to insert entries and then be able to quickly determine if an entry has already been inserted. A Dictionary seems to suit this need (see example). However, I have no use for the dictionary's values. Should I still use a dictionary or is there another better suited data type?
public class Foo
{
private Dictionary<string, bool> Entities;
...
public void AddEntity(string bar)
{
if (!Entities.ContainsKey(bar))
{
// bool value true here has no use and is just a placeholder
Entities.Add(bar, true);
}
}
public string[] GetEntities()
{
return Entities.Keys.ToArray();
}
}

You can use HashSet<T>.
The HashSet<T> class provides high-performance set operations. A set
is a collection that contains no duplicate elements, and whose
elements are in no particular order.

Habib's answer is excellent, but for multi-threaded environments if you use a HashSet<T> then by consequence you have to use locks to protect access to it. I find myself more prone to creating deadlocks with lock statements. Also, locks yield a worse speedup per Amdahl's law because adding a lock statement reduces the percentage of your code that is actually parallel.
For those reasons, a ConcurrentDictionary<T,object> fits the bill in multi-threaded environments. If you end up using one, then wrap it like you did in your question. Just new up objects to toss in as values as needed, since the values won't be important. You can verify that there are no lock statements in its source code.
If you didn't need mutability of the collection then this would be moot. But your question implies that you do need it, since you have an AddEntity method.
Additional info 2017-05-19 - actually, ConcurrentDictionary does use locks internally, although not lock statements per se--it uses Monitor.Enter (check out the TryAddInternal method). However, it seems to lock on individual buckets within the dictionary, which means there will be less contention than putting the entire thing in a lock statement.
So all in all, ConcurrentDictionary is often better for multithreaded environments.
It's actually quite difficult (impossible?) to make a concurrent hash set using only the Interlocked methods. I tried on my own and kept running into the problem of needing to alter two things at the same time--something that only locking can do in general. One workaround I found was to use singly-linked lists for the hash buckets and intentionally create cycles in a list when one thread needed to operate on a node without interference from other threads; this would cause other threads to get caught spinning around in the same spot until that thread was done with its node and undid the cycle. Sure, it technically didn't use locks, but it did not scale well.

Iterate and do operations on a generic collection concurrently

I've created a game emulation program using c# async socks. I need to remove/add & do iterations on a collection (a list that holds clients) concurrently. I am currently using "lock", however, it's a a huge performance drop. I also do not want to use "local lists/copies" to keep the list up-to-date. I've heard about "ConcurrentBags", however, I am not sure how thread safe they are for iterations (for instance if a thread removes an element from the list while another thread is doing an iteration on it!?).
What do you suggest?
Edit: here is a situation
this is when a packet is sent to all the users in a room
lock (parent.gameClientList)
{
for (int i = 0; i <= parent.gameClientList.Count() - 1; i++) if (parent.gameClientList[i].zoneId == zoneId) parent.gameClientList[i].SendXt(packetElements); //if room matches - SendXt sends a packet
}
When a new client connects
Client connectedClient = new Client(socket, this);
lock (gameClientList)
{
gameClientList.Add(connectedClient);
}
Same case when a client disconnects.
I am asking for a better alternative (performance-wise) because the locks slow down everything.

It sounds like the problem is that you're doing all the work within your foreach loop, and it's locking out the add/remove methods for too long. The way around this is to quickly make a copy of the collection while it's locked, and then you can close the lock and iterate on the copy.
Thing[] copy;
lock(myLock) {
copy = _collection.ToArray();
}
foreach(var thing in copy) {...}
The drawback is that by the time you get around to operating on some object of that copy, it may have been removed from the original collection and so maybe you don't want to operate on it anymore. That's another thing you'll just have to figure out the requirements. If that's a problem, a simple option would be to lock each iteration of the loop, which of course will slow things down but at least it won't lock for the entire duration the loop is running:
foreac(var thing in copy) {
lock(myLock) {
if (_collection.Contains(thing)) //check that it's still in the original colleciton
DoWork(thing); //If you can move this outside the lock it'd make your app snappier, but it could also cause problems if you're doing something "dangerous" in DoWork.
}
}
If this is what you meant by "local copies", then you can disregard this option, but I figured I'd offer it in case you meant something else.

Every time you do something concurrently you are going to have loss due to task management (i.e. locks). I suggest you look at what is the bottleneck in your process. You seem to have a shared memory model, as opposed to a message passing model. If you know you need to modify the entire collection at once, there may not be a good solution. But if you are making changes in a particular order you can leverage that order to prevent delays. Locks is an implementation of pessimistic concurrency. You could switch to an optimistic concurrency model. In one the cost is waiting in the other the cost is retrying. Again the actual solution depends on your use case.

On problem with ConcurrentBag is that it is unordered so you cannot pull items out by index the same way you are doing it currently. However, you can iterate it via foreach to get the same effect. This iteration will be thread-safe. It will not go bizerk if an item is added or removed while the iteration is happening.
There is another problem with ConcurrentBag though. It actually copies the contents to a new List internally to make the enumerator work correctly. So even if you wanted to just pick off a single item via the enumerator it would still be a O(n) operation because of the way enumerator works. You can verify this by disassembling it.
However, based on context clues from your update I assume that this collection is going to be small. It appears that there is only one entry per "game client" which means it is probably going to store a small number of items right? If that is correct then the performance of the GetEnumerator method will be mostly insignificant.
You should also consider ConcurrentDictionary as well. I noticed that you are trying to match items from the collection based on zoneId. If you store the items in the ConcurrentDictionary keyed by zoneId then you would not need to iterate the collection at all. Of course, this assumes that there is only one entry per zoneId which may not be the case.
You mentioned that you did not want to use "local lists/copies", but you never said why. I think you should reconsider this for the following reasons.
Iterations could be lock-free.
Adding and removing items appears to be infrequent based context clues from your code.
There are a couple of patterns you can use to make the list copying strategy work really well. I talk about them in my answers here and here.

List<T>: Moving from immutable to changeable structure

I am currently using quite heavily some List and I am looping very frequently via foreach over these lists.
Originally List was immuteable afer the startup. Now I have a requirement to amend the List during runtime from one thread only (a kind of listener). I need to remove from the List in object A and add to the list of object B. A and B are instances of the same class.
Unfortunaly there is no Synchronized List. What would you suggest me to do in this case? in my case speed is more important than synchronisation, thus I am currently working with copies of the lists for add/remove to avoid that the enumerators fail.
Do you have any other recommended way to deal with this?
class X {
List<T> Related {get; set;}
}
In several places and in different threads I am then using
foreach var x in X.Related
Now I need to basically perform in yet another thread
a.Related.Remove(t);
b.Related.Add(t);
To avoid potential exceptions, I am currently doing this:
List<T> aNew=new List<T> (a.Related);
aNew.Remove(t);
a.Related=aNew;
List<T>bNew=new List<T>(b.Related){t};
b.Related=bNew;
Is this correct to avoid exceptions?

From this MSDN post: http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
"...the only way to ensure thread safety is to lock the collection during the entire enumeration. "

Consider using for loops and iterate over your collection in reverse. This way you do not have the "enumerators fail", and as you are going backwards over your collection it is consistent from the POV of the loop.
It's hard to discuss the threading aspects as there is limited detail.
Update
If your collections are small, and you only have 3-4 "potential" concurrent users, I would suggest using a plain locking strategy as suggested by #Jalal although you would need to iterate backwards, e.g.
private readonly object _syncObj = new object();
lock (_syncObj)
{
for (int i = list.Count - 1; i >= 0; i--)
{
//remove from the list and add to the second one.
}
}
You need to protect all accesses to your lists with these lock blocks.
Your current implementation uses the COW (Copy-On-Write) strategy, which can be effective in some scenarios, but your particular implementation suffers from the fact that two or more threads take a copy, make their changes, but then could potentially overwrite the results of other threads.
Update
Further to your question comment, if you are guaranteed to only have one thread updating the collections, then your use of COW is valid, as there is no chance of multiple threads making updates and updates being lost by overwriting by multiple threads. It's a good use of the COW strategy to achieve lock free synchronization.
If you bring other threads in to update the collections, my previous locking comments stand.
My only concern would be that the other "reader" threads may have cached values for the addresses of the original lists, and may not see the new addresses when they are updated. In this case make the list variables volatile.
Update
If you do go for the lock-free strategy there is still one more pitfall, there will still be a gap between setting a.Related and b.Related, in which case your reader threads could be iterating over out-of-date collections e.g. item a could have been removed from list1 but not yet added to list2 - item a will be in neither lists. You could also swap the issue around and add to list2 before removing from list1, in which case item a would be in both lists - duplicates.
If consistency is important you should use locking.

You should lock before you handle the lists since you are in multithreading mode, the lock operation itself does not affect the speed here, the lock operation is executed in nanoseconds about 10 ns depending on the machine. So:
private readonly object _listLocker = new object();
lock (_listLocker)
{
for (int itemIndex = 0; itemIndex < list.Count; itemIndex++)
{
//remove from the first list and add to the second one.
}
}
If you are using framework 4.0 I encorage you to use ConcurrentBag instead of list.
Edit: code snippet:
List<T> aNew=new List<T> (a.Related);
This will work if only all interaction with the collection "including add remove replace items" managed this way. Also you have to use System.Threading.Interlocked.CompareExchange and System.Threading.Interlocked.Exchange methods to replace the existing collection with the new modified. if that is not the case then you are doing nothing by coping
This will not work. for instance consider a thread trying to get an item from the collection, at the same time another thread replace the collection. this could leave the item retrieved in a not constant data. also consider while you are coping the collection, another thread want to insert item to the collection at the same time while you are coping?
This will throw exception indicates that the collection modified.
Another thing is that you are coping the whole collection to a new list to handle it. certainly this will harm the performance, and I think using synchronization mechanism such as lock reduce the performance pallet, and it is the much appropriated thing to do while to handle multithreading scenarios.

How to speed up routines making use of collections in multithreading scenario

I've an application that makes use of parallelization for processing data.
The main program is in C#, while one of the routine for analyzing data is on an external C++ dll. This library scans data and calls a callback everytime a certain signal is found within the data. Data should be collected, sorted and then stored into HD.
Here is my first simple implementation of the method invoked by the callback and of the method for sorting and storing data:
// collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// method invoked by the callback
private void Collect(int type, long time)
{
lock(locker) { mySignalList.Add(new MySignal(type, time)); }
}
// store signals to disk
private void Store()
{
// sort the signals
mySignalList.Sort();
// file is a object that manages the writing of data to a FileStream
file.Write(mySignalList.ToArray());
}
Data is made up of a bidimensional array (short[][] data) of size 10000 x n, with n variable. I use parallelization in this way:
Parallel.For(0, 10000, (int i) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
}
Now for each of the 10000 arrays I estimate that 0 to 4 callbacks could be fired. I'm facing a bottleneck and given that my CPU resources are not over-utilized, I suppose that the lock (together with thousand of callbacks) is the problem (am I right or there could be something else?). I've tried the ConcurrentBag collection but performances are still worse (in line with other user findings).
I thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
[EDIT]
I've followed the Reed Copsey's suggestion and I used the following solution (according to the profiler of VS2010, before the burden for locking and adding to the list was taking 15% of the resources, while now only 1%):
// master collection where saving found signals
List<MySignal> mySignalList = new List<MySignal>();
// thread-local storage of data (each thread is working on its List<MySignal>)
ThreadLocal<List<MySignal>> threadLocal;
// analyze data
private void AnalizeData()
{
using(threadLocal = new ThreadLocal<List<MySignal>>(() =>
{ return new List<MySignal>(); }))
{
Parallel.For<int>(0, 10000,
() =>
{ return 0;},
(i, loopState, localState) =>
{
// wrapper for the external c++ dll
ProcessData(data[i]);
return 0;
},
(localState) =>
{
lock(this)
{
// add thread-local lists to the master collection
mySignalList.AddRange(local.Value);
local.Value.Clear();
}
});
}
}
// method invoked by the callback
private void Collect(int type, long time)
{
local.Value.Add(new MySignal(type, time));
}

thought that a possible solution for use lock-free code would be to have multiple collections. Then it would be necessary a strategy to make each thread of the parallel process working on a single collection. Collections could be for instance inside a dictionary with thread ID as key, but I do not know any .NET facility for this (I should know the threads ID for initialize the dictionary before launching the parallelization). Could be this idea feasible and, in case yes, does exist some .NET tool for this? Or alternatively, any other idea to speed up the process?
You might want to look at using ThreadLocal<T> to hold your collections. This automatically allocates a separate collection per thread.
That being said, there are overloads of Parallel.For which work with local state, and have a collection pass at the end. This, potentially, would allow you to spawn your ProcessData wrapper, where each loop body was working on its own collection, and then recombine at the end. This would, potentially, eliminate the need for locking (since each thread is working on it's own data set) until the recombination phase, which happens once per thread (instead of once per task,ie: 10000 times). This could reduce the number of locks you're taking from ~25000 (0-4*10000) down to a few (system and algorithm dependent, but on a quad core system, probably around 10 in my experience).
For details, see my blog post on aggregating data with Parallel.For/ForEach. It demonstrates the overloads and explains how they work in more detail.

You don't say how much of a "bottleneck" you're encountering. But let's look at the locks.
On my machine (quad core, 2.4 GHz), a lock costs about 70 nanoseconds if it's not contended. I don't know how long it takes to add an item to a list, but I can't imagine that it takes more than a few microseconds. But let's it takes 100 microseconds (I would be very surprised to find that it's even 10 microseconds) to add an item to the list, taking into account lock contention. So if you're adding 40,000 items to the list, that's 4,000,000 microseconds, or 4 seconds. And I would expect one core to be pegged if this were the case.
I haven't used ConcurrentBag, but I've found the performance of BlockingCollection to be very good.
I suspect, though, that your bottleneck is somewhere else. Have you done any profiling?

The basic collections in C# aren't thread safe.
The problem you're having is due to the fact that you're locking the entire collection just to call an add() method.
You could create a thread-safe collection that only locks single elements inside the collection, instead of the whole collection.
Lets look at a linked list for example.
Implement an add(item (or list)) method that does the following:
Lock collection.
A = get last item.
set last item reference to the new item (or last item in new list).
lock last item (A).
unclock collection.
add new items/list to the end of A.
unlock locked item.
This will lock the whole collection for just 3 simple tasks when adding.
Then when iterating over the list, just do a trylock() on each object. if it's locked, wait for the lock to be free (that way you're sure that the add() finished).
In C# you can do an empty lock() block on the object as a trylock().
So now you can add safely and still iterate over the list at the same time.
Similar solutions can be implemented for the other commands if needed.

Any built-in solution for a collection is going to involve some locking. There may be ways to avoid it, perhaps by segregating the actual data constructs being read/written, but you're going to have to lock SOMEWHERE.
Also, understand that Parallel.For() will use the thread pool. While simple to implement, you lose fine-grained control over creation/destruction of threads, and the thread pool involves some serious overhead when starting up a big parallel task.
From a conceptual standpoint, I would try two things in tandem to speed up this algorithm:
Create threads yourself, using the Thread class. This frees you from the scheduling slowdowns of the thread pool; a thread starts processing (or waiting for CPU time) when you tell it to start, instead of the thread pool feeding requests for threads into its internal workings at its own pace. You should be aware of the number of threads you have going at once; the rule of thumb is that the benefits of multithreading are overcome by the overhead when you have more than twice the number of active threads as "execution units" available to execute threads. However, you should be able to architect a system that takes this into account relatively simply.
Segregate the collection of results, by creating a dictionary of collections of results. Each results collection is keyed to some token carried by the thread doing the processing and passed to the callback. The dictionary can have multiple elements READ at one time without locking, and as each thread is WRITING to a different collection within the Dictionary there shouldn't be a need to lock those lists (and even if you did lock them you wouldn't be blocking other threads). The result is that the only collection that has to be locked such that it would block threads is the main dictionary, when a new collection for a new thread is added to it. That shouldn't have to happen often if you're smart about recycling tokens.

Thread Index in PLINQ Query?

Is there way to pass the thread "index" of a PLINQ query into one of it's operators, like a Select?
The background is that I have a PLINQ Query that does some deserialization, the deserialization method uses a structure (problably an array) to pass data to another method:
ParallelEnumerable.Range(0, nrObjects)
.WithDegreeOfParallelism(8)
.Select(i => this.Deserialize(serializer, i, arrays[threadIndex]))
.ToList();
(threadIndex is the new variable I'd like)
If I knew the thread index I could create 8 of these arrays upfront (even if all might not be used) and reuse them. The Deserialization method might be called millions of times so every small optimization counts..

Re the thread index; note that the degree of parallelism is (IIRC) only the maximum number of threads to use; it doesn't have to use that number. Relying on threadIndex in the way you describe would seem to mean that you might actually only access arrays[0]. So in short, no: I don't think so.
You can, however, get the item index; so if that is what you mean, simply:
.Select((value, itemIndex) => this.Deserialize(
serializer, i, arrays[itemIndex])).ToList();
It sounds (comments) like the intent is for obtaining a working buffer; in which case, I would (and indeed, do) keep a few convenient buffers in a container (with synchronized access etc), i.e.
attempt to obtain an existing buffer from a container (synchronized)
if not found, create a new buffer
(do work)
return/add buffer to the container (synchronized)
This will generally be very fast (much faster than buffer-allocation per iteration)

You can create ConcurrentBag of integers and then take from it before deserealization use as index and then return after. Every thread will occupy one array element at a time.
Consider using Parallel class. It is more appropriate when it comes to arrays. There are limitations on indexed versions of Select if you use ParallelEnumerable (Not in your case, but be cautious)

I would recommend using Parallel.ForEach which offers an overload that allows you to allocate (and cleanup) locals that will be passed into one parallel instance of the loop at a time. I've used this approach for processing of large sets of image data where I require buffers for various steps of the process.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.