Is HashSet<T>.Contains() efficient with large lists, multiple threads?

Is HashSet<T>.Contains() efficient with large lists, multiple threads? - c#

I am writing a multi-threaded program to scrape a certain site and collect ID's. It is storing these ID's in a shared static List<string> object.
When any item is added to the List<string>, it is first checked against a HashSet<string> which contains a blacklist of already collected ID's.
I do this as follows:
private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();
public static void AddIDToIDList(string ID)
{
lock (IDList)
{
if (IsIDBlacklisted(ID))
return;
IDList.Add(ID);
}
}
public static bool IsIDBlacklisted(string ID)
{
lock (Blacklist)
{
if (Blacklist.Contains(ID))
return true;
}
return false;
}
The Blacklist is saved to a file after finishing and is loaded every time the program starts, therefore, it will get pretty large over time (up to 50k records). Is there a more efficient way to not only store this blacklist, but also to check each ID against it?
Thanks!

To improve performance try to use ConcurrentBag<T> collection. Also there is no need to lock BlackList because it's not being modified e.g.:
private static HashSet<string> Blacklist = new HashSet<string>();
private static ConcurrentBag<string> IDList = new ConcurrentBag<string>();
public static void AddIDToIDList(string ID)
{
if (Blacklist.Contains(ID))
{
return;
}
IDList.Add(ID);
}

Read operations are thread safe on HashSet, as long as Blacklist is not being modified you don't need to lock on it. Also you should lock inside the blacklist check so the lock is taken less often, this also will increase your performance.
private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();
public static void AddIDToIDList(string ID)
{
if (IsIDBlacklisted(ID))
return;
lock (IDList)
{
IDList.Add(ID);
}
}
public static bool IsIDBlacklisted(string ID)
{
return Blacklist.Contains(ID);
}
If Blacklist is being modified the best way to lock around it is using a ReaderWriterLock (use the slim version if you are using a newer .NET)
private static HashSet<string> Blacklist = new HashSet<string>();
private static List<string> IDList = new List<string>();
private static ReaderWriterLockSlim BlacklistLock = new ReaderWriterLockSlim();
public static void AddIDToIDList(string ID)
{
if (IsIDBlacklisted(ID))
return;
lock (IDList)
{
IDList.Add(ID);
}
}
public static bool IsIDBlacklisted(string ID)
{
BlacklistLock.EnterReadLock();
try
{
return Blacklist.Contains(ID);
}
finally
{
BlacklistLock.ExitReadLock();
}
}
public static bool AddToIDBlacklist(string ID)
{
BlacklistLock.EnterWriteLock();
try
{
return Blacklist.Add(ID);
}
finally
{
BlacklistLock.ExitWriteLock();
}
}

Two considerations - First, if you use the indexer of a .NET dictionary (i.e., System.Collections.Generic.Dictionary) like this (rather than calling the Add() method):
idList[id] = id;
then it will add the item if it doesn't already exist - otherwise, it will replace the existing item at that key. Second, you can use the ConcurrentDictionary (in the System.Collections.Concurrent namespace) for thread-safety so you don't have to worry about the locking yourself. Same comment applies about using the indexer.

In your scenario, yes, HashSet is the best option for this since it contains one value to look up unlike a Dictionary which requires a key and a value to do a lookup.
And ofcourse as others have said no need of locking HashSet if it is not being modified. and consider marking it as readonly.

Related

How to safely write to the same List

I've got a public static List<MyDoggie> DoggieList;
DoggieList is appended to and written to by multiple processes throughout my application.
We run into this exception pretty frequently:
Collection was modified; enumeration operation may not execute
Assuming there are multiple classes writing to DoggieList how do we get around this exception?
Please note that this design is not great, but at this point we need to quickly fix it in production.
How can we perform mutations to this list safely from multiple threads?
I understand we can do something like:
lock(lockObject)
{
DoggieList.AddRange(...)
}
But can we do this from multiple classes against the same DoggieList?

you can also create you own class and encapsulate locking thing in that only, you can try like as below ,
you can add method you want like addRange, Remove etc.
class MyList {
private object objLock = new object();
private List<int> list = new List<int>();
public void Add(int value) {
lock (objLock) {
list.Add(value);
}
}
public int Get(int index) {
int val = -1;
lock (objLock) {
val = list[0];
}
return val;
}
public void GetAll() {
List<int> retList = new List<int>();
lock (objLock) {
retList = new List<T>(list);
}
return retList;
}
}
Good stuff : Concurrent Collections very much in detail :http://www.albahari.com/threading/part5.aspx#_Concurrent_Collections
making use of concurrent collection ConcurrentBag Class can also resolve issue related to multiple thread update
Example
using System.Collections.Concurrent;
using System.Threading.Tasks;
public static class Program
{
public static void Main()
{
var items = new[] { "item1", "item2", "item3" };
var bag = new ConcurrentBag<string>();
Parallel.ForEach(items, bag.Add);
}
}

Using lock a the disadvantage of preventing concurrent readings.
An efficient solution which does not require changing the collection type is to use a ReaderWriterLockSlim
private static readonly ReaderWriterLockSlim _lock = new ReaderWriterLockSlim();
With the following extension methods:
public static class ReaderWriterLockSlimExtensions
{
public static void ExecuteWrite(this ReaderWriterLockSlim aLock, Action action)
{
aLock.EnterWriteLock();
try
{
action();
}
finally
{
aLock.ExitWriteLock();
}
}
public static void ExecuteRead(this ReaderWriterLockSlim aLock, Action action)
{
aLock.EnterReadLock();
try
{
action();
}
finally
{
aLock.ExitReadLock();
}
}
}
which can be used the following way:
_lock.ExecuteWrite(() => DoggieList.Add(new Doggie()));
_lock.ExecuteRead(() =>
{
// safe iteration
foreach (MyDoggie item in DoggieList)
{
....
}
})
And finally if you want to build your own collection based on this:
public class SafeList<T>
{
private readonly ReaderWriterLockSlim _lock = new ReaderWriterLockSlim();
private readonly List<T> _list = new List<T>();
public T this[int index]
{
get
{
T result = default(T);
_lock.ExecuteRead(() => result = _list[index]);
return result;
}
}
public List<T> GetAll()
{
List<T> result = null;
_lock.ExecuteRead(() => result = _list.ToList());
return result;
}
public void ForEach(Action<T> action) =>
_lock.ExecuteRead(() => _list.ForEach(action));
public void Add(T item) => _lock.ExecuteWrite(() => _list.Add(item));
public void AddRange(IEnumerable<T> items) =>
_lock.ExecuteWrite(() => _list.AddRange(items));
}
This list is totally safe, multiple threads can add or get items in parallel without any concurrency issue. Additionally, multiple threads can get items in parallel without locking each other, it's only when writing than 1 single thread can work on the collection.
Note that this collection does not implement IEnumerable<T> because you could get an enumerator and forget to dispose it which would leave the list locked in read mode.

make DoggieList of type ConcurrentStack and then use pushRange method. It is thread safe.
using System.Collections.Concurrent;
var doggieList = new ConcurrentStack<MyDoggie>();
doggieList.PushRange(YourCode)

Threading With List Property

public static class People
{
List<string> names {get; set;}
}
public class Threading
{
public static async Task DoSomething()
{
var t1 = new Task1("bob");
var t2 = new Task1("erin");
await Task.WhenAll(t1,t2);
}
private static async Task Task1(string name)
{
await Task.Run(() =>
{
if(People.names == null) People.names = new List<string>();
Peoples.names.Add(name);
}
}
}
Is that dangerous to initialize a list within a thread? Is it possible that both threads could initialize the list and remove one of the names?
So I was thinking of three options:
Leave it like this since it is simple - only if it is safe though
Do same code but use a concurrentBag - I know thread safe but is initialize safe
Using [DataMember(EmitDefaultValue = new List())] and then just do .Add in Task1 and not worry about initializing. But the only con to this is sometimes the list wont need to be used at all and it seems like a waste to initialize it everytime.

Okay so what I figured worked best for my case was I used a lock statement.
public class Class1
{
private static Object thisLock = new Object();
private static async Task Task1(string name)
{
await Task.Run(() =>
{
AddToList(name);
}
}
private static AddToList(string name)
{
lock(thisLock)
{
if(People.names == null) People.names = new List<string>();
People.names.Add(name);
}
}
}
public static class People
{
public static List<string> names {get; set;}
}

for a simple case like this the easiest way to get thread-safety is using the lock statement:
public static class People
{
static List<string> _names = new List<string>();
public static void AddName(string name)
{
lock (_names)
{
_names.Add(name);
}
}
public static IEnumerable<string> GetNames()
{
lock(_names)
{
return _names.ToArray();
}
}
}
public class Threading
{
public static async Task DoSomething()
{
var t1 = new Task1("bob");
var t2 = new Task1("erin");
await Task.WhenAll(t1,t2);
}
private static async Task Task1(string name)
{
People.AddName(name);
}
}
of course it's not very usefull (why not just add without the threads) - but I hope you get the idea.
If you don't use some kind of lock and concurrently read and write to a List you will most likely get an InvalidOperationException saying the collection has changed during read.
Because you don't really know when a user will use the collection you might return the easiest way to get thread-saftey is copying the collection into an array and returning this.
If this is not practical (collection to large, ..) you have to use the classes in System.Collections.Concurrrent for example the BlockingCollection but those are a bit more involved.

Lock only on an Id

I have a method which needs to run exclusivley run a block of code, but I want to add this restriction only if it is really required. Depending on an Id value (an Int32) I would be loading/modifying distinct objects, so it doesn't make sense to lock access for all threads. Here's a first attempt of doing this -
private static readonly ConcurrentDictionary<int, Object> LockObjects = new ConcurrentDictionary<int, Object>();
void Method(int Id)
{
lock(LockObjects.GetOrAdd(Id,new Object())
{
//Do the long running task here - db fetches, changes etc
Object Ref;
LockObjects.TryRemove(Id,out Ref);
}
}
I have my doubts if this would work - the TryRemove can fail (which will cause the ConcurrentDictionary to keep getting bigger).
A more obvious bug is that the TryRemove successfully removes the Object but if there are other threads (for the same Id) which are waiting (locked out) on this object, and then a new thread with the same Id comes in and adds a new Object and starts processing, since there is no one else waiting for the Object it just added.
Should I be using TPL or some sort of ConcurrentQueue to queue up my tasks instead ? What's the simplest solution ?

I use a similar approach to lock resources for related items rather than a blanket resource lock... It works perfectly.
Your almost there but you really don't need to remove the object from the dictionary; just let the next object with that id get the lock on the object.
Surely there is a limit to the number of unique ids in your application? What is that limit?

The main semantic issue I see is that an object can be locked without being listed in the collection because the the last line in the lock removes it and a waiting thread can pick it up and lock it.
Change the collection to be a collection of objects that should guard a lock. Do not name it LockedObjects and do not remove the objects from the collection unless you no longer expect the object to be needed.
I always think of this type of objects as a key instead of a lock or blocked object; the object is not locked, it is a key to locked sequences of code.

I used the following approach. Do not check the original ID, but get small hash-code of int type to get the existing object for lock. The count of lockers depends on your situation - the more locker counter, the less the probability of collision.
class ThreadLocker
{
const int DEFAULT_LOCKERS_COUNTER = 997;
int lockersCount;
object[] lockers;
public ThreadLocker(int MaxLockersCount)
{
if (MaxLockersCount < 1) throw new ArgumentOutOfRangeException("MaxLockersCount", MaxLockersCount, "Counter cannot be less, that 1");
lockersCount = MaxLockersCount;
lockers = Enumerable.Range(0, lockersCount).Select(_ => new object()).ToArray();
}
public ThreadLocker() : this(DEFAULT_LOCKERS_COUNTER) { }
public object GetLocker(int ObjectID)
{
var idx = (ObjectID % lockersCount + lockersCount) % lockersCount;
return lockers[idx];
}
public object GetLocker(string ObjectID)
{
var hash = ObjectID.GetHashCode();
return GetLocker(hash);
}
public object GetLocker(Guid ObjectID)
{
var hash = ObjectID.GetHashCode();
return GetLocker(hash);
}
}
Usage:
partial class Program
{
static ThreadLocker locker = new ThreadLocker();
static void Main(string[] args)
{
var id = 10;
lock(locker.GetLocker(id))
{
}
}
}
Of cource, you can use any hash-code functions to get the corresponded array index.

If you want to use the ID itself and do not allow collisions, caused by hash-code, you can you the next approach. Maintain the Dictionary of objects and store info about the number of the threads, that want to use ID:
class ThreadLockerByID<T>
{
Dictionary<T, lockerObject<T>> lockers = new Dictionary<T, lockerObject<T>>();
public IDisposable AcquireLock(T ID)
{
lockerObject<T> locker;
lock (lockers)
{
if (lockers.ContainsKey(ID))
{
locker = lockers[ID];
}
else
{
locker = new lockerObject<T>(this, ID);
lockers.Add(ID, locker);
}
locker.counter++;
}
Monitor.Enter(locker);
return locker;
}
protected void ReleaseLock(T ID)
{
lock (lockers)
{
if (!lockers.ContainsKey(ID))
return;
var locker = lockers[ID];
locker.counter--;
if (Monitor.IsEntered(locker))
Monitor.Exit(locker);
if (locker.counter == 0)
lockers.Remove(locker.id);
}
}
class lockerObject<T> : IDisposable
{
readonly ThreadLockerByID<T> parent;
internal readonly T id;
internal int counter = 0;
public lockerObject(ThreadLockerByID<T> Parent, T ID)
{
parent = Parent;
id = ID;
}
public void Dispose()
{
parent.ReleaseLock(id);
}
}
}
Usage:
partial class Program
{
static ThreadLockerByID<int> locker = new ThreadLockerByID<int>();
static void Main(string[] args)
{
var id = 10;
using(locker.AcquireLock(id))
{
}
}
}

There are mini-libraries that do this for you, such as AsyncKeyedLock. I've used it and it saved me a lot of headaches.

Concurrent readers in .NET 4.0 C#

I came across this puzzle in a programming contest where I am not supposed to use any inbuilt concurrent .NET 4.0 data structures.
I have to override the ToString() method and this method should support concurrent readers.
This is the solution which I came up with but I strongly believe it is does not support concurrent readers. How can I support concurrent readers without locking the list?
class Puzzle
{
private List<string> listOfStrings = new List<string>();
public void Add(string item)
{
lock (listOfStrings)
{
if (item != null)
{
listOfStrings.Add(item);
}
}
}
public override string ToString()
{
lock (listOfStrings)
{
return string.Join(",", listOfStrings);
}
}
}

Because List<T> is just a T[] under the hood, there's nothing unsafe about reading the list from multiple threads. Your issue is that reading while writing and writing concurrently are unsafe. Because of this, you should use ReaderWriterLockSlim.
class Puzzle
{
private List<string> listOfStrings = new List<string>();
private ReaderWriterLockSlim listLock = new ReaderWriterLockSlim();
public void Add(string item)
{
listLock.EnterWriteLock();
try
{
listOfStrings.Add(item);
}
finally
{
listlock.ExitWriteLock();
}
}
public override string ToString()
{
listLock.EnterReadLock();
try
{
return string.Join(",", listOfStrings);
}
finally
{
listLock.ExitReadLock();
}
}
}
ReaderWriterLockSlim will allow multiple threads to enter the lock in read mode, but they will all block if/while something is in the lock in write mode. Likewise, entering the lock in write mode will block until all threads have exited the lock in read mode. The practical outworking is that multiple reads can happen at the same time as long as nothing is writing, and one write operation can happen at a time as long as nothing is reading.

Since it doesn't look like Remove isn't a requirement, why can't you just return a string?
class Puzzle
{
private string Value { get; set; }
public Puzzle()
{
Value = String.Empty;
}
public void Add(String item)
{
Value += "," + item;
}
public override string ToString()
{
return Value;
}
}

How to use lock on a Dictionary containing a list of objects in C#?

I have the following class:
public static class HotspotsCache
{
private static Dictionary<short, List<HotSpot>> _companyHotspots = new Dictionary<int, List<HotSpot>>();
private static object Lock = new object();
public static List<HotSpot> GetCompanyHotspots(short companyId)
{
lock (Lock)
{
if (!_companyHotspots.ContainsKey(companyId))
{
RefreshCompanyHotspotCache(companyId);
}
return _companyHotspots[companyId];
}
}
private static void RefreshCompanyHotspotCache(short companyId)
{
....
hotspots = ServiceProvider.Instance.GetService<HotspotsService>().GetHotSpots(..);
_companyHotspots.Add(companyId, hotspots);
....
}
The issue that I'm having is that the operation of getting the hotspots, in RefreshCompanyHotspotCache method, takes a lot of time . So while one thread is performing the cache refresh for a certain CompanyId, all the other threads are waiting until this operation is finished, although there could be threads that are requesting the list of hotspots for another companyId for which the list is already loaded in the dictionary. I would like these last threads not be locked. I also want that all threads that are requesting the list of hotspots for a company that is not yet loaded in the cache to wait until the list is fully retrieved and loaded in the dictionary.
Is there a way to lock only the threads that are reading/writing the cache for certain companyId (for which the refresh is taking place) and let the other threads that are requesting data for another company to do their job?
My thought was to use and array of locks
lock (companyLocks[companyId])
{
...
}
But that didn't solve anything. The threads dealing with one company are still waiting for threads that are refreshing the cache for other companies.

Use the Double-checked lock mechanism also mentioned by Snowbear - this will prevent your code locking when it doesn't actually need to.
With your idea of an individual lock per client, I've used this mechanism in the past, though I used a dictionary of locks. I made a utility class for getting a lock object from a key:
/// <summary>
/// Provides a mechanism to lock based on a data item being retrieved
/// </summary>
/// <typeparam name="T">Type of the data being used as a key</typeparam>
public class LockProvider<T>
{
private object _syncRoot = new object();
private Dictionary<T, object> _lstLocks = new Dictionary<T, object>();
/// <summary>
/// Gets an object suitable for locking the specified data item
/// </summary>
/// <param name="key">The data key</param>
/// <returns></returns>
public object GetLock(T key)
{
if (!_lstLocks.ContainsKey(key))
{
lock (_syncRoot)
{
if (!_lstLocks.ContainsKey(key))
_lstLocks.Add(key, new object());
}
}
return _lstLocks[key];
}
}
So simply use this in the following manner...
private static LockProvider<short> _clientLocks = new LockProvider<short>();
private static Dictionary<short, List<HotSpot>> _companyHotspots = new Dictionary<short, List<HotSpot>>();
public static List<HotSpot> GetCompanyHotspots(short companyId)
{
if (!_companyHotspots.ContainsKey(companyId))
{
lock (_clientLocks.GetLock(companyId))
{
if (!_companyHotspots.ContainsKey(companyId))
{
// Add item to _companyHotspots here...
}
}
return _companyHotspots[companyId];
}

How about you only lock 1 thread, and let that update, while everyone else uses the old list?
private static Dictionary<short, List<HotSpot>> _companyHotspots = new Dictionary<short, List<HotSpot>>();
private static Dictionary<short, List<HotSpot>> _companyHotspotsOld = new Dictionary<short, List<HotSpot>>();
private static bool _hotspotsUpdating = false;
private static object Lock = new object();
public static List<HotSpot> GetCompanyHotspots(short companyId)
{
if (!_hotspotsUpdating)
{
if (!_companyHotspots.ContainsKey(companyId))
{
lock (Lock)
{
_hotspotsUpdating = true;
_companyHotspotsOld = _companyHotspots;
RefreshCompanyHotspotCache(companyId);
_hotspotsUpdating = false;
return _companyHotspots[companyId];
}
}
else
{
return _companyHotspots[companyId];
}
}
else
{
return _companyHotspotsOld[companyId];
}
}

Have you looked into ReaderWriterLockSlim? That should be able to let get finer grained locking where you only take a writelock when needed.
Another thing you may need to look out for is false sharing. I don't know how a lock is implemented exactly but if you lock on objects in an array they're bound to be close to each other in memory, possibly putting them on the same cacheline, so the lock may not behave as you expect.
Another idea, what happens if you change the last code snippet to
object l = companyLocks[companyId];
lock(l){
}
could be the lock statement wraps more here than intended.
GJ

New idea, with locking just the lists as they are created.
If you can guarantee that each company will have at least one hotspot, do this:
public static class HotspotsCache
{
private static Dictionary<short, List<HotSpot>> _companyHotspots = new Dictionary<int, List<HotSpot>>();
static HotspotsCache()
{
foreach(short companyId in allCompanies)
{
companyHotspots.Add(companyId, new List<HotSpot>());
}
}
public static List<HotSpot> GetCompanyHotspots(short companyId)
{
List<HotSpots> result = _companyHotspots[companyId];
if(result.Count == 0)
{
lock(result)
{
if(result.Count == 0)
{
RefreshCompanyHotspotCache(companyId, result);
}
}
}
return result;
}
private static void RefreshCompanyHotspotCache(short companyId, List<HotSpot> resultList)
{
....
hotspots = ServiceProvider.Instance.GetService<HotspotsService>().GetHotSpots(..);
resultList.AddRange(hotspots);
....
}
}
Since the dictionary is being modified after its initial creation, no need to do any locking on it. We only need to lock the individual lists as we populate them, the read operation needs no locking (including the initial Count == 0).

If you're able to use .NET 4 then the answer is straightforward -- use a ConcurrentDictionary<K,V> instead and let that look after the concurrency details for you:
public static class HotSpotsCache
{
private static readonly ConcurrentDictionary<short, List<HotSpot>>
_hotSpotsMap = new ConcurrentDictionary<short, List<HotSpot>>();
public static List<HotSpot> GetCompanyHotSpots(short companyId)
{
return _hotSpotsMap.GetOrAdd(companyId, id => LoadHotSpots(id));
}
private static List<HotSpot> LoadHotSpots(short companyId)
{
return ServiceProvider.Instance
.GetService<HotSpotsService>()
.GetHotSpots(/* ... */);
}
}
If you're not able to use .NET 4 then your idea of using several more granular locks is a good one:
public static class HotSpotsCache
{
private static readonly Dictionary<short, List<HotSpot>>
_hotSpotsMap = new Dictionary<short, List<HotSpot>();
private static readonly object _bigLock = new object();
private static readonly Dictionary<short, object>
_miniLocks = new Dictionary<short, object>();
public static List<HotSpot> GetCompanyHotSpots(short companyId)
{
List<HotSpot> hotSpots;
object miniLock;
lock (_bigLock)
{
if (_hotSpotsMap.TryGetValue(companyId, out hotSpots))
return hotSpots;
if (!_miniLocks.TryGetValue(companyId, out miniLock))
{
miniLock = new object();
_miniLocks.Add(companyId, miniLock);
}
}
lock (miniLock)
{
if (!_hotSpotsMap.TryGetValue(companyId, out hotSpots))
{
hotSpots = LoadHotSpots(companyId);
lock (_bigLock)
{
_hotSpotsMap.Add(companyId, hotSpots);
_miniLocks.Remove(companyId);
}
}
return hotSpots;
}
}
private static List<HotSpot> LoadHotSpots(short companyId)
{
return ServiceProvider.Instance
.GetService<HotSpotsService>()
.GetHotSpots(/* ... */);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Is HashSet<T>.Contains() efficient with large lists, multiple threads? - c#

In your scenario, yes, HashSet is the best option for this since it contains one value to look up unlike a Dictionary which requires a key and a value to do a lookup. And ofcourse as others have said no need of locking HashSet if it is not being modified. and consider marking it as readonly.

Related

How to safely write to the same List

Threading With List Property

Lock only on an Id

Concurrent readers in .NET 4.0 C#

How to use lock on a Dictionary containing a list of objects in C#?

Categories

Resources