Looking for a faster implementation for IEnumerable/IEnumerator

Looking for a faster implementation for IEnumerable/IEnumerator - c#

I'm trying to optimize a concurrent collection that tries to minimize lock contention for reads. First pass was using a linked list, which allowed me to only lock on writes while many simultaneous reads could continue unblocked. This used a custom IEnumerator to yield the next link value. Once i started comparing iteration over the collection to a plain List<T> i noticed my implementation was about half as fast (for from x in c select x on a collection of 1*m* items, i got 24ms for List<T> and 49ms for my collection).
So i thought i'd use a ReaderWriteLockSlim and sacrifice a little contention on reads so i could use a List<T> as my internal storage. Since I have to capture the read lock on iteration start and release it upon completion, i first did a yield pattern for my IEnumerable, foreaching over the internal List<T>. Now i was getting only 66ms.
I peeked at what List actually does and it uses an internal store of T[] and a custom IEnumerator that moves the index forward and returns the current index value. Now, manually using T[] as storage means a lot more maintenance work, but wth, i'm chasing microseconds.
However even mimicking the IEnumerator moving the index on an array, the best I could do was about ~38ms. So what gives List<T> its secret sauce or alternatively what's a faster implementation for an iterator?
UPDATE: Turns out my main speed culprit was running Debug compile, while List<T> is obviously a Release compile. In release my implementation is still a hair slower than List<T>, altough on mono it's now faster.
One other suggestion i got from a friend is that the BCL is faster because it's in the GAC and therefore can get pre-compiled by the system. Will have to put my test in the GAC to test that theory.

Acquiring and releasing the lock on each iteration sounds like a bad idea - because if you perform an Add or Remove while you're iterating over the list, that will invalidate the iterator. List<T> certainly wouldn't like that, for example.
Would your use case allow callers to take out a ReaderWriterLockSlim around their whole process of iteration, instead of on a per-item basis? That would be more efficient and more robust. If not, how are you planning to deal with the concurrency issue? If a writer adds an element earlier than the place where I've got to, a simple implementation would return the same element twice. The opposite would happen with a removal - the iterator would skip an element.
Finally, is .NET 4.0 an option? I know there are some highly optimised concurrent collections there...
EDIT: I'm not quite sure what your current situation is in terms of building an iterator by hand, but one thing that you might want to investigate is using a struct for the IEnumerator<T>, and making your collection explicitly declare that it returns that - that's what List<T> does. It does mean using a mutable struct, which makes kittens cry all around the world, but if this is absolutely crucial to performance and you think you can live with the horror, it's at least worth a try.

Related

Should I use a C# Dictionary if I only need fast lookup of keys, and values are irrelevant?

I am in need of a data type that is able to insert entries and then be able to quickly determine if an entry has already been inserted. A Dictionary seems to suit this need (see example). However, I have no use for the dictionary's values. Should I still use a dictionary or is there another better suited data type?
public class Foo
{
private Dictionary<string, bool> Entities;
...
public void AddEntity(string bar)
{
if (!Entities.ContainsKey(bar))
{
// bool value true here has no use and is just a placeholder
Entities.Add(bar, true);
}
}
public string[] GetEntities()
{
return Entities.Keys.ToArray();
}
}

You can use HashSet<T>.
The HashSet<T> class provides high-performance set operations. A set
is a collection that contains no duplicate elements, and whose
elements are in no particular order.

Habib's answer is excellent, but for multi-threaded environments if you use a HashSet<T> then by consequence you have to use locks to protect access to it. I find myself more prone to creating deadlocks with lock statements. Also, locks yield a worse speedup per Amdahl's law because adding a lock statement reduces the percentage of your code that is actually parallel.
For those reasons, a ConcurrentDictionary<T,object> fits the bill in multi-threaded environments. If you end up using one, then wrap it like you did in your question. Just new up objects to toss in as values as needed, since the values won't be important. You can verify that there are no lock statements in its source code.
If you didn't need mutability of the collection then this would be moot. But your question implies that you do need it, since you have an AddEntity method.
Additional info 2017-05-19 - actually, ConcurrentDictionary does use locks internally, although not lock statements per se--it uses Monitor.Enter (check out the TryAddInternal method). However, it seems to lock on individual buckets within the dictionary, which means there will be less contention than putting the entire thing in a lock statement.
So all in all, ConcurrentDictionary is often better for multithreaded environments.
It's actually quite difficult (impossible?) to make a concurrent hash set using only the Interlocked methods. I tried on my own and kept running into the problem of needing to alter two things at the same time--something that only locking can do in general. One workaround I found was to use singly-linked lists for the hash buckets and intentionally create cycles in a list when one thread needed to operate on a node without interference from other threads; this would cause other threads to get caught spinning around in the same spot until that thread was done with its node and undid the cycle. Sure, it technically didn't use locks, but it did not scale well.

Iterate and do operations on a generic collection concurrently

I've created a game emulation program using c# async socks. I need to remove/add & do iterations on a collection (a list that holds clients) concurrently. I am currently using "lock", however, it's a a huge performance drop. I also do not want to use "local lists/copies" to keep the list up-to-date. I've heard about "ConcurrentBags", however, I am not sure how thread safe they are for iterations (for instance if a thread removes an element from the list while another thread is doing an iteration on it!?).
What do you suggest?
Edit: here is a situation
this is when a packet is sent to all the users in a room
lock (parent.gameClientList)
{
for (int i = 0; i <= parent.gameClientList.Count() - 1; i++) if (parent.gameClientList[i].zoneId == zoneId) parent.gameClientList[i].SendXt(packetElements); //if room matches - SendXt sends a packet
}
When a new client connects
Client connectedClient = new Client(socket, this);
lock (gameClientList)
{
gameClientList.Add(connectedClient);
}
Same case when a client disconnects.
I am asking for a better alternative (performance-wise) because the locks slow down everything.

It sounds like the problem is that you're doing all the work within your foreach loop, and it's locking out the add/remove methods for too long. The way around this is to quickly make a copy of the collection while it's locked, and then you can close the lock and iterate on the copy.
Thing[] copy;
lock(myLock) {
copy = _collection.ToArray();
}
foreach(var thing in copy) {...}
The drawback is that by the time you get around to operating on some object of that copy, it may have been removed from the original collection and so maybe you don't want to operate on it anymore. That's another thing you'll just have to figure out the requirements. If that's a problem, a simple option would be to lock each iteration of the loop, which of course will slow things down but at least it won't lock for the entire duration the loop is running:
foreac(var thing in copy) {
lock(myLock) {
if (_collection.Contains(thing)) //check that it's still in the original colleciton
DoWork(thing); //If you can move this outside the lock it'd make your app snappier, but it could also cause problems if you're doing something "dangerous" in DoWork.
}
}
If this is what you meant by "local copies", then you can disregard this option, but I figured I'd offer it in case you meant something else.

Every time you do something concurrently you are going to have loss due to task management (i.e. locks). I suggest you look at what is the bottleneck in your process. You seem to have a shared memory model, as opposed to a message passing model. If you know you need to modify the entire collection at once, there may not be a good solution. But if you are making changes in a particular order you can leverage that order to prevent delays. Locks is an implementation of pessimistic concurrency. You could switch to an optimistic concurrency model. In one the cost is waiting in the other the cost is retrying. Again the actual solution depends on your use case.

On problem with ConcurrentBag is that it is unordered so you cannot pull items out by index the same way you are doing it currently. However, you can iterate it via foreach to get the same effect. This iteration will be thread-safe. It will not go bizerk if an item is added or removed while the iteration is happening.
There is another problem with ConcurrentBag though. It actually copies the contents to a new List internally to make the enumerator work correctly. So even if you wanted to just pick off a single item via the enumerator it would still be a O(n) operation because of the way enumerator works. You can verify this by disassembling it.
However, based on context clues from your update I assume that this collection is going to be small. It appears that there is only one entry per "game client" which means it is probably going to store a small number of items right? If that is correct then the performance of the GetEnumerator method will be mostly insignificant.
You should also consider ConcurrentDictionary as well. I noticed that you are trying to match items from the collection based on zoneId. If you store the items in the ConcurrentDictionary keyed by zoneId then you would not need to iterate the collection at all. Of course, this assumes that there is only one entry per zoneId which may not be the case.
You mentioned that you did not want to use "local lists/copies", but you never said why. I think you should reconsider this for the following reasons.
Iterations could be lock-free.
Adding and removing items appears to be infrequent based context clues from your code.
There are a couple of patterns you can use to make the list copying strategy work really well. I talk about them in my answers here and here.

Does IEnumerators require more resources than Arrays?

I have an implementation of the Sutherland–Hodgman algorithm, and so I need to return arrays frequently. I'm using Unity, so the answer needs to apply at least on the Mono runtime.
I'm wondering If it's best to return plain arrays, or if I could return as an IEnumerator, to reduce the time between garbage collections. So far I've been returning arrays, but I would really like to loose those calls to GC.Collect().
I guess the garbage collector would need to collect IEnumerators as well, and there is probably some overhead related as well?

When using generator co-routines (functions with many yield returns), no array is created or allocated. Everything is done in a streaming fashion. It is entirely possible to have infinite generators that work without getting an out of memory error:
public static IEnumerable<int> Odds(){
for (int i = 1 ; ; i += 2)
yield return i;
}
Therefore, if you frequently return big arrays only to be iterated and disposed immediately, the benefits will be huge, as the memory allocations will be much smaller. The garbage collector will be called less often and will have less work to do.

First, just create your big array and keep it somewhere in memory. The rest of the application can retrieve a reference to that array and itterate through it.
The best way and fastest to itterate through an array is using for.
If you are using foreach, evertime (under water) a enumerator class is create, which will need to be cleaned up after it has done its duty.

Making GetEnumerator ThreadSafe

Exactly how do enumerators work - I know that they build a state machine behind the scenes but if I call GetEnumerator twice will I get two different objects?
If I do something like this
public IEnumerator<T> GetEnumerator()
{
yield return 1;
yield return 2;
}
Can I aquire a lock at the start of the method and is this lock held until the enumerator has returned null or until the enumerator is GC'd?
What happens if the caller resets the enumerator etc -
I guess my question is what is the best way to manage locking when dealing with an enumerator
Note: The client cannot be responsible for thread syncronisation - The class internally needs to be
And finally the example above is a simplification of the problem- The yield statements do a bit more then what I have shown :)

Yes, each call to your GetEnumerator() method will create a different object (a new state machine).
You can acquire a lock within an iterator block, but be aware that none of the code in your method will be called until the caller calls MoveNext() for the first time.
In general, I would advise against holding a lock within an iterator block if at all possible. You don't know what the caller is going to do between calls to MoveNext(). So long as they dispose of the iterator at some point, the lock will be released eventually, but it still means you're at the mercy of the caller.
If you can give us more information about what you're trying to do, that would help. An alternative design which might be easier to get right would be:
public void DoSomething(Action<T> action)
{
lock (...)
{
// Call action on each element in here
}
}

As John already said it is difficult to give you a good answer.
The obvious google search leads to: http://www.codeproject.com/KB/cs/safe_enumerable.aspx
The idea behind this is to lock the IEnumerable instance on construction which has major drawbacks.
The next obvious thing is isolation where you create a copy of your structure and iterate over the copy. This is if naively implemented very memory consuming but it can be worth it if your data set is relatively small.
The best thing would be if your data is immutable then you have automatic thread safety but if you are bound to a collection which count does change you have a mutable data structure. If you could redesign your data structure into an immutable one you are done.
Since it is not a good idea to lock the data for a potentially long time you can implement strategies to achieve thread safety when you take advantage of you exact data structure and use case. If you for example change the data rarely and enumerate it frequently you could implement an optimistic enumerable which does read before start of the enumeration a write counter of your data structure and yields the results as usual. If a write happens in between you can throw an exception to signal the user of your enumerator to try again until he succeeds. This does work but delegates responsibility onto the caller of your enumerable that he needs retry the enumeration until it succeeds.

Iterating Collection In Two Threads

This question relates both to C# and Java
If you have a collection which is not modified and that collection reference is shared between two threads what happens when you iterate on each thread??
ThreadA: Collection.iterator
ThreadA: Collection.moveNext
ThreadB: Collection.iterator
ThreadB: Collection.moveNext
Will threadB see the first element?
Is the iterator always reset when it is requested? What happens if this is interleaved so movenext and item is interleaved? Is there a danger that you dont process all elements??

It works as expected because each time you request the iterator you get a new one.
If it didn't you wouldn't be able to do foreach followed by foreach on the same collection!

By convention, an Iterator is implemented such that the traversal action never alters the collection's state. It only points to the current position in the collection, and manages the iteration logic. Therefore, if you scan the same collection by N different threads, everything should work fine.
However, note that Java's Iterator allows removal of items, and ListIterator even supports the set operation. If you want to use these actions by at least one of the threads, you will probably face concurrency problems (ConcurrentModificationException), unless the Iterator is specifically designed for such scenarios (such as with ConcurrentHashMap's iterators).

In Java (and I am pretty sure also in C#), the standard API collections typically do not have single iterator. Each call to iterator() produces a new one, which has its own internal index or pointer, so that as long as both threads acquire their own iterator object, there should be no problem.
However, this is not guaranteed by the interface, nor is the ability of two iterators to work concurrently without problems. For custom implementations of collections, all bets are off.

In c# - yes, java - seems to be but I'm not familiar.
About c# see http://csharpindepth.com/Articles/Chapter6/IteratorBlockImplementation.aspx and http://csharpindepth.com/Articles/Chapter11/StreamingAndIterators.aspx

At least in C#, all of the standard collections can be enumerated simultaneously on different threads. However, enumeration on any thread will blow up if you modify the underlying collection during enumeration (as it should.) I don't believe any sane developer writing a collection class would have their enumerators mutate the collection state in a way that interferes with enumeration, but it's possible. If you're using a standard collection, however, you can safely assume this and therefore use locking strategies like Single Writer / Multiple Reader when synchronizing collection access.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.