Best Practice - Removing item from generic collection in C#

Best Practice - Removing item from generic collection in C# - c#

I'm using C# in Visual Studio 2008 with .NET 3.5.
I have a generic dictionary that maps types of events to a generic list of subscribers. A subscriber can be subscribed to more than one event.
private static Dictionary<EventType, List<ISubscriber>> _subscriptions;
To remove a subscriber from the subscription list, I can use either of these two options.
Option 1:
ISubscriber subscriber; // defined elsewhere
foreach (EventType event in _subscriptions.Keys) {
if (_subscriptions[event].Contains(subscriber)) {
_subscriptions[event].Remove(subscriber);
}
}
Option 2:
ISubscriber subscriber; // defined elsewhere
foreach (EventType event in _subscriptions.Keys) {
_subscriptions[event].Remove(subscriber);
}
I have two questions.
First, notice that Option 1 checks for existence before removing the item, while Option 2 uses a brute force removal since Remove() does not throw an exception. Of these two, which is the preferred, "best-practice" way to do this?
Second, is there another, "cleaner," more elegant way to do this, perhaps with a lambda expression or using a LINQ extension? I'm still getting acclimated to these two features.
Thanks.
EDIT
Just to clarify, I realize that the choice between Options 1 and 2 is a choice of speed (Option 2) versus maintainability (Option 1). In this particular case, I'm not necessarily trying to optimize the code, although that is certainly a worthy consideration. What I'm trying to understand is if there is a generally well-established practice for doing this. If not, which option would you use in your own code?

Option 1 will be slower than Option 2. Lambda expressions and LINQ will be slower. I would use HashSet<> instead of List<>.
If you need confirmation about item removal, then Contains has to be used.
EDITED:
Since there is a high probabilty of using your code inside lock statement, and best practice is to reduce time of execution inside lock, it may be useful to apply Option 2. It looks like there is no best practice to use or not-use Contains with Remove.

The Remove() method 'approches O(1)' and is OK when a key does not exist.
But otherwise: when in doubt, measure. Getting some timings isn't that difficult...

Why enumerate the keys when all you're concerned with is the values?
foreach (List<ISubscriber> list in _subscriptions.Values)
{
list.Remove(subscriber);
}
That said, the LINQ solution suggested by Eric P is certainly more concise. Performance might be an issue, though.

I'd opt for the second option. Contains() and Remove() are both O(n) methods, and there's no reason to call both since Remove doesn't throw. At least with method 2, you're only calling one expensive operation instead of two.
I don't know of a faster way to handle it.

If you wanted to use Linq to do this, I think this would work (not tested):
_subscriptions.Values.All(x => x.Remove(subscriber));
Might want to check the performance on that though.

Related

Get original value from HashSet

UPDATE:
Starting with .Net 4.7.2, HashSet.TryGetValue - docs is available.
HashSet.TryGetValue - SO post
I have a problem with HashSet because it does not provide any method similar to TryGetValue known from Dictionary. And I need such method -- passing element to find in the set, and set returning element from its collection (when found).
Sidenote -- "why do you need element from the set, you already have that element?". No, I don't, equality and identity are two different things.
HashSet is not sealed but all its fields are private, so deriving from it is pointless. I cannot use Dictionary instead because I need SetEquals method. I was thinking about grabbing a source for HashSet and adding desired method, but the license is not truly open source (I can look, but I cannot distribute/modify). I could use reflection but the arrays in HashSet are not readonly meaning I cannot bind to those fields once per instance lifetime.
And I don't want to use full blown library for just single class.
So far I am stuck with LINQ SingleOrDefault. So the question is how fix this -- have HashSet with TryGetValue?

Probably you should switch from a HashSet to a SortedSet
There is a simple TryGetValue() for a SortedSet:
public bool TryGetValue(ref T element)
{
var foundSet = sortedSet.GetViewBetween(element, element);
if(foundSet.Count == 1)
{
element = foundSet.First();
return true;
}
return false;
}
when called, the element needs just all properties set which are used in the Comparer. It returns the element found in the Set.

I agree this is something which is basically missing. While it's only useful in rare cases, I think they're significant rare cases - most notable, key canonicalization.
I can only think of one suggestion at the moment, and it's truly foul.
You can specify your own IEqualityComparer<T> when creating a HashSet<T> - so create one which remembers the arguments to the last positive (i.e. true-returning) Equals comparison it has performed. You can then call Contains, and see what the equality comparer was asked to compare.
Caveats:
This holds on to references unnecessarily, so could end up preventing objects being garbage collected
You'd potentially want to do this on a per-thread basis (if you've got a set that isn't modified after initialization, but is then read by multiple threads, for example)
It assumes that HashSet<T> doesn't use any optimization such as "if the references are equal, don't bother consulting the equality comparer"
It's fundamentally a horrible abuse
I've been trying to think of other alternatives in terms of finding intersections, but I haven't got anywhere yet...
As noted in comments, it would be worth encapsulating this as far as possible - I suspect you only need a very limited set of operations, so I'd wrap a HashSet<T> in your own class and only expose the operations you really need - that way you get to clear the "cache" after each operation, removing my first objection above.
It still feels like a horrible abuse to me, but...
As others have suggested, an alternative would be to use a Dictionary<TKey, TValue> and implement SetEquals yourself. That would be simple enough to do - and again, you'd want to encapsulate this in your own type. Either way, you should probably design the type itself first, and then implement it using either a HashSet<> or a Dictionary<,> as an implementation detail.

Sounds like you trying to use the wrong tool. True, you can save some memory using a HashSet but it seems to me that you are trying to acheeve a different goal: Get the actual element that is just equal to a representation.
So in reality they are two different elements. Just the memento (a unique representation) is equal.
Therefore you'd be better of using a Dictionary where you add your elements as Key and Value. So you're able to get it back (the identical) but you miss your SetEquals....
I suppose SetEquals in it's implementation does nothing much different than sequencially compare two HashSets in it's bucket order and fails on first non-equality.
So you should be equally good off using a simple SequenceEqual() (LINQ) comparing the two Keys collections.
So this extension method could do
public static SetEqual<T,G>(this IDictionary<T,G> d, IDictionary<T,G> e)
{
return d.Keys.SequenceEqual(e.Keys);
}
This should work, because a Dictionary basically is a HashSet with an associated value. And more appropriate to your problem. (OK, to be correct, the code should go for Dictionary<> instead of IDictionary<> because Key order matters)
If you need an IEnumerable<> on the second parameter try sorting to get a defined order (not so efficient).

Finally added in .NET 4.7.2:
HashSet.TryGetValue(T, T) Method
An SO post with more details

hopefully not blind but I haven't seen this answer anywhere. If you want dictionary's TryGetValue, you can just steal it.
theHashset.ToDictionary(item => item.ID).TryGetValue(key, out value)
All you need is a quick lambda for determining unique keys.

Is .Select<T>(...) to be prefered before .Where<T>(...)?

I got in a discussion with two colleagues regarding a setup for an iteration over an IEnumerable (the contents of which will not be altered in any way during the operation). There are three conflicting theories on which is the optimal approach. Both the others (and me as well) are very certain and that got me unsure, so for the sake of clarity, I want to check with an external source.
The scenario is as follows. We had the code below as a starting point and discovered that some of the hazaas need not to be acted upon. So, starting with the code below, we started to add a blocker for the action.
foreach(Hazaa hazaa in hazaas) ;
My suggestion is as follows.
foreach(Hazaa hazaa in hazaas.Where(element => condition)) ;
One of the guys wants to resolve it by a more explicit form, claiming that LINQ is not appropriate in this case (not sure why it'd be so but he seems to be very convinced). He's solution is this.
foreach(Hazaa hazaa in hazaas) ;
if(condition) ;
The other contra-suggestion is supported by the claim that Where risks to repeat the filtering process needlessly and that it's more certain to minimize the computational workload by picking the appropriate elements once for all by Select.
foreach(Hazaa hazaa in hazaas.Select(element => condition)) ;
I argue that the first is obsolete, since LINQ can handle data objects quite well.
I also believe that Select-ing is in this case equivalently fast to Where-ing and no needless steps will be taken (e.g. the evaluation of the condition on the elements will only be performed once). If anything, it should be faster using Where because we won't be creating an extra instance of anything.
Who's right?

Select is inappropriate. It doesn't filter anything.
if is a possible solution, but Where is just as explicit.
Where executes the condition exactly once per item, just as the if. Additionally, it is important to note that the call to Where doesn't iterate the list. So, using Where you iterate the list exactly once, just like when using if.
I think you are discussing with one person that didn't understand LINQ - the guy that wants to use Select - and one that doesn't like the functional aspect of LINQ.
I would go with Where.

The .Where() and the if(condition) approach will be the same.
But since LinQ is nicely readable i'd prefer that.
The approach with .Select() is nonsense, since it will not return the Hazaa-Object, but an IEnumerable<Boolean>
To be clear about the functions:
myEnumerable.Where(a => isTrueFor(a)) //This is filtering
myEnumerable.Select(a => a.b) //This is projection
Where() will run a function, which returns a Boolean foreach item of the enumerable and return this item depending on the result of the Boolean function
Select() will run a function for every item in the list and return the result of the function without doing any filtering.

Calling a method in a linq foreach - how much overhead is there?

I'm thinking of replacing a lot of inline foreaches with Linq and in so doing will make new methods, e.g.
Current:
foreach(Item in List)
{
Statement1
Statement2
Statement3
}
Idea:
List.Foreach(Item => Method(Item))
Obviously Method() contains Statement1..3
Is this good practise or is calling a method thousands of times going to degrade performance? My Lists have 10,000-100,000 elements.

Well, for one thing you can probably make the ForEach statement more efficient using a method group conversion
List.ForEach(Method);
That's removed one level of indirection.
Personally though, I don't think it's a good idea. The first approach is more readable, and likely to perform about as well. What's the advantage of using List<T>.ForEach here?
Eric Lippert talks about this more in an excellent blog post. I would use List<T>.ForEach if you already had a delegate you wanted to execute against each element, but I wouldn't introduce a delegate and an extra method just for the sake of it.
In terms of efficiency, I wouldn't expect to see much difference. The first form may perform a little better as it doesn't have the indirection of the delegate call - but the second form may be more efficient if the iteration loop within ForEach makes use of the fact that it has access to the internal data structures of the List<T>. I very much doubt you'll notice it either way. You could try to measure it if you're really bothered, of course.

If your motivation for considering the change is that the three statements in the body are too complicated, then I'd probably use ordinary foreach, but refactor the body to a method:
foreach(var item in List)
Method(item);
If the code in the body isn't complicated, then I'd agree with Jon that there is no good reason for using ForEach (it doesn't make the code more readable/declarative in any way).
I generally don't like using "LINQ-like" constructs to do imperative processing at the end of a LINQ query. I think that using foreach more clearly states that you're finished with querying data and you're doing some processing now.

I'm totally agree with Jon Skeet's answer. But since we are talking about ForEach performance, I have something addtional to your question. Be aware of that if your Statement 1~3 is not relative with each other, that is:
foreach(Item in List)
{
DoSomething();
DoAnotherThing();
DoTheLastThing();
}
The code above probably has a worse performance than the following:
foreach(Item in List)
{
DoSomething();
}
foreach(Item in List)
{
DoAnotherThing();
}
foreach(Item in List)
{
DoTheLastThing();
}
The reason that the latter code which needs 2 more go-over-loops has a better performance, is because when it keeps calling DoSomething() thousands of times, some necessary variables are always warm in CPU registers. Very low costs are used to access those variables. On the other hand, if it calls DoAnotherThing() immediately after calling DoSomthing(), those variables of DoSomething() which already in CPU registers will cool down. Much more costs are needed to access these variables in the next loop.

I've always thought that you should write your code for readability first because the compiler and CLR do an exceptional job at optimisation. If you find that through benchmarking, that this code could be executed more quickly, then have a look at other options by all means.
E.g. for loops are quicker than foreach(), as they use array offsets which are internally optimised in the CLR.
But doesn't a List.ForEach () surely does a foreach ( ) anyway, so you are just giving the work to another method, rather than doing it yourself.
Strictly speaking, introducing more method calls will actually slow your code down on the first pass, because the CLR will JIT-compile methods as they are called, although subsequent calls to the method will not.
So my advice would be stick to writing readable code, then go from there if you can prove that this is a bottleneck of the system.

Which LINQ query is more effective?

I have a huge IEnumerable(suppose the name is myItems), which way is more effective?
Solution 1: Filter it first then ForEach.
Array.ForEach(myItems.Where(FILTER-IT-HERE).ToArray(),MY-ACTION);
Solution 2: Do RETURN in MY-ACTION if the item is not up to the mustard.
Array.ForEach(myItems.ToArray(),MY-ACTION-WITH-FILTER);
Is one of them always better than another? Or any other good suggestions? Thanks in advance.

Did you do any measurements? Since WE can't measure the run time of My-Action then only you can. Measure and decide.

Sometimes one has to create benchmark's because similar looking activities could produce radically different and unexpected results.
You do not say what your data source is so I'm going to assume it may be data on an SQL server in which case filtering at the server side will likely always be the best approach because you have minimized the amount of data transfer. Memory access is always faster than data transfer from disk to memory so whenever you can transfer fewer records, you are likely to have better performance.

Well, both times, you're converting to an array, which might not be so efficient if the IEnumerable is very large (like you said). You could create a generic extension method for IEnumerable, like:
public static void ForEach<T>(this IEnumerable<T> current, Action<T> action) {
foreach (var i in current) {
action(i);
}
}
and then you could do this:
IEnumerable<int> ints = new List<int>();
ints.Where(i => i == 5).ForEach(i => Console.WriteLine(i));

If performance is a concern, it's unclear to me why you'd be bothering to construct an entire array in the first place. Why not just this?
foreach (var item in myItems.Where(FILTER-IT-HERE))
MY-ACTION;
Or:
foreach (var item in myItems)
MY-ACTION-WITH-FILTER;
I ask because, while the others are right that you can't really know without testing, I wouldn't expect there to be much difference between the above two options. I would expect there to be a difference, on the other hand, between creating/populating an array (seemingly for no reason) and not creating an array.

Everything else being equal, calling ToArray() first will impart a greater performance hit than when calling it last. Although, as has been stated by others before me,
Why use ToArray() and Array.ForEach() at all?
We don't know that everything else actually is equal since you do not reveal the implementation details of your filter and action.

The idea of LINQ is to work on enumerable collections, so the best LINQ query is the one where you don't use Array.ForEach() and .ToArray() at all.

I would say that this falls into the category of premature optimization. If, after establishing benchmarks, you find that the code is too slow, you can always try each approach and pick the result that works better for you.
Since we don't know how the IEnumerable<> is produced it's hard to say which approach will perform better. We also don't know how many items will remain after you apply your predicate - nor do we know whether the action or iteration steps are going to be the dominant factor in the execution of your code. The only way to know for sure is to try it both ways, profile the results, and pick the best.
Performance aside, I would choose the version that is most clear - which (for me) is to first filter and then apply the projection to the result.

Generic list FindAll() vs. foreach

I'm looking through a generic list to find items based on a certain parameter.
In General, what would be the best and fastest implementation?
1. Looping through each item in the list and saving each match to a new list and returning that
foreach(string s in list)
{
if(s == "match")
{
newList.Add(s);
}
}
return newList;
Or
2. Using the FindAll method and passing it a delegate.
newList = list.FindAll(delegate(string s){return s == "match";});
Don't they both run in ~ O(N)? What would be the best practice here?
Regards,
Jonathan

You should definitely use the FindAll method, or the equivalent LINQ method. Also, consider using the more concise lambda instead of your delegate if you can (requires C# 3.0):
var list = new List<string>();
var newList = list.FindAll(s => s.Equals("match"));

I would use the FindAll method in this case, as it is more concise, and IMO, has easier readability.
You are right that they are pretty much going to both perform in O(N) time, although the foreach statement should be slightly faster given it doesn't have to perform a delegate invocation (delegates incur a slight overhead as opposed to directly calling methods).
I have to stress how insignificant this difference is, it's more than likely never going to make a difference unless you are doing a massive number of operations on a massive list.
As always, test to see where the bottlenecks are and act appropriately.

Jonathan,
A good answer you can find to this is in chapter 5 (performance considerations) of Linq To Action.
They measure a for each search that executes about 50 times and that comes up with foreach = 68ms per cycle / List.FindAll = 62ms per cycle. Really, it would probably be in your interest to just create a test and see for yourself.

List.FindAll is O(n) and will search the entire list.
If you want to run your own iterator with foreach, I'd recommend using the yield statement, and returning an IEnumerable if possible. This way, if you end up only needing one element of your collection, it will be quicker (since you can stop your caller without exhausting the entire collection).
Otherwise, stick to the BCL interface.

Any perf difference is going to be extremely minor. I would suggest FindAll for clarity, or, if possible, Enumerable.Where. I prefer using the Enumerable methods because it allows for greater flexibility in refactoring the code (you don't take a dependency on List<T>).

Yes, they both implementations are O(n). They need to look at every element in the list to find all matches. In terms of readability I would also prefer FindAll. For performance considerations have a look at LINQ in Action (Ch 5.3). If you are using C# 3.0 you could also apply a lambda expression. But that's just the icing on the cake:
var newList = aList.FindAll(s => s == "match");

Im with the Lambdas
List<String> newList = list.FindAll(s => s.Equals("match"));

Unless the C# team has improved the performance for LINQ and FindAll, the following article seems to suggest that for and foreach would outperform LINQ and FindAll on object enumeration: LINQ on Objects Performance.
This artilce was dated back to March 2009, just before this post originally asked.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best Practice - Removing item from generic collection in C# - c#

The Remove() method 'approches O(1)' and is OK when a key does not exist. But otherwise: when in doubt, measure. Getting some timings isn't that difficult...

Why enumerate the keys when all you're concerned with is the values? foreach (List<ISubscriber> list in _subscriptions.Values) { list.Remove(subscriber); } That said, the LINQ solution suggested by Eric P is certainly more concise. Performance might be an issue, though.

I'd opt for the second option. Contains() and Remove() are both O(n) methods, and there's no reason to call both since Remove doesn't throw. At least with method 2, you're only calling one expensive operation instead of two. I don't know of a faster way to handle it.

If you wanted to use Linq to do this, I think this would work (not tested): _subscriptions.Values.All(x => x.Remove(subscriber)); Might want to check the performance on that though.

Related

Get original value from HashSet

Is .Select<T>(...) to be prefered before .Where<T>(...)?

Calling a method in a linq foreach - how much overhead is there?

Which LINQ query is more effective?

Generic list FindAll() vs. foreach

Categories

Resources