Empty HashSet - Count vs Any

Empty HashSet - Count vs Any - c#

I am only interested to know whether a HashSet hs is empty or not.
I am NOT interested to know exactly how many elements it contains.
So I could use this:
bool isEmpty = (hs.Count == 0);
...or this:
bool isEmpty = hs.Any(x=>true);
Which one provides better results, performance-wise(specially when the HashSet contains a large number of elements) ?

On a HashSet you can use both, since HashSet internally manages the count.
However, if your data is in an IEnumerable<T> or IQueryable<T> object, using result.Any() is preferable over result.Count() (Both Linq Methods).
Linq's .Count() will iterate through the whole Enumerable, .Any() will only peek if any objects exists within the Enumerable or not.
Update:
Just small addition:
In your case with the HashSet .Count may be preferable as .Any() would require an IEmumerator to be created and returned which is a small overhead if you are not going to use the Enumerator anywhere in your code (foreach, Linq, etc.). But I think that would be considered "Micro optimization".

HastSet<T> implements ICollection<T>, which has a Count property, so a call to Count() will just call HastSet<T>.Count, which I'm assuming is an O(1) operation (meaning it doesn't actually have to count - it just returns the current size of the HashSet).
Any will iterate until it finds an item that matches the condition, then stop.
So in your case, it will just iterate one item, then stop, so the difference will probably be negligible.
If you had a filter that you wanted to apply (e.g. x => x.IsValid) then Any would definitely be faster since Count(x => x.IsValid) would iterate over the entire collection, while Any would stop as soon as if finds a match.
For those reasons I generally prefer to use Any() rather than Count()==0 since it's more direct and avoids any potential performance problems. I would only switch to Count()==0 if it provided a significant performance boost over Any().
Note that Any(x=>true) is logically the same as calling Any(). That doesn't change your question, but it looks cleaner without the lambda.

Depending on the type of collection, it may or may not matter performance-wise. So why not just use hs.Any() since that is designed for exactly what you need to know?
And the lambda expression x => true has no meaning here. You can leave that out.

Related

What's the quickest way to check the size of an IEnumerable is greater than some given value?

I know that you can use enumerable.Any() instead of enumerable.Count() to check if the collection has anything in it efficiently.
Is there an equivalent to check the size is at least any larger size?
For example, how would I efficiently do enumerable.Count() > 3.

The most efficient approach will unfortunately depend on the implementation. It's a leaky abstraction at that point.
If you're using a List<T> or similar, using Count() will be fastest. But for any lazily-evaluated sequence, that will evaluate the whole sequence.
For a lazily-evaluated sequence, using enumerable.Skip(3).Any() will be more efficient, because it can stop once it's found the fourth element. That's all you need to know about; you don't care about the actual size.
Using the Skip()/Any() approach will be slightly less efficient than using Count() for some collections - but could be much more efficient for large lazy sequences. (It will also work even for infinite sequences, which Count() wouldn't.)
The difference in efficiency for lists will depend on how many items you're skipping, of course - if you need to see whether there are "at least a million" items then using Count() would be much more efficient for a list.
Sorry not to have an easy answer for you. If you really need this to be optimal in every case, you could perform the same kinds of optimization that the Count() method does. Something like this:
// FIXME: This name is horrible! Note that you'd call it with 4 in your case,
// as it's inclusive of minCount.
// Note this assumes C# 8 and its lovely switch expression support.
// It could be written with if/else etc of course.
public static bool HasAtMinElements<T>(this IEnumerable<T> source, int minCount) =>
source switch
{
null => throw new ArgumentNullException(nameof(source)),
ICollection<TSource> coll => coll.Count >= minCount,
ICollection coll => coll.Count >= minCount,
_ => source.Skip(minCount - 1).Any();
}
That's annoying though :( Note that it doesn't optimize IIListProvider<T> like the real Count() method does, either - because that's internal.

Enumerable.Count Method is the Microsoft's recommended way to return the number of elements in a sequence, which is what you are already doing and it is the best option as far as I see.

Is there a "correct" way between these two statements that filter and return a boolean using LINQ to Objects? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
LINQ extension methods - Any() vs. Where() vs. Exists()
Given a list of objects in memory I ran the following two expressions:
myList.where(x => x.Name == "bla").Any()
vs
myList.Any(x => x.Name == "bla")
The latter was fastest always, I believe this is due to the Where enumerating all items. But this also happens when there's no matches.
Im not sure of the exact WHY though. Are there any cases where this viewed performance difference wouldn't be the case, like if it was querying Nhib?
Cheers.

The Any() with the predicate can perform its task without an iterator (yield return). Using a Where() creates an iterator, which adds has a performance impact (albeit very small).
Thus, performance-wise (by a bit), you're better off using the form of Any() that takes the predicate (x => x.Name == "bla"). Which, personally, I find more readable as well...
On a side note, Where() does not necessarily enumerate over all elements, it just creates an iterator that will travel over the elements as they are requested, thus the call to Any() after the Where() will drive the iteration, which will stop at the first item it finds that matches the condition.
So the performance difference is not that Where() iterates over all the items (in linq-to-objects) because it really doesn't need to (unless, of course, it doesn't find one that satisfies it), it's that the Where() clause has to set up an iterator to walk over the elements, whereas Any() with a predicate does not.

Assuming you correct where to Where and = to ==, I'd expect the "Any with a predicate" version to execute very slightly faster. However, I would expect the situations in which the difference was significant to be few and far between, so you should aim for readability first.
As it happens, I would normally prefer the "Any with a predicate" version in terms of readability too, so you win on both fronts - but you should really go with what you find more readable first. Measure the performance in scenarios you actually care about, and if a section of code isn't performing as you need it to, then consider micro-optimizing it - measuring at every step, of course.

I believe this is due to the Where enumerating all items.
If myList is a collection in memory, it doesn't. The Where method uses deferred execution, so it will only enumerate as many items as needed to determine the result. In that case you would not see any significant difference between .Any(...) and .Where(...).Any().
Are there any cases where this viewed performance difference wouldn't
be the case, like if it was querying Nhib?
Yes, if myList is a data source that will take the expression generated by the methods and translate to a query to run elsewhere (e.g. LINQ To SQL), you may see a difference. The code that translates the expression simply does a better job at translating one of the expressions.

I have read that it is bad practice to iterate over a HashSet. Should I be calling .ToList() on it first?

I have a collection of items called RegisteredItems. I do not care about the order of the items in RegisteredItems, only that they exist.
I perform two types of operations on RegisteredItems:
Find and return item by property.
Iterate over collection and have side-effect.
According to: When should I use the HashSet<T> type? Robert R. says,
"It's somewhat dangerous to iterate over a HashSet because doing so
imposes an order on the items in the set. That order is not really a
property of the set. You should not rely on it. If ordering of the
items in a collection is important to you, that collection isn't a
set."
There are some scenarios where my collection would contain 50-100 items. I realize this is not a large amount of items, but I was still hoping to reap the rewards of using a HashSet instead of List.
I have found myself looking at the following code and wondering what to do:
LayoutManager.Instance.RegisteredItems.ToList().ForEach( item => item.DoStuff() );
vs
foreach( var item in LayoutManager.Instance.RegisteredItems)
{
item.DoStuff();
}
RegisteredItems used to return an IList<T>, but now it returns a HashSet. I felt that, if I was using HashSet for efficiency, it would be improper to cast it as a List. Yet, the above quote from Robert leaves me feeling uneasy about iterating over it, as well.
What's the right call in this scenario? Thanks

If you don't care about order, use a HashSet<>. The quote is about using HashSet<> being dangerous when you're worried about order. If you run this code multiple times, and the items are operated on in different order, will you care? If not, then you're fine. If yes, then don't use a HashSet<>. Arbitrarily converting to a List first doesn't really solve the problem.
And I'm not certain, but I suspect that .ToList() will iterate over the HashSet<> to do that, so, now you're walking the collection twice.
Don't prematurely optimize. If you only have 100 items, just use a HashSet<> and move on. If you start caring about order, change it to a List<> then and use it as a list everwhere.

If you really don't care about order and you know that you can't have duplicate in your hashset (and it's what you want), go ahead use hashset.

In the quoted question, I think he's saying that if you iterate over a Set, you can easily trick yourself into thinking that the items are in a certain order. For example, it'd be easy to treat the first iterated item differently, but you aren't guaranteed that will remain the first iterated item.
As long as you keep this in mind, and consider the Set unordered, iterating over it is fine.

space complexity of a simple linq(to objects) query

I have;
var maxVal = l.TakeWhile(x=>x < val).Where(x=>Matches(x)).Max();
How much space does this need ? Does linq build up a list of the above Where() condition, or is Max() just iterating through the IEnumerable keeping track of what is the current Max() ?
And where can I find more info about this, besides asking on SO f

I have verified with Reflector that each of Enumerable.TakeWhile, Enumerable.Where and Enumerable.Max run in constant space. Consequently, this entire query should run in constant space.
Not surprising, considering TakeWhile and Where are speced to use deferred execution + streaming.
Max does not use deferred execution, but only needs to store 'max so far' and the enumerator on the source enumerable.

According to the Reflector Max() method iterates through the enumerable.
And where can I find more info about this, besides asking on SO f
You can use Reflector to look at the implementation of any .NET assembly.

The only thing offered by Enumerable that I've found doesn't run in constant space is ToList(), for obvious reasons.
With some enumerations, this is inefficient, in that you already have a space complexity above constant (typically O(n) as you are storing the items) and that the collection in question offers a mechanism with lower time complexity. If you are creating such a collection yourself it makes sense to offer your own versions of the extensions offered by Enumerable. For example, if you have a collection that is inherently sorted you should be able to offer Min() and Max() in better than O(n) complexity (whether it is O(1), O(ln) or something else would depend on what way that sorting was kept). Since instance methods override extension methods (when called on an expression of the object type rather than the instance type) then with no difference to the coder using your object, you will offer better efficiency.

Reflector is your friend.
In particular, you can take a look at Linq to Objects extension methods in the Enumerable class in System.Linq.
The above are using iterations, so they use the whatever space the enumerators takes up - usually O(1). Max() is O(1) space.
However, keep in mind that nothing stops a developer from writing an enumerator that takes up more than constant space. E.g. traversing a tree may require O(log n) space. This is the case e.g. for SortedDictionary<K,V> and SortedSet<K,V>.
So it partially depends on what l is in your code.

In-memory LINQ performance

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?

Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.

Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.

Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Empty HashSet - Count vs Any - c#

Depending on the type of collection, it may or may not matter performance-wise. So why not just use hs.Any() since that is designed for exactly what you need to know? And the lambda expression x => true has no meaning here. You can leave that out.

Related

What's the quickest way to check the size of an IEnumerable is greater than some given value?

Is there a "correct" way between these two statements that filter and return a boolean using LINQ to Objects? [duplicate]

I have read that it is bad practice to iterate over a HashSet. Should I be calling .ToList() on it first?

space complexity of a simple linq(to objects) query

In-memory LINQ performance

Categories

Resources