space complexity of a simple linq(to objects) query

space complexity of a simple linq(to objects) query - c#

I have;
var maxVal = l.TakeWhile(x=>x < val).Where(x=>Matches(x)).Max();
How much space does this need ? Does linq build up a list of the above Where() condition, or is Max() just iterating through the IEnumerable keeping track of what is the current Max() ?
And where can I find more info about this, besides asking on SO f

I have verified with Reflector that each of Enumerable.TakeWhile, Enumerable.Where and Enumerable.Max run in constant space. Consequently, this entire query should run in constant space.
Not surprising, considering TakeWhile and Where are speced to use deferred execution + streaming.
Max does not use deferred execution, but only needs to store 'max so far' and the enumerator on the source enumerable.

According to the Reflector Max() method iterates through the enumerable.
And where can I find more info about this, besides asking on SO f
You can use Reflector to look at the implementation of any .NET assembly.

The only thing offered by Enumerable that I've found doesn't run in constant space is ToList(), for obvious reasons.
With some enumerations, this is inefficient, in that you already have a space complexity above constant (typically O(n) as you are storing the items) and that the collection in question offers a mechanism with lower time complexity. If you are creating such a collection yourself it makes sense to offer your own versions of the extensions offered by Enumerable. For example, if you have a collection that is inherently sorted you should be able to offer Min() and Max() in better than O(n) complexity (whether it is O(1), O(ln) or something else would depend on what way that sorting was kept). Since instance methods override extension methods (when called on an expression of the object type rather than the instance type) then with no difference to the coder using your object, you will offer better efficiency.

Reflector is your friend.
In particular, you can take a look at Linq to Objects extension methods in the Enumerable class in System.Linq.
The above are using iterations, so they use the whatever space the enumerators takes up - usually O(1). Max() is O(1) space.
However, keep in mind that nothing stops a developer from writing an enumerator that takes up more than constant space. E.g. traversing a tree may require O(log n) space. This is the case e.g. for SortedDictionary<K,V> and SortedSet<K,V>.
So it partially depends on what l is in your code.

Related

What's the quickest way to check the size of an IEnumerable is greater than some given value?

I know that you can use enumerable.Any() instead of enumerable.Count() to check if the collection has anything in it efficiently.
Is there an equivalent to check the size is at least any larger size?
For example, how would I efficiently do enumerable.Count() > 3.

The most efficient approach will unfortunately depend on the implementation. It's a leaky abstraction at that point.
If you're using a List<T> or similar, using Count() will be fastest. But for any lazily-evaluated sequence, that will evaluate the whole sequence.
For a lazily-evaluated sequence, using enumerable.Skip(3).Any() will be more efficient, because it can stop once it's found the fourth element. That's all you need to know about; you don't care about the actual size.
Using the Skip()/Any() approach will be slightly less efficient than using Count() for some collections - but could be much more efficient for large lazy sequences. (It will also work even for infinite sequences, which Count() wouldn't.)
The difference in efficiency for lists will depend on how many items you're skipping, of course - if you need to see whether there are "at least a million" items then using Count() would be much more efficient for a list.
Sorry not to have an easy answer for you. If you really need this to be optimal in every case, you could perform the same kinds of optimization that the Count() method does. Something like this:
// FIXME: This name is horrible! Note that you'd call it with 4 in your case,
// as it's inclusive of minCount.
// Note this assumes C# 8 and its lovely switch expression support.
// It could be written with if/else etc of course.
public static bool HasAtMinElements<T>(this IEnumerable<T> source, int minCount) =>
source switch
{
null => throw new ArgumentNullException(nameof(source)),
ICollection<TSource> coll => coll.Count >= minCount,
ICollection coll => coll.Count >= minCount,
_ => source.Skip(minCount - 1).Any();
}
That's annoying though :( Note that it doesn't optimize IIListProvider<T> like the real Count() method does, either - because that's internal.

Enumerable.Count Method is the Microsoft's recommended way to return the number of elements in a sequence, which is what you are already doing and it is the best option as far as I see.

Empty HashSet - Count vs Any

I am only interested to know whether a HashSet hs is empty or not.
I am NOT interested to know exactly how many elements it contains.
So I could use this:
bool isEmpty = (hs.Count == 0);
...or this:
bool isEmpty = hs.Any(x=>true);
Which one provides better results, performance-wise(specially when the HashSet contains a large number of elements) ?

On a HashSet you can use both, since HashSet internally manages the count.
However, if your data is in an IEnumerable<T> or IQueryable<T> object, using result.Any() is preferable over result.Count() (Both Linq Methods).
Linq's .Count() will iterate through the whole Enumerable, .Any() will only peek if any objects exists within the Enumerable or not.
Update:
Just small addition:
In your case with the HashSet .Count may be preferable as .Any() would require an IEmumerator to be created and returned which is a small overhead if you are not going to use the Enumerator anywhere in your code (foreach, Linq, etc.). But I think that would be considered "Micro optimization".

HastSet<T> implements ICollection<T>, which has a Count property, so a call to Count() will just call HastSet<T>.Count, which I'm assuming is an O(1) operation (meaning it doesn't actually have to count - it just returns the current size of the HashSet).
Any will iterate until it finds an item that matches the condition, then stop.
So in your case, it will just iterate one item, then stop, so the difference will probably be negligible.
If you had a filter that you wanted to apply (e.g. x => x.IsValid) then Any would definitely be faster since Count(x => x.IsValid) would iterate over the entire collection, while Any would stop as soon as if finds a match.
For those reasons I generally prefer to use Any() rather than Count()==0 since it's more direct and avoids any potential performance problems. I would only switch to Count()==0 if it provided a significant performance boost over Any().
Note that Any(x=>true) is logically the same as calling Any(). That doesn't change your question, but it looks cleaner without the lambda.

Depending on the type of collection, it may or may not matter performance-wise. So why not just use hs.Any() since that is designed for exactly what you need to know?
And the lambda expression x => true has no meaning here. You can leave that out.

Generic list FindAll() vs. foreach

I'm looking through a generic list to find items based on a certain parameter.
In General, what would be the best and fastest implementation?
1. Looping through each item in the list and saving each match to a new list and returning that
foreach(string s in list)
{
if(s == "match")
{
newList.Add(s);
}
}
return newList;
Or
2. Using the FindAll method and passing it a delegate.
newList = list.FindAll(delegate(string s){return s == "match";});
Don't they both run in ~ O(N)? What would be the best practice here?
Regards,
Jonathan

You should definitely use the FindAll method, or the equivalent LINQ method. Also, consider using the more concise lambda instead of your delegate if you can (requires C# 3.0):
var list = new List<string>();
var newList = list.FindAll(s => s.Equals("match"));

I would use the FindAll method in this case, as it is more concise, and IMO, has easier readability.
You are right that they are pretty much going to both perform in O(N) time, although the foreach statement should be slightly faster given it doesn't have to perform a delegate invocation (delegates incur a slight overhead as opposed to directly calling methods).
I have to stress how insignificant this difference is, it's more than likely never going to make a difference unless you are doing a massive number of operations on a massive list.
As always, test to see where the bottlenecks are and act appropriately.

Jonathan,
A good answer you can find to this is in chapter 5 (performance considerations) of Linq To Action.
They measure a for each search that executes about 50 times and that comes up with foreach = 68ms per cycle / List.FindAll = 62ms per cycle. Really, it would probably be in your interest to just create a test and see for yourself.

List.FindAll is O(n) and will search the entire list.
If you want to run your own iterator with foreach, I'd recommend using the yield statement, and returning an IEnumerable if possible. This way, if you end up only needing one element of your collection, it will be quicker (since you can stop your caller without exhausting the entire collection).
Otherwise, stick to the BCL interface.

Any perf difference is going to be extremely minor. I would suggest FindAll for clarity, or, if possible, Enumerable.Where. I prefer using the Enumerable methods because it allows for greater flexibility in refactoring the code (you don't take a dependency on List<T>).

Yes, they both implementations are O(n). They need to look at every element in the list to find all matches. In terms of readability I would also prefer FindAll. For performance considerations have a look at LINQ in Action (Ch 5.3). If you are using C# 3.0 you could also apply a lambda expression. But that's just the icing on the cake:
var newList = aList.FindAll(s => s == "match");

Im with the Lambdas
List<String> newList = list.FindAll(s => s.Equals("match"));

Unless the C# team has improved the performance for LINQ and FindAll, the following article seems to suggest that for and foreach would outperform LINQ and FindAll on object enumeration: LINQ on Objects Performance.
This artilce was dated back to March 2009, just before this post originally asked.

Implementation of Red-Black Tree in C#

I'm looking for an implementation of a Red-Black Tree in C#, with the following features:
Search, Insert and Delete in O(log n).
Members type should be generic.
Support in Comparer(T), for sorting T by different fields in it.
Searching in the tree should be with the specific field, so it won't accept T, but it'll accept the field type sorting it.
Searching shouldn't be only exact value. Should support searching the lower/higher one.
Thank you.

You mostly just described SortedDictionary<T, U>, except for the next-lowest/next-highest value binary search, which you could implement on your own without much difficulty.
Are there specific reasons that SortedDictionary is insufficient for you?

Rip the TreeSet from C5 collection libs.

This is exactly the OrderedDictionary in PowerCollections. It's pretty much identical to SortedDictionary (red black tree with generics) with the addition of the ability to set a start key/end key and scan all values in that range.
SortedDicionary only allows exposes a GetEnumerator() function that starts at the beginning of the collection and only allows a MoveNext() call, so even if you use LINQ there is nothing magic happening: it starts at the beginning and runs your expression on every single node, in order, until it finds those matching your LINQ expression.
OrderedDictionary has a function that gets an enumerator at or before a particular key and that does the lookup in O(log n).
A word of caution though: the enumerator in the PowerCollections OrderedDictionary is implemented using "yield" and the memory usage and enumeration performance is at least O(n^2)... you can change the implementation yourself to make it implement a traditional enumerator and both of these problems go away. I'll submit that patch to Codeplex if I can ever find the time.

In-memory LINQ performance

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?

Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.

Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.

Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

space complexity of a simple linq(to objects) query - c#

According to the Reflector Max() method iterates through the enumerable. And where can I find more info about this, besides asking on SO f You can use Reflector to look at the implementation of any .NET assembly.

Related

What's the quickest way to check the size of an IEnumerable is greater than some given value?

Empty HashSet - Count vs Any

Generic list FindAll() vs. foreach

Implementation of Red-Black Tree in C#

In-memory LINQ performance

Categories

Resources