Why are ToLookup and GroupBy different?

Why are ToLookup and GroupBy different? - c#

.ToLookup<TSource, TKey> returns an ILookup<TKey, TSource>. ILookup<TKey, TSource> also implements interface IEnumerable<IGrouping<TKey, TSource>>.
.GroupBy<TSource, TKey> returns an IEnumerable<IGrouping<Tkey, TSource>>.
ILookup has the handy indexer property, so it can be used in a dictionary-like (or lookup-like) manner, whereas GroupBy can't. GroupBy without the indexer is a pain to work with; pretty much the only way you can then reference the return object is by looping through it (or using another LINQ-extension method). In other words, any case that GroupBy works, ToLookup will work as well.
All this leaves me with the question why would I ever bother with GroupBy? Why should it exist?

why would I ever bother with GroupBy? Why should it exist?
What happens when you call ToLookup on an object representing a remote database table with a billion rows in it?
The billion rows are sent over the wire, and you build the lookup table locally.
What happens when you call GroupBy on such an object?
A query object is built; end of story.
When that query object is enumerated then the analysis of the table is done on the database server and the grouped results are sent back on demand a few at a time.
Logically they are the same thing but the performance implications of each are completely different. Calling ToLookup means I want a cache of the entire thing right now organized by group. Calling GroupBy means "I am building an object to represent the question 'what would these things look like if I organized them by group?'"

In simple LINQ-world words:
ToLookup() - immediate execution
GroupBy() - deferred execution

The two are similar, but are used in different scenarios. .ToLookup() returns a ready to use object that already has all the groups (but not the group's content) eagerly loaded. On the other hand, .GroupBy() returns a lazy loaded sequence of groups.
Different LINQ providers may have different behaviors for the eager and lazy loading of the groups. With LINQ-to-Object it probably makes little difference, but with LINQ-to-SQL (or LINQ-to-EF, etc.), the grouping operation is performed on the database server rather than the client, and so you may want to do an additional filtering on the group key (which generates a HAVING clause) and then only get some of the groups instead of all of them. .ToLookup() wouldn't allow for such semantics since all items are eagerly grouped.

Related

Linq on DateTime Collections Best Performance

My app has a Collection of Job objects. There is a unique property (Key) called Jobnumber and a DateTime property called Bookingtime which is not necessarily unique. There are various other properties also.
I want to do a lot of linq queries based on the Bookingtime property and occasional inserts and removal of objects from the collection.
If I have 1000 to 2000 objects in a collection should I use a SortedList<TKey, TValue> or just a List<T> and order it manually with linq?
Does the options change for 10,000 objects in the collection?
The objects are from database and are already sorted by Bookingtime but I need to work on certain datetime subsets.
DateTime t1,t2; //equals some values.
var subSet = jobs.where(a=>a.Bookingtime >= t1 &&
a.Bookingtime < = t2).ToList();

As one can see in the documentation, the SortedList.Add (too bad there is no AddAll method as in Java that could optimize bulk insertion) operation is performed in O(n) whereas OrderBy runs in O(n log n). The implication is that only on small (or sorted lists), the SortedList can outperform Linq.
Furthermore notice that Linq uses lazy evaluation. It will only sort items if you actually need the resulting list (or use a ToList,... method). If you thus later never do anything with the result, the environment won't even sort the data.
This article even implements a truly lazy OrderBy, such that if you only need the first i items, it will not sort the entire list.
EDIT: based on the your updated question, you better incorporate the .where statement as a WHERE in the SQL query. This can reduce network, memory and CPU usage since a database in many cases has means to optimize queries enormously.

Fastest way to check whether a single element in common exists between two enumerables

I have a method I'm writing where I want to be able to filter orders based on whether they have one or more ordered products in them that exist in the selection of products made by the user. Currently I'm doing this with:
SelectedProductIDs.Intersect(orderProductIDs).Any()
executed on each order (~20,000 orders total in the database and expected to grow quickly), where both SelectedProducts and orderProductIDs are string[]. I've also attempted to use pre-generated HashSets for both SelectedProductIDs and orderProductIDs, but this made no appreciable difference in the speed of comparison.
However, both of these are unpleasantly slow - ~300ms per selection change - particularly given that the dates made available to the sliders within the UI are predicated entirely on the results of this query, so user interaction has to halt in some fashion. Is there a (very) significantly faster way to do this?
Edit: May not have been clear enough - order objects are materialized from SQL data at launch-time and these queries are performed later, in a secondary window of the overall application. SQL is irrelevant to the specifics of this question; this is a LINQ-to-Objects question.

The LINQ intersect is going to reconstruct a new HashSet based on the input value no matter what you do, even if the input is already a HashSet. Its implementation mutates the hash set internally (which is how it avoids yielding duplicate values) so it is important to make a copy of the input sequence, even if it's already a HashSet.
You can create your own Intersect method that accepts a hashset, instead of populating a new one. To avoid mutating it though, you'll have to settle for a bag-based Intersect, rather than a set based Intersect (i.e., duplicates in the sequence will all be yielded). Clearly that's not a problem in your case:
public static IEnumerable<T> IntersectAll<T>(
this HashSet<T> set, IEnumerable<T> sequence)
{
foreach (var item in sequence)
if (set.Contains(item))
yield return item;
}
Now you can write:
SelectedProductIDs.InsersectAll(orderProductIDs).Any();
And the hashset won't need to be re-constructed each time.

It sounds like you are reading all the values from the database into memory and then querying. If you instead use LINQ to EF, it will translate the LINQ query into a SQL query that gets run on the database, which could be significantly faster.

When to force LINQ query evaluation?

What's the accepted practice on forcing evaluation of LINQ queries with methods like ToArray() and are there general heuristics for composing optimal chains of queries? I often try to do everything in a single pass because I've noticed in those instances that AsParallel() does a really good job in speeding up the computation. In cases where the queries perform computations with no side-effects but several passes are required to get the right data out is forcing the computation with ToArray() the right way to go or is it better to leave the query in lazy form?

If you are not averse to using an 'experimental' library, you could use the EnumerableEx.Memoize extension method from the Interactive Extensions library.
This method provides a best-of-both-worlds option where the underlying sequence is computed on-demand, but is not re-computed on subequent passes. Another small benefit, in my opinion, is that the return type is not a mutable collection, as it would be with ToArray or ToList.

Keep the queries in lazy form until you start to evaluate the query multiple times, or even earlier if you need them in another form or you are in danger of variables captured in closures changing their values.
You may want to evaluate when the query contains complex projections which you want to avoid performing multiple times (e.g. constructing complex objects for sequences with lots of elements). In this case evaluating once and iterating many times is much saner.
You may need the results in another form if you want to return them or pass them to another API that expects a specific type of collection.
You may want or need to prevent accessing modified closures if the query captures variables which are not local in scope. Until the query is actually evaluated, you are in danger of other code changing their values "behind your back"; when the evaluation happens, it will use these values instead of those present when the query was constructed. (However, this can be worked around by making a copy of those values in another variable that does have local scope).

You would normally only use ToArray() when you need to use an array, like with an API that expects an array. As long as you don't need to access the results of a query, and you're not confined to some kind of connection context (like the case may be in LINQ to SQL or LINQ to Entities), then you might as well just keep the query in lazy form.

Am I misunderstanding LINQ to SQL .AsEnumerable()?

Consider this code:
var query = db.Table
.Where(t => SomeCondition(t))
.AsEnumerable();
int recordCount = query.Count();
int totalSomeNumber = query.Sum();
decimal average = query.Average();
Assume query takes a very long time to run. I need to get the record count, total SomeNumber's returned, and take an average at the end. I thought based on my reading that .AsEnumerable() would execute the query using LINQ-to-SQL, then use LINQ-to-Objects for the Count, Sum, and Average. Instead, when I do this in LINQPad, I see the same query is run three times. If I replace .AsEnumerable() with .ToList(), it only gets queried once.
Am I missing something about what AsEnumerable is/does?

Calling AsEnumerable() does not execute the query, enumerating it does.
IQueryable is the interface that allows LINQ to SQL to perform its magic. IQueryable implements IEnumerable so when you call AsEnumerable(), you are changing the extension-methods being called from there on, ie from the IQueryable-methods to the IEnumerable-methods (ie changing from LINQ to SQL to LINQ to Objects in this particular case). But you are not executing the actual query, just changing how it is going to be executed in its entirety.
To force query execution, you must call ToList().

Yes. All that AsEnumerable will do is cause the Count, Sum, and Average functions to be executed client-side (in other words, it will bring back the entire result set to the client, then the client will perform those aggregates instead of creating COUNT() SUM() and AVG() statements in SQL).

Justin Niessner's answer is perfect.
I just want to quote a MSDN explanation here: .NET Language-Integrated Query for Relational Data
The AsEnumerable() operator, unlike ToList() and ToArray(), does not cause execution of the query. It is still deferred. The AsEnumerable() operator merely changes the static typing of the query, turning a IQueryable into an IEnumerable, tricking the compiler into treating the rest of the query as locally executed.
I hope this is what is meant by:
IQueryable-methods to the IEnumerable-methods (ie changing from LINQ to SQL to LINQ to Objects
Once it is LINQ to Objects we can apply object's methods (e.g. ToString()). This is the explanation for one of the frequently asked questions about LINQ - Why LINQ to Entities does not recognize the method 'System.String ToString()?
According to ASENUMERABLE - codeblog.jonskeet, AsEnumerable can be handy when:
some aspects of the query in the database, and then a bit more manipulation in .NET – particularly if there are aspects you basically can’t implement in LINQ to SQL (or whatever provider you’re using).
It also says:
All we’re doing is changing the compile-time type of the sequence which is propagating through our query from IQueryable to IEnumerable – but that means that the compiler will use the methods in Enumerable (taking delegates, and executing in LINQ to Objects) instead of the ones in Queryable (taking expression trees, and usually executing out-of-process).
Finally, also see this related question: Returning IEnumerable vs. IQueryable

Well, you are on the right track. The problem is that an IQueryable (what the statement is before the AsEnumerable call) is also an IEnumerable, so that call is, in effect, a nop. It will require forcing it to a specific in-memory data structure (e.g., ToList()) to force the query.

I would presume that ToList forces Linq to fetch the records from the database. When you then perform the proceeding calculations they are done against the in memory objects rather than involving the database.
Leaving the return type as an Enumerable means that the data is not fetched until it is called upon by the code performing the calculations. I guess the knock on of this is that the database is hit three times - one for each calculation and the data is not persisted to memory.

Just adding a little more clarification:
I thought based on my reading that .AsEnumerable() would execute the query using LINQ-to-SQL
It will not execute the query right away, as Justin's answer explains. It only will be materialized (hit the database) later on.
Instead, when I do this in LINQPad, I see the same query is run three times.
Yes, and note that all three queries are exact the same, basically fetching all rows from the given condition into memory and then computing the count/sum/avg locally.
If I replace .AsEnumerable() with .ToList(), it only gets queried once.
But still getting all data into memory, with the advantage that now it run only once.
If performance improvement is a concern, just remove .AsEnumerable() and then the count/sum/avg will be translated correctly to their SQL correspondents. Doing so three queries will run (probably faster if there are index satisfying the conditions) but with a lot less memory footprint.

In-memory LINQ performance

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?

Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.

Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.

Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.