Faking IGrouping for LINQ - c#

Imagine you have a large dataset that may or may not be filtered by a particular condition of the dataset elements that can be intensive to calculate. In the case where it is not filtered, the elements are grouped by the value of that condition - the condition is calculated once.
However, in the case where the filtering has taken place, although the subsequent code still expects to see an IEnumerable<IGrouping<TKey, TElement>> collection, it doesn't make sense to perform a GroupBy operation that would result in the condition being re-evaluated a second time for each element. Instead, I would like to be able to create an IEnumerable<IGrouping<TKey, TElement>> by wrapping the filtered results appropriately, and thus avoiding yet another evaluation of the condition.
Other than implementing my own class that provides the IGrouping interface, is there any other way I can implement this optimization? Are there existing LINQ methods to support this that would give me the IEnumerable<IGrouping<TKey, TElement>> result? Is there another way that I haven't considered?

the condition is calculated once
I hope those keys are still around somewhere...
If your data was in some structure like this:
public class CustomGroup<T, U>
{
T Key {get;set;}
IEnumerable<U> GroupMembers {get;set}
}
You could project such items with a query like this:
var result = customGroups
.SelectMany(cg => cg.GroupMembers, (cg, z) => new {Key = cg.Key, Value = z})
.GroupBy(x => x.Key, x => x.Value)

Inspired by David B's answer, I have come up with a simple solution. So simple that I have no idea how I missed it.
In order to perform the filtering, I obviously need to know what value of the condition I am filtering by. Therefore, given a condition, c, I can just project the filtered list as:
filteredList.GroupBy(x => c)
This avoids any recalculation of properties on the elements (represented by x).
Another solution I realized would work is to revers the ordering of my query and perform the grouping before I perform the filtering. This too would mean the conditions only get evaluated once, although it would unnecessarily allocate groupings that I wouldn't subsequently use.

What about putting the result into a LookUp and using this for the rest of the time?
var lookup = data.ToLookUp(i => Foo(i));

Related

Perform includes on key object, using LINQ, in a GroupBy situation

I have a relatively simple, yet somehow weirdly complicated case whereby I need to perform includes on a lengthy object graph, when I'm doing a group-by.
Here is roughly what my LINQ looks like:
var result = DbContext.ParentTable
.Where(p => [...some criteria...])
.GroupBy(p => p.Child)
.Select(p => new
{
ChildObject = p.Key,
AllTheThings = p.Sum(p => p.SomeNumericColumn),
LatestAndGreatest = p.Max(p => p.SomeDateColumn)
})
.OrderByDescending(o => o.SomeTotal)
.Take(100)
.ToHashSet();
That gives me a listing of anonymous objects, just the way I want it, with child objects neatly associated with some aggregate stats about said object. Fine. But I also need a fair share of the object graph associated with child object.
This ask gets even a bit messier than it might otherwise be, when I want to use existing code, I already have to perform the includes. I.e., I have a static method which will take an IQueryable of my child object and, based upon parameters, give me back another IQueryable, with all the proper includes that I need (there are rather a lot of them).
I can't seem to figure the correct way to take my child object as a queryable, and give that to my include method, such that I get it back, for expansion at the point I want to express it to the new anonymous object (where I'm saying ChildObject = n.Key).
Sorry if this is something of a duplicate -- I did search around and found solutions that were close to what I'm wanting, here but not quite.

Groupby Transform into another group

I have a
IGrouping<string, MyObj>
I want to transform it into another IGrouping. For argument sake the key is the same, but MyObj will transform into MyOtherObj i.e.
IGrouping<string, MyOtherObj>
I am using Linq2Sql but I can copy with this last bit not being transformable into SQL.
I want it to be still be an IGrouping<T,TT> because it is a recognised type and I want the signature and result to be apparent. I also want to be able to do this so I can break my link down a bit and put into better labelled methods. i.e.
GetGroupingWhereTheSearchTextAppearsMoreThanOnce()
RetrieveRelatedResultsAndMap()
Bundle up and return encased in an IEnumerable - no doubt as to what is going on.
I have come close by daisy chaining
IQueryable<IGrouping<string, MyObj>> grouping ....
IQueryable<IGrouping<string, IEnumerable<MyOtherObj>>> testgrouping = grouping.GroupBy(gb => gb.Key, contacts => contacts.Select(s => mapper.Map<MyObj, MyOtherObj>(s)));
but I end up with
IGrouping<string, IEnumerable<MyOtherObj>>
I know it is because of how I am accessing the enumerable that the IGrouping represents but I can't figure out how to do it.
You could just flatten the groupings with SelectMany(x => x) then do the GroupBy again, but then you're obviously doing the work twice.
You should be able to do the projection as part of the first GroupBy call instead.
Alternatively, you can add your own implementation of IGrouping, as described here What is the implementing class for IGrouping?, then simply do:
groups.Select(g => new MyGrouping(g.Key, g.Select(myObj => Mapper.Map<MyObj,MyOtherObj>(myObj))))

join two collections on a custom condition

I have two collections, IEnumerable<A> as and IEnumerable<B> bs
I also have a predicate Func<A, B, boolean> predicate
I would like to join as and bs together to get something equivalent to an IEnumerable<IGrouping<A, B>> joined such that for each element group in joined, for each element b in group, predicate(group.key, b) holds.
To get such a grouping, there usually is the GroupBy extension method, but that cant operate based on a custom predicate.
I considered two approaches, one just building a collection with nested loops, the other doing the same with Aggregate. Both look really ugly. Is there a better way to do this?
In this particular case, for each element b in bs there is exactly one A in as for which the predicate holds, and I don't mind relying on that property if that makes for a nicer solution.
As far as I can see, in the general case it can't make for a better asymptotic runtime complexity than O(n * m) where n is the length of as and m is the length of bs. I'm OK with that.
Considering that you have
IEnumerable<A> aEnumerable;
IEnumerable<B> bEnumerable;
and the following restriction:
In this particular case, for each element b in bs there is exactly one A in as for which the predicate holds
You may do the following:
IEnumerable<IGrouping<A, B>> grouping = bEnumerable
.GroupBy(b => aEnumerable.Single(a => func(a, b)));
Another option which comes to mind and looks more convenient is a simple dictionary:
IEnumerable<A> aEnumerable;
IEnumerable<B> bEnumerable;
Dictionary<A, B[]> dict = aEnumerable
.ToDictionary(a => a,
a => bEnumerable.Where(b => func(a, b)).ToArray());
For every key A in this dictionary there will be those items which hold this predicate.

Fast queryable collection of objects

I am looking for a library that would accept a collection of objects and return an indexed data structure that would be optimised for fast querying.
This is probably better illustrated by an example:
public class MyClass
{
public sting Name {get;set;}
public double Number {get;set;}
public ... (Many more fields)
}
var dataStore = Indexer.Parse(myClassCollection).Index(x => x.Name).Index(x => x.Number).Index( x => x.SomeOtherProperty);
var queryResult = dataStore.Where( x => x.Name == "ABC").Where(x => x.Number == 23).Where( x => x.SomeOtherProperty == dateTimeValue);
The idea is that the query on the dataStore will be very fast, of the order of O(log n).
Using dictionaries of dictionaries starts getting complicated when you have more than 2 or 3 fields you want to index.
Is there a library that already exists that does something like this?
What about an object oriented database.
Sterling is a recommended option. It supports LINQ to Object so don't worry about queries and we have used it for a couple of medium projects with good results (it's pretty fast).
You should take a look at RaptorDB as well. Several versions, including a fully embedded version, can be found on CodeProject here.
You could use Lucene.NET which can also run fully in memory (though I'm not sure that's what you'd want). It supports lightning fast retrieval of documents based on field criteria.
So that actually gives you a document database. If you take that one step further, you end up with something like RavenDB (commercial).
I am wondering whether we could achieve this by creating a SortedDictionary for each of the indexed properties.
SortedDictionary<property, List<MyClass>>
Then parsing the Linq expression tree to find out which properties are being queried. We can retrieve the valid keys of the sortedDictionaries, and then loop through these keys to get a List for each sorted dictionary and then use Set operations such as Union() and Intersect() depending on whether the expression tree has OR or AND directives.
Then return the a List matching the search criteria.
If the query includes a property that is not indexed, execute the query with indexed properties first and then use normal Linq to finish it off.
The interesting bit then becomes parsing the expression tree.
Any thoughts on this approach?

Rules for LINQ to SQL across method boundaries

To keep my code cleaner I often try to break down parts of my data access code in LINQ to SQL into private sub-methods, just like with plain-old business logic code. Let me give a very simplistic example:
public IEnumerable<Item> GetItemsFromRepository()
{
var setA = from a in this.dataContext.TableA
where /* criteria */
select a.Prop;
return DoSubquery(setA);
}
private IEnumerable<Item> DoSubQuery(IEnumerable<DateTimeOffset> set)
{
return from item in set
where /* criteria */
select item;
}
I'm sure no one's imagination would be stretched by imagining more complex examples with deeper nesting or using results of sets to filter other queries.
My basic question is this: I've seen some significant performance differences and even exceptions being thrown by just simply reorganizing LINQ to SQL code in private methods. Can anyone explain the rules for these behaviors so that I can make informed decisions about how to write efficient, clean data access code?
Some questions I've had:
1) When does passage of System.Linq.Table instace to a method cause query execution?
2) When does using a System.Linq.Table in another query cause execution?
3) Are there limits to what types of operations (Take, First, Last, order by, etc.) can be applied to System.Linq.Table passed a parameters into a method?
The most important rule in terms of LINQ-to-SQL would be: don't return IEnumerable<T> unless you must - as the semantic is unclear. There are two schools of thought beyond that:
if you return IQueryable<T>, it is composable, meaning the where from later queries is combined to make a single TSQL, but as a down-side, it is hard to fully test
otherwise, return List<T> or similar, so it is clear that everything beyond that point is LINQ-to-Objects
Currently, you are doing something in the middle: collapsing it to LINQ-to-Objects (via IEnumerable<T>), but without it being obvious - and keeping the connection open in the middle (again, only a problem because it isn't obvious)
Remove the implicit cast:
public IQueryable<Item> GetItemsFromRepository()
{
var setA = from a in this.dataContext.TableA
where /* criteria */
select a.Prop;
return DoSubquery(setA);
}
private IQueryable<Item> DoSubQuery(IQueryable<DateTimeOffset> set)
{
return from item in set
where /* criteria */
select item;
}
The implicit cast from IQueryable<Item> to IEnumerable<Item> is essentially the same as calling AsEnumerable() on your IQueryable<Item>. There are of course times when you want that, but you should leave things as IQueryable by default, so that the entire query can be performed on the database, rather than merely the GetItemsFromRepository() bit with the rest being done in memory.
The secondary questions:
1) When does passage of System.Linq.Table instace to a method cause query execution?
When something needs a final result, such as Max(), ToList(), etc. that is neither a queryable object, nor a loaded-as-it-goes enumerable.
Note though, that while AsEnumerable() does not cause query execution, it does mean that when execution does happen only that before the AsEnumerable() will be performed against the source datasource, this will then produce an on-demand in-memory datasource against which the rest will be performed.
2) When does using a System.Linq.Table in another query cause
execution?
The same as above. Table<T> implements IQueryable<T>. If you e.g. join two of them together, that won't yet cause anything to be executed.
3) Are there limits to what types of operations (Take,
First, Last, order by, etc.) can be applied to System.Linq.Table
passed a parameters into a method?
Those that are definted by IQueryable<T>.
Edit: Some clarification on the differences and similarities between IEnumerable and IQueryable.
Just about anything you can do on an IQueryable you can do on an IEnumerable and vice-versa, but how it's performed will be different.
Any given IQueryable implementation can be used in linq queries and will have all the linqy extension methods like Take(), Select(), GroupBy and so on.
Just how this is done, depends on the implementation. For example, System.Linq.Data.Table implements those methods by the query being turned into an SQL query, the results of which are turned into a objects on a as-loaded basis. So if mySource is a table then:
var filtered = from item in mySource
where item.ID < 23
select new{item.ID, item.Name};
foreach(var i in filtered)
Console.WriteLine(i.Name);
Gets turned into SQL like:
select id, name from mySourceTable where id < 23
And then an enumerator is created from that such that on each call to MoveNext() another row is read from the results, and a new anonymous object created from it.
On the other hand, if mySource where a List or a HashSet, or anything else that implements IEnumerable<T> but doesn't have its own query engine, then the linq-to-objects code will turn it into something like:
foreach(var item in mySource)
if(item.ID < 23)
yield return new {item.ID, item.Name};
Which is about as efficiently as that code could be done in memory. The results will be the same, but the way to get them, would be different:
Now, since all IQueryable<T> can be converted into the equivalent IEnumerable<T> we could, if we wanted to, take the first mySource (where execution happens in a database) and do the following instead:
var filtered = from item in mySource.AsEnumerable()
where item.ID < 23
select new{item.ID, item.Name};
Here, while there is still nothing executed against the database until we iterate through the results or call something that examines all of those results, once we do so, it's as if we split the execution into two separate steps:
var asEnum = mySource.AsEnumerable();
var filtered = from item in asEnum
where item.ID < 23
select new{item.ID, item.Name};
The implemenatation of the first line would be to execute the SQL SELECT * FROM mySourceTable, and the execution of the rest would be like the linq-to-objects example above.
It's not hard to see how, if the database contained 10 items with an id < 23, and 50,000 items with an id higher, this is now much, much less performant.
As well as offering the explicity AsEnumerable() method, all IQueryable<T> can be implicitly cast to IEnumerable<T>. This lets us do foreach on them and use them with any other existing code that handles IEnumerable<T>, but if we accidentally do it at in inappropriate time, we can make queries much slower, and this is what was happening when your DoSubQuery was defined to take an IEnumerable<DateTimeOffset> and return an IEnumerable<Item>; it implicitly called AsEnumerable() on your IQueryable<DateTimeOffset> and your IQueryable<Item> and caused what could have been performed on the database to be performed in-memory.
For this reason, 99% of the time, we want to keep dealing in IQueryable until the very last moment.
As an example of the opposite though, just to point out that AsEnumerable() and the casts to IEnumerable<T> aren't there out of madness, we should consider two things. The first is that IEnumerable<T> lets us do things that can't be done otherwise, such as joining two completely different sources that don't know about each other (e.g. two different databases, a database and an XML file, etc.)
Another is that sometimes IEnumerable<T> is actually more efficient too. Consider:
IQueryable<IGrouping<string, int>> groupingQuery = from item in mySource select item.ID group by item.Name;
var list1 = groupingQuery.Select(grp => new {Name=grp.Key, Count=grp.Count()}).ToList();//fine
foreach(var grp in groupingQuery)//disaster!
Console.WriteLine(grp.Count());
Here groupingQuery is set up as a queryable that does some grouping, but which hasn't been executed in anyway. When we create list1, then first we create a new IQueryable based on that, and the query engine does it's best to work out what the best SQL for it is, and comes up with something like:
select name, count(id) from mySourceTable group by name
Which is pretty efficiently performed. Then the rows are turned into objects, which are then put into a list.
On the other hand, with the second query, there isn't as natural a SQL conversion for a group by that doesn't perform aggregate methods on all of the non-grouped items, so the best the query engine can come up with is to first do:
select distinct name from mySourceTable,
And then for every name it receives, to do:
select id from mySourceTable where name = '{name found in last query goes here}'
And so on, should this mean 2 SQL queries, or 200,000.
In this case, we're much better working on mySource.AsEnumerable() because here it is more efficient to grab the whole table into memory first. (Even better still would be to work on mySource.Select(item => new {item.ID, item.Name}).AsEnumerable() because then we still only retrieve the columns we care about from the database, and switch to in-memory at that point).
The last bit is worth remembering because it breaks our rule that we should stay with IQueryable<T> as long as possible. It isn't something to worry about much, but it is worth keeping an eye on if you do grouping and find yourself with a very slow query.

Categories

Resources