I am looking for a library that would accept a collection of objects and return an indexed data structure that would be optimised for fast querying.
This is probably better illustrated by an example:
public class MyClass
{
public sting Name {get;set;}
public double Number {get;set;}
public ... (Many more fields)
}
var dataStore = Indexer.Parse(myClassCollection).Index(x => x.Name).Index(x => x.Number).Index( x => x.SomeOtherProperty);
var queryResult = dataStore.Where( x => x.Name == "ABC").Where(x => x.Number == 23).Where( x => x.SomeOtherProperty == dateTimeValue);
The idea is that the query on the dataStore will be very fast, of the order of O(log n).
Using dictionaries of dictionaries starts getting complicated when you have more than 2 or 3 fields you want to index.
Is there a library that already exists that does something like this?
What about an object oriented database.
Sterling is a recommended option. It supports LINQ to Object so don't worry about queries and we have used it for a couple of medium projects with good results (it's pretty fast).
You should take a look at RaptorDB as well. Several versions, including a fully embedded version, can be found on CodeProject here.
You could use Lucene.NET which can also run fully in memory (though I'm not sure that's what you'd want). It supports lightning fast retrieval of documents based on field criteria.
So that actually gives you a document database. If you take that one step further, you end up with something like RavenDB (commercial).
I am wondering whether we could achieve this by creating a SortedDictionary for each of the indexed properties.
SortedDictionary<property, List<MyClass>>
Then parsing the Linq expression tree to find out which properties are being queried. We can retrieve the valid keys of the sortedDictionaries, and then loop through these keys to get a List for each sorted dictionary and then use Set operations such as Union() and Intersect() depending on whether the expression tree has OR or AND directives.
Then return the a List matching the search criteria.
If the query includes a property that is not indexed, execute the query with indexed properties first and then use normal Linq to finish it off.
The interesting bit then becomes parsing the expression tree.
Any thoughts on this approach?
Related
I'm looking to get a better understanding on when we should look to use IEnumerable over IQueryablewith LINQ to Entities.
With really basic calls to the database, IQueryable is way quicker, but when do i need to think about using an IEnumerable in its place?
Where is an IEnumerable optimal over an IQueryable??
Basically, IQueryables are executed by a query provider (for example a database) and some operations cannot be or should not be done by the database. For example, if you want to call a C# function (here as an example, capitalize a name correctly) using a value you got from the database you may try something like;
db.Users.Select(x => Capitalize(x.Name)) // Tries to make the db call Capitalize.
.ToList();
Since the Select is executed on an IQueryable, and the underlying database has no idea about your Capitalize function, the query will fail. What you can do instead is to get the correct data from the database and convert the IQueryable to an IEnumerable (which is basically just a way to iterate through collections in-memory) to do the rest of the operation in local memory, as in;
db.Users.Select(x => x.Name) // Gets only the name from the database
.AsEnumerable() // Do the rest of the operations in memory
.Select(x => Capitalize(x)) // Capitalize in memory
.ToList();
The most important thing when it comes to performance of IQueryable vs. IEnumerable from the side of EF, is that you should always try to filter the data using an IQueryable to get as little data as possible to convert to an IEnumerable. What the AsEnumerable call basically does is to tell the database "give me the data as it is filtered now", and if you didn't filter it, you'll get everything fetched to memory, even data you may not need.
IEnumerable represents a sequence of elements which you enumerate one by one until you find the answer you need, so for example if I wanted all entities that had some property greater than 10, I'd need to go through each one in turn and return only those that matched. Pulling every row of a database table into memory in order to do this would not maybe be a great idea.
IQueryable on the other hand represents a set of elements on which operations like filtering can be deferred to the underlying data source, so in the filtering case, if I were to implement IQueryable on top of a custom data source (or use LINQ to Entities!) then I could give the hard work of filtering / grouping etc to the data source (e.g. a database).
The major downside of IQueryable is that implementing it is pretty hard - queries are constructed as Expression trees which as the implementer you then have to parse in order to resolve the query. If you're not planning to write a provider though then this isn't going to hurt you.
Another aspect of IQueryable that it's worth being aware of (although this is really just a generic caveat about passing processing off to another system that may make different assumptions about the world) is that you may find things like string comparison work in the manner they are supported in the source system, not in the manner they are implemented by the consumer, e.g. if your source database is case-insensitive but your default comparison in .NET is case-sensitive.
To keep my code cleaner I often try to break down parts of my data access code in LINQ to SQL into private sub-methods, just like with plain-old business logic code. Let me give a very simplistic example:
public IEnumerable<Item> GetItemsFromRepository()
{
var setA = from a in this.dataContext.TableA
where /* criteria */
select a.Prop;
return DoSubquery(setA);
}
private IEnumerable<Item> DoSubQuery(IEnumerable<DateTimeOffset> set)
{
return from item in set
where /* criteria */
select item;
}
I'm sure no one's imagination would be stretched by imagining more complex examples with deeper nesting or using results of sets to filter other queries.
My basic question is this: I've seen some significant performance differences and even exceptions being thrown by just simply reorganizing LINQ to SQL code in private methods. Can anyone explain the rules for these behaviors so that I can make informed decisions about how to write efficient, clean data access code?
Some questions I've had:
1) When does passage of System.Linq.Table instace to a method cause query execution?
2) When does using a System.Linq.Table in another query cause execution?
3) Are there limits to what types of operations (Take, First, Last, order by, etc.) can be applied to System.Linq.Table passed a parameters into a method?
The most important rule in terms of LINQ-to-SQL would be: don't return IEnumerable<T> unless you must - as the semantic is unclear. There are two schools of thought beyond that:
if you return IQueryable<T>, it is composable, meaning the where from later queries is combined to make a single TSQL, but as a down-side, it is hard to fully test
otherwise, return List<T> or similar, so it is clear that everything beyond that point is LINQ-to-Objects
Currently, you are doing something in the middle: collapsing it to LINQ-to-Objects (via IEnumerable<T>), but without it being obvious - and keeping the connection open in the middle (again, only a problem because it isn't obvious)
Remove the implicit cast:
public IQueryable<Item> GetItemsFromRepository()
{
var setA = from a in this.dataContext.TableA
where /* criteria */
select a.Prop;
return DoSubquery(setA);
}
private IQueryable<Item> DoSubQuery(IQueryable<DateTimeOffset> set)
{
return from item in set
where /* criteria */
select item;
}
The implicit cast from IQueryable<Item> to IEnumerable<Item> is essentially the same as calling AsEnumerable() on your IQueryable<Item>. There are of course times when you want that, but you should leave things as IQueryable by default, so that the entire query can be performed on the database, rather than merely the GetItemsFromRepository() bit with the rest being done in memory.
The secondary questions:
1) When does passage of System.Linq.Table instace to a method cause query execution?
When something needs a final result, such as Max(), ToList(), etc. that is neither a queryable object, nor a loaded-as-it-goes enumerable.
Note though, that while AsEnumerable() does not cause query execution, it does mean that when execution does happen only that before the AsEnumerable() will be performed against the source datasource, this will then produce an on-demand in-memory datasource against which the rest will be performed.
2) When does using a System.Linq.Table in another query cause
execution?
The same as above. Table<T> implements IQueryable<T>. If you e.g. join two of them together, that won't yet cause anything to be executed.
3) Are there limits to what types of operations (Take,
First, Last, order by, etc.) can be applied to System.Linq.Table
passed a parameters into a method?
Those that are definted by IQueryable<T>.
Edit: Some clarification on the differences and similarities between IEnumerable and IQueryable.
Just about anything you can do on an IQueryable you can do on an IEnumerable and vice-versa, but how it's performed will be different.
Any given IQueryable implementation can be used in linq queries and will have all the linqy extension methods like Take(), Select(), GroupBy and so on.
Just how this is done, depends on the implementation. For example, System.Linq.Data.Table implements those methods by the query being turned into an SQL query, the results of which are turned into a objects on a as-loaded basis. So if mySource is a table then:
var filtered = from item in mySource
where item.ID < 23
select new{item.ID, item.Name};
foreach(var i in filtered)
Console.WriteLine(i.Name);
Gets turned into SQL like:
select id, name from mySourceTable where id < 23
And then an enumerator is created from that such that on each call to MoveNext() another row is read from the results, and a new anonymous object created from it.
On the other hand, if mySource where a List or a HashSet, or anything else that implements IEnumerable<T> but doesn't have its own query engine, then the linq-to-objects code will turn it into something like:
foreach(var item in mySource)
if(item.ID < 23)
yield return new {item.ID, item.Name};
Which is about as efficiently as that code could be done in memory. The results will be the same, but the way to get them, would be different:
Now, since all IQueryable<T> can be converted into the equivalent IEnumerable<T> we could, if we wanted to, take the first mySource (where execution happens in a database) and do the following instead:
var filtered = from item in mySource.AsEnumerable()
where item.ID < 23
select new{item.ID, item.Name};
Here, while there is still nothing executed against the database until we iterate through the results or call something that examines all of those results, once we do so, it's as if we split the execution into two separate steps:
var asEnum = mySource.AsEnumerable();
var filtered = from item in asEnum
where item.ID < 23
select new{item.ID, item.Name};
The implemenatation of the first line would be to execute the SQL SELECT * FROM mySourceTable, and the execution of the rest would be like the linq-to-objects example above.
It's not hard to see how, if the database contained 10 items with an id < 23, and 50,000 items with an id higher, this is now much, much less performant.
As well as offering the explicity AsEnumerable() method, all IQueryable<T> can be implicitly cast to IEnumerable<T>. This lets us do foreach on them and use them with any other existing code that handles IEnumerable<T>, but if we accidentally do it at in inappropriate time, we can make queries much slower, and this is what was happening when your DoSubQuery was defined to take an IEnumerable<DateTimeOffset> and return an IEnumerable<Item>; it implicitly called AsEnumerable() on your IQueryable<DateTimeOffset> and your IQueryable<Item> and caused what could have been performed on the database to be performed in-memory.
For this reason, 99% of the time, we want to keep dealing in IQueryable until the very last moment.
As an example of the opposite though, just to point out that AsEnumerable() and the casts to IEnumerable<T> aren't there out of madness, we should consider two things. The first is that IEnumerable<T> lets us do things that can't be done otherwise, such as joining two completely different sources that don't know about each other (e.g. two different databases, a database and an XML file, etc.)
Another is that sometimes IEnumerable<T> is actually more efficient too. Consider:
IQueryable<IGrouping<string, int>> groupingQuery = from item in mySource select item.ID group by item.Name;
var list1 = groupingQuery.Select(grp => new {Name=grp.Key, Count=grp.Count()}).ToList();//fine
foreach(var grp in groupingQuery)//disaster!
Console.WriteLine(grp.Count());
Here groupingQuery is set up as a queryable that does some grouping, but which hasn't been executed in anyway. When we create list1, then first we create a new IQueryable based on that, and the query engine does it's best to work out what the best SQL for it is, and comes up with something like:
select name, count(id) from mySourceTable group by name
Which is pretty efficiently performed. Then the rows are turned into objects, which are then put into a list.
On the other hand, with the second query, there isn't as natural a SQL conversion for a group by that doesn't perform aggregate methods on all of the non-grouped items, so the best the query engine can come up with is to first do:
select distinct name from mySourceTable,
And then for every name it receives, to do:
select id from mySourceTable where name = '{name found in last query goes here}'
And so on, should this mean 2 SQL queries, or 200,000.
In this case, we're much better working on mySource.AsEnumerable() because here it is more efficient to grab the whole table into memory first. (Even better still would be to work on mySource.Select(item => new {item.ID, item.Name}).AsEnumerable() because then we still only retrieve the columns we care about from the database, and switch to in-memory at that point).
The last bit is worth remembering because it breaks our rule that we should stay with IQueryable<T> as long as possible. It isn't something to worry about much, but it is worth keeping an eye on if you do grouping and find yourself with a very slow query.
I have a table with two columns, GroupId and ParentId (both are GUIDS). The table forms a hierarchy so I can look for a value in the “GroupId” filed, when I have found it I can look at its ParentId. This ParentId will also appear in the GroupId of a different record. I can use this to walk up the hierarchy tree from any point to the root (root is an empty GUID). What I’d like to do is get a list of records when I know a GroupId. This would be the record with the GroupId and all the parents back to the root record. Is this possible with Linq and if so, can anyone provide a code snippet?
LINQ is not designed to handle recursive selection.
It is certainly possible to write your own extension method to compensate for that in LINQ to Objects, but I've found that LINQ to Entities does not like functionality not easily translated into SQL.
Edit:
Funnily enough, LINQ to Entities does not complain about Matt Warren's take on recursion using LINQ here. You could do:
var result = db.Table.Where(item => item.GroupId == 5)
.Traverse(item => db.Table.Where(parent
=> item.ParentId == parent.GroupId));
using the extension method defined here:
static class LinqExtensions
{
public static IEnumerable<T> Traverse<T>(this IEnumerable<T> source,
Func<T,IEnumerable<T>> selector){
foreach(T item in source){
yield return item;
IEnumerable<T> children = selector(item);
foreach (T child in children.Traverse(selector))
{
yield return child;
}
}
}
Performace might be poor, though.
It's definitely possible with Linq, but you'd have to make a DB call for each level in the heirarchy. Not exactly optimal.
The other respondents are right - performance is going to be pretty bad on this, since you'll have to make multiple round-trips. This will be somewhat dependent on your particular case, however - is your tree deep and will be people be performing this operation often, for instance.
You may be well served by creating a stored procedure that does this (using a CTE), and wiring it up in the Entities Designer to return your particularly defined Entity.
I have a couple of areas in an application I am building where it looks like I may have to violate the living daylights out of the DRY (Don't Repeat Yourself) principle. I'd really like to stay dry and not get hosed and wondered if someone might be able to offer me a poncho. For background, I am using C#/.NET 3.51 SP1, Sql Server 2008, and Linq-to-Sql.
Basically, my situations revolve around the following scenario. I need to be able to retrieve either a filtered list of items from virtually any table in my database or I need to be able to retrieve a single item from any table in my database given the id of the primary key. I am pretty sure that the best solutions to these problems will involve a good dose of generics and/or reflection.
Here are the two challenges in a little more depth. (Please forgive the verbosity.)
Given a table name (or perhaps a pluralized table name), I would like to be able to retrieve a filtered list of elements in the table. Specifically, this functionality will be used with lookup tables. (There are approximately 50 lookup tables in this database. Additional tables will frequently be added and/or removed.) The current lookup tables all implement an interface (mine) called IReferenceData and have fields of ID (PK), Title, Description, and IsActive.
For each of these lookup tables, I need to sometimes return a list of all records. Other times I need to only return the active records. Any Linq-to-Sql data context automatically contains a List property for each and every TableName. Unfortunately, I don't believe I can use this in it's raw form because it is unfiltered, and I need to apply a filter on the IsActive property.
One option is to write code similar to the following for all 50 tables. Yuk!!!
public List<AAA> GetListAAA(bool activeOnly)
{
return AAAs.Where(b => b.IsActive == true || b.IsActive == activeOnly).OrderBy(c => c.Title).ToList();
}
This would not be terribly hard, but it does add a burden to maintenance.
Note: It is important that when the list is returned that I maintain the underlying data type. The records in these lookup tables may be modified, and I have to apply the updates appropriately.
For each of my 150 tables, I need to be able to retrieve an individual record (FirstOrDefault or SingleOrDefault) by its primary key id. Again, I would prefer not to write this same code many times. I would prefer to have one method that could be used for all of my tables.
I am not really sure what the best approach would be here. Some possibilities that crossed my mind included the following. (I don't have specific ideas for their implementation. I am simply listing them as food for thought.)
A. Have a method like GetTableNameItemByID (Guid id) on the data context. (Good)
B. Have an extension method like GetItem(this, string tableName, Guid id) on the data context. (Better)
C. Have a Generic method or extension method like GetItem (this, Table, Guid id). (I don't even know if this possible but it would be the cleanest to use.) (Best)
Additional Notes
For a variety of reasons, I have already created a partial class for my data context. It would certainly be acceptable if the methods were included in that partial class either as normal methods or in a separate static class for extension methods.
Since you already have a partial implementation of your data context, you could add:
public IQueryable<T> GetList<T>( bool activeOnly ) where T : class, IReferenceData
{
return this.GetTable<T>()
.Where( b => !activeOnly || b.isActive )
.OrderBy( c => c.Title );
}
Retaining the IQueryable character of the data will defer the execution of the query until you are ready to materialize it. Note that you may want to omit the default ordering or have separate methods with and without ordering to allow you to apply different orderings if you desire. If you leave it as an IQueryable, this is probably more valuable since you can use it with paging to reduce the amount of data actually returned (per query) if you desire.
There's a design pattern for your needs called "Generic Repository" .Using this pattern you'll get an IQueryable instead of a real list of your entities which lets you do some other stuff with your query as you go.The point is to let the business layer gets whatever it needs whenever it needs it in a generic approach.
You can find an example here.
Have you considered using a code generation tool? Have a look at CodeSmith. Using a tool like that or T4 will allow you to generate your filter functions automatically and should make them fairly easy to maintain.
I'm not sure the best link to provide for T4, but you could start with this video.
Would this meet your needs?
public static IEnumerable<T> GetList<T>(this IEnumerable<IReferenceData> items, bool activeOnly)
{
return items.Where(b => b.IsActive == true || b.IsActive == activeOnly).OrderBy(c => c.Title).Cast<T>().ToList();
}
You could use it like this:
IEnumerable<IReferenceData> yourList;
List<DerivedClass> filtered = yourList.GetList<DerivedClass>(true);
To do something like this without demanding interfaces etc, you can use dynamic Expressions; something like:
public static IList<T> GetList<T>(
this DataContext context, bool activeOnly )
where T : class
{
IQueryable<T> query = context.GetTable<T>();
var param = Expression.Parameter(typeof(T), "row");
if(activeOnly)
{
var predicate = Expression.Lambda<Func<T, bool>>(
Expression.Equal(
Expression.PropertyOrField(param, "IsActive"),
Expression.Constant(true,typeof(bool))
), param);
query = query.Where(predicate);
}
var selector = Expression.Lambda<Func<T, string>>(
Expression.PropertyOrField(param, "Title"), param);
return query.OrderBy(selector).ToList();
}
Imagine you have a large dataset that may or may not be filtered by a particular condition of the dataset elements that can be intensive to calculate. In the case where it is not filtered, the elements are grouped by the value of that condition - the condition is calculated once.
However, in the case where the filtering has taken place, although the subsequent code still expects to see an IEnumerable<IGrouping<TKey, TElement>> collection, it doesn't make sense to perform a GroupBy operation that would result in the condition being re-evaluated a second time for each element. Instead, I would like to be able to create an IEnumerable<IGrouping<TKey, TElement>> by wrapping the filtered results appropriately, and thus avoiding yet another evaluation of the condition.
Other than implementing my own class that provides the IGrouping interface, is there any other way I can implement this optimization? Are there existing LINQ methods to support this that would give me the IEnumerable<IGrouping<TKey, TElement>> result? Is there another way that I haven't considered?
the condition is calculated once
I hope those keys are still around somewhere...
If your data was in some structure like this:
public class CustomGroup<T, U>
{
T Key {get;set;}
IEnumerable<U> GroupMembers {get;set}
}
You could project such items with a query like this:
var result = customGroups
.SelectMany(cg => cg.GroupMembers, (cg, z) => new {Key = cg.Key, Value = z})
.GroupBy(x => x.Key, x => x.Value)
Inspired by David B's answer, I have come up with a simple solution. So simple that I have no idea how I missed it.
In order to perform the filtering, I obviously need to know what value of the condition I am filtering by. Therefore, given a condition, c, I can just project the filtered list as:
filteredList.GroupBy(x => c)
This avoids any recalculation of properties on the elements (represented by x).
Another solution I realized would work is to revers the ordering of my query and perform the grouping before I perform the filtering. This too would mean the conditions only get evaluated once, although it would unnecessarily allocate groupings that I wouldn't subsequently use.
What about putting the result into a LookUp and using this for the rest of the time?
var lookup = data.ToLookUp(i => Foo(i));