Precompile Lambda Expression Tree conversions as constants? - c#

It is fairly common to take an Expression tree, and convert it to some other form, such as a string representation (for example this question and this question, and I suspect Linq2Sql does something similar).
In many cases, perhaps even most cases, the Expression tree conversion will always be the same, i.e. if I have a function
public string GenerateSomeSql(Expression<Func<TResult, TProperty>> expression)
then any call with the same argument will always return the same result for example:
GenerateSomeSql(x => x.Age) //suppose this will always return "select Age from Person"
GenerateSomeSql(x => x.Ssn) //suppose this will always return "select Ssn from Person"
So, in essence, the function call with a particular argument is really just a constant, except time is wasted at runtime re-computing it continuously.
Assuming, for the sake of argument, that the conversion was sufficiently complex to cause a noticeable performance hit, is there any way to pre-compile the function call into an actual constant?
Edit
It appears that there is no way to do this exactly within C# itself. The closest you can probably come within c# is the accepted answer (though of course you would want to make sure that the caching itself wasn't slower than regenerating). To actually convert to true constants, I suspect that with some work, you could use something like mono-cecil to modify the bytecodes after compilation.

The excellent LINQ IQueryable Toolkit project has a query cache that does something similar to what you've described. It contains an ExpressionComparer class that walks the hierarchy of two expressions and determines if they are equivalent. This technique is also used to collect references to common properties for parameterization and in the removal of redundant joins.
All you would need to do is come up with an expression hashing strategy so you can store the results of your processed expressions in a dictionary, ready for future reuse.
Your method would then look something like this:
private readonly IDictionary<Expression, string> _cache
= new Dictionary<Expression, string>(new ExpressionEqualityComparer());
public string GenerateSomeSql(Expression<Func<TResult, TProperty>> expression)
{
string sql;
if (!_cache.TryGetValue(expression, out sql))
{
//process expression
_cache.Add(expression, sql);
}
return sql;
}
class ExpressionEqualityComparer : IEqualityComparer<Expression>
{
public bool Equals(Expression x, Expression y)
{
return ExpressionComparer.AreEqual(x, y);
}
public int GetHashCode(Expression obj)
{
return ExpressionHasher.GetHash(obj);
}
}

First of all, I suspect that your assumption about compiling an expression causing a performance hit will not actually pan out in reality. My experience shows that there are many more factors (database access, network latency, very poor algorithms) that cause performance bottlenecks before regular "good" code causes issues. Premature optimization is the root of all evil, so build your application and run stress tests to find the actual performance bottlenecks, as they are often not where you would expect.
With that said, I think that pre-compilation depends on what the Expression is being tranlated into. I know that with LINQ to SQL you can call DataContext.GetCommand(Expression) and retrieve a DBCommand, which you could then cache and reuse.

Related

Is there a benefit in using Expressions to build dynamic LINQ queries compared to chaining Funcs if I am not using SQL?

I need to build a dynamic query that can query a large list of objects and get the objects which satisfy a complex predicate known at runtime. I know I want to do it upfront and pass it into the collection to filter on, rather than create some complex switch case on the collection itself.
Everything points me to Expressions and Predicate Builder, which I'm happy to use to chain together expressions in a loop like:
Expression<Func<MyObject, bool>> query = PredicateBuilder.True<MyObject>();
query = query.And(x => x.Field == passedInSearchCriterion)
but I could also do that with:
Func<MyObject, bool> query = x => true;
query = x => query(x) && (x => x.Field == passedInSearchCriterion)
I know the first is better in the case of LINQ to SQL converting it to SQL to execute in the database etc when given to entity framework or something.
But say they were both run locally, not in a database, on a large list, is there any performance difference then in terms of how the resulting function is executed?
I know the first is better because of LINQ to SQL converting it to SQL to execute in the database etc when given to entity framework or something.
No, you don't "know" it's better because you don't understand the difference between expressions and delegates.
That main difference is that expressions are effectively descriptions of a piece of code, and can be inspected to find out information like parameter names - this is why ORMs use them, to map POCOs to SQL columns - while delegates are nothing more than pointers to a method to be executed. As such, there are optimizations the C# compiler can perform on delegates, which it cannot do for expressions. Further details here.
So yes, there will be a performance difference, almost certainly in favour of delegates. Whether that difference is quantifiable and/or relevant to your use-case is something only you can determine via benchmarks.
But any performance difference is irrelevant anyway, because you don't need expressions for your use-case. Just use delegates, which will always be faster.

Any benefit of using yield in this case?

I am maintaining some code at work and the original author is gone so thought I would ask here to see if I can satisfy my curiosity.
Below is a bit of code (anonymized) where yield is being used. As far as I can tell it does not add any benefit and just returning a list would be sufficient, maybe more readable as well (for me at least). Just wondering if I am missing something because this pattern is repeated in a couple of places in the code base.
public virtual IEnumerable<string> ValidProductTypes
{
get
{
yield return ProductTypes.Type1;
yield return ProductTypes.Type2;
yield return ProductTypes.Type3;
}
}
This property is used as a parameter for some class which just uses it to populate a collection:
var productManager = new ProductManager(ValidProductTypes);
public ProductManager(IEnumerable<string> validProductTypes)
{
var myFilteredList = GetFilteredTypes(validProductTypes);
}
public ObservableCollection<ValidType> GetFilteredTypes(IEnumerable<string> validProductTypes)
{
var filteredList = validProductTypes
.Where(type => TypeIsValid); //TypeIsValid returns a ValidType
return new ObservableCollection<ValidType>(filteredList);
}
I'd say that returning an IEnumerable<T> and implementing that using yield return is the simplest option.
If you see that a method returns an IEnumerable<T>, there really is only one thing you can do with it: iterate it. Any more complicated operations on it (like using LINQ) are just encapsulated specific ways of iterating it.
If a method returns an array or list, you also gain the ability to mutate it and you might start wondering if that's an acceptable use of the API. For example, what happens if you do ValidProductTypes.Add("a new product")?
If you're talking just about the implementation, then the difference becomes much smaller. But the caller would still be able to cast the returned array or list from IEnumerable<T> to its concrete type and mutate that. The chance that anyone would actually think this was the intended use of the API is small, but with yield return, the chance is zero, because it's not possible.
Considering that I'd say the syntax has roughly the same complexity and ease of understanding, I think yield return is a reasonable choice. Though with C# 6.0 expression bodied properties, the syntax for arrays might get the upper hand:
public virtual IEnumerable<string> ValidProductTypes =>
new[] { ProductTypes.Type1, ProductTypes.Type2, ProductTypes.Type3 };
The above answer is assuming that this is not performance-critical code, so fairly small differences in performance won't matter. If this is performance-critical code, then you should measure which option is better. And you might also want to consider getting rid of allocations (probably by caching the array in a field, or something like that), which might be the dominant factor.

Query for string properties vs method calls

Is there any or noticeable performance decrease; when using linq queries on string properties vs method calls to get that string value; in an IEnumarable list? If not; is there any other queryable interface for linq that makes performance difference?
What I mean is;
public class MyForm
{
public string FormName {get; set;}
public string GetFormName()
{
return FormName;
}
}
List<MyForm> MyFormList;
//1)
var result = MyFormList.Where(f=>f.FormName=="SalesForm").SingleOrDefault();
//2)
var result = MyFormList.Where(f=>f.GetFormName()=="SalesForm").SingleOrDefault();
Is there any noticeable performance decrease between option 1 and option 2?
Is there any technique that .NET using for string properties to be indexed for best performance when linq query is executed other than IEnumurable; that linq can still query?
My assumption; since IEnumarable just iterates all items; there is no much difference to access string property vs gettting the string value by calling the relevant method.
Am I right?
Properties are methods. A property is translated into a (pair of) method(s) at compile time. (Often the jitter will then inline these methods, so there's not really any call-stack performance penalty for this).
Iterating an IEnumerable will look at each item. There are cases where it may build a HashSet behind the scenes, but it still needs to do that for each item in the sequence at least once, whether property or method, and none of the included IEnumerable operators (to my knowledge) treat the two any differently.
What you might see difference is in any extra the work the method might do or not do to get the result you need. If the property or the method themselves are inherently faster, then those small differences can add up when called over and over during a linq expression evaluation.
Yes, IEnumerable will do linear search with O(n) complexity. There is unlikely to be measurable difference between filed, property or method call returning a string (make sure to measure if actually important).
If you need lookup that is faster - dictionary is the better choice with O(1) lookup.
Notes
if you are querying DB with Linq-to-SQL such property access will be translated into SQL query and likely be optimized by SQL to be close to O(1) on indexed fields.
property is a method - so automatic property and method directly returning backing filed should have the same performance. In your sample you have method that returns value of property that in turn returns value of backing field which may cause some difference, but there is good chance that both calls will be inlined by JIT anyway.
you can implement your own IQueryable source to provide optimized search/where methods and get queries compiled into Queryable extension calls instead of Enumerable ones.

What's the most efficient way to get only the final row of a SQL table using EF4?

I'm looking to retrieve the last row of a table by the table's ID column. What I am currently using works:
var x = db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault();
Is there any way to get the same result with more efficient speed?
I cannot see that this queries through the entire table.
Do you not have an index on the ID column?
Can you add the results of analysing the query to your question, because this is not how it should be.
As well as the analysis results, the SQL produced. I can't see how it would be anything other than select top 1 * from MyTable order by id desc only with explicit column names and some aliasing. Nor if there's an index on id how it's anything other than a scan on that index.
Edit: That promised explanation.
Linq gives us a set of common interfaces, and in the case of C# and VB.NET some keyword support, for a variety of operations upon sources which return 0 or more items (e.g. in-memory collections, database calls, parsing of XML documents, etc.).
This allows us to express similar tasks regardless of the underlying source. Your query for example includes the source, but we could do a more general form of:
public static YourType FinalItem(IQueryable<YourType> source)
{
return source.OrderByDesending(d => d.ID).FirstOrDefault();
}
Now, we could do:
IEnumerable<YourType> l = SomeCallThatGivesUsAList();
var x = FinalItem(db.MyTable);//same as your code.
var y = FinalItem(l);//item in list with highest id.
var z = FinalItem(db.MyTable.Where(d => d.ID % 10 == 0);//item with highest id that ends in zero.
But the really important part, is that while we've a means of defining the sort of operation we want done, we can have the actual implementation hidden from us.
The call to OrderByDescending produces an object that has information on its source, and the lambda function it will use in ordering.
The call to FirstOrDefault in turn has information on that, and uses it to obtain a result.
In the case with the list, the implementation is to produce the equivalent Enumerable-based code (Queryable and Enumerable mirror each other's public members, as do the interfaces they use such as IOrderedQueryable and IOrderedEnumerable and so on).
This is because, with a list that we don't know is already sorted in the order we care about (or in the opposite order), there isn't any faster way than to examine each element. The best we can hope for is an O(n) operation, and we might get an O(n log n) operation - depending on whether the implementation of the ordering is optimised for the possibility of only one item being taken from it*.
Or to put it another way, the best we could hand-code in code that only worked on enumerables is only slightly more efficient than:
public static YourType FinalItem(IEnumerable<YourType> source)
{
YourType highest = default(YourType);
int highestID = int.MinValue;
foreach(YourType item in source)
{
curID = item.ID;
if(highest == null || curID > highestID)
{
highest = item;
highestID = curID;
}
}
return highest;
}
We can do slightly better with some micro-opts on handling the enumerator directly, but only slightly and the extra complication would just make for less-good example code.
Since we can't do any better than that by hand, and since the linq code doesn't know anything more about the source than we do, that's the best we could possibly hope for it matching. It might do less well (again, depending on whether the special case of our only wanting one item was thought of or not), but it won't beat it.
However, this is not the only approach linq will ever take. It'll take a comparable approach with an in-memory enumerable source, but your source isn't such a thing.
db.MyTable represents a table. To enumerate through it gives us the results of an SQL query more or less equivalent to:
SELECT * FROM MyTable
However, db.MyTable.OrderByDescending(d => d.ID) is not the equivalent of calling that, and then ordering the results in memory. Because queries get processed as a whole when they are executed, we actually get the result of an SQL query more or less like:
SELECT * FROM MyTable ORDER BY id DESC
Finally, the entire query db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault() results in a query like:
SELECT TOP 1 * FROM MyTable ORDER BY id DESC
Or
SELECT * FROM MyTable ORDER BY id DESC LIMIT 1
Depending upon what sort of database server you are using. Then the results get passed to code equivalent to the following ADO.NET-based code:
return dataReader.Read() ?
new MyType{ID = dataReader.GetInt32(0), dataReader.GetInt32(1), dataReader.GetString(2)}//or similar
: null;
You can't get much better.
And as for that SQL query. If there's an index on the id column (and since it looks like a primary key, there certainly should be), then that index will be used to very quickly find the row in question, rather than examining each row.
In all, because different linq providers use different means to fulfil the query, they can all try their best to do so in the best way possible. Of course, being in an imperfect world we'll no doubt find that some are better than others. What's more, they can even work to pick the best approach for different conditions. One example of this is that database-related providers can choose different SQL to take advantage of features of different versions of databases. Another is that the implementation of the the version of Count() that works with in memory enumerations works a bit like this;
public static int Count<T>(this IEnumerable<T> source)
{
var asCollT = source as ICollection<T>;
if(asCollT != null)
return asCollT.Count;
var asColl = source as ICollection;
if(asColl != null)
return asColl.Count;
int tally = 0;
foreach(T item in source)
++tally;
return tally;
}
This is one of the simpler cases (and a bit simplified again in my example here, I'm showing the idea not the actual code), but it shows the basic principle of code taking advantage of more efficient approaches when they're available - the O(1) length of arrays and the Count property on collections that is sometimes O(1) and it's not like we've made things worse in the cases where it's O(n) - and then when they aren't available falling back to a less efficient but still functional approach.
The result of all of this is that Linq tends to give very good bang for buck, in terms of performance.
Now, a decent coder should be able to match or beat its approach to any given case most of the time†, and even when Linq comes up with the perfect approach there are some overheads to it itself.
However, over the scope of an entire project, using Linq means that we can concisely create reasonably efficient code that relates to a relatively constrained number of well defined entities (generally one per table as far as databases go). In particular, the use of anonymous functions and joins means that we get queries that are very good. Consider:
var result = from a in db.Table1
join b in db.Table2
on a.relatedBs = b.id
select new {a.id, b.name};
Here we're ignoring columns we don't care about here, and the SQL produced will do the same. Consider what we would do if we were creating the objects that a and b relate to with hand-coded DAO classes:
Create a new class to represent this combination of a's id and b's name, and relevant code to run the query we need to produce instances.
Run a query to obtain all information about each a and the related b, and live with the waste.
Run a query to obtain the information on each a and b that we care of, and just set default values for the other fields.
Of these, option 2 will be wasteful, perhaps very wasteful. Option 3 will be a bit wasteful and very error prone (what if we accidentally try to use a field elsewhere in the code that wasn't set correctly?). Only option 1 will be more efficient than what the linq approach will produce, but this is just one case. Over a large project this could mean producing dozens or even hundreds or thousands of slightly different classes (and unlike the compiler, we won't necessarily spot the cases where they're actually the same). In practice, therefore, linq can do us some great favours when it comes to efficiency.
Good policies for efficient linq are:
Stay with the type of query you start with as long as you can. Whenever you grab items into memory with ToList() or ToArray etc, consider if you really need to. Unless you need to or you can clearly state the advantage doing so gives you, don't.
If you do need to move to processing in memory, favour AsEnumerable() over ToList() and the other means, so you only grab one at a time.
Examine long-running queries with SQLProfiler or similar. There are a handful of cases where policy 1 here is wrong and moving to memory with AsEnumerable() is actually better (most relate to uses of GroupBy that don't use aggregates on the non-grouped fields, and hence don't actually have a single SQL query they correspond with).
If a complicated query is hit many times, then CompiledQuery can help (less so with 4.5 since it has automatic optimisations that cover some of the cases it helps in), but it's normally better to leave that out of the first approach and then use it only in hot-spots that are efficiency problems.
You can get EF to run arbitrary SQL, but avoid it unless it's a strong gain because too much such code reduces the consistent readability using a linq approach throughout gives (I have to say though, I think Linq2SQL beats EF on calling stored procedures and even more so on calling UDFs, but even there this still applies - it's less clear from just looking at the code how things relate to each other).
*AFAIK, this particular optimisation isn't applied, but we're talking about the best possible implementation at this point, so it doesn't matter if it is, isn't, or is in some versions only.
†I'll admit though that Linq2SQL would often produce queries that use APPLY that I would not think of, as I was used to thinking of how to write queries in versions of SQLServer before 2005 introduced it, while code doesn't have those sort of human tendencies to go with old habits. It pretty much taught me how to use APPLY.

Write a lambda expression to perform a calulcation on an list

I have a List/IEnumerable of objects and I'd like to perform a calculation on some of them.
e.g.
myList.Where(f=>f.Calculate==true).Calculate();
to update myList, based on the Where clause, so that the required calulcation is performed and the entire list updated as appropriate.
The list contains "lines" where an amount is either in Month1, Month2, Month3...Month12, Year1, Year2, Year3-5 or "Long Term"
Most lines are fixed and always fall into one of these months, but some "lines" are calulcated based upon their "Maturity Date".
Oh, and just to complicate things! the list (at the moment) is of an anonymous type from a couple of linq queries. I could make it a concrete class if required though, but I'd prefer not to if I can avoid it.
So, I'd like to call a method that works on only the calculated lines, and puts the correct amount into the correct "month".
I'm not worried about the calculation logic, but rather how to get this into an easily readable method that updates the list without, ideally, returning a new list.
[Is it possible to write a lambda extension method to do both the calculation AND the where - or is this overkill anyway as Where() already exists?]
Personally, if you want to update the list in place, I would just use a simple loop. It will be much simpler to follow and maintain:
for (int i=0;i<list.Count;++i)
{
if (list[i].ShouldCalculate)
list[i] = list[i].Calculate();
}
This, at least, is much more obvious that it's going to update. LINQ has the expectation of performing a query, not mutating the data.
If you really want to use LINQ for this, you can - but it will still require a copy if you want to have a List<T> as your results:
myList = myList.Select(f => f.ShouldCalculate ? f.Calculate() : f).ToList();
This would call your Calculate() method as needed, and copy the original when not needed. It does require a copy to create a new List<T>, though, as you mentioned that was a requirement (in comments).
However, my personal preference would still be to use a loop in this case. I find the intent much more clear - plus, you avoid the unnecessary copy operation.
Edit #2:
Given this comment:
Oh, and just to complicate things! the list (at the moment) is of an anonymous type from a couple of linq queries
If you really want to use LINQ style syntax, I would recommend just not calling ToList() on your original queries. If you leave them in their original, IEnumerable<T> form, you can easily do my second option above, but on the original query:
var myList = query.Select(f => f.ShouldCalculate ? f.Calculate() : f).ToList();
This has the advantage of only constructing the list one time, and preventing the copy, as the original sequence will not get evaluated until this operation.
LINQ is mostly geared around side-effect-free queries, and anonymous types themselves are immutable (although of course they can maintain references to mutable types).
Given that you want to mutate the list in place, LINQ isn't a great fit.
As per Reed's suggestion, I would use a straight for loop. However, if you want to perform different calculations at different points, you could encapsulate this:
public static void Recalculate<T>(IList<T> list,
Func<T, bool> shouldCalculate,
Func<T, T> calculation)
{
for (int i = 0; i < list.Count; i++)
{
if (shouldCalculate(items[i]))
{
items[i] = calculation(items[i]);
}
}
}
If you really want to use this in a fluid way, you could make it return the list - but I would personally be against that, as it would then look like it was side-effect-free like LINQ.
And like Reed, I'd also prefer to do this by creating a new sequence...
Select doesn't copy or clone the objects it passes to the passed delegate, any state changes to that object will be reflected through the reference in the container (unless it is a value type).
So updating reference types is not a problem.
To replace the objects (or when working with value types1) this are more complex and there is no inbuilt solution with LINQ. A for loop is clearest (as with the other answers).
1 Remembering, of course, that mutable value types are evil.

Categories

Resources