Simple LINQ count, or so I thought - c#

I am trying to get a count of this:
Model.Version.Where(model => model.revision != Model.revision).Count();
However it tells me in VS that I cannot use Lambda expressions.
The model is of type Documents, which is keyed to Version.
I need the count of any documents in the version table for the model documents where the revision is greater than the model revision.
This will either be a 0 or a 1, could sometimes be higher than 1 I suppose.
What am I doing wrong?
if (Model.Version.Where(model => model.revision > Model.revision).Count() > 0)
{
// do something
}

As others have said, your real code should be fine: it sounds like the problem was only that you were trying to execute this in the debugger instead of in normal code. Personally I'm always somewhat leery of trying to take things too far in the debugger - it can be useful of course, but if things behave unexpectedly, I'd always see whether the same code works as part of a real program, rather than assuming there's something fundamentally wrong with the approach. The debugger has to work under rather different constraints than the normal compilation and execution process.
Likewise, as others have said, it's better to use Any() than Count() > 0. However, cleaner yet is to use the overload of Any accepting a predicate:
if (Model.Version.Any(model => model.revision > Model.revision))
{
...
}
Note however that that's not quite the same as your initial predicate, which was asking for any versions which had a different revision rather than a higher revision. You may want:
if (Model.Version.Any(model => model.revision != Model.revision))
{
...
}
It's worth noting that in LINQ to Objects, using Any can have very real performance benefits over using Count() > 0. In providers which convert the query to a different form (e.g. SQL) there may be no performance benefit, but there's a clarity benefit in saying exactly what you're interested in - you don't really care about the count, you only care if there are any matching items.

Related

What are we guaranteed regarding side-effects in LINQ predicates?

I just saw this bit of code that has a count++ side-effect in the .GroupBy predicate. (originally here).
object[,] data; // This contains all the data.
int count = 0;
List<string[]> dataList = data.Cast<string>()
.GroupBy(x => count++ / data.GetLength(1))
.Select(g => g.ToArray())
.ToList();
This terrifies me because I have no idea how many times the implementation will invoke the key selector function. And I also don't know if the function is guaranteed to be applied to each item in order. I realize that, in practice, the implementation may very well just call the function once per item in order, but I never assumed that as being guaranteed, so I'm paranoid about depending on that behaviour -- especially given what may happen on other platforms, other future implementations, or after translation or deferred execution by other LINQ providers.
As it pertains to a side-effect in the predicate, are we offered some kind of written guarantee, in terms of a LINQ specification or something, as to how many times the key selector function will be invoked, and in what order?
Please, before you mark this question as a duplicate, I am looking for a citation of documentation or specification that says one way or the other whether this is undefined behaviour or not.
For what it's worth, I would have written this kind of query the long way, by first performing a select query with a predicate that takes an index, then creating an anonymous object that includes the index and the original data, then grouping by that index, and finally selecting the original data out of the anonymous object. That seems more like a correct way of doing functional programming. And it also seems more like something that could be translated to a server-side query. The side-effect in the predicate just seems wrong to me - and against the principles of both LINQ and functional programming, so I would assume there would be no guarantee specified and that this may very well be undefined behaviour. Is it?
I realize this question may be difficult to answer if the documentation and LINQ specification actually says nothing regarding side effects in predicates. I want to know specifically whether:
Specs say it's permissible and how. (I doubt it)
Specs say it's undefined behaviour (I suspect this is true and am looking for a citation)
Specs say nothing. (Sloppy spec, if you ask me, but it would be nice to know if others have searched for language regarding side-effects and also come up empty. Just because I can't find it doesn't mean it doesn't exist.)
According to official C# Language Specification, on page 203, we can read (emphasis mine):
12.17.3.1 The C# language does not specify the execution semantics of query expressions. Rather, query expressions are
translated into invocations of methods that adhere to the
query-expression pattern (§12.17.4). Specifically, query expressions
are translated into invocations of methods named Where, Select,
SelectMany, Join, GroupJoin, OrderBy, OrderByDescending, ThenBy,
ThenByDescending, GroupBy, and Cast. These methods are expected to
have particular signatures and return types, as described in §12.17.4.
These methods may be instance methods of the object being queried or
extension methods that are external to the object. These methods
implement the actual execution of the query.
From looking at the source code of GroupBy in corefx on GitHub, it does seems like the key selector function is indeed called once per element, and it is called in the order that the previous IEnumerable provides them. I would in no way consider this a guarantee though.
In my view, any IEnumerables which cannot be enumerated multiple times safely are a big red flag that you may want to reconsider your design choices. An interesting issue that could arise from this is that for example if you view the contents of this IEnumerable in the Visual Studio debugger, it will probably break your code, since it would cause the count variable to go up.
The reason this code hasn't exploded up until now is probably because the IEnumerable is never stored anywhere, since .ToList is called right away. Therefore there is no risk of multiple enumerations (again, with the caveat about viewing it in the debugger and so on).

Is .Select<T>(...) to be prefered before .Where<T>(...)?

I got in a discussion with two colleagues regarding a setup for an iteration over an IEnumerable (the contents of which will not be altered in any way during the operation). There are three conflicting theories on which is the optimal approach. Both the others (and me as well) are very certain and that got me unsure, so for the sake of clarity, I want to check with an external source.
The scenario is as follows. We had the code below as a starting point and discovered that some of the hazaas need not to be acted upon. So, starting with the code below, we started to add a blocker for the action.
foreach(Hazaa hazaa in hazaas) ;
My suggestion is as follows.
foreach(Hazaa hazaa in hazaas.Where(element => condition)) ;
One of the guys wants to resolve it by a more explicit form, claiming that LINQ is not appropriate in this case (not sure why it'd be so but he seems to be very convinced). He's solution is this.
foreach(Hazaa hazaa in hazaas) ;
if(condition) ;
The other contra-suggestion is supported by the claim that Where risks to repeat the filtering process needlessly and that it's more certain to minimize the computational workload by picking the appropriate elements once for all by Select.
foreach(Hazaa hazaa in hazaas.Select(element => condition)) ;
I argue that the first is obsolete, since LINQ can handle data objects quite well.
I also believe that Select-ing is in this case equivalently fast to Where-ing and no needless steps will be taken (e.g. the evaluation of the condition on the elements will only be performed once). If anything, it should be faster using Where because we won't be creating an extra instance of anything.
Who's right?
Select is inappropriate. It doesn't filter anything.
if is a possible solution, but Where is just as explicit.
Where executes the condition exactly once per item, just as the if. Additionally, it is important to note that the call to Where doesn't iterate the list. So, using Where you iterate the list exactly once, just like when using if.
I think you are discussing with one person that didn't understand LINQ - the guy that wants to use Select - and one that doesn't like the functional aspect of LINQ.
I would go with Where.
The .Where() and the if(condition) approach will be the same.
But since LinQ is nicely readable i'd prefer that.
The approach with .Select() is nonsense, since it will not return the Hazaa-Object, but an IEnumerable<Boolean>
To be clear about the functions:
myEnumerable.Where(a => isTrueFor(a)) //This is filtering
myEnumerable.Select(a => a.b) //This is projection
Where() will run a function, which returns a Boolean foreach item of the enumerable and return this item depending on the result of the Boolean function
Select() will run a function for every item in the list and return the result of the function without doing any filtering.

What's the most efficient way to get only the final row of a SQL table using EF4?

I'm looking to retrieve the last row of a table by the table's ID column. What I am currently using works:
var x = db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault();
Is there any way to get the same result with more efficient speed?
I cannot see that this queries through the entire table.
Do you not have an index on the ID column?
Can you add the results of analysing the query to your question, because this is not how it should be.
As well as the analysis results, the SQL produced. I can't see how it would be anything other than select top 1 * from MyTable order by id desc only with explicit column names and some aliasing. Nor if there's an index on id how it's anything other than a scan on that index.
Edit: That promised explanation.
Linq gives us a set of common interfaces, and in the case of C# and VB.NET some keyword support, for a variety of operations upon sources which return 0 or more items (e.g. in-memory collections, database calls, parsing of XML documents, etc.).
This allows us to express similar tasks regardless of the underlying source. Your query for example includes the source, but we could do a more general form of:
public static YourType FinalItem(IQueryable<YourType> source)
{
return source.OrderByDesending(d => d.ID).FirstOrDefault();
}
Now, we could do:
IEnumerable<YourType> l = SomeCallThatGivesUsAList();
var x = FinalItem(db.MyTable);//same as your code.
var y = FinalItem(l);//item in list with highest id.
var z = FinalItem(db.MyTable.Where(d => d.ID % 10 == 0);//item with highest id that ends in zero.
But the really important part, is that while we've a means of defining the sort of operation we want done, we can have the actual implementation hidden from us.
The call to OrderByDescending produces an object that has information on its source, and the lambda function it will use in ordering.
The call to FirstOrDefault in turn has information on that, and uses it to obtain a result.
In the case with the list, the implementation is to produce the equivalent Enumerable-based code (Queryable and Enumerable mirror each other's public members, as do the interfaces they use such as IOrderedQueryable and IOrderedEnumerable and so on).
This is because, with a list that we don't know is already sorted in the order we care about (or in the opposite order), there isn't any faster way than to examine each element. The best we can hope for is an O(n) operation, and we might get an O(n log n) operation - depending on whether the implementation of the ordering is optimised for the possibility of only one item being taken from it*.
Or to put it another way, the best we could hand-code in code that only worked on enumerables is only slightly more efficient than:
public static YourType FinalItem(IEnumerable<YourType> source)
{
YourType highest = default(YourType);
int highestID = int.MinValue;
foreach(YourType item in source)
{
curID = item.ID;
if(highest == null || curID > highestID)
{
highest = item;
highestID = curID;
}
}
return highest;
}
We can do slightly better with some micro-opts on handling the enumerator directly, but only slightly and the extra complication would just make for less-good example code.
Since we can't do any better than that by hand, and since the linq code doesn't know anything more about the source than we do, that's the best we could possibly hope for it matching. It might do less well (again, depending on whether the special case of our only wanting one item was thought of or not), but it won't beat it.
However, this is not the only approach linq will ever take. It'll take a comparable approach with an in-memory enumerable source, but your source isn't such a thing.
db.MyTable represents a table. To enumerate through it gives us the results of an SQL query more or less equivalent to:
SELECT * FROM MyTable
However, db.MyTable.OrderByDescending(d => d.ID) is not the equivalent of calling that, and then ordering the results in memory. Because queries get processed as a whole when they are executed, we actually get the result of an SQL query more or less like:
SELECT * FROM MyTable ORDER BY id DESC
Finally, the entire query db.MyTable.OrderByDescending(d => d.ID).FirstOrDefault() results in a query like:
SELECT TOP 1 * FROM MyTable ORDER BY id DESC
Or
SELECT * FROM MyTable ORDER BY id DESC LIMIT 1
Depending upon what sort of database server you are using. Then the results get passed to code equivalent to the following ADO.NET-based code:
return dataReader.Read() ?
new MyType{ID = dataReader.GetInt32(0), dataReader.GetInt32(1), dataReader.GetString(2)}//or similar
: null;
You can't get much better.
And as for that SQL query. If there's an index on the id column (and since it looks like a primary key, there certainly should be), then that index will be used to very quickly find the row in question, rather than examining each row.
In all, because different linq providers use different means to fulfil the query, they can all try their best to do so in the best way possible. Of course, being in an imperfect world we'll no doubt find that some are better than others. What's more, they can even work to pick the best approach for different conditions. One example of this is that database-related providers can choose different SQL to take advantage of features of different versions of databases. Another is that the implementation of the the version of Count() that works with in memory enumerations works a bit like this;
public static int Count<T>(this IEnumerable<T> source)
{
var asCollT = source as ICollection<T>;
if(asCollT != null)
return asCollT.Count;
var asColl = source as ICollection;
if(asColl != null)
return asColl.Count;
int tally = 0;
foreach(T item in source)
++tally;
return tally;
}
This is one of the simpler cases (and a bit simplified again in my example here, I'm showing the idea not the actual code), but it shows the basic principle of code taking advantage of more efficient approaches when they're available - the O(1) length of arrays and the Count property on collections that is sometimes O(1) and it's not like we've made things worse in the cases where it's O(n) - and then when they aren't available falling back to a less efficient but still functional approach.
The result of all of this is that Linq tends to give very good bang for buck, in terms of performance.
Now, a decent coder should be able to match or beat its approach to any given case most of the time†, and even when Linq comes up with the perfect approach there are some overheads to it itself.
However, over the scope of an entire project, using Linq means that we can concisely create reasonably efficient code that relates to a relatively constrained number of well defined entities (generally one per table as far as databases go). In particular, the use of anonymous functions and joins means that we get queries that are very good. Consider:
var result = from a in db.Table1
join b in db.Table2
on a.relatedBs = b.id
select new {a.id, b.name};
Here we're ignoring columns we don't care about here, and the SQL produced will do the same. Consider what we would do if we were creating the objects that a and b relate to with hand-coded DAO classes:
Create a new class to represent this combination of a's id and b's name, and relevant code to run the query we need to produce instances.
Run a query to obtain all information about each a and the related b, and live with the waste.
Run a query to obtain the information on each a and b that we care of, and just set default values for the other fields.
Of these, option 2 will be wasteful, perhaps very wasteful. Option 3 will be a bit wasteful and very error prone (what if we accidentally try to use a field elsewhere in the code that wasn't set correctly?). Only option 1 will be more efficient than what the linq approach will produce, but this is just one case. Over a large project this could mean producing dozens or even hundreds or thousands of slightly different classes (and unlike the compiler, we won't necessarily spot the cases where they're actually the same). In practice, therefore, linq can do us some great favours when it comes to efficiency.
Good policies for efficient linq are:
Stay with the type of query you start with as long as you can. Whenever you grab items into memory with ToList() or ToArray etc, consider if you really need to. Unless you need to or you can clearly state the advantage doing so gives you, don't.
If you do need to move to processing in memory, favour AsEnumerable() over ToList() and the other means, so you only grab one at a time.
Examine long-running queries with SQLProfiler or similar. There are a handful of cases where policy 1 here is wrong and moving to memory with AsEnumerable() is actually better (most relate to uses of GroupBy that don't use aggregates on the non-grouped fields, and hence don't actually have a single SQL query they correspond with).
If a complicated query is hit many times, then CompiledQuery can help (less so with 4.5 since it has automatic optimisations that cover some of the cases it helps in), but it's normally better to leave that out of the first approach and then use it only in hot-spots that are efficiency problems.
You can get EF to run arbitrary SQL, but avoid it unless it's a strong gain because too much such code reduces the consistent readability using a linq approach throughout gives (I have to say though, I think Linq2SQL beats EF on calling stored procedures and even more so on calling UDFs, but even there this still applies - it's less clear from just looking at the code how things relate to each other).
*AFAIK, this particular optimisation isn't applied, but we're talking about the best possible implementation at this point, so it doesn't matter if it is, isn't, or is in some versions only.
†I'll admit though that Linq2SQL would often produce queries that use APPLY that I would not think of, as I was used to thinking of how to write queries in versions of SQLServer before 2005 introduced it, while code doesn't have those sort of human tendencies to go with old habits. It pretty much taught me how to use APPLY.

Help Need with LINQ Syntax

Can someone help to change to following to select unique Model from Product table
var query = from Product in ObjectContext.Products.Where(p => p.BrandId == BrandId & p.ProdDelOn == null)
orderby Product.Model
select Product;
I'm guessing you that you still want to filter based on your existing Where() clause. I think this should take care of it for you (and will include the ordering as well):
var query = ObjectContext.Products
.Where(p => p.BrandId == BrandId && p.ProdDelOn == null)
.Select(p => p.Model)
.Distinct()
.OrderBy(m => m);
But, depending on how you read the post...it also could be taken as you're trying to get a single unique Model out of the results (which is a different query):
var model = ObjectContext.Products
.Where(p => p.BrandId == BrandId && p.ProdDelOn == null)
.Select(p => p.Model)
.First();
Change the & to && and add the following line:
query = query.Distinct();
I'm afraid I can't answer the question - but I want to comment on it nonetheless.
IMHO, this is an excellent example of what's wrong with the direction the .NET Framework has been going in the last few years. I cannot stand LINQ, and nor do I feel too warmly about extension methods, anonymous methods, lambda expressions, and so on.
Here's why: I have yet to see a situation where either of these things actually contribute anything to solving real-world programming problems. LINQ is ceratinly no replacement for SQL, so you (or at least the project) still need to master that. Writing the LINQ statements is not any simpler than writing the SQL, but it does add run-time processing to build an expression tree and "compile" it into an SQL statement. Now, if you could solve complex problems more easily with LINQ than with SQL directly, or if it meant you didn't need to also know SQL, and if you could trust LINQ would produce good-enough SQL all the time, it might still have been worth using. But NONE of these preconditions are met, so I'm left wondering what the benefit is supposed to be.
Of course, in good old-fashioned SQL the statement would be
SELECT DISTINCT [Model]
FROM [Product]
WHERE [BrandID] = #brandID AND [ProdDelOn] IS NULL
ORDER BY [Model]
In many cases the statements can be easily generated with dev tools and encapsulated by stored procedures. This would perform better, but I'll grant that for many things the performance difference between LINQ and the more straightforward stored procs would be totally irrelevant. (On the other hand, performance problems do have a tendency to sneak in, as we devs often work with totally unrealistic amounts of data and on environments that have little in common with those hosting our software in real production systems.) But the advantages of just not using LINQ are HUGE:
1) Fewer skills required (since you must use SQL anyway)
2) All data access can be performed in one way (since you need SQL anyway)
3) Some control over HOW to get data and not just what
4) Less chance of being rightfully accused of writing bloatware (more efficient)
Similar things can be said with respect to many of the new language features introduced since C# 2.0, though I do appreciate and use some of them. The "var" keyword with type inferrence is great for initializing locals - it's not much use getting the same type information two times on the same line. But let's not pretend this somehow helps one bit if you have a problem to solve. Same for anonymous types - nested private types served the same purpose with hardly any more code, and I've found NO use for this feature since trying it out when it was new and shiny. Extention methods ARE in fact just plain old utility methods, and I have yet to hear any good explanation of why one should use the SAME syntax for instance methods and static methods invoked on another class! This actually means that adding a method to a class and getting no build warnings or errors can break an application. (In case you doubt: If you had an extension method Bar() for your Foo type, Foo.Bar() invokes a completely different implementation which may or may not do something similar to what your extension method Bar() did the day you introduce an instance method with the same signature. It'll build and crash at runtime.)
Sorry to rant like this, and maybe there is a better place to post this than in response to a question. But I really think anyone starting out with LINQ is wasting their time - unless it's in preparation for an MS certification exam, which AFAIU is also something a bit removed from reality.

Understanding how the C# compiler deals with chaining linq methods

I'm trying to wrap my head around what the C# compiler does when I'm chaining linq methods, particularly when chaining the same method multiple times.
Simple example: Let's say I'm trying to filter a sequence of ints based on two conditions.
The most obvious thing to do is something like this:
IEnumerable<int> Method1(IEnumerable<int> input)
{
return input.Where(i => i % 3 == 0 && i % 5 == 0);
}
But we could also chain the where methods, with a single condition in each:
IEnumerable<int> Method2(IEnumerable<int> input)
{
return input.Where(i => i % 3 == 0).Where(i => i % 5 == 0);
}
I had a look at the IL in Reflector; it is obviously different for the two methods, but analysing it further is beyond my knowledge at the moment :)
I would like to find out:
a) what the compiler does differently in each instance, and why.
b) are there any performance implications (not trying to micro-optimize; just curious!)
The answer to (a) is short, but I'll go into more detail below:
The compiler doesn't actually do the chaining - it happens at runtime, through the normal organization of the objects! There's far less magic here than what might appear at first glance - Jon Skeet recently completed the "Where clause" step in his blog series, Re-implementing LINQ to Objects. I'd recommend reading through that.
In very short terms, what happens is this: each time you call the Where extension method, it returns a new WhereEnumerable object that has two things - a reference to the previous IEnumerable (the one you called Where on), and the lambda you provided.
When you start iterating over this WhereEnumerable (for example, in a foreach later down in your code), internally it simply begins iterating on the IEnumerable that it has referenced.
"This foreach just asked me for the next element in my sequence, so I'm turning around and asking you for the next element in your sequence".
That goes all the way down the chain until we hit the origin, which is actually some kind of array or storage of real elements. As each Enumerable then says "OK, here's my element" passing it back up the chain, it also applies its own custom logic. For a Where, it applies the lambda to see if the element passes the criteria. If so, it allows it to continue on to the next caller. If it fails, it stops at that point, turns back to its referenced Enumerable, and asks for the next element.
This keeps happening until everyone's MoveNext returns false, which means the enumeration is complete and there are no more elements.
To answer (b), there's always a difference, but here it's far too trivial to bother with. Don't worry about it :)
The first will use one iterator, the second will use two. That is, the first sets up a pipeline with one stage, the second will involve two stages.
Two iterators have a slight performance disadvantage to one.

Categories

Resources