C# Linq query execution order

C# Linq query execution order - c#

Consider the following method:
public IEnumerable<Owner> GetOwners(OwnerParameters ownerParameters)
{
return FindAll()
.OrderBy(on => on.Name)
.Skip((ownerParameters.PageNumber - 1) * ownerParameters.PageSize)
.Take(ownerParameters.PageSize)
.ToList();
}
Where FindAll() is a repository pattern method that returns IQueryable<Owner>. Does having .OrderBy() before .Skip() and .Take() methods mean that all the elements from the Owner data table will be retrieved and ordered, or, does Linq take into account that .Skip() and .Take() methods might narrow down the required Owner elements and only after having retrieved those will the ordering happen?
EDIT: Profiler log:
SELECT XXX
FROM [Owners] AS [a]
ORDER BY [a].[Name]
OFFSET #__p_0 ROWS FETCH NEXT #__p_1 ROWS ONLY',N'#__p_0 int,#__p_1 int',#__p_0=0,#__p_1=10

Ultimately, this depends on what FindAll() does and what it returns:
if it returns IEnumerable<T>, then it is using LINQ-to-Objects, which mostly just does literally what it is told; if you sort then pages, then it sorts then pages; if you page then sort, then it pages then sorts
however, if it returns IQueryable<T>, then the query is being composed - and only actually executed in ToList(), at which point the provider model gets a chance to inspect your query tree and build the most suitable implementation possible, which often means: writing a SQL query that includes an ORDER BY and some paging hints suitable for the specific RDBMS; if your code paged then sort (which is... unusual) then I would expect most providers to either write some kind of horrible sub-query to try to describe that, or just throw an exception (perhaps NotSupportedException) in disgust

All records wouldn't be retrieved. Depending on your backend a query would be generated, that orders by Name, skips N rows and takes next M rows. Only the query result is retrieved, triggered by .ToList().
ie: In MS SQL server the query might look like:
Select top(M) row_number() over (order by Name) as RowNo, *
from myTable
where RowNo > N
That type of query is not very effective but that is another matter and you could create your custom workaround.
EDIT: I remembered MS SQL code generated wrong. It was:
SELECT ...
FROM ...
ORDER BY ...
OFFSET x ROWS FETCH NEXT y ROWS ONLY
That one is also slow. If you have speed consideration then write your own SQL and send with Linq. Basically what you do is to keep the lastRetrieved value and set it as a parameter:
select top(NTake) ... from ...
order by ...
where orderedByValue > #lastRetrievedValue
(You can send raw SQL queries in Linq)

A quick example to demonstrate that order of query matters. The explanations given above are good.
var test1 = await _dbContext.UserActivityLogs
.Where(x => x.ExternalSyncLogId == request.Id)
.Skip(0).Take(25)
.AsNoTracking().ToListAsync();
This query transpile to
exec sp_executesql N'SELECT [u].[Id], [u].[ActionType], [u].[ActivityEndTime], [u].[ActivityStartTime], [u].[ActivityType], [u].[Created], [u].[CreatedBy], [u].[EntityId], [u].[ExternalSyncLogId], [u].[LastModified], [u].[LastModifiedBy], [u].[RequestBody], [u].[ResponseBody], [u].[Source], [u].[TenantId]
FROM [UserActivityLogs] AS [u]
WHERE ([u].[ExternalSyncLogId] IS NULL
ORDER BY (SELECT 1)
OFFSET #__p_1 ROWS FETCH NEXT #__p_2 ROWS ONLY',N'#__p_1 int,#__p_2 int',#__p_1=0,#__p_2=25
Execution time: 29
But if we just change the sequence then
var test2 = await _dbContext.UserActivityLogs
.Skip(0).Take(25)
.Where(x => x.ExternalSyncLogId == request.Id)
.AsNoTracking().ToListAsync();
Translated as
exec sp_executesql N'SELECT [t].[Id], [t].[ActionType], [t].[ActivityEndTime], [t].[ActivityStartTime], [t].[ActivityType], [t].[Created], [t].[CreatedBy], [t].[EntityId], [t].[ExternalSyncLogId], [t].[LastModified], [t].[LastModifiedBy], [t].[RequestBody], [t].[ResponseBody], [t].[Source], [t].[TenantId]
FROM (
SELECT [u].[Id], [u].[ActionType], [u].[ActivityEndTime], [u].[ActivityStartTime], [u].[ActivityType], [u].[Created], [u].[CreatedBy], [u].[EntityId], [u].[ExternalSyncLogId], [u].[LastModified], [u].[LastModifiedBy], [u].[RequestBody], [u].[ResponseBody], [u].[Source], [u].[TenantId]
FROM [UserActivityLogs] AS [u]
ORDER BY (SELECT 1)
OFFSET #__p_0 ROWS FETCH NEXT #__p_1 ROWS ONLY
) AS [t]
WHERE [t].[ExternalSyncLogId] IS NULL',N'#__p_0 int,#__p_1 int',#__p_0=0,#__p_1=25
Execution time: 474

Related

.netcore EF linq - this is a BUG? Very strange behavior

I have two table in sql. Document and User. Document have relation to User and I want to get users that I sent document recently.
I need to sort by the date document was sent and get unique (distinct) user with relation to this document
This is my linq queries
var recentClients = documentCaseRepository.Entities
.Where(docCase => docCase.AssignedByAgentId == WC.UserContext.UserId)
.OrderByDescending(userWithDate => userWithDate.LastUpdateDate)
.Take(1000) // I need this because if I comment this line then EF generate completely different sql query.
.Select(doc => new { doc.AssignedToClient.Id, doc.AssignedToClient.FirstName, doc.AssignedToClient.LastName })
.Distinct()
.Take(configuration.MaxRecentClientsResults)
.ToList();
and generated sql query is:
SELECT DISTINCT TOP(5) [t].*
FROM (
SELECT TOP(1000) [docCase.AssignedToClient].[Id]
FROM [DocumentCase] AS [docCase]
INNER JOIN [User] AS [docCase.AssignedToClient]
ON ([docCase].[AssignedToClientId] = [docCase.AssignedToClient].[Id])
WHERE [docCase].[AssignedByAgentId] = 3
ORDER BY [docCase].[LastUpdateDate] DESC
)
AS [t]
Every thing is correct for now. But if I delete this line
.Take(1000) // I need this because...
EF generated completely different query such as:
SELECT DISTINCT TOP(5)
[docCase.AssignedToClient].[Id]
FROM [DocumentCase] AS [docCase]
INNER JOIN [User] AS [docCase.AssignedToClient]
ON ([docCase].[AssignedToClientId] = [docCase.AssignedToClient].[Id])
WHERE [docCase].[AssignedByAgentId] = 3
My question is: why EF not generated orderby clause and subquery with distinct?
This is a BUG EF or I'm doing something wrong? And what I must do to generate in linq this sql query ()
SELECT DISTINCT TOP 5 [t].*
FROM ( SELECT [docCase.AssignedToClient].[Id]
FROM [DocumentCase] AS [docCase]
INNER JOIN [User] AS [docCase.AssignedToClient]
ON [docCase].[AssignedToClientId] = [docCase.AssignedToClient].[Id]
WHERE [docCase].[AssignedByAgentId] = 1
ORDER BY [docCase].[LastUpdateDate] DESC
) AS [t]

OrderBy information not always retained across other operators such as Distinct. Entity Framework does not document (to my knowledge) how exactly OrderBy is propagated.
This kind of makes sense because some operators have undefined output order. The fact that ordering is retained in many situations is a convenience for the developer.
Move the OrderBy to the end of the query (or at least past the Distinct).

The reason for the difference in queries is that Distinct messes up result order. So when you first execute OrderBy and then Distinct, you can just es well not execute OrderBy, because this order is lost anyway. So EF can just optimize it away.
Calling Take in between causes the result set to be semantically different: You first order the items, take the first 1000 items of that order and then call Distinct on them.
What you can change in your query depends mainly on the result you want to achieve. Maybe you want to first make the result set distinct then order by date and finally take the amount of items. Other options are also thinkable based on your requirements.

sql Top 1 vs System.Linq firstordefault

I am rewriting an SProc in c#. the problem is that in SProc there is a query like this:
select top 1 *
from ClientDebt
where ClinetID = 11234
order by Balance desc
For example :I have a client with 3 debts, all of them have same balance. the debt ids are : 1,2,3
c# equivalent of that query is :
debts.OrderByDescending(d => d.Balance)
.FirstOrDefault()
debts represent clients 3 debts
the interesting part is that sql return debt with Id 2 but c# code returns Id 1.
The Id 1 make sense for me But in order to keep code functionality the same I need to change the c# code to return middle one.
I do not sure what is the logic behind sql top 1 where several rows match the query.
The query will select one debt and update the database. I would like the linq to return the same result with sql
Thanks

debts.OrderByDescending(d => d.Balance).ThenByDescending(d => d.Id)
.FirstOrDefault()

You can start SQL Profiler, execute stored procedure, review result, and then catch query which application send through linq, and again review result.
Also, you can easily view execution plan of you procedure, and try it to optimize, but with linq query, you cannot easily do this.

AFAIK, IN SQL if you select rows without ORDER BY, it orders the resultset based on the primary key.
With Order BY CLAUSE [field], implicitly next order is [primarykey].

Entity framework execution time

I've noticed a massive difference in execution time with Entity Framework today. I would like to know why the first statement has so much overhead. For this query i'm retrieving 5500 trenddata values from the database (which shouldn't be a big deal).
This is the statement I used before:
TrendDataValues = new ObservableCollection<TrendDataValue>(_trendDataContext.TrendDatas.First(td => td.Id == argument.TrendDataId)
.TrendDataValues
.Where(tdv => tdv.ValueStartTimestamp >= argument.MinValue
&& tdv.ValueStartTimestamp <= argument.MaxValue));
However, this statement takes over 10 seconds to run.
I've rewritten the first statement to the following one. This retrieves the exact same data. However, this statement returns values within 0.2 seconds.
TrendDataValues = new ObservableCollection<TrendDataValue>(from td in _trendDataContext.TrendDatas.Where(d => d.Id == trendDataId)
from tdv in td.TrendDataValues
where tdv.ValueStartTimestamp >= argument.MinValue
&& tdv.ValueEndTimestamp <= argument.MaxValue
select tdv);
Can somebody clarify the difference between the 2 statements?

Suggestion: download http://www.linqpad.net/
Connect LINQ-pad to your database.
Run the two queries and take a look at the SQL tab to see if there is a difference in the SQL that is generated by the queries.
Hope this helps!

Chained method or query syntax if they are the same the resulting sql will be identical, it seems at first glance that in the second example you are implicitly creating a join, i.e. the two from / where statements will act similar to an inner join, whereas in the first you do not and are probably creating some form of cartesian product that the chained methods will have to search.
As the other dood suggests go use LinqPad and check out the sql generated, i bet it's not the same.
P.S. Effectively the 2nd example would actually take longer to compile! but if both examples were logically identical then method and query syntax would be the same execution speed.

As adviced in the answers above, I've tested both queries in linqpad.
The first one runs the following query:
SELECT TOP (1) [t0].[Id], [t0].[Tag], [t0].[Description], [t0].[PollingInterval], [t0].[Compression], [t0].[PlcLogDataTypeValue]
FROM [TrendDatas] AS [t0]
WHERE [t0].[Id] = #p0
The second one runs the following query:
SELECT [t1].[Id], [t1].[ValueStartTimestamp], [t1].[ValueEndTimestamp], [t1].[Value], [t1].[SerieNumber], [t1].[TrendData_Id]
FROM [TrendDatas] AS [t0], [TrendDataValues] AS [t1]
WHERE ([t1].[ValueStartTimestamp] >= #p0) AND ([t1].[ValueStartTimestamp] <= #p1) AND ([t0].[Id] = #p2) AND ([t1].[TrendData_Id] = [t0].[Id])
Apparently the first statement only returns the trenddata-parent object. I guessing how it's iterating over it's values (child elements), since I don't see a query or join referencing the TrendDataValues table, but i'm guessing this isn't going to be pretty.
The second query returns a better result which matches exactly what i'm asking.
Thanks for your support and +1 for the answers!

Why does the Entity Framework generate nested SQL queries?

Why does the Entity Framework generate nested SQL queries?
I have this code
var db = new Context();
var result = db.Network.Where(x => x.ServerID == serverId)
.OrderBy(x=> x.StartTime)
.Take(limit);
Which generates this! (Note the double select statement)
SELECT
`Project1`.`Id`,
`Project1`.`ServerID`,
`Project1`.`EventId`,
`Project1`.`StartTime`
FROM (SELECT
`Extent1`.`Id`,
`Extent1`.`ServerID`,
`Extent1`.`EventId`,
`Extent1`.`StartTime`
FROM `Networkes` AS `Extent1`
WHERE `Extent1`.`ServerID` = #p__linq__0) AS `Project1`
ORDER BY
`Project1`.`StartTime` DESC LIMIT 5
What should I change so that it results in one select statement? I'm using MySQL and Entity Framework with Code First.
Update
I have the same result regardless of the type of the parameter passed to the OrderBy() method.
Update 2: Timed
Total Time (hh:mm:ss.ms) 05:34:13.000
Average Time (hh:mm:ss.ms) 25:42.000
Max Time (hh:mm:ss.ms) 51:54.000
Count 13
First Seen Nov 6, 12 19:48:19
Last Seen Nov 6, 12 20:40:22
Raw query:
SELECT `Project?`.`Id`, `Project?`.`ServerID`, `Project?`.`EventId`, `Project?`.`StartTime` FROM (SELECT `Extent?`.`Id`, `Extent?`.`ServerID`, `Extent?`.`EventId`, `Extent?`.`StartTime`, FROM `Network` AS `Extent?` WHERE `Extent?`.`ServerID` = ?) AS `Project?` ORDER BY `Project?`.`Starttime` DESC LIMIT ?
I used a program to take snapshots from the current process in MySQL.
Other queries were executed at the same time, but when I change it to just one SELECT statement, it NEVER goes over one second. Maybe I have something else that's going on; I'm asking 'cause I'm not so into DBs...
Update 3: The explain statement
The Entity Framework generated
'1', 'PRIMARY', '<derived2>', 'ALL', NULL, NULL, NULL, NULL, '46', 'Using filesort'
'2', 'DERIVED', 'Extent?', 'ref', 'serveridneventid,serverid', 'serveridneventid', '109', '', '45', 'Using where'
One liner
'1', 'SIMPLE', 'network', 'ref', 'serveridneventid,serverid', 'serveridneventid', '109', 'const', '45', 'Using where; Using filesort'
This is from my QA environment, so the timing I pasted above is not related to the rowcount explain statements. I think that there are about 500,000 records that match one server ID.
Solution
I switched from MySQL to SQL Server. I don't want to end up completely rewriting the application layer.

It's the easiest way to build the query logically from the expression tree. Usually the performance will not be an issue. If you are having performance issues you can try something like this to get the entities back:
var results = db.ExecuteStoreQuery<Network>(
"SELECT Id, ServerID, EventId, StartTime FROM Network WHERE ServerID = #ID",
serverId);
results = results.OrderBy(x=> x.StartTime).Take(limit);

My initial impression was that doing it this way would actually be more efficient, although in testing against a MSSQL server, I got <1 second responses regardless.
With a single select statement, it sorts all the records (Order By), and then filters them to the set you want to see (Where), and then takes the top 5 (Limit 5 or, for me, Top 5). On a large table, the sort takes a significant portion of the time. With a nested statement, it first filters the records down to a subset, and only then does the expensive sort operation on it.
Edit: I did test this, but I realized I had an error in my test which invalidated it. Test results removed.

Why does Entity Framework produce a nested query? The simple answer is because Entity Framework breaks your query expression down into an expression tree and then uses that expression tree to build your query. A tree naturally generates nested query expressions (i.e. a child node generates a query and a parent node generates a query on that query).
Why doesn't Entity Framework simplify the query down and write it as you would? The simple answer is because there is a limited amount of work that can go into the query generation engine, and while it's better now than it was in earlier versions it's not perfect and probably never will be.
All that said there should be no significant speed difference between the query you would write by hand and the query EF generated in this case. The database is clever enough to generate an execution plan that applies the WHERE clause first in either case.

If you want to get the EF to generate the query without the subselect, use a constant within the query, not a variable.
I have previously created my own .Where and all other LINQ methods that first traverse the expression tree and convert all variables, method calls etc. into Expression.Constant. It was done just because of this issue in Entity Framework...

I just stumbled upon this post because I suffer from the same problem. I already spend days tracking this down and it it is just a poor query generation in mysql.
I already filed a bug at mysql.com http://bugs.mysql.com/bug.php?id=75272
To summarize the problem:
This simple query
context.products
.Include(x => x.category)
.Take(10)
.ToList();
gets translated into
SELECT
`Limit1`.`C1`,
`Limit1`.`id`,
`Limit1`.`name`,
`Limit1`.`category_id`,
`Limit1`.`id1`,
`Limit1`.`name1`
FROM (SELECT
`Extent1`.`id`,
`Extent1`.`name`,
`Extent1`.`category_id`,
`Extent2`.`id` AS `id1`,
`Extent2`.`name` AS `name1`,
1 AS `C1`
FROM `products` AS `Extent1` INNER JOIN `categories` AS `Extent2` ON `Extent1`.`category_id` = `Extent2`.`id` LIMIT 10) AS `Limit1`
and performs pretty well. Anyway, the outer query is pretty much useless. Now If I add an OrderBy
context.products
.Include(x => x.category)
.OrderBy(x => x.id)
.Take(10)
.ToList();
the query changes to
SELECT
`Project1`.`C1`,
`Project1`.`id`,
`Project1`.`name`,
`Project1`.`category_id`,
`Project1`.`id1`,
`Project1`.`name1`
FROM (SELECT
`Extent1`.`id`,
`Extent1`.`name`,
`Extent1`.`category_id`,
`Extent2`.`id` AS `id1`,
`Extent2`.`name` AS `name1`,
1 AS `C1`
FROM `products` AS `Extent1` INNER JOIN `categories` AS `Extent2` ON `Extent1`.`category_id` = `Extent2`.`id`) AS `Project1`
ORDER BY
`Project1`.`id` ASC LIMIT 10
Which is bad because the order by is in the outer query. Theat means MySQL has to pull every record in order to perform an orderby which results in using filesort
I verified that SQL Server (Comapact at least) does not generate nested queries for the same code
SELECT TOP (10)
[Extent1].[id] AS [id],
[Extent1].[name] AS [name],
[Extent1].[category_id] AS [category_id],
[Extent2].[id] AS [id1],
[Extent2].[name] AS [name1],
FROM [products] AS [Extent1]
LEFT OUTER JOIN [categories] AS [Extent2] ON [Extent1].[category_id] = [Extent2].[id]
ORDER BY [Extent1].[id] ASC

Actually the queries generated by Entity Framework are few ugly, less than LINQ 2 SQL but still ugly.
However, very probably you database engine will make the desired execution plan, and the query will run smoothly.

Does LINQ with a scalar result trigger the lazy loading

I read the Loading Related Entities post by the Entity Framework team and got a bit confused by the last paragraph:
Sometimes it is useful to know how many entities are related to another entity in the database without actually incurring the cost of loading all those entities. The Query method with the LINQ Count method can be used to do this. For example:
using (var context = new BloggingContext())
{
var blog = context.Blogs.Find(1);
// Count how many posts the blog has
var postCount = context.Entry(blog)
.Collection(b => b.Posts)
.Query()
.Count();
}
Why do the Query + Count method needed here?
Can't we simple use the LINQ's COUNT method instead?
var blog = context.Blogs.Find(1);
var postCount = blog.Posts.Count();
Will that trigger the lazy loading and all the collection will be loaded to the memory and just than I'll get my desired scalar value?

You will get your desired scalar value in bot cases. But consider the difference in what's happening.
With .Query().Count() you run a query on the database of the form SELECT COUNT(*) FROM Posts and assign that value to your integer variable.
With .Posts.Count, you run (something like) SELECT * FROM Posts on the database (much more expensive already). Each row of the result is then mapped field-by-field into your C# object type as the collection is enumerated to find your count. By asking for the count in this way, you are forcing all of the data to be loaded so that C# can count how much there is.
Hopefully it's obvious that asking the database for the count of rows (without actually returning all of those rows) is much more efficient!

The first method is not loading all rows since the Count method is invoked from an IQueryable but the second method is loading all rows since it is invoked from an ICollection.
I did some testings to verify it. I tested it with Table1 and Table2 which Table1 has the PK "Id" and Table2 has the FK "Id1" (1:N). I used EF profiler from here http://efprof.com/.
First method:
var t1 = context.Table1.Find(1);
var count1 = context.Entry(t1)
.Collection(t => t.Table2)
.Query()
.Count();
No Select * From Table2:
SELECT TOP (2) [Extent1].[Id] AS [Id]
FROM [dbo].[Table1] AS [Extent1]
WHERE [Extent1].[Id] = 1 /* #p0 */
SELECT [GroupBy1].[A1] AS [C1]
FROM (SELECT COUNT(1) AS [A1]
FROM [dbo].[Table2] AS [Extent1]
WHERE [Extent1].[Id1] = 1 /* #EntityKeyValue1 */) AS [GroupBy1]
Second method:
var t1 = context.Table1.Find(1);
var count2 = t1.Table2.Count();
Table2 is loaded into memory:
SELECT TOP (2) [Extent1].[Id] AS [Id]
FROM [dbo].[Table1] AS [Extent1]
WHERE [Extent1].[Id] = 1 /* #p0 */
SELECT [Extent1].[Id] AS [Id],
[Extent1].[Id1] AS [Id1]
FROM [dbo].[Table2] AS [Extent1]
WHERE [Extent1].[Id1] = 1 /* #EntityKeyValue1 */
Why is this happening?
The result of Collection(t => t.Table2) is a class that implements ICollection but it is not loading all rows and has a property named IsLoaded. The result of the Query method is an IQueryable and this allows calling Count without preloading rows.
The result of t1.Table2 is an ICollection and it is loading all rows to get the count.
By the way, even if you use only t1.Table2 without asking for the count, rows are loaded into memory.

The first solution doesn't trigger the lazy loading because it most probably never access the collection property directly. The Collection method accepts Expression, not just delegate. It is used only to get the name of the property which is than used to access mapping information and build correct query.
Even if it would access the collection property it could use the same strategy as other internal parts of EF (for example validation) which turns off lazy loading temporarily before accessing navigation properties to avoid unexpected lazy loading.
Btw. this is a huge improvement in contrast to ObjectContext API where building query required accessing the navigation property and thus it could trigger lazy loading.
There is one more difference between those two approaches:
The first always executes query to database and returns count of items in the database
The second executes query to database only once to load all items and then returns counts of items in the application without checking state in the database
As the third quite interesting option you can use extra loading. The implementation by Arthur Vickers shows how to use navigation property to get count from the database without lazy loading items.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Linq query execution order - c#

Related

.netcore EF linq - this is a BUG? Very strange behavior

sql Top 1 vs System.Linq firstordefault

Entity framework execution time

Why does the Entity Framework generate nested SQL queries?

Does LINQ with a scalar result trigger the lazy loading

Categories

Resources