Query generated by EF takes too much time to execute - c#

I have a very simple query which is generated by Entity-Framework, Sometimes when I try to run this query It almost takes more than 30 seconds to be executed, and I got time out Exception.
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
[Extent1].[Url] AS [Url],
[Extent1].[Description] AS [Description],
[Extent1].[SentDate] AS [SentDate],
[Extent1].[VisitCount] AS [VisitCount],
[Extent1].[RssSourceId] AS [RssSourceId],
[Extent1].[ReviewStatus] AS [ReviewStatus],
[Extent1].[UserAccountId] AS [UserAccountId],
[Extent1].[CreationDate] AS [CreationDate]
FROM ( SELECT [Extent1].[LinkID] AS [LinkID], [Extent1].[Title] AS [Title], [Extent1].[Url] AS [Url], [Extent1].[Description] AS [Description], [Extent1].[SentDate] AS [SentDate], [Extent1].[VisitCount] AS [VisitCount], [Extent1].[RssSourceId] AS [RssSourceId], [Extent1].[ReviewStatus] AS [ReviewStatus], [Extent1].[UserAccountId] AS [UserAccountId], [Extent1].[CreationDate] AS [CreationDate], row_number() OVER (ORDER BY [Extent1].[SentDate] DESC) AS [row_number]
FROM [dbo].[Links] AS [Extent1]
) AS [Extent1]
WHERE [Extent1].[row_number] > 0
ORDER BY [Extent1].[SentDate] DESC
And the code which is generating the Query is:
public async Task<IQueryable<TEntity>> GetAsync(Expression<Func<TEntity, bool>> filter = null,
Func<IQueryable<TEntity>, IOrderedQueryable<TEntity>> orderBy = null)
{
return await Task.Run(() =>
{
IQueryable<TEntity> query = _dbSet;
if (filter != null)
{
query = query.Where(filter);
}
if (orderBy != null)
{
query = orderBy(query);
}
return query;
});
}
Note that when I remove inner Select statement and Where clause and change it to following, Query executes fine in a less than a second.
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
.
.
.
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
Any advice will be helpful.
UPDATE:
Here is the usage of Above code:
var dbLinks = await _uow.LinkRespository.GetAsync(filter, orderBy);
var pagedLinks = new PagedList<Link>(dbLinks, pageNumber, PAGE_SIZE);
var vmLinks = Mapper.Map<IPagedList<LinkViewItemViewModel>>(pagedLinks);
And filter:
var result = await GetLinks(null, pageNo, a => a.OrderByDescending(x => x.SentDate));

It never occurred to me that you simply didn't have an index. Lesson learnt - always check the basics before digging further.
If you don't need pagination, then the query can be simplified to
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
...
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
and it runs fast, as you've verified.
Apparently, you do need the pagination, so let's see what we can do.
The reason why your current version is slow, because it scans the whole table first, calculates row number for each and every row and only then returns 10 rows. I was wrong here. SQL Server optimizer is pretty smart. The root of your problem is somewhere else. See my update below.
BTW, as other people mentioned, this pagination will work correctly only if SentDate column is unique. If it is not unique, you need to ORDER BY SentDate and another unique column like some ID to resolve ambiguity.
If you don't need ability to jump straight to particular page, but rather always start with page 1, then go to next page, next page and so on, then the proper efficient way to do such pagination is described in this excellent article: http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way
The author uses PostgreSQL for illustration, but the technique works for MS SQL Server as well. It boils down to remembering the ID of the last row on the shown page and then using this ID in the WHERE clause with appropriate supporting index to retrieve the next page without scanning all previous rows.
SQL Server 2008 doesn't have a built-in support for pagination, so we'll have to use workaround. I will show one variant that allows to jump straight to a given page and would work fast for first pages, but would become slower and slower for further pages.
You will have these variables (PageSize, PageNumber) in your C# code. I put them here to illustrate the point.
DECLARE #VarPageSize int = 10; -- number of rows in each page
DECLARE #VarPageNumber int = 3; -- page numeration is zero-based
SELECT TOP (#VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM
(
SELECT TOP((#VarPageNumber + 1) * #VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
) AS [Extent1]
ORDER BY [Extent1].[SentDate] ASC
;
The first page is rows 1 to 10, second page is 11 to 20 and so on.
Let's see how this query works when we try to get the fourth page, i.e. rows 31 to 40. PageSize=10, PageNumber=3. In the inner query we select first 40 rows. Note, that we don't scan the whole table here, we scan only first 40 rows. We don't even need explicit ROW_NUMBER(). Then we need to select last 10 rows out of those found 40, so outer query selects TOP(10) with ORDER BY in the opposite direction. As is this will return rows 40 to 31 in reverse order. You can sort them back into correct order on the client, or add one more outer query, which simply sorts them again by SentDate DESC. Like this:
SELECT
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM
(
SELECT TOP (#VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM
(
SELECT TOP((#VarPageNumber + 1) * #VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
) AS [Extent1]
ORDER BY [Extent1].[SentDate] ASC
) AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
This query (as original query) would work always correctly only if SentDate is unique. If it is not unique, add unique column to the ORDER BY. For example, if LinkID is unique, then in the inner-most query use ORDER BY SentDate DESC, LinkID DESC. In the outer query reverse the order: ORDER BY SentDate ASC, LinkID ASC.
Obviously, if you want to jump to page 1000, then the inner query would have to read 10,000 rows, so the further you go, the slower it gets.
In any case, you need to have an index on SentDate (or SentDate, LinkID) to make it work. Without an index the query would scan the whole table again.
I'm not telling you here how to translate this query to EF, because I don't know. I never used EF. There may be a way. Also, apparently, you can just force it to use actual SQL, rather than trying to play with C# code.
Update
Execution plans comparison
In my database I have a table EventLogErrors with 29,477,859 rows and I compared on SQL Server 2008 the query with ROW_NUMBER that EF generates and what I suggested here with TOP. I tried to retrieve the fourth page 10 rows long. In both cases optimizer was smart enough to read only 40 rows, as you can see from the execution plans. I used a primary key column for ordering and pagination for this test. When I used another indexed column for pagination results were the same, i.e. both variants read only 40 rows. Needless to say, both variants returned results in a fraction of a second.
Variant with TOP
Variant with ROW_NUMBER
What it all means is that the root of your problem is somewhere else. You mentioned that your query runs slowly only sometimes and I didn't really pay attention to it originally. With such symptom I would do the following:
Check execution plan.
Check that you do have an index.
Check that the index is not heavily fragmented and statistics is not outdated.
The SQL Server has a feature called Auto-Parameterization. Also, it has a feature called Parameter Sniffing. Also, it has a feature called Execution plan caching. When all three features work together it may result in using a non-optimal execution plan. There is an excellent article by Erland Sommarskog explaining it in detail: http://www.sommarskog.se/query-plan-mysteries.html This article explains how to confirm that the problem is really in parameter sniffing by checking the cached execution plan and what can be done to fix the problem.

I'm guessing the WHERE row_number > 0 will change over time as you ask for page 2, page 3, etc...
As such, I'm curious if it would help to create this index:
CREATE INDEX idx_links_SentDate_desc ON [dbo].[Links] ([SentDate] DESC)
In all honesty, IF it works, it's pretty much a band-aid and you'll probably will need to rebuild this index on a frequent basis as I'm guessing it will get fragmented over time...
UPDATE: check the comments! Turns out the DESC has no effect whatsoever and should be avoided if your data comes in low to high!

Sometimes the inner select can cause problems with the execution plan, but it's the easiest way for the expression tree to be built from the code. Usually, it won't affect performance too much.
Clearly in this case it does. One workaround is to use your own query with ExecuteStoreQuery. Something like this:
int takeNo = 20;
int skipNo = 100;
var results = db.ExecuteStoreQuery<Link>(
"SELECT LinkID, Title, Url, Description, SentDate, VisitCount, RssSourceId, ReviewStatus, UserAccountId, CreationDate FROM Links",
null);
results = results.OrderBy(x=> x.SentDate).Skip(skipNo).Take(takeNo);
Of course you lose a lot of the benefits of using an ORM in the first place by doing this, but it might be acceptable for an exceptional circumstance.

This looks like a standard paging query. I would guess that you do not have an index on SentDate. If so, the first thing to try is adding an index on SentDate and seeing what kind of impact this has on performance. Assuming that you do not always want to sort/page on SentDate and that indexing every column that you might want to sort/page by is not going to happen, take a look at this other stackoverflow question. In some cases, SQL Server's "Gather Streams" parallelism operation can overflow into TempDb. When this happens, performance goes into the toilet. As the other answer says, Indexing the column can help, as can disabling parallelism. Check out your query plan and see if it looks like this might be the issue.

I am not very good in EF but can give you hints. First of all you have to check if you have an non-clustered index on [Extent1].[SentDate]. Second if not, create, if exists, then recreate or re-arrange it.
Third change your query like this. As your original SQL is nothing just written un-necessary complex and it would result same as this one I am showing here. Try to write things simple, will work faster and maintenance would also be easy.
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
[Extent1].[Url] AS [Url],
[Extent1].[Description] AS [Description],
[Extent1].[SentDate] AS [SentDate],
[Extent1].[VisitCount] AS [VisitCount],
[Extent1].[RssSourceId] AS [RssSourceId],
[Extent1].[ReviewStatus] AS [ReviewStatus],
[Extent1].[UserAccountId] AS [UserAccountId],
[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
or modify this one little bit like this if case it result different.
select top 10 A.* from (
SELECT * from
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
[Extent1].[Url] AS [Url],
[Extent1].[Description] AS [Description],
[Extent1].[SentDate] AS [SentDate],
[Extent1].[VisitCount] AS [VisitCount],
[Extent1].[RssSourceId] AS [RssSourceId],
[Extent1].[ReviewStatus] AS [ReviewStatus],
[Extent1].[UserAccountId] AS [UserAccountId],
[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1] ) A
ORDER BY A.[SentDate] DESC
I am 99% sure it will work.

Have you tried chaining in the method?
IQueryable<TEntity> query = _dbSet;
return query.Where(x => (filter != null ? filter : x)
.Where(x => (orderBy != null ? orderBy : x));
I am wondering if this will change the query that is created by EF.

Your code looks like somewhat obscure for me, And this is first time where I encounter such this querying. As you told, sometimes It takes too long time to execute, so it tells the query can be interpreted in another ways somewhere, perhaps by ignoring EF performance considerations in some cases, So try to rearrange query conditions/selections and consider lazy loading in your program logic.

Aren't you bitten by the Statistic update problem in SQL server?
ALTER DATABASE YourDBName
SET AUTO_UPDATE_STATISTICS_ASYNC ON
Default is OFF, thereby your SQL server will stall when 20% of your data has changed - waiting for the Statistics update before running the query.

I have run into similar issues before where EF will decide to decorate the SQL it decides to run in a very non-performant fashion.
Anyways, to provide a possible solution to your question:
On instances where I don't like what EF does with my code to generate SQL statements, I end up writing a stored procedure, import that into my EDMX as a function and use that to retrieve my data. It affords me control on how to formulate the SQL and I know exactly what index I need to leverage to get the best performance out of this. I imagine you know how to write a stored proc and import that as a function into EF so I will leave those details out. Hope this helps you.
I will still keep checking this page to see if someone comes up with a nicer, less painful solution to your issue.

Call me crazy, but it looks like you've got the thing ordering itself with itself when this code is called:
if (orderBy != null)
{
query = orderBy(query);
}
I think that would explain the whole "sometimes it's slow" bit. Probably runs fine until you have something in the orderBy parameter, then it's calling itself and creating that row numbered sub-select that slows it down.
Try commenting out the query = orderBy(query) portion of your code and see if you still get the slow down. I'm betting that you won't.
Also, you can simplify your code using Dynamic LINQ. It basically lets you specific sorting with a string name of a field (.orderby("somefield")) instead of trying to pass in a method, which I've found to be a lot easier. I use that in MVC apps to handle sorting by whatever field the users clicks on a grid.

Try adding a non-clustered index on SentDate

Related

Horrifically inefficient query generated by Entity Framework 6

Here's the query I want:
select top 10 *
from vw_BoosterTargetLog
where OrganizationId = 4125
order by Id desc
It executes subsecond.
Here's my Entity Framework (6.1.2) equivalent in C#:
return await db.vw_BoosterTargetLog
.Where(x => x.OrganizationId == organizationId)
.OrderByDescending(x => x.Id)
.Take(numberToRun)
.ToListNolockAsync();
And here's the SQL that it generates:
SELECT TOP (10)
[Project1].[OrganizationId] AS [OrganizationId],
[Project1].[BoosterTriggerId] AS [BoosterTriggerId],
[Project1].[IsAutomatic] AS [IsAutomatic],
[Project1].[C1] AS [C1],
[Project1].[CustomerUserId] AS [CustomerUserId],
[Project1].[SourceUrl] AS [SourceUrl],
[Project1].[TargetUrl] AS [TargetUrl],
[Project1].[ShowedOn] AS [ShowedOn],
[Project1].[ClickedOn] AS [ClickedOn],
[Project1].[BoosterTargetId] AS [BoosterTargetId],
[Project1].[TriggerEventGroup] AS [TriggerEventGroup],
[Project1].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[Project1].[TargetTitle] AS [TargetTitle],
[Project1].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[Project1].[Version] AS [Version],
[Project1].[CookieId] AS [CookieId],
[Project1].[CoalescedId] AS [CoalescedId],
[Project1].[OrganizationName] AS [OrganizationName],
[Project1].[ShowedOnDate] AS [ShowedOnDate],
[Project1].[SampleGroupSectionName] AS [SampleGroupSectionName],
[Project1].[Selector] AS [Selector],
[Project1].[SelectorStep] AS [SelectorStep]
FROM ( SELECT
[Extent1].[OrganizationId] AS [OrganizationId],
[Extent1].[OrganizationName] AS [OrganizationName],
[Extent1].[BoosterTriggerId] AS [BoosterTriggerId],
[Extent1].[IsAutomatic] AS [IsAutomatic],
[Extent1].[SampleGroupSectionName] AS [SampleGroupSectionName],
[Extent1].[Selector] AS [Selector],
[Extent1].[SelectorStep] AS [SelectorStep],
[Extent1].[BoosterTargetId] AS [BoosterTargetId],
[Extent1].[CookieId] AS [CookieId],
[Extent1].[CustomerUserId] AS [CustomerUserId],
[Extent1].[CoalescedId] AS [CoalescedId],
[Extent1].[SourceUrl] AS [SourceUrl],
[Extent1].[TriggerEventGroup] AS [TriggerEventGroup],
[Extent1].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[Extent1].[TargetTitle] AS [TargetTitle],
[Extent1].[TargetUrl] AS [TargetUrl],
[Extent1].[ShowedOn] AS [ShowedOn],
[Extent1].[ShowedOnDate] AS [ShowedOnDate],
[Extent1].[ClickedOn] AS [ClickedOn],
[Extent1].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[Extent1].[Version] AS [Version],
CAST( [Extent1].[Id] AS int) AS [C1]
FROM (SELECT
[vw_BoosterTargetLog].[OrganizationId] AS [OrganizationId],
[vw_BoosterTargetLog].[OrganizationName] AS [OrganizationName],
[vw_BoosterTargetLog].[BoosterTriggerId] AS [BoosterTriggerId],
[vw_BoosterTargetLog].[IsAutomatic] AS [IsAutomatic],
[vw_BoosterTargetLog].[SampleGroupSectionName] AS [SampleGroupSectionName],
[vw_BoosterTargetLog].[Selector] AS [Selector],
[vw_BoosterTargetLog].[SelectorStep] AS [SelectorStep],
[vw_BoosterTargetLog].[BoosterTargetId] AS [BoosterTargetId],
[vw_BoosterTargetLog].[CookieId] AS [CookieId],
[vw_BoosterTargetLog].[CustomerUserId] AS [CustomerUserId],
[vw_BoosterTargetLog].[CoalescedId] AS [CoalescedId],
[vw_BoosterTargetLog].[Id] AS [Id],
[vw_BoosterTargetLog].[SourceUrl] AS [SourceUrl],
[vw_BoosterTargetLog].[TriggerEventGroup] AS [TriggerEventGroup],
[vw_BoosterTargetLog].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[vw_BoosterTargetLog].[TargetTitle] AS [TargetTitle],
[vw_BoosterTargetLog].[TargetUrl] AS [TargetUrl],
[vw_BoosterTargetLog].[ShowedOn] AS [ShowedOn],
[vw_BoosterTargetLog].[ShowedOnDate] AS [ShowedOnDate],
[vw_BoosterTargetLog].[ClickedOn] AS [ClickedOn],
[vw_BoosterTargetLog].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[vw_BoosterTargetLog].[Version] AS [Version]
FROM [dbo].[vw_BoosterTargetLog] AS [vw_BoosterTargetLog]) AS [Extent1]
WHERE [Extent1].[OrganizationId] = 4125
) AS [Project1]
ORDER BY [Project1].[C1] DESC
It's ugly as hell, of course, as all EF queries are: I'm not complaining about that. My gripe is that in my testing, best-case, it executes about 10x slower than the first, and worst-case, about 100x slower.
For a query this simple, that seems way beyond all reasonable expectation.
Obviously I can execute SQL directly, or execute a sproc, or something of that sort. And while I'm waiting for feedback, that's what I'll do. But does anyone have any other suggestions about how to speed this up? Is there any way to encourage EF to generate reasonable SQL in a situation like this?
The queries EF produces, while terrible from a readability perspective, are usually still quite good reasonable -- and I say that as someone who does almost all data access through stored procedures with hand-written queries. But in order for it to work, the model EF has of the database needs to match the actual database, or else conversions will be introduced, and when that happens it's very easy to get horrible performance drops while all the data is converted and no indexes can be used.
If we eliminate some nesting, the EF query can be simplified to
SELECT TOP (10) *
FROM (
SELECT *, CAST(Id AS INT) AS C1
FROM vw_BoosterTargetLog
WHERE OrganizationId = 4125
) _
ORDER BY C1 DESC
(This is not the actual result set because Id isn't part of the final result set in the real query, but pretend I wrote out all the columns just like EF did.)
If vw_BoosterTargetLog.Id is not actually an INT, this forces a conversion of all rows before the ordering takes place, which is much slower. The solution is to figure out the actual type of the column (in this case, BIGINT) and update your model accordingly.

Count or Skip(1).Any() where I want to find out if there is more than 1 record - Entity Framework

I'm not sure when but I read an article on this which indicates that the usage of Skip(1).Any() is better than Count() compassion when using Entity Framework (I may remember wrong). I'm not sure about this after I saw the generated T-SQL code.
Here is the first option:
int userConnectionCount = _dbContext.HubConnections.Count(conn => conn.UserId == user.Id);
bool isAtSingleConnection = (userConnectionCount == 1);
This generates the following T-SQL code which is reasonable:
SELECT
[GroupBy1].[A1] AS [C1]
FROM ( SELECT
COUNT(1) AS [A1]
FROM [dbo].[HubConnections] AS [Extent1]
WHERE [Extent1].[UserId] = #p__linq__0
) AS [GroupBy1]
Here is the other option which is the suggested query as far as I remember:
bool isAtSingleConnection = !_dbContext
.HubConnections.OrderBy(conn => conn.Id)
.Skip(1).Any(conn => conn.UserId == user.Id);
Here is the generated T-SQL for the above LINQ query:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM ( SELECT [Extent1].[Id] AS [Id], [Extent1].[UserId] AS [UserId]
FROM ( SELECT [Extent1].[Id] AS [Id], [Extent1].[UserId] AS [UserId], row_number() OVER (ORDER BY [Extent1].[Id] ASC) AS [row_number]
FROM [dbo].[HubConnections] AS [Extent1]
) AS [Extent1]
WHERE [Extent1].[row_number] > 1
) AS [Skip1]
WHERE [Skip1].[UserId] = #p__linq__0
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT
1 AS [C1]
FROM ( SELECT [Extent2].[Id] AS [Id], [Extent2].[UserId] AS [UserId]
FROM ( SELECT [Extent2].[Id] AS [Id], [Extent2].[UserId] AS [UserId], row_number() OVER (ORDER BY [Extent2].[Id] ASC) AS [row_number]
FROM [dbo].[HubConnections] AS [Extent2]
) AS [Extent2]
WHERE [Extent2].[row_number] > 1
) AS [Skip2]
WHERE [Skip2].[UserId] = #p__linq__0
)) THEN cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1];
Which one is the proper way here? Is there a big performance difference between these two?
Query performance depends on a lot of things, like the indexes that are present, the actual data, how stale the statistics about the data present are etc. SQL query plan optimizer looks at these different metrics to come up with an efficient query plan. So, any straightforward answer that says query 1 is always better than query 2 or the opposite would be incorrect.
That said, my answer below tries to explain the articles stance and how Skip(1).Any() could be better(marginally) than doing a Count() > 1. The second query though being bigger in size and mostly unreadable looks like it could be interpreted in an efficient fashion. Again, this depends on things aforementioned. The idea is that the number of rows that the database has to look into to figure out the result is more in case of Count(). In the count case, assuming that the required indexes are there (a clustered index on Id to make the OrderBy in second case efficient), the db has to go through count number of rows. In the second case, it has to go through a maximum of two rows to arrive at the answer.
Lets get more scientific in our analysis and see if my above theory holds any ground. For this, I am creating a dummy database of customers. The Customer type looks like this,
public class Customer
{
public int ID { get; set; }
public string Name { get; set; }
public int Age { get; set; }
}
I am seeding the database with some 100K random rows(I really have to prove this) using this code,
for (int j = 0; j < 100; j++)
{
using (CustomersContext db = new CustomersContext())
{
Random r = new Random();
for (int i = 0; i < 1000; i++)
{
Customer c = new Customer
{
Name = Guid.NewGuid().ToString(),
Age = r.Next(0, 100)
};
db.Customers.Add(c);
}
db.SaveChanges();
}
}
Sample code here.
Now, the queries that I am going to use are as follows,
db.Customers.Where(c => c.Age == 26).Count() > 1; // scenario 1
db.Customers.Where(c => c.Age == 26).OrderBy(c => c.ID).Skip(1).Any() // scenario 2
I have started SQL profiler to catch the query plans. The captured plans look as follows,
Scenario 1:
Check out the estimated cost and actual row count for scenario 1 in the above images.
Scenario 2:
Check out the estimated cost and actual row count for scenario 2 in the below images.
As per the initial guess, the estimated cost and the number of rows is lesser in the Skip and any case as compared to Count case.
Conclusion:
All this analysis aside, as many others have commented earlier, these are not the kind of performance optimizations you should try to do in your code. Things like these hurt readability with very minimal(I would say non-existent) perf benefit. I just did this analysis for fun and would never use this as a basis for choosing scenario 2. I would measure and see if doing a Count() is actually hurting to change the code to use Skip().Any().
I read an article on this which indicates that the usage of Skip(1).Any() is better than Count().
That statement is quite true on a LINQ to objects query. On a LINQ to objects query Skip(1).Any() only needs to try to get the first two items of the sequence, and it can ignore all of the items that come after it. If the sequence involves rather expensive operations (and properly defers execution) or even more importantly, if the sequence is infinite, this could be a big deal. For most queries it will matter a bit, but often not a lot.
For a LINQ query that is based on a query provider instead it's unlikely to be significantly difference. Particularly with EF, as you have seen, the generated query is not noticeably different. Is it possible for there to be a difference, sure. One case could be handled better than the other by the query provider, particular queries much be able to be optimized better with the particular refactor used by one or the other, etc.
If someone is suggesting that there's a major difference in the EF query between these two, odds are they're mistakenly applying the guideline that was designed to just apply to a LINQ to objects query.
It will definitely depend on your record count in your table/data set. If you have a lot of records, then doing a count on an identity will be very fast, because it's indexed, but skipping one record and then getting the next record would be faster.
Granted, this process could be done in sub-millisecond in either case. Unless you have a record count that exceeds 10,000+ records, it really won't matter unless you need it to return under a specific threshold. Don't forget that SQL Server will cache query execution plans. If you re-run the same query, you may not see a difference after running it the first time, unless the data changes pretty significantly beneath it.

How to write this SQL query in Entity Framework?

I have this query that I want translated pretty much 1:1 from Entity Framework to SQL:
SELECT GroupId, ItemId, count(*) as total
FROM [TESTDB].[dbo].[TestTable]
WHERE GroupId = '64025'
GROUP BY GroupId, ItemId
ORDER BY GroupId, total DESC
This SQL query should sort based on the number occurrence of the same ItemId (for that group).
I have this now:
from x in dataContext.TestTable.AsNoTracking()
where x.GroupId = 64025
group x by new {x.GroupId, x.ItemId}
into g
orderby g.Key.GroupId, g.Count() descending
select new {g.Key.GroupId, g.Key.ItemId, Count = g.Count()};
But this generates the following SQL code:
SELECT
[GroupBy1].[K1] AS [GroupId],
[GroupBy1].[K2] AS [ItemId],
[GroupBy1].[A2] AS [C1]
FROM ( SELECT
[Extent1].[GroupId] AS [K1],
[Extent1].[ItemId] AS [K2],
COUNT(1) AS [A1],
COUNT(1) AS [A2]
FROM [dbo].[TestTable] AS [Extent1]
WHERE 64025 = [Extent1].[GroupId]
GROUP BY [Extent1].[GroupId], [Extent1].[ItemId]
) AS [GroupBy1]
ORDER BY [GroupBy1].[K1] ASC, [GroupBy1].[A1] DESC
This also works but is a factor 2 slower than the SQL I created.
I've been fiddling around with the linq code for a while but I haven't managed to create something similar to my query.
Execution plan (only the last two items, the first two are identical):
FIRST: |--Stream Aggregate(GROUP BY:([Extent1].[ItemId]) DEFINE:([Expr1006]=Count(*), [Extent1].[GroupId]=ANY([TESTDB].[dbo].[TestTable].[GroupId] as [Extent1].[GroupId])))
|--Index Seek(OBJECT:([TESTDB].[dbo].[TestTable].[IX_Group]), SEEK:([TESTDB].[dbo].[TestTable].[GroupId]=(64034)) ORDERED FORWARD)
SECOND: |--Stream Aggregate(GROUP BY:([TESTDB].[dbo].[TestTable].[ItemId]) DEFINE:([Expr1007]=Count(*), [TESTDB].[dbo].[TestTable].[GroupId]=ANY([TESTDB].[dbo].[TestTable].[GroupId])))
|--Index Seek(OBJECT:([TESTDB].[dbo].[TestTable].[IX_Group] AS [Extent1]), SEEK:([Extent1].[GroupId]=(64034)) ORDERED FORWARD)
The query that Entity Framework generates and your hand crafted query are semantically the same and will give the same plan.
The derived table definition is inlined during query optimisation so the only difference might be some extremely minor additional overhead during parsing and compilation.
The snippets of SHOWPLAN_TEXT you have posted are the same plan. The only difference is aliases. It looks as though your table definition is something like.
CREATE TABLE [dbo].[TestTable]
(
[GroupId] INT,
[ItemId] INT
)
CREATE NONCLUSTERED INDEX IX_Group ON [dbo].[TestTable] ([GroupId], [ItemId])
And you are getting a plan like this
To all intents and purposes the plans are the same. Your performance testing methodology is probably flawed. Maybe your first query brought pages into cache that then benefited the second query for example.

LINQ generating SQL with duplicate nested selects

I'm very new to the .NET Entity Framework, and I think it's awesome, but somehow I'm getting this strange issue (sorry for the spanish but my program is in that language, anyway it's not a big deal, just the column or property names): I'm doing a normal LINQ To Entities query to get a list of UltimaConsulta, like this:
var query = from uc in bd.UltimasConsultas
select uc;
UltimasConsultas is a view, btw. The thing is that LINQ is generating this SQL for the query:
SELECT
[Extent1].[IdPaciente] AS [IdPaciente],
[Extent1].[Nombre] AS [Nombre],
[Extent1].[PrimerApellido] AS [PrimerApellido],
[Extent1].[SegundoApellido] AS [SegundoApellido],
[Extent1].[Fecha] AS [Fecha]
FROM (SELECT
[UltimasConsultas].[IdPaciente] AS [IdPaciente],
[UltimasConsultas].[Nombre] AS [Nombre],
[UltimasConsultas].[PrimerApellido] AS [PrimerApellido],
[UltimasConsultas].[SegundoApellido] AS [SegundoApellido],
[UltimasConsultas].[Fecha] AS [Fecha]
FROM [dbo].[UltimasConsultas] AS [UltimasConsultas]) AS [Extent1]
Why is LINQ generating a nested Select? I thought from videos and examples that it generates normal SQL selects for this kind of queries. Do I have to configure something (the entity model was generating from a wizard, so it's default configuration)? Thanks in advance for your answers.
To be clear, LINQ to Entities does not generate the SQL. Instead, it generates an ADO.NET canonical command tree, and the ADO.NET provider for your database, presumably SQL Server in this case, generates the SQL.
So why does it generate this derived table (I think "derived table" is the more correct term for the SQL feature in use here)? Because the code which generates the SQL has to generate SQL for a wide variety of LINQ queries, most of which are not nearly as trivial as the one you show. These queries will often be selecting data for multiple types (many of which might be anonymous, rather than named types), and in order to keep the SQL generation relatively sane, they are grouped into extents for each type.
Another question: Why should you care? It's easy to demonstrate that the use of the derived table in this statement is "free" from a performance point of view.
I selected a table at random from a populated database, and run the following query:
SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM [VertexRM].[dbo].[Address]
Let's look at the cost:
<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS ******/
SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM [VertexRM].[dbo].[Address]" StatementType="SELECT">
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
<RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">
Now let's compare that to the query with the derived table:
SELECT
[Extent1].[AddressId]
,[Extent1].[Address1]
,[Extent1].[Address2]
,[Extent1].[City]
,[Extent1].[State]
,[Extent1].[ZIP]
,[Extent1].[ZIPExtension]
FROM (SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM[VertexRM].[dbo].[Address]) AS [Extent1]
And the cost:
<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS ******/
SELECT
[Extent1].[AddressId]
,[Extent1].[Address1]
,[Extent1].[Address2]
,[Extent1].[City]
,[Extent1].[State]
,[Extent1].[ZIP]
,[Extent1].[ZIPExtension]
FROM (SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM[VertexRM].[dbo].[Address]) AS [Extent1]" StatementType="SELECT">
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
<RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">
In both cases, SQL Server simply scans the clustered index. Not surprisingly, the cost is almost precisely the same.
Let's take a look at a slightly more complicated query. I fired up LINQPad, and entered the following query against the same table, plus one related table:
from a in Addresses
select new
{
Id = a.Id,
Address1 = a.Address1,
Address2 = a.Address2,
City = a.City,
State = a.State,
ZIP = a.ZIP,
ZIPExtension = a.ZIPExtension,
PersonCount = a.EntityAddresses.Count()
}
This generates the following SQL:
SELECT
1 AS [C1],
[Project1].[AddressId] AS [AddressId],
[Project1].[Address1] AS [Address1],
[Project1].[Address2] AS [Address2],
[Project1].[City] AS [City],
[Project1].[State] AS [State],
[Project1].[ZIP] AS [ZIP],
[Project1].[ZIPExtension] AS [ZIPExtension],
[Project1].[C1] AS [C2]
FROM ( SELECT
[Extent1].[AddressId] AS [AddressId],
[Extent1].[Address1] AS [Address1],
[Extent1].[Address2] AS [Address2],
[Extent1].[City] AS [City],
[Extent1].[State] AS [State],
[Extent1].[ZIP] AS [ZIP],
[Extent1].[ZIPExtension] AS [ZIPExtension],
(SELECT
COUNT(cast(1 as bit)) AS [A1]
FROM [dbo].[EntityAddress] AS [Extent2]
WHERE [Extent1].[AddressId] = [Extent2].[AddressId]) AS [C1]
FROM [dbo].[Address] AS [Extent1]
) AS [Project1]
Analyzing this, we can see that Project1 is the projection onto the anonymous type. Extent1 is the Address table/entity. And Extent2 is the table for the association. Now there is no derived table for Address, but there is one for the projection.
I don't know if you have ever written a SQL generation system, but it isn't easy. I believe that the general problem of proving that a LINQ to Entities query and a SQL query are equivalent is NP-hard, although certain specific cases are obviously much easier. SQL is intentionally Turing-incomplete, because its designers wanted all SQL queries to execute in bounded time. LINQ, not so.
In short, this is a very difficult problem to solve, and the combination of the Entity Framework and its providers do occasionally sacrifice some readability in favor of consistency over a wide range of queries. But it shouldn't be a performance issue.
Basically it's defining what Extent1 consists of and what variables will relate to each entry. Then its mapping the actual database table to Extent1 so that it can return all entries for that table.
This is what your query is asking for. Its just that LINQ can't add in a wildcard character as you would if you'd done it by hand.

Why did the following linq to sql query generate a subquery?

I did the following query:
var list = from book in books
where book.price > 50
select book;
list = list.Take(50);
I would expect the above to generate something like:
SELECT top 50 id, title, price, author
FROM Books
WHERE price > 50
but it generates:
SELECT
[Limit1].[C1] as [C1]
[Limit1].[id] as [Id],
[Limit1].[title] as [title],
[Limit1].[price] as [price],
[Limit1].[author]
FROM (SELECT TOP (50)
[Extent1].[id] as as [Id],
[Extent1].[title] as [title],
[Extent1].[price] as [price],
[Extent1].[author] as [author]
FROM Books as [Extent1]
WHERE [Extent1].[price] > 50
) AS [Limit1]
Why does the above linq query generate a subquery and where does the C1 come from?
Disclaimer: I've never used LINQ before...
My guess would be paging support? I guess you have some sort of Take(50, 50) method that gets 50 records, starting at record 50. Take a look at the SQL that query generates and you will probably find that it uses a similar sub query structure to allow it to return any 50 rows in a query in approximately the amount of time that it returns the first 50 rows.
In any case, the nested sub query doesn't add any performance overhead as it's automagically optimised away during compilation of the execution plan.
You could still make it cleaner like this:
var c = (from co in db.countries
where co.regionID == 5
select co).Take(50);
This will result in:
Table(country).Where(co => (co.regionID = Convert(5))).Take(50)
Equivalent to:
SELECT TOP (50) [t0].[countryID], [t0].[regionID], [t0].[countryName], [t0].[code]
FROM [dbo].[countries] AS [t0]
WHERE [t0].[regionID] = 5
EDIT: Comments, Its Not necessarily because with separate Take(), you can still use it like this:
var c = (from co in db.countries
where co.regionID == 5
select co);
var l = c.Take(50).ToList();
And the Result would be the same as before.
SELECT TOP (50) [t0].[countryID], [t0].[regionID], [t0].[countryName], [t0].[code]
FROM [dbo].[countries] AS [t0]
WHERE [t0].[regionID] = #p0
The fact that you wrote IQueryable = IQueryable.Take(50) is the tricky part here.
The subquery is generated for projection purposes, it makes more sense when you select from multiple tables into a single anonymous object, then the outer query is used to gather the results.
Try what happens with something like this:
from book in books
where price > 50
select new
{
Title = book.title,
Chapters = from chapter in book.Chapters
select chapter.Title
}
Isn't it a case of the first query returning the total number of rows while the second extracts the subset of rows based on the call to the .Take() method?
I agree with #Justin Swartsel. There was no error involved, so this is largely an academic matter.
Linq-to-SQL endeavors to generate SQL that runs efficiently (which it did in your case).
But it does not make any effort to generate conventional SQL that a human would likely create.
The Linq-to-SQL implementers likely used the builder pattern to generate the SQL.
If so, it would be easier to append a substring (or a subquery in this case) than it would be to backtrack and insert a 'TOP x' fragment into the SELECT clause.

Categories

Resources