Horrifically inefficient query generated by Entity Framework 6

Horrifically inefficient query generated by Entity Framework 6 - c#

Here's the query I want:
select top 10 *
from vw_BoosterTargetLog
where OrganizationId = 4125
order by Id desc
It executes subsecond.
Here's my Entity Framework (6.1.2) equivalent in C#:
return await db.vw_BoosterTargetLog
.Where(x => x.OrganizationId == organizationId)
.OrderByDescending(x => x.Id)
.Take(numberToRun)
.ToListNolockAsync();
And here's the SQL that it generates:
SELECT TOP (10)
[Project1].[OrganizationId] AS [OrganizationId],
[Project1].[BoosterTriggerId] AS [BoosterTriggerId],
[Project1].[IsAutomatic] AS [IsAutomatic],
[Project1].[C1] AS [C1],
[Project1].[CustomerUserId] AS [CustomerUserId],
[Project1].[SourceUrl] AS [SourceUrl],
[Project1].[TargetUrl] AS [TargetUrl],
[Project1].[ShowedOn] AS [ShowedOn],
[Project1].[ClickedOn] AS [ClickedOn],
[Project1].[BoosterTargetId] AS [BoosterTargetId],
[Project1].[TriggerEventGroup] AS [TriggerEventGroup],
[Project1].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[Project1].[TargetTitle] AS [TargetTitle],
[Project1].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[Project1].[Version] AS [Version],
[Project1].[CookieId] AS [CookieId],
[Project1].[CoalescedId] AS [CoalescedId],
[Project1].[OrganizationName] AS [OrganizationName],
[Project1].[ShowedOnDate] AS [ShowedOnDate],
[Project1].[SampleGroupSectionName] AS [SampleGroupSectionName],
[Project1].[Selector] AS [Selector],
[Project1].[SelectorStep] AS [SelectorStep]
FROM ( SELECT
[Extent1].[OrganizationId] AS [OrganizationId],
[Extent1].[OrganizationName] AS [OrganizationName],
[Extent1].[BoosterTriggerId] AS [BoosterTriggerId],
[Extent1].[IsAutomatic] AS [IsAutomatic],
[Extent1].[SampleGroupSectionName] AS [SampleGroupSectionName],
[Extent1].[Selector] AS [Selector],
[Extent1].[SelectorStep] AS [SelectorStep],
[Extent1].[BoosterTargetId] AS [BoosterTargetId],
[Extent1].[CookieId] AS [CookieId],
[Extent1].[CustomerUserId] AS [CustomerUserId],
[Extent1].[CoalescedId] AS [CoalescedId],
[Extent1].[SourceUrl] AS [SourceUrl],
[Extent1].[TriggerEventGroup] AS [TriggerEventGroup],
[Extent1].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[Extent1].[TargetTitle] AS [TargetTitle],
[Extent1].[TargetUrl] AS [TargetUrl],
[Extent1].[ShowedOn] AS [ShowedOn],
[Extent1].[ShowedOnDate] AS [ShowedOnDate],
[Extent1].[ClickedOn] AS [ClickedOn],
[Extent1].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[Extent1].[Version] AS [Version],
CAST( [Extent1].[Id] AS int) AS [C1]
FROM (SELECT
[vw_BoosterTargetLog].[OrganizationId] AS [OrganizationId],
[vw_BoosterTargetLog].[OrganizationName] AS [OrganizationName],
[vw_BoosterTargetLog].[BoosterTriggerId] AS [BoosterTriggerId],
[vw_BoosterTargetLog].[IsAutomatic] AS [IsAutomatic],
[vw_BoosterTargetLog].[SampleGroupSectionName] AS [SampleGroupSectionName],
[vw_BoosterTargetLog].[Selector] AS [Selector],
[vw_BoosterTargetLog].[SelectorStep] AS [SelectorStep],
[vw_BoosterTargetLog].[BoosterTargetId] AS [BoosterTargetId],
[vw_BoosterTargetLog].[CookieId] AS [CookieId],
[vw_BoosterTargetLog].[CustomerUserId] AS [CustomerUserId],
[vw_BoosterTargetLog].[CoalescedId] AS [CoalescedId],
[vw_BoosterTargetLog].[Id] AS [Id],
[vw_BoosterTargetLog].[SourceUrl] AS [SourceUrl],
[vw_BoosterTargetLog].[TriggerEventGroup] AS [TriggerEventGroup],
[vw_BoosterTargetLog].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[vw_BoosterTargetLog].[TargetTitle] AS [TargetTitle],
[vw_BoosterTargetLog].[TargetUrl] AS [TargetUrl],
[vw_BoosterTargetLog].[ShowedOn] AS [ShowedOn],
[vw_BoosterTargetLog].[ShowedOnDate] AS [ShowedOnDate],
[vw_BoosterTargetLog].[ClickedOn] AS [ClickedOn],
[vw_BoosterTargetLog].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[vw_BoosterTargetLog].[Version] AS [Version]
FROM [dbo].[vw_BoosterTargetLog] AS [vw_BoosterTargetLog]) AS [Extent1]
WHERE [Extent1].[OrganizationId] = 4125
) AS [Project1]
ORDER BY [Project1].[C1] DESC
It's ugly as hell, of course, as all EF queries are: I'm not complaining about that. My gripe is that in my testing, best-case, it executes about 10x slower than the first, and worst-case, about 100x slower.
For a query this simple, that seems way beyond all reasonable expectation.
Obviously I can execute SQL directly, or execute a sproc, or something of that sort. And while I'm waiting for feedback, that's what I'll do. But does anyone have any other suggestions about how to speed this up? Is there any way to encourage EF to generate reasonable SQL in a situation like this?

The queries EF produces, while terrible from a readability perspective, are usually still quite good reasonable -- and I say that as someone who does almost all data access through stored procedures with hand-written queries. But in order for it to work, the model EF has of the database needs to match the actual database, or else conversions will be introduced, and when that happens it's very easy to get horrible performance drops while all the data is converted and no indexes can be used.
If we eliminate some nesting, the EF query can be simplified to
SELECT TOP (10) *
FROM (
SELECT *, CAST(Id AS INT) AS C1
FROM vw_BoosterTargetLog
WHERE OrganizationId = 4125
) _
ORDER BY C1 DESC
(This is not the actual result set because Id isn't part of the final result set in the real query, but pretend I wrote out all the columns just like EF did.)
If vw_BoosterTargetLog.Id is not actually an INT, this forces a conversion of all rows before the ordering takes place, which is much slower. The solution is to figure out the actual type of the column (in this case, BIGINT) and update your model accordingly.

Related

Query generated by EF takes too much time to execute

I have a very simple query which is generated by Entity-Framework, Sometimes when I try to run this query It almost takes more than 30 seconds to be executed, and I got time out Exception.
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
[Extent1].[Url] AS [Url],
[Extent1].[Description] AS [Description],
[Extent1].[SentDate] AS [SentDate],
[Extent1].[VisitCount] AS [VisitCount],
[Extent1].[RssSourceId] AS [RssSourceId],
[Extent1].[ReviewStatus] AS [ReviewStatus],
[Extent1].[UserAccountId] AS [UserAccountId],
[Extent1].[CreationDate] AS [CreationDate]
FROM ( SELECT [Extent1].[LinkID] AS [LinkID], [Extent1].[Title] AS [Title], [Extent1].[Url] AS [Url], [Extent1].[Description] AS [Description], [Extent1].[SentDate] AS [SentDate], [Extent1].[VisitCount] AS [VisitCount], [Extent1].[RssSourceId] AS [RssSourceId], [Extent1].[ReviewStatus] AS [ReviewStatus], [Extent1].[UserAccountId] AS [UserAccountId], [Extent1].[CreationDate] AS [CreationDate], row_number() OVER (ORDER BY [Extent1].[SentDate] DESC) AS [row_number]
FROM [dbo].[Links] AS [Extent1]
) AS [Extent1]
WHERE [Extent1].[row_number] > 0
ORDER BY [Extent1].[SentDate] DESC
And the code which is generating the Query is:
public async Task<IQueryable<TEntity>> GetAsync(Expression<Func<TEntity, bool>> filter = null,
Func<IQueryable<TEntity>, IOrderedQueryable<TEntity>> orderBy = null)
{
return await Task.Run(() =>
{
IQueryable<TEntity> query = _dbSet;
if (filter != null)
{
query = query.Where(filter);
}
if (orderBy != null)
{
query = orderBy(query);
}
return query;
});
}
Note that when I remove inner Select statement and Where clause and change it to following, Query executes fine in a less than a second.
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
.
.
.
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
Any advice will be helpful.
UPDATE:
Here is the usage of Above code:
var dbLinks = await _uow.LinkRespository.GetAsync(filter, orderBy);
var pagedLinks = new PagedList<Link>(dbLinks, pageNumber, PAGE_SIZE);
var vmLinks = Mapper.Map<IPagedList<LinkViewItemViewModel>>(pagedLinks);
And filter:
var result = await GetLinks(null, pageNo, a => a.OrderByDescending(x => x.SentDate));

It never occurred to me that you simply didn't have an index. Lesson learnt - always check the basics before digging further.
If you don't need pagination, then the query can be simplified to
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
...
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
and it runs fast, as you've verified.
Apparently, you do need the pagination, so let's see what we can do.
The reason why your current version is slow, because it scans the whole table first, calculates row number for each and every row and only then returns 10 rows. I was wrong here. SQL Server optimizer is pretty smart. The root of your problem is somewhere else. See my update below.
BTW, as other people mentioned, this pagination will work correctly only if SentDate column is unique. If it is not unique, you need to ORDER BY SentDate and another unique column like some ID to resolve ambiguity.
If you don't need ability to jump straight to particular page, but rather always start with page 1, then go to next page, next page and so on, then the proper efficient way to do such pagination is described in this excellent article: http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way
The author uses PostgreSQL for illustration, but the technique works for MS SQL Server as well. It boils down to remembering the ID of the last row on the shown page and then using this ID in the WHERE clause with appropriate supporting index to retrieve the next page without scanning all previous rows.
SQL Server 2008 doesn't have a built-in support for pagination, so we'll have to use workaround. I will show one variant that allows to jump straight to a given page and would work fast for first pages, but would become slower and slower for further pages.
You will have these variables (PageSize, PageNumber) in your C# code. I put them here to illustrate the point.
DECLARE #VarPageSize int = 10; -- number of rows in each page
DECLARE #VarPageNumber int = 3; -- page numeration is zero-based
SELECT TOP (#VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM
(
SELECT TOP((#VarPageNumber + 1) * #VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
) AS [Extent1]
ORDER BY [Extent1].[SentDate] ASC
;
The first page is rows 1 to 10, second page is 11 to 20 and so on.
Let's see how this query works when we try to get the fourth page, i.e. rows 31 to 40. PageSize=10, PageNumber=3. In the inner query we select first 40 rows. Note, that we don't scan the whole table here, we scan only first 40 rows. We don't even need explicit ROW_NUMBER(). Then we need to select last 10 rows out of those found 40, so outer query selects TOP(10) with ORDER BY in the opposite direction. As is this will return rows 40 to 31 in reverse order. You can sort them back into correct order on the client, or add one more outer query, which simply sorts them again by SentDate DESC. Like this:
SELECT
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM
(
SELECT TOP (#VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM
(
SELECT TOP((#VarPageNumber + 1) * #VarPageSize)
[Extent1].[LinkID] AS [LinkID]
,[Extent1].[Title] AS [Title]
,[Extent1].[Url] AS [Url]
,[Extent1].[Description] AS [Description]
,[Extent1].[SentDate] AS [SentDate]
,[Extent1].[VisitCount] AS [VisitCount]
,[Extent1].[RssSourceId] AS [RssSourceId]
,[Extent1].[ReviewStatus] AS [ReviewStatus]
,[Extent1].[UserAccountId] AS [UserAccountId]
,[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
) AS [Extent1]
ORDER BY [Extent1].[SentDate] ASC
) AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
This query (as original query) would work always correctly only if SentDate is unique. If it is not unique, add unique column to the ORDER BY. For example, if LinkID is unique, then in the inner-most query use ORDER BY SentDate DESC, LinkID DESC. In the outer query reverse the order: ORDER BY SentDate ASC, LinkID ASC.
Obviously, if you want to jump to page 1000, then the inner query would have to read 10,000 rows, so the further you go, the slower it gets.
In any case, you need to have an index on SentDate (or SentDate, LinkID) to make it work. Without an index the query would scan the whole table again.
I'm not telling you here how to translate this query to EF, because I don't know. I never used EF. There may be a way. Also, apparently, you can just force it to use actual SQL, rather than trying to play with C# code.
Update
Execution plans comparison
In my database I have a table EventLogErrors with 29,477,859 rows and I compared on SQL Server 2008 the query with ROW_NUMBER that EF generates and what I suggested here with TOP. I tried to retrieve the fourth page 10 rows long. In both cases optimizer was smart enough to read only 40 rows, as you can see from the execution plans. I used a primary key column for ordering and pagination for this test. When I used another indexed column for pagination results were the same, i.e. both variants read only 40 rows. Needless to say, both variants returned results in a fraction of a second.
Variant with TOP
Variant with ROW_NUMBER
What it all means is that the root of your problem is somewhere else. You mentioned that your query runs slowly only sometimes and I didn't really pay attention to it originally. With such symptom I would do the following:
Check execution plan.
Check that you do have an index.
Check that the index is not heavily fragmented and statistics is not outdated.
The SQL Server has a feature called Auto-Parameterization. Also, it has a feature called Parameter Sniffing. Also, it has a feature called Execution plan caching. When all three features work together it may result in using a non-optimal execution plan. There is an excellent article by Erland Sommarskog explaining it in detail: http://www.sommarskog.se/query-plan-mysteries.html This article explains how to confirm that the problem is really in parameter sniffing by checking the cached execution plan and what can be done to fix the problem.

I'm guessing the WHERE row_number > 0 will change over time as you ask for page 2, page 3, etc...
As such, I'm curious if it would help to create this index:
CREATE INDEX idx_links_SentDate_desc ON [dbo].[Links] ([SentDate] DESC)
In all honesty, IF it works, it's pretty much a band-aid and you'll probably will need to rebuild this index on a frequent basis as I'm guessing it will get fragmented over time...
UPDATE: check the comments! Turns out the DESC has no effect whatsoever and should be avoided if your data comes in low to high!

Sometimes the inner select can cause problems with the execution plan, but it's the easiest way for the expression tree to be built from the code. Usually, it won't affect performance too much.
Clearly in this case it does. One workaround is to use your own query with ExecuteStoreQuery. Something like this:
int takeNo = 20;
int skipNo = 100;
var results = db.ExecuteStoreQuery<Link>(
"SELECT LinkID, Title, Url, Description, SentDate, VisitCount, RssSourceId, ReviewStatus, UserAccountId, CreationDate FROM Links",
null);
results = results.OrderBy(x=> x.SentDate).Skip(skipNo).Take(takeNo);
Of course you lose a lot of the benefits of using an ORM in the first place by doing this, but it might be acceptable for an exceptional circumstance.

This looks like a standard paging query. I would guess that you do not have an index on SentDate. If so, the first thing to try is adding an index on SentDate and seeing what kind of impact this has on performance. Assuming that you do not always want to sort/page on SentDate and that indexing every column that you might want to sort/page by is not going to happen, take a look at this other stackoverflow question. In some cases, SQL Server's "Gather Streams" parallelism operation can overflow into TempDb. When this happens, performance goes into the toilet. As the other answer says, Indexing the column can help, as can disabling parallelism. Check out your query plan and see if it looks like this might be the issue.

I am not very good in EF but can give you hints. First of all you have to check if you have an non-clustered index on [Extent1].[SentDate]. Second if not, create, if exists, then recreate or re-arrange it.
Third change your query like this. As your original SQL is nothing just written un-necessary complex and it would result same as this one I am showing here. Try to write things simple, will work faster and maintenance would also be easy.
SELECT TOP (10)
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
[Extent1].[Url] AS [Url],
[Extent1].[Description] AS [Description],
[Extent1].[SentDate] AS [SentDate],
[Extent1].[VisitCount] AS [VisitCount],
[Extent1].[RssSourceId] AS [RssSourceId],
[Extent1].[ReviewStatus] AS [ReviewStatus],
[Extent1].[UserAccountId] AS [UserAccountId],
[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1]
ORDER BY [Extent1].[SentDate] DESC
or modify this one little bit like this if case it result different.
select top 10 A.* from (
SELECT * from
[Extent1].[LinkID] AS [LinkID],
[Extent1].[Title] AS [Title],
[Extent1].[Url] AS [Url],
[Extent1].[Description] AS [Description],
[Extent1].[SentDate] AS [SentDate],
[Extent1].[VisitCount] AS [VisitCount],
[Extent1].[RssSourceId] AS [RssSourceId],
[Extent1].[ReviewStatus] AS [ReviewStatus],
[Extent1].[UserAccountId] AS [UserAccountId],
[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[Links] AS [Extent1] ) A
ORDER BY A.[SentDate] DESC
I am 99% sure it will work.

Have you tried chaining in the method?
IQueryable<TEntity> query = _dbSet;
return query.Where(x => (filter != null ? filter : x)
.Where(x => (orderBy != null ? orderBy : x));
I am wondering if this will change the query that is created by EF.

Your code looks like somewhat obscure for me, And this is first time where I encounter such this querying. As you told, sometimes It takes too long time to execute, so it tells the query can be interpreted in another ways somewhere, perhaps by ignoring EF performance considerations in some cases, So try to rearrange query conditions/selections and consider lazy loading in your program logic.

Aren't you bitten by the Statistic update problem in SQL server?
ALTER DATABASE YourDBName
SET AUTO_UPDATE_STATISTICS_ASYNC ON
Default is OFF, thereby your SQL server will stall when 20% of your data has changed - waiting for the Statistics update before running the query.

I have run into similar issues before where EF will decide to decorate the SQL it decides to run in a very non-performant fashion.
Anyways, to provide a possible solution to your question:
On instances where I don't like what EF does with my code to generate SQL statements, I end up writing a stored procedure, import that into my EDMX as a function and use that to retrieve my data. It affords me control on how to formulate the SQL and I know exactly what index I need to leverage to get the best performance out of this. I imagine you know how to write a stored proc and import that as a function into EF so I will leave those details out. Hope this helps you.
I will still keep checking this page to see if someone comes up with a nicer, less painful solution to your issue.

Call me crazy, but it looks like you've got the thing ordering itself with itself when this code is called:
if (orderBy != null)
{
query = orderBy(query);
}
I think that would explain the whole "sometimes it's slow" bit. Probably runs fine until you have something in the orderBy parameter, then it's calling itself and creating that row numbered sub-select that slows it down.
Try commenting out the query = orderBy(query) portion of your code and see if you still get the slow down. I'm betting that you won't.
Also, you can simplify your code using Dynamic LINQ. It basically lets you specific sorting with a string name of a field (.orderby("somefield")) instead of trying to pass in a method, which I've found to be a lot easier. I use that in MVC apps to handle sorting by whatever field the users clicks on a grid.

Try adding a non-clustered index on SentDate

Count or Skip(1).Any() where I want to find out if there is more than 1 record - Entity Framework

I'm not sure when but I read an article on this which indicates that the usage of Skip(1).Any() is better than Count() compassion when using Entity Framework (I may remember wrong). I'm not sure about this after I saw the generated T-SQL code.
Here is the first option:
int userConnectionCount = _dbContext.HubConnections.Count(conn => conn.UserId == user.Id);
bool isAtSingleConnection = (userConnectionCount == 1);
This generates the following T-SQL code which is reasonable:
SELECT
[GroupBy1].[A1] AS [C1]
FROM ( SELECT
COUNT(1) AS [A1]
FROM [dbo].[HubConnections] AS [Extent1]
WHERE [Extent1].[UserId] = #p__linq__0
) AS [GroupBy1]
Here is the other option which is the suggested query as far as I remember:
bool isAtSingleConnection = !_dbContext
.HubConnections.OrderBy(conn => conn.Id)
.Skip(1).Any(conn => conn.UserId == user.Id);
Here is the generated T-SQL for the above LINQ query:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM ( SELECT [Extent1].[Id] AS [Id], [Extent1].[UserId] AS [UserId]
FROM ( SELECT [Extent1].[Id] AS [Id], [Extent1].[UserId] AS [UserId], row_number() OVER (ORDER BY [Extent1].[Id] ASC) AS [row_number]
FROM [dbo].[HubConnections] AS [Extent1]
) AS [Extent1]
WHERE [Extent1].[row_number] > 1
) AS [Skip1]
WHERE [Skip1].[UserId] = #p__linq__0
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT
1 AS [C1]
FROM ( SELECT [Extent2].[Id] AS [Id], [Extent2].[UserId] AS [UserId]
FROM ( SELECT [Extent2].[Id] AS [Id], [Extent2].[UserId] AS [UserId], row_number() OVER (ORDER BY [Extent2].[Id] ASC) AS [row_number]
FROM [dbo].[HubConnections] AS [Extent2]
) AS [Extent2]
WHERE [Extent2].[row_number] > 1
) AS [Skip2]
WHERE [Skip2].[UserId] = #p__linq__0
)) THEN cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1];
Which one is the proper way here? Is there a big performance difference between these two?

Query performance depends on a lot of things, like the indexes that are present, the actual data, how stale the statistics about the data present are etc. SQL query plan optimizer looks at these different metrics to come up with an efficient query plan. So, any straightforward answer that says query 1 is always better than query 2 or the opposite would be incorrect.
That said, my answer below tries to explain the articles stance and how Skip(1).Any() could be better(marginally) than doing a Count() > 1. The second query though being bigger in size and mostly unreadable looks like it could be interpreted in an efficient fashion. Again, this depends on things aforementioned. The idea is that the number of rows that the database has to look into to figure out the result is more in case of Count(). In the count case, assuming that the required indexes are there (a clustered index on Id to make the OrderBy in second case efficient), the db has to go through count number of rows. In the second case, it has to go through a maximum of two rows to arrive at the answer.
Lets get more scientific in our analysis and see if my above theory holds any ground. For this, I am creating a dummy database of customers. The Customer type looks like this,
public class Customer
{
public int ID { get; set; }
public string Name { get; set; }
public int Age { get; set; }
}
I am seeding the database with some 100K random rows(I really have to prove this) using this code,
for (int j = 0; j < 100; j++)
{
using (CustomersContext db = new CustomersContext())
{
Random r = new Random();
for (int i = 0; i < 1000; i++)
{
Customer c = new Customer
{
Name = Guid.NewGuid().ToString(),
Age = r.Next(0, 100)
};
db.Customers.Add(c);
}
db.SaveChanges();
}
}
Sample code here.
Now, the queries that I am going to use are as follows,
db.Customers.Where(c => c.Age == 26).Count() > 1; // scenario 1
db.Customers.Where(c => c.Age == 26).OrderBy(c => c.ID).Skip(1).Any() // scenario 2
I have started SQL profiler to catch the query plans. The captured plans look as follows,
Scenario 1:
Check out the estimated cost and actual row count for scenario 1 in the above images.
Scenario 2:
Check out the estimated cost and actual row count for scenario 2 in the below images.
As per the initial guess, the estimated cost and the number of rows is lesser in the Skip and any case as compared to Count case.
Conclusion:
All this analysis aside, as many others have commented earlier, these are not the kind of performance optimizations you should try to do in your code. Things like these hurt readability with very minimal(I would say non-existent) perf benefit. I just did this analysis for fun and would never use this as a basis for choosing scenario 2. I would measure and see if doing a Count() is actually hurting to change the code to use Skip().Any().

I read an article on this which indicates that the usage of Skip(1).Any() is better than Count().
That statement is quite true on a LINQ to objects query. On a LINQ to objects query Skip(1).Any() only needs to try to get the first two items of the sequence, and it can ignore all of the items that come after it. If the sequence involves rather expensive operations (and properly defers execution) or even more importantly, if the sequence is infinite, this could be a big deal. For most queries it will matter a bit, but often not a lot.
For a LINQ query that is based on a query provider instead it's unlikely to be significantly difference. Particularly with EF, as you have seen, the generated query is not noticeably different. Is it possible for there to be a difference, sure. One case could be handled better than the other by the query provider, particular queries much be able to be optimized better with the particular refactor used by one or the other, etc.
If someone is suggesting that there's a major difference in the EF query between these two, odds are they're mistakenly applying the guideline that was designed to just apply to a LINQ to objects query.

It will definitely depend on your record count in your table/data set. If you have a lot of records, then doing a count on an identity will be very fast, because it's indexed, but skipping one record and then getting the next record would be faster.
Granted, this process could be done in sub-millisecond in either case. Unless you have a record count that exceeds 10,000+ records, it really won't matter unless you need it to return under a specific threshold. Don't forget that SQL Server will cache query execution plans. If you re-run the same query, you may not see a difference after running it the first time, unless the data changes pretty significantly beneath it.

Why does the Entity Framework generate nested SQL queries?

Why does the Entity Framework generate nested SQL queries?
I have this code
var db = new Context();
var result = db.Network.Where(x => x.ServerID == serverId)
.OrderBy(x=> x.StartTime)
.Take(limit);
Which generates this! (Note the double select statement)
SELECT
`Project1`.`Id`,
`Project1`.`ServerID`,
`Project1`.`EventId`,
`Project1`.`StartTime`
FROM (SELECT
`Extent1`.`Id`,
`Extent1`.`ServerID`,
`Extent1`.`EventId`,
`Extent1`.`StartTime`
FROM `Networkes` AS `Extent1`
WHERE `Extent1`.`ServerID` = #p__linq__0) AS `Project1`
ORDER BY
`Project1`.`StartTime` DESC LIMIT 5
What should I change so that it results in one select statement? I'm using MySQL and Entity Framework with Code First.
Update
I have the same result regardless of the type of the parameter passed to the OrderBy() method.
Update 2: Timed
Total Time (hh:mm:ss.ms) 05:34:13.000
Average Time (hh:mm:ss.ms) 25:42.000
Max Time (hh:mm:ss.ms) 51:54.000
Count 13
First Seen Nov 6, 12 19:48:19
Last Seen Nov 6, 12 20:40:22
Raw query:
SELECT `Project?`.`Id`, `Project?`.`ServerID`, `Project?`.`EventId`, `Project?`.`StartTime` FROM (SELECT `Extent?`.`Id`, `Extent?`.`ServerID`, `Extent?`.`EventId`, `Extent?`.`StartTime`, FROM `Network` AS `Extent?` WHERE `Extent?`.`ServerID` = ?) AS `Project?` ORDER BY `Project?`.`Starttime` DESC LIMIT ?
I used a program to take snapshots from the current process in MySQL.
Other queries were executed at the same time, but when I change it to just one SELECT statement, it NEVER goes over one second. Maybe I have something else that's going on; I'm asking 'cause I'm not so into DBs...
Update 3: The explain statement
The Entity Framework generated
'1', 'PRIMARY', '<derived2>', 'ALL', NULL, NULL, NULL, NULL, '46', 'Using filesort'
'2', 'DERIVED', 'Extent?', 'ref', 'serveridneventid,serverid', 'serveridneventid', '109', '', '45', 'Using where'
One liner
'1', 'SIMPLE', 'network', 'ref', 'serveridneventid,serverid', 'serveridneventid', '109', 'const', '45', 'Using where; Using filesort'
This is from my QA environment, so the timing I pasted above is not related to the rowcount explain statements. I think that there are about 500,000 records that match one server ID.
Solution
I switched from MySQL to SQL Server. I don't want to end up completely rewriting the application layer.

It's the easiest way to build the query logically from the expression tree. Usually the performance will not be an issue. If you are having performance issues you can try something like this to get the entities back:
var results = db.ExecuteStoreQuery<Network>(
"SELECT Id, ServerID, EventId, StartTime FROM Network WHERE ServerID = #ID",
serverId);
results = results.OrderBy(x=> x.StartTime).Take(limit);

My initial impression was that doing it this way would actually be more efficient, although in testing against a MSSQL server, I got <1 second responses regardless.
With a single select statement, it sorts all the records (Order By), and then filters them to the set you want to see (Where), and then takes the top 5 (Limit 5 or, for me, Top 5). On a large table, the sort takes a significant portion of the time. With a nested statement, it first filters the records down to a subset, and only then does the expensive sort operation on it.
Edit: I did test this, but I realized I had an error in my test which invalidated it. Test results removed.

Why does Entity Framework produce a nested query? The simple answer is because Entity Framework breaks your query expression down into an expression tree and then uses that expression tree to build your query. A tree naturally generates nested query expressions (i.e. a child node generates a query and a parent node generates a query on that query).
Why doesn't Entity Framework simplify the query down and write it as you would? The simple answer is because there is a limited amount of work that can go into the query generation engine, and while it's better now than it was in earlier versions it's not perfect and probably never will be.
All that said there should be no significant speed difference between the query you would write by hand and the query EF generated in this case. The database is clever enough to generate an execution plan that applies the WHERE clause first in either case.

If you want to get the EF to generate the query without the subselect, use a constant within the query, not a variable.
I have previously created my own .Where and all other LINQ methods that first traverse the expression tree and convert all variables, method calls etc. into Expression.Constant. It was done just because of this issue in Entity Framework...

I just stumbled upon this post because I suffer from the same problem. I already spend days tracking this down and it it is just a poor query generation in mysql.
I already filed a bug at mysql.com http://bugs.mysql.com/bug.php?id=75272
To summarize the problem:
This simple query
context.products
.Include(x => x.category)
.Take(10)
.ToList();
gets translated into
SELECT
`Limit1`.`C1`,
`Limit1`.`id`,
`Limit1`.`name`,
`Limit1`.`category_id`,
`Limit1`.`id1`,
`Limit1`.`name1`
FROM (SELECT
`Extent1`.`id`,
`Extent1`.`name`,
`Extent1`.`category_id`,
`Extent2`.`id` AS `id1`,
`Extent2`.`name` AS `name1`,
1 AS `C1`
FROM `products` AS `Extent1` INNER JOIN `categories` AS `Extent2` ON `Extent1`.`category_id` = `Extent2`.`id` LIMIT 10) AS `Limit1`
and performs pretty well. Anyway, the outer query is pretty much useless. Now If I add an OrderBy
context.products
.Include(x => x.category)
.OrderBy(x => x.id)
.Take(10)
.ToList();
the query changes to
SELECT
`Project1`.`C1`,
`Project1`.`id`,
`Project1`.`name`,
`Project1`.`category_id`,
`Project1`.`id1`,
`Project1`.`name1`
FROM (SELECT
`Extent1`.`id`,
`Extent1`.`name`,
`Extent1`.`category_id`,
`Extent2`.`id` AS `id1`,
`Extent2`.`name` AS `name1`,
1 AS `C1`
FROM `products` AS `Extent1` INNER JOIN `categories` AS `Extent2` ON `Extent1`.`category_id` = `Extent2`.`id`) AS `Project1`
ORDER BY
`Project1`.`id` ASC LIMIT 10
Which is bad because the order by is in the outer query. Theat means MySQL has to pull every record in order to perform an orderby which results in using filesort
I verified that SQL Server (Comapact at least) does not generate nested queries for the same code
SELECT TOP (10)
[Extent1].[id] AS [id],
[Extent1].[name] AS [name],
[Extent1].[category_id] AS [category_id],
[Extent2].[id] AS [id1],
[Extent2].[name] AS [name1],
FROM [products] AS [Extent1]
LEFT OUTER JOIN [categories] AS [Extent2] ON [Extent1].[category_id] = [Extent2].[id]
ORDER BY [Extent1].[id] ASC

Actually the queries generated by Entity Framework are few ugly, less than LINQ 2 SQL but still ugly.
However, very probably you database engine will make the desired execution plan, and the query will run smoothly.

Choosing better query out of 2 queries, both returning same results

I am having two different sql queries one written by me and one automatically generated by C# when used with linq, both are giving same results.
I am not sure which one to choose, Iam looking for
Whats the best way to choose one query out of many, when all returns same result (most optimized query).
Out of my queries (below written), which one should i choose.
Hand Written
select * from People P
inner join SubscriptionItemXes S
on
P.Id=S.Person_Id
inner join FoodTagXFoods T1
on T1.FoodTagX_Id = S.Tag2
inner join FoodTagXFoods T2
on T2.FoodTagX_Id = S.Tag1
inner join Foods F
on
F.Id= T1.Food_Id and F.Id= T2.Food_Id
where p.id='1'
Automatically Generated by LINQ
SELECT
[Distinct1].[Id] AS [Id],
[Distinct1].[Item] AS [Item]
FROM ( SELECT DISTINCT
[Extent2].[Id] AS [Id],
[Extent2].[Item] AS [Item]
FROM [dbo].[People] AS [Extent1]
CROSS JOIN [dbo].[Foods] AS [Extent2]
INNER JOIN [dbo].[FoodTagXFoods] AS [Extent3]
ON [Extent2].[Id] = [Extent3].[Food_Id]
INNER JOIN [dbo].[SubscriptionItemXes] AS [Extent4]
ON [Extent1].[Id] = [Extent4].[Person_Id]
WHERE (N'rusi' = [Extent1].[Name]) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[FoodTagXFoods] AS [Extent5]
WHERE ([Extent2].[Id] = [Extent5].[Food_Id])
AND ([Extent5].[FoodTagX_Id] = [Extent4].[Tag1])
)) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[FoodTagXFoods] AS [Extent6]
WHERE ([Extent2].[Id] = [Extent6].[Food_Id])
AND ([Extent6].[FoodTagX_Id] = [Extent4].[Tag2])
))
) AS [Distinct1]
Execution Plan Results
Hand Written: Query Cost (relative to batch):33%
Linq Generated: Query Cost (relative to batch):67%

I have found that two different queries, one hand-written and one generated by Linq might look wildly different but, actually, when you analyse the query plan in SSMS, you find that actually they are almost identical.
You need to actually run these queries in SSMS with Display Actual Execution Plan switched on, and analyse the different plans. It's the only way to correctly analyse the two and find out which is better.
In general, Linq is actually very good at generating efficient queries; even if the actual SQL itself is pretty ugly (in some cases, it's the kind of SQL that a human would write if they had the time!). If course, that said, it can also generate some pigs!
Additionally, asking SO to help with performance of a query over so many tables is fraught with problems for us, since it will be governed so much by your indexes :)

But they aren't quite returning the same thing... The first query grabs everyting (SELECT *) while LINQ is going to extract what you really want (id and item). Trivial you may say but streaming back lots of data that's never used is a good waste of bandwidth and will make your application appear sluggish. Additionally, the LINQ query seems to be doing a lot more which may or may not be the correct solution especially as data is populated into FoodTagXFoods
As for which performs better, I couldn't tell you without something like the actual query plans and/or results of statistics io from both queries. My money is on hand-written but maybe because I like my hands.

Examining SQL Server Query Execution Plan is the best way to choose the better query.
Hope below tutorials help.
SQL Tuning Tutorial - Understanding a Database Execution Plan (1)
Examining Query Execution Plans
SQL Server Query Execution Plan Analysis
SQL SERVER – Actual Execution Plan vs. Estimated Execution Plan

LINQ generating SQL with duplicate nested selects

I'm very new to the .NET Entity Framework, and I think it's awesome, but somehow I'm getting this strange issue (sorry for the spanish but my program is in that language, anyway it's not a big deal, just the column or property names): I'm doing a normal LINQ To Entities query to get a list of UltimaConsulta, like this:
var query = from uc in bd.UltimasConsultas
select uc;
UltimasConsultas is a view, btw. The thing is that LINQ is generating this SQL for the query:
SELECT
[Extent1].[IdPaciente] AS [IdPaciente],
[Extent1].[Nombre] AS [Nombre],
[Extent1].[PrimerApellido] AS [PrimerApellido],
[Extent1].[SegundoApellido] AS [SegundoApellido],
[Extent1].[Fecha] AS [Fecha]
FROM (SELECT
[UltimasConsultas].[IdPaciente] AS [IdPaciente],
[UltimasConsultas].[Nombre] AS [Nombre],
[UltimasConsultas].[PrimerApellido] AS [PrimerApellido],
[UltimasConsultas].[SegundoApellido] AS [SegundoApellido],
[UltimasConsultas].[Fecha] AS [Fecha]
FROM [dbo].[UltimasConsultas] AS [UltimasConsultas]) AS [Extent1]
Why is LINQ generating a nested Select? I thought from videos and examples that it generates normal SQL selects for this kind of queries. Do I have to configure something (the entity model was generating from a wizard, so it's default configuration)? Thanks in advance for your answers.

To be clear, LINQ to Entities does not generate the SQL. Instead, it generates an ADO.NET canonical command tree, and the ADO.NET provider for your database, presumably SQL Server in this case, generates the SQL.
So why does it generate this derived table (I think "derived table" is the more correct term for the SQL feature in use here)? Because the code which generates the SQL has to generate SQL for a wide variety of LINQ queries, most of which are not nearly as trivial as the one you show. These queries will often be selecting data for multiple types (many of which might be anonymous, rather than named types), and in order to keep the SQL generation relatively sane, they are grouped into extents for each type.
Another question: Why should you care? It's easy to demonstrate that the use of the derived table in this statement is "free" from a performance point of view.
I selected a table at random from a populated database, and run the following query:
SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM [VertexRM].[dbo].[Address]
Let's look at the cost:
<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS ******/
SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM [VertexRM].[dbo].[Address]" StatementType="SELECT">
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
<RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">
Now let's compare that to the query with the derived table:
SELECT
[Extent1].[AddressId]
,[Extent1].[Address1]
,[Extent1].[Address2]
,[Extent1].[City]
,[Extent1].[State]
,[Extent1].[ZIP]
,[Extent1].[ZIPExtension]
FROM (SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM[VertexRM].[dbo].[Address]) AS [Extent1]
And the cost:
<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS ******/
SELECT
[Extent1].[AddressId]
,[Extent1].[Address1]
,[Extent1].[Address2]
,[Extent1].[City]
,[Extent1].[State]
,[Extent1].[ZIP]
,[Extent1].[ZIPExtension]
FROM (SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM[VertexRM].[dbo].[Address]) AS [Extent1]" StatementType="SELECT">
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
<RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">
In both cases, SQL Server simply scans the clustered index. Not surprisingly, the cost is almost precisely the same.
Let's take a look at a slightly more complicated query. I fired up LINQPad, and entered the following query against the same table, plus one related table:
from a in Addresses
select new
{
Id = a.Id,
Address1 = a.Address1,
Address2 = a.Address2,
City = a.City,
State = a.State,
ZIP = a.ZIP,
ZIPExtension = a.ZIPExtension,
PersonCount = a.EntityAddresses.Count()
}
This generates the following SQL:
SELECT
1 AS [C1],
[Project1].[AddressId] AS [AddressId],
[Project1].[Address1] AS [Address1],
[Project1].[Address2] AS [Address2],
[Project1].[City] AS [City],
[Project1].[State] AS [State],
[Project1].[ZIP] AS [ZIP],
[Project1].[ZIPExtension] AS [ZIPExtension],
[Project1].[C1] AS [C2]
FROM ( SELECT
[Extent1].[AddressId] AS [AddressId],
[Extent1].[Address1] AS [Address1],
[Extent1].[Address2] AS [Address2],
[Extent1].[City] AS [City],
[Extent1].[State] AS [State],
[Extent1].[ZIP] AS [ZIP],
[Extent1].[ZIPExtension] AS [ZIPExtension],
(SELECT
COUNT(cast(1 as bit)) AS [A1]
FROM [dbo].[EntityAddress] AS [Extent2]
WHERE [Extent1].[AddressId] = [Extent2].[AddressId]) AS [C1]
FROM [dbo].[Address] AS [Extent1]
) AS [Project1]
Analyzing this, we can see that Project1 is the projection onto the anonymous type. Extent1 is the Address table/entity. And Extent2 is the table for the association. Now there is no derived table for Address, but there is one for the projection.
I don't know if you have ever written a SQL generation system, but it isn't easy. I believe that the general problem of proving that a LINQ to Entities query and a SQL query are equivalent is NP-hard, although certain specific cases are obviously much easier. SQL is intentionally Turing-incomplete, because its designers wanted all SQL queries to execute in bounded time. LINQ, not so.
In short, this is a very difficult problem to solve, and the combination of the Entity Framework and its providers do occasionally sacrifice some readability in favor of consistency over a wide range of queries. But it shouldn't be a performance issue.

Basically it's defining what Extent1 consists of and what variables will relate to each entry. Then its mapping the actual database table to Extent1 so that it can return all entries for that table.
This is what your query is asking for. Its just that LINQ can't add in a wildcard character as you would if you'd done it by hand.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Horrifically inefficient query generated by Entity Framework 6 - c#

Related

Query generated by EF takes too much time to execute

Count or Skip(1).Any() where I want to find out if there is more than 1 record - Entity Framework

Why does the Entity Framework generate nested SQL queries?

Choosing better query out of 2 queries, both returning same results

LINQ generating SQL with duplicate nested selects

Categories

Resources