LINQ generating SQL with duplicate nested selects - c#

I'm very new to the .NET Entity Framework, and I think it's awesome, but somehow I'm getting this strange issue (sorry for the spanish but my program is in that language, anyway it's not a big deal, just the column or property names): I'm doing a normal LINQ To Entities query to get a list of UltimaConsulta, like this:
var query = from uc in bd.UltimasConsultas
select uc;
UltimasConsultas is a view, btw. The thing is that LINQ is generating this SQL for the query:
SELECT
[Extent1].[IdPaciente] AS [IdPaciente],
[Extent1].[Nombre] AS [Nombre],
[Extent1].[PrimerApellido] AS [PrimerApellido],
[Extent1].[SegundoApellido] AS [SegundoApellido],
[Extent1].[Fecha] AS [Fecha]
FROM (SELECT
[UltimasConsultas].[IdPaciente] AS [IdPaciente],
[UltimasConsultas].[Nombre] AS [Nombre],
[UltimasConsultas].[PrimerApellido] AS [PrimerApellido],
[UltimasConsultas].[SegundoApellido] AS [SegundoApellido],
[UltimasConsultas].[Fecha] AS [Fecha]
FROM [dbo].[UltimasConsultas] AS [UltimasConsultas]) AS [Extent1]
Why is LINQ generating a nested Select? I thought from videos and examples that it generates normal SQL selects for this kind of queries. Do I have to configure something (the entity model was generating from a wizard, so it's default configuration)? Thanks in advance for your answers.

To be clear, LINQ to Entities does not generate the SQL. Instead, it generates an ADO.NET canonical command tree, and the ADO.NET provider for your database, presumably SQL Server in this case, generates the SQL.
So why does it generate this derived table (I think "derived table" is the more correct term for the SQL feature in use here)? Because the code which generates the SQL has to generate SQL for a wide variety of LINQ queries, most of which are not nearly as trivial as the one you show. These queries will often be selecting data for multiple types (many of which might be anonymous, rather than named types), and in order to keep the SQL generation relatively sane, they are grouped into extents for each type.
Another question: Why should you care? It's easy to demonstrate that the use of the derived table in this statement is "free" from a performance point of view.
I selected a table at random from a populated database, and run the following query:
SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM [VertexRM].[dbo].[Address]
Let's look at the cost:
<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS ******/
SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM [VertexRM].[dbo].[Address]" StatementType="SELECT">
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
<RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">
Now let's compare that to the query with the derived table:
SELECT
[Extent1].[AddressId]
,[Extent1].[Address1]
,[Extent1].[Address2]
,[Extent1].[City]
,[Extent1].[State]
,[Extent1].[ZIP]
,[Extent1].[ZIPExtension]
FROM (SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM[VertexRM].[dbo].[Address]) AS [Extent1]
And the cost:
<StmtSimple StatementCompId="1" StatementEstRows="7900" StatementId="1" StatementOptmLevel="TRIVIAL" StatementSubTreeCost="0.123824" StatementText="/****** Script for SelectTopNRows command from SSMS ******/
SELECT
[Extent1].[AddressId]
,[Extent1].[Address1]
,[Extent1].[Address2]
,[Extent1].[City]
,[Extent1].[State]
,[Extent1].[ZIP]
,[Extent1].[ZIPExtension]
FROM (SELECT [AddressId]
,[Address1]
,[Address2]
,[City]
,[State]
,[ZIP]
,[ZIPExtension]
FROM[VertexRM].[dbo].[Address]) AS [Extent1]" StatementType="SELECT">
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan CachedPlanSize="9" CompileTime="0" CompileCPU="0" CompileMemory="64">
<RelOp AvgRowSize="246" EstimateCPU="0.008847" EstimateIO="0.114977" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="7900" LogicalOp="Clustered Index Scan" NodeId="0" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.123824">
In both cases, SQL Server simply scans the clustered index. Not surprisingly, the cost is almost precisely the same.
Let's take a look at a slightly more complicated query. I fired up LINQPad, and entered the following query against the same table, plus one related table:
from a in Addresses
select new
{
Id = a.Id,
Address1 = a.Address1,
Address2 = a.Address2,
City = a.City,
State = a.State,
ZIP = a.ZIP,
ZIPExtension = a.ZIPExtension,
PersonCount = a.EntityAddresses.Count()
}
This generates the following SQL:
SELECT
1 AS [C1],
[Project1].[AddressId] AS [AddressId],
[Project1].[Address1] AS [Address1],
[Project1].[Address2] AS [Address2],
[Project1].[City] AS [City],
[Project1].[State] AS [State],
[Project1].[ZIP] AS [ZIP],
[Project1].[ZIPExtension] AS [ZIPExtension],
[Project1].[C1] AS [C2]
FROM ( SELECT
[Extent1].[AddressId] AS [AddressId],
[Extent1].[Address1] AS [Address1],
[Extent1].[Address2] AS [Address2],
[Extent1].[City] AS [City],
[Extent1].[State] AS [State],
[Extent1].[ZIP] AS [ZIP],
[Extent1].[ZIPExtension] AS [ZIPExtension],
(SELECT
COUNT(cast(1 as bit)) AS [A1]
FROM [dbo].[EntityAddress] AS [Extent2]
WHERE [Extent1].[AddressId] = [Extent2].[AddressId]) AS [C1]
FROM [dbo].[Address] AS [Extent1]
) AS [Project1]
Analyzing this, we can see that Project1 is the projection onto the anonymous type. Extent1 is the Address table/entity. And Extent2 is the table for the association. Now there is no derived table for Address, but there is one for the projection.
I don't know if you have ever written a SQL generation system, but it isn't easy. I believe that the general problem of proving that a LINQ to Entities query and a SQL query are equivalent is NP-hard, although certain specific cases are obviously much easier. SQL is intentionally Turing-incomplete, because its designers wanted all SQL queries to execute in bounded time. LINQ, not so.
In short, this is a very difficult problem to solve, and the combination of the Entity Framework and its providers do occasionally sacrifice some readability in favor of consistency over a wide range of queries. But it shouldn't be a performance issue.

Basically it's defining what Extent1 consists of and what variables will relate to each entry. Then its mapping the actual database table to Extent1 so that it can return all entries for that table.
This is what your query is asking for. Its just that LINQ can't add in a wildcard character as you would if you'd done it by hand.

Related

Horrifically inefficient query generated by Entity Framework 6

Here's the query I want:
select top 10 *
from vw_BoosterTargetLog
where OrganizationId = 4125
order by Id desc
It executes subsecond.
Here's my Entity Framework (6.1.2) equivalent in C#:
return await db.vw_BoosterTargetLog
.Where(x => x.OrganizationId == organizationId)
.OrderByDescending(x => x.Id)
.Take(numberToRun)
.ToListNolockAsync();
And here's the SQL that it generates:
SELECT TOP (10)
[Project1].[OrganizationId] AS [OrganizationId],
[Project1].[BoosterTriggerId] AS [BoosterTriggerId],
[Project1].[IsAutomatic] AS [IsAutomatic],
[Project1].[C1] AS [C1],
[Project1].[CustomerUserId] AS [CustomerUserId],
[Project1].[SourceUrl] AS [SourceUrl],
[Project1].[TargetUrl] AS [TargetUrl],
[Project1].[ShowedOn] AS [ShowedOn],
[Project1].[ClickedOn] AS [ClickedOn],
[Project1].[BoosterTargetId] AS [BoosterTargetId],
[Project1].[TriggerEventGroup] AS [TriggerEventGroup],
[Project1].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[Project1].[TargetTitle] AS [TargetTitle],
[Project1].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[Project1].[Version] AS [Version],
[Project1].[CookieId] AS [CookieId],
[Project1].[CoalescedId] AS [CoalescedId],
[Project1].[OrganizationName] AS [OrganizationName],
[Project1].[ShowedOnDate] AS [ShowedOnDate],
[Project1].[SampleGroupSectionName] AS [SampleGroupSectionName],
[Project1].[Selector] AS [Selector],
[Project1].[SelectorStep] AS [SelectorStep]
FROM ( SELECT
[Extent1].[OrganizationId] AS [OrganizationId],
[Extent1].[OrganizationName] AS [OrganizationName],
[Extent1].[BoosterTriggerId] AS [BoosterTriggerId],
[Extent1].[IsAutomatic] AS [IsAutomatic],
[Extent1].[SampleGroupSectionName] AS [SampleGroupSectionName],
[Extent1].[Selector] AS [Selector],
[Extent1].[SelectorStep] AS [SelectorStep],
[Extent1].[BoosterTargetId] AS [BoosterTargetId],
[Extent1].[CookieId] AS [CookieId],
[Extent1].[CustomerUserId] AS [CustomerUserId],
[Extent1].[CoalescedId] AS [CoalescedId],
[Extent1].[SourceUrl] AS [SourceUrl],
[Extent1].[TriggerEventGroup] AS [TriggerEventGroup],
[Extent1].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[Extent1].[TargetTitle] AS [TargetTitle],
[Extent1].[TargetUrl] AS [TargetUrl],
[Extent1].[ShowedOn] AS [ShowedOn],
[Extent1].[ShowedOnDate] AS [ShowedOnDate],
[Extent1].[ClickedOn] AS [ClickedOn],
[Extent1].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[Extent1].[Version] AS [Version],
CAST( [Extent1].[Id] AS int) AS [C1]
FROM (SELECT
[vw_BoosterTargetLog].[OrganizationId] AS [OrganizationId],
[vw_BoosterTargetLog].[OrganizationName] AS [OrganizationName],
[vw_BoosterTargetLog].[BoosterTriggerId] AS [BoosterTriggerId],
[vw_BoosterTargetLog].[IsAutomatic] AS [IsAutomatic],
[vw_BoosterTargetLog].[SampleGroupSectionName] AS [SampleGroupSectionName],
[vw_BoosterTargetLog].[Selector] AS [Selector],
[vw_BoosterTargetLog].[SelectorStep] AS [SelectorStep],
[vw_BoosterTargetLog].[BoosterTargetId] AS [BoosterTargetId],
[vw_BoosterTargetLog].[CookieId] AS [CookieId],
[vw_BoosterTargetLog].[CustomerUserId] AS [CustomerUserId],
[vw_BoosterTargetLog].[CoalescedId] AS [CoalescedId],
[vw_BoosterTargetLog].[Id] AS [Id],
[vw_BoosterTargetLog].[SourceUrl] AS [SourceUrl],
[vw_BoosterTargetLog].[TriggerEventGroup] AS [TriggerEventGroup],
[vw_BoosterTargetLog].[TriggerIgnoreIdentifiedUsers] AS [TriggerIgnoreIdentifiedUsers],
[vw_BoosterTargetLog].[TargetTitle] AS [TargetTitle],
[vw_BoosterTargetLog].[TargetUrl] AS [TargetUrl],
[vw_BoosterTargetLog].[ShowedOn] AS [ShowedOn],
[vw_BoosterTargetLog].[ShowedOnDate] AS [ShowedOnDate],
[vw_BoosterTargetLog].[ClickedOn] AS [ClickedOn],
[vw_BoosterTargetLog].[BoosterTargetVersionId] AS [BoosterTargetVersionId],
[vw_BoosterTargetLog].[Version] AS [Version]
FROM [dbo].[vw_BoosterTargetLog] AS [vw_BoosterTargetLog]) AS [Extent1]
WHERE [Extent1].[OrganizationId] = 4125
) AS [Project1]
ORDER BY [Project1].[C1] DESC
It's ugly as hell, of course, as all EF queries are: I'm not complaining about that. My gripe is that in my testing, best-case, it executes about 10x slower than the first, and worst-case, about 100x slower.
For a query this simple, that seems way beyond all reasonable expectation.
Obviously I can execute SQL directly, or execute a sproc, or something of that sort. And while I'm waiting for feedback, that's what I'll do. But does anyone have any other suggestions about how to speed this up? Is there any way to encourage EF to generate reasonable SQL in a situation like this?
The queries EF produces, while terrible from a readability perspective, are usually still quite good reasonable -- and I say that as someone who does almost all data access through stored procedures with hand-written queries. But in order for it to work, the model EF has of the database needs to match the actual database, or else conversions will be introduced, and when that happens it's very easy to get horrible performance drops while all the data is converted and no indexes can be used.
If we eliminate some nesting, the EF query can be simplified to
SELECT TOP (10) *
FROM (
SELECT *, CAST(Id AS INT) AS C1
FROM vw_BoosterTargetLog
WHERE OrganizationId = 4125
) _
ORDER BY C1 DESC
(This is not the actual result set because Id isn't part of the final result set in the real query, but pretend I wrote out all the columns just like EF did.)
If vw_BoosterTargetLog.Id is not actually an INT, this forces a conversion of all rows before the ordering takes place, which is much slower. The solution is to figure out the actual type of the column (in this case, BIGINT) and update your model accordingly.

Refer to temporary table in Entity Framework query

There is a list list in memory of 50,000 Product IDs. I would like to get all these Products from the DB. Using dbContext.Products.Where(p => list.contains(p.ID)) generates a giant IN in the SQL - WHERE ID IN (2134,1324543,5675,32451,45735...), and it takes forever. This is partly because it takes time for SQL Server to parse such a large string, and also the execution plan is bad. (I know this from trying to use a temporary table instead).
So I used SQLBulkCopy to insert the IDs to a temporary table, and then ran
dbContext.Set<Product>().SqlQuery("SELECT * FROM Products WHERE ID IN (SELECT ID FROM #tmp))"
This gave good performance. However, now I need the products, with their suppliers (multiple for every product). Using a custom SQL command there is no way to get back a complex object that I know of. So how can I get the products with their suppliers, using the temporary table?
(If I can somehow refer to the temporary table in LINQ, then it would be OK - I could just do dbContext.Products.Where(p => dbContext.TempTable.Any(t => t.ID==p.ID)). If I could refer to it in a UDF that would also be good - but you can't. I cannot use a real table, since concurrent users would leave it in an inconsistent state.)
Thanks
I was curious to explore the sql generated using Join syntax rather than Contains. Here is the code for my test:
IQueryable<Product> queryable = Uow.ProductRepository.All;
List<int> inMemKeys = new int[] { 2134, 1324543, 5675, 32451, 45735 }.ToList();
string sql1 = queryable.Where(p => inMemKeys.Contains(p.ID)).ToString();
string sql2 = queryable.Join(inMemKeys, t => t.ID, pk => pk, (t, pk) => t).ToString();
This is the sql generated using Contains (sql1)
SELECT
[extent1].[id] AS [id],...etc
FROM [dbo].[products] AS [extent1]
WHERE ([extent1].[id] IN (2134, 1324543, 5675, 32451, 45735))
This is the sql generated using Join:
SELECT
[extent1].[id] AS [id],...etc
FROM [dbo].[products] AS [extent1]
INNER JOIN (SELECT
[unionall3].[c1] AS [c1]
FROM (SELECT
[unionall2].[c1] AS [c1]
FROM (SELECT
[unionall1].[c1] AS [c1]
FROM (SELECT
2134 AS [c1]
FROM (SELECT
1 AS x) AS [singlerowtable1] UNION ALL SELECT
1324543 AS [c1]
FROM (SELECT
1 AS x) AS [singlerowtable2]) AS [unionall1] UNION ALL SELECT
5675 AS [c1]
FROM (SELECT
1 AS x) AS [singlerowtable3]) AS [unionall2] UNION ALL SELECT
32451 AS [c1]
FROM (SELECT
1 AS x) AS [singlerowtable4]) AS [unionall3] UNION ALL SELECT
45735 AS [c1]
FROM (SELECT
1 AS x) AS [singlerowtable5]) AS [unionall4]
ON [extent1].[id] = [unionall4].[c1]
So the sql creates a big select statement using union all to create the equivalent of your temporary table, then it joins to that table. The sql is more verbose, but it may well be efficient - I'm afraid I'm not qualified to say.
While it doesn't answer the question as set out in the heading, it does show a way to avoid the giant IN . OK.... now it's a giant UNION ALL.... anyways...I hope that this contribution is useful to some
I suggest you extend the filter table (TempTable in the code above) to store something like a UserId or SessionId as well as ProductID's:
this will give you all the performance you're after
it will work for concurrent users
If this filter table is changing a lot then consider updating it in a separate transaction (i.e. a different instance of dbContext) to avoid holding a write lock on this table for longer than necessary.

How to write this SQL query in Entity Framework?

I have this query that I want translated pretty much 1:1 from Entity Framework to SQL:
SELECT GroupId, ItemId, count(*) as total
FROM [TESTDB].[dbo].[TestTable]
WHERE GroupId = '64025'
GROUP BY GroupId, ItemId
ORDER BY GroupId, total DESC
This SQL query should sort based on the number occurrence of the same ItemId (for that group).
I have this now:
from x in dataContext.TestTable.AsNoTracking()
where x.GroupId = 64025
group x by new {x.GroupId, x.ItemId}
into g
orderby g.Key.GroupId, g.Count() descending
select new {g.Key.GroupId, g.Key.ItemId, Count = g.Count()};
But this generates the following SQL code:
SELECT
[GroupBy1].[K1] AS [GroupId],
[GroupBy1].[K2] AS [ItemId],
[GroupBy1].[A2] AS [C1]
FROM ( SELECT
[Extent1].[GroupId] AS [K1],
[Extent1].[ItemId] AS [K2],
COUNT(1) AS [A1],
COUNT(1) AS [A2]
FROM [dbo].[TestTable] AS [Extent1]
WHERE 64025 = [Extent1].[GroupId]
GROUP BY [Extent1].[GroupId], [Extent1].[ItemId]
) AS [GroupBy1]
ORDER BY [GroupBy1].[K1] ASC, [GroupBy1].[A1] DESC
This also works but is a factor 2 slower than the SQL I created.
I've been fiddling around with the linq code for a while but I haven't managed to create something similar to my query.
Execution plan (only the last two items, the first two are identical):
FIRST: |--Stream Aggregate(GROUP BY:([Extent1].[ItemId]) DEFINE:([Expr1006]=Count(*), [Extent1].[GroupId]=ANY([TESTDB].[dbo].[TestTable].[GroupId] as [Extent1].[GroupId])))
|--Index Seek(OBJECT:([TESTDB].[dbo].[TestTable].[IX_Group]), SEEK:([TESTDB].[dbo].[TestTable].[GroupId]=(64034)) ORDERED FORWARD)
SECOND: |--Stream Aggregate(GROUP BY:([TESTDB].[dbo].[TestTable].[ItemId]) DEFINE:([Expr1007]=Count(*), [TESTDB].[dbo].[TestTable].[GroupId]=ANY([TESTDB].[dbo].[TestTable].[GroupId])))
|--Index Seek(OBJECT:([TESTDB].[dbo].[TestTable].[IX_Group] AS [Extent1]), SEEK:([Extent1].[GroupId]=(64034)) ORDERED FORWARD)
The query that Entity Framework generates and your hand crafted query are semantically the same and will give the same plan.
The derived table definition is inlined during query optimisation so the only difference might be some extremely minor additional overhead during parsing and compilation.
The snippets of SHOWPLAN_TEXT you have posted are the same plan. The only difference is aliases. It looks as though your table definition is something like.
CREATE TABLE [dbo].[TestTable]
(
[GroupId] INT,
[ItemId] INT
)
CREATE NONCLUSTERED INDEX IX_Group ON [dbo].[TestTable] ([GroupId], [ItemId])
And you are getting a plan like this
To all intents and purposes the plans are the same. Your performance testing methodology is probably flawed. Maybe your first query brought pages into cache that then benefited the second query for example.

Why does the Entity Framework generate nested SQL queries?

Why does the Entity Framework generate nested SQL queries?
I have this code
var db = new Context();
var result = db.Network.Where(x => x.ServerID == serverId)
.OrderBy(x=> x.StartTime)
.Take(limit);
Which generates this! (Note the double select statement)
SELECT
`Project1`.`Id`,
`Project1`.`ServerID`,
`Project1`.`EventId`,
`Project1`.`StartTime`
FROM (SELECT
`Extent1`.`Id`,
`Extent1`.`ServerID`,
`Extent1`.`EventId`,
`Extent1`.`StartTime`
FROM `Networkes` AS `Extent1`
WHERE `Extent1`.`ServerID` = #p__linq__0) AS `Project1`
ORDER BY
`Project1`.`StartTime` DESC LIMIT 5
What should I change so that it results in one select statement? I'm using MySQL and Entity Framework with Code First.
Update
I have the same result regardless of the type of the parameter passed to the OrderBy() method.
Update 2: Timed
Total Time (hh:mm:ss.ms) 05:34:13.000
Average Time (hh:mm:ss.ms) 25:42.000
Max Time (hh:mm:ss.ms) 51:54.000
Count 13
First Seen Nov 6, 12 19:48:19
Last Seen Nov 6, 12 20:40:22
Raw query:
SELECT `Project?`.`Id`, `Project?`.`ServerID`, `Project?`.`EventId`, `Project?`.`StartTime` FROM (SELECT `Extent?`.`Id`, `Extent?`.`ServerID`, `Extent?`.`EventId`, `Extent?`.`StartTime`, FROM `Network` AS `Extent?` WHERE `Extent?`.`ServerID` = ?) AS `Project?` ORDER BY `Project?`.`Starttime` DESC LIMIT ?
I used a program to take snapshots from the current process in MySQL.
Other queries were executed at the same time, but when I change it to just one SELECT statement, it NEVER goes over one second. Maybe I have something else that's going on; I'm asking 'cause I'm not so into DBs...
Update 3: The explain statement
The Entity Framework generated
'1', 'PRIMARY', '<derived2>', 'ALL', NULL, NULL, NULL, NULL, '46', 'Using filesort'
'2', 'DERIVED', 'Extent?', 'ref', 'serveridneventid,serverid', 'serveridneventid', '109', '', '45', 'Using where'
One liner
'1', 'SIMPLE', 'network', 'ref', 'serveridneventid,serverid', 'serveridneventid', '109', 'const', '45', 'Using where; Using filesort'
This is from my QA environment, so the timing I pasted above is not related to the rowcount explain statements. I think that there are about 500,000 records that match one server ID.
Solution
I switched from MySQL to SQL Server. I don't want to end up completely rewriting the application layer.
It's the easiest way to build the query logically from the expression tree. Usually the performance will not be an issue. If you are having performance issues you can try something like this to get the entities back:
var results = db.ExecuteStoreQuery<Network>(
"SELECT Id, ServerID, EventId, StartTime FROM Network WHERE ServerID = #ID",
serverId);
results = results.OrderBy(x=> x.StartTime).Take(limit);
My initial impression was that doing it this way would actually be more efficient, although in testing against a MSSQL server, I got <1 second responses regardless.
With a single select statement, it sorts all the records (Order By), and then filters them to the set you want to see (Where), and then takes the top 5 (Limit 5 or, for me, Top 5). On a large table, the sort takes a significant portion of the time. With a nested statement, it first filters the records down to a subset, and only then does the expensive sort operation on it.
Edit: I did test this, but I realized I had an error in my test which invalidated it. Test results removed.
Why does Entity Framework produce a nested query? The simple answer is because Entity Framework breaks your query expression down into an expression tree and then uses that expression tree to build your query. A tree naturally generates nested query expressions (i.e. a child node generates a query and a parent node generates a query on that query).
Why doesn't Entity Framework simplify the query down and write it as you would? The simple answer is because there is a limited amount of work that can go into the query generation engine, and while it's better now than it was in earlier versions it's not perfect and probably never will be.
All that said there should be no significant speed difference between the query you would write by hand and the query EF generated in this case. The database is clever enough to generate an execution plan that applies the WHERE clause first in either case.
If you want to get the EF to generate the query without the subselect, use a constant within the query, not a variable.
I have previously created my own .Where and all other LINQ methods that first traverse the expression tree and convert all variables, method calls etc. into Expression.Constant. It was done just because of this issue in Entity Framework...
I just stumbled upon this post because I suffer from the same problem. I already spend days tracking this down and it it is just a poor query generation in mysql.
I already filed a bug at mysql.com http://bugs.mysql.com/bug.php?id=75272
To summarize the problem:
This simple query
context.products
.Include(x => x.category)
.Take(10)
.ToList();
gets translated into
SELECT
`Limit1`.`C1`,
`Limit1`.`id`,
`Limit1`.`name`,
`Limit1`.`category_id`,
`Limit1`.`id1`,
`Limit1`.`name1`
FROM (SELECT
`Extent1`.`id`,
`Extent1`.`name`,
`Extent1`.`category_id`,
`Extent2`.`id` AS `id1`,
`Extent2`.`name` AS `name1`,
1 AS `C1`
FROM `products` AS `Extent1` INNER JOIN `categories` AS `Extent2` ON `Extent1`.`category_id` = `Extent2`.`id` LIMIT 10) AS `Limit1`
and performs pretty well. Anyway, the outer query is pretty much useless. Now If I add an OrderBy
context.products
.Include(x => x.category)
.OrderBy(x => x.id)
.Take(10)
.ToList();
the query changes to
SELECT
`Project1`.`C1`,
`Project1`.`id`,
`Project1`.`name`,
`Project1`.`category_id`,
`Project1`.`id1`,
`Project1`.`name1`
FROM (SELECT
`Extent1`.`id`,
`Extent1`.`name`,
`Extent1`.`category_id`,
`Extent2`.`id` AS `id1`,
`Extent2`.`name` AS `name1`,
1 AS `C1`
FROM `products` AS `Extent1` INNER JOIN `categories` AS `Extent2` ON `Extent1`.`category_id` = `Extent2`.`id`) AS `Project1`
ORDER BY
`Project1`.`id` ASC LIMIT 10
Which is bad because the order by is in the outer query. Theat means MySQL has to pull every record in order to perform an orderby which results in using filesort
I verified that SQL Server (Comapact at least) does not generate nested queries for the same code
SELECT TOP (10)
[Extent1].[id] AS [id],
[Extent1].[name] AS [name],
[Extent1].[category_id] AS [category_id],
[Extent2].[id] AS [id1],
[Extent2].[name] AS [name1],
FROM [products] AS [Extent1]
LEFT OUTER JOIN [categories] AS [Extent2] ON [Extent1].[category_id] = [Extent2].[id]
ORDER BY [Extent1].[id] ASC
Actually the queries generated by Entity Framework are few ugly, less than LINQ 2 SQL but still ugly.
However, very probably you database engine will make the desired execution plan, and the query will run smoothly.

Choosing better query out of 2 queries, both returning same results

I am having two different sql queries one written by me and one automatically generated by C# when used with linq, both are giving same results.
I am not sure which one to choose, Iam looking for
Whats the best way to choose one query out of many, when all returns same result (most optimized query).
Out of my queries (below written), which one should i choose.
Hand Written
select * from People P
inner join SubscriptionItemXes S
on
P.Id=S.Person_Id
inner join FoodTagXFoods T1
on T1.FoodTagX_Id = S.Tag2
inner join FoodTagXFoods T2
on T2.FoodTagX_Id = S.Tag1
inner join Foods F
on
F.Id= T1.Food_Id and F.Id= T2.Food_Id
where p.id='1'
Automatically Generated by LINQ
SELECT
[Distinct1].[Id] AS [Id],
[Distinct1].[Item] AS [Item]
FROM ( SELECT DISTINCT
[Extent2].[Id] AS [Id],
[Extent2].[Item] AS [Item]
FROM [dbo].[People] AS [Extent1]
CROSS JOIN [dbo].[Foods] AS [Extent2]
INNER JOIN [dbo].[FoodTagXFoods] AS [Extent3]
ON [Extent2].[Id] = [Extent3].[Food_Id]
INNER JOIN [dbo].[SubscriptionItemXes] AS [Extent4]
ON [Extent1].[Id] = [Extent4].[Person_Id]
WHERE (N'rusi' = [Extent1].[Name]) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[FoodTagXFoods] AS [Extent5]
WHERE ([Extent2].[Id] = [Extent5].[Food_Id])
AND ([Extent5].[FoodTagX_Id] = [Extent4].[Tag1])
)) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[FoodTagXFoods] AS [Extent6]
WHERE ([Extent2].[Id] = [Extent6].[Food_Id])
AND ([Extent6].[FoodTagX_Id] = [Extent4].[Tag2])
))
) AS [Distinct1]
Execution Plan Results
Hand Written: Query Cost (relative to batch):33%
Linq Generated: Query Cost (relative to batch):67%
I have found that two different queries, one hand-written and one generated by Linq might look wildly different but, actually, when you analyse the query plan in SSMS, you find that actually they are almost identical.
You need to actually run these queries in SSMS with Display Actual Execution Plan switched on, and analyse the different plans. It's the only way to correctly analyse the two and find out which is better.
In general, Linq is actually very good at generating efficient queries; even if the actual SQL itself is pretty ugly (in some cases, it's the kind of SQL that a human would write if they had the time!). If course, that said, it can also generate some pigs!
Additionally, asking SO to help with performance of a query over so many tables is fraught with problems for us, since it will be governed so much by your indexes :)
But they aren't quite returning the same thing... The first query grabs everyting (SELECT *) while LINQ is going to extract what you really want (id and item). Trivial you may say but streaming back lots of data that's never used is a good waste of bandwidth and will make your application appear sluggish. Additionally, the LINQ query seems to be doing a lot more which may or may not be the correct solution especially as data is populated into FoodTagXFoods
As for which performs better, I couldn't tell you without something like the actual query plans and/or results of statistics io from both queries. My money is on hand-written but maybe because I like my hands.
Examining SQL Server Query Execution Plan is the best way to choose the better query.
Hope below tutorials help.
SQL Tuning Tutorial - Understanding a Database Execution Plan (1)
Examining Query Execution Plans
SQL Server Query Execution Plan Analysis
SQL SERVER – Actual Execution Plan vs. Estimated Execution Plan

Categories

Resources