LINQ vs sql - bringing back too many rows

LINQ vs sql - bringing back too many rows - c#

I'm trying to convert this query
select top(10) *
from SOMETABLE
where Name = 'test'
into linq so i think it should look like this
var c =
(from l
in db.SOMETABLE
where l.Name= 'test'
select l).take(10);
But when I look into server profiler I can see that linq takes all the data from table and probably apply WHERE and TAKE afer pooling data from database.
The problem is that the SOMETABLE have ~10 000 000 records and it does not work fast.
Am I doing it wrong?

The code you posted has at least 3 mistakes, so I assume it isn't your actual code. To get the symptom you describe, the most likely cause is that you have used IEnumerable<T> somewhere, and are composing from that. To get end-to-end query composition (i.e. to do the TOP at the database), you need to be using IQueryable<T>. For example, the following is broken:
IEnumerable<SomeType> data = db.SomeTable;
var c = (from l in data
where l.Name == "test"
select l).Take(10);
but the following is absolutely fine, noting that only the first line has changed:
IQueryable<SomeType> data = db.SomeTable;
var c = (from l in data
where l.Name == "test"
select l).Take(10);
noting that this is also identical to:
IQueryable<SomeType> data = db.SomeTable;
var c = data.Where(l => l.Name == "test").Take(10);
So: make sure you haven't forced it to IEnumerable<T> (or similar, such as lists) prematurely.
As a final note, IIRC Entity Framework demands an ordering if you are applying skip/take (erroring if you don't) - this further supports my guess that you have dropped to IEnumerable<T> too early, but: don't be amazed if you need to specifcy an order by too.

Related

Entity Framework gets progressively slow with extra join added even though SQL generated is fast

We have 18 table join which is typical for ERP systems. The join is done via LINQ over Entity Framework.
The join gets progressively slower as more joins are added. The return result set is small(15 records). The LINQ generated query is captured via SQL Profiler and when we run this via Microsoft Management Console it is very fast : 10ms. When we run it via our C# LINQ-over-EntityFramework it takes 4 seconds.
What i guess is happening:
The time it takes to compile expression tree into SQL is 2 seconds out of total 4 seconds, and another 2 seconds i guess is spent internally to convert SQL result set into actually C# classes. Also it is not connected to initialization of entity framework because we run some queries before and repetitive calls to this join produce same 4 seconds.
Is there a way to speed this up. Otherwise we are considering abandoning Entity Framework for being absolutely inefficient...

In case it helps, I had a nasty performance issue, whereby a simple query that took 1-2 seconds in raw SQL took about 11 seconds via EF.
I went from using...
List<GeographicalLocation> geographicalLocations = new SalesTrackerCRMEntities()
.CreateObjectSet<GeographicalLocation>()
.Where(g => g.Active)
.ToList();
which took about 11 seconds via EF, to using...
var geographicalLocations = getContext().CreateObjectSet<GeographicalLocation>()
.AsNoTracking()
.Where(g => g.Active).ToList();
which took less than 200 milliseconds.
The disadvantage to this is that it won't load related entities, so you have to load them manually afterwards, but it gives such an immense performance boost that it was well worth it (in this case at least).
You would have to assess each case individually to see if the extra speed is worth the extra code.

You correctly identified bottlenecks.
If you have quite complex queries, I would suggest you to use compiled queries to overcome expression tree to sql query conversion.
You can refer Compiled Queries in EF from here.
Fo second part if EF is using two much time materialize your object graph then I would suggest to use some other means to retrieve data apart from EF.
One option can be Dapper.NET, You can have your concise sql query and you can directly retrieve its result in concrete model objects using Dapper (or any other tiny ORM)

I suspect your query takes so long to generate becuase you are treating Entity Framework like it is a SQL Query, which is not correct. You have many joins and akward calls in your linq syntax. Generally, your syntax should be similar to the following fictitious modeling query:
var result = (from appointment in appointments
from operation in appointment.Operations
where appointment.Id == 12
select new Model {
Id = appointment.Id,
Name = appointment.Name,
// etc, etc
}).ToList();
There is no use of joins above, the navigation property between Appointment and Operations takes care of the neccessary plumbing. Remember, this is an ORM, there is no concept of a join, only a concept of relationships.
The call to Distinct at the end, also indicates the structure of the db schema may be problematic if it returns too many duplicate results.
If after refactoring the entity model and correctly constructing the query still leaves with underperformance, it is advisable to use a stored procedure and map the result with EF's built in methods for doing so.

It is hard to tell what is going wrong here without seeing how you are using linq, but I suspect this will fix your problem:
var myResult = dataContext.table.Where(x == "Your joins and otherstuff").ToList();
//after converting it to a list use it how you need, but not before.
If this does not help please post your code.

The problem is that you are probably passing it to a data source that is running all sorts of additional queries based on you open result set.
Try this instead:
IEnumerable<SigmaTEK.Web.Models.SchedulerGridModel> tasks = (from appointment in _appointmentRep.Value.Get().Where(a => (a.Start < DbContext.MaxTime && DbContext.MinTime < a.Expiration))
join timeApplink in _timelineAppointmentLink.Value.Get().Where(a => a.AppointmentId != Guid.Empty)
on appointment.Id equals timeApplink.AppointmentId
join timeline in timelineRep.Value.Get().Where(i => timelines.Contains(i.Id))
on timeApplink.TimelineId equals timeline.Id
join repeater in _appointmentRepeaterRep.Value.Get().Where(repeater => (repeater.Start < DbContext.MaxTime && DbContext.MinTime < repeater.Expiration))
on appointment.Id equals repeater.Appointment
into repeaters
from repeater in repeaters.DefaultIfEmpty()
join aInstance in _appointmentInstanceRep.Value.Get()
on appointment.Id equals aInstance.Appointment
into instances
from instance in instances.DefaultIfEmpty()
join opRes in opResRep.Get()
on instance.ResourceOwner equals opRes.Id
into opResources
from op in opResources.DefaultIfEmpty()
join itemResource in _opDocItemResourcelinkRep.Value.Get()
on op.Id equals itemResource.Resource
into itemsResources
from itemresource in itemsResources.DefaultIfEmpty()
join opDocItem in opDocItemRep.Get()
on itemresource.OpDocItem equals opDocItem.Id
into opDocItems
from opdocitem in opDocItems.DefaultIfEmpty()
join opDocSection in opDocOpSecRep.Get()
on opdocitem.SectionId equals opDocSection.Id
into sections
from section in sections.DefaultIfEmpty()
join opDoc in opDocRep.Get()
on section.InternalOperationalDocument equals opDoc.Id
into opdocs
from opdocitem2 in opDocItems.DefaultIfEmpty()
join opDocItemLink in opDocItemStrRep.Get()
on opdocitem2.Id equals opDocItemLink.Parent
into opDocItemLinks
from link in opDocItemLinks.DefaultIfEmpty()
join finItem in finItemsRep.Get()
on link.Child equals finItem.Id
into temp1
from rd1 in temp1.DefaultIfEmpty()
join sec in finSectionRep.Get()
on rd1.SectionId equals sec.Id
into opdocsections
from finopdocsec in opdocsections.DefaultIfEmpty()
join finopdoc in opDocRep.Get().Where(i => i.DocumentType == "Quote")
on finopdocsec.InternalOperationalDocument equals finopdoc.Id
into finOpdocs
from finOpDoc in finOpdocs.DefaultIfEmpty()
join entry in entryRep.Get()
on rd1.Transaction equals entry.Transaction
into entries
from entry2 in entries.DefaultIfEmpty()
join resproduct in resprosductRep.Get()
on entry2.Id equals resproduct.Entry
into resproductlinks
from resprlink in resproductlinks.DefaultIfEmpty()
join res in resRep.Get()
on resprlink.Resource equals res.Id
into rootResource
from finopdoc in finOpdocs.DefaultIfEmpty()
join rel in orgDocIndRep.Get().Where(i => (i.Relationship == "OrderedBy"))
on finopdoc.Id equals rel.OperationalDocument
into orgDocIndLinks
from orgopdoclink in orgDocIndLinks.DefaultIfEmpty()
join org in orgRep.Get()
on orgopdoclink.Organization equals org.Id
into toorgs
from opdoc in opdocs.DefaultIfEmpty()
from rootresource in rootResource.DefaultIfEmpty()
from toorg in toorgs.DefaultIfEmpty()
select new SigmaTEK.Web.Models.SchedulerGridModel()
{
Id = appointment.Id,
Description = appointment.Description,
End = appointment.Expiration,
Start = appointment.Start,
OperationDisplayId = op.DisplayId,
OperationName = op.Name,
AppContextId = _appContext.Id,
TimelineId = timeline.Id,
AssemblyDisplayId = rootresource.DisplayId,
//Duration = SigmaTEK.Models.App.Utils.StringHelpers.TimeSpanToString((appointment.Expiration - appointment.Start)),
WorkOrder = opdoc.DisplayId,
Organization = toorg.Name
}).Distinct().ToList();
//In your UI
MyGrid.DataSource = tasks;
MyGrid.DataBind();
//Do not use an ObjectDataSource! It makes too many extra calls

Reusing LINQ query results in another LINQ query without re-querying the database

I have a situation where my application constructs a dynamic LINQ query using PredicateBuilder based on user-specified filter criteria (aside: check out this link for the best EF PredicateBuilder implementation). The problem is that this query usually takes a long time to run and I need the results of this query to perform other queries (i.e., joining the results with other tables). If I were writing T-SQL, I'd put the results of the first query into a temporary table or a table variable and then write my other queries around that. I thought of getting a list of IDs (e.g., List<Int32> query1IDs) from the first query and then doing something like this:
var query2 = DbContext.TableName.Where(x => query1IDs.Contains(x.ID))
This will work in theory; however, the number of IDs in query1IDs can be in the hundreds or thousands (and the LINQ expression x => query1IDs.Contains(x.ID) gets translated into a T-SQL "IN" statement, which is bad for obvious reasons) and the number of rows in TableName is in the millions. Does anyone have any suggestions as to the best way to deal with this kind of situation?
Edit 1: Additional clarification as to what I'm doing.
Okay, I'm constructing my first query (query1) which just contains the IDs that I'm interested in. Basically, I'm going to use query1 to "filter" other tables. Note: I am not using a ToList() at the end of the LINQ statement---the query is not executed at this time and no results are sent to the client:
var query1 = DbContext.TableName1.Where(ComplexFilterLogic).Select(x => x.ID)
Then I take query1 and use it to filter another table (TableName2). I now put ToList() at the end of this statement because I want to execute it and bring the results to the client:
var query2 = (from a in DbContext.TableName2 join b in query1 on a.ID equals b.ID select new { b.Column1, b.column2, b.column3,...,b.columnM }).ToList();
Then I take query1 and re-use it to filter yet another table (TableName3), execute it and bring the results to the client:
var query3 = (from a in DbContext.TableName3 join b in query1 on a.ID equals b.ID select new { b.Column1, b.column2, b.column3,...,b.columnM }).ToList();
I can keep doing this for as many queries as I like:
var queryN = (from a in DbContext.TableNameN join b in query1 on a.ID equals b.ID select new { b.Column1, b.column2, b.column3,...,b.columnM }).ToList();
The Problem: query1 is takes a long time to execute. When I execute query2, query3...queryN, query1 is being executed (N-1) times...this is not a very efficient way of doing things (especially since query1 isn't changing). As I said before, if I were writing T-SQL, I would put the result of query1 into a temporary table and then use that table in the subsequent queries.
Edit 2:
I'm going to give the credit for answering this question to Albin Sunnanbo for his comment:
When I had similar problems with a heavy query that I wanted to reuse in several other queries I always went back to the solution of creating a join in each query and put more effort in optimizing the query execution (mostly by tweaking my indexes).
I think that's really the best that one can do with Entity Framework. In the end, if the performance gets really bad, I'll probably go with John Wooley's suggestion:
This may be a situation where dropping to native ADO against a stored proc returning multiple results and using an internal temp table might be your best option for this operation. Use EF for the other 90% of your app.
Thanks to everyone who commented on this post...I appreciate everyone's input!

If the size of TableName is not too big to load the whole table you use
var tableNameById = DbContext.TableName.ToDictionary(x => x.ID);
to fetch the whole table and automatically put it in a local Dictionary with ID as key.
Another way is to just "force" the LINQ evaluation with .ToList(), in the case fetch the whole table and do the Where part locally with Linq2Objects.
var query1Lookup = new Hashset<int>(query1IDs);
var query2 = DbContext.TableName.ToList().Where(x => query1IDs.Contains(x.ID));
Edit:
Storing a list of ID:s from one query in a list and use that list as filter in another query can usually be rewritten as a join.
When I had similar problems with a heavy query that I wanted to reuse in several other queries I always went back to the solution of creating a join in each query and put more effort in optimizing the query execution (mostly by tweaking my indexes).

Since you are running a subsequent query off the results, take your first query and use it as a View on your SQL Server, add the view to your context, and build your LINQ queries against the view.

Have you considered composing your query as per this article (using the decorator design pattern):
Composed LINQ Queries using the Decorator Pattern
The premise is that, instead of enumerating your first (very constly) query, you basically use the decorator pattern to produce a chain of IQueryable that is a result of query 1 and query N. This way you always execute the filtered form of the query.
Hope this might help

How does a LINQ expression know that Where() comes before Select()?

I'm trying to create a LINQ provider. I'm using the guide LINQ: Building an IQueryable provider series, and I have added the code up to LINQ: Building an IQueryable Provider - Part IV.
I am getting a feel of how it is working and the idea behind it. Now I'm stuck on a problem, which isn't a code problem but more about the understanding.
I'm firing off this statement:
QueryProvider provider = new DbQueryProvider();
Query<Customer> customers = new Query<Customer>(provider);
int i = 3;
var newLinqCustomer = customers.Select(c => new { c.Id, c.Name}).Where(p => p.Id == 2 | p.Id == i).ToList();
Somehow the code, or expression, knows that the Where comes before the Select. But how and where?
There is no way in the code that sorts the expression, in fact the ToString() in debug mode, shows that the Select comes before the Where.
I was trying to make the code fail. Normal I did the Where first and then the Select.
So how does the expression sort this? I have not done any change to the code in the guide.

The expressions are "interpreted", "translated" or "executed" in the order you write them - so the Where does not come before the Select
If you execute:
var newLinqCustomer = customers.Select(c => new { c.Id, c.Name})
.Where(p => p.Id == 2 | p.Id == i).ToList();
Then the Where is executed on the IEnumerable or IQueryable of the anonymous type.
If you execute:
var newLinqCustomer = customers.Where(p => p.Id == 2 | p.Id == i)
.Select(c => new { c.Id, c.Name}).ToList();
Then the Where is executed on the IEnumerable or IQueryable of the customer type.
The only thing I can think of is that maybe you're seeing some generated SQL where the SELECT and WHERE have been reordered? In which case I'd guess that there's an optimisation step somewhere in the (e.g.) LINQ to SQL provider that takes SELECT Id, Name FROM (SELECT Id, Name FROM Customer WHERE Id=2 || Id=#i) and converts it to SELECT Id, Name FROM Customer WHERE Id=2 || Id=#i - but this must be a provider specific optimisation.

No, in the general case (such as LINQ to Objects) the select will be executed before the where statement. Think of it is a pipeline, your first step is a transformation, the second a filter. Not the other way round, as it would be the case if you wrote Where...Select.
Now, a LINQ Provider has the freedom to walk the expression tree and optimize it as it sees fit. Be aware that you may not change the semantics of the expression though. This means that a smart LINQ to SQL provider would try to pull as many where clauses it can into the SQL query to reduce the amount of data travelling over the network. However, keep the example from Stuart in mind: Not all query providers are clever, partly because ruling out side effects from query reordering is not as easy as it seems.

How can I make this SelectMany use a Join?

Given that I have three tables (Customer, Orders, and OrderLines) in a Linq To Sql model where
Customer -- One to Many -> Orders -- One to Many -> OrderLines
When I use
var customer = Customers.First();
var manyWay = from o in customer.CustomerOrders
from l in o.OrderLines
select l;
I see one query getting the customer, that makes sense. Then I see a query for the customer's orders and then a single query for each order getting the order lines, rather than joining the two. Total of n + 1 queries (not counting getting customer)
But if I use
var tableWay = from o in Orders
from l in OrderLines
where o.Customer == customer
&& l.Order == o
select l;
Then instead of seeing a single query for each order getting the order lines, I see a single query joining the two tables. Total of 1 query (not counting getting customer)
I would prefer to use the first Linq query as it seems more readable to me, but why isn't L2S joining the tables as I would expect in the first query? Using LINQPad I see that the second query is being compiled into a SelectMany, though I see no alteration to the first query, not sure if that's a indicator to some problem in my query.

I think the key here is
customer.CustomerOrders
Thats an EntitySet, not an IQueryable, so your first query doesn't translate directly into a SQL query. Instead, it is interpreted as many queries, one for each Order.
That's my guess, anyway.

How about this:
Customers.First().CustomerOrders.SelectMany(item => item.OrderLines)

I am not 100% sure. But my guess is because you are traversing down the relationship that is how the query is built up, compared to the second solution where you are actually joining two sets by a value.

So after Francisco's answer and experimenting with LINQPad I have come up with a decent workaround.
var lines = from c in Customers
where c == customer
from o in c.CustomerOrders
from l in o.OrderLines
select l;
This forces the EntitySet into an Expression which the provider then turns into the appropriate query. The first two lines are the key, by querying the IQueryable and then putting the EntitySet in the SelectMany it becomes an expression. This works for the other operators as well, Where, Select, etc.

Try this query:
IQueryable<OrderLine> query =
from c in myDataContext.customers.Take(1)
from o in c.CustomerOrders
from l in o.OrderLines
select l;
You can go to the CustomerOrders property definition and see how the property acts when it used with an actual instance. When the property is used in a query expression, the behavior is up to the query provider - the property code is usually not run in that case.
See also this answer, which demonstrates a method that behaves differently in a query expression, than if it is actually called.

Using Linq to SQL, how do I find min and max of a column in a table?

I want to find the fastest way to get the min and max of a column in a table with a single Linq to SQL roundtrip. So I know this would work in two roundtrips:
int min = MyTable.Min(row => row.FavoriteNumber);
int max = MyTable.Max(row => row.FavoriteNumber);
I know I can use group but I don't have a group by clause, I want to aggregate over the whole table! And I can't use the .Min without grouping first. I did try this:
from row in MyTable
group row by true into r
select new {
min = r.Min(z => z.FavoriteNumber),
max = r.Max(z => z.FavoriteNumber)
}
But that crazy group clause seems silly, and the SQL it makes is more complex than it needs to be.
So, is there any way to just get the correct SQL out?
EDIT: These guys failed too: Linq to SQL: how to aggregate without a group by? ... lame oversight by LINQ designers if there's really no answer.
EDIT 2: I looked at my own solution (with the nonsensical constant group by clause) in the SQL Server Management Studio execution plan analysis, and it looks to me like it is identical to the plan generated by:
SELECT MIN(FavoriteNumber), MAX(FavoriteNumber)
FROM MyTable
so unless someone can come up with a simpler-or-equally-as-good answer, I think I have to mark it as answered-by-myself. Thoughts?

As stated in the question, this method seems to actually generate optimal SQL code, so while it looks a bit squirrely in LINQ, it should be optimal performance-wise.
from row in MyTable
group row by true into r
select new {
min = r.Min(z => z.FavoriteNumber),
max = r.Max(z => z.FavoriteNumber)
}

I could find only this one which produces somewhat clean sql still not really effective comparing to select min(val), max(val) from table:
var r =
(from min in items.OrderBy(i => i.Value)
from max in items.OrderByDescending(i => i.Value)
select new {min, max}).First();
the sql is
SELECT TOP (1)
[t0].[Value],
[t1].[Value] AS [Value2]
FROM
[TestTable] AS [t0],
[TestTable] AS [t1]
ORDER BY
[t0].[Value],
[t1].[Value] DESC
still there is another option to use single connection for both min and max queries (see Multiple Active Result Sets (MARS))
or stored procedure..

I'm not sure how to translate it into C# yet (I'm working on it)
This is the Haskell version
minAndMax :: Ord a => [a] -> (a,a)
minAndMax [x] = (x,x)
minAndMax (x:xs) = (min a x, max b x)
where (a,b) = minAndMax xs
The C# version should involve Aggregate some how (I think).

You could select the whole table, and do your min and max operations in memory:
var cache = // select *
var min = cache.Min(...);
var max = cache.Max(...);
Depending on how large your dataset is, this might be the way to go about not hitting your database more than once.

A LINQ to SQL query is a single expression. Thus, if you can't express your query in a single expression (or don't like it once you do) then you have to look at other options.
Stored procedures, since they can have statements, enable you to accomplish this in a single round-trip. You will either have two output parameters or select a result set with two rows. Either way, you will need custom code to read the stored procedure's result.
(I don't personally see the need to avoid two round-trips here. It seems like a premature optimization, especially since you will probably have to jump through hoops to get it working. Not to mention the time you will spend justifying this decision and explaining the solution to other developers.)
Put another way: you've already answered your own question. "I can't use the .Min without grouping first", followed by "that crazy group clause seems silly, and the SQL it makes is more complex than it needs to be", are clues that the simple and easily-understood two-round-trip solution is the best expression of your intent (unless you write custom SQL).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

LINQ vs sql - bringing back too many rows - c#

Related

Entity Framework gets progressively slow with extra join added even though SQL generated is fast

Reusing LINQ query results in another LINQ query without re-querying the database

How does a LINQ expression know that Where() comes before Select()?

How can I make this SelectMany use a Join?

Using Linq to SQL, how do I find min and max of a column in a table?

Categories

Resources