Let's say I have the following bit of code (which I know could be easily modified to perform better, but it illustrates what I want to do)
List<Query> l = new List<Query>;
// Query is a class that doesn't exist, it represents an EF operation
foreach (var x in Xs)
{
Query o = { context.someEntity.Where(s=>s.Id==x.Id).First();}
// It wouldn't execute it, this is pseudo code for delegate/anonymous function
l.Add(o)
}
Then send this list of Query to EF, and have it optimize so that it does the least amount of round trips possible. Let's call it BatchOptimizeAndRun; you would say
var results = BatchOptimizeAndRun(l);
And knowing what it knows from the schema it would reduce the overall query to an optimal version and execute that and place the read results in an array.
I hope I've described what I'm looking for accurately and more importantly that it exists.
And if I sound like a rambling mad man, let's pretend this question never existed.
I'd have to echo Mr. Moore's advice, as I too have spent far too long constructing a linq-to-entities query of monolithic proportions only to find that I could have made a stored procedure in less time that was easier to read and faster to execute. That being said in your example...
List<int> ids = Xs.Select(x => x.Id).ToList();
var results = context.someEntity.Where(s => ids.Contains(s.Id)).ToList();
I believe this will compile to something like
SELECT
*
FROM
someEntity
WHERE
Id IN (ids) --Where ids is a comma separated list of INT
Which will provide you with what you need.
Related
I'm dumping a table out of MySQL into a DataTable object using MySqlDataAdapter. Database input and output is doing fine, but my application code seems to have a performance issue I was able to track down to a specific LINQ statement.
The goal is simple, search the contents of the DataTable for a column value matching a specific string, just like a traditional WHERE column = 'text' SQL clause.
Simplified code:
foreach (String someValue in someList) {
String searchCode = OutOfScopeFunction(someValue);
var results = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.Take(1);
if (results.Any()) {
results.First()["columnname"] = 10;
}
}
This simplified code is executed thousands of times, once for each entry in someList. When I run Visual Studio Performance Profiler I see that the "results.Any()" line is highlighted as consuming 93.5% of the execution time.
I've tried several different methods for optimizing this code, but none have improved performance while keeping the emoteTable DataTable as the primary source of the data. I can convert emoteTable to Dictionary<String, DataRow> outside of the foreach, but then I have to keep the DataTable and the Dictionary in sync, which while still a performance improvement, feels wrong.
Three questions:
Is this the proper way to search for a value in a DataTable (equivalent of a traditional SQL WHERE clause)? If not, how SHOULD it be done?
Addendum to 1, regardless of the proper way, what is the fastest (execution time)?
Why does the results.Any() line consume 90%+ resources? In this situation it makes more sense that the var results line should consume the resources, after all, it's the line doing the actual search, right?
Thank you for your time. If I find an answer I shall post it here as well.
Any() is taking 90% of the time because the result is only executed when you call Any(). Before you call Any(), the query is not actually made.
It would seem the problem is that you first fetch entire table into the memory and then search. You should instruct your database to search.
Moreover, when you call results.First(), the whole results query is executed again.
With deferred execution in mind, you should write something like
var result = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.FirstOrDefault();
if (result != null) {
result["columnname"] = 10;
}
What you have implemented is pretty much join :
var searchCodes = someList.Select(OutOfScopeFunction);
var emotes = emoteTable.AsEnumerable();
var results = Enumerable.Join(emotes, searchCodes, e=>e, sc=>sc.Field<String>("code"), (e, sc)=>sc);
foreach(var result in results)
{
result["columnname"] = 10;
}
Join will probably optimize the access to both lists using some kind of lookup.
But first thing I would do is to completely abandon idea of combining DataTable and LINQ. They are two different technologies and trying to assert what they might do inside when combined is hard.
Did you try doing raw UPDATE calls? How many items are you expecting to update?
i have a full outer join query pulling data from an sql compact database (i use EF6 for mapping):
var query =
from entry in left.Union(right).AsEnumerable()
select new
{
...
} into e
group e by e.Date.Year into year
select new
{
Year = year.Key,
Quartals = from x in year
group x by (x.Date.Month - 1) / 3 + 1 into quartal
select new
{
Quartal = quartal.Key,
Months = from x in quartal
group x by x.Date.Month into month
select new
{
Month = month.Key,
Contracts = from x in month
group x by x.Contract.extNo into contract
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
}
};
as you can see i use nested groups to structure results.
the interesting thing is, if i remove AsEnumerable() call, the query takes 3.5x more time to execute: ~210ms vs ~60ms. And when it runs for the first time the difference is much greater: 39000(!)ms vs 1300ms.
My questions are:
What am i doing wrong, maybe those groupings should be done in a different way?
Why the first execution takes so much time? I know expression trees should be built etc, but 39 seconds?
Why linq to db is slower than linq to entities in my case? Is it generally slower and its better to load data from db if possible before processing?
thakns!
To answer your three questions:
Maybe those groupings should be done in a different way?
No. If you want nested groupings you can only do that by groupings within groupings.
You can group by multiple fields at once:
from entry in left.Union(right)
select new
{
...
} into e
group e by new
{
e.Date.Year,
Quartal = (e.Date.Month - 1) / 3 + 1,
e.Date.Month,
contract = e.Contract.extNo
} into grp
select new
{
Year = grp.Key,
Quartal = grp.Key,
Month = grp.Key,
Contracts = from x in grp
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
This will remove a lot of complexity from the generated query so it's likely to be (much) faster without AsEnumerable(). But the result is quite different: a flat group (Year, Quartal, etc, in one row), not a nested grouping.
Why the first execution takes so much time?
Because the generated SQL query is probably pretty complex and the database engine's query optimizer can't find a fast execution path.
3a. Why is linq to db slower than linq to entities in my case?
Because, apparently, in this case it's much more efficient to fetch the data into memory first and do the groupings by LINQ-to-objects. This effect will be more significant if left and right represent more or less complex queries themselves. In that case, the generated SQL can get hugely bloated, because it has to process two sources of complexity in one statement, which may lead to many repeated identical sub queries. By outsourcing the grouping, the database is probably left with a relative simple query and of course the grouping in memory is never affected by the complexity of the SQL query.
3b. Is it generally slower and its better to load data from db if possible before processing?
No, not generally. I'd even say, hardly ever. In this case it is because (as I can see) you don't filter data. If however the part before AsEnumerable() would return millions of records and you would apply filtering afterwards, the query without AsEnumerable() would probably be much faster, because the filtering is done in the database.
Therefore, you should always keep an eye on generated SQL. It's unrealistic to expect that EF will always generate a super optimized SQL statement. It hardly ever will. Its primary focus is on correctness (and it does an exceptional job there), performance is secondary. It's the developer's job to make LINQ-to-Entities and LINQ-to-object work together as a slick team.
Using AsEnumerable() will convert a type that implements IEnumerable<T> to IEnumerable<T> itself.
Read this topic https://msdn.microsoft.com/en-us/library/bb335435.aspx
AsEnumerable<TSource>(IEnumerable<TSource>) can be used to choose between query implementations when a sequence implements IEnumerable<T> but also has a different set of public query methods available. For example, given a generic class Table that implements IEnumerable<T> and has its own methods such as Where, Select, and SelectMany, a call to Where would invoke the public Where method of Table. A Table type that represents a database table could have a Where method that takes the predicate argument as an expression tree and converts the tree to SQL for remote execution. If remote execution is not desired, for example because the predicate invokes a local method, the AsEnumerable<TSource> method can be used to hide the custom methods and instead make the standard query operators available.
When you invoke AsEnumerable() first, it won't convert LINQ-to-SQL but will instead load the table in memory as the Where is enumerating it. Since now it is loaded in memory, it's execution is faster.
Profiling my code because it is taking a long time to execute, it is generating a SELECT instead of a COUNT and as there are 20,000 records it is very very slow.
This is the code:
var catViewModel= new CatViewModel();
var catContext = new CatEntities();
var catAccount = catContext.Account.Single(c => c.AccountId == accountId);
catViewModel.NumberOfCats = catAccount.Cats.Count();
It is straightforward stuff, but the code that the profiler is showing is:
exec sp_executesql N'SELECT
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy,
[Extent1].xxxxx AS yyyyy // You get the idea
FROM [dbo].[Cats] AS [Extent1]
WHERE Cats.[AccountId] = #EntityKeyValue1',N'#EntityKeyValue1 int',#EntityKeyValue1=7
I've never seen this behaviour before, any ideas?
Edit: It is fixed if I simply do this instead:
catViewModel.NumberOfRecords = catContext.Cats.Where(c => c.AccountId == accountId).Count();
I'd still like to know why the former didn't work though.
So you have 2 completely separate queries going on here and I think I can explain why you get different results. Let's look at the first one
// pull a single account record
var catAccount = catContext.Account.Single(c => c.AccountId == accountId);
// count all the associated Cat records against said account
catViewModel.NumberOfCats = catAccount.Cats.Count();
Going on the assumption that Cats has a 0..* relationship with Account and assuming you are leveraging the frameworks ability to lazily load foreign tables then your first call to catAccounts.Cats is going to result in a SELECT for all the associated Cat records for that particular account. This results in the table being brought into memory therefore the call to Count() would result in an internal check of the Count property of the in-memory collection (hence no COUNT SQL generated).
The second query
catViewModel.NumberOfRecords =
catContext.Cats.Where(c => c.AccountId == accountId).Count();
Is directly against the Cats table (which would be IQueryable<T>) therefore the only operations performed against the table are Where/Count, and both of these will be evaluated on the DB-side before execution so it's obviously a lot more efficient than the first.
However, if you need both Account and Cats then I would recommend you eager load the data on the fetch, that way you take the hit upfront once
var catAccount = catContext.Account.Include(a => a.Cats).Single(...);
Most times, when somebody accesses a sub-collection of an entity, it is because there are a limited number of records, and it is acceptable to populate the collection. Thus, when you access:
catAccount.Cats
(regardless of what you do next), it is filling that collection. Your .Count() is then operating on the local in-memory collection. The problem is that you don't want that. Now you have two options:
check whether your provider offer some mechanism to make that a query rather than a collection
build the query dynamically
access the core data-model instead
I'm pretty confident that if you did:
catViewModel.NumberOfRecords =
catContext.Cats.Count(c => c.AccountId == accountId);
it will work just fine. Less convenient? Sure. But "works" is better than "convenient".
I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?
There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.
1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.
I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)
I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.
Another question regarding EF:
I was wondering what's going behind the scenes when iterating over a query result.
For example, check out the following code:
var activeSources = from e in entitiesContext.Sources
where e.IsActive
select e;
and then:
foreach (Source currSource in allSources)
{
code based on the current source...
}
Important note: Each iteration takes a while to complete (from 1 to 25 seconds).
Now, I assume EF is based on DataReaders for maximum efficiency, so based on that assumption, I figure that in the above case, the Database connection will be kept open until I finish iterating over the results, which will be a very long time (when talking in terms of code), which is something I obviously don't want.
Is there a way to fetch the entire data like I would've done with plain old ADO.NET DataAdapters, DataSets and the fill() method instead of using DataReaders?
Or maybe i'm way off with my assumptions?
In any case I would've loved to be pointed to a good source explaining this if available.
Thanks,
Mikey
If you want to get all of the data up front, similar to Fill(), you need to force the query to execute.
var activeSources = from e in entitiesContext.Sources
where e.IsActive
select e;
var results = activeSources.ToList();
After ToList() is called you will have the data and be disconnected from the database.
If you want to return all results at once use .ToList(); Then deferred execution won't happen.
var activeSources = (from e in entitiesContext.Sources
where e.IsActive
select e).ToList();