LINQ Search nvarchar(MAX) column extremely slow using .Contains()

LINQ Search nvarchar(MAX) column extremely slow using .Contains() - c#

I have a .net core API and I am trying to search 4.4 million records using .Contains(). This is obviously extremely slow - 26 seconds. I am just querying one column which is the name of the record. How is this problem generally solved when dealing with millions of records?
I have never worked with millions of records before so apart from the obvious altering of the .Select and .Take, I haven't tried anything too drastic. I have spent many hours on this though.
The other filters included in the .Where are only used when a user chooses to use them on the front end - The real problem is just searching by CompanyName.
Note; I am using .ToArray() when returning the results.
I have indexes in the database but cannot add one for CompanyName as it is Nvarchar(MAX).
I have also looked at the execution plan and it doesn't really show anything out of the ordinary.
query = _context.Companies.Where(
c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper())
&& c.CompanyNumber.StartsWith(
string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter)
? paging.SearchCriteria.companyNumberFilter.ToUpper()
: ""
)
&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter
&& c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter
)
.Select(x => new Company() {
CompanyName = x.CompanyName,
IncorporationDate = x.IncorporationDate,
CompanyNumber = x.CompanyNumber
}
)
.Take(10);
I expect the query to take around 1 / 2 seconds as when I execute a like query in ssms it take about 1 / 2 seconds.
Here is the code being submitted to DB:
Microsoft.EntityFrameworkCore.Database.Command: Information: Executing DbCommand [Parameters=[#__p_4='?' (DbType = Int32), #__ToUpper_0='?' (Size = 4000), #__p_1='?' (Size = 4000), #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2='?' (DbType = DateTime2), #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3='?' (DbType = DateTime2), #__p_5='?' (DbType = Int32)], CommandType='Text', CommandTimeout='30']
SELECT [t].[CompanyName], [t].[IncorporationDate], [t].[CompanyNumber]
FROM (
SELECT TOP(#__p_4) [c].[CompanyName], [c].[IncorporationDate], [c].[CompanyNumber], [c].[ID]
FROM [Companies] AS [c]
WHERE (((((#__ToUpper_0 = N'') AND #__ToUpper_0 IS NOT NULL) OR (CHARINDEX(#__ToUpper_0, [c].[CompanyName]) > 0)) AND (((#__p_1 = N'') AND #__p_1 IS NOT NULL) OR ([c].[CompanyNumber] IS NOT NULL AND (#__p_1 IS NOT NULL AND (([c].[CompanyNumber] LIKE [c].[CompanyNumber] + N'%') AND (((LEFT([c].[CompanyNumber], LEN(#__p_1)) = #__p_1) AND (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NOT NULL AND #__p_1 IS NOT NULL)) OR (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NULL AND #__p_1 IS NULL))))))) AND ([c].[IncorporationDate] > #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2)) AND ([c].[IncorporationDate] < #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3)
) AS [t]
ORDER BY [t].[IncorporationDate] DESC
OFFSET #__p_5 ROWS FETCH NEXT #__p_4 ROWS ONLY
SOLVED! With the help of both answers!
In the end as suggested, I tried full-text searching which was lightening fast but compromised accuracy of search results. In order to filter those results more accurately, I used .Contains on the query after applying the full-text search.
Here is the code that works. Hopefully this helps others.
//query = _context.Companies
//.Where(c => c.CompanyName.StartsWith(paging.SearchCriteria.companyNameFilter.ToUpper())
//&& c.CompanyNumber.StartsWith(string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter) ? paging.SearchCriteria.companyNumberFilter.ToUpper() : "")
//&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter && c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter)
//.Select(x => new Company() { CompanyName = x.CompanyName, IncorporationDate = x.IncorporationDate, CompanyNumber = x.CompanyNumber }).Take(10);
query = _context.Companies.Where(c => EF.Functions.FreeText(c.CompanyName, paging.SearchCriteria.companyNameFilter.ToUpper()));
query = query.Where(x => x.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper()));
(I temporarily excluded the other filters for simplicity)

When you run the query in SSMS, it's probably cached for subsequent calls. The original query probably took similar time as the EF query. That said, there are disadvantages to parametrised queries - while you can better reuse execution plans in a parametrised query, this also means that the execution plan isn't necessarily the best for the actual query you're trying to run right now.
For example, if you specify a CompanyNumber (which is easy to find in an index due to the StartsWith), you can filter the data first by CompanyNumber, thus making the name search trivial (I assume CompanyNumber is unique, so either you get 0 records, or you get the one you get by CompanyNumber). This might not be possible for the parametrised query, if its execution plan was optimized for looking up by name.
But in the end, Contains is a performance killer. It needs to read every single byte of data in your table's CompanyName field; which usually means it has to read every single row, and process much of its data. Searching by a substring looks deceptively simple, but always carries heavy penalties - its complexity is linear with respect to data size.
One option is to find a way to avoid the Contains. Users often ask for features they don't actually need. StartsWith might work just as well for most of the cases. But that's a business decision, of course.
Another option would be finding a way to reduce the query as much as possible before you apply the Contains filter - if you only allow searching for company name with other filters that narrow the search down, you can save the DB server a lot of work. This may be tricky, and can sometimes collide with the execution plan collission issue - you might want to add some way to avoid having the same execution plan for two queries that are wildly different; an easy way in EF would be to build the query up dynamically, rather than trying for one expression:
var query = _context.Companies;
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNameFilter))
query = query.Where(c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter));
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter))
query = query.Where(c => c.CompanyNumber.StartsWith(paging.SearchCriteria.companyNumberFilter));
// etc. for the rest of the query
This means that you actually have multiple parametrised queries that can each have their own execution plan, more in line with what the query actually does. For some extreme cases, it might also be worthwhile to completely prevent execution plan caching (this is often useful in reports).
The final option is using full-text search. You can find plenty of tutorials on how to make this work. This works essentially by splitting the unformatted string data to individual words or phrases, and indexing those. This means that a search for "hello world" doesn't necessarily return all the records that have "hello world" in the name, and it might also return records that have something else than "hello world" in the name. Think Google Search rather than Contains. This can often be a great method for human-written text, but it can be very confusing for the user who doesn't understand why you'd return search results that are completely different from what he was searching for. It also often doesn't work well if you need to do partial searches (e.g. searching for "Computer" might return "Computer, Inc.", but searching for "Comp" might return nothing).
The first option is likely the fastest, and closest to what the users would expect. It has the weakness that it can't search in the middle, though. The second option is the most correct, and might make your query substantially faster, especially in the most common cases with good statistics. The third option is probably about as fast as the first one, but can be tricky to setup properly, and can be confusing for your users. It does also provide you with more powerful ways to query the text data (e.g. using wildcards).

Welcome to stack overflow. It looks like you are suffering from at least one of these three problems in your code and your architecture.
First: indexing
You've mentioned that this cannot be indexed but there is support in SQL Server for full text indexing at the very least.
.Contains
This method isn't exactly suitable for the size of operation you're performing. If possible, perhaps as a last resort, consider moving to a parameterized query. For now, however, it looks like you want to keep your business logic in the .net code rather than spreading it into SQL and that's a worthy plan.
c.IncorporationDate
Date comparison can be a little costly in SQL Server. Once you're dealing with so many millions of rows you might get a lot of performance benefit from correctly partitioned tables and indexes.
Consider whether or not these rows can change at all. Something named IncoporationDate sounds like it definitely should not be changed. I suspect you may want to leverage that after reading the rest of these.

Related

Entity Framework Core count does not have optimal performance

I need to get the amount of records with a certain filter.
Theoretically this instruction:
_dbContext.People.Count (w => w.Type == 1);
It should generate SQL like:
Select count (*)
from People
Where Type = 1
However, the generated SQL is:
Select Id, Name, Type, DateCreated, DateLastUpdate, Address
from People
Where Type = 1
The query being generated takes much longer to run in a database with many records.
I need to generate the first query.
If I just do this:
_dbContext.People.Count ();
Entity Framework generates the following query:
Select count (*)
from People
.. which runs very fast.
How to generate this second query passing search criteria to the count?

There is not much to answer here. If your ORM tool does not produce the expected SQL query from a simple LINQ query, there is no way you can let it do that by rewriting the query (and you shouldn't be doing that at the first place).
EF Core has a concept of mixed client/database evaluation in LINQ queries which allows them to release EF Core versions with incomplete/very inefficient query processing like in your case.
Excerpt from Features not in EF Core (note the word not) and Roadmap:
Improved translation to enable more queries to successfully execute, with more logic being evaluated in the database (rather than in-memory).
Shortly, they are planning to improve the query processing, but we don't know when will that happen and what level of degree (remember the mixed mode allows them to consider query "working").
So what are the options?
First, stay away from EF Core until it becomes really useful. Go back to EF6, it's has no such issues.
If you can't use EF6, then stay updated with the latest EF Core version.
For instance, in both v1.0.1 and v1.1.0 you query generates the intended SQL (tested), so you can simply upgrade and the concrete issue will be gone.
But note that along with improvements the new releases introduce bugs/regressions (as you can see here EFCore returning too many columns for a simple LEFT OUTER join for instance), so do that on your own risk (and consider the first option again, i.e. Which One Is Right for You :)

Try to use this lambda expression for execute query faster.
_dbContext.People.select(x=> x.id).Count();

Try this
(from x in _dbContext.People where x.Type == 1 select x).Count();
or you could do the async version of it like:
await (from x in _dbContext.People where x.Type == 1 select x).CountAsync();
and if those don't work out for you, then you could at least make the query more efficient by doing:
(from x in _dbContext.People where x.Type == 1 select x.Id).Count();
or
await (from x in _dbContext.People where x.Type == 1 select x.Id).CountAsync();

If you want to optimize performance and the current EF provider is not not (yet) capable of producing the desired query, you can always rely on raw SQL.
Obviously, this is a trade-off as you are using EF to avoid writing SQL directly, but using raw SQL can be useful if the query you want to perform can't be expressed using LINQ, or if using a LINQ query is resulting in inefficient SQL being sent to the database.
A sample raw SQL query would look like this:
var results = _context.People.FromSql("SELECT Id, Name, Type, " +
"FROM People " +
"WHERE Type = #p0",
1);
As far as I know, raw SQL queries passed to the FromSql extension method currently require that you return a model type, i.e. returning a scalar result may not yet be supported.
You can however always go back to plain ADO.NET queries:
using (var connection = _context.Database.GetDbConnection())
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = "SELECT COUNT(*) FROM People WHERE Type = 1";
var result = command.ExecuteScalar().ToString();
}
}

It seems that there has been some problem with one of the early releases of Entity Framework Core. Unfortunately you have not specified exact version so I am not able to dig into EF source code to tell what exactly has gone wrong.
To test this scenario, I have installed the latest EF Core package and managed to get correct result.
Here is my test program:
And here is SQL what gets generated captured by SQL Server Profiler:
As you can see it matches all the expectations.
Here is the excerpt from packages.config file:
...
<package id="Microsoft.EntityFrameworkCore" version="1.1.0" targetFramework="net452" />
...
So, in your situation the only solution is to update to the latest package which is 1.1.0 at the time of writing this.

Does this get what you want:
_dbContext.People.Where(w => w.Type == 1).Count();

I am using EFCore 1.1 here.
This can occur if EFCore cannot translate the entire Where clause to SQL. This can be something as simple as DateTime.Now that might not even think about.
The following statement results in a SQL query that will surprisingly run a SELECT * and then C# .Count() once it has loaded the entire table!
int sentCount = ctx.ScheduledEmail.Where(x => x.template == template &&
x.SendConfirmedDate > DateTime.Now.AddDays(-7)).Count();
But this query will run an SQL SELECT COUNT(*) as you would expect / hope for:
DateTime earliestDate = DateTime.Now.AddDays(-7);
int sentCount = ctx.ScheduledEmail.Where(x => x.template == template
&& x.SendConfirmedDate > earliestDate).Count();
Crazy but true. Fortunately this also works:
DateTime now = DateTime.Now;
int sentCount = ctx.ScheduledEmail.Where(x => x.template == template &&
x.SendConfirmedDate > now.AddDays(-7)).Count();

sorry for the bump, but...
probably the reason the query with the where clause is slow is because you didnt provide your database a fast way to execute it.
in case of the select count(*) from People query we dont need to know the actual data for each field and we can just use a small index that doesnt have all these fields in them so we havent got to spend our slow I/O on. The database software would be clever enough to see that the primary key index requires the least I/O to do the count on. The pk id's require less space than the full row so you get more back to count per I/O block so you can complete faster.
Now in the case of the query with the Type it needs to read the Type to determine it's value. You should create an index on Type if you want your query to be fast or else it will have to do a very slow full table scan, reading all rows. It helps when your values are more discriminating. A column Gender (usually) only has two values and isnt very discriminating, a primary key column where every value is unique is highly dscriminating. Higher discriminating values will result in a shorter index range scan and a faster result to the count.

What I used to count rows using a search query was
_dbContext.People.Where(w => w.Type == 1).Count();
This can also be achieved by
List<People> people = new List<People>();
people = _dbContext.People.Where(w => w.Type == 1);
int count = people.Count();
This way you will get the people list too if you need it further.

Performance issues to iterate results with C# SQLite DataReader and attached database

I am using System.Data.SQLite and SQLiteDataReader in my C# project. I am facing performance issues when getting the results of a query with attached databases.
Here is an example of a query to search text into two databases :
ATTACH "db2.db" as db2;
SELECT MainRecord.RecordID,
((LENGTH(MainRecord.Value) - LENGTH(REPLACE(UPPER(MainRecord.Value), UPPER("FirstValueToSearch"), ""))) / 18) AS "FirstResultNumber",
((LENGTH(DB2Record.Value) - LENGTH(REPLACE(UPPER(DB2Record.Value), UPPER("SecondValueToSearch"), ""))) / 19) AS "SecondResultNumber"
FROM main.Record MainRecord
JOIN db2.Record DB2Record ON DB2Record.RecordID BETWEEN (MainRecord.PositionMin) AND (MainRecord.PositionMax)
WHERE FirstResultNumber > 0 AND SecondResultNumber > 0;
DETACH db2;
When executing this query with SQLiteStudio or SQLiteAdmin, this works fine, I am getting the results in a few seconds (the Record table can contain hundreds of thousands of records, the query returns 36000 records).
When executing this query in my C# project, the execution takes a few seconds too, but it takes hours to run through all the results.
Here is my code :
// Attach databases
SQLiteDataReader data = null;
using (SQLiteCommand command = this.m_connection.CreateCommand())
{
command.CommandText = "SELECT...";
data = command.ExecuteReader();
}
if (data.HasRows)
{
while (data.Read())
{
// Do nothing, just iterate all results
}
}
data.Close();
// Detach databases
Calling the Read method of the SQLiteDataReader once can take more than 10 seconds ! I guess this is because the SQLiteDataReader is lazy loaded (and so it doesn't return the whole rowset before reading the results), am I right ?
EDIT 1 :
I don't know if this has something to do with lazy loading, like I said initially, but all I want is being able to get ALL the results as soon as the query is ended. Isn't it possible ? In my opinion, this is really strange that it takes hours to get results of a query executed in few seconds...
EDIT 2 :
I just added a COUNT(*) in my select query in order to see if I could get the total number of results at the first data.Read(), just to be sure that it was only the iteration of the results that was taking so long. And I was wrong : this new request executes in few seconds in SQLiteAdmin / SQLiteStudio, but takes hours to execute in my C# project. Any idea why the same query is so much longer to execute in my C# project?
EDIT 3 :
Thanks to EXPLAIN QUERY PLAN, I noticed that there was a slight difference in the execution plan for the same query between SQLiteAdmin / SQLiteStudio and my C# project. In the second case, it is using an AUTOMATIC PARTIAL COVERING INDEX on DB2Record instead of using the primary key index. Is there a way to ignore / disable the use of automatic partial covering indexes? I know it is used to speed up the queries, but in my case, it's rather the opposite that happens...
Thank you.

Besides finding matching records, it seems that you're also counting the number of times the strings matched. The result of this count is also used in the WHERE clause.
You want the number of matches, but the number of matches does not matter in the WHERE clause - you could try change the WHERE clause to:
WHERE MainRecord.Value LIKE '%FirstValueToSearch%' AND DB2Record.Value LIKE '%SecondValueToSearch%'
It might not result in any difference though - especially if there's no index on the Value columns - but worth a shot. Indexes on text columns require alot of space, so I wouldn't blindly recommend that.
If you haven't done so yet, place an index on the DB2's RecordID column.
You can use EXPLAIN QUERY PLAN SELECT ... to make SQLite spit out what it does to try to make your query perform, the output of that might help diagnose the problem.

Are you sure you use the same version of sqlite in System.Data.SQLite, SQLiteStudio and SQLiteAdmin ?
You can have huge differences.

One more typical reason why SQL query can take different amount of time when executed with ADO.NET and from native utility (like SQLiteAdmin) are command parameters used in CommandText (it is not clear from your code whether parameters are used or not). Depending on ADO.NET provider implementation the following identical CommandText values:
SELECT * FROM sometable WHERE somefield = ? // assume parameter is '2'
and
SELECT * FROM sometable WHERE somefield='2'
may lead to absolutely different execution plan and query performance.
Another suggestion: you may disable journal (specifying "Journal Mode=off;" in the connection string) and synchronous mode ("Synchronous=off;") as these options also may affect query performance in some cases.

Improving Linq query

I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?

There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.

1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.

I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)

I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.

LINQ Query incredibly slow - why?

I got a very simple LINQ query:
List<table> list = ( from t in ctx.table
where
t.test == someString
&& t.date >= dateStartInt
&& t.date <= dateEndInt
select t ).ToList<table>();
The table which gets queried has got about 30 million rows, but the columns test and date are indexed.
When it should return around 5000 rows it takes several minutes to complete.
I also checked the SQL command which LINQ generates.
If I run that command on the SQL Server it takes 2 seconds to complete.
What's the problem with LINQ here?
It's just a very simple query without any joins.
That's the query SQL Profiler shows:
exec sp_executesql N'SELECT [t0].[test]
FROM [dbo].[table] AS [t0]
WHERE ([t0].[test] IN (#p0)) AND ([t0].[date] >= #p1)
AND ([t0].[date] <= #p2)',
N'#p0 nvarchar(12),#p1 int,#p2 int',#p0=N'123test',#p1=110801,#p2=110804
EDIT:
It's really weird. While testing I noticed that it's much faster now. The LINQ query now takes 3 seconds for around 20000 rows, which is quite ok.
What's even more confusing:
It's the same behaviour on our production server. An hour ago it was really slow, now it's fast again. As I was testing on the development server, I didn't change anything on the production server. The only thing I can think of being a problem is that both servers are virtualized and share the SAN with lots of other servers.
How can I find out if that's the problem?

Before blaming LINQ first find out where the actual delay is taking place.
Sending the query
Execution of the query
Receiving the results
Transforming the results into local types
Binding/showing the result in the UI
And any other events tied to this process
Then start blaming LINQ ;)

If I had to guess, I would say "parameter sniffing" is likely, i.e. it has built and cached a query plan based on one set of parameters, which is very suboptimal for your current parameter values. You can tackle this with OPTION (OPTIMIZE FOR UNKNOWN) in regular TSQL, but there is no LINQ-to-SQL / EF way of expolsing this.
My plan would be:
use profiling to prove that the time is being lost in the query (as opposed to materialization etc)
once confirmed, consider using direct TSQL methods to invoke
For example, with LINQ-to-SQL, ctx.ExecuteQuery<YourType>(tsql, arg0, ...) can be used to throw raw TSQL at the server (with parameters as {0} etc, like string.Format). Personally, I'd lean towards "dapper" instead - very similar usage, but a faster materializer (but it doesn't support EntityRef<> etc for lazy-loading values - which is usually a bad thing anyway as it leads to N+1).
i.e. (with dapper)
List<table> list = ctx.Query<table>(#"
select * from table
where test == #someString
and date >= #dateStartInt
and date <= #dateEndInt
OPTION (OPTIMIZE FOR UNKNOWN)",
new {someString, dateStartInt, dateEndInt}).ToList();
or (LINQ-to-SQL):
List<table> list = ctx.ExecuteQuery<table>(#"
select * from table
where test == {0}
and date >= {1}
and date <= {2}
OPTION (OPTIMIZE FOR UNKNOWN)",
someString, dateStartInt, dateEndInt).ToList();

Linq2Entities Include with Skip / Take - load issue

Note: I know there are a number of questions around for issues with Linq's .Include(table) not loading data, I believe I have exhausted the options people have listed, and still had problems.
I have a large Linq2Entities query on an application I'm maintaining. The query is built up as such:
IQueryable<Results> query = context.MyTable
.Where(r =>
r.RelatedTable.ID == 2 &&
r.AnotherRelatedTable.ID == someId);
Then predicates are built up depending on various business logic, such as:
if (sortColumn.Contains("dob "))
{
if (orderByAscending)
query = query.OrderBy(p => p.RelatedTable.OrderByDescending(q => q.ID).FirstOrDefault().FieldName);
else
query = query.OrderByDescending(p => p.RelatedTable.OrderByDescending(q => q.ID).FirstOrDefault().FieldName);
}
Note - there is always a sort order provided.
Originally the included tables were set at the beginning, after reading articles such as the famous Tip 22, so now they are done at the end (which didn't fix the problem):
var resultsList = (query.Select(r => r) as ObjectQuery<Results>)
.Include("RelatedTable")
.Include("AnotherRelatedTable")
.Skip((page - 1) * rowsPerPage)
.Take(rowsPerPage);
Seemingly at random (approximately for every 5000 users of the site, this issue happens once) the RelatedTable data won't load. It can be brute forced by calling load on the related table. But even the failure to load isn't consistent, I've run the query in testing and it's worked, but most of the time hasn't, without changing any code or data.
It is fine, when the skip and take aren't included, and the whole dataset is returned, but I would expect the skip and take to be done on the complete dataset - it certainly appears to be from profiling the SQL...
UPDATE 16/11/10: I have profiled the SQL against a problem data set, and I've been able to reproduce the query failing about 9/10 times, but succeeding the rest. The SQL being executed is identical when the query fails or succeeds except, as expected, for the parameters passed to the SQL.
The issue has been solved with the following change, but the question remains as to why this should be.
Failing - get LINQ to handle the rows:
var resultsList = (query.Select(r => r) as ObjectQuery<Results>)
.Include("RelatedTable")
.Include("AnotherRelatedTable")
.Skip((page - 1) * rowsPerPage)
.Take(rowsPerPage)
.ToList();
Working - enumerate the data then get the rows:
var resultsList = (query.Select(r => r) as ObjectQuery<Results>)
.Include("RelatedTable")
.Include("AnotherRelatedTable")
.ToList()
.Skip((page - 1) * rowsPerPage)
.Take(rowsPerPage);
Unfortunately the SQL this query creates contains some sensitive schema data so I can't post it, it is also 1400 lines long, so I wouldn't subject the public to it anyway!

The sole effect of Take() is to change the generated SQL. Other than that, the Entity Framework does not care about it at all. Same for .Skip(). It's hard to believe that this would have an effect on query materialization (although stranger things have happened).
So what could be causing this behavior? Off the top of my head:
A bug in your application or mapping which is causing an incorrect query to be generated.
A bug in the Entity Framework which would cause returned data to be materialized into objects incorrectly in certain circumstances.
Bad data in your database.
A bug in your database's SQL parser.
I don't think you're going to get a lot further with this until you can capture the generated SQL and run it yourself. This is actually not terribly hard, as you can set up a SQL profiler with an appropriate filter. If you find that the generated SQL is different in the buggy case, you can work backwards from there. If you find that the generated SQL is identical in the buggy case, the next step would be to look at the rows returned, preferably in the same context as the application ran it.
In short, I think you just have to keep tweaking your SQL profiling until you have the information you need.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.