How can I speed up this query? Right now it's taking me around 2 minutes to pull back 210K records.
I turned off LazyLoading as well as set AsNoTracking on my tables.
I know it's a lot of data but surely it shouldn't take 2 minutes to retrieve the data?
context.Configuration.LazyLoadingEnabled = false;
List<MY_DATA> data = context
.MY_DATA.AsNoTracking()
.Include(x => x.MY_DATA_DETAILS)
.Where(x => startDate <= DbFunctions.TruncateTime(x.DB_DATE)
&& endDate >= DbFunctions.TruncateTime(x.DB_DATE)
&& x.MY_DATA_DETAILS.CODE.Trim().ToUpper() == myCode.Trim().ToUpper())
.ToList();
You can do without DbFunctions.TruncateTime() and probably also without these Trim().ToUpper() calls.
If you execute a function on a database column before it is filtered, it's impossible for the query optimizer to use any indexes on this column. This is know as being non-sargable. To execute the query in its present form, the database engine has to transform the data first and then scan all the transformed data to do the filtering. No index involved.
DbFunctions.TruncateTime() is meaningless. You have to choose startDate and endDate wisely and use x.DB_DATE as it is.
Further, if x.MY_DATA_DETAILS.CODE is a varchar column (most text columns are), it will be auto-trimmed in searches. Even if the database value contains trailing spaces, they will be ignored. So Trim isn't necessary. Next, most text columns by default have a case-insensitive database collation. You should check it. If this is SQL Server, look for collations like SQL_Latin1_General_CP1_CI_AS. The CI part means Case-Insensitive. If so, you can also do away with the ToUpper part. If not, you can either change the collation to a case-insensitive one, or you maybe should conclude that the column is case sensitive for a reason, so it does matter whether you look for Abc or abc.
Either way, having these transforming function removed form the database columns, the query should be able to run considerably faster, provided that proper indexes are in place.
Usually defining indexes on the columns that you frequently use in the 'where' clause, can improve your performance in selecting the rows from a large tale.
I recommend that you create a stored procedure and move your query into the SP and apply the performance tuning in Database and in your C# code, call the SP.
In addition to what the other guy said about moving your query to a stored proc and creating the proper indexes, I'd say you'd be better off using SQL reporting rather then trying to import the data into your application and reporting from there. Especially 210K rows. SQL has internal optimizations that you'll never be able to come close to with stored procs and queries.
You can see this very easily:
1) try write a simple console app that tries to pull down that entire table and writes it to a CSV file -- it'll be extremely slow.
2) try use the data export through Sql Mgmt Studio and export to a CSV -- it'll be done in a few seconds.
Do you need all the attributes of the object?
You can do this.
List<MY_DATA> data = context.MY_DATA.Include(x =>
x.MY_DATA_DETAILS).Where(x => startDate <= DbFunctions.TruncateTime(x.DB_DATE) &&
endDate >= DbFunctions.TruncateTime(x.DB_DATE) &&
x.MY_DATA_DETAILS.CODE.Trim().ToUpper() == myCode.Trim().ToUpper()).select(x => new MY_DATA()
{
Value = data
}).ToList();
Related
I have a .net core API and I am trying to search 4.4 million records using .Contains(). This is obviously extremely slow - 26 seconds. I am just querying one column which is the name of the record. How is this problem generally solved when dealing with millions of records?
I have never worked with millions of records before so apart from the obvious altering of the .Select and .Take, I haven't tried anything too drastic. I have spent many hours on this though.
The other filters included in the .Where are only used when a user chooses to use them on the front end - The real problem is just searching by CompanyName.
Note; I am using .ToArray() when returning the results.
I have indexes in the database but cannot add one for CompanyName as it is Nvarchar(MAX).
I have also looked at the execution plan and it doesn't really show anything out of the ordinary.
query = _context.Companies.Where(
c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper())
&& c.CompanyNumber.StartsWith(
string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter)
? paging.SearchCriteria.companyNumberFilter.ToUpper()
: ""
)
&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter
&& c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter
)
.Select(x => new Company() {
CompanyName = x.CompanyName,
IncorporationDate = x.IncorporationDate,
CompanyNumber = x.CompanyNumber
}
)
.Take(10);
I expect the query to take around 1 / 2 seconds as when I execute a like query in ssms it take about 1 / 2 seconds.
Here is the code being submitted to DB:
Microsoft.EntityFrameworkCore.Database.Command: Information: Executing DbCommand [Parameters=[#__p_4='?' (DbType = Int32), #__ToUpper_0='?' (Size = 4000), #__p_1='?' (Size = 4000), #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2='?' (DbType = DateTime2), #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3='?' (DbType = DateTime2), #__p_5='?' (DbType = Int32)], CommandType='Text', CommandTimeout='30']
SELECT [t].[CompanyName], [t].[IncorporationDate], [t].[CompanyNumber]
FROM (
SELECT TOP(#__p_4) [c].[CompanyName], [c].[IncorporationDate], [c].[CompanyNumber], [c].[ID]
FROM [Companies] AS [c]
WHERE (((((#__ToUpper_0 = N'') AND #__ToUpper_0 IS NOT NULL) OR (CHARINDEX(#__ToUpper_0, [c].[CompanyName]) > 0)) AND (((#__p_1 = N'') AND #__p_1 IS NOT NULL) OR ([c].[CompanyNumber] IS NOT NULL AND (#__p_1 IS NOT NULL AND (([c].[CompanyNumber] LIKE [c].[CompanyNumber] + N'%') AND (((LEFT([c].[CompanyNumber], LEN(#__p_1)) = #__p_1) AND (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NOT NULL AND #__p_1 IS NOT NULL)) OR (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NULL AND #__p_1 IS NULL))))))) AND ([c].[IncorporationDate] > #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2)) AND ([c].[IncorporationDate] < #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3)
) AS [t]
ORDER BY [t].[IncorporationDate] DESC
OFFSET #__p_5 ROWS FETCH NEXT #__p_4 ROWS ONLY
SOLVED! With the help of both answers!
In the end as suggested, I tried full-text searching which was lightening fast but compromised accuracy of search results. In order to filter those results more accurately, I used .Contains on the query after applying the full-text search.
Here is the code that works. Hopefully this helps others.
//query = _context.Companies
//.Where(c => c.CompanyName.StartsWith(paging.SearchCriteria.companyNameFilter.ToUpper())
//&& c.CompanyNumber.StartsWith(string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter) ? paging.SearchCriteria.companyNumberFilter.ToUpper() : "")
//&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter && c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter)
//.Select(x => new Company() { CompanyName = x.CompanyName, IncorporationDate = x.IncorporationDate, CompanyNumber = x.CompanyNumber }).Take(10);
query = _context.Companies.Where(c => EF.Functions.FreeText(c.CompanyName, paging.SearchCriteria.companyNameFilter.ToUpper()));
query = query.Where(x => x.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper()));
(I temporarily excluded the other filters for simplicity)
When you run the query in SSMS, it's probably cached for subsequent calls. The original query probably took similar time as the EF query. That said, there are disadvantages to parametrised queries - while you can better reuse execution plans in a parametrised query, this also means that the execution plan isn't necessarily the best for the actual query you're trying to run right now.
For example, if you specify a CompanyNumber (which is easy to find in an index due to the StartsWith), you can filter the data first by CompanyNumber, thus making the name search trivial (I assume CompanyNumber is unique, so either you get 0 records, or you get the one you get by CompanyNumber). This might not be possible for the parametrised query, if its execution plan was optimized for looking up by name.
But in the end, Contains is a performance killer. It needs to read every single byte of data in your table's CompanyName field; which usually means it has to read every single row, and process much of its data. Searching by a substring looks deceptively simple, but always carries heavy penalties - its complexity is linear with respect to data size.
One option is to find a way to avoid the Contains. Users often ask for features they don't actually need. StartsWith might work just as well for most of the cases. But that's a business decision, of course.
Another option would be finding a way to reduce the query as much as possible before you apply the Contains filter - if you only allow searching for company name with other filters that narrow the search down, you can save the DB server a lot of work. This may be tricky, and can sometimes collide with the execution plan collission issue - you might want to add some way to avoid having the same execution plan for two queries that are wildly different; an easy way in EF would be to build the query up dynamically, rather than trying for one expression:
var query = _context.Companies;
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNameFilter))
query = query.Where(c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter));
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter))
query = query.Where(c => c.CompanyNumber.StartsWith(paging.SearchCriteria.companyNumberFilter));
// etc. for the rest of the query
This means that you actually have multiple parametrised queries that can each have their own execution plan, more in line with what the query actually does. For some extreme cases, it might also be worthwhile to completely prevent execution plan caching (this is often useful in reports).
The final option is using full-text search. You can find plenty of tutorials on how to make this work. This works essentially by splitting the unformatted string data to individual words or phrases, and indexing those. This means that a search for "hello world" doesn't necessarily return all the records that have "hello world" in the name, and it might also return records that have something else than "hello world" in the name. Think Google Search rather than Contains. This can often be a great method for human-written text, but it can be very confusing for the user who doesn't understand why you'd return search results that are completely different from what he was searching for. It also often doesn't work well if you need to do partial searches (e.g. searching for "Computer" might return "Computer, Inc.", but searching for "Comp" might return nothing).
The first option is likely the fastest, and closest to what the users would expect. It has the weakness that it can't search in the middle, though. The second option is the most correct, and might make your query substantially faster, especially in the most common cases with good statistics. The third option is probably about as fast as the first one, but can be tricky to setup properly, and can be confusing for your users. It does also provide you with more powerful ways to query the text data (e.g. using wildcards).
Welcome to stack overflow. It looks like you are suffering from at least one of these three problems in your code and your architecture.
First: indexing
You've mentioned that this cannot be indexed but there is support in SQL Server for full text indexing at the very least.
.Contains
This method isn't exactly suitable for the size of operation you're performing. If possible, perhaps as a last resort, consider moving to a parameterized query. For now, however, it looks like you want to keep your business logic in the .net code rather than spreading it into SQL and that's a worthy plan.
c.IncorporationDate
Date comparison can be a little costly in SQL Server. Once you're dealing with so many millions of rows you might get a lot of performance benefit from correctly partitioned tables and indexes.
Consider whether or not these rows can change at all. Something named IncoporationDate sounds like it definitely should not be changed. I suspect you may want to leverage that after reading the rest of these.
I'm coding an application with Entity Framework in which I rely heavily on user defined functions.
I have a question about the best way (most optimized way) of how I limit and page my result sets. Basically I am wondering if these two options are the same or one is prefered performance wise.
Option 1.
//C#
var result1 = _DB.fn_GetData().OrderBy(x => Id).Skip(page *100).Take(100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
Option 2.
//C#
var result2 = _DB.fn_GetData(page = 0, size = 100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
ORDER BY Id
OFFSET (size * page) ROWS FETCH NEXT size ROWS ONLY
To me these seem to be producing about the same result, but maybe I am missing some key aspect.
You'll have to be aware when your LINQ statement is AsEnumerable and when it is AsQueryable. As long as your statement is an IQueryable<...> the software will try to translate it into SQL and let your database do the query. Once it really has lost the IQueryable, and has become an implementation of an IEnumerable, the data has been brought to local memory, and all further LINQ statements will be performed by your process, not by the database.
If you use your debugger, you will see that the return value of your fn_getData returns an IEnumerable. This means that the result of fn_GetData is brought to local memory and your OrderBy etc is performed by your process.
Usually it is much more efficient to only move the records that you will use to local memory. Besides: do not fetch the complete records, but only the properties that you plan to use. So in this case I guess you'll have to create an extended version of fn_GetData that returns only the values you plan to use
I suggest second option because SQL Server can more faster then C# methods.
In your first option, you take all of the records in table and loop through. But second option, SQL Server do it for you and you get what you want.
You should apply the limiting and where clauses (it depends on table indexes) in the database as far as possible. For first example;
var result1 = _DB.fn_GetData().OrderBy(x => Id).Skip(page *100).Take(100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
The whole table is retrieved from database into in-memory and it kills the performance and reliability. I strongly don't suggest it. You should consider to put some limitations to filter records on the database. So, the second option is better approach in this case.
I am fetching a list of products including their prices. I want to get just enable prices.
I wrote two type of queries:
context.Products.Include("Prices").Where(p=>p.Prices.Where(pr=>pr.Enable==true).Count()>0).ToList();
And the other one is:
context.Products.Include("Prices").ToList().RemoveAll(p => p.Prices.Where(pr => pr.Enable == true).ToList().Count == 0);
Which one is more optimized?
Assuming you are using an EntityFramework context, the first one is way better.
This is because Linq to SQL will translate the statement into an SQL statement. The Where statements will result in an according SQL Where. So only the necessary subset of the elements are retrieved.
The second statement retrieves all Products and Prices and then removes the unwanted elements.
This assumes that you have a remote database. If your database is running locally or you already have all Products and Prices in memory its not so easy to tell (you would have to use the profiler for that).
This kind of question really depends on a lot of things, so it is not so easy to say which is better.
But from the code, the first one is doing the where clause at sql side, where the second code is getting all the data out from sql and do the where in application.
so it will depend on the sql server, the application hardware and data amount.
I'm querying my sql database which is in Azure (actually my web app is on Azure as well).
Every time I perform this particular query, there are ever changing errors (e.g. sometimes timeout occurs, sometimes it works perfectly, sometimes it takes extremely long to load).
I have noted that I am using the ToList method here to enumerate the query but I suspect that's why it is degrading.
Is there anyway I can fix this or make it better....or maybe just use native SQL to execute my query?.
I should also note in my webconfig my Database connection timeout is set to 30 seconds. Would this have any performance benefit?
I'm putting the suspect code here:
case null:
lstQueryEvents = db.vwTimelines.Where(s => s.UserID == UserId)
.Where(s => s.blnHide == false)
.Where(s => s.strEmailAddress.Contains(strSearch) || s.strDisplayName.Contains(strSearch) || s.strSubject.Contains(strSearch))
.OrderByDescending(s => s.LatestEventTime)
.Take(intNumRecords)
.ToList();
break;
It's basically querying for the 50 records...I don't understand why it's timing out sometimes.
Here are some tips:
Make sure that your SQL data types matches types in your model
Judging by your code, types should be something like this:
UserID should be int (cannot tell for sure by looking at code);
blnHide should be bit;
strEmailAddress should be nvarchar;
strDisplayName should be nvarchar;
strSubject should be nvarchar;
Make use of indexes
You should create Non-Clustered Indexes on columns that you use to filter and order data.
In order of importance:
LatestEventTime as you order ALL data by this column;
UserID as you filter out most of data by this column;
blnHide as you filter out part of data by this column;
Make use of indexes for text lookup
You could make use of indexes for text lookup if you change your filter behaviour slightly and search text only in the start of column value.
To achieve that:
change .Contains() with .StartsWith() as it would allow index to be used.
create Non-Clustered Indexes on strEmailAddress column:
create Non-Clustered Indexes on strDisplayName column:
create Non-Clustered Indexes on strSubject column:
Try out free text search
Microsoft only recently have introduced full text search in Azure SQL. You can use that to find rows matching by partial string. This is a bit complicated to achieve using EF, but it is certainly doable.
Here are some links to get you started:
Entity Framework, Code First and Full Text Search
https://azure.microsoft.com/en-us/blog/full-text-search-is-now-available-for-preview-in-azure-sql-database/
string.Contains(...) converted to WHERE ... LIKE ... sql-statement. Which is very expensive. Try to reform your query to avoid it.
Plus, Azure SQL has it's own limitations (5 sec as far as I remember, but better check SLA) for query run, so it would generally ignore your web.config settings if they are longer.
I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?
There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.
1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.
I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)
I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.