How to optimize this Linq query - c#

I am trying to optimize this query. It is not slow for now but I am expecting a spike to hit this query in the near future. Is there anything else i can do to make this query faster?
var posts = from p in context.post
where p.post_isdeleted == false && p.post_parentid == null
select new
{
p.post_date,
p.post_id,
p.post_titleslug,
p.post_title,
p.post_descriptionrender,
p.userinfo.user_username,
p.userinfo.user_userid,
p.userinfo.user_GravatarHash,
p.userinfo.user_points,
p.category.catid,
p.category.name,
p.post_answercount,
p.post_hasbestanswer,
p.post_hits,
p.post_isanonymous,
p.post_votecount,
FavoriteCount = context.favorites.Where(x => x.post.post_id == p.post_id).Count(),
tags = from tg in context.posttag
where tg.posttag_postid == p.post_id
select new
{
tg.tag.tag_id,
tg.tag.tag_title
}
};

In a general sense you could look into caching that information, but nothing is intrinsically "slow" about the query. This will really depend on how the query is being used (how often, what data is being hit, etc). There are a lot of possible optimization solutions for a given problem, and though you may find improvements based on intuition you'll have a much easier time doing so if you have profiling tools in place to nail down problem areas. Plus, you'll have the satisfaction of proving that the areas you improve are worth the time investment.

A possible optimization for this query whould be, only to load the Ids of other related objects (UserInfo, Category, Tag), and initialize this objects only on demand using a lazy loading strategy. Or executing an other query to resolve these objects.
But it depends on, how you use the result of the query. Maybe you need all the information of the related objects, or you need only some information of these objects or the Id of the objects is enough, cause you need the Ids for other queries.

I could make the LINQ cleaner (by using associations instead of psuedo-joins), but it wouldn't make it any faster. To make it faster you probably need to look at DB indexing.

Related

LINQ Search nvarchar(MAX) column extremely slow using .Contains()

I have a .net core API and I am trying to search 4.4 million records using .Contains(). This is obviously extremely slow - 26 seconds. I am just querying one column which is the name of the record. How is this problem generally solved when dealing with millions of records?
I have never worked with millions of records before so apart from the obvious altering of the .Select and .Take, I haven't tried anything too drastic. I have spent many hours on this though.
The other filters included in the .Where are only used when a user chooses to use them on the front end - The real problem is just searching by CompanyName.
Note; I am using .ToArray() when returning the results.
I have indexes in the database but cannot add one for CompanyName as it is Nvarchar(MAX).
I have also looked at the execution plan and it doesn't really show anything out of the ordinary.
query = _context.Companies.Where(
c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper())
&& c.CompanyNumber.StartsWith(
string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter)
? paging.SearchCriteria.companyNumberFilter.ToUpper()
: ""
)
&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter
&& c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter
)
.Select(x => new Company() {
CompanyName = x.CompanyName,
IncorporationDate = x.IncorporationDate,
CompanyNumber = x.CompanyNumber
}
)
.Take(10);
I expect the query to take around 1 / 2 seconds as when I execute a like query in ssms it take about 1 / 2 seconds.
Here is the code being submitted to DB:
Microsoft.EntityFrameworkCore.Database.Command: Information: Executing DbCommand [Parameters=[#__p_4='?' (DbType = Int32), #__ToUpper_0='?' (Size = 4000), #__p_1='?' (Size = 4000), #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2='?' (DbType = DateTime2), #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3='?' (DbType = DateTime2), #__p_5='?' (DbType = Int32)], CommandType='Text', CommandTimeout='30']
SELECT [t].[CompanyName], [t].[IncorporationDate], [t].[CompanyNumber]
FROM (
SELECT TOP(#__p_4) [c].[CompanyName], [c].[IncorporationDate], [c].[CompanyNumber], [c].[ID]
FROM [Companies] AS [c]
WHERE (((((#__ToUpper_0 = N'') AND #__ToUpper_0 IS NOT NULL) OR (CHARINDEX(#__ToUpper_0, [c].[CompanyName]) > 0)) AND (((#__p_1 = N'') AND #__p_1 IS NOT NULL) OR ([c].[CompanyNumber] IS NOT NULL AND (#__p_1 IS NOT NULL AND (([c].[CompanyNumber] LIKE [c].[CompanyNumber] + N'%') AND (((LEFT([c].[CompanyNumber], LEN(#__p_1)) = #__p_1) AND (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NOT NULL AND #__p_1 IS NOT NULL)) OR (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NULL AND #__p_1 IS NULL))))))) AND ([c].[IncorporationDate] > #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2)) AND ([c].[IncorporationDate] < #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3)
) AS [t]
ORDER BY [t].[IncorporationDate] DESC
OFFSET #__p_5 ROWS FETCH NEXT #__p_4 ROWS ONLY
SOLVED! With the help of both answers!
In the end as suggested, I tried full-text searching which was lightening fast but compromised accuracy of search results. In order to filter those results more accurately, I used .Contains on the query after applying the full-text search.
Here is the code that works. Hopefully this helps others.
//query = _context.Companies
//.Where(c => c.CompanyName.StartsWith(paging.SearchCriteria.companyNameFilter.ToUpper())
//&& c.CompanyNumber.StartsWith(string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter) ? paging.SearchCriteria.companyNumberFilter.ToUpper() : "")
//&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter && c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter)
//.Select(x => new Company() { CompanyName = x.CompanyName, IncorporationDate = x.IncorporationDate, CompanyNumber = x.CompanyNumber }).Take(10);
query = _context.Companies.Where(c => EF.Functions.FreeText(c.CompanyName, paging.SearchCriteria.companyNameFilter.ToUpper()));
query = query.Where(x => x.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper()));
(I temporarily excluded the other filters for simplicity)
When you run the query in SSMS, it's probably cached for subsequent calls. The original query probably took similar time as the EF query. That said, there are disadvantages to parametrised queries - while you can better reuse execution plans in a parametrised query, this also means that the execution plan isn't necessarily the best for the actual query you're trying to run right now.
For example, if you specify a CompanyNumber (which is easy to find in an index due to the StartsWith), you can filter the data first by CompanyNumber, thus making the name search trivial (I assume CompanyNumber is unique, so either you get 0 records, or you get the one you get by CompanyNumber). This might not be possible for the parametrised query, if its execution plan was optimized for looking up by name.
But in the end, Contains is a performance killer. It needs to read every single byte of data in your table's CompanyName field; which usually means it has to read every single row, and process much of its data. Searching by a substring looks deceptively simple, but always carries heavy penalties - its complexity is linear with respect to data size.
One option is to find a way to avoid the Contains. Users often ask for features they don't actually need. StartsWith might work just as well for most of the cases. But that's a business decision, of course.
Another option would be finding a way to reduce the query as much as possible before you apply the Contains filter - if you only allow searching for company name with other filters that narrow the search down, you can save the DB server a lot of work. This may be tricky, and can sometimes collide with the execution plan collission issue - you might want to add some way to avoid having the same execution plan for two queries that are wildly different; an easy way in EF would be to build the query up dynamically, rather than trying for one expression:
var query = _context.Companies;
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNameFilter))
query = query.Where(c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter));
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter))
query = query.Where(c => c.CompanyNumber.StartsWith(paging.SearchCriteria.companyNumberFilter));
// etc. for the rest of the query
This means that you actually have multiple parametrised queries that can each have their own execution plan, more in line with what the query actually does. For some extreme cases, it might also be worthwhile to completely prevent execution plan caching (this is often useful in reports).
The final option is using full-text search. You can find plenty of tutorials on how to make this work. This works essentially by splitting the unformatted string data to individual words or phrases, and indexing those. This means that a search for "hello world" doesn't necessarily return all the records that have "hello world" in the name, and it might also return records that have something else than "hello world" in the name. Think Google Search rather than Contains. This can often be a great method for human-written text, but it can be very confusing for the user who doesn't understand why you'd return search results that are completely different from what he was searching for. It also often doesn't work well if you need to do partial searches (e.g. searching for "Computer" might return "Computer, Inc.", but searching for "Comp" might return nothing).
The first option is likely the fastest, and closest to what the users would expect. It has the weakness that it can't search in the middle, though. The second option is the most correct, and might make your query substantially faster, especially in the most common cases with good statistics. The third option is probably about as fast as the first one, but can be tricky to setup properly, and can be confusing for your users. It does also provide you with more powerful ways to query the text data (e.g. using wildcards).
Welcome to stack overflow. It looks like you are suffering from at least one of these three problems in your code and your architecture.
First: indexing
You've mentioned that this cannot be indexed but there is support in SQL Server for full text indexing at the very least.
.Contains
This method isn't exactly suitable for the size of operation you're performing. If possible, perhaps as a last resort, consider moving to a parameterized query. For now, however, it looks like you want to keep your business logic in the .net code rather than spreading it into SQL and that's a worthy plan.
c.IncorporationDate
Date comparison can be a little costly in SQL Server. Once you're dealing with so many millions of rows you might get a lot of performance benefit from correctly partitioned tables and indexes.
Consider whether or not these rows can change at all. Something named IncoporationDate sounds like it definitely should not be changed. I suspect you may want to leverage that after reading the rest of these.

Entity Framework Complex query with inner joins

I want to execute a complex query using Entity Framework (Code First).
I know it can be done by LINQ, but what is the best way to write LINQ query for complex data and inner joining?
I want to execute the following query:
var v = from m in WebAppDbContext.UserSessionTokens
from c in WebAppDbContext.Companies.Include(a => a.SecurityGroups)
from n in WebAppDbContext.SecurityGroups.Include(x => x.Members)
where m.TokenString == userTokenString &&
n.Members.Contains(m.User) &&
c.SecurityGroups.Contains(n)
select c;
Is this the best way to do this?
Does this query get any entire list of records from the db and then executes the filtration on this list? (might be huge list)
And most important: Does it query the database several times?
In my opinion and based on my own experiences, talking about performance, especially joining data sets, it's faster when you write it in SQL. But since you used code first approach then it's not an option. To answer your questions, your query will not query DB several times (you can try debugging and see Events log in VS). EF will transform your query into SQL statement and execute it.
TL:DR; don't micromanage the robots. Let them do their thing and 99% of the time you'll be fine. Linq doesn't really expose methods to micromanage the underlying data query, anyway, its whole purpose is to be an abstraction layer.
In my experience, the Linq provider is pretty smart about converting valid Linq syntax into "pretty good" SQL. Your looks like your query is all inner joins, and those should all be on the primary indexes/foreign keys of each table, so it's going to come up with something pretty good.
If that's not enough to soothe your worries, I'd suggest:
put on a SQL Trace to see the actual query that's being presented to the Database. I bet it's not as simple as Companies join SecurityGroups join Members join UserTokens, but it's probably equivalent to it.
Only worry about optimizing if this becomes an actual performance problem.
If it does become a performance problem, consider refactoring your problem space. It looks like you're trying to get a Company record from a UserToken, by going through three other tables. Is it possible to just relate Users and Companies directly, rather than through the security model? (please excuse my total ignorance of your problem space, I'm just saying "maybe look at the problem from another angle?")
In short, it sounds like you're burning brain cycles attempting to optimize something prematurely. Now, it's reasonable to think if there is a performance problem, this section of the code could be responsible, but that would only be noticeable if you're doing this query a lot. Based on coming from a security token, I'd guess you're doing this once per session to get the current user's contextual information. In that case, the problem isn't usually with Linq, but with your approach to solving the problem for finding Company information. Can you cache the result?

Does the order of EF linq query clauses influence performance?

Should I worry about the order of Entity Framework linq query clauses in terms of performance?
In the example below could changing the order of the two where clauses have an performance impact on the DB lookup?
using (var context = new ModelContext())
{
var fetchedImages = (from images in context.Images.Include("ImageSource")
where images.Type.Equals("jpg")
where images.ImageSource.Id == 5
select images).ToArray();
}
No, changing of these two where clauses will not affect performance.
Generated SQL will look like this anyway:
WHERE [condition1] AND [condition2]
Besides, you can write conditions, combined with logical operators:
where images.Type.Equals("jpg") && images.ImageSource.Id == 5
The general case is that the where operators will "short circuit", so if the first is false, the second won't even be examined, so if one will fail more frequently, generally you want to check it first.
However, this breaks down if the more frequent failure is slow to process.
The only way to tell for sure is by profiling your code using appropriate tools.

Improving Linq query

I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?
There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.
1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.
I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)
I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.

LINQ to Entities query takes long to compile, SQL runs fast

I'm working on a piece of code, written by a coworker, that interfaces with a CRM application our company uses. There are two LINQ to Entities queries in this piece of code that get executed many times in our application, and I've been asked to optimize them because one of them is really slow.
These are the queries:
First query, this one compiles pretty much instantly. It gets relation information from the CRM database, filtering by a list of relation IDs given by the application:
from relation in context.ADRELATION
where ((relationIds.Contains(relation.FIDADRELATION)) && (relation.FLDELETED != -1))
join addressTable in context.ADDRESS on relation.FIDADDRESS equals addressTable.FIDADDRESS
into temporaryAddressTable
from address in temporaryAddressTable.DefaultIfEmpty()
join mailAddressTable in context.ADDRESS on relation.FIDMAILADDRESS equals
mailAddressTable.FIDADDRESS into temporaryMailAddressTable
from mailAddress in temporaryMailAddressTable.DefaultIfEmpty()
select new { Relation = relation, Address = address, MailAddress = mailAddress };
The second query, which takes about 4-5 seconds to compile, and takes information about people from the database (again filtered by a list of IDs):
from role in context.ROLE
join relationTable in context.ADRELATION on role.FIDADRELATION equals relationTable.FIDADRELATION into temporaryRelationTable
from relation in temporaryRelationTable.DefaultIfEmpty()
join personTable in context.PERSON on role.FIDPERS equals personTable.FIDPERS into temporaryPersonTable
from person in temporaryPersonTable.DefaultIfEmpty()
join nationalityTable in context.TBNATION on person.FIDTBNATION equals nationalityTable.FIDTBNATION into temporaryNationalities
from nationality in temporaryNationalities.DefaultIfEmpty()
join titelTable in context.TBTITLE on person.FIDTBTITLE equals titelTable.FIDTBTITLE into temporaryTitles
from title in temporaryTitles.DefaultIfEmpty()
join suffixTable in context.TBSUFFIX on person.FIDTBSUFFIX equals suffixTable.FIDTBSUFFIX into temporarySuffixes
from suffix in temporarySuffixes.DefaultIfEmpty()
where ((rolIds.Contains(role.FIDROLE)) && (relation.FLDELETED != -1))
select new { Role = role, Person = person, relation = relation, Nationality = nationality, Title = title.FTXTBTITLE, Suffix = suffix.FTXTBSUFFIX };
I've set up the SQL Profiler and took the SQL from both queries, then ran it in SQL Server Management Studio. Both queries ran very fast, even with a large (~1000) number of IDs. So the problem seems to lie in the compilation of the LINQ query.
I have tried to use a compiled query, but since those can only contain primitive parameters, I had to strip out the part with the filter and apply that after the Invoke() call, so I'm not sure if that helps much. Also, since this code runs in a WCF service operation, I'm not sure if the compiled query will even still exist on subsequent calls.
Finally what I tried was to only select a single column in the second query. While this obviously won't give me the information I need, I figured it would be faster than the ~200 columns we're selecting now. No such case, it still took 4-5 seconds.
I'm not a LINQ guru at all, so I can barely follow this code (I have a feeling it's not written optimally, but can't put my finger on it). Could anyone give me a hint as to why this problem might be occurring?
The only solution I have left is to manually select all the information instead of joining all these tables. I'd then end up with about 5-6 queries. Not too bad I guess, but since I'm not dealing with horribly inefficient SQL here (or at least an acceptable level of inefficiency), I was hoping to prevent that.
Thanks in advance, hope I made things clear. If not, feel free to ask and I'll provide additional details.
Edit:
I ended up adding associations on my entity framework (the target database didn't have foreign keys specified) and rewriting the query thusly:
context.ROLE.Where(role => rolIds.Contains(role.FIDROLE) && role.Relation.FLDELETED != -1)
.Select(role => new
{
ContactId = role.FIDROLE,
Person = role.Person,
Nationality = role.Person.Nationality.FTXTBNATION,
Title = role.Person.Title.FTXTBTITLE,
Suffix = role.Person.Suffix.FTXTBSUFFIX
});
Seems a lot more readable and it's faster too.
Thanks for the suggestions, I will definitely keep the one about making multiple compiled queries for different numbers of arguments in mind!
Gabriels answer is correct: Use a compiled query.
It looks like you are compiling it again for every WCF request which of course defeats the purpose of one-time initialization. Instead, put the compiled query into a static field.
Edit:
Do this: Send maximum load to your service and pause the debugger 10 times. Look at the call stack. Did it stop more often in L2S code or in ADO.NET code? This will tell you if the problem is still with L2S or with SQL Server.
Next, let's fix the filter. We need to push it back into the compiled query. This is only possible by transforming this:
rolIds.Contains(role.FIDROLE)
to this:
role.FIDROLE == rolIds_0 || role.FIDROLE == rolIds_1 || ...
You need a new compiled query for every cardinality of rolIds. This is nasty, but it is necessary to get it to compile. In my project, I have automated this task but you can do a one-off solution here.
I guess most queries will have very few role-id's so you can materialize 10 compiled queries for cardinalities 1-10 and if the cardinality exceeds 10 you fall back to client-side filtering.
If you decide to keep the query inside the code, you could compile it. You still have to compile the query once when you run your app, but all subsequent call are gonna use that already compiled query. You can take a look at MSDN help here: http://msdn.microsoft.com/en-us/library/bb399335.aspx.
Another option would be to use a stored procedure and call the procedure from your code. Hence no compile time.

Categories

Resources