C# - LINQ Lambda expression using GroupBy - Why nested validations are so inefficient? - c#

I had a bad day tried to improve the performance of the next query in C#, using Entity Framework (The information is stored in a SQL Server, and the structure use a Code First approach - But this does not matter at this time):
Bad performance query:
var projectDetail = await _context
.ProjectDetails
.Where(pd => projectHeaderIds.Contains(pd.IdProjectHeader))
.Include(pd => pd.Stage)
.Include(pd => pd.ProjectTaskStatus)
.GroupBy(g => new { g.IdProjectHeader, g.IdStage, g.Stage.StageName })
.Select(pd => new
{
pd.Key.IdProjectHeader,
pd.Key.IdStage,
pd.Key.StageName,
TotalTasks = pd.Count(),
MissingCriticalActivity = pd.Count(t => t.CheckTask.CriticalActivity && t.ProjectTaskStatus.Score != 100) > 0,
Score = Math.Round(pd.Average(a => a.ProjectTaskStatus.Score), 2),
LastTaskCompleted = pd.Max(p => p.CompletionDate)
}).ToListAsync();
After some hours, I figured out the problem and I was able to fix the performance (Instead to takes more than 4 minutes, now, the new query takes only 1-2 seconds):
New query
var groupTotalTasks = await _context
.ProjectDetails
.Where(pd => projectHeaderIds.Contains(pd.IdProjectHeader))
.Select(r => new
{
r.IdProjectHeader,
r.CompletionDate,
r.IdStage,
r.ProjectTaskStatus.Score,
r.CheckTask.CriticalActivity,
r.Stage.StageName
})
.GroupBy(g => new { g.IdProjectHeader, g.IdStage, g.StageName })
.Select(pd => new
{
pd.Key.IdProjectHeader,
pd.Key.IdStage,
pd.Key.StageName,
TotalTasks = pd.Count(),
MissingCriticalActivity = pd.Count(r => r.CriticalActivity && r.Score != 100) > 0,
Score = Math.Round(pd.Average(a => a.Score), 2),
LastTaskCompleted = pd.Max(p => p.CompletionDate)
}).ToListAsync();
The steps to improve the query was the following:
Avoid nested validations (Like Score, that use the MainQuery.ProjectTaskStatus.Score to calculate the average)
Avoid Include in the queries
I used a Select to only get the information that I will use after in the GroupBy.
Those changes fixed my issue, but, why?
...and, still, exists another way to improve this query?
What are the reasons specifically to use of nested validations makes the query extremely slow?
The other changes make more sense to me.

I recently read that whenever EF Core 2 ran into anything that it couldn't produce a SQL Query for, it would switch to in-memory evaluation. So the first query would basically be pulling all of your ProjectDetails out of the database, then doing all the grouping and such in your application's memory. That's probably the biggest issue you had.
Using .Include had a big impact in that case, because you were including a bunch of other data when you pulled out all those ProjectDetails. It probably has little to no impact now that you've avoided doing all that work in-memory.
They realized the error in their ways, and changed the behavior to throw an exception in cases like that starting with EF Core 3.
To avoid problems like this in the future, you can upgrade to EF Core 3, or just be really careful to ensure Entity Framework can translate everything in your query to SQL.

Related

What is the fastest way to sort an EF-to-Linq query?

Using Entity Framework, in theory which is faster:
// (1) sort then select/project
// in db, for entire table
var results = someQuery
.OrderBy(q => q.FieldA)
.Select(q => new { q.FieldA, q.FieldB })
.ToDictionary(q => q.FieldA, q => q.FieldB);
or
// (2) select/project then sort
// in db, on a smaller data set
var results = someQuery
.Select(q => new { q.FieldA, q.FieldB })
.OrderBy(q => q.FieldA)
.ToDictionary(q => q.FieldA, q => q.FieldB);
or
// (3) select/project then materialize then sort
// in object space
var results = someQuery
.Select(q => new { q.FieldA, q.FieldB })
.ToDictionary(q => q.FieldA, q => q.FieldB)
.OrderBy(q => q.FieldA); // -> this won't compile, but you get the question
I'm no SQL expert, but it intuitively seems that 2 is faster than 1... is that correct? And how does that compare to 3, because in my experience with EF almost everything is faster when done on the db.
PS I have no perf tools in my environment, and not sure how to test this, hence the question.
Your query is compiling and being executed at the moment you call ToDictionary, so both 1 and 2 should be the same and produce the same query: you get a SELECT FieldA, FieldB FROM table ORDER BY FieldA in both cases.
Third is different: you first execute the SQL query (without the ORDER BY clause), then your sort the returned set in-memory (data is not sorted by the DB provider, but by the client). This might be faster or slower depending on the amount of data, the server's and client's hardware, and how is your database designed (indexes, etc.), the network infrastructure, and so on.
There's no way to tell which one will be faster with the information you provided
PS: this makes no sense as a Dictionary doesn't really care about order (I don't think 3 would compile since Dictionary<>, if I'm not mistaken, doesn't have OrderBy), but change ToDictionary to ToList and there's your performance answer

Out of Memory Lambda Compile versus inline delegates

Using 4.5.1 with an application that on the server side shuffles chart data with many REST requests simultaneously.
Use IQueryable to build queries. For example, I originally had the following:
var query = ctx.Respondents
.Join(
ctx.Respondents,
other => other.RespondentId,
res => res.RespondentId,
(other, res) => new ChartJoin { Respondent = res, Occasion = null, BrandVisited = null, BrandInfo = null, Party = null, Item = null }
)
. // bunch of other joins filling out the ChartJoin
.Where(x => x.Respondent.status == 1)
. // more Where clauses dynamically applied
.GroupBy(x => new CommonGroupBy { Year = (int)x.Respondent.currentVisitYear, Month = (int)x.Respondent.currentVisitMonth })
.OrderBy(x => x.Key.Year)
.ThenBy(x => x.Key.Month)
.Select(x => new AverageEaterCheque
{
Year = x.Key.Year,
Month = x.Key.Month,
AverageCheque = (double)(x.Sum(m => m.BrandVisited.DOLLAR_TOTAL) / x.Sum(m => m.BrandVisited.NUM_PAID)),
Base = x.Count(),
Days = x.Select(m => m.Respondent.visitDate).Distinct().Count()
});
To allow for dynamic grouping (via the client), the GroupBy was generated with C# expressions returning a Dictionary. The Select also had to be generated with expressions. The above Select became something like:
public static Expression<Func<IGrouping<IDictionary<string, object>, ChartJoin>, AverageEaterCheque>> GetAverageEaterChequeSelector()
{
// x =>
var ParameterType = typeof(IGrouping<IDictionary<string, object>, ChartJoin>);
var parameter = Expression.Parameter(ParameterType);
// x => x.Sum(m => m.BrandVisited.DOLLAR_TOTAL) / x.Sum(m => m.BrandVisited.NUM_PAID)
var m = Expression.Parameter(typeof(ChartJoin), "m");
var mBrandVisited = Expression.PropertyOrField(m, "BrandVisited");
PropertyInfo DollarTotalPropertyInfo = typeof(BrandVisited).GetProperty("DOLLAR_TOTAL");
PropertyInfo NumPaidPropertyInfo = typeof(BrandVisited).GetProperty("NUM_PAID");
....
return a lambda...
}
When I did a test run locally I got an Out of Memory error. Then I started reading blogs from Totin and others that Lambda compiles, expression trees in general are expensive. Had no idea it would blow my application. And I need the ability to dynamically add grouping which lead me to using Expression trees for the GroupBy and Select clauses.
Would love some pointers on how to chase down the memory offenders in my application? Have seen some people use dotMemory but would be great with some practical tips as well. Very little experience in monitoring C#, DotNet.
Since you're compiling the expression into a delegate, the operation is performed using LINQ to Objects, rather than using the IQueryable overload. This means that the entirety of the data set is being pulled into memory, and all of the processing done by the application, instead of that processing being done in the database and only the final results being sent to the application.
Apparently pulling down the entire table into memory is enough to run your application out of memory.
You need to not compile the lambda, and leave it as an expression, thus allowing the query provider to translate it into SQL, as is done with your original code.

How to do mongodb queries faster?

I have lots of queries like sample1,sample2 and sample3. There are more than 13 million records in mongodb collection. So this query getting long time. Is there any way to faster this query?
I think using IMongoQuery object to resolve this problem. Is there any better way?
Sample 1:
var collection = new MongoDbRepo().DbCollection<Model>("tblmodel");
decimal total1 = collection.FindAll()
.SelectMany(x => x.MMB.MVD)
.Where(x => x.M01.ToLower() == "try")
.Sum(x => x.M06);
Sample 2:
var collection = new MongoDbRepo().DbCollection<Model>("tblmodel");
decimal total2 = collection.FindAll().Sum(x => x.MMB.MVO.O01);
Sample 3:
var list1= collection.FindAll()
.SelectMany(x => x.MHB.VLH)
.Where(x => x.V15 > 1).ToList();
var list2= list1.GroupBy(x => new { x.H03, x.H09 })
.Select(lg =>
new
{
Prop1= lg.Key.H03,
Prop2= lg.Count(),
Prop3= lg.Sum(w => w.H09),
});
The function FindAll returns a MongoCursor. When you add LINQ extension methods on to the FindAll, all of the processing happens on the client, not the Database server. Every document is returned to the client. Ideally, you'll need to pass in a query to limit the results by using Find.
Or, you could use the AsQueryable function to better utilize LINQ expressions and the extension methods:
var results = collection.AsQueryable().Where(....);
I don't understand your data model, so I can't offer any specific suggestions as to how to add a query that would filter more of the data on the server.
You can use the SetFields chainable method after FindAll to limit the fields that are returned if you really do need to return every document to the client for processing.
You also might find that writing some of the queries using the MongoDB aggregation framework might produce similar results, without sending any data to the client (except the results). Or, possibly a Map-Reduce depending on the nature of the data.

How to Improve Linq query perfomance regarding Trim()

Our company tables were created with fields with padded spaces.
I don't have access/permissions to make changes to the DB.
However, I noticed that when I create LINQ queries using the Trim() function, the performance decreases quite a bit.
A query as simple as this shows that performance decrease:
Companies
.Where(c => c.CompanyName.Equals("Apple"))
.Select(c => new {
Tick = c.Ticker.Trim(),
Address = c.Address.Trim()
});
Is there a way to change the query so that there is no loss in performance?
Or does this rest solely with my DBA?
Quick solution is to pad your company name before giving it to the query. For example, if the column is char(50):
var paddedName = "Apple".PadRight(50);
var result = Companies
.Where(c => c.CompanyName.Equals(paddedName))
.Select(c => new {
Tick = c.Ticker.Trim(),
Address = c.Address.Trim()
});
However, you should consider correcting the database to avoid further issues.
I haven't try the performance if we use "Like" statement to do first round filter, and make it .ToList(), second round only internally do equal check without call Database.
var result = (Companies
.Where(c => c.CompanyName.StartsWith("Apple"))
.Select(c => new
{
Tick = c.Ticker.Trim(),
Address = c.Address.Trim()
})).ToList();
var result1=result
.Where(c=>c.CompanyName.Trim().Equals("Apple"))
.Select(c => c);
Other than Entity Framework, linq-to-sql can sometimes switch to linq-to-objects under the hood when it encounters method calls that can't be translated to SQL. So if you do
....
.Select(c => new {
Tick = c.Ticker.TrimEnd().TrimStart(),
Address = c.Address.TrimEnd().TrimStart()
you will notice that the generated SQL no longer contains LTRIM(RTRIM()), but only the field name and that the trims are executed in client memory. Apparently, somehow the LTRIM(RTRIM()) causes a less efficient query plan (surprisingly).
Maybe only TrimEnd() suffices if there are no leading spaces.
Further, I fully agree with p.s.w.g. that you should go out of your way to try and clean up the database in stead of fixing bad data in queries. If you can't do this job, find the right persons and twist their arms.

Linq to sql expression tree execution zone issue

I have got a bit of an issue and was wondering if there is a way to have my cake and eat it.
Currently I have a Repository and Query style pattern for how I am using Linq2Sql, however I have got one issue and I cannot see a nice way to solve it. Here is an example of the problem:
var someDataMapper = new SomeDataMapper();
var someDataQuery = new GetSomeDataQuery();
var results = SomeRepository.HybridQuery(someDataQuery)
.Where(x => x.SomeColumn == 1 || x.SomeColumn == 2)
.OrderByDescending(x => x.SomeOtherColumn)
.Select(x => someDataMapper.Map(x));
return results.Where(x => x.SomeMappedColumn == "SomeType");
The main bits to pay attention to here are Mapper, Query, Repository and then the final where clause. I am doing this as part of a larger refactor, and we found that there were ALOT of similar queries which were getting slightly different result sets back but then mapping them the same way to a domain specific model. So take for example getting back a tbl_car and then mapping it to a Car object. So a mapper basically takes one type and spits out another, so exactly the same as what would normally happen in the select:
// Non mapped version
select(x => new Car
{
Id = x.Id,
Name = x.Name,
Owner = x.FirstName + x.Surname
});
// Mapped version
select(x => carMapper.Map(x));
So the car mapper is more re-usable on all areas which do similar queries returning same end results but doing different bits along the way. However I keep getting the error saying that Map is not able to be converted to SQL, which is fine as I dont want it to be, however I understand that as it is in an expression tree it would try to convert it.
{"Method 'SomeData Map(SomeTable)' has no supported translation to SQL."}
Finally the object that is returned and mapped is passed further up the stack for other objects to use, which make use of Linq to SQL's composition abilities to add additional criteria to the query then finally ToList() or itterate on the data returned, however they filter based on the mapped model, not the original table model, which I believe is perfectly fine as answered in a previous question:
Linq2Sql point of retrieving data
So to sum it up, can I use my mapping pattern as shown without it trying to convert that single part to SQL?
Yes, you can. Put AsEnumerable() before the last Select:
var results = SomeRepository.HybridQuery(someDataQuery)
.Where(x => x.SomeColumn == 1 || x.SomeColumn == 2)
.OrderByDescending(x => x.SomeOtherColumn)
.AsEnumerable()
.Select(x => someDataMapper.Map(x));
Please note, however, that the second Where - the one that operates on SomeMappedColumn - will now be executed in memory and not by the database. If this last where clause significantly reduces the result set this could be a problem.
An alternate approach would be to create a method that returns the expression tree of that mapping. Something like the following should work, as long as everything happening in the mapping is convertible to SQL.
Expression<Func<EntityType, Car>> GetCarMappingExpression()
{
return new Expression<Func<EntityType, Car>>(x => new Car
{
Id = x.Id,
Name = x.Name,
Owner = x.FirstName + x.Surname
});
}
Usage would be like this:
var results = SomeRepository.HybridQuery(someDataQuery)
.Where(x => x.SomeColumn == 1 || x.SomeColumn == 2)
.OrderByDescending(x => x.SomeOtherColumn)
.Select(GetCarMappingExpression());

Categories

Resources