LINQ or SQL: Group by with a sum of distinct values

LINQ or SQL: Group by with a sum of distinct values - c#

I have a database table, on SQL Server 2019, containing a time series of prices collected with multiple frequencies (daily, weekly or monthly) which I query using EF Core 3.1
I'm trying to extract these prices, aggregated by month, but without losing the information of the collection frequency.
From the following set of data:
I'm trying to get this one, which contains the aggregate average value of the prices, grouped by Month, and with the frequencies of the raw records.
These could be easily solved by using
string.Join(",",s.Select(innerSel=>innerSel.OriginalFrequency).Distinct())
but unfortunately, I can't use as I need to work on IQueryable objects and run the execution of the LINQ query only at the end when I take a subset of data, based on the page-size, because converting to a List the query before grouping means to get several thousands of records from the DB.
I was trying to use a combination of SUM and COUNT of the frequencies in order to easily understand which is the original combination by multiplication these two values (see the schema below) but the COUNT and SUM should count only distinct values, otherwise, it doesn't work.
Is there a way to not lose this information in some way, without overloading the database server requesting unnecessary data, or making multiple requests?
This is the code where I'm stuck:
var aggregatedMonthlyPrices = prices.GroupBy(g => new
{
g.DateMonth,
g.DateYear
}).Select(s => new
{
DateMonth = s.Key.DateMonth,
DateYear = s.Key.DateYear
Price=s.Average(avg=>avg.Price),
FrequencySum= s.Sum(sum=>sum.DataCollectionFrequencyId),
FrequencyCount = s.Count(),
});

Related

Entity Framework DbContext filtered query for count is extremely slow using a variable

Using an ADO.NET entity data model I've constructed two queries below against a table containing 1800 records that has just over 30 fields that yield staggering results.
// Executes slowly, over 6000 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == _custID).Count();
// Executes instantly, under 20 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == 625).Count();
I see from the database log that Entity Framework provides that the queries are almost identical except that the filter portion uses a parameter. Copying this query into SSMS and declaring & setting this parameter there results in a near instant query so it doesn't appear to be on the database end of things.
Has anyone encountered this that can explain what's happening? I'm at the mercy of a third party control that adds this command to the query in an attempt to limit the number of rows returned, getting the count is a must. This is used for several queries so a generic solution is needed. It is unfortunate it doesn't work as advertised, it seems to only make the query take 5-10 times as long as it would if I just loaded the entire view into memory. When no filter is used however, it works like a dream.
Use of these components includes the source code so I can change this behavior but need to consider which approaches can be used to provide a reusable solution.

You did not mention about design details of your model but if you only want to have count of records based on condition, then this can be optimized by only counting the result set based on one column. For example,
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Count();
If you design have 10 columns, and based on above statement let say 100 records have been returned, then against every record result set contains 10 columns' data which is of not use.
You can optimize this by only counting result set based on single column.
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Select(x=>new {x.column}).Count();
Other optimization methods, like using async variants of count CountAsync can be used.

Force query results to paginate/divide large query into smaller ones to avoid timeout with crummy API

I am using the provided Epicor Prophet 21 database API to form queries to our database. All is well until I form a query to return data on about 200 inventory parts at once. The 1 minute request TimeOut seems to be built into the assembly files, and there is no functionality for pagination or specific output selection.
I have thought about trying to divide the list alpha-numerically by the itemIDs, (eg. returning all itemIDs > GGGGGGG, then all itemIDs < GGGGGGG). But the IDs are all of different lengths, and the middle point shifts about. This seems clunky, and there must be a better way to divide the query results without knowing what the specific results will be. The API provides standard query filters: comparison operators =><, and/or, startsWith, endsWith, subStringOf.
Any ideas?

Performant Linq Query that gets maximum revision of rows

all.
I am developing an application that is tracking the changes to an objects properties. Each time an objects properties change, I create a new row in the table with the updated property values and an incremented revision.
I have a table that has a structure like the following:
Id (primary key, system generated)
UserFriendlyId (generated programmatically, it is the Id the user sees in the UI, it stays the same regardless of how many revisions an object goes through)
.... (misc properties)
Revision (int, incremented when an object properties are changed)
To get the maximum revision for each UserFriendlyId, I do the following:
var latestIdAndRev = context.Rows.GroupBy(r => r.UserFriendlyId).Select(latest => new { UserFriendlyId = latest.Key, Revision = latest.Max(r=>r.Revision)}).ToList();
Then in order to get a collection of the Row objects, I do the following:
var latestRevs = context.Rows.Where(r => latestIdAndRev.Contains( new {UserFriendlyId=r.UserFriendlyId, Revision=r.Revision})).ToList();
Even though, my table only has ~3K rows, the performance on the latestRevs statement is horrible (takes several minutes to finish, if it doesn't time out first).
Any idea on what I might do differently to get better performance retrieving the latest revision for a collection of userfriendlyids?

To increase the performance of you query you should try to make the entire query run on the database. You have divided the query into two parts and in the first query you pull all the revisions to the client side into latestIdAndRev. The second query .Where(r => latestIdAndRev.Contains( ... )) will then translate into a SQL statement that is something like WHERE ... IN and then a list of all the ID's that you are looking for.
You can combine the queries into a single query where you group by UserFriendlyId and then for each group select the row with the highest revision simply ordering the rows by Revision (descending) and picking the first row:
latestRevs = context.Rows.GroupBy(
r => r.UserFriendlyId,
(key, rows) => rows.OrderByDescending(r => r.Revision).First()
).ToList();
This should generate pretty efficient SQL even though I have not been able to verify this myself. To further increase performance you should have a look at indexing the UserFriendlyId and the Revision columns but your results may vary. In general adding an index increases the time it takes to insert a row but may decrease the time it takes to find a row.
(General advice: Watch out for .Where(row => clientSideCollectionOfIds.Contains(row.Id)) because all the ID's will have to be included in the query. This is not a fault of the ER mapper.)

There are a couple of things to look at, as you are likely ending up with serious recursion. If this is SQL Server, open profiler and start a profile on the database in question and then fire off the command. Look at what is being run, examine the execution plan, and see what is actually being run.
From this you MIGHT be able to use the index wizard to create a set of indexes that speeds things up. I say might, as the recursive nature of the query may not be easily solved.
If you want something that recurses to be wicked fast, invest in learning Window Functions. A few years back, we had a query that took up to 30 seconds reduced to milliseconds by heading that direction. NOTE: I am not stating this is your solution, just stating it is worth looking into if indexes alone do not meet your Service Level Agreements (SLAs).

LINQ nested groups performance

i have a full outer join query pulling data from an sql compact database (i use EF6 for mapping):
var query =
from entry in left.Union(right).AsEnumerable()
select new
{
...
} into e
group e by e.Date.Year into year
select new
{
Year = year.Key,
Quartals = from x in year
group x by (x.Date.Month - 1) / 3 + 1 into quartal
select new
{
Quartal = quartal.Key,
Months = from x in quartal
group x by x.Date.Month into month
select new
{
Month = month.Key,
Contracts = from x in month
group x by x.Contract.extNo into contract
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
}
};
as you can see i use nested groups to structure results.
the interesting thing is, if i remove AsEnumerable() call, the query takes 3.5x more time to execute: ~210ms vs ~60ms. And when it runs for the first time the difference is much greater: 39000(!)ms vs 1300ms.
My questions are:
What am i doing wrong, maybe those groupings should be done in a different way?
Why the first execution takes so much time? I know expression trees should be built etc, but 39 seconds?
Why linq to db is slower than linq to entities in my case? Is it generally slower and its better to load data from db if possible before processing?
thakns!

To answer your three questions:
Maybe those groupings should be done in a different way?
No. If you want nested groupings you can only do that by groupings within groupings.
You can group by multiple fields at once:
from entry in left.Union(right)
select new
{
...
} into e
group e by new
{
e.Date.Year,
Quartal = (e.Date.Month - 1) / 3 + 1,
e.Date.Month,
contract = e.Contract.extNo
} into grp
select new
{
Year = grp.Key,
Quartal = grp.Key,
Month = grp.Key,
Contracts = from x in grp
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
This will remove a lot of complexity from the generated query so it's likely to be (much) faster without AsEnumerable(). But the result is quite different: a flat group (Year, Quartal, etc, in one row), not a nested grouping.
Why the first execution takes so much time?
Because the generated SQL query is probably pretty complex and the database engine's query optimizer can't find a fast execution path.
3a. Why is linq to db slower than linq to entities in my case?
Because, apparently, in this case it's much more efficient to fetch the data into memory first and do the groupings by LINQ-to-objects. This effect will be more significant if left and right represent more or less complex queries themselves. In that case, the generated SQL can get hugely bloated, because it has to process two sources of complexity in one statement, which may lead to many repeated identical sub queries. By outsourcing the grouping, the database is probably left with a relative simple query and of course the grouping in memory is never affected by the complexity of the SQL query.
3b. Is it generally slower and its better to load data from db if possible before processing?
No, not generally. I'd even say, hardly ever. In this case it is because (as I can see) you don't filter data. If however the part before AsEnumerable() would return millions of records and you would apply filtering afterwards, the query without AsEnumerable() would probably be much faster, because the filtering is done in the database.
Therefore, you should always keep an eye on generated SQL. It's unrealistic to expect that EF will always generate a super optimized SQL statement. It hardly ever will. Its primary focus is on correctness (and it does an exceptional job there), performance is secondary. It's the developer's job to make LINQ-to-Entities and LINQ-to-object work together as a slick team.

Using AsEnumerable() will convert a type that implements IEnumerable<T> to IEnumerable<T> itself.
Read this topic https://msdn.microsoft.com/en-us/library/bb335435.aspx
AsEnumerable<TSource>(IEnumerable<TSource>) can be used to choose between query implementations when a sequence implements IEnumerable<T> but also has a different set of public query methods available. For example, given a generic class Table that implements IEnumerable<T> and has its own methods such as Where, Select, and SelectMany, a call to Where would invoke the public Where method of Table. A Table type that represents a database table could have a Where method that takes the predicate argument as an expression tree and converts the tree to SQL for remote execution. If remote execution is not desired, for example because the predicate invokes a local method, the AsEnumerable<TSource> method can be used to hide the custom methods and instead make the standard query operators available.
When you invoke AsEnumerable() first, it won't convert LINQ-to-SQL but will instead load the table in memory as the Where is enumerating it. Since now it is loaded in memory, it's execution is faster.

SQL Linq .Take() latest 20 rows from HUGE database, performance-wise

I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.
I need to retrieve the latest 20 rows that satisfy a certain condition
Here's a rough example of my query
using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));
IQueryable<Post> postQueryable =
from postDbEntry in postHubDbContext.PostDbEntries
orderby postDbEntry.Id descending
where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
select new Post(postDbEntry);
postQueryable = postQueryable.Take(20);
IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);
return postOrderedQueryable.ToList();
}
The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?
Basically how do I make this query work efficiently with a database that has a billion rows ?

According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.
This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.
So the only performance you can make is in database. Like #usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).

Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table
Check out this question for more details
How do I view the SQL generated by the Entity Framework?

This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.
Assume both indexes exist:
If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.
If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.
So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.
Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.