Performant Linq Query that gets maximum revision of rows - c#

all.
I am developing an application that is tracking the changes to an objects properties. Each time an objects properties change, I create a new row in the table with the updated property values and an incremented revision.
I have a table that has a structure like the following:
Id (primary key, system generated)
UserFriendlyId (generated programmatically, it is the Id the user sees in the UI, it stays the same regardless of how many revisions an object goes through)
.... (misc properties)
Revision (int, incremented when an object properties are changed)
To get the maximum revision for each UserFriendlyId, I do the following:
var latestIdAndRev = context.Rows.GroupBy(r => r.UserFriendlyId).Select(latest => new { UserFriendlyId = latest.Key, Revision = latest.Max(r=>r.Revision)}).ToList();
Then in order to get a collection of the Row objects, I do the following:
var latestRevs = context.Rows.Where(r => latestIdAndRev.Contains( new {UserFriendlyId=r.UserFriendlyId, Revision=r.Revision})).ToList();
Even though, my table only has ~3K rows, the performance on the latestRevs statement is horrible (takes several minutes to finish, if it doesn't time out first).
Any idea on what I might do differently to get better performance retrieving the latest revision for a collection of userfriendlyids?

To increase the performance of you query you should try to make the entire query run on the database. You have divided the query into two parts and in the first query you pull all the revisions to the client side into latestIdAndRev. The second query .Where(r => latestIdAndRev.Contains( ... )) will then translate into a SQL statement that is something like WHERE ... IN and then a list of all the ID's that you are looking for.
You can combine the queries into a single query where you group by UserFriendlyId and then for each group select the row with the highest revision simply ordering the rows by Revision (descending) and picking the first row:
latestRevs = context.Rows.GroupBy(
r => r.UserFriendlyId,
(key, rows) => rows.OrderByDescending(r => r.Revision).First()
).ToList();
This should generate pretty efficient SQL even though I have not been able to verify this myself. To further increase performance you should have a look at indexing the UserFriendlyId and the Revision columns but your results may vary. In general adding an index increases the time it takes to insert a row but may decrease the time it takes to find a row.
(General advice: Watch out for .Where(row => clientSideCollectionOfIds.Contains(row.Id)) because all the ID's will have to be included in the query. This is not a fault of the ER mapper.)

There are a couple of things to look at, as you are likely ending up with serious recursion. If this is SQL Server, open profiler and start a profile on the database in question and then fire off the command. Look at what is being run, examine the execution plan, and see what is actually being run.
From this you MIGHT be able to use the index wizard to create a set of indexes that speeds things up. I say might, as the recursive nature of the query may not be easily solved.
If you want something that recurses to be wicked fast, invest in learning Window Functions. A few years back, we had a query that took up to 30 seconds reduced to milliseconds by heading that direction. NOTE: I am not stating this is your solution, just stating it is worth looking into if indexes alone do not meet your Service Level Agreements (SLAs).

Related

Entity Framework DbContext filtered query for count is extremely slow using a variable

Using an ADO.NET entity data model I've constructed two queries below against a table containing 1800 records that has just over 30 fields that yield staggering results.
// Executes slowly, over 6000 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == _custID).Count();
// Executes instantly, under 20 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == 625).Count();
I see from the database log that Entity Framework provides that the queries are almost identical except that the filter portion uses a parameter. Copying this query into SSMS and declaring & setting this parameter there results in a near instant query so it doesn't appear to be on the database end of things.
Has anyone encountered this that can explain what's happening? I'm at the mercy of a third party control that adds this command to the query in an attempt to limit the number of rows returned, getting the count is a must. This is used for several queries so a generic solution is needed. It is unfortunate it doesn't work as advertised, it seems to only make the query take 5-10 times as long as it would if I just loaded the entire view into memory. When no filter is used however, it works like a dream.
Use of these components includes the source code so I can change this behavior but need to consider which approaches can be used to provide a reusable solution.
You did not mention about design details of your model but if you only want to have count of records based on condition, then this can be optimized by only counting the result set based on one column. For example,
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Count();
If you design have 10 columns, and based on above statement let say 100 records have been returned, then against every record result set contains 10 columns' data which is of not use.
You can optimize this by only counting result set based on single column.
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Select(x=>new {x.column}).Count();
Other optimization methods, like using async variants of count CountAsync can be used.

SQL Linq .Take() latest 20 rows from HUGE database, performance-wise

I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.
I need to retrieve the latest 20 rows that satisfy a certain condition
Here's a rough example of my query
using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));
IQueryable<Post> postQueryable =
from postDbEntry in postHubDbContext.PostDbEntries
orderby postDbEntry.Id descending
where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
select new Post(postDbEntry);
postQueryable = postQueryable.Take(20);
IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);
return postOrderedQueryable.ToList();
}
The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?
Basically how do I make this query work efficiently with a database that has a billion rows ?
According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.
This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.
So the only performance you can make is in database. Like #usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).
Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table
Check out this question for more details
How do I view the SQL generated by the Entity Framework?
This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.
Assume both indexes exist:
If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.
If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.
So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.
Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.

Improving Linq query

I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?
There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.
1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.
I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)
I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.

LINQ stack overflow selecting many

I have a scenario where I have a table of "batch" and a table of "test" where "test" contains an FK to "batch" and a many tests can belong to a batch.
I want to be able to select multiple batches and find all tests that belong to them. I do this by producing a list of PKs to the batches I'm interested in and then the following LINQ query:
var ret =
from t in tests
from b in indices //indices is a list of long PK's belonging to selected batches
where t.batch_id == b
select t;
It works but when my selection size exceeds 14 batches, I get a "SQLite error
parser stack overflow" on the LINQ expression regardless of how many tests are found.
I want to be able to handle large selections if possible. How can I do this?
If JeffN825's query doesn't resolve your issue, which I would give high odds on it doing so, you may need to compile your own SQLite and set the -DYYSTACKDEPTH value to something bigger than the default. So you will need to find out what it was set to and then maybe double it and go from there. The full line you would pass is CFLAGS="-DYYSTACKDEPTH=1000" changing the 1000 to be what how deep you want the stack to be.
The LINQ provider may be blowing up because it's trying to issue 1 query per index. You can verify this (if SQL is in fact getting generated at all) by profiling the DB and seeing if it's actually issuing 1 query per index.
Try this instead:
var ret =
from t in tests
where indices.Contains(t.batch_id)
select t;

NHibernate - Log items that appear in a search result

I am using NHibernate in an MVC 2.0 application. Essentially I want to keep track of the number of times each product shows up in a search result. For example, when somebody searches for a widget the product named WidgetA will show up in the first page of the search results. At this point i will increment a field in the database to reflect that it appeared as part of a search result.
While this is straightforward I am concerned that the inserts themselves will greatly slow down the search result. I would like to batch my statements together but it seems that coupling my inserts with my select may be counter productive. Has anyone tried to accomplish this in NHibernate and, if so, are there any standard patterns for completing this kind of operation?
Interesting question!
Here's a possible solution:
var searchResults = session.CreateCriteria<Product>()
//your query parameters here
.List<Product>();
session.CreateQuery(#"update Product set SearchCount = SearchCount + 1
where Id in (:productIds)")
.SetParameterList("productIds", searchResults.Select(p => p.Id).ToList())
.ExecuteUpdate();
Of course you can do the search with Criteria, HQL, SQL, Linq, etc.
The update query is a single round trip for all the objects, so the performance impact should be minimal.

Categories

Resources