nhibernate select n random unique records - c#

This has been asked before many times but i couldnt find a conclusive answer to this. Can anyone assist
I want to be able to query my database (will my sqlServer AND mysql) using nhibernate in order to get X number of rows. these numbers need to be unique. I am selecting 250 out of 350 so my odds are high that i will have duplicates (because random doent imply unique)
This needs to work for both mysql and sql server.
since I am selecting over 70% of the table already, i dont mind selecting all the records, and using LINQ to pul out the 250 unique values (if this is possible)
whatever has the least impact on my system overhead.
Thanks!

You can order your results using RAND() or NEWID() and select the top 250 - the results will be distinct and random, in terms of efficiency, this will be more efficient than retrieving all 350 results and then ordering them randomly in code - IF YOU IGNORE CACHING
However if you do cache the results, it will probably be significantly more efficient over time to retrieve the 350 results into a cache (preferrably using NHibernate's second level cache) and then sort them in code using something like:
var rand = new Random();
var results = Session.CreateCriteria<MyClass>()
.SetCacheable(true)
.List()
.OrderBy(x => rand.Next());

You are looking for the distinct, add .Distinct() before the .ToList();

I think you've answered yourself.
A full-table query for 350 records is going to put less strain on the server than a complex randomizing and paging query for 250 (that is usually done by appending a uniqueidentifier to rows in an inner query and then sorting using it. Your can't do that with LINQ)
Your can even cache the results for and then randomize the list in memory each time to get 250 different records.

Related

Performant Linq Query that gets maximum revision of rows

all.
I am developing an application that is tracking the changes to an objects properties. Each time an objects properties change, I create a new row in the table with the updated property values and an incremented revision.
I have a table that has a structure like the following:
Id (primary key, system generated)
UserFriendlyId (generated programmatically, it is the Id the user sees in the UI, it stays the same regardless of how many revisions an object goes through)
.... (misc properties)
Revision (int, incremented when an object properties are changed)
To get the maximum revision for each UserFriendlyId, I do the following:
var latestIdAndRev = context.Rows.GroupBy(r => r.UserFriendlyId).Select(latest => new { UserFriendlyId = latest.Key, Revision = latest.Max(r=>r.Revision)}).ToList();
Then in order to get a collection of the Row objects, I do the following:
var latestRevs = context.Rows.Where(r => latestIdAndRev.Contains( new {UserFriendlyId=r.UserFriendlyId, Revision=r.Revision})).ToList();
Even though, my table only has ~3K rows, the performance on the latestRevs statement is horrible (takes several minutes to finish, if it doesn't time out first).
Any idea on what I might do differently to get better performance retrieving the latest revision for a collection of userfriendlyids?
To increase the performance of you query you should try to make the entire query run on the database. You have divided the query into two parts and in the first query you pull all the revisions to the client side into latestIdAndRev. The second query .Where(r => latestIdAndRev.Contains( ... )) will then translate into a SQL statement that is something like WHERE ... IN and then a list of all the ID's that you are looking for.
You can combine the queries into a single query where you group by UserFriendlyId and then for each group select the row with the highest revision simply ordering the rows by Revision (descending) and picking the first row:
latestRevs = context.Rows.GroupBy(
r => r.UserFriendlyId,
(key, rows) => rows.OrderByDescending(r => r.Revision).First()
).ToList();
This should generate pretty efficient SQL even though I have not been able to verify this myself. To further increase performance you should have a look at indexing the UserFriendlyId and the Revision columns but your results may vary. In general adding an index increases the time it takes to insert a row but may decrease the time it takes to find a row.
(General advice: Watch out for .Where(row => clientSideCollectionOfIds.Contains(row.Id)) because all the ID's will have to be included in the query. This is not a fault of the ER mapper.)
There are a couple of things to look at, as you are likely ending up with serious recursion. If this is SQL Server, open profiler and start a profile on the database in question and then fire off the command. Look at what is being run, examine the execution plan, and see what is actually being run.
From this you MIGHT be able to use the index wizard to create a set of indexes that speeds things up. I say might, as the recursive nature of the query may not be easily solved.
If you want something that recurses to be wicked fast, invest in learning Window Functions. A few years back, we had a query that took up to 30 seconds reduced to milliseconds by heading that direction. NOTE: I am not stating this is your solution, just stating it is worth looking into if indexes alone do not meet your Service Level Agreements (SLAs).

SQL Linq .Take() latest 20 rows from HUGE database, performance-wise

I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.
I need to retrieve the latest 20 rows that satisfy a certain condition
Here's a rough example of my query
using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));
IQueryable<Post> postQueryable =
from postDbEntry in postHubDbContext.PostDbEntries
orderby postDbEntry.Id descending
where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
select new Post(postDbEntry);
postQueryable = postQueryable.Take(20);
IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);
return postOrderedQueryable.ToList();
}
The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?
Basically how do I make this query work efficiently with a database that has a billion rows ?
According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.
This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.
So the only performance you can make is in database. Like #usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).
Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table
Check out this question for more details
How do I view the SQL generated by the Entity Framework?
This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.
Assume both indexes exist:
If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.
If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.
So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.
Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.

Enumerating large data sets multiple times using LINQ to Entities

Let's say I have two tables in my SQL database.
1. A medium sized table with thousands of records called MyTable1
2. A large table with millions of records (and growing by the day) called MyTable2
MyTable1 and MyTable2 both have a property called Hash that can be equal.
I'm looking to find the most efficient way to use Linq to Entities to iterate over MyTable1 and find all records in MyTable2 that have the same Hash and save into another table. Here's a simplified view of what the code looks like.
using(var db = new context()) {
var myTable1Records = db.MyTable1.Select(x => x);
foreach(var record in myTable1Records) {
var matches = db.MyTable2.Where(y => y.Hash.Equals(record.Hash)).Select(y => y);
foreach(var match in matches) {
// Add match to another table
}
}
}
I'm seeing the performance of this code slow down significantly as the size of MyTable2 grows larger every day. A few ideas I'm experimenting with for efficiently handling this type of scenario are:
Setting MergeOption.NoTracking on db.MyTable2 since it's purely a read operation. Haven't seen much of an improvement from this unfortunately.
Pulling MyTable2 into memory using .ToList() to eliminate multiple calls to the db
Creating "chunks" of MyTable2 that the code can iterate over so it's not querying against the full million+ records each time.
I'd love to see if there are other techniques or magic bullets you've found to be effective in this type of scenario. Thanks!
You have a property called Hash. Use it as a hash! Store the first table in a Dictionary keyed by Hash and then iterate through the second table checking for matches in the Dictionary, again keying by Hash.
Or, better yet, use LINQ:
var matches = db.MyTable1.Intersect(db.MyTable2);
If you need to do a custom comparison, create an IEqualityComparer. (I assume you're doing some type of projection and that the Select(x => x) is a placeholder for the purposes of this question.)
Or, better still, this operation might be better off taking place entirely in the database in a stored procedure or view. You're essentially doing a JOIN but using C# to do it. You're incurring the cost of the round trip time from database to your client application for what could possibly all be done on the database server.
What you're doing here is performing an inner join. By using a query provider you can even ensure that this work is done on the DB side, rather than in memory within your application; you'll only be pulling down the matching results, no more:
var query = from first in db.MyTable1
join second in db.MyTable2
on first.Hash equals second.Hash
select second;
I would recommend staying in SQL Server. A view or a clustered index might be the best approach.
Here are a few sources to use to read up on the subject of indexes:
http://www.c-sharpcorner.com/uploadfile/nipuntomar/clustered-index-and-non-clustered-index-in-sql-server/
http://technet.microsoft.com/en-us/library/jj835095.aspx
Should every User Table have a Clustered Index?
And here is a source on SQL Views:
http://technet.microsoft.com/en-us/library/aa214068(v=sql.80).aspx
May be indexing your hash column can help. Assuming Hash is a char or varchar type, max length an index can support is 900 bytes.
CREATE NONCLUSTERED INDEX IX_MyTable2_Hash ON dbo.MyTable2(Hash);
For performance of indexing a varchar, you might want to check here
SQL indexing on varchar

Optimizing Linq: IN operator with large data

How to optimize this query?
// This will return data ranging from 1 to 500,000 records
List<string> products = GetProductsNames();
List<Product> actualProducts = (from p in db.Products
where products.Contains(p.Name)
select p).ToList();
This code takes around 30 seconds to fill actualProducts if I send a list of 44,000 strings, dont know what it takes for 500,000 records. :(
any way to tweak this query?
NOTE: it takes almost this much time for each call (ignoring the first slow edmx call)
An IN query on 500,000 records is always going to be a pathological case.
Firstly, make sure there is an index (probably non-clustered) on Name in the database.
Ideas (both involve dropping to ADO.NET):
use a "table valued parameter" to pass in the values, and INNER JOIN to the table-valued-parameter in TSQL
alternatively, create a table of the form ProductQuery with columns QueryId (which could be uniqueidentifier) and Name; invent a guid to represent your query (Guid.NewGuid()), and then use SqlBulkCopy to push the 500,000 pairs (the same guid on each row; different guids are different queries) into the table really quickly; then use TSQL to do an INNER JOIN between the two tables
Actually, these are very similar, but the first one is probably the first thing to try. Less to set up.
If you don't want to use Database you could try something with Dictionary<string,string>
If am not wrong I suspect products.Contains(p.Name) is expensive since it is O(n) operation. Try to change your GetProductsNames return type as Dictionary<string,string> or convert List to Dictionary
Dictionary<string, string> productsDict = products.ToDictionary(x => x);
So you have a dictionary in hand, now rewrite the query as below
List<Product> actualProducts = (from p in db.Products
where productsDict.ContainsKey(p.Name)
select p).ToList();
This will help you to improve performance a lot(disadvantage is you allocate double memory advantage is performance). I tested with very large samples with good results. Try it out.
Hope this helps.
You could also take a hashing approach using the name-column as the value that gets passed to the hashing function; then you could iterate the 500K set, subjecting each name to the hashing function, and testing for existence in your local hash file. This would require more code than a linq approach but it might be considerably faster than repeated calls to the back-end doing inner joins.

LINQ c# efficiency

I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.
What is the order of this LINQ query? Is there a more efficient way of doing this?
var MyValues = from r in MyDataTable.AsEnumerable()
orderby r.Field<double>(_varName)
select r.Field<double>(_varName);
IEnumerable result= MyValues.Distinct();
I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n). If I had to guess, everything but the orderby should be O(n), so overall you're still just O(n log n).
Update: the LINQ Distinct() call should also be O(n).
So altogether, the Big-Oh for this thing is still O(Kn log n), where K is some constant.
Is there a more efficent way of doing this?
You could get better efficiency if you do the sort as part of the query that initializes MyDataTable, instead of sorting in memory afterwards.
from comments
I actually use MyDistinct.Distinct()
If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy. The order matters here.
You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.
var values = from r in MyDataTable.AsEnumerable()
select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
.OrderBy(d => d);
I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:
Order of LINQ extension methods does not affect performance?
Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:
Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316
your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.
the order of the query is, if you see the underlying command, would be transalated to
Select <_varname> from MyDataTable order by <_varname>
and this is as good as it is when run on the database IDE or commandline.
to give you a short answer regarding performance
put in a where clause if you can (with columns that are indexed)
ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
profile your underlying query,
if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell
it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)
Hope this helps
that said, if your db is not properly indexed, meaning

Categories

Resources