Optimizing Linq: IN operator with large data

Optimizing Linq: IN operator with large data - c#

How to optimize this query?
// This will return data ranging from 1 to 500,000 records
List<string> products = GetProductsNames();
List<Product> actualProducts = (from p in db.Products
where products.Contains(p.Name)
select p).ToList();
This code takes around 30 seconds to fill actualProducts if I send a list of 44,000 strings, dont know what it takes for 500,000 records. :(
any way to tweak this query?
NOTE: it takes almost this much time for each call (ignoring the first slow edmx call)

An IN query on 500,000 records is always going to be a pathological case.
Firstly, make sure there is an index (probably non-clustered) on Name in the database.
Ideas (both involve dropping to ADO.NET):
use a "table valued parameter" to pass in the values, and INNER JOIN to the table-valued-parameter in TSQL
alternatively, create a table of the form ProductQuery with columns QueryId (which could be uniqueidentifier) and Name; invent a guid to represent your query (Guid.NewGuid()), and then use SqlBulkCopy to push the 500,000 pairs (the same guid on each row; different guids are different queries) into the table really quickly; then use TSQL to do an INNER JOIN between the two tables
Actually, these are very similar, but the first one is probably the first thing to try. Less to set up.

If you don't want to use Database you could try something with Dictionary<string,string>
If am not wrong I suspect products.Contains(p.Name) is expensive since it is O(n) operation. Try to change your GetProductsNames return type as Dictionary<string,string> or convert List to Dictionary
Dictionary<string, string> productsDict = products.ToDictionary(x => x);
So you have a dictionary in hand, now rewrite the query as below
List<Product> actualProducts = (from p in db.Products
where productsDict.ContainsKey(p.Name)
select p).ToList();
This will help you to improve performance a lot(disadvantage is you allocate double memory advantage is performance). I tested with very large samples with good results. Try it out.
Hope this helps.

You could also take a hashing approach using the name-column as the value that gets passed to the hashing function; then you could iterate the 500K set, subjecting each name to the hashing function, and testing for existence in your local hash file. This would require more code than a linq approach but it might be considerably faster than repeated calls to the back-end doing inner joins.

Related

Linq query where in list, performance what is the best?

I have a simple linq query that gets a slug from a database for a product.
var query = from url in urlTable
where url.ProductId == productId &&
url.EntityName == entityName &&
url.IsActive
orderby url.Id descending
select url.Slug
I am trying to optimize this, since it is run for every product and on a category page this is run x times the number of products.
I could do this (if i'm not mistaking), send in a list of products and do a new query.
var query = from url in urlTable
where productList.Contains(url.ProductId) &&
url.EntityName == entityName &&
url.IsActive
orderby url.Id descending
select url.Slug
But I have read somewhere that the performance of Contains is bad. Is there any other way to do this? What is the best method performance wise?

But I have read somewhere that the performance of Contains is bad.
I believe you're mixing this up with string.Contains, which indeed is a bad idea on large data sets, because it simply can't use any index at all.
In any case, why are you guessing on performance? You should profile and see what's better for yourself. Also, look at the SQL produced by each of the queries and look at their respective query plans.
Now, with that out of the way, the second query is better, simply because it grabs as much as it can during one query, thus removing a lot of the overhead. It isn't too noticeable if you're only querying two or three times, but once you get into say a hundred, you're in trouble. Apart from being better in the client-server communication, it's also better on the server, because it can use the index very effectively, rather than looking up X items one after another. Note that that's probably negligible for primary keys, which usually don't have a logarithmic access time.

The second option is better. I would add the product-id to the result so you can differentiate between products.
var query = from url in urlTable
where productList.Contains(url.ProductId) &&
url.IsActive
orderby url.Id descending
select new { ProductId, Slug }
Please note that your list of product-id's is converted to sql-parameters IN (#p1, #p2, #p3) and there is a maximum amount of sql-parameters per sql-query. I think limit is somewhere around 2000 parameter. So if you are quering for more than 2000 products, this solution will not work.

var query = from productId in productList
join url in urlTable on productId equals url.ProductId
where url.IsActive
orderby url.Id descending
select url.Slug;
I believe this query would have a better performance.

Enumerating large data sets multiple times using LINQ to Entities

Let's say I have two tables in my SQL database.
1. A medium sized table with thousands of records called MyTable1
2. A large table with millions of records (and growing by the day) called MyTable2
MyTable1 and MyTable2 both have a property called Hash that can be equal.
I'm looking to find the most efficient way to use Linq to Entities to iterate over MyTable1 and find all records in MyTable2 that have the same Hash and save into another table. Here's a simplified view of what the code looks like.
using(var db = new context()) {
var myTable1Records = db.MyTable1.Select(x => x);
foreach(var record in myTable1Records) {
var matches = db.MyTable2.Where(y => y.Hash.Equals(record.Hash)).Select(y => y);
foreach(var match in matches) {
// Add match to another table
}
}
}
I'm seeing the performance of this code slow down significantly as the size of MyTable2 grows larger every day. A few ideas I'm experimenting with for efficiently handling this type of scenario are:
Setting MergeOption.NoTracking on db.MyTable2 since it's purely a read operation. Haven't seen much of an improvement from this unfortunately.
Pulling MyTable2 into memory using .ToList() to eliminate multiple calls to the db
Creating "chunks" of MyTable2 that the code can iterate over so it's not querying against the full million+ records each time.
I'd love to see if there are other techniques or magic bullets you've found to be effective in this type of scenario. Thanks!

You have a property called Hash. Use it as a hash! Store the first table in a Dictionary keyed by Hash and then iterate through the second table checking for matches in the Dictionary, again keying by Hash.
Or, better yet, use LINQ:
var matches = db.MyTable1.Intersect(db.MyTable2);
If you need to do a custom comparison, create an IEqualityComparer. (I assume you're doing some type of projection and that the Select(x => x) is a placeholder for the purposes of this question.)
Or, better still, this operation might be better off taking place entirely in the database in a stored procedure or view. You're essentially doing a JOIN but using C# to do it. You're incurring the cost of the round trip time from database to your client application for what could possibly all be done on the database server.

What you're doing here is performing an inner join. By using a query provider you can even ensure that this work is done on the DB side, rather than in memory within your application; you'll only be pulling down the matching results, no more:
var query = from first in db.MyTable1
join second in db.MyTable2
on first.Hash equals second.Hash
select second;

I would recommend staying in SQL Server. A view or a clustered index might be the best approach.
Here are a few sources to use to read up on the subject of indexes:
http://www.c-sharpcorner.com/uploadfile/nipuntomar/clustered-index-and-non-clustered-index-in-sql-server/
http://technet.microsoft.com/en-us/library/jj835095.aspx
Should every User Table have a Clustered Index?
And here is a source on SQL Views:
http://technet.microsoft.com/en-us/library/aa214068(v=sql.80).aspx

May be indexing your hash column can help. Assuming Hash is a char or varchar type, max length an index can support is 900 bytes.
CREATE NONCLUSTERED INDEX IX_MyTable2_Hash ON dbo.MyTable2(Hash);
For performance of indexing a varchar, you might want to check here
SQL indexing on varchar

Joining values in a WHERE comparison in Oracle

I have a HUGE query which I need to optimize. Before my coding it was like
SELECT [...] WHERE foo = 'var' [...]
executed 2000 times for 2000 different values of foo. We all know how slow it is. I managed to join all that different queries in
SELECT [...] WHERE foo = 'var' OR foo = 'var2' OR [...]
Of course, there are 2000 chained comparisons. The result is a huge query, executed a few seconds faster than before but not enough. I suppose the StringBuilder I am using takes a while in building the query, so the time earned by saving 1999 queries is wasted in this:
StringBuilder query = new StringBuilder();
foreach (string var in vars)
query.Append("foo = '").Append(var).Append("' OR ");
query.Remove(query.Length - 4) // for removing the last " OR "
So I would like to know if I could use some workaround for optimize the building of that string, maybe joining different values in the comparison with some SQL trick like
SELECT [...] WHERE foo = ('var' OR 'var2' OR [...])
so I can save some Append operations. Of course, any different idea trying to avoid that huge query at all will be more than welcome.

#Armaggedon,
For any decent DBMS, the IN () operator should correspond to a number of x OR y corresponding comparisons. About your concern about StringBuild.Append, its implementation is very efficient and you shouldn't notice any delay regarding this amount of data, if you have a few MB to spare for its temporary internal buffer. That said, I don't think your performance problem is related to these issues.
For database tuning it's always a far shot to propose solutions without the "full picture", but I think your problem might be related to compiling such a huge dynamic SQL statement. -- parsing and optimizing SQL statements can consume lots of processor time and it should be avoided.
Maybe you could improve the response time by moving your domain into an auxiliary indexed table. Or by moving the various checks over the same char column to a text search using INSTR functions:
-- 1. using domain table
SELECT myColumn FROM myTable WHERE foo IN (SELECT myValue FROM myDomain);
-- 2. using INSTR function
SELECT myColumn FROM myTable WHERE INSTR('allValues', foo, 1, 1) > 0;

Why not use the IN-operator as of IN-operator on W3school? It lets you combine your values in a much shorter way. You can also store the values in a temporary table as mentioned in this post to bypass the limit of 1000 rows on Oracle

It's been a while since I danced the Oracle dance, but I seem to remember a concept of "Bind Variables" - typically used for bulk insertions... I'm wondering if you could express the list of values as an array, and use that with IN...
Have to say - this is just an idea - I don't have time to research it further for you...

LINQ c# efficiency

I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.
What is the order of this LINQ query? Is there a more efficient way of doing this?
var MyValues = from r in MyDataTable.AsEnumerable()
orderby r.Field<double>(_varName)
select r.Field<double>(_varName);
IEnumerable result= MyValues.Distinct();

I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n). If I had to guess, everything but the orderby should be O(n), so overall you're still just O(n log n).
Update: the LINQ Distinct() call should also be O(n).
So altogether, the Big-Oh for this thing is still O(Kn log n), where K is some constant.

Is there a more efficent way of doing this?
You could get better efficiency if you do the sort as part of the query that initializes MyDataTable, instead of sorting in memory afterwards.

from comments
I actually use MyDistinct.Distinct()
If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy. The order matters here.
You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.
var values = from r in MyDataTable.AsEnumerable()
select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
.OrderBy(d => d);
I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:
Order of LINQ extension methods does not affect performance?
Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:
Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316

your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.
the order of the query is, if you see the underlying command, would be transalated to
Select <_varname> from MyDataTable order by <_varname>
and this is as good as it is when run on the database IDE or commandline.
to give you a short answer regarding performance
put in a where clause if you can (with columns that are indexed)
ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
profile your underlying query,
if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell
it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)
Hope this helps
that said, if your db is not properly indexed, meaning

nhibernate select n random unique records

This has been asked before many times but i couldnt find a conclusive answer to this. Can anyone assist
I want to be able to query my database (will my sqlServer AND mysql) using nhibernate in order to get X number of rows. these numbers need to be unique. I am selecting 250 out of 350 so my odds are high that i will have duplicates (because random doent imply unique)
This needs to work for both mysql and sql server.
since I am selecting over 70% of the table already, i dont mind selecting all the records, and using LINQ to pul out the 250 unique values (if this is possible)
whatever has the least impact on my system overhead.
Thanks!

You can order your results using RAND() or NEWID() and select the top 250 - the results will be distinct and random, in terms of efficiency, this will be more efficient than retrieving all 350 results and then ordering them randomly in code - IF YOU IGNORE CACHING
However if you do cache the results, it will probably be significantly more efficient over time to retrieve the 350 results into a cache (preferrably using NHibernate's second level cache) and then sort them in code using something like:
var rand = new Random();
var results = Session.CreateCriteria<MyClass>()
.SetCacheable(true)
.List()
.OrderBy(x => rand.Next());

You are looking for the distinct, add .Distinct() before the .ToList();

I think you've answered yourself.
A full-table query for 350 records is going to put less strain on the server than a complex randomizing and paging query for 250 (that is usually done by appending a uniqueidentifier to rows in an inner query and then sorting using it. Your can't do that with LINQ)
Your can even cache the results for and then randomize the list in memory each time to get 250 different records.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimizing Linq: IN operator with large data - c#

Related

Linq query where in list, performance what is the best?

Enumerating large data sets multiple times using LINQ to Entities

Joining values in a WHERE comparison in Oracle

LINQ c# efficiency

nhibernate select n random unique records

Categories

Resources