Enumerating large data sets multiple times using LINQ to Entities

Enumerating large data sets multiple times using LINQ to Entities - c#

Let's say I have two tables in my SQL database.
1. A medium sized table with thousands of records called MyTable1
2. A large table with millions of records (and growing by the day) called MyTable2
MyTable1 and MyTable2 both have a property called Hash that can be equal.
I'm looking to find the most efficient way to use Linq to Entities to iterate over MyTable1 and find all records in MyTable2 that have the same Hash and save into another table. Here's a simplified view of what the code looks like.
using(var db = new context()) {
var myTable1Records = db.MyTable1.Select(x => x);
foreach(var record in myTable1Records) {
var matches = db.MyTable2.Where(y => y.Hash.Equals(record.Hash)).Select(y => y);
foreach(var match in matches) {
// Add match to another table
}
}
}
I'm seeing the performance of this code slow down significantly as the size of MyTable2 grows larger every day. A few ideas I'm experimenting with for efficiently handling this type of scenario are:
Setting MergeOption.NoTracking on db.MyTable2 since it's purely a read operation. Haven't seen much of an improvement from this unfortunately.
Pulling MyTable2 into memory using .ToList() to eliminate multiple calls to the db
Creating "chunks" of MyTable2 that the code can iterate over so it's not querying against the full million+ records each time.
I'd love to see if there are other techniques or magic bullets you've found to be effective in this type of scenario. Thanks!

You have a property called Hash. Use it as a hash! Store the first table in a Dictionary keyed by Hash and then iterate through the second table checking for matches in the Dictionary, again keying by Hash.
Or, better yet, use LINQ:
var matches = db.MyTable1.Intersect(db.MyTable2);
If you need to do a custom comparison, create an IEqualityComparer. (I assume you're doing some type of projection and that the Select(x => x) is a placeholder for the purposes of this question.)
Or, better still, this operation might be better off taking place entirely in the database in a stored procedure or view. You're essentially doing a JOIN but using C# to do it. You're incurring the cost of the round trip time from database to your client application for what could possibly all be done on the database server.

What you're doing here is performing an inner join. By using a query provider you can even ensure that this work is done on the DB side, rather than in memory within your application; you'll only be pulling down the matching results, no more:
var query = from first in db.MyTable1
join second in db.MyTable2
on first.Hash equals second.Hash
select second;

I would recommend staying in SQL Server. A view or a clustered index might be the best approach.
Here are a few sources to use to read up on the subject of indexes:
http://www.c-sharpcorner.com/uploadfile/nipuntomar/clustered-index-and-non-clustered-index-in-sql-server/
http://technet.microsoft.com/en-us/library/jj835095.aspx
Should every User Table have a Clustered Index?
And here is a source on SQL Views:
http://technet.microsoft.com/en-us/library/aa214068(v=sql.80).aspx

May be indexing your hash column can help. Assuming Hash is a char or varchar type, max length an index can support is 900 bytes.
CREATE NONCLUSTERED INDEX IX_MyTable2_Hash ON dbo.MyTable2(Hash);
For performance of indexing a varchar, you might want to check here
SQL indexing on varchar

Related

Calling SKIP() in code or using TOP in function

I'm coding an application with Entity Framework in which I rely heavily on user defined functions.
I have a question about the best way (most optimized way) of how I limit and page my result sets. Basically I am wondering if these two options are the same or one is prefered performance wise.
Option 1.
//C#
var result1 = _DB.fn_GetData().OrderBy(x => Id).Skip(page *100).Take(100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
Option 2.
//C#
var result2 = _DB.fn_GetData(page = 0, size = 100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
ORDER BY Id
OFFSET (size * page) ROWS FETCH NEXT size ROWS ONLY
To me these seem to be producing about the same result, but maybe I am missing some key aspect.

You'll have to be aware when your LINQ statement is AsEnumerable and when it is AsQueryable. As long as your statement is an IQueryable<...> the software will try to translate it into SQL and let your database do the query. Once it really has lost the IQueryable, and has become an implementation of an IEnumerable, the data has been brought to local memory, and all further LINQ statements will be performed by your process, not by the database.
If you use your debugger, you will see that the return value of your fn_getData returns an IEnumerable. This means that the result of fn_GetData is brought to local memory and your OrderBy etc is performed by your process.
Usually it is much more efficient to only move the records that you will use to local memory. Besides: do not fetch the complete records, but only the properties that you plan to use. So in this case I guess you'll have to create an extended version of fn_GetData that returns only the values you plan to use

I suggest second option because SQL Server can more faster then C# methods.
In your first option, you take all of the records in table and loop through. But second option, SQL Server do it for you and you get what you want.

You should apply the limiting and where clauses (it depends on table indexes) in the database as far as possible. For first example;
var result1 = _DB.fn_GetData().OrderBy(x => Id).Skip(page *100).Take(100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
The whole table is retrieved from database into in-memory and it kills the performance and reliability. I strongly don't suggest it. You should consider to put some limitations to filter records on the database. So, the second option is better approach in this case.

LINQ query to Azure SQL database timing out

I'm querying my sql database which is in Azure (actually my web app is on Azure as well).
Every time I perform this particular query, there are ever changing errors (e.g. sometimes timeout occurs, sometimes it works perfectly, sometimes it takes extremely long to load).
I have noted that I am using the ToList method here to enumerate the query but I suspect that's why it is degrading.
Is there anyway I can fix this or make it better....or maybe just use native SQL to execute my query?.
I should also note in my webconfig my Database connection timeout is set to 30 seconds. Would this have any performance benefit?
I'm putting the suspect code here:
case null:
lstQueryEvents = db.vwTimelines.Where(s => s.UserID == UserId)
.Where(s => s.blnHide == false)
.Where(s => s.strEmailAddress.Contains(strSearch) || s.strDisplayName.Contains(strSearch) || s.strSubject.Contains(strSearch))
.OrderByDescending(s => s.LatestEventTime)
.Take(intNumRecords)
.ToList();
break;
It's basically querying for the 50 records...I don't understand why it's timing out sometimes.

Here are some tips:
Make sure that your SQL data types matches types in your model
Judging by your code, types should be something like this:
UserID should be int (cannot tell for sure by looking at code);
blnHide should be bit;
strEmailAddress should be nvarchar;
strDisplayName should be nvarchar;
strSubject should be nvarchar;
Make use of indexes
You should create Non-Clustered Indexes on columns that you use to filter and order data.
In order of importance:
LatestEventTime as you order ALL data by this column;
UserID as you filter out most of data by this column;
blnHide as you filter out part of data by this column;
Make use of indexes for text lookup
You could make use of indexes for text lookup if you change your filter behaviour slightly and search text only in the start of column value.
To achieve that:
change .Contains() with .StartsWith() as it would allow index to be used.
create Non-Clustered Indexes on strEmailAddress column:
create Non-Clustered Indexes on strDisplayName column:
create Non-Clustered Indexes on strSubject column:
Try out free text search
Microsoft only recently have introduced full text search in Azure SQL. You can use that to find rows matching by partial string. This is a bit complicated to achieve using EF, but it is certainly doable.
Here are some links to get you started:
Entity Framework, Code First and Full Text Search
https://azure.microsoft.com/en-us/blog/full-text-search-is-now-available-for-preview-in-azure-sql-database/

string.Contains(...) converted to WHERE ... LIKE ... sql-statement. Which is very expensive. Try to reform your query to avoid it.
Plus, Azure SQL has it's own limitations (5 sec as far as I remember, but better check SLA) for query run, so it would generally ignore your web.config settings if they are longer.

SQL Linq .Take() latest 20 rows from HUGE database, performance-wise

I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.
I need to retrieve the latest 20 rows that satisfy a certain condition
Here's a rough example of my query
using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));
IQueryable<Post> postQueryable =
from postDbEntry in postHubDbContext.PostDbEntries
orderby postDbEntry.Id descending
where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
select new Post(postDbEntry);
postQueryable = postQueryable.Take(20);
IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);
return postOrderedQueryable.ToList();
}
The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?
Basically how do I make this query work efficiently with a database that has a billion rows ?

According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.
This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.
So the only performance you can make is in database. Like #usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).

Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table
Check out this question for more details
How do I view the SQL generated by the Entity Framework?

This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.
Assume both indexes exist:
If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.
If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.
So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.
Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.

Joining values in a WHERE comparison in Oracle

I have a HUGE query which I need to optimize. Before my coding it was like
SELECT [...] WHERE foo = 'var' [...]
executed 2000 times for 2000 different values of foo. We all know how slow it is. I managed to join all that different queries in
SELECT [...] WHERE foo = 'var' OR foo = 'var2' OR [...]
Of course, there are 2000 chained comparisons. The result is a huge query, executed a few seconds faster than before but not enough. I suppose the StringBuilder I am using takes a while in building the query, so the time earned by saving 1999 queries is wasted in this:
StringBuilder query = new StringBuilder();
foreach (string var in vars)
query.Append("foo = '").Append(var).Append("' OR ");
query.Remove(query.Length - 4) // for removing the last " OR "
So I would like to know if I could use some workaround for optimize the building of that string, maybe joining different values in the comparison with some SQL trick like
SELECT [...] WHERE foo = ('var' OR 'var2' OR [...])
so I can save some Append operations. Of course, any different idea trying to avoid that huge query at all will be more than welcome.

#Armaggedon,
For any decent DBMS, the IN () operator should correspond to a number of x OR y corresponding comparisons. About your concern about StringBuild.Append, its implementation is very efficient and you shouldn't notice any delay regarding this amount of data, if you have a few MB to spare for its temporary internal buffer. That said, I don't think your performance problem is related to these issues.
For database tuning it's always a far shot to propose solutions without the "full picture", but I think your problem might be related to compiling such a huge dynamic SQL statement. -- parsing and optimizing SQL statements can consume lots of processor time and it should be avoided.
Maybe you could improve the response time by moving your domain into an auxiliary indexed table. Or by moving the various checks over the same char column to a text search using INSTR functions:
-- 1. using domain table
SELECT myColumn FROM myTable WHERE foo IN (SELECT myValue FROM myDomain);
-- 2. using INSTR function
SELECT myColumn FROM myTable WHERE INSTR('allValues', foo, 1, 1) > 0;

Why not use the IN-operator as of IN-operator on W3school? It lets you combine your values in a much shorter way. You can also store the values in a temporary table as mentioned in this post to bypass the limit of 1000 rows on Oracle

It's been a while since I danced the Oracle dance, but I seem to remember a concept of "Bind Variables" - typically used for bulk insertions... I'm wondering if you could express the list of values as an array, and use that with IN...
Have to say - this is just an idea - I don't have time to research it further for you...

Optimizing Linq: IN operator with large data

How to optimize this query?
// This will return data ranging from 1 to 500,000 records
List<string> products = GetProductsNames();
List<Product> actualProducts = (from p in db.Products
where products.Contains(p.Name)
select p).ToList();
This code takes around 30 seconds to fill actualProducts if I send a list of 44,000 strings, dont know what it takes for 500,000 records. :(
any way to tweak this query?
NOTE: it takes almost this much time for each call (ignoring the first slow edmx call)

An IN query on 500,000 records is always going to be a pathological case.
Firstly, make sure there is an index (probably non-clustered) on Name in the database.
Ideas (both involve dropping to ADO.NET):
use a "table valued parameter" to pass in the values, and INNER JOIN to the table-valued-parameter in TSQL
alternatively, create a table of the form ProductQuery with columns QueryId (which could be uniqueidentifier) and Name; invent a guid to represent your query (Guid.NewGuid()), and then use SqlBulkCopy to push the 500,000 pairs (the same guid on each row; different guids are different queries) into the table really quickly; then use TSQL to do an INNER JOIN between the two tables
Actually, these are very similar, but the first one is probably the first thing to try. Less to set up.

If you don't want to use Database you could try something with Dictionary<string,string>
If am not wrong I suspect products.Contains(p.Name) is expensive since it is O(n) operation. Try to change your GetProductsNames return type as Dictionary<string,string> or convert List to Dictionary
Dictionary<string, string> productsDict = products.ToDictionary(x => x);
So you have a dictionary in hand, now rewrite the query as below
List<Product> actualProducts = (from p in db.Products
where productsDict.ContainsKey(p.Name)
select p).ToList();
This will help you to improve performance a lot(disadvantage is you allocate double memory advantage is performance). I tested with very large samples with good results. Try it out.
Hope this helps.

You could also take a hashing approach using the name-column as the value that gets passed to the hashing function; then you could iterate the 500K set, subjecting each name to the hashing function, and testing for existence in your local hash file. This would require more code than a linq approach but it might be considerably faster than repeated calls to the back-end doing inner joins.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Enumerating large data sets multiple times using LINQ to Entities - c#

May be indexing your hash column can help. Assuming Hash is a char or varchar type, max length an index can support is 900 bytes. CREATE NONCLUSTERED INDEX IX_MyTable2_Hash ON dbo.MyTable2(Hash); For performance of indexing a varchar, you might want to check here SQL indexing on varchar

Related

Calling SKIP() in code or using TOP in function

LINQ query to Azure SQL database timing out

SQL Linq .Take() latest 20 rows from HUGE database, performance-wise

Joining values in a WHERE comparison in Oracle

Optimizing Linq: IN operator with large data

Categories

Resources