Slow LINQ Performance on DataTable Where Clause?

Slow LINQ Performance on DataTable Where Clause? - c#

I'm dumping a table out of MySQL into a DataTable object using MySqlDataAdapter. Database input and output is doing fine, but my application code seems to have a performance issue I was able to track down to a specific LINQ statement.
The goal is simple, search the contents of the DataTable for a column value matching a specific string, just like a traditional WHERE column = 'text' SQL clause.
Simplified code:
foreach (String someValue in someList) {
String searchCode = OutOfScopeFunction(someValue);
var results = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.Take(1);
if (results.Any()) {
results.First()["columnname"] = 10;
}
}
This simplified code is executed thousands of times, once for each entry in someList. When I run Visual Studio Performance Profiler I see that the "results.Any()" line is highlighted as consuming 93.5% of the execution time.
I've tried several different methods for optimizing this code, but none have improved performance while keeping the emoteTable DataTable as the primary source of the data. I can convert emoteTable to Dictionary<String, DataRow> outside of the foreach, but then I have to keep the DataTable and the Dictionary in sync, which while still a performance improvement, feels wrong.
Three questions:
Is this the proper way to search for a value in a DataTable (equivalent of a traditional SQL WHERE clause)? If not, how SHOULD it be done?
Addendum to 1, regardless of the proper way, what is the fastest (execution time)?
Why does the results.Any() line consume 90%+ resources? In this situation it makes more sense that the var results line should consume the resources, after all, it's the line doing the actual search, right?
Thank you for your time. If I find an answer I shall post it here as well.

Any() is taking 90% of the time because the result is only executed when you call Any(). Before you call Any(), the query is not actually made.
It would seem the problem is that you first fetch entire table into the memory and then search. You should instruct your database to search.
Moreover, when you call results.First(), the whole results query is executed again.
With deferred execution in mind, you should write something like
var result = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.FirstOrDefault();
if (result != null) {
result["columnname"] = 10;
}

What you have implemented is pretty much join :
var searchCodes = someList.Select(OutOfScopeFunction);
var emotes = emoteTable.AsEnumerable();
var results = Enumerable.Join(emotes, searchCodes, e=>e, sc=>sc.Field<String>("code"), (e, sc)=>sc);
foreach(var result in results)
{
result["columnname"] = 10;
}
Join will probably optimize the access to both lists using some kind of lookup.
But first thing I would do is to completely abandon idea of combining DataTable and LINQ. They are two different technologies and trying to assert what they might do inside when combined is hard.
Did you try doing raw UPDATE calls? How many items are you expecting to update?

Related

Optimize the number of accesses to a database when working with IQueryable<T> and custom functions

In my C# Class Library project I have a method that needs to compute some statistics GetFaultRate, that, given a date, computes the number of products with faults over the number of products produced.
float GetFaultRate(DateTime date)
{
var products = GetProducts(date);
var faultyProducts = GetFaultyProducts(date);
var rate = (float) (faultyProducts.Count() / products.Count());
return rate;
}
Both methods, GetProducts and GetFaultyProducts take the data from a Repository class _productRepository.
IEnumerable<Product> GetProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodProducts = products.Where(p => CustomFunction(p.productionDate) == date);
return periodProducts;
}
IEnumerable<Product> GetFaultyProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodFaultyProducts = products.Where(p => CustomFunction(p.ProductionDate) == date && p.Faulty == true);
return periodFaultyProducts;
}
Where GetAll has signature:
IQueryable<Product> GetAll();
The products in the database are many and it takes a lot of time to retrieve them and convert ToList(). I need to enumerate the collection since any custom function such as CustomFunction, cannot be executed in a IQueryable<T>.
My application gets stuck for a long time before obtaining the fault rate. I guess it is because of the large number of objects to be retrieved. I can indeed remove the two functions GetProducts and GetFaultyProducts and implement the logic inside GetFaultRate. However since I have other functions that use GetProducts and GetFaultyProducts, with the latter solution I have only one access to the database but a lot of duplicate code.
What can be a good compromise?

First off, don't convert the IQueryable to a list. It forces the entire data set to be brought into memory all at once, rather than just calling Where directly on the query which will allow you to filter the data as it comes in. This will substantially decrease your memory footprint, and (very) marginally increase the runtime speed. If you need to convert an IQueryable to an IEnumerable so that the Where isn't executed by the database simply use AsEnumerable.
Next, getting all of the data is something you should avoid if at all possible, especially multiple times. You'd need to show us what your date function does, but it's possible that it is something that could be done on the database. Any filtering you can do at all at the database will substantially increase performance.
Next, you really don't need two queries here. The second query is just a subset of the first, so if you know that you'll always be using both queries then you should just just perform the first query, bring the results into memory (i.e. with a ToList that you store) and then use a Where on that to filter the results further. This will avoid another database trip as well as all of the data processing/filtering.
If you won't always be using both queries, but will sometimes use just one or the other, then you can improve the second query by filtering out on Faulty before getting all items. Add Where(p => p.Faulty) before you call AsEnumerable and filter on the date information after calling AsEnumerable (and that's if you can't convert any of the date filtering to filtering that can be done at the database).
It appears that in the end you only need to compute the ratio of items that are faulty as compared to the total. That can easily be done with a single query, rather than two.
You've said that Count is running really slowly in your code, but that's not really true. Count is simply the method that is actually enumerating your query, whereas all of the other methods were simply building the query, not executing it. However, you can cut your performance costs drastically by combining the queries entirely.
var lookup = _productRepository.GetAll()
.AsEnumerable()//if at all possible, try to re-write the `Where`
//to be a valid SQL query so that you don't need this call here
.Where(p => CustomFunction(p.productionDate) == date)
.ToLookup(product => product.Faulty);
int totalCount = lookup[true].Count() + lookup[false].Count();
double rate = lookup[true].Count() / (double) totalCount;

var products = GetProducts(date);
var periodFaultyProducts = (from p in products.AsParallel()
where p.Faulty == true
select p).AsEnumerable();

You need to reduce the number of database requests. ToList, First, FirstOrDefault, Any, Take and Count forces your query to run at a database. As Servy pointed out, AsEnumerable converts your query from IQueryable to IEnumerable. If you have to find subsets you can use Where.

Improving Linq query

I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?

There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.

1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.

I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)

I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.

Manipulating entity framework to eliminate round trips to the database

Let's say I have the following bit of code (which I know could be easily modified to perform better, but it illustrates what I want to do)
List<Query> l = new List<Query>;
// Query is a class that doesn't exist, it represents an EF operation
foreach (var x in Xs)
{
Query o = { context.someEntity.Where(s=>s.Id==x.Id).First();}
// It wouldn't execute it, this is pseudo code for delegate/anonymous function
l.Add(o)
}
Then send this list of Query to EF, and have it optimize so that it does the least amount of round trips possible. Let's call it BatchOptimizeAndRun; you would say
var results = BatchOptimizeAndRun(l);
And knowing what it knows from the schema it would reduce the overall query to an optimal version and execute that and place the read results in an array.
I hope I've described what I'm looking for accurately and more importantly that it exists.
And if I sound like a rambling mad man, let's pretend this question never existed.

I'd have to echo Mr. Moore's advice, as I too have spent far too long constructing a linq-to-entities query of monolithic proportions only to find that I could have made a stored procedure in less time that was easier to read and faster to execute. That being said in your example...
List<int> ids = Xs.Select(x => x.Id).ToList();
var results = context.someEntity.Where(s => ids.Contains(s.Id)).ToList();
I believe this will compile to something like
SELECT
*
FROM
someEntity
WHERE
Id IN (ids) --Where ids is a comma separated list of INT
Which will provide you with what you need.

Entity Framework - behind the scenes: DataReaders and connection life period

Another question regarding EF:
I was wondering what's going behind the scenes when iterating over a query result.
For example, check out the following code:
var activeSources = from e in entitiesContext.Sources
where e.IsActive
select e;
and then:
foreach (Source currSource in allSources)
{
code based on the current source...
}
Important note: Each iteration takes a while to complete (from 1 to 25 seconds).
Now, I assume EF is based on DataReaders for maximum efficiency, so based on that assumption, I figure that in the above case, the Database connection will be kept open until I finish iterating over the results, which will be a very long time (when talking in terms of code), which is something I obviously don't want.
Is there a way to fetch the entire data like I would've done with plain old ADO.NET DataAdapters, DataSets and the fill() method instead of using DataReaders?
Or maybe i'm way off with my assumptions?
In any case I would've loved to be pointed to a good source explaining this if available.
Thanks,
Mikey

If you want to get all of the data up front, similar to Fill(), you need to force the query to execute.
var activeSources = from e in entitiesContext.Sources
where e.IsActive
select e;
var results = activeSources.ToList();
After ToList() is called you will have the data and be disconnected from the database.

If you want to return all results at once use .ToList(); Then deferred execution won't happen.
var activeSources = (from e in entitiesContext.Sources
where e.IsActive
select e).ToList();

Can you get DataReader-like streaming using Linq-to-SQL?

I've been using Linq-to-SQL for quite awhile and it works great. However, lately I've been experimenting with using it to pull really large amounts of data and am running across some issues. (Of course, I understand that L2S may not be the right tool for this particular kind of processing, but that's why I'm experimenting - to find its limits.)
Here's a code sample:
var buf = new StringBuilder();
var dc = new DataContext(AppSettings.ConnectionString);
var records = from a in dc.GetTable<MyReallyBigTable>() where a.State == "OH" select a;
var i = 0;
foreach (var record in records) {
buf.AppendLine(record.ID.ToString());
i += 1;
if (i > 3) {
break; // Takes forever...
}
}
Once I start iterating over the data, the query executes as expected. When stepping through the code, I enter the loop right away which is exactly what I hoped for - that means that L2S appears to be using a DataReader behind the scenes instead of pulling all the data first. However, once I get to the break, the query continues to run and pull all the rest of the records. Here are my questions for the SO community:
1.) Is there a way to stop Linq-to-SQL from finishing execution of a really big query in the middle the way you can with a DataReader?
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the DataContext from filling up with change tracking information for every object returned. Basically, instead of filling up memory, can I do a large query with short object lifecycles the way you can with DataReader techniques?
I'm okay if this isn't functionality built-in to the DataContext itself and requires extending the functionality with some customization. I'm just looking to leverage the simplicity and power of Linq for large queries for nightly processing tasks instead of relying on T-SQL for everything.

1.) Is there a way to stop Linq-to-SQL from finishing execution of a really
big query in the middle the way you
can with a DataReader?
Not quite. Once the query is finally executed the underlying SQL statement is returning a result set of matching records. The query is deferred up till that point, but not during traversal.
For your example you could simply use records.Take(3) but I understand your actual logic to halt the process might be external to SQL or not easily translatable.
You could use a combination approach by building a strongly typed LINQ query then executing it with old fashioned ADO.NET. The downside is you lose the mapping to the class and have to manually deal with the SqlDataReader results. An example of this is shown below:
var query = from c in Customers
where c.ID < 15
select c;
using (var command = dc.GetCommand(query))
{
command.Connection.Open();
using (var reader = command.ExecuteReader())
{
int i = 0;
while (reader.Read())
{
Customer c = new Customer();
c.ID = reader.GetInt32(reader.GetOrdinal("ID"));
c.Name = reader.GetString(reader.GetOrdinal("Name"));
Console.WriteLine("{0}: {1}", c.ID, c.Name);
i++;
if (i > 3)
break;
}
}
}
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the
DataContext from filling up with
change tracking information for every
object returned.
If your intention for a particular query is to use it for read-only purposes then you could disable object tracking to increase performance by setting the DataContext.ObjectTrackingEnabled property to false:
using (var dc = new MyDataContext())
{
dc.ObjectTrackingEnabled = false;
// do stuff
}
You can also read this MSDN topic: How to: Retrieve Information As Read-Only (LINQ to SQL).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.