How to deal with large result sets with Linq to Entities?

How to deal with large result sets with Linq to Entities? - c#

I have a fairly complex linq to entities query that I display on a website. It uses paging so I never pull down more than 50 records at a time for display.
But I also want to give the user the option to export the full results to Excel or some other file format.
My concern is that there could potentially be a large # of records all being loaded into memory at one time to do this.
Is there a way to process a linq result set 1 record at a time like you could w/ a datareader so only 1 record is really being kept in memory at a time?
I've seen suggestions that if you enumerate over the linq query w/ a foreach loop that the records will not all be read into memory at once and would not overwelm the server.
Does anyone have a link to something I could read to verify this?
I'd appreciate any help.
Thanks

set the ObjectContext to MergeOption.NoTracking (since it is a read only operation). If you are using the same ObjectContext for saving other data, Detach the object from the context.
how to detach
foreach( IQueryable)
{
//do something
objectContext.Detach(object);
}
Edit: If you are using NoTracking option, there is no need to detach
Edit2: I wrote to Matt Warren about this scenario. And am posting relevant private correspondences here, with his approval
The results from SQL server may not
even be all produced by the server
yet. The query has started on the
server and the first batch of results
are transferred to the client, but no
more are produced (or they are cached
on the server) until the client
requests to continue reading them.
This is what is called ‘firehose
cursor’ mode, or sometimes referred to
as streaming. The server is sending
them as fast as it can, and the client
is reading them as fast as it can
(your code), but there is a data
transfer protocol underneath that
requires acknowledgement from the
client to continue sending more data.
Since IQueryable inherits from IEnumerable, I believe the underlying query sent to the server would be the same. However, when we do a IEnumerable.ToList(), the data reader, which is used by the underlying connection, would start populating the object, the objects get loaded into the app domain and might run out of memory these objects cannot yet be disposed.
When you are using foreach and IEunmerable the data reader reads the SQL result set one at a time, the objects are created and then disposed. The underlying connection might receive data in chunks and might not send a response to SQL Server back until all the chunks are read. Hence you will not run into 'out of memory` exception
Edit3:
When your query is running, you actually can open your SQL Server "Activity Monitor" and see the query, the Task State as SUSPENDED and Wait Type as Async_network_IO - which actually states that the result is in the SQL Server network buffer. You can read more about it here and here

Look at the return value of the LINQ query. It should be IEnumerable<>, which only loads one object at a time. If you then use something like .ToList(), they will all be loaded into memory. Just make sure your code doesn't maintain a list or use more than one instance at a time and you will be fine.
Edit: To add on to what people have said about foreach... If you do something like:
var query = from o in Objects
where o.Name = "abc"
select o;
foreach (Object o in query)
{
// Do something with o
}
The query portion uses deferred execution (see examples), so the objects are not in memory yet. The foreach iterates through the results, but only getting one object at a time. query uses IEnumerator, which has Reset() and MoveNext(). The foreach calls MoveNext() each round until there are no more results.

Related

How many records are loaded into the BindingSource?

I've always worked with Linq and that's why I always brought only the necessary records for operation - obviously everything was hand-coded.
Now, I'm studying Data Binding because I understand that I can speed up the whole process a lot.
However, I have a question about the initial load of the BindingSource. I noticed that the sample codes always contain the .Load () command without specifying an initial filter.
Example:
dbContext ctx = new dbContex();
ctx.Table.Load(); <-- Here is my doubt
myBindingSource.DataSource = ctx.Table.Local.ToBindingList()
Let's assume that this table has 10,000 records. Will it load 10,000 records at once? Doesn't this type of operation make the load very slow and consume a lot of network bandwidth?

According to documentation
One common way to do this is to write a LINQ query and then call ToList on it, only to immediately discard the created list. The Load extension method works just like ToList except that it avoids the creation of the list altogether.
So, if you just call
ctx.Table.Load()
it will load all the data on that table.
You can also query it before calling Load()
context.Posts.Where(p => p.Tags.Contains("stackoverflow")).Load();

Using .Where() on a List

Assuming the two following possible blocks of code inside of a view, with a model passed to it using something like return View(db.Products.Find(id));
List<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date).ToList();
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id).ToList();
IEnumerable<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date);
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id);
What are the performance implications between the two? By this point, is all of the product related data in memory, including its navigation properties? Does using ToList() in this instance matter at all? If not, is there a better approach to using Linq queries on a List without having to call ToList() every time, or is this the typical way one would do it?

Read http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Deferred execution is one of the many marvels intrinsic to linq. The short version is that your data is never touched (it remains idle in the source be that in-memory, or in-database, or wherever). When you construct a linq query all you are doing is creating an IEnumerable class that is 'capable' of enumerating your data. The work doesn't begin until you actually start enumerating and then each piece of data comes all the way from the source, through the pipeline, and is serviced by your code one item at a time. If you break your loop early, you have saved some work - later items are never processed. That's the simple version.
Some linq operations cannot work this way. Orderby is the best example. Orderby has to know every piece of data because it possible that the last piece retrieved from the source very well could be the first piece that you are supposed to get. So when an operation such as orderby is in the pipe, it will actually cache your dataset internally. So all data has been pulled from the source, and has gone through the pipeline, up to the orderby, and then the orderby becomes your new temporary data source for any operations that come afterwards in the expression. Even so, orderby tries as much as possible to follow the deferred execution paradigm by waiting until the last possible moment to build its cache. Including orderby in your query still doesn't do any work, immediately, but the work begins once you start enumerating.
To answer your question directly, your call to ToList is doing exactly that. OrderByDescending is caching the data from your datasource => ToList additionally persists it into a variable that you can actually touch (reviews) => where starts pulling records one at a time from reviews, and if it matches then your final ToList call is storing the results into yet another list in memory.
Beyond the memory implications, ToList is additionally thwarting deferred execution because it also STOPS the processing of your view at the time of the call, to entirely process that entire linq expression, in order to build its in-memory representation of the results.
Now none of this is a real big deal if the number of records we're talking about are in the dozens. You'll never notice the difference at runtime because it happens so quick. But when dealing with large scale datasets, deferring as much as possible for as long as possible in hopes that something will happen allowing you to cancel a full enumeration... in addition to the memory savings... gold.
In your version without ToList: OrderByDescending will still cache a copy of your dataset as processed through the pipeline up to that point, internally, sorted of course. That's ok, you gotta do what you gotta do. But that doesn't happen until you actually try to retrieve your first record later in your view. Once that cache is complete, you get your first record, and for every next record you are then pulling from that cache, checking against the where clause, you get it or you don't based upon that where and have saved a couple of in memory copies and a lot of work.
Magically, I bet even your lead-in of db.Products.Find(id) doesn't even start spinning until your view starts enumerating (if not using ToList). If db.Products is a Linq2SQL datasource, then every other element you've specified will reduce into SQL verbiage, and your entire expression will be deferred.
Hope this helps! Read further on Deferred Execution. And if you want to know 'how' that works, look into c# iterators (yield return). There's a page somewhere on MSDN that I'm having trouble finding that contains the common linq operations, and whether they defer execution or not. I'll update if I can track that down.
/*edit*/ to clarify - all of the above is in the context of raw linq, or Linq2Objects. Until we find that page, common sense will tell you how it works. If you close your eyes and imagine implementing orderby, or any other linq operation, if you can't think of a way to implement it with 'yield return', or without caching, then execution is not likely deferred and a cache copy is likely and/or a full enumeration... orderby, distinct, count, sum, etc... Linq2SQL is a whole other can of worms. Even in that context, ToList will still stop and process the whole expression and store the results because a list, is a list, and is in memory. But Linq2SQL is uniquely capable of deferring many of those aforementioned clauses, and then some, by incorporating them into the generated SQL that is sent to the SQL server. So even orderby can be deferred in this way because the clause will be pushed down into your original datasource and then ignored in the pipe.
Good luck ;)

Not enough context to know for sure.
But ToList() guarantees that the data has been copied into memory, and your first example does that twice.
The second example could involve queryable data or some other on-demand scenario. Even if the original data was all already in memory and even if you only added a call to ToList() at the end, that would be one less copy in-memory than the first example.
And it's entirely possible that in the second example, by the time you get to the end of that little snippet, no actual processing of the original data has been done at all. In that case, the database might not even be queried until some code actually enumerates the final reviews value.
As for whether there's a "better" way to do it, not possible to say. You haven't defined "better". Personally, I tend to prefer the second example...why materialize data before you need it? But in some cases, you do want to force materialization. It just depends on the scenario.

LINQ execution time

today I noticed that when I run several LINQ-statements on big data the time taken may vary extremely.
Suppose we have a query like this:
var conflicts = features.Where(/* some condition */);
foreach (var c in conflicts) // log the conflicts
Where features is a list of objects representing rows in a table. Hence these objects are quite complex and even querying one simple property of them is a huge operation (including the actual database-query, validation, state-changes...) I suppose performing such a query takes much time. Far wrong: the first statement executes in a quite small amount of time, whereas simply looping the results takes eternally.
However, If I convert the collection retrieved by the LINQ-expression to a List using IEnumerable#ToList() the first statement runs a bit slower and looping the results is very fast. Having said this the complete duration-time of second approach is much less than when not converting to a list.
var conflicts = features.Where(/* some condition */).ToList();
foreach (var c in conflicts) // log the conflicts
So I suppose that var conflicts = features.Where does not actually query but prepares the data. But I do not understand why converting to a list and afterwards looping is so much faster then. That´s the actual question
Has anybody an explanation for this?

This statement, just declare your intention:
var conflicts = features.Where(...);
to get the data that fullfils the criteria in Where clause. Then when you write this
foreach (var c in conflicts)
The the actual query will be executed and will start getting the results. This is called lazy loading. Another term we use for this is the deffered execution. We deffer the execution of the query, until we need it's data.
On the other hand, if you had done something like this:
var conflicts = features.Where(...).ToList();
an in memory collection would have been created, in which the results of the query would had been stored. In this case the query, would had been executed immediately.
Generally speaking, as you could read in wikipedia:
Lazy loading is a design pattern commonly used in computer programming
to defer initialization of an object until the point at which it is
needed. It can contribute to efficiency in the program's operation if
properly and appropriately used. The opposite of lazy loading is eager
loading.
Update
And I suppose this in-memory collection is much faster then when doing
lazy load?
Here is a great article that answers your question.

Welcome to the wonderful world of lazy evaluation. With LINQ the query is not executed until the result is needed. Before you try to get the result (ToList() gets the result and puts it in a list) you are just creating the query. Think of it as writing code vs running the program. While this may be confusing and may cause the code to execute at unexpected times and even multiple times if you for example you foreach the query twice, this is actually a good thing. It allows you to have a piece of code that returns a query (not the result but the actual query) and have another piece of code create a new query based on the original query. For example you may add additional filters on top of the original or page it.
The performance difference you are seeing is basically the database call happening at different places in your code.

Is looping through IQueryable the same as looping through a Datareader with Reader.Read() in C#

Sorry for the long title :)
I couldn't really find any answers this and this question has been circling through my head for some time now. Short summary of question at the end.
Basically I'm wondering if linq to entities/sql loops through a datareader before it maps the resultset to one of the entities/sql classes or is it already mapped. Does the Iqueryable have to loop through a data reader before I can loop through the Iqueryable. To show what I mean in code I'm wondering if
foreach(object o in SomeIQueryableObject)
{
...do stuff with the object
}
will result in either A:
while(Reader.Read())
{
SomeIqueryableObject.Add(Reader.TranslateRow());
}
//now the iqueryable has been loaded with objects and we can loop through it
//with the above code
foreach(object o in SomeIQueryableObject)
{
...do stuff with the object
}
or will it result in B:
while(Reader.Read())
{
...do stuff with the object
}
//Iqueryable doesn't have to loop through Reader
//before you can loop through Iqueryable.
Short summary: when I use a foreach on an iqueryable, is this the equivalent of looping through a datareader with Reader.Read()

I'm afraid the correct answer here is: it depends.
First of all - it depends on the kind of IQueryable implementation we're talking about; IQueryable<T> is simply an interface and the actual implementation is left to the implementer. To the best of my knowledge Microsoft didn't provide any strict guidelines on how this should be done (heck, no one said IQueryable<T> needs to have anything in common with a database which would make the question pointless).
When we limit ourselves to linq-to-sql/entity-framework, the answer is still: it depends (at least for linq-to-sql, though I'd be surprised if EF works much differently). It depends on several factors (e.g. the type of query you're executing, data volume, etc.). Basically it works similarly to your second pseudo-code (i.e. not loading all the data at once), but optimizations can be used to retrieve data in chunks/buffer it, so in some cases the query execution on the db side might end quicker than your enumeration (though attempting to utilize this is any way might be ... non-trivial).
If you want to simulate the bulk loading version, explicitly call ToList()/ToArray() (or use any other method that would cause all the data to be read).

AFAIK EF will use your second option. The reason for me saying this is that I have experienced that EF will open a second connection when you execute another query while looping through the results of the first query. Since I was using a transaction, AND I disabled distributed transaction, and a second open connection (while the first connection is still open and is used) disables the re-use of the current transaction, my second query threw an exception saying that DTC wasn't enabled.
So my conclusion was: if you are executing a query and looping through the results (so no .ToList()), then the query is still executing, and the connection is still open, and if EF is using a data-reader, then yes, it is within the loop which reads from the data-reader.

C# OutOfMemory Issue when dealing with large data

In our application we are generating reports using Windows Service. The data for reports is fetched from SQL Server using a stored procedure. In some scenario the result set returned contains 250,000 records (We can not help on this part and we need this data in one go as we need to do some calculations on this).
Problem
Our application is getting this data in reader and we are converting this dataset in our custom collection of custom objects. As the data is huge it is not able to store the complete data in the custom object and hence throwing out of memory. When we see the task manager for the process usage while executing the record, it goes very high and even the CPU utilization.
I am not sure what should be do in this case.
Can we increase the size of the memory allocated to a single process running under CLR?
Any other workarounds?
Any help would be really appreciated
Why do I need all data at once : We need to do calculations on complete resultset
We are using ADO.NET and transforming the data set in to our custom object (collection)
Our system is 32 bit
We can not page the data
Can not move the computation to sql server
This stack trace might help:
Exception of type 'System.OutOfMemoryException' was thrown. Server
stack trace: at
System.Collections.Generic.Dictionary2.ValueCollection.System.Collections.Generic.IEnumerable<TValue>.GetEnumerator()
at System.Linq.Enumerable.WhereEnumerableIterator1.MoveNext() at
System.Collections.Generic.List1.InsertRange(Int32 index,
IEnumerable1 collection) at
System.Collections.Generic.List1.AddRange(IEnumerable1 collection)
at MyProject.Common.Data.DataProperty.GetPropertiesForType(Type t) in
C:\Ashish-Stuff\Projects\HCPA\Dev
Branch\Common\Benefits.Common\Data\DataProperty.shared.cs:line 60 at
MyProject.Common.Data.Extensions.GetProperties[T](T target) in
C:\Ashish-Stuff\Projects\HCPA\Dev
Branch\Common\Benefits.Common\Data\Extensions.shared.cs:line 30 at
MyProject.Common.Data.Factories.SqlServerDataFactoryContract1.GetData(String
procedureName, IDictionary2 parameters, Nullable1 languageId,
Nullable1 pageNumber, Nullable`1 pageSize)
Thanks,
Ashish

Could you every 1,000 rows of data, serialize your custom collection of objects to disk somewhere? Then when you return data, paginate it from those files?
More info on your use case as to why you need to pull back 2.5million rows of data would be helpful.

My first though was that computation could be made on Sql-Server side by some stored procedure. I suspect that this approach requires some Sql-Server jedi ... but anyway, have you considered such approach?

I would love to see a code sample highlighting where exactly you are getting this error from. Is it on the data pull itself (populating the reader) or is it creating the object and adding it to the custom collection (populating the collection).
I have had similar issues before, dealing with VERY LARGE datasets, but met great success with leaving it in a stream for as long as possible. streams will keep the data directly in memory and you wont ever really have anything with direct access to the entire mess until you finish building the object. Now, given that the stack trace shows the error on a "MoveNext" operation, this may not work for you. I would then say try to chunk the data, grab 10k rows at a time or something, I know that this can be done with SQL. It will make the data read take a lot longer though.
EDIT
If you read this from the database into a local stream, that you then pass around (just be careful not to close it), then you shouldn't run into these issues. Make a data wrapper class that you can pass around with an open stream and an open reader. Store the data in the stream and then use wrapper functions to read the specific data you need from it. Things like GetSumOfXField() or AverageOfYValues(), etc etc... The data will never be in a live object, but you wont have to keep going back to the database for it.
Pseudo Example
public void ReadingTheDataFunction()
{
DBDataReader reader = dbCommand.ExecuteReader();
MyDataStore.FillDataSource(reader)
}
private void FillDataSource(DbDataReader reader)
{
StreamWriter writer = new StreamWriter(GlobaldataStream);
while (reader.Read())
writer.WriteLine(BuildStringFromDataRow(reader));
reader.close();
}
private CustomObject GetNextRow()
{
String line = GlobalDataReader.ReadLine();
//Parse String to Custom Object
return ret;
}
From there you pass around MyDataStore, and as long as the stream and reader aren't closed, you can move your position around, go looking for individual entries, compile sums and averages, etc etc. you never even really have to know you aren't dealing with a live object, as long as you only interact with it via the interface functions.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.