Good morning!
Actually I'm playing around with EF atm a little bit and I need your guys help:
Following scenario: I have a table with a lot of data in it. If I'm querying this table through EF, all the records get load into memory.
eg.
var counter = default(int);
using (var myEntities = new MyEntities())
{
foreach (var record in myEntities.MySpecialTable)
{
counter++;
}
}
So, I run through all of the records of MySpecialTable and count counter up.
When I'm having a look at the Taskmanager, or anything else which shows me the memory-consumption of my app, it tells me: 400Mb. (because of the data)
If I run throught the table another time, memory consumption will get doubled.
I already tried to call the GC, but it won't work.
Why do all of the records of each run get stored somewhere in the memory (and are not released)?
Where do they get stored?
Is there any way to make EF-queries behave like a DataReader?
Or is there any other ORM which is as elegant as EF?
edit:
no, i'm not doing a count like that ... this is just for showing the iteration :)
First of all, I hope you're not actually doing a count like that; the Count method is far more efficient. But presuming this is just demo code to show the memory issue:
Change the MergeOption property of the ObjectQuery to NoTracking. This tells the Entity Framework that you have no intention of modifying the entities, and hence it needn't bother keeping track of their original state.
Related
I am doing a simple foreach loop on a table in my database, however, if I try to do any other queries against the database while inside that loop, it seems that the database (or possibly only my DataContext) is locked.
My tables in question are a Process table (about 44,000 rows) and a child ProcessValidation table (about 133,000 rows). ProcessValidation has a valProcessKey that points back to Process and has an index applied against it. My code is as follows:
using (var dbContext = new BTR.Data.Legacy.DataContexts.xDS.DataContext(connectionString) { DeferredLoadingEnabled = false, ObjectTrackingEnabled = false })
{
var dataLoadProcesses =
dbContext.Processes
.Where(p => new[] { "UpdateProcess", "RBLProcess", "RBLCalcProcess" }.Contains(p.procType));
// All these count queries work outside of the foreach
Console.WriteLine(dataLoadProcesses.Count());
Console.WriteLine(dbContext.ProcessValidations.Count());
Console.WriteLine(dbContext.ProcessValidations.Count(v => v.valProcessKey == 1591));
foreach (var process in dataLoadProcesses)
{
// can grab fields from 'process' object
// can *NOT* execute any other queries against ProcessValidation
}
}
Inside my foreach, I can grab fields from each process row and everything works fine, but if I try querying database again it locks up. For example, a simple count query like below never returns and just locks up (I think eventually it actually gave me an OutOfMemoryException).
var existingValidations =
dbContext.ProcessValidations.Count(v => v.valProcessKey == process.procKey);
I tried using an TransactionScope around the query to effectively issue a NOLOCK command, but that didn't help either.
Above, I said possible only the DataContext was locked because while it is locked, I executed a Count() query against the database in a LINQPad script using a different DataContext and it returned immediately (even though the original DataContext was still spinning)
It was suggested that this might be a duplicate of Processing large datasets using LINQ but I don't think it applies.
I created my context outside the loop
It locks up on first Count() query.
After I get past problem of that, creating a compiled query might help performance and memory consumption, but I'm not to that point yet
The problem is that LINQ is fetching a LOCK. You can easily avoid it by using foreach (var process in dataLoadProcesses.ToList()). In this answer you have a little explanation. I don't think its enough, because I don't understand how can a for-each block a count, but it is a start.
If you want/need your query not to block the database at all, I found this other link.
Both of this solutions must bring the entire query before processing it, and there probably is a better solution.
I've just been noodling about with a profiler looking at performance bottlenecks in a WCF application after some users complained of slowness.
To my surprise, almost all the problems came down to Entity Framework operations. We use a repository pattern and most of the "Add/Modify" code looks very much like this:
public void Thing_Add(Thing thing)
{
Log.Trace("Thing_Add called with ThingID " + thing.ThingID);
if (db.Things.Any(m => m.ThingID == thing.ThingID))
{
db.Entry(thing).State = System.Data.EntityState.Modified;
}
else
{
db.Things.Add(thing);
}
}
This is obviously a convenient way to wrap an add/update check into a single function.
Now, I'm aware that EF isn't the most efficient thing when it comes to doing inserts and updates. However, my understanding was (which a little research bears out) that it should be capable of processing a few hundred records faster than a user would likely notice.
But this is causing big bottlenecks on small upserts. For example, in one case it takes six seconds to process about fifty records. That's a particularly bad example but there seem to be instances all over this application where small EF upserts are taking upwards of a second or two. Certainly enough to annoy a user.
We're using Entity Framework 5 with a Database First model. The profiler says it's not the Log.Trace that's causing the issue. What could be causing this, and how can I investigate and fix the issue?
I found the root of the problem on another SO post: DbContext is very slow when adding and deleting
Turns out that when you're working with a large number of objects, especially in a loop, the gradual accumulation of change tracking makes EF get slower and slower.
Refreshing the DbContext isn't enough in this instance as we're still working with too many linked entities. So I put this inside the repository:
public void AutoDetectChangesEnabled(bool detectChanges)
{
db.Configuration.AutoDetectChangesEnabled = detectChanges;
}
And can now use it to turn AutoDetectChangesEnabled on and off before doing looped inserts:
try
{
rep.AutoDetectChangesEnabled(false);
foreach (var thing in thingsInFile)
{
rep.Thing_add(new Thing(thing));
}
}
finally
{
rep.AutoDetectChangesEnabled(true);
}
This makes a hell of a difference. Although it needs to used with care, since it'll stop EF from recognizing potential updates to changed objects.
Assuming the two following possible blocks of code inside of a view, with a model passed to it using something like return View(db.Products.Find(id));
List<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date).ToList();
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id).ToList();
IEnumerable<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date);
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id);
What are the performance implications between the two? By this point, is all of the product related data in memory, including its navigation properties? Does using ToList() in this instance matter at all? If not, is there a better approach to using Linq queries on a List without having to call ToList() every time, or is this the typical way one would do it?
Read http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Deferred execution is one of the many marvels intrinsic to linq. The short version is that your data is never touched (it remains idle in the source be that in-memory, or in-database, or wherever). When you construct a linq query all you are doing is creating an IEnumerable class that is 'capable' of enumerating your data. The work doesn't begin until you actually start enumerating and then each piece of data comes all the way from the source, through the pipeline, and is serviced by your code one item at a time. If you break your loop early, you have saved some work - later items are never processed. That's the simple version.
Some linq operations cannot work this way. Orderby is the best example. Orderby has to know every piece of data because it possible that the last piece retrieved from the source very well could be the first piece that you are supposed to get. So when an operation such as orderby is in the pipe, it will actually cache your dataset internally. So all data has been pulled from the source, and has gone through the pipeline, up to the orderby, and then the orderby becomes your new temporary data source for any operations that come afterwards in the expression. Even so, orderby tries as much as possible to follow the deferred execution paradigm by waiting until the last possible moment to build its cache. Including orderby in your query still doesn't do any work, immediately, but the work begins once you start enumerating.
To answer your question directly, your call to ToList is doing exactly that. OrderByDescending is caching the data from your datasource => ToList additionally persists it into a variable that you can actually touch (reviews) => where starts pulling records one at a time from reviews, and if it matches then your final ToList call is storing the results into yet another list in memory.
Beyond the memory implications, ToList is additionally thwarting deferred execution because it also STOPS the processing of your view at the time of the call, to entirely process that entire linq expression, in order to build its in-memory representation of the results.
Now none of this is a real big deal if the number of records we're talking about are in the dozens. You'll never notice the difference at runtime because it happens so quick. But when dealing with large scale datasets, deferring as much as possible for as long as possible in hopes that something will happen allowing you to cancel a full enumeration... in addition to the memory savings... gold.
In your version without ToList: OrderByDescending will still cache a copy of your dataset as processed through the pipeline up to that point, internally, sorted of course. That's ok, you gotta do what you gotta do. But that doesn't happen until you actually try to retrieve your first record later in your view. Once that cache is complete, you get your first record, and for every next record you are then pulling from that cache, checking against the where clause, you get it or you don't based upon that where and have saved a couple of in memory copies and a lot of work.
Magically, I bet even your lead-in of db.Products.Find(id) doesn't even start spinning until your view starts enumerating (if not using ToList). If db.Products is a Linq2SQL datasource, then every other element you've specified will reduce into SQL verbiage, and your entire expression will be deferred.
Hope this helps! Read further on Deferred Execution. And if you want to know 'how' that works, look into c# iterators (yield return). There's a page somewhere on MSDN that I'm having trouble finding that contains the common linq operations, and whether they defer execution or not. I'll update if I can track that down.
/*edit*/ to clarify - all of the above is in the context of raw linq, or Linq2Objects. Until we find that page, common sense will tell you how it works. If you close your eyes and imagine implementing orderby, or any other linq operation, if you can't think of a way to implement it with 'yield return', or without caching, then execution is not likely deferred and a cache copy is likely and/or a full enumeration... orderby, distinct, count, sum, etc... Linq2SQL is a whole other can of worms. Even in that context, ToList will still stop and process the whole expression and store the results because a list, is a list, and is in memory. But Linq2SQL is uniquely capable of deferring many of those aforementioned clauses, and then some, by incorporating them into the generated SQL that is sent to the SQL server. So even orderby can be deferred in this way because the clause will be pushed down into your original datasource and then ignored in the pipe.
Good luck ;)
Not enough context to know for sure.
But ToList() guarantees that the data has been copied into memory, and your first example does that twice.
The second example could involve queryable data or some other on-demand scenario. Even if the original data was all already in memory and even if you only added a call to ToList() at the end, that would be one less copy in-memory than the first example.
And it's entirely possible that in the second example, by the time you get to the end of that little snippet, no actual processing of the original data has been done at all. In that case, the database might not even be queried until some code actually enumerates the final reviews value.
As for whether there's a "better" way to do it, not possible to say. You haven't defined "better". Personally, I tend to prefer the second example...why materialize data before you need it? But in some cases, you do want to force materialization. It just depends on the scenario.
today I noticed that when I run several LINQ-statements on big data the time taken may vary extremely.
Suppose we have a query like this:
var conflicts = features.Where(/* some condition */);
foreach (var c in conflicts) // log the conflicts
Where features is a list of objects representing rows in a table. Hence these objects are quite complex and even querying one simple property of them is a huge operation (including the actual database-query, validation, state-changes...) I suppose performing such a query takes much time. Far wrong: the first statement executes in a quite small amount of time, whereas simply looping the results takes eternally.
However, If I convert the collection retrieved by the LINQ-expression to a List using IEnumerable#ToList() the first statement runs a bit slower and looping the results is very fast. Having said this the complete duration-time of second approach is much less than when not converting to a list.
var conflicts = features.Where(/* some condition */).ToList();
foreach (var c in conflicts) // log the conflicts
So I suppose that var conflicts = features.Where does not actually query but prepares the data. But I do not understand why converting to a list and afterwards looping is so much faster then. That´s the actual question
Has anybody an explanation for this?
This statement, just declare your intention:
var conflicts = features.Where(...);
to get the data that fullfils the criteria in Where clause. Then when you write this
foreach (var c in conflicts)
The the actual query will be executed and will start getting the results. This is called lazy loading. Another term we use for this is the deffered execution. We deffer the execution of the query, until we need it's data.
On the other hand, if you had done something like this:
var conflicts = features.Where(...).ToList();
an in memory collection would have been created, in which the results of the query would had been stored. In this case the query, would had been executed immediately.
Generally speaking, as you could read in wikipedia:
Lazy loading is a design pattern commonly used in computer programming
to defer initialization of an object until the point at which it is
needed. It can contribute to efficiency in the program's operation if
properly and appropriately used. The opposite of lazy loading is eager
loading.
Update
And I suppose this in-memory collection is much faster then when doing
lazy load?
Here is a great article that answers your question.
Welcome to the wonderful world of lazy evaluation. With LINQ the query is not executed until the result is needed. Before you try to get the result (ToList() gets the result and puts it in a list) you are just creating the query. Think of it as writing code vs running the program. While this may be confusing and may cause the code to execute at unexpected times and even multiple times if you for example you foreach the query twice, this is actually a good thing. It allows you to have a piece of code that returns a query (not the result but the actual query) and have another piece of code create a new query based on the original query. For example you may add additional filters on top of the original or page it.
The performance difference you are seeing is basically the database call happening at different places in your code.
Yet another How-to-free-memory question:
I'm copying data between two databases which are currently identical but will soon be getting out of sync. I have put together a skeleton app in C# using Reflection and ADO.Net Entities that does this:
For each table in the source database:
Clear the corresponding table in the destination database
For each object in the source table
For each property in the source object
If an identically-named property exists in the destination object, use Reflection to copy the source property to the destination property
This works great until I get to the big 900MB table that has user-uploaded files in it.
The process of copying the blobs (up to 7 MB each) to my machine and back to the destination database uses up local memory. However, that memory isn't getting freed, and the process dies once it's copied about 750 MB worth of data - with my program having 1500 MB of allocated space when the OutOfMemoryException is thrown, presumably two copies of everything that it's copied so far.
I tried a naive approach first, doing a simple copy. It worked on every table until I got to the big one. I have tried forcing a GC.Collect() with no obvious change to the results. I've also tried putting the actual copy into a separate function in hopes that the reference going out of scope would help it get GCed. I even put a Thread.Sleep in to try to give background processes more time to run. All of these have had no effect.
Here's the relevant code as it exists right now:
public static void CopyFrom<TSource, TDest>(this ObjectSet<TDest> Dest, ObjectSet<TSource> Source, bool SaveChanges, ObjectContext context)
where TSource : class
where TDest : class {
int total = Source.Count();
int count = 0;
foreach (var src in Source) {
count++;
CopyObject(src, Dest);
if (SaveChanges && context != null) {
context.SaveChanges();
GC.Collect();
if (count % 100 == 0) {
Thread.Sleep(2000);
}
}
}
}
I didn't include the CopyObject() function, it just uses reflection to evaluate the properties of src and put them into identically-named properties in a new object to be appended to Dest.
SaveChanges is a Boolean variable passed in saying that this extra processing should be done, it's only true on the big table, false otherwise.
So, my question: How can I modify this code to not run me out of memory?
The problem is that your database context utilizes a lot of caching internally, and it's holding onto a lot of your information and preventing the garbage collector from freeing it (whether you call Collect or not).
This means that your context is defined at too high of a scope. (It appears, based on your edit, that you're using it across tables. That's...not good.) You haven't shown where it is defined, but wherever it is it should probably be at a lower level. Keep in mind that because of connection pooling creating new contexts is not expensive, and based on your use cases you shouldn't need to rely on a lot of the cached info (because you're not touching items more than once) so frequently creating new contexts shouldn't add performance costs, even though it's substantially decreasing your memory footprint.