OutOfMemory when removing rows > 500000 EntityFramework 6 - c#

What I've got:
I have a large list of addresses(ip addr) > millions
What I'm trying to do:
Remove 500k addresses efficiently through EntityFramework
My Problem:
Right now, I'm splitting into lists of 10000 addresses and using RemoveRange(ListOfaddresses)
if (addresses.Count() > 10000)
{
var addressChunkList = extension.BreakIntoChunks<Address>(addresses.ToList(), 10000);
foreach (var chunk in addressChunkList)
{
db.Address.RemoveRange(chunk);
}
}
but I'm getting an OutOfMemoryException which must mean that it's not freeing resources even though I'm splitting my addresses into separate lists.
What can I do to not get the OutOfMemoryException and still remove large quantities of addresses within reasonable time?

When I have needed to do something similar I have turned to the following plugin (I am not associated).
https://github.com/loresoft/EntityFramework.Extended
This allows you to do bulk deletes using Entity Framework without having to select and load the entity into the memory first which of course is more efficient.
Example from the website:
context.Users.Delete(u => u.FirstName == "firstname");

So? WHere did you get the idea EF is an ETL / bulk data manipulation tool?
It is not. Doing half a million deletes in one transaction will be dead slow (delete one by one) and EF is just not done for this. As you found out.
Nothing you can do here. Start using EF within design parameters or choose an alternative approach for this bulk operations. There are cases an ORM makes little sense.

A couple of suggestions.
Use a stored procedure or plain SQL
Move your DbContext to a narrower scope:
for (int i = 0; i < 500000; i += 1000)
{
using (var db = new DbContext())
{
var chunk = largeListOfAddress.Take(1000).Select(a => new Address { Id = a.Id });
db.Address.RemoveRange(chunk);
db.SaveChanges();
}
}
See Rick Strahl's post on bulk inserts for more details

Related

EF (C#) with tunneling to MySQL - optimising database calls

So I have program to deal with some sort of work in MySQL database. I'm connecting through SSH tunneling with putty (yes, I know launching programs on server itself would be much better but I don't have a choice here).
I have some problems with programs speed. I solved one by adding ".Include(table_name)" but I can't think about a way to do it here.
So purpose of this method is to clean database of unwanted, broken records. Simplified code looks like this:
using (var dbContext = new MyDatabase_dataEntities())
{
List<achievements> achiList = new List<achievements>();
var achievementsQuery = from data in dbContext.achievements
orderby data.playerID
select data;
achiList = achievementsQuery.Skip(counter * 5000).Take(5000).ToList();
foreach (achievements record in achiList)
{
var playerExists = from data in dbContext.players_data
where data.playerID == record.playerID
select data;
if(!playerExists.Any())
{
dbContext.achievements.Remove(record);
}
}
dbContext.SaveChanges();
counter++;
}
So this is built this way because I want to load achievements table then check if achievements have their player in player_data. If such player doesn't exist, remove his achievements.
It is all in do while, so I don't overload my memory by loading all data at once.
I know the problem is with checking in database in foreach steps but I can't figure out how to do it without it. Other things I tried generated errors either because EF couldn't translate it into SQL or exceptions were thrown when trying to access non-existing entity. Doing it in foreach bottlenecks whole program probably because of ping to the server.
I will need similar thing more often so I would be really gratefull if anyone could help me with making it so there is no need to call to database in "foreach". I know I could try to load whole players_data table and then check for Any(), but some tables I need it on are too big for that.
Oh, and turning off tracking changes doesn't help at this point because it is not what slows the program.
I would be gratefull for any help, thanks in advance!
EDIT: Mmmm, is there a way to get achievements which doesn't have player_data corresponding to them through one query using associations? Like adding to achievements query something like:
where !data.player_data.Exists()
Intellisense shows me that there is nothing like Exists or Any to use at this point. Is there any trick similar to this? It would definitely deal with the problem I have with speed there since database call in foreach wouldn't be needed.
If you want to delete achievements that don't have corresponding user records, then you can user a SQL query below:
DELETE a
FROM `achievements` a
LEFT JOIN `user` AS u
ON u.`playerID` = a.`playerID`
WHERE u.`playerID` IS NULL;
SQL query will be an order of magnitude faster than Entity Framework.
If you want to do that in the application, you can use the following code that uses LINQ to Entities and LINQ extensions methods. I assume you have a foreign key for player_data in achievements table so Entity Framework generates player_data lazy property for your achievements entity:
using (var dbContext = new MyDatabase_dataEntities())
{
var proceed = true;
while (proceed)
{
// Get net 1000 entities to delete
var entitiesToDelete = dbContext.achievements
.Where(x => x.players_data == null)
.Take(1000)
.ToList();
dbContext.achievements.RemoveRange(entitiesToDelete);
dbContext.SaveChanges();
// Proceed if deleted entities during this iteration
proceed = entitiesToDelete.Count() > 0;
}
}
If you prefer to use LINQ query syntax instead of extension methods, then your code will look like:
using (var dbContext = new MyDatabase_dataEntities())
{
var proceed = true;
while (proceed)
{
// Get net 1000 entities to delete
var query = from achievement in dbContext.achievements
where achievement.players_data == null
select achievement;
var entitiesToDelete = query.ToList();
dbContext.achievements.RemoveRange(entitiesToDelete);
dbContext.SaveChanges();
// Proceed if deleted entities during this iteration
proceed = entitiesToDelete.Count() > 0;
}
}

C#, Best way to loop through NHibernate objects without loading entire set into memory

I'm trying to loop through a large table and write the entries to a csv file. If I load all Objects into memory I get an OutOfMemoryException. My Employer class is mapped with fluent nhibernate.
Here's what I've tried:
This Loads all object on first iteration and crashes.
var myQuerable = DataProvider.GetEmployer(); // returns IQuerably
foreach (var emp in myQuerable)
{
// stuff...
}
No luck here:
var myEnumerator = myQuerable.GetEnumerator();
I thought this would work:
for (int i = 0; i <= myQuerable.Count(); i++)
{
Employer e = myQuerable.ElementAt(i);
}
but am getting this exception:
Could not parse expression
'value(NHibernate.Linq.NhQueryable`1[MyProject.Model.Employer]).ElementAt(0)':
This overload of the method 'System.Linq.Queryable.ElementAt' is currently not supported
Am I missing something here? Is this even possible with nHibernate?
Thanks!
I don't think loading your entries one by one could resolve your problem fully, as this is gonna go to another bad direction - huge loading on database side and longer response time for your C# method. I can't imagine how long it will take, as you've already god OutOfMemoryException exception that indicate you have huge number of records. I think the mechanism you really should take is pagination. There're various materials on the Internet about this topic, such as NHibernate 3 paging and determining the total number of rows.
Cheers!
Looks like I'm going to have to follow this artical:
http://ayende.com/blog/4548/nhibernate-streaming-large-result-sets
or use straight ado for performance.
Thanks for the help!

Read huge table with LINQ to SQL: Running out of memory vs slow paging

I have a huge table which I need to read through on a certain order and compute some aggregate statistics. The table already has a clustered index for the correct order so getting the records themselves is pretty fast. I'm trying to use LINQ to SQL to simplify the code that I need to write. The problem is that I don't want to load all the objects into memory, since the DataContext seems to keep them around -- yet trying to page them results in horrible performance problems.
Here's the breakdown. Original attempt was this:
var logs =
(from record in dataContext.someTable
where [index is appropriate]
select record);
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
This is pretty fast, and streams at a good rate, but the problem is that the memory use of the application keeps going up never stops. My guess is that the LINQ to SQL entities are being kept around in memory and not being disposed properly. So after reading Out of memory when creating a lot of objects C# , I tried the following approach. This seems to be the common Skip/Take paradigm that many people use, with the added feature of saving memory.
Note that _conn is created beforehand, and a temporary data context is created for each query, resulting in the associated entities being garbage collected.
int skipAmount = 0;
bool finished = false;
while (!finished)
{
// Trick to allow for automatic garbage collection while iterating through the DB
using (var tempDataContext = new MyDataContext(_conn) {CommandTimeout = 600})
{
var query =
(from record in tempDataContext.someTable
where [index is appropriate]
select record);
List<workerLog> logs = query.Skip(skipAmount).Take(BatchSize).ToList();
if (logs.Count == 0)
{
finished = true;
continue;
}
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
skipAmount += logs.Count;
}
}
Now I have the desired behavior that memory usage doesn't increase at all as I am streaming through the data. Yet, I have a far worse problem: each Skip is causing the data to load more and more slowly as the underlying query seems to actually cause the server to go through all the data for all previous pages. While running the query each page takes longer and longer to load, and I can tell that this is turning into a quadratic operation. This problem has appeared in the following posts:
LINQ Skip() Problem
LINQ2SQL select orders and skip/take
I can't seem to find a way to do this with LINQ that allows me to have limited memory use by paging data, and yet still have each page load in constant time. Is there a way to do this properly? My hunch is that there might be some way to tell the DataContext to explicitly forget about the object in the first approach above, but I can't find out how to do that.
After madly grasping at some straws, I found that the DataContext's ObjectTrackingEnabled = false could be just what the doctor ordered. It is, not surprisingly, specifically designed for a read-only case like this.
using (var readOnlyDataContext =
new MyDataContext(_conn) {CommandTimeout = really_long, ObjectTrackingEnabled = false})
{
var logs =
(from record in readOnlyDataContext.someTable
where [index is appropriate]
select record);
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
}
The above approach does not use any memory when streaming through objects. When writing data, I can use a different DataContext that has object tracking enabled, and that seems to work okay. However, this approach does have the problem of a SQL query that can take an hour or more to stream and complete, so if there's a way to do the paging as above without the performance hit, I'm open to other alternatives.
A warning about turning object tracking off: I found out that when you try to do multiple concurrent reads with the same DataContext, you don't get the error There is already an open DataReader associated with this Command which must be closed first. The application just goes into an infinite loop with 100% CPU usage. I'm not sure if this is a C# bug or a feature.

LINQ (to objects) , running several queries over the same IEnumerable?

Is it somehow possible to chain together several LINQ queries on the same IEnumerable ?
Some background,
I've some files, 20-50Gb in size, they will not fit in memory. Some code parses messages from such a file, and basically does :
public IEnumerable<Record> ReadRecordsFromStream(Stream inStream) {
Record msg;
while ((msg = ReadRecord(inStream)) != null) {
yield return msg;
}
}
This allow me to perform interesting queries on the records.
e.g. find the average duration of a Record
var records = ReadRecordsFromStream(stream);
var avg = records.Average(x => x.Duration);
Or perhaps the number of records per hour/minute
var x = from t in records
group t by t.Time.Hour + ":" + t.Time.Minute into g
select new { Period = g.Key, Frequency = g.Count() };
And there's a a dozen or so more queries I'd like to run to pull relevant info out of these records. Some of the simple queries can certainly be combined in a single query, but this seem to get unmanegable quite fast.
Now, each time I run these queries, I have to read the file from the beginning again, all records reparsed - parsing a 20Gb file 20 times takes time, and is a waste.
What can I do to be able to do just one pass over the file, but run several linq queries against it ?
You might want to consider using Reactive Extensions for this. It's been a while since I've used it, but you'd probably create a Subject<Record>, attach all your queries to it (as appropriate IObservable<T> variables) and then hook up the data source. That will push all the data through the various aggregations for you, only reading from disk once.
While the exact details elude me without downloading the latest build myself, I blogged on this a couple of times: part 1; part 2. (Various features that I complained about being missing in part 1 were added :)
I have done this before for logs with 3-10MB/file. Haven't reached that file size but I tried to execute this in a 1GB+ total log files without consuming that much of RAM. You may try what I did.
There's a technology that allows you to do this kind of thing. It's called a database :)

Can you get DataReader-like streaming using Linq-to-SQL?

I've been using Linq-to-SQL for quite awhile and it works great. However, lately I've been experimenting with using it to pull really large amounts of data and am running across some issues. (Of course, I understand that L2S may not be the right tool for this particular kind of processing, but that's why I'm experimenting - to find its limits.)
Here's a code sample:
var buf = new StringBuilder();
var dc = new DataContext(AppSettings.ConnectionString);
var records = from a in dc.GetTable<MyReallyBigTable>() where a.State == "OH" select a;
var i = 0;
foreach (var record in records) {
buf.AppendLine(record.ID.ToString());
i += 1;
if (i > 3) {
break; // Takes forever...
}
}
Once I start iterating over the data, the query executes as expected. When stepping through the code, I enter the loop right away which is exactly what I hoped for - that means that L2S appears to be using a DataReader behind the scenes instead of pulling all the data first. However, once I get to the break, the query continues to run and pull all the rest of the records. Here are my questions for the SO community:
1.) Is there a way to stop Linq-to-SQL from finishing execution of a really big query in the middle the way you can with a DataReader?
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the DataContext from filling up with change tracking information for every object returned. Basically, instead of filling up memory, can I do a large query with short object lifecycles the way you can with DataReader techniques?
I'm okay if this isn't functionality built-in to the DataContext itself and requires extending the functionality with some customization. I'm just looking to leverage the simplicity and power of Linq for large queries for nightly processing tasks instead of relying on T-SQL for everything.
1.) Is there a way to stop Linq-to-SQL from finishing execution of a really
big query in the middle the way you
can with a DataReader?
Not quite. Once the query is finally executed the underlying SQL statement is returning a result set of matching records. The query is deferred up till that point, but not during traversal.
For your example you could simply use records.Take(3) but I understand your actual logic to halt the process might be external to SQL or not easily translatable.
You could use a combination approach by building a strongly typed LINQ query then executing it with old fashioned ADO.NET. The downside is you lose the mapping to the class and have to manually deal with the SqlDataReader results. An example of this is shown below:
var query = from c in Customers
where c.ID < 15
select c;
using (var command = dc.GetCommand(query))
{
command.Connection.Open();
using (var reader = command.ExecuteReader())
{
int i = 0;
while (reader.Read())
{
Customer c = new Customer();
c.ID = reader.GetInt32(reader.GetOrdinal("ID"));
c.Name = reader.GetString(reader.GetOrdinal("Name"));
Console.WriteLine("{0}: {1}", c.ID, c.Name);
i++;
if (i > 3)
break;
}
}
}
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the
DataContext from filling up with
change tracking information for every
object returned.
If your intention for a particular query is to use it for read-only purposes then you could disable object tracking to increase performance by setting the DataContext.ObjectTrackingEnabled property to false:
using (var dc = new MyDataContext())
{
dc.ObjectTrackingEnabled = false;
// do stuff
}
You can also read this MSDN topic: How to: Retrieve Information As Read-Only (LINQ to SQL).

Categories

Resources