So I have program to deal with some sort of work in MySQL database. I'm connecting through SSH tunneling with putty (yes, I know launching programs on server itself would be much better but I don't have a choice here).
I have some problems with programs speed. I solved one by adding ".Include(table_name)" but I can't think about a way to do it here.
So purpose of this method is to clean database of unwanted, broken records. Simplified code looks like this:
using (var dbContext = new MyDatabase_dataEntities())
{
List<achievements> achiList = new List<achievements>();
var achievementsQuery = from data in dbContext.achievements
orderby data.playerID
select data;
achiList = achievementsQuery.Skip(counter * 5000).Take(5000).ToList();
foreach (achievements record in achiList)
{
var playerExists = from data in dbContext.players_data
where data.playerID == record.playerID
select data;
if(!playerExists.Any())
{
dbContext.achievements.Remove(record);
}
}
dbContext.SaveChanges();
counter++;
}
So this is built this way because I want to load achievements table then check if achievements have their player in player_data. If such player doesn't exist, remove his achievements.
It is all in do while, so I don't overload my memory by loading all data at once.
I know the problem is with checking in database in foreach steps but I can't figure out how to do it without it. Other things I tried generated errors either because EF couldn't translate it into SQL or exceptions were thrown when trying to access non-existing entity. Doing it in foreach bottlenecks whole program probably because of ping to the server.
I will need similar thing more often so I would be really gratefull if anyone could help me with making it so there is no need to call to database in "foreach". I know I could try to load whole players_data table and then check for Any(), but some tables I need it on are too big for that.
Oh, and turning off tracking changes doesn't help at this point because it is not what slows the program.
I would be gratefull for any help, thanks in advance!
EDIT: Mmmm, is there a way to get achievements which doesn't have player_data corresponding to them through one query using associations? Like adding to achievements query something like:
where !data.player_data.Exists()
Intellisense shows me that there is nothing like Exists or Any to use at this point. Is there any trick similar to this? It would definitely deal with the problem I have with speed there since database call in foreach wouldn't be needed.
If you want to delete achievements that don't have corresponding user records, then you can user a SQL query below:
DELETE a
FROM `achievements` a
LEFT JOIN `user` AS u
ON u.`playerID` = a.`playerID`
WHERE u.`playerID` IS NULL;
SQL query will be an order of magnitude faster than Entity Framework.
If you want to do that in the application, you can use the following code that uses LINQ to Entities and LINQ extensions methods. I assume you have a foreign key for player_data in achievements table so Entity Framework generates player_data lazy property for your achievements entity:
using (var dbContext = new MyDatabase_dataEntities())
{
var proceed = true;
while (proceed)
{
// Get net 1000 entities to delete
var entitiesToDelete = dbContext.achievements
.Where(x => x.players_data == null)
.Take(1000)
.ToList();
dbContext.achievements.RemoveRange(entitiesToDelete);
dbContext.SaveChanges();
// Proceed if deleted entities during this iteration
proceed = entitiesToDelete.Count() > 0;
}
}
If you prefer to use LINQ query syntax instead of extension methods, then your code will look like:
using (var dbContext = new MyDatabase_dataEntities())
{
var proceed = true;
while (proceed)
{
// Get net 1000 entities to delete
var query = from achievement in dbContext.achievements
where achievement.players_data == null
select achievement;
var entitiesToDelete = query.ToList();
dbContext.achievements.RemoveRange(entitiesToDelete);
dbContext.SaveChanges();
// Proceed if deleted entities during this iteration
proceed = entitiesToDelete.Count() > 0;
}
}
Related
I have 2 database with tables.
I wanted to insert records from first database to second database table in LINQ. I have created 2 dbml files with 2 datacontexts but I am unable to code the insertion of records.
I have list of records:
using(_TimeClockDataContext)
{
var Query = (from EditTime in _TimeClockDataContext.tblEditTimes
orderby EditTime.ScanDate ascending
select new EditTimeBO
{
EditTimeID = EditTime.EditTimeID,
UserID = Convert.ToInt64(EditTime.UserID),
ScanCardId = Convert.ToInt64(EditTime.ScanCardId),
}).ToList();
return Query;
}
Now I want to insert record in new table which is in _Premire2DataContext.
If you want to "copy" records from one database to another using Linq then you need two database contexts, one for the database you are reading from, and one for the database you are reading to.
EditTime[] sourceRows;
using (var sourceContext = CreateSourceContext())
{
sourceRows = ReadRows(sourceContext);
}
using (var destinationContext = CreateDestinationContext())
{
WriteRows(destinationContext, sourceRows);
}
You now just need to fill in the implementations for ReadRows and WriteRows using standard LINQ to SQL code. The code for writing rows should look a bit like this.
void WriteRows(TimeClockDataContext context, EditTime[] rows)
{
foreach (var row in rows)
{
destinationContext.tblEditTimes.Add(row);
}
destinationContext.SubmitChanges();
}
Note that as long as the schema is the same you can use the same context and therefore the same objects - so when reading records we ideally want to return the correct array type, therefore reading is going to look a bit like this
EditTime[] ReadRows(TimeClockDataContext context)
{
return (
from EditTime in _TimeClockDataContext.tblEditTimes
orderby EditTime.ScanDate ascending
select EditTime
).ToArray();
}
You can use an array or a list - it doesn't really matter. I've used an array mostly because the syntax is shorter. Note that we return the original EditTime objects rather than create new ones as this means we can add those objects directly to the second data context.
I've not compiled any of this code yet, so it might contain some typos. Also apologies if I've made some obvious errors - its been a while since I last used LINQ to SQL.
If you have foreign keys or the second database has a different schema then things get more complicated, but the fundamental process remains the same - read from one context (using standard LINQ to SQL) and store the results somewhere, then add the rows the the second context (using standard LINQ to SQL).
Also note that this isn't necessarily going to be particularly quick. If performance is an issue then you should look into using bulk inserts in the WriteRows method, or potentially even use linked servers to do the entire thing in SQL.
What I've got:
I have a large list of addresses(ip addr) > millions
What I'm trying to do:
Remove 500k addresses efficiently through EntityFramework
My Problem:
Right now, I'm splitting into lists of 10000 addresses and using RemoveRange(ListOfaddresses)
if (addresses.Count() > 10000)
{
var addressChunkList = extension.BreakIntoChunks<Address>(addresses.ToList(), 10000);
foreach (var chunk in addressChunkList)
{
db.Address.RemoveRange(chunk);
}
}
but I'm getting an OutOfMemoryException which must mean that it's not freeing resources even though I'm splitting my addresses into separate lists.
What can I do to not get the OutOfMemoryException and still remove large quantities of addresses within reasonable time?
When I have needed to do something similar I have turned to the following plugin (I am not associated).
https://github.com/loresoft/EntityFramework.Extended
This allows you to do bulk deletes using Entity Framework without having to select and load the entity into the memory first which of course is more efficient.
Example from the website:
context.Users.Delete(u => u.FirstName == "firstname");
So? WHere did you get the idea EF is an ETL / bulk data manipulation tool?
It is not. Doing half a million deletes in one transaction will be dead slow (delete one by one) and EF is just not done for this. As you found out.
Nothing you can do here. Start using EF within design parameters or choose an alternative approach for this bulk operations. There are cases an ORM makes little sense.
A couple of suggestions.
Use a stored procedure or plain SQL
Move your DbContext to a narrower scope:
for (int i = 0; i < 500000; i += 1000)
{
using (var db = new DbContext())
{
var chunk = largeListOfAddress.Take(1000).Select(a => new Address { Id = a.Id });
db.Address.RemoveRange(chunk);
db.SaveChanges();
}
}
See Rick Strahl's post on bulk inserts for more details
I want to replace existing records in the DB with new records in one transaction. Using TransactionScope, I have
using ( var scope = new TransactionScope())
{
db.Tasks.DeleteAllOnSubmit(oldTasks);
db.Tasks.SubmitChanges();
db.Tasks.InsertAllOnSubmit(newTasks);
db.Tasks.SubmitChanges();
scope.Complete();
}
My program threw
System.InvalidOperationException: Cannot add an entity that already exists.
After some trial and error, I found the culprit lies in the the fact that there isn't any other execution instructions between the delete and the insert. If I insert other code between the first SubmitChanges() and InsertAllOnSubmit(), everything works fine. Can anyone explain why is this happening? It is very concerning.
I tried another one to update the objects:
IEnumerable<Task> tasks = ( ... some long query that involves multi tables )
.AsEnumerable()
.Select( i =>
{
i.Task.Duration += i.LastLegDuration;
return i.Task;
}
db.SubmitChanges();
This didn't work neither. db didn't pick up any changes to Tasks.
EDIT:
This behavior doesn't seem to have anything to do with Transactions. At the end, I adopted the grossly inefficient Update:
newTasks.ForEach( t =>
{
Task attached = db.Tasks.Single( i => ... use primary id to look up ... );
attached.Duration = ...;
... more updates, Property by Property ...
}
db.SubmitChanges();
Instead of inserting and deleting or making multiple queries, you can try to update multiple rows in one pass by selecting a list of Id's to update and checking if the list contains each item.
Also, make sure you mark your transaction as complete to indicate to transaction manager that the state across all resources is consistent, and the transaction can be committed.
Dictionary<int,int> taskIdsWithDuration = getIdsOfTasksToUpdate(); //fetch a dictionary keyed on id's from your long query and values storing the corresponding *LastLegDuration*
using (var scope = new TransactionScope(TransactionScopeOption.Required))
{
var tasksToUpdate = db.Tasks.Where(x => taskIdsWithDuration.Keys.Contains(x.id));
foreach (var task in tasksToUpdate)
{
task.duration1 += taskIdsWithDuration[task.id];
}
db.SaveChanges();
scope.Complete();
}
Depending on your scenario, you can invert the search in the case that your table is extremely large and the number of items to update is reasonably small, to leverage indexing. Your existing update query should work fine if this is the case, so I doubt you'll need to invert it.
I had same problem in LinqToSql and I don't think its to do with the transaction, but with how the session/context is coalescing changes. I say this because I fixed the problem by bypassing linqtosql for the delete and using some raw sql to do it. Ugly I know, but it worked, and all inside a transaction scope.
I've been using Linq-to-SQL for quite awhile and it works great. However, lately I've been experimenting with using it to pull really large amounts of data and am running across some issues. (Of course, I understand that L2S may not be the right tool for this particular kind of processing, but that's why I'm experimenting - to find its limits.)
Here's a code sample:
var buf = new StringBuilder();
var dc = new DataContext(AppSettings.ConnectionString);
var records = from a in dc.GetTable<MyReallyBigTable>() where a.State == "OH" select a;
var i = 0;
foreach (var record in records) {
buf.AppendLine(record.ID.ToString());
i += 1;
if (i > 3) {
break; // Takes forever...
}
}
Once I start iterating over the data, the query executes as expected. When stepping through the code, I enter the loop right away which is exactly what I hoped for - that means that L2S appears to be using a DataReader behind the scenes instead of pulling all the data first. However, once I get to the break, the query continues to run and pull all the rest of the records. Here are my questions for the SO community:
1.) Is there a way to stop Linq-to-SQL from finishing execution of a really big query in the middle the way you can with a DataReader?
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the DataContext from filling up with change tracking information for every object returned. Basically, instead of filling up memory, can I do a large query with short object lifecycles the way you can with DataReader techniques?
I'm okay if this isn't functionality built-in to the DataContext itself and requires extending the functionality with some customization. I'm just looking to leverage the simplicity and power of Linq for large queries for nightly processing tasks instead of relying on T-SQL for everything.
1.) Is there a way to stop Linq-to-SQL from finishing execution of a really
big query in the middle the way you
can with a DataReader?
Not quite. Once the query is finally executed the underlying SQL statement is returning a result set of matching records. The query is deferred up till that point, but not during traversal.
For your example you could simply use records.Take(3) but I understand your actual logic to halt the process might be external to SQL or not easily translatable.
You could use a combination approach by building a strongly typed LINQ query then executing it with old fashioned ADO.NET. The downside is you lose the mapping to the class and have to manually deal with the SqlDataReader results. An example of this is shown below:
var query = from c in Customers
where c.ID < 15
select c;
using (var command = dc.GetCommand(query))
{
command.Connection.Open();
using (var reader = command.ExecuteReader())
{
int i = 0;
while (reader.Read())
{
Customer c = new Customer();
c.ID = reader.GetInt32(reader.GetOrdinal("ID"));
c.Name = reader.GetString(reader.GetOrdinal("Name"));
Console.WriteLine("{0}: {1}", c.ID, c.Name);
i++;
if (i > 3)
break;
}
}
}
2.) If you execute a large Linq-to-SQL query, is there a way to prevent the
DataContext from filling up with
change tracking information for every
object returned.
If your intention for a particular query is to use it for read-only purposes then you could disable object tracking to increase performance by setting the DataContext.ObjectTrackingEnabled property to false:
using (var dc = new MyDataContext())
{
dc.ObjectTrackingEnabled = false;
// do stuff
}
You can also read this MSDN topic: How to: Retrieve Information As Read-Only (LINQ to SQL).
I need to update values but I am looping all the tables values to do it:
public static void Update(IEnumerable<Sample> samples
, DataClassesDataContext db)
{
foreach (var sample in db.Samples)
{
var matches = samples.Where(a => a.Id == sample.Id);
if(matches.Any())
{
var match = matches.First();
match.SomeColumn = sample.SomeColumn;
}
}
db.SubmitChanges();
}
I am certain the code above isn't the right way to do it, but I couldn't think of any other way yet. Can you show a better way?
Yes, there is a simpler way. Much simpler. If you attach your entities to the context and then Refresh (with KeepCurrentValues selected), Linq to SQL will get those entities from the server, compare them, and mark updated those that are different. Your code would look something like this.
public static void Update(IEnumerable<Sample> samples
, DataClassesDataContext db)
{
db.Samples.AttachAll(samples);
db.Refresh(RefreshMode.KeepCurrentValues, samples)
db.SubmitChanges();
}
In this case, Linq to SQL is using the keys to match and update records so as long as your keys are in synch, you're fine.
With Linq2Sql (or Linq to Entities), there is no way* to update records on the server without retrieving them in full first, so what you're doing is actually correct.
If you want to avoid this, write a stored procedure that does what you want and add it to your model.
I'm not entirely sure if that was your intended question however :)
*: There are some hacks around that use LINQ to build a SELECT statement and butcher the resulting SELECT statement into an UPDATE somehow, but I wouldn't recommend it.