Linq caching data values - major concurrency problem? - c#

Here's a little experiment I did:
MyClass obj = dataContext.GetTable<MyClass>().Where(x => x.ID = 1).Single();
Console.WriteLine(obj.MyProperty); // output = "initial"
Console.WriteLine("Waiting..."); // put a breakpoint after this line
obj = null;
obj = dataContext.GetTable<MyClass>().Where(x => x.ID = 1).Single(); // same as before, but reloaded
Console.WriteLine(obj.MyProperty); // output still = "initial"
obj.MyOtherProperty = "foo";
dataContext.SubmitChanges(); // throws concurrency exception
When I hit the breakpoint after line 3, I go to a SQL query window and manually change the value to "updated". Then I carry on running. Linq does not reload my object, but re-uses the one it previously had in memory! This is a huge problem for data concurrency!
How do you disable this hidden cache of objects that Linq obviously is keeping in memory?
EDIT - On reflection, it is simply unthinkable that Microsoft could have left such a gaping chasm in the Linq framework. The code above is a dumbed-down version of what I'm actually doing, and there may be little subtleties that I've missed. In short, I'd appreciate if you'd do your own experimentation to verify that my findings above are correct. Alternatively, there must be some kind of "secret switch" that makes Linq robust against concurrent data updates. But what?

This isn't an issue I've come across before (since I don't tend to keep DataContexts open for long periods of time), but it looks like someone else has:
http://www.rocksthoughts.com/blog/archive/2008/01/14/linq-to-sql-caching-gotcha.aspx

LinqToSql has a wide variety of tools to deal with concurrency problems.
The first step, however, is to admit there is a concurrency problem to be solved!
First, DataContext's intended object lifecycle is supposed to match a UnitOfWork. If you're holding on to one for extended periods, you're going to have to work that much harder because the class isn't designed to be used that way.
Second, DataContext tracks two copies of each object. One is the original state and one is the changed/changable state. If you ask for the MyClass with Id = 1, it will give you back the same instance it gave you last time, which is the changed/changable version... not the original. It must do this to prevent concurrency problems with in memory instances... LinqToSql does not allow one DataContext to be aware of two changable versions of MyClass(Id = 1).
Third, DataContext has no idea whether your in-memory change comes before or after the database change, and so cannot referee the concurrency conflict without some guidance. All it sees is:
I read MyClass(Id = 1) from the database.
Programmer modified MyClass(Id = 1).
I sent MyClass(Id = 1) back to the database (look at this sql to see optimistic concurrency in the where clause)
The update will succeed if the database's version matches the original (optimistic concurrency).
The update will fail with concurrency exception if the database's version does not match the original.
Ok, now that the problem is stated, here's a couple of ways to deal with it.
You can throw away the DataContext and start over. This is a little heavy handed for some, but at least it's easy to implement.
You can ask for the original instance or the changed/changable instance to be refreshed with the database value by calling DataContext.Refresh(RefreshMode, target) (reference docs with many good concurrency links in the "Remarks" section). This will bring the changes client side and allow your code to work-out what the final result should be.
You can turn off concurrency checking in the dbml (ColumnAttribute.UpdateCheck) . This disables optimistic concurrency and your code will stomp over anyone else's changes. Also heavy handed, also easy to implement.

Set the ObjectTrackingEnabled property of the DataContext to false.
When ObjectTrackingEnabled is set to true the DataContext is behaving like a Unit of Work. It's going to keep any object loaded in memory so that it can track changes to it. The DataContext has to remember the object as you originally loaded it to know if any changes have been made.
If you are working in a read only scenario you should turn off object tracking. It can be a decent performance improvement.
If you aren't working in a read only scenario then I'm not sure why you want it to work this way. If you have made edits then why would you want it to pull in modified state from the database?

LINQ to SQL uses the identity map design pattern which means that it will always return the same instance of an object for it's given primary key (unless you turn off object tracking).
The solution is simply either use a second data context if you don't want it to interfere with the first instance or refresh the first instance if you do.

Related

C# BulkWriteAsync, Transactions and Results

I am relatively new to working with mongodb. Currently I am getting a little more familiar with the API and especially with C# drivers. I have a few understanding questions around bulk updates. As the C# driver offers a BulkWriteAsync method, I could read a lot about it in the mongo documentation. As I understand, it is possible to configure the BulkWrite not to stop in case of an error at any step. This can be done by use the unordered setting. What I did not found is, what happens to the data. Does the database do a rollback in case of an error? Or do I have to use a surrounding by myself? In case of an error: can I get details of which step was not successful? Think of a bulk with updates on 100 documents. Can I find out, which updates were not successfull? As the BulkWriteResult offers very little information, I am not sure if this operation is realy a good one for me.
thanks in advance
You're right in that BulkWriteResult doesn't provide the full set of information to make a call on what to do.
In the case of a MongoBulkWriteException<T>, however, you can access the WriteErrors property to get the indexes of models that errored. Here's a pared down example of how to use the property.
var models = sourceOfModels.ToArray();
for (var i = 0; i < MaxTries; i++)
try
{
return await someCollection.BulkWriteAsync(models, new BulkWriteOptions { IsOrdered = false });
}
catch (MongoBulkWriteException e)
{
// reconstitute the list of models to try from the set of failed models
models = e.WriteErrors.Select(x => models[x.Index]).ToArray();
}
Note: The above is very naive code. My actual code is more sophisticated. What the above does is try over and over to do the write, in each case, with only the outstanding writes. Say you started with 1000 ReplaceOne<T> models to write, and 900 went through; the second try will try against the remaining 100, and so on until retries are exhausted, or there are no errors.
If the code is not within a transaction, and an error occurs, of course nothing is rolled back; you have some writes that succeed and some that do not. In the case of a transaction, the exception is still raised (MongoDB 4.2+). Prior to that, you would not get an exception.
Finally, while the default is ordered writes, unordered writes can be very useful when the writes are unrelated to one another (e.g. documents representing DDD aggregates where there are no dependencies). It's this same "unrelatedness" that also obviates the need for a transaction.

entity framework stored procedure result to domain model

I'm going through old projects at work trying to make them faster. I'm currently looking at some web APIs. One API is running particularly slow the problem is in the data service it is calling. Specifically it is in a lambda method trying to map a stored procedure result to a domain model. A simple version of the code.
public IEnumerable<DomainModelResult> GetData()
{
return this.EntityFrameworkDB.GetDataSproc().ToList()
.Select(sprocResults=>sprocResults.ToDomainModelResult())
.AsEnumerable();
}
This is a simplified version, but after profiling it I found the major hangup is in the lambda function. I am assuming this is because the EFContext is still open and some goofy entity framework stuff is happening.
Problem is I'm relatively new to Entity Framework(intern) and pretty ignorant to the inner workings of it. Could someone explain why this is so slow. I feel it should be very fast The DomainModelResult is a POCO and only setter methods are being used in ToDomainModelResult.
Edit:
I thought ToList() would do that but started to doubt myself because I couldn't think of another explanation. All the ToDomainModelResult() stuff is extremely simple. Something like.
public static DomainModelResult ToDomainModelResult(SprocResult source)
{
return new DomainModeResult
{
FirstName = source.description,
MiddleName = source._middlename,
LastName = source.lastname,
UserName = source.expr2,
Address = source.uglyName
};
}
Its just a bunch of simple setters, I think the model causing problems has 17 properties. The reason this is being done is because the project is old database first and the stored procedures have ugly names that aren't descriptive at all. Also so switching the stored procedures in dataservices is easy and doesn't break the rest of the project.
Edit:2 For some reason Using ToArray and breaking apart the linq statements makes the assignment from procedure result to domain model result extremely fast. Now the whole dataservice method is faster which is odd, I don't know where the rest of the time went.
This might be a more esoteric question than I originally thought. My question hasn't been answered but the problem is no longer there. Thanks to all the replied. I'm keeping this as unanswered for now.
Edit3: Please flag this question for removal I can't remove it. I found the problem but it is totally unrelated to my original question. I misunderstood the problem when I asked the question. The increase in speed I'm chalking up to compiler optimization and running code in the profiler. The real issues wasn't in my lambda but in a dynamic lambda called by entity framework when the context is closed or an object is accessed it was doing data validation. GetString, GetInt32, and ISDBNull were eating up the most time. So I'm assuming microsoft has optimized these methods and the only way to speed this up is possibly making some variable not nullable in the procedure. This question is misleading and so esoteric I don't think it belongs here and will just confuse people. Sorry.
You should split the code and check which one is taking time.
public IEnumerable<DomainModelResult> GetData()
{
var lst = this.EntityFrameworkDB.GetDataSproc().ToList();
return lst
.Select(sprocResults=>sprocResults.ToDomainModelResult())
.AsEnumerable();
}
I am pretty sure the GetDataSproc procedure is taking most of your time. You need to optimize the stored procedure code
Update
If possible, it is better to do more work on SQL side instead of retrieving 60,000 rows into your memory. Few possible solutions:
If you need to display this information, do paging (top and skip)
If you are doing any filtering or calculating or grouping anything after you retrieve rows in memory, do it in your stored proc
.Net side, as you are returning IEnumerable you may able to use yield on your second part, depends on your architecture

Prevent the DbContext from repeatedly trying to save bad data

I have a process that is importing an Excel Spreadhseet, and parsing the data into my data objects. The source of this data is very questionable, as we're moving our customer from a spreadsheet-based data management into a managed database system with checks for valid data.
During my import process, I do some basic sanity checks of the data to accommodate just how bad the data could be that we're importing, but I have my overall validation being done in the DbContext.
Part of what I'm trying to do is that I want to provide the Row # in the spreadsheet that the data is bad so they can easily determine what they need to fix to get the file to import.
Once I have the data from the spreadsheet (model), and the Opportunity they're working with from the database (opp), here's the pseudocode of my process:
foreach (var model in Spreadsheet.Rows) { // Again, pseudocode
if(opp != null && ValidateModel(model, opp, row)) {
// Copy properties to the database object
// This is in a Repository-layer method, not directly in my import process.
// Just written here for clarity instead of several nested method calls.
context.SaveChanges();
}
}
I can provide more of the code here if needed, but the problem comes in my DbContext's ValidateEntity() method (override of DbContext).
Again, there's nothing wrong with the code that I've written, as far as I'm aware, but if an Opportunity that failed this level of validation, then it stays as part of the unsaved objects in the context, meaning that it repeatedly tries to get validated every time the ValidateEntity() is called. This leads to a repeat of the same Validation Error message for every row after the initial problem occurs.
Is there a way to [edit]get the Context to stop trying to validate the object after it fails validation once[edit]? I know I could wait until the end and call context.SaveChanges() once at the end to get around this, but I would like to be able to match this with what row it is in the Database.
For reference, I am using Entity Framework 6.1 with a Code First approach.
EDIT Attempting to clarify further for Marc L. (including an update to the code block above)
Right now, my process will iterate through as many rows as there are in the Spreadsheet. The reason why I'm calling my Repository layer with each object to save, instead of working with an approach that only calls context.SaveChanges() once is to allow myself the ability to determine which row is the one that is causing a validation error.
I'm glad that my DbContext's custom ValidateEntity() methods are catching the validation errors, but the problem resides in the fact that it is not throwing the DbEntityValidationException for the same entity multiple times.
I would like it so that if the object fails validation once, the context no longer tries to save the object, regardless of how many times context.SaveChanges() is called.
Your question is not a dupe (this is about saving, not loaded entities) but you could follow Jimmy's advice above. That is, once an entity is added to the context it is tracked in the "added" state and the only way to stop it from re-validating is by detaching it. It's an SO-internal link, but I'll reproduce the code snippet:
dbContext.Entry(entity).State = EntityState.Detached;
However, I don't think that's the way you want to go, because you're using exceptions to manage state unnecessarily (exceptions are notoriously expensive).
Working from the information given, I'd use a more set-based solution:
modify your model class so that it contains a RowID that records the original spreadsheet row (there's probably other good reasons to have this, too)
turn off entity-tracking for the context (turns of change detection allowing each Add() to be O(1))
add all the entities
call context.GetValidationErrors() and get all your errors at once, using the aforementioned RowID to identify the invalid rows.
You didn't indicate whether your process should save the good rows or reject the file as a whole, but this will accommodate either--that is, if you need to save the good rows, detach all the invalid rows using the code above and then SaveChanges().
Finally, if you do want to save the good rows and you're uncomfortable with the set-based method, it would be better to use a new DbContext for every single row, or at least create a new DbContext after each error. The ADO.NET team insists that context-creation is "relatively cheap" (sorry I don't have a cite or stats at hand for this) so this shouldn't damage your throughput too much. Even so, it will at least remain O(n). I wouldn't blame you, managing a large context can open you up to other issues as well.

Ways for refreshing linked entity set collections within global static data context data objects

Anyone whos been using LINQ to SQL for any length of time will know that using a global static data context can present problems with synchronisation between the database (especially if being used by many concurrent users). For the sake of simplicity, I like to work with objects directly in memory, manipulate there, then push a context.SubmitChanges() when updates and inserts to that object and its linked counterparts are complete. I am aware this is not recommended but it also has advantages. The problem here is that any attached linked objects are not refreshed with this, and it is also not possible to refresh the collection with context.Refresh(linqobject.linkedcollection) as this does not take into account newly added and removed objects.
My question is have i missed something obvious? It seems to be madness that there is no simple way to refresh these collections without creating specific logic.
I would also like to offer a workaround which I have discovered, but I am interested to know if there are drawbacks with this approach (I have not profiled the output and am concerned that it may be generating unintended insert and delete statements).
foreach (OObject O in Program.BaseObject.OObjects.OrderBy(o => o.ID))
{
Program.DB.Refresh(System.Data.Linq.RefreshMode.OverwriteCurrentValues, O);
Program.DB.Refresh(System.Data.Linq.RefreshMode.OverwriteCurrentValues, O.LinksTable);
O.LinksTable.Assign(Program.DB.LinksTable.Where(q => q.OObject == O));}
It also seems to be possible to do things like Program.DB.Refresh(System.Data.Linq.RefreshMode.OverwriteCurrentValues, Program.DB.OObjects);
however this appears to return the entire table which is often highly inefficient. Any ideas?

Read a value multiple times or store as a variable first time round?

Basically, is it better practice to store a value into a variable at the first run through, or to continually use the value? The code will explain it better:
TextWriter tw = null;
if (!File.Exists(ConfigurationManager.AppSettings["LoggingFile"]))
{
// ...
tw = File.CreateText(ConfigurationManager.AppSettings["LoggingFile"]);
}
or
TextWriter tw = null;
string logFile = ConfigurationManager.AppSettings["LoggingFile"].ToString();
if (!File.Exists(logFile))
{
// ...
tw = File.CreateText(logFile);
}
Clarity is important, and DRY (don't repeat yourself) is important. This is a micro-abstraction - hiding a small, but still significant, piece of functionality behind a variable. The performance is negligible, but the positive impact of clarity can't be understated. Use a well-named variable to hold the value once it's been acquired.
the 2nd solution is better for me because :
the dictionary lookup has a cost
it's more readable
Or you can have a singleton object with it's private constructor that populates once all configuration data you need.
Second one would be the best choice.
Imagine this next situation. Settings are updated by other threads and during some of them, since setting value isn't locked, changes to another value.
In the first situation, your execution can fail, or it'll be executed fine, but code was checking for a file of some name, and later saves whatever to a file that's not the one checked before. This is too bad, isn't it?
Another benefit is you're not retrieving the value twice. You get once, and you use wherever your code needs to read the whole setting.
I'm pretty sure, the second one is more readable. But if you talk about performance - do not optimize on early stages and without profiler.
I must agree with the others. Readability and DRY is important and the cost of the variable is very low considering that often you will have just Objects and not really store the thing multiple times.
There might be exceptions with special or large objects. You must keep in mind the question if the value you cache might change in between and if you would like or not (most times the second!) to know the new value within your code! In your example, think what might happen when ConfigurationManager.AppSettings["LoggingFile"] changes between the two calls (due to accessor logic or thread or always reading the value from a file from disk).
Resumee: About 99% you will want the second method / the cache!
IMO that would depend on what you are trying to cache. Caching a setting from App.conig might not be as benefiial (apart from code readability) as caching the results of a web service call over a GPRS connection.

Categories

Resources