I have a process that is importing an Excel Spreadhseet, and parsing the data into my data objects. The source of this data is very questionable, as we're moving our customer from a spreadsheet-based data management into a managed database system with checks for valid data.
During my import process, I do some basic sanity checks of the data to accommodate just how bad the data could be that we're importing, but I have my overall validation being done in the DbContext.
Part of what I'm trying to do is that I want to provide the Row # in the spreadsheet that the data is bad so they can easily determine what they need to fix to get the file to import.
Once I have the data from the spreadsheet (model), and the Opportunity they're working with from the database (opp), here's the pseudocode of my process:
foreach (var model in Spreadsheet.Rows) { // Again, pseudocode
if(opp != null && ValidateModel(model, opp, row)) {
// Copy properties to the database object
// This is in a Repository-layer method, not directly in my import process.
// Just written here for clarity instead of several nested method calls.
context.SaveChanges();
}
}
I can provide more of the code here if needed, but the problem comes in my DbContext's ValidateEntity() method (override of DbContext).
Again, there's nothing wrong with the code that I've written, as far as I'm aware, but if an Opportunity that failed this level of validation, then it stays as part of the unsaved objects in the context, meaning that it repeatedly tries to get validated every time the ValidateEntity() is called. This leads to a repeat of the same Validation Error message for every row after the initial problem occurs.
Is there a way to [edit]get the Context to stop trying to validate the object after it fails validation once[edit]? I know I could wait until the end and call context.SaveChanges() once at the end to get around this, but I would like to be able to match this with what row it is in the Database.
For reference, I am using Entity Framework 6.1 with a Code First approach.
EDIT Attempting to clarify further for Marc L. (including an update to the code block above)
Right now, my process will iterate through as many rows as there are in the Spreadsheet. The reason why I'm calling my Repository layer with each object to save, instead of working with an approach that only calls context.SaveChanges() once is to allow myself the ability to determine which row is the one that is causing a validation error.
I'm glad that my DbContext's custom ValidateEntity() methods are catching the validation errors, but the problem resides in the fact that it is not throwing the DbEntityValidationException for the same entity multiple times.
I would like it so that if the object fails validation once, the context no longer tries to save the object, regardless of how many times context.SaveChanges() is called.
Your question is not a dupe (this is about saving, not loaded entities) but you could follow Jimmy's advice above. That is, once an entity is added to the context it is tracked in the "added" state and the only way to stop it from re-validating is by detaching it. It's an SO-internal link, but I'll reproduce the code snippet:
dbContext.Entry(entity).State = EntityState.Detached;
However, I don't think that's the way you want to go, because you're using exceptions to manage state unnecessarily (exceptions are notoriously expensive).
Working from the information given, I'd use a more set-based solution:
modify your model class so that it contains a RowID that records the original spreadsheet row (there's probably other good reasons to have this, too)
turn off entity-tracking for the context (turns of change detection allowing each Add() to be O(1))
add all the entities
call context.GetValidationErrors() and get all your errors at once, using the aforementioned RowID to identify the invalid rows.
You didn't indicate whether your process should save the good rows or reject the file as a whole, but this will accommodate either--that is, if you need to save the good rows, detach all the invalid rows using the code above and then SaveChanges().
Finally, if you do want to save the good rows and you're uncomfortable with the set-based method, it would be better to use a new DbContext for every single row, or at least create a new DbContext after each error. The ADO.NET team insists that context-creation is "relatively cheap" (sorry I don't have a cite or stats at hand for this) so this shouldn't damage your throughput too much. Even so, it will at least remain O(n). I wouldn't blame you, managing a large context can open you up to other issues as well.
Related
Hello I found problem when I use ASP.NET MVC with EF and call Web API from other website(that have also use Entity Framework)
the problem is that
I want to make sure that both MVC SaveChanges() and Web API SaveChanges() succeed both together.
Here's my dream pseudo code
public ActionResult Operation()
{
Code Insert Update Delete....
bool testMvcSaveSuccess = db.TempSaveChanges(); //it does not have this command.
if(testMvcSaveSuccess == true)
{
bool isApiSuccess = CallApi(); //insert data to Other Web App
if(isApiSuccess == true)
{
db.SaveChanges(); //Real Save
}
}
}
From above code, if it doesn't have db.TempSaveChanges(), maybe Web API will be successful, but MVC SaveChanges() might fail.
So there is nothing like TempSaveChanges because there is something even better: Transactions.
Transaction is an IDisposable (can be used in a using block) and has methods like Commit and Rollback.
Small example:
private void TestTransaction()
{
var context = new MyContext(connectionString);
using (var transaction = context.Database.BeginTransaction())
{
// do CRUD stuff here
// here is your 'TempSaveChanges' execution
int changesCount = context.SaveChanges();
if (changesCount > 0)
// changes were made
{
// this will do the real db changes
transaction.Commit();
}
else
{
// no changes detected -> so do nothing
// could use 'transaction.Rollback();' since there are no changes, this should not be necessary
// using block will dispose transaction and with it all changes as well
}
}
}
I have extracted this example from my GitHub Exercise.EntityFramework repository. Feel free to Star/Clone/Fork...
Yes you can.
you need to overload the .Savechanges in the context class where it will be called first checked and then call the regular after.
Or create you own TempSaveChanges() in the context class call it then if successful call SaveChanges from it.
What you are referring to is known as atomicity: you want several operations to either all succeed, or none of them. In the context of a database you obtain this via transactions (if the database supports it). In your case however, you need a transaction which spans across two disjoint systems. A general-purpose (some special cases have simpler solutions) robust implementation of such a transaction would have certain requirements on the two systems, and also require additional persistence.
Basically, you need to be able to gracefully recover from a sudden stop at any point during the sequence. Each of the databases you are using are most likely ACID compliant, so you can count on each DB transaction to fulfill the atomicity requirement (they either succeed or fail). Therefore, all you need to worry about is the sequence of the two DB transactions. Your requirement on the two systems is a way to determine a posteriori whether or not some operation was performed.
Example process flow:
Operation begins
Generate unique transaction ID and persist (with request data)
Make changes to local DB and commit
Call external Web API
Flag transaction as completed (or delete it)
Operation ends
Recovery:
Get all pending (not completed) transactions from store
Check if expected change to local DB was made
Ask Web API if expected change was made
If none of the changes were made or both of the changes were made then the transaction is done: delete/flag it.
If one of the changes was made but not the other, then either revert the change that was made (revert transaction), or perform the change that was not (resume transaction) => then delete/flag it.
Now, as you can see it quickly gets complicated, specially if "determining if changes were made" is a non-trivial operation. What is a common solution to this is to use that unique transaction ID as a means of determining which data needs attention. But at this point it gets very application-specific and depends entirely on what the specific operations are. For certain applications, you can just re-run the entire operation (since you have the entire request data stored in the transaction) in the recovery step. Some special cases do not need to persist the transaction since there are other ways of achieving the same things etc.
ok so let's clarify things a bit.
you have an MVC app A1, with its own database D1
you then have an API, let's call it A2 with its own database D2.
you want some code in A1 which does a temp save in D1, then fires a call to A2 and if the response is successful then it saves the temp data from D1 in the right place this time.
based on your pseudo code, I would suggest you create a second table where you save your "temporary" data in D1. So your database has an extra table and the flow is like this:
first you save your A1 data in that table, you then call A2, data gets saved in D2, A1 receives the confirmation and calls a method which moves the data from the second table to where it should be.
Scenarios to consider:
Saving the temp data in D1 works, but the call to A2 fails. you now clear the orphan data with a batch job or simply call something that deletes it when the call to A2 fails.
The call to A2 succeeds and the call to D1 fails, so now you have temp data in D1 which has failed to move to the right table. You could add a flag to the second table against each row, which indicates that the second call to A2 succeeded so this data needs to move in the right place, when possible. You can have a service here which runs periodically and if it finds any data with the flag set to true then it moves the data to the right place.
There are other ways to deal with scenarios like this. You could use a queue system to manage this. Each row of data becomes a message, you assign it a unique id, a GUID, that is basically a CorrelationID and it's the same in both systems. Even if one system goes down, when it comes back up the data will be saved and all is good in the world and because of the common id you can always link it up properly.
I have a Web Api 2 (using C#) controller method running an asynchronous save (RESTful POST method). Our QA tester mashed the save button (client-side issue that was fixed, irrelevant to Q). Without a duplicate check, naturally this was saving duplicate entries to the db.
I implemented a duplicate check (duplicateCount just selects the number of items that are exactly like the item passed to post, should be 0, and it works, details irrelevant):
var duplicateCount = (fooCollection.CountAsync(aggregateFilter)).Result;
if (duplicateCount > 0) { return BadRequest(); }
This check works...except on the first button mash - two duplicate entries get saved, each from an individual controller hit.
So, it seems to me that the second controller hit happens before the first controller hit manages to save the item to the db, so the duplicate check passes. Is this possible?
I am more interested in the theory than the particular answer. Also, I know I can check for duplicates in the db as well, this is more of a conceptual question. The MongoDb part is really there just for completeness, I imagine it would be similar if I was doing an async save to SQL.
edit: Someone asked how I am making the call in comments. It's through RestAngular, but imo it's irrelevant because I know the controller is getting hit as many times as I hit the button. I also know for a fact that it does not create duplicates on a single hit.
Quick answer is "yes" - with the ability of a controller to be instantiated on any number of threads simultaneously it acts as though its multi-threaded. Your code is not "thread safe" in that its business operation requires an exclusive lock to be placed on some element of shared information (in this case the state of the database).
You could (I would not recommend) open a Mutex or a database transaction to force single thread behaviour, but your throughput would tank.
I personally don't face this very often because of my (possibly bad) insistence on all my entities having Guid primary keys and use of the SQL Merge command to either insert or update. This may be a helpful pattern (it doesn't matter how many times you send the same "message" to the controller - it will never save a duplicate).
I'm implementing CQRS pattern with Event sourcing, I'm using NServiceBus, NEventStore and NES(Connects between NSB and NEventStore).
My application will check a web service regularly for any file to be downloaded and processed. when the a file is found, a command (DownloadFile) is sent to the bus, and received by FileCommandHandler which creates a new aggregate root (File) and handle the message.
Now inside the (File aggregate root) I have to check that the content of the file doesn't match with any other file content (Since the web service guarantee that only file name is unique, and the content may be duplicated with different name), by hashing it and comparing with the list of hashed contents.
The question is where I have to save the list of hash codes? is it allowed to query the read model?
public class File : AggregateBase
{
public File(DownloadFile cmd, IFileService fileDownloadService, IClaimSerializerService serializerService, IBus bus)
: this()
{
// code to download the file content, deserialize it, and publish an event.
}
}
public class FileCommandHandler : IHandleMessages<DownloadFile>, IHandleMessages<ExtractFile>
{
public void Handle(DownloadFile command)
{
//for example, is it possible to do this (honestly, I feel it is not, since read model should always considered stale !)
var file = readModelContext.GetFileByHashCode (Hash(command.FileContent));
if (file != null)
throw new Exception ("File content matched with another already downloaded file");
// Since there is no way to query the event source for file content like:
// eventSourceRepository.Find<File>(c=>c.HashCode == Hash(command.FileContent));
}
}
Seems like you're looking for deduplication.
Your command side is where you want things to be consistent. Queries will always leave you open to race conditions. So, instead of running a query, I'd reverse the logic and actually write the hash into a database table (any db with ACID guarantees). If this write is successful, process the file. If the write of the hash fails, skip processing.
There's no point putting this logic into a handler, because retrying the message in case of failure (ie storing the hash multiple times) will not make it succeed. You'd also end up with messages for duplicate files in the error q.
A good place for the deduplication logic is likely inside your web service client. Some pseudo logic
Get file
Open transaction
Insert hash into database & catch failure (not any failure, only failure to insert)
Bus.Send message to process file if # of records inserted in step 3 is not zero
commit transaction
Some example deduplication code in NServiceBus gateway here
Edit:
Looking at their code, I actually think the session.Get<DeduplicationMessage> is unnecessary. session.Save(gatewayMessage); should be enough and is the consistency boundary.
Doing a query would make sense only if the rate of failure is high, meaning you have a lot of duplicate content files. If 99%+ of inserts succeed, the duplicates can indeed be treated as exceptions.
This depends on a lot of things ... throughput being one of them. But since you're approaching this problem in a "pull based" fashion anyway (you're querying a webservice to poll for work (downloading and analysing a file)), you could make this whole process serial without having to worry about collisions. Now that might not give the desired rate at which you want to be handling "the work", but more importantly ... have you measured? Let's sidestep that for a minute and assume that serial isn't going to work. How many files are we talking about? A few 100, 1000, ... millions? Depending on that hashes might fit into memory and could be rebuilt if/when the process should come down. There might also be an opportunity to partition your problem along the axis of time or context. Every file since the beginning of dawn or just today, or maybe this month's worth of files? Really, I think you should dig deeper in your problem space. Apart from that, this feels like an awkward problem to solve using event sourcing, but YMMV.
When you have a true uniqueness-constraint in your domain, you can make the uniqueness-tester a domain service, whose implementation is part of the infrastructure -- similar to a repository, whose interface is part of the domain and whose implementation is part of the infrastructure. For the implementation, you can then use an in-memory hash or a database that is updated/queried as needed.
This is concurrency related. So the SubmitChanges() fails, and a ChangeConflictException is thrown. For each ObjectChangeConflict in db.ChangeConflicts, its Resolve is set to RefreshMode.OverwriteCurrentValues? What does this mean?
http://msdn.microsoft.com/en-us/library/bb399354.aspx
Northwnd db = new Northwnd("...");
try
{
db.SubmitChanges(ConflictMode.ContinueOnConflict);
}
catch (ChangeConflictException e)
{
Console.WriteLine(e.Message);
foreach (ObjectChangeConflict occ in db.ChangeConflicts)
{
// All database values overwrite current values.
occ.Resolve(RefreshMode.OverwriteCurrentValues);
}
}
I added some comments to the code, see if it helps:
Northwnd db = new Northwnd("...");
try
{
// here we attempt to submit changes for the database
// The ContinueOnConflict specifies that all updates to the
// database should be tried, and that concurrency conflicts
// should be accumulated and returned at the end of the process.
db.SubmitChanges(ConflictMode.ContinueOnConflict);
}
catch (ChangeConflictException e)
{
// we got a change conflict, so we need to process it
Console.WriteLine(e.Message);
// There may be many change conflicts (if multiple DB tables were
// affected, for example), so we need to loop over each
// conflict and resolve it.
foreach (ObjectChangeConflict occ in db.ChangeConflicts)
{
// To resolve each conflict, we call
// ObjectChangeConflict.Resolve, and we pass in OverWriteCurrentValues
// so that the current values will be overwritten with the values
// from the database
occ.Resolve(RefreshMode.OverwriteCurrentValues);
}
}
First, you must understand that LinqToSql tracks two states for each database row. The original state and the current state. Original state is what the datacontext thinks is in the database. Current state has your in-memory modifications.
Second, LinqToSql uses optimistic concurrency to perform updates. When SubmitChanges is called, the datacontext sends the original state (as a filter) along with the current state into the database. If no records are modified (because the database's record no longer matches the original state), then a ChangeConflictException is raised.
Third, to resolve a change conflict, you must overwrite the original state so the optimistic concurrency filter can locate the record. Then you have to decide what to do with the current state... You can abandon your modifications (that's what the posted code does), which will result in no change to the record, but you are ready to move on with the current database values in your app.
I think it means that if it detects a conflict, see this under computing science, then it goes into the catch. Within there it goes through each of the conflicts (the foreach loop) and resets the values to what they were before the change tried to occur.
Apparently, any changes you've done to the objects are to be thrown out, since somebody else has stolen a march on you and updated the database while you were busy. In optimistic concurrency, dropping the changes is the only possible automated solution. However, the user probably isn't going to be too happy if they spent any time on inputting the discarded data.
The conflict is propabely caused by the fact that the object in your datacontext (the object that stores and keeps changes etc in .net code) has other values then the ones in your db.
Let's say you load a person object from the db. One of the fields is firstname, firstname is S
oo. Now you have a copy of your record in the datacontext. You change some things and want to write the changes to the db, but when (LINQ? other orm) wants to write the changes to the DB, it notices that the firstname in the DB is already changed.
So someone/something has changed your record, you have kind of a "deadlock" (correct term?) then you have to define what is more important, your changes, or the changes that something/someone else made.
TO THE POINT !!! -> Refreshmode.overwirteCurrentValues Just refreshes the object in your datacontext, it RELOADS the object from the db so that you are working with the updated object.
I hope this was a little clear :)
grtz
Here's a little experiment I did:
MyClass obj = dataContext.GetTable<MyClass>().Where(x => x.ID = 1).Single();
Console.WriteLine(obj.MyProperty); // output = "initial"
Console.WriteLine("Waiting..."); // put a breakpoint after this line
obj = null;
obj = dataContext.GetTable<MyClass>().Where(x => x.ID = 1).Single(); // same as before, but reloaded
Console.WriteLine(obj.MyProperty); // output still = "initial"
obj.MyOtherProperty = "foo";
dataContext.SubmitChanges(); // throws concurrency exception
When I hit the breakpoint after line 3, I go to a SQL query window and manually change the value to "updated". Then I carry on running. Linq does not reload my object, but re-uses the one it previously had in memory! This is a huge problem for data concurrency!
How do you disable this hidden cache of objects that Linq obviously is keeping in memory?
EDIT - On reflection, it is simply unthinkable that Microsoft could have left such a gaping chasm in the Linq framework. The code above is a dumbed-down version of what I'm actually doing, and there may be little subtleties that I've missed. In short, I'd appreciate if you'd do your own experimentation to verify that my findings above are correct. Alternatively, there must be some kind of "secret switch" that makes Linq robust against concurrent data updates. But what?
This isn't an issue I've come across before (since I don't tend to keep DataContexts open for long periods of time), but it looks like someone else has:
http://www.rocksthoughts.com/blog/archive/2008/01/14/linq-to-sql-caching-gotcha.aspx
LinqToSql has a wide variety of tools to deal with concurrency problems.
The first step, however, is to admit there is a concurrency problem to be solved!
First, DataContext's intended object lifecycle is supposed to match a UnitOfWork. If you're holding on to one for extended periods, you're going to have to work that much harder because the class isn't designed to be used that way.
Second, DataContext tracks two copies of each object. One is the original state and one is the changed/changable state. If you ask for the MyClass with Id = 1, it will give you back the same instance it gave you last time, which is the changed/changable version... not the original. It must do this to prevent concurrency problems with in memory instances... LinqToSql does not allow one DataContext to be aware of two changable versions of MyClass(Id = 1).
Third, DataContext has no idea whether your in-memory change comes before or after the database change, and so cannot referee the concurrency conflict without some guidance. All it sees is:
I read MyClass(Id = 1) from the database.
Programmer modified MyClass(Id = 1).
I sent MyClass(Id = 1) back to the database (look at this sql to see optimistic concurrency in the where clause)
The update will succeed if the database's version matches the original (optimistic concurrency).
The update will fail with concurrency exception if the database's version does not match the original.
Ok, now that the problem is stated, here's a couple of ways to deal with it.
You can throw away the DataContext and start over. This is a little heavy handed for some, but at least it's easy to implement.
You can ask for the original instance or the changed/changable instance to be refreshed with the database value by calling DataContext.Refresh(RefreshMode, target) (reference docs with many good concurrency links in the "Remarks" section). This will bring the changes client side and allow your code to work-out what the final result should be.
You can turn off concurrency checking in the dbml (ColumnAttribute.UpdateCheck) . This disables optimistic concurrency and your code will stomp over anyone else's changes. Also heavy handed, also easy to implement.
Set the ObjectTrackingEnabled property of the DataContext to false.
When ObjectTrackingEnabled is set to true the DataContext is behaving like a Unit of Work. It's going to keep any object loaded in memory so that it can track changes to it. The DataContext has to remember the object as you originally loaded it to know if any changes have been made.
If you are working in a read only scenario you should turn off object tracking. It can be a decent performance improvement.
If you aren't working in a read only scenario then I'm not sure why you want it to work this way. If you have made edits then why would you want it to pull in modified state from the database?
LINQ to SQL uses the identity map design pattern which means that it will always return the same instance of an object for it's given primary key (unless you turn off object tracking).
The solution is simply either use a second data context if you don't want it to interfere with the first instance or refresh the first instance if you do.