C# BulkWriteAsync, Transactions and Results

C# BulkWriteAsync, Transactions and Results - c#

I am relatively new to working with mongodb. Currently I am getting a little more familiar with the API and especially with C# drivers. I have a few understanding questions around bulk updates. As the C# driver offers a BulkWriteAsync method, I could read a lot about it in the mongo documentation. As I understand, it is possible to configure the BulkWrite not to stop in case of an error at any step. This can be done by use the unordered setting. What I did not found is, what happens to the data. Does the database do a rollback in case of an error? Or do I have to use a surrounding by myself? In case of an error: can I get details of which step was not successful? Think of a bulk with updates on 100 documents. Can I find out, which updates were not successfull? As the BulkWriteResult offers very little information, I am not sure if this operation is realy a good one for me.
thanks in advance

You're right in that BulkWriteResult doesn't provide the full set of information to make a call on what to do.
In the case of a MongoBulkWriteException<T>, however, you can access the WriteErrors property to get the indexes of models that errored. Here's a pared down example of how to use the property.
var models = sourceOfModels.ToArray();
for (var i = 0; i < MaxTries; i++)
try
{
return await someCollection.BulkWriteAsync(models, new BulkWriteOptions { IsOrdered = false });
}
catch (MongoBulkWriteException e)
{
// reconstitute the list of models to try from the set of failed models
models = e.WriteErrors.Select(x => models[x.Index]).ToArray();
}
Note: The above is very naive code. My actual code is more sophisticated. What the above does is try over and over to do the write, in each case, with only the outstanding writes. Say you started with 1000 ReplaceOne<T> models to write, and 900 went through; the second try will try against the remaining 100, and so on until retries are exhausted, or there are no errors.
If the code is not within a transaction, and an error occurs, of course nothing is rolled back; you have some writes that succeed and some that do not. In the case of a transaction, the exception is still raised (MongoDB 4.2+). Prior to that, you would not get an exception.
Finally, while the default is ordered writes, unordered writes can be very useful when the writes are unrelated to one another (e.g. documents representing DDD aggregates where there are no dependencies). It's this same "unrelatedness" that also obviates the need for a transaction.

Related

StackExchange.Redis Transaction chaining parameters

I'm trying to execute a basic transactional operation that contains two operations
Get the length of a set scard MySet
Pop the entire set with the given length : spop MySet len
I know it is possible to use smembers and del consecutively. But what I want to achieve is to get the output of the first operation and use it in the second operation and do it in a transaction. Here is what I tried so far:
var transaction = this.cacheClient.Db1.Database.CreateTransaction();
var itemLength = transaction.SetLengthAsync(key).ContinueWith(async lengthTask =>
{
var length = await lengthTask;
try
{
// here I want to pass the length argument
return await transaction.SetPopAsync(key, length); // probably here transaction is already committed
// so it never passes this line and no exceptions thrown.
}
catch (Exception ex)
{
throw;
}
});
await transaction.ExecuteAsync();
Also, I tried the same thing with CreateBatch and get the same result. I'm currently using the workaround I mentioned above. I know it is also possible to evaluate a Lua script but I want to know is it possible with transactions or am I doing something terribly wrong.

The nature of redis is that you cannot read data during multi/exec - you only get results when the exec runs, which means it isn't possible to use those results inside the multi. What you are attempting is kinda doomed. There are two ways of doing what you want here:
Speculatively read what you need, then perform a multi/exec (transaction) block using that knowledge as a constraint, which SE.Redis will enforce inside a WATCH block; this is really complex and hard to get right, quite honestly
Use Lua, meaning: ScriptEvaluate[Async], where you can do everything you want in a series of operations that execute contiguously on the server without competing with other connections
Option 2 is almost always the right way to do this, ever since it became possible.

Prevent the DbContext from repeatedly trying to save bad data

I have a process that is importing an Excel Spreadhseet, and parsing the data into my data objects. The source of this data is very questionable, as we're moving our customer from a spreadsheet-based data management into a managed database system with checks for valid data.
During my import process, I do some basic sanity checks of the data to accommodate just how bad the data could be that we're importing, but I have my overall validation being done in the DbContext.
Part of what I'm trying to do is that I want to provide the Row # in the spreadsheet that the data is bad so they can easily determine what they need to fix to get the file to import.
Once I have the data from the spreadsheet (model), and the Opportunity they're working with from the database (opp), here's the pseudocode of my process:
foreach (var model in Spreadsheet.Rows) { // Again, pseudocode
if(opp != null && ValidateModel(model, opp, row)) {
// Copy properties to the database object
// This is in a Repository-layer method, not directly in my import process.
// Just written here for clarity instead of several nested method calls.
context.SaveChanges();
}
}
I can provide more of the code here if needed, but the problem comes in my DbContext's ValidateEntity() method (override of DbContext).
Again, there's nothing wrong with the code that I've written, as far as I'm aware, but if an Opportunity that failed this level of validation, then it stays as part of the unsaved objects in the context, meaning that it repeatedly tries to get validated every time the ValidateEntity() is called. This leads to a repeat of the same Validation Error message for every row after the initial problem occurs.
Is there a way to [edit]get the Context to stop trying to validate the object after it fails validation once[edit]? I know I could wait until the end and call context.SaveChanges() once at the end to get around this, but I would like to be able to match this with what row it is in the Database.
For reference, I am using Entity Framework 6.1 with a Code First approach.
EDIT Attempting to clarify further for Marc L. (including an update to the code block above)
Right now, my process will iterate through as many rows as there are in the Spreadsheet. The reason why I'm calling my Repository layer with each object to save, instead of working with an approach that only calls context.SaveChanges() once is to allow myself the ability to determine which row is the one that is causing a validation error.
I'm glad that my DbContext's custom ValidateEntity() methods are catching the validation errors, but the problem resides in the fact that it is not throwing the DbEntityValidationException for the same entity multiple times.
I would like it so that if the object fails validation once, the context no longer tries to save the object, regardless of how many times context.SaveChanges() is called.

Your question is not a dupe (this is about saving, not loaded entities) but you could follow Jimmy's advice above. That is, once an entity is added to the context it is tracked in the "added" state and the only way to stop it from re-validating is by detaching it. It's an SO-internal link, but I'll reproduce the code snippet:
dbContext.Entry(entity).State = EntityState.Detached;
However, I don't think that's the way you want to go, because you're using exceptions to manage state unnecessarily (exceptions are notoriously expensive).
Working from the information given, I'd use a more set-based solution:
modify your model class so that it contains a RowID that records the original spreadsheet row (there's probably other good reasons to have this, too)
turn off entity-tracking for the context (turns of change detection allowing each Add() to be O(1))
add all the entities
call context.GetValidationErrors() and get all your errors at once, using the aforementioned RowID to identify the invalid rows.
You didn't indicate whether your process should save the good rows or reject the file as a whole, but this will accommodate either--that is, if you need to save the good rows, detach all the invalid rows using the code above and then SaveChanges().
Finally, if you do want to save the good rows and you're uncomfortable with the set-based method, it would be better to use a new DbContext for every single row, or at least create a new DbContext after each error. The ADO.NET team insists that context-creation is "relatively cheap" (sorry I don't have a cite or stats at hand for this) so this shouldn't damage your throughput too much. Even so, it will at least remain O(n). I wouldn't blame you, managing a large context can open you up to other issues as well.

best way to catch database constraint errors

I am calling a stored procedure that inserts data in to a sql server database from c#. I have a number of constraints on the table such as unique column etc. At present I have the following code:
try
{
// inset data
}
catch (SqlException ex)
{
if (ex.Message.ToLower().Contains("duplicate key"))
{
if (ex.Message.ToLower().Contains("url"))
{
return 1;
}
if (ex.Message.ToLower().Contains("email"))
{
return 2;
}
}
return 3;
}
Is it better practise to check if column is unique etc before inserting the data in C#, or in store procedure or let an exception occur and handle like above? I am not a fan of the above but looking for best practise in this area.

I view database constraints as a last resort kind of thing. (I.e. by all means they should be present in your schema as a backup way of maintaining data integrity.) But I'd say the data should really be valid before you try to save it in the database. If for no other reason, then because providing feedback about invalid input is a UI concern, and a data validity error really shouldn't bubble up and down the entire tier stack every single time.
Furthermore, there are many sorts of assertions you want to make about the shape of your data that can't be expressed using constraints easily. (E.g. state transitions of an order. "An order can only go to SHIPPED from PAID" or more complex scenarios.) That is, you'd need to use involving procedural-language based checks, ones that duplicate even more of your business logic, and then have those report some sort of error code as well, and include yet more complexity in your app just for the sake of doing all your data validation in the schema definition.
Validation is inherently hard to place in an app since it concerns both the UI and is coupled to the model schema, but I veer on the side of doing it near the UI.

I see two questions here, and here's my take...
Are database constraints good? For large systems they're indepensible. Most large systems have more than one front end, and not always in compatible languages where middle-tier or UI data-checking logic can be shared. They may also have batch processes in Transact-SQL or PL/SQL only. It's fine to duplicate the checking on the front end, but in a multi-user app the only way to truly check uniqueness is to insert the record and see what the database says. Same with foreign key constraints - you don't truly know until you try to insert/update/delete.
Should exceptions be allowed to throw, or should return values be substituted? Here's the code from the question:
try
{
// inset data
}
catch (SqlException ex)
{
if (ex.Message.ToLower().Contains("duplicate key"))
{
if (ex.Message.ToLower().Contains("url"))
{
return 1; // Sure, that's one good way to do it
}
if (ex.Message.ToLower().Contains("email"))
{
return 2; // Sure, that's one good way to do it
}
}
return 3; // EVIL! Or at least quasi-evil :)
}
If you can guarantee that the calling program will actually act based on the return value, I think the return 1 and return 2 are best left to your judgement. I prefer to rethrow a custom exception for cases like this (for example DuplicateEmailException) but that's just me - the return values will do the trick too. After all, consumer classes can ignore exceptions just as easily as they can ignore return values.
I'm against the return 3. This means there was an unexpected exception (database down, bad connection, whatever). Here you have an unspecified error, and the only diagnostic information you have is this: "3". Imagine posting a question on SO that says I tried to insert a row but the system said '3'. Please advise. It would be closed within seconds :)
If you don't know how to handle an exception in the data class, there's no way a consumer of the data class can handle it. At this point you're pretty much hosed so I say log the error, then exit as gracefully as possible with an "Unexpected error" message.
I know I ranted a bit about the unexpected exception, but I've handled too many support incidents where the programmer just sequelched database exceptions, and when something unexpected came up the app either failed silently or failed downstream, leaving zero diagnostic information. Very naughty.

I would prefer a stored procedure that checks for potential violations before just throwing the data at SQL Server and letting the constraint bubble up an error. The reasons for this are performance-related:
Performance impact of different error handling techniques
Checking for potential constraint violations before entering SQL Server TRY and CATCH logic
Some people will advocate that constraints at the database layer are unnecessary since your program can do everything. The reason I wouldn't rely solely on your C# program to detect duplicates is that people will find ways to affect the data without going through your C# program. You may introduce other programs later. You may have people writing their own scripts or interacting with the database directly. Do you really want to leave the table unprotected because they don't honor your business rules? And I don't think the C# program should just throw data at the table and hope for the best, either.
If your business rules change, do you really want to have to re-compile your app (or all of multiple apps)? I guess that depends on how well-protected your database is and how likely/often your business rules are to change.

I did something like this:
public class SqlExceptionHelper
{
public SqlExceptionHelper(SqlException sqlException)
{
// Do Nothing.
}
public static string GetSqlDescription(SqlException sqlException)
{
switch (sqlException.Number)
{
case 21:
return "Fatal Error Occurred: Error Code 21.";
case 53:
return "Error in Establishing a Database Connection: 53.";
default
return ("Unexpected Error: " + sqlException.Message.ToString());
}
}
}
Which allows it to be reusable, and it will allow you to get the Error Codes from SQL.
Then just implement:
public class SiteHandler : ISiteHandler
{
public string InsertDataToDatabase(Handler siteInfo)
{
try
{
// Open Database Connection, Run Commands, Some additional Checks.
}
catch(SqlException exception)
{
SqlExceptionHelper errorCompare = new SqlExceptionHelper(exception);
return errorCompare.ToString();
}
}
}
Then it is providing some specific errors for common occurrences. But as mentioned above; you really should ensure that you've tested your data before you just input it into your database. That way no mismatched constraints surface or exists.
Hope it points you in a good direction.

Depends on what you're trying to do. Some things to think about:
Where do you want to handle your error? I would recommend as close to the data as possible.
Who do you want to know about the error? Does your user need to know that 'you've already used that ID'...?
etc.
Also -- constraints can be good -- I don't 100% agree with millimoose's answer on that point -- I mean, I do in the should be this way / better performance ideal -- but practically speaking, if you don't have control over your developers / qc, and especially when it comes to enforcing rules that could blow your database up (or otherwise, break dependent objects like reports, etc. if a duplicate key were to turn-up somewhere, you need some barrier against (for example) the duplicate key entry.

How do I make an in-memory process transactional?

I'm very familiar with using a transaction RDBMS, but how would I make sure that changes made to my in-memory data are rolled back if the transaction fails? What if I'm not even using a database?
Here's a contrived example:
public void TransactionalMethod()
{
var items = GetListOfItems();
foreach (var item in items)
{
MethodThatMayThrowException(item);
item.Processed = true;
}
}
In my example, I might want the changes made to the items in the list to somehow be rolled back, but how can I accomplish this?
I am aware of "software transactional memory" but don't know much about it and it seems fairly experimental. I'm aware of the concept of "compensatable transactions", too, but that incurs the overhead of writing do/undo code.
Subversion seems to deal with errors updating a working copy by making you run the "cleanup" command.
Any ideas?
UPDATE:
Reed Copsey offers an excellent answer, including:
Work on a copy of data, update original on commit.
This takes my question one level further - what if an error occurs during the commit? We so often think of the commit as an immediate operation, but in reality it may be making many changes to a lot of data. What happens if there are unavoidable things like OutOfMemoryExceptions while the commit is being applied?
On the flipside, if one goes for a rollback option, what happens if there's an exception during the rollback? I understand things like Oracle RDBMS has the concept of rollback segments and UNDO logs and things, but assuming there's no serialisation to disk (where if it isn't serialised to disk it didn't happen, and a crash means you can investigate those logs and recover from it), is this really possible?
UPDATE 2:
An answer from Alex made a good suggestion: namely that one updates a different object, then, the commit phase is simply changing the reference to the current object over to the new object. He went further to suggest that the object you change is effectively a list of the modified objects.
I understand what he's saying (I think), and I want to make the question more complex as a result:
How, given this scenario, do you deal with locking? Imagine you have a list of customers:
var customers = new Dictionary<CustomerKey, Customer>();
Now, you want to make a change to some of those customers, how do you apply those changes without locking and replacing the entire list? For example:
var customerTx = new Dictionary<CustomerKey, Customer>();
foreach (var customer in customers.Values)
{
var updatedCust = customer.Clone();
customerTx.Add(GetKey(updatedCust), updatedCust);
if (CalculateRevenueMightThrowException(customer) >= 10000)
{
updatedCust.Preferred = true;
}
}
How do I commit? This (Alex's suggestion) will mean locking all customers while replacing the list reference:
lock (customers)
{
customers = customerTx;
}
Whereas if I loop through, modifying the reference in the original list, it's not atomic,a and falls foul of the "what if it crashes partway through" problem:
foreach (var kvp in customerTx)
{
customers[kvp.Key] = kvp.Value;
}

Pretty much every option for doing this requires one of three basic methods:
Make a copy of your data before modifications, to revert to a rollback state if aborted.
Work on a copy of data, update original on commit.
Keep a log of changes to your data, to undo them in the case of an abort.
For example, Software Transactional Memory, which you mentioned, follows the third approach. The nice thing about that is that it can work on the data optimistically, and just throw away the log on a successful commit.

Take a look at the Microsoft Research project, SXM.
From Maurice Herlihy's page, you can download documentation as well as code samples.

You asked: "What if an error occurs during the commit?"
It doesn't matter. You can commit to somewhere/something in memory and check meanwhile if the operation succeeds. If it did, you change the reference of the intended object (object A) to where you committed (object B). Then you have failsafe commits - the reference is only updated on successful commit. Reference change is atomic.

public void TransactionalMethod()
{
var items = GetListOfItems();
try {
foreach (var item in items)
{
MethodThatMayThrowException(item);
item.Processed = true;
}
}
catch(Exception ex) {
foreach (var item in items)
{
if (item.Processed) {
UndoProcessingForThisItem(item);
}
}
}
}
Obviously, the implementation of the "Undo..." is left as an exercise for the reader.

Linq caching data values - major concurrency problem?

Here's a little experiment I did:
MyClass obj = dataContext.GetTable<MyClass>().Where(x => x.ID = 1).Single();
Console.WriteLine(obj.MyProperty); // output = "initial"
Console.WriteLine("Waiting..."); // put a breakpoint after this line
obj = null;
obj = dataContext.GetTable<MyClass>().Where(x => x.ID = 1).Single(); // same as before, but reloaded
Console.WriteLine(obj.MyProperty); // output still = "initial"
obj.MyOtherProperty = "foo";
dataContext.SubmitChanges(); // throws concurrency exception
When I hit the breakpoint after line 3, I go to a SQL query window and manually change the value to "updated". Then I carry on running. Linq does not reload my object, but re-uses the one it previously had in memory! This is a huge problem for data concurrency!
How do you disable this hidden cache of objects that Linq obviously is keeping in memory?
EDIT - On reflection, it is simply unthinkable that Microsoft could have left such a gaping chasm in the Linq framework. The code above is a dumbed-down version of what I'm actually doing, and there may be little subtleties that I've missed. In short, I'd appreciate if you'd do your own experimentation to verify that my findings above are correct. Alternatively, there must be some kind of "secret switch" that makes Linq robust against concurrent data updates. But what?

This isn't an issue I've come across before (since I don't tend to keep DataContexts open for long periods of time), but it looks like someone else has:
http://www.rocksthoughts.com/blog/archive/2008/01/14/linq-to-sql-caching-gotcha.aspx

LinqToSql has a wide variety of tools to deal with concurrency problems.
The first step, however, is to admit there is a concurrency problem to be solved!
First, DataContext's intended object lifecycle is supposed to match a UnitOfWork. If you're holding on to one for extended periods, you're going to have to work that much harder because the class isn't designed to be used that way.
Second, DataContext tracks two copies of each object. One is the original state and one is the changed/changable state. If you ask for the MyClass with Id = 1, it will give you back the same instance it gave you last time, which is the changed/changable version... not the original. It must do this to prevent concurrency problems with in memory instances... LinqToSql does not allow one DataContext to be aware of two changable versions of MyClass(Id = 1).
Third, DataContext has no idea whether your in-memory change comes before or after the database change, and so cannot referee the concurrency conflict without some guidance. All it sees is:
I read MyClass(Id = 1) from the database.
Programmer modified MyClass(Id = 1).
I sent MyClass(Id = 1) back to the database (look at this sql to see optimistic concurrency in the where clause)
The update will succeed if the database's version matches the original (optimistic concurrency).
The update will fail with concurrency exception if the database's version does not match the original.
Ok, now that the problem is stated, here's a couple of ways to deal with it.
You can throw away the DataContext and start over. This is a little heavy handed for some, but at least it's easy to implement.
You can ask for the original instance or the changed/changable instance to be refreshed with the database value by calling DataContext.Refresh(RefreshMode, target) (reference docs with many good concurrency links in the "Remarks" section). This will bring the changes client side and allow your code to work-out what the final result should be.
You can turn off concurrency checking in the dbml (ColumnAttribute.UpdateCheck) . This disables optimistic concurrency and your code will stomp over anyone else's changes. Also heavy handed, also easy to implement.

Set the ObjectTrackingEnabled property of the DataContext to false.
When ObjectTrackingEnabled is set to true the DataContext is behaving like a Unit of Work. It's going to keep any object loaded in memory so that it can track changes to it. The DataContext has to remember the object as you originally loaded it to know if any changes have been made.
If you are working in a read only scenario you should turn off object tracking. It can be a decent performance improvement.
If you aren't working in a read only scenario then I'm not sure why you want it to work this way. If you have made edits then why would you want it to pull in modified state from the database?

LINQ to SQL uses the identity map design pattern which means that it will always return the same instance of an object for it's given primary key (unless you turn off object tracking).
The solution is simply either use a second data context if you don't want it to interfere with the first instance or refresh the first instance if you do.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.