Pessimistic locking in EF code first - c#

I'd like to lock specified row(s) in my table exclusively, so no reads no updates allowed until the actual transaction completes. To do this, I've created a helper class in my database repository:
public void PessimisticMyEntityHandler(Action<IEnumerable<MyEntity>> fieldUpdater, string sql, params object[] parameters)
{
using (var scope = new System.Transactions.TransactionScope())
{
fieldUpdater(DbContext.Set<MyEntity>().SqlQuery(sql, parameters));
scope.Complete();
}
}
Here is my test code. Basicly I'm just starting two tasks and both of them tries to lock the row with the Id '1'. My guess was that the second task won't be able to read(and update) the row until the first one finishes its job, but the output window shows that it actually can.
Task.Factory.StartNew(() =>
{
var dbRepo = new DatabaseRepository();
dbRepo.PessimisticMyEntityHandler(myEntities =>
{
Debug.WriteLine("entered into lock1");
/* Modify some properties considering the current ones... */
var myEntity = myEntities.First();
Thread.Sleep(1500);
myEntity.MyEntityCode = "abcdefgh";
dbRepo.Update<MyEntity>(myEntity);
Debug.WriteLine("leaving lock1");
}, "SELECT * FROM MyEntities WITH (UPDLOCK, HOLDLOCK) WHERE Id = #param1", new SqlParameter("param1", 1));
});
Task.Factory.StartNew(() =>
{
Thread.Sleep(500);
var dbRepo = new DatabaseRepository();
dbRepo.PessimisticMyEntityHandler(myEntities =>
{
Debug.WriteLine("entered into lock2");
/* Modify some properties considering the current ones... */
var myEntity = myEntities.First();
myEntity.MyEntityCode = "xyz";
dbRepo.Update<MyEntity>(myEntity);
Debug.WriteLine("leaving lock2");
}, "SELECT * FROM MyEntities WITH (UPDLOCK, HOLDLOCK) WHERE Id = #param1", new SqlParameter("param1", 1));
});
Output window:
entered into lock1
entered into lock2
leaving lock2
leaving lock1

What you are asking for, involves two main phenomena in DBMS and particularly in SQL Server: Lock and Isolation Level. I do my best to explain them in summery.
You asked about Pessimistic Concurrency. The answer is: it is not supported yet in Entity Framework. In other words, by conventional API of EF you cannot lock a table or some rows for SELECT like what for example Oracle does via SELECT FOR UPDATE. Though you can achieve this by writing a native SQL command to select some rows or the entire table with an Exclusive lock and maintain this lock until the end of the transaction. This way other threads not only cannot update the selected rows, but also cannot select them. They get blocked until you release the lock. This is what I do in my projects and though somehow risky, it works fine.
So In summery:
Lock for select: NO by EF / Yes by native SQL
Lock for Update:
When you modify rows in DB, the modified rows gain some sort of lock. The kind of lock is determined by the Isolation Level of the running Transaction. The default of Isolation Level in SQL Server is Read Committed which means that all the rows that are modified in current transaction gain Shared lock. This lock is compatible with SELECT but not with UPDATE or DELETE. It means that when you modify a row in your transaction, by default it is guaranteed that no other parallel threads can infer and change them until you end the transaction either by COMMIT or ROLLBACK.
To understand the locks in SQL Server see:
http://technet.microsoft.com/en-us/library/aa213039(v=sql.80).aspx
To understand transaction isolation level see: http://technet.microsoft.com/en-us/library/ms189122(v=sql.105).aspx
.
Update:
Table hints UPDLOCK and HOLDLOCK may be ignored by query optimizer or other DBMS modules since they are just hint :-). The only combination of table hints that is surely enforced is (XLOCK, PAGLOCK).
Example: SELECT * FROM MyTable WITH (XLOCK, PAGLOCK)
As I said, manual locking is risky. Use it with maximum level of consideration.

Related

Parallel.ForEach Source List with Where Condition

I have a code block which processes StoreProducts an then adds or updates them in the database in a for each loop. But this is slow. When I convert the code Parallel.ForEach block, then same products gets both added and updated at the same time. I could not figure out how to safely utilize for the following functionality, any help would be appreciated.
var validProducts = storeProducts.Where(p => p.Price2 > 0
&& !string.IsNullOrEmpty(p.ProductAtt08Desc.Trim())
&& !string.IsNullOrEmpty(p.Barcode.Trim())
).ToList();
var processedProductCodes = new List<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)), po,
(product) =>
{
lock (_lockThis)
{
processedProductCodes.Add(product.ProductCode);
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
The thing in here is, the list validProducts may have more than one same ProductCode, so they are variants and I have to manage that even one of them is being processed it should not be processed again.
So where condition that is found in the parallel foreach 'validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)' is not working as expected like in normal for each.
The bulk of my answer is less-so an answer to your question and more some guidance - if you were to provide some more technical details, I may be able to assist more precisely.
A Parallel.ForEach is probably not the best solution here -- especially when you have a shared list or a busy server.
You are locking to write but not to read from that shared list. So I'm surprised it's not throwing during the Where. Turn the List<string> into a ConcurrentDictionary<string, bool> (just to create a simple concurrent hash table) then you'll get better write throughput and it won't throw during reads.
But you're going to have database contention issues (if using multiple connections) because your insert will likely still require locks. Even if you simply split the workload you would run into this. This DB locking could cause blocks/deadlocks so it may end up slower than the original. If using one connection, you generally cannot parallelize commands.
I would try wrapping the majority of inserts in a transaction containing batches of say 1000 inserts or place the entire workload into one bulk insert. Then the database will keep the data in-memory and commit the entire thing to disk when finished (instead of one record at a time).
Depending on your typical workload, you may want to try different storage solutions. Databases are generally bad for inserting large volumes of records... you will likely see much better performance with alternative solutions (such as Key-Value stores). Or place the data into something like Redis and slowly persist to the database in the background.
Parallel.ForEach buffers items internally for each thread, one option you could do is switch to a partitioner that does not use buffering
var pat = Partitioner.Create(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode))
,EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(pat, po, (product) => ...
That will get you closer but you will still have a race conditions where two of the same object can be processed because you don't break out of the loop if you find a duplicate.
The better option is switch processedProductCodes to a HashSet<string> and change your code to
var processedProductCodes = new HashSet<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts, po,
(product) =>
{
//You can safely lock on processedProductCodes
lock (processedProductCodes)
{
if(!processedProductCodes.Add(product.ProductCode))
{
//Add returns false if the code is already in the collection.
return;
}
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
HashSet has a much faster lookup and is built in to the Add function.

Parallel processing mutiple sql statements results in deadlock

I am working with a project which involves processing a lot of text files and results in either inserting records into an mssql db or updating existing information.
The sql statement is written and stored in a list until the files have finished being processed.
This list is then processed. Each statement was being processed one at a time but as this could be thousands of statements and could create a very long running process.
To attempt to speed up this process i introduced some parallel processing but this occasionally results in the following error:
Transaction (Process ID 94) was deadlocked on lock | communication
buffer resources with another process and has been chosen as the
deadlock victim. Rerun the transaction.
Code as follows:
public static void ParallelNonScalarExecution(List<string> Statements, string conn)
{
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = 8;
CancellationTokenSource cancelToken = new CancellationTokenSource();
po.CancellationToken = cancelToken.Token;
Parallel.ForEach(Statements, po, Statement =>
{
using (SqlConnection mySqlConnection = new SqlConnection(conn))
{
mySqlConnection.Open();
using (SqlCommand mySqlCommand = new SqlCommand(Statement, mySqlConnection))
{
mySqlCommand.CommandTimeout = Timeout;
mySqlCommand.ExecuteScalar();
}
}
});
}
The update statements i believe are simple in what they are trying to achieve:
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code = 1
UPDATE TableA SET Notes = 'blahblahblah', Date = '2016-01-01' WHERE Code = 2
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code = 3
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code = 4
UPDATE TableB SET Type = 1 WHERE Code = 100
UPDATE TableA SET Notes = 'blahblahblah', Date = '2016-01-01' WHERE Code = 5
UPDATE TableB SET Type = 1 WHERE Code = 101
What is the best way to overcome this issue?
From what I see you don't want to do what you are doing. I would NOT recommend having multiple update statements that effect the same data/table on different threads. This is the breeding of a race condition/dead lock. In your case it should be safe, but if at any point you changed the where condition and there was overlap you would have a race condition issue.
If you really wanted to speed this up with multi-threading than having all of the update statements for tableA on one thread and all of the update statements on tableB on one thread. Another idea is to block your update statements.
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code IN (1,2,3,4,5)
UPDATE TableA SET Date = '2016-01-01' WHERE Code IN (2,5)
UPDATE TableB SET Type = 1 WHERE Code IN (100,101)
These above statements should be able to be independently execute in a concurent enviroment as no two statements effect the same column?
Thread A updates resource X and does not commit and can continue doing more updates.
Thread B updates resource y and does not commit and can continue doing more updates.
At this point, both have uncommitted updates.
Now thread A updates resource y and waits on the lock from Thread B.
Thread B is not held up by anything, so it goes on, eventually tries to update resource x and is blocked by the lock A has on x.
Now they are in a deadlock. It's a stalemate, not one can proceed to commit, so the system kills one.
You have to commit more often to reduce the chances of a deadlock (but that does not eliminate the possibility entirely), or you have to carefully order your updates so all updates to x get done and completed before going on to do any updates on y.

This simple code produces deadlock. Simple Example Program included

Code, notice the order of the values is different. So it alternates between locking rows:
static void Main( string[] args )
{
List<int> list = new List<int>();
for(int i = 0; i < 1000; i++ )
list.Add( i );
Parallel.ForEach( list, i =>
{
using( NamePressDataContext db = new NamePressDataContext() )
{
db.ExecuteCommand( #"update EBayDescriptionsCategories set CategoryId = Ids.CategoryId from EBayDescriptionsCategories
join (values (7276, 20870),(240, 20870)) as Ids(Id,CategoryId) on Ids.Id = EBayDescriptionsCategories.Id" );
db.ExecuteCommand( #"update EBayDescriptionsCategories set CategoryId = Ids.CategoryId from EBayDescriptionsCategories
join (values (240, 20870),(7276, 20870)) as Ids(Id,CategoryId) on Ids.Id = EBayDescriptionsCategories.Id" );
}
} );
}
Table def:
CREATE TABLE [dbo].[EDescriptionsCategories](
[CategoryId] [int] NOT NULL,
[Id] [int] NOT NULL,
CONSTRAINT [PK_EDescriptionsCategories] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
Exception:
Transaction (Process ID 80) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
The code works only with WITH (TABLOCK) hint. Is it possible not to lock the whole table just to update just those 2 rows in parallel?
Your two statements acquire row locks in different order. That's a classic case for deadlocks. You can fix this by ensuring that the order of locks taken is always in some global order (for example, ordered by ID). You should probably coalesce the two UPDATE statements into one and sort the list of IDs on the client before sending it to SQL Server. For many typical UPDATE plans this actually works fine (not guaranteed, though).
Or, you add retry logic in case you detect a deadlock (SqlException.Number == 1205). This is more elegant because it requires no deeper code changes. But deadlocks have performance implications so only do this for low deadlock rates.
If your parallel processing generates lots of updates, you could INSERT all those updates into a temp table (which can be done concurrently) and when you are done you execute one big UPDATE that copies all the individual update records to the main table. You just change the join source in your sample queries.
Code, notice the order of the values is different. So it alternates between locking rows
No, it doesn't alternate. It acquires the locks in two different order. Deadlock is guaranteed.
Is it possible not to ... update just those 2 rows in parallel?
Not like that it isn't. What you're asking for is the definition of a deadlock. Something gotta give. The solution must come from your business logic, there should be no attempts to process the same set of IDs from distinct transactions. What that means, is entire business specific. IF you cannot achieve that, then basically you are just begging for deadlocks. There are some things you can do, but none is bulletproof and all come at great cost. The problem is higher up the chain.
Agree with other answers as regards to the locking.
The more pressing question is what are you hoping to gain from this? There's only one cable those commands are travelling down.
You are probably making the overall performance worse by doing this. Far better to do your computation in parallel but serialize (and possibly batch) your updates.

Linq: Delete and Insert same Primary Key values within TransactionScope

I want to replace existing records in the DB with new records in one transaction. Using TransactionScope, I have
using ( var scope = new TransactionScope())
{
db.Tasks.DeleteAllOnSubmit(oldTasks);
db.Tasks.SubmitChanges();
db.Tasks.InsertAllOnSubmit(newTasks);
db.Tasks.SubmitChanges();
scope.Complete();
}
My program threw
System.InvalidOperationException: Cannot add an entity that already exists.
After some trial and error, I found the culprit lies in the the fact that there isn't any other execution instructions between the delete and the insert. If I insert other code between the first SubmitChanges() and InsertAllOnSubmit(), everything works fine. Can anyone explain why is this happening? It is very concerning.
I tried another one to update the objects:
IEnumerable<Task> tasks = ( ... some long query that involves multi tables )
.AsEnumerable()
.Select( i =>
{
i.Task.Duration += i.LastLegDuration;
return i.Task;
}
db.SubmitChanges();
This didn't work neither. db didn't pick up any changes to Tasks.
EDIT:
This behavior doesn't seem to have anything to do with Transactions. At the end, I adopted the grossly inefficient Update:
newTasks.ForEach( t =>
{
Task attached = db.Tasks.Single( i => ... use primary id to look up ... );
attached.Duration = ...;
... more updates, Property by Property ...
}
db.SubmitChanges();
Instead of inserting and deleting or making multiple queries, you can try to update multiple rows in one pass by selecting a list of Id's to update and checking if the list contains each item.
Also, make sure you mark your transaction as complete to indicate to transaction manager that the state across all resources is consistent, and the transaction can be committed.
Dictionary<int,int> taskIdsWithDuration = getIdsOfTasksToUpdate(); //fetch a dictionary keyed on id's from your long query and values storing the corresponding *LastLegDuration*
using (var scope = new TransactionScope(TransactionScopeOption.Required))
{
var tasksToUpdate = db.Tasks.Where(x => taskIdsWithDuration.Keys.Contains(x.id));
foreach (var task in tasksToUpdate)
{
task.duration1 += taskIdsWithDuration[task.id];
}
db.SaveChanges();
scope.Complete();
}
Depending on your scenario, you can invert the search in the case that your table is extremely large and the number of items to update is reasonably small, to leverage indexing. Your existing update query should work fine if this is the case, so I doubt you'll need to invert it.
I had same problem in LinqToSql and I don't think its to do with the transaction, but with how the session/context is coalescing changes. I say this because I fixed the problem by bypassing linqtosql for the delete and using some raw sql to do it. Ugly I know, but it worked, and all inside a transaction scope.

Improve .NET/MSSQL select & update performance

I'd like to increase performance of very simple select and update queries of .NET & MSSQL 2k8.
My queries always select or update a single row. The DB tables have indexes on the columns I query on.
My test .NET code looks like this:
public static MyData GetMyData(int accountID, string symbol)
{
using (var cnn = new SqlConnection(connectionString))
{
cnn.Open();
var cmd = new SqlCommand("MyData_Get", cnn);
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.Add(CreateInputParam("#AccountID", SqlDbType.Int, accountID));
cmd.Parameters.Add(CreateInputParam("#Symbol", SqlDbType.VarChar, symbol));
SqlDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
var MyData = new MyData();
MyData.ID = (int)reader["ID"];
MyData.A = (int)reader["A"];
MyData.B = reader["B"].ToString();
MyData.C = (int)reader["C"];
MyData.D = Convert.ToDouble(reader["D"]);
MyData.E = Convert.ToDouble(reader["E"]);
MyData.F = Convert.ToDouble(reader["F"]);
return MyData;
}
}
}
and the according stored procedure looks like this:
PROCEDURE [dbo].[MyData_Get]
#AccountID int,
#Symbol varchar(25)
AS
BEGIN
SET NOCOUNT ON;
SELECT p.ID, p.A, p.B, p.C, p.D, p.E, p.F FROM [MyData] AS p WHERE p.AccountID = #AccountID AND p.Symbol = #Symbol
END
What I'm seeing if I run GetMyData in a loop, querying MyData objects, I'm not exceeding about ~310 transactions/sec. I was hoping to achieve well over a 1000 transactions/sec.
On the SQL Server side, not really sure what I can improve for such a simple query.
ANTS profiler shows me that on the .NET side, as expected, the bottleneck is cnn.Open and cnn.ExecuteReader, however I have no idea how I could significantly improve my .NET code?
I've seen benchmarks though where people seem to easily achieve 10s of thousands transactions/sec.
Any advice on how I can significantly improve the performance for this scenario would be greatly appreciated!
Thanks,
Tom
EDIT:
Per MrLink's recommendation, adding "TOP 1" to the SELECT query improved performance to about 585 transactions/sec from 310
EDIT 2:
Arash N suggested to have the select query "WITH(NOLOCK)" and that dramatically improved the performance! I'm now seeing around 2500 transactions/sec
EDIT 3:
Another slight optimization that I just did on the .NET side helped me to gain another 150 transactions/sec. Changing while(reader.Read()) to if(reader.Read()) surprisingly made quite a difference. On avg. I'm now seeing 2719 transactions/sec
Try using WITH(NOLOCK) in your SELECT statement to increase the performance. This would select the row without locking it.
SELECT p.ID, p.A, p.B, p.C, p.D, p.E, p.F FROM [MyData] WITH(NOLOCK) AS p WHERE p.AccountID = #AccountID AND p.Symbol = #Symbol
Some things to consider.
First, your not closing the server connection. (cnn.Close();) Eventually, it will get closed by the garbage collector. But until that happens, your creating a brand new connection to the database every time rather than collecting one from the connection pool.
Second, Do you have an index set in Sql Server covering the AccountID and Symbol columns?
Third, While accountId being and int is nice and fast. The Symbol column being varchar(25) is always going to be much slower. Can you change this to an int flag?
Make sure your database connections are actually pooling. If you are seeing a bottleneck in cnn.Open, there would seem to be a good chance they are not getting pooled.
I was hoping to achieve well over a 1000 transactions/sec [when running GetMyData in a loop]
What you are asking for is for GetMyData to run in less than 1ms - this is just pointless optimisation! At the bare minimum this method involves a round trip to the database server (possibly involving network access) - you wouldn't be able to make this method much faster if your query was SELECT 1.
If you have a genuine requirement to make more requests per second then the answer is either to use multiple threads or to buy a faster PC.
There is absolutely nothing wrong with your code - I'm not sure where you have seen people managing 10,000+ transactions per second, but I'm sure this must have involved multiple concurrent clients accessing the same database server rather than a single thread managing to execute queries in less than a 10th of a ms!
Is your method called frequently? Could you batch your requests so you can open your connection, create your parameters, get the result and reuse them for several queries before closing the whole thing up again?
If the data is not frequently invalidated (updated) I would implement a cache layer. This is one of the most effective ways (if used correctly) to gain performance.
You could use output parameters instead of a select since there's always a single row.
You could also create the SqlCommand ahead of time and re-use it. If you are executing a lot of queries in a single thread you can keep executing it on the same connection. If not, you could create a pool of them or do cmdTemplate.Clone() and set the Connection.
Try re-using the Command and Prepare'ing it first.
I can't say that it will definitely help, but it seems worth a try.
In no particular order...
Have you (or your DBAs) examined the execution plan your stored procedure is getting? Has SQL Server cached a bogus execution plan (either due to oddball parameters or old stats).
How often are statistics updated on the database?
Do you use temp tables in your stored procedure? If so, are they create upfront. If not, you'll be doing a lot of recompiles as creating a temp table invalidates the execution plan.
Are you using connection pooling? Opening/Closing a SQL server connection is an expensive operation.
Is your table clustered on accountID and Symbol?
Finally...
Is there a reason you're hitting this table by account and symbol rather than, say, just retrieving all the data for an entire account in one fell swoop?

Categories

Resources