I am working with a project which involves processing a lot of text files and results in either inserting records into an mssql db or updating existing information.
The sql statement is written and stored in a list until the files have finished being processed.
This list is then processed. Each statement was being processed one at a time but as this could be thousands of statements and could create a very long running process.
To attempt to speed up this process i introduced some parallel processing but this occasionally results in the following error:
Transaction (Process ID 94) was deadlocked on lock | communication
buffer resources with another process and has been chosen as the
deadlock victim. Rerun the transaction.
Code as follows:
public static void ParallelNonScalarExecution(List<string> Statements, string conn)
{
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = 8;
CancellationTokenSource cancelToken = new CancellationTokenSource();
po.CancellationToken = cancelToken.Token;
Parallel.ForEach(Statements, po, Statement =>
{
using (SqlConnection mySqlConnection = new SqlConnection(conn))
{
mySqlConnection.Open();
using (SqlCommand mySqlCommand = new SqlCommand(Statement, mySqlConnection))
{
mySqlCommand.CommandTimeout = Timeout;
mySqlCommand.ExecuteScalar();
}
}
});
}
The update statements i believe are simple in what they are trying to achieve:
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code = 1
UPDATE TableA SET Notes = 'blahblahblah', Date = '2016-01-01' WHERE Code = 2
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code = 3
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code = 4
UPDATE TableB SET Type = 1 WHERE Code = 100
UPDATE TableA SET Notes = 'blahblahblah', Date = '2016-01-01' WHERE Code = 5
UPDATE TableB SET Type = 1 WHERE Code = 101
What is the best way to overcome this issue?
From what I see you don't want to do what you are doing. I would NOT recommend having multiple update statements that effect the same data/table on different threads. This is the breeding of a race condition/dead lock. In your case it should be safe, but if at any point you changed the where condition and there was overlap you would have a race condition issue.
If you really wanted to speed this up with multi-threading than having all of the update statements for tableA on one thread and all of the update statements on tableB on one thread. Another idea is to block your update statements.
UPDATE TableA SET Notes = 'blahblahblah' WHERE Code IN (1,2,3,4,5)
UPDATE TableA SET Date = '2016-01-01' WHERE Code IN (2,5)
UPDATE TableB SET Type = 1 WHERE Code IN (100,101)
These above statements should be able to be independently execute in a concurent enviroment as no two statements effect the same column?
Thread A updates resource X and does not commit and can continue doing more updates.
Thread B updates resource y and does not commit and can continue doing more updates.
At this point, both have uncommitted updates.
Now thread A updates resource y and waits on the lock from Thread B.
Thread B is not held up by anything, so it goes on, eventually tries to update resource x and is blocked by the lock A has on x.
Now they are in a deadlock. It's a stalemate, not one can proceed to commit, so the system kills one.
You have to commit more often to reduce the chances of a deadlock (but that does not eliminate the possibility entirely), or you have to carefully order your updates so all updates to x get done and completed before going on to do any updates on y.
Related
I have a code block which processes StoreProducts an then adds or updates them in the database in a for each loop. But this is slow. When I convert the code Parallel.ForEach block, then same products gets both added and updated at the same time. I could not figure out how to safely utilize for the following functionality, any help would be appreciated.
var validProducts = storeProducts.Where(p => p.Price2 > 0
&& !string.IsNullOrEmpty(p.ProductAtt08Desc.Trim())
&& !string.IsNullOrEmpty(p.Barcode.Trim())
).ToList();
var processedProductCodes = new List<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)), po,
(product) =>
{
lock (_lockThis)
{
processedProductCodes.Add(product.ProductCode);
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
The thing in here is, the list validProducts may have more than one same ProductCode, so they are variants and I have to manage that even one of them is being processed it should not be processed again.
So where condition that is found in the parallel foreach 'validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)' is not working as expected like in normal for each.
The bulk of my answer is less-so an answer to your question and more some guidance - if you were to provide some more technical details, I may be able to assist more precisely.
A Parallel.ForEach is probably not the best solution here -- especially when you have a shared list or a busy server.
You are locking to write but not to read from that shared list. So I'm surprised it's not throwing during the Where. Turn the List<string> into a ConcurrentDictionary<string, bool> (just to create a simple concurrent hash table) then you'll get better write throughput and it won't throw during reads.
But you're going to have database contention issues (if using multiple connections) because your insert will likely still require locks. Even if you simply split the workload you would run into this. This DB locking could cause blocks/deadlocks so it may end up slower than the original. If using one connection, you generally cannot parallelize commands.
I would try wrapping the majority of inserts in a transaction containing batches of say 1000 inserts or place the entire workload into one bulk insert. Then the database will keep the data in-memory and commit the entire thing to disk when finished (instead of one record at a time).
Depending on your typical workload, you may want to try different storage solutions. Databases are generally bad for inserting large volumes of records... you will likely see much better performance with alternative solutions (such as Key-Value stores). Or place the data into something like Redis and slowly persist to the database in the background.
Parallel.ForEach buffers items internally for each thread, one option you could do is switch to a partitioner that does not use buffering
var pat = Partitioner.Create(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode))
,EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(pat, po, (product) => ...
That will get you closer but you will still have a race conditions where two of the same object can be processed because you don't break out of the loop if you find a duplicate.
The better option is switch processedProductCodes to a HashSet<string> and change your code to
var processedProductCodes = new HashSet<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts, po,
(product) =>
{
//You can safely lock on processedProductCodes
lock (processedProductCodes)
{
if(!processedProductCodes.Add(product.ProductCode))
{
//Add returns false if the code is already in the collection.
return;
}
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
HashSet has a much faster lookup and is built in to the Add function.
Code, notice the order of the values is different. So it alternates between locking rows:
static void Main( string[] args )
{
List<int> list = new List<int>();
for(int i = 0; i < 1000; i++ )
list.Add( i );
Parallel.ForEach( list, i =>
{
using( NamePressDataContext db = new NamePressDataContext() )
{
db.ExecuteCommand( #"update EBayDescriptionsCategories set CategoryId = Ids.CategoryId from EBayDescriptionsCategories
join (values (7276, 20870),(240, 20870)) as Ids(Id,CategoryId) on Ids.Id = EBayDescriptionsCategories.Id" );
db.ExecuteCommand( #"update EBayDescriptionsCategories set CategoryId = Ids.CategoryId from EBayDescriptionsCategories
join (values (240, 20870),(7276, 20870)) as Ids(Id,CategoryId) on Ids.Id = EBayDescriptionsCategories.Id" );
}
} );
}
Table def:
CREATE TABLE [dbo].[EDescriptionsCategories](
[CategoryId] [int] NOT NULL,
[Id] [int] NOT NULL,
CONSTRAINT [PK_EDescriptionsCategories] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
Exception:
Transaction (Process ID 80) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
The code works only with WITH (TABLOCK) hint. Is it possible not to lock the whole table just to update just those 2 rows in parallel?
Your two statements acquire row locks in different order. That's a classic case for deadlocks. You can fix this by ensuring that the order of locks taken is always in some global order (for example, ordered by ID). You should probably coalesce the two UPDATE statements into one and sort the list of IDs on the client before sending it to SQL Server. For many typical UPDATE plans this actually works fine (not guaranteed, though).
Or, you add retry logic in case you detect a deadlock (SqlException.Number == 1205). This is more elegant because it requires no deeper code changes. But deadlocks have performance implications so only do this for low deadlock rates.
If your parallel processing generates lots of updates, you could INSERT all those updates into a temp table (which can be done concurrently) and when you are done you execute one big UPDATE that copies all the individual update records to the main table. You just change the join source in your sample queries.
Code, notice the order of the values is different. So it alternates between locking rows
No, it doesn't alternate. It acquires the locks in two different order. Deadlock is guaranteed.
Is it possible not to ... update just those 2 rows in parallel?
Not like that it isn't. What you're asking for is the definition of a deadlock. Something gotta give. The solution must come from your business logic, there should be no attempts to process the same set of IDs from distinct transactions. What that means, is entire business specific. IF you cannot achieve that, then basically you are just begging for deadlocks. There are some things you can do, but none is bulletproof and all come at great cost. The problem is higher up the chain.
Agree with other answers as regards to the locking.
The more pressing question is what are you hoping to gain from this? There's only one cable those commands are travelling down.
You are probably making the overall performance worse by doing this. Far better to do your computation in parallel but serialize (and possibly batch) your updates.
I'd like to lock specified row(s) in my table exclusively, so no reads no updates allowed until the actual transaction completes. To do this, I've created a helper class in my database repository:
public void PessimisticMyEntityHandler(Action<IEnumerable<MyEntity>> fieldUpdater, string sql, params object[] parameters)
{
using (var scope = new System.Transactions.TransactionScope())
{
fieldUpdater(DbContext.Set<MyEntity>().SqlQuery(sql, parameters));
scope.Complete();
}
}
Here is my test code. Basicly I'm just starting two tasks and both of them tries to lock the row with the Id '1'. My guess was that the second task won't be able to read(and update) the row until the first one finishes its job, but the output window shows that it actually can.
Task.Factory.StartNew(() =>
{
var dbRepo = new DatabaseRepository();
dbRepo.PessimisticMyEntityHandler(myEntities =>
{
Debug.WriteLine("entered into lock1");
/* Modify some properties considering the current ones... */
var myEntity = myEntities.First();
Thread.Sleep(1500);
myEntity.MyEntityCode = "abcdefgh";
dbRepo.Update<MyEntity>(myEntity);
Debug.WriteLine("leaving lock1");
}, "SELECT * FROM MyEntities WITH (UPDLOCK, HOLDLOCK) WHERE Id = #param1", new SqlParameter("param1", 1));
});
Task.Factory.StartNew(() =>
{
Thread.Sleep(500);
var dbRepo = new DatabaseRepository();
dbRepo.PessimisticMyEntityHandler(myEntities =>
{
Debug.WriteLine("entered into lock2");
/* Modify some properties considering the current ones... */
var myEntity = myEntities.First();
myEntity.MyEntityCode = "xyz";
dbRepo.Update<MyEntity>(myEntity);
Debug.WriteLine("leaving lock2");
}, "SELECT * FROM MyEntities WITH (UPDLOCK, HOLDLOCK) WHERE Id = #param1", new SqlParameter("param1", 1));
});
Output window:
entered into lock1
entered into lock2
leaving lock2
leaving lock1
What you are asking for, involves two main phenomena in DBMS and particularly in SQL Server: Lock and Isolation Level. I do my best to explain them in summery.
You asked about Pessimistic Concurrency. The answer is: it is not supported yet in Entity Framework. In other words, by conventional API of EF you cannot lock a table or some rows for SELECT like what for example Oracle does via SELECT FOR UPDATE. Though you can achieve this by writing a native SQL command to select some rows or the entire table with an Exclusive lock and maintain this lock until the end of the transaction. This way other threads not only cannot update the selected rows, but also cannot select them. They get blocked until you release the lock. This is what I do in my projects and though somehow risky, it works fine.
So In summery:
Lock for select: NO by EF / Yes by native SQL
Lock for Update:
When you modify rows in DB, the modified rows gain some sort of lock. The kind of lock is determined by the Isolation Level of the running Transaction. The default of Isolation Level in SQL Server is Read Committed which means that all the rows that are modified in current transaction gain Shared lock. This lock is compatible with SELECT but not with UPDATE or DELETE. It means that when you modify a row in your transaction, by default it is guaranteed that no other parallel threads can infer and change them until you end the transaction either by COMMIT or ROLLBACK.
To understand the locks in SQL Server see:
http://technet.microsoft.com/en-us/library/aa213039(v=sql.80).aspx
To understand transaction isolation level see: http://technet.microsoft.com/en-us/library/ms189122(v=sql.105).aspx
.
Update:
Table hints UPDLOCK and HOLDLOCK may be ignored by query optimizer or other DBMS modules since they are just hint :-). The only combination of table hints that is surely enforced is (XLOCK, PAGLOCK).
Example: SELECT * FROM MyTable WITH (XLOCK, PAGLOCK)
As I said, manual locking is risky. Use it with maximum level of consideration.
I am working with a Multi threaded Console Application, in which each thread basically tries to obtain the TOP 1 "File" row with certain criteria met and locks it.( There is a LockID column which is populated when this happens so that the next thread picks up the next available 'unlocked' 'File' row)
We put a monitor on the SQL Server DB and each time the deadlock happens on 2 queries.
SELECT TOP 1 F.Id, F.ContentTypeId, F.ManufacturerId, F.DocumentTypeId, F.Name, F.Description, F.VersionId, F.LastChangedVersionOn, F.ReferenceCount, F.LastChangedReferencesOn, F.LastChangedImageOn, F.ImageSize, F.IsStale, F.InvalidFile, CT.Id, CT.Name, CT.MimeType, CT.IsMimeAttachment, CT.Extensions, CT.CanTrackVersions, CT.UseRemoteSource, CT.FullTextFilter, CT.ContentHandler, V.Id, V.Size, V.Hash, V.Title, DT.Id, DT.Code, DT.Ordinal, DT.Name, DT.PluralName, DT.UrlPart
FROM Docs.Files F
INNER JOIN Docs.ContentTypes CT ON CT.Id = F.ContentTypeId
LEFT JOIN Docs.Versions V ON V.Id = F.VersionId
LEFT JOIN Docs.DocumentTypes DT ON DT.Id = F.DocumentTypeId
WHERE (F.LockId IS NULL OR F.LockedOn < DATEADD(hh,-1,GETUTCDATE()))
AND F.IsStale = 1 AND F.InvalidFile = 0
AND
(#Id int)UPDATE Docs.Files
SET LastChangedImageOn = GETUTCDATE(), ImageSize = (
SELECT DATALENGTH(FileImage)
FROM Docs.FileImages
WHERE FileId = #Id)
WHERE Id = #Id;
SELECT TOP 1 LastChangedImageOn FROM Docs.Files WHERE Id = #Id
The first query is run when a new thread is created and we try to obtain a new 'File' row.
The second query is run when a thread(may be a previously created one) is almost done with processing the 'File' record.Used Transactions on this query. Isolation level was "ReadCommitted".
I'm pretty sure that both the queries aren't trying to access the same "FileID" because two threads would never process the same "FileID" subsequently.
I'm terribly confused as to how I can diagnose this issue. What might be causing the deadlock between these 2 queries?
I'd really really appreciate it if someone could guide me in the right direction. Thanks a lot in advance :)
Hmm... it's been a while since I did anything with SQL Server. But let's try.
You mention that "two threads would never process the same "FileID" subsequently", how can you be sure of this? Is the ID being fed from a source outside the thread?
I want to replace existing records in the DB with new records in one transaction. Using TransactionScope, I have
using ( var scope = new TransactionScope())
{
db.Tasks.DeleteAllOnSubmit(oldTasks);
db.Tasks.SubmitChanges();
db.Tasks.InsertAllOnSubmit(newTasks);
db.Tasks.SubmitChanges();
scope.Complete();
}
My program threw
System.InvalidOperationException: Cannot add an entity that already exists.
After some trial and error, I found the culprit lies in the the fact that there isn't any other execution instructions between the delete and the insert. If I insert other code between the first SubmitChanges() and InsertAllOnSubmit(), everything works fine. Can anyone explain why is this happening? It is very concerning.
I tried another one to update the objects:
IEnumerable<Task> tasks = ( ... some long query that involves multi tables )
.AsEnumerable()
.Select( i =>
{
i.Task.Duration += i.LastLegDuration;
return i.Task;
}
db.SubmitChanges();
This didn't work neither. db didn't pick up any changes to Tasks.
EDIT:
This behavior doesn't seem to have anything to do with Transactions. At the end, I adopted the grossly inefficient Update:
newTasks.ForEach( t =>
{
Task attached = db.Tasks.Single( i => ... use primary id to look up ... );
attached.Duration = ...;
... more updates, Property by Property ...
}
db.SubmitChanges();
Instead of inserting and deleting or making multiple queries, you can try to update multiple rows in one pass by selecting a list of Id's to update and checking if the list contains each item.
Also, make sure you mark your transaction as complete to indicate to transaction manager that the state across all resources is consistent, and the transaction can be committed.
Dictionary<int,int> taskIdsWithDuration = getIdsOfTasksToUpdate(); //fetch a dictionary keyed on id's from your long query and values storing the corresponding *LastLegDuration*
using (var scope = new TransactionScope(TransactionScopeOption.Required))
{
var tasksToUpdate = db.Tasks.Where(x => taskIdsWithDuration.Keys.Contains(x.id));
foreach (var task in tasksToUpdate)
{
task.duration1 += taskIdsWithDuration[task.id];
}
db.SaveChanges();
scope.Complete();
}
Depending on your scenario, you can invert the search in the case that your table is extremely large and the number of items to update is reasonably small, to leverage indexing. Your existing update query should work fine if this is the case, so I doubt you'll need to invert it.
I had same problem in LinqToSql and I don't think its to do with the transaction, but with how the session/context is coalescing changes. I say this because I fixed the problem by bypassing linqtosql for the delete and using some raw sql to do it. Ugly I know, but it worked, and all inside a transaction scope.