In a situation where i have to insert a record into a table A, and one of the fields in the table references a record in another table B. How can i make sure that until i commit the insert statement, the record in table B referenced by a the record to be inserted in table A is not tampered with.
I am thinking of including both tables into a transaction and locking all the records involved in the transaction. but that may lead to concurrency deficiency. so need your recommendation.
Thank you,
Note that even with a transaction, you;;ll need to get the isolation level right. The most paranoid (and hence the most accurate) is "serializable", which takes out locks (even range locks) when you read data, so that other spids can't play with it.
If you want to make your changes to the two tables become a single atomic action then they should be performed in a single transaction. its relatively simple in .net, you just need to use the BeginTransaction method on a SqlConnection to create a new transaction and then you make your SqlCommands etc work against the transaction rather than the connection. You can also do it using a TransactionScope but you may have issues with MSDTC.
I wouldn't be to concerned about concurrency issues about using transactions. I would shy away from trying to deal with locking issues yourself, I would start with just making updates atomic and maintaining data integrity.
Yon don't need a transaction for this, just a foreign key relationship. A relationship from table A, field B_FK referencing table B's primary key will prevent the creation of the table A row if the corresponding table B row does not exist.
If by "tampered with" you mean deleted, then John's right, a foreign-key relationship is probably what you want. But, if you mean modified, then a transaction is the only way to go. Yes, it'll mean that you have a potential bottleneck, but there's no way to avoid that if you want your operation to be "atomic". To avoid any noticeable performance degradation, you'll want to keep the lifetime of the transaction to the bare minimum.
Since you're using c# (and presumably ADO.NET), you could use the transaction features built-into the framework. However, it's better to have the transaction handled by the databases server since this means that the transaction can be started, completed & committed in a single request (see above re transaction lifetime).
Related
I have a lot of rows (300k+) to upsert into SQL server database in a shortest possible period of time, so the idea was to use parallelization and partition the data and use async to pump the data into SQL, X threads at the time, 100 rows per context, with context being recycled to minimize tracking overhead. However, that means more than one connection is to be used in parallel and thus CommittableTransaction/TransactionScope would use distributed transaction which would cause parallelized transaction enlistment operation to return the infamous "This platform does not support distributed transactions." exception.
I do need the ability to commit/rollback the entire set of upserts. Its part of the batch upload process and any error should rollback the changes to previously working/stable condition, application wise.
What are my options? Short of using one connection and no parallelization?
Note: Problem is not so simple as a batch of insert commands, if that was the case, I would just generate inserts and run them on server as query or indeed use SqlBulkCopy. About half of them are updates, half are inserts where new keys are generated by SQL Server which need to be obtained and re-keyed on child objects which would be inserted next, rows are spread over about a dozen tables in a 3-level hierarchy.
Nope. Totally wrong approach. Do NOT use EF for that - bulk insert ETL is not what Object Relational Mappers are made for and a lot of design decisions are not productive for that. You would also not use a small car instead of a truck to transport 20 tons of goods.
300k rows are trivial if you use SqlBulkCopy API in some sort.
Recently we faced quite an interesting issue that has to do with SQL transactions timeout. The statement that timed out does not really matter for the sake of question, but it was single INSERT statement w/o explicit transaction with client generated GUID as a key:
INSERT MyTable
(id, ...)
VALUES (<client-app-generated-guid>, ...)
We also have a retry policies in-place, so that if command fails with SqlException, then it will be retried. SQL Server (Azure SQL) did not behave normally one day and we faced a lot of strange PK violation errors during retries. They were caused by retrying actually successfully committed on the SQL Server transaction (so that causes insert with already taken ID). I understand that SQL timeout it's purely client side concept, so if Client thinks that SqlCommand failed - it might or might not mean it.
I suspect that Client explicit transaction control via for instance wrapping statements with TransactionScope as shown bellow will fix 99% of such troubles -- because Commit is actually quite fast&cheap operation. However, I still see the caveat there -- the timeout also can happen on Committing stage. The application again can be in conditions where it's impossible to guess whether transaction really committed or not (to figure out necessity of retry).
The question is how to write code in bulletproof (to such kind of troubles) and generic fashion and do a retry only when it's positively clear that transaction was not committed.
using (var trx = new TransactionScope())
using (var con = GetOpenConnection(connectionString))
{
con.Execute("<some-non-idempotent-query>");
// what if Complete() times out?!
// to retry or not to retry?!
trx.Complete();
}
The problem is that the Exception does not mean that the transaction failed. For any compensating action (like retrying) you need to have a definite way of telling if it failed. There are scalability issues with what I will suggest, but its the technique that is the important thing, the scalability issues can be solved in other ways.
My solution;
the last INSERT before COMMIT is to write a Guid to a tracking table.
if an exception occurs, that indicates a network failure, SELECT ##TRANCOUNT. If it indicates you are still in a transaction (is greater than 0)(which probably should never happen, but its worth checking) then you can happily resubmit your COMMIT
If ##TRANCOUNT returns 0 then you are no longer in a transaction. Selecting your Guid from the tracking table will tell you whether your COMMIT was successful.
If your commit was not successful (##TRANCOUNT ==0 and your Guid is not present in the tracking table) then resubmit your entire batch from the BEGIN TRANSACTION onwards.
The general approach is: try to read back what you just tried to insert.
If you can read back the ID that you tried to insert, then previous transaction committed successfully, no need to retry.
If you can't find the ID that you tried to insert, then you know that your attempt to insert has failed, so you should retry.
I'm afraid there is no way to have a completely generic pattern that would work for any SQL statement. Your "checking" code needs to know what to look for.
If it is INSERT with ID - then you are looking for that ID.
If it is some UPDATE, then the check would be custom and depend on the nature of that UPDATE.
If it is DELETE, then the check consists of trying to read what was meant to be deleted.
Actually, here is a generic pattern: any data modification batch that has one or multiple INSERT, UPDATE, DELETE statements should have one more INSERT statement within that transaction that inserts some GUID (some ID of the data modifying transaction itself) into a dedicated audit table. Then your checking code tries to read that same GUID from that dedicated audit table. If GUID is found, then you know that previous transaction committed successfully. If GUID is not found, then you know that previous transaction was rolled back and you can retry.
Having this dedicated audit table unifies/standardize the checks. The checks no longer depend on internals and details of your data changing code. Your data modification code and verification code depend on the same agreed interface - audit table.
I have two servers that runs the same query checking for specific values in a single shared DB. If the query finds the values it will alter those values. At the same time the other server might run the same query and there will be some kind of conflict while trying to alter the information.
Question: How could I best configure that the servers won't run their query at the same time and guarantee that they won't get conflicts?
Databases take care of this for you automatically. They use locks to make sure only one query access specific data at a time. These locks don't have to apply whole tables; depending on the query and transaction type, per-row locks are possible also. When you have two queries that should be grouped together, such as your select and update, transactions make sure the locks from the first query are not released until both queries have finished.
Generally, databases are meant to serve queries (and release their locks) quickly, so that two queries that arrive at about the same time will be processed in sequence with little to no observable delay to the end user. It is possible for locks to cause problems for queries that need to lock a lot of data, that need to run for a long time, or when two transactions begin to lock unrelated data but later both need data locked by the other. That is called a deadlock.
Problems with locks can be controlled by adjusting transaction isolation levels. However, it's usually a mistake to go messing with isolation levels. Most of the time the defaults will do what you need, and messing with isolation levels without fully-understanding what you're doing can make the situation worse, as well as allow queries to return stale or wrong data.
Transactions and Isolation levels are your friends here. You need to set the isolation level so that they won't interfere.
Refer to https://msdn.microsoft.com/en-gb/library/ms173763.aspx for guidance on the level you need to set.
You need to have an extra column in database ex. server_id. and write a query as:
select * from database where server_id = 1 --for the first server
select * from database where server_id = 2 --for the second server
I have three tables , table A and TableB and Table C.
I want to read data from table A and join it with table B and insert the result in Table C.
I don't want any other transactions can insert any records in table A while I'm joining TablA and Table B.
Which isolation level should I use? Is using read committed isolation level right or not?
You are asking the wrong question.
You shouldn't block inserts, why would you want to do that, specially in a concurrent environment like you describe? You'll only see blockage and deadlocks.
Instead you should ask How can I ensure that the join between A and B is consistent ? Meaning that the join will not see any record inserted during the join, without blocking said inserts. And the answer is use SNAPSHOT ISOLATION.
With SNAPSHOT ISOLATION each time you run the join you will see only rows that were already committed when the join started. Rows that were inserted (in A or B) after your join started are not visible. So the join is always consistent. Important thing is that you do not block inserts, and you won't deadlock either. Sounds too good to be true? Of course there is no free lunch, snapshot isolation has a cost, see Row Versioning Resource Usage.
This is a good read on the topic: Implementing Snapshot or Read Committed Snapshot Isolation in SQL Server: A Guide.
No, go for the Serializable option which is the best in your present scenario as they are use to prevent the user from adding new records being added to the table.
You should use Serializable from the MSDN
"A range lock is placed on the DataSet, preventing other users from updating or inserting rows into the dataset until the transaction is complete."
See http://msdn.microsoft.com/en-us/library/system.data.isolationlevel.aspx for details
Instead of blocking inserts, consider using snapshot isolation for reads. That way you get a point-in-time fully consistent and stable snapshot to read from. Concurrent DML does not disturb your reads.
If you need to block inserts, SERIALIZABLE is the minimum required level. You might well suffer from blocking and deadlocking. Therefore, I recommend snapshot isolation if you can at all use it.
Hi I will also go with Serializable.
Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)
Read committed (Database Engine default level)
Repeatable read
Serializable (the highest level, where transactions are completely isolated from one another)*
Refer link which clearly show which will be used :-
http://technet.microsoft.com/en-us/library/ms189122%28v=sql.105%29.aspx
I have a process that is running multi threaded.
Process has a thread safe collection of items to process.
Each thread processes items from the collection in a loop.
Each item in the list is sent to a stored procedure by the thread to insert data into 3 tables in a transaction (in sql). If one insert fails, all three fails. Note that the scope of transaction is per item.
The inserts are pretty simple, just inserting one row (foreign key related) into each table, with identity seeds. There is no read, just insert and then move on to the next item.
If I have multiple threads trying to process their own items each trying to insert into the same set of tables, will this create deadlocks, timeouts, or any other problems due to transaction locks?
I know I have to use one db connection per thread, i'm mainly concerned with the lock levels of tables in each transaction. When one thread is inserting rows into the 3 tables, will the other threads have to wait? There is no dependency of rows per table, except the auto identiy needs to be incremented. If it is a table level lock to increment the identity, then I suppose other threads will have to wait. The inserts may or may not be fast sometimes. If it is going to have to wait, does it make sense to do multithreading?
The objective for multithreading is to speed up the processing of items.
Please share your experience.
PS: Identity seed is not a GUID.
In SQL Server multiple inserts into a single table normally do not block each other on their own. The IDENTITY generation mechanism is highly concurrent so it does not serialize access. Inserts may block each other if they insert the same key in an unique index (one of them will also hit a duplicate key violation if both attempt to commit). You also have a probability game because keys are hashed, but it only comes into play in large transactions, see %%LOCKRES%% COLLISION PROBABILITY MAGIC MARKER: 16,777,215. If the transaction inserts into multiple tables also there shouldn't be conflicts as long as, again, the keys inserted are disjoint (this happens naturally if the inserts are master-child-child).
That being said, the presence of secondary indexes and specially the foreign keys constraints may introduce blocking and possible deadlocks. W/o an exact schema definition is impossible to tell wether you are or are not susceptible to deadlocks. Any other workload (reports, reads, maintenance) also adds to the contention problems and can potentially cause blocking and deadlocks.
Really really really high end deployments (the kind that don't need to ask for advice on forums...) can suffer from insert hot spot symptoms, see Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads
BTW, doing INSERTs from multiple threads is very seldom the correct answer to increasing the load throughput. See The Data Loading Performance Guide for good advice on how to solve that problem. And one last advice: multiple threads are also seldom the answer to making any program faster. Async programming is almost always the correct answer. See AsynchronousProcessing and BeginExecuteNonQuery.
As a side note:
just inserting one row (foreign key related) into each table, ...
There is no read,
This statement is actually contradicting itself. Foreign keys implies reads, since they must be validated during writes.
What makes you think it has to be a table level lock if there is an identity. I don't see that in any of the documentation and I just tested an insert with (rowlock) on a table with an identity column and it works.
To minimize locking take a rowlock. For all the stored procedures update the tables in the same order.
You have inserts into three table taking up to 10 seconds each? I have some inserts in transactions that hit multiple tables (some of them big) and getting 100 / second.
Review your table design and keys. If you can pick a clustered PK that represents the order of your insert and if you can sort before inserting it will make a huge difference. Review the need for any other indexes. If you must have other indexes then monitor the fragmentation and defragment.
Related but not the same. I have a dataloader that must parse some data and then load millions of rows a night but not in a transaction. It optimized at 4 parallel process starting with empty tables but the problem was after two hours of loading throughput was down by a factor of 10 due to fragmentation. I redesigned the tables so the PK clustered index was on insert order. Dropped any other index that did not yield at least a 50% select bump. On the nightly insert first drop (disable) the indexes and use just two threads. One thread to parse and one to insert. Then I recreate the index at the end of the load. Got 100:1 improvement over 4 threads hammering the indexes. Yes you have a different problem but review your tables. Too often I think indexes are added for small select benefits without considering the hit to insert and update. Also select benefit is often over valued as you build the index and compare and that fresh index has no fragmentation.
Heavy-duty DBMSs like mssql are generally very, very good with handling concurrency. What exactly will happen with your concurrently executing transactions largely depends on your TI level (http://msdn.microsoft.com/en-us/library/ms175909%28v=sql.105%29.aspx), which you can set as you see fit, but in this scenario I dont think you need to worry about deadlocks.
Whether it makes sense or not - its always hard to guess that without knowing anything about your system. Its not hard to try it out though, so you can find that out yourself. If I was to guess, I would say it wont help you much if all your threads are gonna be doing is insert rows in a round-robin fashion.
The other threads will wait anyway, your pc cant really execute more threads than the cpu cores you have at every given moment.
You wrote you want to use multi threading to speed up the processing. Im not sure this is something you can take as given/correct automaticly. The level of parallelism and its effects on speed of processing depends on lots of factors, which are very processing - dependant, such as whether theres an IO involved, for example, or if each thread is supposed to do in memory processing only. This is, i think, one of the reasons microsoft offer the task schedulers in their tpl framework, and generally treat the concurency in this library as something that is supposed to be set at runtime.
I think your safest bet is to run test queries / processes to see exactly what happens (though, of course, it still wont be 100% accurate). You can also check out the optimisitc concurrency features of sql server, which allow lock - free work (im not sure how it handles identity columns though)