I have a method in .Net (v4.6 using Dapper), named BulkUpdate, that will modify several tables and can include around 10,000 rows or more. This operation can take a few seconds to a couple of minutes depending on the number of data being inserted. Since I will be updating multiple related tables I have to enclose all operations in a TransactionScope.
My question is what is the best way to avoid other read requests (outside the Transaction) from being "locked" or "wait" while my BulkUpdate method is in progress. Please, I do not want to add SET ISOLATION LEVEL READ_UNCOMMITTED to the beginning of every read, nor add NO LOCK hint...are there any other solutions?
Please use TASK concept in front end side C#, split 100 or 1000 rows to each task and run simultaneously. It may be use full for you. already i will improved my application performance using like this.
https://www.codeproject.com/Questions/1226752/Csharp-how-to-run-tasks-in-parallel
Related
I have a database in SQL Server 2012 and want to update a table in it.
My table has three columns, the first column is of type nchar(24). It is filled with billion of rows. The other two columns are from the same type, but they are null (empty) at this moment.
I need to read the data from the first column, with this information I do some calculations. The result of my calculations are two strings, this two strings are the data I want to insert into the two empty columns.
My question is what is the fastest way to read the information from the first column of the table and update the second and third column.
Read and update step by step? Read a few rows, do the calculation, update the rows while reading the next few rows?
As it comes to billion of rows, performance is the only important thing here.
Let me know if you need any more information!
EDIT 1:
My calculation canĀ“t be expressed in SQL.
As the SQL server is on the local machine, the througput is nothing we have to be worried about. One calculation take about 0.02154 seconds, I have a total number of 2.809.475.760 rows this is about 280 GB of data.
Normally, DML is best performed in bigger batches. Depending on your indexing structure, a small batch size (maybe 1000?!) can already deliver the best results, or you might need bigger batch sizes (up to the point where you write all rows of the table in one statement).
Bulk updates can be performed by bulk-inserting information about the updates you want to make, and then updating all rows in the batch in one statement. Alternative strategies exist.
As you can't hold all rows to be updated in memory at the same time you probably need to look into MARS to be able to perform streaming reads while writing occasionally at the same time. Or, you can do it with two connections. Be careful to not deadlock across connections. SQL Server cannot detect that by principle. Only a timeout will resolve such a (distributed) deadlock. Making the reader run under snapshot isolation is a good strategy here. Snapshot isolation causes reader to not block or be blocked.
Linq is pretty efficient from my experiences. I wouldn't worry too much about optimizing your code yet. In fact that is typically something you should avoid is prematurely optimizing your code, just get it to work first then refactor as needed. As a side note, I once tested a stored procedure against a Linq query, and Linq won (to my amazement)
There is no simple how and a one-solution-fits all here.
If there are billions of rows, does performance matter? It doesn't seem to me that it has to be done within a second.
What is the expected throughput of the database and network. If your behind a POTS dial-in link the case is massively different when on 10Gb fiber.
The computations? How expensive are they? Just c=a+b or heavy processing of other text files.
Just a couple of questions raised in response. As such there is a lot more involved that we are not aware of to answer correctly.
Try a couple of things and measure it.
As a general rule: Writing to a database can be improved by batching instead of single updates.
Using a async pattern can free up some of the time for calculations instead of waiting.
EDIT in reply to comment
If calculations take 20ms biggest problem is IO. Multithreading won't bring you much.
Read the records in sequence using snapshot isolation so it's not hampered by write locks and update in batches. My guess is that the reader stays ahead of the writer without much trouble, reading in batches adds complexity without gaining much.
Find the sweet spot for the right batchsize by experimenting.
I have process A that runs every 5 minutes and needs to write something in table "EventLog". This works all day long but at night there is another process B starting that needs to delete a lot of old data from this table. The table has millions of rows (blobs included) and many related tables (deletion by cascade) so process B runs up to ~45 minutes. While process B is running I get a lot of deadlock warnings for process A and I want to get rid of these.
The easy option would be "Don't run process A when process B is running" but there must be a better approach. I am using EntityFramework 6 and TransactionScope in both processes. I didn't find out how to set priority or something like that on my processes. Is this possible?
EDIT:
I forgot to say that I am already using one delete transaction per record, not one transaction for all records. Inside loop I create new DBContext and TransactionScope, so each record has its own transaction. My problem is that deleting a record still takes some time because of the related BLOBs and data in other related tables (lets say about 5 sec. per row). I still get deadlock situations when deleting process (B) crosses with inserting process (A).
Transactions don't have priority. Deadlock victims are chosen by the database, most commonly using things like "work required to roll it back". One way of avoiding a deadlock is to ensure that you block rather than deadlock, by accessing tables in the same order, and by taking locks at the eventual level (for example, taking an UPDLOCK when reading the data, to avoid two queries getting read locks, then one trying to escalate to a write lock). Ultimately, though, this is a tricky area - and something that takes 45M to complete (please tell me that isn't a single transaction!) is always going to cause problems.
Rework process B to not delete it all at once but in smaller batches that never take more than 1 minute. Run those in a loop until all to be deleted is done.
I'm working on a online sales web site. I'm using C# 4,0 and SQL server 2008 and I want to control and prevent users simultaneously insert into the table like dbo.orders... How can I do that?
Inserts will not be a problem, but updates can be. The term that you need to research is database concurrency. There are four basic models you can implement each with its own pros and cons. Some are better suited for certain situations and there are hundreds of articles on the web for this subject.
Are you trying to solve this in C# code on in SQL? Because in SQL it's simple. If you add BEGIN TRAN in the beginning of the stored procedure and COMMIT in the end, this will act as a lock in C# preventing concurrent code executions effectively serializing the requests. So if there are two inserts, they will be executed one after another. One thing to remember is that it will be blocking operation, i.e. the second insert won't start until the first one is finished (regardless successfully or not).
In your Add method you can use Locking with lock keyword, this will allow one thread at a time.
I want to get the community's perspective on this. If I have a process which is heavily DB/IO bound, how smart would it be to parallelize individual process paths using the Task Parallel library?
I'll use an example ... if I have a bunch of items, and I need to do the following operations
Query a DB for a list of items
Do some aggregation operations to group certain items based on a dynamic list of parameters.
For each grouped result, Query the database for something based on the aggregated result.
For each grouped result, Do some numeric calculations (3 and 4 would happen sequentially).
Do some inserts and updates for the result calculated in #3
Do some inserts and updates for each item returned in #1
Logically speaking, I can parallelize into a graph of tasks at steps #3, #5, #6 as one item has no bearing on the result of the previous. However, each of these will be waiting on the database (sql server) which is fine and I understand that we can only process as far as the SQL server will let us.
But I want to logically distribute the task on the local machine so that it processes as fast as the Database lets us without having to wait for anything on our end. I've done some mock prototype where I substitute the db calls with Thread.Sleeps (I also tried some variations with .SpinWait, which was a million times faster), and the parallel version is waaaaay faster than the current implementation which is completely serial and not parallel at all.
What I'm afraid of is putting too much strain on the SQL server ... are there any considerations I should consider before I go too far down this path?
If the parallel version is much faster than the serial version, I would not worry about the strain on your SQL server...unless of course the tasks you are performing are low priority compared to some other significant or time-critical operations that are also performed on the DB server.
Your description of tasks is not well understood by me, but it almost sounds like more of those tasks should have been performed directly in the database (I presume there are details that make that not possible?)
Another option would be to create a pipeline so that step 3 for the second group happening at the same time as step 4 for the first group. And if you can overlap the updates at step 5, do that too. That way you're doing concurrent SQL accesses and processing, but not over-taxing the database because you only have two concurrent operations going on at once.
So you do steps 1 and 2 sequentially (I presume) to get a collection of groups that require further processing. Then. your main thread starts:
for each group
query the database
place the results of the query into the calc queue
A second thread services the results queue:
while not end of data
Dequeue result from calc queue
Do numeric calculations
place the results of the query into the update queue
A third thread services the update queue:
while not end of data
Dequeue result from update queue
Update database
The System.Collections.Concurrent.BlockingCollection<T> is a very effective queue for this kind of thing.
The nice thing here is that if you can scale it if you want by adding multiple calculation threads or query/update threads if the SQL Server can handle more concurrent transactions.
I use something very similar to this in a daily merge/update program, with very good results. That particular process doesn't use SQL server, but rather standard file I/O, but the concepts translate very well.
There is small system, where a database table as queue on MSSQL 2005. Several applications are writing to this table, and one application is reading and processing in a FIFO manner.
I have to make it a little bit more advanced to be able to create a distributed system, where several processing application can run. The result should be that 2-10 processing application should be able to run and they should not interfere each other during work.
My idea is to extend the queue table with a row showing that a process is already working on it. The processing application will first update the table with it's idetifyer, and then asks for the updated records.
So something like this:
start transaction
update top(10) queue set processing = 'myid' where processing is null
select * from processing where processing = 'myid'
end transaction
After processing, it sets the processing column of the table to something else, like 'done', or whatever.
I have three questions about this approach.
First: can this work in this form?
Second: if it is working, is it effective? Do you have any other ideas to create such a distribution?
Third: In MSSQL the locking is row based, but after an amount of rows are locked, the lock is extended to the whole table. So the second application cannot access it, until the first application does not release the transaction. How big can be the selection (top x) in order to not lock the whole table, only create row locks?
This will work, but you'll probably find you'll run into blocking or deadlocks where multiple processes try and read/update the same data. I wrote a procedure to do exactly this for one of our systems which uses some interesting locking semantics to ensure this type of thing runs with no blocking or deadlocks, described here.
This approach looks reasonable to me, and is similar to one I have used in the past - successfully.
Also, the row/ table will only be locked while the update and select operations take place, so I doubt the row vs table question is really a major consideration.
Unless the processing overhead of your app is so low as to be negligible, I'd keep the "top" value low - perhaps just 1. Of course that entirely depends on the details of your app.
Having said all that, I'm not a DBA, and so will also be interested in any more expert answers
In regards to your question about locking. You can use a locking hint to force it to lock only rows
update mytable with (rowlock) set x=y where a=b
Biggest problem with this approach is that you increase the number of 'updates' to the table. Try this with just one process consuming (update + delete) and others inserting data in the table and you will find that at around a million records, it starts to crumble.
I would rather have one consumer for the DB and use message queues to deliver processing data to other consumers.