My C# application retrieves over a million records from Sql Server, processes them and then updates the database back. This results in close to 100,000 update statements and they all have the following form -
update Table1 set Col1 = <some number> where Id in (n1, n2, n3....upto n200)
"Id" is an int, primary key with clustered index. No two update statements update the same Ids, so in theory, they can all run in parallel without any locks. Therefore, ideally, I suppose I should run as many as possible in parallel. The expectation is that all finish in no more than 5 minutes.
Now, my question is what is the most efficient way of doing it? I'm trying the below -
Running them sequentially one by one - This is the least efficient solution. Takes over an hour.
Running them in parallel by launching each update in it's on thread - Again very inefficient because we're creating thousands of threads but I tried anyway and it took over an hour and quite a few of them failed because of this or that connection issue.
Bulk inserting in a new table and then do a join for the update. But then we run into concurrency issues because more than one user are expected to be doing it.
Merge batches instead of updates - Google says that merge is actually slower than individual update statements so I haven't tried it.
I suppose this must be a very common problem with many applications out there that handle a sizeable amounts of data. Are there any standard solutions? Any ideas or suggestions will be appreciated.
I created a integer tbl type so that I can pass all my id's to sp as a list and then single query will update whole table.
This is still slow but i see this is way more quicker than conventional "where id in (1,2,3)"
definition for TYPE
CREATE TYPE [dbo].[integer_list_tbltype] AS TABLE(
[n] [int] NOT NULL,
PRIMARY KEY CLUSTERED
(
[n] ASC
)WITH (IGNORE_DUP_KEY = OFF)
)
GO
Here is the usage.
declare #intval integer_list_tbltype
declare #colval int=10
update c
set c.Col1=#colval
from #intval i
join Table1 c on c.ID = i.n
Let me know if you have any questions.
Related
I am building an C# application that inserts 2000 records every second using Bulkinsert.
Database version is 2008 R2
The application calls a SP that deletes the records when they are more than 2 hours old in chunks using TOP (10000). This is performed after each insert.
The enduser selects records to view in a diagram using dateranges and a selection of 2 to 10 parameterids.
Since the application will run 24/7 with no downtime i am concerned about performance issues.
Partitioning is not an option since the customer dont have an Enterprise edition.
Is the clustered index definition good?
Is it neccesary to implement any index recreation / reindexation to increase performance due to the fact that rows are inserted in one end of the table and removed in the other end?
What about update statistics, is it still an issue in 2008 R2?
I use OPTION (RECOMPILE) to avoid using outdated queryplans in the select, is that a good approach?
Are there any tablehints that can speed up the SELECT?
Any suggestions around locking strategies?
In addition to the scenario above i have 3 more tables that works in the same way with different timeframes. One inserts every 20 seconds and deletes rows older than 1 week, another inserts every minute and deletes rows older than six weeks and the last inserts every 5 minutes and deletes rows older than 3 years.
CREATE TABLE [dbo].[BufferShort](
[DateTime] [datetime2](2) NOT NULL,
[ParameterId] [int] NOT NULL,
[BufferStateId] [smallint] NOT NULL,
[Value] [real] NOT NULL,
CONSTRAINT [PK_BufferShort] PRIMARY KEY CLUSTERED
(
[DateTime] ASC,
[ParameterId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
ALTER PROCEDURE [dbo].[DeleteFromBufferShort]
#DateTime DateTime,
#BufferSizeInHours int
AS
BEGIN
DELETE TOP (10000)
FROM BufferShort
FROM BufferStates
WHERE BufferShort.BufferStateId = BufferStates.BufferStateId
AND BufferShort.[DateTime] < #DateTime
AND (BufferStates.BufferStateType = 'A' OR BufferStates.Deleted = 'True')
RETURN 0
END
ALTER PROCEDURE [dbo].[SelectFromBufferShortWithParameterList]
#DateTimeFrom Datetime2(2),
#DateTimeTo Datetime2(2),
#ParameterList varchar(max)
AS
BEGIN
SET NOCOUNT ON;
-- Split ParameterList into a temporary table
SELECT * INTO #TempTable FROM dbo.splitString(#ParameterList, ',');
SELECT *
FROM BufferShort Datapoints
JOIN Parameters P ON P.ParameterId = Datapoints.ParameterId
JOIN #TempTable TT ON TT.Token = P.ElementReference
WHERE Datapoints.[DateTime] BETWEEN #DateTimeFrom AND #DateTimeTo
ORDER BY [DateTime]
OPTION (RECOMPILE)
RETURN 0
END
This is a classic case of penny wise/pound foolish. You are inserting 150 million records per day and you are not using Enterprise.
The main reason not to use a clustered index is because the machine cannot keep up the quantity of rows being inserted. Otherwise you should always use a clustered index. The decision of whether to use a clustered index is usually argued between those who believe that every table should have a clustered index and those who believe that perhaps one or two percent of tables should not have a clustered index. (I don't have time to engage in a 'religious' type debate about this- just research the web.) I always go with a clustered index unless the inserts on a table are failing.
I would not use the STATISTICS_NORECOMPUTE clause. I would only turn it off if inserts are failing. Please see Kimberly Tripp's (an MVP and a real SQL Server expert) article at http://sqlmag.com/blog/statisticsnorecompute-when-would-anyone-want-use-it.
I would also not use OPTION (RECOMPILE) unless you see queries are not using the right indexes (or join types) in the actual query plan. If your query is executed many times per minute/second this can have an unnecessary impact on the performance of your machine.
The clustered index definition seems good as long as all queries specify at least the leading DateTime column. The index will also maximize insert speed, assuming the times are incremental, as well as reduce fragmentation. You shouldn't need to reorg/reorganize often.
If you have only the clustered index on this table, I wouldn't expect you need to update stats frequently because there isn't another data access path. If you have other indexes and complex queries, verify the index is branded ascending with the query below. You may need to update stats frequently if it is not branded ascending and you have complex queries:
DBCC TRACEON(2388);
DBCC SHOW_STATISTICS('dbo.BufferShort', 'PK_BufferShort');
DBCC TRACEOFF(2388);
For the #ParameterList, consider a table-valued-parameter instead. Specify a primary key of Token on the table type.
I would suggest you introduce the RECOMPILE hint only if needed; I suspect you will get a stable plan with a clustered index seek without it.
If you have blocking problems, consider altering the database to specify the READ_COMMITTED_SNAPSHOT option so that row versioning instead of blocking is used for read consistency. Note that this will add 14 bytes of row overhead and use tempdb more heavily, but the concurrency benefits might outweigh the costs.
I have a table having around 1 million records. Table structure is shown below. The UID column is a primary key and uniqueidentifier type.
Table_A (contains a million records)
UID Name
-----------------------------------------------------------
E8CDD244-B8E4-4807-B04D-FE6FDB71F995 DummyRecord
I also have a function called fn_Split('Guid_1,Guid_2,Guid_3,....,Guid_n') which accepts a list of comma
seperated guids and gives back a table variable containing the guids.
From my application code I am passing a sql query to get new guids [Keys that are with application code but not in the database table]
var sb = new StringBuilder();
sb
.Append(" SELECT NewKey ")
.AppendFormat(" FROM fn_Split ('{0}') ", keyList)
.Append(" EXCEPT ")
.Append("SELECT UID from Table_A");
The first time this command is executed it times out on quite a few occassions. I am trying to figure out what would be a better approach here to avoid such timeouts and/or improve performance of this.
Thanks.
Firstly add an index if there isn't one, on table_a.uid, but i assume there is.
Some alternate queries to try,
select newkey
from fn_split
left outer join table_a
on newkey = uid
where uid IS NULL
select newkey
from fn_split(blah)
where newkey not in (select uid
from table_a)
select newkey
from fn_split(blah) f
where not exists(select uid
from table_a a
where f.newkey = a.uid)
There is plenty of info around here as to why you should not use a Guid for your primary key, especially if it in unordered. That would be the first thing to fix. As far as your query goes you might try what Paul or Tim suggested, but as far as I know EXCEPT and NOT IN will use the same execution plan, though the OUTER JOIN may be more efficint in some cases.
If you're using MS SQL 2008 then you can/should use TableValue Parameters. Essentially you'd send in your guids in the form of a DataTable to your stored procedure.
Then inside your stored procedure you can use the parameters as a "table" and do a join or EXCEPT or what have you to get your results.
This method is faster than using a function to split because functions in MS SQL server are really slow.
But I guess is the time is being taken due to massive Disk I/O this query requires. Since you're searching on your UId column and since they are "random" no index is going to help here. The engine will have to resort to a table scan. Which means you'll need some serious Disk I/O performance to get the results in "good time".
Using the Uid data type as in index is not recommended. However, it may not make a difference in your case. But let me ask you this:
The guids that you send in from your app, are in just a random list of guids or is here some business relationship or entity relationship here? It's possible, that your data model is not correct for what you are trying to do. So how do you determine what guids you have to search on?
However, for argument sake, let's assume your guids are just a random selection then there is no index that is really being used since the database engine will have to do a table scan to pick out each of the required guids/records from the million records you have. In a situation like this the only way to speed things up is at the physical database level, that is how your data is physically stored on the hard drives etc.
For example:
Having faster drives will improve performance
If this kind of query is being fired over and over then more memory on the box will help because the engine can cache the data in memory and it won't need to do physical reads
If you partition your table then the engine can parallelize the the seek operation and get you results faster.
If your table contains a lot of other fields that you don't always need, then spliting the table in two tables where table1 contains the guid and the bare minimum set of fields and table2 contains the rest will speed up the query quite a bit due to the disk I/O demands being less
Lot's of other things to look at here
Also note that when you send in adhoc SQL statements that don't have parameters the engine has to create a plan each time you execute it. In this case it's not a big deal but keep in mind that each plan will be cached in memory thus pushing out any data that might have been cached.
Lastly you can always increase the commandTimeOut property in this case to get past the timeout issues.
How much time does it take now and what kind of improvement are you looking to get ot hoping to get?
If I understand your question correctly, in your client code you have a comma-delimited string of (string) GUIDs. These GUIDS are usable by the client only if they don't already exist in TableA. Could you invoke a SP which creates a temporary table on the server containing the potentially usable GUIDS, and then do this:
select guid from #myTempTable as temp
where not exists
(
select uid from TABLEA where uid = temp.guid
)
You could pass your string of GUIDS to the SP; it would populate the temp table using your function; and then return an ADO.NET DataTable to the client. This should be very easy to test before you even bother to write the SP.
I am questioning what you do with this information.
If you insert the keys into this table afterwards you could simply try to insert them on first hand - that's much faster and more solid in a multi-user environment then query first insert later:
create procedure TryToInsert #GUID uniqueidentifier, #Name varchar(n) as
begin try
insert into Table_A (UID,Name)
values (#GUID, #Name);
return 0;
end try
begin catch
return 1;
end;
In all cases you can split the KeyList at the client to get faster results - and you could query the keys that are not valid:
select UID
from Table_A
where UID in ('new guid','new guid',...);
If the GUID are random you should use newsequentialid() with you clustered primary key:
create table Table_A (
UID uniqueidentifier default newsequentialid() primary key,
Name varchar(n) not null
);
With this you can insert and query your newly inserted data in one step:
insert into Table_A (Name)
output inserted.*
values (#Name);
... just my two cents
In any case, are not GUIDs intrinsically engineered to be, for all intents and purposes, unique? (i.e. universally unique -- doesn't matter where generated). I wouldn't even bother to do the test beforehand; just insert your row with the GUID PK and if the insert should fail, discard the GUID. But it should not fail, unless these are not truly GUIDs.
http://en.wikipedia.org/wiki/GUID
http://msdn.microsoft.com/en-us/library/ms190215.aspx
It seems you are doing a lot of unnecessary work, but perhaps I don't grasp your application requirement.
I am making an invoicing system, with the support for multiple subsidaries which each have their own set of invoice numbers, therefore i have a table with a primary key of (Subsidiary, InvoiceNo)
I cannot use MySQL auto increment field, as then it will be constantly incrementing the same count for all subsidaries.
I don't want to make seperate tables for each subsidiary as there will be new subsidaries added as need be...
I am currently using "Select Max (ID) Where Subsidiary = X", from my table and adding the invoice according to this.
I am using nHibernate, and the Invoice insert, comes before the InvoiceItem insert, therefore if Invoice insert fails, InvoiceItem will not be carried out. But instead i will catch the exception, re-retrieve the Max(ID) and try again.
What is the problem with this approach? And if any, what is an alternative?
The reson for asking is because i read one of the answers on this question: Nhibernate Criteria: 'select max(id)'
This is a very bad idea to use when generating primary keys. My advise is as follows:
Do not give primary keys a business meaning (synthetic keys);
Use a secondary mechanism for generating the invoice numbers.
This will make your life a lot easier. The mechanism for generating invoice numbers can then e.g. be a table that looks something like:
Subsidiary;
NextInvoiceNumber.
This will separate the internal numbering from how the database works.
With such a mechanism, you will be able to use auto increment fields again, or even better, GUID's.
Some links with reading material:
http://fabiomaulo.blogspot.com/2008/12/identity-never-ending-story.html
http://nhforge.org/blogs/nhibernate/archive/2009/02/09/nh2-1-0-new-generators.aspx
As you say, the problem with this approach is multiple sessions might try and insert the same invoice ID. You get a unique constraint violation, have to try again, that might fail as well, and so on.
I solve such problems by locking the subsiduary during the creation of new invoices. However, don't lock the table, (a) if you are using InnoDB there are problems that a lock table command by default will commit the transaction. (b) There is no reason why invoices for two different subsiduaries shouldn't be added at the same time as they have different independent invoice numbers.
What I would do in your situation is:
Open an transaction and make sure your tables are InnoDB.
Lock the subsiduary with an SELECT .. FOR UPDATE command. This can be done using LockMode.UPGRADE in NHibernate.
Find the max id using max(..) function and do the insert
Commit the transaction
This serializes all invoice inserts for one subsiduary (i.e. only one session can do such an insert at once, any second attempt will wait until the first is complete or has rolled back) but that's what you want. You don't want holes in your invoice numbers (e.g. if you insert invoice id 3485 and then it fails, then there are invoices 3484 and 3486 but no 3485).
Using the ADO.NET MySQL Connector, what is a good way to fetch lots of records (1000+) by primary key?
I have a table with just a few small columns, and a VARCHAR(128) primary key. Currently it has about 100k entries, but this will become more in the future.
In the beginning, I thought I would use the SQL IN statement:
SELECT * FROM `table` WHERE `id` IN ('key1', 'key2', [...], 'key1000')
But with this the query could be come very long, and also I would have to manually escape quote characters in the keys etc.
Now I use a MySQL MEMORY table (tempid INT, id VARCHAR(128)) to first upload all the keys with prepared INSERT statements. Then I make a join to select all the existing keys, after which I clean up the mess in the memory table.
Is there a better way to do this?
Note: Ok maybe its not the best idea to have a string as primary key, but the question would be the same if the VARCHAR column would be a normal index.
Temporary table: So far it seems the solution is to put the data into a temporary table, and then JOIN, which is basically what I currently do (see above).
I've dealt with a similar situation in a Payroll system where the user needed to generate reports based on a selection of employees (eg. employees X,Y,Z... or employees that work in certain offices). I've built a filter window with all the employees and all the attributes that could be considered as a filter criteria, and had that window save selected employee id's in a filter table from the database. I did this because:
Generating SELECT queries with dynamically generated IN filter is just ugly and highly unpractical.
I could join that table in all my queries that needed to use the filter window.
Might not be the best solution out there but served, and still serves me very well.
If your primary keys follow some pattern, you can select where key like 'abc%'.
If you want to get out 1000 at a time, in some kind of sequence, you may want to have another int column in your data table with a clustered index. This would do the same job as your current memory table - allow you to select by int range.
What is the nature of the primary key? It is anything meaningful?
If you're concerned about performance I definitely wouldn't recommend an 'IN' clause. It's much better try do an INNER JOIN if you can.
You can either first insert all the values into a temporary table and join to that or do a sub-select. Best is to actually profile the changes and figure out what works best for you.
Why can't you consider using a Table valued parameter to push the keys in the form of a DataTable and fetch the matching records back?
Or
Simply you write a private method that can concatenate all the key codes from a provided collection and return a single string and pass that string to the query.
I think it may solve your problem.
I am using SQL Server 2008 Enterprise. And using ADO.Net + C# + .Net 3.5 + ASP.Net as client to access database. When I access SQL Server 2008 tables, I always invoke stored procedure from my C# + ADO.Net code.
I have 3 operations on table FooTable. And Multiple connections will execute them at the same time in sequences, i.e. executes delete, the execute insert and then execute select. Each statement (delete/insert/select) is of a separate individual transaction in the single store procedure.
My question is whether it is possible that deadlock will occur on delete statement? My guess is whether it is possible that deadlock occurs if multiple connections are operating on the same Param1 value?
BTW: For the statements below, Param1 is a column of table FooTable, Param1 is a foreign key of another table (refers to another primary key clustered index column of the other table). There is no index on Param1 itself for table FooTable. FooTable has another column which is used as clustered primary key, but not Param1 column.
create PROCEDURE [dbo].[FooProc]
(
#Param1 int
,#Param2 int
,#Param3 int
)
AS
DELETE FooTable WHERE Param1 = #Param1
INSERT INTO FooTable
(
Param1
,Param2
,Param3
)
VALUES
(
#Param1
,#Param2
,#Param3
)
DECLARE #ID bigint
SET #ID = ISNULL(##Identity,-1)
IF #ID > 0
BEGIN
SELECT IdentityStr FROM FooTable WHERE ID = #ID
END
Here is what the activity monitor table looks like,
ProcessID System Process Login Database Status Opened transaction Command Application Wait Time Wait Type CPU
52 No Foo suspended 0 DELETE .Net SqlClient Data Provider 4882 LCK_M_U 0
53 No George Foo suspended 2 DELETE .Net SqlClient Data Provider 12332 LCK_M_U 0
54 No George Foo suspended 2 DELETE .Net SqlClient Data Provider 6505 LCK_M_U 0
(a lot of rows like the row for process ID 54)
I would add an index on Param1 to FooTable; without it, the DELETE is doing full table scan, and that'll create problems with deadlocks.
EDIT
Based on your activity details, it doesn't look like you have deadlocks, you have blocking, many deletes are queueing up while one delete takes place. Again, indexing on Param1 would alleviate this, without it, each delete is going to do a full table scan to find the records to delete, while that is happening, the other delete's have to wait. If you have an index on Param1, it'll process much quicker and you won't see the blocking you are now.
If you have deadlocks, the system will kill one of the involved processes, otherwise nothing would ever process; with blocking, things will process, but very slowly if the table is large.
I do not think you would get a deadlock (this is not my field of expertise), but an explicit transaction would probably be a better choice here. A scenario that comes to mind with this code is the following
Two concurrent calls to the procedure execute with Param1 value of 5, both delete and then both insert, so now you have two records with Param1 value of 5. Depending on your data consistency requirements this might or might not be a concern to you.
An alternative for you might be to actually perform an Update and if no rows are affected (check ##rowcount) then do an Insert all in a transaction of course. Or better yet, take a look at Merge to perform Insert/Update operation in a single statement.