We have built a C# .NET system that can be used to create data warehouses. This system takes selected databases and run a script against these databases to create a combined database/warehouse.
Now, I have three databases to be compiled into a single database and I am copying two tables from each (table [XI] and table [XII] - which have a one to many relationship, but have no constraints set up at the time of the copy/INSERT INTO). The figures for the script to run and the relevant sizes for each table are below:
The executed script consists of 30 SQL queries.
DatabaseA:
Table [XI] 29,026 Rows (size 20,128Kb).
Table [XII] 531,958 Rows (size 50,168Kb).
Time taken for entire script: 1.51s.
DatabaseB:
Table [XI] 117,877 Rows (size 17,000Kb).
Table [XII] 4,000,443 Rows (size 512,824Kb).
Time taken for entire script: 2.04s.
These both run fine and fast. The next is almost exactly the same size as the first but takes 40x as long!
DatabaseC:
Table [XI] 29,543 Rows (size 20,880Kb).
Table [XII] 538,302 Rows (size 68,000Kb).
Time taken for entire script: 44.38s.
I cannot work out why this is taking so long. I have used SQL Server Profiler and the Performance Monitor, but I cannot nail-down the reason for this massive change in performance.
The query being used to do the update is dynamic and is shown at the bottom of this question - it is large due the explicit reference to the required columns. My question is; what could be causing this inordinate increase in execution time?
Any clues would be greatly appreciated.
SQL:
DECLARE #DbName NVARCHAR(128);
SET #DbName = (SELECT TOP 1 [DbName]
FROM [IPACostAdmin]..[TmpSpecialOptions]);
DECLARE #FilterSql NVARCHAR(MAX);
SET #FilterSql = (SELECT TOP 1 [AdditionalSQL]
FROM [IPACostAdmin]..[TmpSpecialOptions]);
DECLARE #SQL NVARCHAR(MAX);
DECLARE #SQL1 NVARCHAR(MAX);
DECLARE #SQL2 NVARCHAR(MAX);
SET #SQL1 =
'INSERT INTO [' + #DbName + ']..[Episode]
([Fields1], ..., [FieldN])';
SET #SQL2 =
'SELECT
[Fields1], ..., [FieldN]
FROM [B1A] ' + #FilterSql + ';';
SET #SQL = #SQL1 + #SQL2;
EXEC(#SQL);
GO
Note: I am splitting the dynamic SQL into #SQL1 and #SQL2 for clarity. Also note that I have not shown all columns due to space and the fact that it would largely be redundant.
Edit1.
1. The databases are on the same server.
2. The database files, including logs are in the same directory on the same drive.
3. There are no primary/foriegn keys or constraints set up on the source databases (DatabaseA/B/C) or the data warehouse database at the time of this INSERT INTO.
Edit2. I have ran the above query in management studio and it took 5s!?
Edit3. I have added a temporary CLUSTERED INDEX in a hope that this would assist this query, this has not helped either.
Some information would be great to know:
1: Databases are on the same server?
2: The db file and the logfile is on the same drive in case of A and C?
(Once I had a problem with two database where one of them was on an SSD drive and the other one in a HDD. That was a problem of reading the data)
3: DB statistics about fragmentation? (Tables has no constraints, but Indexes are defined?)
This was caused by a DELETE query being run before a preceeding CREATE CLUSTERED INDEX query had time to update the entire table. The solution was to use the BEGIN TRANSACTION and COMMIT keywords. This forces SQL Server to finish the indexing before attampting anyother operations.
Note, that this problem is only likely to arise when following a CREATE CLUSTERED INDEX query with a dynamic SQL statement that modifies the existing tabel.
I hope this helps someone else.
Related
All, I have a dynamic SQL Query that I am executing from a C# application. The problem query is an INSERT statement, which is run from within a C# loop, being executed sequentially on many databases to create a single data warehouse [database]. I have run this code one 100+ databases in a single batch without problem; however, I have just come across one specific database where the query
DECLARE #DbName NVARCHAR(128);
SET #DbName = (SELECT TOP 1 [DbName]
FROM [IPACostAdmin]..[TmpSpecialOptions]);
DECLARE #FilterSql NVARCHAR(MAX);
SET #FilterSql = (SELECT TOP 1 [AdditionalSQL]
FROM [IPACostAdmin]..[TmpSpecialOptions]);
DECLARE #SQL NVARCHAR(MAX);
DECLARE #SQL1 NVARCHAR(MAX);
DECLARE #SQL2 NVARCHAR(MAX);
SET #SQL1 =
'INSERT INTO [' + #DbName + ']..[Episode] WITH(TABLOCK)
([EstabID],..., [InclFlag]) ';
SET #SQL2 =
'SELECT
[EstabID],..., [InclFlag]
FROM [B1A] ' + #FilterSql + ';';
SET #SQL = #SQL1 + #SQL2;
EXEC sp_executesql #SQL;
Goes from taking roughly three seconds for an insert of 20,000-30,000 records to 40+ minutes! Now, after long deliberation and experiments, I have just worked out the fix for this; it is to use
EXEC sp_executesql #SQL WITH RECOMPILE;
This brings it back down to < 2s for the insert.
This SQL is executed from the application once for each database in the batch, the current execution of this statement should be totally separate from the preceding ones as far as the server is concerned (as I understand it), but it is not; it seems SQL is cashing the dynamic SQL in this case.
I would like to know what is happening here for this single site? Where will I need to ensure I use the RECOMPILE option in future to prevent such issues?
Thanks for your time.
_Note. I appreciate that this recompiles the query, but I am baffelled as to why the server is using the same execution plan in the first place. each time this query is run it is against a different database using a different Initial Catalog using a different SqlConnection.
when you do RECOMPILE, sql server will generate each time new execution plan and execute it. other wise it will try to use an existing execution plan stored in the procedure cache, which may be wrong for the current query as in dynamic query, conditions and parameters get changed each time it executes..
From yesterday, i'm facing a problem:when i call a stored proc from c#,it lasts >5 in, but when i execute it directly from SSMS (in the server machine) its lasts less than 30 seconds.
I have searched in forums and went trough this great article http://www.sommarskog.se/query-plan-mysteries.html but no result.
The script contained in my proc is retrieving 10 columns among them a column called "article" of type nvarchar(max).
When i remove the article column from my Select ,my proc executes quickly.
To further my logic, i have created a new stored proc retrieving just Primary Key Column and nvarchar(max) column.
I'm reproducing the same behaviour.Here is my new proc=MyNewProc(lasts >5 min when called from c# and 0 Secondes in the server from SSMS)
CREATE PROCEDURE Student.GetStudents
AS
BEGIN
SET NOCOUNT ON
-----------------
SELECT StudentId,Article
FROM Students
WHERE Degree=1
END
MyNewProc returns just 2500 rows.
Is that normal? How can i improve that.
SELECT SUM(DATALENGTH(Article)) FROM Students WHERE Degree=1
the result is 13885838
You're probably transferring a lot of data over the network. That takes time.
Instead of returning article try returning LEFT(article, 50) to see if its an issue with the volume of data or not.
One thing to note is that SSMS will begin populating the results immediately while a C# application probably will not.
In SSMS, go to the following: Tools -> Options
Then go to Query Execution -> SQL Server -> Advanced
From here, look at what check boxes are checked and if there is something that is checked, SSMS will use this automatically when you execute a sproc from inside of it but when you execute it from C# (or whatever client you're using) it won't be used.
I had this same issue and found out that I needed to include the following line at the top of my sproc and it worked perfectly:
SET ARITHABORT ON;
I have been using a stored procedure for more than 1.5 years. But I've never considered how data is retrieved from the UI or within another stored procedure.
When I write a simple stored procedure.
eg.
CREATE PROCEDURE sp_test
AS
BEGIN
SELECT * FROM tblTest --Considering table has 3 columns.
END
How does C# gets this result into DataTable.
Whenever I have to use the result of this procedure in another procedure, I think we have to create a table valued parameter using the table datatype and assign its result to a table variable. I've never tried it.
CREATE PROCEDURE sp_testcall
AS
BEGIN
#temp = exec sp_test -- I think this would be the way, never tried
END
If the above sample code is true, then what is the difference between using the above method and a query to insert records into a temporary table?
CREATE PROCEDURE sp_test
AS
BEGIN
SELECT * INTO #tmp FROM tblTest --Considering table has 3 columns.
END
It would seem that copying the result into a temporary table requires another effort by sql server.
But what would be going on behind the scenes? Would it directly assign references of the result into a table valued parameter or does it use the same process as a temporary table?
My question might not be clear. But I will try to improve.
For an beginer to intermediate level you should always consider #temp tables and #table variables two faces of the same coin. While there are some differences between them, for all practical purposes they cost the same and behave nearly identical. The sole major difference is that #table variables are not transacted and hence not affected by rollbacks.
If you drill down into details, #temp tables are slightly more expensive to process (since they are transacted) but on the other hand #table variables have only the lifetime of a variable scope.
As to other issues raised by your question:
table value parameters are always read only and you cannot modify them (insert/update/delete into them)
tacking the result set of a procedure into a table (real table, #temp table or #tabel variable, doesn't matter) can only be done by using INSERT INTO <table> EXEC sp_test
as a rule of thumb a procedure that produces a result that is needed in another procedure is likely to be better of as a User Defined Function
The topic of sharing data between procedures was analyzed at length by Erland Sommarskog, see How to Share Data Between Stored Procedures.
A select means "return data to client". C# is a client, therefore it gets the data.
Then again, it's not exactly C# that does it, it's ADO.NET. There's a data provider that knows how to use a network/memory/some other protocol to talk to the SQL server and read data streams it generates. This particular client (ADO.NET) uses the received data to construct certain classes, such as DataTable, other providers can do something completely different.
All that is irrelevant at SQL Server level, because as far as the server is concerned, the data has been sent out using the protocol with which the connection was established, that's it.
From inside, it doesn't make much sense to have a stored procedure return simply selected data to anything else.
When you need to do that, you have the means to explicitly tell SQL Server what you want, such as inserting the data into a temporary table available to both involved SPs, inserting data into a table-valued parameter passed to the procedure, or rewriting your stored procedure as a function that returns a table.
Then again, it's not exacly clear to me what you were asking about.
I have a list of objects, this list contains about 4 million objects. there is a stored proc that takes objects attributes as params , make some lookups and insert them into tables.
what s the most efficient way to insert this 4 million objects to db?
How i do :
-- connect to sql - SQLConnection ...
foreach(var item in listofobjects)
{
SQLCommand sc = ...
// assign params
sc.ExecuteQuery();
}
THis has been really slow.
is there a better way to do this?
this process will be a scheduled task. i will run this ever hour, so i do expect high volume data like this.
Take a look at the SqlBulkCopy Class
based on your comment, dump the data into a staging table then do the lookup and insert into the real table set based from a proc....it will be much faster than row by row
It's never going to be ideal to insert four million records from C#, but a better way to do it is to build the command text up in code so you can do it in chunks.
This is hardly bulletproof, and it doesn't illustrate how to incorporate lookups (as you've mentioned you need), but the basic idea is:
// You'd modify this to chunk it out - only testing can tell you the right
// number - perhaps 100 at a time.
for(int i=0; i < items.length; i++) {
// e.g., 'insert dbo.Customer values(#firstName1, #lastName1)'
string newStatement = string.Format(
"insert dbo.Customer values(#firstName{0}, #lastName{0})", i);
command.CommandText += newStatement;
command.Parameters.Add("#firstName" + i, items[i].FirstName);
command.Parameters.Add("#lastName" + i, items[i].LastName);
}
// ...
command.ExecuteNonQuery();
I have had excellent results using XML to get large amounts of data into SQL Server. Like you, I initially was inserting rows one at a time which took forever due to the round trip time between the application and the server, then I switched the logic to pass in an XML string containing all the rows to insert. Time to insert went from 30 minutes to less that 5 seconds. This was for a couple of thousand rows. I have tested with XML strings up to 20 megabytes in size and there were no issues. Depending on your row size this might be an option.
The data was passed in as an XML String using the nText type.
Something like this formed the basic details of the stored procedure that did the work:
CREATE PROCEDURE XMLInsertPr( #XmlString ntext )
DECLARE #ReturnStatus int, #hdoc int
EXEC #ReturnStatus = sp_xml_preparedocument #hdoc OUTPUT, #XmlString
IF (#ReturnStatus <> 0)
BEGIN
RAISERROR ('Unable to open XML document', 16,1,50003)
RETURN #ReturnStatus
END
INSERT INTO TableName
SELECT * FROM OPENXML(#hdoc, '/XMLData/Data') WITH TableName
END
You might consider dropping any indexes you have on the table(s) you are inserting into and then recreating them after you have inserted everything. I'm not sure how the bulk copy class works but if you are updating your indexes on every insert it can slow things down quite a bit.
Like Abe metioned: drop indexes (and recreate later)
If you trust your data: generate a sql statement for each call to the stored proc, combine some, and then execute.
This saves you communication overhead.
The combined calls (to the stored proc) could be wrapped in a BEGIN TRANSACTION so you have only one commit per x inserts
If this is a onetime operation: do no optimize and run it during the night / weekend
I have a table, schema is very simple, an ID column as unique primary key (uniqueidentifier type) and some other nvarchar columns. My current goal is, for 5000 inputs, I need to calculate what ones are already contained in the table and what are not. Tht inputs are string and I have a C# function which converts string into uniqueidentifier (GUID). My logic is, if there is an existing ID, then I treat the string as already contained in the table.
My question is, if I need to find out what ones from the 5000 input strings are already contained in DB, and what are not, what is the most efficient way?
BTW: My current implementation is, convert string to GUID using C# code, then invoke/implement a store procedure which query whether an ID exists in database and returns back to C# code.
My working environment: VSTS 2008 + SQL Server 2008 + C# 3.5.
My first instinct would be to pump your 5000 inputs into a single-column temporary table X, possibly index it, and then use:
SELECT X.thecol
FROM X
JOIN ExistingTable USING (thecol)
to get the ones that are present, and (if both sets are needed)
SELECT X.thecol
FROM X
LEFT JOIN ExistingTable USING (thecol)
WHERE ExistingTable.thecol IS NULL
to get the ones that are absent. Worth benchmarking, at least.
Edit: as requested, here are some good docs & tutorials on temp tables in SQL Server. Bill Graziano has a simple intro covering temp tables, table variables, and global temp tables. Randy Dyess and SQL Master discuss performance issue for and against them (but remember that if you're getting performance problems you do want to benchmark alternatives, not just go on theoretical considerations!-).
MSDN has articles on tempdb (where temp tables are kept) and optimizing its performance.
Step 1. Make sure you have a problem to solve. Five thousand inserts isn't a lot to insert one at a time in a lot of contexts.
Are you certain that the simplest way possible isn't sufficient? What performance issues have you measured so far?
What do you need to do with those entries that do or don't exist in your table??
Depending on what you need, maybe the new MERGE statement in SQL Server 2008 could fit your bill - update what's already there, insert new stuff, all wrapped neatly into a single SQL statement. Check it out!
http://blogs.conchango.com/davidportas/archive/2007/11/14/SQL-Server-2008-MERGE.aspx
http://www.sql-server-performance.com/articles/dba/SQL_Server_2008_MERGE_Statement_p1.aspx
http://blogs.msdn.com/brunoterkaly/archive/2008/11/12/sql-server-2008-merge-capability.aspx
Your statement would look something like this:
MERGE INTO
(your target table) AS t
USING
(your source table, e.g. a temporary table) AS s
ON t.ID = s.ID
WHEN NOT MATCHED THEN -- new rows does not exist in base table
....(do whatever you need to do)
WHEN MATCHED THEN -- row exists in base table
... (do whatever else you need to do)
;
To make this really fast, I would load the "new" records from e.g. a TXT or CSV file into a temporary table in SQL server using BULK INSERT:
BULK INSERT YourTemporaryTable
FROM 'c:\temp\yourimportfile.csv'
WITH
(
FIELDTERMINATOR =',',
ROWTERMINATOR =' |\n'
)
BULK INSERT combined with MERGE should give you the best performance you can get on this planet :-)
Marc
PS: here's a note from TechNet on MERGE performance and why it's faster than individual statements:
In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement.
By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.
Try to ensure you end up running only one query - i.e. if your solution consists of running 5000 queries against the database, that'll probably be the biggest consumer of resources for the operation.
If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database.
If you want simplicity, since 5000 records is not very many, then from C# just use a loop to generate an insert statement for each of the strings you want to add to the table. Wrap the insert in a TRY CATCH block. Send em all up to the server in one shot like this:
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
if you have a unique index or primary key defined on your string GUID, then the duplicate inserts will fail. Checking ahead of time to see if the record does not exist just duplicates work that SQL is going to do anyway.
If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy. Reading 5000 GUIDS should take much less than 1 second. This is simpler than bulk importing to a temp table (which is the only way you will get performance from a temp table) and doing an update using a join to the temp table.
Since you are using Sql server 2008, you could use Table-valued parameters. It's a way to provide a table as a parameter to a stored procedure.
Using ADO.NET you could easily pre-populate a DataTable and pass it as a SqlParameter.
Steps you need to perform:
Create a custom Sql Type
CREATE TYPE MyType AS TABLE
(
UniqueId INT NOT NULL,
Column NVARCHAR(255) NOT NULL
)
Create a stored procedure which accepts the Type
CREATE PROCEDURE spInsertMyType
#Data MyType READONLY
AS
xxxx
Call using C#
SqlCommand insertCommand = new SqlCommand(
"spInsertMyType", connection);
insertCommand.CommandType = CommandType.StoredProcedure;
SqlParameter tvpParam =
insertCommand.Parameters.AddWithValue(
"#Data", dataReader);
tvpParam.SqlDbType = SqlDbType.Structured;
Links: Table-valued Parameters in Sql 2008
Definitely do not do it one-by-one.
My preferred solution is to create a stored procedure with one parameter that can take and XML in the following format:
<ROOT>
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000">
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000001">
....
</ROOT>
Then in the procedure with the argument of type NCHAR(MAX) you convert it to XML, after what you use it as a table with single column (lets call it #FilterTable). The store procedure looks like:
CREATE PROCEDURE dbo.sp_MultipleParams(#FilterXML NVARCHAR(MAX))
AS BEGIN
SET NOCOUNT ON
DECLARE #x XML
SELECT #x = CONVERT(XML, #FilterXML)
-- temporary table (must have it, because cannot join on XML statement)
DECLARE #FilterTable TABLE (
"ID" UNIQUEIDENTIFIER
)
-- insert into temporary table
-- #important: XML iS CaSe-SenSiTiv
INSERT #FilterTable
SELECT x.value('#ID', 'UNIQUEIDENTIFIER')
FROM #x.nodes('/ROOT/MyObject') AS R(x)
SELECT o.ID,
SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END)) AS FoundInDB
FROM #FilterTable o
LEFT JOIN dbo.MyTable t
ON o.ID = t.ID
GROUP BY o.ID
END
GO
You run it as:
EXEC sp_MultipleParams '<ROOT><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000"/><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000002"/></ROOT>'
And your results look like:
ID FoundInDB
------------------------------------ -----------
60EAD98F-8A6C-4C22-AF75-000000000000 1
60EAD98F-8A6C-4C22-AF75-000000000002 0