I've been tasked with creating a data synchronization process between a CSV file generated by another vendor and upwards of 300 separate-but-structurally-identical CRM databases. All CRM databases are defined in the same SQL Server instance. Here are the specifics:
The source data will be a CSV which contains a list of all email addresses where clients have opted-in to marketing communications. This CSV file will be sent in its entirety every night, but will contain record-level date/time stamps which will allow me to select only those records which have been modified since the last processing cycle. The CSV file will potentially have many hundreds of thousands of rows, though the expected changes on a daily basis will be substantially lower than that.
I'll be selecting data from the CSV and will be converting each row into a custom List<T> object.
Once the CSV is queried and the data has been transformed, I will need to compare the contents of this List<T> against the CRM databases. This is due to the fact that any given email address contained in the CSV file may:
Not exist in any of the 300 databases
Exist in one of the 300 databases
Exist in multiple databases
In any case where there is a match between an email address in the master CSV list and any CRM database, the matching CRM record will be updated with the values contained in the CSV file.
At a high, very generic level, I was thinking that I would have to do something like this:
foreach(string dbName in masterDatabaseList)
{
//open db connection
foreach(string emailAddress in masterEmailList)
{
//some helper method that would execute a SQL statement like
//"IF EXISTS ... WHERE EMAIL_ADDRESS = <emailAddress>" return true;
bool matchFound = EmailExistsInDb(emailAddress)
if (matchFound )
{
//the current email from the master list does exist in this database
//do necessary updates and stuff
}
}
}
Is this the most efficient approach? I'm not to keen on having to hit 300 databases potentially thousands of times to see if each and every email in the master CSV list exists. Ideally, I'd like to generate a SQL statement along the lines of:
"SELECT * FROM EMAIL_TABLE WHERE EMAIL_ADDRESS IN(email1,email2, email3,...)"
This would allow for a single query to be executed against the database, but I don't know whether this approach would be any better / more efficient, especially because I would have to dynamically generate the SQL and could potentially open it up to injection.
What is the best practice in this scenario? Because I have 300 databases that need to be compared each time, I'm looking for an approach that will yield the best results with the least amount of processing time. In my production code, I will be implementing a multi-threaded approach so that multiple databases can be processed simultaneously, so any approach would need to be thread-safe.
You seem to have the basic idea right. Hitting the database once for every line in the CSV is going to be way too slow. You can create a "where in" statement via LINQ like so:
var addresses = GetEmailAddresses();
var entries = ctx.Entries.Where(e => addresses.Contains(e.EmailAddress));
However, if you have too many addresses in your list, it'll take a long, long time to generate and evaluate your query. I'd recommend splitting your input list up into batches of a reasonable size (200 entries?), and then using the trick above to handle each batch with a single database check.
Once you've got that working, you can try a few other things to see if they make a measurable difference performance-wise:
Tweak the batch size.
Run the batches independently with varying degrees of parallelism.
Play with indexes on the database tables, particularly on the email address field.
Order the email addresses before breaking them into batches. It's possible that the db queries will take better advantage of hard disk caching strategies.
You could put the contents of your csv list objects into a table value parameter. Then call a stored procedure, passing in that TVP. The stored procedure could then run a cursor through the 300 databases and joins to your table value parameter (using ad-hoc sql). It will basically be a loop that iterates 300 times which isn't too bad.
Something like this:
CREATE PROCEDURE yourNewProcedure
(
#TableValueParameter dbo.udtTVP READONLY
)
AS
DECLARE #dbName varchar(255)
DECLARE #SQL nvarchar(3000)
DECLARE DB_Cursor CURSOR LOCAL FOR
SELECT DISTINCT name
FROM sys.databases
WHERE Name like '%yourdbs%'
OPEN DB_Cursor
FETCH NEXT FROM DB_Cursor INTO #dbName
WHILE ##FETCH_STATUS = 0
BEGIN
SET #SQL = 'UPDATE t
SET t2.Field = t.Field
FROM #TableValueParameter t
JOIN [' + #dbName + ']..TableYouCareAbout t2 ON t.Field = t2.Field '
EXEC sp_executesql #SQL, N'#TableValueParameter dbo.udtTVP', #TableValueParamete
FETCH NEXT FROM DB_Cursor INTO #dbName
END
CLOSE DB_Cursor
DEALLOCATE DB_Cursor
Related
I get a list of ID's and amounts from a excel file (thousands of id's and corresponding amounts). I then need to check the database to see if each ID exists and if it does check to make sure the amount in the DB is greater or equal to that of the amount from the excel file.
Problem is running this select statement upwards of 6000 times and return the values I need takes a long time. Even at a 1/2 of a second a piece it will take about an hour to do all the selects. (I normally dont get more than 5 results max back)
Is there a faster way to do this?
Is it possible to somehow pass all the ID's at once and just make 1 call and get the massive collection?
I have tried using SqlDataReaders and SqlDataAdapters but they seem to be about the same (too long either way)
General idea of how this works below
for (int i = 0; i < ID.Count; i++)
{
SqlCommand cmd = new SqlCommand("select Amount, Client, Pallet from table where ID = #ID and Amount > 0;", sqlCon);
cmd.Parameters.Add("#ID", SqlDbType.VarChar).Value = ID[i];
SqlDataAdapter da = new SqlDataAdapter(cmd);
da.Fill(dataTable);
da.Dispose();
}
Instead of a long in list (difficult to parameterise and has a number of other inefficiencies regarding execution plans: compilation time, plan reuse, and the plans themselves) you can pass all the values in at once via a table valued parameter.
See arrays and lists in SQL Server for more details.
Generally I make sure to give the table type a primary key and use option (recompile) to get the most appropriate execution plans.
Combine all the IDs together into a single large IN clause, so it reads like:
select Amount, Client, Pallet from table where ID in (1,3,5,7,9,11) and Amount > 0;
"I have tried using SqlDataReaders and SqlDataAdapters"
It sounds like you might be open to other APIs. Using Linq2SQL or Linq2Entities:
var someListIds = new List<int> { 1,5,6,7 }; //imagine you load this from where ever
db.MyTable.Where( mt => someListIds.Contains(mt.ID) );
This is safe in terms of avoiding potential SQL injection vulnerabilities and will generate a "in" clause. Note however the size of the someListIds can be so large that the SQL query generated exceeds limits of query length, but the same is true of any other technique involving the IN clause. You can easily workaround that by partitioning lists into large chunks, and still be tremendously better than a query per ID.
Use Table-Valued Parameters
With them you can pass a c# datatable with your values into a stored procedure as a resultset/table which you can join to and do a simple:
SELECT *
FROM YourTable
WHERE NOT EXISTS (SELECT * FORM InputResultSet WHERE YourConditions)
Use the in operator. Your problem is very common and it has a name: N+1 performance problem
Where are you getting the IDs from? If it is from another query, then consider grouping them into one.
Rather than performing a separate query for every single ID that you have, execute one query to get the amount of every single ID that you want to check (or if you have too many IDs to put in one query, then batch them into batches of a few thousand).
Import the data directly to SQL Server. Use stored procedure to output the data you need.
If you must consume it in the app tier... use xml datatype to pass into a stored procedure.
You can import the data from the excel file into SQL server as a table (using the import data wizard). Then you can perform a single query in SQL server where you join this table to your lookup table, joining on the ID field. There's a few more steps to this process, but it's a lot neater than trying to concatenate all the IDs into a much longer query.
I'm assuming a certain amount of access privileges to the server here, but this is what I'd do given the access I normally have. I'm also assuming this is a one off task. If not, the import of the data to SQL server can be done programmatically as well
IN clause has limits, so if you go with that approach, make sure a batch size is used to process X amount of Ids at a time, otherwise you will hit another issue.
A #Robertharvey has noted, if there are not a lot of IDs and there are no transactions occurring, then just pull all the Ids at once into memory into a dictionary like object and process them there. Six thousand values is not alot and a single select could return all those back within a few seconds.
Just remember that if another process is updating the data, your local cached version may be stale.
There is another way to handle this, Making XML of IDs and pass it to procedure. Here is code for procedure.
IF OBJECT_ID('GetDataFromDatabase') IS NOT NULL
BEGIN
DROP PROCEDURE GetDataFromDatabase
END
GO
--Definition
CREATE PROCEDURE GetDataFromDatabase
#xmlData XML
AS
BEGIN
DECLARE #DocHandle INT
DECLARE #idList Table (id INT)
EXEC SP_XML_PREPAREDOCUMENT #DocHandle OUTPUT, #xmlData;
INSERT INTO #idList (id) SELECT x.id FROM OPENXML(#DocHandle, '//data', 2) WITH ([id] INT) x
EXEC SP_XML_removeDOCUMENT #DocHandle ;
--SELECT * FROM #idList
SELECT t.Amount, t.Client, t.Pallet FROM yourTable t INNER JOIN #idList x ON t.id = x.id and t.Amount > 0;
END
GO
--Uses
EXEC GetDataFromDatabase #xmlData = '<root><data><id>1</id></data><data><id>2</id></data></root>'
You can put any logic in procedure. You can pass id, amount also via XML. You can pass huge list of ids via XML.
SqlDataAdapter objects too heavy for that.
Firstly, using stored procedures, it will be faster.
Secondly, use the group operation, for this pass as a parameter to a list of identifiers on the side of the database, run a query on these parameters, and return the processed result.
It will quickly and efficiently, as all data processing logic is on the side of the database server
You can select the whole resultset (or join multiple 'limited' result sets) and save it all to DataTable Then you can do selects and updates (if needed) directly on datatable. Then plug new data back... Not super efficient memory wise, but often is very good (and only) solution when working in bulk and need it to be very fast.
So if you have thousands of records, it might take couple of minutes to populate all records into the DataTable
then you can search your table like this:
string findMatch = "id = value";
DataRow[] rowsFound = dataTable.Select(findMatch);
Then just loop foreach (DataRow dr in rowsFound)
I have a table having around 1 million records. Table structure is shown below. The UID column is a primary key and uniqueidentifier type.
Table_A (contains a million records)
UID Name
-----------------------------------------------------------
E8CDD244-B8E4-4807-B04D-FE6FDB71F995 DummyRecord
I also have a function called fn_Split('Guid_1,Guid_2,Guid_3,....,Guid_n') which accepts a list of comma
seperated guids and gives back a table variable containing the guids.
From my application code I am passing a sql query to get new guids [Keys that are with application code but not in the database table]
var sb = new StringBuilder();
sb
.Append(" SELECT NewKey ")
.AppendFormat(" FROM fn_Split ('{0}') ", keyList)
.Append(" EXCEPT ")
.Append("SELECT UID from Table_A");
The first time this command is executed it times out on quite a few occassions. I am trying to figure out what would be a better approach here to avoid such timeouts and/or improve performance of this.
Thanks.
Firstly add an index if there isn't one, on table_a.uid, but i assume there is.
Some alternate queries to try,
select newkey
from fn_split
left outer join table_a
on newkey = uid
where uid IS NULL
select newkey
from fn_split(blah)
where newkey not in (select uid
from table_a)
select newkey
from fn_split(blah) f
where not exists(select uid
from table_a a
where f.newkey = a.uid)
There is plenty of info around here as to why you should not use a Guid for your primary key, especially if it in unordered. That would be the first thing to fix. As far as your query goes you might try what Paul or Tim suggested, but as far as I know EXCEPT and NOT IN will use the same execution plan, though the OUTER JOIN may be more efficint in some cases.
If you're using MS SQL 2008 then you can/should use TableValue Parameters. Essentially you'd send in your guids in the form of a DataTable to your stored procedure.
Then inside your stored procedure you can use the parameters as a "table" and do a join or EXCEPT or what have you to get your results.
This method is faster than using a function to split because functions in MS SQL server are really slow.
But I guess is the time is being taken due to massive Disk I/O this query requires. Since you're searching on your UId column and since they are "random" no index is going to help here. The engine will have to resort to a table scan. Which means you'll need some serious Disk I/O performance to get the results in "good time".
Using the Uid data type as in index is not recommended. However, it may not make a difference in your case. But let me ask you this:
The guids that you send in from your app, are in just a random list of guids or is here some business relationship or entity relationship here? It's possible, that your data model is not correct for what you are trying to do. So how do you determine what guids you have to search on?
However, for argument sake, let's assume your guids are just a random selection then there is no index that is really being used since the database engine will have to do a table scan to pick out each of the required guids/records from the million records you have. In a situation like this the only way to speed things up is at the physical database level, that is how your data is physically stored on the hard drives etc.
For example:
Having faster drives will improve performance
If this kind of query is being fired over and over then more memory on the box will help because the engine can cache the data in memory and it won't need to do physical reads
If you partition your table then the engine can parallelize the the seek operation and get you results faster.
If your table contains a lot of other fields that you don't always need, then spliting the table in two tables where table1 contains the guid and the bare minimum set of fields and table2 contains the rest will speed up the query quite a bit due to the disk I/O demands being less
Lot's of other things to look at here
Also note that when you send in adhoc SQL statements that don't have parameters the engine has to create a plan each time you execute it. In this case it's not a big deal but keep in mind that each plan will be cached in memory thus pushing out any data that might have been cached.
Lastly you can always increase the commandTimeOut property in this case to get past the timeout issues.
How much time does it take now and what kind of improvement are you looking to get ot hoping to get?
If I understand your question correctly, in your client code you have a comma-delimited string of (string) GUIDs. These GUIDS are usable by the client only if they don't already exist in TableA. Could you invoke a SP which creates a temporary table on the server containing the potentially usable GUIDS, and then do this:
select guid from #myTempTable as temp
where not exists
(
select uid from TABLEA where uid = temp.guid
)
You could pass your string of GUIDS to the SP; it would populate the temp table using your function; and then return an ADO.NET DataTable to the client. This should be very easy to test before you even bother to write the SP.
I am questioning what you do with this information.
If you insert the keys into this table afterwards you could simply try to insert them on first hand - that's much faster and more solid in a multi-user environment then query first insert later:
create procedure TryToInsert #GUID uniqueidentifier, #Name varchar(n) as
begin try
insert into Table_A (UID,Name)
values (#GUID, #Name);
return 0;
end try
begin catch
return 1;
end;
In all cases you can split the KeyList at the client to get faster results - and you could query the keys that are not valid:
select UID
from Table_A
where UID in ('new guid','new guid',...);
If the GUID are random you should use newsequentialid() with you clustered primary key:
create table Table_A (
UID uniqueidentifier default newsequentialid() primary key,
Name varchar(n) not null
);
With this you can insert and query your newly inserted data in one step:
insert into Table_A (Name)
output inserted.*
values (#Name);
... just my two cents
In any case, are not GUIDs intrinsically engineered to be, for all intents and purposes, unique? (i.e. universally unique -- doesn't matter where generated). I wouldn't even bother to do the test beforehand; just insert your row with the GUID PK and if the insert should fail, discard the GUID. But it should not fail, unless these are not truly GUIDs.
http://en.wikipedia.org/wiki/GUID
http://msdn.microsoft.com/en-us/library/ms190215.aspx
It seems you are doing a lot of unnecessary work, but perhaps I don't grasp your application requirement.
I have a list of objects, this list contains about 4 million objects. there is a stored proc that takes objects attributes as params , make some lookups and insert them into tables.
what s the most efficient way to insert this 4 million objects to db?
How i do :
-- connect to sql - SQLConnection ...
foreach(var item in listofobjects)
{
SQLCommand sc = ...
// assign params
sc.ExecuteQuery();
}
THis has been really slow.
is there a better way to do this?
this process will be a scheduled task. i will run this ever hour, so i do expect high volume data like this.
Take a look at the SqlBulkCopy Class
based on your comment, dump the data into a staging table then do the lookup and insert into the real table set based from a proc....it will be much faster than row by row
It's never going to be ideal to insert four million records from C#, but a better way to do it is to build the command text up in code so you can do it in chunks.
This is hardly bulletproof, and it doesn't illustrate how to incorporate lookups (as you've mentioned you need), but the basic idea is:
// You'd modify this to chunk it out - only testing can tell you the right
// number - perhaps 100 at a time.
for(int i=0; i < items.length; i++) {
// e.g., 'insert dbo.Customer values(#firstName1, #lastName1)'
string newStatement = string.Format(
"insert dbo.Customer values(#firstName{0}, #lastName{0})", i);
command.CommandText += newStatement;
command.Parameters.Add("#firstName" + i, items[i].FirstName);
command.Parameters.Add("#lastName" + i, items[i].LastName);
}
// ...
command.ExecuteNonQuery();
I have had excellent results using XML to get large amounts of data into SQL Server. Like you, I initially was inserting rows one at a time which took forever due to the round trip time between the application and the server, then I switched the logic to pass in an XML string containing all the rows to insert. Time to insert went from 30 minutes to less that 5 seconds. This was for a couple of thousand rows. I have tested with XML strings up to 20 megabytes in size and there were no issues. Depending on your row size this might be an option.
The data was passed in as an XML String using the nText type.
Something like this formed the basic details of the stored procedure that did the work:
CREATE PROCEDURE XMLInsertPr( #XmlString ntext )
DECLARE #ReturnStatus int, #hdoc int
EXEC #ReturnStatus = sp_xml_preparedocument #hdoc OUTPUT, #XmlString
IF (#ReturnStatus <> 0)
BEGIN
RAISERROR ('Unable to open XML document', 16,1,50003)
RETURN #ReturnStatus
END
INSERT INTO TableName
SELECT * FROM OPENXML(#hdoc, '/XMLData/Data') WITH TableName
END
You might consider dropping any indexes you have on the table(s) you are inserting into and then recreating them after you have inserted everything. I'm not sure how the bulk copy class works but if you are updating your indexes on every insert it can slow things down quite a bit.
Like Abe metioned: drop indexes (and recreate later)
If you trust your data: generate a sql statement for each call to the stored proc, combine some, and then execute.
This saves you communication overhead.
The combined calls (to the stored proc) could be wrapped in a BEGIN TRANSACTION so you have only one commit per x inserts
If this is a onetime operation: do no optimize and run it during the night / weekend
I have a table, schema is very simple, an ID column as unique primary key (uniqueidentifier type) and some other nvarchar columns. My current goal is, for 5000 inputs, I need to calculate what ones are already contained in the table and what are not. Tht inputs are string and I have a C# function which converts string into uniqueidentifier (GUID). My logic is, if there is an existing ID, then I treat the string as already contained in the table.
My question is, if I need to find out what ones from the 5000 input strings are already contained in DB, and what are not, what is the most efficient way?
BTW: My current implementation is, convert string to GUID using C# code, then invoke/implement a store procedure which query whether an ID exists in database and returns back to C# code.
My working environment: VSTS 2008 + SQL Server 2008 + C# 3.5.
My first instinct would be to pump your 5000 inputs into a single-column temporary table X, possibly index it, and then use:
SELECT X.thecol
FROM X
JOIN ExistingTable USING (thecol)
to get the ones that are present, and (if both sets are needed)
SELECT X.thecol
FROM X
LEFT JOIN ExistingTable USING (thecol)
WHERE ExistingTable.thecol IS NULL
to get the ones that are absent. Worth benchmarking, at least.
Edit: as requested, here are some good docs & tutorials on temp tables in SQL Server. Bill Graziano has a simple intro covering temp tables, table variables, and global temp tables. Randy Dyess and SQL Master discuss performance issue for and against them (but remember that if you're getting performance problems you do want to benchmark alternatives, not just go on theoretical considerations!-).
MSDN has articles on tempdb (where temp tables are kept) and optimizing its performance.
Step 1. Make sure you have a problem to solve. Five thousand inserts isn't a lot to insert one at a time in a lot of contexts.
Are you certain that the simplest way possible isn't sufficient? What performance issues have you measured so far?
What do you need to do with those entries that do or don't exist in your table??
Depending on what you need, maybe the new MERGE statement in SQL Server 2008 could fit your bill - update what's already there, insert new stuff, all wrapped neatly into a single SQL statement. Check it out!
http://blogs.conchango.com/davidportas/archive/2007/11/14/SQL-Server-2008-MERGE.aspx
http://www.sql-server-performance.com/articles/dba/SQL_Server_2008_MERGE_Statement_p1.aspx
http://blogs.msdn.com/brunoterkaly/archive/2008/11/12/sql-server-2008-merge-capability.aspx
Your statement would look something like this:
MERGE INTO
(your target table) AS t
USING
(your source table, e.g. a temporary table) AS s
ON t.ID = s.ID
WHEN NOT MATCHED THEN -- new rows does not exist in base table
....(do whatever you need to do)
WHEN MATCHED THEN -- row exists in base table
... (do whatever else you need to do)
;
To make this really fast, I would load the "new" records from e.g. a TXT or CSV file into a temporary table in SQL server using BULK INSERT:
BULK INSERT YourTemporaryTable
FROM 'c:\temp\yourimportfile.csv'
WITH
(
FIELDTERMINATOR =',',
ROWTERMINATOR =' |\n'
)
BULK INSERT combined with MERGE should give you the best performance you can get on this planet :-)
Marc
PS: here's a note from TechNet on MERGE performance and why it's faster than individual statements:
In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement.
By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.
Try to ensure you end up running only one query - i.e. if your solution consists of running 5000 queries against the database, that'll probably be the biggest consumer of resources for the operation.
If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database.
If you want simplicity, since 5000 records is not very many, then from C# just use a loop to generate an insert statement for each of the strings you want to add to the table. Wrap the insert in a TRY CATCH block. Send em all up to the server in one shot like this:
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
if you have a unique index or primary key defined on your string GUID, then the duplicate inserts will fail. Checking ahead of time to see if the record does not exist just duplicates work that SQL is going to do anyway.
If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy. Reading 5000 GUIDS should take much less than 1 second. This is simpler than bulk importing to a temp table (which is the only way you will get performance from a temp table) and doing an update using a join to the temp table.
Since you are using Sql server 2008, you could use Table-valued parameters. It's a way to provide a table as a parameter to a stored procedure.
Using ADO.NET you could easily pre-populate a DataTable and pass it as a SqlParameter.
Steps you need to perform:
Create a custom Sql Type
CREATE TYPE MyType AS TABLE
(
UniqueId INT NOT NULL,
Column NVARCHAR(255) NOT NULL
)
Create a stored procedure which accepts the Type
CREATE PROCEDURE spInsertMyType
#Data MyType READONLY
AS
xxxx
Call using C#
SqlCommand insertCommand = new SqlCommand(
"spInsertMyType", connection);
insertCommand.CommandType = CommandType.StoredProcedure;
SqlParameter tvpParam =
insertCommand.Parameters.AddWithValue(
"#Data", dataReader);
tvpParam.SqlDbType = SqlDbType.Structured;
Links: Table-valued Parameters in Sql 2008
Definitely do not do it one-by-one.
My preferred solution is to create a stored procedure with one parameter that can take and XML in the following format:
<ROOT>
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000">
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000001">
....
</ROOT>
Then in the procedure with the argument of type NCHAR(MAX) you convert it to XML, after what you use it as a table with single column (lets call it #FilterTable). The store procedure looks like:
CREATE PROCEDURE dbo.sp_MultipleParams(#FilterXML NVARCHAR(MAX))
AS BEGIN
SET NOCOUNT ON
DECLARE #x XML
SELECT #x = CONVERT(XML, #FilterXML)
-- temporary table (must have it, because cannot join on XML statement)
DECLARE #FilterTable TABLE (
"ID" UNIQUEIDENTIFIER
)
-- insert into temporary table
-- #important: XML iS CaSe-SenSiTiv
INSERT #FilterTable
SELECT x.value('#ID', 'UNIQUEIDENTIFIER')
FROM #x.nodes('/ROOT/MyObject') AS R(x)
SELECT o.ID,
SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END)) AS FoundInDB
FROM #FilterTable o
LEFT JOIN dbo.MyTable t
ON o.ID = t.ID
GROUP BY o.ID
END
GO
You run it as:
EXEC sp_MultipleParams '<ROOT><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000"/><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000002"/></ROOT>'
And your results look like:
ID FoundInDB
------------------------------------ -----------
60EAD98F-8A6C-4C22-AF75-000000000000 1
60EAD98F-8A6C-4C22-AF75-000000000002 0
I have some complex stored procedures that may return many thousands of rows, and take a long time to complete.
Is there any way to find out how many rows are going to be returned before the query executes and fetches the data?
This is with Visual Studio 2005, a Winforms application and SQL Server 2005.
You mentioned your stored procedures take a long time to complete. Is the majority of the time taken up during the process of selecting the rows from the database or returning the rows to the caller?
If it is the latter, maybe you can create a mirror version of your SP that just gets the count instead of the actual rows. If it is the former, well, there isn't really that much you can do since it is the act of finding the eligible rows which is slow.
A solution to your problem might be to re-write the stored procedure so that it limits the result set to some number, like:
SELECT TOP 1000 * FROM tblWHATEVER
in SQL Server, or
SELECT * FROM tblWHATEVER WHERE ROWNUM <= 1000
in Oracle. Or implement a paging solution so that the result set of each call is acceptably small.
make a stored proc to count the rows first.
SELECT COUNT(*) FROM table
Unless there's some aspect of the business logic of you app that allows calculating this, no. The database it going to have to do all the where & join logic to figure out how line rows, and that's the vast majority of the time spend in the SP.
You can't get the rowcount of a procedure without executing the procedure.
You could make a different procedure that accepts the same parameters, the purpose of which is to tell you how many rows the other procedure should return. However, the steps required by this procedure would normally be so similar to those of the main procedure that it should take just about as long as just executing the main procedure.
You would have to write a different version of the stored procedure to get a row count. This one would probably be much faster because you could eliminate joining tables which you aren't filtered against, remove ordering, etc. For example if your stored proc executed the sql such as:
select firstname, lastname, email, orderdate from
customer inner join productorder on customer.customerid=productorder.productorderid
where orderdate>#orderdate order by lastname, firstname;
your counting version would be something like:
select count(*) from productorder where orderdate>#orderdate;
Not in general.
Through knowledge about the operation of the stored procedure, you may be able to get either an estimate or an accurate count (for instance, if the "core" or "base" table of the query is able to be quickly calculated, but it is complex joins and/or summaries which drive the time upwards).
But you would have to call the counting SP first and then the data SP or you could look at using a multiple result set SP.
It could take as long to get a row count as to get the actual data, so I wouldn't advodate performing a count in most cases.
Some possibilities:
1) Does SQL Server expose its query optimiser findings in some way? i.e. can you parse the query and then obtain an estimate of the rowcount? (I don't know SQL Server).
2) Perhaps based on the criteria the user gives you can perform some estimations of your own. For example, if the user enters 'S%' in the customer surname field to query orders you could determine that that matches 7% (say) of the customer records, and extrapolate that the query may return about 7% of the order records.
Going on what Tony Andrews said in his answer, you can get an estimated query plan of the call to your query with:
SET showplan_text OFF
GO
SET showplan_all on
GO
--Replace with call you your stored procedure
select * from MyTable
GO
SET showplan_all ofF
GO
This should return a table, or many tables which will let you get the estimated row count of your query.
You need to analyze the returned data set, to determine what is a logical, (meaningful) primary key for the result set that is being returned. In general this WILL be much faster than the complete procedure, because the server is not constructing a result set from data in all the columns of each row of each table, it is simply counting the rows... In general, it may not even need to read the actual table rows off disk to do this, it may simply need to count index nodes...
Then write another SQL statement that only includes the tables necessary to generate those key columns (Hopefully this is a subset of the tables in the main sql query), and the same where clause with the same filtering predicate values...
Then add another Optional parameter to the Stored Proc called, say, #CountsOnly, with a default of false (0) as so...
Alter Procedure <storedProcName>
#param1 Type,
-- Other current params
#CountsOnly TinyInt = 0
As
Set NoCount On
If #CountsOnly = 1
Select Count(*)
From TableA A
Join TableB B On etc. etc...
Where < here put all Filtering predicates >
Else
<Here put old SQL That returns complete resultset with all data>
Return 0
You can then just call the same stored proc with #CountsOnly set equal to 1 to just get the count of records. Old code that calls the proc would still function as it used to, since the parameter value is set to default to false (0), if it is not included
It's at least technically possible to run a procedure that puts the result set in a temporary table. Then you can find the number of rows before you move the data from server to application and would save having to create the result set twice.
But I doubt it's worth the trouble unless creating the result set takes a very long time, and in that case it may be big enough that the temp table would be a problem. Almost certainly the time to move the big table over the network will be many times what is needed to create it.