SQL/C# - Apply a function to columns within a SQL query - c#

Is there a way to parse a given SQL SELECT query and wrap each column with a function call e.g. dbo.Foo(column_name) prior to running the SQL query?
We have looked into using a regular expression type 'replace' on the column names, however, we cannot seem to account for all the ways in which a SQL query can be written.
An example of the SQL query would be;
SELECT
[ColumnA]
, [ColumnB]
, [ColumnC] AS [Column C]
, CAST([ColumnD] AS VARCHAR(11)) AS [Bar]
, DATEPART([yyyy], GETDATE()) - DATEPART([yyyy], [ColumnD]) AS [Diff]
, [ColumnE]
FROM [MyTable]
WHERE LEN([ColumnE]) > 0
ORDER BY
[ColumnA]
, DATEPART([yyyy], [ColumnD]) - DATEPART([yyyy], GETDATE());
The result we require would be;
SELECT
[dbo].[Foo]([ColumnA])
, [dbo].[Foo]([ColumnB])
, [dbo].[Foo]([ColumnC]) AS [Column C]
, CAST([dbo].[Foo]([ColumnD]) AS VARCHAR(11)) AS [Bar]
, DATEPART([yyyy], GETDATE()) - DATEPART([yyyy], [dbo].[Foo]([ColumnD])) AS [Diff]
, [dbo].[Foo]([ColumnE])
FROM [MyTable]
WHERE LEN([dbo].[Foo]([ColumnE])) > 0
ORDER BY
[dbo].[Foo]([ColumnA])
, DATEPART([yyyy], [dbo].[Foo]([ColumnD])) - DATEPART([yyyy], GETDATE());
Any or all of the above columns might need the function called on them (including columns used in the WHERE and ORDER BY) which is why we require a query wide solution.
We have many pre-written queries like the above which need to be updated, which is why a manual update will be difficult.
The above example shows that some result columns might be calculated and some have simply been renamed. Most are also made up with joins and some contain case statements which I have left out for the purpose of this example.
Another scenario which would need to be accounted for is table name aliasing e.g. SELECT t1.ColumnA, t2.ColumnF etc.
Either a SQL or C# solution for solving this problem would be ideal.

Instead of replacing each occurrence of every column, you can replace the statement...
FROM MyTable
...with a subselect that includes all existing columns with the function call:
FROM (
SELECT dbo.Foo(ColumnA) AS ColumnA, dbo.Foo(ColumnB) AS ColumnB,
dbo.Foo(ColumnC) AS ColumnC --etc.
FROM MyTable
) AS MyTable
The rest of the query can remain unchanged. In case of table aliasing, you simply replace AS Table1 with AS t1.
Another option you should consider is to create views in your database that would be essentially the subselect. Combined with a naming convention, you can easily replace the occurrences in your FROM (and JOIN) statements with the view name:
FROM MyTable_Foo AS t1
If you want to replace all queries that you'll ever use, consider renaming the tables and creating views that are named like the old tables.
On a more general note: You should reconsider your approach to the underlying problem, since what you are doing here takes away much of the power of SQL. The worst thing here is that once you call the function on all columns, you will not be able to use the indices on those columns, which could mean a serious hit on DB performance.

Related

How to use sub Query in insert statement

I have tried but I get error:
SubQuery are not allowed in this context message comes.
I have two tables Product and Category and want to use categoryId base on CategoryName.
The query is
Insert into Product(Product_Name,Product_Model,Price,Category_id)
values(' P1','M1' , 100, (select CategoryID from Category where Category_Name=Laptop))
Please tell me a solution with code.
(you didn't clearly specify what database you're using - this is for SQL Server but should apply to others as well, with some minor differences)
The INSERT command comes in two flavors:
(1) either you have all your values available, as literals or SQL Server variables - in that case, you can use the INSERT .. VALUES() approach:
INSERT INTO dbo.YourTable(Col1, Col2, ...., ColN)
VALUES(Value1, Value2, #Variable3, #Variable4, ...., ValueN)
Note: I would recommend to always explicitly specify the list of column to insert data into - that way, you won't have any nasty surprises if suddenly your table has an extra column, or if your tables has an IDENTITY or computed column. Yes - it's a tiny bit more work - once - but then you have your INSERT statement as solid as it can be and you won't have to constantly fiddle around with it if your table changes.
(2) if you don't have all your values as literals and/or variables, but instead you want to rely on another table, multiple tables, or views, to provide the values, then you can use the INSERT ... SELECT ... approach:
INSERT INTO dbo.YourTable(Col1, Col2, ...., ColN)
SELECT
SourceColumn1, SourceColumn2, #Variable3, #Variable4, ...., SourceColumnN
FROM
dbo.YourProvidingTableOrView
Here, you must define exactly as many items in the SELECT as your INSERT expects - and those can be columns from the table(s) (or view(s)), or those can be literals or variables. Again: explicitly provide the list of columns to insert into - see above.
You can use one or the other - but you cannot mix the two - you cannot use VALUES(...) and then have a SELECT query in the middle of your list of values - pick one of the two - stick with it.
So in your concrete case, you'll need to use:
INSERT INTO dbo.Product(Product_Name, Product_Model, Price, Category_id)
SELECT
' P1', 'M1', 100, CategoryID
FROM
dbo.Category
WHERE
Category_Name = 'Laptop'
Try like this
Insert into Product
(
Product_Name,
Product_Model,
Price,Category_id
)
Select
'P1',
'M1' ,
100,
CategoryID
From
Category
where Category_Name='Laptop'
Try this:
DECLARE #CategoryID BIGINT = (select top 1 CategoryID from Category where Category_Name='Laptop')
Insert into Product(Product_Name,Product_Model,Price,Category_id)
values(' P1','M1' , 100, #CategoryID)

Prefix every column name with a specific string?

I'm trying to manually map some rows to instances of their appropriate classes. I know that I need to use every column of every table, and map all of those columns from one table into a given class.
However, I was wondering if there would be an easier way to do it. Right now, I have a class called School and a class called User. Each of these classes has a Name property, and other properties (but the ´Name` one is the important one, since it is a mutual name for both classes).
Right now, I am doing the following to map them down.
SELECT u.SomeOtherColumn, u.Name AS userName, s.SomeOtherColumn, s.Name AS schoolName FROM User AS u INNER JOIN School AS s ON something
I would love to do the following, but I can't, since Name is a mutual name between the classes.
SELECT u.*, s.* FROM User AS u INNER JOIN School AS s ON something
This however generates an error since they both have the column Name. Can I prefix them somehow? Like this for instance?
u.user_*, s.school_*
So that every column of each of those tables have a prefix? For instance user_Name and school_Name?
Years ago I wrote a bunch of functions and procedures to help me with developing automatic code-generation routines for SQL Servers and applications using dynamic SQL. Here is the one that I think would be most helpful to your situation:
Create FUNCTION [dbo].[ColumnString2]
(
#TableName As SYSNAME, --table or view whose column names you want
#Template As NVarchar(MAX), --replaces '{c}' with the name for every column,
#Between As NVarchar(MAX) --puts this string between every column string
)
RETURNS NVarchar(MAX) AS
BEGIN
DECLARE #str As NVarchar(MAX);
SELECT TOP 999
#str = COALESCE(
#str + #Between + REPLACE(#Template,N'{c}',COLUMN_NAME),
REPLACE(#Template,N'{c}',COLUMN_NAME)
)
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA= COALESCE(PARSENAME(#TableName, 2), N'dbo')
And TABLE_NAME = PARSENAME(#TableName, 1)
ORDER BY ORDINAL_POSITION
RETURN #str;
END
This allows you to format all of the column names of a table or view any way that you want. Simply pass it a table name, and a Template string with '{c}' everywhere that you want the column name inserted for each column. It will do this for every column in #TableName, and add the #Between string in between them.
Here is an example of how to vertically format all of the column names for a table, renaming them with a prefix in a way that is suitable for inclusion into a SELECT query:
SELECT dbo.[ColumnString2](N'yourTable', N'
{c} As prefix_{c}', N',')
This function was intended for use with dynamic SQL, but you can use it too by executing it in Management Studio with your output set to Text (instead of Grid). Then cut and paste the output into your desired query, view or code text. (Be sure to change your SSMS Query options for Text Results to raise the "maximum number of characters displayed" from 256 to the max (8000). If that still gets cut off for you, then you can change this procedure to a function that outputs each column as a separate row, instead of as one single large string.)

How to perform a count on a arbitrary query (possibly containing a order by)

I have been tasked with updating our internal framework we use in-house. One of the things the framework does is you pass it a query and it will return the number of rows the query has in it (The framework makes heavy use of DataReaders so we need the total before hand for UI things).
The query that the count needs to be done on can be different from project to project (SOL-injection is not a issue, the query is not from user input, just hard coded in from another programmer when they use the framework for their project.) and I was told that just having the programmers write a second query for the count is unacceptable.
Currently the solution is to do the following (I did not write this, I was just told to fix it).
//executes query and returns record count
public static int RecordCount(string SqlQuery, string ConnectionString, bool SuppressError = false)
{
//SplitLeft is just myString.Substring(0, myString.IndexOf(pattern)) with some error checking. and InStr is just a wrapper for IndexOf.
//remove order by clause (breaks count(*))
if (Str.InStr(0, SqlQuery.ToLower(), " order by ") > -1)
SqlQuery = Str.SplitLeft(SqlQuery.ToLower(), " order by ");
try
{
//execute query
using (SqlConnection cnSqlConnect = OpenConnection(ConnectionString, SuppressError))
using (SqlCommand SqlCmd = new SqlCommand("select count(*) from (" + SqlQuery + ") as a", cnSqlConnect))
{
SqlCmd.CommandTimeout = 120;
return (Int32)SqlCmd.ExecuteScalar();
}
}
catch (Exception ex)
{
if (SuppressError == false)
MessageBox.Show(ex.Message, "Sql.RecordCount()");
return -1;
}
}
However it breaks on queries like (again, not my query, I just need to make it work)
select [ClientID], [Date], [Balance]
from [Ledger]
where Seq = (select top 1 Seq
from [Ledger] as l
where l.[ClientID] = [Ledger].[ClientID]
order by [Date] desc, Seq desc)
and Balance <> 0)
as it will removes everything after the order by and breaks the query. I thought I may go from simple string matching to a more complicated parser, but before I do that I wanted to ask if there is a better way.
UPDATE: The order by clause is dropped because if you include it using my method or a CTE you will get the error The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.
Some more details: This framework is used for writing conversion applications. We write apps to pull data from a clients old database and move it in to our database format when a customer buys our CRM software. Often we are working with source tables that are poorly written and can be several Gigs in size. We do not have the resources to hold the whole table in memory so we use a DataReader to pull the data out so everything is not in memory at once. However a requirement is a progress bar with the total number of records to be processed. This RecordCount function is used to figure the max of the progress bar. It works fairly well, the only snag is if the programmer writing the conversion needs to order the data output, having a order by clause in the outer most query breaks count(*)
Partial Solution: I came up with this while trying to figure it out, it will not work 100% of the time but I think it will be better than the current solution
If I find a order by clause, I then check to see if the first thing in the query is a select (and no Top following) I replace that beginning text with select top 100 percent. It works better but I am not posting this as a solution as I am hoping for a universal solution.
Assuming you aren't going to see anything but fairly ordinary select statements, I don't think you don't need a full-on SQL parser to do what you want. You can reasonably make the assumption that you've got syntactically valid SQL. You need to build a tokenizer (lexical analyzer), though.
The lexical analysis needed for Transact SQL is pretty simple. The token list consist of (off the top of my head, since it's been a while since I had to do this):
whitespace
two types of comments:
---style comments
/.../`-style comments
three types of quoted literals:
string literals (e.g., `'my string literal'), and
two flavors of quoting reserved words for use as column or object names:
ANSI/ISO style, using double quotes (e.g., "table")
Transact-SQL style, using square-brackets (e.g., [table])
hex literals (e.g., 0x01A2F)
numeric literals (e.g. 757, -3218, 5.4 or -7.6E-32, 5.0m , $5.3201 etc.)
words, reserved or not: a unicode letter, underscore (''), 'at'-sign ('#') or hash ('#'), followed by zero or more of unicode letters, decimal digits, underscore ('') or the at-, dollar- or hash- signs ('#', '$' or '#').
operators, including parentheses.
It can pretty much all be done with regular expressions. If you were using Perl, you'd be done in a day, easy. It'll probably take a bit longer in C#, though.
I would probably treat comments as whitespace and collapse multiple sequences of whitespace and comment into a single whitespace token as it facilitates the recognition of constructs such as order by.
The reason you don't need a parser is that you don't really care very much about the parse tree. What you do care about is nested parentheses. So...
Once you've gotten a lexical analyzer that emits a stream of tokens, all you need to do is eat and discard tokens counting open/closing parentheses until you see a 'from' keyword at parenthetical depth 0.
Write select count(*) into your StringBuilder.
Start appending tokens (including the from) into the StringBuilder until you see an 'order by' at parenthetical depth 0. You'll need to build a certain amount of look-ahead into your lexer to do this (which see my earlier note regarding the collapsing of sequences of whitespace and/or comments into a single whitespace token.)
At this point, you should be pretty much done. Execute the query.
NOTES
Parameterized queries likely won't work.
Recursive queries, with a CTE and a with clause will probably get broken.
This will discard anything past the ORDER BY clause: if the query uses query hint, a FOR clause, or COMPUTE/COMPUTE BY, your results will likely differ from the original query (especially with any compute clauses, since those break up the queries result sets).
Bare UNION queries will get broken, since something like
select c1,c2 from t1
UNION select c1,c2 from t2
will get turned into
select count(*) from t1
UNION select c1,c2 from t2
All this is completely untested, just my thoughts based on oddball stuff I've had to do over the years.
Instead of modifying the existing clauses of the query - how about inserting a new clause, the INTO clause.
SELECT *
INTO #MyCountTable -- new clause to create a temp table with these records.
FROM TheTable
SELECT ##RowCount
-- or maybe this:
--SELECT COUNT(*) FROM #MyCountTable
DROP TABLE #MyCountTable
TSql query modification seems to be an eternal struggle to be the lastest thing that happens.
would you post a answer of how to do this "the right way" using IQueryable
Suppose you had some arbitrary query:
IQueryable<Ledger> query = myDataContext.Ledgers
.Where(ledger => ledger.Seq ==
myDataContext.Ledgers
.Where(ledger2 => ledger2.ClientId == ledger.ClientId)
.OrderByDescending(ledger2 => ledger2.Date)
.ThenByDescending(ledger2 => ledger2.Seq)
.Take(1).SingleOrDefault().Seq
)
.Where(ledger => ledger.Balance != 0);
Then you just get the Count of the rows, no need for any custom method or query manipulation.
int theCount = query.Count();
//demystifying the extension method:
//int theCount = System.Linq.Queryable.Count(query);
LinqToSql will include your desire for a count into the query text.
I guess you want to drop the order by clause to improve the performance. The general case is quite complex and you will need full sql parser to drop the ordering clause.
Also, did you check the comparative performance of
select count(id) from ....
v/s
select count(*) from (select id, a+b from ....)
The problem is that the a+b will need to be evaluated in latter, essentially executing query twice.
If you want a progress bar because the retrieval itself is slow then this is completely counter-productive, because you will spend almost the same amount of time estimating the count.
And if the application is complex enough that the data can change between the two query execution then you don't even know how reliable the count is.
So: the real answer is that you cannot get a count on arbitrary query in efficient way. For a non-efficient way, if your resultset is rewindable, then go the end of resultset, figure out the row count and then go back to the first row.
What if rather than try to re-build your query, you do something like:
WITH MyQuery AS (
select [ClientID], [Date], [Balance]
from [Ledger]
where Seq = (select top 1 Seq
from [Ledger] as l
where l.[ClientID] = [Ledger].[ClientID]
order by [Date] desc, Seq desc)
and Balance <> 0)
)
SELECT COUNT(*) From MyQuery;
Note I haven't tested this on SQL Server 2005 but it should work.
Update:
We've confirmed SQL Server 2005 does not support an ORDER BY clause within a CTE. This does, however, work with Oracle and perhaps other databases.
I wouldn't edit or try to parse the SQL at all, but you may have to use an EVIL CURSOR (don't worry, we won't explicitly iterate through anything). Here, I would simply pass your ad-hoc SQL to a proc which runs it as a cursor, and returns the number of rows in the cursor. There may be some optimizations available, but I've kept it simple, and this should work for any valid select statement (even CTEs) that you pass to it. No need to code and debug your own T-SQL lexer or anything.
create proc GetCountFromSelect (
#SQL nvarchar(max)
)
as
begin
set nocount on
exec ('declare CountCursor insensitive cursor for ' + #SQL + ' for read only')
open CountCursor
select ##cursor_rows as RecordCount
close CountCursor
deallocate CountCursor
end
go
exec GetCountFromSelect '// Your SQL here'
go

how to improve SQL query performance in my case

I have a table, schema is very simple, an ID column as unique primary key (uniqueidentifier type) and some other nvarchar columns. My current goal is, for 5000 inputs, I need to calculate what ones are already contained in the table and what are not. Tht inputs are string and I have a C# function which converts string into uniqueidentifier (GUID). My logic is, if there is an existing ID, then I treat the string as already contained in the table.
My question is, if I need to find out what ones from the 5000 input strings are already contained in DB, and what are not, what is the most efficient way?
BTW: My current implementation is, convert string to GUID using C# code, then invoke/implement a store procedure which query whether an ID exists in database and returns back to C# code.
My working environment: VSTS 2008 + SQL Server 2008 + C# 3.5.
My first instinct would be to pump your 5000 inputs into a single-column temporary table X, possibly index it, and then use:
SELECT X.thecol
FROM X
JOIN ExistingTable USING (thecol)
to get the ones that are present, and (if both sets are needed)
SELECT X.thecol
FROM X
LEFT JOIN ExistingTable USING (thecol)
WHERE ExistingTable.thecol IS NULL
to get the ones that are absent. Worth benchmarking, at least.
Edit: as requested, here are some good docs & tutorials on temp tables in SQL Server. Bill Graziano has a simple intro covering temp tables, table variables, and global temp tables. Randy Dyess and SQL Master discuss performance issue for and against them (but remember that if you're getting performance problems you do want to benchmark alternatives, not just go on theoretical considerations!-).
MSDN has articles on tempdb (where temp tables are kept) and optimizing its performance.
Step 1. Make sure you have a problem to solve. Five thousand inserts isn't a lot to insert one at a time in a lot of contexts.
Are you certain that the simplest way possible isn't sufficient? What performance issues have you measured so far?
What do you need to do with those entries that do or don't exist in your table??
Depending on what you need, maybe the new MERGE statement in SQL Server 2008 could fit your bill - update what's already there, insert new stuff, all wrapped neatly into a single SQL statement. Check it out!
http://blogs.conchango.com/davidportas/archive/2007/11/14/SQL-Server-2008-MERGE.aspx
http://www.sql-server-performance.com/articles/dba/SQL_Server_2008_MERGE_Statement_p1.aspx
http://blogs.msdn.com/brunoterkaly/archive/2008/11/12/sql-server-2008-merge-capability.aspx
Your statement would look something like this:
MERGE INTO
(your target table) AS t
USING
(your source table, e.g. a temporary table) AS s
ON t.ID = s.ID
WHEN NOT MATCHED THEN -- new rows does not exist in base table
....(do whatever you need to do)
WHEN MATCHED THEN -- row exists in base table
... (do whatever else you need to do)
;
To make this really fast, I would load the "new" records from e.g. a TXT or CSV file into a temporary table in SQL server using BULK INSERT:
BULK INSERT YourTemporaryTable
FROM 'c:\temp\yourimportfile.csv'
WITH
(
FIELDTERMINATOR =',',
ROWTERMINATOR =' |\n'
)
BULK INSERT combined with MERGE should give you the best performance you can get on this planet :-)
Marc
PS: here's a note from TechNet on MERGE performance and why it's faster than individual statements:
In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement.
By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.
Try to ensure you end up running only one query - i.e. if your solution consists of running 5000 queries against the database, that'll probably be the biggest consumer of resources for the operation.
If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database.
If you want simplicity, since 5000 records is not very many, then from C# just use a loop to generate an insert statement for each of the strings you want to add to the table. Wrap the insert in a TRY CATCH block. Send em all up to the server in one shot like this:
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH
if you have a unique index or primary key defined on your string GUID, then the duplicate inserts will fail. Checking ahead of time to see if the record does not exist just duplicates work that SQL is going to do anyway.
If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy. Reading 5000 GUIDS should take much less than 1 second. This is simpler than bulk importing to a temp table (which is the only way you will get performance from a temp table) and doing an update using a join to the temp table.
Since you are using Sql server 2008, you could use Table-valued parameters. It's a way to provide a table as a parameter to a stored procedure.
Using ADO.NET you could easily pre-populate a DataTable and pass it as a SqlParameter.
Steps you need to perform:
Create a custom Sql Type
CREATE TYPE MyType AS TABLE
(
UniqueId INT NOT NULL,
Column NVARCHAR(255) NOT NULL
)
Create a stored procedure which accepts the Type
CREATE PROCEDURE spInsertMyType
#Data MyType READONLY
AS
xxxx
Call using C#
SqlCommand insertCommand = new SqlCommand(
"spInsertMyType", connection);
insertCommand.CommandType = CommandType.StoredProcedure;
SqlParameter tvpParam =
insertCommand.Parameters.AddWithValue(
"#Data", dataReader);
tvpParam.SqlDbType = SqlDbType.Structured;
Links: Table-valued Parameters in Sql 2008
Definitely do not do it one-by-one.
My preferred solution is to create a stored procedure with one parameter that can take and XML in the following format:
<ROOT>
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000">
<MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000001">
....
</ROOT>
Then in the procedure with the argument of type NCHAR(MAX) you convert it to XML, after what you use it as a table with single column (lets call it #FilterTable). The store procedure looks like:
CREATE PROCEDURE dbo.sp_MultipleParams(#FilterXML NVARCHAR(MAX))
AS BEGIN
SET NOCOUNT ON
DECLARE #x XML
SELECT #x = CONVERT(XML, #FilterXML)
-- temporary table (must have it, because cannot join on XML statement)
DECLARE #FilterTable TABLE (
"ID" UNIQUEIDENTIFIER
)
-- insert into temporary table
-- #important: XML iS CaSe-SenSiTiv
INSERT #FilterTable
SELECT x.value('#ID', 'UNIQUEIDENTIFIER')
FROM #x.nodes('/ROOT/MyObject') AS R(x)
SELECT o.ID,
SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END)) AS FoundInDB
FROM #FilterTable o
LEFT JOIN dbo.MyTable t
ON o.ID = t.ID
GROUP BY o.ID
END
GO
You run it as:
EXEC sp_MultipleParams '<ROOT><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000"/><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000002"/></ROOT>'
And your results look like:
ID FoundInDB
------------------------------------ -----------
60EAD98F-8A6C-4C22-AF75-000000000000 1
60EAD98F-8A6C-4C22-AF75-000000000002 0

What is the best way, algorithm, method to difference large lists of data?

I am receiving a large list of current account numbers daily, and storing them in a database. My task is to find added and released accounts from each file. Right now, I have 4 SQL tables, (AccountsCurrent, AccountsNew, AccountsAdded, AccountsRemoved). When I receive a file, I am adding it entirely to AccountsNew. Then running the below queries to find which we added and removed.
INSERT AccountsAdded(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew WHERE AccountNumber not in (SELECT AccountNum FROM AccountsCurrent)
INSERT AccountsRemoved(AccountNum, Name) SELECT AccountNum, Name FROM AccountsCurrent WHERE AccountNumber not in (SELECT AccountNum FROM AccountsNew)
TRUNCATE TABLE AccountsCurrent
INSERT AccountsCurrent(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew
TRUNCATE TABLE AccountsNew
Right now, I am differencing about 250,000 accounts, but this is going to keep growing. Is this the best method, do you have any other ideas?
EDIT:
This is an MSSQL 2000 database. I'm using c# to process the file.
The only data I am focused on is the accounts that were added and removed between the last and current files. The AccountsCurrent, is only used to determine what accounts were added or removed.
To be honest, I think that I'd follow something like your approach. One thing is that you could remove the truncate, do a rename of the "new" to "current" and re-create "new".
Sounds like a history/audit process that might be better done using triggers. Have a separate history table that captures changes (e.g., timestamp, operation, who performed the change, etc.)
New and deleted accounts are easy to understand. "Current" accounts implies that there's an intermediate state between being new and deleted. I don't see any difference between "new" and "added".
I wouldn't have four tables. I'd have a STATUS table that would have the different possible states, and ACCOUNTS or the HISTORY table would have a foreign key to it.
Using IN clauses on long lists can be slow.
If the tables are indexed, using a LEFT JOIN can prove to be faster...
INSERT INTO [table] (
[fields]
)
SELECT
[fields]
FROM
[table1]
LEFT JOIN
[table2]
ON [join condition]
WHERE
[table2].[id] IS NULL
This assumes 1:1 relationships and not 1:many. If you have 1:many you can do any of...
1. SELECT DISTINCT
2. Use a GROUP BY clause
3. Use a different query, see below...
INSERT INTO [table] (
[fields]
)
SELECT
[fields]
FROM
[table1]
WHERE
EXISTS (SELECT * FROM [table2] WHERE [condition to match tables 1 and 2])
-- # This is quick provided that all fields to match the two tables are
-- # indexed in both tables. Should then be much faster than the IN clause.
You could also subtract the intersection to get the differences in one table.
If the initial file is ordered in a sensible and consistent way (big IF!), it would run considerably faster as a C# program which logically compared the files.

Categories

Resources