Good practice to avoid duplicate records

Good practice to avoid duplicate records - c#

I have a application where users can add update and delete a record, I wanted to know the best ways to avoid duplicate records. In this application to avoid duplicate records i created a index on the table, is it a good practice or there are others?

There are a few ways to do this. If you have a unique index on a field and you try to insert a duplicate value SQL Server with throw an error. My preferred way is to test for existence before the insert by using
IF NOT EXISTS (SELECT ID FROM MyTable WHERE MyField = #ValueToBeInserted)
BEGIN
INSERT INTO MyTable (Field1, Field2) Values (#Value1, #Value2)
END
You can also return a value to let you know if the INSERT took place using an ELSE on the above code.
If you choose to index a field you can set IGNORE_DUP_KEY to simply ignore any duplicate inserts. If you were inserting multiple rows any duplicates would be ignored and the non duplicates would continue to be inserted.

You can use UNIQUE constraints on columns or on a set of columns that you don't want to be duplicated; see also http://www.w3schools.com/sql/sql_unique.asp.
Here is an example for both a single-column and a multi-column unique constraint:
CREATE TABLE [Person]
(
…
[SSN] VARCHAR(…) UNIQUE, -- only works for single-column UNIQUE constraint
…
[Name] NVARCHAR(…),
[DateOfBirth] DATE,
…
UNIQUE ([Name], [DateOfBirth]) -- works for any number of columns
)

An id for a table is almost compulsory according to me. To avoid duplicates when inserting a row, you can simply use :
INSERT IGNORE INTO Table(id, name) VALUES (null, "blah")
This works in MySQL, i'm not sure about SQL Server.

Related

How to store data into two tables with same ID? [duplicate]

How am I supposed to get the IDENTITY of an inserted row?
I know about ##IDENTITY and IDENT_CURRENT and SCOPE_IDENTITY, but don't understand the implications or impacts attached to each.
Can someone please explain the differences and when I would be using each?

##IDENTITY returns the last identity value generated for any table in the current session, across all scopes. You need to be careful here, since it's across scopes. You could get a value from a trigger, instead of your current statement.
SCOPE_IDENTITY() returns the last identity value generated for any table in the current session and the current scope. Generally what you want to use.
IDENT_CURRENT('tableName') returns the last identity value generated for a specific table in any session and any scope. This lets you specify which table you want the value from, in case the two above aren't quite what you need (very rare). Also, as #Guy Starbuck mentioned, "You could use this if you want to get the current IDENTITY value for a table that you have not inserted a record into."
The OUTPUT clause of the INSERT statement will let you access every row that was inserted via that statement. Since it's scoped to the specific statement, it's more straightforward than the other functions above. However, it's a little more verbose (you'll need to insert into a table variable/temp table and then query that) and it gives results even in an error scenario where the statement is rolled back. That said, if your query uses a parallel execution plan, this is the only guaranteed method for getting the identity (short of turning off parallelism). However, it is executed before triggers and cannot be used to return trigger-generated values.

I believe the safest and most accurate method of retrieving the inserted id would be using the output clause.
for example (taken from the following MSDN article)
USE AdventureWorks2008R2;
GO
DECLARE #MyTableVar table( NewScrapReasonID smallint,
Name varchar(50),
ModifiedDate datetime);
INSERT Production.ScrapReason
OUTPUT INSERTED.ScrapReasonID, INSERTED.Name, INSERTED.ModifiedDate
INTO #MyTableVar
VALUES (N'Operator error', GETDATE());
--Display the result set of the table variable.
SELECT NewScrapReasonID, Name, ModifiedDate FROM #MyTableVar;
--Display the result set of the table.
SELECT ScrapReasonID, Name, ModifiedDate
FROM Production.ScrapReason;
GO

I'm saying the same thing as the other guys, so everyone's correct, I'm just trying to make it more clear.
##IDENTITY returns the id of the last thing that was inserted by your client's connection to the database.
Most of the time this works fine, but sometimes a trigger will go and insert a new row that you don't know about, and you'll get the ID from this new row, instead of the one you want
SCOPE_IDENTITY() solves this problem. It returns the id of the last thing that you inserted in the SQL code you sent to the database. If triggers go and create extra rows, they won't cause the wrong value to get returned. Hooray
IDENT_CURRENT returns the last ID that was inserted by anyone. If some other app happens to insert another row at an unforunate time, you'll get the ID of that row instead of your one.
If you want to play it safe, always use SCOPE_IDENTITY(). If you stick with ##IDENTITY and someone decides to add a trigger later on, all your code will break.

The best (read: safest) way to get the identity of a newly-inserted row is by using the output clause:
create table TableWithIdentity
( IdentityColumnName int identity(1, 1) not null primary key,
... )
-- type of this table's column must match the type of the
-- identity column of the table you'll be inserting into
declare #IdentityOutput table ( ID int )
insert TableWithIdentity
( ... )
output inserted.IdentityColumnName into #IdentityOutput
values
( ... )
select #IdentityValue = (select ID from #IdentityOutput)

Add
SELECT CAST(scope_identity() AS int);
to the end of your insert sql statement, then
NewId = command.ExecuteScalar()
will retrieve it.

From MSDN
##IDENTITY, SCOPE_IDENTITY, and IDENT_CURRENT are similar functions in that they return the last value inserted into the IDENTITY column of a table.
##IDENTITY and SCOPE_IDENTITY will return the last identity value generated in any table in the current session. However, SCOPE_IDENTITY returns the value only within the current scope; ##IDENTITY is not limited to a specific scope.
IDENT_CURRENT is not limited by scope and session; it is limited to a specified table. IDENT_CURRENT returns the identity value generated for a specific table in any session and any scope. For more information, see IDENT_CURRENT.
IDENT_CURRENT is a function which takes a table as a argument.
##IDENTITY may return confusing result when you have an trigger on the table
SCOPE_IDENTITY is your hero most of the time.

When you use Entity Framework, it internally uses the OUTPUT technique to return the newly inserted ID value
DECLARE #generated_keys table([Id] uniqueidentifier)
INSERT INTO TurboEncabulators(StatorSlots)
OUTPUT inserted.TurboEncabulatorID INTO #generated_keys
VALUES('Malleable logarithmic casing');
SELECT t.[TurboEncabulatorID ]
FROM #generated_keys AS g
JOIN dbo.TurboEncabulators AS t
ON g.Id = t.TurboEncabulatorID
WHERE ##ROWCOUNT > 0
The output results are stored in a temporary table variable, joined back to the table, and return the row value out of the table.
Note: I have no idea why EF would inner join the ephemeral table back to the real table (under what circumstances would the two not match).
But that's what EF does.
This technique (OUTPUT) is only available on SQL Server 2008 or newer.
Edit - The reason for the join
The reason that Entity Framework joins back to the original table, rather than simply use the OUTPUT values is because EF also uses this technique to get the rowversion of a newly inserted row.
You can use optimistic concurrency in your entity framework models by using the Timestamp attribute: 🕗
public class TurboEncabulator
{
public String StatorSlots)
[Timestamp]
public byte[] RowVersion { get; set; }
}
When you do this, Entity Framework will need the rowversion of the newly inserted row:
DECLARE #generated_keys table([Id] uniqueidentifier)
INSERT INTO TurboEncabulators(StatorSlots)
OUTPUT inserted.TurboEncabulatorID INTO #generated_keys
VALUES('Malleable logarithmic casing');
SELECT t.[TurboEncabulatorID], t.[RowVersion]
FROM #generated_keys AS g
JOIN dbo.TurboEncabulators AS t
ON g.Id = t.TurboEncabulatorID
WHERE ##ROWCOUNT > 0
And in order to retrieve this Timetsamp you cannot use an OUTPUT clause.
That's because if there's a trigger on the table, any Timestamp you OUTPUT will be wrong:
Initial insert. Timestamp: 1
OUTPUT clause outputs timestamp: 1
trigger modifies row. Timestamp: 2
The returned timestamp will never be correct if you have a trigger on the table. So you must use a separate SELECT.
And even if you were willing to suffer the incorrect rowversion, the other reason to perform a separate SELECT is that you cannot OUTPUT a rowversion into a table variable:
DECLARE #generated_keys table([Id] uniqueidentifier, [Rowversion] timestamp)
INSERT INTO TurboEncabulators(StatorSlots)
OUTPUT inserted.TurboEncabulatorID, inserted.Rowversion INTO #generated_keys
VALUES('Malleable logarithmic casing');
The third reason to do it is for symmetry. When performing an UPDATE on a table with a trigger, you cannot use an OUTPUT clause. Trying do UPDATE with an OUTPUT is not supported, and will give an error:
Cannot use UPDATE with OUTPUT clause when a trigger is on the table
The only way to do it is with a follow-up SELECT statement:
UPDATE TurboEncabulators
SET StatorSlots = 'Lotus-O deltoid type'
WHERE ((TurboEncabulatorID = 1) AND (RowVersion = 792))
SELECT RowVersion
FROM TurboEncabulators
WHERE ##ROWCOUNT > 0 AND TurboEncabulatorID = 1

I can't speak to other versions of SQL Server, but in 2012, outputting directly works just fine. You don't need to bother with a temporary table.
INSERT INTO MyTable
OUTPUT INSERTED.ID
VALUES (...)
By the way, this technique also works when inserting multiple rows.
INSERT INTO MyTable
OUTPUT INSERTED.ID
VALUES
(...),
(...),
(...)
Output
ID
2
3
4

##IDENTITY is the last identity inserted using the current SQL Connection. This is a good value to return from an insert stored procedure, where you just need the identity inserted for your new record, and don't care if more rows were added afterward.
SCOPE_IDENTITY is the last identity inserted using the current SQL Connection, and in the current scope -- that is, if there was a second IDENTITY inserted based on a trigger after your insert, it would not be reflected in SCOPE_IDENTITY, only the insert you performed. Frankly, I have never had a reason to use this.
IDENT_CURRENT(tablename) is the last identity inserted regardless of connection or scope. You could use this if you want to get the current IDENTITY value for a table that you have not inserted a record into.

ALWAYS use scope_identity(), there's NEVER a need for anything else.

One other way to guarantee the identity of the rows you insert is to specify the identity values and use the SET IDENTITY_INSERT ON and then OFF. This guarantees you know exactly what the identity values are! As long as the values are not in use then you can insert these values into the identity column.
CREATE TABLE #foo
(
fooid INT IDENTITY NOT NULL,
fooname VARCHAR(20)
)
SELECT ##Identity AS [##Identity],
Scope_identity() AS [SCOPE_IDENTITY()],
Ident_current('#Foo') AS [IDENT_CURRENT]
SET IDENTITY_INSERT #foo ON
INSERT INTO #foo
(fooid,
fooname)
VALUES (1,
'one'),
(2,
'Two')
SET IDENTITY_INSERT #foo OFF
SELECT ##Identity AS [##Identity],
Scope_identity() AS [SCOPE_IDENTITY()],
Ident_current('#Foo') AS [IDENT_CURRENT]
INSERT INTO #foo
(fooname)
VALUES ('Three')
SELECT ##Identity AS [##Identity],
Scope_identity() AS [SCOPE_IDENTITY()],
Ident_current('#Foo') AS [IDENT_CURRENT]
-- YOU CAN INSERT
SET IDENTITY_INSERT #foo ON
INSERT INTO #foo
(fooid,
fooname)
VALUES (10,
'Ten'),
(11,
'Eleven')
SET IDENTITY_INSERT #foo OFF
SELECT ##Identity AS [##Identity],
Scope_identity() AS [SCOPE_IDENTITY()],
Ident_current('#Foo') AS [IDENT_CURRENT]
SELECT *
FROM #foo
This can be a very useful technique if you are loading data from another source or merging data from two databases etc.

Create a uuid and also insert it to a column. Then you can easily identify your row with the uuid. Thats the only 100% working solution you can implement. All the other solutions are too complicated or are not working in same edge cases.
E.g.:
1) Create row
INSERT INTO table (uuid, name, street, zip)
VALUES ('2f802845-447b-4caa-8783-2086a0a8d437', 'Peter', 'Mainstreet 7', '88888');
2) Get created row
SELECT * FROM table WHERE uuid='2f802845-447b-4caa-8783-2086a0a8d437';

Even though this is an older thread, there is a newer way to do this which avoids some of the pitfalls of the IDENTITY column in older versions of SQL Server, like gaps in the identity values after server reboots. Sequences are available in SQL Server 2016 and forward which is the newer way is to create a SEQUENCE object using TSQL. This allows you create your own numeric sequence object in SQL Server and control how it increments.
Here is an example:
CREATE SEQUENCE CountBy1
START WITH 1
INCREMENT BY 1 ;
GO
Then in TSQL you would do the following to get the next sequence ID:
SELECT NEXT VALUE FOR CountBy1 AS SequenceID
GO
Here are the links to CREATE SEQUENCE and NEXT VALUE FOR

Complete solution in SQL and ADO.NET
const string sql = "INSERT INTO [Table1] (...) OUTPUT INSERTED.Id VALUES (...)";
using var command = connection.CreateCommand();
command.CommandText = sql;
var outputIdParameter = new SqlParameter("#Id", SqlDbType.Int) { Direction = ParameterDirection.Output };
command.Parameters.Add(outputIdParameter);
await connection.OpenAsync();
var outputId= await command.ExecuteScalarAsync();
await connection.CloseAsync();
int id = Convert.ToInt32(outputId);

After Your Insert Statement you need to add this. And Make sure about the table name where data is inserting.You will get current row no where row affected just now by your insert statement.
IDENT_CURRENT('tableName')

SQL Server allow duplicates in any column, but not all columns

I've searched through numerous threads to try to find an answer to this but any answer I've found suggests using a unique constraint on a single column, or multiple columns.
My problem is, I'm writing an application in C# with a SQL Server back end. One of the features is to allow a user to import a .CSV file into the database after a little bit of pre-processing. I need to find the quickest method to prevent the user from importing the same data more than once. The data will look something like
ID -- will be auto-generated in SQL Server (PK)
Date Time(datetime)
Machine(nchar)
...
...
...
Name(nchar)
Age(int)
I want to allow any number of the columns to be duplicate values, a long as the entire record is not.
I was thinking of creating another column in the database, obtained by hashing all of the columns together and making it unique but want sure if that was the most efficient method, or if the resulting hash would be guaranteed unique. The CSV files will only be around 60 MB, but there will be tens of thousands of them.
Any help would be appreciated.
Thanks

You should be able to resolve this by creating a unique constraint which includes all the columns.
create table #a (col1 varchar(10), col2 varchar(10))
ALTER TABLE #a
ADD CONSTRAINT UQ UNIQUE NONCLUSTERED
(col1, col2)
-- Works, duplicate entries in columns
insert into #a (col1, col2)
values ('a', 'b')
,('a', 'c')
,('b', 'c')
-- Fails, full duplicate record:
insert into #a (col1, col2)
values ('a1', 'b1')
,('a1', 'b1')

The code below can work to ensure that you don't duplicate the [Date Time], Machine, [Name] and Age columns when you insert the data.
It's important to ensure that at the time of running the code, each row of the incoming dataset has a unique ID on it. This code just fails to shift any rows where the ID gets selected because all four other values are already duplicated in the destination table.
INSERT INTO MAIN_TABLE ([Date Time],Machine,[Name],Age)
SELECT [Date Time],Machine,[Name],Age
FROM IMPORT_TABLE WHERE ID NOT IN
(
SELECT I.ID FROM IMPORT_TABLE I INNER JOIN MAIN_TABLE M
ON I.[Date Time]=M.[Date Time]
AND I.Machine=M.Machine
AND I.[Name]=M.[Name]
AND I.Age=M.Age
)

What is the best way prevent duplicate records in a SQL Server database

what is the best way to prevent duplicate records in a SQL Server database? Using triggers? Using a unique constraint?

Use unique constraints on one or more columns in the table.
Example:
CREATE TABLE Persons
(
P_Id int NOT NULL UNIQUE,
LastName varchar(255) NOT NULL UNIQUE,
FirstName varchar(255) NOT NULL UNIQUE,
Address varchar(255),
City varchar(255)
)
Alter existing table as below
ALTER TABLE Persons
ADD CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName)
If you are using a front-end application to populate the table in the database. Do your validation select query from the application on the database to check for duplicates before inserting into the database. Using constraints will prevent duplicates by throwing an exception.
Note: The above example is SQL SERVER, Oracle, ms access
For much more indepth solution see How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

If you don't want error throw from unique constraint and you also want database to receive duplicated data but insert nothing. You may look at merge statement
https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx?f=255&MSPPError=-2147217396

Entity Framework Ignore Duplicate Primary Key Values while inserting Multiple Rows at once

I am using LINQ along with Entity-Framework to insert some data in a SQL Server 2012 Database.
My Database Table in which the data is being inserted has a primary key and i am inserting about 1000 Records at once.
That is i retrieve data in a set of 1000 rows and save those 1000 rows at one time for performance reasons.
Now the problem is i may occasionally get a duplicate value for any of the row in those 1000 rows and when that happens none of the rows are saved in the database.
Is there any way i can just silently ignore that one row and not insert it while all the other non duplicate rows get inserted?
Also i did try querying the database before every insert but the performance cost for that is too high.

Is there any way i can just silently ignore that one row and not insert it while all the other non duplicate rows get inserted?
If you can recreate the index in SQL Server you can ignore duplicates. After the insert recreate the index without ignore_dups because it's faster without.
CREATE TABLE [dbo].[YourTable](
[id] [int] NOT NULL,
PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

I would suggest that you do a bit of extra processing before the insert statement, without any code to go off of I will try and show you with pseudo code.
ICollection<record> myRecords = Service.GetMyRecords();
//your processing logic before the insert
ICollection<record> recordsToInsert =Service.BusinessLogic(myRecords);
foreach(var record in recordsToInsert)
{
if(myRecords.Contains(record )
{
recordsToInsert.Remove(record);
}
}
This should ensure you have no records in the recordsToInsert Collection that will trip your DB. This also saves you an attempted insert statement since it doesn't try and fail.

Do one query first that checks whether any of the new records exist like so:
var checks = records.Select(r => r.Id).ToArray();
if (!context.Records.Any(r => check.Contains(r.Id))
{
// do the insert
}
After the first check you could refine the check to find out which of the 1000 records is the culprit. So the happy scenario will always be pretty quick. Only when a duplicate is found the process will be slower.
You can't tell EF to silently ignore one database exception while running one transaction.

Build up list first, or query for every record? (Have to check for duplicates)

I'm doing some web scraping to build up a personal SQL database. As I'm looping through the web requests, I'm adding records. The only thing is, duplicates sometimes appear in the web requests and I want to make sure to only add a record if it doesn't already exist in my database. I gather this can be done by performing an SQL query before every insert to make sure that record hasn't already been added, but is this the best way to do it? Would it make more sense to build up a Generic.List first, and then do all my database inserts at the end?

You can create a stored procedure that will attempt to update a record and then insert if the update query did not update any rows. This will minimize the number of queries that need to be run and prevent checking for the row's existence. A little bit of Googling found this. The second option looks like it might be what you are looking for.
/*
Same SP is used to INSERT as well as UPDATE a table.
Here we are avoid unnecessary checking of whether the record exists or not.
Instead try to Update directly. If there is no record then ##RowCount would be 0.
Based on that Insert it as a new record.
*/
CREATE PROCEDURE uspUPSert_Ver2
(
#empID INT,
#fname VARCHAR(25),
#lname VARCHAR(25),
#emailid VARCHAR(50)
)
AS
BEGIN
SET NOCOUNT ON
BEGIN TRAN
UPDATE tblUpsert WITH (SERIALIZABLE)
SET emailid = #emailid ,
firstname = #fname ,
lastname = #lname
WHERE EmpID = #empID
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO tblUpsert VALUES (#empID, #fname, #lname, #emailid)
END
COMMIT TRAN
END
GO

seems like you would need either a primary key or a unique constraint on the columns that identify the rows as duplicate. Then if there is an error in the insert that violates the unique constraint the row won't insert. Catch the exception, log it to a different table for future validation and move to the next row.
http://www.w3schools.com/sql/sql_unique.asp

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Good practice to avoid duplicate records - c#

I have a application where users can add update and delete a record, I wanted to know the best ways to avoid duplicate records. In this application to avoid duplicate records i created a index on the table, is it a good practice or there are others?

An id for a table is almost compulsory according to me. To avoid duplicates when inserting a row, you can simply use : INSERT IGNORE INTO Table(id, name) VALUES (null, "blah") This works in MySQL, i'm not sure about SQL Server.

Related

How to store data into two tables with same ID? [duplicate]

SQL Server allow duplicates in any column, but not all columns

What is the best way prevent duplicate records in a SQL Server database

Entity Framework Ignore Duplicate Primary Key Values while inserting Multiple Rows at once

Build up list first, or query for every record? (Have to check for duplicates)

Categories

Resources