SQL batch insert, avoid duplicates, no PK

SQL batch insert, avoid duplicates, no PK - c#

I was given a task to insert over 1000 rows with 4 columns. The table in question does not have a PK or FK. Let's say it contains columns ID, CustomerNo, Description. The records needed to be inserted can have the same CustomerNo and Description values.
I read about importing data to a temporary table, comparing it with the real table, removing duplicates, and moving new records to the real table.
I also could have 1000 queries that check if such a record already exists and insert data if it does not. But I'm too ashamed to try that out for obvious reasons.
I'm not expecting any specific code, because I did not give any specific details. What I'm hoping for is some pseudocode or general advice for completing such tasks. I can't wait to give some upvotes!

So the idea is, you don't want to insert an entry if there's already an entry with the same ID?
If so, after you import your data into a temporary table, you can accomplish what you're looking for in the where clause of a select statement:
insert into table
select ID, CustomerNo, Description from #data_source
where (#data_source.ID not in (select table.ID from table))

I would suggest to you to load the data into a temp table or variable table. Then you can do a "Select Into" using the distinct key word which will removed the duplicated records.

you will always need to read the target table, unless you bulk load the target table into a temp table(in this point you will have two temp tables) compare both, eliminate duplicates and then insert in target table, but even this is not accurate, because you can have a new insert in the target table while you do this.

Related

Editing duplicate values in a database

I have a DataGrid View pulling some items from my database. What I want to achieve is to be able to edit the pack size or the bar_code fields. I am aware on how to update values in a database but how would I go about doing it if the data is the same? Meaning in many instances a bar code would have multiple pack sizes that is related to the one bar code number. Let's say I have the below screenshot. A data entry error was made and the bar_code and PackSize columns are the exact same. I want to change the first bar code to "1234." How would I achieve this? I can't say update barcode to 'textBox1.Text' where bar_code = '771313166386' because it would then change both data. How do I go about only focusing on one row of data at a time?

You can try using this query to update only the first row:
UPDATE TOP (1) my_table
SET bar_code = '1234'
WHERE bar_code = '771313166386'
You should have an auto-increment id column or a Primary key in your table.

I'd suggest you handle the logic of data duplicate manipulation at the backend rather than pulling them inside the grid and handle it there.
The following query will help you retrieve the duplicate records based on the mentioned columns. You can change it to UPDATE or DELETE as per your requirement.
-- Using cte and ranking function
;With CTE
As
(
Select
Product,
Description,
BarCode,
PackSize
Row_Number() Over(Partition By Product, BarCode, PackSize Order By Product) As RowNum
From YourTable
)
Select * From CTE
-- Where RowNum > 1;
Hope this is helpful :)

This might not help you directly in your answer. But, it is important to mention that your table design is incorrect. You should ensure the data integrity by creating a primary key in your table.
So when you need to update a product you have only one row to update.
Then you can add more tables and use foreign key references between them.

You need to uniquely represent the products. As per your sample data, I guess that there isn't any primary key on your table.
What you can do is either specify a unique constraint on columns to ensure that this type of data entry cannot be done.
If you cannot come up with list of columns to uniquely identify the rows, you can use surrogate keys by specifying Identity column and then while updating, always put a constraint where thisIdentityColumn=value

A data entry error was made and the bar_code and PackSize columns are
the exact same
I think this is the key. Essentially, the exact duplicates are unintentional, and the rows should be unique. Further it looks like bar_code + pack_size is your primary key (subject to data being entered correctly).
So, when you do an update, simply update the first row found that matches a bar_code and a pack_size. If it isn't unique, then the update should ensure that you are one step closer to unique rows in the database.
If you need a non-verbal answer, let me know.

Query and export from unsortable table SQL Server

First I am sorry for my bad English, is not my language.
My problem is: I have a table with around 10 million records of transaction of bank. It don't have PK and didn't sort as any column.
My work is create a page to filter and export it to csv. But limit of rows to export Csv is around 200k records.
I have some idea like:
create 800 tables of 800 ATMs (just an idea, I know it's stupid) and send data from main table to it 1 time per day => export to 800 file csv
use Linq to get 100k record per time then next time, I skip those. But I am stuck when Skip command need OrderBy and I got OutOfMemoryException with it
db.tblEJTransactions.OrderBy(u => u.Id).Take(100000).ToList()
Can anyone help me, every idea is welcome (my boss said I can use anything includes create hundred of tables, use Nosql ... )

If you don't have a primary key in your table, then add one.
The simplest and easiest is to add an int IDENTITY column.
ALTER TABLE dbo.T
ADD ID int NOT NULL IDENTITY (1, 1)
ALTER TABLE dbo.T
ADD CONSTRAINT PK_T PRIMARY KEY CLUSTERED (ID)
If you can't alter the original table, create a copy.
Once the table has a primary key you can sort by it and select chunks/pages of 200K rows with predictable results.

I'm not sure about my solution. But you can refer and try it:
select top 1000000 *, row_number() over (order by (select null)) from tblEJTransactions
The above query returns sorted list.
And then you can use Linq to get the result.

3 records with same ID but change different columns using SqlBulkCopy

I am doing a conversion with SqlBulkCopy. I currently have an IList collection of classes which basically i can do a conversion to a DataTable for use with SqlBulkCopy.
Problem is that I can have 3 records with the same ID.
Let me explain .. here are 3 records
ID Name Address
1 Scott London
1 Mark London
1 Manchester
Basically i need to insert them sequentially .. hence i insert record 1 if it doesn't exist, then the next record if it exists i need to update the record rather than insert a new 1 (notice the id is still 1) so in the case of the second record i replace both columns Name And Address on ID 1.
Finally on the 3rd record you notice that Name doesn't exist but its ID 1 and has an address of manchester so i need to update the record but NOT CHANGING Name but updating Manchester.. hence the 3rd record would make the id1 =
ID Name Address
1 Mark Manchester
Any ideas how i can do this? i am at a loss.
Thanks.
EDIT
Ok a little update. I will manage and merge my records before using SQLbulkCopy. Is it possible to get a list of what succeeded and what failed... or is it a case of ALL or nothing? I presume there is no other alternative to SQLbulkCopy but to do updates?
it would be ideal to be able to Insert everything and the ones that failed are inserted into a temp table ... hence i only need to worry about correcting the ones in my failed table as the others i know are all OK

Since you need to process that data into a DataTable anyway (unless you are writing a custom IDataReader), you should merge the records before giving them to SqlBulkCopy; for example (in pseudo code):
/* create empty data-table */
foreach(row in list) {
var row = /* try to get exsiting row from data-table based on id */
if(row == null) { row = /* create and append row to data-table */ }
else { merge non-trivial properties into existing row */
}
then pass the DataTable to SqlBulkCopy once you have the desired data.
Re the edit; in that scenario, I would upload to a staging table (just a regular table that has a schema like the uploaded data, but typically no foreign keys etc), then use regular TSQL to move the data into the transactional tables. In addition to full TSQL support this also allows better logging of operations. In particular, perhaps look at the OUTPUT clause of INSERT which can help complex bulk operations.

You can't do updates with bulk copy (bulk insert), only insert. Hence the name.
You need to fix the data before you insert them. If this means you have updates to pre-existing rows, you can't insert those as that will generate the key conflict.
You can either bulk insert into a temporary table, and run the appropriate insert or update statements, only insert the new rows and issue update statements for the rest, or delete the pre-existing rows after fetching them and fixing the data before reinserting.
But there's no way to persuade bulk copy to update an existing row.

How to fetch lots of database table records by primary key?

Using the ADO.NET MySQL Connector, what is a good way to fetch lots of records (1000+) by primary key?
I have a table with just a few small columns, and a VARCHAR(128) primary key. Currently it has about 100k entries, but this will become more in the future.
In the beginning, I thought I would use the SQL IN statement:
SELECT * FROM `table` WHERE `id` IN ('key1', 'key2', [...], 'key1000')
But with this the query could be come very long, and also I would have to manually escape quote characters in the keys etc.
Now I use a MySQL MEMORY table (tempid INT, id VARCHAR(128)) to first upload all the keys with prepared INSERT statements. Then I make a join to select all the existing keys, after which I clean up the mess in the memory table.
Is there a better way to do this?
Note: Ok maybe its not the best idea to have a string as primary key, but the question would be the same if the VARCHAR column would be a normal index.
Temporary table: So far it seems the solution is to put the data into a temporary table, and then JOIN, which is basically what I currently do (see above).

I've dealt with a similar situation in a Payroll system where the user needed to generate reports based on a selection of employees (eg. employees X,Y,Z... or employees that work in certain offices). I've built a filter window with all the employees and all the attributes that could be considered as a filter criteria, and had that window save selected employee id's in a filter table from the database. I did this because:
Generating SELECT queries with dynamically generated IN filter is just ugly and highly unpractical.
I could join that table in all my queries that needed to use the filter window.
Might not be the best solution out there but served, and still serves me very well.

If your primary keys follow some pattern, you can select where key like 'abc%'.
If you want to get out 1000 at a time, in some kind of sequence, you may want to have another int column in your data table with a clustered index. This would do the same job as your current memory table - allow you to select by int range.
What is the nature of the primary key? It is anything meaningful?

If you're concerned about performance I definitely wouldn't recommend an 'IN' clause. It's much better try do an INNER JOIN if you can.
You can either first insert all the values into a temporary table and join to that or do a sub-select. Best is to actually profile the changes and figure out what works best for you.

Why can't you consider using a Table valued parameter to push the keys in the form of a DataTable and fetch the matching records back?
Or
Simply you write a private method that can concatenate all the key codes from a provided collection and return a single string and pass that string to the query.
I think it may solve your problem.

Problem generating primary key with SSIS

I am with my boss and we are having a problem with an SSIS project.
Are DataModel sucks and doesn't have a automatic primary key so we have to do the classic and nasty
Select Max(id) + 1 from customer
The problem is that from the moment that my script task generate the PK to the moment I insert there are 10 rows that has been turning into my script task so i get 10 time the same ID and the app crash big time!!
How could that in SSIS????

I got a simple answer i juste put my records that i wanted to insert in a DataSet without any PK id and outside my dataflow i do a foreach loop that get foreach record a new PK ID and insert it one by one.
DONE!

create a TempWork table, with the same exact structure as the final destination table, except make the PK an IDENTITY(n,1) where "n" is the next value based on the final destination table's PK. Use SSIS to insert into this TempWork table, and the IDs will be generated for you. when it is all done, do this:
INSERT INTO FinalTable (PK,col1, col2,...) SELECT PK, col1, col2... from TempWork
then DROP TABLE TempWork

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

SQL batch insert, avoid duplicates, no PK - c#

I would suggest to you to load the data into a temp table or variable table. Then you can do a "Select Into" using the distinct key word which will removed the duplicated records.

Related

Editing duplicate values in a database

Query and export from unsortable table SQL Server

3 records with same ID but change different columns using SqlBulkCopy

How to fetch lots of database table records by primary key?

Problem generating primary key with SSIS

Categories

Resources