SQL Server : randomize rows (shuffle IDs) - c#

Is there a way to randomize the rows in SQL Server?
I don't want to retrieve the rows in a random manner, I know how to to that.
I want to shuffle the row IDs in the database (ex. ID1 will change to ID27 and ID27 will change to ID1).
I can copy all records to a temporary table, truncate the original table and insert the records back from the temporary table using a parallel loop for randomization.
Is there an easier way to this ?
ID is the identity seed, auto incremented

This sounds like a really strange requirement. Since the id is an identity you can't change that, so you'll have to swap all the other data on the row, which you could probably do with something like this:
select
a.id as old_id,
b.*
into #newdata
from
(
select
id,
row_number() over (order by id) as rn
from
data
) a
join (
select
*,
row_number() over (order by newid()) as rn
from
data
) b on a.rn = b.rn
This creates a temp table with old and new id numbers + all the columns from the table. You could then use to update all the columns for the rows from in the original table using this temp. table.
Can't really recommend doing this, especially if there's a lot of rows. Before doing this you probably should take a table level exclusive lock to the table just in case.

Related

SQL Server allow duplicates in any column, but not all columns

I've searched through numerous threads to try to find an answer to this but any answer I've found suggests using a unique constraint on a single column, or multiple columns.
My problem is, I'm writing an application in C# with a SQL Server back end. One of the features is to allow a user to import a .CSV file into the database after a little bit of pre-processing. I need to find the quickest method to prevent the user from importing the same data more than once. The data will look something like
ID -- will be auto-generated in SQL Server (PK)
Date Time(datetime)
Machine(nchar)
...
...
...
Name(nchar)
Age(int)
I want to allow any number of the columns to be duplicate values, a long as the entire record is not.
I was thinking of creating another column in the database, obtained by hashing all of the columns together and making it unique but want sure if that was the most efficient method, or if the resulting hash would be guaranteed unique. The CSV files will only be around 60 MB, but there will be tens of thousands of them.
Any help would be appreciated.
Thanks
You should be able to resolve this by creating a unique constraint which includes all the columns.
create table #a (col1 varchar(10), col2 varchar(10))
ALTER TABLE #a
ADD CONSTRAINT UQ UNIQUE NONCLUSTERED
(col1, col2)
-- Works, duplicate entries in columns
insert into #a (col1, col2)
values ('a', 'b')
,('a', 'c')
,('b', 'c')
-- Fails, full duplicate record:
insert into #a (col1, col2)
values ('a1', 'b1')
,('a1', 'b1')
The code below can work to ensure that you don't duplicate the [Date Time], Machine, [Name] and Age columns when you insert the data.
It's important to ensure that at the time of running the code, each row of the incoming dataset has a unique ID on it. This code just fails to shift any rows where the ID gets selected because all four other values are already duplicated in the destination table.
INSERT INTO MAIN_TABLE ([Date Time],Machine,[Name],Age)
SELECT [Date Time],Machine,[Name],Age
FROM IMPORT_TABLE WHERE ID NOT IN
(
SELECT I.ID FROM IMPORT_TABLE I INNER JOIN MAIN_TABLE M
ON I.[Date Time]=M.[Date Time]
AND I.Machine=M.Machine
AND I.[Name]=M.[Name]
AND I.Age=M.Age
)

Query and export from unsortable table SQL Server

First I am sorry for my bad English, is not my language.
My problem is: I have a table with around 10 million records of transaction of bank. It don't have PK and didn't sort as any column.
My work is create a page to filter and export it to csv. But limit of rows to export Csv is around 200k records.
I have some idea like:
create 800 tables of 800 ATMs (just an idea, I know it's stupid) and send data from main table to it 1 time per day => export to 800 file csv
use Linq to get 100k record per time then next time, I skip those. But I am stuck when Skip command need OrderBy and I got OutOfMemoryException with it
db.tblEJTransactions.OrderBy(u => u.Id).Take(100000).ToList()
Can anyone help me, every idea is welcome (my boss said I can use anything includes create hundred of tables, use Nosql ... )
If you don't have a primary key in your table, then add one.
The simplest and easiest is to add an int IDENTITY column.
ALTER TABLE dbo.T
ADD ID int NOT NULL IDENTITY (1, 1)
ALTER TABLE dbo.T
ADD CONSTRAINT PK_T PRIMARY KEY CLUSTERED (ID)
If you can't alter the original table, create a copy.
Once the table has a primary key you can sort by it and select chunks/pages of 200K rows with predictable results.
I'm not sure about my solution. But you can refer and try it:
select top 1000000 *, row_number() over (order by (select null)) from tblEJTransactions
The above query returns sorted list.
And then you can use Linq to get the result.

How to insert Serial Number to Unpivoted Column

I have a table in my DB which contains Date and Time separately in columns for a Time Table so for Displaying it as a Single one in the front end I had joined Date and Time Column and Inserted into Temporary table and Unpivoted it,but the Pk_id is same for both the Unpivoted Columns so in the Front end in the Drop down box when I select the item in the Index say at 6 in DDL after a postback occur it will return to Index 1 in DDL.So,is there a way to put Serial number for the Unpivoted columns, My Unpivot Query is,
Select * from
(
Select pk_bid,No_of_batches,Batch1,Batch2,Batch3,Batch4, from #tempbatch
) as p
Unpivot(Batchname for [Batches] in([Batch1],[Batch2],[Batch3],[Batch4])) as UnPvt
In the above query pk_bid & No_of_Batches is same so If I put Rownumber() Partition by pk_bid Order by pk_bid or Rownumber() Partition by No_of_Batches Order by No_of_Batches it gives the 1,1 only as it is same.
I had solved My above Problem like this,
I had created another Temporary table and created Serial Number with the column in that table with differant values the Query I had done is,
Create Table #Tempbatch2
(
pk_bid int,
No_of_batches int,
Batchname Varchar(max),
[Batches] Varchar(max)
)
Insert Into #Tempbatch2
Select * from
(
Select pk_batchid,No_of_batches,Batch1,Batch2,Batch3,Batch4 from #tempbatch
) as p
Unpivot(Batchname for [Batches] in([Batch1],[Batch2],[Batch3],[Batch4])) as UnPvt
Select Row_number() OVER(ORDER BY (Batchaname)) as S_No,pk_bid,No_of_batches,Batchname,[Batches] from #Tempbatch2

Most efficient method to load DataSet from subset of multiple joined tables

I have a large inventory system, and I'm having to re-write part of the I/O portion of it. At its heart, there's a product table and a set of related tables. I need to be able to read pieces of it as efficiently as possible. From C# I construct this query:
select * -- includes productid
into #tt
from products where productClass = 547 -- possibly more conditions
select * from #tt;
select * from productHistory where productid in (select productid from #tt);
select * from productSuppliers where productid in (select productid from #tt);
select * from productSafetyInfo where productid in (select productid from #tt);
select * from productMiscInfo where productid in (select productid from #tt);
drop table #tt;
This query gives me exactly the results I need: 5 result sets each having zero, one or more records (if the first returns zero rows, the others do as well, of course). The program then takes those result sets and crams them into an appropriate DataSet. (Which then gets handed off into a constructor expecting just these records.) This query (with differing conditions) gets run a lot.
My question is, is there a more efficient way to retrieve this data?
Re-working this as a single join won't work because each child might return a variable number of rows.
If you have an index on products.productClass this might yield better performance.
select * from products where productClass = 547 -- includes productid
select productHistory.*
from productHistory
join products
on products.productid = productHistory.productid
and products,productClass = 547;
...
If productID is a clustered index then you will probalbly get better permance with
CREATE TABLE #Temp (productid INT PRIMARY KEY CLUSTERED);
insert into #temp
select productid from products where productClass = 547
order by productid;
go
select productHistory.*
from productHistory
join #Temp
on #Temp.productid = productHistory.productid;
A join on a clustered index seems to give the best performance.
Think about it - SQL can match the first and know it can forget about the rest then move to the second knowing it can move foward (not go back to the top).
With a where in (select ..) SQL cannot take advantage of order.
The more tables you need to join the more reason to #temp as you take about 1/2 second hit creating on populating the #temp.
If you are going to #temp you might as well make it a stuctured temp.
Make sure when you JOIN tables you are joining on indexes. Otherwise you will end up with table scans vs index scans and your code will be very slow specially when joining large tables.
Best practice is to optimize your SQL Queries to avoid table scans.
If you don't have it already, I would strongly suggest making this a stored procedure.
Also, I suspect, but can't prove without testing it, that you will get better performance if you perform joins on the products table for each of your subtables rather than copying into a local table.
Finally, unless you can combine the data, I don't think there is a more efficient way to do this.
Without seeing your schema and knowing a little more about your data and table sizes, it's hard to suggest definitive improvements on the query side.
However, instead of "cramming the results into an appropriate DataSet," since you are using a batched command to return multiple result sets, you could use SqlDataAdapter to do that part for you:
SqlDataAdapter adapter = new SqlDataAdapter(cmd);
DataSet results = new DataSet();
adapter.Fill(results);
After that, the first result set will be in results.Tables[0], the second in results.Tables[1], etc.

Improve SQL performance for populating List<T>

I have 200,000 records in a database with the PK as a varchar(50)
Every 5 minutes I do a SELECT COUNT(*) FROM TABLE
If that result is greater than the List.Count I then execute
"SELECT * FROM TABLE WHERE PRIMARYKEY NOT IN ( " + myList.ToCSVString() + ")"
The reason I do this is because records are being added to the table via another process.
This query takes a long time to run and I also believe its throwing an OutOfMemoryException
Is there a better way to implement this?
Thanks
SQL Server has a solution for this, add a timestamp column, every time you touch any row in the table the timestamp will grow.
Add an index for the timestamp column.
Instead of just storing ids in memory, store ids and last timestamp.
To update:
select max timestamp
select all the rows between old max timestamp and current max timestamp
merge that into the list
Handling deletions is a bit more tricky, but can be achieved if you tombstone as opposed to delete.
Can you change the table?
If so, you might want to add a new auto incremented column that will serve as the PK TableId.
On each SELECT save the max id and on the next select add where TableId > maxId.
Create an INT PK, and use something like this:
"SELECT * FROM TABLE WHERE MY_ID > " + myList.Last().Id;
If you can't change your PK, create another column with date as type , and with NOW() as the default value and use it to query for new items.
Create another table in the database with a single column for for the primary key. When your application starts, insert the PKs into this table. Then you can detect added keys directly with a select rather than checking the count:
select PrimaryKey from Table where PrimaryKey not in (select PrimaryKey from OtherTable)
If this CSV list is large, I would recommend loading your file into a temp table, put an index on it and do a left join where null
select tbl.*
from table tbl
left join #tmpTable tmp on tbl.primarykey = tmp.primarykey
where tmp.primary key is null
edit: a Primary Key should not be a varchar. It should almost always be a incremented int/bigint. This would've been a lot easier. select * from table where primarykey > #lastknownkey
Smack the DB programmer who designed this.. :p
This design would also cause index fragmentation because rows won't be inserted in a linear fashion.

Categories

Resources