Query and export from unsortable table SQL Server - c#

First I am sorry for my bad English, is not my language.
My problem is: I have a table with around 10 million records of transaction of bank. It don't have PK and didn't sort as any column.
My work is create a page to filter and export it to csv. But limit of rows to export Csv is around 200k records.
I have some idea like:
create 800 tables of 800 ATMs (just an idea, I know it's stupid) and send data from main table to it 1 time per day => export to 800 file csv
use Linq to get 100k record per time then next time, I skip those. But I am stuck when Skip command need OrderBy and I got OutOfMemoryException with it
db.tblEJTransactions.OrderBy(u => u.Id).Take(100000).ToList()
Can anyone help me, every idea is welcome (my boss said I can use anything includes create hundred of tables, use Nosql ... )

If you don't have a primary key in your table, then add one.
The simplest and easiest is to add an int IDENTITY column.
ALTER TABLE dbo.T
ADD ID int NOT NULL IDENTITY (1, 1)
ALTER TABLE dbo.T
ADD CONSTRAINT PK_T PRIMARY KEY CLUSTERED (ID)
If you can't alter the original table, create a copy.
Once the table has a primary key you can sort by it and select chunks/pages of 200K rows with predictable results.

I'm not sure about my solution. But you can refer and try it:
select top 1000000 *, row_number() over (order by (select null)) from tblEJTransactions
The above query returns sorted list.
And then you can use Linq to get the result.

Related

SQL Server allow duplicates in any column, but not all columns

I've searched through numerous threads to try to find an answer to this but any answer I've found suggests using a unique constraint on a single column, or multiple columns.
My problem is, I'm writing an application in C# with a SQL Server back end. One of the features is to allow a user to import a .CSV file into the database after a little bit of pre-processing. I need to find the quickest method to prevent the user from importing the same data more than once. The data will look something like
ID -- will be auto-generated in SQL Server (PK)
Date Time(datetime)
Machine(nchar)
...
...
...
Name(nchar)
Age(int)
I want to allow any number of the columns to be duplicate values, a long as the entire record is not.
I was thinking of creating another column in the database, obtained by hashing all of the columns together and making it unique but want sure if that was the most efficient method, or if the resulting hash would be guaranteed unique. The CSV files will only be around 60 MB, but there will be tens of thousands of them.
Any help would be appreciated.
Thanks
You should be able to resolve this by creating a unique constraint which includes all the columns.
create table #a (col1 varchar(10), col2 varchar(10))
ALTER TABLE #a
ADD CONSTRAINT UQ UNIQUE NONCLUSTERED
(col1, col2)
-- Works, duplicate entries in columns
insert into #a (col1, col2)
values ('a', 'b')
,('a', 'c')
,('b', 'c')
-- Fails, full duplicate record:
insert into #a (col1, col2)
values ('a1', 'b1')
,('a1', 'b1')
The code below can work to ensure that you don't duplicate the [Date Time], Machine, [Name] and Age columns when you insert the data.
It's important to ensure that at the time of running the code, each row of the incoming dataset has a unique ID on it. This code just fails to shift any rows where the ID gets selected because all four other values are already duplicated in the destination table.
INSERT INTO MAIN_TABLE ([Date Time],Machine,[Name],Age)
SELECT [Date Time],Machine,[Name],Age
FROM IMPORT_TABLE WHERE ID NOT IN
(
SELECT I.ID FROM IMPORT_TABLE I INNER JOIN MAIN_TABLE M
ON I.[Date Time]=M.[Date Time]
AND I.Machine=M.Machine
AND I.[Name]=M.[Name]
AND I.Age=M.Age
)

SQL batch insert, avoid duplicates, no PK

I was given a task to insert over 1000 rows with 4 columns. The table in question does not have a PK or FK. Let's say it contains columns ID, CustomerNo, Description. The records needed to be inserted can have the same CustomerNo and Description values.
I read about importing data to a temporary table, comparing it with the real table, removing duplicates, and moving new records to the real table.
I also could have 1000 queries that check if such a record already exists and insert data if it does not. But I'm too ashamed to try that out for obvious reasons.
I'm not expecting any specific code, because I did not give any specific details. What I'm hoping for is some pseudocode or general advice for completing such tasks. I can't wait to give some upvotes!
So the idea is, you don't want to insert an entry if there's already an entry with the same ID?
If so, after you import your data into a temporary table, you can accomplish what you're looking for in the where clause of a select statement:
insert into table
select ID, CustomerNo, Description from #data_source
where (#data_source.ID not in (select table.ID from table))
I would suggest to you to load the data into a temp table or variable table. Then you can do a "Select Into" using the distinct key word which will removed the duplicated records.
you will always need to read the target table, unless you bulk load the target table into a temp table(in this point you will have two temp tables) compare both, eliminate duplicates and then insert in target table, but even this is not accurate, because you can have a new insert in the target table while you do this.

Improve SQL performance for populating List<T>

I have 200,000 records in a database with the PK as a varchar(50)
Every 5 minutes I do a SELECT COUNT(*) FROM TABLE
If that result is greater than the List.Count I then execute
"SELECT * FROM TABLE WHERE PRIMARYKEY NOT IN ( " + myList.ToCSVString() + ")"
The reason I do this is because records are being added to the table via another process.
This query takes a long time to run and I also believe its throwing an OutOfMemoryException
Is there a better way to implement this?
Thanks
SQL Server has a solution for this, add a timestamp column, every time you touch any row in the table the timestamp will grow.
Add an index for the timestamp column.
Instead of just storing ids in memory, store ids and last timestamp.
To update:
select max timestamp
select all the rows between old max timestamp and current max timestamp
merge that into the list
Handling deletions is a bit more tricky, but can be achieved if you tombstone as opposed to delete.
Can you change the table?
If so, you might want to add a new auto incremented column that will serve as the PK TableId.
On each SELECT save the max id and on the next select add where TableId > maxId.
Create an INT PK, and use something like this:
"SELECT * FROM TABLE WHERE MY_ID > " + myList.Last().Id;
If you can't change your PK, create another column with date as type , and with NOW() as the default value and use it to query for new items.
Create another table in the database with a single column for for the primary key. When your application starts, insert the PKs into this table. Then you can detect added keys directly with a select rather than checking the count:
select PrimaryKey from Table where PrimaryKey not in (select PrimaryKey from OtherTable)
If this CSV list is large, I would recommend loading your file into a temp table, put an index on it and do a left join where null
select tbl.*
from table tbl
left join #tmpTable tmp on tbl.primarykey = tmp.primarykey
where tmp.primary key is null
edit: a Primary Key should not be a varchar. It should almost always be a incremented int/bigint. This would've been a lot easier. select * from table where primarykey > #lastknownkey
Smack the DB programmer who designed this.. :p
This design would also cause index fragmentation because rows won't be inserted in a linear fashion.

How to fetch lots of database table records by primary key?

Using the ADO.NET MySQL Connector, what is a good way to fetch lots of records (1000+) by primary key?
I have a table with just a few small columns, and a VARCHAR(128) primary key. Currently it has about 100k entries, but this will become more in the future.
In the beginning, I thought I would use the SQL IN statement:
SELECT * FROM `table` WHERE `id` IN ('key1', 'key2', [...], 'key1000')
But with this the query could be come very long, and also I would have to manually escape quote characters in the keys etc.
Now I use a MySQL MEMORY table (tempid INT, id VARCHAR(128)) to first upload all the keys with prepared INSERT statements. Then I make a join to select all the existing keys, after which I clean up the mess in the memory table.
Is there a better way to do this?
Note: Ok maybe its not the best idea to have a string as primary key, but the question would be the same if the VARCHAR column would be a normal index.
Temporary table: So far it seems the solution is to put the data into a temporary table, and then JOIN, which is basically what I currently do (see above).
I've dealt with a similar situation in a Payroll system where the user needed to generate reports based on a selection of employees (eg. employees X,Y,Z... or employees that work in certain offices). I've built a filter window with all the employees and all the attributes that could be considered as a filter criteria, and had that window save selected employee id's in a filter table from the database. I did this because:
Generating SELECT queries with dynamically generated IN filter is just ugly and highly unpractical.
I could join that table in all my queries that needed to use the filter window.
Might not be the best solution out there but served, and still serves me very well.
If your primary keys follow some pattern, you can select where key like 'abc%'.
If you want to get out 1000 at a time, in some kind of sequence, you may want to have another int column in your data table with a clustered index. This would do the same job as your current memory table - allow you to select by int range.
What is the nature of the primary key? It is anything meaningful?
If you're concerned about performance I definitely wouldn't recommend an 'IN' clause. It's much better try do an INNER JOIN if you can.
You can either first insert all the values into a temporary table and join to that or do a sub-select. Best is to actually profile the changes and figure out what works best for you.
Why can't you consider using a Table valued parameter to push the keys in the form of a DataTable and fetch the matching records back?
Or
Simply you write a private method that can concatenate all the key codes from a provided collection and return a single string and pass that string to the query.
I think it may solve your problem.

What is the best way, algorithm, method to difference large lists of data?

I am receiving a large list of current account numbers daily, and storing them in a database. My task is to find added and released accounts from each file. Right now, I have 4 SQL tables, (AccountsCurrent, AccountsNew, AccountsAdded, AccountsRemoved). When I receive a file, I am adding it entirely to AccountsNew. Then running the below queries to find which we added and removed.
INSERT AccountsAdded(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew WHERE AccountNumber not in (SELECT AccountNum FROM AccountsCurrent)
INSERT AccountsRemoved(AccountNum, Name) SELECT AccountNum, Name FROM AccountsCurrent WHERE AccountNumber not in (SELECT AccountNum FROM AccountsNew)
TRUNCATE TABLE AccountsCurrent
INSERT AccountsCurrent(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew
TRUNCATE TABLE AccountsNew
Right now, I am differencing about 250,000 accounts, but this is going to keep growing. Is this the best method, do you have any other ideas?
EDIT:
This is an MSSQL 2000 database. I'm using c# to process the file.
The only data I am focused on is the accounts that were added and removed between the last and current files. The AccountsCurrent, is only used to determine what accounts were added or removed.
To be honest, I think that I'd follow something like your approach. One thing is that you could remove the truncate, do a rename of the "new" to "current" and re-create "new".
Sounds like a history/audit process that might be better done using triggers. Have a separate history table that captures changes (e.g., timestamp, operation, who performed the change, etc.)
New and deleted accounts are easy to understand. "Current" accounts implies that there's an intermediate state between being new and deleted. I don't see any difference between "new" and "added".
I wouldn't have four tables. I'd have a STATUS table that would have the different possible states, and ACCOUNTS or the HISTORY table would have a foreign key to it.
Using IN clauses on long lists can be slow.
If the tables are indexed, using a LEFT JOIN can prove to be faster...
INSERT INTO [table] (
[fields]
)
SELECT
[fields]
FROM
[table1]
LEFT JOIN
[table2]
ON [join condition]
WHERE
[table2].[id] IS NULL
This assumes 1:1 relationships and not 1:many. If you have 1:many you can do any of...
1. SELECT DISTINCT
2. Use a GROUP BY clause
3. Use a different query, see below...
INSERT INTO [table] (
[fields]
)
SELECT
[fields]
FROM
[table1]
WHERE
EXISTS (SELECT * FROM [table2] WHERE [condition to match tables 1 and 2])
-- # This is quick provided that all fields to match the two tables are
-- # indexed in both tables. Should then be much faster than the IN clause.
You could also subtract the intersection to get the differences in one table.
If the initial file is ordered in a sensible and consistent way (big IF!), it would run considerably faster as a C# program which logically compared the files.

Categories

Resources