Fastest Way To Upsert Using Postgres and C# - c#

I am writing an application in C# that will copy data from one postgres table to another on a regular basis. I am using the NPGSql library.
I have run into the following issue: When there are thousands of rows to be copied (> 10k), the program runs very slowly.
I have tried:
For my first attempt, I pulled the entirety of the destination table, then compared the data I was inserting to the data that already existed. Then, I would write an insert or update statement depending on whether it already existed but had alterations, or whether it did not exist at all. This was the worst solution, as every individual statement had to be sent as a command.
Next, I tried putting an "on conflict" trigger on the actual table. This let me send all of the inserts as bulk INSERT INTO.... statements, and the table would take care of updates. This was significantly faster, but not fast enough.
I read about Postgres's COPY method, but it does not seem to suit my needs. It seems that COPY will do ONLY an insert, and NOT an upsert. Because I am modifying this table several times, some of the data will be new, but some will be old rows that need updating.
Has anyone come up with a fast way to UPSERT, provided that I need an option to EDIT a row, not just do a blanket mass INSERT of all of my data?
Please let me know if I can provide any other information
Thank you so much for your time

First of all, I assume the tables are on different databases, otherwise I would just do this all in DML.
I think copy is definitely your friend. There is no faster way to extract or load data, and then you can let the database do the heavy lifting.
On the source database:
copy source_table
to '/var/tmp/foo.csv' csv;
On the destination database:
truncate temp_table;
copy temp_table
from '/var/tmp/foo.csv' csv;
insert into destination_table
select *
from temp_table t
where not exists (
select null
from destination_table d
where t.id = d.id
);
update destination_table d
set
field1 = t.field1,
field2 = t.field2
from temp_table t
where
d.id = t.id and
(d.field1 is distinct from t.field1 or
d.field2 is distinct from t.field2)
It would be great if you can do something like this if the data is readily available:
Couple of other comments:
the insert into uses an anti-join, and this is my favorite construct to insert missing records
on the update, it's important to specify the criteria for what you udpate -- don't update everything; only those records that have changed. This will make a big difference in performance. Hopefully there are a set number of fields you can use to determine if a record has changed.
If there is a field that indicates the record has been updated (last_update_date or something similar), a slightly lazier and wonderful approach is to delete those records and let the anti-join insert re-insert them. This would omit the need for the update statement and would be much less code for tables with lots of columns

Related

Read from one table and insert into another - one row at a time

I am dealing with a huge database with millions of rows. I would like to run an SQL statement through C#, which selects 1.2 million rows from one database, and inserts them into another after parsing and modifying some data.
I originally wanted to do so by first running the select statement and parsing the data by iterating through the MySqlDataReader object which contains the data. This would be a memory overhead, so I have decided to select one row, parse it and insert into the other database, and then move onto the next row.
How can this be done? I have tried the SELECT....INTO syntax for a MySQL query, however this still seems to select all the data, and then inserts it after.
Use SqlBulkCopy Class to move data from one source to other
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy%28v=vs.110%29.aspx
I am not sure if you are able to add a new column to the existing table. If you are able to add a new column, you can use the new column as a flag. It could be "TRANSFERED(boolean)".
You will select one row at a time with the condition TRANSFERED=FALSE and do the process. After that row is processed, you should update as TRANSFERED=TRUE.
Or, you must have a uniqe id column in your existing table. Create a temp table which will store the id of processed rows, that way you will know which rows are processed or not
I am not quite sure what is your error. For your case, I suggest you should use 'select top 1000 ' to get the data because insert row one by one is really slow. After that, you can use 'insert into query', it should be noted that sqlbulkcopy is just for sql server, I suggest you use the stringbuilder to make the sql query for if you use string, it will has a big overhead to concat the string.

How to store string array in sqlite database and keep order

i want to store a great amount of strings into my sqlite database. I want them to be always in the same order when i read them as i add them to the database. I know i could give them an autoincrementing primary key and sort by that but since there can be up to 100.000 strings this is a performance issue. Besides the order should NEVER change or be sorted in any different way.
short example:
sql insert "hghtzdz12g"
sql insert "jut65bdt"
sql insert "lkk7676nbgt"
sql select * should give ALWAYS this order {"hghtzdz12g", "jut65bdt", "lkk7676nbgt" }
Any ideas how to achive this ?
Thanks
In a query like
SELECT * FROM MyTable ORDER BY MyColumn
the database does not need to sort the results if the column is indexed, because it can just scan through the index entries in order.
The rowid (or whatever you call the autoincrementing column) is an index, and is even more efficient than a separate index.
If you are sure you will never need anything but exactly this array in exactly this order, you can cheat the database and put in a single blob field.
But then you should ask yourself why you chose a database in the first place.
The correct database solution is indeed a table using a key that you can sort by.
If this performance is not enough, you can have a look here for performance hints.
If you need ultra-fast performance, maybe a database is not the best tool for the job. Databases are used for their ACID abilities and speed is not one of them but rather a secondary objective of everything in software.

Add column to existing SQL Server table - Implications

I have an existing table in SQL Server with existing entries (over 1 million in fact).
This table gets updated, inserted and selected from on a regular basis by a front-end application. I want/need to add a datetime column e.g. M_DateModified that can be updated like so:
UPDATE Table SET M_DateModified = GETDATE()
whenever a button gets pressed on the front-end and a stored procedure gets called. This column will be added to an existing report as requested.
My problem, and answer is this. Being one of the core tables of our app, will ALTERING the table and adding an additional column break other existing queries? Obviously, you can't insert into a table without specifying all values for all columns so any existing INSERT queries will break (WHICH is a massive problem).
Any help would be much appreciated on the best solution regarding this problem.
First, as marc_s says, It should only affect SELECT * queries, and not even all of them would necessarily be affected.
Secondly, you only need to specify all non-Null fields on an INSERT, so if you make it NULL-able, you don't have to worry about that. Further, for a Created_Date-type column, it is typical to add a DEFAULT setting of =GetDate(), which will fill it in for you if it is not specified.
Thirdly, if you are still worried about impacting your existing code-base, then do the following:
Rename your table to something like "physicalTable".
Create a View with the same name that your table formely had, that does a SELECT .. FROM physicalTable, listing the columns explicitly and in the same order, but do not include the M_DateModified field in it.
Leave your code unmodified, now referencing the View, instead of directly accessing the table.
Now your code can safely interact with the table without any changes (SQL DML code cannot tell the difference between a Table and a writeable View like this).
Finally, this kind of "ModifiedDate" column is a common need and is most often handled, first by making it NULL-able, then by adding an Insert & Update trigger that sets it automatically:
UPDATE t
SET M_DateModified = GetDate()
FROM (SELECT * FROM physicalTable y JOIN inserted i ON y.PkId = i.PkId) As t
This way the application does not have to maintain the field itself. As an added bonus, neither can the application set it incorrectly or falsely (this is a common and acceptable use of triggers in SQL).
If the new column is not mandantory you have nothing to worry about. Unless you have some knuckleheads who wrote select statements with a "*" instead of column list.
Well, as long as your SELECTs are not *, those should be fine. For the INSERTs, if you give the field a default of GETDATE() and allow NULLs, you can exclude it and it will still be filled.
Depends on how your other queries are set up. If they are SELECT [Item1], [Item2], ect.... Then you won't face any issues. If it's a SELECT * FROM then you may experience some unexpected results.
Keep in mind how you want to set it up, you'll either have to set it to be nullable which could give you fits down the road, or set a default date, which could give you incorrect data for reporting, retrieval, queries, ect..

Recommend usage of temp table or table variable in Entity Framework 4. Update Performance Entity framework

I need to update a bit field in a table and set this field to true for a specific list of Ids in that table.
The Ids are passed in from an external process.
I guess in pure SQL the most efficient way would be to create a temp table and populate it with the Ids, then join the main table with this and set the bit field accordingly.
I could create a SPROC to take the Ids but there could be 200 - 300,000 rows involved that need this flag set so its probably not the most efficient way. Using the IN statement has limitation wrt the amount of data that can be passed and performance.
How can I achieve the above using the Entity Framework
I guess its possible to create a SPROC to create a temp table but this would not exist from the models perspective.
Is there a way to dynamically add entities at run time. [Or is this approach just going to cause headaches].
I'm making the assumption above though that populating a temp table with 300,000 rows and doing a join would be quicker than calling a SPROC 300,000 times :)
[The Ids are Guids]
Is there another approach that I should consider.
For data volumes like 300k rows, I would forget EF. I would do this by having a table such as:
BatchId RowId
Where RowId is the PK of the row we want to update, and BatchId just refers to this "run" of 300k rows (to allow multiple at once etc).
I would generate a new BatchId (this could be anything unique -Guid leaps to mind), and use SqlBulkCopy to insert te records onto this table, i.e.
100034 17
100034 22
...
100034 134556
I would then use a simgle sproc to do the join and update (and delete the batch from the table).
SqlBulkCopy is the fastest way of getting this volume of data to the server; you won't drown in round-trips. EF is object-oriented : nice for lots of scenarios - but not this one.
I'm assigning Marcs response as the answer but I'd just like to give a little detail on how we implemented the requirement.
Marc response helped greatly in the formulation of our solution.
We had to deal with an aim/guideline to keep within the Entity Framework while not utilizing SPROCS and although our solution may not suit others it has worked for us
We created a Item table in the Database with BatchId [uniqueidentifier] and ItemId varchar columns.
This table was added to the EF model so we did not use temporary tables.
On upload of these Ids this table is populated with the Ids [Inserts are quick enough we find using EF]
We then use context.ExecuteStoreCommand to run the SQL to do join the item table and the main table and update the bit field in the main table for records that exist for the batch Id created specifically for that session.
We finally clear this table for that batchId.
We have the performance, keeping within our no SPROC goal. [Which not of us agree with :) but its a democracy]
Our exact requirements are a little more complex but insofar as needing good update performance using the Entity framework given our specific restrictions it works fine.
Liam

how can I speed up insertion of many rows to a table via ADO.NET?

I have a table that has 5 columns: AcctId (int), Address1 (varchar), Address2 (varchar), Person1 (varchar), Person2 (varchar) . I'm generating random data to insert into this table via a C# console application. I've tried doing this random data insert via SQL-Server and decided it was not a good solution -- SQL is not good at random on an each-row basis. Generating the random data -- 975k rows of it -- takes a minimal amount of time. It's in a List of custom objects.
I need to take this random data and update many rows in the database with the new random data. I tried updating the rows one at a time, very slow because of the repeated searching of the List object in code. So I think the best approach is to put all the randomized data into a table in the database, then update all the other tables that use this data. I.e. UPDATE t SET t.Address1=d.Address1 FROM Table1 t INNER JOIN RandomizedData d ON d.AcctId = t.Acct_ID. The database is very un-normalized so this Acct data is sprinkled all over the place. I've got no control of the normalization.
So, having decided to insert all of the randomized data into a single table, I set out to create insert scripts:
USE TheDatabase
Insert tmp_RandomizedData
SELECT 1,'4392 EIGHTH AVE','','JENNIFER CARTER','BARBARA CARTER' UNION ALL
SELECT 2,'2168 MAIN ST','HNGR F','DANIEL HERNANDEZ','SUSAN MARTIN'
// etc another 98 times...
// FYI, this is not real data!
I'm building this INSERT script in batches of 100. It's taking on average 175 ms to run each insert. Does this seem like a long time? It's going to take about 35 mins to run the whole insert.
The table doesn't have a primary key or any indexes. I was planning on adding those after all the data is inserted (thinking that that would be faster).
Is there a better way to do this?
The SQLBulkCopy class in .net can blast records in pretty quickly. I used this to transfer data from an i-Series database to SQL Tables very rapidly.
Use BCP. You can use this article as a guide. It's for VB6 but the gist is exactly the same. The trick is to use the BULK INSERT command.
... Read more of your question, you might also want to look at Sql RedGates sample data generator, it generates tons of data really, really, fast.
Use larger batches, 50,000 to 75,000 rows. On SQL 2000 on hardware from 2000, the sweet spot for inserts was 50,000 rows. This was on a live production database, with indexes, during the day on a very large table.
Small batch sizes are better for inserts into a highly active table and where there is a high deadlock risk. Is anyone else using this table while your doing inserts?
Is this a one time import? Let it run over night.
Finally, INSERT statements executed via ADO.NET isn't really an optimal ETL solution. SSIS, DTS, (or any other ETL solution, such as Talend) would be more appropriate for heavy duty data moving. On the other hand, if all you have is a hammer...

Categories

Resources