I have a table in an oracle database, lets call it Task, where I'm inserting a bunch of rows from a batch process.
I have a unique constraint set up on 4 columns, one of which is nullable (locationId, shelfId, itemId, and batchId), with one of the columns being nullable (shelfId)
In the process that's parsing the CSV file's values (read from another database table), they are batched in groups of 100 and posted to an API for further parsing (into the format of the above mentioned table) and inserted for later submission to another table (in a different schema, but with the same unique constraint). The issue I'm running into is where there are duplicates based on the above constraint in the file (they are typically sequential, and I have only ever seen one additional entry in the file). After they have been parsed, they are inserted, and I'm seeing the unique constraint exception being thrown on rows that a) do not have a row in the table and b) do not meet the unique constraint. When I remove the duplicates from the initial import file I do not get any unique constraint exceptions (which... makes sense weirdly enough).
I'm using Entity Framework in .net for the Oracle database, which I wouldn't think has anything to do with this, but it may, judging by the weirdness of this issue. I'm completely stumped as to what to do, I've tried writing additional validation and looking up the records in the table before inserting them, removing them from the initial file (which works as a work around), but I'm unsure of what to do for a long-term solution.
Example Data:
LocationId ShelfId ItemID BatchId
1 NULL 00AXXFD 1
1 NULL 00AXXFD 1
1 NULL 00FFD12 1
etc...
You are getting UK error because your input data contains duplicates. When you insert all of them at once they are part of the same transaction so Oracle sees duplicates and throws exception even before you commit. After it fails the transaction rolls back so you don't see any records inserted hence no duplicates found.
The correct approach is to remove duplicates from the input data (as you are doing) before inserting.
You use Oracle to enforce UK by committing after insertion of each row.
Note - As I was saying you mayn't be committing after inserting each row. It doesn't matter if insertion happens one by one or all at once, what matters is the transaction scope. JDBC has autocommit=true/false to enable single operation commit. When it is 'true' a transaction is committed after every operation. In general it needs to be 'false' so that you can control the transaction scope
Related
I'm working on some bulk inserts with Entity Framework Core. To minimize round trips to the database, the new inserts are batched in groups of 100 before being added to the database context and saved using SaveChanges().
The current problem is that if any of the batch fail to insert because of, e.g., unique key violations on the table, the entire transaction is rolled back. In this scenario it would be ideal for it to simply discard the records that were unable to be inserted, and insert the rest.
I'm more than likely going to need to write a stored procedure for this, but is there any way to have Entity Framework Core skip over rows that fail to insert?
In your stored procedure, use a MERGE statement instead of an INSERT and then only use the WHEN NOT MATCHED
MERGE #tvp incoming
INTO targetTable existing WITH (HOLDLOCK)
ON (incoming.PK = existing.PK)
WHEN NOT MATCHED
INSERT
The records that match will be discarded. The #tvp is the Table Valued Parameter that is being given to the stored proc from your app code.
There are locking considerations when using the MERGE statement that may or may not apply to your scenario. It's worth reading up on concurrency and atomicity for it to make sure you cover the rest of your bases.
if you are going to the stored procedure then you can declare TVPs. In C# when you will fill the TVP it will fail and you will know in the catch that it failed then to go with recursion of this 100 rows.
Recursive function
It will break the N row into n/2 and will again call the TVP filling. If the first set is ok it will proceed and the second set will fail. The set which will fail you can simply call this recursive function again on that. It will keep your safe records in TVP and failed records seperately. You can call this recursive function up to X. Where X is a number such as 5,6,7. After that, you will be having only bad records.
If you need to know about TVP you can see this
Note : you can not use Parallel Query execution for this approach.
What is the most efficient way to update a row if it exists or add it if doesn't?
I am using Linq to SQL and have read a few posts on it, but none that are current or that solve it without multiple database calls or an old framework. Currently I just insert and if there is a duplicate the statement gives an error, Violation of PRIMARY KEY constraint.
The reason i need it to be quick is that it will eventually be hitting many thousands of records.
Knowing that round-trip calls were needed in order to do this, I loaded the table's primary key values. Then before calling db.SaveChanges I added the following code:
if (checklist.Contains(tempstats))
{
db.Stats.Add(tempstats);
}
I have created two threads in C# and I am calling two separate functions in parallel. Both functions read the last ID from XYZ table and insert new record with value ID+1. Here ID column is the primary key. When I execute the both functions I am getting primary key violation error. Both function having the below query:
insert into XYZ values((SELECT max(ID)+1 from XYZ),'Name')
Seems like both functions are reading the value at a time and trying to insert with the same value.
How can I solve this problem.. ?
Let the database handle selecting the ID for you. It's obvious from your code above that what you really want is an auto-incrementing integer ID column, which the database can definitely handle doing for you. So set up your table properly and instead of your current insert statement, do this:
insert into XYZ values('Name')
If your database table is already set up I believe you can issue a statement similar to:
alter table your_table modify column you_table_id int(size) auto_increment
Finally, if none of these solutions are adequate for whatever reason (including, as you indicated in the comments section, inability to edit the table schema) then you can do as one of the other users suggested in the comments and create a synchronized method to find the next ID. You would basically just create a static method that returns an int, issue your select id statement in that static method, and use the returned result to insert your next record into the table. Since this method would not guarantee a successful insert (due to external applications ability to also insert into the same table) you would also have to catch Exceptions and retry on failure).
Set ID column to be "Identity" column. Then, you can execute your queries as:
insert into XYZ values('Name')
I think that you can't use ALTER TABLE to change column to be Identity after column is created. Use Managament Studio to set this column to be Identity. If your table has many rows, this can be a long running process, because it will actually copy your data to a new table (will perform table re-creation).
Most likely that option is disabled in your Managament Studio. In order to enable it open Tools->Options->Designers and uncheck option "Prevent saving changes that require table re-creation"...depending on your table size, you will probably have to set timeout, too. Your table will be locked during that time.
A solution for such problems is to have generate the ID using some kind of a sequence.
For example, in SQL Server you can create a sequence using the command below:
CREATE SEQUENCE Test.CountBy1
START WITH 1
INCREMENT BY 1 ;
GO
Then in C#, you can retrieve the next value out of Test and assign it to the ID before inserting it.
It sounds like you want a higher transaction isolation level or more restrictive locking.
I don't use these features too often, so hopefully somebody will suggest an edit if I'm wrong, but you want one of these:
-- specify the strictest isolation level
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
insert into XYZ values((SELECT max(ID)+1 from XYZ),'Name')
or
-- make locks exclusive so other transactions cannot access the same rows
insert into XYZ values((SELECT max(ID)+1 from XYZ WITH (XLOCK)),'Name')
We have a table with a key field, and another table which contains the current value of that key sequence, ie, to insert a new record you need to:
UPDATE seq SET key = key + 1
SELECT key FROM seq
INSERT INTO table (id...) VALUES (#key...)
Today I have been investigating collisions, and have found that without using transactions the above code run in parallel induces collisions, however, swapping the UPDATE and SELECT lines does not induce collisions, ie:
SELECT key + 1 FROM seq
UPDATE seq SET key = key + 1
INSERT INTO table (id...) VALUES (#key...)
Can anyone explain why? (I am not interested in better ways to do this, I am going to use transactions, and I cannot change the database design, I am just interested in why we observed what we did.)
I am running the two lines of SQL as a single string using C#'s SqlConnection, SqlCommand and SqlDataAdapter.
First off, your queries do not entirely make sense. Here's what I presume you are actually doing:
UPDATE seq SET key = key + 1
SELECT #key = key FROM seq
INSERT INTO table (id...) VALUES (#key...)
and
SELECT #key = key + 1 FROM seq
UPDATE seq SET key = #key
INSERT INTO table (id...) VALUES (#key...)
You're experiencing concurrency issues tied to the Transaction Isolation Level.
Transaction Isolation Levels represent a compromise between the need for concurrency (i.e. performance) and the need for data quality (i.e. accuracy).
By default, SQL uses a Read Committed isolation level, which means you can't get "dirty" reads (reads of data that has been modified by another transaction that but not yet committed to the table). It does not, however, mean that you are immune from other types of concurrency issues.
In your case, the issue you are having is called a non-repeatable read.
In your first example, the first line is reading the key value, then updating it. (In order for the UPDATE to set the column to key+1 it must first read the value of key). Then the second line's SELECT is reading the key value again. In a Read Committed or Read Uncommitted isolation level, it is possible that another transaction meanwhile completes an update to the key field, meaning that line 2 will read it as key+2 instead of the expected key+1.
Now, with your second example, once the key value has been read and modified and placed in the #key variable, it is not being read again. This prevents the non-repeatable read issue, but you're still not totally immune from concurrency problems. What can happen in this scenario is a lost update, in which two or more transactions end up trying to update key to the same value, and subsequently inserting duplicate keys to the table.
To be absolutely certain of having no concurrency problems with this structure as designed, you will need to use locking hints to ensure that all reads and updates to key are serializable (i.e. not concurrent). This will have horrendous performance, but "WITH UPDLOCK,HOLDLOCK" will get you there.
Your best solution, if you cannot change the database design, is to find someone who can. As Brian Hoover indicated, an auto-incrementing IDENTITY column is the way to do this with superb performance. The way you're doing it now reduces SQL's V-8 engine to one that is only allowed to fire on one cylinder.
I have a legacy data table in SQL Server 2005 that has a PK with no identity/autoincrement and no power to implement one.
As a result, I am forced to create new records in ASP.NET manually via the ole "SELECT MAX(id) + 1 FROM table"-before-insert technique.
Obviously this creates a race condition on the ID in the event of simultaneous inserts.
What's the best way to gracefully resolve the event of a race collision? I'm looking for VB.NET or C# code ideas along the lines of detecting a collision and then re-attempting the failed insert by getting yet another max(id) + 1. Can this be done?
Thoughts? Comments? Wisdom?
Thank you!
NOTE: What if I cannot change the database in any way?
Create an auxiliary table with an identity column. In a transaction insert into the aux table, retrieve the value and use it to insert in your legacy table. At this point you can even delete the row inserted in the aux table, the point is just to use it as a source of incremented values.
Not being able to change database schema is harsh.
If you insert existing PK into table you will get SqlException with a message indicating PK constraint violation. Catch this exception and retry insert a few times until you succeed. If you find that collision rate is too high, you may try max(id) + <small-random-int> instead of max(id) + 1. Note that with this approach your ids will have gaps and the id space will be exhausted sooner.
Another possible approach is to emulate autoincrementing id outside of database. For instance, create a static integer, Interlocked.Increment it every time you need next id and use returned value. The tricky part is to initialize this static counter to good value. I would do it with Interlocked.CompareExchange:
class Autoincrement {
static int id = -1;
public static int NextId() {
if (id == -1) {
// not initialized - initialize
int lastId = <select max(id) from db>
Interlocked.CompareExchange(id, -1, lastId);
}
// get next id atomically
return Interlocked.Increment(id);
}
}
Obviously the latter works only if all inserted ids are obtained via Autoincrement.NextId of single process.
The key is to do it in one statement or one transaction.
Can you do this?
INSERT (PKcol, col2, col3, ...)
SELECT (SELECT MAX(id) + 1 FROM table WITH (HOLDLOCK, UPDLOCK)), #val2, #val3, ...
Without testing, this will probably work too:
INSERT (PKcol, col2, col3, ...)
VALUES ((SELECT MAX(id) + 1 FROM table WITH (HOLDLOCK, UPDLOCK)), #val2, #val3, ...)
If you can't, another way is to do it in a trigger.
The trigger is part of the INSERT transaction
Use HOLDLOCK, UPDLOCK for the MAX. This holds the row lock until commit
The row being updated is locked for the duration
A second insert will wait until the first completes.
The downside is that you are changing the primary key.
An auxiliary table needs to be part of a transaction.
Or change the schema as suggested...
Note: All you need is a source of ever-increasing integers. It doesn't have to come from the same database, or even from a database at all.
Personally, I would use SQL Express because it is free and easy.
If you have a single web server:
Create a SQL Express database on the web server with a single table [ids] with a single autoincrementing field [new_id]. Insert a record into this [ids] table, get the [new_id], and pass that onto your database layer as the PK of the table in question.
If you have multiple web servers:
It's a pain to setup, but you can use the same trick by setting appropriate seed/increment (i.e. increment = 3, and seed = 1/2/3 for three web servers).
What about running the whole batch (select for id and insert) in serializable transaction?
That should get you around needing to make changes in the database.
Is the main concern concurrent access? I mean, will multiple instances of your app (or, God forbid, other apps outside your control) be performing inserts concurrently?
If not, you can probably manage the inserts through a central, synchronized module in your app, and avoid race conditions entirely.
If so, well... like Joel said, change the database. I know you can't, but the problem is as old as the hills, and it's been solved well -- at the database level. If you want to fix it yourself, you're just going to have to loop (insert, check for collisions, delete) over and over and over again. The fundamental problem is that you can't perform a transaction (I don't mean that in the SQL "TRANSACTION" sense, but in the larger data-theory sense) if you don't have support from the database.
The only further thought I have is that if you at least have control over who has access to the database (e.g., only "authorized" apps, either written or approved by you), you could implement a side-band mutex of sorts, where a "talking stick" is shared by all the apps and ownership of the mutex is required to do an insert. That would be its own hairy ball of wax, though, as you'd have to figure out policy for dead clients, where it's hosted, configuration issues, etc. And of course a "rogue" client could do inserts without the talking stick and hose the whole setup.
The best solution is to change the database. You may not be able to change the column to be an identity column, but you should be able to make sure there's a unique constraint on the column and add a new identity column seeded with your existing PK's. Then either use the new column instead or use a trigger to make the old column mirror the new, or both.