I have a requirement (by law) for a gap-less numbers on different tables. The IDs can have holes in them but not the sequences.
This is something I have to either solve in the C# code or in the database (Postgres, MS SQL and Oracle).
This is my problem:
Start transaction 1
Start transaction 2
Insert row on table "Portfolio" in transaction 1
Get next number in sequence for column Portfolio_Sequence (1)
Insert row on table "Document" in transaction 1
Get next number in sequence for column Document_Sequence (1)
Insert row on table "Portfolio" in transaction 2
Get next number in sequence for column Portfolio_Sequence (2)
Insert row on table "Document" in transaction 2
Get next number in sequence for column Document_Sequence (2)
Problem occurred in transaction 1
Rollback transaction 1
Commit transaction 2
Problem: Gap in sequence for both Portfolio_Sequence and Document_Sequence.
Note that this is very simplified and there is way more tables included in each of the transactions.
How can I deal with this?
I have seen suggestions where you "lock" the sequence until the transaction is either committed or rolled back, but this will be a huge halt for the system when it is this many tables involved and this complex long transactions.
As you have already seemed to conclude, gapless sequences simply do not scale. Either you run the risk of dropping values when a rollback occurs, or you have a serialization point that will prevent a multi-user, concurrent transaction system from scaling. You cannot have both.
My thought would be, what about a post processing action, where every day, you have a process that runs at close of business, checks for gaps, and renumbers anything that needs to be renumbered?
One final thought: I don't know your requirement, but, I know you said this is "required by law". Well, ask yourself, what did people do before there were computers? How would this "requirement" be met? Assuming you have a stack of blank forms that come preprinted with a "sequence" number in the upper right corner? And what happens if someone spilled coffee on that form? How was that handled? It seems you need a similar method to handle that in your system.
Hope that helps.
This problem is impossible to solve by principle because any transaction can rollback (bugs, timeouts, deadlocks, network errors, ...).
You will have a serial contention point. Try to reduce contention as much as possible: Keep the transaction that is allocating numbers as small as possible. Also, allocate numbers as late as possible in the transaction because only once you allocate a number contention arises. if you're doing 1000ms of uncontended work, and then allocate a number (taking 10ms) you still have a degree of parallelism of 100 which is enough.
So maybe you can insert all rows (of which you say there are many) with dummy sequence numbers, and only at the end of the transaction you quickly allocate all real sequence numbers and update the rows that are already written. This would work well if there are more inserts than updates, or the updates are quicker than the inserts (which they will be), or there is other processing or waiting interleaved between the inserts.
Gap-less sequences are hard to come by. I suggest to use a plain serial column instead. Create a view with the window function row_number() to produce a gap-less sequence:
CREATE VIEW foo AS
SELECT *, row_number() OVER (ORDER BY serial_col) AS gapless_id
FROM tbl;
Here is an idea that should support both high performance and high concurrency:
Use a highly concurrent, cached Oracle sequence to generate a dumb unique identifier for the gap-less table row. Call this entity MASTER_TABLE
Use the dumb unique identifier for all internal referential integrity from the MASTER_TABLE to other dependent detail tables.
Now your gap-less MASTER_TABLE sequence number can be implemented as an additional attribute on the MASTER_TABLE, and will be populated by a process that is separate from the MASTER_TABLE row creation. In fact, the gap-less additional attribute should be maintained in a 4th normal form attribute table of the MASTER_TABLE, and hence then a single background thread can then populate it at leisure, without concern for any row-locks on the MASTER_TABLE.
All queries that need to display the gap-less sequence number on a screen or report or whatever, would join the MASTER_TABLE with the gap-less additional attribute 4th normal form table. Note, these joins will be satisfied only after the background thread had populated the gap-less additional attribute 4th normal form table.
Related
At the risk of over-explaining my question, I'm going to err on the side of too much information.
I am creating a bulk upload process that inserts data into two tables. The two tables look roughly as follows. TableA is a self-referencing table that allows N levels of reference.
Parts (self-referencing table)
--------
PartId (PK Int Non-Auto-Incrementing)
DescriptionId (Fk)
ParentPartId
HierarchyNode (HierarchyId)
SourcePartId (VARCHAR(500) a unique Part Id from the source)
(other columns)
Description
--------
DescriptionId (PK Int Non-Auto-Incrementing)
Language (PK either 'EN' or 'JA')
DescriptionText (varchar(max))
(I should note too that there are other tables that will reference our PartID that I'm leaving out of this for now.)
In Description, the combo of Description and Language will be unique, but the actual `DescriptionID will always have at least two instances.
Now, for the bulk upload process, I created two staging tables that look a lot like Parts and Description but don't have any PK's, Indexes, etc. They are Parts_Staging and Description_Staging.
In Parts_Staging there is an extra column that contains a Hierarchy Node String, which is the HierarchyNode in this kind of format: /1/2/3/ etc. Then when data is copied from the _Staging table to the actual table, I use a CAST(Source.Column AS hierarchyid).
Because of the complexity of the ID's shared across the two tables, the self-referencing id's and the hierarchyid in Parts, and the number of rows to be inserted (possible in the 100,000's) I decided to 100% compile ALL of the data in a C# model first, including the PK ID's. So the process looks like this in C#:
Query the two tables for MAX ID
Using those Max ID's, compile a complete model of all the data for both tables (inlcuding the hierarchyid /1/2/3/)
Do a bulk insert into both _Staging Tables
Trigger a SP that copies non-duplicate data from the two _Staging tables into the actual tables. (This is where the CAST(Source.Column AS hierarchyid) happens).
We are importing lots of parts books, and a single part may be replicated across multiple books. We need to remove the duplicates. In step 4, duplicates are weeded out by checking the SourcePartId in the Parts table and the Description in the DescriptionText in the Description table.
That entire process works beautifully! And best of all, it's really fast. But, if you are reading this carefully (and I thank if you are) then you have already noticed one glaring, obvious problem.
If multiple processes are happening at the same time (and that absolutely WILL happen!) then there is a very real risk of getting the ID's mixed up and the data becoming really corrupted. Process1 could do the GET MAX ID query and before it manages to finish, Process2 could also do a GET MAX ID query, and because Process1 hasn't actually written to the tables yet, it would get the same ID's.
My original thought was to use a SEQUENCE object. And at first, that plan seemed to be brilliant. But it fell apart in testing because it's entirely possible that the same data will be processed more than once and eventually ignored when the copy happens from the _Staging tables to the final tables. And in that case, the SEQUENCE numbers will already be claimed and used, resulting in giant gaps in the ID's. Not that this is a fatal flaw, but it's an issue we would rather avoid.
So... that was a LOT of background info to ask this actual question. What I'm thinking of doing is this:
Lock both of the tables in question
Steps 1-4 as outlined above
Unlock both of the tables.
The lock would need to be a READ lock (which I think is an Exclusive lock?) so that if another process attempts to do the GET MAX ID query, it will have to wait.
My question is: 1) Is this the best approach? And 2) How does one place an Exclusive lock on a table?
Thanks!
I'm not sure in regards to what's the best approach but in terms of placing an 'exclusive' lock on a table, simply using with (TABLOCKX) in your query will put one on the table.
If you wish to learn about it;
https://msdn.microsoft.com/en-GB/library/ms187373.aspx
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I'm wrestling with the classic problem of inventory allocation and concurrency and I wondered if anyone could give me a steer on best practice in this situation.
My situation is that we have an order prepared with several "slots" which will be filled by unique inventory items at a certain stage in the process and at this point I want to make sure that nobody allocates the same unique unit to a slot on a different order. For example a user wants a van next Thursday so I reserve a "van" slot but at a later point in time I allocate a specific vehicle from the yard to this slot. I want to make sure that two different operators can't allocate the same van to two different customers next Thursday.
We already have a stock availability check process where we compare the aggregate of two tables within a date range, the result of summing these two tables (one is items in and the other is items out) tells me whether we have the specific item that I want to allocate to this slot on this date but I want to prevent another user from allocating the same item to their own slot at the same point in time.
I've already done some googling and research on this site and it looks like I need a "pessimistic locking" solution but I'm not sure how to put one in place effectively.
The allocation process will be called from a web API (rest api using .Net) with entity framework and I've considered the following two solutions:
Option 1 - Let the database handle it
At the point of allocation I begin a transaction and acquire an exclusive lock on the two tables used for evaluating stock availability.
The process confirms the stock availability, allocates the units to the slots and then releases the locks.
I think this would prevent the race condition of two users trying to allocate the same unique unit to two different orders but I'm uncomfortable with locking two tables for every other process that needs to query these tables until the allocation process completes as I think this could cause a bottleneck to other processes attempting to read those tables. In this scenario I think the second process which attempts to perform the duplicate allocation should be queued until the first has released the locks as it won't be able to query the availability tables and when it does it will fail the availability check and report an out of stock warning - so effectively blocking the second order from allocating the same stock.
On paper this sounds like it would work but I have two concerns; the first is that it will hit performance and the second is that I'm overlooking something. Also I'm using Postgres for the first time on this project (I'm normally a SQL Server guy) but I think Postgres still has the features to do this.
Option 2 - Use some kind of manual locking
I think my scenario is something like ticketing websites would encounter during the sales process for concerts or cinemas and I've seen them put up timers saying things like "your tickets will expire in 5 minutes" but I don't know how they implement this kind of system in the back end. Do they create a table of "reserved" stock before the allocation process begins with some kind of expiry time on them and then "blacklist" other users attempting to allocate the same units until that timer expires?
Sorry for the long intro but I wanted to explain the problem completely as I've seen plenty of questions about similar scenarios but nothing that really helped me to make a decision on how to proceed.
My question is which of the two options (if any) are "the right way" to do this?
Edit: The closest parallel to this question I've seen is How to deal with inventory and concurrency but it doesn't discuss option 1 (possibly because it's a terrible idea)
I think option 2 is better with some tweak.
This is what i'll do if i have to deal with such situation
Whenever user tries to book a vehicle for a slot, i'll make an entry(Entry should contain unique key which is made up with unique car id + slot time, And no duplicate entries should be allowed here for that combination that way you'll get Error in your application if two user tries to book same car for same slot at the same time so you can notify other user that van is already gone) in temporary holding area(normal table will do but if u have higher transaction you want to look into some caching database solutions.)
So before second user tries to book a vehicle user must check for lock in that slot for that car. (or you can show unavailability of cars for that slot using this data).
I'm not sure how your database is laid out, but if each inventory item is its own record in the database, just have a IsUsed flag on the table. When you go to update the record, just make sure you put IsUsed = 0 as part of the where clause. If total modified comes back as 0, then you know something else updated it before you.
If you have a table for storing vehicles in your db then you can take pessimistic no wait lock on vehcile to be allotted in slot selected by user.
This lock will be held by one transaction once aquired till it commits or rollbacks. All the other transaction if try to aquire the lock on the vehicle will fail immediately. Hence no waiting in db for transactions.
This will scalable as no waiting queues for txns in db to get the lock on vehicle to be allotted.
For failing transactions you can immediately roll back them and ask user to select different vehcile or slot.
Now it also applies if you have multiple vehicle of the same type and you get a chance to alott same vehicle I mean having same registration number to two user in same slot. As only one transaction wil win and others will fail.
Below is the postgresql query for this:
SELECT *
FROM vehicle
WHERE id = ?
FOR UPDATE nowait
There are different approaches to this problem and I'm just answering what I've thought about and eventually settled on when having to tackle this problem for a customer.
1. If the traffic is not heavy on your INSERT and UPDATE on these resources you can completely lock the table by doing something like this for example in a stored procedure, but this can also be done in simple client-side code:
CREATE PROCEDURE ...
AS
BEGIN
BEGIN TRANSACTION
-- lock table "a" till end of transaction
SELECT ...
FROM a
WITH (TABLOCK, HOLDLOCK)
WHERE ...
-- do some other stuff (including inserting/updating table "a")
-- release lock
COMMIT TRANSACTION
END
2. Use pessimistic locking by having your code obtain locks you yourself create. Put in an extra table pr resource-type you want to lock and set a unique constraint on the Id of the resource you want to lock. You then obtain a lock by trying to insert a row and you release the lock by deleting it. Put timestamps on so that you can have a job to clean up locks that got lost. The table could look like this:
Id bigint
BookingId bigint -- the resource you want to lock on. Put a unique constrain here
Creation datetime -- you can use these 2 timestamps to decide when to automatically remove a lock
Updated datetime
Username nvarchar(100) -- maybe who obtained the lock?
With this approach it's easy to decide which of your code needs to obtain a lock and what pieces of code can tolerate reading your resource and reservation table without a lock.
3. If it's a resource that is allotted by a begin- and end-time you could set the granularity of this timespan to e.g 15min. Each 15min timeslot of the day will then get a number starting from 0. Then you could create a table beside your reservation-table where start and end timestamps now consist of a number for the timeslot. Choose a reasonable starting timestamp as number 0. You will then insert as many rows with the different timeslot number as needed for every reservation. You of course need to have a unique constraint on the "Timeslot"+"ResourceId" so that any insert will be rejected if it is already reserved for that timeslot.
Updating this table could nicely be done in triggers on your table with reservations so that you can still have real timestamp on reservation-table and when an insert or update is performed you can update the timeslot-table and it can raise an error if you violate the unique constraint thereby rolling back the transaction and preventing a change in both tables.
I have issue on the performance of my program using C#.
In first loop, the table will insert and update 175000 records with 54 secs.
In the second loop, with 175000 records, 1 min 11 secs.
Next, the third loop, with 18195 1 min 28 secs.
The loop going on and the time taken is more for 125 records can go up to 2 mins.
I am wondering why would smaller records taking longer time to update? Does the number of records updating does not give effect on the time taken to complete the loop?
Can anyone enlighten me on this?
Flow of Program:
Insert into TableA (date,time) select date,time from rawdatatbl where id>=startID && id<=maxID; //startID is the next ID of last records
update TableA set columnName = values, columnName1 =values, columnName2 = values, columnName.....
I'm using InnoDB.
Reported behavior seems consistent with growing size of table, and inefficient query execution plan for UPDATE statements. Most likely explanation would be that the UPDATE is performing a full table scan to locate rows to be updated, because an appropriate index is not available. And as the table has more and more rows added, it takes longer and longer to perform the full table scan.
Quick recommendations:
review the query execution plan (obtained by running EXPLAIN)
verify suitable indexes is available and are being used
Apart from that, there's tuning of the MySQL instance itself. But that's going to depend on which storage engine the tables are using, MyISAM, InnoDB, et al.
Please provide SHOW CREATE TABLE for both tables, and the actual statements. Here are some guesses...
The target table has indexes. Since the indexes are built as the inserts occur, any "random" indexes will become slower and slower.
innodb_buffer_pool_size was so small that caching became a problem.
The UPDATE seems to be a full table update. Well, the table is larger each time.
How did you get startID from one query before doing the next one (which has id>=startID)? Perhaps that code is slower as you get farther into the table.
You say "in the second loop", where is the "loop"? Or were you referring to the INSERT...SELECT as a "loop"?
In our current project, customers will send collection of a complex/nested messages to our system. Frequency of these messages are approx. 1000-2000 msg/per seconds.
These complex objects contains the transaction data (to be added) as well as master data (which will be added if not found). But instead of passing the ids of the master data, customer passes the 'name' column.
System checks if master data exist for these names. If found, it uses the ids from database otherwise create this master data first and then use these ids.
Once master data ids are resolved, system inserts the transactional data to a SQL Server database (using master data ids). Number of master entities per message are around 15-20.
Following are the some strategies we can adopt.
We can resolve master ids first from our C# code (and insert master data if not found) and store these ids in C# cache. Once all ids are resolved, we can bulk insert the transactional data using SqlBulkCopy class. We can hit the database 15 times to fetch the ids for different entities and then hit database one more time to insert the final data. We can use the same connection will close it after doing all this processing.
We can send all these messages containing master data and transactional data in single hit to the database (in the form of multiple TVP) and then inside stored procedure, create the master data first for the missing ones and then insert the transactional data.
Could anyone suggest the best approach in this use case?
Due to some privacy issue, I cannot share the actual object structure. But here is the hypothetical object structure which is very close to our business object.
One such message will contain information about one product (its master data) and its price details (transaction data) from different vendors:
Master data (which need to be added if not found)
Product name: ABC, ProductCateory: XYZ, Manufacturer: XXX and some other other details (number of properties are in the range of 15-20).
Transaction data (which will always be added)
Vendor Name: A, ListPrice: XXX, Discount: XXX
Vendor Name: B, ListPrice: XXX, Discount: XXX
Vendor Name: C, ListPrice: XXX, Discount: XXX
Vendor Name: D, ListPrice: XXX, Discount: XXX
Most of the information about the master data will remain the same for a message belong to one product (and will change less frequently) but transaction data will always fluctuate. So, system will check if the product 'XXX' exist in the system or not. If not it check if the 'Category' mentioned with this product exist of not. If not, it will insert a new record for category and then for product. This will be done to for Manufacturer and other master data.
Multiple vendors will be sending data about multiple products (2000-5000) at the same time.
So, assume that we have 1000 suppliers, Each vendor is sending data about 10-15 different products. After each 2-3 seconds, every vendor sends us the price updates of these 10 products. He may start sending data about new products, but which will not be very frequent.
You would likely be best off with your #2 idea (i.e. sending all of the 15 - 20 entities to the DB in one shot using multiple TVPs and processing as a whole set of up to 2000 messages).
Caching master data lookups at the app layer and translating prior to sending to the DB sounds great, but misses something:
You are going to have to hit the DB to get the initial list anyway
You are going to have to hit the DB to insert new entries anyway
Looking up values in a dictionary to replace with IDs is exactly what a database does (assume a Non-Clustered Index on each of these name-to-ID lookups)
Frequently queried values will have their datapages cached in the buffer pool (which is a memory cache)
Why duplicate at the app layer what is already provided and happening right now at the DB layer, especially given:
The 15 - 20 entities can have up to 20k records (which is a relatively small number, especially when considering that the Non-Clustered Index only needs to be two fields: Name and ID which can pack many rows into a single data page when using a 100% Fill Factor).
Not all 20k entries are "active" or "current", so you don't need to worry about caching all of them. So whatever values are current will be easily identified as the ones being queried, and those data pages (which may include some inactive entries, but no big deal there) will be the ones to get cached in the Buffer Pool.
Hence, you don't need to worry about aging out old entries OR forcing any key expirations or reloads due to possibly changing values (i.e. updated Name for a particular ID) as that is handled naturally.
Yes, in-memory caching is wonderful technology and greatly speeds up websites, but those scenarios / use-cases are for when non-database processes are requesting the same data over and over in pure read-only purposes. But this particular scenario is one in which data is being merged and the list of lookup values can be changing frequently (moreso due to new entries than due to updated entries).
That all being said, Option #2 is the way to go. I have done this technique several times with much success, though not with 15 TVPs. It might be that some optimizations / adjustments need to be made to the method to tune this particular situation, but what I have found to work well is:
Accept the data via TVP. I prefer this over SqlBulkCopy because:
it makes for an easily self-contained Stored Procedure
it fits very nicely into the app code to fully stream the collection(s) to the DB without needing to copy the collection(s) to a DataTable first, which is duplicating the collection, which is wasting CPU and memory. This requires that you create a method per each collection that returns IEnumerable<SqlDataRecord>, accepts the collection as input, and uses yield return; to send each record in the for or foreach loop.
TVPs are not great for statistics and hence not great for JOINing to (though this can be mitigated by using a TOP (#RecordCount) in the queries), but you don't need to worry about that anyway since they are only used to populate the real tables with any missing values
Step 1: Insert missing Names for each entity. Remember that there should be a NonClustered Index on the [Name] field for each entity, and assuming that the ID is the Clustered Index, that value will naturally be a part of the index, hence [Name] only will provide a covering index in addition to helping the following operation. And also remember that any prior executions for this client (i.e. roughly the same entity values) will cause the data pages for these indexes to remain cached in the Buffer Pool (i.e. memory).
;WITH cte AS
(
SELECT DISTINCT tmp.[Name]
FROM #EntityNumeroUno tmp
)
INSERT INTO EntityNumeroUno ([Name])
SELECT cte.[Name]
FROM cte
WHERE NOT EXISTS(
SELECT *
FROM EntityNumeroUno tab
WHERE tab.[Name] = cte.[Name]
)
Step 2: INSERT all of the "messages" in simple INSERT...SELECT where the data pages for the lookup tables (i.e. the "entities") are already cached in the Buffer Pool due to Step 1
Finally, keep in mind that conjecture / assumptions / educated guesses are no substitute for testing. You need to try a few methods to see what works best for your particular situation since there might be additional details that have not been shared that could influence what is considered "ideal" here.
I will say that if the Messages are insert-only, then Vlad's idea might be faster. The method I am describing here I have used in situations that were more complex and required full syncing (updates and deletes) and did additional validations and creation of related operational data (not lookup values). Using SqlBulkCopy might be faster on straight inserts (though for only 2000 records I doubt there is much difference if any at all), but this assumes you are loading directly to the destination tables (messages and lookups) and not into intermediary / staging tables (and I believe Vlad's idea is to SqlBulkCopy directly to the destination tables). However, as stated above, using an external cache (i.e. not the Buffer Pool) is also more error prone due to the issue of updating lookup values. It could take more code than it's worth to account for invalidating an external cache, especially if using an external cache is only marginally faster. That additional risk / maintenance needs to be factored into which method is overall better for your needs.
UPDATE
Based on info provided in comments, we now know:
There are multiple Vendors
There are multiple Products offered by each Vendor
Products are not unique to a Vendor; Products are sold by 1 or more Vendors
Product properties are singular
Pricing info has properties that can have multiple records
Pricing info is INSERT-only (i.e. point-in-time history)
Unique Product is determined by SKU (or similar field)
Once created, a Product coming through with an existing SKU but different properties otherwise (e.g. category, manufacturer, etc) will be considered the same Product; the differences will be ignored
With all of this in mind, I will still recommend TVPs, but to re-think the approach and make it Vendor-centric, not Product-centric. The assumption here is that Vendor's send files whenever. So when you get a file, import it. The only lookup you would be doing ahead of time is the Vendor. Here is the basic layout:
Seems reasonable to assume that you already have a VendorID at this point because why would the system be importing a file from an unknown source?
You can import in batches
Create a SendRows method that:
accepts a FileStream or something that allows for advancing through a file
accepts something like int BatchSize
returns IEnumerable<SqlDataRecord>
creates a SqlDataRecord to match the TVP structure
for loops though the FileStream until either BatchSize has been met or no more records in the File
perform any necessary validations on the data
map the data to the SqlDataRecord
call yield return;
Open the file
While there is data in the file
call the stored proc
pass in VendorID
pass in SendRows(FileStream, BatchSize) for the TVP
Close the file
Experiment with:
opening the SqlConnection before the loop around the FileStream and closing it after the loops are done
Opening the SqlConnection, executing the stored procedure, and closing the SqlConnection inside of the FileStream loop
Experiment with various BatchSize values. Start at 100, then 200, 500, etc.
The stored proc will handle inserting new Products
Using this type of structure you will be sending in Product properties that are not used (i.e. only the SKU is used for the look up of existing Products). BUT, it scales very well as there is no upper-bound regarding file size. If the Vendor sends 50 Products, fine. If they send 50k Products, fine. If they send 4 million Products (which is the system I worked on and it did handle updating Product info that was different for any of its properties!), then fine. No increase in memory at the app layer or DB layer to handle even 10 million Products. The time the import takes should increase in step with the amount of Products sent.
UPDATE 2
New details related to Source data:
comes from Azure EventHub
comes in the form of C# objects (no files)
Product details come in through O.P.'s system's APIs
is collected in single queue (just pull data out insert into database)
If the data source is C# objects then I would most definitely use TVPs as you can send them over as is via the method I described in my first update (i.e. a method that returns IEnumerable<SqlDataRecord>). Send one or more TVPs for the Price/Offer per Vendor details but regular input params for the singular Property attributes. For example:
CREATE PROCEDURE dbo.ImportProduct
(
#SKU VARCHAR(50),
#ProductName NVARCHAR(100),
#Manufacturer NVARCHAR(100),
#Category NVARCHAR(300),
#VendorPrices dbo.VendorPrices READONLY,
#DiscountCoupons dbo.DiscountCoupons READONLY
)
SET NOCOUNT ON;
-- Insert Product if it doesn't already exist
IF (NOT EXISTS(
SELECT *
FROM dbo.Products pr
WHERE pr.SKU = #SKU
)
)
BEGIN
INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...)
VALUES (#SKU, #ProductName, #Manufacturer, #Category, ...);
END;
...INSERT data from TVPs
-- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows
From a DB point of view, there's no such fast thing than BULK INSERT (from csv files for example). The best is to bulk all data asap, then process it with stored procedures.
A C# layer will just slow down the process, since all the queries between C# and SQL will be thousands times slower than what Sql-Server can directly handle.
We have a application that executes a job to process a range of rows from a mssql view.
This view contains a lot of rows, and the data is inserted with a additional column (dataid) set to identity, meant for us to use to know how far through the dataset we have gotten.
A while ago we had some issues when just getting top n rows with a dataid larger than y (y being the last biggest last dataid that we processed). It seemed that the rows was not returned in correct order, meaning that when we grabbed a range of rows, it seemed that the dataid of some of the rows was misplaced, which meant that we processed a row with a dataid of 100 when we actually had only gotten to 95.
example
The window / range is 100 rows on each crunch. but if the rows' dataid are not in sequential order, the query getting the next 100 rows, may contain a dataid that really should have been located in the next crunch. And then rows will be skipped when the next crunch is executed.
A order by on the dataid would solve the problem, but that is way way to slow.
Do you guys have any suggestions how this could be done in a better/working way?
When i say a lot of rows, i mean a few billion rows, and yes, if you think that is absolutely crazy you are completely right!
We use Dapper to map the rows into objects.
This is completely read only.
I hope this question is not too vague.
Thanks in advance!
A order by on the dataid would solve the problem, but that is way way to slow.
Apply the proper indexes.
The only answer to "why is my query slow" is: How To: Optimize SQL Queries.
Is not clear what you mean by mixing 'view' and 'insert' in the same sentence. If you really mean a view that projects an IDENTITY function then you can stop right now, it will not work. You need to have a persisted bookmark to resume your work. An IDENTITY projected in a SELECT by a view does not meet the persistence criteria.
You need to process data in a well defined order that is persistent on consecutive reads. You must be able to read a key that clearly defines a boundary in the given order. You need to persist the last key processed in the same transaction as the batch processing the rows. How you achieve these requirements, is entirely up to you. A typical solution is to process in the clustered index order and remember the last processed cluster key position. An unique clustered key is a must. An IDENTITY property and a clustered index by it does satisfy the criteria you need.
If you only want to work on the last 100, give a take a 1000000, you could look at partitioning the data.
Whats the point of including the other 999999000000 in the index?