We have a application that executes a job to process a range of rows from a mssql view.
This view contains a lot of rows, and the data is inserted with a additional column (dataid) set to identity, meant for us to use to know how far through the dataset we have gotten.
A while ago we had some issues when just getting top n rows with a dataid larger than y (y being the last biggest last dataid that we processed). It seemed that the rows was not returned in correct order, meaning that when we grabbed a range of rows, it seemed that the dataid of some of the rows was misplaced, which meant that we processed a row with a dataid of 100 when we actually had only gotten to 95.
example
The window / range is 100 rows on each crunch. but if the rows' dataid are not in sequential order, the query getting the next 100 rows, may contain a dataid that really should have been located in the next crunch. And then rows will be skipped when the next crunch is executed.
A order by on the dataid would solve the problem, but that is way way to slow.
Do you guys have any suggestions how this could be done in a better/working way?
When i say a lot of rows, i mean a few billion rows, and yes, if you think that is absolutely crazy you are completely right!
We use Dapper to map the rows into objects.
This is completely read only.
I hope this question is not too vague.
Thanks in advance!
A order by on the dataid would solve the problem, but that is way way to slow.
Apply the proper indexes.
The only answer to "why is my query slow" is: How To: Optimize SQL Queries.
Is not clear what you mean by mixing 'view' and 'insert' in the same sentence. If you really mean a view that projects an IDENTITY function then you can stop right now, it will not work. You need to have a persisted bookmark to resume your work. An IDENTITY projected in a SELECT by a view does not meet the persistence criteria.
You need to process data in a well defined order that is persistent on consecutive reads. You must be able to read a key that clearly defines a boundary in the given order. You need to persist the last key processed in the same transaction as the batch processing the rows. How you achieve these requirements, is entirely up to you. A typical solution is to process in the clustered index order and remember the last processed cluster key position. An unique clustered key is a must. An IDENTITY property and a clustered index by it does satisfy the criteria you need.
If you only want to work on the last 100, give a take a 1000000, you could look at partitioning the data.
Whats the point of including the other 999999000000 in the index?
Related
I have issue on the performance of my program using C#.
In first loop, the table will insert and update 175000 records with 54 secs.
In the second loop, with 175000 records, 1 min 11 secs.
Next, the third loop, with 18195 1 min 28 secs.
The loop going on and the time taken is more for 125 records can go up to 2 mins.
I am wondering why would smaller records taking longer time to update? Does the number of records updating does not give effect on the time taken to complete the loop?
Can anyone enlighten me on this?
Flow of Program:
Insert into TableA (date,time) select date,time from rawdatatbl where id>=startID && id<=maxID; //startID is the next ID of last records
update TableA set columnName = values, columnName1 =values, columnName2 = values, columnName.....
I'm using InnoDB.
Reported behavior seems consistent with growing size of table, and inefficient query execution plan for UPDATE statements. Most likely explanation would be that the UPDATE is performing a full table scan to locate rows to be updated, because an appropriate index is not available. And as the table has more and more rows added, it takes longer and longer to perform the full table scan.
Quick recommendations:
review the query execution plan (obtained by running EXPLAIN)
verify suitable indexes is available and are being used
Apart from that, there's tuning of the MySQL instance itself. But that's going to depend on which storage engine the tables are using, MyISAM, InnoDB, et al.
Please provide SHOW CREATE TABLE for both tables, and the actual statements. Here are some guesses...
The target table has indexes. Since the indexes are built as the inserts occur, any "random" indexes will become slower and slower.
innodb_buffer_pool_size was so small that caching became a problem.
The UPDATE seems to be a full table update. Well, the table is larger each time.
How did you get startID from one query before doing the next one (which has id>=startID)? Perhaps that code is slower as you get farther into the table.
You say "in the second loop", where is the "loop"? Or were you referring to the INSERT...SELECT as a "loop"?
This might seem subjective, but I'm looking for answers from those who like to set, or at least be a part of setting, coding standards.
In C#, What type of result should you expect when searching for a single record by a non primary key index?
If you :
select * from tablename where fieldname=#fieldname
As a matter of practice, should you code logic to expect an IEnumerable list or a single record?
If you really expect only one record, should the SQL use TOP 1? like below:
select Top 1 * from tablename where fieldname=#fieldname
I think rather than thinking about what you expect, a better way to look at this is construct your query such that you get what you want. If you are only interested in the zero or one potential matches then TOP(1) certainly works. Although I'd likely add some type of ordering clause.
However, if you want zero or more, then the first approach is better.
Any time you are querying based off of a non-unique value you always have the possibility of returning more than one record. Sure, today that query only gets one. However at some point in the future an unforseen change will occur and all of a sudden you now get multiple rows back.
I have a few database tables set up like:
Executions
-ExecutionID
Periods
-PeriodID
-ExecutionID
Transactions
-TransactionID
-PeriodID
-Name
-ResponseTime
I have a page where I want to display the execution in a GridView, with columns:
TransactionName, Period1ResponseTime, Period2ResponseTime, etc..
I've been trying to come up with the best way to combine the tables in to a single datasource to feed to the GridView but am only coming up dirty brute force ideas. What do you think the best approach for this would be? Is it possible using SQL alone?
Ps. The transaction names are distinct per period (won't be 2 transactions with same name in a period) and not every period will have the same transactions although they're mostly the same (different response times) though.
I ended up doing it the brute force c# approach as I can't figure out the complex SQL involved - not really sure what "tools" I could use to accomplish. That said I'll have to look more in to cursors. Basically what I do now is 1 query to get all distinct transaction names from periods in the execution (make SQL statement on the fly with some OR appends. Then get transactions for the different period and store in an array. Finally I combine them all in to a custom data table by going row by row (distinct transaction names), and for each period search its the transaction list for that name: if found puts the response time value in that periods column, otherwise leaves blank. Not sure what the duration is for stackoverflow questions but if anyone has a suggestion to improve on this approach I'm all ears since it feels very inefficient.
I have a requirement (by law) for a gap-less numbers on different tables. The IDs can have holes in them but not the sequences.
This is something I have to either solve in the C# code or in the database (Postgres, MS SQL and Oracle).
This is my problem:
Start transaction 1
Start transaction 2
Insert row on table "Portfolio" in transaction 1
Get next number in sequence for column Portfolio_Sequence (1)
Insert row on table "Document" in transaction 1
Get next number in sequence for column Document_Sequence (1)
Insert row on table "Portfolio" in transaction 2
Get next number in sequence for column Portfolio_Sequence (2)
Insert row on table "Document" in transaction 2
Get next number in sequence for column Document_Sequence (2)
Problem occurred in transaction 1
Rollback transaction 1
Commit transaction 2
Problem: Gap in sequence for both Portfolio_Sequence and Document_Sequence.
Note that this is very simplified and there is way more tables included in each of the transactions.
How can I deal with this?
I have seen suggestions where you "lock" the sequence until the transaction is either committed or rolled back, but this will be a huge halt for the system when it is this many tables involved and this complex long transactions.
As you have already seemed to conclude, gapless sequences simply do not scale. Either you run the risk of dropping values when a rollback occurs, or you have a serialization point that will prevent a multi-user, concurrent transaction system from scaling. You cannot have both.
My thought would be, what about a post processing action, where every day, you have a process that runs at close of business, checks for gaps, and renumbers anything that needs to be renumbered?
One final thought: I don't know your requirement, but, I know you said this is "required by law". Well, ask yourself, what did people do before there were computers? How would this "requirement" be met? Assuming you have a stack of blank forms that come preprinted with a "sequence" number in the upper right corner? And what happens if someone spilled coffee on that form? How was that handled? It seems you need a similar method to handle that in your system.
Hope that helps.
This problem is impossible to solve by principle because any transaction can rollback (bugs, timeouts, deadlocks, network errors, ...).
You will have a serial contention point. Try to reduce contention as much as possible: Keep the transaction that is allocating numbers as small as possible. Also, allocate numbers as late as possible in the transaction because only once you allocate a number contention arises. if you're doing 1000ms of uncontended work, and then allocate a number (taking 10ms) you still have a degree of parallelism of 100 which is enough.
So maybe you can insert all rows (of which you say there are many) with dummy sequence numbers, and only at the end of the transaction you quickly allocate all real sequence numbers and update the rows that are already written. This would work well if there are more inserts than updates, or the updates are quicker than the inserts (which they will be), or there is other processing or waiting interleaved between the inserts.
Gap-less sequences are hard to come by. I suggest to use a plain serial column instead. Create a view with the window function row_number() to produce a gap-less sequence:
CREATE VIEW foo AS
SELECT *, row_number() OVER (ORDER BY serial_col) AS gapless_id
FROM tbl;
Here is an idea that should support both high performance and high concurrency:
Use a highly concurrent, cached Oracle sequence to generate a dumb unique identifier for the gap-less table row. Call this entity MASTER_TABLE
Use the dumb unique identifier for all internal referential integrity from the MASTER_TABLE to other dependent detail tables.
Now your gap-less MASTER_TABLE sequence number can be implemented as an additional attribute on the MASTER_TABLE, and will be populated by a process that is separate from the MASTER_TABLE row creation. In fact, the gap-less additional attribute should be maintained in a 4th normal form attribute table of the MASTER_TABLE, and hence then a single background thread can then populate it at leisure, without concern for any row-locks on the MASTER_TABLE.
All queries that need to display the gap-less sequence number on a screen or report or whatever, would join the MASTER_TABLE with the gap-less additional attribute 4th normal form table. Note, these joins will be satisfied only after the background thread had populated the gap-less additional attribute 4th normal form table.
We have a large database with enquiries, each enquirys is referenced using a Guid. The Guid isn't very customer friendly so we want to the additional 5 digit "human id" (ok as we'll very likely won't have more than 99999 enquirys active at any time, and it's ok if a humanuid reference multiple enquirys as they aren't used for anything important).
1) Is there any way to have a IDENTITY column reset to 1 after 99999?
My current workaround to this is to use a INT IDENTITY(1,1) NOT NULL column and when presenting a HumanId take HumanId % 100000.
2) Is there any way to automatically "randomly distribute" the ids over [0..99999] so that two enquirys created after each other don't get the adjacent ids? I guess I'm looking for a two-way one-to-one hash function??
... Ideally I'd like to create this using T-SQL automatically creating these id's when a enquiry is created.
If performance and concurrency isn't too much of an issue, you can use triggers and the MAX() function to calculate a 'next human ID' value. You probably would want to keep your IDENTITY column as is, and have the 'human ID' in a separate column.
EDIT: On a side note, this sounds like a 'presentation layer' issue, which shouldn't be in your database. Your presentation layer of your application should have the code to worry about presenting a record in a human readable manner. Just a thought...
If you absolutely need to do this in the database, then why not derive your human-friendly value directly from the GUID column?
-- human_id doesn't have to be calculated when you retrieve the data
-- you could create a computed column on the table itself if you prefer
SELECT (CAST(your_guid_column AS BINARY(3)) % 100000) AS human_id
FROM your_table
This will give you a random-ish value between 0 and 99999, derived from the first 3 bytes of the GUID. If you want a larger, or smaller, range then adjust the divisor accordingly.
I would strongly recommend relooking at your logic. Your approach has a few dangers, including:
It is always a bad idea to re-use ID's, even if the original record has become "obsolete" - do you lose anything by continuing to grow ID's beyond 99999? The problem here is more likely to be with long term maintenance, especially if there is any danger of the system developing over time. Another thing to consider - is there any chance a user will take this reference number, and use it to reference your system at some stage in the future?
With manually assigning a generated / random ID, you will need to ensure that multiple records are not assigned the same ID. There are a few options that you have to follow this (for example, using transactions), however you should ensure that the scope of the transactions is not going to leave you open to problems with concurrent transactions being blocked - this may cause a few problems eg. Performance. You may be best served by generating your ID externally (as SQL does not do random especially well), and then enforcing a unique constraint on your DB, perhaps in the way suggested by Firoz Ansari.
If you still want to reset the identity column, this can be done with the DBCC CHECKIDENT command.
An example of generating random seeds in SQL server can be found here:
http://weblogs.sqlteam.com/jeffs/archive/2004/11/22/2927.aspx
You can create composite primary key with two columns, say..BatchId and HumanId.
Records in these columns will look like this:
BatchId, HumanId
1, 1
1, 2
1, 3
.
.
1, 99998
1, 99999
2, 1
2, 2
3, 3
use MAX or ORDER BY DESC to get next available HumanId with condition with BachId
SELECT TOP 1 #NextHumanId=HumanId
FROM [THAT_TABLE]
ORDER BY BatchId DESC, HumanID DESC
IF #NextHumanId>=99999 THEN SET #NextHumanId=1
Hope this help.
You could have a table of available HUMANIDs, each time you add an enquiry you could randomly pull a HUMANID from the table (and DELETE it), and each time you delete the enquiry you could add it back (by INSERTing).