I need to find the fastest way of reading a large OpenEdge table (100 million rows plus), preferably programmatically (in c#) and outside of ETL tools such as SSIS or staging formats such as text file extracts.
I'm currently using ODBC (driver: Progress OpenEdge 11.5) to query the OpenEdge 11.5 tables in batches using the OFFSET and FETCH modifiers
SELECT COL_1, COL_2
FROM PUB.TABLE_1
ORDER BY ROWID ASC
OFFSET {currentBatchStart} ROWS
FETCH NEXT {batchSize} ROWS ONLY
I'm querying via a system DSN with FetchArraySize: 25 and QueryTimeout: -1. And I'm connecting to an OpenEdge server group set up for SQL only access with message buffer size: 1024.
I'm finding the performance is poor (about 1 million records every 15 minutes) and I suspect it will only slow down as I advance through the table when using the OFFSET FETCH modifiers.
My question is are there any methods I can adopt or settings I can play with to tune the query performance?
For example are there better ways of constructing my SQL query? e.g. should I order by columns in an index rather than ROWID?
Should I increase the message buffer size on the sql server group
Or should I be looking at alternative methods to read the data out of the table?
Note: Each batch is subsequently sqlbulkcopy'ed into a SQL Server table
I'm not much on ODBC - from what I can make of your code this will have increasing performance issues as you get further down the table as you surmise.
My suggestion would to be to identify a unique index on that table and use that index's keys to determine what values to get next. Then your query becomes something like this:
WHERE table.KeyField > LastFieldValueRead
ORDER BY table.KeyField
FETCH NEXT {batchSize} ROWS ONLY
Then the db engine can use your field values to find the offset and get the next values - this'll be much more performant than what you have now.
If this will be an ongoing concern 11.7 has Change Data Capture for logging data changes for replication elsewhere, and Progress sells the Pro2 tool to provide ongoing replication of data.
You should write OE code and connect to the SQL Server via .net functionality (if I remember correctly its in System.Data.SQL).
I've written a conversion tool this way which reads from SQL Server, Oracle DB, xBase and others and store them into a Progress RDBMS using almost everything from the original database (table, field and index name, format and the only thing that has to be converted where the datatypes). And I'm pretty sure it works the other way around also.
Related
I am doing web application using c# .net and sql server 2008 as back end. Where application read data from excel and insert into sql table. For this mechanism I have used SQLBulkCopy function which work very well. Sql table has 50 fields from which system_error and mannual_error are two fields. After inserting records in 48 columns I need to re-ckeck all this records and update above mentioned two columns by specific errors e.g. Name filed have number, qty Not specified etc. For this I have to check each column by fetching in datatable and using for loop.
Its work very well when record numbers are 1000 to 5000. But it took huge time say 50 minutes when records are around 100,000 or more than this.
Initially I have used simple SQL Update Query then I had used stored procedure but both requires same time.
How to increase the performance of application? What are other ways when dealing with huge data to update? Do suggestions.
I hope this is why people use mongodb and no SQL systems. You can update huge data setsby optimizing your query. Read more here:
http://www.sqlservergeeks.com/blogs/AhmadOsama/personal/450/sql-server-optimizing-update-queries-for-large-data-volumes
Also check:Best practices for inserting/updating large amount of data in SQL Server 2008
One thing to consider is that iterating over a database table row by row, rather than performing set based update operations would incur a significant performance hit.
If you are in fact performing set based updates on your data and still have significant performance problems you should look at the execution plan of your queries so that you can workout where and why they are performing so badly.
In my application I have a SQL Server 2008 table Employee Swipedaily_Tbl with 11 columns
where the employee daily swipes are inserted.
And I have about 8000 employees in my company. This means there will be at least 16000 rows created daily..
I am planing to delete all the rows at the end of a month and save them to another table in order to increase performance...... or back up the previous month data as dmb file from by application itself
As I am a new to SQL Server and DBA, can anyone suggest whether there is a better idea?
Can I create a dump file from the application?
Either by using Partitioning Table so inserting new data in huge volume database table won't effect its performance or using Script to backup data monthly wise using SQL Job and delete from existing one but if you are using Identity column you might need some changes in script to avoid conflict in old and new data.
Create an identical table
Create a SQL script to copy all the data older than a given date
(say today's date) to that table and delete from your table
Configure a SQL agent job to execute that script on the 1st of every
month
However, with proper indexing, you should be OK to reatian the data in your original table itself for a much longer period - 365 day x 8000 employees x 2 swipes = 5.84 million records, not too much for SQL server to handle.
Raj
You can create another table identical to Swipedaily_Tbl(11 columns) with additional one column that would tell when specific record was inserted in the backup table. You can then create a script that would backup the data older than one month and delete that data from the orignal table. You can then create a batch or a console application that could be scheduled to run at the end of month.
Hope this help.
Thanks.
It would depend on your requirements for the "old" data.
Personally, I would strongly consider using table partitioning.
See: http://technet.microsoft.com/en-us/library/dd578580(v=sql.100).aspx
Keep all records in table; this will make queries that look at current and historic data simultaneously simpler and potentially cheaper.
As all too often, it depends. Native partitioning requires the Enterprise Edition of SQL Server, however there are ways around it (although not very clean), like this.
If you do have the Enterprise Edition of SQL Server, I would take a serious look at partitioning (well linked in some of the other answers here), however I wouldn't split on a monthly basis, maybe a quarterly or semi-annual basis, as at two swipes per day is less than half a million rows per month, and a 1.5-3 mil. row table isn't that much for SQL server to handle.
If you are experiencing performance issues at this point in time with maybe a few months of data, have you reviewed the most frequent queries hitting the table and ensured that they're using indexes?
I have one desktop application receiving data from a webservice and storing it inside a local postgresql database (while the webservice retrieves data from a SQL Server database). At the end of the process there will be a minimum of 2.5 million entries inside a table in my local database but this will be received from de webservice in batches of about 300 rows at time and within a time frame of about 15 days.
What I need is a way to make sure that my local database has the exact same information the server's database has.
I'm thinking of creating some sort of checksum for each batch received and then, after all batches were received, another checksum of the entire table but I don't know if this is the best solution and, if is, I don't know where to start to create it.
PS: TCP already handles integrity check so I don't even know if this is needed, but it is critical that the data are the same.
I can see how a checksum could possibly be useful, but the amount of transformation you're doing would probably make it impractical. You'd have to derive the checksum on either the original form of the data or on the transformed form; it wouldn't be valid on both.
You have some strange constraints (been there myself), so it's kind of hard to come up with a clear strategy without knowing all the details. Maybe one of the following suggestions would work.
A simple count(*) on the SQL Server side and on the PostgreSQL side after the migration is complete.
Dump out a list of keys from the SQL Server side and from the PostgreSQL side after the migration is complete, and then sort and compare those files.
If 1 and 2 aren't possible because of limited access to SQL Server, maybe dump out the results of the web service calls to a single file location as you go along, and then extract the same data from PostgreSQL at the end, and compare those files.
There are numerous tools available for comparing files if you choose options 2 or 3.
Do you have control over the web service and SQL Server DB? If you do, SQL Server Change Tracking should do the trick. MSDN Change Tracking will track every change (or just the changes you care about) on a per table basis. Each time you synchronize you just pass it your version number and it will return the changeset required to bring you up to date.
I am interested in what the best practices are for paging large datasets (100 000+ records) using ASP.NET and SQL Server.
I have used SQL server to perform the paging before, and although this seems to be an ideal solution, issues arise around dynamic sorting with this solution (case statements for the order by clause to determine column and case statements for ASC/DESC order). I am not a fan of this as not only does it bind the application with the SQL details, it is a maintainability nightmare.
Opened to other solutions...
Thanks all.
In my experience, 100 000+ records are too many records for the user looking at them. Last time I did this, I provided filters. So users can use them and see the filtered (less number of) records and order them, so paging and ordering became much faster (than paging/ordering on whole 100 000+ records). If user didn't use filters, I showed a "warning" that large number of records would be returned and there would be delays. Adding an index on the column being ordered as suggested by Erick would also definitely help.
I wanted to add a quick suggestion to Raj's answer. If you create a temp table with the format ##table, it will survive. However, it will also be shared across all connections.
If you create an Index on the column that will be sorted, the cost of this method is far lower.
Erick
If you use the Order by technique, every time you page through, you will cause the same load on the server because you running the query, then filtering the data.
When I have had the luxury of non-connection-pooled environments, I have created and left the connection open until paging is done. Created a #Temp table on the connection with just the IDs of the rows that need to get back, and added and IDENTITY field to this rowset. Then do paging using this table to get the fastest returns.
If you are restricted to a connection-pooled environment, then the #Temp table is lost as soon as the connection is closed. In that case, you will have to cache the list of Ids on the server - never send them to the client to be cached.
I have a program in c# in VS that runs a mainform.
That mainform exports data to an SQL Database with stored procedures into tables. The data exported is a lot of data (600,000 + rows).
I have a problem tho. On my mainform I need to have a "database write out interval". This is a number of how many "rows" will be imported into the database.
My problem is however the steps on how to implement that interval. The mainform runs, and when the main program is done, the sql still takes IN data for another 5-10 minutes.
Therefore, if I close the mainform, the rest of the data will not me imported.
Do you professional programmers out there know a way where I can somehow communicate with SQL to only export data for a user-specified interval. T
his has to be done with my c# class.
I dont know where to begin.
I dont think a timer would be a good idea because differenct computers and cpu's perform differently. Any advice would be appreciated.
If the data is of a fixed format (ie, there are going to be the same columns for every row and its not going to change much), you should look at Bulk Insert. Its incredibly fast at inserting large numbers of rows.
The basics are you write your data out to a text file (ie, csv, but you can specify whatever delimiter you want), then execute a BULK INSERT command against the server. One of the arguments is the path to the file you wrote out. It's a bit of a pain to use because you have to write the file in a folder on the server (or a UNC path that the server has access to) which leads to configuring windows shares or setting up FTP on the server. It sounds like exactly what you want to use, though.
Here's the MSDN documentation on BULK INSERT:
http://msdn.microsoft.com/en-us/library/ms188365.aspx
Instead of exporting all of your data to SQL and then trying to abort or manage the load a a better process might be to split your load into smaller chunks (10,000 records or so) and check whether the user wants to continue after each load. This gives you a lot more flexibility and control over the load then dumping all 600,000 records to SQL and trying to manage the process.
Also what Tim Coker mentioned is spot on. Even if your stored proc is doing some data manipulation it is a lot faster to load the data via bulk insert and run a query after the load to do any work you have to do then to run all 600,000 records through the stored proc.
Like all the other comments before, i will suggest you to use BulkInsert. You will be amazed by how fast the performance is when it comes to large dataset and perhaps your concept about interval is no longer required. Inserting 100k of records may only take seconds.
Depends on how your code is written, ADO.NET has native support for BulkInsert through SqlBulkCopy, see the code below
http://www.knowdotnet.com/articles/bulkcopy_intro1.html
If you have been using Linq to db for your code, there are already some clever code written as extension method to the datacontext which transform the linq changeset into a dataset and internally use ADO.NET to achieve the bulk insert
http://blogs.microsoft.co.il/blogs/aviwortzel/archive/2008/05/06/implementing-sqlbulkcopy-in-linq-to-sql.aspx