I have issue on the performance of my program using C#.
In first loop, the table will insert and update 175000 records with 54 secs.
In the second loop, with 175000 records, 1 min 11 secs.
Next, the third loop, with 18195 1 min 28 secs.
The loop going on and the time taken is more for 125 records can go up to 2 mins.
I am wondering why would smaller records taking longer time to update? Does the number of records updating does not give effect on the time taken to complete the loop?
Can anyone enlighten me on this?
Flow of Program:
Insert into TableA (date,time) select date,time from rawdatatbl where id>=startID && id<=maxID; //startID is the next ID of last records
update TableA set columnName = values, columnName1 =values, columnName2 = values, columnName.....
I'm using InnoDB.
Reported behavior seems consistent with growing size of table, and inefficient query execution plan for UPDATE statements. Most likely explanation would be that the UPDATE is performing a full table scan to locate rows to be updated, because an appropriate index is not available. And as the table has more and more rows added, it takes longer and longer to perform the full table scan.
Quick recommendations:
review the query execution plan (obtained by running EXPLAIN)
verify suitable indexes is available and are being used
Apart from that, there's tuning of the MySQL instance itself. But that's going to depend on which storage engine the tables are using, MyISAM, InnoDB, et al.
Please provide SHOW CREATE TABLE for both tables, and the actual statements. Here are some guesses...
The target table has indexes. Since the indexes are built as the inserts occur, any "random" indexes will become slower and slower.
innodb_buffer_pool_size was so small that caching became a problem.
The UPDATE seems to be a full table update. Well, the table is larger each time.
How did you get startID from one query before doing the next one (which has id>=startID)? Perhaps that code is slower as you get farther into the table.
You say "in the second loop", where is the "loop"? Or were you referring to the INSERT...SELECT as a "loop"?
Related
I've got a sql query which returns 20 columns and about 500 000 rows at this moment. The values are running because people are working on the data in database.
Most columns in the query isn't simple selects but there is a lot of 'case when'. Data is joined from 5 tables.
Is there a way to show the data in GridView in efficient manner. Now i show all data (500000 rows) and it takes long time. I've tried pagination but when i want to for example take 100 rows with offset 10 rows, the whole query is executed and it takes to long.
How could i cope with this?
I think you have two sepparte issues here:
Slow query: be sure to optimize your query. There are litteraly thousands of articles on the net. My first option is always to check indexes on the columns I'm joining the tables by. Start with analyzing execution plan, you'll quickly discover the main problem(s).
The sheer number of records. 500000 is at least 100 times too big for any human, completly unusable. There are two solutions: limit the number (add another criteria) of returned records or use Server mode for grid.
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.
I have a query that gets executed on an SQLite database table from a WPF application; this query can return a large amount of records (from about 80000 up to about 1 million) which I down-sample before displaying it.
It takes about 10 seconds to return 700000 records, can it be optimized in some way?
The WHERE clause filters the records by a date time column:
(WHERE CollectedOn > #startTime AND CollectedOn < #endTime)
and I'm selecting all of the 18 columns of the table.
Does the number of columns influence the executing time of the query?
Thanks for the comments. I should point out a few more things:
The data I collect needs to be displayed in a chart; since I want to display only 600 points my algorithm picks one point every 600 from those 700000 records. Can this be achieved in a single query?
These are some of the things I would consider:
Further narrowing down the number of returned records (you said that you down-sample before displaying... can you down-sample within the database or even the WHERE clause)?
Do you really need all the records at once? Maybe paging would help (see LIMIT and OFFSET)
You could try to use an index to speed up your query. Use EXPLAIN to find out what your query does exactly... afterwards you can optimize joins and selections (also use indices for joins).
Narrowing down the attributes is always a good thing to do (instead of just returning all columns), but at least for simple queries (no subselects), it will have less influence than selecting the right rows using the WHERE clause. Also search for "selection" and "projection" about this issue.
We have a application that executes a job to process a range of rows from a mssql view.
This view contains a lot of rows, and the data is inserted with a additional column (dataid) set to identity, meant for us to use to know how far through the dataset we have gotten.
A while ago we had some issues when just getting top n rows with a dataid larger than y (y being the last biggest last dataid that we processed). It seemed that the rows was not returned in correct order, meaning that when we grabbed a range of rows, it seemed that the dataid of some of the rows was misplaced, which meant that we processed a row with a dataid of 100 when we actually had only gotten to 95.
example
The window / range is 100 rows on each crunch. but if the rows' dataid are not in sequential order, the query getting the next 100 rows, may contain a dataid that really should have been located in the next crunch. And then rows will be skipped when the next crunch is executed.
A order by on the dataid would solve the problem, but that is way way to slow.
Do you guys have any suggestions how this could be done in a better/working way?
When i say a lot of rows, i mean a few billion rows, and yes, if you think that is absolutely crazy you are completely right!
We use Dapper to map the rows into objects.
This is completely read only.
I hope this question is not too vague.
Thanks in advance!
A order by on the dataid would solve the problem, but that is way way to slow.
Apply the proper indexes.
The only answer to "why is my query slow" is: How To: Optimize SQL Queries.
Is not clear what you mean by mixing 'view' and 'insert' in the same sentence. If you really mean a view that projects an IDENTITY function then you can stop right now, it will not work. You need to have a persisted bookmark to resume your work. An IDENTITY projected in a SELECT by a view does not meet the persistence criteria.
You need to process data in a well defined order that is persistent on consecutive reads. You must be able to read a key that clearly defines a boundary in the given order. You need to persist the last key processed in the same transaction as the batch processing the rows. How you achieve these requirements, is entirely up to you. A typical solution is to process in the clustered index order and remember the last processed cluster key position. An unique clustered key is a must. An IDENTITY property and a clustered index by it does satisfy the criteria you need.
If you only want to work on the last 100, give a take a 1000000, you could look at partitioning the data.
Whats the point of including the other 999999000000 in the index?
My goal is to maximise performance. The basics of the scenario are:
I read some data from SQL Server 2005 into a DataTable (1000 records x 10 columns)
I do some processing in .NET of the data, all records have at least 1 field changed in the DataTable, but potentially all 10 fields could be changed
I also add some new records in to the DataTable
I do a SqlDataAdapter.Update(myDataTable.GetChanges()) to persist the updates (an inserts) back to the db using a InsertCommand and UpdateCommand I defined at the start
Assume table being updated contains 10s of millions of records
This is fine. However, if a row has changed in the DataTable then ALL columns for that record are updated in the database even if only 1 out of 9 columns has actually changed value. This means unnecessary work, particularly if indexes are involved. I don't believe SQL Server optimises this scenario?
I think, if I was able to only update the columns that had actually changed for any given record, that I should see a noticeable performance improvement (esp. as cumulatively I will be dealing with millions of rows).
I found this article: http://netcode.ru/dotnet/?lang=&katID=30&skatID=253&artID=6635
But don't like the idea of doing multiple UPDATEs within the sproc.
Short of creating individual UPDATE statements for each changed DataRow and then firing them in somehow in a batch, I'm looking for other people's experiences/suggestions.
(Please assume I can't use triggers)
Thanks in advance
Edit: Any way to get SqlDataAdapter to send UPDATE statements specific to each changed DataRow (only to update the actual changed columns in that row) rather than giving a general .UpdateCommand that updates all columns?
Isn't it possible to implement your own IDataAdapter where you implement this functionality ?
Offcourse, the DataAdapter only fires the correct SqlCommand, which is determined by the RowState of each DataRow.
So, this means that you would have to generate the SQL command that has to be executed for each situation ...
But, I wonder if it is worth the effort. How much performance will you gain ?
I think that - if it is really necessary - I would disable all my indexes and constraints, do the update using the regular SqlDataAdapter, and afterwards enable the indexes and constraints.
you might try is do create an XML of your changed dataset, pass it as a parameter ot a sproc and the do a single update by using sql nodes() function to translate the xml into a tabular form.
you should never try to update a clustered index. if you do it's time to rethink your db schema.
I would VERY much suggest that you do this with a stored procedure.
Lets say that you have 10 million records you have to update. And lets say that each record has 100 bytes (for 10 columns this could be too small, but lets be conservative). This amounts to cca 100 MB of data that must be transferred from database (network traffic), stored in memory and than returned to database in form of UPDATE or INSERT that are much more verbose for transfer to database.
I expect that SP would perform much better.
Than again you could divide you work into smaller SP (that are called from main SP) that would update just the necessary fields and that way gain additional performance.
Disabling indexes/constraints is also an option.
EDIT:
Another thing you must consider is potential number of different update statements. In case of 10 fields per row any field could stay the same or change. So if you construct your UPDATE statement to reflect this you could potentially get 10^2 = 1024 different UPDATE statements and any of those must be parsed by SQL Server, execution plan calculated and parsed statement stored in some area. There is a price to do this.