I've got a sql query which returns 20 columns and about 500 000 rows at this moment. The values are running because people are working on the data in database.
Most columns in the query isn't simple selects but there is a lot of 'case when'. Data is joined from 5 tables.
Is there a way to show the data in GridView in efficient manner. Now i show all data (500000 rows) and it takes long time. I've tried pagination but when i want to for example take 100 rows with offset 10 rows, the whole query is executed and it takes to long.
How could i cope with this?
I think you have two sepparte issues here:
Slow query: be sure to optimize your query. There are litteraly thousands of articles on the net. My first option is always to check indexes on the columns I'm joining the tables by. Start with analyzing execution plan, you'll quickly discover the main problem(s).
The sheer number of records. 500000 is at least 100 times too big for any human, completly unusable. There are two solutions: limit the number (add another criteria) of returned records or use Server mode for grid.
Related
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.
I have a query that gets executed on an SQLite database table from a WPF application; this query can return a large amount of records (from about 80000 up to about 1 million) which I down-sample before displaying it.
It takes about 10 seconds to return 700000 records, can it be optimized in some way?
The WHERE clause filters the records by a date time column:
(WHERE CollectedOn > #startTime AND CollectedOn < #endTime)
and I'm selecting all of the 18 columns of the table.
Does the number of columns influence the executing time of the query?
Thanks for the comments. I should point out a few more things:
The data I collect needs to be displayed in a chart; since I want to display only 600 points my algorithm picks one point every 600 from those 700000 records. Can this be achieved in a single query?
These are some of the things I would consider:
Further narrowing down the number of returned records (you said that you down-sample before displaying... can you down-sample within the database or even the WHERE clause)?
Do you really need all the records at once? Maybe paging would help (see LIMIT and OFFSET)
You could try to use an index to speed up your query. Use EXPLAIN to find out what your query does exactly... afterwards you can optimize joins and selections (also use indices for joins).
Narrowing down the attributes is always a good thing to do (instead of just returning all columns), but at least for simple queries (no subselects), it will have less influence than selecting the right rows using the WHERE clause. Also search for "selection" and "projection" about this issue.
I have issue on the performance of my program using C#.
In first loop, the table will insert and update 175000 records with 54 secs.
In the second loop, with 175000 records, 1 min 11 secs.
Next, the third loop, with 18195 1 min 28 secs.
The loop going on and the time taken is more for 125 records can go up to 2 mins.
I am wondering why would smaller records taking longer time to update? Does the number of records updating does not give effect on the time taken to complete the loop?
Can anyone enlighten me on this?
Flow of Program:
Insert into TableA (date,time) select date,time from rawdatatbl where id>=startID && id<=maxID; //startID is the next ID of last records
update TableA set columnName = values, columnName1 =values, columnName2 = values, columnName.....
I'm using InnoDB.
Reported behavior seems consistent with growing size of table, and inefficient query execution plan for UPDATE statements. Most likely explanation would be that the UPDATE is performing a full table scan to locate rows to be updated, because an appropriate index is not available. And as the table has more and more rows added, it takes longer and longer to perform the full table scan.
Quick recommendations:
review the query execution plan (obtained by running EXPLAIN)
verify suitable indexes is available and are being used
Apart from that, there's tuning of the MySQL instance itself. But that's going to depend on which storage engine the tables are using, MyISAM, InnoDB, et al.
Please provide SHOW CREATE TABLE for both tables, and the actual statements. Here are some guesses...
The target table has indexes. Since the indexes are built as the inserts occur, any "random" indexes will become slower and slower.
innodb_buffer_pool_size was so small that caching became a problem.
The UPDATE seems to be a full table update. Well, the table is larger each time.
How did you get startID from one query before doing the next one (which has id>=startID)? Perhaps that code is slower as you get farther into the table.
You say "in the second loop", where is the "loop"? Or were you referring to the INSERT...SELECT as a "loop"?
I have a dropdown list in my aspx page. Dropdown list's datasource is a datatable. Backend is MySQL and records get to the datatable by using a stored procedure.
I want to display records in the dropdown menu in ascending order.
I can achieve this by two ways.
1) dt is datatable and I am using dataview to filter records.
dt = objTest_BLL.Get_Names();
dataView = dt.DefaultView;
dataView.Sort = "name ASC";
dt = dataView.ToTable();
ddown.DataSource = dt;
ddown.DataTextField = dt.Columns[1].ToString();
ddown.DataValueField = dt.Columns[0].ToString();
ddown.DataBind();
2) Or in the select query I can simply say that
SELECT
`id`,
`name`
FROM `test`.`type_names`
ORDER BY `name` ASC ;
If I use 2nd method I can simply eliminate the dataview part. Assume this type_names table has 50 records. And my page is view by 100,000 users at a minute. What is the best method by considering efficiency,Memory handling? Get unsorted records to datatable and filter in code behind or sort them inside the datatabse?
Note - Only real performance tests can tell you real numbers.. Theoretical options are below (which is why I use word guess a lot in this answer).
You have at least 3 (instead of 2) options -
Sort in database - If the column being sorted on is indexed.. Then this may make most sense, because overhead of sorting on your database server may be negligible. SQL servers own data caches may make this super fast operation.. but 100k queries per minute.. measure if SQL gives noticeably faster results without sort.
Sort in code behind / middle layer - Likely you won't have your own equivalent of index.. you'd be sorting list of 50 records, 100k times per minutes.. would be slower than SQL, I would guess.
Big benefit would apply, only if data is relatively static, or very slow changing, and sorted values can be cached in memory for few seconds to minutes or hours..
The option not in your list - send the data unsorted all the way to the client, and sort it on client side using javascript. This solution may scale the most... sorting 50 records in Browser, should not be a noticeable impact on your UX.
The SQL purists will no doubt tell you that it’s better to let SQL do the sorting rather than C#. That said, unless you are dealing with massive record sets or doing many queries per second it’s unlikely you’d notice any real difference.
For my own projects, these days I tend to do the sorting on C# unless I’m running some sort of aggregate on the statement. The reason is that it’s quick, and if you are running any sort of stored proc or function on the SQL server it means you don’t need to find ways of passing order by’s into the stored proc.
I have a simple tool for searching in a given db. The user can provide numerous conditions and my tool puts the sql query together based on that. However I want to prevent the query to be executed in case it returns too many records. For e.g. in case the user leaves all the filters blank then the query would pull all the records from the db which would take tens of minutes. Of course it's not necessary for any of my users. So I want some limitation.
I was thinking about running a count() sql query with the same conditions before each 'real' query, but that takes too much time.
Is there any option to measure the records 'during' the query and stop it if a certain amount is being reached? Throwing some exception asking the user to refine the search.
I use this approach:
State that you want to fetch AT MOST 100 rows. Construct your query so it returns at most 101 rows (with TOP N or the more generic ANSI way by filtering on row_number). Then you can easily detect whether there is more. You can act accordingly, in my case, show a 'read more'.
You could run a test query to search the database with the user defined options and only return the id field of the returned results, this would be very quick and also allow you to test the count().
Then if all is ok then you can run the full query to return all of their results.
Following on from the answer above, if you are working with large amounts of data, select top N, with the fast query option.
E.g.
SELECT TOP 101 [ColumnName]
FROM [Table]
OPTION (FAST 101)
This depends on your application and how you want it to work.
If your only wanting to displaying data on a table and setting a maximum size to your query is enough. You can use TOP in your select statement.
SELECT TOP N [ColumnName]
But considering you said a count takes too much time then I think your concerned about handling a very large data set and maybe manipulating it not necessarily just getting a limited set of data from the query.
Then one method is to break apart the job into chunks and run across the job size so grab the first N rows then the next N rows and repeat until there is no more values to be returned. You can also have record keeping for rollbacks and checkpoints to ensure data integrity.
Similar questions maybe:
query to limit records returned by sql query based on size of data
How to iterate through a large SQL result set with multiple related tables