I have a simple tool for searching in a given db. The user can provide numerous conditions and my tool puts the sql query together based on that. However I want to prevent the query to be executed in case it returns too many records. For e.g. in case the user leaves all the filters blank then the query would pull all the records from the db which would take tens of minutes. Of course it's not necessary for any of my users. So I want some limitation.
I was thinking about running a count() sql query with the same conditions before each 'real' query, but that takes too much time.
Is there any option to measure the records 'during' the query and stop it if a certain amount is being reached? Throwing some exception asking the user to refine the search.
I use this approach:
State that you want to fetch AT MOST 100 rows. Construct your query so it returns at most 101 rows (with TOP N or the more generic ANSI way by filtering on row_number). Then you can easily detect whether there is more. You can act accordingly, in my case, show a 'read more'.
You could run a test query to search the database with the user defined options and only return the id field of the returned results, this would be very quick and also allow you to test the count().
Then if all is ok then you can run the full query to return all of their results.
Following on from the answer above, if you are working with large amounts of data, select top N, with the fast query option.
E.g.
SELECT TOP 101 [ColumnName]
FROM [Table]
OPTION (FAST 101)
This depends on your application and how you want it to work.
If your only wanting to displaying data on a table and setting a maximum size to your query is enough. You can use TOP in your select statement.
SELECT TOP N [ColumnName]
But considering you said a count takes too much time then I think your concerned about handling a very large data set and maybe manipulating it not necessarily just getting a limited set of data from the query.
Then one method is to break apart the job into chunks and run across the job size so grab the first N rows then the next N rows and repeat until there is no more values to be returned. You can also have record keeping for rollbacks and checkpoints to ensure data integrity.
Similar questions maybe:
query to limit records returned by sql query based on size of data
How to iterate through a large SQL result set with multiple related tables
Related
Trying to modify some fields in all table records, using Npgsql data Provider for PostgreSQL.
Each record needs:
to be read,
some fields needs to be modified by a C# procedure
and write back to table
Is there an object or mechanism that allow to point to each record to do this without multiple queries to perform the C# procedure call between the reading and writing of each record?
If you're looking for a way to update a value via an open cursor, to avoid an additional UPDATE, then that doesn't exist in PostgreSQL. On the other hand, I'm pretty sure (but not 100%) that on other databases it doesn't actually improve perf either, i.e. that an additional roundtrip for each update is required anyway. In other words, "updating a cursor" for results from a SELECT is probably API sugar rather than an actual optimization.
The most efficient way to accomplish this with Npgsql is probably to do a SELECT, buffer results in memory, iterate them to calculate the new values, and then issue a prepared batched update that updates the rows (i.e. a single command with several UPDATE ...; UPDATE ... statements). If the amount of rows is too large, this can be split into several batches, i.e. "load x rows, calculate, update those x rows; load next x rows...". You can use PostgreSQL's cursor functionality to each time load the next X rows, or simple issue new SELECTs and use LIMIT/OFFSET for paging (likely to have similar performance).
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.
I have a query that gets executed on an SQLite database table from a WPF application; this query can return a large amount of records (from about 80000 up to about 1 million) which I down-sample before displaying it.
It takes about 10 seconds to return 700000 records, can it be optimized in some way?
The WHERE clause filters the records by a date time column:
(WHERE CollectedOn > #startTime AND CollectedOn < #endTime)
and I'm selecting all of the 18 columns of the table.
Does the number of columns influence the executing time of the query?
Thanks for the comments. I should point out a few more things:
The data I collect needs to be displayed in a chart; since I want to display only 600 points my algorithm picks one point every 600 from those 700000 records. Can this be achieved in a single query?
These are some of the things I would consider:
Further narrowing down the number of returned records (you said that you down-sample before displaying... can you down-sample within the database or even the WHERE clause)?
Do you really need all the records at once? Maybe paging would help (see LIMIT and OFFSET)
You could try to use an index to speed up your query. Use EXPLAIN to find out what your query does exactly... afterwards you can optimize joins and selections (also use indices for joins).
Narrowing down the attributes is always a good thing to do (instead of just returning all columns), but at least for simple queries (no subselects), it will have less influence than selecting the right rows using the WHERE clause. Also search for "selection" and "projection" about this issue.
I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?
The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...
I'm using jquery datatables to display a grid which uses webapi to retrieve it's data. The webapi uses linq to query a mssql database and it neatly uses filtering, sorting and skip/take to assemble it's query on a well-indexed table containing about a million records (and growing). A common scenario.
And it performs really well. The browser has to wait about 50 ms for the response (while paginating for example) to return.
However, after I took a look with a profiling tool I noticed about 25 ms to be used just selecting the total rowcount of the table. Which I want to know because I want the datatable to display something like: "displaying row 1 to 10 of 45.000 filtered out of 1.000.000" needing the total count.
I don't actually need to know the precise total count (it's just informative) every trip from the server so I perhaps could keep the value server side and refresh it every second in a different task without it interfering with the data retrieval of datatables. I would just return the 'close enough' value of the total row count.
Is there a solid mechanism for that? I've tried to put the total rowcount in a static used by multiple users during multiple callbacks and every time it was requested a async task was fired to refresh it.
That feels icky however, sharing the static and having a different thread update it doesn't feel all that stable to me. I've looked at SqlDependency to push the recordcount every time it changes from my data to my domain model but that doesn't seem to support SELECT COUNT(Id) FROM TABLE scenarios.
Any thoughts?
You could use one of the system tables if possible. You could ping this every minute and stick it in the cache. This article has two that it claims are sufficient options:
--The way the SQL management studio counts rows (look at table properties, storage
--, row count). Very fast, but still an approximate number of rows.
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
or
--Quick (although not as fast as method 2) operation and equally important, reliable.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Transactions')
AND (index_id=0 or index_id=1);
Have you considered taking the count when a query is performed and then echoing the value out to your clients via SignalR?
Basically, when the LINQ call returns get a .Count() and hand off the value to a background thread to let SignalR notify the clients of the update, at the same time you return the data to the requesting client.
SignalR will activate a javascript function in all of the client pages, where you can then take the passed in value and display it somewhere on the page.
http://www.asp.net/signalr