Improving search performance in large data sets

Improving search performance in large data sets - c#

On a WPF application already in production, users have a window where they choose a client. It shows a list with all the clients and a TextBox where they can search for a client.
As the client base increased, this turns out to be exceptionally slow. Around 1 minute for a operation that happens around 100 times each day.
Currently MSSQL management studio says the query select id, name, birth_date from client takes 41 seconds to execute (around 130000 rows).
Is there any suggestions on how to improve this time? Indexes, ORMs or direct sql queries on code?
Currently I'm using framework 3.5 and LinqToSql

If your query is actually SELECT id, name, birth_date from client (ie, no where clause) there is very little that you'll be able to do to speed that up short of new hardware. SQL Server will have to do a table scan to get all of the data. Even a covering index means that it will have to scan an index just as big as the table.
What you need to ask yourself is: is a list of 130000 clients really useful for your users? I anybody really going to scroll through to the 75613th entry in a list to find the user that they want? The answer is probably not. I would go with the search option only. At least then you can add indices that make sense for those queries.
If you absolutely do need the entire list, try loading it lazily in chunks. Start with the first 500 records and then add more records as the user moves the scroll bar. That way the initial load time is reduced and the user will only load the data that is necessary.

Why do you need the list of all the clients? Couldn't you just have the search TextBox that you describe and handle the search query on the server side. There you set a cap on the maximum number of returned rows for an individual client search (e.g. max 500 matches).
Alternatively, some efficiency gains may be achived by caching the client data list on the web server

Indexing should not help, based on your query. You could use a view which caches the sorted query (assuming you're not ordering by the id?), but given SQL Server's baked-in query cache for adhoc queries you're probably not going to see much gain there either. The ORM does add some overhead, but there are several tutorials out there for cutting the cost of that (e.g. http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html). Main points there that apply to you are to use compiled queries wherever possible, and turn off optimistic concurrency for read-only data.
An even bigger performance gain could be realized by having your clients not hit the db directly. If you add a service layer in there (not necessarily a web service, but it could be) then the service class or application could put some smart caching in place, which would help by an order of magnitude for read-only queries like this.

Go in to SQL Server, do a new query. In the Query menu click the "Include Client Statistics".
Run the query just as you would from code.
It will display the results and also a tab next to the result called "Client Statistics"
Click that and look at the time in the "Wait time on server replies" This is in ms, and it's the time the server was actually executing.
I just ran this query:
select firstname, lastname from leads
It took 3ms on the server to fetch 301,000 records.
The "Total Execution Time" was something like 483ms, which includes the time for SSMS to actually get the data and process it. My query took something like 2.5-3s to run in SSMS and the remaining time (2500ms or so) was actually for SSMS to paint the results etc.)
My guess is, the 41 seconds is probably not being spent on the SQL server, as 130,000 records really isn't that much. Your 41 seconds is probably largely being spent by everything after the SQL server returns the results.
If you find out SQL Server is taking a long time to execute, in the query menu turn on "Include Actual Execution Plan" Rerun your query. A new tab appears called "Execution Plan" this tab will show you what SQL server is doing when you do a select on this table as well as a percentage of where it spends all of it's time. In my case it spent 100% of the time in a "Clustered Index Scan" of PK_Leads
Edited to include more stats

In general:
Find out what takes so much time, executing the query or retrieving the results
If its the query execution, the query plan will tell you which indexes are missing, just press the display query plan button in the SSMS and you will get hints on which indexes you should create to increase performance
If its the retrieval of the values, there is not much you can do about it besides upgrading hardware (ram, disk, network etc.)
But:
In your case it looks like the query is a full table scan, which is never good for performance, check if you really need to retrieve all this data at once.
Since there are no clauses what so ever its unlikely that its the query execution that is the problem. Meaning additional indexes will not help.
You will need to change the way the application access the data. Instead of loading all clients into memory and then search from them in memory you will need to pass on the search term to the database query.
LinqToSql enable you to use different features for searching values, here is a blog describing most of them:
http://davidhayden.com/blog/dave/archive/2007/11/23/LINQToSQLLIKEOperatorGeneratingLIKESQLServer.aspx

Related

How to cache a big table from SQL Server

I have a table with a lot of rows (3 million) from which I need to query some rows at several points in my app. The way I found to do this is querying all the data the first time that any was needed and storing it in a static DataTable with SqlAdapter.Fill() for the rest of the app life.
That's fast, because then when I need something I use DataTable.Select("some query") and the app processes the info just nice.
The problem is that this table takes about 800MB of RAM, and I have to run this app in PCs where it might be too much.
The other way I thought was to query the data I need each time. This takes little memory but has poor performance (a lot of queries to the database, which is at a network address and with 1000 queries you start to notice the ping and all that..).
Is there any intermediate point between performance and memory usage?
EDIT: What I'm retrieving are sales, which have a date, a product and a quantity. I query by product, and it isn't indexed that way. But anyways, making 1000 queries, even if the query took 0.05s, a 0.2s ping makes a total of 200 seconds...

First talk to the dba about performance
If you are downloading the entire table you might actually be putting more load on the network and SQL than if you performed individual queries.
As a dba if I knew you were downloading an entire large table I would put an index on product immediately.
Why are you performing 1000s of queries?
If you are looking for sales when a product is created then a cache is problematic. You would not yet have sales data. The problem with a cache is stale data. If you know the data will not change - you either have it or not then you can eliminate the concern of stale data.
There is something between sequentially and simultaneously. You can pack multiple selects in a single request. What this does is make a single round trip and is more efficient.
select * from tableA where ....;
select * from tableB where ....;
With DataReader just call SqlDataReader.NextResult Method ()
using (SqlDataReader rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
}
rdr.NextResultSet();
while (rdr.Read())
{
}
}
Pretty sure you can do the same type of thing with multiple DataTables in a DataSet.
Another option is a LocalDB. It is targeted at developers but for what you are doing it would work just fine. DataTable speed without memory concerns. You can even put an index on ProductID. It will take a little longer to write to disc compared to memory but you are not using up memory.
Then there is the ever evil with (nolock). Know what you are doing and I am not going to go into all the possible evils but I can tell you that I use it a lot.

The question can be precipitated to Memory vs Performance. The answer to that is Caching.
If you know what your usage pattern would be like, then one thing you can do is to create a local cache in the app.
The extreme cases are - your cache size is 800MB with all your data in it (thereby sacrificing memory) - OR - your cache size is 0MB and all your queries go to network (thereby sacrificing performance).
Three important questions about the design of the cache are answered below.
How to populate the Cache?
If you are likely to make some query multiple times, store it in cache and before going to network, test if your cache already has the result. If it doesn't, query the database and then store the result in the cache.
If after querying for some data, you are likely to query the next and/or previous piece of data, then query all of it once and cache it so that when you query the next piece, you already have it in cache.
Effectively the idea is that if you know some information may be needed in future, cache it beforehand.
How to free the Cache?
You can decide the freeing mechanism for cache either Actively or Passively.
Passively: Whenever cache is full you can evict the data from it.
Actively: Run a background thread at regular interval and it takes care of removal for you.
One hybrid method is to run a freeing thread as soon as you reach, let's say, 80% of your memory limit and then free whatever memory you can.
What data to remove from the Cache?
This has been answered already in context of the issue of Page Replacement Policies for Operating Systems.
For completion, I'll summarize the important ones here:
Evict the Least Recently Used data (if it is not likely to be used);
Evict the data that was brought in earliest (if the earliest data is not likely to be used);
Evict the data that was brought in latest (if you think that the newly brought in data is least likely to be used).
Automatically remove the data that is older than t time units.

RE: "I can't index by anything because I'm not the database admin nor can ask for that."
Can you prepopulate a temp table and index on that?, e.g.
Select * into #MyTempTable from BigHugeTable
Create Index Prodidx on #MyTempTable (product)
You will have to ensure you always reuse the same connection (and it isn't closed) in order to use the temp table.

How to determine if query of SQL is too complex?

In my c# application, the users can build dynamic reports from the SQL database.
I need to warn the users if their DB-query is too complex and takes too long to run.
I'm working with microsoft-sql-server 2008.
How can I do that?
Are there any statistic-algorithms to calculate the runtime of a query execution?

This is practically impossible. The database calculates execution-plans based on table and indices statistics and even the database itself cannot predict the runtime.
There might be some indications such as ordering (and grouping, which implies ordering) or several joins, but any algorithmic prediction is nearly impossible in my opinion.

You could create a formular, based on Factors like the number and types of joins, order by and where criteria but also the line-count of all tables affected.
Depending how that formular is build, that would give u a rough indication when a query gets to "heavy".

As others are pointing out, the absolute answer is that you can't tell up front, how long a query will take to run.
The closest thing I could suggest, to try and give a crude idea of what is involved in the query before running it, is to first retrieve the estimated execution plan for it, and doing some crude analysis of that.
SET SHOWPLAN_XML ON
GO
SELECT TOP 100 *
FROM MyTable
GO
In the execution plan XML, there is info like the estimated number of rows, whether the optimizer timed out before choosing an execution plan, what operations are performed (index seek/scan etc) and a whole ton of other stuff (you'd really need to dive deeper into execution plans).
So in theory, you could try and look at the info in the plan to make a crude/best guess judgement. Note it's only the estimated execution plan, and I don't know what level of accuracy/judgement you could make.
It doesn't give you time, just (maybe) some way to weigh up the relative anticipated cost of a query

As the others already shown, you can't predict the complexity beforehand. All you can do is letting the query run and abort if it takes too long.
To abort the query after some time, you could take a look into the SqlCommand.CommandTimeout Property and set it to some suitable value. The drawback of this approach that the user might wait the full time just to get an error message that the query simply took to long.
Another approach would be to let the user decide when it takes to much time. This can be done by using no limit (Timeout = 0), but call the execute method asynchronously and simply provide a cancel button to the user he can hit.
This last approach would be the same as within the SQL Server Management Studio. If you start a query you'll see in the footer a running timer on how long the query is currently running and within the toolbar you have a cancel button to stop the currently running query.

Providing "Totals" for custom SQL queries on a regular basis

I would like some advice on how to best go about what I'm trying to achieve.
I'd like to provide a user with a screen that will display one or more "icon" (per say) and display a total next to it (bit like the iPhone does). Don't worry about the UI, the question is not about that, it is more about how to handle the back-end.
Let's say for argument sake, I want to provide the following:
Total number of unread records
Total number of waiting for approval
Total number of pre-approved
Total number of approved
etc...
I suppose, the easiest way to descrive the above would be "MS Outlook". Whenever emails arrive to your inbox, you can see the number of unread email being updated immediately. I know it's local, so it's a bit different, but now imagine having the same principle but for the queries above.
This could vary from user to user and while dynamic stored procedures are not ideal, I don't think I could write one sp for each scenario, but again, that's not the issue heree.
Now the recommendation part:
Should I be creating a timer that polls the database every minute (for example?) and run-all my relevant sql queries which will then provide me with the relevant information.
Is there a way to do this in real time without having a "polling" mechanism i.e. Whenever a query changes, it updates the total/count and then pushes out the count of the query to the relevant client(s)?
Should I have some sort of table storing these "totals" for each query and handle the updating of these immediately based on triggers in SQL and then when queried by a user, it would only read the "total" rather than trying to calculate them?
The problem with triggers is that these would have to be defined individually and I'm really tring to keep this as generic as possible... Again, I'm not 100% clear on how to handle this to be honest, so let me know what you think is best or how you would go about it.
Ideally when a specific query is created, I'd like to provide to choices. 1) General (where anyone can use this) and b) Specific where the "username" would be used as part of the query and the count returned would only be applied for that user but that's another issue.
The important part is really the notification part. While the polling is easy, I'm not sure I like it.
Imagine if I had 50 queries to be execute and I've got 500 users (unlikely, but still!) looking at the screen with these icons. 500 users would poll the database every minute and 50 queries would also be executed, this could potentially be 25000 queries per miuntes... Just doesn't sound right.
As mentioned, ideally, a) I'd love to have the data changes in real-time rather than having to wait a minute to be notified of a new "count" and b) I want to reduce the amount of queries to a minimum. Maybe I won't have a choice.
The idea behind this, is that they will have a small icon for each of these queries, and a little number will be displayed indicating how many records apply to the relevant query. When they click on this, it will bring them the relevant result data rather than the actual count and then can deal with it accordingly.
I don't know if I've explained this correctly, but if unclear, please ask, but hopefully I have and I'll be able to get some feedback on this.
Looking forward to your feeback.
Thanks.

I am not sure if this is the ideal solution but maybe a decent 1.
The following are the assumptions I have taken
Considering that your front end is a web application i.e. asp.net
The data which needs to be fetched on a regular basis is not hugh
The data which needs to be fetched does not change very frequently
If I were in this situation then I would have gone with the following approach
Implemented SQL Caching using SQLCacheDependency class. This class will fetch the data from the database and store in the cache of the application. The cache will get invalidated whenever the data in the table on which the dependency is created changes thus fetching the new data and again creating the cache. And you just need to get the data from the cache rest everything (polling the database, etc) is done by asp.net itself. Here is a link which describes the steps to implement SQL Caching and believe me it is not that difficult to implement.
Use AJAX to update the counts on the UI so that the User does not feel the pinch of PostBack.

What about "Improving Performance with SQL Server 2008 Indexed Views"?
"This is often particularly effective for aggregate views in decision
support or data warehouse environments"

C# + SQL Server - Fastest / Most Efficient way to read new rows into memory

I have an SQL Server 2008 Database and am using C# 4.0 with Linq to Entities classes setup for Database interaction.
There exists a table which is indexed on a DateTime column where the value is the insertion time for the row. Several new rows are added a second (~20) and I need to effectively pull them into memory so that I can display them in a GUI. For simplicity lets just say I need to show the newest 50 rows in a list displayed via WPF.
I am concerned with the load polling may place on the database and the time it will take to process new results forcing me to become a slow consumer (Getting stuck behind a backlog). I was hoping for some advice on an approach. The ones I'm considering are;
Poll the database in a tight loop (~1 result per query)
Poll the database every second (~20 results per query)
Create a database trigger for Inserts and tie it to an event in C# (SqlDependency)
I also have some options for access;
Linq-to-Entities Table Select
Raw SQL Query
Linq-to-Entities Stored Procedure
If you could shed some light on the pros and cons or suggest another way entirely I'd love to hear it.
The process which adds the rows to the table is not under my control, I wish only to read the rows never to modify or add. The most important things are to not overload the SQL Server, keep the GUI up to date and responsive and use as little memory as possible... you know, the basics ;)
Thanks!

I'm a little late to the party here, but if you have the feature on your edition of SQL Server 2008, there is a feature known as Change Data Capture that may help. Basically, you have to enable this feature both for the database and for the specific tables you need to capture. The built-in Change Data Capture process looks at the transaction log to determine what changes have been made to the table and records them in a pre-defined table structure. You can then query this table or pull results from the table into something friendlier (perhaps on another server altogether?). We are in the early stages of using this feature for a particular business requirement, and it seems to be working quite well thus far.
You would have to test whether this feature would meet your needs as far as speed, but it may help maintenance since no triggers are required and the data capture does not tie up your database tables themselves.

Rather than polling the database, maybe you can use the SQL Server Service broker and perform the read from there, even pushing which rows are new. Then you can select from the table.
The most important thing I would see here is having an index on the way you identify new rows (a timestamp?). That way your query would select the top entries from the index instead of querying the table every time.
Test, test, test! Benchmark your performance for any tactic you want to try. The biggest issues to resolve are how the data is stored and any locking and consistency issues you need to deal with.

If you table is updated constantly with 20 rows a second, then there is nothing better to do that pull every second or every few seconds. As long as you have an efficient way (meaning an index or clustered index) that can retrieve the last rows that were inserted, this method will consume the fewest resources.
IF the updates occur in burst of 20 updates per second but with significant periods of inactivity (minutes) in between, then you can use SqlDependency (which has absolutely nothing to do with triggers, by the way, read The Mysterious Notification for to udneratand how it actually works). You can mix LINQ with SqlDependency, see linq2cache.

Do you have to query to be notified of new data?
You may be better off using push notifications from a Service Bus (eg: NServiceBus).
Using notifications (i.e events) is almost always a better solution than using polling.

Load SQL query result data into cache in advance

I have the following situation:
.net 3.5 WinForm client app accessing SQL Server 2008
Some queries returning relatively big amount of data are used quite often by a form
Users are using local SQL Express and restarting their machines at least daily
Other users are working remotely over slow network connections
The problem is that after a restart, the first time users open this form the queries are extremely slow and take more or less 15s on a fast machine to execute. Afterwards the same queries take only 3s. Of course this comes from the fact that no data is cached and must be loaded from disk first.
My question:
Would it be possible to force the loading of the required data in advance into SQL Server cache?
Note
My first idea was to execute the queries in a background worker when the application starts, so that when the user starts the form the queries will already be cached and execute fast directly. I however don't want to load the result of the queries over to the client as some users are working remotely or have otherwise slow networks.
So I thought just executing the queries from a stored procedure and putting the results into temporary tables so that nothing would be returned.
Turned out that some of the result sets are using dynamic columns so I couldn't create the corresponding temp tables and thus this isn't a solution.
Do you happen to have any other idea?

Are you sure this is the execution plan being created, or is it server memory caching that's going on? Maybe the first query loads quite a bit of data, but subsequent queries can use the already-cached data, and so run much quicker. I've never seen an execution plan take more than a second to generate, so I'd suspect the plan itself isn't the cause.
Have you tried running the index tuning wizard on your query? If it is the plan that's causing problems, maybe some statistics or an additional index will help you out, and the optimizer is pretty good at recommending things.

I'm not sure how you are executing your queries, but you could do:
SqlCommand Command = /* your command */
Command.ExecuteReader(CommandBehavior.SchemaOnly).Dispose();
Executing your command with the schema-only command behavior will add SET FMTONLY ON to the query and cause SQL Server to get metadata about the result set (requiring generation of the plan), but will not actually execute the command.

To narrow down the source of the problem you can always use the SQL Server Objects in perfmon to get a general idea of how the local instance of SQL Server Express is performing.
In this case you would most likely see a lower Buffer Cache Hit Ratio on the first request and a higher number on subsequent requests.
Also you may want to check out http://msdn.microsoft.com/en-us/library/ms191129.aspx It describes how you can set a sproc to run automatically when the SQL Server service starts up.
If you retrieve the Data you need with that sproc then maybe the data will remain cached and improve the performance the first time the data is retrieved by the end user via your form.

In the end I still used the approach I tried first: Executing the queries from a stored procedure and putting the results into temporary tables so that nothing would be returned. This 'caching' stored procedure is executed in the background whenever the application starts.
It just took some time to write the temporary tables as the result sets are dynamic.
Thanks to all of you for your really fast help on the issue!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.