Related
We are a product website with several products having guarantee. Guarantee is only applicable for few products with particular dealerids. The 2 tables are:
Product table with columns as id, name, cityId, dealerId, price. This table has all the products.
GuaranteeDealers table with column as dealerId. This has all dealer with guaranteed products.
We want to get all products with info if it is guaranteed or not. The query looks like:
APPROACH1: Get isGuaranteed from sql function to server(c#) side:
select id, name, cityId, dealerId, price, isGuaranteed = isGuaranteed( dealerId) from customers
isGuaranteed is a sql function that checks if dealerId is in the table guranteeDealers. If yes it returns 1 else 0.
I have 50000 products and 500 such dealers and this query takes too long to execute.
OR
APPROACH2: Get list of dealers and set isGuaranteed flag in c#(server) side.
select id, name, cityId, dealerId, price. Map these to c# list of products
select dealerId from guaranteeDealers table to c# list of dealers.
Iterate product records in c# and set the isGuaranteed flag by c# function that checks if product's dealerId is in c# list of guaranteeDealers.
This takes very less time compared to 1.
While both approaches look similar to me, can someone explain why it takes so long time to execute function in select statement in mysql? Also which is correct to do, approach 1 or 2?
Q: "why it takes so long time to execute function in select statement in mysql?"
In terms of performance, executing a correlated subquery 50,000 times will eat our lunch, and if we're not careful, it will eat our lunchbox too.
That subquery will be executed for each and every row returned by the outer query. That's like executing 50,000 separate, individual SELECT statements. And that's going to take time.
Hiding a correlated subquery inside a MySQL stored program (function) doesn't help. That just adds overhead on each execution of the subquery, and makes things slower. If we strip out the function and bring that subquery inline, we are probably looking at something like this:
SELECT p.id
, p.name
, p.cityId
, p.dealerId
, p.price
, IFNULL( ( SELECT 1
FROM guaranteeDealers d
WHERE d.dealerId = p.dealerID
LIMIT 1
)
,0) AS isGuarantee
FROM products p
ORDER BY ...
For each and every row returned from products (that isn't filtered out by a predicate e.g. condition in the WHERE clause), this is essentially telling MySQL to execute a separate SELECT statement. Run a query to look to see if the dealerID is found in the guaranteeDealers table. And that happens for each row.
If the outer query is only returning a couple of rows, then that's only a couple of extra SELECT statements to execute, and we aren't really going to notice the extra time. But when we return tens (or hundreds) of thousands of rows, that starts to add up. And it gets expensive, in terms of the total amount of time all those query executions take.
And if we "hide" that subquery in a MySQL stored program (function), that adds more overhead, introducing a bunch of context switches. From query executing in the database context, calling a function that switches over to the stored program engine which executes the function, which then needs to run a database query, which switches back to the database context to execute the query and return a resultset, switching back to the stored program environment to process the resultset and return a value, and then switching back to the original database context, to get the returned value. If we have to do that a couple of times, no big whoop. Repeat that tens of thousands of times, and that overhead is going to add up.
(Note that native MySQL built-in functions don't have this same context switching overhead. The native functions are compiled code that execute within the database context. Which is a big reason we favor native functions over MySQL stored programs.)
If we want improved performance, we need to ditch the processing RBAR (row by agonizing row), which gets excruciatingly slow for large sets. We need to approach the problem set-wise rather than row-wise.
We can tell MySQL what set to return, and let it figure out the most efficient way to return that. Rather than us round tripping back and forth to the database, executing individual SQL statements to get little bits of the set piecemeal, using instructions that dictate how MySQL should prepare the set.
In answer to the question
Q: "which approach is correct"
both approaches are "correct" is as much as they return the set we're after.
The second approach is "better" in that it significantly reduces the number of SELECT statements that need to be executed (2 statements rather than 50,001 statements).
In terms of the best approach, we are usually better off letting MySQL do the "matching" of rows, rather than doing the matching in client code. (Why unnecessarily clutter up our code doing an operation that can usually be much more efficiently accomplished in the database.) Yes, sometimes we need to do the matching in our code. And occasionally it turns out to be faster.
But sometimes, we can write just one SELECT statement that specifies the set we want returned, and let MySQL have a go at it. And if it's slow, we can do some tuning, looking at the execution plan, making sure suitable indexes are available, and tweaking the query.
Given the information in the question about the set to be returned, and assuming that dealerId is unique in the guaranteeDealers table. If our "test" is whether a matching row exists in the guaranteeDealers table, we can use an OUTER JOIN operation, and an expression in the SELECT list that returns a 0 or 1, depending on whether a matching row was found.
SELECT p.id
, p.name
, p.cityId
, p.dealerId
, p.price
, IF(d.dealerId IS NULL,0,1) AS isGuarantee
FROM products p
LEFT
JOIN guaranteeDealers d
ON d.dealerId = p.dealerId
ORDER BY ...
For optimal performance, we are going to want to have suitable indexes defined. At a mimimum (if there isn't already such an index defined)
ON guaranteeDealers (dealerId)
If there are also other tables that are involved in producing the result we are after, then we want to also involve that table in the query we execute. That will give the MySQL optimizer a chance to come up with the most efficient plan to return the entire set. And not constrain MySQL to performing individual operations to be return bits piecemeal.
select id, name, cityId, customers.dealerId, price,
isGuaranteed = guaranteeDealers.dealerId is not null
from customers left join guaranteeDealers
on guaranteeDealers.dealerId = customets.dealerId
No need to call a function.
Note I have used customers because that is the table you used in your question - although I suspect you might have meant products.
Approach 1 is the better one because it reduces the size of the resultset being transferred from the database server to the application server. Its performance problem is caused by the isGuaranteed function, which is being executed once per row (of the customers table, which looks like it might be a typo). An approach like this would be much more performant:
select p.id, p.name, p.cityId, p.dealerId, p.price, gd.IsGuaranteed is not null
from Product p
left join GuaranteeDealers gd on p.dealerId = gd.dealerId
My query is fairly complex, but I have simplified it to figure out this problem and now it is a simple JOIN that I'm running on a SQL Server 2014 database. The query is:
SELECT * FROM SportsCars as sc INNER JOIN Cars AS c ON c.CarID = sc.CarID WHERE c.Type = 1
When I run this query from SMSS and watch it in SQL Profiler, it takes around 350ms to execute. When I run the same query inside my application using Entity Framework or ADO.NET (I've tried both). It takes 4500ms to execute.
ADO.NET Code:
using (var connection = new SqlConnection(connectionString))
{
connection.Open();
var cmdA = new SqlCommand("SET ARITHABORT ON", connection);
cmdA.ExecuteNonQuery();
var query = "SELECT * FROM SportsCars as sc INNER JOIN Cars AS c ON c.CarID = sc.CarID WHERE c.Type = 1";
var cmd = new SqlCommand(query, connection);
cmd.ExecuteNonQuery()
}
I've done an extensive Google search and found this awesome article and several StackOverflow questions (here and here). In order to make the session parameters identical for both queries I call SET ARITHABORT ON in ADO.NET and it makes no difference. This is a straight SQL query, so there is not a parameter sniffing problem. I've simplified the query and the indexes down to their most basic form for this test. There is nothing else running on the server and there is nothing else accessing the database during the test. There are no computed columns in the Cars or SportsCars table, just INTs and VARCHARs.
The SportsCars table has about 170k records and 4 columns, and the Cars table has about 1.2M records and 7 columns. The resulting data set (SportsCars of Type=1) has about 2600 records and 11 columns. I have a single non-clustered index on the Cars table, on the [Type] column that includes all the columns of the cars table. And both tables have a clustered index on the CarID column. No other indexes exist on either table. I'm running as the same database user in both cases.
When I view the data in SQL Profiler, I see that both queries are using the exact same, very simple query plan. In SQL Profiler, I'm using the Performance Event Class and the ShowPlan XML Statistics Profile, which I believe to be the proper event to monitor and capture the actual execution plan. The # of reads is the same for both queries (2596).
How can two exact same queries with the exact same query plan take 10x longer in ADO.NET vs. SMSS?
Figured it out:
Because I'm using Entity Framework, the connection string in my application has MultipleActiveResultSets=True. When I remove this from the connection string, the queries have the same performance in ADO.NET and SSMS.
Apparently there is an issue with this setting causing queries to respond slowly when connected to SQL Server via WAN. I found this link and this comment:
MARS uses "firehose mode" to retrieve data. Firehose mode means that
the server will produce data as fast as possible. This also means that
your client application must receive inbound data at the same speed as
it comes in. If it doesn't the data storage buffers on the server will
fill up and the processing will stop until those buffers empty.
So what? You may ask... But as long as the processing is stopped the
resources on the SQL server are in use and are tied up. This includes
the worker thread, schema and data locks, memory, etc. So it is
crucial that your client application consumes the inbound results as
quickly as they arrive.
I have to use this setting with Entity Framework otherwise lazy loading will generate exceptions. So I'm going to have to figure out some other workaround. But at least I understand the issue now.
How can two exact same queries with the exact same query plan take 10x longer in ADO.NET vs. SMSS?
First we need to be clear about what is considered "same" with regards to queries and query plans. Assuming that the query at the very top of the question is a copy-and-paste, then it is not the same query as the one being submitted via ADO.NET. For two queries to be the same, they need to be byte-by-byte the same, which includes all white-space, capitalization, punctuation, comments, etc.
The two queries shown are definitely very similar. And they might even share the same execution plan. But how was "same"ness determined for those? Was the XML the same in both cases? Or just what was shown graphically in SSMS when viewing the plans? If they were determined to be the same based on their graphical representation then that is sometimes misleading. The XML itself needs to be checked. Even if two query plans have the same query hash, there are still (sometimes) parts of a query plan that are variable and changes do not change the plan hash. One example is the evaluation of expressions. Sometimes they are calculated and their result is embedded into the plan as a constant. Sometimes they are calculated at the start of each execution and stored and reused within that particular execution, but not for any subsequent executions.
One difference between SSMS and ADO.NET is the default session properties for each. I thought I had seen a chart years ago showing the defaults for ADO / OLEDB / SQLNCLI but can't find it out. Either way, it doesn't need to be guess work as it can be discovered using the SESSIONPROPERTY function. Just run this query in the C# code instead of your current SELECT, and inspect the results in debug or print them out or whatever. Either way, run something like this:
SELECT SESSIONPROPERTY('ANSI_NULLS') AS [AnsiNulls],
SESSIONPROPERTY('ANSI_PADDING') AS [AnsiPadding],
SESSIONPROPERTY('CONCAT_NULL_YIELDS_NULL') AS [ConcatNullYieldsNull],
...;
Make sure to get all of the setting noted in the linked MSDN page. Now, in SSMS, go to the "Query" menu, select "Query Options...", and go to "Execution" | "ANSI". The settings coming back from the C# code need to match the ones showing in SSMS. Anything set different requires adding something like this to the beginning of your ADO.NET query string:
SET ANSI_NULLS ON;
{rest of query}
Now, if you want to eliminate the DataTable loading from being a possible suspect, just replace that line, just replace:
var cars = new DataTable();
cars.Load(reader);
with:
while(reader.Read());
And lastly, why not just put the query into a Stored Procedure? The session settings (i.e. ANSI_NULLS, etc) that typically matter the most are stored with the proc definition so they should work the same whether you EXEC from SSMS or from ADO.NET (again, we aren't dealing with any parameters here).
I'm trying to execute stored procedure on SQL Server 2008 from C# application. It takes a lot of time and my application "not responding". My stored procedure:
ALTER PROCEDURE [dbo].[PROC066]
#date_1 date,
#date_2 date
AS
SELECT
RTRIM(tt1.[Org]) as 'Org',
RTRIM(tt1.[CustomerHQ]) as 'Customer',
RTRIM(tt1.[Customer]) as 'ShipTo',
RTRIM(tt1.[Date]) as 'Date',
RTRIM(tt2.Clean) as 'Clean',
RTRIM(tt1.[PO number]) as 'PONumber',
RTRIM(tt1.[GCAS]) as 'GCAS',
'Description' = (SELECT DISTINCT [Description] FROM tDeploymentDB WHERE tDeploymentDB.[Item Code] = tt1.[GCAS]),
RTRIM(tt1.NQTY) as 'NQTY',
RTRIM(tt1.[Dummy]) as 'Dummy',
RTRIM(tt1.[MOQ]) as 'MOQ',
RTRIM(tt1.[Inactive]) as 'Inactive',
RTRIM(tt1.[Allocation]) as 'Allocation',
RTRIM(tt1.[MultiSector]) as 'MultiSector',
RTRIM(tt2.Pallet_Checked) as 'FullPallets',
RTRIM(tt2.FullTruck_Checked) as 'Full_Truck',
FROM tCleanOrderTracking_prod as tt1, tCleanOrderTracking_HDR_prod as tt2
WHERE
SUBSTRING(RTRIM(tt1.[PO number]), 6, LEN(RTRIM(tt1.[PO number])) - 5) =
RTRIM(tt2.CustomerOrderNumber) AND
tt1.[Date] = #date_1 AND tt1.[Date] <= #date_2
ORDER BY Org, [PO Number]
If I'm trying to execute this procedure on SQL Server Management Studio - it takes 5-7 sec. But through C# I can't to execute this query. I was trying to delete this row in query
tt1.[Date] = #date_1 AND tt1.[Date] <= #date_2
After that it works fine.. my application takes 5 sec for executing. Also I was trying to rewrite this row
tt1.[Date] BETWEEN #date_1 AND #date_2
No result! My main table in query tCleanOrderTracking_prod have about 40500 records. What can I try else? What I'm doing wrong?
Have a look at How to read SQL Server execution plans it will show you how to analyse your own SQL Query and make improvements on speed and execution performance.
You want to run this against your query and look HOW it is being executed so that you can add indexes / change joins etc etc.
The Trim function is also a computationally expensive function, its not just a select from a row but a processing command. Running it on 40,000 can take quite a hit. Consider only doing it on columns you need it on, and do it during INSERT time and not select time (then you do it once, not a million times ;) ).
You should consider removing the trim functions from the select query. Doing this requires a lot of effort on the part of SQL Server. You should be trimming the values before you insert them instead.
Remove trim functions from where condition because while executing it is first applying function on all rows and then after executing where condition so it impacts performance.
I'm trying to load data from oracle to sql server (Sorry for not writing this before)
I have a table(actually a view which has data from different tables) with 1 million records atleast. I designed my package in such a way that i have functions for business logics and call them in select query directly.
Ex:
X1(id varchar2)
x2(id varchar2, d1 date)
x3(id varchar2, d2 date)
Select id, x, y, z, decode (.....), x1(id), x2(id), x3(id)
FROM Table1
Note: My table has 20 columns and i call 5 different functions on atleast 6-7 columns.
And some functions compare the parameters passed with audit table and perform logic
How can i improve performance of my query or is there a better way to do this
I tried doing it in C# code but initial select of records is large enough for dataset and i get outofmemory exception.
my function does selects and then performs logic for example:
Function(c_x2, eid)
Select col1
into p_x1
from tableP
where eid = eid;
IF (p_x1 = NULL) THEN
ret_var := 'INITIAL';
ELSIF (p_x1 = 'L') AND (c_x2 = 'A') THEN
ret_var:= 'RL';
INSERT INTO Audit
(old_val, new_val, audit_event, id, pname)
VALUES
(p_x1, c_x2, 'RL', eid, 'PackageProcName');
ELSIF (p_x1 = 'A') AND (c_x2 = 'L') THEN
ret_var := 'GL';
INSERT INTO Audit
(old_val, new_val, audit_event, id, pname)
VALUES
(p_x1, c_x2, 'GL', eid, 'PackgProcName');
END IF;
RETURN ret_var;
i'm getting each row and performing
logic in C# and then inserting
If possible INSERT from the SELECT:
INSERT INTO YourNewTable
(col1, col2, col3)
SELECT
col1, col2, col3
FROM YourOldTable
WHERE ....
this will run significantly faster than a single query where you then loop over the result set and have an INSERT for each row.
EDIT as for the OP question edit:
you should be able to replace the function call to plain SQL in your query. Mimic the "initial" using a LEFT JOIN tableP, and the "RL" or "GL" can be calculated using CASE.
EDIT based on OP recent comments:
since you are loading data from Oracle into SQL Server, this is what I would do: most people that could help have moved on and will not read this question again, so open a new question where you say: 1) you need to load data from Oracle (version) to SQL Server Version 2) currently you are loading it from one query processing each row in C# and inserting it into SQL Server, and it is slow. and all the other details. There are much better ways of bulk loading data into SQL Server. As for this question, you could accept an answer, answer yourself where you explain you need to ask a new question, or just leave it unaccepted.
My recommendation is that you do not use functions and then call them within other SELECT statements. This:
SELECT t.id, ...
x1(t.id) ...
FROM TABLE t
...is equivalent to:
SELECT t.id, ...
(SELECT x.column FROM x1 x WHERE x.id = t.id)
FROM TABLE t
Encapsulation doesn't work in SQL like when using C#/etc. While the approach makes maintenance easier, performance suffers because sub selects will execute for every row returned.
A better approach would be to update the supporting function to include the join criteria (IE: "where x.id = t.id" for lack of real one) in the SELECT:
SELECT x.id
x.column
FROM x1 x
...so you can use it as a JOIN:
SELECT t.id, ...
x1.column
FROM TABLE t
JOIN (SELECT x.id,
x.column
FROM MY_PACKAGE.x) x1 ON x1.id = t.id
I prefer that to having to incorporate the function logic into the queries for sake of maintenance, but sometimes it can't be helped.
Personally I'd create an SSIS import to do this task. USing abulk insert you can imporve speed dramitcally and SSIS can handle the functions part after the bulk insert.
Firstly you need to find where the performance problem actually is. Then you can look at trying to solve it.
What is the performance of the view like? How long does it take the view to execute
without any of the function calls? Try running the command
How well does it perform? Does it take 1 minute or 1 hour?
create table the_view_table
as
select *
from the_view;
How well do the functions perform? According to the description you are making approximately 5 million function calls. They had better be pretty efficient! Also are the functions defined as deterministic. If the functions are defined using the deterministic keyword, the Oracle has a chance of optimizing away some of the calls.
Is there a way of reducing the number of function calls? The function are being called once the view has been evaluated and the million rows of data are available. BUT are all the input values from the highest level of the query? Can the function calls be imbeded into the view at a lower level. Consider the following two queries. Which would be quicker?
select
f.dim_id,
d.dim_col_1,
long_slow_function(d.dim_col_2) as dim_col_2
from large_fact_table f
join small_dim_table d on (f.dim_id = d.dim_id)
select
f.dim_id,
d.dim_col_1,
d.dim_col_2
from large_fact_table f
join (
select
dim_id,
dim_col_1,
long_slow_function(d.dim_col_2) as dim_col_2
from small_dim_table) d on (f.dim_id = d.dim_id)
Ideally the second query should run quicker as it calling the function fewer times.
The performance issue could be in any of these places and until you investigate the issue, it would be difficult to know where to direct your tuning efforts.
A couple of tips:
Don't load all records into RAM but process them one by one.
Try to run as many functions on the client as possible. Databases are really slow to execute user defined functions.
If you need to join two tables, it's sometimes possible to create two connections on the client. Fetch the data main data with connection 1 and the audit data with connection 2. Order the data for both connections in the same way so you can read single records from both connections and perform whatever you need on them.
If your functions always return the same result for the same input, use a computed column or a materialized view. The database will run the function once and save it in a table somewhere. That will make INSERT slow but SELECT quick.
Create a sorted intex on your table.
Introduction to SQL Server Indizes, other RDBMS are similar.
Edit since you edited your question:
Using a view is even more sub-optimal, especially when querying single rows from it. I think your "busines functions" are actually something like stored procedures?
As others suggested, in SQL always go set based. I assumed you already did that, hence my tip to start using indexing.
I have some complex stored procedures that may return many thousands of rows, and take a long time to complete.
Is there any way to find out how many rows are going to be returned before the query executes and fetches the data?
This is with Visual Studio 2005, a Winforms application and SQL Server 2005.
You mentioned your stored procedures take a long time to complete. Is the majority of the time taken up during the process of selecting the rows from the database or returning the rows to the caller?
If it is the latter, maybe you can create a mirror version of your SP that just gets the count instead of the actual rows. If it is the former, well, there isn't really that much you can do since it is the act of finding the eligible rows which is slow.
A solution to your problem might be to re-write the stored procedure so that it limits the result set to some number, like:
SELECT TOP 1000 * FROM tblWHATEVER
in SQL Server, or
SELECT * FROM tblWHATEVER WHERE ROWNUM <= 1000
in Oracle. Or implement a paging solution so that the result set of each call is acceptably small.
make a stored proc to count the rows first.
SELECT COUNT(*) FROM table
Unless there's some aspect of the business logic of you app that allows calculating this, no. The database it going to have to do all the where & join logic to figure out how line rows, and that's the vast majority of the time spend in the SP.
You can't get the rowcount of a procedure without executing the procedure.
You could make a different procedure that accepts the same parameters, the purpose of which is to tell you how many rows the other procedure should return. However, the steps required by this procedure would normally be so similar to those of the main procedure that it should take just about as long as just executing the main procedure.
You would have to write a different version of the stored procedure to get a row count. This one would probably be much faster because you could eliminate joining tables which you aren't filtered against, remove ordering, etc. For example if your stored proc executed the sql such as:
select firstname, lastname, email, orderdate from
customer inner join productorder on customer.customerid=productorder.productorderid
where orderdate>#orderdate order by lastname, firstname;
your counting version would be something like:
select count(*) from productorder where orderdate>#orderdate;
Not in general.
Through knowledge about the operation of the stored procedure, you may be able to get either an estimate or an accurate count (for instance, if the "core" or "base" table of the query is able to be quickly calculated, but it is complex joins and/or summaries which drive the time upwards).
But you would have to call the counting SP first and then the data SP or you could look at using a multiple result set SP.
It could take as long to get a row count as to get the actual data, so I wouldn't advodate performing a count in most cases.
Some possibilities:
1) Does SQL Server expose its query optimiser findings in some way? i.e. can you parse the query and then obtain an estimate of the rowcount? (I don't know SQL Server).
2) Perhaps based on the criteria the user gives you can perform some estimations of your own. For example, if the user enters 'S%' in the customer surname field to query orders you could determine that that matches 7% (say) of the customer records, and extrapolate that the query may return about 7% of the order records.
Going on what Tony Andrews said in his answer, you can get an estimated query plan of the call to your query with:
SET showplan_text OFF
GO
SET showplan_all on
GO
--Replace with call you your stored procedure
select * from MyTable
GO
SET showplan_all ofF
GO
This should return a table, or many tables which will let you get the estimated row count of your query.
You need to analyze the returned data set, to determine what is a logical, (meaningful) primary key for the result set that is being returned. In general this WILL be much faster than the complete procedure, because the server is not constructing a result set from data in all the columns of each row of each table, it is simply counting the rows... In general, it may not even need to read the actual table rows off disk to do this, it may simply need to count index nodes...
Then write another SQL statement that only includes the tables necessary to generate those key columns (Hopefully this is a subset of the tables in the main sql query), and the same where clause with the same filtering predicate values...
Then add another Optional parameter to the Stored Proc called, say, #CountsOnly, with a default of false (0) as so...
Alter Procedure <storedProcName>
#param1 Type,
-- Other current params
#CountsOnly TinyInt = 0
As
Set NoCount On
If #CountsOnly = 1
Select Count(*)
From TableA A
Join TableB B On etc. etc...
Where < here put all Filtering predicates >
Else
<Here put old SQL That returns complete resultset with all data>
Return 0
You can then just call the same stored proc with #CountsOnly set equal to 1 to just get the count of records. Old code that calls the proc would still function as it used to, since the parameter value is set to default to false (0), if it is not included
It's at least technically possible to run a procedure that puts the result set in a temporary table. Then you can find the number of rows before you move the data from server to application and would save having to create the result set twice.
But I doubt it's worth the trouble unless creating the result set takes a very long time, and in that case it may be big enough that the temp table would be a problem. Almost certainly the time to move the big table over the network will be many times what is needed to create it.