Optimize multiple SQL Selects in C#

Optimize multiple SQL Selects in C# - c#

I have a part of a program wich has to display data from multiple tables, based on different queries. So far it looks like this: (keep in mind that every subsequent SELECT is based on something we got from A)
SELECT * FROM TABLE A WHERE ID = ...
SELECT [8 fields] FROM TABLE B WHERE ...
SELECT [5 fields] FROM TABLE C WHERE ...
SELECT [1 field] FROM TABLE D WHERE ...
SELECT [1 field] FROM TABLE E WHERE ...
SELECT [1 field] FROM TABLE F WHERE ...
SELECT [1 field] FROM TABLE G WHERE ...
SELECT [1 field] FROM TABLE H WHERE ...
SELECT [2 fields] FROM TABLE I WHERE ...
After that, I take the results and create different objects or put in different fields with them.
Thing is, between clicking the button and getting the window to show, I have a delay of about 2 seconds.
Keep in mind, this is a very big database, with millions of records. Changing the DB is out of the question, unfortunately.
I am searching only by Primary Keys, I have no way to restrict the search even more than that.
The connection is opened from the start, I don't close/reopen it after each statement.
Joining just Table A and Table B takes a lot longer than two different Selects, up to 1.5 seconds, while running sequential selects goes down to bout 300 ms.
I still find it to be quite a long time, given that the first query executes in around 53 ms in the DBMS.
I am using the ODBC driver in C#, Net Framework 4. The database itself is a DB2, however, using the DB2 native driver has given us a plethora of problems and IBM has been less than helpful about it.
Also, whnever I select only a few fields, I create the needed object using only those and leaving the rest on default.
Is there any way I could improve this?
Thank you in advance,
Andrei
Edit: The diagnostic tool says something among the lines of:
--Two queries in another part of the program, we can ignore these, as they are not usually there-- 0.31 s
First query - 0.75 s
Second query - 0.87s
Third query - 0.95s
Fourth query - 0.99s
Fifth query - 1.00s
Sixth query - 1.04s
Seventh query - 1.08s
Eighth query - 1.10s
Ninth query - 1.12s
Program output - 1.81s

There is overhead to constructing query strings and executing them. When running multiple similar queries, you want to be sure that they are compiled once and then the execution plan is re-used.
However, it is likely that the best approach is to create a single query that returned multiple columns. Naively, this would look like:
select . . .
from a join
b
on . . . join
c
on . . . join
. . .
However, the joins might be left joins. The query might be more complex if you are joining along different dimensions that might produce Cartesian products.
The key point is that the SQL query will optimize the query inside the database. This is (generally) more efficient than constructing multiple different queries.

Related

c# vs mysql: calling function in sql select statement vs fetching data and calling same function in c#

We are a product website with several products having guarantee. Guarantee is only applicable for few products with particular dealerids. The 2 tables are:
Product table with columns as id, name, cityId, dealerId, price. This table has all the products.
GuaranteeDealers table with column as dealerId. This has all dealer with guaranteed products.
We want to get all products with info if it is guaranteed or not. The query looks like:
APPROACH1: Get isGuaranteed from sql function to server(c#) side:
select id, name, cityId, dealerId, price, isGuaranteed = isGuaranteed( dealerId) from customers
isGuaranteed is a sql function that checks if dealerId is in the table guranteeDealers. If yes it returns 1 else 0.
I have 50000 products and 500 such dealers and this query takes too long to execute.
OR
APPROACH2: Get list of dealers and set isGuaranteed flag in c#(server) side.
select id, name, cityId, dealerId, price. Map these to c# list of products
select dealerId from guaranteeDealers table to c# list of dealers.
Iterate product records in c# and set the isGuaranteed flag by c# function that checks if product's dealerId is in c# list of guaranteeDealers.
This takes very less time compared to 1.
While both approaches look similar to me, can someone explain why it takes so long time to execute function in select statement in mysql? Also which is correct to do, approach 1 or 2?

Q: "why it takes so long time to execute function in select statement in mysql?"
In terms of performance, executing a correlated subquery 50,000 times will eat our lunch, and if we're not careful, it will eat our lunchbox too.
That subquery will be executed for each and every row returned by the outer query. That's like executing 50,000 separate, individual SELECT statements. And that's going to take time.
Hiding a correlated subquery inside a MySQL stored program (function) doesn't help. That just adds overhead on each execution of the subquery, and makes things slower. If we strip out the function and bring that subquery inline, we are probably looking at something like this:
SELECT p.id
, p.name
, p.cityId
, p.dealerId
, p.price
, IFNULL( ( SELECT 1
FROM guaranteeDealers d
WHERE d.dealerId = p.dealerID
LIMIT 1
)
,0) AS isGuarantee
FROM products p
ORDER BY ...
For each and every row returned from products (that isn't filtered out by a predicate e.g. condition in the WHERE clause), this is essentially telling MySQL to execute a separate SELECT statement. Run a query to look to see if the dealerID is found in the guaranteeDealers table. And that happens for each row.
If the outer query is only returning a couple of rows, then that's only a couple of extra SELECT statements to execute, and we aren't really going to notice the extra time. But when we return tens (or hundreds) of thousands of rows, that starts to add up. And it gets expensive, in terms of the total amount of time all those query executions take.
And if we "hide" that subquery in a MySQL stored program (function), that adds more overhead, introducing a bunch of context switches. From query executing in the database context, calling a function that switches over to the stored program engine which executes the function, which then needs to run a database query, which switches back to the database context to execute the query and return a resultset, switching back to the stored program environment to process the resultset and return a value, and then switching back to the original database context, to get the returned value. If we have to do that a couple of times, no big whoop. Repeat that tens of thousands of times, and that overhead is going to add up.
(Note that native MySQL built-in functions don't have this same context switching overhead. The native functions are compiled code that execute within the database context. Which is a big reason we favor native functions over MySQL stored programs.)
If we want improved performance, we need to ditch the processing RBAR (row by agonizing row), which gets excruciatingly slow for large sets. We need to approach the problem set-wise rather than row-wise.
We can tell MySQL what set to return, and let it figure out the most efficient way to return that. Rather than us round tripping back and forth to the database, executing individual SQL statements to get little bits of the set piecemeal, using instructions that dictate how MySQL should prepare the set.
In answer to the question
Q: "which approach is correct"
both approaches are "correct" is as much as they return the set we're after.
The second approach is "better" in that it significantly reduces the number of SELECT statements that need to be executed (2 statements rather than 50,001 statements).
In terms of the best approach, we are usually better off letting MySQL do the "matching" of rows, rather than doing the matching in client code. (Why unnecessarily clutter up our code doing an operation that can usually be much more efficiently accomplished in the database.) Yes, sometimes we need to do the matching in our code. And occasionally it turns out to be faster.
But sometimes, we can write just one SELECT statement that specifies the set we want returned, and let MySQL have a go at it. And if it's slow, we can do some tuning, looking at the execution plan, making sure suitable indexes are available, and tweaking the query.
Given the information in the question about the set to be returned, and assuming that dealerId is unique in the guaranteeDealers table. If our "test" is whether a matching row exists in the guaranteeDealers table, we can use an OUTER JOIN operation, and an expression in the SELECT list that returns a 0 or 1, depending on whether a matching row was found.
SELECT p.id
, p.name
, p.cityId
, p.dealerId
, p.price
, IF(d.dealerId IS NULL,0,1) AS isGuarantee
FROM products p
LEFT
JOIN guaranteeDealers d
ON d.dealerId = p.dealerId
ORDER BY ...
For optimal performance, we are going to want to have suitable indexes defined. At a mimimum (if there isn't already such an index defined)
ON guaranteeDealers (dealerId)
If there are also other tables that are involved in producing the result we are after, then we want to also involve that table in the query we execute. That will give the MySQL optimizer a chance to come up with the most efficient plan to return the entire set. And not constrain MySQL to performing individual operations to be return bits piecemeal.

select id, name, cityId, customers.dealerId, price,
isGuaranteed = guaranteeDealers.dealerId is not null
from customers left join guaranteeDealers
on guaranteeDealers.dealerId = customets.dealerId
No need to call a function.
Note I have used customers because that is the table you used in your question - although I suspect you might have meant products.

Approach 1 is the better one because it reduces the size of the resultset being transferred from the database server to the application server. Its performance problem is caused by the isGuaranteed function, which is being executed once per row (of the customers table, which looks like it might be a typo). An approach like this would be much more performant:
select p.id, p.name, p.cityId, p.dealerId, p.price, gd.IsGuaranteed is not null
from Product p
left join GuaranteeDealers gd on p.dealerId = gd.dealerId

Joining two tables from different servers using linq

I'm trying to join two tables from different servers , but it keeps throwing this exception :
This method supports LINQ to Entities infrastructure and is not intended to be used directly from your code.
here is my query :
var myList = (from myTableA in _dbProvider.ContextOne.TableA
join myTableB in _dbProvider.ContextOne.TableB on myTableA.ProductId equals myTableB.Oid
join myTableC in _dbProvider.ContextTwo.TableC on myTableB.Id equals myTableC.ProductId
where
select myTableC.Name).Distinct().ToList();
what's that mean ?,
knowing that I found an other solution by getting data separately from each table into lists then joining them but it's very greedy in terms of time
is there any other solution ?

You can't join two tables from two different servers. Definitely not from EF.
Your best bet is to only fetch the data in two separate lists and then join them together using Linq to objects.
Let me make an imaginary example: You have 1000,000 invoices on one table, each one has about 10 items, a total of 10,000,000 invoice details on anther server. You need Invoices and their details for 10 first invoices created on 2015-5-4
you send a query to first DB, getting only that 10 invoices, extract their ids and use that to query about 100 rows from the other server. This is only about two times slower than making a single join query.
In some cases this becomes impossible (you have conditions on both tables) and you need to bring more rows, but in simple scenarios this is possible.

How to join two table so that result contains values from second table in between values of first table

TABLE 1 (ID,PID,PNO) contains start and end point for eg; (A, B). with Primary Key (ID, PID) Foreign Key (ID)
TABLE 2 (ID,PNO) contains middle point information in order (a1,a2 ... bn-1, bn). with Primary Key (ID)
I am trying to join them in such a way that i can get [A, a1, a2 ... bn-1 , bn, B].
I fetched data using
SELECT PNO FROM TABLE2 WHERE ID= 123 UNION SELECT PNO FROM TABLE1 WHERE ID= 123
and tried it in C# code by fetching all data and then adding condition's and reordering them. This attempt is two lengthy.
Apart from this is there a way to join these two tables to get the result set.
Note : These tables are related to each other by common field ID, and The PID in TABLE1 one has two distinct value's like 1 for start and 2 for end. based on this the PNO with 1 should come first and PNO with 2 is expected to come at end.

"tried it in C# code by fetching all data and then adding condition's and reordering them".
This is usually a very bad idea, especially if you have a lot of network activity. SQL is very good at manipulating data: in fact it is optimized for that task. So, try passing the conditions to the database as a WHERE clause, using an ORDER BY to sort the final result set and returning just the rows you need. This could have a big impact on the total elapsed time if there is a large difference between the number of rows in the raw database set and the final C# set.
Other things. If you're still finding this too slow you have a standard tuning problem. You haven't provided any of the hard information necessary to give a definitive solution, so here are some guesses.
You want all the records for an ID so there isn't a more efficient way of joining the two intermediary result sets to get a final set. But if the two sets are exclusive - that is, if the endpoints in Table1 are not included in the points from Table2 (your question isn't completely clear on this) - a UNION ALL would be more efficient:
SELECT PNO FROM TABLE2 WHERE ID= 123
UNION ALL
SELECT PNO FROM TABLE1 WHERE ID= 123
That's because UNION makes an additional operation to produce a distinct set of values. Skipping that step will save you some time.
An index on table2 ( ID, PNO) could speed up retrieval times by avoiding the need to touch the table at all. Whether it's worth the overhead of maintaining an index depends on how often you want to run this query and how you load Table2. It also depends on what further filters you apply if you act on my opening paragraph.

Enhance performance of large slow dataloading query

I'm trying to load data from oracle to sql server (Sorry for not writing this before)
I have a table(actually a view which has data from different tables) with 1 million records atleast. I designed my package in such a way that i have functions for business logics and call them in select query directly.
Ex:
X1(id varchar2)
x2(id varchar2, d1 date)
x3(id varchar2, d2 date)
Select id, x, y, z, decode (.....), x1(id), x2(id), x3(id)
FROM Table1
Note: My table has 20 columns and i call 5 different functions on atleast 6-7 columns.
And some functions compare the parameters passed with audit table and perform logic
How can i improve performance of my query or is there a better way to do this
I tried doing it in C# code but initial select of records is large enough for dataset and i get outofmemory exception.
my function does selects and then performs logic for example:
Function(c_x2, eid)
Select col1
into p_x1
from tableP
where eid = eid;
IF (p_x1 = NULL) THEN
ret_var := 'INITIAL';
ELSIF (p_x1 = 'L') AND (c_x2 = 'A') THEN
ret_var:= 'RL';
INSERT INTO Audit
(old_val, new_val, audit_event, id, pname)
VALUES
(p_x1, c_x2, 'RL', eid, 'PackageProcName');
ELSIF (p_x1 = 'A') AND (c_x2 = 'L') THEN
ret_var := 'GL';
INSERT INTO Audit
(old_val, new_val, audit_event, id, pname)
VALUES
(p_x1, c_x2, 'GL', eid, 'PackgProcName');
END IF;
RETURN ret_var;

i'm getting each row and performing
logic in C# and then inserting
If possible INSERT from the SELECT:
INSERT INTO YourNewTable
(col1, col2, col3)
SELECT
col1, col2, col3
FROM YourOldTable
WHERE ....
this will run significantly faster than a single query where you then loop over the result set and have an INSERT for each row.
EDIT as for the OP question edit:
you should be able to replace the function call to plain SQL in your query. Mimic the "initial" using a LEFT JOIN tableP, and the "RL" or "GL" can be calculated using CASE.
EDIT based on OP recent comments:
since you are loading data from Oracle into SQL Server, this is what I would do: most people that could help have moved on and will not read this question again, so open a new question where you say: 1) you need to load data from Oracle (version) to SQL Server Version 2) currently you are loading it from one query processing each row in C# and inserting it into SQL Server, and it is slow. and all the other details. There are much better ways of bulk loading data into SQL Server. As for this question, you could accept an answer, answer yourself where you explain you need to ask a new question, or just leave it unaccepted.

My recommendation is that you do not use functions and then call them within other SELECT statements. This:
SELECT t.id, ...
x1(t.id) ...
FROM TABLE t
...is equivalent to:
SELECT t.id, ...
(SELECT x.column FROM x1 x WHERE x.id = t.id)
FROM TABLE t
Encapsulation doesn't work in SQL like when using C#/etc. While the approach makes maintenance easier, performance suffers because sub selects will execute for every row returned.
A better approach would be to update the supporting function to include the join criteria (IE: "where x.id = t.id" for lack of real one) in the SELECT:
SELECT x.id
x.column
FROM x1 x
...so you can use it as a JOIN:
SELECT t.id, ...
x1.column
FROM TABLE t
JOIN (SELECT x.id,
x.column
FROM MY_PACKAGE.x) x1 ON x1.id = t.id
I prefer that to having to incorporate the function logic into the queries for sake of maintenance, but sometimes it can't be helped.

Personally I'd create an SSIS import to do this task. USing abulk insert you can imporve speed dramitcally and SSIS can handle the functions part after the bulk insert.

Firstly you need to find where the performance problem actually is. Then you can look at trying to solve it.
What is the performance of the view like? How long does it take the view to execute
without any of the function calls? Try running the command
How well does it perform? Does it take 1 minute or 1 hour?
create table the_view_table
as
select *
from the_view;
How well do the functions perform? According to the description you are making approximately 5 million function calls. They had better be pretty efficient! Also are the functions defined as deterministic. If the functions are defined using the deterministic keyword, the Oracle has a chance of optimizing away some of the calls.
Is there a way of reducing the number of function calls? The function are being called once the view has been evaluated and the million rows of data are available. BUT are all the input values from the highest level of the query? Can the function calls be imbeded into the view at a lower level. Consider the following two queries. Which would be quicker?
select
f.dim_id,
d.dim_col_1,
long_slow_function(d.dim_col_2) as dim_col_2
from large_fact_table f
join small_dim_table d on (f.dim_id = d.dim_id)
select
f.dim_id,
d.dim_col_1,
d.dim_col_2
from large_fact_table f
join (
select
dim_id,
dim_col_1,
long_slow_function(d.dim_col_2) as dim_col_2
from small_dim_table) d on (f.dim_id = d.dim_id)
Ideally the second query should run quicker as it calling the function fewer times.
The performance issue could be in any of these places and until you investigate the issue, it would be difficult to know where to direct your tuning efforts.

A couple of tips:
Don't load all records into RAM but process them one by one.
Try to run as many functions on the client as possible. Databases are really slow to execute user defined functions.
If you need to join two tables, it's sometimes possible to create two connections on the client. Fetch the data main data with connection 1 and the audit data with connection 2. Order the data for both connections in the same way so you can read single records from both connections and perform whatever you need on them.
If your functions always return the same result for the same input, use a computed column or a materialized view. The database will run the function once and save it in a table somewhere. That will make INSERT slow but SELECT quick.

Create a sorted intex on your table.
Introduction to SQL Server Indizes, other RDBMS are similar.
Edit since you edited your question:
Using a view is even more sub-optimal, especially when querying single rows from it. I think your "busines functions" are actually something like stored procedures?
As others suggested, in SQL always go set based. I assumed you already did that, hence my tip to start using indexing.

What is the best way, algorithm, method to difference large lists of data?

I am receiving a large list of current account numbers daily, and storing them in a database. My task is to find added and released accounts from each file. Right now, I have 4 SQL tables, (AccountsCurrent, AccountsNew, AccountsAdded, AccountsRemoved). When I receive a file, I am adding it entirely to AccountsNew. Then running the below queries to find which we added and removed.
INSERT AccountsAdded(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew WHERE AccountNumber not in (SELECT AccountNum FROM AccountsCurrent)
INSERT AccountsRemoved(AccountNum, Name) SELECT AccountNum, Name FROM AccountsCurrent WHERE AccountNumber not in (SELECT AccountNum FROM AccountsNew)
TRUNCATE TABLE AccountsCurrent
INSERT AccountsCurrent(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew
TRUNCATE TABLE AccountsNew
Right now, I am differencing about 250,000 accounts, but this is going to keep growing. Is this the best method, do you have any other ideas?
EDIT:
This is an MSSQL 2000 database. I'm using c# to process the file.
The only data I am focused on is the accounts that were added and removed between the last and current files. The AccountsCurrent, is only used to determine what accounts were added or removed.

To be honest, I think that I'd follow something like your approach. One thing is that you could remove the truncate, do a rename of the "new" to "current" and re-create "new".

Sounds like a history/audit process that might be better done using triggers. Have a separate history table that captures changes (e.g., timestamp, operation, who performed the change, etc.)
New and deleted accounts are easy to understand. "Current" accounts implies that there's an intermediate state between being new and deleted. I don't see any difference between "new" and "added".
I wouldn't have four tables. I'd have a STATUS table that would have the different possible states, and ACCOUNTS or the HISTORY table would have a foreign key to it.

Using IN clauses on long lists can be slow.
If the tables are indexed, using a LEFT JOIN can prove to be faster...
INSERT INTO [table] (
[fields]
)
SELECT
[fields]
FROM
[table1]
LEFT JOIN
[table2]
ON [join condition]
WHERE
[table2].[id] IS NULL
This assumes 1:1 relationships and not 1:many. If you have 1:many you can do any of...
1. SELECT DISTINCT
2. Use a GROUP BY clause
3. Use a different query, see below...
INSERT INTO [table] (
[fields]
)
SELECT
[fields]
FROM
[table1]
WHERE
EXISTS (SELECT * FROM [table2] WHERE [condition to match tables 1 and 2])
-- # This is quick provided that all fields to match the two tables are
-- # indexed in both tables. Should then be much faster than the IN clause.

You could also subtract the intersection to get the differences in one table.

If the initial file is ordered in a sensible and consistent way (big IF!), it would run considerably faster as a C# program which logically compared the files.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize multiple SQL Selects in C# - c#

Related

c# vs mysql: calling function in sql select statement vs fetching data and calling same function in c#

Joining two tables from different servers using linq

How to join two table so that result contains values from second table in between values of first table

Enhance performance of large slow dataloading query

What is the best way, algorithm, method to difference large lists of data?

Categories

Resources