sql Top 1 vs System.Linq firstordefault - c#

I am rewriting an SProc in c#. the problem is that in SProc there is a query like this:
select top 1 *
from ClientDebt
where ClinetID = 11234
order by Balance desc
For example :I have a client with 3 debts, all of them have same balance. the debt ids are : 1,2,3
c# equivalent of that query is :
debts.OrderByDescending(d => d.Balance)
.FirstOrDefault()
debts represent clients 3 debts
the interesting part is that sql return debt with Id 2 but c# code returns Id 1.
The Id 1 make sense for me But in order to keep code functionality the same I need to change the c# code to return middle one.
I do not sure what is the logic behind sql top 1 where several rows match the query.
The query will select one debt and update the database. I would like the linq to return the same result with sql
Thanks

debts.OrderByDescending(d => d.Balance).ThenByDescending(d => d.Id)
.FirstOrDefault()

You can start SQL Profiler, execute stored procedure, review result, and then catch query which application send through linq, and again review result.
Also, you can easily view execution plan of you procedure, and try it to optimize, but with linq query, you cannot easily do this.

AFAIK, IN SQL if you select rows without ORDER BY, it orders the resultset based on the primary key.
With Order BY CLAUSE [field], implicitly next order is [primarykey].

Related

C# Linq query execution order

Consider the following method:
public IEnumerable<Owner> GetOwners(OwnerParameters ownerParameters)
{
return FindAll()
.OrderBy(on => on.Name)
.Skip((ownerParameters.PageNumber - 1) * ownerParameters.PageSize)
.Take(ownerParameters.PageSize)
.ToList();
}
Where FindAll() is a repository pattern method that returns IQueryable<Owner>. Does having .OrderBy() before .Skip() and .Take() methods mean that all the elements from the Owner data table will be retrieved and ordered, or, does Linq take into account that .Skip() and .Take() methods might narrow down the required Owner elements and only after having retrieved those will the ordering happen?
EDIT: Profiler log:
SELECT XXX
FROM [Owners] AS [a]
ORDER BY [a].[Name]
OFFSET #__p_0 ROWS FETCH NEXT #__p_1 ROWS ONLY',N'#__p_0 int,#__p_1 int',#__p_0=0,#__p_1=10
Ultimately, this depends on what FindAll() does and what it returns:
if it returns IEnumerable<T>, then it is using LINQ-to-Objects, which mostly just does literally what it is told; if you sort then pages, then it sorts then pages; if you page then sort, then it pages then sorts
however, if it returns IQueryable<T>, then the query is being composed - and only actually executed in ToList(), at which point the provider model gets a chance to inspect your query tree and build the most suitable implementation possible, which often means: writing a SQL query that includes an ORDER BY and some paging hints suitable for the specific RDBMS; if your code paged then sort (which is... unusual) then I would expect most providers to either write some kind of horrible sub-query to try to describe that, or just throw an exception (perhaps NotSupportedException) in disgust
All records wouldn't be retrieved. Depending on your backend a query would be generated, that orders by Name, skips N rows and takes next M rows. Only the query result is retrieved, triggered by .ToList().
ie: In MS SQL server the query might look like:
Select top(M) row_number() over (order by Name) as RowNo, *
from myTable
where RowNo > N
That type of query is not very effective but that is another matter and you could create your custom workaround.
EDIT: I remembered MS SQL code generated wrong. It was:
SELECT ...
FROM ...
ORDER BY ...
OFFSET x ROWS FETCH NEXT y ROWS ONLY
That one is also slow. If you have speed consideration then write your own SQL and send with Linq. Basically what you do is to keep the lastRetrieved value and set it as a parameter:
select top(NTake) ... from ...
order by ...
where orderedByValue > #lastRetrievedValue
(You can send raw SQL queries in Linq)
A quick example to demonstrate that order of query matters. The explanations given above are good.
var test1 = await _dbContext.UserActivityLogs
.Where(x => x.ExternalSyncLogId == request.Id)
.Skip(0).Take(25)
.AsNoTracking().ToListAsync();
This query transpile to
exec sp_executesql N'SELECT [u].[Id], [u].[ActionType], [u].[ActivityEndTime], [u].[ActivityStartTime], [u].[ActivityType], [u].[Created], [u].[CreatedBy], [u].[EntityId], [u].[ExternalSyncLogId], [u].[LastModified], [u].[LastModifiedBy], [u].[RequestBody], [u].[ResponseBody], [u].[Source], [u].[TenantId]
FROM [UserActivityLogs] AS [u]
WHERE ([u].[ExternalSyncLogId] IS NULL
ORDER BY (SELECT 1)
OFFSET #__p_1 ROWS FETCH NEXT #__p_2 ROWS ONLY',N'#__p_1 int,#__p_2 int',#__p_1=0,#__p_2=25
Execution time: 29
But if we just change the sequence then
var test2 = await _dbContext.UserActivityLogs
.Skip(0).Take(25)
.Where(x => x.ExternalSyncLogId == request.Id)
.AsNoTracking().ToListAsync();
Translated as
exec sp_executesql N'SELECT [t].[Id], [t].[ActionType], [t].[ActivityEndTime], [t].[ActivityStartTime], [t].[ActivityType], [t].[Created], [t].[CreatedBy], [t].[EntityId], [t].[ExternalSyncLogId], [t].[LastModified], [t].[LastModifiedBy], [t].[RequestBody], [t].[ResponseBody], [t].[Source], [t].[TenantId]
FROM (
SELECT [u].[Id], [u].[ActionType], [u].[ActivityEndTime], [u].[ActivityStartTime], [u].[ActivityType], [u].[Created], [u].[CreatedBy], [u].[EntityId], [u].[ExternalSyncLogId], [u].[LastModified], [u].[LastModifiedBy], [u].[RequestBody], [u].[ResponseBody], [u].[Source], [u].[TenantId]
FROM [UserActivityLogs] AS [u]
ORDER BY (SELECT 1)
OFFSET #__p_0 ROWS FETCH NEXT #__p_1 ROWS ONLY
) AS [t]
WHERE [t].[ExternalSyncLogId] IS NULL',N'#__p_0 int,#__p_1 int',#__p_0=0,#__p_1=25
Execution time: 474

c# vs mysql: calling function in sql select statement vs fetching data and calling same function in c#

We are a product website with several products having guarantee. Guarantee is only applicable for few products with particular dealerids. The 2 tables are:
Product table with columns as id, name, cityId, dealerId, price. This table has all the products.
GuaranteeDealers table with column as dealerId. This has all dealer with guaranteed products.
We want to get all products with info if it is guaranteed or not. The query looks like:
APPROACH1: Get isGuaranteed from sql function to server(c#) side:
select id, name, cityId, dealerId, price, isGuaranteed = isGuaranteed( dealerId) from customers
isGuaranteed is a sql function that checks if dealerId is in the table guranteeDealers. If yes it returns 1 else 0.
I have 50000 products and 500 such dealers and this query takes too long to execute.
OR
APPROACH2: Get list of dealers and set isGuaranteed flag in c#(server) side.
select id, name, cityId, dealerId, price. Map these to c# list of products
select dealerId from guaranteeDealers table to c# list of dealers.
Iterate product records in c# and set the isGuaranteed flag by c# function that checks if product's dealerId is in c# list of guaranteeDealers.
This takes very less time compared to 1.
While both approaches look similar to me, can someone explain why it takes so long time to execute function in select statement in mysql? Also which is correct to do, approach 1 or 2?
Q: "why it takes so long time to execute function in select statement in mysql?"
In terms of performance, executing a correlated subquery 50,000 times will eat our lunch, and if we're not careful, it will eat our lunchbox too.
That subquery will be executed for each and every row returned by the outer query. That's like executing 50,000 separate, individual SELECT statements. And that's going to take time.
Hiding a correlated subquery inside a MySQL stored program (function) doesn't help. That just adds overhead on each execution of the subquery, and makes things slower. If we strip out the function and bring that subquery inline, we are probably looking at something like this:
SELECT p.id
, p.name
, p.cityId
, p.dealerId
, p.price
, IFNULL( ( SELECT 1
FROM guaranteeDealers d
WHERE d.dealerId = p.dealerID
LIMIT 1
)
,0) AS isGuarantee
FROM products p
ORDER BY ...
For each and every row returned from products (that isn't filtered out by a predicate e.g. condition in the WHERE clause), this is essentially telling MySQL to execute a separate SELECT statement. Run a query to look to see if the dealerID is found in the guaranteeDealers table. And that happens for each row.
If the outer query is only returning a couple of rows, then that's only a couple of extra SELECT statements to execute, and we aren't really going to notice the extra time. But when we return tens (or hundreds) of thousands of rows, that starts to add up. And it gets expensive, in terms of the total amount of time all those query executions take.
And if we "hide" that subquery in a MySQL stored program (function), that adds more overhead, introducing a bunch of context switches. From query executing in the database context, calling a function that switches over to the stored program engine which executes the function, which then needs to run a database query, which switches back to the database context to execute the query and return a resultset, switching back to the stored program environment to process the resultset and return a value, and then switching back to the original database context, to get the returned value. If we have to do that a couple of times, no big whoop. Repeat that tens of thousands of times, and that overhead is going to add up.
(Note that native MySQL built-in functions don't have this same context switching overhead. The native functions are compiled code that execute within the database context. Which is a big reason we favor native functions over MySQL stored programs.)
If we want improved performance, we need to ditch the processing RBAR (row by agonizing row), which gets excruciatingly slow for large sets. We need to approach the problem set-wise rather than row-wise.
We can tell MySQL what set to return, and let it figure out the most efficient way to return that. Rather than us round tripping back and forth to the database, executing individual SQL statements to get little bits of the set piecemeal, using instructions that dictate how MySQL should prepare the set.
In answer to the question
Q: "which approach is correct"
both approaches are "correct" is as much as they return the set we're after.
The second approach is "better" in that it significantly reduces the number of SELECT statements that need to be executed (2 statements rather than 50,001 statements).
In terms of the best approach, we are usually better off letting MySQL do the "matching" of rows, rather than doing the matching in client code. (Why unnecessarily clutter up our code doing an operation that can usually be much more efficiently accomplished in the database.) Yes, sometimes we need to do the matching in our code. And occasionally it turns out to be faster.
But sometimes, we can write just one SELECT statement that specifies the set we want returned, and let MySQL have a go at it. And if it's slow, we can do some tuning, looking at the execution plan, making sure suitable indexes are available, and tweaking the query.
Given the information in the question about the set to be returned, and assuming that dealerId is unique in the guaranteeDealers table. If our "test" is whether a matching row exists in the guaranteeDealers table, we can use an OUTER JOIN operation, and an expression in the SELECT list that returns a 0 or 1, depending on whether a matching row was found.
SELECT p.id
, p.name
, p.cityId
, p.dealerId
, p.price
, IF(d.dealerId IS NULL,0,1) AS isGuarantee
FROM products p
LEFT
JOIN guaranteeDealers d
ON d.dealerId = p.dealerId
ORDER BY ...
For optimal performance, we are going to want to have suitable indexes defined. At a mimimum (if there isn't already such an index defined)
ON guaranteeDealers (dealerId)
If there are also other tables that are involved in producing the result we are after, then we want to also involve that table in the query we execute. That will give the MySQL optimizer a chance to come up with the most efficient plan to return the entire set. And not constrain MySQL to performing individual operations to be return bits piecemeal.
select id, name, cityId, customers.dealerId, price,
isGuaranteed = guaranteeDealers.dealerId is not null
from customers left join guaranteeDealers
on guaranteeDealers.dealerId = customets.dealerId
No need to call a function.
Note I have used customers because that is the table you used in your question - although I suspect you might have meant products.
Approach 1 is the better one because it reduces the size of the resultset being transferred from the database server to the application server. Its performance problem is caused by the isGuaranteed function, which is being executed once per row (of the customers table, which looks like it might be a typo). An approach like this would be much more performant:
select p.id, p.name, p.cityId, p.dealerId, p.price, gd.IsGuaranteed is not null
from Product p
left join GuaranteeDealers gd on p.dealerId = gd.dealerId

EF and LINQ query to database speed

I just wondering if I am wasting my time or is there anything I could to improve this query which in turn will improve performance.
Inside a Repository, I am trying to get the 10 most recent items
public List<entity> GetTop10
{
get
{
return m_Context.entity.Where(x => x.bool == false).OrderByDescending(x => x.Created).Take(10).ToList();
}
}
But this is taking a long time as the table its querying has over 11000 rows in it. So my question is, is there anyway I could speed up this kind of query?
I am trying to get my SQL hat on regarding performance, I know the order would slow it down, but how I could I achieve the same result?
Thanks
The particular query you posted is a potential candidate for using a filtered index. Say you have a SQL table:
CREATE TABLE Employees
(
ID INT IDENTITY(1,1) PRIMARY KEY,
Name NVARCHAR(100),
IsAlive BIT
)
You can imagine that generally you only want to query on employees that have not (yet) died so will end up with SQL like this:
SELECT Name FROM Employees WHERE IsAlive = 1
So, why not create a filtered index:
CREATE INDEX IX_Employees_IsAliveTrue
ON Employees(IsAlive)
WHERE IsAlive = 1
So now if you query the table it will use this index which may only be a small portion of your table, especially if you've had a recent zombie invasion and 90% of your staff are now the walking dead.
However, an Entity Framework like this:
var nonZombies = from e in db.Employees
where e.IsAlive == true
select e;
May not be able to use the index (SQL has a problem with filtered indexes and parameterised queries). To get round this, you can create a view in your database:
CREATE VIEW NonZombies
AS
SELECT ID, Name, IsAlive FROM Employees WHERE IsAlive = 1
Now you can add that to your framework (how you do this will vary depending on if you are using code/model/database first) and you will now be able to decide which employees deserve urgent attention (like priority access to food and weapons):
var nonZombies = from e in db.NonZombies
select e;
From your LINQ query will be created SQL SELECT similar to this:
SELECT TOP(10) * FROM entity
WHERE bool = 0
ORDER BY Created DESC
Similar because instead of the '*' will server select concrete columns to map these to entity object.
If this is too slow for you. The error is in the database, not in the EntityFramework.
So try adding some indexes to your table.

LINQ - Speeding up query that has a join to a huge table

I have these two tables
ExpiredAccount Account
-------------- ---------------
ExpiredAccountID AccountID
AccountID (fk) AccountName
... ...
Basically, I want to return a list of ExpiredAccounts displaying the AccountName in the result.
I currently do this using
var expiredAccounts = (from x in ExpiredAccount
join m in Account on x.AccountID equals m.AccountID
select m.AccountName).ToList()
This works fine. However, this takes too long.
There's not a lot of records in expiredAccounts (<200).
The Account table on the otherhand has over 300,000 records.
Is there anyway I could speed up my query, or alternatively, another way to do this more efficiently with or without using LINQ?
Firstly, assuming you are using Entity Framework, you don't need to be using the join at all. You could simply do:
var expiredAccounts = (from x in ExpiredAccount
select x.Account.AccountName).ToList()
However, I don't think they will generate a different query plan on the database. But my guess is that you don't have an index on AccountID in the Account table (although that seems unlikely).
One thing you can do is use ToTraceString (for example: http://social.msdn.microsoft.com/Forums/en-US/adodotnetentityframework/thread/4a17b992-05ca-4e3b-9910-0018e7cc9c8c/) to get the SQL which is being run. Then you can open SQL Management Studio and run that with the execution plan option turned on and it will show you what the execution plan was and what indexes need to be added to make it better.
You can try using Contains method:
var expiredAccounts = (from m in Account where ExpiredAccount.Select(x => x.AccountId)
.Contains(m.AccountId)
select m.AccountName).ToList()
It should generate IN clause in SQL query that will be performed agains database.

Enhance performance of large slow dataloading query

I'm trying to load data from oracle to sql server (Sorry for not writing this before)
I have a table(actually a view which has data from different tables) with 1 million records atleast. I designed my package in such a way that i have functions for business logics and call them in select query directly.
Ex:
X1(id varchar2)
x2(id varchar2, d1 date)
x3(id varchar2, d2 date)
Select id, x, y, z, decode (.....), x1(id), x2(id), x3(id)
FROM Table1
Note: My table has 20 columns and i call 5 different functions on atleast 6-7 columns.
And some functions compare the parameters passed with audit table and perform logic
How can i improve performance of my query or is there a better way to do this
I tried doing it in C# code but initial select of records is large enough for dataset and i get outofmemory exception.
my function does selects and then performs logic for example:
Function(c_x2, eid)
Select col1
into p_x1
from tableP
where eid = eid;
IF (p_x1 = NULL) THEN
ret_var := 'INITIAL';
ELSIF (p_x1 = 'L') AND (c_x2 = 'A') THEN
ret_var:= 'RL';
INSERT INTO Audit
(old_val, new_val, audit_event, id, pname)
VALUES
(p_x1, c_x2, 'RL', eid, 'PackageProcName');
ELSIF (p_x1 = 'A') AND (c_x2 = 'L') THEN
ret_var := 'GL';
INSERT INTO Audit
(old_val, new_val, audit_event, id, pname)
VALUES
(p_x1, c_x2, 'GL', eid, 'PackgProcName');
END IF;
RETURN ret_var;
i'm getting each row and performing
logic in C# and then inserting
If possible INSERT from the SELECT:
INSERT INTO YourNewTable
(col1, col2, col3)
SELECT
col1, col2, col3
FROM YourOldTable
WHERE ....
this will run significantly faster than a single query where you then loop over the result set and have an INSERT for each row.
EDIT as for the OP question edit:
you should be able to replace the function call to plain SQL in your query. Mimic the "initial" using a LEFT JOIN tableP, and the "RL" or "GL" can be calculated using CASE.
EDIT based on OP recent comments:
since you are loading data from Oracle into SQL Server, this is what I would do: most people that could help have moved on and will not read this question again, so open a new question where you say: 1) you need to load data from Oracle (version) to SQL Server Version 2) currently you are loading it from one query processing each row in C# and inserting it into SQL Server, and it is slow. and all the other details. There are much better ways of bulk loading data into SQL Server. As for this question, you could accept an answer, answer yourself where you explain you need to ask a new question, or just leave it unaccepted.
My recommendation is that you do not use functions and then call them within other SELECT statements. This:
SELECT t.id, ...
x1(t.id) ...
FROM TABLE t
...is equivalent to:
SELECT t.id, ...
(SELECT x.column FROM x1 x WHERE x.id = t.id)
FROM TABLE t
Encapsulation doesn't work in SQL like when using C#/etc. While the approach makes maintenance easier, performance suffers because sub selects will execute for every row returned.
A better approach would be to update the supporting function to include the join criteria (IE: "where x.id = t.id" for lack of real one) in the SELECT:
SELECT x.id
x.column
FROM x1 x
...so you can use it as a JOIN:
SELECT t.id, ...
x1.column
FROM TABLE t
JOIN (SELECT x.id,
x.column
FROM MY_PACKAGE.x) x1 ON x1.id = t.id
I prefer that to having to incorporate the function logic into the queries for sake of maintenance, but sometimes it can't be helped.
Personally I'd create an SSIS import to do this task. USing abulk insert you can imporve speed dramitcally and SSIS can handle the functions part after the bulk insert.
Firstly you need to find where the performance problem actually is. Then you can look at trying to solve it.
What is the performance of the view like? How long does it take the view to execute
without any of the function calls? Try running the command
How well does it perform? Does it take 1 minute or 1 hour?
create table the_view_table
as
select *
from the_view;
How well do the functions perform? According to the description you are making approximately 5 million function calls. They had better be pretty efficient! Also are the functions defined as deterministic. If the functions are defined using the deterministic keyword, the Oracle has a chance of optimizing away some of the calls.
Is there a way of reducing the number of function calls? The function are being called once the view has been evaluated and the million rows of data are available. BUT are all the input values from the highest level of the query? Can the function calls be imbeded into the view at a lower level. Consider the following two queries. Which would be quicker?
select
f.dim_id,
d.dim_col_1,
long_slow_function(d.dim_col_2) as dim_col_2
from large_fact_table f
join small_dim_table d on (f.dim_id = d.dim_id)
select
f.dim_id,
d.dim_col_1,
d.dim_col_2
from large_fact_table f
join (
select
dim_id,
dim_col_1,
long_slow_function(d.dim_col_2) as dim_col_2
from small_dim_table) d on (f.dim_id = d.dim_id)
Ideally the second query should run quicker as it calling the function fewer times.
The performance issue could be in any of these places and until you investigate the issue, it would be difficult to know where to direct your tuning efforts.
A couple of tips:
Don't load all records into RAM but process them one by one.
Try to run as many functions on the client as possible. Databases are really slow to execute user defined functions.
If you need to join two tables, it's sometimes possible to create two connections on the client. Fetch the data main data with connection 1 and the audit data with connection 2. Order the data for both connections in the same way so you can read single records from both connections and perform whatever you need on them.
If your functions always return the same result for the same input, use a computed column or a materialized view. The database will run the function once and save it in a table somewhere. That will make INSERT slow but SELECT quick.
Create a sorted intex on your table.
Introduction to SQL Server Indizes, other RDBMS are similar.
Edit since you edited your question:
Using a view is even more sub-optimal, especially when querying single rows from it. I think your "busines functions" are actually something like stored procedures?
As others suggested, in SQL always go set based. I assumed you already did that, hence my tip to start using indexing.

Categories

Resources