One query that gets all data vs 300 specific queries - c#

First of all, I'm aware that there are similar questions out there, but I can't seem to find one that fits my case.
I have a program that updates the stock in a CSV file of about 300 products by checking with our database.
The database has +/- 100k records. At the moment I'm doing a select of all records in the database and just use the ones I need. But I selected 99k records too much.
I could do specific selects for the products too, but I'm worried sending 300 queries (might become more in the future) in a matter of seconds would be too much.
Or another (probably stupid) option:
select * from database where id=product1 || id=product2
|| id=product3 || id=product4,...
What's the better option in this case? I'm not worrying about execution time, I'd rather have clean and "efficient" code than fast code in this scenario.

You can try like this:
select *
from database where id IN (1, 2, 3)
If the count of values to search is more than the count which is not then do the reverse
select *
from database where id NOT IN (some values)

You could do something like this:
select *
from database
where id IN (1, 2, 3)
All of your ids that you want to get can just go into that array so you don't have to use a long list of or clauses.

Related

Performant Linq Query that gets maximum revision of rows

all.
I am developing an application that is tracking the changes to an objects properties. Each time an objects properties change, I create a new row in the table with the updated property values and an incremented revision.
I have a table that has a structure like the following:
Id (primary key, system generated)
UserFriendlyId (generated programmatically, it is the Id the user sees in the UI, it stays the same regardless of how many revisions an object goes through)
.... (misc properties)
Revision (int, incremented when an object properties are changed)
To get the maximum revision for each UserFriendlyId, I do the following:
var latestIdAndRev = context.Rows.GroupBy(r => r.UserFriendlyId).Select(latest => new { UserFriendlyId = latest.Key, Revision = latest.Max(r=>r.Revision)}).ToList();
Then in order to get a collection of the Row objects, I do the following:
var latestRevs = context.Rows.Where(r => latestIdAndRev.Contains( new {UserFriendlyId=r.UserFriendlyId, Revision=r.Revision})).ToList();
Even though, my table only has ~3K rows, the performance on the latestRevs statement is horrible (takes several minutes to finish, if it doesn't time out first).
Any idea on what I might do differently to get better performance retrieving the latest revision for a collection of userfriendlyids?
To increase the performance of you query you should try to make the entire query run on the database. You have divided the query into two parts and in the first query you pull all the revisions to the client side into latestIdAndRev. The second query .Where(r => latestIdAndRev.Contains( ... )) will then translate into a SQL statement that is something like WHERE ... IN and then a list of all the ID's that you are looking for.
You can combine the queries into a single query where you group by UserFriendlyId and then for each group select the row with the highest revision simply ordering the rows by Revision (descending) and picking the first row:
latestRevs = context.Rows.GroupBy(
r => r.UserFriendlyId,
(key, rows) => rows.OrderByDescending(r => r.Revision).First()
).ToList();
This should generate pretty efficient SQL even though I have not been able to verify this myself. To further increase performance you should have a look at indexing the UserFriendlyId and the Revision columns but your results may vary. In general adding an index increases the time it takes to insert a row but may decrease the time it takes to find a row.
(General advice: Watch out for .Where(row => clientSideCollectionOfIds.Contains(row.Id)) because all the ID's will have to be included in the query. This is not a fault of the ER mapper.)
There are a couple of things to look at, as you are likely ending up with serious recursion. If this is SQL Server, open profiler and start a profile on the database in question and then fire off the command. Look at what is being run, examine the execution plan, and see what is actually being run.
From this you MIGHT be able to use the index wizard to create a set of indexes that speeds things up. I say might, as the recursive nature of the query may not be easily solved.
If you want something that recurses to be wicked fast, invest in learning Window Functions. A few years back, we had a query that took up to 30 seconds reduced to milliseconds by heading that direction. NOTE: I am not stating this is your solution, just stating it is worth looking into if indexes alone do not meet your Service Level Agreements (SLAs).

Performance issues to iterate results with C# SQLite DataReader and attached database

I am using System.Data.SQLite and SQLiteDataReader in my C# project. I am facing performance issues when getting the results of a query with attached databases.
Here is an example of a query to search text into two databases :
ATTACH "db2.db" as db2;
SELECT MainRecord.RecordID,
((LENGTH(MainRecord.Value) - LENGTH(REPLACE(UPPER(MainRecord.Value), UPPER("FirstValueToSearch"), ""))) / 18) AS "FirstResultNumber",
((LENGTH(DB2Record.Value) - LENGTH(REPLACE(UPPER(DB2Record.Value), UPPER("SecondValueToSearch"), ""))) / 19) AS "SecondResultNumber"
FROM main.Record MainRecord
JOIN db2.Record DB2Record ON DB2Record.RecordID BETWEEN (MainRecord.PositionMin) AND (MainRecord.PositionMax)
WHERE FirstResultNumber > 0 AND SecondResultNumber > 0;
DETACH db2;
When executing this query with SQLiteStudio or SQLiteAdmin, this works fine, I am getting the results in a few seconds (the Record table can contain hundreds of thousands of records, the query returns 36000 records).
When executing this query in my C# project, the execution takes a few seconds too, but it takes hours to run through all the results.
Here is my code :
// Attach databases
SQLiteDataReader data = null;
using (SQLiteCommand command = this.m_connection.CreateCommand())
{
command.CommandText = "SELECT...";
data = command.ExecuteReader();
}
if (data.HasRows)
{
while (data.Read())
{
// Do nothing, just iterate all results
}
}
data.Close();
// Detach databases
Calling the Read method of the SQLiteDataReader once can take more than 10 seconds ! I guess this is because the SQLiteDataReader is lazy loaded (and so it doesn't return the whole rowset before reading the results), am I right ?
EDIT 1 :
I don't know if this has something to do with lazy loading, like I said initially, but all I want is being able to get ALL the results as soon as the query is ended. Isn't it possible ? In my opinion, this is really strange that it takes hours to get results of a query executed in few seconds...
EDIT 2 :
I just added a COUNT(*) in my select query in order to see if I could get the total number of results at the first data.Read(), just to be sure that it was only the iteration of the results that was taking so long. And I was wrong : this new request executes in few seconds in SQLiteAdmin / SQLiteStudio, but takes hours to execute in my C# project. Any idea why the same query is so much longer to execute in my C# project?
EDIT 3 :
Thanks to EXPLAIN QUERY PLAN, I noticed that there was a slight difference in the execution plan for the same query between SQLiteAdmin / SQLiteStudio and my C# project. In the second case, it is using an AUTOMATIC PARTIAL COVERING INDEX on DB2Record instead of using the primary key index. Is there a way to ignore / disable the use of automatic partial covering indexes? I know it is used to speed up the queries, but in my case, it's rather the opposite that happens...
Thank you.
Besides finding matching records, it seems that you're also counting the number of times the strings matched. The result of this count is also used in the WHERE clause.
You want the number of matches, but the number of matches does not matter in the WHERE clause - you could try change the WHERE clause to:
WHERE MainRecord.Value LIKE '%FirstValueToSearch%' AND DB2Record.Value LIKE '%SecondValueToSearch%'
It might not result in any difference though - especially if there's no index on the Value columns - but worth a shot. Indexes on text columns require alot of space, so I wouldn't blindly recommend that.
If you haven't done so yet, place an index on the DB2's RecordID column.
You can use EXPLAIN QUERY PLAN SELECT ... to make SQLite spit out what it does to try to make your query perform, the output of that might help diagnose the problem.
Are you sure you use the same version of sqlite in System.Data.SQLite, SQLiteStudio and SQLiteAdmin ?
You can have huge differences.
One more typical reason why SQL query can take different amount of time when executed with ADO.NET and from native utility (like SQLiteAdmin) are command parameters used in CommandText (it is not clear from your code whether parameters are used or not). Depending on ADO.NET provider implementation the following identical CommandText values:
SELECT * FROM sometable WHERE somefield = ? // assume parameter is '2'
and
SELECT * FROM sometable WHERE somefield='2'
may lead to absolutely different execution plan and query performance.
Another suggestion: you may disable journal (specifying "Journal Mode=off;" in the connection string) and synchronous mode ("Synchronous=off;") as these options also may affect query performance in some cases.

SQL Linq .Take() latest 20 rows from HUGE database, performance-wise

I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.
I need to retrieve the latest 20 rows that satisfy a certain condition
Here's a rough example of my query
using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));
IQueryable<Post> postQueryable =
from postDbEntry in postHubDbContext.PostDbEntries
orderby postDbEntry.Id descending
where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
select new Post(postDbEntry);
postQueryable = postQueryable.Take(20);
IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);
return postOrderedQueryable.ToList();
}
The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?
Basically how do I make this query work efficiently with a database that has a billion rows ?
According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.
This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.
So the only performance you can make is in database. Like #usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).
Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table
Check out this question for more details
How do I view the SQL generated by the Entity Framework?
This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.
Assume both indexes exist:
If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.
If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.
So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.
Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.

Linq: Return results count for large dataset

I have IEnumerable that is result of database query that has a large dataset, as much as 10000 records. I need the count to display on the webpage for pagination. How can I do it, using .Count() will result in exceptions like 1The underlying provider failed on Open` or takes way too long.
Is there a way I can query database to get the count for results by linq-sql?
It may be how you are using LINQ. if you call the query like this:
var users = (from u in context.Users select u);
int userCount = users.Count();
That would effectively only call a query to return a count from the database. If you did something like this:
List<User> users = (from u in context.Users select u).ToList();
int userCount = users.Count();
That would call and retrieve the records from the database then try to count them.
let the database tell you the count - databases are built to be able to do this - and select only the rows you need from the database instead of returning the whole set from the database when you only want to use a small subset.
You could do a select count query first, but then some data might have been added or deleted. Wrap it in a transaction, and you'll have a huge locking issue.
Another way would be to get page 1, and then get the rest in batches on another thread, when that finishes update the count and display, you could even have a running count, to indicate progress in getting all the data.
Why are you getting all the records at once?
If you looked at it as getting the first, last, previous and next page.
And then dealt with issues that arise from that approach, you'd have a scalable solution.
The reason count is taking so long is it has to suck all 10,000 records into the client in order to count them.
It's a solution that will scale very badly, imagine a 100,000 or a million!
The other point is latency. Those 100,000 are at the time the query was executed, refresh isn't going to be popular is it. Then when you do refresh, you have all the display the page you were on issues...
If you reworked it to get the next n records after the the last one currently displayed or before the first...
Then you could cache, you could refresh pages in the background, you could attempt to predict request, say get the next n pages in advance.

Using Linq to SQL, how do I find min and max of a column in a table?

I want to find the fastest way to get the min and max of a column in a table with a single Linq to SQL roundtrip. So I know this would work in two roundtrips:
int min = MyTable.Min(row => row.FavoriteNumber);
int max = MyTable.Max(row => row.FavoriteNumber);
I know I can use group but I don't have a group by clause, I want to aggregate over the whole table! And I can't use the .Min without grouping first. I did try this:
from row in MyTable
group row by true into r
select new {
min = r.Min(z => z.FavoriteNumber),
max = r.Max(z => z.FavoriteNumber)
}
But that crazy group clause seems silly, and the SQL it makes is more complex than it needs to be.
So, is there any way to just get the correct SQL out?
EDIT: These guys failed too: Linq to SQL: how to aggregate without a group by? ... lame oversight by LINQ designers if there's really no answer.
EDIT 2: I looked at my own solution (with the nonsensical constant group by clause) in the SQL Server Management Studio execution plan analysis, and it looks to me like it is identical to the plan generated by:
SELECT MIN(FavoriteNumber), MAX(FavoriteNumber)
FROM MyTable
so unless someone can come up with a simpler-or-equally-as-good answer, I think I have to mark it as answered-by-myself. Thoughts?
As stated in the question, this method seems to actually generate optimal SQL code, so while it looks a bit squirrely in LINQ, it should be optimal performance-wise.
from row in MyTable
group row by true into r
select new {
min = r.Min(z => z.FavoriteNumber),
max = r.Max(z => z.FavoriteNumber)
}
I could find only this one which produces somewhat clean sql still not really effective comparing to select min(val), max(val) from table:
var r =
(from min in items.OrderBy(i => i.Value)
from max in items.OrderByDescending(i => i.Value)
select new {min, max}).First();
the sql is
SELECT TOP (1)
[t0].[Value],
[t1].[Value] AS [Value2]
FROM
[TestTable] AS [t0],
[TestTable] AS [t1]
ORDER BY
[t0].[Value],
[t1].[Value] DESC
still there is another option to use single connection for both min and max queries (see Multiple Active Result Sets (MARS))
or stored procedure..
I'm not sure how to translate it into C# yet (I'm working on it)
This is the Haskell version
minAndMax :: Ord a => [a] -> (a,a)
minAndMax [x] = (x,x)
minAndMax (x:xs) = (min a x, max b x)
where (a,b) = minAndMax xs
The C# version should involve Aggregate some how (I think).
You could select the whole table, and do your min and max operations in memory:
var cache = // select *
var min = cache.Min(...);
var max = cache.Max(...);
Depending on how large your dataset is, this might be the way to go about not hitting your database more than once.
A LINQ to SQL query is a single expression. Thus, if you can't express your query in a single expression (or don't like it once you do) then you have to look at other options.
Stored procedures, since they can have statements, enable you to accomplish this in a single round-trip. You will either have two output parameters or select a result set with two rows. Either way, you will need custom code to read the stored procedure's result.
(I don't personally see the need to avoid two round-trips here. It seems like a premature optimization, especially since you will probably have to jump through hoops to get it working. Not to mention the time you will spend justifying this decision and explaining the solution to other developers.)
Put another way: you've already answered your own question. "I can't use the .Min without grouping first", followed by "that crazy group clause seems silly, and the SQL it makes is more complex than it needs to be", are clues that the simple and easily-understood two-round-trip solution is the best expression of your intent (unless you write custom SQL).

Categories

Resources