I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.
What is the order of this LINQ query? Is there a more efficient way of doing this?
var MyValues = from r in MyDataTable.AsEnumerable()
orderby r.Field<double>(_varName)
select r.Field<double>(_varName);
IEnumerable result= MyValues.Distinct();
I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n). If I had to guess, everything but the orderby should be O(n), so overall you're still just O(n log n).
Update: the LINQ Distinct() call should also be O(n).
So altogether, the Big-Oh for this thing is still O(Kn log n), where K is some constant.
Is there a more efficent way of doing this?
You could get better efficiency if you do the sort as part of the query that initializes MyDataTable, instead of sorting in memory afterwards.
from comments
I actually use MyDistinct.Distinct()
If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy. The order matters here.
You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.
var values = from r in MyDataTable.AsEnumerable()
select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
.OrderBy(d => d);
I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:
Order of LINQ extension methods does not affect performance?
Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:
Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316
your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.
the order of the query is, if you see the underlying command, would be transalated to
Select <_varname> from MyDataTable order by <_varname>
and this is as good as it is when run on the database IDE or commandline.
to give you a short answer regarding performance
put in a where clause if you can (with columns that are indexed)
ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
profile your underlying query,
if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell
it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)
Hope this helps
that said, if your db is not properly indexed, meaning
Related
I often use LinQ statements to query with EF, or to filter data, or to search my data collections, but I've always had that doubt about which is the first statement to write.
Let's say we have a query similar to this:
var result = Data.Where(x => x.Text.StartsWith("ABC")).OrderBy(x => x.Id).Select(x => x.Text).Take(5).ToList();
The same query works even if the statements are in different order, for example:
var result = Data.OrderBy(x => x.Id).Select(x => x.Text).Where(x => x.Text.StartsWith("ABC")).Take(5).ToList();
I understand that there are certain statements that do modify the expected result, but my doubt is with those that do not modify, as in the previous example. Does a specified order or any good practice guide exist for this?
It will give you different results. Let's assume that you have following ids:
6,5,4,3,2,1
The first statement will give you
1,2,3,4,5
and the second one
2,3,4,5,6
I assumed that all objects with following ids start with ABC
Edit: I think I haven't answered the question properly. Yes, there is a difference. In the first example you only sort 5 elements however in the second one you order all elements which is definitely slower than the first one.
Does a specified order or any good practice guide exist for this?
No, because the order determines what the result is. In SQL (a declarative language), SELECT always comes before WHERE, which comes before GROUP BY, etc., and the parsing engine turns that into an execution plan which will execute in whatever order the optimizer thinks is best.
So selecting, then ordering, then grouping all happens on the data specified by the FROM clause(s), so order does not matter.
C# (within methods) is a procedural language, meaning that statements will be executed in the exact order that you provide them.
When you select, then order, the ordering applies to the selection, meaning that if you select a subset of fields (or project to different fields), the ordering applies to the projection. If you order, then select, the ordering applies to the original data, then the projection applies to the ordered data data.
In your second edited example, the query seems to be broken because you are specifying properties that would be lost from the projection:
var result = Data.OrderBy(x => x.Id).Select(x => x.Text).Where(x => x.Text.StartsWith("ABC")).Take(5).ToList();
^
at this (^) point, you are projecting just the Text property, which I'm assuming sia string, and thus the subsequent Select is working on a collection of strings, which would not have a Text property to filter off of.
Certainly you could change the Where to filter the strings directly, but it illustrates that shifting the order of commands can have a catastrophic impact on the query. It might not make a difference, as you are trying to illustrate, for example, ordering then filtering should be logically equivalent to filtering then ordering (assuming that one doesn't impact the other), and there's no "best practice" to say which should go first, so the right answer (if there is one) would be determined on a case-by-case basis.
Which is faster?
Getting a list of some variable (say string type) from a LINQ and then filtering duplicates in C#, or directly selecting distinct values in LINQ only?
Say we are having
N rows if we take duplicates
and R if we filter
( N >> R ) there are many duplicates.
Basically I am asking, in general which is faster and better programming
selecting whole N rows in LINQ, convert it into a list and then filtering it to R rows
or directly selecting the R rows from LINQ and converting it to a list.
Note :
IN SQL the time taken to get R rows is roughly 2 times then it takes to get N for my case! But a generic answer is welcome.
I assume when you say Linq, you mean LinqToSQL.
Rule of thumb when connect to database is to only get what you need; and for this, if you have a good querying strategy for Linq, then filtering at LinqToSQL can save a lot of wasted work.
If the column you're filtering happens to be FullTextIndex, you hit the jackpot.
Look, you question is complex, what i mean.
1)Better programming is to use the most of times a ready build-in function
2)Based on my experience Distinct works faster, in MsSql and C#.
3) LINQ is kinda lazy on filtering, especially if you have many items in your list. Distinct is optimized by Microsoft developers.
note: similar question, may be useful
Result: Try to use more of build-in functions you have at your platform, there is a plenty of information on net, and you can escape paragraphs of coding just calling a ready function.
Hope it helped.
How to optimize this query?
// This will return data ranging from 1 to 500,000 records
List<string> products = GetProductsNames();
List<Product> actualProducts = (from p in db.Products
where products.Contains(p.Name)
select p).ToList();
This code takes around 30 seconds to fill actualProducts if I send a list of 44,000 strings, dont know what it takes for 500,000 records. :(
any way to tweak this query?
NOTE: it takes almost this much time for each call (ignoring the first slow edmx call)
An IN query on 500,000 records is always going to be a pathological case.
Firstly, make sure there is an index (probably non-clustered) on Name in the database.
Ideas (both involve dropping to ADO.NET):
use a "table valued parameter" to pass in the values, and INNER JOIN to the table-valued-parameter in TSQL
alternatively, create a table of the form ProductQuery with columns QueryId (which could be uniqueidentifier) and Name; invent a guid to represent your query (Guid.NewGuid()), and then use SqlBulkCopy to push the 500,000 pairs (the same guid on each row; different guids are different queries) into the table really quickly; then use TSQL to do an INNER JOIN between the two tables
Actually, these are very similar, but the first one is probably the first thing to try. Less to set up.
If you don't want to use Database you could try something with Dictionary<string,string>
If am not wrong I suspect products.Contains(p.Name) is expensive since it is O(n) operation. Try to change your GetProductsNames return type as Dictionary<string,string> or convert List to Dictionary
Dictionary<string, string> productsDict = products.ToDictionary(x => x);
So you have a dictionary in hand, now rewrite the query as below
List<Product> actualProducts = (from p in db.Products
where productsDict.ContainsKey(p.Name)
select p).ToList();
This will help you to improve performance a lot(disadvantage is you allocate double memory advantage is performance). I tested with very large samples with good results. Try it out.
Hope this helps.
You could also take a hashing approach using the name-column as the value that gets passed to the hashing function; then you could iterate the 500K set, subjecting each name to the hashing function, and testing for existence in your local hash file. This would require more code than a linq approach but it might be considerably faster than repeated calls to the back-end doing inner joins.
Quick LINQ performance question.
I have a database with many many records and it's used for a webshop.
All query logic and paging is done with LINQ, and it performs quite well.
This is, because the usual search for products contains one or more where clause, and that shortens my result set to a couple of hundred results at max.
But.. there is an option to list all products (when no search criteria is provided), and that query is slow.. real slow. Even though i'm just asking for a single page with .Skip(20).Take(10), it's still slow because the total result is something like 140000 products. Is there a way to limit this (or all) query, so that the speed of the whole thing is kept okay?
I don't want to force my customers to provide one or more criteria.. but on the other hand i have no problem with telling them that they can never find more than 2000 products.
Thanks for helping!
Tys
Why don't you limit the number of records on the sql side as described in this post
http://www.sqlservercurry.com/2009/06/skip-and-take-n-number-of-records-in.html
Watch out for any "premature" enumerations when you pass down queries/results in your code!
There are also several LINQ visualizers available, which can help to see what the LINQ expressions actually translate to. Or you can play around with expressions in LINQPad before integrating in your codeā¦
What you can do is to have Linq use stored procedure from the database.
In that case, it will be faster because it is the database engine who will do the work and return it to Linq; the database engine is made for that, and it is closer to data than Linq.
I suggest you give it a try and give us feedback
You can check what indexes has the table and what PK is. It could be the table has no index at all so records compared by field values. Also you can catch the query in the SqlProfiler, run it separately and analyse its query plan.
I'm in the midst of trying to replace a the Criteria queries I'm using for a multi-field search page with LINQ queries using the new LINQ provider. However, I'm running into a problem getting record counts so that I can implement paging. I'm trying to achieve a result
equivalent to that produced by a CountDistinct projection from the Criteria API using LINQ. Is there a way to do this?
The Distinct() method provided by LINQ doesn't seem to behave the way I would expect, and appending ".Distinct().Count()" to the end of a LINQ query grouped by the field I want a distinct count of (an integer ID column) seems to return a non-distinct count of those values.
I can provide the code I'm using if needed, but since there are so many fields, it's
pretty long, so I didn't want to crowd the post if it wasn't needed.
Thanks!
I figured out a way to do this, though it may not be optimal in all situations. Just doing a .Distinct() on the LINQ query does, in fact, produce a "distinct" in the resulting SQL query when used without .Count(). If I cause the query to be enumerated by using .Distinct().ToList() and then use the .Count() method on the resulting in-memory collection, I get the result I want.
This is not exactly equivalent to what I was originally doing with the Criteria query, since the counting is actually being done in the application code, and the entire list of IDs must be sent from the DB to the application. In my case, though, given the small number of distinct IDs, I think it will work, and won't be too much of a performance bottleneck.
I do hope, however, that a true CountDistinct() LINQ operation will be implemented in the future.
You could try selecting the column you want a distinct count of first. It would look something like: Select(p => p.id).Distinct().Count(). As it stands, you're distincting the entire object, which will compare the reference of the object and not the actual values.