I have a created a single LINQ query that creates a main group and then two nested groups. Within the last nest there is also a simple OrderBy. The issue I am running into is while writing the query or trying to edit it the visual studio memory consumption sky rockets to ~500MB and is eating 50% of my CPU, which makes visual studio unresponsive for a few minutes. If I comment out the query then visual studio acts just fine. So my question is why is visual studio consuming so much memory during design time for a linq query, granted it is kind of complicated?
The datatable I am using is 10732 rows long by 21 columns across
var results = from p in m_Scores.AsEnumerable()
group p by p.Field<string>("name") into x
select new
{
Name = x.Key,
Members = from z in x
group z by z.Field<string>("id") into zz
select new
{
Id = zz.Key,
Plots = from a in zz
group a by a.Field<string>("foo") into bb
select new
{
Foo = bb.Key,
Bars = bb
}.Bars.OrderBy(m => m.Field<string>("foo"))
}
};
Hardware Specs:
Dell Latitude with a 2.20GHz Dual Core Processor and 4GB ram
The problem with groups and orders is that they require knowledge of the entire collection in order to perform the operations. Same thing with aggregates like min, max, sum, avg, etc. Some of these operations cannot interrogate the true type of the IEnumerable being passed, or it wouldn't matter, since they are "destructive" in nature so they have to create a working copy. When you chain these things together, you end up with at least two copies of the full enumerable; the one produced by the previous method that is being iterated by the current method, and the one being generated by the current method. Any copy of the enumerable that has a reference outside the query (like the source enumerable) also stays in memory, and enumerables that have become orphaned stick around until the GC thread has time to dispose of and finalize them. For a large source enumerable, all of this can create a massive demand on the heap.
In addition, nesting aggregates in clauses can very quickly make the query expensive. Unlike a DBMS which can engineer a "query plan", Linq isn't quite that smart. Min(), for example, requires iteration of the whole enumerable to find the smallest value of the specified projection. When that's a criteria of a Where clause, a good DBMS will find that value once per context and then inline the value in subsequent evaluations as necessary. Linq just runs the extension method each time it is called, and when you have a condition like enumerable.Where(x=>x.Value == enumerable.Min(x2=>x2.Value)), that's an O(N^2)-complexity operation just to evaluate the filter. Add multiple levels of grouping and the Big-O can easily reach high-polynomial complexity.
Generally, you can reduce query time by performing the optimizations that a DBMS would given the same query. If the value of an aggregate can be known for the full scope of the query (e.g. result = source.Where(s=>s.Value == source.Min(x=>x.value))), evaluate that into a variable using the let clause (or an external query), and replace Min() calls with the alias. Iterating the enumerable twice is generally cheaper than iterating it N^2 times, especially if the enumerable stays in memory between iterations.
Also, make sure your query order reduces the sample space as much and as cheaply as possible before you start grouping. You may be able to make educated guesses about conditions that must be evaluated expensively, such as Where(s=>s.Value < threshold).Where(s=>s.Value == source.Min(x=>x.Value)) or more concisely, Where(s=>s.Value < threshold && s.Value == source.Min(x=>x.Value)) (the second works in C# because of lazy condition evaluation, but not all languages evaluate lazily). That cuts the number of evaluations of Min() down to the number of elements meeting the first criteria. You can use existing criteria to do the same thing, wherever the criteria A and B are independent enough that A && B == B && A.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am just considering which provides me the best performance when I use both OrderBy() and Distinct() inside a LINQ Query. It seems to me they're both equal in speed as the Distinct() method will use a hash table while in-memory and I assume that any SQL query would be optimized first by .NET before it gets executed.
Am I correct in assuming this or does the order of these two commands still affect the performance of LINQ in general?
As for how it would work... When you build a LINQ query, you're basically building an expression tree but nothing gets executed yet. So calling MyList.Distinct().OrderBy() would just make this tree, yet won't execute it. (It's deferred.) Only when you call another function like ToList() would the expression tree get executed and the runtime could optimize the expression tree before it gets executed.
For LINQ to objects even if we assume that that OrderBy(...).Distinct() and Distinct().OrderBy(...) will return the same result (which is not guaranteed) the performance will depend on the data.
If you have a lot of duplication in data - running Distinct first should be faster. Next benchmark shows that (at least on my machine):
public class LinqBench
{
private static List<int> test = Enumerable.Range(1, 100)
.SelectMany(i => Enumerable.Repeat(i, 10))
.Select((i, index) => (i, index))
.OrderBy(t => t.index % 10)
.Select(t => t.i)
.ToList();
[Benchmark]
public List<int> OrderByThenDistinct() => test.OrderBy(i => i).Distinct().ToList();
[Benchmark]
public List<int> DistinctThenOrderBy()=> test.Distinct().OrderBy(i => i).ToList();
}
On my machine for .Net Core 3.1 it gives:
Method
Mean
Error
StdDev
OrderByThenDistinct
129.74 us
2.120 us
1.879 us
DistinctThenOrderBy
19.58 us
0.384 us
0.794 us
First, seq.OrderBy(...).Distinct() and seq.Distinct().OrderBy(...) are not guaranteed to return the same result, because Distinct() may return an unordered enumeration. MS implementation conveniently preserves the order, but if you pass a LINQ query to the database, the results may come back in any order the DB engine sees fit.
Second, in the extreme case when you have lots of duplication (say, five values repeated randomly 1,000,000 times) you would be better off doing a Distinct before OrderBy().
Long story short, if you want your results to be ordered, use Distinct().OrderBy(...) regardless of the performance.
I assume that any SQL query would be optimized first by .NET before it gets >
executed.
And how do you think that would work, given that:
Only the SQL executing side (the server) has the knowledge for this (i.e. which indices to use) AND has a query optimizer that is supposed to optimize the executed query based on the statistics of the table.
You have to be VERY sure that you do not change the result in any way.
Sorry, this makes no sense - there are pretty much no optimizations that you CAN safely do in C# without having all the internal details of the database, so the query is sent to the database for analysis.
As such, an OrderBy or a Distinct (ESPECIALLY a distinct) WILL impact performance - how much depends on i.e. whether the OrderBy can rely on an index.
or does the order of these two commands still affect the performance of LINQ
in general?
Here it gets funny (and you give no example).
DISTINCT and ORDERBY are in SQL in a specific order, regardless how you formulated it in LINQ. There is only ONE allowed syntax as per SQL definition. LINQ puts the query together and optimizes that out. If you look at the syntax, there is a specific place for the DISTINCT (which is a SQL term for at least SQL Server) and the OrderBy.
On the other side...
.Distinct().OrderBy() and .OrderBy().Distinct()
have DIFFERENT RESULTS. They CAN be done in SQL (you can use the output of the Distinct as a virtual table that you then order), but they have a different semantic. Unless you think that LINQ will magically read your mind, there is no context for the compiler other than to assume you are competent in writing what you do (as long as it is legal) and execute these steps in the order you gave.
Except: The DOCUMENTATION for Distinct in Queryable is clear this is not done:
https://learn.microsoft.com/en-us/dotnet/api/system.linq.queryable.distinct?redirectedfrom=MSDN&view=net-5.0#System_Linq_Queryable_Distinct__1_System_Linq_IQueryable___0__
says that Distinct returns an unordered list.
So, there is a fundamental difference and they are not the same.
i have a full outer join query pulling data from an sql compact database (i use EF6 for mapping):
var query =
from entry in left.Union(right).AsEnumerable()
select new
{
...
} into e
group e by e.Date.Year into year
select new
{
Year = year.Key,
Quartals = from x in year
group x by (x.Date.Month - 1) / 3 + 1 into quartal
select new
{
Quartal = quartal.Key,
Months = from x in quartal
group x by x.Date.Month into month
select new
{
Month = month.Key,
Contracts = from x in month
group x by x.Contract.extNo into contract
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
}
};
as you can see i use nested groups to structure results.
the interesting thing is, if i remove AsEnumerable() call, the query takes 3.5x more time to execute: ~210ms vs ~60ms. And when it runs for the first time the difference is much greater: 39000(!)ms vs 1300ms.
My questions are:
What am i doing wrong, maybe those groupings should be done in a different way?
Why the first execution takes so much time? I know expression trees should be built etc, but 39 seconds?
Why linq to db is slower than linq to entities in my case? Is it generally slower and its better to load data from db if possible before processing?
thakns!
To answer your three questions:
Maybe those groupings should be done in a different way?
No. If you want nested groupings you can only do that by groupings within groupings.
You can group by multiple fields at once:
from entry in left.Union(right)
select new
{
...
} into e
group e by new
{
e.Date.Year,
Quartal = (e.Date.Month - 1) / 3 + 1,
e.Date.Month,
contract = e.Contract.extNo
} into grp
select new
{
Year = grp.Key,
Quartal = grp.Key,
Month = grp.Key,
Contracts = from x in grp
select new
{
ExtNo = month.Key,
Entries = contract,
}
}
This will remove a lot of complexity from the generated query so it's likely to be (much) faster without AsEnumerable(). But the result is quite different: a flat group (Year, Quartal, etc, in one row), not a nested grouping.
Why the first execution takes so much time?
Because the generated SQL query is probably pretty complex and the database engine's query optimizer can't find a fast execution path.
3a. Why is linq to db slower than linq to entities in my case?
Because, apparently, in this case it's much more efficient to fetch the data into memory first and do the groupings by LINQ-to-objects. This effect will be more significant if left and right represent more or less complex queries themselves. In that case, the generated SQL can get hugely bloated, because it has to process two sources of complexity in one statement, which may lead to many repeated identical sub queries. By outsourcing the grouping, the database is probably left with a relative simple query and of course the grouping in memory is never affected by the complexity of the SQL query.
3b. Is it generally slower and its better to load data from db if possible before processing?
No, not generally. I'd even say, hardly ever. In this case it is because (as I can see) you don't filter data. If however the part before AsEnumerable() would return millions of records and you would apply filtering afterwards, the query without AsEnumerable() would probably be much faster, because the filtering is done in the database.
Therefore, you should always keep an eye on generated SQL. It's unrealistic to expect that EF will always generate a super optimized SQL statement. It hardly ever will. Its primary focus is on correctness (and it does an exceptional job there), performance is secondary. It's the developer's job to make LINQ-to-Entities and LINQ-to-object work together as a slick team.
Using AsEnumerable() will convert a type that implements IEnumerable<T> to IEnumerable<T> itself.
Read this topic https://msdn.microsoft.com/en-us/library/bb335435.aspx
AsEnumerable<TSource>(IEnumerable<TSource>) can be used to choose between query implementations when a sequence implements IEnumerable<T> but also has a different set of public query methods available. For example, given a generic class Table that implements IEnumerable<T> and has its own methods such as Where, Select, and SelectMany, a call to Where would invoke the public Where method of Table. A Table type that represents a database table could have a Where method that takes the predicate argument as an expression tree and converts the tree to SQL for remote execution. If remote execution is not desired, for example because the predicate invokes a local method, the AsEnumerable<TSource> method can be used to hide the custom methods and instead make the standard query operators available.
When you invoke AsEnumerable() first, it won't convert LINQ-to-SQL but will instead load the table in memory as the Where is enumerating it. Since now it is loaded in memory, it's execution is faster.
In my C# Class Library project I have a method that needs to compute some statistics GetFaultRate, that, given a date, computes the number of products with faults over the number of products produced.
float GetFaultRate(DateTime date)
{
var products = GetProducts(date);
var faultyProducts = GetFaultyProducts(date);
var rate = (float) (faultyProducts.Count() / products.Count());
return rate;
}
Both methods, GetProducts and GetFaultyProducts take the data from a Repository class _productRepository.
IEnumerable<Product> GetProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodProducts = products.Where(p => CustomFunction(p.productionDate) == date);
return periodProducts;
}
IEnumerable<Product> GetFaultyProducts(DateTime date)
{
var products = _productRepository.GetAll().ToList();
var periodFaultyProducts = products.Where(p => CustomFunction(p.ProductionDate) == date && p.Faulty == true);
return periodFaultyProducts;
}
Where GetAll has signature:
IQueryable<Product> GetAll();
The products in the database are many and it takes a lot of time to retrieve them and convert ToList(). I need to enumerate the collection since any custom function such as CustomFunction, cannot be executed in a IQueryable<T>.
My application gets stuck for a long time before obtaining the fault rate. I guess it is because of the large number of objects to be retrieved. I can indeed remove the two functions GetProducts and GetFaultyProducts and implement the logic inside GetFaultRate. However since I have other functions that use GetProducts and GetFaultyProducts, with the latter solution I have only one access to the database but a lot of duplicate code.
What can be a good compromise?
First off, don't convert the IQueryable to a list. It forces the entire data set to be brought into memory all at once, rather than just calling Where directly on the query which will allow you to filter the data as it comes in. This will substantially decrease your memory footprint, and (very) marginally increase the runtime speed. If you need to convert an IQueryable to an IEnumerable so that the Where isn't executed by the database simply use AsEnumerable.
Next, getting all of the data is something you should avoid if at all possible, especially multiple times. You'd need to show us what your date function does, but it's possible that it is something that could be done on the database. Any filtering you can do at all at the database will substantially increase performance.
Next, you really don't need two queries here. The second query is just a subset of the first, so if you know that you'll always be using both queries then you should just just perform the first query, bring the results into memory (i.e. with a ToList that you store) and then use a Where on that to filter the results further. This will avoid another database trip as well as all of the data processing/filtering.
If you won't always be using both queries, but will sometimes use just one or the other, then you can improve the second query by filtering out on Faulty before getting all items. Add Where(p => p.Faulty) before you call AsEnumerable and filter on the date information after calling AsEnumerable (and that's if you can't convert any of the date filtering to filtering that can be done at the database).
It appears that in the end you only need to compute the ratio of items that are faulty as compared to the total. That can easily be done with a single query, rather than two.
You've said that Count is running really slowly in your code, but that's not really true. Count is simply the method that is actually enumerating your query, whereas all of the other methods were simply building the query, not executing it. However, you can cut your performance costs drastically by combining the queries entirely.
var lookup = _productRepository.GetAll()
.AsEnumerable()//if at all possible, try to re-write the `Where`
//to be a valid SQL query so that you don't need this call here
.Where(p => CustomFunction(p.productionDate) == date)
.ToLookup(product => product.Faulty);
int totalCount = lookup[true].Count() + lookup[false].Count();
double rate = lookup[true].Count() / (double) totalCount;
var products = GetProducts(date);
var periodFaultyProducts = (from p in products.AsParallel()
where p.Faulty == true
select p).AsEnumerable();
You need to reduce the number of database requests. ToList, First, FirstOrDefault, Any, Take and Count forces your query to run at a database. As Servy pointed out, AsEnumerable converts your query from IQueryable to IEnumerable. If you have to find subsets you can use Where.
I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.
Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.
Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.
I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.
Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).
A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).
We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.
This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.