I have an app that retrieves data requested by the user. All parameters except Type are optional. If a parameter is not specified, all items are retrieved. If it is specified, only items corresponding that parameter are retrieved. For example, here I retrieve products by year of release (-1 is the default value, if the user hasn't specified one):
var products = context.Products.Where(p => p.type == Type).ToList();
if (!(Year == -1))
products = products.Where(p => p.year == Year).ToList();
This works perfectly fine for some of the years. E.g., if I search 2001, I get all entries needed. But since products has a limited size and only retrieves 1500 entries, later years are simply not retrieved, not in the products list, and it comes up as no data for that year, even though there is data in the DB.
How can I get around this problem?
One of the nice things about deferred execution on LINQ is it can help make code that has variable filtering rules a lot more neat and readable. If you're not sure what deferred execution is, in a nutshell it's a mechanism that only runs the LINQ query when you ask for the results rather than when you make the statements that comprise the query.
In essence this means we can have code like:
//always adults
var p = person.Where(x => x.Age > 18);
//we maybe filter on these
if(email != null)
p = p.Where(x => x.Email == email);
if(socialSN != null)
p = p.Where(x => x.SSN == socialSN);
var r = p.ToList(); //the query is only actually run now
The multiple calls to where here are cumulative; they will conceptually build a where clause but not execute the query until ToList is called. At this point, if a database is in use then the db sees the query with all its Where clauses and can leverage indexes and statistics
If we were to use ToList after every Where, then the first Where would hit the db and it's whole dataset would download to the client app, and the runtime would set about converting an enumerable to a list (a lot of copying and memory allocating). The subsequent Where would filter the list in the client app, enumerating it but then converting it to a list again - the big problem being its done in the memory of the client app as some naive unindexed loop, and all those millions of dollars of r&d Microsoft poured into making their SQL Server query optimizer pull huge amounts of data very quickly, are wasted :)
Consider also that that first clause in my example set- Age>18 could be huge; a million people of a spread of ages over age 12, for example - A large amount of data is true for that predicate. Email or SSN would be a far smaller dataset, probably indexed etc. It's a contrived example sure but hopefully well illustrates the point about performance; by ToList()ing too early we end up downloading too much data
I am very new to Mongo (this is actually day 1) and using the C# driver that is available for it. One thing that I want to know (as I am not sure how to word it in Google) is how does mongo handle executing queries when I want to grab a part of the collection.
What I mean by this is that I know that with NHibernate and EF Core, the query is first built and it will only fire when you cast it. So say like an IQueryable to IEnnumerable, .ToList(), etc.
Ex:
//Query is fired when I call .ToList, until that point it is just building it
context.GetLinqQuery<MyObject>().Where(x => x.a == 'blah').ToList();
However, with Mongo's examples it appears to me that if I want to grab a filtered result I will first need to get the collection, and then filter it down.
Ex:
var collection = _database.GetCollection<MyObject>("MyObject");
//Empty filter for ease of typing for example purposes
var filter = Builders<MyObject>.Filter.Empty;
var collection.Find(filter).ToList();
Am I missing something here, I do not think I saw any overload in the GetCollection method that will accept a filter. Does this mean that it will first load the whole collection into memory, then filter it? Or will it still be building the query and only execute it once I call either .Find or .ToList on it?
I ask this because at work we have had situations where improper positioning of .ToList() would result is seriously weak performance. Apologies if this is not the right place to ask.
References:
https://docs.mongodb.com/guides/server/read_queries/
The equivalent to your context.GetLinqQuery<MyObject>() would be to use AsQueryable:
collection.AsQueryable().Where(x => x.a == "blah").ToList();
The above query will be executed server side* and is equivalent to:
collection.Find(Builders<MyObject>.Filter.Eq(x => x.a, "blah")).ToEnumerable().ToList();
* The docs state that:
Only LINQ queries that can be translated to an equivalent MongoDB query are supported. If you write a LINQ query that can’t be translated you will get a runtime exception and the error message will indicate which part of the query wasn’t supported.
Background: We are using ElasticSearch separate from our mongo to search for things. Once the query returns a list of id's we need to pull the items from mongo and have them sorted in the order they came from ES.
Question: How do I find a list of id's then sort the results to be in the same order that the list is. It would maybe be something like this:
// NOTE: The following code does not work.
JobsCollection
.Find(Builders<JobModel>.Filter.In(j => j.Id, idList))
.OrderBy(j => idList.IndexOf(j.Id)) // <-- Something like this
I may just sort it after I get the objects if I can't figure it out, but if the mongo driver can do it for me, that would be cool.
There is no a good way to make a sorting you described on MongoDB server side. And even if it was possible I doubt you gain some performance improvement comparing to client side sorting. The most time will be spent for reading the data from DB and transfering to the client, which are equal for server and client sorting.
Your LINQ solution looks good for me. However it uses List.IndexOf() for each list element that gives you complexity of O(n^2). This LINQ could show poor performance on big lists.
I propose to improve it and create a dictionary that maps id value to its index in the list:
var idMap = idList.Select((x, i) => new { Value = x, Index = i })
.ToDictionary(x => x.Value, x => x.Index);
var sorted = JobsCollection
.Find(Builders<JobModel>.Filter.In(j => j.Id, idList))
.ToList()
.OrderBy(j => idMap[j.Id]);
This will give you complexity of O(n * log(n)) which is far better.
UPDATE (regarding not working code from the question):
I believe by not working code you mean that it could not be compiled. The problem is that Find method returns IFindFluent<TDocument, TDocument>. IFindFluent is an cursor and you can't use it in LINQ expressions directly. To make your code compile and work add .ToEnumerable() call on Find result:
JobsCollection
.Find(Builders<JobModel>.Filter.In(j => j.Id, idList))
.ToEnumerable()
.OrderBy(j => idList.IndexOf(j.Id));
I have a requirement to filter a list of Clients based on if they haven't had any jobs booked in the last x months. In my code I have two lists, one is my Clients and the other is a filtered List of Jobs between today and x months ago and the idea is to filter Clients based on their id not appearing in the jobs list. I tried the following:
filteredClients.Where(n => jobsToSearch.Count(j => j.Client == n.ClientID) == 0).ToList();
But I seem to get ALL clients regardless. I can easily do a foreach but this severly slows down the process. How can I filter the client list based on the job list effectively?
The main thing you're doing wrong is that you don't assign your results back to something. That's why your original seemed to keep all clients. But we can still improve on the original:
filteredClients = filteredClients.Where(n => !jobsToSearch.Any(j => j.Client == n.ClientId)).ToList();
The difference between this and your .Count() solution is that .Any() can stop looking at the jobs list with each client as soon as it encounters the first match, so it should run a bit faster. But we're not done yet. We can do even better by narrowing the jobs list down to only distinct clients:
var badClients = jobsToSearch.Select(j => j.Client).Distinct().ToList();
filteredClients = filteredClients.Where(n => !badClients.Any(j => j == n.ClientId)).ToList();
And likely even better still by using a HashSet, which can make O(1) lookups like a Dcitionary. Assuming the client ID is an int:
var badClients = new HashSet<int>(jobsToSearch.Select(j => j.Client));
filteredClients = filteredClients.Where(n => !badClients.Contains(n.ClientId)).ToList();
Whether this last option performs better depends on the number of clients that have jobs... if the list is short, the .Distinct() might still do better.
Finally, I don't normally recommend calling .ToList() like this. As much as possible, save actually realizing a List, Array, or collection type until the last possible moment, and just keep it to an Enumerable for as long as possible.
did you thought about using "groupby"?
without checking the syntax and writing code out of my mind (havnt vs available atm):
var groupedJobs = jobsearch.GroupBy(job => job.Client);
var itemsWithJobs = filteredList.Where(item => groupedJobs.ContainsKey(item.ClientID));
I can check the syntax tomorrow morning.
The biggest pro of this is, that you have build an Dictonary which is much much faster to search in it. Than to iterate though lists.
To filter the clients who are in IdList;
List1.Where(x=> IdList.Contains(x.ClientId));
To filter the clients who are not in IdList;
List1.Where(x=> !IdList.Contains(x.ClientId));
I have this one gigantic complex LINQ to SQL query that I need to optimize somehow, because the background C# compiler completely hogs the CPU and I can't type or edit my .cs file normally in Visual Studio 2010 (every letter, especially if IntelliSense wants to pop up, lags horribly).
The culprit is this:
var custFVC =
(from cfvc in customer.CustomerFrameVariationCategories
let lastValue = cfvc.CustomerFrameVariationCategoryValueChanges.Where(cfvcvc => cfvcvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(cfvcvc2 => cfvcvc2.ChangeDateTime).FirstOrDefault() ?? new CustomerFrameVariationCategoryValueChange()
let lastValue2 = cfvc.FrameVariationCategory.FrameVariation.Frame.FrameValueChanges.Where(fvc => fvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(fvc2 => fvc2.ChangeDateTime).FirstOrDefault() ?? new FrameValueChange()
let lastValue3 = cfvc.FrameVariationCategory.FrameVariationCategoryValueChanges.Where(fvcvc => fvcvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(fvcvc2 => fvcvc2.ChangeDateTime).FirstOrDefault() ?? new FrameVariationCategoryValueChange()
let lastValue4 = cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Any(fvm => (fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleValueChange()).IsActive == false)
where lastValue.IsActive == true
orderby cfvc.FrameVariationCategory.FrameVariation.Frame.Name, cfvc.FrameVariationCategory.Category.Name, cfvc.FrameVariationCategory.FrameVariation.Name
select new
{
cfvc.Id,
cfvc.FrameVariationCategory,
lastValue.CoverCoefficient,
lastValue.NeiserNet,
PlywoodName = lastValue2.Plywood.Name,
FrameIsActive = lastValue2.IsActive,
OwnCost = cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Sum(fvm => // sum all frame variation modules
(lastValue4 ? 0 : fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= this.SelectedDateTime) // if module not active then 0
.OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault().Porolone) + // otherwise get Porolone
fvm.FrameModule.FrameModuleComponents.Sum(fmc => // add to Porolone sum of all module components
(fmc.Article.ArticleDetails.Any() ? fmc.Article.ArticleDetails.Sum(ad => // if any article details then use A*L*W*T instead of Amount
WindowExcel.MultiplyArticleDetailValues(ad.ArticleDetailValueChanges.Where(advc => advc.ChangeDateTime <= this.SelectedDateTime)
.OrderByDescending(advc2 => advc2.ChangeDateTime).FirstOrDefault() ?? new ArticleDetailValueChange())) :
WindowExcel.GetModuleComponentAmount(fmc.FrameModuleComponentValueChanges.Where(fmcvc => fmcvc.ChangeDateTime <= this.SelectedDateTime) // no details = get amount
.OrderByDescending(fmcvc2 => fmcvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleComponentValueChange())) * // times article values
WindowExcel.MultiplyArticleValues(fmc.Article.ArticleValueChanges.Where(avc => avc.ChangeDateTime <= this.SelectedDateTime)
.OrderByDescending(avc2 => avc2.ChangeDateTime).FirstOrDefault() ?? new ArticleValueChange()))),
Cubes = cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Sum(fvm => (fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= this.SelectedDateTime && fmvc.IsActive == true).OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleValueChange()).Cubes),
lastValue3.CoverNet,
lastValue3.CoverGarbage,
lastValue3.CoverGross,
lastValue3.CoverPrice,
lastValue3.BackgroundNet,
lastValue3.BackgroundGarbage,
lastValue3.BackgroundGross,
lastValue3.BackgroundPrice,
FVCIsActive = lastValue3.IsActive,
FrameModuleAnyNonActive = lastValue4
}).ToList();
The biggest problem here is OwnCost, everything up to and after that Visual Studio can handle. I don't want to turn off background compiling (a feature that checks for compile time errors before actually compiling), I don't want to create a stored procedure. I can't off this code into a separate class/method, because the LINQ DataContext can't be passed around (as far as I know - also take into consideration that the context variable is inside a using statement).
The only vague idea that I have, is some sort of an extension method, or a method that returns a LINQ query or something like that. Because I don't know what exactly it is that I can do here to rectify the problem, I don't know how to formulate the wording, thus I can't google it...
How can I move (or optimize) OwnCost or the entire query out of the current .cs file, or perhaps split it into a method within the same file (might help the background compiler), or "something"...?
My first instinct as that you're trying to make LINQ to SQL do the work of a stored procedure. But that may be incorrect; it's pretty difficult to tell if it would even be possible for a stored procedure to do this.
My second instinct is that it should be possible to split the OwnCost calculation into a function, so that this query just contains
OwnCost = cfvc.Select(CalculateOwnCost)
My third instinct, on seeing that the calculation includes a WindowExcel object, is to flee screaming, but I'm going to take a couple of deep breaths and ask, are you in fact interoperating with Excel in the context of this query, and might that possibly be a source of problems?
Edit
To break out the OwnCost calculation into its own function, do something like this:
public decimal CalculateOwnCost(CustomerFrameVariationCategory cvfc)
{
return cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Sum(fvm => // sum all frame variation modules
(lastValue4 ? 0 : fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= this.SelectedDateTime) // if module not active then 0
.OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault().Porolone) + // otherwise get Porolone
fvm.FrameModule.FrameModuleComponents.Sum(fmc => // add to Porolone sum of all module components
(fmc.Article.ArticleDetails.Any() ? fmc.Article.ArticleDetails.Sum(ad => // if any article details then use A*L*W*T instead of Amount
WindowExcel.MultiplyArticleDetailValues(ad.ArticleDetailValueChanges.Where(advc => advc.ChangeDateTime <= this.SelectedDateTime)
.OrderByDescending(advc2 => advc2.ChangeDateTime).FirstOrDefault() ?? new ArticleDetailValueChange())) :
WindowExcel.GetModuleComponentAmount(fmc.FrameModuleComponentValueChanges.Where(fmcvc => fmcvc.ChangeDateTime <= this.SelectedDateTime) // no details = get amount
.OrderByDescending(fmcvc2 => fmcvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleComponentValueChange())) * // times article values
WindowExcel.MultiplyArticleValues(fmc.Article.ArticleValueChanges.Where(avc => avc.ChangeDateTime <= this.SelectedDateTime)
.OrderByDescending(avc2 => avc2.ChangeDateTime).FirstOrDefault() ?? new ArticleValueChange()))),
Cubes = cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Sum(fvm => (fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= this.SelectedDateTime && fmvc.IsActive == true).OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleValueChange()).Cubes)
}
That assumes that CustomerFrameVariationCategories is a colleciton of CustomerFrameVariationCategory objects, and that OwnCost is a decimal.
Once you do this, your original query can just include the Select that I showed above - you can also write it as
OwnCost = cfvc.Select(x => CalculateOwnCost(x))
if it makes you more comfortable (me, I've gotten scolded by Resharper enough on this point that I've come to accept it, but it's a matter of taste).
There's no reason you can't further decompose some of the intermediate expressions in that query into their own functions. A lambda function is just a function, after all.
I don't have any inherent insight into this problem in terms of what is expensive in the C# compiler. However the two things that jump out when I look at your query are the following
The number and complexity of the let bindings
The complexity of the initializer of the OwnCost member inside the select clause
The best advice I can give is to try and break up the query in order to get these into separate statements and hopefully that will ease the pressure on the compiler.
Split it up. That is one massive expression tree VS is trying to deal with. You could break it up so that some of the SELECT clause transformation happens in LINQ-to-object. This would be a lot easier for the background compiler to deal with. Just get:
var custFVC = (from cfvc in customer.CustomerFrameVariationCategories
let lastValue = cfvc.CustomerFrameVariationCategoryValueChanges.Where(cfvcvc => cfvcvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(cfvcvc2 => cfvcvc2.ChangeDateTime).FirstOrDefault() ?? new CustomerFrameVariationCategoryValueChange()
let lastValue2 = cfvc.FrameVariationCategory.FrameVariation.Frame.FrameValueChanges.Where(fvc => fvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(fvc2 => fvc2.ChangeDateTime).FirstOrDefault() ?? new FrameValueChange()
let lastValue3 = cfvc.FrameVariationCategory.FrameVariationCategoryValueChanges.Where(fvcvc => fvcvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(fvcvc2 => fvcvc2.ChangeDateTime).FirstOrDefault() ?? new FrameVariationCategoryValueChange()
let lastValue4 = cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Any(fvm => (fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= this.SelectedDateTime).OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleValueChange()).IsActive == false)
where lastValue.IsActive == true
orderby cfvc.FrameVariationCategory.FrameVariation.Frame.Name, cfvc.FrameVariationCategory.Category.Name, cfvc.FrameVariationCategory.FrameVariation.Name
select new
{ cfvc, lastValue, lastValue1, lastValue2, lastValue3}).ToList();
And then do the rest of your manipulation from there. If the result set is small, this might be more efficient anyway, and certainly easier on your db. If the result set is small, doing this will have very little performance cost.
If you have a bored, underworked db and there's a large result set and the machine where this code is running is strained, then you might need to keep the workload on the db.
Keep in mind that just breaking up into several steps the building of one massive expression tree to be run against IQuerable will not do you any good. That last variable will be as complicated (under the hood) as yours is now, and the compiler will still choke. The bottom line is you need to run .ToList() earlier in the life of this manipulation. A series of LINQ-to-object queries against IEnumerable won't be difficult for the background compiler to handle.
VS is probably choking on this because it's such a large, complex single statement.
Since the only linkages between the OwnCost and the LINQ context are the reference to cfvc and lastValue4, it seems to me that you could calculate OwnCost in a separate step after the initial LINQ query statement. Store the lastValue4 in the anonymous type constructed by the LINQ statement, remove OwnCost, and remove .ToList() from the end. You don't need to store the cfvc value since the only thing you're using it for is to access .FrameVariationCategory, which you've already captured in the 2nd field of the anonymous type.
In a separate statement, select from the CustFVC resultset to construct the OwnCost for each item to produce a new result set that contains all the data bits you're looking for. Call .ToList() on that second result set. This should produce equivalent results to the monster statement in similar time.
If this is a large result set, be careful about iterating over the data multiple times. If you use a foreach to calculate OwnCost for each item in the original result set, you will be running through the data twice - twice as much work as the single monster LINQ query.
If you use a LINQ query for the second operation, it shouldn't cause any additional passes over the data beyond what you already have - LINQ is lazy evaluated, so the next row is not actually retrieved until asked for. ToList() forces all rows to be retrieved. A foreach loop forces all rows to be retrieved. LINQ query using a LINQ query as an input does not iterate any rows of the input result set, it just piles on more conditions to be evaluated when somebody eventually asks for the next row of the 2nd result set.
I may be wrong here and it's just a guess, but have you tried splitting your class (and essentially your file) with the partial keyword?
Hah, I found a solution myself :)
Roberts instinct about a LINQ function got me googling. The results were not relevant to the matter at hand, but the little code I did stumble upon, got me thinking about a brute force attack method. Using redoced's idea of a partial class I finally wrote this piece of code in a seperate .cs file:
public partial class WindowExcel
{
private static decimal GetOwnCost(CustomerFrameVariationCategory cfvc, bool frameModuleAnyNonActive, DateTime selectedDateTime)
{
return cfvc.FrameVariationCategory.FrameVariation.FrameVariationModules.Sum(fvm => // sum all frame variation modules
(frameModuleAnyNonActive ? 0 : fvm.FrameModule.FrameModuleValueChanges.Where(fmvc => fmvc.ChangeDateTime <= selectedDateTime) // if module not active then 0
.OrderByDescending(fmvc2 => fmvc2.ChangeDateTime).FirstOrDefault().Porolone) + // otherwise get Porolone
fvm.FrameModule.FrameModuleComponents.Sum(fmc => // add to Porolone sum of all module components
(fmc.Article.ArticleDetails.Any() ? fmc.Article.ArticleDetails.Sum(ad => // if any article details then use A*L*W*T instead of Amount
WindowExcel.MultiplyArticleDetailValues(ad.ArticleDetailValueChanges.Where(advc => advc.ChangeDateTime <= selectedDateTime)
.OrderByDescending(advc2 => advc2.ChangeDateTime).FirstOrDefault() ?? new ArticleDetailValueChange())) :
WindowExcel.GetModuleComponentAmount(fmc.FrameModuleComponentValueChanges.Where(fmcvc => fmcvc.ChangeDateTime <= selectedDateTime) // no details = get amount
.OrderByDescending(fmcvc2 => fmcvc2.ChangeDateTime).FirstOrDefault() ?? new FrameModuleComponentValueChange())) * // times article values
WindowExcel.MultiplyArticleValues(fmc.Article.ArticleValueChanges.Where(avc => avc.ChangeDateTime <= selectedDateTime)
.OrderByDescending(avc2 => avc2.ChangeDateTime).FirstOrDefault() ?? new ArticleValueChange())));
}
}
And in my gigantic LINQ query I rewrote OwnCost as such:
OwnCost = WindowExcel.GetOwnCost(cfvc, lastValue4, this.SelectedDateTime)
Editing the GetOwnCost method is still painfully slow, as was excepted, but at least the rest of my project is now usable. I'm not sure what this brute force seperation does to performance. The fact that I can't ref the CustomerFrameVariationCategory and that the OwnCost expression tree is inside of a method not in a LINQ query itself, raises questions. Guess I'll have to profile it at some point, but that's a hole other issue.
Now to the delicate issue of what to mark as the answer. Though I do appreciate all the input, none of the answers so far were correct (no concrete solution), thus I'll have to mark my own post as the answer. But I will vote for redoced's and Robert's answers for pointing me in the right direction.
I would appreciate, if anyone can comment about possible code execution performance impacts for my solution vs the original code.
PS! Writing this in Internet Explorer 8 is again painfully slow because of the constant CPU hogging (has something to do with coloring the code). So it's not only a VS issue....
Edit:
It seems Robert has managed to post the exact same solution I came up with. Would have probably got my answer posted earlier if not for the constant CPU hogging...
In all fairness I marked Robert's post as the answer :)
Instead of writing LINQ to SQL you can write a stored procedure for this to do all these things.