trying to optimize if/else condition slows down program - c#

I am currently trying to optimize a .net application with the help of the VS-Profiling tools.
One function, which gets called quite often, contains the following code:
if (someObjectContext.someObjectSet.Where(i => i.PNT_ATT_ID == tmp_ATT_ID).OrderByDescending(i => i.Position).Select(i => i.Position).Count() == 0)
{
lastPosition = 0;
}
else
{
lastPosition = someObjectContext.someObjectSet.Where(i => i.PNT_ATT_ID == tmp_ATT_ID).OrderByDescending(i => i.Position).Select(i => i.Position).Cast<int>().First();
}
Which I changed to something like this:
var relevantEntities = someObjectContext.someObjectSet.Where(i => i.PNT_ATT_ID == tmp_ATT_ID).OrderByDescending(i => i.Position).Select(i => i.Position);
if (relevantEntities.Count() == 0)
{
lastPosition = 0;
}
else
{
lastPosition = relevantEntities.Cast<int>().First();
}
I was hoping that the change would speed up the method a bit, as I was unsure wether the compiler would notice that the query is done twice and cache the results.
To my surprise the execution time (the number of inklusive samplings) of the method has not decreased, but even increased by 9% (according to the profiler)
Can someone explain why this is happening?

I was hoping that the change would speed up the method a bit, as I was unsure whether the compiler would notice that the query is done twice and cache the results.
It will not. In fact it cannot. The database might not return the same results for the two queries. It's entirely possible for a result to be added or removed after the first query and before the second. (Making this code not only inefficient, but potentially broken if that were to happen.) Since it's entirely possible that you want two queries to be executed, knowing that the results could differ, it's important that the results of the query not be re-used.
The important point here is the idea of deferred execution. relevantEntities is not the results of a query, it's the query itself. It's not until the IQueryable is iterated (by a method such as Count, First, a foreach loop, etc.) that the database will be queried, and each time you iterate the query it will perform another query against the database.
In your case you can just do this:
var lastPosition = someObjectContext.someObjectSet
.Where(i => i.PNT_ATT_ID == tmp_ATT_ID)
.OrderByDescending(i => i.Position)
.Select(i => i.Position)
.Cast<int>()
.FirstOrDefault();
This leverages the fact that the default value of an int is 0, which is what you were setting the value to in the event that there was not match before.
Note that this is a query that is functionally the same as yours, it just avoids executing it twice. An even better query would be the one suggested by lazyberezovsky in which you leveraged Max rather than ordering and taking the first. If there is an index on that column there wouldn't be much of a difference, but if there's not a an index ordering would be a lot more expensive.

You can use Max() to get maximum position instead of ordering and taking first item, and DefaultIfEmpty() to provide default value (zero for int) if there are no entities matching your condition. Btw you can provide custom default value to return if sequence is empty.
lastPosition = someObjectContext.someObjectSet
.Where(i => i.PNT_ATT_ID == tmp_ATT_ID)
.Select(i => i.Position)
.Cast<int>()
.DefaultIfEmpty()
.Max();
Thus you will avoid executing two queries - one for defining if there is any positions, and another for getting latest position.

Related

Need to format complex foreach and IF statements to better formatted LINQ expression

I have a Foreach Statement as shown below
foreach (var fieldMappingOption in collectionHelper.FieldMappingOptions
.Where(fmo => fmo.IsRequired && !fmo.IsCalculated
&& !fmo.FieldDefinition.Equals( MMPConstants.FieldDefinitions.FieldValue)
&& (implicitParents || anyParentMappings
|| fmo.ContainerType == collectionHelper.SelectedOption.ContainerType)))
{
if (!collectionHelper.FieldMappingHelpers
.Any(fmh => fmh.SelectedOption.Equals(fieldMappingOption)))
{
requiredMissing = true;
var message = String.Format(
"The MMP column {0} is required and therefore must be mapped to a {1} column.",
fieldMappingOption.Label, session.ImportSource.CollectionLabel);
session.ErrorMessages.Add(message);
}
}
Can I break the above complex foreach and IF statements into better formatted LINQ expression. Also, performance wise which will be better. Please suggest.
Re : Change the Foreach to a Linq statement
Well, you could convert the two for loops into a LINQ Select, and since inside the loop, you've only one branch with an additional predicate, you could combine the predicate into the outer loop, something like so:
var missingFieldMappingOptions = collectionHelper.FieldMappingOptions
.Where(fmo => fmo.IsRequired && !fmo.IsCalculated
&& !fmo.FieldDefinition.Equals( MMPConstants.FieldDefinitions.FieldValue)
&& (implicitParents || anyParentMappings
|| fmo.ContainerType == collectionHelper.SelectedOption.ContainerType))
&& !collectionHelper.FieldMappingHelpers
.Any(fmh => fmh.SelectedOption.Equals(fmo)))
.Select(fmo =>
$"The MMP column {fmo.Label} is required and therefore" +
$" must be mapped to a {session.ImportSource.CollectionLabel} column.");
var requiredMissing = missingFieldMappingOptions.Any();
session.ErrorMessages.AddRange(missingFieldMappingOptions)
However, even LINQ can't make the filter clauses in the .Where disappear, so the LINQ Select is hardly more readable than the for loop, and isn't really any more performant either (there may be some marginal benefit to setting the requiredMissing flag and adding to session.ErrorMessages in one bulk chunk.
Performance
From a performance perspective, the below is problematic as it will be O(N log N) when combined in the outer for loop (fortunately .Any() will return early if a match is found, otherwise it would be as bad as N^2):
if (!collectionHelper
.FieldMappingHelpers.Any(fmh => fmh.SelectedOption.Equals(fieldMappingOption)))
Does FieldMappingOption have a unique key? If so, then suggest add a Dictionary<Key, FieldMappingOption> to collectionHelper and then use .ContainsKey(key) which approaches O(1), e.g.
!collectionHelper
.SelectedFieldMappingOptions.ContainsKey(fieldMappingOption.SomeKey)
Even if there isn't a unique key, you could use a decent HashCode on FieldMappingOption and key by that to get a similar effect, although you'll need to consider what happens in the event of a hash collision.
Readability
The Where predicate in the outer for loop is arguably messy and could use some refactoring (for readability, if not for performance).
IMO most of the where clauses could be moved into FieldMappingOption as a meta property, e.g. wrap up
fmo.IsRequired
&& !fmo.IsCalculated
&& !fmo.FieldDefinition.Equals(MMPConstants.FieldDefinitions.FieldValue)
into a property, e.g. fmo.MustBeValidated etc.
You could squeeze minor performance by ensuring the predicate returns false as soon as possible, by rearranging the && clauses which are most likely to fail first, but wouldn't do so if it impacts the flow of readability of the code.

How to optimize a LINQ with minimum and additional condition

Asume we have a list of objects (to make it more clear no properties etc.pp are used)
public class SomeObject{
public bool IsValid;
public int Height;
}
List<SomeObject> objects = new List<SomeObject>();
Now I want only the value from a list, which is both valid and has the lowest height.
Classically i would have used sth like:
SomeObject temp;
foreach(SomeObject so in objects)
{
if(so.IsValid)
{
if (null == temp)
temp = so;
else if (temp.Height > so.Height)
temp = so;
}
}
return temp;
I was thinking that it can be done more clearly with LinQ.
The first approach which came to my mind was:
List<SomeObject> sos = objects.Where(obj => obj.IsValid);
if(sos.Count>0)
{
return sos.OrderBy(obj => obj.Height).FirstOrDefault();
}
But then i waas thinking: In the foreach approach i am going one time through the list. With Linq i would go one time through the list for filtering, and one time for ordering even i do not need to complete order the list.
Would something like
return objects.OrderBy(obj => obj.Height).FirstOrDefault(o => o.IsValid);
also go twice throught the list?
Can this be somehow optimized, so that the linw also only needs to run once through the list?
You can use GroupBy:
IEnumerable<SomeObject> validHighestHeights = objects
.Where(o => o.IsValid)
.GroupBy(o => o.Height)
.OrderByDescending(g => g.Key)
.First();
This group contains all valid objects with the highest height.
The most efficient way to do this with Linq is as follows:
var result = objects.Aggregate(
default(SomeObject),
(acc, current) =>
!current.IsValid ? acc :
acc == null ? current :
current.Height < acc.Height ? current :
acc);
This will loop over the collection only once.
However, you said "I was thinking that it can be done more clearly with LinQ." Whether this is more clear or not, I leave that up to you to decide.
You can try this one:
return (from _Object in Objects Where _Object.isValid OrderBy _Object.Height).FirstOrDefault();
or
return _Objects.Where(_Object => _Object.isValid).OrderBy(_Object => _Object.Height).FirstOrDefault();
Would something like
return objects.OrderBy(obj => obj.Height).FirstOrDefault(o => o.IsValid);
also go twice throught the list?
Only in the worst case scenario, where the first valid object is the last in order of obj.Height (or there is none to be found). Iterating the collection using FirstOrDefault will stop as soon as a valid element is found.
Can this be somehow optimized, so that the linw also only needs to run
once through the list?
I'm afraid you'd have to make your own extension method. Considering what I've written above though, I'd consider it pretty optimized as it is.
**UPDATE**
Actually, the following would be a bit faster, as we'd avoid sorting invalid items:
return object.Where(o => o.IsValid).OrderBy(o => o.Height).FirstOrDefault();

Linq performance query

I have this query that gives the correct results but it takes about 15 seconds to run
int Count= P.Pets.Where(c => !P.Pets.Where(a => a.IsOwned == true)
.Select(a => a.OwnerName).Contains(c.OwnerName) && c.CreatedDate >=
EntityFunctions.AddDays(DateTime.Now, -8)).GroupBy(b=>b.OwnerName).Count();
If I remove this part of the linq
'&& c.CreatedDate >= EntityFunctions.AddHours(DateTime.Now, -8)'
It only takes about 3 seconds to run. How can I keep the same condition happening but a lot faster?
I need that date criteria because I don't want any Classeses that were created 8 days old to be included in the count
Edit
I have a table by the name of People which is referred to in this query as P and I want to return a count of the total of Pets they are that do not have a owner and remove the ones from the query that don't do have an owner even if they exist in another Pet reference has not the owner of that Pet. Meaning if a person has at least one record in the Pets table to be considered as an owner of a pet than I want to remove all cases where that person exist in the return query and once that is done only return the Pets that have been created newer than 8 days
You should cache the date and put that evaluation first (since the DateTime evaluation should be faster than a Contains evaluation). Also avoid recalculating the same query multiple times.
DateTime eightDaysOld = EntityFunctions.AddHours(DateTime.Now, -8);
//calculate these independently from the rest of the query
var ownedPetOwnerNames = P.Pets.Where(a => a.IsOwned == true)
.Select(a => a.OwnerName);
//Evaluate the dates first, it should be
//faster than Contains()
int Count = P.Pets.Where(c => c.CreatedDate >= eightDaysOld &&
//Using the cached result should speed this up
ownedPetOwnerNames.Contains(c.OwnerName))
.GroupBy(b=>b.OwnerName).Count();
That should return the same results. (I hope)
You are loosing any ability to use indices with that snippet, as it calculates that static date for every row. Declare a DateTime variable before your query and set it to DateTime.Now.AddHours(-8) and use the variable instead of your snippet in the where clause.
By separating the query and calling ToList() on it and inserting it in the master query make it go 4 times faster
var ownedPetOwnerNames = P.Pets.Where(a => a.IsOwned == true)
.Select(a => a.OwnerName).ToList();
int Count = P.Pets.Where(c => c.CreatedDate >= Date&&
ownedPetOwnerNames.Contains(c.OwnerName)).GroupBy(b=>b.OwnerName).Count();
You could use (and maybe first create) a navigation property Pet.Owner:
var refDate = DateTime.Today.AddDays(-8);
int Count= P.Pets
.Where(p => !p.Owner.Pets.Any(p1 => p1.IsOwned)
&& p.CreatedDate >= refDate)
.GroupBy(b => b.OwnerName).Count();
This may increase performance because the Contains is gone. At least it is better scalable than your two-phase query with a Contains involving an unpredictable number of strings.
Of course you also need to make sure there is an index on CreatedDate.

Can this be done entirely via linq?

I have a process where I identity rows in a list (unmatchedClient) then call a separate method to delete them (pingtree.RemoveNodes). This seems a little long winded and I could acheive the same thing by merely setting the value of the property "DeleteFlag" to true. But how do I set the value using linq?
var unmatchedClient = pingtree.Nodes.Where(x =>
_application.LoanAmount < x.Lender.MinLoanAmount ||
_application.LoanAmount > x.Lender.MaxLoanAmount ||
_application.LoanTerm < x.Lender.MinLoanTerm ||
_application.LoanTerm > x.Lender.MaxLoanTerm)
.Select(x => x.TreeNode)
.ToList();
pingtree.RemoveNodes(unmatchedClient);
Thanks in advance.
Like this?
pingtree.Nodes.Where(x =>
_application.LoanAmount < x.Lender.MinLoanAmount ||
_application.LoanAmount > x.Lender.MaxLoanAmount ||
_application.LoanTerm < x.Lender.MinLoanTerm ||
_application.LoanTerm > x.Lender.MaxLoanTerm)
.Select(x => x.TreeNode)
.ToList()
.ForEach(n=> n.DeleteFlag = true);
But how do I set the value using linq
You don't. Period.
Linq is a query language and querying is reading.
There is a back door that some people abuse to set values. In your case it would look like:
pingtree.Nodes.Where(...)
.Select(n => { n.DeleteFlag = true; return n; }
but this really is not done. Firstly because it is unexpected. Linq methods, including Select, are supposed to leave the source collection unchanged. Secondly because the statement itself does not do anything because of deferred execution. You'd have to force execution (e.g. by ToList()) to make it effective.
Maybe this looks like nitpicking, but when queries get a bit more complex it becomes a major point. It is not uncommon to do a projection (Select) followed by a filter (Where). You could have decided to do a filtering (where n.Deleted == false) after the projection for instance.
So, you query the objects using linq and then loop through them to do whatever needs to be done. ToList().ForEach() is one of the methods you can use to do that.
Side node: the back door that I showed would throw exceptions when using linq-to-sql or entity framework.

Should I use two "where" clauses or "&&" in my LINQ query?

When writing a LINQ query with multiple "and" conditions, should I write a single where clause containing && or multiple where clauses, one for each conditon?
static void Main(string[] args)
{
var ints = new List<int>(Enumerable.Range(-10, 20));
var positiveEvensA = from i in ints
where (i > 0) && ((i % 2) == 0)
select i;
var positiveEvensB = from i in ints
where i > 0
where (i % 2) == 0
select i;
System.Diagnostics.Debug.Assert(positiveEvensA.Count() ==
positiveEvensB.Count());
}
Is there any difference other than personal preference or coding style (long lines, readability, etc.) between positiveEvensA and positiveEvensB?
One possible difference that comes to mind is that different LINQ providers may be able to better cope with multiple wheres rather than a more complex expression; is this true?
I personally would always go with the && vs. two where clauses whenever it doesn't make the statement unintelligible.
In your case, it probably won't be noticeble at all, but having 2 where clauses definitely will have a performance impact if you have a large collection, and if you use all of the results from this query. For example, if you call .Count() on the results, or iterate through the entire list, the first where clause will run, creating a new IEnumerable<T> that will be completely enumerated again, with a second delegate.
Chaining the 2 clauses together causes the query to form a single delegate that gets run as the collection is enumerated. This results in one enumeration through the collection and one call to the delegate each time a result is returned.
If you split them, things change. As your first where clause enumerates through the original collection, the second where clause enumerates its results. This causes, potentially (worst case), 2 full enumerations through your collection and 2 delegates called per member, which could mean this statement (theoretically) could take 2x the runtime speed.
If you do decide to use 2 where clauses, placing the more restrictive clause first will help quite a bit, since the second where clause is only run on the elements that pass the first one.
Now, in your case, this won't matter. On a large collection, it could. As a general rule of thumb, I go for:
Readability and maintainability
Performance
In this case, I think both options are equally maintainable, so I'd go for the more performant option.
This is mostly a personal style issue. Personally, as long as the where clause fits on one line, I group the clauses.
Using multiple wheres will tend to be less performant because it requires an extra delegate invocation for every element that makes it that far. However it's likely to be an insignificant issue and should only be considered if a profiler shows it to be a problem.
As Jared Par has already said: it depends on your personal preference, readability and the use-case. For example if your method has some optional parameters and you want to filter a collection if they are given, the Where is perfect:
IEnumerable<SomeClass> matchingItems = allItems;
if(!string.IsNullOrWhiteSpace(name))
matchingItems = matchingItems
.Where(c => c.Name == name);
if(date.HasValue)
matchingItems = matchingItems
.Where(c => c.Date == date.Value);
if(typeId.HasValue)
matchingItems = matchingItems
.Where(c => c.TypeId == typeId.Value);
return matchingItems;
If you wanted to do this with &&, have fun ;)
Where i don't agree with Jared and Reed is the performance issue that multiple Where are supposed to have. Actually Where is optimized in a way that it combines multiple predicates to one as you can see here(in CombinePredicates).
But i wanted to know if it really has no big impact if the collection is large and there are multiple Where which all have to be evaluated. I was suprised that the following benchmark revealed that even the multiple Where approach was slightly more efficient. The summary:
Method
Mean
Error
StdDev
MultipleWhere
1.555 s
0.0310 s
0.0392 s
MultipleAnd
1.571 s
0.0308 s
0.0649 s
Here's the benchmark code, i think it's good enough for this test:
#LINQPad optimize+
void Main()
{
var summary = BenchmarkRunner.Run<WhereBenchmark>();
}
public class WhereBenchmark
{
string[] fruits = new string[] { "apple", "mango", "papaya", "banana", "guava", "pineapple" };
private IList<string> longFruitList;
[GlobalSetup]
public void Setup()
{
Random rnd = new Random();
int size = 1_000_000;
longFruitList = new List<string>(size);
for (int i = 1; i < size; i++)
longFruitList.Add(GetRandomFruit());
string GetRandomFruit()
{
return fruits[rnd.Next(0, fruits.Length)];
}
}
[Benchmark]
public void MultipleWhere()
{
int count = longFruitList
.Where(f => f.EndsWith("le"))
.Where(f => f.Contains("app"))
.Where(f => f.StartsWith("pine"))
.Count(); // counting pineapples
}
[Benchmark]
public void MultipleAnd()
{
int count = longFruitList
.Where(f => f.EndsWith("le") && f.Contains("app") && f.StartsWith("pine"))
.Count(); // counting pineapples
}
}
The performance issue only applies to memory based collections ... Linq to SQL generates expression trees that defer execution. More Details here:
Multiple WHERE Clauses with LINQ extension methods
If you run SQL Profiler and check the generated queries, you can see that there is no difference between two types of queries in terms of performance.
So, just your taste in code style.
Like others have suggested, it's more of a personal preference. I like the use of && as it's more readable and mimics the syntax of other mainstream languages.

Categories

Resources