Faster version of LINQ .Any() and .Count() - c#

I'm checking if a list has an element whose source and target are already in the list. If not I'm adding that element to the list. I'm doing this way :
if (!objectToSerialize.elements
.Any(x => x.data.source == edgetoAdd.data.source &&
x.data.target == edgetoAdd.data.target))
objectToSerialize.elements.Add(edgetoAdd);
This works but very slowly. Is there a way to make this part faster? Are there faster implementations of Any() or Count? Thanks in advance.

You can pre-index the data into something like a HashSet<T> for some T. Since you are comparing two values, a tuple might help:
var existingValues = new HashSet<(string,string)>(
objectToSerialize.elements.Select(x => (x.data.source, x.data.target)));
now you can test
existingValues.Contains((edgetoAdd.data.source, edgetoAdd.data.target))
efficiently. But!! Building the index is not free. This mainly helps if you are going to be testing lots of values. If you're only adding one, a linear search is probably your best bet.
Note that you can use the index approach with an index that lasts between multiple Add calls, but you would also need to remember to .Add it to the index each time. You can short-cut the test/add pair by using the return value of .Add on the hashset:
if(existingValues.Add((edgetoAdd.data.source, edgetoAdd.data.target)))
{
// a new value, yay!
objectToSerialize.elements.Add(edgetoAdd);
}

Related

Is anything faster than .Any() to check that condition?

I have been searching for a way to do the following with a faster EF query :
using (DAL.MandatsDatas db = new DAL.MandatsDatas())
{
if(db.ARTICLE.Any( t => t.condition == condition))
oneArticle = db.ARTICLE.First( t => t.condition == condition);
}
It works fine, but the more i add of these, the slower it feels.
It just looks like it goes through all the rows 2 times (i don't know if it's the case)
I've been searching, saw people using the count() > 0 and other irrelevant stuff...
Is there a faster way to check if someting exist and then take it.
Also i was wondering if the FirstOrDefault() could help my case, how does it work ?
Yes, FirstOrDefault is better here:
oneArticle = db.ARTICLE.FirstOrDefault(t => t.condition == condition);
Basically Any will do one select, and then First will do one more. While FirstOrDefault will do the same First does, and just return null if there was no output, thus eliminating the need to run another selection operation.
Yes, FirstOrDefault will be faster because it will only query once. The way it works is if no rows where available it will return null, if there was rows available it returns the fist row based on any ordering you applied (if any).

Getting distinct and ordered members from a list of strings - linq or hashset for unique which one is faster / better suited

I have a big list of strings (about 5k-20k entries) that I need to order and also to remove duplicates from.
I've done this in 2 ways now, once with a hashset and once solely with linq. Tests with that number of entries did not show a big difference but I'm wondering what way and thus what method would be better suited.
For the ways (myList is of the datatype List):
Linq: I'm using 1 linq statement to order the list and get the distinct values from it.
myList = myList.OrderBy(q => q).Distinct().ToList();
Hashset: I'm using hashset to remove all duplicates and then I'm ordering the list
myList = new HashSet<String>(myList).ToList<String>();
myList = myList.OrderBy(q => q).ToList();
Like I said tests I made were about the same time consumption for both methods but I'm still wondering if one method is better than the other and if so why (the code is for a high performance part and I need to get every millisecond I can out of it).
If you're really concerned about every nanosecond, then
myList = myList.Distinct().OrderBy(q => q).ToList();
might be slightly faster than:
myList = myList.OrderBy(q => q).Distinct().ToList();
if there are a large number of duplicates.
The LINQ method is more readable and will have similar performance to explicitly creating a HashSet<T> as others have said. In fact it may be slightly faster if the original List is already sorted, since the LINQ method will preserve the initial order before sorting, while explicitly creating a HashSet<T> will enumerate in an undefined order.
They are pretty much the same. Distinct also uses a Set<T> to eliminate duplicates. My suggestion is use the Distinct first then sort your items. Also in your second code, ToList<String> call is redundant, you can use OrderBy on HashSet then call ToList.

Which is faster in .NET, .Contains() or .Count()?

I want to compare an array of modified records against a list of records pulled from the database, and delete those records from the database that do not exist in the incoming array. The modified array comes from a client app that maintains the database, and this code runs in a WCF service app, so if the client deletes a record from the array, that record should be deleted from the database. Here's the sample code snippet:
public void UpdateRecords(Record[] recs)
{
// look for deleted records
foreach (Record rec in UnitOfWork.Records.ToList())
{
var copy = rec;
if (!recs.Contains(rec)) // use this one?
if (0 == recs.Count(p => p.Id == copy.Id)) // or this one?
{
// if not in the new collection, remove from database
Record deleted = UnitOfWork.Records.Single(p => p.Id == copy.Id);
UnitOfWork.Remove(deleted);
}
}
// rest of method code deleted
}
My question: is there a speed advantage (or other advantage) to using the Count method over the Contains method? the Id property is guaranteed to be unique and to identify that particular record, so you don't need to do a bitwise compare, as I assume Contains might do.
Anyone?
Thanks, Dave
This would be faster:
if (!recs.Any(p => p.Id == copy.Id))
This has the same advantages as using Count() - but it also stops after it finds the first match unlike Count()
You should not even consider Count since you are only checking for the existence of a record. You should use Any instead.
Using Count forces to iterate the entire enumerable to get the correct count, Any stops enumerating as soon as you found the first element.
As for the use of Contains you need to take in consideration if for the specified type reference equality is equivalent to the Id comparison you are performing. Which by default it is not.
Assuming Record implements both GetHashCode and Equals properly, I'd use a different approach altogether:
// I'm assuming it's appropriate to pull down all the records from the database
// to start with, as you're already doing it.
foreach (Record recordToDelete in UnitOfWork.Records.ToList().Except(recs))
{
UnitOfWork.Remove(recordToDelete);
}
Basically there's no need to have an N * M lookup time - the above code will end up building a set of records from recs based on their hash code, and find non-matches rather more efficiently than the original code.
If you've actually got more to do, you could use:
HashSet<Record> recordSet = new HashSet<Record>(recs);
foreach (Record recordFromDb in UnitOfWork.Records.ToList())
{
if (!recordSet.Contains(recordFromDb))
{
UnitOfWork.Remove(recordFromDb);
}
else
{
// Do other stuff
}
}
(I'm not quite sure why your original code is refetching the record from the database using Single when you've already got it as rec...)
Contains() is going to use Equals() against your objects. If you have not overridden this method, it's even possible Contains() is returning incorrect results. If you have overridden it to use the object's Id to determine identity, then in that case Count() and Contains() are almost doing the exact same thing. Except Contains() will short circuit as soon as it hits a match, where as Count() will keep on counting. Any() might be a better choice than both of them.
Do you know for certain this is a bottleneck in your app? It feels like premature optimization to me. Which is the root of all evil, you know :)
Since you're guarenteed that there will be 1 and only 1, Any might be faster. Because as soon as it finds a record that matches it will return true.
Count will traverse the entire list counting each occurrence. So if the item is #1 in the list of 1000 items, it's going to check each of the 1000.
EDIT
Also, this might be a time to mention not doing a premature optimization.
Wire up both your methods, put a stopwatch before and after each one.
Create a sufficiently large list (1000 items or more, depending on your domain.) And see which one is faster.
My guess is that we're talking on the order of ms here.
I'm all for writing efficient code, just make sure you're not taking hours to save 5 ms on a method that gets called twice a day.
It would be so:
UnitOfWork.Records.RemoveAll(r => !recs.Any(rec => rec.Id == r.Id));
May I suggest an alternative approach that should be faster I believe since count would continue even after the first match.
public void UpdateRecords(Record[] recs)
{
// look for deleted records
foreach (Record rec in UnitOfWork.Records.ToList())
{
var copy = rec;
if (!recs.Any(x => x.Id == copy.Id)
{
// if not in the new collection, remove from database
Record deleted = UnitOfWork.Records.Single(p => p.Id == copy.Id);
UnitOfWork.Remove(deleted);
}
}
// rest of method code deleted
}
That way you are sure to break on the first match instead of continue to count.
If you need to know the actual number of elements, use Count(); it's the only way. If you are checking for the existence of a matching record, use Any() or Contains(). Both are MUCH faster than Count(), and both will perform about the same, but Contains will do an equality check on the entire object while Any() will evaluate a lambda predicate based on the object.

Sorting a list in .Net by one field, then another

I have list of objects I need to sort based on some of their properties. This works fine to sort it by one field:
reportDataRows.Sort((x, y) => x["Comment1"].CompareTo(y["Comment1"]));
foreach (var row in reportDataRows) {
...
}
I see lots of examples on here that do this with only one field. But how do I sort by one field, then another? Or how about a list of many fields? It seems like using LINQ orderby thenby would be best, but I don't know enough about it to know how use it.
For the parameters, something like this that supports any number of fields to sort by would be nice:
var sortBy = new List<string>(){"Comment1","Time"};
I don't want to be writing code to do this in every one of my apps. I plan on moving this sort code to the class that holds the data so that it can do more advanced things like using a list of parameters and implicitly recognizing that the field is a date and sorting it as a date instead of a string. The reportDataRow object contains fields with this information, so I don't have to do any messy checks to find out if the field is supposed to be a date.
Yes, I think it makes more sense to use OrderBy and ThenBy:
foreach (var row in reportDataRows.OrderBy(x => x["Comment1"]).ThenBy(x => x["Comment2"])
{
...
}
This assumes the other thing you want to order by is "Comment2".
Try this:
reportDataRows.Sort((x, y) =>
{
var compare = x["Comment1"].CompareTo(y["Comment1"]);
if(compare != 0)
return compare;
return x["Comment2"].CompareTo(y["Comment2"]);
});
You may want to look at this previous answer where I posted an extension method which handles multiple order by's in LINQ. This allows this sort of syntax:
myList.OrderByMany(x => x.Field1, x => x.Field2);
Look at the example for ThenBy on msdn.
If you're comparing your own objects, then you can implement the IComparable interface.
Otherwise, you can use the IComparer interface.
Using LINQ method syntax:
var sortedRows = reportDataRows.OrderBy(r => r["Comment1"])
.ThenBy(r => r["AnotherField"];
foreach (var row in sortedRows) {
...
}
And even more readable using query comprehension syntax:
var sortedRows = from r in reportDataRows
orderby r["Comment1"], r["Comment2"]
select r;
foreach (var row in sortedRows) {
...
}
You got it. Enumerable.OrderBy().ThenBy() is your ticket. It works exactly like it looks; elements are sorted by each projection, with ties decided by comparing the next projection. You can chain as many ThenBys as you want, and there are also OrderByDesc and ThenByDesc methods that will sort that projection in descending order.
As Albin has pointed out, An OrderBy chain does not touch the original list unless you assign the result of the ordering back to the original variable, like this:
reportDataRows = reportDataRows.OrderBy(x=>x.Comment1).ThenBy(x=>x.Comment2).ToList();
As a rule, OrderBy will perform slightly slower than List.Sort(); the algorithm is designed to work on any IEnumerable series of elements, so in order to sort (which requires knowing every element of the series) it slurps its entire source enumerable into a new array. However, OrderBy has a distinct advantage over Sort in that it is a "stable" sort; elements that are exactly equal to each other will retain their "relative order" in the sorted enumerable (the first of the two that you;d encounter when iterating through the unsorted list will be the first of the two encountered when iterating through the sorted list).

How to find the first item according to a specific ordering using LINQ in O(n)?

Suppose I have a list of items (e.g., Posts) and I want to find the first item according to some non-trivial ordering (e.g., PublishDate and then CommentsCount as a tie-breaker).
The natural way to do this with LINQ is like this:
posts.OrderBy(post => post.PublishDate).ThenBy(post => post.CommentsCount).First()
However, the micro-optimizer in me is worried that calling OrderBy actually costs me O(n*lgn) for sorting the entire list, when all I really need is an O(n) find-minimum operation.
So, is LINQ smart enough to return something from OrderBy() that knows how to optimize subsequent First() calls? If not, what's a better way to do this out-of-the-box? (I can always write my own FindMinimumItem implementation but that seems like overkill).
The sorting is smart in the way that it will only do the ThenBy on the first group from the OrderBy, but the OrderBy still has to sort all items before it can return the first group.
You can use the Aggregate method to get the first post according to a custom comparison:
Post lowest =
posts.Aggregate((Post)null,
(x, y) =>
x == null
|| y.PublishDate < x.PublishDate
|| (y.PublishDate == x.PublishDate && y.CommentsCount < x.CommentsCount)
? y : x
);
(Assuming that you are using LINQ to Objects of course.)
Is this in SQL or LINQ to Objects? If it's the latter, you probably want MinBy from MoreLINQ; your statement as written will indeed sort and then take the first item.
And yes, it's a shame that it doesn't include this (and similar things like DistinctBy) out of the box.
EDIT: I see your question has now changed; MoreLINQ doesn't support a compound comparison like that. In MiscUtil I have code to create a compound IComparer<T> - you could pass that into MinBy using the identity function as the key selector. Feel free to add a feature request for a MinBy which takes a source and an IComparer<T> without a key selector :)
Usually that is a max or min (I don't know the how is it called in LinQ), given a specific key; sorting and getting the first or last seems overkill in any language or framework.

Categories

Resources