Which is faster in .NET, .Contains() or .Count()? - c#

I want to compare an array of modified records against a list of records pulled from the database, and delete those records from the database that do not exist in the incoming array. The modified array comes from a client app that maintains the database, and this code runs in a WCF service app, so if the client deletes a record from the array, that record should be deleted from the database. Here's the sample code snippet:
public void UpdateRecords(Record[] recs)
{
// look for deleted records
foreach (Record rec in UnitOfWork.Records.ToList())
{
var copy = rec;
if (!recs.Contains(rec)) // use this one?
if (0 == recs.Count(p => p.Id == copy.Id)) // or this one?
{
// if not in the new collection, remove from database
Record deleted = UnitOfWork.Records.Single(p => p.Id == copy.Id);
UnitOfWork.Remove(deleted);
}
}
// rest of method code deleted
}
My question: is there a speed advantage (or other advantage) to using the Count method over the Contains method? the Id property is guaranteed to be unique and to identify that particular record, so you don't need to do a bitwise compare, as I assume Contains might do.
Anyone?
Thanks, Dave

This would be faster:
if (!recs.Any(p => p.Id == copy.Id))
This has the same advantages as using Count() - but it also stops after it finds the first match unlike Count()

You should not even consider Count since you are only checking for the existence of a record. You should use Any instead.
Using Count forces to iterate the entire enumerable to get the correct count, Any stops enumerating as soon as you found the first element.
As for the use of Contains you need to take in consideration if for the specified type reference equality is equivalent to the Id comparison you are performing. Which by default it is not.

Assuming Record implements both GetHashCode and Equals properly, I'd use a different approach altogether:
// I'm assuming it's appropriate to pull down all the records from the database
// to start with, as you're already doing it.
foreach (Record recordToDelete in UnitOfWork.Records.ToList().Except(recs))
{
UnitOfWork.Remove(recordToDelete);
}
Basically there's no need to have an N * M lookup time - the above code will end up building a set of records from recs based on their hash code, and find non-matches rather more efficiently than the original code.
If you've actually got more to do, you could use:
HashSet<Record> recordSet = new HashSet<Record>(recs);
foreach (Record recordFromDb in UnitOfWork.Records.ToList())
{
if (!recordSet.Contains(recordFromDb))
{
UnitOfWork.Remove(recordFromDb);
}
else
{
// Do other stuff
}
}
(I'm not quite sure why your original code is refetching the record from the database using Single when you've already got it as rec...)

Contains() is going to use Equals() against your objects. If you have not overridden this method, it's even possible Contains() is returning incorrect results. If you have overridden it to use the object's Id to determine identity, then in that case Count() and Contains() are almost doing the exact same thing. Except Contains() will short circuit as soon as it hits a match, where as Count() will keep on counting. Any() might be a better choice than both of them.
Do you know for certain this is a bottleneck in your app? It feels like premature optimization to me. Which is the root of all evil, you know :)

Since you're guarenteed that there will be 1 and only 1, Any might be faster. Because as soon as it finds a record that matches it will return true.
Count will traverse the entire list counting each occurrence. So if the item is #1 in the list of 1000 items, it's going to check each of the 1000.
EDIT
Also, this might be a time to mention not doing a premature optimization.
Wire up both your methods, put a stopwatch before and after each one.
Create a sufficiently large list (1000 items or more, depending on your domain.) And see which one is faster.
My guess is that we're talking on the order of ms here.
I'm all for writing efficient code, just make sure you're not taking hours to save 5 ms on a method that gets called twice a day.

It would be so:
UnitOfWork.Records.RemoveAll(r => !recs.Any(rec => rec.Id == r.Id));

May I suggest an alternative approach that should be faster I believe since count would continue even after the first match.
public void UpdateRecords(Record[] recs)
{
// look for deleted records
foreach (Record rec in UnitOfWork.Records.ToList())
{
var copy = rec;
if (!recs.Any(x => x.Id == copy.Id)
{
// if not in the new collection, remove from database
Record deleted = UnitOfWork.Records.Single(p => p.Id == copy.Id);
UnitOfWork.Remove(deleted);
}
}
// rest of method code deleted
}
That way you are sure to break on the first match instead of continue to count.

If you need to know the actual number of elements, use Count(); it's the only way. If you are checking for the existence of a matching record, use Any() or Contains(). Both are MUCH faster than Count(), and both will perform about the same, but Contains will do an equality check on the entire object while Any() will evaluate a lambda predicate based on the object.

Related

Faster version of LINQ .Any() and .Count()

I'm checking if a list has an element whose source and target are already in the list. If not I'm adding that element to the list. I'm doing this way :
if (!objectToSerialize.elements
.Any(x => x.data.source == edgetoAdd.data.source &&
x.data.target == edgetoAdd.data.target))
objectToSerialize.elements.Add(edgetoAdd);
This works but very slowly. Is there a way to make this part faster? Are there faster implementations of Any() or Count? Thanks in advance.
You can pre-index the data into something like a HashSet<T> for some T. Since you are comparing two values, a tuple might help:
var existingValues = new HashSet<(string,string)>(
objectToSerialize.elements.Select(x => (x.data.source, x.data.target)));
now you can test
existingValues.Contains((edgetoAdd.data.source, edgetoAdd.data.target))
efficiently. But!! Building the index is not free. This mainly helps if you are going to be testing lots of values. If you're only adding one, a linear search is probably your best bet.
Note that you can use the index approach with an index that lasts between multiple Add calls, but you would also need to remember to .Add it to the index each time. You can short-cut the test/add pair by using the return value of .Add on the hashset:
if(existingValues.Add((edgetoAdd.data.source, edgetoAdd.data.target)))
{
// a new value, yay!
objectToSerialize.elements.Add(edgetoAdd);
}

Is anything faster than .Any() to check that condition?

I have been searching for a way to do the following with a faster EF query :
using (DAL.MandatsDatas db = new DAL.MandatsDatas())
{
if(db.ARTICLE.Any( t => t.condition == condition))
oneArticle = db.ARTICLE.First( t => t.condition == condition);
}
It works fine, but the more i add of these, the slower it feels.
It just looks like it goes through all the rows 2 times (i don't know if it's the case)
I've been searching, saw people using the count() > 0 and other irrelevant stuff...
Is there a faster way to check if someting exist and then take it.
Also i was wondering if the FirstOrDefault() could help my case, how does it work ?
Yes, FirstOrDefault is better here:
oneArticle = db.ARTICLE.FirstOrDefault(t => t.condition == condition);
Basically Any will do one select, and then First will do one more. While FirstOrDefault will do the same First does, and just return null if there was no output, thus eliminating the need to run another selection operation.
Yes, FirstOrDefault will be faster because it will only query once. The way it works is if no rows where available it will return null, if there was rows available it returns the fist row based on any ordering you applied (if any).

Slow LINQ Performance on DataTable Where Clause?

I'm dumping a table out of MySQL into a DataTable object using MySqlDataAdapter. Database input and output is doing fine, but my application code seems to have a performance issue I was able to track down to a specific LINQ statement.
The goal is simple, search the contents of the DataTable for a column value matching a specific string, just like a traditional WHERE column = 'text' SQL clause.
Simplified code:
foreach (String someValue in someList) {
String searchCode = OutOfScopeFunction(someValue);
var results = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.Take(1);
if (results.Any()) {
results.First()["columnname"] = 10;
}
}
This simplified code is executed thousands of times, once for each entry in someList. When I run Visual Studio Performance Profiler I see that the "results.Any()" line is highlighted as consuming 93.5% of the execution time.
I've tried several different methods for optimizing this code, but none have improved performance while keeping the emoteTable DataTable as the primary source of the data. I can convert emoteTable to Dictionary<String, DataRow> outside of the foreach, but then I have to keep the DataTable and the Dictionary in sync, which while still a performance improvement, feels wrong.
Three questions:
Is this the proper way to search for a value in a DataTable (equivalent of a traditional SQL WHERE clause)? If not, how SHOULD it be done?
Addendum to 1, regardless of the proper way, what is the fastest (execution time)?
Why does the results.Any() line consume 90%+ resources? In this situation it makes more sense that the var results line should consume the resources, after all, it's the line doing the actual search, right?
Thank you for your time. If I find an answer I shall post it here as well.
Any() is taking 90% of the time because the result is only executed when you call Any(). Before you call Any(), the query is not actually made.
It would seem the problem is that you first fetch entire table into the memory and then search. You should instruct your database to search.
Moreover, when you call results.First(), the whole results query is executed again.
With deferred execution in mind, you should write something like
var result = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.FirstOrDefault();
if (result != null) {
result["columnname"] = 10;
}
What you have implemented is pretty much join :
var searchCodes = someList.Select(OutOfScopeFunction);
var emotes = emoteTable.AsEnumerable();
var results = Enumerable.Join(emotes, searchCodes, e=>e, sc=>sc.Field<String>("code"), (e, sc)=>sc);
foreach(var result in results)
{
result["columnname"] = 10;
}
Join will probably optimize the access to both lists using some kind of lookup.
But first thing I would do is to completely abandon idea of combining DataTable and LINQ. They are two different technologies and trying to assert what they might do inside when combined is hard.
Did you try doing raw UPDATE calls? How many items are you expecting to update?

How can you determine the current items position whilst looping through the collection?

How can you determine the current items position whilst looping through the collection?
I'm working through decision data, grouped by each client, but I have some business logic which depends on the "position" in the set, i.e. 1st, 2nd, 3rd, etc. in conjunction with other properties of the record, e.g. if it's the 3rd decision about a client and their rating in the instance is A then ...
var multiples = from d in context.Decision_Data
group d by d.Client_No
into c
where c.Count() > 1
select c;
foreach (var grouping in multiples)
{
foreach (var item in grouping)
{
// business logic here for processing each decision for a Client_No
// BUT depends on item position ... 1st, 2nd, etc.
}
UPDATE: I appreciate I could put a counter in and manually increment, but it feels wrong and I'd of thought there was something in .NET to handle this ??
Something like this:
foreach (var grouping in multiples)
{
foreach (var x in grouping.Select(index,item) => new {index, item})
{
// x.index is the position of the item in this group
// x.item is the item itself
}
}
Side note: you can make the implementation of your LINQ query a bit more efficient. Count() > 1 will completely enumerate each group fully, which you are likely to do in the foreach anyway. Instead you can use Skip(1).Any(), which will stop iterating the group as soon as it finds two items. Obviously this will only make a real difference for (very) large input lists.
var multiples = from d in context.Decision_Data
group d by d.Client_No
into c
where c.Skip(1).Any()
select c;
There isn't anything offered by the standard foreach. Simply maintain an external count.
There is an overload on the Enumerable.Select extension method that provides the index of the current item:
http://msdn.microsoft.com/en-us/library/bb534869
But without knowing what your code is trying to do in the foreach I cannot really offer an example of using it. In theory you could project an anonymous type that has the index stored and use that later on with the foreach. It appears that jeroenh's answer went down this route.
As Adam stated, you could either go with Adams solution, or do a ToList() on the query to be able to do
multiples.IndexOf(grouping)
I fail to see how you can have any certainty about your decisions order.
I'm guessing your data come from a long-term data source (e.g. data base or such) and that you doesn't have any control on the order in which the decision are fetched from the data source, especially after applying a "group by".
I would add an "order" field (or column) to the Decision entity to track the order in which the decision were made that would be set while adding the Decision to the data source.
That way, you could directly use this field in your business logic.
There must be many ways to achieve the tracking of decision order but without, you can't even be sure in what order they have been made.

What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.
Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.
Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

Categories

Resources