Find() vs. enumeration on lists

Find() vs. enumeration on lists - c#

I'm working with a code base where lists need to be frequently searched for a single element.
Is it faster to use a Predicate and Find() than to manually do an enumeration on the List?
for example:
string needle = "example";
FooObj result = _list.Find(delegate(FooObj foo) {
return foo.Name == needle;
});
vs.
string needle = "example";
foreach (FooObj foo in _list)
{
if (foo.Name == needle)
return foo;
}
While they are equivalent in functionality, are they equivalent in performance as well?

They are not equivalent in performance. The Find() method requires a method (in this case delegate) invocation for every item in the list. Method invocation is not free and is relatively expensive as compared to an inline comparison. The foreach version requires no extra method invocation per object.
That being said, I wouldn't pick one or the other based on performance until I actually profiled my code and found this to be a problem. I haven't yet found the overhead of this scenario to every be a "hot path" problem for code I've written and I use this pattern a lot with Find and other similar methods.

If searching your list is too slow as-is, you can probably do better than a linear search. If you can keep the list sorted, you can use a binary search to find the element in O(lg n) time.
If you're searching a whole lot, consider replacing that list with a Dictionary to index your objects by name.

Technically, the runtime performance of the delegate version will be slightly worse than the other version - but in most cases you'd be hard pressed to perceive any difference.
Of more importance (IHMO) is the code time performance of being able to write what you want, rather than how you want it. This makes a big difference in maintainability.
This original code:
string needle = "example";
foreach (FooObj foo in _list)
{
if (foo.Name == needle)
return foo;
}
requires any maintainer to read the code and understand that you're looking for a particular item.
This code
string needle = "example";
return _list.Find(
delegate(FooObj foo)
{
return foo.Name == needle;
});
makes it clear that you're looking for a particular item - quicker to understand.
Finally, this code, using features from C# 3.0:
string needle = "example";
return _list.Find( foo => foo.Name == needle);
does exactly the same thing, but in one line that's even faster to read and understand (well, once you understand lambda expressions, anyway).
In summary, given that the performance of the alternatives is nearly equal, choose the one that makes the code easier to read and maintain.

"I'm working with a code base where lists need to be frequently searched for a single element"
It is better to change your data structure to be Dictionary instead of List to get better performance

Similar question was asked for List.ForEach vs. foreach-iteration (foreach vs someList.Foreach(){}).
In that case List.ForEach was a bit faster.

As Jared pointed out, there are differences.
But, as always, don't worry unless you know it's a bottleneck. And if it is a bottleneck, that's probably because the lists are big, in which case you should consider using a faster find - a hash table or binary tree, or even just sorting the list and doing binary search will give you log(n) performance which will have far more impact than tweaking your linear case.

Related

LINQ or foreach - style/readability and speed

I have a piece of code for some validation logic, which in generalized for goes like this:
private bool AllItemsAreSatisfactoryV1(IEnumerable<Source> collection)
{
foreach(var foo in collection)
{
Target target = SomeFancyLookup(foo);
if (!target.Satisfactory)
{
return false;
}
}
return true;
}
This works, is pretty easy to understand, and has early-out optimization. It is, however, pretty verbose. The main purpose of this question is what is considered readable and good style. I'm also interested in the performance; I'm a firm believer that premature {optimization, pessimization} is the root of all evil, and try to avoid micro-optimizing as well as introducing bottlenecks.
I'm pretty new to LINQ, so I'd like some comments on the two alternative versions I've come up with, as well as any other suggestions wrt. readability.
private bool AllItemsAreSatisfactoryV2(IEnumerable<Source> collection)
{
return null ==
(from foo in collection
where !(SomeFancyLookup(foo).Satisfactory)
select foo).First();
}
private bool AllItemsAreSatisfactoryV3(IEnumerable<Source> collection)
{
return !collection.Any(foo => !SomeFancyLookup(foo).Satisfactory);
}
I don't believe that V2 offers much over V1 in terms of readability, even if shorter. I find V3 to be clear & concise, but I'm not too fond of the Method().Property part; of course I could turn the lambda into a full delegate, but then it loses it's one-line elegance.
What I'd like comments on are:
Style - ever so subjective, but what do you feel is readable?
Performance - are any of these a definite no-no? As far as I understand, all three methods should early-out.
Debuggability - anything to consider?
Alternatives - anything goes.
Thanks in advance :)

I think All would be clearer:
private bool AllItemsAreSatisfactoryV1(IEnumerable<Source> collection)
{
return collection.Select(f => SomeFancyLookup(f)).All(t => t.Satisfactory);
}
I think it's unlikely using linq here would cause a performance problem over a regular foreach loop, although it would be straightforward to change if it did.

I personally have no problem with the style of V3, and that one would be my first choice. You're essentially looking through the list for any whose lookup is not satisfactory.
V2 is difficult to grasp the intent of, and in its current form will throw an exception (First() requires the source IEnumerable to not be empty; I think you're looking for FirstOrDefault()). Why not just tack Any() on the end instead of comparing a result from the list to null?
V1 is fine, if a bit loquacious, and probably the easiest to debug, as I've found debugging lambdas to be a bit persnickety at times. You can remove the inner braces to lose some whitespace without sacrificing readability.
Really, all three will boil down into very similar opcodes; iterate through collection, call SomeFancyLookup(), and check a property of its return value; get out on the first failure. Any() "hides" a very similar foreach algorithm. The difference between V1 and all others is the use of a named variable, which MIGHT be a little less performant, but you have a reference to a Target in all three cases so I doubt it's significant, if a difference even exists.

Which LINQ query is more effective?

I have a huge IEnumerable(suppose the name is myItems), which way is more effective?
Solution 1: Filter it first then ForEach.
Array.ForEach(myItems.Where(FILTER-IT-HERE).ToArray(),MY-ACTION);
Solution 2: Do RETURN in MY-ACTION if the item is not up to the mustard.
Array.ForEach(myItems.ToArray(),MY-ACTION-WITH-FILTER);
Is one of them always better than another? Or any other good suggestions? Thanks in advance.

Did you do any measurements? Since WE can't measure the run time of My-Action then only you can. Measure and decide.

Sometimes one has to create benchmark's because similar looking activities could produce radically different and unexpected results.
You do not say what your data source is so I'm going to assume it may be data on an SQL server in which case filtering at the server side will likely always be the best approach because you have minimized the amount of data transfer. Memory access is always faster than data transfer from disk to memory so whenever you can transfer fewer records, you are likely to have better performance.

Well, both times, you're converting to an array, which might not be so efficient if the IEnumerable is very large (like you said). You could create a generic extension method for IEnumerable, like:
public static void ForEach<T>(this IEnumerable<T> current, Action<T> action) {
foreach (var i in current) {
action(i);
}
}
and then you could do this:
IEnumerable<int> ints = new List<int>();
ints.Where(i => i == 5).ForEach(i => Console.WriteLine(i));

If performance is a concern, it's unclear to me why you'd be bothering to construct an entire array in the first place. Why not just this?
foreach (var item in myItems.Where(FILTER-IT-HERE))
MY-ACTION;
Or:
foreach (var item in myItems)
MY-ACTION-WITH-FILTER;
I ask because, while the others are right that you can't really know without testing, I wouldn't expect there to be much difference between the above two options. I would expect there to be a difference, on the other hand, between creating/populating an array (seemingly for no reason) and not creating an array.

Everything else being equal, calling ToArray() first will impart a greater performance hit than when calling it last. Although, as has been stated by others before me,
Why use ToArray() and Array.ForEach() at all?
We don't know that everything else actually is equal since you do not reveal the implementation details of your filter and action.

The idea of LINQ is to work on enumerable collections, so the best LINQ query is the one where you don't use Array.ForEach() and .ToArray() at all.

I would say that this falls into the category of premature optimization. If, after establishing benchmarks, you find that the code is too slow, you can always try each approach and pick the result that works better for you.
Since we don't know how the IEnumerable<> is produced it's hard to say which approach will perform better. We also don't know how many items will remain after you apply your predicate - nor do we know whether the action or iteration steps are going to be the dominant factor in the execution of your code. The only way to know for sure is to try it both ways, profile the results, and pick the best.
Performance aside, I would choose the version that is most clear - which (for me) is to first filter and then apply the projection to the result.

How to replace for-loops with a functional statement in C#?

A colleague once said that God is killing a kitten every time I write a for-loop.
When asked how to avoid for-loops, his answer was to use a functional language. However, if you are stuck with a non-functional language, say C#, what techniques are there to avoid for-loops or to get rid of them by refactoring? With lambda expressions and LINQ perhaps? If so, how?
Questions
So the question boils down to:
Why are for-loops bad? Or, in what context are for-loops to avoid and why?
Can you provide C# code examples of how it looks before, i.e. with a loop, and afterwards without a loop?

Functional constructs often express your intent more clearly than for-loops in cases where you operate on some data set and want to transform, filter or aggregate the elements.
Loops are very appropriate when you want to repeatedly execute some action.
For example
int x = array.Sum();
much more clearly expresses your intent than
int x = 0;
for (int i = 0; i < array.Length; i++)
{
x += array[i];
}

Why are for-loops bad? Or, in what
context are for-loops to avoid and
why?
If your colleague has a functional programming, then he's probably already familiar with the basic reasons for avoiding for loops:
Fold / Map / Filter cover most use cases of list traversal, and lend themselves well to function composition. For-loops aren't a good pattern because they aren't composable.
Most of the time, you traverse through a list to fold (aggregate), map, or filter values in a list. These higher order functions already exist in every mainstream functional language, so you rarely see the for-loop idiom used in functional code.
Higher order functions are the bread and butter of function composition, meaning you can easily combine simple function into something more complex.
To give a non-trivial example, consider the following in an imperative language:
let x = someList;
y = []
for x' in x
y.Add(f x')
z = []
for y' in y
z.Add(g y')
In a functional language, we'd write map g (map f x), or we can eliminate the intermediate list using map (f . g) x. Now we can, in principle, eliminate the intermediate list from the imperative version, and that would help a little -- but not much.
The main problem with the imperative version is simply that the for-loops are implementation details. If you want change the function, you change its implementation -- and you end up modifying a lot of code.
Case in point, how would you write map g (filter f x) in imperatively? Well, since you can't reuse your original code which maps and maps, you need to write a new function which filters and maps instead. And if you have 50 ways to map and 50 ways to filter, how you need 50^50 functions, or you need to simulate the ability to pass functions as first-class parameters using the command pattern (if you've ever tried functional programming in Java, you understand what a nightmare this can be).
Back in the the functional universe, you can generalize map g (map f x) in way that lets you swap out the map with filter or fold as needed:
let apply2 a g b f x = a g (b f x)
And call it using apply2 map g filter f or apply2 map g map f or apply2 filter g filter f or whatever you need. Now you'd probably never write code like that in the real world, you'd probably simplify it using:
let mapmap g f = apply2 map g map f
let mapfilter g f = apply2 map g filter f
Higher-order functions and function composition give you a level of abstraction that you cannot get with the imperative code.
Abstracting out the implementation details of loops let's you seamlessly swap one loop for another.
Remember, for-loops are an implementation detail. If you need to change the implementation, you need to change every for-loop.
Map / fold / filter abstract away the loop. So if you want to change the implementation of your loops, you change it in those functions.
Now you might wonder why you'd want to abstract away a loop. Consider the task of mapping items from one type to another: usually, items are mapped one at a time, sequentially, and independently from all other items. Most of the time, maps like this are prime candidates for parallelization.
Unfortunately, the implementation details for sequential maps and parallel maps aren't interchangeable. If you have a ton of sequential maps all over your code, and you want swap them out for parallel maps, you have two choices: copy/paste the same parallel mapping code all over your code base, or abstract away mapping logic into two functions map and pmap. Once you're go the second route, you're already knee-deep in functional programming territory.
If you understand the purpose of function composition and abstracting away implementation details (even details as trivial as looping), you can start to appreciate just how and why functional programming is so powerful in the first place.

For loops are not bad. There are many very valid reasons to keep a for loop.
You can often "avoid" a for loop by reworking it using LINQ in C#, which provides a more declarative syntax. This can be good or bad depending on the situation:
Compare the following:
var collection = GetMyCollection();
for(int i=0;i<collection.Count;++i)
{
if(collection[i].MyValue == someValue)
return collection[i];
}
vs foreach:
var collection = GetMyCollection();
foreach(var item in collection)
{
if(item.MyValue == someValue)
return item;
}
vs. LINQ:
var collection = GetMyCollection();
return collection.FirstOrDefault(item => item.MyValue == someValue);
Personally, all three options have their place, and I use them all. It's a matter of using the most appropriate option for your scenario.

There's nothing wrong with for loops but here are some of the reasons people might prefer functional/declarative approaches like LINQ where you declare what you want rather than how you get it:-
Functional approaches are potentially easier to parallelize either manually using PLINQ or by the compiler. As CPUs move to even more cores this may become more important.
Functional approaches make it easier to achieve lazy evaluation in multi-step processes because you can pass the intermediate results to the next step as a simple variable which hasn't been evaluated fully yet rather than evaluating the first step entirely and then passing a collection to the next step (or without using a separate method and a yield statement to achieve the same procedurally).
Functional approaches are often shorter and easier to read.
Functional approaches often eliminate complex conditional bodies within for loops (e.g. if statements and 'continue' statements) because you can break the for loop down into logical steps - selecting all the elements that match, doing an operation on them, ...

For loops don't kill people (or kittens, or puppies, or tribbles). People kill people.
For loops, in and of themselves, are not bad. However, like anything else, it's how you use them that can be bad.

Sometime you don't kill just one kitten.
for (int i = 0; i < kittens.Length; i++)
{
kittens[i].Kill();
}
Sometimes you kill them all.

You can refactor your code well enough so that you won't see them often. A good function name is definitely more readable that a for loop.
Taking the example from AndyC :
Loop
// mystrings is a string array
List<string> myList = new List<string>();
foreach(string s in mystrings)
{
if(s.Length > 5)
{
myList.add(s);
}
}
Linq
// mystrings is a string array
List<string> myList = mystrings.Where<string>(t => t.Length > 5)
.ToList<string();
Wheter you use the first or the second version inside your function, It's easier to read
var filteredList = myList.GetStringLongerThan(5);
Now that's an overly simple example, but you get my point.

Your colleague is not right. For loops are not bad per se. They are clean, readable and not particularly error prone.

Your colleague is wrong about for loops being bad in all cases, but correct that they can be rewritten functionally.
Say you have an extension method that looks like this:
void ForEach<T>(this IEnumerable<T> collection, Action <T> action)
{
foreach(T item in collection)
{
action(item)
}
}
Then you can write a loop like this:
mycollection.ForEach(x => x.DoStuff());
This may not be very useful now. But if you then replace your implementation of the ForEach extension method for use a multi threaded approach then you gain the advantages of parallelism.
This obviously isn't always going to work, this implementation only works if the loop iterations are completely independent of each other, but it can be useful.
Also: always be wary of people who say some programming construct is always wrong.

A simple (and pointless really) example:
Loop
// mystrings is a string array
List<string> myList = new List<string>();
foreach(string s in mystrings)
{
if(s.Length > 5)
{
myList.add(s);
}
}
Linq
// mystrings is a string array
List<string> myList = mystrings.Where<string>(t => t.Length > 5).ToList<string>();
In my book, the second one looks a lot tidier and simpler, though there's nothing wrong with the first one.

Sometimes a for-loop is bad if there exists a more efficient alternative. Such as searching, where it might be more efficient to sort a list and then use quicksort or binary sort. Or when you are iterating over items in a database. It is usually much more efficient to use set-based operations in a database instead of iterating over the items.
Otherwise if the for-loop, especially a for-each makes the most sense and is readable, then I would go with that rather than rafactor it into something that isn't as intuitive. I personally don't believe in these religious sounding "always do it this way, because that is the only way". Rather it is better to have guidelines, and understand in what scenarios it is appropriate to apply those guidelines. It is good that you ask the Why's!

For loop is, let's say, "bad" as it implies branch prediction in CPU, and possibly performance decrease when branch prediction miss.
But CPU (having a branch prediction accuracy of 97%) and compiler with tecniques like loop unrolling, make loop performance reduction negligible.

If you abstract the for loop directly you get:
void For<T>(T initial, Func<T,bool> whilePredicate, Func<T,T> step, Action<T> action)
{
for (T t = initial; whilePredicate(t); step(t))
{
action(t);
}
}
The problem I have with this from a functional programming perspective is the void return type. It essentially means that for loops do not compose nicely with anything. So the goal is not to have a 1-1 conversion from for loop to some function, it is to think functionally and avoid doing things that do not compose. Instead of thinking of looping and acting think of the whole problem and what you are mapping from and to.

A for loop can always be replaced by a recursive function that doesn't involve the use of a loop. A recursive function is a more functional stye of programming.
But if you blindly replace for loops with recursive functions, then kittens and puppies will both die by the millions, and you will be done in by a velocirapter.
OK, here's an example. But please keep in mind that I do not advocate making this change!
The for loop
for (int index = 0; index < args.Length; ++index)
Console.WriteLine(args[index]);
Can be changed to this recursive function call
WriteValuesToTheConsole(args, 0);
static void WriteValuesToTheConsole<T>(T[] values, int startingIndex)
{
if (startingIndex < values.Length)
{
Console.WriteLine(values[startingIndex]);
WriteValuesToTheConsole<T>(values, startingIndex + 1);
}
}
This should work just the same for most values, but it is far less clear, less effecient, and could exhaust the stack if the array is too large.

Your colleague may be suggesting under certain circumstances where database data is involved that it is better to use an aggregate SQL function such as Average() or Sum() at query time as opposed to processing the data on the C# side within an ADO .NET application.
Otherwise for loops are highly effective when used properly, but realize that if you find yourself nesting them to three or more orders, you might need a better algorithm, such as one that involves recursion, subroutines or both. For example, a bubble sort has a O(n^2) runtime on its worst-case (reverse order) scenario, but a recursive sort algorithm is only O(n log n), which is much better.
Hopefully this helps.
Jim

Any construct in any language is there for a reason. It's a tool to be used to accomplish a task. Means to an end. In every case, there are manners in which to use it appropriately, that is, in a clear and concise way and within the spirit of the language AND manners to abuse it. This applies to the much-misaligned goto statement as well as to your for loop conundrum, as well as while, do-while, switch/case, if-then-else, etc. If the for loop is the right tool for what you're doing, USE IT and your colleague will need to come to terms with your design decision.

It depends upon what is in the loop but he/she may be referring to a recursive function
//this is the recursive function
public static void getDirsFiles(DirectoryInfo d)
{
//create an array of files using FileInfo object
FileInfo [] files;
//get all files for the current directory
files = d.GetFiles("*.*");
//iterate through the directory and print the files
foreach (FileInfo file in files)
{
//get details of each file using file object
String fileName = file.FullName;
String fileSize = file.Length.ToString();
String fileExtension =file.Extension;
String fileCreated = file.LastWriteTime.ToString();
io.WriteLine(fileName + " " + fileSize +
" " + fileExtension + " " + fileCreated);
}
//get sub-folders for the current directory
DirectoryInfo [] dirs = d.GetDirectories("*.*");
//This is the code that calls
//the getDirsFiles (calls itself recursively)
//This is also the stopping point
//(End Condition) for this recursion function
//as it loops through until
//reaches the child folder and then stops.
foreach (DirectoryInfo dir in dirs)
{
io.WriteLine("--------->> {0} ", dir.Name);
getDirsFiles(dir);
}
}

The question is if the loop will be mutating state or causing side effects. If so, use a foreach loop. If not, consider using LINQ or other functional constructs.
See "foreach" vs "ForEach" on Eric Lippert's Blog.

LINQ Optimization Question

So I've been using LINQ for a while, and I have a question.
I have a collection of objects. They fall into two categories based on their property values. I need to set a different property one way for one group, and one way for the other:
foreach(MyItem currentItem in myItemCollection)
{
if (currentItem.myProp == "CATEGORY_ONE")
{
currentItem.value = 1;
}
else if (currentItem.myProp == "CATEGORY_TWO")
{
currentItem.value = 2;
}
}
Alternately, I could do something like:
myItemCollection.Where(currentItem=>currentItem.myProp == "CATEGORY_ONE").ForEach(item=>item.value = 1);
myItemCollection.Where(currentItem=>currentItem.myProp == "CATEGORY_TWO").ForEach(item=>item.value = 2);
I would think the first one is faster, but figured it couldn't hurt to check.

Iterating through the collection only once (and not calling any delegates, and not using as many iterators) is likely to be slightly faster, but I very much doubt that it'll be significant.
Write the most readable code which does the job, and only worry about performance at the micro level (i.e. where it's easy to change) when it's a problem.
I think the first piece of code is more readable in this case. Less LINQy, but more readable.

How about doing it like that?
myItemCollection.ForEach(item => item.value = item.myProp == "CATEGORY_ONE" ? 1 : 2);

Real Answer
Only a profiler will really tell you which one is faster.
Fuzzy Answer
The first one is most likely faster in terms of raw speed. There are two reasons why
The list is only iterated a single time
The second one erquires 2 delegate invocations for every element in the list.
The real question though is "does the speed difference between the two solutions matter?" That is the only question that is relevant to your application. And only profiling can really give you much data on this.

Generic list FindAll() vs. foreach

I'm looking through a generic list to find items based on a certain parameter.
In General, what would be the best and fastest implementation?
1. Looping through each item in the list and saving each match to a new list and returning that
foreach(string s in list)
{
if(s == "match")
{
newList.Add(s);
}
}
return newList;
Or
2. Using the FindAll method and passing it a delegate.
newList = list.FindAll(delegate(string s){return s == "match";});
Don't they both run in ~ O(N)? What would be the best practice here?
Regards,
Jonathan

You should definitely use the FindAll method, or the equivalent LINQ method. Also, consider using the more concise lambda instead of your delegate if you can (requires C# 3.0):
var list = new List<string>();
var newList = list.FindAll(s => s.Equals("match"));

I would use the FindAll method in this case, as it is more concise, and IMO, has easier readability.
You are right that they are pretty much going to both perform in O(N) time, although the foreach statement should be slightly faster given it doesn't have to perform a delegate invocation (delegates incur a slight overhead as opposed to directly calling methods).
I have to stress how insignificant this difference is, it's more than likely never going to make a difference unless you are doing a massive number of operations on a massive list.
As always, test to see where the bottlenecks are and act appropriately.

Jonathan,
A good answer you can find to this is in chapter 5 (performance considerations) of Linq To Action.
They measure a for each search that executes about 50 times and that comes up with foreach = 68ms per cycle / List.FindAll = 62ms per cycle. Really, it would probably be in your interest to just create a test and see for yourself.

List.FindAll is O(n) and will search the entire list.
If you want to run your own iterator with foreach, I'd recommend using the yield statement, and returning an IEnumerable if possible. This way, if you end up only needing one element of your collection, it will be quicker (since you can stop your caller without exhausting the entire collection).
Otherwise, stick to the BCL interface.

Any perf difference is going to be extremely minor. I would suggest FindAll for clarity, or, if possible, Enumerable.Where. I prefer using the Enumerable methods because it allows for greater flexibility in refactoring the code (you don't take a dependency on List<T>).

Yes, they both implementations are O(n). They need to look at every element in the list to find all matches. In terms of readability I would also prefer FindAll. For performance considerations have a look at LINQ in Action (Ch 5.3). If you are using C# 3.0 you could also apply a lambda expression. But that's just the icing on the cake:
var newList = aList.FindAll(s => s == "match");

Im with the Lambdas
List<String> newList = list.FindAll(s => s.Equals("match"));

Unless the C# team has improved the performance for LINQ and FindAll, the following article seems to suggest that for and foreach would outperform LINQ and FindAll on object enumeration: LINQ on Objects Performance.
This artilce was dated back to March 2009, just before this post originally asked.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find() vs. enumeration on lists - c#

If searching your list is too slow as-is, you can probably do better than a linear search. If you can keep the list sorted, you can use a binary search to find the element in O(lg n) time. If you're searching a whole lot, consider replacing that list with a Dictionary to index your objects by name.

"I'm working with a code base where lists need to be frequently searched for a single element" It is better to change your data structure to be Dictionary instead of List to get better performance

Similar question was asked for List.ForEach vs. foreach-iteration (foreach vs someList.Foreach(){}). In that case List.ForEach was a bit faster.

Related

LINQ or foreach - style/readability and speed

Which LINQ query is more effective?

How to replace for-loops with a functional statement in C#?

LINQ Optimization Question

Generic list FindAll() vs. foreach

Categories

Resources