Efficiently gather a list of objects

Efficiently gather a list of objects - c#

List<MyClass> options = new List<MyClass>();
foreach (MyClass entity in ExistingList) {
if (entity.IsCoolEnough) {
options.Add(entity);
}
}
I'm simply curious what the fastest, most efficient way of doing this is. The list isn't very large, but it's run frequently, so I'd like to keep it snappy. I'm not looking for a change in verbosity either. I just want runtime as fast as possible.

Well, using LINQ it reads more intuitively:
var options = (from e in ExistingList where e.IsCoolEnough select e).ToList();
I'm not sure whether it is faster or more efficient, though.
I'd say that for small lists, this is actually some kind of over-optimization, as for short lists, foreach, for and the above approach shouldn't make a difference at all. So instead of optimizing this, first check whether it imposes a runtime-problem at all.

I see two possible cases of "general" optimization, either at db-level or model-level depending on where the resource hog is.
If you can select via a where on IsCoolEnough in db-level:
var result = ExistingList.Where(e => e.IsCoolEnough) // still in db?
.ToList(); // enumerate
If you can't and the expensive operation is IsCoolEnough:
var result = new ConcurrentBag<MyClass>(); // Thread safe but unordered.
Parallel.ForEach(ExistingList, current =>
{
if (current.IsCoolEnough) result.Add(current);
});
It's impossible for me to give more solid advice given the quite unknown requirements. But you do not have any general "faults" in your implementation.
I like referring to this article, so I'm doing that again now. It's worth the read (Ie click somewhere on/in this line/sentence, it's a link).

Related

Select Statement doing a "for each" thing, then returning itself

I have a data pipeline. I would like to do some conditional transformations on the data in a fast way. It seems like I could just build up an enumeration without triggering it till the end like so:
var data = read();
if (!adminUser) data = data.Select(d => {d.ClearAdminOnlyFields(); return d;});
if (summarize) data = data.Select(d => {d.ClearVerboseFields(); return d;});
if (translate) data = data.Select(d => {d.Translate(culture); return d;});
return data;
The data above is thousands of items. I have tried googling this style of using select, but can't find any good examples of it being used. It seems like folks always enumerate with a .ToList() then do the transforms in a .ForEach(), but multiple enumerations like that should be slower! It seems like it would also be slower to do one big foreach with the if check inside it.
My question is: Am I wrong about this being faster? If so, can you explain what alternative is faster/better and why.

You should not do this, not because it doesn't work, but because it violates the common expectation of what Select does (transforms data without side effects).
You should use a foreach for this kind of logic instead. You should be able to do this with a single foreach, enumerating only once. Using Select to do this is not faster.

Is it bad practice to purposely rely on Linq Side Effects?

A programming pattern like this comes up every so often:
int staleCount = 0;
fileUpdatesGridView.DataSource = MultiMerger.TargetIds
.Select(id =>
{
FileDatabaseMerger merger = MultiMerger.GetMerger(id);
if (merger.TargetIsStale)
staleCount++;
return new
{
Id = id,
IsStale = merger.TargetIsStale,
// ...
};
})
.ToList();
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = staleCount > 0;
I'm not sure there is a more succinct way to code this?
Even if so, is it bad practice to do this?

No, it is not strictly "bad practice" (like constructing SQL queries with string concatenation of user input or using goto).
Sometimes such code is more readable than several queries/foreach or no-side-effect Aggregate call. Also it is good idea to at least try to write foreach and no-side-effect versions to see which one is more readable/easier to prove correctness.
Please note that:
it is frequently very hard to reason what/when will happen with such code. I.e. you sample hacks around the fact of LINQ queries executed lazily with .ToList() call, otherwise that value will not be computed.
pure functions can be run in parallel, once with side effects need a lot of care to do so
if you ever need to convert LINQ-to-Object to LINQ-to-SQL you have to rewrite such queries
generally LINQ queries favor functional programming style without side-effects (and hence by convention readers would not expect side-effects in the code).

Why not just code it like this:
var result=MultiMerger.TargetIds
.Select(id =>
{
FileDatabaseMerger merger = MultiMerger.GetMerger(id);
return new
{
Id = id,
IsStale = merger.TargetIsStale,
// ...
};
})
.ToList();
fileUpdatesGridView.DataSource = result;
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = result.Any(r=>r.IsStale);
I would consider this a bad practice. You are making the assumption that the lambda expression is being forced to execute because you called ToList. That's an implementation detail of the current version of ToList. What if ToList in .NET 7.x is changed to return an object that semi-lazily converts the IQueryable? What if it's changed to run the lambda in parallel? All of a sudden you have concurrency issues on your staleCount. As far as I know, both of those are possibilities which would break your code because of bad assumptions your code is making.
Now as far as repeatedly calling MultiMerger.GetMerger with a single id, that really should be reworked to be a join as the logic for doing a join (w|c)ould be much more efficient than what you have coded there and would scale a lot better, especially if the implementation of MultiMerger is actually pulling data from a database (or might be changed to do so).
As far as calling ToList() before passing it to the Datasource, if the Datasource doesn't use all the fields in your new object, you would be (much) faster and take less memory to skip the ToList and let the datasource only pull the fields it needs. What you've done is highly couple the data to the exact requirements of the view, which should be avoided where possible. An example would be what if you all of a sudden need to display a field that exists in FileDatabaseMerger, but isn't in your current anonymous object? Now you have to make changes to both the controller and view to add it, where if you just passed in an IQueryable, you would only have to change the view. Again, faster, less memory, more flexible, and more maintainable.
Hope this helps.. And this question really should be posted of code review, not stackoverflow.
Update on further review, the following code would be much better:
var result=MultiMerger.GetMergersByIds(MultiMerger.TargetIds);
fileUpdatesGridView.DataSource = result;
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = result.Any(r=>r.TargetIsStale);
or
var result=MultiMerger.GetMergers().Where(m=>MultiMerger.TargetIds.Contains(m.Id));
fileUpdatesGridView.DataSource = result;
fileUpdatesGridView.DataBind();
fileUpdatesMergeButton.Enabled = result.Any(r=>r.TargetIsStale);

Instantiation in a loop: Verbosity vs. Performance

So in the following bit of code, I really like the verbosity of scenario one, but I would like to know how big of a performance hit this takes when compared to scenario two. Is instantiation in the loop a big deal?
Is the syntactic benefit (which I like, some might not even agree with it being verbose or optimal) worth said performance hit? You can assume that the collections remain reasonably small (N < a few hundred).
// First scenario
var productCategoryModels = new List<ProductCategoryModel>();
foreach (var productCategory in productCategories)
{
var model = new ProductCategoryModel.ProductCategoryModelConverter(currentContext).Convert(productCategory);
productCategoryModels.Add(model);
}
// Second scenario
var productCategoryModels = new List<ProductCategoryModel>();
var modelConvert = new ProductCategoryModel.ProductCategoryModelConverter(currentContext);
foreach (var productCategory in productCategories)
{
var model = modelConvert.Convert(productCategory);
productCategoryModels.Add(model);
}
Would love to hear your guys' thoughts on this as I see this quite often.

I would approach this questions a bit differently. If whatever happens in new ProductCategoryModel.ProductCategoryModelConverter(currentContext) doesn't change during the loop, I see no reason to include it within the loop. If it is not part of the loop, it shouldn't be in there imo.
If you include it just because it looks better, you force the reader to figure out if it makes a difference or not.

Like Brian, I see no reason to create a new instance if you're not actually changing it - I'm assuming that Convert doesn't change the original object?
I'd recommend using LINQ to avoid the loop entirely though (in terms of your own code):
var modelConverter = new ProductCategoryModelConverter(currentContext);
var models = productCategories.Select(x => modelConverter.Convert(x))
.ToList();
In terms of performance, it would depend on what ProductCategoryModelConverter's constructor had to do. Just creating a new object is pretty cheap, in terms of overhead. It's not free, of course - but I wouldn't expect this to cause a bottleneck in most cases. However, I'd say that instantiating it in a loop would imply to the reader that it's necessary; that there was some reason not to use just a single object. A converter certainly sounds like something which will stay immutable as it does its work... so I'd be puzzled by the instantiate-in-a-loop version.

Keep whichever of the formats you're happiest with - if they both perform well enough for your application. Optimize later if you encounter performance problems.

Which LINQ query is more effective?

I have a huge IEnumerable(suppose the name is myItems), which way is more effective?
Solution 1: Filter it first then ForEach.
Array.ForEach(myItems.Where(FILTER-IT-HERE).ToArray(),MY-ACTION);
Solution 2: Do RETURN in MY-ACTION if the item is not up to the mustard.
Array.ForEach(myItems.ToArray(),MY-ACTION-WITH-FILTER);
Is one of them always better than another? Or any other good suggestions? Thanks in advance.

Did you do any measurements? Since WE can't measure the run time of My-Action then only you can. Measure and decide.

Sometimes one has to create benchmark's because similar looking activities could produce radically different and unexpected results.
You do not say what your data source is so I'm going to assume it may be data on an SQL server in which case filtering at the server side will likely always be the best approach because you have minimized the amount of data transfer. Memory access is always faster than data transfer from disk to memory so whenever you can transfer fewer records, you are likely to have better performance.

Well, both times, you're converting to an array, which might not be so efficient if the IEnumerable is very large (like you said). You could create a generic extension method for IEnumerable, like:
public static void ForEach<T>(this IEnumerable<T> current, Action<T> action) {
foreach (var i in current) {
action(i);
}
}
and then you could do this:
IEnumerable<int> ints = new List<int>();
ints.Where(i => i == 5).ForEach(i => Console.WriteLine(i));

If performance is a concern, it's unclear to me why you'd be bothering to construct an entire array in the first place. Why not just this?
foreach (var item in myItems.Where(FILTER-IT-HERE))
MY-ACTION;
Or:
foreach (var item in myItems)
MY-ACTION-WITH-FILTER;
I ask because, while the others are right that you can't really know without testing, I wouldn't expect there to be much difference between the above two options. I would expect there to be a difference, on the other hand, between creating/populating an array (seemingly for no reason) and not creating an array.

Everything else being equal, calling ToArray() first will impart a greater performance hit than when calling it last. Although, as has been stated by others before me,
Why use ToArray() and Array.ForEach() at all?
We don't know that everything else actually is equal since you do not reveal the implementation details of your filter and action.

The idea of LINQ is to work on enumerable collections, so the best LINQ query is the one where you don't use Array.ForEach() and .ToArray() at all.

I would say that this falls into the category of premature optimization. If, after establishing benchmarks, you find that the code is too slow, you can always try each approach and pick the result that works better for you.
Since we don't know how the IEnumerable<> is produced it's hard to say which approach will perform better. We also don't know how many items will remain after you apply your predicate - nor do we know whether the action or iteration steps are going to be the dominant factor in the execution of your code. The only way to know for sure is to try it both ways, profile the results, and pick the best.
Performance aside, I would choose the version that is most clear - which (for me) is to first filter and then apply the projection to the result.

How to replace for-loops with a functional statement in C#?

A colleague once said that God is killing a kitten every time I write a for-loop.
When asked how to avoid for-loops, his answer was to use a functional language. However, if you are stuck with a non-functional language, say C#, what techniques are there to avoid for-loops or to get rid of them by refactoring? With lambda expressions and LINQ perhaps? If so, how?
Questions
So the question boils down to:
Why are for-loops bad? Or, in what context are for-loops to avoid and why?
Can you provide C# code examples of how it looks before, i.e. with a loop, and afterwards without a loop?

Functional constructs often express your intent more clearly than for-loops in cases where you operate on some data set and want to transform, filter or aggregate the elements.
Loops are very appropriate when you want to repeatedly execute some action.
For example
int x = array.Sum();
much more clearly expresses your intent than
int x = 0;
for (int i = 0; i < array.Length; i++)
{
x += array[i];
}

Why are for-loops bad? Or, in what
context are for-loops to avoid and
why?
If your colleague has a functional programming, then he's probably already familiar with the basic reasons for avoiding for loops:
Fold / Map / Filter cover most use cases of list traversal, and lend themselves well to function composition. For-loops aren't a good pattern because they aren't composable.
Most of the time, you traverse through a list to fold (aggregate), map, or filter values in a list. These higher order functions already exist in every mainstream functional language, so you rarely see the for-loop idiom used in functional code.
Higher order functions are the bread and butter of function composition, meaning you can easily combine simple function into something more complex.
To give a non-trivial example, consider the following in an imperative language:
let x = someList;
y = []
for x' in x
y.Add(f x')
z = []
for y' in y
z.Add(g y')
In a functional language, we'd write map g (map f x), or we can eliminate the intermediate list using map (f . g) x. Now we can, in principle, eliminate the intermediate list from the imperative version, and that would help a little -- but not much.
The main problem with the imperative version is simply that the for-loops are implementation details. If you want change the function, you change its implementation -- and you end up modifying a lot of code.
Case in point, how would you write map g (filter f x) in imperatively? Well, since you can't reuse your original code which maps and maps, you need to write a new function which filters and maps instead. And if you have 50 ways to map and 50 ways to filter, how you need 50^50 functions, or you need to simulate the ability to pass functions as first-class parameters using the command pattern (if you've ever tried functional programming in Java, you understand what a nightmare this can be).
Back in the the functional universe, you can generalize map g (map f x) in way that lets you swap out the map with filter or fold as needed:
let apply2 a g b f x = a g (b f x)
And call it using apply2 map g filter f or apply2 map g map f or apply2 filter g filter f or whatever you need. Now you'd probably never write code like that in the real world, you'd probably simplify it using:
let mapmap g f = apply2 map g map f
let mapfilter g f = apply2 map g filter f
Higher-order functions and function composition give you a level of abstraction that you cannot get with the imperative code.
Abstracting out the implementation details of loops let's you seamlessly swap one loop for another.
Remember, for-loops are an implementation detail. If you need to change the implementation, you need to change every for-loop.
Map / fold / filter abstract away the loop. So if you want to change the implementation of your loops, you change it in those functions.
Now you might wonder why you'd want to abstract away a loop. Consider the task of mapping items from one type to another: usually, items are mapped one at a time, sequentially, and independently from all other items. Most of the time, maps like this are prime candidates for parallelization.
Unfortunately, the implementation details for sequential maps and parallel maps aren't interchangeable. If you have a ton of sequential maps all over your code, and you want swap them out for parallel maps, you have two choices: copy/paste the same parallel mapping code all over your code base, or abstract away mapping logic into two functions map and pmap. Once you're go the second route, you're already knee-deep in functional programming territory.
If you understand the purpose of function composition and abstracting away implementation details (even details as trivial as looping), you can start to appreciate just how and why functional programming is so powerful in the first place.

For loops are not bad. There are many very valid reasons to keep a for loop.
You can often "avoid" a for loop by reworking it using LINQ in C#, which provides a more declarative syntax. This can be good or bad depending on the situation:
Compare the following:
var collection = GetMyCollection();
for(int i=0;i<collection.Count;++i)
{
if(collection[i].MyValue == someValue)
return collection[i];
}
vs foreach:
var collection = GetMyCollection();
foreach(var item in collection)
{
if(item.MyValue == someValue)
return item;
}
vs. LINQ:
var collection = GetMyCollection();
return collection.FirstOrDefault(item => item.MyValue == someValue);
Personally, all three options have their place, and I use them all. It's a matter of using the most appropriate option for your scenario.

There's nothing wrong with for loops but here are some of the reasons people might prefer functional/declarative approaches like LINQ where you declare what you want rather than how you get it:-
Functional approaches are potentially easier to parallelize either manually using PLINQ or by the compiler. As CPUs move to even more cores this may become more important.
Functional approaches make it easier to achieve lazy evaluation in multi-step processes because you can pass the intermediate results to the next step as a simple variable which hasn't been evaluated fully yet rather than evaluating the first step entirely and then passing a collection to the next step (or without using a separate method and a yield statement to achieve the same procedurally).
Functional approaches are often shorter and easier to read.
Functional approaches often eliminate complex conditional bodies within for loops (e.g. if statements and 'continue' statements) because you can break the for loop down into logical steps - selecting all the elements that match, doing an operation on them, ...

For loops don't kill people (or kittens, or puppies, or tribbles). People kill people.
For loops, in and of themselves, are not bad. However, like anything else, it's how you use them that can be bad.

Sometime you don't kill just one kitten.
for (int i = 0; i < kittens.Length; i++)
{
kittens[i].Kill();
}
Sometimes you kill them all.

You can refactor your code well enough so that you won't see them often. A good function name is definitely more readable that a for loop.
Taking the example from AndyC :
Loop
// mystrings is a string array
List<string> myList = new List<string>();
foreach(string s in mystrings)
{
if(s.Length > 5)
{
myList.add(s);
}
}
Linq
// mystrings is a string array
List<string> myList = mystrings.Where<string>(t => t.Length > 5)
.ToList<string();
Wheter you use the first or the second version inside your function, It's easier to read
var filteredList = myList.GetStringLongerThan(5);
Now that's an overly simple example, but you get my point.

Your colleague is not right. For loops are not bad per se. They are clean, readable and not particularly error prone.

Your colleague is wrong about for loops being bad in all cases, but correct that they can be rewritten functionally.
Say you have an extension method that looks like this:
void ForEach<T>(this IEnumerable<T> collection, Action <T> action)
{
foreach(T item in collection)
{
action(item)
}
}
Then you can write a loop like this:
mycollection.ForEach(x => x.DoStuff());
This may not be very useful now. But if you then replace your implementation of the ForEach extension method for use a multi threaded approach then you gain the advantages of parallelism.
This obviously isn't always going to work, this implementation only works if the loop iterations are completely independent of each other, but it can be useful.
Also: always be wary of people who say some programming construct is always wrong.

A simple (and pointless really) example:
Loop
// mystrings is a string array
List<string> myList = new List<string>();
foreach(string s in mystrings)
{
if(s.Length > 5)
{
myList.add(s);
}
}
Linq
// mystrings is a string array
List<string> myList = mystrings.Where<string>(t => t.Length > 5).ToList<string>();
In my book, the second one looks a lot tidier and simpler, though there's nothing wrong with the first one.

Sometimes a for-loop is bad if there exists a more efficient alternative. Such as searching, where it might be more efficient to sort a list and then use quicksort or binary sort. Or when you are iterating over items in a database. It is usually much more efficient to use set-based operations in a database instead of iterating over the items.
Otherwise if the for-loop, especially a for-each makes the most sense and is readable, then I would go with that rather than rafactor it into something that isn't as intuitive. I personally don't believe in these religious sounding "always do it this way, because that is the only way". Rather it is better to have guidelines, and understand in what scenarios it is appropriate to apply those guidelines. It is good that you ask the Why's!

For loop is, let's say, "bad" as it implies branch prediction in CPU, and possibly performance decrease when branch prediction miss.
But CPU (having a branch prediction accuracy of 97%) and compiler with tecniques like loop unrolling, make loop performance reduction negligible.

If you abstract the for loop directly you get:
void For<T>(T initial, Func<T,bool> whilePredicate, Func<T,T> step, Action<T> action)
{
for (T t = initial; whilePredicate(t); step(t))
{
action(t);
}
}
The problem I have with this from a functional programming perspective is the void return type. It essentially means that for loops do not compose nicely with anything. So the goal is not to have a 1-1 conversion from for loop to some function, it is to think functionally and avoid doing things that do not compose. Instead of thinking of looping and acting think of the whole problem and what you are mapping from and to.

A for loop can always be replaced by a recursive function that doesn't involve the use of a loop. A recursive function is a more functional stye of programming.
But if you blindly replace for loops with recursive functions, then kittens and puppies will both die by the millions, and you will be done in by a velocirapter.
OK, here's an example. But please keep in mind that I do not advocate making this change!
The for loop
for (int index = 0; index < args.Length; ++index)
Console.WriteLine(args[index]);
Can be changed to this recursive function call
WriteValuesToTheConsole(args, 0);
static void WriteValuesToTheConsole<T>(T[] values, int startingIndex)
{
if (startingIndex < values.Length)
{
Console.WriteLine(values[startingIndex]);
WriteValuesToTheConsole<T>(values, startingIndex + 1);
}
}
This should work just the same for most values, but it is far less clear, less effecient, and could exhaust the stack if the array is too large.

Your colleague may be suggesting under certain circumstances where database data is involved that it is better to use an aggregate SQL function such as Average() or Sum() at query time as opposed to processing the data on the C# side within an ADO .NET application.
Otherwise for loops are highly effective when used properly, but realize that if you find yourself nesting them to three or more orders, you might need a better algorithm, such as one that involves recursion, subroutines or both. For example, a bubble sort has a O(n^2) runtime on its worst-case (reverse order) scenario, but a recursive sort algorithm is only O(n log n), which is much better.
Hopefully this helps.
Jim

Any construct in any language is there for a reason. It's a tool to be used to accomplish a task. Means to an end. In every case, there are manners in which to use it appropriately, that is, in a clear and concise way and within the spirit of the language AND manners to abuse it. This applies to the much-misaligned goto statement as well as to your for loop conundrum, as well as while, do-while, switch/case, if-then-else, etc. If the for loop is the right tool for what you're doing, USE IT and your colleague will need to come to terms with your design decision.

It depends upon what is in the loop but he/she may be referring to a recursive function
//this is the recursive function
public static void getDirsFiles(DirectoryInfo d)
{
//create an array of files using FileInfo object
FileInfo [] files;
//get all files for the current directory
files = d.GetFiles("*.*");
//iterate through the directory and print the files
foreach (FileInfo file in files)
{
//get details of each file using file object
String fileName = file.FullName;
String fileSize = file.Length.ToString();
String fileExtension =file.Extension;
String fileCreated = file.LastWriteTime.ToString();
io.WriteLine(fileName + " " + fileSize +
" " + fileExtension + " " + fileCreated);
}
//get sub-folders for the current directory
DirectoryInfo [] dirs = d.GetDirectories("*.*");
//This is the code that calls
//the getDirsFiles (calls itself recursively)
//This is also the stopping point
//(End Condition) for this recursion function
//as it loops through until
//reaches the child folder and then stops.
foreach (DirectoryInfo dir in dirs)
{
io.WriteLine("--------->> {0} ", dir.Name);
getDirsFiles(dir);
}
}

The question is if the loop will be mutating state or causing side effects. If so, use a foreach loop. If not, consider using LINQ or other functional constructs.
See "foreach" vs "ForEach" on Eric Lippert's Blog.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficiently gather a list of objects - c#

Related

Select Statement doing a "for each" thing, then returning itself

Is it bad practice to purposely rely on Linq Side Effects?

Instantiation in a loop: Verbosity vs. Performance

Which LINQ query is more effective?

How to replace for-loops with a functional statement in C#?

Categories

Resources