What are we guaranteed regarding side-effects in LINQ predicates? - c#

I just saw this bit of code that has a count++ side-effect in the .GroupBy predicate. (originally here).
object[,] data; // This contains all the data.
int count = 0;
List<string[]> dataList = data.Cast<string>()
.GroupBy(x => count++ / data.GetLength(1))
.Select(g => g.ToArray())
.ToList();
This terrifies me because I have no idea how many times the implementation will invoke the key selector function. And I also don't know if the function is guaranteed to be applied to each item in order. I realize that, in practice, the implementation may very well just call the function once per item in order, but I never assumed that as being guaranteed, so I'm paranoid about depending on that behaviour -- especially given what may happen on other platforms, other future implementations, or after translation or deferred execution by other LINQ providers.
As it pertains to a side-effect in the predicate, are we offered some kind of written guarantee, in terms of a LINQ specification or something, as to how many times the key selector function will be invoked, and in what order?
Please, before you mark this question as a duplicate, I am looking for a citation of documentation or specification that says one way or the other whether this is undefined behaviour or not.
For what it's worth, I would have written this kind of query the long way, by first performing a select query with a predicate that takes an index, then creating an anonymous object that includes the index and the original data, then grouping by that index, and finally selecting the original data out of the anonymous object. That seems more like a correct way of doing functional programming. And it also seems more like something that could be translated to a server-side query. The side-effect in the predicate just seems wrong to me - and against the principles of both LINQ and functional programming, so I would assume there would be no guarantee specified and that this may very well be undefined behaviour. Is it?
I realize this question may be difficult to answer if the documentation and LINQ specification actually says nothing regarding side effects in predicates. I want to know specifically whether:
Specs say it's permissible and how. (I doubt it)
Specs say it's undefined behaviour (I suspect this is true and am looking for a citation)
Specs say nothing. (Sloppy spec, if you ask me, but it would be nice to know if others have searched for language regarding side-effects and also come up empty. Just because I can't find it doesn't mean it doesn't exist.)

According to official C# Language Specification, on page 203, we can read (emphasis mine):
12.17.3.1 The C# language does not specify the execution semantics of query expressions. Rather, query expressions are
translated into invocations of methods that adhere to the
query-expression pattern (§12.17.4). Specifically, query expressions
are translated into invocations of methods named Where, Select,
SelectMany, Join, GroupJoin, OrderBy, OrderByDescending, ThenBy,
ThenByDescending, GroupBy, and Cast. These methods are expected to
have particular signatures and return types, as described in §12.17.4.
These methods may be instance methods of the object being queried or
extension methods that are external to the object. These methods
implement the actual execution of the query.

From looking at the source code of GroupBy in corefx on GitHub, it does seems like the key selector function is indeed called once per element, and it is called in the order that the previous IEnumerable provides them. I would in no way consider this a guarantee though.
In my view, any IEnumerables which cannot be enumerated multiple times safely are a big red flag that you may want to reconsider your design choices. An interesting issue that could arise from this is that for example if you view the contents of this IEnumerable in the Visual Studio debugger, it will probably break your code, since it would cause the count variable to go up.
The reason this code hasn't exploded up until now is probably because the IEnumerable is never stored anywhere, since .ToList is called right away. Therefore there is no risk of multiple enumerations (again, with the caveat about viewing it in the debugger and so on).

Related

LINQ Design Curiosity: Skip/Take vs. SkipWhile/TakeWhile

Is there any particular reason to have separate methods Skip and SkipWhile, rather than simply having overloads of the same method?
What I mean is, instead of Skip(int), SkipWhile(Func<TSource,bool>), and SkipWhile(Func<TSource,int,bool>), why not have Skip(int), Skip(Func<TSource,bool>), and Skip(Func<TSource,int,bool>)? I'm sure there's some reason for it, as the whole LINQ system was designed by people with much more experience than me, but that reasoning is not apparent.
The only possibility that's come to mind has been issues with the parser for the SQL-like syntax, but that already distinguishes between things like Select(Func<TSource,TResult>) and Select(Func<TSource,int,TResult>), so I doubt that's why.
The same question applies to Take and TakeWhile, which are complimentary to the above.
Edit: To clarify, I am aware of the functional differences between the variants, I'm merely asking about the design decision on the naming of the methods.
IMO, the only reason would be better readability. Skip sound like “Skip N number of records”, while SkipWhile sounds like “Skip until a condition is met”. These names are self-explanatory
The "While" indicates that LINQ will only skip while the lambda expression evaluates to true, and will stop skipping as soon as it is no longer true. This is a very different thing from just skipping a fixed number of items.
The same reasoning holds true for Take, of course.
All is well in the interest of clarity!

Is .Select<T>(...) to be prefered before .Where<T>(...)?

I got in a discussion with two colleagues regarding a setup for an iteration over an IEnumerable (the contents of which will not be altered in any way during the operation). There are three conflicting theories on which is the optimal approach. Both the others (and me as well) are very certain and that got me unsure, so for the sake of clarity, I want to check with an external source.
The scenario is as follows. We had the code below as a starting point and discovered that some of the hazaas need not to be acted upon. So, starting with the code below, we started to add a blocker for the action.
foreach(Hazaa hazaa in hazaas) ;
My suggestion is as follows.
foreach(Hazaa hazaa in hazaas.Where(element => condition)) ;
One of the guys wants to resolve it by a more explicit form, claiming that LINQ is not appropriate in this case (not sure why it'd be so but he seems to be very convinced). He's solution is this.
foreach(Hazaa hazaa in hazaas) ;
if(condition) ;
The other contra-suggestion is supported by the claim that Where risks to repeat the filtering process needlessly and that it's more certain to minimize the computational workload by picking the appropriate elements once for all by Select.
foreach(Hazaa hazaa in hazaas.Select(element => condition)) ;
I argue that the first is obsolete, since LINQ can handle data objects quite well.
I also believe that Select-ing is in this case equivalently fast to Where-ing and no needless steps will be taken (e.g. the evaluation of the condition on the elements will only be performed once). If anything, it should be faster using Where because we won't be creating an extra instance of anything.
Who's right?
Select is inappropriate. It doesn't filter anything.
if is a possible solution, but Where is just as explicit.
Where executes the condition exactly once per item, just as the if. Additionally, it is important to note that the call to Where doesn't iterate the list. So, using Where you iterate the list exactly once, just like when using if.
I think you are discussing with one person that didn't understand LINQ - the guy that wants to use Select - and one that doesn't like the functional aspect of LINQ.
I would go with Where.
The .Where() and the if(condition) approach will be the same.
But since LinQ is nicely readable i'd prefer that.
The approach with .Select() is nonsense, since it will not return the Hazaa-Object, but an IEnumerable<Boolean>
To be clear about the functions:
myEnumerable.Where(a => isTrueFor(a)) //This is filtering
myEnumerable.Select(a => a.b) //This is projection
Where() will run a function, which returns a Boolean foreach item of the enumerable and return this item depending on the result of the Boolean function
Select() will run a function for every item in the list and return the result of the function without doing any filtering.

What practices can safeguard against unexpected deferred execution with IEnumerable<T> as argument? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are a few questions similar to this which deals with right input and output types like this. My question is what good practices, method naming, choosing parameter type, or similar can safeguard from deferred execution accidents?
These are most prevalent with IEnumerable which is a very common argument type because:
Follows the robustness principle "Be conservative in what you do, be liberal in what you accept from others"
Used extensively with Linq
IEnumerable is high in the collection hierarchy and predates newer collection types
However, it also introduces deferred execution. Now we might have gone wrong in designing our methods (especially extension methods) when we thought the best idea is to take the most basic type. So our methods looked like:
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> lstObject)
{
foreach (T t in lstObject)
//some fisher-yates may be
}
The danger obviously is when we mix the above function with lazy Linq and its so susceptible.
var query = foos.Select(p => p).Where(p => p).OrderBy(p => p); //doesn't execute
//but
var query = foos.Select(p => p).Where(p => p).Shuffle().OrderBy(p => p);
//the second line executes up to a point.
A bigger edit:
Reopening this: a criticism of a language's functionality isn't constructive - however asking for good practices is where StackOverflow shines. Updated the question to reflect this.
A big edit here :
To clarify the above line - My question is not about the second expression not getting evaluated, seriously not. Programmers know it. My worry is about Shuffle method actually executing the query up to that point. See the first query, where nothing gets executed. Now similarly when constructing another Linq expression (which should be executed later), our custom function is playing the spoilsport. In other words, how to let the caller know Shuffle is not the kinda function they would want at that point of Linq expression. I hope the point is driven home. Apologies! :) Though its as simple as going and inspecting the method, I'm asking how do you guys typically program defensively..
The above example may not be that dangerous, but you get the point. That is certain (custom) functions don't go well with the Linq idea of deferred execution. The problem is not just about performance, but also about unexpected side-effects.
But a function like this works magic with Linq:
public static IEnumerable<S> DistinctBy<S, T>(this IEnumerable<S> source,
Func<S, T> keySelector)
{
HashSet<T> seenKeys = new HashSet<T>(); //credits Jon Skeet
foreach (var element in source)
if (seenKeys.Add(keySelector(element)))
yield return element;
}
As you can see both the functions take IEnumerable<>, but the caller wouldn't know how the functions react. So what are the general cautionary measures that you guys take here?
Name our custom methods appropriately so that it gives the idea for the caller that it does bode well or not with Linq?
Move lazy methods to a different namespace, and keep Linq-ish to another, so that it gives some sort of an idea at least?
Do not accept an IEnumerable as parameter for immediately executing methods but instead take a more derived type or a concrete type itself which thus leaves IEnumerable for lazy methods alone? This puts the burden on the caller to do the execution of possible un-executed expressions? This is quite possible for us, since outside Linq world we hardly deal with IEnumerables, and most basic collection classes implement up to ICollection at least.
Or anything else? I particularly like the 3rd option, and that's what I was going with, but thought to get your ideas prior to. I have seen plenty of code (nice little Linq like extension methods!) from even good programmers that accept IEnumerable and do a ToList() or something similar on them inside the method. I don't know how they cope with the side-effects..
Edit: After a downvote and an answer, I would like to clarify that its not about programmers not knowing about how Linq works (our proficiency could be at some level, but thats a different thing), but its that many functions were written not taking Linq into account back then. Now chaining an immediately executing method along with Linq extension methods make it dangerous. So my question is there a general guideline programmers follow to let the caller know what to use from Linq side and what not to? It's more about programming defensively than if-you-don't-know-to-use-it-then-we-can't-help! (or at least I believe)..
As you can see both the functions take IEnumerable<>, but the caller wouldn't know how the functions react.
That's simply a matter of documentation. Look at the documentation for DistinctBy in MoreLINQ, which includes:
This operator uses deferred execution and streams the results, although
a set of already-seen keys is retained. If a key is seen multiple times,
only the first element with that key is returned.
Yes, it's important to know what a member does before you use it, and for things accepting/returning any kind of collection, there are various important things to know:
Will the collection be read immediately, or deferred?
Will the collection be streamed while results are returned?
If the declared collection type accepted is mutable, will the method try to mutate it?
If the declared collection type returned is mutable, will it actually be a mutable implementation?
Will the collection returned be changed by other actions (e.g. is it a read-only view on a collection which may be modified within the class)
Is null an acceptable input value?
Is null an acceptable element value?
Will the method ever return null?
All of these things are worth considering - and most of them were worth considering long before LINQ.
The moral is really, "Make sure you know how something behaves before you call it." That was true before LINQ, and LINQ hasn't changed it. It's just introduced two possibilities (deferred execution and streaming results) which were rarely present before.
Use IEnumerable wherever it makes sense, and code defensively.
As SLaks pointed out in a comment, deferred execution has been possible with IEnumerable since the beginning, and since C# 2.0 introduced the yield statement, it's been very easy to implement deferred execution yourself. For example, this method returns an IEnumerable that uses deferred execution to return some random numbers:
public static IEnumerable<int> RandomSequence(int length)
{
Random rng = new Random();
for (int i = 0; i < length; i++) {
Console.WriteLine("deferred execution!");
yield return rng.Next();
}
}
So whenever you use foreach to loop over an IEnumerable, you have to assume that anything could happen in between iterations. It could even throw an exception, so you may want to put the foreach loop inside a try/finally.
If the caller passes in an IEnumerable that does something dangerous or never stops returning numbers (an infinite sequence), it's not your fault. You don't have to detect it and throw an error; just add enough exception handlers so that your method can clean up after itself in the event something goes wrong. In the case of something simple like Shuffle, there's nothing to do; just let the caller deal with the exception.
In the rare case that your method really can't deal with an infinite sequence, consider accepting a different type like IList. But even IList won't protect you from deferred execution - you don't know what class is implementing IList or what sort of voodoo it's doing to come up with each element! In the super-rare case that you really can't allow any unexpected code to run while you iterate, you should be accepting an array, not any kind of interface.
Deferred execution has nothing to do with types. Any linq method that uses iterators has potential for deferred execution if you write your code that way. Select(), Where(), OrderByDescending() for e.g. all use iterators and hence defer execution. Yes those methods expect an IEnumerable<T>, but that doesn't mean that IEnumerable<T> is the problem.
That is certain (custom) functions don't go well with the Linq idea of
deferred execution. The problem is not just about performance, but
also about unexpected side-effects.
So what are the general cautionary measures that you guys take here?
None. Honestly we use IEnumerable everywhere and don't have the problem of people not understanding "side effects". "the Linq idea of deferred execution" is central to its usefulness in things like Linq-to-SQL. It sounds to me like the design of the custom functions is not as clear as it could be. If people are writing code to use LINQ and they don't understand what it's doing, then that is the issue, not the fact that IEnumerable happens to be a base type.
All of your ideas are just wrappers around the fact that it sounds like you have programmers that just don't understand linq queries. If you don't need lazy execution, which it sounds like you don't, then just force everything to evaluate before the functions exit. Call ToList() on your results and return them in a consistent API that the consumer would like to work with - lists, arrays, collections or IEnumerables.

In-memory LINQ performance

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?
Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.
Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.
Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.

How do you design an enumerator that returns (theoretically) an infinite amount of items?

I'm writing code that looks similar to this:
public IEnumerable<T> Unfold<T>(this T seed)
{
while (true)
{
yield return [next (T)object in custom sequence];
}
}
Obviously, this method is never going to return. (The C# compiler silently allows this, while R# gives me the warning "Function never returns".)
Generally speaking, is it bad design to provide an enumerator that returns an infinite number of items, without supplying a way to stop enumerating?
Are there any special considerations for this scenario? Mem? Perf? Other gotchas?
If we always supply an exit condition, which are the options? E.g:
an object of type T that represents the inclusive or exclusive boundary
a Predicate<T> continue (as TakeWhile does)
a count (as Take does)
...
Should we rely on users calling Take(...) / TakeWhile(...) after Unfold(...)? (Maybe the preferred option, since it leverages existing Linq knowledge.)
Would you answer this question differently if the code was going to be published in a public API, either as-is (generic) or as a specific implementation of this pattern?
So long as you document very clearly that the method will never finish iterating (the method itself returns very quickly, of course) then I think it's fine. Indeed, it can make some algorithms much neater. I don't believe there are any significant memory/perf implications - although if you refer to an "expensive" object within your iterator, that reference will be captured.
There are always ways of abusing APIs: so long as your docs are clear, I think it's fine.
"Generally speaking, is it bad desing
to provide an enumerator that returns
an infinite amount of items, without
supplying a way to stop enumerating?"
The consumer of the code, can always stop enumerating (using break for example or other means). If your enumerator returns and infinite sequence, that doesn't mean the client of the enumerator is somehow forced to never break enumeration, actually you can't make an enumerator which is guaranteed to be fully enumerated by a client.
Should we rely on users calling
Take(...) / TakeWhile(...) after
Unfold(...)? (Maybe the preferred
option, since it leverages existing
Linq knowledge.)
Yes, as long as you clearly specify in your documentation that the enumerator returns and infinite sequence and breaking of enumeration is the caller's responsibility, everything should be fine.
Returning infinite sequences isn't a bad idea, functional programing languages have done it for a long time now.
I agree with Jon. Compiler transforms your method to class implementing simple state machine that keeps reference to current value (i.e. value that will be returned via Current property). I used this approach several times to simplify code. If you clearly document method's behavior it should work just fine.
I would not use an infinite enumerator in a public API. C# programmers, myself included, are too used to the foreach loop. This would also be consistent with the .NET Framework; notice how the Enumerable.Range and Enumerable.Repeat methods take an argument to limit the number of items in the Enumerable. Microsoft chose to use Enumerable.Repeat(" ", 10) instead of Enumerable.Repeat(" ").Take(10) to avoid the infinite enumeration and I would adhere to their design choices.

Categories

Resources