Regarding evaluation of Enumerable/List [duplicate] - c#

This question already has answers here:
Is there an IEnumerable implementation that only iterates over it's source (e.g. LINQ) once?
(4 answers)
Closed 3 years ago.
I've been playing with Lists and Enumerables and I think I understand the basics:
Enumerable: The elements are evaluated each time they are consumed.
List: The elements are evaluated on definition and are not reevaluated at any point.
I've done some tests:
Enumerable. https://www.tutorialspoint.com/tpcg.php?p=bs75zCKL
List: https://www.tutorialspoint.com/tpcg.php?p=PpyY2iif
SingleEvaluationEnum: https://www.tutorialspoint.com/tpcg.php?p=209Ciiy7
Starting with the Enumerable example:
var myList = new List<int>() { 1, 2, 3, 4, 5, 6 };
var myEnumerable = myList.Where(p =>
{
Console.Write($"{p} ");
return p > 2;
}
);
Console.WriteLine("");
Console.WriteLine("Starting");
myEnumerable.First();
Console.WriteLine("");
myEnumerable.Skip(1).First();
The output is:
Starting
1 2 3
1 2 3 4
If we add .ToList() after the .Where(...) then the output is:
1 2 3 4 5 6
Starting
I also was able to have a bit of both worlds with this class:
class SingleEvaluationEnum<T>
{
private IEnumerable<T> Enumerable;
public SingleEvaluationEnum(IEnumerable<T> enumerable)
=> Enumerable = enumerable;
public IEnumerable<T> Get()
{
if (!(Enumerable is List<T>))
Enumerable = Enumerable.ToList().AsEnumerable();
return Enumerable;
}
}
You can see the output is:
Starting
1 2 3 4 5 6
This way the evaluation is deferred until the first consumption and is not re-evaluated in the next ones. But the whole list is evaluated.
My question is: Is there a way to get this output?
Starting
1 2 3
4
In other words: I want myEnumerable.First() to evaluate only the necesary elements, but no more. And I want myEnumerable.Skip(1).First() to reuse the already evaluated elements.
EDIT: Clarification: I want that any "query" over the Enumerable applies to all the elements in the list. That's why (AFAIK) an Enumerator doesn't work.
Thanks!

LINQ is fundamentally a functional approach to working with collections. One of the assumptions is that there are no side-effects to evaluating the functions. You're violating that assumption by calling Console.Write in the function.
There's no magic involved, just functions. IEnumerable has just one method - GetEnumerator. That's all that is needed for LINQ, and that's all that LINQ really does. For example, a naïve implementation of Where would look like this:
public static IEnumerable<T> Where<T>(this IEnumerable<T> #this, Func<T, bool> filter)
{
foreach (var item in #this)
{
if (filter(item)) yield return item;
}
}
A Skip might look like this:
public static IEnumerable<T> Skip<T>(this IEnumerable<T> #this, int skip)
{
foreach (var item in #this)
{
if (skip-- > 0) continue;
yield return item;
}
}
That's all there is to it. It doesn't have any information about what IEnumerable is or represents. In fact, that's the whole point - you're abstracting those details away. There are a few optimizations in those methods, but they don't do anything smart. In the end, the difference between the List and IEnumerable in your example isn't anything fundamental - it's that myEnumerable.Skip(1) has side-effects (because myEnumerable itself has side-effects) while myList.Skip(1) doesn't. But both do the exact same thing - evaluate the enumerable, item by item. There's no other method than GetEnumerator on an enumerable, and IEnumerator only has Current and MoveNext (of those that matter for us).
LINQ is immutable. That's one of the reasons why it's so useful. This allows you to do exactly what you're doing - query the same enumerable twice but getting the exact same result. But you're not happy with that. You want things to be mutable. Well, nothing is stopping you from making your own helper functions. LINQ is just a bunch of functions, after all - you can make your own.
One such simple extension could be a memoized enumerable. Wrap around the source enumerable, create a list internally, and when you iterate over the source enumerable, keep adding items to the list. The next time GetEnumerator is called, start iterating over your internal list. When you reach the end, continue with the original approach - iterate over the source enumerable and keep adding to the list.
This will allow you to use LINQ fully, just inserting Memoize() to your LINQ queries at the places where you want to avoid iterating over the source multiple times. In your example, this would be something like:
myEnumerable = myEnumerable.Memoize();
Console.WriteLine("");
Console.WriteLine("Starting");
myEnumerable.First();
Console.WriteLine("");
myEnumerable.Skip(1).First();
The first call to myEnumerable.First() will iterate through the first three items in myList, and the second will only work with the fourth.

Basically it sounds like you're looking for an Enumerator which you can get by calling GetEnumerator on an IEnumerable. An Enumerator keeps track of it's position.
var myList = new List<int>() { 1, 2, 3, 4, 5, 6 };
var myEnumerator = myList.Where(p =>
{
Console.Write($"{p} ");
return p > 2;
}
).GetEnumerator();
Console.WriteLine("Starting");
myEnumerator.MoveNext();
Console.WriteLine("");
myEnumerator.MoveNext();
This will get you the output:
Starting
1 2 3
4
Edit to respond to your comment:
First of all this sounds like an extremely bad idea. An enumerator represents something that can be enumerated. This is why you can pipe all those fancy LINQ queries on top of it. However all calls to First "visualize" this enumeration (which results in GetEnumerator being called to get an Enumerator and walking over that until we're done and then disposing it). You however ask for every visualization to change the IEnumerable it's visualizing (this is not good practice).
However since you said this is for learning I'll give you code that ends up with an IEnumerable that will give you your desired output. I would not recommend you ever use this in real code, this is not a good and solid way of doing things.
First we create a custom Enumerator that doesn't dispose, but just keeps enumerating some internal enumerator:
public class CustomEnumerator<T> : IEnumerator<T>
{
private readonly IEnumerator<T> _source;
public CustomEnumerator(IEnumerator<T> source)
{
_source = source;
}
public T Current => _source.Current;
object IEnumerator.Current => _source.Current;
public void Dispose()
{
}
public bool MoveNext()
{
return _source.MoveNext();
}
public void Reset()
{
throw new NotImplementedException();
}
}
Then we create a custom IEnumerable class that, instead of creating a new Enumerator everytime GetEnumerator() is called, but will secretly keep using the same enumerator:
public class CustomEnumerable<T> : IEnumerable<T>
{
public CustomEnumerable(IEnumerable<T> source)
{
_internalEnumerator = new CustomEnumerator<T>(source.GetEnumerator());
}
private IEnumerator<T> _internalEnumerator;
public IEnumerator<T> GetEnumerator()
{
return _internalEnumerator;
}
IEnumerator IEnumerable.GetEnumerator()
{
return _internalEnumerator;
}
}
And finally we create an IEnumerable extension method to convert an IEnumerable into our CustomEnumerable:
public static class IEnumerableExtensions
{
public static IEnumerable<T> ToTrackingEnumerable<T>(this IEnumerable<T> source) => new CustomEnumerable<T>(source);
}
Finally when we can now do this:
var myList = new List<int>() { 1, 2, 3, 4, 5, 6 };
var myEnumerable = myList.Where(p =>
{
Console.Write($"{p} ");
return p > 2;
}).ToTrackingEnumerable();
Console.WriteLine("Starting");
var first = myEnumerable.First();
Console.WriteLine("");
var second = myEnumerable.Where(p => p % 2 == 1).First();
Console.WriteLine("");
I changed the last part so show that we can still use LINQ on it. The output is now:
Starting
1 2 3
4 5

Related

Consuming an IEnumerable multiple times in one pass

Is it possible to write a higher-order function that causes an IEnumerable to be consumed multiple times but in only one pass and without reading all the data into memory? [See Edit below for a clarification of what I'm looking for.]
For example, in the code below the enumerable is mynums (onto which I've tagged a .Trace() in order to see how many times we enumerate it). The goal is figure out if it has any numbers greater than 5, as well as the sum of all of the numbers. A function which processes an enumerable twice is Both_TwoPass, but it enumerates it twice. In contrast Both_NonStream only enumerates it once, but at the expense of reading it into memory. In principle it is possible carry out both of these tasks in a single pass and in a streaming fashion as shown by Any5Sum, but that is specific solution. Is it possible to write a function with the same signature as Both_* but that is the best of both worlds?
(It seems to me that this should be possible using threads. Is there a better solution using, say, async?)
Edit
Below is a clarification regarding what I'm looking for. What I've done is included a very down-to-earth description of each property in square brackets.
I'm looking for a function Both with the following characteristics:
It has signature (S1, S2) Both<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1>, Func<IEnumerable<T>, S2>) (and produces the "right" output!)
It only iterates the first argument, tt, once. [What I mean by this is that when passed mynums (as defined below) it only outputs mynums: 0 1 2 ... once. This precludes function Both_TwoPass.]
It processes the data from the first argument, tt, in a streaming fashion. [What I mean by this is that, for example, there is insufficient memory to store all the items from tt in memory simultaneously, thus precluding function Both_NonStream.]
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApp
{
static class Extensions
{
public static IEnumerable<T> Trace<T>(this IEnumerable<T> tt, string msg = "")
{
Console.Write(msg);
try
{
foreach (T t in tt)
{
Console.Write(" {0}", t);
yield return t;
}
}
finally
{
Console.WriteLine('.');
}
}
public static (S1, S2) Both_TwoPass<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> f1, Func<IEnumerable<T>, S2> f2)
{
return (f1(tt), f2(tt));
}
public static (S1, S2) Both_NonStream<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> f1, Func<IEnumerable<T>, S2> f2)
{
var tt2 = tt.ToList();
return (f1(tt2), f2(tt2));
}
public static (bool, int) Any5Sum(this IEnumerable<int> ii)
{
int sum = 0;
bool any5 = false;
foreach (int i in ii)
{
sum += i;
any5 |= i > 5; // or: if (!any5) any5 = i > 5;
}
return (any5, sum);
}
}
class Program
{
static void Main()
{
var mynums = Enumerable.Range(0, 10).Trace("mynums:");
Console.WriteLine("TwoPass: (any > 5, sum) = {0}", mynums.Both_TwoPass(tt => tt.Any(k => k > 5), tt => tt.Sum()));
Console.WriteLine("NonStream: (any > 5, sum) = {0}", mynums.Both_NonStream(tt => tt.Any(k => k > 5), tt => tt.Sum()));
Console.WriteLine("Manual: (any > 5, sum) = {0}", mynums.Any5Sum());
}
}
}
The way you've written your computation model (i.e. return (f1(tt), f2(tt))) there is no way to avoid multiple iterations of your enumerable. You're basically saying compute Item1 then compute Item2.
You have to either change the model from (Func<IEnumerable<T>, S1>, Func<IEnumerable<T>, S2>) to (Func<T, S1>, Func<T, S2>) or to Func<IEnumerable<T>, (S1, S2)> to be able to run the computations in parallel.
You implementation of Any5Sum is basically the second approach (Func<IEnumerable<T>, (S1, S2)>). But there's already a built-in method for that.
Try this:
Console.WriteLine("Aggregate: (any > 5, sum) = {0}",
mynums
.Aggregate<int, (bool any5, int sum)>(
(false, 0),
(a, x) => (a.any5 | x > 5, a.sum + x)));
I think you and I are describing the same thing in the comments. There is no need to create such a "special-purpose IEnumerable", though, because the BlockingCollection<> class already exists for such producer-consumer scenarios. You'd use it as follows...
Create a BlockingCollection<> for each consuming function (i.e. tt1 and tt2).
By default, a BlockingCollection<> wraps a ConcurrentQueue<>, so the elements will arrive in FIFO order.
To satisfy your requirement that only one element be held in memory at a time, 1 will be specified for the bounded capacity. Note that this capacity is per collection, so with two collections there will be up to two queued elements at any given moment.
Each collection will hold the input elements for that consumer.
Create a thread/task for each consuming function.
The thread/task will simply call GetConsumingEnumerator() for its input collection, pass the resulting IEnumerable<> to its consuming function, and return that result.
GetConsumingEnumerable() does just as its name implies: it creates an IEnumerable<> that consumes (removes) elements from the collection. If the collection is empty, enumeration will block until an element is added. CompleteAdding() is called once the producer is finished, which allows the consuming enumerator to exit when the collection empties.
The producer enumerates the IEnumerable<>, tt, and adds each element to both collections. This is the only time that tt is enumerated.
BlockingCollection<>.Add() will block if the collection has reached its capacity, preventing the entirety of tt from being buffered in-memory.
Once tt has been fully enumerated, CompleteAdding() is called on each collection.
Once each consumer thread/task has completed, their results are returned.
Here's what that looks like in code...
public static (S1, S2) Both<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> tt1, Func<IEnumerable<T>, S2> tt2)
{
const int MaxQueuedElementsPerCollection = 1;
using (BlockingCollection<T> collection1 = new BlockingCollection<T>(MaxQueuedElementsPerCollection))
using (Task<S1> task1 = StartConsumerTask(collection1, tt1))
using (BlockingCollection<T> collection2 = new BlockingCollection<T>(MaxQueuedElementsPerCollection))
using (Task<S2> task2 = StartConsumerTask(collection2, tt2))
{
foreach (T element in tt)
{
collection1.Add(element);
collection2.Add(element);
}
// Inform any enumerators created by .GetConsumingEnumerable()
// that there will be no more elements added.
collection1.CompleteAdding();
collection2.CompleteAdding();
// Accessing the Result property blocks until the Task<> is complete.
return (task1.Result, task2.Result);
}
Task<S> StartConsumerTask<S>(BlockingCollection<T> collection, Func<IEnumerable<T>, S> func)
{
return Task.Run(() => func(collection.GetConsumingEnumerable()));
}
}
Note that, for efficiency's sake, you could increase MaxQueuedElementsPerCollection to, say, 10 or 100 so that the consumers don't have to run in lock-step with each other.
There is one problem with this code, though. When a collection is empty the consumer has to wait for the producer to produce an element, and when a collection is full the producer has to wait for the consumer to consume an element. Consider what happens mid-way through the execution of your tt => tt.Any(k => k > 5) lambda...
The producer waits for the collection to be non-full and adds 5.
The consumer waits for the collection to be non-empty and removes 5.
5 > 5 returns false and enumeration continues.
The producer waits for the collection to be non-full and adds 6.
The consumer waits for the collection to be non-empty and removes 6.
6 > 5 returns true and enumeration stops. Any(), the lambda, and the consumer task all return.
The producer waits for the collection to be non-full and adds 7.
The producer waits for the collection to be non-full and...that never happens!
The consumer has already abandoned the enumeration, so it won't consume any elements to make room for the new one. Add() will never return.
The cleanest way I could come up with to prevent this deadlock is to ensure the entire collection gets enumerated even if func doesn't do so. This just requires a simple change to the StartConsumerTask<>() local method...
Task<S> StartConsumerTask<S>(BlockingCollection<T> collection, Func<IEnumerable<T>, S> func)
{
return Task.Run(
() => {
try
{
return func(collection.GetConsumingEnumerable());
}
finally
{
// Prevent BlockingCollection<>.Add() calls from
// deadlocking by ensuring the entire collection gets
// consumed even if func abandoned its enumeration early.
foreach (T element in collection.GetConsumingEnumerable())
{
// Do nothing...
}
}
}
);
}
The downside of this is that tt will always be enumerated to completion, even if both tt1 and tt2 abandon their enumerators early.
With that addressed, this...
static void Main()
{
IEnumerable<int> mynums = Enumerable.Range(0, 10).Trace("mynums:");
Console.WriteLine("Both: (any > 5, sum) = {0}", mynums.Both(tt => tt.Any(k => k > 5), tt => tt.Sum()));
}
...outputs this...
mynums: 0 1 2 3 4 5 6 7 8 9.
Both: (any > 5, sum) = (True, 45)
The core problem here is who is responsible for calling Enumeration.MoveNext() (eg by using a foreach loop). Synchronizing multiple foreach loops across threads would be slow and fiddly to get right.
Implementing IAsyncEnumerable<T>, so that multiple await foreach loops can take turns processing items would be easier. But still silly.
So the simpler solution would be to change the question. Instead of trying to call multiple methods that both try to enumerate the items, change the interface to simply visit each item.
I believe it is possible to satisfy all the requirements of the question, and one more (very natural) requirement, namely that the original enumerable be only enumerated partially if each of the two Func<IEnumerable<T>, S> consume it partially.
(This was discussed by #BACON). The approach is discussed in more detail in my GitHub repo 'CoEnumerable'. The idea is that the Barrier class provides a fairly straightforward approach to implement a proxy IEnumerable which can be consumed by each of the Func<IEnumerable<T>, S>s while the proxy consumes the real IEnumerable just once. In particular, the implementation consumes only as much of the original enumerable is as absolutely necessary (i.e., it satisfies the extra requirement mentioned above).
The proxy is:
class BarrierEnumerable<T> : IEnumerable<T>
{
private Barrier barrier;
private bool moveNext;
private readonly Func<T> src;
public BarrierEnumerable(IEnumerator<T> enumerator)
{
src = () => enumerator.Current;
}
public Barrier Barrier
{
set => barrier = value;
}
public bool MoveNext
{
set => moveNext = value;
}
public IEnumerator<T> GetEnumerator()
{
try
{
while (true)
{
barrier.SignalAndWait();
if (moveNext)
{
yield return src();
}
else
{
yield break;
}
}
}
finally
{
barrier.RemoveParticipant();
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
in terms of which we can combine the two consumers
public static T Combine<S, T1, T2, T>(this IEnumerable<S> source,
Func<IEnumerable<S>, T1> coenumerable1,
Func<IEnumerable<S>, T2> coenumerable2,
Func<T1, T2, T> resultSelector)
{
using var ss = source.GetEnumerator();
var enumerable1 = new BarrierEnumerable<S>(ss);
var enumerable2 = new BarrierEnumerable<S>(ss);
using var barrier = new Barrier(2, _ => enumerable1.MoveNext = enumerable2.MoveNext = ss.MoveNext());
enumerable2.Barrier = enumerable1.Barrier = barrier;
using var t1 = Task.Run(() => coenumerable1(enumerable1));
using var t2 = Task.Run(() => coenumerable2(enumerable2));
return resultSelector(t1.Result, t2.Result);
}
The GitHub repo has a couple of examples of using the above code, and some brief design discussion (including limitations).

What is the reason for creating IEnumerator?

IEnumerator contains MoveNext(), Reset() and Current as its members. Now assume that I have moved these methods and property to IEnumerable interface and removed GetEnumerator() method and IEnumerator interface.
Now, the object of the class which implements IEnumerable will be able to access the methods and the property and hence can be iterated upon.
Why was the above approach not followed and the problems I will face
if I follow it?
How does presence of IEnumerator interface solve those problems?
An iterator contains separate state to the collection: it contains a cursor for where you are within the collection. As such, there has to be a separate object to represent that extra state, a way to get that object, and operations on that object - hence IEnumerator (and IEnumerator<T>), GetEnumerator(), and the iterator members.
Imagine if we didn't have the separate state, and then we wrote:
var list = new List<int> { 1, 2, 3 };
foreach (var x in list)
{
foreach (var y in list)
{
Console.WriteLine("{0} {1}", x, y);
}
}
That should print "1 1", "1 2", "1 3", "2 1" etc... but without any extra state, how could it "know" the two different positions of the two loops?
Now assume that I have moved these methods and property to IEnumerable interface and removed GetEnumerator() method and IEnumerator interface.
Such a design would prevent concurrent enumeration at the collection. If it was the collection itself that tracked the current position, you couldn't have several threads enumerating the same collection, or even nested enumerations such as this:
foreach (var x in collection)
{
foreach (var y in collection)
{
Console.WriteLine("{0} {1}", x, y);
}
}
By delegating the responsibility of tracking the current position to a different object (the enumerator), it makes each enumeration of the collection independent from the others
A bit of a long answer, the two previous answers cover most of it, but I found some aspects I found interesting when looking up foreach in the C# language specification. Unless you are interested in that, stop reading.
Now over to the intering part, according to C# spec expansion of the following statements:
foreach (V v in x) embedded-statement
Gves you:
{`
E e = ((C)(x)).GetEnumerator();
try {
while (e.MoveNext()) {
V v = (V)(T)e.Current;
embedded-statement
}
}
finally {
… // Dispose e
}
}
Having some kind of identity function which follows x == ((C)(x)).GetEnumerator() (it is it's own enumerator) and using #JonSkeet's loops produces something like this (removed try/catch for brevity):
var list = new List<int> { 1, 2, 3 };
while (list.MoveNext()) {
int x = list.Current; // x is always 1
while (list.MoveNext()) {
int y = list.Current; // y becomes 2, then 3
Console.WriteLine("{0} {1}", x, y);
}
}
Will print something along the lines of:
1 2
1 3
And then list.MoveNext() will return false forever. Which makes a very important point, if you look at this:
var list = new List<int> { 1, 2, 3 };
// loop
foreach (var x in list) Console.WriteLine(x); // Will print 1,2,3
// loop again
// Will never enter loop, since reset wasn't called and MoveNext() returns false
foreach (var y in list) Console.WriteLine(x); // Print nothing
So with the above in mind, and note that it is totally doable since the foreach statement looks for a GetEnumerator()-method before it checks whether or not the type implements IEnumerable<T>:
Why was the above approach not followed and the problems I will face
if I follow it?
You cannot nest loops, nor can you use the foreach statement to access the same collection more than once without calling Reset() manually between them. Also what happens when we dispose of the enumerator after each foreach?
How does presence of IEnumerator interface solves those problems?
All iterations are independent of each other, whether we are talking nesting, multiple threads etc, the enumeration is separate from the collection itself. You can think of it as a bit like separations of concerns, or SoC, since the idea is to separate the traversal from the actual list itself and that the traversal under no circumstances should alter the state of the collection. IE a call to MoveNext() with your example would modify the collection.
Others have already answered your question, but I want to add another little detail.
Let's decompile System.Collection.Generics.List(Of T). Its definition looks like this:
public class List<T> : IList<T>, ICollection<T>, IList, ICollection, IReadOnlyList<T>, IReadOnlyCollection<T>, IEnumerable<T>, IEnumerable
and its enumerator definition looks like this:
public struct Enumerator : IEnumerator<T>, IDisposable, IEnumerator
As you can see, List itself is a class, but its Enumerator is a struct, and this design helps boost performance.
Lets assume that you do not have a separation between IEnumerable and IEnumerator. In this situation you are forced to make List a struct, but this is not a very good idea, so you cannot do that. Therefore, you are loosing a good opportunity to gain some performance boost.
With the separation between IEnumerable and IEnumerator you can implement each interface as you like and use struct for enumerators.
The iteration logic (foreach) is not bound to IEnumerables or IEnumerator. What you need foreach to work is a method called GetEnumerator in the class that returns an object of class that has MoveNext(), Reset() methods and the Current property. For example the following code works and it will create an endless loop.
In a design perspective the seperation is to ensure that the container(IEnumerable) does not keep any state during and after the completion of the iteration(foreach) operations.
public class Iterator
{
public bool MoveNext()
{
return true;
}
public void Reset()
{
}
public object Current { get; private set; }
}
public class Tester
{
public Iterator GetEnumerator()
{
return new Iterator();
}
public static void Loop()
{
Tester tester = new Tester();
foreach (var v in tester)
{
Console.WriteLine(v);
}
}
}

Why use the yield keyword, when I could just use an ordinary IEnumerable?

Given this code:
IEnumerable<object> FilteredList()
{
foreach( object item in FullList )
{
if( IsItemInPartialList( item ) )
yield return item;
}
}
Why should I not just code it this way?:
IEnumerable<object> FilteredList()
{
var list = new List<object>();
foreach( object item in FullList )
{
if( IsItemInPartialList( item ) )
list.Add(item);
}
return list;
}
I sort of understand what the yield keyword does. It tells the compiler to build a certain kind of thing (an iterator). But why use it? Apart from it being slightly less code, what's it do for me?
Using yield makes the collection lazy.
Let's say you just need the first five items. Your way, I have to loop through the entire list to get the first five items. With yield, I only loop through the first five items.
The benefit of iterator blocks is that they work lazily. So you can write a filtering method like this:
public static IEnumerable<T> Where<T>(this IEnumerable<T> source,
Func<T, bool> predicate)
{
foreach (var item in source)
{
if (predicate(item))
{
yield return item;
}
}
}
That will allow you to filter a stream as long as you like, never buffering more than a single item at a time. If you only need the first value from the returned sequence, for example, why would you want to copy everything into a new list?
As another example, you can easily create an infinite stream using iterator blocks. For example, here's a sequence of random numbers:
public static IEnumerable<int> RandomSequence(int minInclusive, int maxExclusive)
{
Random rng = new Random();
while (true)
{
yield return rng.Next(minInclusive, maxExclusive);
}
}
How would you store an infinite sequence in a list?
My Edulinq blog series gives a sample implementation of LINQ to Objects which makes heavy use of iterator blocks. LINQ is fundamentally lazy where it can be - and putting things in a list simply doesn't work that way.
With the "list" code, you have to process the full list before you can pass it on to the next step. The "yield" version passes the processed item immediately to the next step. If that "next step" contains a ".Take(10)" then the "yield" version will only process the first 10 items and forget about the rest. The "list" code would have processed everything.
This means that you see the most difference when you need to do a lot of processing and/or have long lists of items to process.
You can use yield to return items that aren't in a list. Here's a little sample that could iterate infinitely through a list until canceled.
public IEnumerable<int> GetNextNumber()
{
while (true)
{
for (int i = 0; i < 10; i++)
{
yield return i;
}
}
}
public bool Canceled { get; set; }
public void StartCounting()
{
foreach (var number in GetNextNumber())
{
if (this.Canceled) break;
Console.WriteLine(number);
}
}
This writes
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
...etc. to the console until canceled.
object jamesItem = null;
foreach(var item in FilteredList())
{
if (item.Name == "James")
{
jamesItem = item;
break;
}
}
return jamesItem;
When the above code is used to loop through FilteredList() and assuming item.Name == "James" will be satisfied on 2nd item in the list, the method using yield will yield twice. This is a lazy behavior.
Where as the method using list will add all the n objects to the list and pass the complete list to the calling method.
This is exactly a use case where difference between IEnumerable and IList can be highlighted.
The best real world example I've seen for the use of yield would be to calculate a Fibonacci sequence.
Consider the following code:
class Program
{
static void Main(string[] args)
{
Console.WriteLine(string.Join(", ", Fibonacci().Take(10)));
Console.WriteLine(string.Join(", ", Fibonacci().Skip(15).Take(1)));
Console.WriteLine(string.Join(", ", Fibonacci().Skip(10).Take(5)));
Console.WriteLine(string.Join(", ", Fibonacci().Skip(100).Take(1)));
Console.ReadKey();
}
private static IEnumerable<long> Fibonacci()
{
long a = 0;
long b = 1;
while (true)
{
long temp = a;
a = b;
yield return a;
b = temp + b;
}
}
}
This will return:
1, 1, 2, 3, 5, 8, 13, 21, 34, 55
987
89, 144, 233, 377, 610
1298777728820984005
This is nice because it allows you to calculate out an infinite series quickly and easily, giving you the ability to use the Linq extensions and query only what you need.
why use [yield]? Apart from it being slightly less code, what's it do for me?
Sometimes it is useful, sometimes not. If the entire set of data must be examined and returned then there is not going to be any benefit in using yield because all it did was introduce overhead.
When yield really shines is when only a partial set is returned. I think the best example is sorting. Assume you have a list of objects containing a date and a dollar amount from this year and you would like to see the first handful (5) records of the year.
In order to accomplish this, the list must be sorted ascending by date, and then have the first 5 taken. If this was done without yield, the entire list would have to be sorted, right up to making sure the last two dates were in order.
However, with yield, once the first 5 items have been established the sorting stops and the results are available. This can save a large amount of time.
The yield return statement allows you to return only one item at a time. You are collecting all the items in a list and again returning that list, which is a memory overhead.

Getting head and tail from IEnumerable that can only be iterated once

I have a sequence of elements. The sequence can only be iterated once and can be "infinite".
What is the best way get the head and the tail of such a sequence?
Update: A few clarifications that would have been nice if I included in the original question :)
Head is the first element of the sequence and tail is "the rest". That means the the tail is also "infinite".
When I say infinite, I mean "very large" and "I wouldn't want to store it all in memory at once". It could also have been actually infinite, like sensor data for example (but it wasn't in my case).
When I say that it can only be iterated once, I mean that generating the sequence is resource heavy, so I woundn't want to do it again. It could also have been volatile data, again like sensor data, that won't be the same on next read (but it wasn't in my case).
Decomposing IEnumerable<T> into head & tail isn't particularly good for recursive processing (unlike functional lists) because when you use the tail operation recursively, you'll create a number of indirections. However, you can write something like this:
I'm ignoring things like argument checking and exception handling, but it shows the idea...
Tuple<T, IEnumerable<T>> HeadAndTail<T>(IEnumerable<T> source) {
// Get first element of the 'source' (assuming it is there)
var en = source.GetEnumerator();
en.MoveNext();
// Return first element and Enumerable that iterates over the rest
return Tuple.Create(en.Current, EnumerateTail(en));
}
// Turn remaining (unconsumed) elements of enumerator into enumerable
IEnumerable<T> EnumerateTail<T>(IEnumerator en) {
while(en.MoveNext()) yield return en.Current;
}
The HeadAndTail method gets the first element and returns it as the first element of a tuple. The second element of a tuple is IEnumerable<T> that's generated from the remaining elements (by iterating over the rest of the enumerator that we already created).
Obviously, each call to HeadAndTail should enumerate the sequence again (unless there is some sort of caching used). For example, consider the following:
var a = HeadAndTail(sequence);
Console.WriteLine(HeadAndTail(a.Tail).Tail);
//Element #2; enumerator is at least at #2 now.
var b = HeadAndTail(sequence);
Console.WriteLine(b.Tail);
//Element #1; there is no way to get #1 unless we enumerate the sequence again.
For the same reason, HeadAndTail could not be implemented as separate Head and Tail methods (unless you want even the first call to Tail to enumerate the sequence again even if it was already enumerated by a call to Head).
Additionally, HeadAndTail should not return an instance of IEnumerable (as it could be enumerated multiple times).
This leaves us with the only option: HeadAndTail should return IEnumerator, and, to make things more obvious, it should accept IEnumerator as well (we're just moving an invocation of GetEnumerator from inside the HeadAndTail to the outside, to emphasize it is of one-time use only).
Now that we have worked out the requirements, the implementation is pretty straightforward:
class HeadAndTail<T> {
public readonly T Head;
public readonly IEnumerator<T> Tail;
public HeadAndTail(T head, IEnumerator<T> tail) {
Head = head;
Tail = tail;
}
}
static class IEnumeratorExtensions {
public static HeadAndTail<T> HeadAndTail<T>(this IEnumerator<T> enumerator) {
if (!enumerator.MoveNext()) return null;
return new HeadAndTail<T>(enumerator.Current, enumerator);
}
}
And now it can be used like this:
Console.WriteLine(sequence.GetEnumerator().HeadAndTail().Tail.HeadAndTail().Head);
//Element #2
Or in recursive functions like this:
TResult FoldR<TSource, TResult>(
IEnumerator<TSource> sequence,
TResult seed,
Func<TSource, TResult, TResult> f
) {
var headAndTail = sequence.HeadAndTail();
if (headAndTail == null) return seed;
return f(headAndTail.Head, FoldR(headAndTail.Tail, seed, f));
}
int Sum(IEnumerator<int> sequence) {
return FoldR(sequence, 0, (x, y) => x+y);
}
var array = Enumerable.Range(1, 5);
Console.WriteLine(Sum(array.GetEnumerator())); //1+(2+(3+(4+(5+0)))))
While other approaches here suggest using yield return for the tail enumerable, such an approach adds unnecessary nesting overhead. A better approach would be to convert the Enumerator<T> back into something that can be used with foreach:
public struct WrappedEnumerator<T>
{
T myEnumerator;
public T GetEnumerator() { return myEnumerator; }
public WrappedEnumerator(T theEnumerator) { myEnumerator = theEnumerator; }
}
public static class AsForEachHelper
{
static public WrappedEnumerator<IEnumerator<T>> AsForEach<T>(this IEnumerator<T> theEnumerator) {return new WrappedEnumerator<IEnumerator<T>>(theEnumerator);}
static public WrappedEnumerator<System.Collections.IEnumerator> AsForEach(this System.Collections.IEnumerator theEnumerator)
{ return new WrappedEnumerator<System.Collections.IEnumerator>(theEnumerator); }
}
If one used separate WrappedEnumerator structs for the generic IEnumerable<T> and non-generic IEnumerable, one could have them implement IEnumerable<T> and IEnumerable respectively; they wouldn't really obey the IEnumerable<T> contract, though, which specifies that it should be possible to possible to call GetEnumerator() multiple times, with each call returning an independent enumerator.
Another important caveat is that if one uses AsForEach on an IEnumerator<T>, the resulting WrappedEnumerator should be enumerated exactly once. If it is never enumerated, the underlying IEnumerator<T> will never have its Dispose method called.
Applying the above-supplied methods to the problem at hand, it would be easy to call GetEnumerator() on an IEnumerable<T>, read out the first few items, and then use AsForEach() to convert the remainder so it can be used with a ForEach loop (or perhaps, as noted above, to convert it into an implementation of IEnumerable<T>). It's important to note, however, that calling GetEnumerator() creates an obligation to Dispose the resulting IEnumerator<T>, and the class that performs the head/tail split would have no way to do that if nothing ever calls GetEnumerator() on the tail.
probably not the best way to do it but if you use the .ToList() method you can then get the elements in position [0] and [Count-1], if Count > 0.
But you should specify what do you mean by "can be iterated only once"
What exactly is wrong with .First() and .Last()? Though yeah, I have to agree with the people who asked "what does the tail of an infinite list mean"... the notion doesn't make sense, IMO.

Get next N elements from enumerable

Context: C# 3.0, .Net 3.5
Suppose I have a method that generates random numbers (forever):
private static IEnumerable<int> RandomNumberGenerator() {
while (true) yield return GenerateRandomNumber(0, 100);
}
I need to group those numbers in groups of 10, so I would like something like:
foreach (IEnumerable<int> group in RandomNumberGenerator().Slice(10)) {
Assert.That(group.Count() == 10);
}
I have defined Slice method, but I feel there should be one already defined. Here is my Slice method, just for reference:
private static IEnumerable<T[]> Slice<T>(IEnumerable<T> enumerable, int size) {
var result = new List<T>(size);
foreach (var item in enumerable) {
result.Add(item);
if (result.Count == size) {
yield return result.ToArray();
result.Clear();
}
}
}
Question: is there an easier way to accomplish what I'm trying to do? Perhaps Linq?
Note: above example is a simplification, in my program I have an Iterator that scans given matrix in a non-linear fashion.
EDIT: Why Skip+Take is no good.
Effectively what I want is:
var group1 = RandomNumberGenerator().Skip(0).Take(10);
var group2 = RandomNumberGenerator().Skip(10).Take(10);
var group3 = RandomNumberGenerator().Skip(20).Take(10);
var group4 = RandomNumberGenerator().Skip(30).Take(10);
without the overhead of regenerating number (10+20+30+40) times. I need a solution that will generate exactly 40 numbers and break those in 4 groups by 10.
Are Skip and Take of any use to you?
Use a combination of the two in a loop to get what you want.
So,
list.Skip(10).Take(10);
Skips the first 10 records and then takes the next 10.
I have done something similar. But I would like it to be simpler:
//Remove "this" if you don't want it to be a extension method
public static IEnumerable<IList<T>> Chunks<T>(this IEnumerable<T> xs, int size)
{
var curr = new List<T>(size);
foreach (var x in xs)
{
curr.Add(x);
if (curr.Count == size)
{
yield return curr;
curr = new List<T>(size);
}
}
}
I think yours are flawed. You return the same array for all your chunks/slices so only the last chunk/slice you take would have the correct data.
Addition: Array version:
public static IEnumerable<T[]> Chunks<T>(this IEnumerable<T> xs, int size)
{
var curr = new T[size];
int i = 0;
foreach (var x in xs)
{
curr[i % size] = x;
if (++i % size == 0)
{
yield return curr;
curr = new T[size];
}
}
}
Addition: Linq version (not C# 2.0). As pointed out, it will not work on infinite sequences and will be a great deal slower than the alternatives:
public static IEnumerable<T[]> Chunks<T>(this IEnumerable<T> xs, int size)
{
return xs.Select((x, i) => new { x, i })
.GroupBy(xi => xi.i / size, xi => xi.x)
.Select(g => g.ToArray());
}
Using Skip and Take would be a very bad idea. Calling Skip on an indexed collection may be fine, but calling it on any arbitrary IEnumerable<T> is liable to result in enumeration over the number of elements skipped, which means that if you're calling it repeatedly you're enumerating over the sequence an order of magnitude more times than you need to be.
Complain of "premature optimization" all you want; but that is just ridiculous.
I think your Slice method is about as good as it gets. I was going to suggest a different approach that would provide deferred execution and obviate the intermediate array allocation, but that is a dangerous game to play (i.e., if you try something like ToList on such a resulting IEnumerable<T> implementation, without enumerating over the inner collections, you'll end up in an endless loop).
(I've removed what was originally here, as the OP's improvements since posting the question have since rendered my suggestions here redundant.)
Let's see if you even need the complexity of Slice. If your random number generates is stateless, I would assume each call to it would generate unique random numbers, so perhaps this would be sufficient:
var group1 = RandomNumberGenerator().Take(10);
var group2 = RandomNumberGenerator().Take(10);
var group3 = RandomNumberGenerator().Take(10);
var group4 = RandomNumberGenerator().Take(10);
Each call to Take returns a new group of 10 numbers.
Now, if your random number generator re-seeds itself with a specific value each time it's iterated, this won't work. You'll simply get the same 10 values for each group. So instead, you would use:
var generator = RandomNumberGenerator();
var group1 = generator.Take(10);
var group2 = generator.Take(10);
var group3 = generator.Take(10);
var group4 = generator.Take(10);
This maintains an instance of the generator so that you can continue retrieving values without re-seeding the generator.
You could use the Skip and Take methods with any Enumerable object.
For your edit :
How about a function that takes a slice number and a slice size as a parameter?
private static IEnumerable<T> Slice<T>(IEnumerable<T> enumerable, int sliceSize, int sliceNumber) {
return enumerable.Skip(sliceSize * sliceNumber).Take(sliceSize);
}
It seems like we'd prefer for an IEnumerable<T> to have a fixed position counter so that we can do
var group1 = items.Take(10);
var group2 = items.Take(10);
var group3 = items.Take(10);
var group4 = items.Take(10);
and get successive slices rather than getting the first 10 items each time. We can do that with a new implementation of IEnumerable<T> which keeps one instance of its Enumerator and returns it on every call of GetEnumerator:
public class StickyEnumerable<T> : IEnumerable<T>, IDisposable
{
private IEnumerator<T> innerEnumerator;
public StickyEnumerable( IEnumerable<T> items )
{
innerEnumerator = items.GetEnumerator();
}
public IEnumerator<T> GetEnumerator()
{
return innerEnumerator;
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return innerEnumerator;
}
public void Dispose()
{
if (innerEnumerator != null)
{
innerEnumerator.Dispose();
}
}
}
Given that class, we could implement Slice with
public static IEnumerable<IEnumerable<T>> Slices<T>(this IEnumerable<T> items, int size)
{
using (StickyEnumerable<T> sticky = new StickyEnumerable<T>(items))
{
IEnumerable<T> slice;
do
{
slice = sticky.Take(size).ToList();
yield return slice;
} while (slice.Count() == size);
}
yield break;
}
That works in this case, but StickyEnumerable<T> is generally a dangerous class to have around if the consuming code isn't expecting it. For example,
using (var sticky = new StickyEnumerable<int>(Enumerable.Range(1, 10)))
{
var first = sticky.Take(2);
var second = sticky.Take(2);
foreach (int i in second)
{
Console.WriteLine(i);
}
foreach (int i in first)
{
Console.WriteLine(i);
}
}
prints
1
2
3
4
rather than
3
4
1
2
Take a look at Take(), TakeWhile() and Skip()
I think the use of Slice() would be a bit misleading. I think of that as a means to give me a chuck of an array into a new array and not causing side effects. In this scenario you would actually move the enumerable forward 10.
A possible better approach is to just use the Linq extension Take(). I don't think you would need to use Skip() with a generator.
Edit: Dang, I have been trying to test this behavior with the following code
Note: this is wasn't really correct, I leave it here so others don't fall into the same mistake.
var numbers = RandomNumberGenerator();
var slice = numbers.Take(10);
public static IEnumerable<int> RandomNumberGenerator()
{
yield return random.Next();
}
but the Count() for slice is alway 1. I also tried running it through a foreach loop since I know that the Linq extensions are generally lazily evaluated and it only looped once. I eventually did the code below instead of the Take() and it works:
public static IEnumerable<int> Slice(this IEnumerable<int> enumerable, int size)
{
var list = new List<int>();
foreach (var count in Enumerable.Range(0, size)) list.Add(enumerable.First());
return list;
}
If you notice I am adding the First() to the list each time, but since the enumerable that is being passed in is the generator from RandomNumberGenerator() the result is different every time.
So again with a generator using Skip() is not needed since the result will be different. Looping over an IEnumerable is not always side effect free.
Edit: I'll leave the last edit just so no one falls into the same mistake, but it worked fine for me just doing this:
var numbers = RandomNumberGenerator();
var slice1 = numbers.Take(10);
var slice2 = numbers.Take(10);
The two slices were different.
I had made some mistakes in my original answer but some of the points still stand. Skip() and Take() are not going to work the same with a generator as it would a list. Looping over an IEnumerable is not always side effect free. Anyway here is my take on getting a list of slices.
public static IEnumerable<int> RandomNumberGenerator()
{
while(true) yield return random.Next();
}
public static IEnumerable<IEnumerable<int>> Slice(this IEnumerable<int> enumerable, int size, int count)
{
var slices = new List<List<int>>();
foreach (var iteration in Enumerable.Range(0, count)){
var list = new List<int>();
list.AddRange(enumerable.Take(size));
slices.Add(list);
}
return slices;
}
I got this solution for the same problem:
int[] ints = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
IEnumerable<IEnumerable<int>> chunks = Chunk(ints, 2, t => t.Dump());
//won't enumerate, so won't do anything unless you force it:
chunks.ToList();
IEnumerable<T> Chunk<T, R>(IEnumerable<R> src, int n, Func<IEnumerable<R>, T> action){
IEnumerable<R> head;
IEnumerable<R> tail = src;
while (tail.Any())
{
head = tail.Take(n);
tail = tail.Skip(n);
yield return action(head);
}
}
if you just want the chunks returned, not do anything with them, use chunks = Chunk(ints, 2, t => t). What I would really like is to have to have t=>t as default action, but I haven't found out how to do that yet.

Categories

Resources