Is it possible to write a higher-order function that causes an IEnumerable to be consumed multiple times but in only one pass and without reading all the data into memory? [See Edit below for a clarification of what I'm looking for.]
For example, in the code below the enumerable is mynums (onto which I've tagged a .Trace() in order to see how many times we enumerate it). The goal is figure out if it has any numbers greater than 5, as well as the sum of all of the numbers. A function which processes an enumerable twice is Both_TwoPass, but it enumerates it twice. In contrast Both_NonStream only enumerates it once, but at the expense of reading it into memory. In principle it is possible carry out both of these tasks in a single pass and in a streaming fashion as shown by Any5Sum, but that is specific solution. Is it possible to write a function with the same signature as Both_* but that is the best of both worlds?
(It seems to me that this should be possible using threads. Is there a better solution using, say, async?)
Edit
Below is a clarification regarding what I'm looking for. What I've done is included a very down-to-earth description of each property in square brackets.
I'm looking for a function Both with the following characteristics:
It has signature (S1, S2) Both<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1>, Func<IEnumerable<T>, S2>) (and produces the "right" output!)
It only iterates the first argument, tt, once. [What I mean by this is that when passed mynums (as defined below) it only outputs mynums: 0 1 2 ... once. This precludes function Both_TwoPass.]
It processes the data from the first argument, tt, in a streaming fashion. [What I mean by this is that, for example, there is insufficient memory to store all the items from tt in memory simultaneously, thus precluding function Both_NonStream.]
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApp
{
static class Extensions
{
public static IEnumerable<T> Trace<T>(this IEnumerable<T> tt, string msg = "")
{
Console.Write(msg);
try
{
foreach (T t in tt)
{
Console.Write(" {0}", t);
yield return t;
}
}
finally
{
Console.WriteLine('.');
}
}
public static (S1, S2) Both_TwoPass<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> f1, Func<IEnumerable<T>, S2> f2)
{
return (f1(tt), f2(tt));
}
public static (S1, S2) Both_NonStream<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> f1, Func<IEnumerable<T>, S2> f2)
{
var tt2 = tt.ToList();
return (f1(tt2), f2(tt2));
}
public static (bool, int) Any5Sum(this IEnumerable<int> ii)
{
int sum = 0;
bool any5 = false;
foreach (int i in ii)
{
sum += i;
any5 |= i > 5; // or: if (!any5) any5 = i > 5;
}
return (any5, sum);
}
}
class Program
{
static void Main()
{
var mynums = Enumerable.Range(0, 10).Trace("mynums:");
Console.WriteLine("TwoPass: (any > 5, sum) = {0}", mynums.Both_TwoPass(tt => tt.Any(k => k > 5), tt => tt.Sum()));
Console.WriteLine("NonStream: (any > 5, sum) = {0}", mynums.Both_NonStream(tt => tt.Any(k => k > 5), tt => tt.Sum()));
Console.WriteLine("Manual: (any > 5, sum) = {0}", mynums.Any5Sum());
}
}
}
The way you've written your computation model (i.e. return (f1(tt), f2(tt))) there is no way to avoid multiple iterations of your enumerable. You're basically saying compute Item1 then compute Item2.
You have to either change the model from (Func<IEnumerable<T>, S1>, Func<IEnumerable<T>, S2>) to (Func<T, S1>, Func<T, S2>) or to Func<IEnumerable<T>, (S1, S2)> to be able to run the computations in parallel.
You implementation of Any5Sum is basically the second approach (Func<IEnumerable<T>, (S1, S2)>). But there's already a built-in method for that.
Try this:
Console.WriteLine("Aggregate: (any > 5, sum) = {0}",
mynums
.Aggregate<int, (bool any5, int sum)>(
(false, 0),
(a, x) => (a.any5 | x > 5, a.sum + x)));
I think you and I are describing the same thing in the comments. There is no need to create such a "special-purpose IEnumerable", though, because the BlockingCollection<> class already exists for such producer-consumer scenarios. You'd use it as follows...
Create a BlockingCollection<> for each consuming function (i.e. tt1 and tt2).
By default, a BlockingCollection<> wraps a ConcurrentQueue<>, so the elements will arrive in FIFO order.
To satisfy your requirement that only one element be held in memory at a time, 1 will be specified for the bounded capacity. Note that this capacity is per collection, so with two collections there will be up to two queued elements at any given moment.
Each collection will hold the input elements for that consumer.
Create a thread/task for each consuming function.
The thread/task will simply call GetConsumingEnumerator() for its input collection, pass the resulting IEnumerable<> to its consuming function, and return that result.
GetConsumingEnumerable() does just as its name implies: it creates an IEnumerable<> that consumes (removes) elements from the collection. If the collection is empty, enumeration will block until an element is added. CompleteAdding() is called once the producer is finished, which allows the consuming enumerator to exit when the collection empties.
The producer enumerates the IEnumerable<>, tt, and adds each element to both collections. This is the only time that tt is enumerated.
BlockingCollection<>.Add() will block if the collection has reached its capacity, preventing the entirety of tt from being buffered in-memory.
Once tt has been fully enumerated, CompleteAdding() is called on each collection.
Once each consumer thread/task has completed, their results are returned.
Here's what that looks like in code...
public static (S1, S2) Both<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> tt1, Func<IEnumerable<T>, S2> tt2)
{
const int MaxQueuedElementsPerCollection = 1;
using (BlockingCollection<T> collection1 = new BlockingCollection<T>(MaxQueuedElementsPerCollection))
using (Task<S1> task1 = StartConsumerTask(collection1, tt1))
using (BlockingCollection<T> collection2 = new BlockingCollection<T>(MaxQueuedElementsPerCollection))
using (Task<S2> task2 = StartConsumerTask(collection2, tt2))
{
foreach (T element in tt)
{
collection1.Add(element);
collection2.Add(element);
}
// Inform any enumerators created by .GetConsumingEnumerable()
// that there will be no more elements added.
collection1.CompleteAdding();
collection2.CompleteAdding();
// Accessing the Result property blocks until the Task<> is complete.
return (task1.Result, task2.Result);
}
Task<S> StartConsumerTask<S>(BlockingCollection<T> collection, Func<IEnumerable<T>, S> func)
{
return Task.Run(() => func(collection.GetConsumingEnumerable()));
}
}
Note that, for efficiency's sake, you could increase MaxQueuedElementsPerCollection to, say, 10 or 100 so that the consumers don't have to run in lock-step with each other.
There is one problem with this code, though. When a collection is empty the consumer has to wait for the producer to produce an element, and when a collection is full the producer has to wait for the consumer to consume an element. Consider what happens mid-way through the execution of your tt => tt.Any(k => k > 5) lambda...
The producer waits for the collection to be non-full and adds 5.
The consumer waits for the collection to be non-empty and removes 5.
5 > 5 returns false and enumeration continues.
The producer waits for the collection to be non-full and adds 6.
The consumer waits for the collection to be non-empty and removes 6.
6 > 5 returns true and enumeration stops. Any(), the lambda, and the consumer task all return.
The producer waits for the collection to be non-full and adds 7.
The producer waits for the collection to be non-full and...that never happens!
The consumer has already abandoned the enumeration, so it won't consume any elements to make room for the new one. Add() will never return.
The cleanest way I could come up with to prevent this deadlock is to ensure the entire collection gets enumerated even if func doesn't do so. This just requires a simple change to the StartConsumerTask<>() local method...
Task<S> StartConsumerTask<S>(BlockingCollection<T> collection, Func<IEnumerable<T>, S> func)
{
return Task.Run(
() => {
try
{
return func(collection.GetConsumingEnumerable());
}
finally
{
// Prevent BlockingCollection<>.Add() calls from
// deadlocking by ensuring the entire collection gets
// consumed even if func abandoned its enumeration early.
foreach (T element in collection.GetConsumingEnumerable())
{
// Do nothing...
}
}
}
);
}
The downside of this is that tt will always be enumerated to completion, even if both tt1 and tt2 abandon their enumerators early.
With that addressed, this...
static void Main()
{
IEnumerable<int> mynums = Enumerable.Range(0, 10).Trace("mynums:");
Console.WriteLine("Both: (any > 5, sum) = {0}", mynums.Both(tt => tt.Any(k => k > 5), tt => tt.Sum()));
}
...outputs this...
mynums: 0 1 2 3 4 5 6 7 8 9.
Both: (any > 5, sum) = (True, 45)
The core problem here is who is responsible for calling Enumeration.MoveNext() (eg by using a foreach loop). Synchronizing multiple foreach loops across threads would be slow and fiddly to get right.
Implementing IAsyncEnumerable<T>, so that multiple await foreach loops can take turns processing items would be easier. But still silly.
So the simpler solution would be to change the question. Instead of trying to call multiple methods that both try to enumerate the items, change the interface to simply visit each item.
I believe it is possible to satisfy all the requirements of the question, and one more (very natural) requirement, namely that the original enumerable be only enumerated partially if each of the two Func<IEnumerable<T>, S> consume it partially.
(This was discussed by #BACON). The approach is discussed in more detail in my GitHub repo 'CoEnumerable'. The idea is that the Barrier class provides a fairly straightforward approach to implement a proxy IEnumerable which can be consumed by each of the Func<IEnumerable<T>, S>s while the proxy consumes the real IEnumerable just once. In particular, the implementation consumes only as much of the original enumerable is as absolutely necessary (i.e., it satisfies the extra requirement mentioned above).
The proxy is:
class BarrierEnumerable<T> : IEnumerable<T>
{
private Barrier barrier;
private bool moveNext;
private readonly Func<T> src;
public BarrierEnumerable(IEnumerator<T> enumerator)
{
src = () => enumerator.Current;
}
public Barrier Barrier
{
set => barrier = value;
}
public bool MoveNext
{
set => moveNext = value;
}
public IEnumerator<T> GetEnumerator()
{
try
{
while (true)
{
barrier.SignalAndWait();
if (moveNext)
{
yield return src();
}
else
{
yield break;
}
}
}
finally
{
barrier.RemoveParticipant();
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
in terms of which we can combine the two consumers
public static T Combine<S, T1, T2, T>(this IEnumerable<S> source,
Func<IEnumerable<S>, T1> coenumerable1,
Func<IEnumerable<S>, T2> coenumerable2,
Func<T1, T2, T> resultSelector)
{
using var ss = source.GetEnumerator();
var enumerable1 = new BarrierEnumerable<S>(ss);
var enumerable2 = new BarrierEnumerable<S>(ss);
using var barrier = new Barrier(2, _ => enumerable1.MoveNext = enumerable2.MoveNext = ss.MoveNext());
enumerable2.Barrier = enumerable1.Barrier = barrier;
using var t1 = Task.Run(() => coenumerable1(enumerable1));
using var t2 = Task.Run(() => coenumerable2(enumerable2));
return resultSelector(t1.Result, t2.Result);
}
The GitHub repo has a couple of examples of using the above code, and some brief design discussion (including limitations).
Related
Running the following code
foreach(var i in
Observable
.Range(1, 3)
.Do(Console.WriteLine)
.ToEnumerable())
Console.WriteLine("Fin:" + i);
I'm getting this output:
1
2
3
Fin:1
Fin:2
Fin:3
The question is - why does ToEnumerable caches all the values and provides them just after the source sequence completes?
Does it related somehow to "leaving the monad"?
If you dig into the source code of the Rx library you'll see that the ToEnumerable operator is implemented basically like this:
public static IEnumerable<T> ToEnumerable<T>(this IObservable<T> source)
{
using var enumerator = new GetEnumerator<T>();
enumerator.Run(source);
while (enumerator.MoveNext()) yield return enumerator.Current;
}
...where the GetEnumerator<T> is a class defined in this file. This class is an IEnumerator<T> and an IObserver<T>. It has an internal _queue (ConcurrentQueue<T>) where it stores the received items. The most interesting methods are the Run, OnNext and MoveNext:
public IEnumerator<T> Run(IObservable<T> source)
{
_subscription.Disposable = source.Subscribe(this);
return this;
}
public void OnNext(T value)
{
_queue.Enqueue(value);
_gate.Release();
}
public bool MoveNext()
{
_gate.Wait();
if (_queue.TryDequeue(out _current)) return true;
_error?.Throw();
return false;
}
In your code, when you start the foreach loop, the Run method runs, and the Range+Do sequence is subscribed. This sequence emits all its elements during the subscription. The OnNext method is invoked for each emited element, so all elements are enqueued inside the _queue. After the completion of the Run method, follows the while loop that dequeues and yields the queued elements. That's why you see all the sideffects of the Do operator happening before any iteration of your foreach loop.
The Rx library includes another operator similar to the ToEnumerable, the Next operator, with this signature:
// Returns an enumerable sequence whose enumeration blocks until the next element
// in the source observable sequence becomes available. Enumerators on the resulting
// sequence will block until the next element becomes available.
public static IEnumerable<T> Next<T>(this IObservable<T> source);
According to my expirements this operator doesn't do what you want either.
ToEnumerable doesn't wait for observable to complete, it just happens that observable completes synchronously in this case. Observable.Interval shows this:
var enumerable =
Observable
.Interval(TimeSpan.FromMilliseconds(100))
.Take(3)
.Do(i => Console.WriteLine("obs: {0}", i))
.ToEnumerable();
foreach (int value in enumerable)
{
Console.WriteLine(value);
}
This question already has answers here:
Is there an IEnumerable implementation that only iterates over it's source (e.g. LINQ) once?
(4 answers)
Closed 3 years ago.
I've been playing with Lists and Enumerables and I think I understand the basics:
Enumerable: The elements are evaluated each time they are consumed.
List: The elements are evaluated on definition and are not reevaluated at any point.
I've done some tests:
Enumerable. https://www.tutorialspoint.com/tpcg.php?p=bs75zCKL
List: https://www.tutorialspoint.com/tpcg.php?p=PpyY2iif
SingleEvaluationEnum: https://www.tutorialspoint.com/tpcg.php?p=209Ciiy7
Starting with the Enumerable example:
var myList = new List<int>() { 1, 2, 3, 4, 5, 6 };
var myEnumerable = myList.Where(p =>
{
Console.Write($"{p} ");
return p > 2;
}
);
Console.WriteLine("");
Console.WriteLine("Starting");
myEnumerable.First();
Console.WriteLine("");
myEnumerable.Skip(1).First();
The output is:
Starting
1 2 3
1 2 3 4
If we add .ToList() after the .Where(...) then the output is:
1 2 3 4 5 6
Starting
I also was able to have a bit of both worlds with this class:
class SingleEvaluationEnum<T>
{
private IEnumerable<T> Enumerable;
public SingleEvaluationEnum(IEnumerable<T> enumerable)
=> Enumerable = enumerable;
public IEnumerable<T> Get()
{
if (!(Enumerable is List<T>))
Enumerable = Enumerable.ToList().AsEnumerable();
return Enumerable;
}
}
You can see the output is:
Starting
1 2 3 4 5 6
This way the evaluation is deferred until the first consumption and is not re-evaluated in the next ones. But the whole list is evaluated.
My question is: Is there a way to get this output?
Starting
1 2 3
4
In other words: I want myEnumerable.First() to evaluate only the necesary elements, but no more. And I want myEnumerable.Skip(1).First() to reuse the already evaluated elements.
EDIT: Clarification: I want that any "query" over the Enumerable applies to all the elements in the list. That's why (AFAIK) an Enumerator doesn't work.
Thanks!
LINQ is fundamentally a functional approach to working with collections. One of the assumptions is that there are no side-effects to evaluating the functions. You're violating that assumption by calling Console.Write in the function.
There's no magic involved, just functions. IEnumerable has just one method - GetEnumerator. That's all that is needed for LINQ, and that's all that LINQ really does. For example, a naïve implementation of Where would look like this:
public static IEnumerable<T> Where<T>(this IEnumerable<T> #this, Func<T, bool> filter)
{
foreach (var item in #this)
{
if (filter(item)) yield return item;
}
}
A Skip might look like this:
public static IEnumerable<T> Skip<T>(this IEnumerable<T> #this, int skip)
{
foreach (var item in #this)
{
if (skip-- > 0) continue;
yield return item;
}
}
That's all there is to it. It doesn't have any information about what IEnumerable is or represents. In fact, that's the whole point - you're abstracting those details away. There are a few optimizations in those methods, but they don't do anything smart. In the end, the difference between the List and IEnumerable in your example isn't anything fundamental - it's that myEnumerable.Skip(1) has side-effects (because myEnumerable itself has side-effects) while myList.Skip(1) doesn't. But both do the exact same thing - evaluate the enumerable, item by item. There's no other method than GetEnumerator on an enumerable, and IEnumerator only has Current and MoveNext (of those that matter for us).
LINQ is immutable. That's one of the reasons why it's so useful. This allows you to do exactly what you're doing - query the same enumerable twice but getting the exact same result. But you're not happy with that. You want things to be mutable. Well, nothing is stopping you from making your own helper functions. LINQ is just a bunch of functions, after all - you can make your own.
One such simple extension could be a memoized enumerable. Wrap around the source enumerable, create a list internally, and when you iterate over the source enumerable, keep adding items to the list. The next time GetEnumerator is called, start iterating over your internal list. When you reach the end, continue with the original approach - iterate over the source enumerable and keep adding to the list.
This will allow you to use LINQ fully, just inserting Memoize() to your LINQ queries at the places where you want to avoid iterating over the source multiple times. In your example, this would be something like:
myEnumerable = myEnumerable.Memoize();
Console.WriteLine("");
Console.WriteLine("Starting");
myEnumerable.First();
Console.WriteLine("");
myEnumerable.Skip(1).First();
The first call to myEnumerable.First() will iterate through the first three items in myList, and the second will only work with the fourth.
Basically it sounds like you're looking for an Enumerator which you can get by calling GetEnumerator on an IEnumerable. An Enumerator keeps track of it's position.
var myList = new List<int>() { 1, 2, 3, 4, 5, 6 };
var myEnumerator = myList.Where(p =>
{
Console.Write($"{p} ");
return p > 2;
}
).GetEnumerator();
Console.WriteLine("Starting");
myEnumerator.MoveNext();
Console.WriteLine("");
myEnumerator.MoveNext();
This will get you the output:
Starting
1 2 3
4
Edit to respond to your comment:
First of all this sounds like an extremely bad idea. An enumerator represents something that can be enumerated. This is why you can pipe all those fancy LINQ queries on top of it. However all calls to First "visualize" this enumeration (which results in GetEnumerator being called to get an Enumerator and walking over that until we're done and then disposing it). You however ask for every visualization to change the IEnumerable it's visualizing (this is not good practice).
However since you said this is for learning I'll give you code that ends up with an IEnumerable that will give you your desired output. I would not recommend you ever use this in real code, this is not a good and solid way of doing things.
First we create a custom Enumerator that doesn't dispose, but just keeps enumerating some internal enumerator:
public class CustomEnumerator<T> : IEnumerator<T>
{
private readonly IEnumerator<T> _source;
public CustomEnumerator(IEnumerator<T> source)
{
_source = source;
}
public T Current => _source.Current;
object IEnumerator.Current => _source.Current;
public void Dispose()
{
}
public bool MoveNext()
{
return _source.MoveNext();
}
public void Reset()
{
throw new NotImplementedException();
}
}
Then we create a custom IEnumerable class that, instead of creating a new Enumerator everytime GetEnumerator() is called, but will secretly keep using the same enumerator:
public class CustomEnumerable<T> : IEnumerable<T>
{
public CustomEnumerable(IEnumerable<T> source)
{
_internalEnumerator = new CustomEnumerator<T>(source.GetEnumerator());
}
private IEnumerator<T> _internalEnumerator;
public IEnumerator<T> GetEnumerator()
{
return _internalEnumerator;
}
IEnumerator IEnumerable.GetEnumerator()
{
return _internalEnumerator;
}
}
And finally we create an IEnumerable extension method to convert an IEnumerable into our CustomEnumerable:
public static class IEnumerableExtensions
{
public static IEnumerable<T> ToTrackingEnumerable<T>(this IEnumerable<T> source) => new CustomEnumerable<T>(source);
}
Finally when we can now do this:
var myList = new List<int>() { 1, 2, 3, 4, 5, 6 };
var myEnumerable = myList.Where(p =>
{
Console.Write($"{p} ");
return p > 2;
}).ToTrackingEnumerable();
Console.WriteLine("Starting");
var first = myEnumerable.First();
Console.WriteLine("");
var second = myEnumerable.Where(p => p % 2 == 1).First();
Console.WriteLine("");
I changed the last part so show that we can still use LINQ on it. The output is now:
Starting
1 2 3
4 5
So topic is the questions.
I get that method AsParallel returns wrapper ParallelQuery<TSource> that uses the same LINQ keywords, but from System.Linq.ParallelEnumerable instead of System.Linq.Enumerable
It's clear enough, but when i'm looking into decompiled sources, i don't understand how does it works.
Let's begin from an easiest extension : Sum() method. Code:
[__DynamicallyInvokable]
public static int Sum(this ParallelQuery<int> source)
{
if (source == null)
throw new ArgumentNullException("source");
else
return new IntSumAggregationOperator((IEnumerable<int>) source).Aggregate();
}
it's clear, let's go to Aggregate() method. It's a wrapper on InternalAggregate method that traps some exceptions. Now let's take a look on it.
protected override int InternalAggregate(ref Exception singularExceptionToThrow)
{
using (IEnumerator<int> enumerator = this.GetEnumerator(new ParallelMergeOptions?(ParallelMergeOptions.FullyBuffered), true))
{
int num = 0;
while (enumerator.MoveNext())
checked { num += enumerator.Current; }
return num;
}
}
and here is the question: how it works? I see no concurrence safety for a variable, modified by many threads, we see only iterator and summing. Is it magic enumerator? Or how does it works? GetEnumerator() returns QueryOpeningEnumerator<TOutput>, but it's code is too complicated.
Finally in my second PLINQ assault I found an answer. And it's pretty clear.
Problem is that enumerator is not simple. It's a special multithreading one. So how it works? Answer is that enumerator doesn't return a next value of source, it returns a whole sum of next partition. So this code is only executed 2,4,6,8... times (based on Environment.ProcessorCount), when actual summation work is performed inside enumerator.MoveNext in enumerator.OpenQuery method.
So TPL obviosly partition the source enumerable, then sum independently each partition and then pefrorm this summation, see IntSumAggregationOperatorEnumerator<TKey>. No magic here, just could plunge deeper.
The Sum operator aggregates all values in a single thread. There is no multi-threading here. The trick is that multi-threading is happening somewhere else.
The PLINQ Sum method can handle PLINQ enumerables. Those enumerables could be built up using other constructs (such as where) that allows a collection to be processed over multiple threads.
The Sum operator is always the last operator in a chain. Although it is possible to process this sum over multiple threads, the TPL team probably found out that this had a negative impact on performance, which is reasonable, since the only thing this method has to do is a simple integer addition.
So this method processes all results that come available from other threads and processes them on a single thread and returns that value. The real trick is in other PLINQ extension methods.
protected override int InternalAggregate(ref Exception singularExceptionToThrow)
{
using (IEnumerator<int> enumerator = this.GetEnumerator(new ParallelMergeOptions? (ParallelMergeOptions.FullyBuffered), true))
{
int num = 0;
while (enumerator.MoveNext())
checked { num += enumerator.Current; }
return num;
}
}
This code won't be executed parallel, the while will be sequentially execute it's innerscope.
Try this instead
List<int> list = new List<int>();
int num = 0;
Parallel.ForEach(list, (item) =>
{
checked { num += item; }
});
The inner action will be spread on the ThreadPool and the ForEach statement will be complete when all items are handled.
Here you need threadsafety:
List<int> list = new List<int>();
int num = 0;
Parallel.ForEach(list, (item) =>
{
Interlocked.Add(ref num, item);
});
Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..
I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.
The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.
Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}
I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}
In LINQ Where is a streaming operator. Where-as OrderByDescending is a non-streaming operator. AFAIK, a streaming operator only gathers the next item that is necessary. A non-streaming operator evaluates the entire data stream at once.
I fail to see the relevance of defining a Streaming Operator. To me, it is redundant with Deferred Execution. Take the example where I have written a custom extension and consumed it using the where operator and and orderby.
public static class ExtensionStuff
{
public static IEnumerable<int> Where(this IEnumerable<int> sequence, Func<int, bool> predicate)
{
foreach (int i in sequence)
{
if (predicate(i))
{
yield return i;
}
}
}
}
public static void Main()
{
TestLinq3();
}
private static void TestLinq3()
{
int[] items = { 1, 2, 3,4 };
var selected = items.Where(i => i < 3)
.OrderByDescending(i => i);
Write(selected);
}
private static void Write(IEnumerable<int> selected)
{
foreach(var i in selected)
Console.WriteLine(i);
}
In either case, Where needs to evaluate each element in order to determine which elements meet the condition. The fact that it yields seems to only become relevant because the operator gains deferred execution.
So, what is the importance of Streaming Operators?
There are two aspects: speed and memory.
The speed aspect becomes more apparent when you use a method like .Take() to only consume a portion of the original result set.
// Consumes ten elements, yields 5 results.
Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.Take(5)
.ToList();
// Consumes one million elements, yields 5 results.
Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.OrderByDescending(i => i)
.Take(5)
.ToList();
Because the first example uses only streaming operators before the call to Take, you only end up yielding values 1 through 10 before Take stops evaluating. Furthermore, only one value is loaded into memory at a time, so you have a very small memory footprint.
In the second example, OrderByDescending is not streaming, so the moment Take pulls the first item, the entire result that's passed through the Where filter has to be placed in memory for sorting. This could take a long time and produce a big memory footprint.
Even if you weren't using Take, the memory issue can be important. For example:
// Puts half a million elements in memory, sorts, then outputs them.
var numbers = Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.OrderByDescending(i => i);
foreach(var number in numbers) Console.WriteLine(number);
// Puts one element in memory at a time.
var numbers = Enumerable.Range(1, 1000000).Where(i => i % 2 == 0);
foreach(var number in numbers) Console.WriteLine(number);
The fact that it yields seems to only become relevant because the
operator gains deferred execution.
So, what is the importance of Streaming Operators?
I.e. you could not process infinite sequences with buffering / non-streaming extension methods - while you can "run" such a sequence (until you abort) just fine using only streaming extension methods.
Take for example this method:
public IEnumerable<int> GetNumbers(int start)
{
int num = start;
while(true)
{
yield return num;
num++;
}
}
You can use Where just fine:
foreach (var num in GetNumbers(0).Where(x => x % 2 == 0))
{
Console.WriteLine(num);
}
OrderBy() would not work in this case since it would have to exhaustively enumerate the results before emitting a single number.
Just to be explicit; in the case you mentioned there's no advantage to the fact that where streams, since the orderby sucks the whole thing in anyway. There are however times where the advantages of streaming are used (other answers/comments have given examples), so all LINQ operators stream to the best of their ability. Orderby streams as much as it can, which happens to be not very much. Where streams very effectively.