So topic is the questions.
I get that method AsParallel returns wrapper ParallelQuery<TSource> that uses the same LINQ keywords, but from System.Linq.ParallelEnumerable instead of System.Linq.Enumerable
It's clear enough, but when i'm looking into decompiled sources, i don't understand how does it works.
Let's begin from an easiest extension : Sum() method. Code:
[__DynamicallyInvokable]
public static int Sum(this ParallelQuery<int> source)
{
if (source == null)
throw new ArgumentNullException("source");
else
return new IntSumAggregationOperator((IEnumerable<int>) source).Aggregate();
}
it's clear, let's go to Aggregate() method. It's a wrapper on InternalAggregate method that traps some exceptions. Now let's take a look on it.
protected override int InternalAggregate(ref Exception singularExceptionToThrow)
{
using (IEnumerator<int> enumerator = this.GetEnumerator(new ParallelMergeOptions?(ParallelMergeOptions.FullyBuffered), true))
{
int num = 0;
while (enumerator.MoveNext())
checked { num += enumerator.Current; }
return num;
}
}
and here is the question: how it works? I see no concurrence safety for a variable, modified by many threads, we see only iterator and summing. Is it magic enumerator? Or how does it works? GetEnumerator() returns QueryOpeningEnumerator<TOutput>, but it's code is too complicated.
Finally in my second PLINQ assault I found an answer. And it's pretty clear.
Problem is that enumerator is not simple. It's a special multithreading one. So how it works? Answer is that enumerator doesn't return a next value of source, it returns a whole sum of next partition. So this code is only executed 2,4,6,8... times (based on Environment.ProcessorCount), when actual summation work is performed inside enumerator.MoveNext in enumerator.OpenQuery method.
So TPL obviosly partition the source enumerable, then sum independently each partition and then pefrorm this summation, see IntSumAggregationOperatorEnumerator<TKey>. No magic here, just could plunge deeper.
The Sum operator aggregates all values in a single thread. There is no multi-threading here. The trick is that multi-threading is happening somewhere else.
The PLINQ Sum method can handle PLINQ enumerables. Those enumerables could be built up using other constructs (such as where) that allows a collection to be processed over multiple threads.
The Sum operator is always the last operator in a chain. Although it is possible to process this sum over multiple threads, the TPL team probably found out that this had a negative impact on performance, which is reasonable, since the only thing this method has to do is a simple integer addition.
So this method processes all results that come available from other threads and processes them on a single thread and returns that value. The real trick is in other PLINQ extension methods.
protected override int InternalAggregate(ref Exception singularExceptionToThrow)
{
using (IEnumerator<int> enumerator = this.GetEnumerator(new ParallelMergeOptions? (ParallelMergeOptions.FullyBuffered), true))
{
int num = 0;
while (enumerator.MoveNext())
checked { num += enumerator.Current; }
return num;
}
}
This code won't be executed parallel, the while will be sequentially execute it's innerscope.
Try this instead
List<int> list = new List<int>();
int num = 0;
Parallel.ForEach(list, (item) =>
{
checked { num += item; }
});
The inner action will be spread on the ThreadPool and the ForEach statement will be complete when all items are handled.
Here you need threadsafety:
List<int> list = new List<int>();
int num = 0;
Parallel.ForEach(list, (item) =>
{
Interlocked.Add(ref num, item);
});
Related
Is it possible to write a higher-order function that causes an IEnumerable to be consumed multiple times but in only one pass and without reading all the data into memory? [See Edit below for a clarification of what I'm looking for.]
For example, in the code below the enumerable is mynums (onto which I've tagged a .Trace() in order to see how many times we enumerate it). The goal is figure out if it has any numbers greater than 5, as well as the sum of all of the numbers. A function which processes an enumerable twice is Both_TwoPass, but it enumerates it twice. In contrast Both_NonStream only enumerates it once, but at the expense of reading it into memory. In principle it is possible carry out both of these tasks in a single pass and in a streaming fashion as shown by Any5Sum, but that is specific solution. Is it possible to write a function with the same signature as Both_* but that is the best of both worlds?
(It seems to me that this should be possible using threads. Is there a better solution using, say, async?)
Edit
Below is a clarification regarding what I'm looking for. What I've done is included a very down-to-earth description of each property in square brackets.
I'm looking for a function Both with the following characteristics:
It has signature (S1, S2) Both<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1>, Func<IEnumerable<T>, S2>) (and produces the "right" output!)
It only iterates the first argument, tt, once. [What I mean by this is that when passed mynums (as defined below) it only outputs mynums: 0 1 2 ... once. This precludes function Both_TwoPass.]
It processes the data from the first argument, tt, in a streaming fashion. [What I mean by this is that, for example, there is insufficient memory to store all the items from tt in memory simultaneously, thus precluding function Both_NonStream.]
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApp
{
static class Extensions
{
public static IEnumerable<T> Trace<T>(this IEnumerable<T> tt, string msg = "")
{
Console.Write(msg);
try
{
foreach (T t in tt)
{
Console.Write(" {0}", t);
yield return t;
}
}
finally
{
Console.WriteLine('.');
}
}
public static (S1, S2) Both_TwoPass<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> f1, Func<IEnumerable<T>, S2> f2)
{
return (f1(tt), f2(tt));
}
public static (S1, S2) Both_NonStream<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> f1, Func<IEnumerable<T>, S2> f2)
{
var tt2 = tt.ToList();
return (f1(tt2), f2(tt2));
}
public static (bool, int) Any5Sum(this IEnumerable<int> ii)
{
int sum = 0;
bool any5 = false;
foreach (int i in ii)
{
sum += i;
any5 |= i > 5; // or: if (!any5) any5 = i > 5;
}
return (any5, sum);
}
}
class Program
{
static void Main()
{
var mynums = Enumerable.Range(0, 10).Trace("mynums:");
Console.WriteLine("TwoPass: (any > 5, sum) = {0}", mynums.Both_TwoPass(tt => tt.Any(k => k > 5), tt => tt.Sum()));
Console.WriteLine("NonStream: (any > 5, sum) = {0}", mynums.Both_NonStream(tt => tt.Any(k => k > 5), tt => tt.Sum()));
Console.WriteLine("Manual: (any > 5, sum) = {0}", mynums.Any5Sum());
}
}
}
The way you've written your computation model (i.e. return (f1(tt), f2(tt))) there is no way to avoid multiple iterations of your enumerable. You're basically saying compute Item1 then compute Item2.
You have to either change the model from (Func<IEnumerable<T>, S1>, Func<IEnumerable<T>, S2>) to (Func<T, S1>, Func<T, S2>) or to Func<IEnumerable<T>, (S1, S2)> to be able to run the computations in parallel.
You implementation of Any5Sum is basically the second approach (Func<IEnumerable<T>, (S1, S2)>). But there's already a built-in method for that.
Try this:
Console.WriteLine("Aggregate: (any > 5, sum) = {0}",
mynums
.Aggregate<int, (bool any5, int sum)>(
(false, 0),
(a, x) => (a.any5 | x > 5, a.sum + x)));
I think you and I are describing the same thing in the comments. There is no need to create such a "special-purpose IEnumerable", though, because the BlockingCollection<> class already exists for such producer-consumer scenarios. You'd use it as follows...
Create a BlockingCollection<> for each consuming function (i.e. tt1 and tt2).
By default, a BlockingCollection<> wraps a ConcurrentQueue<>, so the elements will arrive in FIFO order.
To satisfy your requirement that only one element be held in memory at a time, 1 will be specified for the bounded capacity. Note that this capacity is per collection, so with two collections there will be up to two queued elements at any given moment.
Each collection will hold the input elements for that consumer.
Create a thread/task for each consuming function.
The thread/task will simply call GetConsumingEnumerator() for its input collection, pass the resulting IEnumerable<> to its consuming function, and return that result.
GetConsumingEnumerable() does just as its name implies: it creates an IEnumerable<> that consumes (removes) elements from the collection. If the collection is empty, enumeration will block until an element is added. CompleteAdding() is called once the producer is finished, which allows the consuming enumerator to exit when the collection empties.
The producer enumerates the IEnumerable<>, tt, and adds each element to both collections. This is the only time that tt is enumerated.
BlockingCollection<>.Add() will block if the collection has reached its capacity, preventing the entirety of tt from being buffered in-memory.
Once tt has been fully enumerated, CompleteAdding() is called on each collection.
Once each consumer thread/task has completed, their results are returned.
Here's what that looks like in code...
public static (S1, S2) Both<T, S1, S2>(this IEnumerable<T> tt, Func<IEnumerable<T>, S1> tt1, Func<IEnumerable<T>, S2> tt2)
{
const int MaxQueuedElementsPerCollection = 1;
using (BlockingCollection<T> collection1 = new BlockingCollection<T>(MaxQueuedElementsPerCollection))
using (Task<S1> task1 = StartConsumerTask(collection1, tt1))
using (BlockingCollection<T> collection2 = new BlockingCollection<T>(MaxQueuedElementsPerCollection))
using (Task<S2> task2 = StartConsumerTask(collection2, tt2))
{
foreach (T element in tt)
{
collection1.Add(element);
collection2.Add(element);
}
// Inform any enumerators created by .GetConsumingEnumerable()
// that there will be no more elements added.
collection1.CompleteAdding();
collection2.CompleteAdding();
// Accessing the Result property blocks until the Task<> is complete.
return (task1.Result, task2.Result);
}
Task<S> StartConsumerTask<S>(BlockingCollection<T> collection, Func<IEnumerable<T>, S> func)
{
return Task.Run(() => func(collection.GetConsumingEnumerable()));
}
}
Note that, for efficiency's sake, you could increase MaxQueuedElementsPerCollection to, say, 10 or 100 so that the consumers don't have to run in lock-step with each other.
There is one problem with this code, though. When a collection is empty the consumer has to wait for the producer to produce an element, and when a collection is full the producer has to wait for the consumer to consume an element. Consider what happens mid-way through the execution of your tt => tt.Any(k => k > 5) lambda...
The producer waits for the collection to be non-full and adds 5.
The consumer waits for the collection to be non-empty and removes 5.
5 > 5 returns false and enumeration continues.
The producer waits for the collection to be non-full and adds 6.
The consumer waits for the collection to be non-empty and removes 6.
6 > 5 returns true and enumeration stops. Any(), the lambda, and the consumer task all return.
The producer waits for the collection to be non-full and adds 7.
The producer waits for the collection to be non-full and...that never happens!
The consumer has already abandoned the enumeration, so it won't consume any elements to make room for the new one. Add() will never return.
The cleanest way I could come up with to prevent this deadlock is to ensure the entire collection gets enumerated even if func doesn't do so. This just requires a simple change to the StartConsumerTask<>() local method...
Task<S> StartConsumerTask<S>(BlockingCollection<T> collection, Func<IEnumerable<T>, S> func)
{
return Task.Run(
() => {
try
{
return func(collection.GetConsumingEnumerable());
}
finally
{
// Prevent BlockingCollection<>.Add() calls from
// deadlocking by ensuring the entire collection gets
// consumed even if func abandoned its enumeration early.
foreach (T element in collection.GetConsumingEnumerable())
{
// Do nothing...
}
}
}
);
}
The downside of this is that tt will always be enumerated to completion, even if both tt1 and tt2 abandon their enumerators early.
With that addressed, this...
static void Main()
{
IEnumerable<int> mynums = Enumerable.Range(0, 10).Trace("mynums:");
Console.WriteLine("Both: (any > 5, sum) = {0}", mynums.Both(tt => tt.Any(k => k > 5), tt => tt.Sum()));
}
...outputs this...
mynums: 0 1 2 3 4 5 6 7 8 9.
Both: (any > 5, sum) = (True, 45)
The core problem here is who is responsible for calling Enumeration.MoveNext() (eg by using a foreach loop). Synchronizing multiple foreach loops across threads would be slow and fiddly to get right.
Implementing IAsyncEnumerable<T>, so that multiple await foreach loops can take turns processing items would be easier. But still silly.
So the simpler solution would be to change the question. Instead of trying to call multiple methods that both try to enumerate the items, change the interface to simply visit each item.
I believe it is possible to satisfy all the requirements of the question, and one more (very natural) requirement, namely that the original enumerable be only enumerated partially if each of the two Func<IEnumerable<T>, S> consume it partially.
(This was discussed by #BACON). The approach is discussed in more detail in my GitHub repo 'CoEnumerable'. The idea is that the Barrier class provides a fairly straightforward approach to implement a proxy IEnumerable which can be consumed by each of the Func<IEnumerable<T>, S>s while the proxy consumes the real IEnumerable just once. In particular, the implementation consumes only as much of the original enumerable is as absolutely necessary (i.e., it satisfies the extra requirement mentioned above).
The proxy is:
class BarrierEnumerable<T> : IEnumerable<T>
{
private Barrier barrier;
private bool moveNext;
private readonly Func<T> src;
public BarrierEnumerable(IEnumerator<T> enumerator)
{
src = () => enumerator.Current;
}
public Barrier Barrier
{
set => barrier = value;
}
public bool MoveNext
{
set => moveNext = value;
}
public IEnumerator<T> GetEnumerator()
{
try
{
while (true)
{
barrier.SignalAndWait();
if (moveNext)
{
yield return src();
}
else
{
yield break;
}
}
}
finally
{
barrier.RemoveParticipant();
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
in terms of which we can combine the two consumers
public static T Combine<S, T1, T2, T>(this IEnumerable<S> source,
Func<IEnumerable<S>, T1> coenumerable1,
Func<IEnumerable<S>, T2> coenumerable2,
Func<T1, T2, T> resultSelector)
{
using var ss = source.GetEnumerator();
var enumerable1 = new BarrierEnumerable<S>(ss);
var enumerable2 = new BarrierEnumerable<S>(ss);
using var barrier = new Barrier(2, _ => enumerable1.MoveNext = enumerable2.MoveNext = ss.MoveNext());
enumerable2.Barrier = enumerable1.Barrier = barrier;
using var t1 = Task.Run(() => coenumerable1(enumerable1));
using var t2 = Task.Run(() => coenumerable2(enumerable2));
return resultSelector(t1.Result, t2.Result);
}
The GitHub repo has a couple of examples of using the above code, and some brief design discussion (including limitations).
If I have two list and I want to know if there are at least one common element, I have this two options:
lst1.Intersect(lst2).Any();
Lst1.Any(x => lst2.Contains(x));
The two options give me the result that I expect, however I don't know what is the best option. Which is more efficient? And why?
Thanks.
EDIT: when I created this post, apart of the solution, I was looking the reason. I know that I can run tests, but I wouldn't know the reason of the result. One is faster than the other? Is always one solution best than the other?
So for this reason, I hace accepted the answer of Matthew, not only for the test code, but also he explain when one is better than other and why. I appreciate a lot the contributions of Nicholas and Oren too.
Thanks.
Oren's answer has an error in the way the stopwatch is being used. It isn't being reset at the end of the loop after the time taken by Any() has been measured.
Note how it goes back to the start of the loop with the stopwatch never being Reset() so that the time that is added to intersect includes the time taken by Any().
Following is a corrected version.
A release build run outside any debugger gives this result on my PC:
Intersect: 1ms
Any: 6743ms
Note how I'm making two non-overlapping string lists for this test. Also note that this is a worst-case test.
Where there are many intersections (or intersections that happen to occur near the start of the data) then Oren is quite correct to say that Any() should be faster.
If the real data usually contains intersections then it's likely that it is better to use Any(). Otherwise, use Intersect(). It's very data dependent.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace Demo
{
class Program
{
void run()
{
double intersect = 0;
double any = 0;
Stopwatch stopWatch = new Stopwatch();
List<string> L1 = Enumerable.Range(0, 10000).Select(x => x.ToString()).ToList();
List<string> L2 = Enumerable.Range(10000, 10000).Select(x => x.ToString()).ToList();
for (int i = 0; i < 10; i++)
{
stopWatch.Restart();
Intersect(L1, L2);
stopWatch.Stop();
intersect += stopWatch.ElapsedMilliseconds;
stopWatch.Restart();
Any(L1, L2);
stopWatch.Stop();
any += stopWatch.ElapsedMilliseconds;
}
Console.WriteLine("Intersect: " + intersect + "ms");
Console.WriteLine("Any: " + any + "ms");
}
private static bool Any(List<string> lst1, List<string> lst2)
{
return lst1.Any(lst2.Contains);
}
private static bool Intersect(List<string> lst1, List<string> lst2)
{
return lst1.Intersect(lst2).Any();
}
static void Main()
{
new Program().run();
}
}
}
For comparative purposes, I wrote my own test comparing int sequences:
intersect took 00:00:00.0065928
any took 00:00:08.6706195
The code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace Demo
{
class Program
{
void run()
{
var lst1 = Enumerable.Range(0, 10000);
var lst2 = Enumerable.Range(10000, 10000);
int count = 10;
DemoUtil.Time(() => lst1.Intersect(lst2).Any(), "intersect", count);
DemoUtil.Time(() => lst1.Any(lst2.Contains), "any", count);
}
static void Main()
{
new Program().run();
}
}
static class DemoUtil
{
public static void Print(this object self)
{
Console.WriteLine(self);
}
public static void Print(this string self)
{
Console.WriteLine(self);
}
public static void Print<T>(this IEnumerable<T> self)
{
foreach (var item in self)
Console.WriteLine(item);
}
public static void Time(Action action, string title, int count)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < count; ++i)
action();
(title + " took " + sw.Elapsed).Print();
}
}
}
If I also time this for overlapping ranges by changing the lists to this and upping the count to 10000:
var lst1 = Enumerable.Range(10000, 10000);
var lst2 = Enumerable.Range(10000, 10000);
I get these results:
intersect took 00:00:03.2607476
any took 00:00:00.0019170
In this case Any() is clearly much faster.
Conclusion
The worst-case performance is very bad for Any() but acceptible for Intersect().
The best-case performance is extremely good for Any() and bad for Intersect().
(and best-case for Any() is probably worst-case for Intersect()!)
The Any() approach is O(N^2) in the worst case and O(1) in the best case.
The Intersect() approach is always O(N) (since it uses hashing, not sorting, otherwise it would be O(N(Log(N))).
You must also consider the memory usage: the Intersect() method needs to take a copy of one of the inputs, whereas Any() doesn't.
Therefore to make the best decision you really need to know the characteristics of the real data, and actually perform tests.
If you really don't want the Any() to turn into an O(N^2) in the worst case, then you should use Intersect(). However, the chances are that you will be best off using Any().
And of course, most of the time none of this matters!
Unless you've discovered this part of the code to be a bottleneck, this is of merely academic interest. You shouldn't waste your time with this kind of analysis if there's no problem. :)
It depends on the implementation of your IEnumerables.
Your first try (Intersect/Any), finds all the matches and then determines if the set is empty or not. From the documentation, this looks to be something like O(n) operation:
When the object returned by this method is enumerated, Intersect enumerates first,
collecting all distinct elements of that sequence. It then enumerates [the]
second, marking those elements that occur in both sequences. Finally, the marked
elements are yielded in the order in which they were collected.
Your second try ( Any/Contains ) enumerates over the first collection, an O(n) operation, and for each item in the first collection, enumerates over the second, another O(n) operation, to see if a matching element is found. This makes it something like an O(n2) operation, does it not? Which do you think might be faster?
One thing to consider, though, is that the Contains() lookup for certain collection or set types (e.g., dictionaries, binary trees or ordered collections that allow a binary search or hashtable lookup) might be a cheap operation if the Contains() implementation is smart enough to take advantage of the semantics of the collection upon which it is operating.
But you'll need to experiment with your collection types to find out which works better.
See Matthew's answer for a complete and accurate breakdown.
Relatively easy to mock up and try yourself:
bool found;
double intersect = 0;
double any = 0;
for (int i = 0; i < 100; i++)
{
List<string> L1 = GenerateNumberStrings(200000);
List<string> L2 = GenerateNumberStrings(60000);
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
found = Intersect(L1, L2);
stopWatch.Stop();
intersect += stopWatch.ElapsedMilliseconds;
stopWatch.Reset();
stopWatch.Start();
found = Any(L1, L2);
stopWatch.Stop();
any += stopWatch.ElapsedMilliseconds;
}
Console.WriteLine("Intersect: " + intersect + "ms");
Console.WriteLine("Any: " + any + "ms");
}
private static bool Any(List<string> lst1, List<string> lst2)
{
return lst1.Any(x => lst2.Contains(x));
}
private static bool Intersect(List<string> lst1, List<string> lst2)
{
return lst1.Intersect(lst2).Any();
}
You'll find that the Any method is significantly faster in the long run, likely because it does not require the memory allocations and setup that intersect requires (Any stops and returns true as soon as it finds a match whereas Intersect actually needs to store the matches in a new List<T>).
Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..
I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.
The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.
Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}
I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}
I know it's impossible to use return and yield return in the same method.
This is the code that I would like to optimize:
public IEnumerable<TItem> GetItems(int data)
{
if (this.isSingleSet)
{
return this.singleSet; // is IEnumerable per-se
}
else
{
int index = this.GetSet(data);
foreach(TKey key in this.keySets[index])
{
yield return this.items[key];
}
}
}
Important: I know this code doesn't compile. It's the code I have to optimize.
There are two ways that I know of that would make this method working:
convert yield return part:
...
else
{
int index = this.GetSet(data);
return this.keySets[index].Select(key => this.items[key]);
}
convert return part:
if (this.isSingleSet)
{
foreach(TItem item in this.singleSet)
{
yield return item;
}
}
else ...
But there's a big speed difference between the two. Using only return statements (in other words using Select()) is much much slower (like 6 times slower) to yield return conversion.
Question
Is there any other way that comes to your mind how to write this method? Do you have any other suggestions information that would be valuable to performance discrepancy?
Additional info
I was measuring speed of the two methods by using stopwatch around a for loop.
Stopwatch s = new Stopwatch();
s.Start();
for(int i = 0; i < 1000000; i++)
{
GetItems(GetRandomData()).ToList();
}
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
Each of these were loops were run in separate processes so there was could be no performance influence by garbage collection or anything else.
I've run the program with one method version then
Closed it
Rewrote the method and run it again.
Did this few times to see reliable performance difference...
Use two functions. The outer, called by clients, function does all the non-lazy bits (like parameter validation) that you don't want to delay. The private worker does the lazy bits:
public IEnumerable<TItem> GetItems(int data) {
if (this.isSingleSet) {
return this.singleSet; // is IEnumerable per-se
} else {
return DoGetItems(data);
}
}
private IEnumerable<TItem> DoGetItems(int data) {
int index = this.GetSet(data);
foreach(TKey key in this.keySets[index]) {
yield return this.items[key];
}
}
The implementation of Select is (with the error checking removed):
public static IEnumerable<R> Select<A, R>(
this IEnumerable<A> sequence,
Func<A, R> projection)
{
foreach(A item in sequence)
yield return projection(item);
}
So I have a hard time believing that your using Select is hugely slower than the almost-identical foreach loop you already have. It will be slowed down by doing error checking (once) and by creating the delegate (once), and the slight overhead of indirecting through the delegate. But the loop machinery should be identical.
However, if I've learned one thing in performance analysis, it's that my expectations are frequently dead wrong. What does your profiling run indicate is the bottleneck in your application? Let's reason from facts, not from guesses here. What is the hot spot?
A trivial example of an "infinite" IEnumerable would be
IEnumerable<int> Numbers() {
int i=0;
while(true) {
yield return unchecked(i++);
}
}
I know, that
foreach(int i in Numbers().Take(10)) {
Console.WriteLine(i);
}
and
var q = Numbers();
foreach(int i in q.Take(10)) {
Console.WriteLine(i);
}
both work fine (and print out the number 0-9).
But are there any pitfalls when copying or handling expressions like q? Can I rely on the fact, that they are always evaluated "lazy"? Is there any danger to produce an infinite loop?
As long as you only call lazy, un-buffered methods you should be fine. So Skip, Take, Select, etc are fine. However, Min, Count, OrderBy etc would go crazy.
It can work, but you need to be cautious. Or inject a Take(somethingFinite) as a safety measure (or some other custom extension method that throws an exception after too much data).
For example:
public static IEnumerable<T> SanityCheck<T>(this IEnumerable<T> data, int max) {
int i = 0;
foreach(T item in data) {
if(++i >= max) throw new InvalidOperationException();
yield return item;
}
}
Yes, you are guaranteed that the code above will be executed lazily. While it looks (in your code) like you'd loop forever, your code actually produces something like this:
IEnumerable<int> Numbers()
{
return new PrivateNumbersEnumerable();
}
private class PrivateNumbersEnumerable : IEnumerable<int>
{
public IEnumerator<int> GetEnumerator()
{
return new PrivateNumbersEnumerator();
}
}
private class PrivateNumbersEnumerator : IEnumerator<int>
{
private int i;
public bool MoveNext() { i++; return true; }
public int Current
{
get { return i; }
}
}
(This obviously isn't exactly what will be generated, since this is pretty specific to your code, but it's nonetheless similar and should show you why it's going to be lazily evaluated).
You would have to avoid any greedy functions that attempt to read to end. This would include Enumerable extensions like: Count, ToArray/ToList, and aggregates Avg/Min/Max, etc.
There's nothing wrong with infinite lazy lists, but you must make conscious decisions about how to handle them.
Use Take to limit the impact of an endless loop by setting an upper bound even if you don't need them all.
Yes, your code will always work without infinite looping. Someone might come along though later and mess things up. Suppose they want to do:
var q = Numbers().ToList();
Then, you're hosed! Many "aggregate" functions will kill you, like Max().
If it wasn't lazy evaluation, your first example won't work as expected in the first place.