foreach + break vs linq FirstOrDefault performance difference - c#

I have two classes that perform date date range data fetching for particular days.
public class IterationLookup<TItem>
{
private IList<Item> items = null;
public IterationLookup(IEnumerable<TItem> items, Func<TItem, TKey> keySelector)
{
this.items = items.OrderByDescending(keySelector).ToList();
}
public TItem GetItem(DateTime day)
{
foreach(TItem i in this.items)
{
if (i.IsWithinRange(day))
{
return i;
}
}
return null;
}
}
public class LinqLookup<TItem>
{
private IList<Item> items = null;
public IterationLookup(IEnumerable<TItem> items, Func<TItem, TKey> keySelector)
{
this.items = items.OrderByDescending(keySelector).ToList();
}
public TItem GetItem(DateTime day)
{
return this.items.FirstOrDefault(i => i.IsWithinRange(day));
}
}
Then I do speed tests which show that Linq version is about 5 times slower. This would make sense when I would store items locally without enumerating them using ToList. This would make it much slower, because with every call to FirstOrDefault, linq would also perform OrderByDescending. But that's not the case so I don't really know what's going on. Linq should perform very similar to iteration.
This is the code excerpt that measures my timings
IList<RangeItem> ranges = GenerateRanges(); // returns List<T>
var iterLookup = new IterationLookup<RangeItems>(ranges, r => r.Id);
var linqLookup = new LinqLookup<RangeItems>(ranges, r => r.Id);
Stopwatch timer = new Stopwatch();
timer.Start();
for(int i = 0; i < 1000000; i++)
{
iterLookup.GetItem(GetRandomDay());
}
timer.Stop();
// display elapsed time
timer.Restart();
for(int i = 0; i < 1000000; i++)
{
linqLookup.GetItem(GetRandomDay());
}
timer.Stop();
// display elapsed time
Why do I know it should perform better? Because when I write a very similar code without using these lookup classes, Linq performs very similar to foreach iterations...
// continue from previous code block
// items used by both order as they do in classes as well
IList<RangeItem> items = ranges.OrderByDescending(r => r.Id).ToList();
timer.Restart();
for(int i = 0; i < 1000000; i++)
{
DateTime day = GetRandomDay();
foreach(RangeItem r in items)
{
if (r.IsWithinRange(day))
{
// RangeItem result = r;
break;
}
}
}
timer.Stop();
// display elapsed time
timer.Restart();
for(int i = 0; i < 1000000; i++)
{
DateTime day = GetRandomDay();
items.FirstOrDefault(i => i.IsWithinRange(day));
}
timer.Stop();
// display elapsed time
This is by my opinion highly similar code. FirstOrDefault as much as I know also iterate for only as long until it gets to a valid item or until it reaches the end. And this is somehow the same as foreach with break.
But even iteration class performs worse than my simple foreach iteration loop which is also a mystery because all the overhead it has is the call to a method within a class compared to direct access.
Question
What am I doing wrong in my LINQ class that it performs so very slow?
What am I doing wrong in my Iteration class so it performs twice as slow as direct foreach loop?
Which times are being measured?
I do these steps:
Generate ranges (as seen below in results)
Create object instances for IterationLookup, LinqLookup (and my optimized date range class BitCountLookup which is not part of discussion here)
Start timer and execute 1 million lookups on random days within maximum date range (as seen in results) by using previously instantiated IterationLookup class.
Start timer and execute 1 million lookups on random days within maximum date range (as seen in results) by using previously instantiated LinqLookup class.
Start timer and execute 1 million lookups (6 times) using manual foreach+break loops and Linq calls.
As you can see object instantiation is not measured.
Appendix I: Results over million lookups
Ranges displayed in these results don't overlap, which should make both approaches even more similar in case LINQ version doesn't break loop on successful match (which it highly likely does).
Generated Ranges:
ID Range 000000000111111111122222222223300000000011111111112222222222
123456789012345678901234567890112345678901234567890123456789
09 22.01.-30.01. |-------|
08 14.01.-16.01. |-|
07 16.02.-19.02. |--|
06 15.01.-17.01. |-|
05 19.02.-23.02. |---|
04 01.01.-07.01.|-----|
03 02.01.-10.01. |-------|
02 11.01.-13.01. |-|
01 16.01.-20.01. |---|
00 29.01.-06.02. |-------|
Lookup classes...
- Iteration: 1028ms
- Linq: 4517ms !!! THIS IS THE PROBLEM !!!
- BitCounter: 401ms
Manual loops...
- Iter: 786ms
- Linq: 981ms
- Iter: 787ms
- Linq: 996ms
- Iter: 787ms
- Linq: 977ms
- Iter: 783ms
- Linq: 979ms
Appendix II: GitHub:Gist code to test for yourself
I've put up a Gist so you can get the full code yourself and see what's going on. Create a Console application and copy Program.cs into it an add other files that are part of this gist.
Grab it here.
Appendix III: Final thoughts and measurement tests
The most problematic thing was of course LINQ implementatino that was awfully slow. Turns out that this has all to do with delegate compiler optimization. LukeH provided the best and most usable solution that actually made me try different approaches to this. I've tried various different approaches in the GetItem method (or GetPointData as it's called in Gist):
the usual way that most of developers would do (and is implemented in Gist as well and wasn't updated after results revealed this is not the best way of doing it):
return this.items.FirstOrDefault(item => item.IsWithinRange(day));
by defining a local predicate variable:
Func<TItem, bool> predicate = item => item.IsWithinRange(day);
return this.items.FirstOrDefault(predicate);
local predicate builder:
Func<DateTime, Func<TItem, bool>> builder = d => item => item.IsWithinRange(d);
return this.items.FirstOrDefault(builder(day));
local predicate builder and local predicate variable:
Func<DateTime, Func<TItem, bool>> builder = d => item => item.IsWithinRange(d);
Func<TItem, bool> predicate = builder(day);
return this.items.FirstOrDefault(predicate);
class level (static or instance) predicate builder:
return this.items.FirstOrDefault(classLevelBuilder(day));
externally defined predicate and provided as method parameter
public TItem GetItem(Func<TItem, bool> predicate)
{
return this.items.FirstOrDefault(predicate);
}
when executing this method I also took two approaches:
predicate provided directly at method call within for loop:
for (int i = 0; i < 1000000; i++)
{
linqLookup.GetItem(item => item.IsWithinRange(GetRandomDay()));
}
predicate builder defined outside for loop:
Func<DateTime, Func<Ranger, bool>> builder = d => r => r.IsWithinRange(d);
for (int i = 0; i < 1000000; i++)
{
linqLookup.GetItem(builder(GetRandomDay()));
}
Results - what performs best
For comparison when using iteration class, it takes it approx. 770ms to execute 1 million lookups on randomly generated ranges.
3 local predicate builder turns out to be best compiler optimized so it performs almost as fast as usual iterations. 800ms.
6.2 predicate builder defined outside for loop: 885ms
6.1 predicate defined within for loop: 1525ms
all others took somewhere between 4200ms - 4360ms and are thus considered unusable.
So whenever you use a predicate in externally frequently callable method, define a builder and execute that. This will yield best results.
The biggest surprise to me about this is that delegates (or predicates) may be this much time consuming.

Sometimes LINQ appears slower because the generation of delegates in a loop (especially a non-obvious loop over method calls) can add time. Instead, you may want to consider moving your finder out of the class to make it more generic (like your key selector is on construction):
public class LinqLookup<TItem, TKey>
{
private IList<Item> items = null;
public IterationLookup(IEnumerable<TItem> items, Func<TItem, TKey> keySelector)
{
this.items = items.OrderByDescending(keySelector).ToList();
}
public TItem GetItem(Func<TItem, TKey> selector)
{
return this.items.FirstOrDefault(selector);
}
}
Since you don't use a lambda in your iterative code, this can be a bit of a difference since it has to create the delegate on each pass through the loop. Usually, this time is insignificant for every-day coding, and the time to invoke the delegate is no more expensive than other method calls, it's just the delegate creation in a tight loop that can add that little bit of extra time.
In this case, since the delegate never changes for the class, you can create it outside of the code you are looping through and it would be more efficient.
Update:
Actually, even without any optimization, compiling in release mode on my machine I do not see the 5x difference. I just performed 1,000,000 lookups on an Item that only has a DateTime field, with 5,000 items in the list. Of course, my data, etc, are different, but you can see the times are actually really close when you abstract out the delegate:
iterative : 14279 ms, 0.014279 ms/call
linq w opt : 17400 ms, 0.0174 ms/call
These time differences are very minor and worth the readability and maintainability improvements of using LINQ. I don't see the 5x difference though, which leads me to believe there's something we're not seeing in your test harness.

Further to Gabe's answer, I can confirm that the difference appears to be caused by the cost of re-constructing the delegate for every call to GetPointData.
If I add a single line to the GetPointData method in your IterationRangeLookupSingle class then it slows right down to the same crawling pace as LinqRangeLookupSingle. Try it:
// in IterationRangeLookupSingle<TItem, TKey>
public TItem GetPointData(DateTime point)
{
// just a single line, this delegate is never used
Func<TItem, bool> dummy = i => i.IsWithinRange(point);
// the rest of the method remains exactly the same as before
// ...
}
(I'm not sure why the compiler and/or jitter can't just ignore the superfluous delegate that I added above. Obviously, the delegate is necessary in your LinqRangeLookupSingle class.)
One possible workaround is to compose the predicate in LinqRangeLookupSingle so that point is passed to it as an argument. This means that the delegate only needs to be constructed once, not every time the GetPointData method is called. For example, the following change will speed up the LINQ version so that it's pretty much comparable with the foreach version:
// in LinqRangeLookupSingle<TItem, TKey>
public TItem GetPointData(DateTime point)
{
Func<DateTime, Func<TItem, bool>> builder = x => y => y.IsWithinRange(x);
Func<TItem, bool> predicate = builder(point);
return this.items.FirstOrDefault(predicate);
}

Assume you have a loop like this:
for (int counter = 0; counter < 1000000; counter++)
{
// execute this 1M times and time it
DateTime day = GetRandomDay();
items.FirstOrDefault(i => i.IsWithinRange(day));
}
This loop will create 1,000,000 lambda objects in order for the i.IsWithinRange call to access day. After each lambda creation, the delegate that calls i.IsWithinRange gets invoked on average 1,000,000 * items.Length / 2 times. Both of those factors do not exist in your foreach loop, which is why the explicit loop is faster.

Related

What is the most efficient loop in c#

There are a number of different way to accomplish the same simple loop though the items of an object in c#.
This has made me wonder if there is any reason be it performance or ease of use, as to use on over the other. Or is it just down to personal preference.
Take a simple object
var myList = List<MyObject>;
Lets assume the object is filled and we want to iterate over the items.
Method 1.
foreach(var item in myList)
{
//Do stuff
}
Method 2
myList.Foreach(ml =>
{
//Do stuff
});
Method 3
while (myList.MoveNext())
{
//Do stuff
}
Method 4
for (int i = 0; i < myList.Count; i++)
{
//Do stuff
}
What I was wondering is do each of these compiled down to the same thing? is there a clear performance advantage for using one over the others?
or is this just down to personal preference when coding?
Have I missed any?
The answer the majority of the time is it does not matter. The number of items in the loop (even what one might consider a "large" number of items, say in the thousands) isn't going to have an impact on the code.
Of course, if you identify this as a bottleneck in your situation, by all means, address it, but you have to identify the bottleneck first.
That said, there are a number of things to take into consideration with each approach, which I'll outline here.
Let's define a few things first:
All of the tests were run on .NET 4.0 on a 32-bit processor.
TimeSpan.TicksPerSecond on my machine = 10,000,000
All tests were performed in separate unit test sessions, not in the same one (so as not to possibly interfere with garbage collections, etc.)
Here's some helpers that are needed for each test:
The MyObject class:
public class MyObject
{
public int IntValue { get; set; }
public double DoubleValue { get; set; }
}
A method to create a List<T> of any length of MyClass instances:
public static List<MyObject> CreateList(int items)
{
// Validate parmaeters.
if (items < 0)
throw new ArgumentOutOfRangeException("items", items,
"The items parameter must be a non-negative value.");
// Return the items in a list.
return Enumerable.Range(0, items).
Select(i => new MyObject { IntValue = i, DoubleValue = i }).
ToList();
}
An action to perform for each item in the list (needed because Method 2 uses a delegate, and a call needs to be made to something to measure impact):
public static void MyObjectAction(MyObject obj, TextWriter writer)
{
// Validate parameters.
Debug.Assert(obj != null);
Debug.Assert(writer != null);
// Write.
writer.WriteLine("MyObject.IntValue: {0}, MyObject.DoubleValue: {1}",
obj.IntValue, obj.DoubleValue);
}
A method to create a TextWriter which writes to a null Stream (basically a data sink):
public static TextWriter CreateNullTextWriter()
{
// Create a stream writer off a null stream.
return new StreamWriter(Stream.Null);
}
And let's fix the number of items at one million (1,000,000, which should be sufficiently high to enforce that generally, these all have about the same performance impact):
// The number of items to test.
public const int ItemsToTest = 1000000;
Let's get into the methods:
Method 1: foreach
The following code:
foreach(var item in myList)
{
//Do stuff
}
Compiles down into the following:
using (var enumerable = myList.GetEnumerable())
while (enumerable.MoveNext())
{
var item = enumerable.Current;
// Do stuff.
}
There's quite a bit going on there. You have the method calls (and it may or may not be against the IEnumerator<T> or IEnumerator interfaces, as the compiler respects duck-typing in this case) and your // Do stuff is hoisted into that while structure.
Here's the test to measure the performance:
[TestMethod]
public void TestForEachKeyword()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
foreach (var item in list)
{
// Write the values.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Foreach loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Foreach loop ticks: 3210872841
Method 2: .ForEach method on List<T>
The code for the .ForEach method on List<T> looks something like this:
public void ForEach(Action<T> action)
{
// Error handling omitted
// Cycle through the items, perform action.
for (int index = 0; index < Count; ++index)
{
// Perform action.
action(this[index]);
}
}
Note that this is functionally equivalent to Method 4, with one exception, the code that is hoisted into the for loop is passed as a delegate. This requires a dereference to get to the code that needs to be executed. While the performance of delegates has improved from .NET 3.0 on, that overhead is there.
However, it's negligible. The test to measure the performance:
[TestMethod]
public void TestForEachMethod()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
list.ForEach(i => MyObjectAction(i, writer));
// Write out the number of ticks.
Debug.WriteLine("ForEach method ticks: {0}", s.ElapsedTicks);
}
}
The output:
ForEach method ticks: 3135132204
That's actually ~7.5 seconds faster than using the foreach loop. Not completely surprising, given that it uses direct array access instead of using IEnumerable<T>.
Remember though, this translates to 0.0000075740637 seconds per item being saved. That's not worth it for small lists of items.
Method 3: while (myList.MoveNext())
As shown in Method 1, this is exactly what the compiler does (with the addition of the using statement, which is good practice). You're not gaining anything here by unwinding the code yourself that the compiler would otherwise generate.
For kicks, let's do it anyways:
[TestMethod]
public void TestEnumerator()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
// Get the enumerator.
using (IEnumerator<MyObject> enumerator = list.GetEnumerator())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
while (enumerator.MoveNext())
{
// Write.
MyObjectAction(enumerator.Current, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Enumerator loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Enumerator loop ticks: 3241289895
Method 4: for
In this particular case, you're going to gain some speed, as the list indexer is going directly to the underlying array to perform the lookup (that's an implementation detail, BTW, there's nothing to say that it can't be a tree structure backing the List<T> up).
[TestMethod]
public void TestListIndexer()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle by index.
for (int i = 0; i < list.Count; ++i)
{
// Get the item.
MyObject item = list[i];
// Perform the action.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("List indexer loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
List indexer loop ticks: 3039649305
However the place where this can make a difference is arrays. Arrays can be unwound by the compiler to process multiple items at a time.
Instead of doing ten iterations of one item in a ten item loop, the compiler can unwind this into five iterations of two items in a ten item loop.
However, I'm not positive here that this is actually happening (I have to look at the IL and the output of the compiled IL).
Here's the test:
[TestMethod]
public void TestArray()
{
// Create the list.
MyObject[] array = CreateList(ItemsToTest).ToArray();
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle by index.
for (int i = 0; i < array.Length; ++i)
{
// Get the item.
MyObject item = array[i];
// Perform the action.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Enumerator loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Array loop ticks: 3102911316
It should be noted that out-of-the box, Resharper offers a suggestion with a refactoring to change the above for statements to foreach statements. That's not to say this is right, but the basis is to reduce the amount of technical debt in code.
TL;DR
You really shouldn't be concerned with the performance of these things, unless testing in your situation shows that you have a real bottleneck (and you'll have to have massive numbers of items to have an impact).
Generally, you should go for what's most maintainable, in which case, Method 1 (foreach) is the way to go.
In regards to the final bit of the question, "Did I miss any?" Yes, and I feel I would be remiss to not mention this even though the question is quite old. While those four ways of doing it will execute in relatively the same amount of time, there is a way not shown above that runs faster than all of them. Quite significantly, in fact, as the number of items in the iterated list increases. It would be the exact same way as the last method but instead of getting .Count in the condition check of the loop, you assign this value to a variable before setting up the loop and use that instead. Which leaves you with something like this:
var countVar = list.Count;
for(int i = 0; i < countVar; i++)
{
//loop logic
}
By doing it this way, you're only looking up a variable value at each iteration, rather than resolving the Count or Length properties, which is considerably less efficient.
I would suggest an even better and not well-known approach for faster loop iteration over a list. I would recommend you to first read about Span<T>. Note that you can use it if you are using .NET Core.
List<MyObject> list = new();
foreach (MyObject item in CollectionsMarshal.AsSpan(list))
{
// Do something
}
Be aware of the caveats:
The CollectionsMarshal.AsSpan method is unsafe and should be used only if you know what you're doing. CollectionsMarshal.AsSpan returns a Span<T> on the private array of List<T>. Iterating over a Span<T> is fast as the JIT uses the same tricks as for optimizing arrays. Using this method, it won't check the list is not modified during the enumeration.
This is a more detailed explanation of what it does behind the scenes and more, super interesting!

Simple List<string> vs IEnumarble<string> Performance issues

I've tested List<string> vs IEnumerable<string>
iterations with for and foreach loops , is it possible that the List is much faster ?
these are 2 of few links I could find that are publicly stating that performance is better iterating IEnumerable over List.
Link1
Link2
my tests was loading 10K lines from a text file that holds a list of URLs.
I've first loaded it in to a List , then copied List to an IEnumerable
List<string> StrByLst = ...method to load records from the file .
IEnumerable StrsByIE = StrByLst;
so each has 10k items Type <string>
looping on each collection for 100 times , meaning 100K iterations, resulted with
List<string> is faster by amazing 50 x than the IEnumerable<string>
is that predictable ?
update
this is the code that is doing the tests
string WorkDirtPath = HostingEnvironment.ApplicationPhysicalPath;
string fileName = "tst.txt";
string fileToLoad = Path.Combine(WorkDirtPath, fileName);
List<string> ListfromStream = new List<string>();
ListfromStream = PopulateListStrwithAnyFile(fileToLoad) ;
IEnumerable<string> IEnumFromStream = ListfromStream ;
string trslt = "";
Stopwatch SwFr = new Stopwatch();
Stopwatch SwFe = new Stopwatch();
string resultFrLst = "",resultFrIEnumrable, resultFe = "", Container = "";
SwFr.Start();
for (int itr = 0; itr < 100; itr++)
{
for (int i = 0; i < ListfromStream.Count(); i++)
{
Container = ListfromStream.ElementAt(i);
}
//the stop() was here , i was doing changes , so my mistake.
}
SwFr.Stop();
resultFrLst = SwFr.Elapsed.ToString();
//forgot to do this reset though still it is faster (x56??)
SwFr.Reset();
SwFr.Start();
for(int itr = 0; itr<100; itr++)
{
for (int i = 0; i < IEnumFromStream.Count(); i++)
{
Container = IEnumFromStream.ElementAt(i);
}
}
SwFr.Stop();
resultFrIEnumrable = SwFr.Elapsed.ToString();
Update ... final
taking out the counter to outside of the for loops ,
int counter = ..countfor both IEnumerable & List
then passed counter(int) as a count of total items as suggested by #ScottChamberlain .
re checked that every thing is in place, now the results are 5 % faster IEnumerable.
so that concludes , use by scenario - use case... no performance difference at all ...
You are doing something wrong.
The times that you get should be very close to each other, because you are running essentially the same code.
IEnumerable is just an interface, which List implements, so when you call some method on the IEnumerable reference it ends up calling the corresponding method of List.
There is no code implemented in the IEnumerable - this is what interfaces are - they only specify what functionality a class should have, but say nothing about how it's implemented.
You have a few problems with your test, one is the IEnumFromStream.Count() inside the for loop, every time it want to get that value it must enumerate over the entire list to get the count and the value is not cached between loops. Move that call outside of the for loop and save the result in a int and use that value for the for loop, you will see a shorter time for your IEnumerable.
Also the IEnumFromStream.ElementAt(i) behaves similarly to Count() it must iterate over the whole list up to i (eg: first time it goes 0, second time 0,1, third 0,1,2, and so on...) every time where List can jump directly to the index it needs. You should be working with the IEnumerator returned from GetEnumerator() instead.
IEnumerable's and for loop's don't mix well. Use the correct tool for the job, either call GetEnumerator() and work with that or use it in a foreach loop.
Now, I know a lot of you may be saying "But it is a interface it will be just mapping the calls and it should make no difference", but there is a key thing, IEnumerable<T> Does not have a Count() or ElementAt() method!. Those methods are extension methods added by LINQ, and the LINQ classes do not know the underlying collection is a List, so it does what it knows the underlying object can do, and that is iterating over the list every time the method is called.
IEnumerable using IEnumerator
using(var enu = IEnumFromStream.GetEnumerator())
{
//You have to call "MoveNext()" once before getting "Current" the first time,
// this is done so you can have a nice clean while loop like this.
while(enu.MoveNext())
{
Container = enu.Current;
}
}
The above code is basically the same thing as
foreach(var enu in IEnumFromStream)
{
Container = enu;
}
The important thing to remember is IEnumerable's do not have a length, in fact they can be infinitely long. There is a whole field of computer science on detecting a infinitely long IEnumerable
Based on the code you posted I think the problem is with your use of the Stopwatch class.
You declare two of these, SwFr and SwFe, but only use the former. Because of this, the last call to SwFr.Elapsed will get the total amount of time across both sets of for loops.
If you are wanting to reuse that object in this way, place a call to SwFr.Reset() right after resultFrLst = SwFr.Elapsed.ToString();.
Alternatively, you could use SwFe when running the second test.

LINQ Why is "Enumerable = Enumerable.Skip(N)" slow?

I am having an issue with the performance of a LINQ query and so I created a small simplified example to demonstrate the issue below. The code takes a random list of small integers and returns the list partitioned into several smaller lists each which totals 10 or less.
The problem is that (as I've written this) the code takes exponentially longer with N. This is only an O(N) problem. With N=2500, the code takes over 10 seconds to run on my pc.
I would appriciate greatly if someone could explain what is going on. Thanks, Mark.
int N = 250;
Random r = new Random();
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = new List<List<int>>();
// work.Dump("All the work."); // LINQPad Print
var workEnumerable = work.AsEnumerable();
Stopwatch sw = Stopwatch.StartNew();
while(workEnumerable.Any()) // or .FirstorDefault() != null
{
int soFar = 0;
var chunk = workEnumerable.TakeWhile( x =>
{
soFar += x;
return (soFar <= 10);
}).ToList();
chunks.Add(chunk); // Commented out makes no difference.
workEnumerable = workEnumerable.Skip(chunk.Count); // <== SUSPECT
}
sw.Stop();
// chunks.Dump("Work Chunks."); // LINQPad Print
sw.Elapsed.Dump("Time elapsed.");
What .Skip() does is create a new IEnumerable that loops over the source, and only begins yielding results after the first N elements. You chain who knows how many of these after each other. Everytime you call .Any(), you need to loop over all the previously skipped elements again.
Generally speaking, it's a bad idea to set up very complicated operator chains in LINQ and enumerating them repeatedly. Also, since LINQ is a querying API, methods like Skip() are a bad choice when what you're trying to achieve amounts to modifying a data structure.
You effectively keep chaining Skip() onto the same enumerable. In a list of 250, the last chunk will be created from a lazy enumerable with ~25 'Skip' enumerator classes on the front.
You would find things become a lot faster, already if you did
workEnumerable = workEnumerable.Skip(chunk.Count).ToList();
However, I think the whole approach could be altered.
How about using standard LINQ to achieve the same:
See it live on http://ideone.com/JIzpml
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
private readonly static Random r = new Random();
public static void Main(string[] args)
{
int N = 250;
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = work.Select((o,i) => new { Index=i, Obj=o })
.GroupBy(e => e.Index / 10)
.Select(group => group.Select(e => e.Obj).ToList())
.ToList();
foreach(var chunk in chunks)
Console.WriteLine("Chunk: {0}", string.Join(", ", chunk.Select(i => i.ToString()).ToArray()));
}
}
The Skip() method and others like it basically create a placeholder object, implementing IEnumerable, that references its parent enumerable and contains the logic to perform the skipping. Skips in loops, therefore, are non-performant, because instead of throwing away elements of the enumerable, like you think they are, they add a new layer of logic that's lazily executed when you actually need the first element after all the ones you've skipped.
You can get around this by calling ToList() or ToArray(). This forces "eager" evaluation of the Skip() method, and really does get rid of the elements you're skipping from the new collection you will be enumerating. That comes at an increased memory cost, and requires all of the elements to be known (so if you're running this on an IEnumerable that represents an infinite series, good luck).
The second option is to not use Linq, and instead use the IEnumerable implementation itself, to get and control an IEnumerator. Then instead of Skip(), simply call MoveNext() the necessary number of times.

How does a streaming operator differ from deferred execution?

In LINQ Where is a streaming operator. Where-as OrderByDescending is a non-streaming operator. AFAIK, a streaming operator only gathers the next item that is necessary. A non-streaming operator evaluates the entire data stream at once.
I fail to see the relevance of defining a Streaming Operator. To me, it is redundant with Deferred Execution. Take the example where I have written a custom extension and consumed it using the where operator and and orderby.
public static class ExtensionStuff
{
public static IEnumerable<int> Where(this IEnumerable<int> sequence, Func<int, bool> predicate)
{
foreach (int i in sequence)
{
if (predicate(i))
{
yield return i;
}
}
}
}
public static void Main()
{
TestLinq3();
}
private static void TestLinq3()
{
int[] items = { 1, 2, 3,4 };
var selected = items.Where(i => i < 3)
.OrderByDescending(i => i);
Write(selected);
}
private static void Write(IEnumerable<int> selected)
{
foreach(var i in selected)
Console.WriteLine(i);
}
In either case, Where needs to evaluate each element in order to determine which elements meet the condition. The fact that it yields seems to only become relevant because the operator gains deferred execution.
So, what is the importance of Streaming Operators?
There are two aspects: speed and memory.
The speed aspect becomes more apparent when you use a method like .Take() to only consume a portion of the original result set.
// Consumes ten elements, yields 5 results.
Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.Take(5)
.ToList();
// Consumes one million elements, yields 5 results.
Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.OrderByDescending(i => i)
.Take(5)
.ToList();
Because the first example uses only streaming operators before the call to Take, you only end up yielding values 1 through 10 before Take stops evaluating. Furthermore, only one value is loaded into memory at a time, so you have a very small memory footprint.
In the second example, OrderByDescending is not streaming, so the moment Take pulls the first item, the entire result that's passed through the Where filter has to be placed in memory for sorting. This could take a long time and produce a big memory footprint.
Even if you weren't using Take, the memory issue can be important. For example:
// Puts half a million elements in memory, sorts, then outputs them.
var numbers = Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.OrderByDescending(i => i);
foreach(var number in numbers) Console.WriteLine(number);
// Puts one element in memory at a time.
var numbers = Enumerable.Range(1, 1000000).Where(i => i % 2 == 0);
foreach(var number in numbers) Console.WriteLine(number);
The fact that it yields seems to only become relevant because the
operator gains deferred execution.
So, what is the importance of Streaming Operators?
I.e. you could not process infinite sequences with buffering / non-streaming extension methods - while you can "run" such a sequence (until you abort) just fine using only streaming extension methods.
Take for example this method:
public IEnumerable<int> GetNumbers(int start)
{
int num = start;
while(true)
{
yield return num;
num++;
}
}
You can use Where just fine:
foreach (var num in GetNumbers(0).Where(x => x % 2 == 0))
{
Console.WriteLine(num);
}
OrderBy() would not work in this case since it would have to exhaustively enumerate the results before emitting a single number.
Just to be explicit; in the case you mentioned there's no advantage to the fact that where streams, since the orderby sucks the whole thing in anyway. There are however times where the advantages of streaming are used (other answers/comments have given examples), so all LINQ operators stream to the best of their ability. Orderby streams as much as it can, which happens to be not very much. Where streams very effectively.

Is yield useful outside of LINQ?

When ever I think I can use the yield keyword, I take a step back and look at how it will impact my project. I always end up returning a collection instead of yeilding because I feel the overhead of maintaining the state of the yeilding method doesn't buy me much. In almost all cases where I am returning a collection I feel that 90% of the time, the calling method will be iterating over all elements in the collection, or will be seeking a series of elements throughout the entire collection.
I do understand its usefulness in linq, but I feel that only the linq team is writing such complex queriable objects that yield is useful.
Has anyone written anything like or not like linq where yield was useful?
Note that with yield, you are iterating over the collection once, but when you build a list, you'll be iterating over it twice.
Take, for example, a filter iterator:
IEnumerator<T> Filter(this IEnumerator<T> coll, Func<T, bool> func)
{
foreach(T t in coll)
if (func(t)) yield return t;
}
Now, you can chain this:
MyColl.Filter(x=> x.id > 100).Filter(x => x.val < 200).Filter (etc)
You method would be creating (and tossing) three lists. My method iterates over it just once.
Also, when you return a collection, you are forcing a particular implementation on you users. An iterator is more generic.
I do understand its usefulness in linq, but I feel that only the linq team is writing such complex queriable objects that yield is useful.
Yield was useful as soon as it got implemented in .NET 2.0, which was long before anyone ever thought of LINQ.
Why would I write this function:
IList<string> LoadStuff() {
var ret = new List<string>();
foreach(var x in SomeExternalResource)
ret.Add(x);
return ret;
}
When I can use yield, and save the effort and complexity of creating a temporary list for no good reason:
IEnumerable<string> LoadStuff() {
foreach(var x in SomeExternalResource)
yield return x;
}
It can also have huge performance advantages. If your code only happens to use the first 5 elements of the collection, then using yield will often avoid the effort of loading anything past that point. If you build a collection then return it, you waste a ton of time and space loading things you'll never need.
I could go on and on....
I recently had to make a representation of mathematical expressions in the form of an Expression class. When evaluating the expression I have to traverse the tree structure with a post-order treewalk. To achieve this I implemented IEnumerable<T> like this:
public IEnumerator<Expression<T>> GetEnumerator()
{
if (IsLeaf)
{
yield return this;
}
else
{
foreach (Expression<T> expr in LeftExpression)
{
yield return expr;
}
foreach (Expression<T> expr in RightExpression)
{
yield return expr;
}
yield return this;
}
}
Then I can simply use a foreach to traverse the expression. You can also add a Property to change the traversal algorithm as needed.
At a previous company, I found myself writing loops like this:
for (DateTime date = schedule.StartDate; date <= schedule.EndDate;
date = date.AddDays(1))
With a very simple iterator block, I was able to change this to:
foreach (DateTime date in schedule.DateRange)
It made the code a lot easier to read, IMO.
yield was developed for C#2 (before Linq in C#3).
We used it heavily in a large enterprise C#2 web application when dealing with data access and heavily repeated calculations.
Collections are great any time you have a few elements that you're going to hit multiple times.
However in lots of data access scenarios you have large numbers of elements that you don't necessarily need to pass round in a great big collection.
This is essentially what the SqlDataReader does - it's a forward only custom enumerator.
What yield lets you do is quickly and with minimal code write your own custom enumerators.
Everything yield does could be done in C#1 - it just took reams of code to do it.
Linq really maximises the value of the yield behaviour, but it certainly isn't the only application.
Whenever your function returns IEnumerable you should use "yielding". Not in .Net > 3.0 only.
.Net 2.0 example:
public static class FuncUtils
{
public delegate T Func<T>();
public delegate T Func<A0, T>(A0 arg0);
public delegate T Func<A0, A1, T>(A0 arg0, A1 arg1);
...
public static IEnumerable<T> Filter<T>(IEnumerable<T> e, Func<T, bool> filterFunc)
{
foreach (T el in e)
if (filterFunc(el))
yield return el;
}
public static IEnumerable<R> Map<T, R>(IEnumerable<T> e, Func<T, R> mapFunc)
{
foreach (T el in e)
yield return mapFunc(el);
}
...
I'm not sure about C#'s implementation of yield(), but on dynamic languages, it's far more efficient than creating the whole collection. on many cases, it makes it easy to work with datasets much bigger than RAM.
I am a huge Yield fan in C#. This is especially true in large homegrown frameworks where often methods or properties return List that is a sub-set of another IEnumerable. The benefits that I see are:
the return value of a method that uses yield is immutable
you are only iterating over the list once
it a late or lazy execution variable, meaning the code to return the values are not executed until needed (though this can bite you if you dont know what your doing)
of the source list changes, you dont have to call to get another IEnumerable, you just iterate over IEnumeable again
many more
One other HUGE benefit of yield is when your method potentially will return millions of values. So many that there is the potential of running out of memory just building the List before the method can even return it. With yield, the method can just create and return millions of values, and as long the caller also doesnt store every value. So its good for large scale data processing / aggregating operations
Personnally, I haven't found I'm using yield in my normal day-to-day programming. However, I've recently started playing with the Robotics Studio samples and found that yield is used extensively there, so I also see it being used in conjunction with the CCR (Concurrency and Coordination Runtime) where you have async and concurrency issues.
Anyway, still trying to get my head around it as well.
Yield is useful because it saves you space. Most optimizations in programming makes a trade off between space (disk, memory, networking) and processing. Yield as a programming construct allows you to iterate over a collection many times in sequence without needing a separate copy of the collection for each iteration.
consider this example:
static IEnumerable<Person> GetAllPeople()
{
return new List<Person>()
{
new Person() { Name = "George", Surname = "Bush", City = "Washington" },
new Person() { Name = "Abraham", Surname = "Lincoln", City = "Washington" },
new Person() { Name = "Joe", Surname = "Average", City = "New York" }
};
}
static IEnumerable<Person> GetPeopleFrom(this IEnumerable<Person> people, string where)
{
foreach (var person in people)
{
if (person.City == where) yield return person;
}
yield break;
}
static IEnumerable<Person> GetPeopleWithInitial(this IEnumerable<Person> people, string initial)
{
foreach (var person in people)
{
if (person.Name.StartsWith(initial)) yield return person;
}
yield break;
}
static void Main(string[] args)
{
var people = GetAllPeople();
foreach (var p in people.GetPeopleFrom("Washington"))
{
// do something with washingtonites
}
foreach (var p in people.GetPeopleWithInitial("G"))
{
// do something with people with initial G
}
foreach (var p in people.GetPeopleWithInitial("P").GetPeopleFrom("New York"))
{
// etc
}
}
(Obviously you are not required to use yield with extension methods, it just creates a powerful paradigm to think about data.)
As you can see, if you have a lot of these "filter" methods (but it can be any kind of method that does some work on a list of people) you can chain many of them together without requiring extra storage space for each step. This is one way of raising the programming language (C#) up to express your solutions better.
The first side-effect of yield is that it delays execution of the filtering logic until you actually require it. If you therefore create a variable of type IEnumerable<> (with yields) but never iterate through it, you never execute the logic or consume the space which is a powerful and free optimization.
The other side-effect is that yield operates on the lowest common collection interface (IEnumerable<>) which enables the creation of library-like code with wide applicability.
Note that yield allows you to do things in a "lazy" way. By lazy, I mean that the evaluation of the next element in the IEnumberable is not done until the element is actually requested. This allows you the power to do a couple of different things. One is that you could yield an infinitely long list without the need to actually make infinite calculations. Second, you could return an enumeration of function applications. The functions would only be applied when you iterate through the list.
I've used yeild in non-linq code things like this (assuming functions do not live in same class):
public IEnumerable<string> GetData()
{
foreach(String name in _someInternalDataCollection)
{
yield return name;
}
}
...
public void DoSomething()
{
foreach(String value in GetData())
{
//... Do something with value that doesn't modify _someInternalDataCollection
}
}
You have to be careful not to inadvertently modify the collection that your GetData() function is iterating over though, or it will throw an exception.
Yield is very useful in general. It's in ruby among other languages that support functional style programming, so its like it's tied to linq. It's more the other way around, that linq is functional in style, so it uses yield.
I had a problem where my program was using a lot of cpu in some background tasks. What I really wanted was to still be able to write functions like normal, so that I could easily read them (i.e. the whole threading vs. event based argument). And still be able to break the functions up if they took too much cpu. Yield is perfect for this. I wrote a blog post about this and the source is available for all to grok :)
The System.Linq IEnumerable extensions are great, but sometime you want more. For example, consider the following extension:
public static class CollectionSampling
{
public static IEnumerable<T> Sample<T>(this IEnumerable<T> coll, int max)
{
var rand = new Random();
using (var enumerator = coll.GetEnumerator());
{
while (enumerator.MoveNext())
{
yield return enumerator.Current;
int currentSample = rand.Next(max);
for (int i = 1; i <= currentSample; i++)
enumerator.MoveNext();
}
}
}
}
Another interesting advantage of yielding is that the caller cannot cast the return value to the original collection type and modify your internal collection

Categories

Resources