I'm writing a program as follows:
Find all files with the correct extension in a given directory
Foreach, find all occurrences of a given string in those files
Print each line
I'd like to write this in a functional way, as a series of generator functions (things that call yield return and only return one item at a time lazily-loaded), so my code would read like this:
IEnumerable<string> allFiles = GetAllFiles();
IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
IEnumerable<string> contents = GetFileContents( matchingFiles );
IEnumerable<string> matchingLines = GetMatchingLines( contents );
foreach( var lineText in matchingLines )
Console.WriteLine( "Found: " + lineText );
This is all fine, but what I'd also like to do is print some statistics at the end. Something like this:
Found 233 matches in 150 matching files. Scanned 3,297 total files in 5.72s
The problem is, writing the code in a 'pure functional' style like above, each item is lazily loaded.
You only know how many files match in total until the final foreach loop completes, and because only one item is ever yielded at a time, the code doesn't have any place to keep track of how many things it's found previously. If you invoke LINQ's matchingLines.Count() method, it will re-enumerate the collection!
I can think of many ways to solve this problem, but all of them seem to be somewhat ugly. It strikes me as something that people are bound to have done before, and I'm sure there'll be a nice design pattern which shows a best practice way of doing this.
Any ideas? Cheers
In a similar vein to other answers, but taking a slightly more generic approach ...
... why not create a Decorator class that can wrap an existing IEnumerable implementation and calculate the statistic as it passes other items through.
Here's a Counter class I just threw together - but you could create variations for other kinds of aggregation too.
public class Counter<T> : IEnumerable<T>
{
public int Count { get; private set; }
public Counter(IEnumerable<T> source)
{
mSource = source;
Count = 0;
}
public IEnumerator<T> GetEnumerator()
{
foreach (var T in mSource)
{
Count++;
yield return T;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
foreach (var T in mSource)
{
Count++;
yield return T;
}
}
private IEnumerable<T> mSource;
}
You could create three instances of Counter:
One to wrap GetAllFiles() counting the total number of files;
One to wrap GetMatches() counting the number of matching files; and
One to wrap GetMatchingLines() counting the number of matching lines.
The key with this approach is that you're not layering multiple responsibilities onto your existing classes/methods - the GetMatchingLines() method only handles the matching, you're not asking it to track stats as well.
Clarification in response to a comment by Mitcham:
The final code would look something like this:
var files = new Counter<string>( GetAllFiles());
var matchingFiles = new Counter<string>(GetMatches( "*.txt", files ));
var contents = GetFileContents( matchingFiles );
var linesFound = new Counter<string>(GetMatchingLines( contents ));
foreach( var lineText in linesFound )
Console.WriteLine( "Found: " + lineText );
string message
= String.Format(
"Found {0} matches in {1} matching files. Scanned {2} files",
linesFound.Count,
matchingFiles.Count,
files.Count);
Console.WriteLine(message);
Note that this is still a functional approach - the variables used are immutable (more like bindings than variables), and the overall function has no side-effects.
I would say that you need to encapsulate the process into a 'Matcher' class in which your methods capture statistics as they progress.
public class Matcher
{
private int totalFileCount;
private int matchedCount;
private DateTime start;
private int lineCount;
private DateTime stop;
public IEnumerable<string> Match()
{
return GetMatchedFiles();
System.Console.WriteLine(string.Format(
"Found {0} matches in {1} matching files." +
" {2} total files scanned in {3}.",
lineCount, matchedCount,
totalFileCount, (stop-start).ToString());
}
private IEnumerable<File> GetMatchedFiles(string pattern)
{
foreach(File file in SomeFileRetrievalMethod())
{
totalFileCount++;
if (MatchPattern(pattern,file.FileName))
{
matchedCount++;
yield return file;
}
}
}
}
I'll stop there since I'm supposed to be coding work stuff, but the general idea is there. The entire point of 'pure' functional program is to not have side effects, and this type of statics calculation is a side effect.
I can think of two ideas
Pass in a context object and return (string + context) from your enumerators - the purely functional solution
use thread local storage for you statistics (CallContext), you can be fancy and support a stack of contexts. so you would have code like this.
using (var stats = DirStats.Create())
{
IEnumerable<string> allFiles = GetAllFiles();
IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
IEnumerable<string> contents = GetFileContents( matchingFiles );
stats.Print()
IEnumerable<string> matchingLines = GetMatchingLines( contents );
stats.Print();
}
If you're happy to turn your code upside down, you might be interested in Push LINQ. The basic idea is to reverse the "pull" model of IEnumerable<T> and turn it into a "push" model with observers - each part of the pipeline effectively pushes its data past any number of observers (using event handlers) which typically form new parts of the pipeline. This gives a really easy way to hook up multiple aggregates to the same data.
See this blog entry for some more details. I gave a talk on it in London a while ago - my page of talks has a few links for sample code, the slide deck, video etc.
It's a fun little project, but it does take a bit of getting your head around.
I took Bevan's code and refactored it around until I was content. Fun stuff.
public class Counter
{
public int Count { get; set; }
}
public static class CounterExtensions
{
public static IEnumerable<T> ObserveCount<T>
(this IEnumerable<T> source, Counter count)
{
foreach (T t in source)
{
count.Count++;
yield return t;
}
}
public static IEnumerable<T> ObserveCount<T>
(this IEnumerable<T> source, IList<Counter> counters)
{
Counter c = new Counter();
counters.Add(c);
return source.ObserveCount(c);
}
}
public static class CounterTest
{
public static void Test1()
{
IList<Counter> counters = new List<Counter>();
//
IEnumerable<int> step1 =
Enumerable.Range(0, 100).ObserveCount(counters);
//
IEnumerable<int> step2 =
step1.Where(i => i % 10 == 0).ObserveCount(counters);
//
IEnumerable<int> step3 =
step2.Take(3).ObserveCount(counters);
//
step3.ToList();
foreach (Counter c in counters)
{
Console.WriteLine(c.Count);
}
}
}
Output as expected: 21, 3, 3
Assuming those functions are your own, the only thing I can think of is the Visitor pattern, passing in an abstract visitor function that calls you back when each thing happens. For example: pass an ILineVisitor into GetFileContents (which I'm assuming breaks up the file into lines). ILineVisitor would have a method like OnVisitLine(String line), you could then implement the ILineVisitor and make it keep the appropriate stats. Rinse and repeat with a ILineMatchVisitor, IFileVisitor etc. Or you could use a single IVisitor with an OnVisit() method which has a different semantic in each case.
Your functions would each need to take a Visitor, and call it's OnVisit() at the appropriate time, which may seem annoying, but at least the visitor could be used to do lots of interesting things, other than just what you're doing here. In fact you could actually avoid writing GetMatchingLines by passing a visitor that checks for the match in OnVisitLine(String line) into GetFileContents.
Is this one of the ugly things you'd already considered?
Related
Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..
I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.
The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.
Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}
I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}
This little program finds the top ten most used words in a file. How would you, or could you, optimize this to process the file via line-by-line streaming, but keep it in the functional style it is now?
static void Main(string[] args)
{
string path = #"C:\tools\copying.txt";
File.ReadAllText(path)
.Split(' ')
.Where(s => !string.IsNullOrEmpty(s))
.GroupBy(s => s)
.OrderByDescending(g => g.Count())
.Take(10)
.ToList()
.ForEach(g => Console.WriteLine("{0}\t{1}", g.Key, g.Count()));
Console.ReadLine();
}
Here is the line reader I'd like use:
static IEnumerable<string> ReadLinesFromFile(this string filename)
{
using (StreamReader reader = new StreamReader(filename))
{
while (true)
{
string s = reader.ReadLine();
if (s == null)
break;
yield return s;
}
}
}
Edit:
I realize that the implementation of top-words doesn't take into account punctuation and all the other little nuances, and I'm not too worried about that.
Clarification:
I'm interested in solution that doesn't load the entire file into memory at once. I suppose you'd need a data structure that could take a stream of words and "group" on the fly -- like a trie. And then somehow get it done in a lazy way so the line reader can go about it's business line-by-line. I'm now realizing that this is a lot to ask for and is a lot more complex than the simple example I gave above. Maybe I'll give it a shot and see if I can get the code as clear as above (with a bunch of new lib support).
So what you're saying is you want to go from:
full text -> sequence of words -> rest of query
to
sequence of lines -> sequence of words -> rest of query
yes?
that seems straightforward.
var words = from line in GetLines()
from word in line.Split(' ')
select word;
and then
words.Where( ... blah blah blah
Or, if you prefer using the "fluent" style throughout, the SelectMany() method is the one you want.
I personally would not do this all in one go. I'd make the query, and then write a foreach loop. That way, the query is built free of side effects, and the side effects are in the loop, where they belong. But some people seem to prefer putting their side effects into a ForEach method instead.
UPDATE: There's a question as to how "lazy" this query is.
You are correct in that what you end up with is an in-memory representation of every word in the file; however, with my minor reorganization of it, you at least do not have to create one big string that contains the entire text to begin with; you can do it line by line.
There are ways to cut down on how much duplication there is here, which we'll come to in a minute. However, I want to keep talking for a bit about how to reason about laziness.
A great way to think about these things is due to Jon Skeet, which I shall shamelessly steal from him.
Imagine a stage upon which there is a line of people. They are wearing shirts that say GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach.
ToList pokes Take. Take does something and then hands toList a card with a list of words on it. ToList keeps on poking Take until Take says "I'm done". At that point, ToList makes a list out of all the cards it has been handed, and then hands the first one off to ForEach. The next time it is poked, it hands out the next card.
What does Take do? Every time it is poked it asks OrderByDescending for another card, and immediately hands that card to ToList. After handing out ten cards, it tells ToList "I'm done".
What does OrderByDescending do? When it is poked for the first time, it pokes GroupBy. GroupBy hands it a card. It keeps on poking GroupBy until GroupBy says "I'm done". Then OrderByDescending sorts the cards, and hands the first one to Take. Every subsequent time it is poked, it hands a new card to Take, until Take stops asking.
GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach
And so on. You see how this goes. The query operators GetLines, Split, Where, GroupBy, OrderByDescending, Take are lazy, in that they do not act until poked. Some of them (OrderByDescending, ToList, GroupBy), need to poke their card provider many, many times before they can respond to the guy poking them. Some of them (GetLines, Split, Where, Take) only poke their provider once when they are themselves poked.
Once ToList is done, ForEach pokes ToList. ToList hands ForEach a card off its list. Foreach counts the words, and then writes a word and a count on the whiteboard. ForEach keeps on poking ToList until ToList says "no more".
(Notice that the ToList is completely unnecessary in your query; all it does is accumulate the results of the top ten into a list. ForEach could be talking directly to Take.)
Now, as for your question of whether you can reduce the memory footprint further: yes, you can. Suppose the file is "foo bar foo blah". Your code builds up the set of groups:
{
{ key: foo, contents: { foo, foo } },
{ key: bar, contents: { bar } },
{ key: blah, contents: { blah } }
}
and then orders those by the length of the contents list, and then takes the top ten. You don't have to store nearly that much in the contents list in order to compute the answer you want. What you really want to be storing is:
{
{ key: foo, value: 2 },
{ key: bar, value: 1 },
{ key: blah, value: 1 }
}
and then sort that by value.
Or, alternately, you could build up the backwards mapping
{
{ key: 2, value: { foo } },
{ key: 1, value: { bar, blah }}
}
sort that by key, and then do a select-many on the lists until you have extracted the top ten words.
The concept you want to look at in order to do either of these is the "accumulator". An accumulator is an object which efficiently "accumulates" information about a data structure while the data structure is being iterated over. "Sum" is an accumulator of a sequence of numbers. "StringBuilder" is often used as an accumulator on a sequence of strings. You could write an accumulator which accumulates counts of words as the list of words is walked over.
The function you want to study in order to understand how to do this is Aggregate:
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.aggregate.aspx
Good luck!
First, let's abstract away our file into an IEnumerable<string> where the lines are yielded one at a time:
class LineReader : IEnumerable<string> {
Func<TextReader> _source;
public LineReader(Func<Stream> streamSource) {
_source = () => new StreamReader(streamSource());
}
public IEnumerator<string> GetEnumerator() {
using (var reader = _source()) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
IEnumerator IEnumerable.GetEnumerator() {
return GetEnumerator();
}
}
Next, let's make an extension method on IEnumerable<string> that will yield the words in each line:
static class IEnumerableStringExtensions {
public static IEnumerable<string> GetWords(this IEnumerable<string> lines) {
foreach (string line in lines) {
foreach (string word in line.Split(' ')) {
yield return word;
}
}
}
}
Then:
var lr = new LineReader(() => new FileStream("C:/test.txt", FileMode.Open));
var dict = lr.GetWords()
.GroupBy(w => w)
.ToDictionary(w => w.Key, w => w.Count());
foreach (var pair in dict.OrderByDescending(kvp => kvp.Value).Take(10)) {
Console.WriteLine("{0}: {1}", pair.Key, pair.Value);
}
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why is there not a ForEach extension method on the IEnumerable interface?
I've noticed when writing LINQ-y code that .ForEach() is a nice idiom to use. For example, here is a piece of code that takes the following inputs, and produces these outputs:
{ "One" } => "One"
{ "One", "Two" } => "One, Two"
{ "One", "Two", "Three", "Four" } => "One, Two, Three and Four";
And the code:
private string InsertCommasAttempt(IEnumerable<string> words)
{
List<string> wordList = words.ToList();
StringBuilder sb = new StringBuilder();
var wordsAndSeparators = wordList.Select((string word, int pos) =>
{
if (pos == 0) return new { Word = word, Leading = string.Empty };
if (pos == wordList.Count - 1) return new { Word = word, Leading = " and " };
return new { Word = word, Leading = ", " };
});
wordsAndSeparators.ToList().ForEach(v => sb.Append(v.Leading).Append(v.Word));
return sb.ToString();
}
Note the interjected .ToList() before the .ForEach() on the second to last line.
Why is it that .ForEach() isn't available as an extension method on IEnumerable<T>? With an example like this, it just seems weird.
Because ForEach(Action) existed before IEnumerable<T> existed.
Since it was not added with the other extension methods, one can assume that the C# designers felt it was a bad design and prefer the foreach construct.
Edit:
If you want you can create your own extension method, it won't override the one for a List<T> but it will work for any other class which implements IEnumerable<T>.
public static class IEnumerableExtensions
{
public static void ForEach<T>(this IEnumerable<T> source, Action<T> action)
{
foreach (T item in source)
action(item);
}
}
According to Eric Lippert, this is mostly for philosophical reasons. You should read the whole post, but here's the gist as far as I'm concerned:
I am philosophically opposed to
providing such a method, for two
reasons.
The first reason is that doing so
violates the functional programming
principles that all the other sequence
operators are based upon. Clearly the
sole purpose of a call to this method
is to cause side effects.
The purpose of an expression is to
compute a value, not to cause a side
effect. The purpose of a statement is
to cause a side effect. The call site
of this thing would look an awful lot
like an expression (though,
admittedly, since the method is
void-returning, the expression could
only be used in a “statement
expression” context.)
It does not sit well with me to make
the one and only sequence operator
that is only useful for its side
effects.
The second reason is that doing so
adds zero new representational power
to the language.
Because ForEach() on an IEnumerable is just a normal for each loop like this:
for each T item in MyEnumerable
{
// Action<T> goes here
}
ForEach isn't on IList it's on List. You were using the concrete List in your example.
I am just guessing here , but putting foreach on IEnumerable would make operations on it to have side effects . None of the "available" extension methods cause side effects , putting an imperative method like foreach on there would muddy the api I guess . Also, foreach would initialize the lazy collection .
Personally I've been fending off the temptation to just add my own , just to keep side effect free functions separate from ones with side effects.
ForEach is implemented in the concrete class List<T>
Just a guess, but List can iterate over its items without creating an enumerator:
public void ForEach(Action<T> action)
{
if (action == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.match);
}
for (int i = 0; i < this._size; i++)
{
action(this._items[i]);
}
}
This can lead to better performance. With IEnumerable, you don't have the option to use an ordinary for-loop.
LINQ follows the pull-model and all its (extension) methods should return IEnumerable<T>, except for ToList(). The ToList() is there to end the pull-chain.
ForEach() is from the push-model world.
You can still write your own extension method to do this, as pointed out by Samuel.
I honestly don't know for sure why the .ForEach(Action) isn't included on IEnumerable but, right, wrong or indifferent, that's the way it is...
I DID however want to highlight the performance issue mentioned in other comments. There is a performance hit based on how you loop over a collection. It is relatively minor but nevertheless, it certainly exists. Here is an incredibly fast and sloppy code snippet to show the relations... only takes a minute or so to run through.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Start Loop timing test: loading collection...");
List<int> l = new List<int>();
for (long i = 0; i < 60000000; i++)
{
l.Add(Convert.ToInt32(i));
}
Console.WriteLine("Collection loaded with {0} elements: start timings",l.Count());
Console.WriteLine("\n<===============================================>\n");
Console.WriteLine("foreach loop test starting...");
DateTime start = DateTime.Now;
//l.ForEach(x => l[x].ToString());
foreach (int x in l)
l[x].ToString();
Console.WriteLine("foreach Loop Time for {0} elements = {1}", l.Count(), DateTime.Now - start);
Console.WriteLine("\n<===============================================>\n");
Console.WriteLine("List.ForEach(x => x.action) loop test starting...");
start = DateTime.Now;
l.ForEach(x => l[x].ToString());
Console.WriteLine("List.ForEach(x => x.action) Loop Time for {0} elements = {1}", l.Count(), DateTime.Now - start);
Console.WriteLine("\n<===============================================>\n");
Console.WriteLine("for loop test starting...");
start = DateTime.Now;
int count = l.Count();
for (int i = 0; i < count; i++)
{
l[i].ToString();
}
Console.WriteLine("for Loop Time for {0} elements = {1}", l.Count(), DateTime.Now - start);
Console.WriteLine("\n<===============================================>\n");
Console.WriteLine("\n\nPress Enter to continue...");
Console.ReadLine();
}
Don't get hung up on this too much though. Performance is the currency of application design but unless your application is experiencing an actual performance hit that is causing usability problems, focus on coding for maintainability and reuse since time is the currency of real life business projects...
It's called "Select" on IEnumerable<T>
I am enlightened, thank you.
still trying to find where i would use the "yield" keyword in a real situation.
I see this thread on the subject
What is the yield keyword used for in C#?
but in the accepted answer, they have this as an example where someone is iterating around Integers()
public IEnumerable<int> Integers()
{
yield return 1;
yield return 2;
yield return 4;
yield return 8;
yield return 16;
yield return 16777216;
}
but why not just use
list<int>
here instead. seems more straightforward..
If you build and return a List (say it has 1 million elements), that's a big chunk of memory, and also of work to create it.
Sometimes the caller may only want to know what the first element is. Or they might want to write them to a file as they get them, rather than building the whole list in memory and then writing it to a file.
That's why it makes more sense to use yield return. It doesn't look that different to building the whole list and returning it, but it's very different because the whole list doesn't have to be created in memory before the caller can look at the first item on it.
When the caller says:
foreach (int i in Integers())
{
// do something with i
}
Each time the loop requires a new i, it runs a bit more of the code in Integers(). The code in that function is "paused" when it hits a yield return statement.
Yield allows you to build methods that produce data without having to gather everything up before returning. Think of it as returning multiple values along the way.
Here's a couple of methods that illustrate the point
public IEnumerable<String> LinesFromFile(String fileName)
{
using (StreamReader reader = new StreamReader(fileName))
{
String line;
while ((line = reader.ReadLine()) != null)
yield return line;
}
}
public IEnumerable<String> LinesWithEmails(IEnumerable<String> lines)
{
foreach (String line in lines)
{
if (line.Contains("#"))
yield return line;
}
}
Neither of these two methods will read the whole contents of the file into memory, yet you can use them like this:
foreach (String lineWithEmail in LinesWithEmails(LinesFromFile("test.txt")))
Console.Out.WriteLine(lineWithEmail);
You can use yield to build any iterator. That could be a lazily evaluated series (reading lines from a file or database, for example, without reading everything at once, which could be too much to hold in memory), or could be iterating over existing data such as a List<T>.
C# in Depth has a free chapter (6) all about iterator blocks.
I also blogged very recently about using yield for smart brute-force algorithms.
For an example of the lazy file reader:
static IEnumerable<string> ReadLines(string path) {
using (StreamReader reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
This is entirely "lazy"; nothing is read until you start enumerating, and only a single line is ever held in memory.
Note that LINQ-to-Objects makes extensive use of iterator blocks (yield). For example, the Where extension is essentially:
static IEnumerable<T> Where<T>(this IEnumerable<T> data, Func<T, bool> predicate) {
foreach (T item in data) {
if (predicate(item)) yield return item;
}
}
And again, fully lazy - allowing you to chain together multiple operations without forcing everything to be loaded into memory.
yield allows you to process collections that are potentially infinite in size because the entire collection is never loaded into memory in one go, unlike a List based approach. For instance an IEnumerable<> of all the prime numbers could be backed off by the appropriate algo for finding the primes, whereas a List approach would always be finite in size and therefore incomplete. In this example, using yield also allows processing for the next element to be deferred until it is required.
A real situation for me, is when i want to process a collection that takes a while to populate more smoothly.
Imagine something along the lines (psuedo code):
public IEnumberable<VerboseUserInfo> GetAllUsers()
{
foreach(UserId in userLookupList)
{
VerboseUserInfo info = new VerboseUserInfo();
info.Load(ActiveDirectory.GetLotsOfUserData(UserId));
info.Load(WebSerice.GetSomeMoreInfo(UserId));
yield return info;
}
}
Instead of having to wait a minute for the collection to populate before i can start processing items in it. I will be able to start immediately, and then report back to the user-interface as it happens.
You may not always want to use yield instead of returning a list, and in your example you use yield to actually return a list of integers. Depending on whether you want a mutable list, or a immutable sequence, you could use a list, or an iterator (or some other collection muttable/immutable).
But there are benefits to use yield.
Yield provides an easy way to build lazy evaluated iterators. (Meaning only the code to get next element in sequence is executed when the MoveNext() method is called then the iterator returns doing no more computations, until the method is called again)
Yield builds a state machine under the covers, and this saves you allot of work by not having to code the states of your generic generator => more concise/simple code.
Yield automatically builds optimized and thread safe iterators, sparing you the details on how to build them.
Yield is much more powerful than it seems at first sight and can be used for much more than just building simple iterators, check out this video to see Jeffrey Richter and his AsyncEnumerator and how yield is used make coding using the async pattern easy.
You might want to iterate through various collections:
public IEnumerable<ICustomer> Customers()
{
foreach( ICustomer customer in m_maleCustomers )
{
yield return customer;
}
foreach( ICustomer customer in m_femaleCustomers )
{
yield return customer;
}
// or add some constraints...
foreach( ICustomer customer in m_customers )
{
if( customer.Age < 16 )
{
yield return customer;
}
}
// Or....
if( Date.Today == 1 )
{
yield return m_superCustomer;
}
}
I agree with everything everyone has said here about lazy evaluation and memory usage and wanted to add another scenario where I have found the iterators using the yield keyword useful. I have run into some cases where I have to do a sequence of potentially expensive processing on some data where it is extremely useful to use iterators. Rather than processing the entire file immediately, or rolling my own processing pipeline, I can simply use iterators something like this:
IEnumerable<double> GetListFromFile(int idxItem)
{
// read data from file
return dataReadFromFile;
}
IEnumerable<double> ConvertUnits(IEnumerable<double> items)
{
foreach(double item in items)
yield return convertUnits(item);
}
IEnumerable<double> DoExpensiveProcessing(IEnumerable<double> items)
{
foreach(double item in items)
yield return expensiveProcessing(item);
}
IEnumerable<double> GetNextList()
{
return DoExpensiveProcessing(ConvertUnits(GetListFromFile(curIdx++)));
}
The advantage here is that by keeping the input and output to all of the functions IEnumerable<double>, my processing pipeline is completely composable, easy to read, and lazy evaluated so I only have to do the processing I really need to do. This lets me put almost all of my processing in the GUI thread without impacting responsiveness so I don't have to worry about any threading issues.
I came up with this to overcome .net shortcoming having to manually deep copy List.
I use this:
static public IEnumerable<SpotPlacement> CloneList(List<SpotPlacement> spotPlacements)
{
foreach (SpotPlacement sp in spotPlacements)
{
yield return (SpotPlacement)sp.Clone();
}
}
And at another place:
public object Clone()
{
OrderItem newOrderItem = new OrderItem();
...
newOrderItem._exactPlacements.AddRange(SpotPlacement.CloneList(_exactPlacements));
...
return newOrderItem;
}
I tried to come up with oneliner that does this, but it's not possible, due to yield not working inside anonymous method blocks.
EDIT:
Better still, use generic List cloner:
class Utility<T> where T : ICloneable
{
static public IEnumerable<T> CloneList(List<T> tl)
{
foreach (T t in tl)
{
yield return (T)t.Clone();
}
}
}
The method used by yield of saving memory by processing items on-the-fly is nice, but really it's just syntactic sugar. It's been around for a long time. In any language that has function or interface pointers (even C and assembly) you can get the same effect using a callback function / interface.
This fancy stuff:
static IEnumerable<string> GetItems()
{
yield return "apple";
yield return "orange";
yield return "pear";
}
foreach(string item in GetItems())
{
Console.WriteLine(item);
}
is basically equivalent to old-fashioned:
interface ItemProcessor
{
void ProcessItem(string s);
};
class MyItemProcessor : ItemProcessor
{
public void ProcessItem(string s)
{
Console.WriteLine(s);
}
};
static void ProcessItems(ItemProcessor processor)
{
processor.ProcessItem("apple");
processor.ProcessItem("orange");
processor.ProcessItem("pear");
}
ProcessItems(new MyItemProcessor());
When ever I think I can use the yield keyword, I take a step back and look at how it will impact my project. I always end up returning a collection instead of yeilding because I feel the overhead of maintaining the state of the yeilding method doesn't buy me much. In almost all cases where I am returning a collection I feel that 90% of the time, the calling method will be iterating over all elements in the collection, or will be seeking a series of elements throughout the entire collection.
I do understand its usefulness in linq, but I feel that only the linq team is writing such complex queriable objects that yield is useful.
Has anyone written anything like or not like linq where yield was useful?
Note that with yield, you are iterating over the collection once, but when you build a list, you'll be iterating over it twice.
Take, for example, a filter iterator:
IEnumerator<T> Filter(this IEnumerator<T> coll, Func<T, bool> func)
{
foreach(T t in coll)
if (func(t)) yield return t;
}
Now, you can chain this:
MyColl.Filter(x=> x.id > 100).Filter(x => x.val < 200).Filter (etc)
You method would be creating (and tossing) three lists. My method iterates over it just once.
Also, when you return a collection, you are forcing a particular implementation on you users. An iterator is more generic.
I do understand its usefulness in linq, but I feel that only the linq team is writing such complex queriable objects that yield is useful.
Yield was useful as soon as it got implemented in .NET 2.0, which was long before anyone ever thought of LINQ.
Why would I write this function:
IList<string> LoadStuff() {
var ret = new List<string>();
foreach(var x in SomeExternalResource)
ret.Add(x);
return ret;
}
When I can use yield, and save the effort and complexity of creating a temporary list for no good reason:
IEnumerable<string> LoadStuff() {
foreach(var x in SomeExternalResource)
yield return x;
}
It can also have huge performance advantages. If your code only happens to use the first 5 elements of the collection, then using yield will often avoid the effort of loading anything past that point. If you build a collection then return it, you waste a ton of time and space loading things you'll never need.
I could go on and on....
I recently had to make a representation of mathematical expressions in the form of an Expression class. When evaluating the expression I have to traverse the tree structure with a post-order treewalk. To achieve this I implemented IEnumerable<T> like this:
public IEnumerator<Expression<T>> GetEnumerator()
{
if (IsLeaf)
{
yield return this;
}
else
{
foreach (Expression<T> expr in LeftExpression)
{
yield return expr;
}
foreach (Expression<T> expr in RightExpression)
{
yield return expr;
}
yield return this;
}
}
Then I can simply use a foreach to traverse the expression. You can also add a Property to change the traversal algorithm as needed.
At a previous company, I found myself writing loops like this:
for (DateTime date = schedule.StartDate; date <= schedule.EndDate;
date = date.AddDays(1))
With a very simple iterator block, I was able to change this to:
foreach (DateTime date in schedule.DateRange)
It made the code a lot easier to read, IMO.
yield was developed for C#2 (before Linq in C#3).
We used it heavily in a large enterprise C#2 web application when dealing with data access and heavily repeated calculations.
Collections are great any time you have a few elements that you're going to hit multiple times.
However in lots of data access scenarios you have large numbers of elements that you don't necessarily need to pass round in a great big collection.
This is essentially what the SqlDataReader does - it's a forward only custom enumerator.
What yield lets you do is quickly and with minimal code write your own custom enumerators.
Everything yield does could be done in C#1 - it just took reams of code to do it.
Linq really maximises the value of the yield behaviour, but it certainly isn't the only application.
Whenever your function returns IEnumerable you should use "yielding". Not in .Net > 3.0 only.
.Net 2.0 example:
public static class FuncUtils
{
public delegate T Func<T>();
public delegate T Func<A0, T>(A0 arg0);
public delegate T Func<A0, A1, T>(A0 arg0, A1 arg1);
...
public static IEnumerable<T> Filter<T>(IEnumerable<T> e, Func<T, bool> filterFunc)
{
foreach (T el in e)
if (filterFunc(el))
yield return el;
}
public static IEnumerable<R> Map<T, R>(IEnumerable<T> e, Func<T, R> mapFunc)
{
foreach (T el in e)
yield return mapFunc(el);
}
...
I'm not sure about C#'s implementation of yield(), but on dynamic languages, it's far more efficient than creating the whole collection. on many cases, it makes it easy to work with datasets much bigger than RAM.
I am a huge Yield fan in C#. This is especially true in large homegrown frameworks where often methods or properties return List that is a sub-set of another IEnumerable. The benefits that I see are:
the return value of a method that uses yield is immutable
you are only iterating over the list once
it a late or lazy execution variable, meaning the code to return the values are not executed until needed (though this can bite you if you dont know what your doing)
of the source list changes, you dont have to call to get another IEnumerable, you just iterate over IEnumeable again
many more
One other HUGE benefit of yield is when your method potentially will return millions of values. So many that there is the potential of running out of memory just building the List before the method can even return it. With yield, the method can just create and return millions of values, and as long the caller also doesnt store every value. So its good for large scale data processing / aggregating operations
Personnally, I haven't found I'm using yield in my normal day-to-day programming. However, I've recently started playing with the Robotics Studio samples and found that yield is used extensively there, so I also see it being used in conjunction with the CCR (Concurrency and Coordination Runtime) where you have async and concurrency issues.
Anyway, still trying to get my head around it as well.
Yield is useful because it saves you space. Most optimizations in programming makes a trade off between space (disk, memory, networking) and processing. Yield as a programming construct allows you to iterate over a collection many times in sequence without needing a separate copy of the collection for each iteration.
consider this example:
static IEnumerable<Person> GetAllPeople()
{
return new List<Person>()
{
new Person() { Name = "George", Surname = "Bush", City = "Washington" },
new Person() { Name = "Abraham", Surname = "Lincoln", City = "Washington" },
new Person() { Name = "Joe", Surname = "Average", City = "New York" }
};
}
static IEnumerable<Person> GetPeopleFrom(this IEnumerable<Person> people, string where)
{
foreach (var person in people)
{
if (person.City == where) yield return person;
}
yield break;
}
static IEnumerable<Person> GetPeopleWithInitial(this IEnumerable<Person> people, string initial)
{
foreach (var person in people)
{
if (person.Name.StartsWith(initial)) yield return person;
}
yield break;
}
static void Main(string[] args)
{
var people = GetAllPeople();
foreach (var p in people.GetPeopleFrom("Washington"))
{
// do something with washingtonites
}
foreach (var p in people.GetPeopleWithInitial("G"))
{
// do something with people with initial G
}
foreach (var p in people.GetPeopleWithInitial("P").GetPeopleFrom("New York"))
{
// etc
}
}
(Obviously you are not required to use yield with extension methods, it just creates a powerful paradigm to think about data.)
As you can see, if you have a lot of these "filter" methods (but it can be any kind of method that does some work on a list of people) you can chain many of them together without requiring extra storage space for each step. This is one way of raising the programming language (C#) up to express your solutions better.
The first side-effect of yield is that it delays execution of the filtering logic until you actually require it. If you therefore create a variable of type IEnumerable<> (with yields) but never iterate through it, you never execute the logic or consume the space which is a powerful and free optimization.
The other side-effect is that yield operates on the lowest common collection interface (IEnumerable<>) which enables the creation of library-like code with wide applicability.
Note that yield allows you to do things in a "lazy" way. By lazy, I mean that the evaluation of the next element in the IEnumberable is not done until the element is actually requested. This allows you the power to do a couple of different things. One is that you could yield an infinitely long list without the need to actually make infinite calculations. Second, you could return an enumeration of function applications. The functions would only be applied when you iterate through the list.
I've used yeild in non-linq code things like this (assuming functions do not live in same class):
public IEnumerable<string> GetData()
{
foreach(String name in _someInternalDataCollection)
{
yield return name;
}
}
...
public void DoSomething()
{
foreach(String value in GetData())
{
//... Do something with value that doesn't modify _someInternalDataCollection
}
}
You have to be careful not to inadvertently modify the collection that your GetData() function is iterating over though, or it will throw an exception.
Yield is very useful in general. It's in ruby among other languages that support functional style programming, so its like it's tied to linq. It's more the other way around, that linq is functional in style, so it uses yield.
I had a problem where my program was using a lot of cpu in some background tasks. What I really wanted was to still be able to write functions like normal, so that I could easily read them (i.e. the whole threading vs. event based argument). And still be able to break the functions up if they took too much cpu. Yield is perfect for this. I wrote a blog post about this and the source is available for all to grok :)
The System.Linq IEnumerable extensions are great, but sometime you want more. For example, consider the following extension:
public static class CollectionSampling
{
public static IEnumerable<T> Sample<T>(this IEnumerable<T> coll, int max)
{
var rand = new Random();
using (var enumerator = coll.GetEnumerator());
{
while (enumerator.MoveNext())
{
yield return enumerator.Current;
int currentSample = rand.Next(max);
for (int i = 1; i <= currentSample; i++)
enumerator.MoveNext();
}
}
}
}
Another interesting advantage of yielding is that the caller cannot cast the return value to the original collection type and modify your internal collection