How to repeat whole IEnumerable multiple times?
Similar to Python:
> print ['x', 'y'] * 3
['x', 'y', 'x', 'y', 'x', 'y']
You could use plain LINQ to do this:
var repeated = Enumerable.Repeat(original, 3)
.SelectMany(x => x);
Or you could just write an extension method:
public static IEnumerable<T> Repeat<T>(this IEnumerable<T> source,
int count)
{
for (int i = 0; i < count; i++)
{
foreach (var item in source)
{
yield return item;
}
}
}
Note that in both cases, the sequence would be re-evaluated on each "repeat" - so it could potentially give different results, even a different length... Indeed, some sequences can't be evaluated more than than once. You need to be careful about what you do this with, basically:
If you know the sequence can only be evaluated once (or you want to get consistent results from an inconsistent sequence) but you're happy to buffer the evaluation results in memory, you can call ToList() first and call Repeat on that
If you know the sequence can be consistently evaluated multiple times, just call Repeat as above
If you're in neither of the above situations, there's fundamentally no way of repeating the sequence elements multiple times
Related
Lets assume you have a function that returns a lazily-enumerated object:
struct AnimalCount
{
int Chickens;
int Goats;
}
IEnumerable<AnimalCount> FarmsInEachPen()
{
....
yield new AnimalCount(x, y);
....
}
You also have two functions that consume two separate IEnumerables, for example:
ConsumeChicken(IEnumerable<int>);
ConsumeGoat(IEnumerable<int>);
How can you call ConsumeChicken and ConsumeGoat without a) converting FarmsInEachPen() ToList() beforehand because it might have two zillion records, b) no multi-threading.
Basically:
ConsumeChicken(FarmsInEachPen().Select(x => x.Chickens));
ConsumeGoats(FarmsInEachPen().Select(x => x.Goats));
But without forcing the double enumeration.
I can solve it with multithread, but it gets unnecessarily complicated with a buffer queue for each list.
So I'm looking for a way to split the AnimalCount enumerator into two int enumerators without fully evaluating AnimalCount. There is no problem running ConsumeGoat and ConsumeChicken together in lock-step.
I can feel the solution just out of my grasp but I'm not quite there. I'm thinking along the lines of a helper function that returns an IEnumerable being fed into ConsumeChicken and each time the iterator is used, it internally calls ConsumeGoat, thus executing the two functions in lock-step. Except, of course, I don't want to call ConsumeGoat more than once..
I don't think there is a way to do what you want, since ConsumeChickens(IEnumerable<int>) and ConsumeGoats(IEnumerable<int>) are being called sequentially, each of them enumerating a list separately - how do you expect that to work without two separate enumerations of the list?
Depending on the situation, a better solution is to have ConsumeChicken(int) and ConsumeGoat(int) methods (which each consume a single item), and call them in alternation. Like this:
foreach(var animal in animals)
{
ConsomeChicken(animal.Chickens);
ConsomeGoat(animal.Goats);
}
This will enumerate the animals collection only once.
Also, a note: depending on your LINQ-provider and what exactly it is you're trying to do, there may be better options. For example, if you're trying to get the total sum of both chickens and goats from a database using linq-to-sql or linq-to-entities, the following query..
from a in animals
group a by 0 into g
select new
{
TotalChickens = g.Sum(x => x.Chickens),
TotalGoats = g.Sum(x => x.Goats)
}
will result in a single query, and do the summation on the database-end, which is greatly preferable to pulling the entire table over and doing the summation on the client end.
The way you have posed your problem, there is no way to do this. IEnumerable<T> is a pull enumerable - that is, you can GetEnumerator to the front of the sequence and then repeatedly ask "Give me the next item" (MoveNext/Current). You can't, on one thread, have two different things pulling from the animals.Select(a => a.Chickens) and animals.Select(a => a.Goats) at the same time. You would have to do one then the other (which would require materializing the second).
The suggestion BlueRaja made is one way to change the problem slightly. I would suggest going that route.
The other alternative is to utilize IObservable<T> from Microsoft's reactive extensions (Rx), a push enumerable. I won't go into the details of how you would do that, but it's something you could look into.
Edit:
The above is assuming that ConsumeChickens and ConsumeGoats are both returning void or are at least not returning IEnumerable<T> themselves - which seems like an obvious assumption. I'd appreciate it if the lame downvoter would actually comment.
Actually simples way to achieve what you what is convert FarmsInEachPen return value to push collection or IObservable and use ReactiveExtensions for working with it
var observable = new Subject<Animals>()
observable.Do(x=> DoSomethingWithChicken(x. Chickens))
observable.Do(x=> DoSomethingWithGoat(x.Goats))
foreach(var item in FarmsInEachPen())
{
observable.OnNext(item)
}
I figured it out, thanks in large part due to the path that #Lee put me on.
You need to share a single enumerator between the two zips, and use an adapter function to project the correct element into the sequence.
private static IEnumerable<object> ConsumeChickens(IEnumerable<int> xList)
{
foreach (var x in xList)
{
Console.WriteLine("X: " + x);
yield return null;
}
}
private static IEnumerable<object> ConsumeGoats(IEnumerable<int> yList)
{
foreach (var y in yList)
{
Console.WriteLine("Y: " + y);
yield return null;
}
}
private static IEnumerable<int> SelectHelper(IEnumerator<AnimalCount> enumerator, int i)
{
bool c = i != 0 || enumerator.MoveNext();
while (c)
{
if (i == 0)
{
yield return enumerator.Current.Chickens;
c = enumerator.MoveNext();
}
else
{
yield return enumerator.Current.Goats;
}
}
}
private static void Main(string[] args)
{
var enumerator = GetAnimals().GetEnumerator();
var chickensList = ConsumeChickens(SelectHelper(enumerator, 0));
var goatsList = ConsumeGoats(SelectHelper(enumerator, 1));
var temp = chickensList.Zip(goatsList, (i, i1) => (object) null);
temp.ToList();
Console.WriteLine("Total iterations: " + iterations);
}
Given this code:
IEnumerable<object> FilteredList()
{
foreach( object item in FullList )
{
if( IsItemInPartialList( item ) )
yield return item;
}
}
Why should I not just code it this way?:
IEnumerable<object> FilteredList()
{
var list = new List<object>();
foreach( object item in FullList )
{
if( IsItemInPartialList( item ) )
list.Add(item);
}
return list;
}
I sort of understand what the yield keyword does. It tells the compiler to build a certain kind of thing (an iterator). But why use it? Apart from it being slightly less code, what's it do for me?
Using yield makes the collection lazy.
Let's say you just need the first five items. Your way, I have to loop through the entire list to get the first five items. With yield, I only loop through the first five items.
The benefit of iterator blocks is that they work lazily. So you can write a filtering method like this:
public static IEnumerable<T> Where<T>(this IEnumerable<T> source,
Func<T, bool> predicate)
{
foreach (var item in source)
{
if (predicate(item))
{
yield return item;
}
}
}
That will allow you to filter a stream as long as you like, never buffering more than a single item at a time. If you only need the first value from the returned sequence, for example, why would you want to copy everything into a new list?
As another example, you can easily create an infinite stream using iterator blocks. For example, here's a sequence of random numbers:
public static IEnumerable<int> RandomSequence(int minInclusive, int maxExclusive)
{
Random rng = new Random();
while (true)
{
yield return rng.Next(minInclusive, maxExclusive);
}
}
How would you store an infinite sequence in a list?
My Edulinq blog series gives a sample implementation of LINQ to Objects which makes heavy use of iterator blocks. LINQ is fundamentally lazy where it can be - and putting things in a list simply doesn't work that way.
With the "list" code, you have to process the full list before you can pass it on to the next step. The "yield" version passes the processed item immediately to the next step. If that "next step" contains a ".Take(10)" then the "yield" version will only process the first 10 items and forget about the rest. The "list" code would have processed everything.
This means that you see the most difference when you need to do a lot of processing and/or have long lists of items to process.
You can use yield to return items that aren't in a list. Here's a little sample that could iterate infinitely through a list until canceled.
public IEnumerable<int> GetNextNumber()
{
while (true)
{
for (int i = 0; i < 10; i++)
{
yield return i;
}
}
}
public bool Canceled { get; set; }
public void StartCounting()
{
foreach (var number in GetNextNumber())
{
if (this.Canceled) break;
Console.WriteLine(number);
}
}
This writes
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
...etc. to the console until canceled.
object jamesItem = null;
foreach(var item in FilteredList())
{
if (item.Name == "James")
{
jamesItem = item;
break;
}
}
return jamesItem;
When the above code is used to loop through FilteredList() and assuming item.Name == "James" will be satisfied on 2nd item in the list, the method using yield will yield twice. This is a lazy behavior.
Where as the method using list will add all the n objects to the list and pass the complete list to the calling method.
This is exactly a use case where difference between IEnumerable and IList can be highlighted.
The best real world example I've seen for the use of yield would be to calculate a Fibonacci sequence.
Consider the following code:
class Program
{
static void Main(string[] args)
{
Console.WriteLine(string.Join(", ", Fibonacci().Take(10)));
Console.WriteLine(string.Join(", ", Fibonacci().Skip(15).Take(1)));
Console.WriteLine(string.Join(", ", Fibonacci().Skip(10).Take(5)));
Console.WriteLine(string.Join(", ", Fibonacci().Skip(100).Take(1)));
Console.ReadKey();
}
private static IEnumerable<long> Fibonacci()
{
long a = 0;
long b = 1;
while (true)
{
long temp = a;
a = b;
yield return a;
b = temp + b;
}
}
}
This will return:
1, 1, 2, 3, 5, 8, 13, 21, 34, 55
987
89, 144, 233, 377, 610
1298777728820984005
This is nice because it allows you to calculate out an infinite series quickly and easily, giving you the ability to use the Linq extensions and query only what you need.
why use [yield]? Apart from it being slightly less code, what's it do for me?
Sometimes it is useful, sometimes not. If the entire set of data must be examined and returned then there is not going to be any benefit in using yield because all it did was introduce overhead.
When yield really shines is when only a partial set is returned. I think the best example is sorting. Assume you have a list of objects containing a date and a dollar amount from this year and you would like to see the first handful (5) records of the year.
In order to accomplish this, the list must be sorted ascending by date, and then have the first 5 taken. If this was done without yield, the entire list would have to be sorted, right up to making sure the last two dates were in order.
However, with yield, once the first 5 items have been established the sorting stops and the results are available. This can save a large amount of time.
The yield return statement allows you to return only one item at a time. You are collecting all the items in a list and again returning that list, which is a memory overhead.
In LINQ Where is a streaming operator. Where-as OrderByDescending is a non-streaming operator. AFAIK, a streaming operator only gathers the next item that is necessary. A non-streaming operator evaluates the entire data stream at once.
I fail to see the relevance of defining a Streaming Operator. To me, it is redundant with Deferred Execution. Take the example where I have written a custom extension and consumed it using the where operator and and orderby.
public static class ExtensionStuff
{
public static IEnumerable<int> Where(this IEnumerable<int> sequence, Func<int, bool> predicate)
{
foreach (int i in sequence)
{
if (predicate(i))
{
yield return i;
}
}
}
}
public static void Main()
{
TestLinq3();
}
private static void TestLinq3()
{
int[] items = { 1, 2, 3,4 };
var selected = items.Where(i => i < 3)
.OrderByDescending(i => i);
Write(selected);
}
private static void Write(IEnumerable<int> selected)
{
foreach(var i in selected)
Console.WriteLine(i);
}
In either case, Where needs to evaluate each element in order to determine which elements meet the condition. The fact that it yields seems to only become relevant because the operator gains deferred execution.
So, what is the importance of Streaming Operators?
There are two aspects: speed and memory.
The speed aspect becomes more apparent when you use a method like .Take() to only consume a portion of the original result set.
// Consumes ten elements, yields 5 results.
Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.Take(5)
.ToList();
// Consumes one million elements, yields 5 results.
Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.OrderByDescending(i => i)
.Take(5)
.ToList();
Because the first example uses only streaming operators before the call to Take, you only end up yielding values 1 through 10 before Take stops evaluating. Furthermore, only one value is loaded into memory at a time, so you have a very small memory footprint.
In the second example, OrderByDescending is not streaming, so the moment Take pulls the first item, the entire result that's passed through the Where filter has to be placed in memory for sorting. This could take a long time and produce a big memory footprint.
Even if you weren't using Take, the memory issue can be important. For example:
// Puts half a million elements in memory, sorts, then outputs them.
var numbers = Enumerable.Range(1, 1000000).Where(i => i % 2 == 0)
.OrderByDescending(i => i);
foreach(var number in numbers) Console.WriteLine(number);
// Puts one element in memory at a time.
var numbers = Enumerable.Range(1, 1000000).Where(i => i % 2 == 0);
foreach(var number in numbers) Console.WriteLine(number);
The fact that it yields seems to only become relevant because the
operator gains deferred execution.
So, what is the importance of Streaming Operators?
I.e. you could not process infinite sequences with buffering / non-streaming extension methods - while you can "run" such a sequence (until you abort) just fine using only streaming extension methods.
Take for example this method:
public IEnumerable<int> GetNumbers(int start)
{
int num = start;
while(true)
{
yield return num;
num++;
}
}
You can use Where just fine:
foreach (var num in GetNumbers(0).Where(x => x % 2 == 0))
{
Console.WriteLine(num);
}
OrderBy() would not work in this case since it would have to exhaustively enumerate the results before emitting a single number.
Just to be explicit; in the case you mentioned there's no advantage to the fact that where streams, since the orderby sucks the whole thing in anyway. There are however times where the advantages of streaming are used (other answers/comments have given examples), so all LINQ operators stream to the best of their ability. Orderby streams as much as it can, which happens to be not very much. Where streams very effectively.
Is it possible to implement McCarthy's amb-operator for non-deterministic choice in C#?
Apparently .NET lacks continuation support but yield return could be useful. Would this be possible in other static .NET-languages like F#?
Yes, yield return does a form of continuation. Although for many useful cases, Linq provides functional operators that allow you to plug together a lazy sequence generator, so in fact in C# 3 it isn't necessary to use yield return so much (except when adding more Linq-style extensions of your own to plug gaps in the library, e.g. Zip, Unfold).
In the example we factorise an integer by brute force. Essentially the same example in C# can be done with the built-in Linq operators:
var factors = Enumerable.Range(2, 100)
.Join(Enumerable.Range(2, 100),
n => 1, n => 1, (i, j) => new { i, j })
.First(v => v.i*v.j == 481);
Console.WriteLine("Factors are " + factors.i + ", " + factors.j);
Here the starting points are my two calls to Enumerable.Range, which is built-in to Linq but you could implement yourself as:
IEnumerable<int> Range(int start, int stop)
{
for (int n = start; n < stop; n++)
yield return n;
}
There are two odd parameters, the n => 1, n => 1 parameters to Join. I'm picking 1 as the key value for Join to use when matching up items, hence all combinations will match and so I get to test every combination of numbers from the ranges.
Then I turn the pair of values into a kind of tuple (an anonymous type) with:
(i, j) => new { i, j })
Finally, I pick the first such tuple for which my test is satisfied:
.First(v => v.i*v.j == 481);
Update
The code inside the call to First need not be merely a short test expression. It can be a whole lot of imperative code which needs to be "restarted" if the test fails:
.First(v =>
{
Console.WriteLine("Aren't lambdas powerful things?");
return v.i*v.j == 481;
);
So the part of the program that potentially needs to be restarted with different values goes in that lambda. Whenever that lambda wants to restart itself with different values, it just returns false - the equivalent of calling amb with no arguments.
This is not an answer to your question, but it may get you what you want.
amb is used for nondeterministic computing. As you may know, Prolog is a nondeterministic language using the notion of unification to bind values to variables (basically what amb ends up doing).
There IS an implementation of this functionality in C#, called YieldProlog. As you guessed, the yield operator is an important requisite for this.
http://yieldprolog.sourceforge.net/
I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:
Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:
public static string ReturnSomething(string query, int limit)
{
StringBuilder sb = new StringBuilder();
foreach (var stringitem in MyCollection.Where(
x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
{
sb.Append(stringitem);
}
return sb.ToString();
}
It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?
In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.
If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.
I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.
The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.
public class IndexedList {
private class Index : Dictionary<string, List<string>> {
private int _indexLength;
public Index(int indexLength) {
_indexLength = indexLength;
}
public void Add(string value) {
if (value.Length >= _indexLength) {
string key = value.Substring(0, _indexLength);
List<string> list;
if (!this.TryGetValue(key, out list)) {
Add(key, list = new List<string>());
}
list.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
return
this[query.Substring(0, _indexLength)]
.Where(s => s.Length > query.Length && s.StartsWith(query))
.Take(limit);
}
}
private Index _index1;
private Index _index2;
private Index _index4;
private Index _index8;
public IndexedList(IEnumerable<string> values) {
_index1 = new Index(1);
_index2 = new Index(2);
_index4 = new Index(4);
_index8 = new Index(8);
foreach (string value in values) {
_index1.Add(value);
_index2.Add(value);
_index4.Add(value);
_index8.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
if (query.Length >= 8) return _index8.Find(query, limit);
if (query.Length >= 4) return _index4.Find(query,limit);
if (query.Length >= 2) return _index2.Find(query,limit);
return _index1.Find(query, limit);
}
}
I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).
This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.
// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
int low = 0, high = list.Count - 1;
while (high > low) {
int mid = (low + high) / 2;
if (list[mid] < query)
low = mid + 1;
else
high = mid - 1;
}
while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
yield return list[low];
low++;
}
}
If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.
In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".
None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.
If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.
Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.
About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.
Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:
foreach (var stringitem in MyCollection.Where(
x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).
By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.
I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.
Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.
However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.
I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith function retains the order.
I would suggest to use the List.BinarySearch method together with a IComparer<string> that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()).
You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.
You have then to start from the returned index (in both directions!) to find all elements matching your query string.