Related
My question is basically what's a good programming practice. In case of IEnumerable each item is evaluated at a time where as in case of ToList the whole collection gets iterated before it starts the for loop.
As per below code which function (GetBool1 vs GetBool2) should be used and why.
public class TestListAndEnumerable1
{
public static void Test()
{
GetBool1();
GetBool2();
Console.ReadLine();
}
private static void GetBool1()
{
var list = new List<int> {0,1,2,3,4,5,6,7,8,9};
foreach (var item in list.Where(PrintAndEvaluate))
{
Thread.Sleep(1000);
}
}
private static bool PrintAndEvaluate(int x)
{
Console.WriteLine("Hi from " + x);
return x%2==0;
}
private static void GetBool2()
{
List<int> list = new List<int> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
foreach (var item in list.Where(PrintAndEvaluate).ToList())
{
Thread.Sleep(1000);
}
}
}
The bahviour of the two loops is different. In the first case the Console will be written to as each item is iterated and evaluated, and a Sleep will occur between each Console.Write.
In the second case the Console Writes will also be evaluated, but these evaluations will all occur before the Sleeps - these occur only when all the PrintAndEvaluate calls have finished.
The second case enumerates the members of the list twice, allocating and fragmenting memory as it does so.
If your question is "which is most efficient" then the answer is the first example, but if you want to know "is there another more efficient method" then just use a loop like;
for(int counter = 0 ; counter <= list.Count; counter ++)
{
if(PrintAndEvaluate(list[counter]))
{
Thread.Sleep(1000);
}
}
This prevents the construction of an instance of an Iterator class so does not contribute to heap fragmentation.
GetBool1 should be used.
The sole difference between the two methods is the presence of the ToList() call. Right?
Let's look at the significance of the ToList call by first reading its docs:
Creates a List<T> from an IEnumerable<T>.
This means that a new list will be created when you call ToList. As you may know, creating a new list takes time and memory.
On the other hand GetBool1 does not have a ToList call, so it does not take as much time to execute.
GetBool1 is the better option. For the option2, even you convert the IEnumerable to List, when you call foreach it called the GetEnumerator again. But the differences are very little. I make a little change of your code to output the execution time:
public static void Test()
{
var list = new List<int>();
for (int i = 0; i < 10000; i++)
{
list.Add(i);
}
GetBool1(list);
GetBool2(list);
GetBool3(list);
Console.ReadLine();
}
private static void GetBool1(List<int> list)
{
System.Diagnostics.Stopwatch watcher = new System.Diagnostics.Stopwatch();
watcher.Start();
foreach (var item in list.Where(PrintAndEvaluate))
{
Thread.Sleep(1);
}
watcher.Stop();
Console.WriteLine("GetBool1 - {0}", watcher.ElapsedMilliseconds);
}
private static bool PrintAndEvaluate(int x)
{
return x % 2 == 0;
}
private static void GetBool2(List<int> list)
{
System.Diagnostics.Stopwatch watcher = new System.Diagnostics.Stopwatch();
watcher.Start();
foreach (var item in list.Where(PrintAndEvaluate).ToList())
{
Thread.Sleep(1);
}
watcher.Stop();
Console.WriteLine("GetBool2 - {0}", watcher.ElapsedMilliseconds);
}
The output is:
I have a general question, concerning performance and best practice.
When working with a List (or any other datatype) from a different Class, which is better practice? Copying it at the beginning, working with the local and then re-copying it to the original, or always access the original?
An Example:
access the original:
public class A
{
public static List<int> list = new List<int>();
}
public class B
{
public static void insertString(int i)
{
// insert at right place
int count = A.list.Count;
if (count == 0)
{
A.list.Add(i);
}
else
{
for (int j = 0; j < count; j++)
{
if (A.list[j] >= i)
{
A.list.Insert(j, i);
break;
}
if (j == count - 1)
{
A.list.Add(i);
}
}
}
}
}
As you see I access the original List A.list several times. Here the alternative:
Copying:
public class A
{
public static List<int> list = new List<int>();
}
public class B
{
public static void insertString(int i)
{
List<int> localList = A.list;
// insert at right place
int count = localList.Count;
if (count == 0)
{
localList.Add(i);
}
else
{
for (int j = 0; j < count; j++)
{
if (localList[j] >= i)
{
localList.Insert(j, i);
break;
}
if (j == count - 1)
{
localList.Add(i);
}
}
}
A.list = localList;
}
}
Here I access the the list in the other class only twice (getting it at the beginning and setting it at the end). Which would be better.
Please note that this is a general question and that the algorithm is only an example.
I won't bother thinking about performance here and instead focus on best practice:
Giving out the whole List violates encapsulation. B can modify the List and all its elements without A noticing (This is not a problem if A never uses the List itself but then A wouldn't even need to store it).
A simple example: A creates the List and immediately adds one element. Subsequently, A never bothers to check List.Count, because it knows that the List cannot be empty. Now B comes along and empties the List...
So any time B is changed, you need to also check A to see if all the assumptions of A are still correct. This is enough of a headache if you have full control over the code. If another programmer uses your class A, he may do something unexpected with the List and never check if that's ok.
Solution(s):
If B only needs to iterate over the elements, write an IEnumerable accessor. If B mustn't modify the elements, make the accessor deliver copies.
If B needs to modify the List (add/remove elements), either give B a copy of the List (containing copies of the elements if they needn't be modified) and accept a new List from B or use an accessor as before and implement the necessary List operations. In both cases, A will know if B modifies the List and can react accordingly.
Example:
class A
{
private List<ItemType> internalList;
public IEnumerable<ItemType> Items()
{
foreach (var item in internalList)
yield return item;
// or maybe item.Copy();
// new ItemType(item);
// depending on ItemType
}
public RemoveFromList(ItemType toRemove)
{
internalList.Remove(toRemove);
// do other things necessary to keep A in a consistent state
}
}
What is the quickest (and least resource intensive) to compare two massive (>50.000 items) and as a result have two lists like the ones below:
items that show up in the first list but not in the second
items that show up in the second list but not in the first
Currently I'm working with the List or IReadOnlyCollection and solve this issue in a linq query:
var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();
But this doesn't perform as good as i would like.
Any idea of making this quicker and less resource intensive as i need to process a lot of lists?
Use Except:
var firstNotSecond = list1.Except(list2).ToList();
var secondNotFirst = list2.Except(list1).ToList();
I suspect there are approaches which would actually be marginally faster than this, but even this will be vastly faster than your O(N * M) approach.
If you want to combine these, you could create a method with the above and then a return statement:
return !firstNotSecond.Any() && !secondNotFirst.Any();
One point to note is that there is a difference in results between the original code in the question and the solution here: any duplicate elements which are only in one list will only be reported once with my code, whereas they'd be reported as many times as they occur in the original code.
For example, with lists of [1, 2, 2, 2, 3] and [1], the "elements in list1 but not list2" result in the original code would be [2, 2, 2, 3]. With my code it would just be [2, 3]. In many cases that won't be an issue, but it's worth being aware of.
Enumerable.SequenceEqual Method
Determines whether two sequences are equal according to an equality comparer.
MS.Docs
Enumerable.SequenceEqual(list1, list2);
This works for all primitive data types. If you need to use it on custom objects you need to implement IEqualityComparer
Defines methods to support the comparison of objects for equality.
IEqualityComparer Interface
Defines methods to support the comparison of objects for equality.
MS.Docs for IEqualityComparer
More efficient would be using Enumerable.Except:
var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);
This method is implemented by using deferred execution. That means you could write for example:
var first10 = inListButNotInList2.Take(10);
It is also efficient since it internally uses a Set<T> to compare the objects. It works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.
If you want the results to be case insensitive, the following will work:
List<string> list1 = new List<string> { "a.dll", "b1.dll" };
List<string> list2 = new List<string> { "A.dll", "b2.dll" };
var firstNotSecond = list1.Except(list2, StringComparer.OrdinalIgnoreCase).ToList();
var secondNotFirst = list2.Except(list1, StringComparer.OrdinalIgnoreCase).ToList();
firstNotSecond would contain b1.dll
secondNotFirst would contain b2.dll
using System.Collections.Generic;
using System.Linq;
namespace YourProject.Extensions
{
public static class ListExtensions
{
public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
where T: IEquatable<T>
{
if (list.Except(other).Any())
return false;
if (other.Except(list).Any())
return false;
return true;
}
}
}
Sometimes you only need to know if two lists are different, and not what those differences are. In that case, consider adding this extension method to your project. Note that your listed objects should implement IEquatable!
Usage:
public sealed class Car : IEquatable<Car>
{
public Price Price { get; }
public List<Component> Components { get; }
...
public override bool Equals(object obj)
=> obj is Car other && Equals(other);
public bool Equals(Car other)
=> Price == other.Price
&& Components.SetwiseEquivalentTo(other.Components);
public override int GetHashCode()
=> Components.Aggregate(
Price.GetHashCode(),
(code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}
Whatever the Component class is, the methods shown here for Car should be implemented almost identically.
It's very important to note how we've written GetHashCode. In order to properly implement IEquatable, Equals and GetHashCode must operate on the instance's properties in a logically compatible way.
Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.
Note: the strange name is to imply the fact that the method does not consider the order of the elements in the list. If you do care about the order of the elements in the list, this method is not for you!
try this way:
var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
.Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));
Not for this Problem, but here's some code to compare lists for equal and not! identical objects:
public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where T : IEquatable<T>
/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
return this.Any(t => t.Equals(element));
}
/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
if (list == null) return false;
return this.All(list.Contains) && list.All(this.Contains);
}
If only combined result needed, this will work too:
var set1 = new HashSet<T>(list1);
var set2 = new HashSet<T>(list2);
var areEqual = set1.SetEquals(set2);
where T is type of lists element.
While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.
For very large data sets (in the billions of elements) it usually pays off to consider the particular circumstances. Here are a few ideas that might provide some inspiration:
If the elements can be compared (which is almost always the case in practice), then sorting the lists and applying the following zip approach is worth consideration:
/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
var refs = sortedReferenceObjects.GetEnumerator();
var diffs = sortedDifferenceObjects.GetEnumerator();
bool hasNext = refs.MoveNext() && diffs.MoveNext();
while (hasNext)
{
int comparison = comparer.Compare(refs.Current, diffs.Current);
if (comparison == 0)
{
// insert code that emits the current element if equal elements should be kept
hasNext = refs.MoveNext() && diffs.MoveNext();
}
else if (comparison < 0)
{
yield return Tuple.Create(refs.Current, -1);
hasNext = refs.MoveNext();
}
else
{
yield return Tuple.Create(diffs.Current, 1);
hasNext = diffs.MoveNext();
}
}
}
This can e.g. be used in the following way:
const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);
Benchmarking on my computer gave the following result for N = 1M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
115.19 ms
0.656 ms
0.582 ms
1.00
2800.0000
2800.0000
2800.0000
67110744 B
DiffZip
23.48 ms
0.018 ms
0.015 ms
0.20
-
-
-
720 B
And for N = 100M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
12.146 s
0.0427 s
0.0379 s
1.00
13000.0000
13000.0000
13000.0000
8589937032 B
DiffZip
2.324 s
0.0019 s
0.0018 s
0.19
-
-
-
720 B
Note that this example of course benefits from the fact that the lists are already sorted and integers can be very efficiently compared. But this is exactly the point: If you do have favourable circumstances, make sure that you exploit them.
A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.
I have used this code to compare two list which has million of records.
This method will not take much time
//Method to compare two list of string
private List<string> Contains(List<string> list1, List<string> list2)
{
List<string> result = new List<string>();
result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));
return result;
}
I compared 3 different methods for comparing different data sets. Tests below create a string collection of all the numbers from 0 to length - 1, then another collection with the same range, but with even numbers. I then pick out the odd numbers from the first collection.
Using Linq Except
public void TestExcept()
{
WriteLine($"Except {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
Write("Count of items: ");
WriteLine(array.Except(newArray).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879
Using HashSet.Add
public void TestHashSet()
{
WriteLine($"HashSet {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>();
for (int i = 0; i < length; i++)
{
hashSet.Add(i.ToString());
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newHashSet = new HashSet<string>();
for (int i = 0; i < length; i+=2)
{
newHashSet.Add(i.ToString());
}
dateTime = DateTime.Now;
Write("Count of items: ");
// HashSet Add returns true if item is added successfully (not previously existing)
WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057
Special HashSet test:
public void TestLoadingHashSet()
{
int length = 20000000;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>(array);
Write("Time to load hashset: ");
WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160
Using .Contains
public void TestContains()
{
WriteLine($"Contains {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
WriteLine(dateTime);
Write("Count of items: ");
WriteLine(array.Where(a => !newArray.Contains(a)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)
Conclusion:
Linq Except ran approximately 1 second slower on my device than using HashSets (n=20,000,000).
Using Where and Contains ran for a very long time
Closing remarks on HashSets:
Unique data
Make sure to override GetHashCode (correctly) for class types
May need up to 2x the memory if you make a copy of the data set, depending on implementation
HashSet is optimized for cloning other HashSets using the IEnumerable constructor, but it is slower to convert other collections to HashSets (see special test above)
First approach:
if (list1 != null && list2 != null && list1.Select(x => list2.SingleOrDefault(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare) != null).All(x => true))
return true;
Second approach if you are ok with duplicate values:
if (list1 != null && list2 != null && list1.Select(x => list2.Any(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare)).All(x => true))
return true;
Both Jon Skeet's and miguelmpn's answers are good. It depends on whether the order of the list elements is important or not:
// take order into account
bool areEqual1 = Enumerable.SequenceEqual(list1, list2);
// ignore order
bool areEqual2 = !list1.Except(list2).Any() && !list2.Except(list1).Any();
One line:
var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
Console.WriteLine("same sets");
I did the generic function for comparing two lists.
public static class ListTools
{
public enum RecordUpdateStatus
{
Added = 1,
Updated = 2,
Deleted = 3
}
public class UpdateStatu<T>
{
public T CurrentValue { get; set; }
public RecordUpdateStatus UpdateStatus { get; set; }
}
public static List<UpdateStatu<T>> CompareList<T>(List<T> currentList, List<T> inList, string uniqPropertyName)
{
var res = new List<UpdateStatu<T>>();
res.AddRange(inList.Where(a => !currentList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Added,
}));
res.AddRange(currentList.Where(a => !inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Deleted,
}));
res.AddRange(currentList.Where(a => inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Updated,
}));
return res;
}
}
I think this is a simple and easy way to compare two lists element by element
x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]
tmp = []
for i in range(len(x)) and range(len(y)):
if x[i]>y[i]:
tmp.append(1)
else:
tmp.append(0)
print(tmp)
Maybe it's funny, but this works for me:
string.Join("",List1) != string.Join("", List2)
This is the best solution you'll found
var list3 = list1.Where(l => list2.ToList().Contains(l));
Context: C# 3.0, .Net 3.5
Suppose I have a method that generates random numbers (forever):
private static IEnumerable<int> RandomNumberGenerator() {
while (true) yield return GenerateRandomNumber(0, 100);
}
I need to group those numbers in groups of 10, so I would like something like:
foreach (IEnumerable<int> group in RandomNumberGenerator().Slice(10)) {
Assert.That(group.Count() == 10);
}
I have defined Slice method, but I feel there should be one already defined. Here is my Slice method, just for reference:
private static IEnumerable<T[]> Slice<T>(IEnumerable<T> enumerable, int size) {
var result = new List<T>(size);
foreach (var item in enumerable) {
result.Add(item);
if (result.Count == size) {
yield return result.ToArray();
result.Clear();
}
}
}
Question: is there an easier way to accomplish what I'm trying to do? Perhaps Linq?
Note: above example is a simplification, in my program I have an Iterator that scans given matrix in a non-linear fashion.
EDIT: Why Skip+Take is no good.
Effectively what I want is:
var group1 = RandomNumberGenerator().Skip(0).Take(10);
var group2 = RandomNumberGenerator().Skip(10).Take(10);
var group3 = RandomNumberGenerator().Skip(20).Take(10);
var group4 = RandomNumberGenerator().Skip(30).Take(10);
without the overhead of regenerating number (10+20+30+40) times. I need a solution that will generate exactly 40 numbers and break those in 4 groups by 10.
Are Skip and Take of any use to you?
Use a combination of the two in a loop to get what you want.
So,
list.Skip(10).Take(10);
Skips the first 10 records and then takes the next 10.
I have done something similar. But I would like it to be simpler:
//Remove "this" if you don't want it to be a extension method
public static IEnumerable<IList<T>> Chunks<T>(this IEnumerable<T> xs, int size)
{
var curr = new List<T>(size);
foreach (var x in xs)
{
curr.Add(x);
if (curr.Count == size)
{
yield return curr;
curr = new List<T>(size);
}
}
}
I think yours are flawed. You return the same array for all your chunks/slices so only the last chunk/slice you take would have the correct data.
Addition: Array version:
public static IEnumerable<T[]> Chunks<T>(this IEnumerable<T> xs, int size)
{
var curr = new T[size];
int i = 0;
foreach (var x in xs)
{
curr[i % size] = x;
if (++i % size == 0)
{
yield return curr;
curr = new T[size];
}
}
}
Addition: Linq version (not C# 2.0). As pointed out, it will not work on infinite sequences and will be a great deal slower than the alternatives:
public static IEnumerable<T[]> Chunks<T>(this IEnumerable<T> xs, int size)
{
return xs.Select((x, i) => new { x, i })
.GroupBy(xi => xi.i / size, xi => xi.x)
.Select(g => g.ToArray());
}
Using Skip and Take would be a very bad idea. Calling Skip on an indexed collection may be fine, but calling it on any arbitrary IEnumerable<T> is liable to result in enumeration over the number of elements skipped, which means that if you're calling it repeatedly you're enumerating over the sequence an order of magnitude more times than you need to be.
Complain of "premature optimization" all you want; but that is just ridiculous.
I think your Slice method is about as good as it gets. I was going to suggest a different approach that would provide deferred execution and obviate the intermediate array allocation, but that is a dangerous game to play (i.e., if you try something like ToList on such a resulting IEnumerable<T> implementation, without enumerating over the inner collections, you'll end up in an endless loop).
(I've removed what was originally here, as the OP's improvements since posting the question have since rendered my suggestions here redundant.)
Let's see if you even need the complexity of Slice. If your random number generates is stateless, I would assume each call to it would generate unique random numbers, so perhaps this would be sufficient:
var group1 = RandomNumberGenerator().Take(10);
var group2 = RandomNumberGenerator().Take(10);
var group3 = RandomNumberGenerator().Take(10);
var group4 = RandomNumberGenerator().Take(10);
Each call to Take returns a new group of 10 numbers.
Now, if your random number generator re-seeds itself with a specific value each time it's iterated, this won't work. You'll simply get the same 10 values for each group. So instead, you would use:
var generator = RandomNumberGenerator();
var group1 = generator.Take(10);
var group2 = generator.Take(10);
var group3 = generator.Take(10);
var group4 = generator.Take(10);
This maintains an instance of the generator so that you can continue retrieving values without re-seeding the generator.
You could use the Skip and Take methods with any Enumerable object.
For your edit :
How about a function that takes a slice number and a slice size as a parameter?
private static IEnumerable<T> Slice<T>(IEnumerable<T> enumerable, int sliceSize, int sliceNumber) {
return enumerable.Skip(sliceSize * sliceNumber).Take(sliceSize);
}
It seems like we'd prefer for an IEnumerable<T> to have a fixed position counter so that we can do
var group1 = items.Take(10);
var group2 = items.Take(10);
var group3 = items.Take(10);
var group4 = items.Take(10);
and get successive slices rather than getting the first 10 items each time. We can do that with a new implementation of IEnumerable<T> which keeps one instance of its Enumerator and returns it on every call of GetEnumerator:
public class StickyEnumerable<T> : IEnumerable<T>, IDisposable
{
private IEnumerator<T> innerEnumerator;
public StickyEnumerable( IEnumerable<T> items )
{
innerEnumerator = items.GetEnumerator();
}
public IEnumerator<T> GetEnumerator()
{
return innerEnumerator;
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return innerEnumerator;
}
public void Dispose()
{
if (innerEnumerator != null)
{
innerEnumerator.Dispose();
}
}
}
Given that class, we could implement Slice with
public static IEnumerable<IEnumerable<T>> Slices<T>(this IEnumerable<T> items, int size)
{
using (StickyEnumerable<T> sticky = new StickyEnumerable<T>(items))
{
IEnumerable<T> slice;
do
{
slice = sticky.Take(size).ToList();
yield return slice;
} while (slice.Count() == size);
}
yield break;
}
That works in this case, but StickyEnumerable<T> is generally a dangerous class to have around if the consuming code isn't expecting it. For example,
using (var sticky = new StickyEnumerable<int>(Enumerable.Range(1, 10)))
{
var first = sticky.Take(2);
var second = sticky.Take(2);
foreach (int i in second)
{
Console.WriteLine(i);
}
foreach (int i in first)
{
Console.WriteLine(i);
}
}
prints
1
2
3
4
rather than
3
4
1
2
Take a look at Take(), TakeWhile() and Skip()
I think the use of Slice() would be a bit misleading. I think of that as a means to give me a chuck of an array into a new array and not causing side effects. In this scenario you would actually move the enumerable forward 10.
A possible better approach is to just use the Linq extension Take(). I don't think you would need to use Skip() with a generator.
Edit: Dang, I have been trying to test this behavior with the following code
Note: this is wasn't really correct, I leave it here so others don't fall into the same mistake.
var numbers = RandomNumberGenerator();
var slice = numbers.Take(10);
public static IEnumerable<int> RandomNumberGenerator()
{
yield return random.Next();
}
but the Count() for slice is alway 1. I also tried running it through a foreach loop since I know that the Linq extensions are generally lazily evaluated and it only looped once. I eventually did the code below instead of the Take() and it works:
public static IEnumerable<int> Slice(this IEnumerable<int> enumerable, int size)
{
var list = new List<int>();
foreach (var count in Enumerable.Range(0, size)) list.Add(enumerable.First());
return list;
}
If you notice I am adding the First() to the list each time, but since the enumerable that is being passed in is the generator from RandomNumberGenerator() the result is different every time.
So again with a generator using Skip() is not needed since the result will be different. Looping over an IEnumerable is not always side effect free.
Edit: I'll leave the last edit just so no one falls into the same mistake, but it worked fine for me just doing this:
var numbers = RandomNumberGenerator();
var slice1 = numbers.Take(10);
var slice2 = numbers.Take(10);
The two slices were different.
I had made some mistakes in my original answer but some of the points still stand. Skip() and Take() are not going to work the same with a generator as it would a list. Looping over an IEnumerable is not always side effect free. Anyway here is my take on getting a list of slices.
public static IEnumerable<int> RandomNumberGenerator()
{
while(true) yield return random.Next();
}
public static IEnumerable<IEnumerable<int>> Slice(this IEnumerable<int> enumerable, int size, int count)
{
var slices = new List<List<int>>();
foreach (var iteration in Enumerable.Range(0, count)){
var list = new List<int>();
list.AddRange(enumerable.Take(size));
slices.Add(list);
}
return slices;
}
I got this solution for the same problem:
int[] ints = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
IEnumerable<IEnumerable<int>> chunks = Chunk(ints, 2, t => t.Dump());
//won't enumerate, so won't do anything unless you force it:
chunks.ToList();
IEnumerable<T> Chunk<T, R>(IEnumerable<R> src, int n, Func<IEnumerable<R>, T> action){
IEnumerable<R> head;
IEnumerable<R> tail = src;
while (tail.Any())
{
head = tail.Take(n);
tail = tail.Skip(n);
yield return action(head);
}
}
if you just want the chunks returned, not do anything with them, use chunks = Chunk(ints, 2, t => t). What I would really like is to have to have t=>t as default action, but I haven't found out how to do that yet.
I have a List class, and I would like to override GetEnumerator() to return my own Enumerator class. This Enumerator class would have two additional properties that would be updated as the Enumerator is used.
For simplicity (this isn't the exact business case), let's say those properties were CurrentIndex and RunningTotal.
I could manage these properties within the foreach loop manually, but I would rather encapsulate this functionality for reuse, and the Enumerator seems to be the right spot.
The problem: foreach hides all the Enumerator business, so is there a way to, within a foreach statement, access the current Enumerator so I can retrieve my properties? Or would I have to foreach, use a nasty old while loop, and manipulate the Enumerator myself?
Strictly speaking, I would say that if you want to do exactly what you're saying, then yes, you would need to call GetEnumerator and control the enumerator yourself with a while loop.
Without knowing too much about your business requirement, you might be able to take advantage of an iterator function, such as something like this:
public static IEnumerable<decimal> IgnoreSmallValues(List<decimal> list)
{
decimal runningTotal = 0M;
foreach (decimal value in list)
{
// if the value is less than 1% of the running total, then ignore it
if (runningTotal == 0M || value >= 0.01M * runningTotal)
{
runningTotal += value;
yield return value;
}
}
}
Then you can do this:
List<decimal> payments = new List<decimal>() {
123.45M,
234.56M,
.01M,
345.67M,
1.23M,
456.78M
};
foreach (decimal largePayment in IgnoreSmallValues(payments))
{
// handle the large payments so that I can divert all the small payments to my own bank account. Mwahaha!
}
Updated:
Ok, so here's a follow-up with what I've termed my "fishing hook" solution. Now, let me add a disclaimer that I can't really think of a good reason to do something this way, but your situation may differ.
The idea is that you simply create a "fishing hook" object (reference type) that you pass to your iterator function. The iterator function manipulates your fishing hook object, and since you still have a reference to it in your code outside, you have visibility into what's going on:
public class FishingHook
{
public int Index { get; set; }
public decimal RunningTotal { get; set; }
public Func<decimal, bool> Criteria { get; set; }
}
public static IEnumerable<decimal> FishingHookIteration(IEnumerable<decimal> list, FishingHook hook)
{
hook.Index = 0;
hook.RunningTotal = 0;
foreach(decimal value in list)
{
// the hook object may define a Criteria delegate that
// determines whether to skip the current value
if (hook.Criteria == null || hook.Criteria(value))
{
hook.RunningTotal += value;
yield return value;
hook.Index++;
}
}
}
You would utilize it like this:
List<decimal> payments = new List<decimal>() {
123.45M,
.01M,
345.67M,
234.56M,
1.23M,
456.78M
};
FishingHook hook = new FishingHook();
decimal min = 0;
hook.Criteria = x => x > min; // exclude any values that are less than/equal to the defined minimum
foreach (decimal value in FishingHookIteration(payments, hook))
{
// update the minimum
if (value > min) min = value;
Console.WriteLine("Index: {0}, Value: {1}, Running Total: {2}", hook.Index, value, hook.RunningTotal);
}
// Resultint output is:
//Index: 0, Value: 123.45, Running Total: 123.45
//Index: 1, Value: 345.67, Running Total: 469.12
//Index: 2, Value: 456.78, Running Total: 925.90
// we've skipped the values .01, 234.56, and 1.23
Essentially, the FishingHook object gives you some control over how the iterator executes. The impression I got from the question was that you needed some way to access the inner workings of the iterator so that you could manipulate how it iterates while you are in the middle of iterating, but if this is not the case, then this solution might be overkill for what you need.
With foreach you indeed can't get the enumerator - you could, however, have the enumerator return (yield) a tuple that includes that data; in fact, you could probably use LINQ to do it for you...
(I couldn't cleanly get the index using LINQ - can get the total and current value via Aggregate, though; so here's the tuple approach)
using System.Collections;
using System.Collections.Generic;
using System;
class MyTuple
{
public int Value {get;private set;}
public int Index { get; private set; }
public int RunningTotal { get; private set; }
public MyTuple(int value, int index, int runningTotal)
{
Value = value; Index = index; RunningTotal = runningTotal;
}
static IEnumerable<MyTuple> SomeMethod(IEnumerable<int> data)
{
int index = 0, total = 0;
foreach (int value in data)
{
yield return new MyTuple(value, index++,
total = total + value);
}
}
static void Main()
{
int[] data = { 1, 2, 3 };
foreach (var tuple in SomeMethod(data))
{
Console.WriteLine("{0}: {1} ; {2}", tuple.Index,
tuple.Value, tuple.RunningTotal);
}
}
}
You can also do something like this in a more Functional way, depending on your requirements. What you are asking can be though of as "zipping" together multiple sequences, and then iterating through them all at once. The three sequences for the example you gave would be:
The "value" sequence
The "index" sequence
The "Running Total" Sequence
The next step would be to specify each of these sequences seperately:
List<decimal> ValueList
var Indexes = Enumerable.Range(0, ValueList.Count)
The last one is more fun... the two methods I can think of are to either have a temporary variable used to sum up the sequence, or to recalculate the sum for each item. The second is obviously much less performant, I would rather use the temporary:
decimal Sum = 0;
var RunningTotals = ValueList.Select(v => Sum = Sum + v);
The last step would be to zip these all together. .Net 4 will have the Zip operator built in, in which case it will look like this:
var ZippedSequence = ValueList.Zip(Indexes, (value, index) => new {value, index}).Zip(RunningTotals, (temp, total) => new {temp.value, temp.index, total});
This obviously gets noisier the more things you try to zip together.
In the last link, there is source for implementing the Zip function yourself. It really is a simple little bit of code.