Quickest way to compare two generic lists for differences - c#

What is the quickest (and least resource intensive) to compare two massive (>50.000 items) and as a result have two lists like the ones below:
items that show up in the first list but not in the second
items that show up in the second list but not in the first
Currently I'm working with the List or IReadOnlyCollection and solve this issue in a linq query:
var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();
But this doesn't perform as good as i would like.
Any idea of making this quicker and less resource intensive as i need to process a lot of lists?

Use Except:
var firstNotSecond = list1.Except(list2).ToList();
var secondNotFirst = list2.Except(list1).ToList();
I suspect there are approaches which would actually be marginally faster than this, but even this will be vastly faster than your O(N * M) approach.
If you want to combine these, you could create a method with the above and then a return statement:
return !firstNotSecond.Any() && !secondNotFirst.Any();
One point to note is that there is a difference in results between the original code in the question and the solution here: any duplicate elements which are only in one list will only be reported once with my code, whereas they'd be reported as many times as they occur in the original code.
For example, with lists of [1, 2, 2, 2, 3] and [1], the "elements in list1 but not list2" result in the original code would be [2, 2, 2, 3]. With my code it would just be [2, 3]. In many cases that won't be an issue, but it's worth being aware of.

Enumerable.SequenceEqual Method
Determines whether two sequences are equal according to an equality comparer.
MS.Docs
Enumerable.SequenceEqual(list1, list2);
This works for all primitive data types. If you need to use it on custom objects you need to implement IEqualityComparer
Defines methods to support the comparison of objects for equality.
IEqualityComparer Interface
Defines methods to support the comparison of objects for equality.
MS.Docs for IEqualityComparer

More efficient would be using Enumerable.Except:
var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);
This method is implemented by using deferred execution. That means you could write for example:
var first10 = inListButNotInList2.Take(10);
It is also efficient since it internally uses a Set<T> to compare the objects. It works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.

If you want the results to be case insensitive, the following will work:
List<string> list1 = new List<string> { "a.dll", "b1.dll" };
List<string> list2 = new List<string> { "A.dll", "b2.dll" };
var firstNotSecond = list1.Except(list2, StringComparer.OrdinalIgnoreCase).ToList();
var secondNotFirst = list2.Except(list1, StringComparer.OrdinalIgnoreCase).ToList();
firstNotSecond would contain b1.dll
secondNotFirst would contain b2.dll

using System.Collections.Generic;
using System.Linq;
namespace YourProject.Extensions
{
public static class ListExtensions
{
public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
where T: IEquatable<T>
{
if (list.Except(other).Any())
return false;
if (other.Except(list).Any())
return false;
return true;
}
}
}
Sometimes you only need to know if two lists are different, and not what those differences are. In that case, consider adding this extension method to your project. Note that your listed objects should implement IEquatable!
Usage:
public sealed class Car : IEquatable<Car>
{
public Price Price { get; }
public List<Component> Components { get; }
...
public override bool Equals(object obj)
=> obj is Car other && Equals(other);
public bool Equals(Car other)
=> Price == other.Price
&& Components.SetwiseEquivalentTo(other.Components);
public override int GetHashCode()
=> Components.Aggregate(
Price.GetHashCode(),
(code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}
Whatever the Component class is, the methods shown here for Car should be implemented almost identically.
It's very important to note how we've written GetHashCode. In order to properly implement IEquatable, Equals and GetHashCode must operate on the instance's properties in a logically compatible way.
Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.
Note: the strange name is to imply the fact that the method does not consider the order of the elements in the list. If you do care about the order of the elements in the list, this method is not for you!

try this way:
var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
.Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));

Not for this Problem, but here's some code to compare lists for equal and not! identical objects:
public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where T : IEquatable<T>
/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
return this.Any(t => t.Equals(element));
}
/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
if (list == null) return false;
return this.All(list.Contains) && list.All(this.Contains);
}

If only combined result needed, this will work too:
var set1 = new HashSet<T>(list1);
var set2 = new HashSet<T>(list2);
var areEqual = set1.SetEquals(set2);
where T is type of lists element.

While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.
For very large data sets (in the billions of elements) it usually pays off to consider the particular circumstances. Here are a few ideas that might provide some inspiration:
If the elements can be compared (which is almost always the case in practice), then sorting the lists and applying the following zip approach is worth consideration:
/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
var refs = sortedReferenceObjects.GetEnumerator();
var diffs = sortedDifferenceObjects.GetEnumerator();
bool hasNext = refs.MoveNext() && diffs.MoveNext();
while (hasNext)
{
int comparison = comparer.Compare(refs.Current, diffs.Current);
if (comparison == 0)
{
// insert code that emits the current element if equal elements should be kept
hasNext = refs.MoveNext() && diffs.MoveNext();
}
else if (comparison < 0)
{
yield return Tuple.Create(refs.Current, -1);
hasNext = refs.MoveNext();
}
else
{
yield return Tuple.Create(diffs.Current, 1);
hasNext = diffs.MoveNext();
}
}
}
This can e.g. be used in the following way:
const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);
Benchmarking on my computer gave the following result for N = 1M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
115.19 ms
0.656 ms
0.582 ms
1.00
2800.0000
2800.0000
2800.0000
67110744 B
DiffZip
23.48 ms
0.018 ms
0.015 ms
0.20
-
-
-
720 B
And for N = 100M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
12.146 s
0.0427 s
0.0379 s
1.00
13000.0000
13000.0000
13000.0000
8589937032 B
DiffZip
2.324 s
0.0019 s
0.0018 s
0.19
-
-
-
720 B
Note that this example of course benefits from the fact that the lists are already sorted and integers can be very efficiently compared. But this is exactly the point: If you do have favourable circumstances, make sure that you exploit them.
A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.

I have used this code to compare two list which has million of records.
This method will not take much time
//Method to compare two list of string
private List<string> Contains(List<string> list1, List<string> list2)
{
List<string> result = new List<string>();
result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));
return result;
}

I compared 3 different methods for comparing different data sets. Tests below create a string collection of all the numbers from 0 to length - 1, then another collection with the same range, but with even numbers. I then pick out the odd numbers from the first collection.
Using Linq Except
public void TestExcept()
{
WriteLine($"Except {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
Write("Count of items: ");
WriteLine(array.Except(newArray).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879
Using HashSet.Add
public void TestHashSet()
{
WriteLine($"HashSet {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>();
for (int i = 0; i < length; i++)
{
hashSet.Add(i.ToString());
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newHashSet = new HashSet<string>();
for (int i = 0; i < length; i+=2)
{
newHashSet.Add(i.ToString());
}
dateTime = DateTime.Now;
Write("Count of items: ");
// HashSet Add returns true if item is added successfully (not previously existing)
WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057
Special HashSet test:
public void TestLoadingHashSet()
{
int length = 20000000;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>(array);
Write("Time to load hashset: ");
WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160
Using .Contains
public void TestContains()
{
WriteLine($"Contains {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
WriteLine(dateTime);
Write("Count of items: ");
WriteLine(array.Where(a => !newArray.Contains(a)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)
Conclusion:
Linq Except ran approximately 1 second slower on my device than using HashSets (n=20,000,000).
Using Where and Contains ran for a very long time
Closing remarks on HashSets:
Unique data
Make sure to override GetHashCode (correctly) for class types
May need up to 2x the memory if you make a copy of the data set, depending on implementation
HashSet is optimized for cloning other HashSets using the IEnumerable constructor, but it is slower to convert other collections to HashSets (see special test above)

First approach:
if (list1 != null && list2 != null && list1.Select(x => list2.SingleOrDefault(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare) != null).All(x => true))
return true;
Second approach if you are ok with duplicate values:
if (list1 != null && list2 != null && list1.Select(x => list2.Any(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare)).All(x => true))
return true;

Both Jon Skeet's and miguelmpn's answers are good. It depends on whether the order of the list elements is important or not:
// take order into account
bool areEqual1 = Enumerable.SequenceEqual(list1, list2);
// ignore order
bool areEqual2 = !list1.Except(list2).Any() && !list2.Except(list1).Any();

One line:
var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
Console.WriteLine("same sets");

I did the generic function for comparing two lists.
public static class ListTools
{
public enum RecordUpdateStatus
{
Added = 1,
Updated = 2,
Deleted = 3
}
public class UpdateStatu<T>
{
public T CurrentValue { get; set; }
public RecordUpdateStatus UpdateStatus { get; set; }
}
public static List<UpdateStatu<T>> CompareList<T>(List<T> currentList, List<T> inList, string uniqPropertyName)
{
var res = new List<UpdateStatu<T>>();
res.AddRange(inList.Where(a => !currentList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Added,
}));
res.AddRange(currentList.Where(a => !inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Deleted,
}));
res.AddRange(currentList.Where(a => inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Updated,
}));
return res;
}
}

I think this is a simple and easy way to compare two lists element by element
x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]
tmp = []
for i in range(len(x)) and range(len(y)):
if x[i]>y[i]:
tmp.append(1)
else:
tmp.append(0)
print(tmp)

Maybe it's funny, but this works for me:
string.Join("",List1) != string.Join("", List2)

This is the best solution you'll found
var list3 = list1.Where(l => list2.ToList().Contains(l));

Related

get next available integer using LINQ

Say I have a list of integers:
List<int> myInts = new List<int>() {1,2,3,5,8,13,21};
I would like to get the next available integer, ordered by increasing integer. Not the last or highest one, but in this case the next integer that is not in this list. In this case the number is 4.
Is there a LINQ statement that would give me this? As in:
var nextAvailable = myInts.SomeCoolLinqMethod();
Edit: Crap. I said the answer should be 2 but I meant 4. I apologize for that!
For example: Imagine that you are responsible for handing out process IDs. You want to get the list of current process IDs, and issue a next one, but the next one should not just be the highest value plus one. Rather, it should be the next one available from an ordered list of process IDs. You could get the next available starting with the highest, it does not really matter.
I see a lot of answers that write a custom extension method, but it is possible to solve this problem with the standard linq extension methods and the static Enumerable class:
List<int> myInts = new List<int>() {1,2,3,5,8,13,21};
// This will set firstAvailable to 4.
int firstAvailable = Enumerable.Range(1, Int32.MaxValue).Except(myInts).First();
The answer provided by #Kevin has a undesirable performance profile. The logic will access the source sequence numerous times: once for the .Count call, once for the .FirstOrDefault call, and once for each .Contains call. If the IEnumerable<int> instance is a deferred sequence, such as the result of a .Select call, this will cause at least 2 calculations of the sequence, along with once for each number. Even if you pass a list to the method, it will potentially go through the entire list for each checked number. Imagine running it on the sequence { 1, 1000000 } and you can see how it would not perform well.
LINQ strives to iterate source sequences no more than once. This is possible in general and can have a big impact on the performance of your code. Below is an extension method which will iterate the sequence exactly once. It does so by looking for the difference between each successive pair, then adds 1 to the first lower number which is more than 1 away from the next number:
public static int? FirstMissing(this IEnumerable<int> numbers)
{
int? priorNumber = null;
foreach(var number in numbers.OrderBy(n => n))
{
var difference = number - priorNumber;
if(difference != null && difference > 1)
{
return priorNumber + 1;
}
priorNumber = number;
}
return priorNumber == null ? (int?) null : priorNumber + 1;
}
Since this extension method can be called on any arbitrary sequence of integers, we make sure to order them before we iterate. We then calculate the difference between the current number and the prior number. If this is the first number in the list, priorNumber will be null and thus difference will be null. If this is not the first number in the list, we check to see if the difference from the prior number is exactly 1. If not, we know there is a gap and we can add 1 to the prior number.
You can adjust the return statement to handle sequences with 0 or 1 items as you see fit; I chose to return null for empty sequences and n + 1 for the sequence { n }.
This will be fairly efficient:
static int Next(this IEnumerable<int> source)
{
int? last = null;
foreach (var next in source.OrderBy(_ => _))
{
if (last.HasValue && last.Value + 1 != next)
{
return last.Value + 1;
}
last = next;
}
return last.HasValue ? last.Value + 1 : Int32.MaxValue;
}
public static class IntExtensions
{
public static int? SomeCoolLinqMethod(this IEnumerable<int> ints)
{
int counter = ints.Count() > 0 ? ints.First() : -1;
while (counter < int.MaxValue)
{
if (!ints.Contains(++counter)) return counter;
}
return null;
}
}
Usage:
var nextAvailable = myInts.SomeCoolLinqMethod();
Ok, here is the solution that I came up with that works for me.
var nextAvailableInteger = Enumerable.Range(myInts.Min(),myInts.Max()).FirstOrDefault( r=> !myInts.Contains(r));
If anyone has a more elegant solution I would be happy to accept that one. But for now, this is what I'm putting in my code and moving on.
Edit: this is what I implemented after Kevin's suggestion to add an extension method. And that was the real answer - that no single LINQ extension would do so it makes more sense to add my own. That is really what I was looking for.
public static int NextAvailableInteger(this IEnumerable<int> ints)
{
return NextAvailableInteger(ints, 1); // by default we use one
}
public static int NextAvailableInteger(this IEnumerable<int> ints, int defaultValue)
{
if (ints == null || ints.Count() == 0) return defaultValue;
var ordered = ints.OrderBy(v => v);
int counter = ints.Min();
int max = ints.Max();
while (counter < max)
{
if (!ordered.Contains(++counter)) return counter;
}
return (++counter);
}
Not sure if this qualifies as a cool Linq method, but using the left outer join idea from This SO Answer
var thelist = new List<int> {1,2,3,4,5,100,101};
var nextAvailable = (from curr in thelist
join next in thelist
on curr + 1 equals next into g
from newlist in g.DefaultIfEmpty()
where !g.Any ()
orderby curr
select curr + 1).First();
This puts the processing on the sql server side if you're using Linq to Sql, and allows you to not have to pull the ID lists from the server to memory.
var nextAvailable = myInts.Prepend(0).TakeWhile((x,i) => x == i).Last() + 1;
It is 7 years later, but there are better ways of doing this than the selected answer or the answer with the most votes.
The list is already in order, and based on the example 0 doesn't count. We can just prepend 0 and check if each item matches it's index. TakeWhile will stop evaluating once it hits a number that doesn't match, or at the end of the list.
The answer is the last item that matches, plus 1.
TakeWhile is more efficient than enumerating all the possible numbers then excluding the existing numbers using Except, because we TakeWhile will only go through the list until it finds the first available number, and the resulting Enumerable collection is at most n.
The answer using Except generates an entire enumerable of answers that are not needed just to grab the first one. Linq can do some optimization with First(), but it still much slower and more memory intensive than TakeWhile.

Why does the same algorithm work in Scala much slower than in C#? And how to make it faster?

The algorithm creates all possible variants of the sequence from variants for each member of the sequence.
C# code :
static void Main(string[] args)
{
var arg = new List<List<int>>();
int i = 0;
for (int j = 0; j < 5; j++)
{
arg.Add(new List<int>());
for (int j1 = i; j1 < i + 3; j1++)
{
//if (j1 != 5)
arg[j].Add(j1);
}
i += 3;
}
List<Utils<int>.Variant<int>> b2 = new List<Utils<int>.Variant<int>>();
//int[][] bN;
var s = System.Diagnostics.Stopwatch.StartNew();
//for(int j = 0; j < 10;j++)
b2 = Utils<int>.Produce2(arg);
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
}
public class Variant<T>
{
public T element;
public Variant<T> previous;
}
public static List<Variant<T>> Produce2(List<List<T>> input)
{
var ret = new List<Variant<T>>();
foreach (var form in input)
{
var newRet = new List<Variant<T>>(ret.Count * form.Count);
foreach (var el in form)
{
if (ret.Count == 0)
{
newRet.Add(new Variant<T>{ element = el, previous = null });
}
else
{
foreach (var variant in ret)
{
var buf = new Variant<T> { previous = variant, element = el };
newRet.Add(buf);
}
}
}
ret = newRet;
}
return ret;
}
Scala code :
object test {
def main() {
var arg = new Array[Array[Int]](5)
var i = 0
var init = 0
while (i<5)
{
var buf = new Array[Int](3)
var j = 0
while (j<3)
{
buf(j) = init
init = init+1
j = j + 1
}
arg(i)=buf
i = i + 1
}
println("Hello, world!")
val start = System.currentTimeMillis
var res = Produce(arg)
val stop = System.currentTimeMillis
println(stop-start)
/*for(list <- res)
{
for(el <- list)
print(el+" ")
println
}*/
println(res.length)
}
def Produce[T](input:Array[Array[T]]):Array[Variant[T]]=
{
var ret = new Array[Variant[T]](1)
for(val forms <- input)
{
if(forms!=null)
{
var newRet = new Array[Variant[T]](forms.length*ret.length)
if(ret.length>0)
{
for(val prev <-ret)
if(prev!=null)
for(val el <-forms)
{
newRet = newRet:+new Variant[T](el,prev)
}
}
else
{
for(val el <- forms)
{
newRet = newRet:+new Variant[T](el,null)
}
}
ret = newRet
}
}
return ret
}
}
class Variant[T](var element:T, previous:Variant[T])
{
}
As others have said, the difference is in how you're using the collections. Array in Scala is the same thing as Java's primitive array, [], which is the same as C#'s primitive array []. Scala is clever enough to do what you ask (namely, copy the entire array with a new element on the end), but not so clever as to tell you that you'd be better off using a different collection. For example, if you just change Array to ArrayBuffer it should be much faster (comparable to C#).
Actually, though, you'd be better off not using for loops at all. One of the strengths of Scala's collections library is that you have a wide variety of powerful operations at your disposal. In this case, you want to take every item from forms and convert it into a Variant. That's what map does.
Also, your Scala code doesn't seem to actually work.
If you want all possible variants from each member, you really want to use recursion. This implementation does what you say you want:
object test {
def produce[T](input: Array[Array[T]], index: Int = 0): Array[List[T]] = {
if (index >= input.length) Array()
else if (index == input.length-1) input(index).map(elem => List(elem))
else {
produce(input, index+1).flatMap(variant => {
input(index).map(elem => elem :: variant)
})
}
}
def main() {
val arg = Array.tabulate(5,3)((i,j) => i*3+j)
println("Hello, world!")
val start = System.nanoTime
var res = produce(arg)
val stop = System.nanoTime
println("Time elapsed (ms): " + (stop-start)/1000000L)
println("Result length: " + res.length)
println(res.deep)
}
}
Let's unpack this a little. First, we've replaced your entire construction of the initial variants with a single tabulate instruction. tabulate takes a target size (5x3, here), and then a function that maps from the indices into that rectangle into the final value.
We've also made produce a recursive function. (Normally we'd make it tail-recursive, but let's keep things as simple as we can for now.) How do you generate all variants? Well, all variants is clearly (every possibility at this position) + (all variants from later positions). So we write that down recursively.
Note that if we build variants recursively like this, all the tails of the variants end up the same, which makes List a perfect data structure: it's a singly-linked immutable list, so instead of having to copy all those tails over and over again, we just point to them.
Now, how do we actually do the recursion? Well, if there's no data at all, we had better return an empty array (i.e. if index is past the end of the array). If we're on the last element of the array of variations, we basically want each element to turn into a list of length 1, so we use map to do exactly that (elem => List(elem)). Finally, if we are not at the end, we get the results from the rest (which is produce(input, index+1)) and make variants with each element.
Let's take the inner loop first: input(index).map(elem => elem :: variant). This takes each element from variants in position index and sticks them onto an existing variant. So this will give us a new batch of variants. Fair enough, but where do we get the new variant from? We produce it from the rest of the list: produce(input, index+1), and then the only trick is that we need to use flatMap--this takes each element, produces a collection out of it, and glues all those collections together.
I encourage you to throw printlns in various places to see what's going on.
Finally, note that with your test size, it's actually an insigificant amount of work; you can't accurately measure that, even if you switch to using the more accurate System.nanoTime as I did. You'd need something like tabulate(12,3) before it gets significant (500,000 variants produced).
The :+ method of the Array (more precisely of ArrayOps) will always create a copy of the array. So instead of a constant time operation you have one that is more or less O(n).
You do it within nested cycles => your whole stuff will be an order of magnitude slower.
This way you more or less emulate an immutable data structure with a mutable one (which was not designed for it).
To fix it you can either use Array as a mutable data structure (but then try to avoid endless copying), or you can switch to a immutable one. I did not check your code very carefully, but the first bet is usually List, check the scaladoc of the various methods to see their performance behaviour.
ret.length is not 0 all the time, right before return it is 243. The size of array should not be changed, and List in .net is an abstraction on top of array. BUT thank you for the point - problem was that I used :+ operator with array which as I understand caused implicit use of type LinkedList

Why is IEnumerable.Count() reevaluating the query?

The following code prints "2 2 2 2" when I would expect "1 1 1 1". Why does "Count()" reevaluates the query?
class Class1
{
static int GlobalTag = 0;
public Class1()
{
tag = (++GlobalTag);
}
public int tag;
public int calls = 0;
public int Do()
{
calls++;
return tag;
}
}
class Program
{
static void Main(string[] args)
{
Class1[] cls = new Class1[] { new Class1(), new Class1(), new Class1(), new Class1() };
var result = cls.Where(c => (c.Do() % 2) == 0);
if (result.Count() <= 10)
{
if (result.Count() <= 10)
{
foreach (var c in cls)
{
Console.WriteLine(c.calls);
}
}
}
}
}
How else could it work? What would you expect Count() to do in order to cache the values?
LINQ to Objects generally executes lazily, only actually evaluating a query when it needs to - such as to count the elements. So the call to Where isn't evaluating the sequence at all; it just remembers the predicate and the sequence so that it can evaluate it when it needs to.
For a lot more details about how LINQ to Objects works, I suggest you read my Edulinq blog series. It's rather long (and not quite finished) but it'll give you a lot more insight into how LINQ to Objects works.
Not all sequences are repeatable, so it generally has to count them. To help it, you could call ToList() on the sequence - even if typed as IEnumerable<T>, LINQ-to-Objects will still short-cut and use the .Count - so very cheap, and repeatable.
For an example of a non-repeatable sequence:
static int evil;
static IEnumerable<int> GetSequence() {
foreach(var item in Enumerable.Range(1, Interlocked.Increment(ref evil)))
yield return item;
}
with demo:
var sequence = GetSequence();
Console.WriteLine(sequence.Count()); // 1
Console.WriteLine(sequence.Count()); // 2
Console.WriteLine(sequence.Count()); // 3

Get next N elements from enumerable

Context: C# 3.0, .Net 3.5
Suppose I have a method that generates random numbers (forever):
private static IEnumerable<int> RandomNumberGenerator() {
while (true) yield return GenerateRandomNumber(0, 100);
}
I need to group those numbers in groups of 10, so I would like something like:
foreach (IEnumerable<int> group in RandomNumberGenerator().Slice(10)) {
Assert.That(group.Count() == 10);
}
I have defined Slice method, but I feel there should be one already defined. Here is my Slice method, just for reference:
private static IEnumerable<T[]> Slice<T>(IEnumerable<T> enumerable, int size) {
var result = new List<T>(size);
foreach (var item in enumerable) {
result.Add(item);
if (result.Count == size) {
yield return result.ToArray();
result.Clear();
}
}
}
Question: is there an easier way to accomplish what I'm trying to do? Perhaps Linq?
Note: above example is a simplification, in my program I have an Iterator that scans given matrix in a non-linear fashion.
EDIT: Why Skip+Take is no good.
Effectively what I want is:
var group1 = RandomNumberGenerator().Skip(0).Take(10);
var group2 = RandomNumberGenerator().Skip(10).Take(10);
var group3 = RandomNumberGenerator().Skip(20).Take(10);
var group4 = RandomNumberGenerator().Skip(30).Take(10);
without the overhead of regenerating number (10+20+30+40) times. I need a solution that will generate exactly 40 numbers and break those in 4 groups by 10.
Are Skip and Take of any use to you?
Use a combination of the two in a loop to get what you want.
So,
list.Skip(10).Take(10);
Skips the first 10 records and then takes the next 10.
I have done something similar. But I would like it to be simpler:
//Remove "this" if you don't want it to be a extension method
public static IEnumerable<IList<T>> Chunks<T>(this IEnumerable<T> xs, int size)
{
var curr = new List<T>(size);
foreach (var x in xs)
{
curr.Add(x);
if (curr.Count == size)
{
yield return curr;
curr = new List<T>(size);
}
}
}
I think yours are flawed. You return the same array for all your chunks/slices so only the last chunk/slice you take would have the correct data.
Addition: Array version:
public static IEnumerable<T[]> Chunks<T>(this IEnumerable<T> xs, int size)
{
var curr = new T[size];
int i = 0;
foreach (var x in xs)
{
curr[i % size] = x;
if (++i % size == 0)
{
yield return curr;
curr = new T[size];
}
}
}
Addition: Linq version (not C# 2.0). As pointed out, it will not work on infinite sequences and will be a great deal slower than the alternatives:
public static IEnumerable<T[]> Chunks<T>(this IEnumerable<T> xs, int size)
{
return xs.Select((x, i) => new { x, i })
.GroupBy(xi => xi.i / size, xi => xi.x)
.Select(g => g.ToArray());
}
Using Skip and Take would be a very bad idea. Calling Skip on an indexed collection may be fine, but calling it on any arbitrary IEnumerable<T> is liable to result in enumeration over the number of elements skipped, which means that if you're calling it repeatedly you're enumerating over the sequence an order of magnitude more times than you need to be.
Complain of "premature optimization" all you want; but that is just ridiculous.
I think your Slice method is about as good as it gets. I was going to suggest a different approach that would provide deferred execution and obviate the intermediate array allocation, but that is a dangerous game to play (i.e., if you try something like ToList on such a resulting IEnumerable<T> implementation, without enumerating over the inner collections, you'll end up in an endless loop).
(I've removed what was originally here, as the OP's improvements since posting the question have since rendered my suggestions here redundant.)
Let's see if you even need the complexity of Slice. If your random number generates is stateless, I would assume each call to it would generate unique random numbers, so perhaps this would be sufficient:
var group1 = RandomNumberGenerator().Take(10);
var group2 = RandomNumberGenerator().Take(10);
var group3 = RandomNumberGenerator().Take(10);
var group4 = RandomNumberGenerator().Take(10);
Each call to Take returns a new group of 10 numbers.
Now, if your random number generator re-seeds itself with a specific value each time it's iterated, this won't work. You'll simply get the same 10 values for each group. So instead, you would use:
var generator = RandomNumberGenerator();
var group1 = generator.Take(10);
var group2 = generator.Take(10);
var group3 = generator.Take(10);
var group4 = generator.Take(10);
This maintains an instance of the generator so that you can continue retrieving values without re-seeding the generator.
You could use the Skip and Take methods with any Enumerable object.
For your edit :
How about a function that takes a slice number and a slice size as a parameter?
private static IEnumerable<T> Slice<T>(IEnumerable<T> enumerable, int sliceSize, int sliceNumber) {
return enumerable.Skip(sliceSize * sliceNumber).Take(sliceSize);
}
It seems like we'd prefer for an IEnumerable<T> to have a fixed position counter so that we can do
var group1 = items.Take(10);
var group2 = items.Take(10);
var group3 = items.Take(10);
var group4 = items.Take(10);
and get successive slices rather than getting the first 10 items each time. We can do that with a new implementation of IEnumerable<T> which keeps one instance of its Enumerator and returns it on every call of GetEnumerator:
public class StickyEnumerable<T> : IEnumerable<T>, IDisposable
{
private IEnumerator<T> innerEnumerator;
public StickyEnumerable( IEnumerable<T> items )
{
innerEnumerator = items.GetEnumerator();
}
public IEnumerator<T> GetEnumerator()
{
return innerEnumerator;
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return innerEnumerator;
}
public void Dispose()
{
if (innerEnumerator != null)
{
innerEnumerator.Dispose();
}
}
}
Given that class, we could implement Slice with
public static IEnumerable<IEnumerable<T>> Slices<T>(this IEnumerable<T> items, int size)
{
using (StickyEnumerable<T> sticky = new StickyEnumerable<T>(items))
{
IEnumerable<T> slice;
do
{
slice = sticky.Take(size).ToList();
yield return slice;
} while (slice.Count() == size);
}
yield break;
}
That works in this case, but StickyEnumerable<T> is generally a dangerous class to have around if the consuming code isn't expecting it. For example,
using (var sticky = new StickyEnumerable<int>(Enumerable.Range(1, 10)))
{
var first = sticky.Take(2);
var second = sticky.Take(2);
foreach (int i in second)
{
Console.WriteLine(i);
}
foreach (int i in first)
{
Console.WriteLine(i);
}
}
prints
1
2
3
4
rather than
3
4
1
2
Take a look at Take(), TakeWhile() and Skip()
I think the use of Slice() would be a bit misleading. I think of that as a means to give me a chuck of an array into a new array and not causing side effects. In this scenario you would actually move the enumerable forward 10.
A possible better approach is to just use the Linq extension Take(). I don't think you would need to use Skip() with a generator.
Edit: Dang, I have been trying to test this behavior with the following code
Note: this is wasn't really correct, I leave it here so others don't fall into the same mistake.
var numbers = RandomNumberGenerator();
var slice = numbers.Take(10);
public static IEnumerable<int> RandomNumberGenerator()
{
yield return random.Next();
}
but the Count() for slice is alway 1. I also tried running it through a foreach loop since I know that the Linq extensions are generally lazily evaluated and it only looped once. I eventually did the code below instead of the Take() and it works:
public static IEnumerable<int> Slice(this IEnumerable<int> enumerable, int size)
{
var list = new List<int>();
foreach (var count in Enumerable.Range(0, size)) list.Add(enumerable.First());
return list;
}
If you notice I am adding the First() to the list each time, but since the enumerable that is being passed in is the generator from RandomNumberGenerator() the result is different every time.
So again with a generator using Skip() is not needed since the result will be different. Looping over an IEnumerable is not always side effect free.
Edit: I'll leave the last edit just so no one falls into the same mistake, but it worked fine for me just doing this:
var numbers = RandomNumberGenerator();
var slice1 = numbers.Take(10);
var slice2 = numbers.Take(10);
The two slices were different.
I had made some mistakes in my original answer but some of the points still stand. Skip() and Take() are not going to work the same with a generator as it would a list. Looping over an IEnumerable is not always side effect free. Anyway here is my take on getting a list of slices.
public static IEnumerable<int> RandomNumberGenerator()
{
while(true) yield return random.Next();
}
public static IEnumerable<IEnumerable<int>> Slice(this IEnumerable<int> enumerable, int size, int count)
{
var slices = new List<List<int>>();
foreach (var iteration in Enumerable.Range(0, count)){
var list = new List<int>();
list.AddRange(enumerable.Take(size));
slices.Add(list);
}
return slices;
}
I got this solution for the same problem:
int[] ints = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
IEnumerable<IEnumerable<int>> chunks = Chunk(ints, 2, t => t.Dump());
//won't enumerate, so won't do anything unless you force it:
chunks.ToList();
IEnumerable<T> Chunk<T, R>(IEnumerable<R> src, int n, Func<IEnumerable<R>, T> action){
IEnumerable<R> head;
IEnumerable<R> tail = src;
while (tail.Any())
{
head = tail.Take(n);
tail = tail.Skip(n);
yield return action(head);
}
}
if you just want the chunks returned, not do anything with them, use chunks = Chunk(ints, 2, t => t). What I would really like is to have to have t=>t as default action, but I haven't found out how to do that yet.

Access Enumerator within a foreach loop?

I have a List class, and I would like to override GetEnumerator() to return my own Enumerator class. This Enumerator class would have two additional properties that would be updated as the Enumerator is used.
For simplicity (this isn't the exact business case), let's say those properties were CurrentIndex and RunningTotal.
I could manage these properties within the foreach loop manually, but I would rather encapsulate this functionality for reuse, and the Enumerator seems to be the right spot.
The problem: foreach hides all the Enumerator business, so is there a way to, within a foreach statement, access the current Enumerator so I can retrieve my properties? Or would I have to foreach, use a nasty old while loop, and manipulate the Enumerator myself?
Strictly speaking, I would say that if you want to do exactly what you're saying, then yes, you would need to call GetEnumerator and control the enumerator yourself with a while loop.
Without knowing too much about your business requirement, you might be able to take advantage of an iterator function, such as something like this:
public static IEnumerable<decimal> IgnoreSmallValues(List<decimal> list)
{
decimal runningTotal = 0M;
foreach (decimal value in list)
{
// if the value is less than 1% of the running total, then ignore it
if (runningTotal == 0M || value >= 0.01M * runningTotal)
{
runningTotal += value;
yield return value;
}
}
}
Then you can do this:
List<decimal> payments = new List<decimal>() {
123.45M,
234.56M,
.01M,
345.67M,
1.23M,
456.78M
};
foreach (decimal largePayment in IgnoreSmallValues(payments))
{
// handle the large payments so that I can divert all the small payments to my own bank account. Mwahaha!
}
Updated:
Ok, so here's a follow-up with what I've termed my "fishing hook" solution. Now, let me add a disclaimer that I can't really think of a good reason to do something this way, but your situation may differ.
The idea is that you simply create a "fishing hook" object (reference type) that you pass to your iterator function. The iterator function manipulates your fishing hook object, and since you still have a reference to it in your code outside, you have visibility into what's going on:
public class FishingHook
{
public int Index { get; set; }
public decimal RunningTotal { get; set; }
public Func<decimal, bool> Criteria { get; set; }
}
public static IEnumerable<decimal> FishingHookIteration(IEnumerable<decimal> list, FishingHook hook)
{
hook.Index = 0;
hook.RunningTotal = 0;
foreach(decimal value in list)
{
// the hook object may define a Criteria delegate that
// determines whether to skip the current value
if (hook.Criteria == null || hook.Criteria(value))
{
hook.RunningTotal += value;
yield return value;
hook.Index++;
}
}
}
You would utilize it like this:
List<decimal> payments = new List<decimal>() {
123.45M,
.01M,
345.67M,
234.56M,
1.23M,
456.78M
};
FishingHook hook = new FishingHook();
decimal min = 0;
hook.Criteria = x => x > min; // exclude any values that are less than/equal to the defined minimum
foreach (decimal value in FishingHookIteration(payments, hook))
{
// update the minimum
if (value > min) min = value;
Console.WriteLine("Index: {0}, Value: {1}, Running Total: {2}", hook.Index, value, hook.RunningTotal);
}
// Resultint output is:
//Index: 0, Value: 123.45, Running Total: 123.45
//Index: 1, Value: 345.67, Running Total: 469.12
//Index: 2, Value: 456.78, Running Total: 925.90
// we've skipped the values .01, 234.56, and 1.23
Essentially, the FishingHook object gives you some control over how the iterator executes. The impression I got from the question was that you needed some way to access the inner workings of the iterator so that you could manipulate how it iterates while you are in the middle of iterating, but if this is not the case, then this solution might be overkill for what you need.
With foreach you indeed can't get the enumerator - you could, however, have the enumerator return (yield) a tuple that includes that data; in fact, you could probably use LINQ to do it for you...
(I couldn't cleanly get the index using LINQ - can get the total and current value via Aggregate, though; so here's the tuple approach)
using System.Collections;
using System.Collections.Generic;
using System;
class MyTuple
{
public int Value {get;private set;}
public int Index { get; private set; }
public int RunningTotal { get; private set; }
public MyTuple(int value, int index, int runningTotal)
{
Value = value; Index = index; RunningTotal = runningTotal;
}
static IEnumerable<MyTuple> SomeMethod(IEnumerable<int> data)
{
int index = 0, total = 0;
foreach (int value in data)
{
yield return new MyTuple(value, index++,
total = total + value);
}
}
static void Main()
{
int[] data = { 1, 2, 3 };
foreach (var tuple in SomeMethod(data))
{
Console.WriteLine("{0}: {1} ; {2}", tuple.Index,
tuple.Value, tuple.RunningTotal);
}
}
}
You can also do something like this in a more Functional way, depending on your requirements. What you are asking can be though of as "zipping" together multiple sequences, and then iterating through them all at once. The three sequences for the example you gave would be:
The "value" sequence
The "index" sequence
The "Running Total" Sequence
The next step would be to specify each of these sequences seperately:
List<decimal> ValueList
var Indexes = Enumerable.Range(0, ValueList.Count)
The last one is more fun... the two methods I can think of are to either have a temporary variable used to sum up the sequence, or to recalculate the sum for each item. The second is obviously much less performant, I would rather use the temporary:
decimal Sum = 0;
var RunningTotals = ValueList.Select(v => Sum = Sum + v);
The last step would be to zip these all together. .Net 4 will have the Zip operator built in, in which case it will look like this:
var ZippedSequence = ValueList.Zip(Indexes, (value, index) => new {value, index}).Zip(RunningTotals, (temp, total) => new {temp.value, temp.index, total});
This obviously gets noisier the more things you try to zip together.
In the last link, there is source for implementing the Zip function yourself. It really is a simple little bit of code.

Categories

Resources