How to remove duplicates from a list of nested objects? - c#

I know there are many answers out there suggesting overriding equals and hashcode, but in my case, that is not possible because the objects used are imported from DLLs.
First, I have a list of objects called DeploymentData.
These objects, along other properties, contain the following two: Location(double x, double y, double z) and Duct(int id).
The goal is to remove those that have the same Location parameters.
First, I grouped them by Duct, as a Location can not be the same if it's on another duct.
var groupingByDuct = deploymentDataList.GroupBy(x => x.Duct.Id).ToList();
Then the actual algorithm:
List<DeploymentData> uniqueDeploymentData = new List<DeploymentData>();
foreach (var group in groupingByDuct) {
uniqueDeploymentData
.AddRange(group
.Select(x => x)
.GroupBy(d => new { d.Location.X, d.Location.Y, d.Location.Z })
.Select(x => x.First()).ToList());
}
This does the work, but in order to properly check that they are indeed duplicates, the entire location should be compared. For this, I've made the following method:
private bool CompareXYZ(XYZ point1, XYZ point2, double tolerance = 10)
{
if (System.Math.Abs(point1.X - point2.X) < tolerance &&
System.Math.Abs(point1.Y - point2.Y) < tolerance &&
System.Math.Abs(point1.Z - point2.Z) < tolerance) {
return true;
}
return false;
}
BUT I have no idea how to apply that to the code written above. To sum up:
How can I write the algorithm above without all those method calls?
How can I adjust the algorithm above to use the CompareXYZ method for a better precision?
Efficiency?

An easy way to filter duplicates is to use a Hashset with a custom equality comparer. This is a class that implements IEqualityComparer, e.g.:
public class DeploymentDataEqualityComparer : IEqualityComparer<DeploymentData>
{
private readonly double _tolerance;
public DeploymentDataEqualityComparer(double tolerance)
{
_tolerance = tolerance;
}
public bool Equals(DeploymentData a, DeploymentData b)
{
if (a.Duct.id != b.Duct.id)
return false; // Different Duct, therefore not equal
if (System.Math.Abs(a.Location.X - b.Location.X) < _tolerance &&
System.Math.Abs(a.Location.Y - b.Location.Y) < _tolerance &&
System.Math.Abs(a.Location.Z - b.Location.Z) < _tolerance) {
return true;
}
return false;
}
public GetHashCode(DeploymentData dd)
{
// If the classes of the library do not implement GetHashCode, you can create a custom implementation
return dd.Duct.GetHashCode() | dd.Location.GetHashCode();
}
}
In order to filter duplicates, you can then add them to a HashSet:
var hashSet = new HashSet<DeploymentData>(new DeploymentDataEqualityComparer(10));
foreach (var deploymentData in deploymentDataList)
hashSet.Add(deploymentData);
This way, you do not need to group by duct and use the enhanced performance of the HashSet.

Related

Is this possible? Specify any generic type as long as the + operation is defined on it

I'm not sure if this is possible, but if it is then it would be useful.
I am attempting to program in a class called Matrix<T>. The intent is to be able to have matrices of various data types, such as integers, floats, doubles, etc.
I now want to define addition:
public static Matrix<T> operator +(Matrix<T> first, Matrix<T> second)
{
if (first.dimension != second.dimension)
{
throw new Exception("The matrices' dimensions do not match");
}
Matrix<T> add = new Matrix<T>(first.dimension);
for (int i = 1; i <= first.rows; i++)
{
for (int j = 1; j <= first.columns; i++)
{
add[i,j] = first[i,j] + second[i,j];
}
}
return add;
}
There is an issue with the line add[i,j] = first[i,j] + second[i,j]; since the operation + is not defined on a general object of type T.
I only want to specify matrices where T is a type such that addition is defined, however. So, I can make a matrix of ints, floats, doubles, etc. but if I were to try and define a matrix of, say, int[]s, I would want this to throw an exception since + is not defined for int[]s.
So, instead of writing T, is there some way of telling the computer "this can take in any generic type, as long as an operator + is defined on the type? Or, is this not possible and I would have to sepeately define a matrix of ints, matrix of floats, and so on?
Edit: I don't see how the linked question from closure is related to this - I see nothing about operators there. If they are related, can somebody explain how?
Currently it is not possible (at least without losing compile time safety or changing the API) but with preview features enabled and System.Runtime.Experimental nuget you can use IAdditionOperators to restrict T to have + operator defined. I would say that adding this interface also to Matrix itself can be a good idea:
class Matrix<T> : IAdditionOperators<Matrix<T>, Matrix<T>, Matrix<T>> where T : IAdditionOperators<T, T, T>
{
public static Matrix<T> operator +(Matrix<T> left, Matrix<T> right)
{
// swap to real implementation here
T x = default;
T y = default;
Console.WriteLine(x + y);
return default;
}
}
See also:
Generic math (especially section about trying it out, note - VS 2022 recommended)
It's possible using reflection
class Int
{
readonly int v;
public int Get => v;
public Int(int v)
{
this.v = v;
}
public static Int operator +(Int me, Int other) => new Int(me.v + other.v);
}
class Arr<T>
{
T[] _arr;
public Arr(T[] arr)
{
_arr = arr;
}
public T this[int index] => _arr[index];
public static Arr<T> operator+(Arr<T> me, Arr<T> other)
{
var addMethod = typeof(T).GetMethod("op_Addition");
if (addMethod == null)
throw new InvalidOperationException($"Type {typeof(T)} doesn't implement '+' operator");
var result = me._arr.Zip(other._arr)
.Select(elements => addMethod.Invoke(null, new object[] { elements.First, elements.Second }))
.Cast<T>()
.ToArray();
return new Arr<T>(result);
}
}
[Test]
public void TestAdd()
{
var firstArray = new Arr<Int>(new[] { new Int(1), new Int(2) });
var secondArray = new Arr<Int>(new[] { new Int(2), new Int(3) });
var sum = firstArray + secondArray;
Assert.AreEqual(3, sum[0].Get);
Assert.AreEqual(5, sum[1].Get);
}
Reduced the example to array.
Unfortunetly it compiles even if T doesn't implement add operator, so you will get a exception in runtime. You could also check if the add method has proper signature (returns T and takes two T's). If you need help understanding the code, let me know!

Find Circular Items in a Collection (of non-Transitive Items)

Problem:
I have got a simple List<T> and I'm trying to sort it. But items in the list are not all transitive in terms of comparability, i.e., for e.g. my List<T> looks like:
A
B
C
D
E
where A > B and B > C but C > A. It is also possible to have circular greatness like A > B, B > C, C > D but D > A, i.e., it need not be always a group of 3. What I want is to find all groups of circular greatnesses in the given List<T>. For e.g., assuming A > B > C > A and A > B > C > D > A are the two circular groups in the above case my output should look either like:
List<List<T>> circulars = [[A, B, C, A], [A, B, C, D, A]]
or
List<List<T>> circulars = [[A, B, C], [A, B, C, D]]
// but in this case I do not want duplicates in the output.
// For e.g., the output shouldn't have both [B, C, A] and [A, B, C]
// since both the groups refer to the same set of circular items A, B & C
// as B > C > A > B is also true.
// But [B, A, C] is a different group (though nothing circular about it)
Either one is fine with me. I prefer small (linquish) solution but this didn't look as easy as it first seemed. May be I'm missing something very simple.
Scenario:
This is a part of sports analysis where one player/team will be stronger than the other which in turn will be stronger than another but the last one will be stronger than the first. I cant reveal more information, but let me take a case of head-to-heads in sports, especially in tennis and chess the individual match-ups lead to this kind of situation. For e.g., in terms of head-to-head, Kramnik leads Kasparov and Kasparov leads Karpov but Karpov leads Kramnik. Or for another e.g., Federer leads Davydenko, Davydenko leads Nadal but Nadal leads Federer.
My class looks like this:
class Player : IComparable<Player>
{
// logic
}
This is what I tried:
First generate all possible permutations of collection items with a minimum group size of 3. Like [A B C], [A, C, B]...., [A, B, C, D], [A, B, D, C].... etc (This is very slow)
Then go through the entire sub groups and check for patterns. Like if there are any situations where A > B > C > D (This is reasonably slow, but I'm ok with it)
Lastly go through the entire sub groups to remove the duplicate groups like [A, B, C] and [B, C, A] etc.
Code:
var players = [.....]; //all the players in the collection
// first generate all the permutations possible in the list from size 3
// to players.Count
var circulars = Enumerable.Range(3, players.Count - 3 + 1)
.Select(x => players.Permutations(x))
.SelectMany(x => x)
.Select(x => x.ToList())
// then check in the each sublists if a pattern like A > B > C > A is
// generated vv this is the player comparison
.Where(l => l.Zip(l.Skip(1), (p1, p2) => new { p1, p2 }).All(x => x.p1 > x.p2) && l.First() < l.Last())
// then remove the duplicate lists using special comparer
.Distinct(new CircularComparer<Player>())
.ToList();
public static IEnumerable<IEnumerable<T>> Permutations<T>(this IEnumerable<T> list, int length)
{
if (length == 1)
return list.Select(t => new[] { t });
return Permutations(list, length - 1)
.SelectMany(t => list.Where(e => !t.Contains(e)), (t1, t2) => t1.Concat(new[] { t2 }));
}
class CircularComparer<T> : IEqualityComparer<ICollection<T>>
{
public bool Equals(ICollection<T> x, ICollection<T> y)
{
if (x.Count != y.Count)
return false;
return Enumerable.Range(1, x.Count)
.Any(i => x.SequenceEqual(y.Skip(i).Concat(y.Take(i))));
}
public int GetHashCode(ICollection<T> obj)
{
return 0;
}
}
The problem with this approach is that it is extremely slow. For a collection of just around 10 items, the permutations that has to generated itself is huge (close to 1 million items). Is there a better approach which is reasonably efficient? I am not after the fastest code possible. Is there a better recursive approach here? Smells like it.
The scenario...
[A, B, C, D, E]
where A > B, B > C, C > D, C > A, D > A
...could be represented as a directed graph using the convention that A -> B means A > B:
So the question is essentially "How can I find cycles in a directed graph?"
To solve that, you can use Tarjan's strongly connected components algorithm. I would recommend looking up an good implementation of this algorithm and apply it to your scenario.
There are numerous means for enumerating the permutations of N objects such that each permutation can be efficiently obtained from its index in the enumeration. One such as this excerpt from my tutorial on CUDOFY using the Travelling Salesman problem:
/// <summary>Amended algorithm after SpaceRat (see Remarks):
/// Don't <b>Divide</b> when you can <b>Multiply</b>!</summary>
/// <seealso cref="http://www.daniweb.com/software-development/cpp/code/274075/all-permutations-non-recursive"/>
/// <remarks>Final loop iteration unneeded, as element [0] only swaps with itself.</remarks>
[Cudafy]
public static float PathFromRoutePermutation(GThread thread,
long permutation, int[,] path) {
for (int city = 0; city < _cities; city++) { path[city, thread.threadIdx.x] = city; }
var divisor = 1L;
for (int city = _cities; city > 1L; /* decrement in loop body */) {
var dest = (int)((permutation / divisor) % city);
divisor *= city;
city--;
var swap = path[dest, thread.threadIdx.x];
path[dest, thread.threadIdx.x] = path[city, thread.threadIdx.x];
path[city, thread.threadIdx.x] = swap;
}
return 0;
}
#endregion
}
From this point one is able to readily perform the identification of permutations with circular greatness in parallel. One can first use the multiple cores on the CPU to achieve improved performance, and then those available on the GPU. After repeated tunings of the Travelling Salesman problem
in this way I improved performance for the 11 cities case from over 14 seconds (using CPU only) to about .25 seconds using my GPU; an improvement of 50 times.
Of course, your mileage will vary according to other aspects of the problem as well as your hardware.
I could improve performance drastically relying on recursion. Rather than generate the entire permutations of sequences possible beforehand, I now recurse through the collection to find cycles. To help with that I created circular references of itself (greater and lesser items) so that I can traverse through. Code is a bit longer.
Here is the basic idea:
I created a basic interface ICyclic<T> which has to be implemented by the Player class.
I traverse through the collection and assign the lesser and greater items (in the Prepare method).
I ignore the really bad ones (ie for which there are no lesser items in the collection) and really good ones (ie for which there are no greater items in the collection) to avoid infinite recursion and generally improve performance. The absolute best ones and worst ones don't contribute to cycles. All done in Prepare method.
Now each item will have a collection of items lesser than the item. And the items in the collection will have their own collection of worse items. And so on. This is the path I recursively traverse.
At every point the last item is compared to the first item in the visited path to detect cycles.
The cycles are added to a HashSet<T> to avoid duplicates. An equality comparer is defined to detect equivalent circular lists.
Code:
public interface ICyclic<T> : IComparable<T>
{
ISet<T> Worse { get; set; }
ISet<T> Better { get; set; }
}
public static ISet<IList<T>> Cycles<T>(this ISet<T> input) where T : ICyclic<T>
{
input = input.ToHashSet();
Prepare(input);
var output = new HashSet<IList<T>>(new CircleEqualityComparer<T>());
foreach (var item in input)
{
bool detected;
Visit(item, new List<T> { item }, item.Worse, output, out detected);
}
return output;
}
static void Prepare<T>(ISet<T> input) where T : ICyclic<T>
{
foreach (var item in input)
{
item.Worse = input.Where(t => t.CompareTo(item) < 0).ToHashSet();
item.Better = input.Where(t => t.CompareTo(item) > 0).ToHashSet();
}
Action<Func<T, ISet<T>>> exceptionsRemover = x =>
{
var exceptions = new HashSet<T>();
foreach (var item in input.OrderBy(t => x(t).Count))
{
x(item).ExceptWith(exceptions);
if (!x(item).Any())
exceptions.Add(item);
}
input.ExceptWith(exceptions);
};
exceptionsRemover(t => t.Worse);
exceptionsRemover(t => t.Better);
}
static void Visit<T>(T item, List<T> visited, ISet<T> worse, ISet<IList<T>> output,
out bool detected) where T : ICyclic<T>
{
detected = false;
foreach (var bad in worse)
{
Func<T, T, bool> comparer = (t1, t2) => t1.CompareTo(t2) > 0;
if (comparer(visited.Last(), visited.First()))
{
detected = true;
var cycle = visited.ToList();
output.Add(cycle);
}
if (visited.Contains(bad))
{
var cycle = visited.SkipWhile(x => !x.Equals(bad)).ToList();
if (cycle.Count >= 3)
{
detected = true;
output.Add(cycle);
}
continue;
}
if (bad.Equals(item) || comparer(bad, visited.Last()))
continue;
visited.Add(bad);
Visit(item, visited, bad.Worse, output, out detected);
if (detected)
visited.Remove(bad);
}
}
public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source)
{
return new HashSet<T>(source);
}
public class CircleEqualityComparer<T> : IEqualityComparer<ICollection<T>>
{
public bool Equals(ICollection<T> x, ICollection<T> y)
{
if (x.Count != y.Count)
return false;
return Enumerable.Range(1, x.Count)
.Any(i => x.SequenceEqual(y.Skip(i).Concat(y.Take(i))));
}
public int GetHashCode(ICollection<T> obj)
{
return unchecked(obj.Aggregate(0, (x, y) => x + y.GetHashCode()));
}
}
Original answer (from OP)
On the plus side, this is shorter and concise. Also since it doesn't rely on recursion, it need not have an ICyclic<T> constraint, any IComparable<T> should work. On the minus side, it is slow as molasses in January.
public static IEnumerable<ICollection<T>> Cycles<T>(this ISet<T> input) where T : IComparable<T>
{
if (input.Count < 3)
return Enumerable.Empty<ICollection<T>>();
Func<T, T, bool> comparer = (t1, t2) => t1.CompareTo(t2) > 0;
return Enumerable.Range(3, input.Count - 3 + 1)
.Select(x => input.Permutations(x))
.SelectMany(x => x)
.Select(x => x.ToList())
.Where(l => l.Zip(l.Skip(1), (t1, t2) => new { t1, t2 }).All(x => comparer(x.t1, x.t2))
&& comparer(l.Last(), l.First()))
.Distinct(new CircleEqualityComparer<T>());
}
public static IEnumerable<IEnumerable<T>> Permutations<T>(this IEnumerable<T> list, int length)
{
if (length == 1)
return list.Select(t => new[] { t });
return Permutations(list, length - 1)
.SelectMany(t => list.Where(e => !t.Contains(e)), (t1, t2) => t1.Concat(new[] { t2 }));
}
public class CircleEqualityComparer<T> : IEqualityComparer<ICollection<T>>
{
public bool Equals(ICollection<T> x, ICollection<T> y)
{
if (x.Count != y.Count)
return false;
return Enumerable.Range(1, x.Count)
.Any(i => x.SequenceEqual(y.Skip(i).Concat(y.Take(i))));
}
public int GetHashCode(ICollection<T> obj)
{
return unchecked(obj.Aggregate(0, (x, y) => x + y.GetHashCode()));
}
}
Few things to note:
I have used a ISet<T>s and HashSet<T>s instead of more traditional List<T>s, but it is just to make the intent more clear, that no duplicate items are allowed. Lists should work just fine.
.NET doesn't really have insertion order preserving set (ie which allows no duplicates), hence had to use List<T> at many places. A set might have marginally improved performance, but more importantly using set and list interchangeably causes confusion.
The first approach gives performance jump of the order of 100 times over the second one.
The second approach can be sped up by utilizing the Prepare method. The logic holds there too, ie, lesser members in collection means lesser permutations to generate. But still very very painfully slow.
I have made the methods generic but the solution can be made more general purpose. For eg, in my case the cycle is detected based on a certain comparison logic. This can be passed as a parameter, ie, the items in the collection need not be just comparable, it could be any vertex determining logic. But that's left to readers' exercise.
In my code (both the examples) only cycles of minimum size 3 are considered, ie, cycles like A > B > C > A. It doesn't account for cycles like A > B, B > A situations. In case you need it, change all the instances of 3 in the code to whatever you like. Even better pass it to the function.

Remove everything which is duplicate in a List<List<double[]>>

I hope you can help me out on this one. I have a List < List < double[] > > and I want to remove everything which is duplicate in such list. That is:
1) Within the List < double[] > there are some of the double[] which are duplicate.I want to keep only the non-duplicate doubles[] within the List < double[] >. See lists 1 and 5 in the picture.
2) Within List < List < double[] > > there are some of the List < double[] > which are duplicate. I want to keep only the non-repeated lists. See lists 0 & 2 and lists 1 & 3.
The desired output is designated in the picture:
I have tried the following but it doesn't work.
public static List<List<double[]>> CleanListOfListsOfDoubleArray(List<List<double[]>> input)
{
var output = new List<List<double[]>>();
for (int i = 0; i < input.Count; i++)
{
var temp= input[i].Distinct().ToList();
output.Add(temp);
}
return output.Distinct().ToList();
}
Can you please help me on this?
Your code (excluding the ToList collectors) seems logically equivalent to:
return input.Select(t => t.Distinct()).Distinct();
You're trying to use Distinct on collections. That's reasonable, since you are expecting to get distinct collections.
The problem is that you have left Distinct without logic to compare these collections. Without specifying that logic, Distinct can't compare collections properly (by equality of each individual member).
There is another overload of Distinct that takes an IEqualityComparer<T> as an argument. To use it, you'll have to implement such a comparer first. A reasonable implementation (adapted from Cédric Bignon's answer) could look like this:
public class ArrayComparer<T> : IEqualityComparer<T[]>
{
public bool Equals(T[] x, T[] y)
{
return ReferenceEquals(x, y) || (x != null && y != null && x.SequenceEqual(y));
}
public int GetHashCode(T[] obj)
{
return 0;
}
}
public class ListOfArrayComparer<T> : IEqualityComparer<List<T[]>>
{
public bool Equals(List<T[]> x, List<T[]> y)
{
return ReferenceEquals(x, y) || (x != null && y != null && x.SequenceEqual(y, new ArrayComparer<T>()));
}
public int GetHashCode(List<T[]> obj)
{
return 0;
}
}
Your code should then look like this:
public static List<List<double[]>> CleanListOfListsOfDoubleArray(List<List<double[]>> input)
{
var output = new List<List<double[]>>();
for (int i = 0; i < input.Count; i++)
{
var temp = input[i].Distinct(new ArrayComparer<double>()).ToList();
output.Add(temp);
}
return output.Distinct(new ListOfArrayComparer<double>()).ToList();
}
Or even just:
public static List<List<double[]>> CleanListOfListsOfDoubleArray(List<List<double[]>> input)
{
var output = input.Select(t => t.Distinct(new ArrayComparer<double>()).ToList()).ToList();
return output.Distinct(new ListOfArrayComparer<double>()).ToList();
}
Keep in mind that this would be a lot less complicated if you used more specific types for describing your problem.
If, for example, instead of double[], you used a more specific pair type (like Tuple<double, double>), you would only need to implement one comparer (the first Distinct call could be left with its default behavior, if I remember correctly).
If, instead of the List<double> you had a specialized PairCollection that implements its own equality method, you wouldn't need the second equality comparer either (your original code would work as it already is, most probably).
So, to avoid problems like this in the future, try to declare specialized types for your problem (instead of relying on the generic lists and arrays and nesting them like here).

Quickest way to compare two generic lists for differences

What is the quickest (and least resource intensive) to compare two massive (>50.000 items) and as a result have two lists like the ones below:
items that show up in the first list but not in the second
items that show up in the second list but not in the first
Currently I'm working with the List or IReadOnlyCollection and solve this issue in a linq query:
var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();
But this doesn't perform as good as i would like.
Any idea of making this quicker and less resource intensive as i need to process a lot of lists?
Use Except:
var firstNotSecond = list1.Except(list2).ToList();
var secondNotFirst = list2.Except(list1).ToList();
I suspect there are approaches which would actually be marginally faster than this, but even this will be vastly faster than your O(N * M) approach.
If you want to combine these, you could create a method with the above and then a return statement:
return !firstNotSecond.Any() && !secondNotFirst.Any();
One point to note is that there is a difference in results between the original code in the question and the solution here: any duplicate elements which are only in one list will only be reported once with my code, whereas they'd be reported as many times as they occur in the original code.
For example, with lists of [1, 2, 2, 2, 3] and [1], the "elements in list1 but not list2" result in the original code would be [2, 2, 2, 3]. With my code it would just be [2, 3]. In many cases that won't be an issue, but it's worth being aware of.
Enumerable.SequenceEqual Method
Determines whether two sequences are equal according to an equality comparer.
MS.Docs
Enumerable.SequenceEqual(list1, list2);
This works for all primitive data types. If you need to use it on custom objects you need to implement IEqualityComparer
Defines methods to support the comparison of objects for equality.
IEqualityComparer Interface
Defines methods to support the comparison of objects for equality.
MS.Docs for IEqualityComparer
More efficient would be using Enumerable.Except:
var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);
This method is implemented by using deferred execution. That means you could write for example:
var first10 = inListButNotInList2.Take(10);
It is also efficient since it internally uses a Set<T> to compare the objects. It works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.
If you want the results to be case insensitive, the following will work:
List<string> list1 = new List<string> { "a.dll", "b1.dll" };
List<string> list2 = new List<string> { "A.dll", "b2.dll" };
var firstNotSecond = list1.Except(list2, StringComparer.OrdinalIgnoreCase).ToList();
var secondNotFirst = list2.Except(list1, StringComparer.OrdinalIgnoreCase).ToList();
firstNotSecond would contain b1.dll
secondNotFirst would contain b2.dll
using System.Collections.Generic;
using System.Linq;
namespace YourProject.Extensions
{
public static class ListExtensions
{
public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
where T: IEquatable<T>
{
if (list.Except(other).Any())
return false;
if (other.Except(list).Any())
return false;
return true;
}
}
}
Sometimes you only need to know if two lists are different, and not what those differences are. In that case, consider adding this extension method to your project. Note that your listed objects should implement IEquatable!
Usage:
public sealed class Car : IEquatable<Car>
{
public Price Price { get; }
public List<Component> Components { get; }
...
public override bool Equals(object obj)
=> obj is Car other && Equals(other);
public bool Equals(Car other)
=> Price == other.Price
&& Components.SetwiseEquivalentTo(other.Components);
public override int GetHashCode()
=> Components.Aggregate(
Price.GetHashCode(),
(code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}
Whatever the Component class is, the methods shown here for Car should be implemented almost identically.
It's very important to note how we've written GetHashCode. In order to properly implement IEquatable, Equals and GetHashCode must operate on the instance's properties in a logically compatible way.
Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.
Note: the strange name is to imply the fact that the method does not consider the order of the elements in the list. If you do care about the order of the elements in the list, this method is not for you!
try this way:
var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
.Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));
Not for this Problem, but here's some code to compare lists for equal and not! identical objects:
public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where T : IEquatable<T>
/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
return this.Any(t => t.Equals(element));
}
/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
if (list == null) return false;
return this.All(list.Contains) && list.All(this.Contains);
}
If only combined result needed, this will work too:
var set1 = new HashSet<T>(list1);
var set2 = new HashSet<T>(list2);
var areEqual = set1.SetEquals(set2);
where T is type of lists element.
While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.
For very large data sets (in the billions of elements) it usually pays off to consider the particular circumstances. Here are a few ideas that might provide some inspiration:
If the elements can be compared (which is almost always the case in practice), then sorting the lists and applying the following zip approach is worth consideration:
/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
var refs = sortedReferenceObjects.GetEnumerator();
var diffs = sortedDifferenceObjects.GetEnumerator();
bool hasNext = refs.MoveNext() && diffs.MoveNext();
while (hasNext)
{
int comparison = comparer.Compare(refs.Current, diffs.Current);
if (comparison == 0)
{
// insert code that emits the current element if equal elements should be kept
hasNext = refs.MoveNext() && diffs.MoveNext();
}
else if (comparison < 0)
{
yield return Tuple.Create(refs.Current, -1);
hasNext = refs.MoveNext();
}
else
{
yield return Tuple.Create(diffs.Current, 1);
hasNext = diffs.MoveNext();
}
}
}
This can e.g. be used in the following way:
const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);
Benchmarking on my computer gave the following result for N = 1M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
115.19 ms
0.656 ms
0.582 ms
1.00
2800.0000
2800.0000
2800.0000
67110744 B
DiffZip
23.48 ms
0.018 ms
0.015 ms
0.20
-
-
-
720 B
And for N = 100M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
12.146 s
0.0427 s
0.0379 s
1.00
13000.0000
13000.0000
13000.0000
8589937032 B
DiffZip
2.324 s
0.0019 s
0.0018 s
0.19
-
-
-
720 B
Note that this example of course benefits from the fact that the lists are already sorted and integers can be very efficiently compared. But this is exactly the point: If you do have favourable circumstances, make sure that you exploit them.
A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.
I have used this code to compare two list which has million of records.
This method will not take much time
//Method to compare two list of string
private List<string> Contains(List<string> list1, List<string> list2)
{
List<string> result = new List<string>();
result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));
return result;
}
I compared 3 different methods for comparing different data sets. Tests below create a string collection of all the numbers from 0 to length - 1, then another collection with the same range, but with even numbers. I then pick out the odd numbers from the first collection.
Using Linq Except
public void TestExcept()
{
WriteLine($"Except {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
Write("Count of items: ");
WriteLine(array.Except(newArray).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879
Using HashSet.Add
public void TestHashSet()
{
WriteLine($"HashSet {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>();
for (int i = 0; i < length; i++)
{
hashSet.Add(i.ToString());
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newHashSet = new HashSet<string>();
for (int i = 0; i < length; i+=2)
{
newHashSet.Add(i.ToString());
}
dateTime = DateTime.Now;
Write("Count of items: ");
// HashSet Add returns true if item is added successfully (not previously existing)
WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057
Special HashSet test:
public void TestLoadingHashSet()
{
int length = 20000000;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>(array);
Write("Time to load hashset: ");
WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160
Using .Contains
public void TestContains()
{
WriteLine($"Contains {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
WriteLine(dateTime);
Write("Count of items: ");
WriteLine(array.Where(a => !newArray.Contains(a)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)
Conclusion:
Linq Except ran approximately 1 second slower on my device than using HashSets (n=20,000,000).
Using Where and Contains ran for a very long time
Closing remarks on HashSets:
Unique data
Make sure to override GetHashCode (correctly) for class types
May need up to 2x the memory if you make a copy of the data set, depending on implementation
HashSet is optimized for cloning other HashSets using the IEnumerable constructor, but it is slower to convert other collections to HashSets (see special test above)
First approach:
if (list1 != null && list2 != null && list1.Select(x => list2.SingleOrDefault(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare) != null).All(x => true))
return true;
Second approach if you are ok with duplicate values:
if (list1 != null && list2 != null && list1.Select(x => list2.Any(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare)).All(x => true))
return true;
Both Jon Skeet's and miguelmpn's answers are good. It depends on whether the order of the list elements is important or not:
// take order into account
bool areEqual1 = Enumerable.SequenceEqual(list1, list2);
// ignore order
bool areEqual2 = !list1.Except(list2).Any() && !list2.Except(list1).Any();
One line:
var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
Console.WriteLine("same sets");
I did the generic function for comparing two lists.
public static class ListTools
{
public enum RecordUpdateStatus
{
Added = 1,
Updated = 2,
Deleted = 3
}
public class UpdateStatu<T>
{
public T CurrentValue { get; set; }
public RecordUpdateStatus UpdateStatus { get; set; }
}
public static List<UpdateStatu<T>> CompareList<T>(List<T> currentList, List<T> inList, string uniqPropertyName)
{
var res = new List<UpdateStatu<T>>();
res.AddRange(inList.Where(a => !currentList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Added,
}));
res.AddRange(currentList.Where(a => !inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Deleted,
}));
res.AddRange(currentList.Where(a => inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Updated,
}));
return res;
}
}
I think this is a simple and easy way to compare two lists element by element
x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]
tmp = []
for i in range(len(x)) and range(len(y)):
if x[i]>y[i]:
tmp.append(1)
else:
tmp.append(0)
print(tmp)
Maybe it's funny, but this works for me:
string.Join("",List1) != string.Join("", List2)
This is the best solution you'll found
var list3 = list1.Where(l => list2.ToList().Contains(l));

How to get matches between collections of different types?

I think this would take O(A x B) time to execute.
(where A is the size of collectionA and B is the size of collectionB)
Am I correct?
IEnumerable<A> GetMatches(IEnumerable<A> collectionA, IEnumerable<B> collectionB)
{
foreach (A a in collectionA)
foreach (B b in collectionB)
if (a.Value == b.Value)
yield return a;
}
Is there a faster way to execute this query? (maybe using LINQ?)
Enumerable.Intersect will not, unfortunately, work as you're comparing against two separate types (A and B).
This will require a bit of handling separately to get an Intersect call that will work.
You could do this in stages:
IEnumerable<A> GetMatches(IEnumerable<A> collectionA, IEnumerable<B> collectionB)
where A : ISomeConstraintWithValueProperty
where B : ISomeOtherConstraintWithSameValueProperty
{
// Get distinct values in A
var values = new HashSet<TypeOfValue>(collectionB.Select(b => b.Value));
return collectionA.Where(a => values.Contains(a.Value));
}
Note that this will return duplicates if collectionB contains duplicates (but not collectionA), so it will have slightly different results than your looping code.
If you want unique matches (only one returned), you could change the last line to:
return collectionA.Where(a => values.Contains(a.Value)).Distinct();
You may try the following intersection algorithm which has complexity O(m+n) if your data is sorted, or O(nlogn) otherwise, without consuming additional memory:
private static IEnumerable<A> Intersect(A[] alist, B[] blist)
{
Array.Sort(alist);
Array.Sort(blist);
for (int i = 0, j = 0; i < alist.Length && j < blist.Length;)
{
if (alist[i].Value == blist[j].Value)
{
yield return alist[i];
i++;
j++;
}
else
{
if (alist[i].Value < blist[j].Value)
{
i++;
}
else
{
j++;
}
}
}
}

Categories

Resources