Find Circular Items in a Collection (of non-Transitive Items) - c#

Problem:
I have got a simple List<T> and I'm trying to sort it. But items in the list are not all transitive in terms of comparability, i.e., for e.g. my List<T> looks like:
A
B
C
D
E
where A > B and B > C but C > A. It is also possible to have circular greatness like A > B, B > C, C > D but D > A, i.e., it need not be always a group of 3. What I want is to find all groups of circular greatnesses in the given List<T>. For e.g., assuming A > B > C > A and A > B > C > D > A are the two circular groups in the above case my output should look either like:
List<List<T>> circulars = [[A, B, C, A], [A, B, C, D, A]]
or
List<List<T>> circulars = [[A, B, C], [A, B, C, D]]
// but in this case I do not want duplicates in the output.
// For e.g., the output shouldn't have both [B, C, A] and [A, B, C]
// since both the groups refer to the same set of circular items A, B & C
// as B > C > A > B is also true.
// But [B, A, C] is a different group (though nothing circular about it)
Either one is fine with me. I prefer small (linquish) solution but this didn't look as easy as it first seemed. May be I'm missing something very simple.
Scenario:
This is a part of sports analysis where one player/team will be stronger than the other which in turn will be stronger than another but the last one will be stronger than the first. I cant reveal more information, but let me take a case of head-to-heads in sports, especially in tennis and chess the individual match-ups lead to this kind of situation. For e.g., in terms of head-to-head, Kramnik leads Kasparov and Kasparov leads Karpov but Karpov leads Kramnik. Or for another e.g., Federer leads Davydenko, Davydenko leads Nadal but Nadal leads Federer.
My class looks like this:
class Player : IComparable<Player>
{
// logic
}
This is what I tried:
First generate all possible permutations of collection items with a minimum group size of 3. Like [A B C], [A, C, B]...., [A, B, C, D], [A, B, D, C].... etc (This is very slow)
Then go through the entire sub groups and check for patterns. Like if there are any situations where A > B > C > D (This is reasonably slow, but I'm ok with it)
Lastly go through the entire sub groups to remove the duplicate groups like [A, B, C] and [B, C, A] etc.
Code:
var players = [.....]; //all the players in the collection
// first generate all the permutations possible in the list from size 3
// to players.Count
var circulars = Enumerable.Range(3, players.Count - 3 + 1)
.Select(x => players.Permutations(x))
.SelectMany(x => x)
.Select(x => x.ToList())
// then check in the each sublists if a pattern like A > B > C > A is
// generated vv this is the player comparison
.Where(l => l.Zip(l.Skip(1), (p1, p2) => new { p1, p2 }).All(x => x.p1 > x.p2) && l.First() < l.Last())
// then remove the duplicate lists using special comparer
.Distinct(new CircularComparer<Player>())
.ToList();
public static IEnumerable<IEnumerable<T>> Permutations<T>(this IEnumerable<T> list, int length)
{
if (length == 1)
return list.Select(t => new[] { t });
return Permutations(list, length - 1)
.SelectMany(t => list.Where(e => !t.Contains(e)), (t1, t2) => t1.Concat(new[] { t2 }));
}
class CircularComparer<T> : IEqualityComparer<ICollection<T>>
{
public bool Equals(ICollection<T> x, ICollection<T> y)
{
if (x.Count != y.Count)
return false;
return Enumerable.Range(1, x.Count)
.Any(i => x.SequenceEqual(y.Skip(i).Concat(y.Take(i))));
}
public int GetHashCode(ICollection<T> obj)
{
return 0;
}
}
The problem with this approach is that it is extremely slow. For a collection of just around 10 items, the permutations that has to generated itself is huge (close to 1 million items). Is there a better approach which is reasonably efficient? I am not after the fastest code possible. Is there a better recursive approach here? Smells like it.

The scenario...
[A, B, C, D, E]
where A > B, B > C, C > D, C > A, D > A
...could be represented as a directed graph using the convention that A -> B means A > B:
So the question is essentially "How can I find cycles in a directed graph?"
To solve that, you can use Tarjan's strongly connected components algorithm. I would recommend looking up an good implementation of this algorithm and apply it to your scenario.

There are numerous means for enumerating the permutations of N objects such that each permutation can be efficiently obtained from its index in the enumeration. One such as this excerpt from my tutorial on CUDOFY using the Travelling Salesman problem:
/// <summary>Amended algorithm after SpaceRat (see Remarks):
/// Don't <b>Divide</b> when you can <b>Multiply</b>!</summary>
/// <seealso cref="http://www.daniweb.com/software-development/cpp/code/274075/all-permutations-non-recursive"/>
/// <remarks>Final loop iteration unneeded, as element [0] only swaps with itself.</remarks>
[Cudafy]
public static float PathFromRoutePermutation(GThread thread,
long permutation, int[,] path) {
for (int city = 0; city < _cities; city++) { path[city, thread.threadIdx.x] = city; }
var divisor = 1L;
for (int city = _cities; city > 1L; /* decrement in loop body */) {
var dest = (int)((permutation / divisor) % city);
divisor *= city;
city--;
var swap = path[dest, thread.threadIdx.x];
path[dest, thread.threadIdx.x] = path[city, thread.threadIdx.x];
path[city, thread.threadIdx.x] = swap;
}
return 0;
}
#endregion
}
From this point one is able to readily perform the identification of permutations with circular greatness in parallel. One can first use the multiple cores on the CPU to achieve improved performance, and then those available on the GPU. After repeated tunings of the Travelling Salesman problem
in this way I improved performance for the 11 cities case from over 14 seconds (using CPU only) to about .25 seconds using my GPU; an improvement of 50 times.
Of course, your mileage will vary according to other aspects of the problem as well as your hardware.

I could improve performance drastically relying on recursion. Rather than generate the entire permutations of sequences possible beforehand, I now recurse through the collection to find cycles. To help with that I created circular references of itself (greater and lesser items) so that I can traverse through. Code is a bit longer.
Here is the basic idea:
I created a basic interface ICyclic<T> which has to be implemented by the Player class.
I traverse through the collection and assign the lesser and greater items (in the Prepare method).
I ignore the really bad ones (ie for which there are no lesser items in the collection) and really good ones (ie for which there are no greater items in the collection) to avoid infinite recursion and generally improve performance. The absolute best ones and worst ones don't contribute to cycles. All done in Prepare method.
Now each item will have a collection of items lesser than the item. And the items in the collection will have their own collection of worse items. And so on. This is the path I recursively traverse.
At every point the last item is compared to the first item in the visited path to detect cycles.
The cycles are added to a HashSet<T> to avoid duplicates. An equality comparer is defined to detect equivalent circular lists.
Code:
public interface ICyclic<T> : IComparable<T>
{
ISet<T> Worse { get; set; }
ISet<T> Better { get; set; }
}
public static ISet<IList<T>> Cycles<T>(this ISet<T> input) where T : ICyclic<T>
{
input = input.ToHashSet();
Prepare(input);
var output = new HashSet<IList<T>>(new CircleEqualityComparer<T>());
foreach (var item in input)
{
bool detected;
Visit(item, new List<T> { item }, item.Worse, output, out detected);
}
return output;
}
static void Prepare<T>(ISet<T> input) where T : ICyclic<T>
{
foreach (var item in input)
{
item.Worse = input.Where(t => t.CompareTo(item) < 0).ToHashSet();
item.Better = input.Where(t => t.CompareTo(item) > 0).ToHashSet();
}
Action<Func<T, ISet<T>>> exceptionsRemover = x =>
{
var exceptions = new HashSet<T>();
foreach (var item in input.OrderBy(t => x(t).Count))
{
x(item).ExceptWith(exceptions);
if (!x(item).Any())
exceptions.Add(item);
}
input.ExceptWith(exceptions);
};
exceptionsRemover(t => t.Worse);
exceptionsRemover(t => t.Better);
}
static void Visit<T>(T item, List<T> visited, ISet<T> worse, ISet<IList<T>> output,
out bool detected) where T : ICyclic<T>
{
detected = false;
foreach (var bad in worse)
{
Func<T, T, bool> comparer = (t1, t2) => t1.CompareTo(t2) > 0;
if (comparer(visited.Last(), visited.First()))
{
detected = true;
var cycle = visited.ToList();
output.Add(cycle);
}
if (visited.Contains(bad))
{
var cycle = visited.SkipWhile(x => !x.Equals(bad)).ToList();
if (cycle.Count >= 3)
{
detected = true;
output.Add(cycle);
}
continue;
}
if (bad.Equals(item) || comparer(bad, visited.Last()))
continue;
visited.Add(bad);
Visit(item, visited, bad.Worse, output, out detected);
if (detected)
visited.Remove(bad);
}
}
public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source)
{
return new HashSet<T>(source);
}
public class CircleEqualityComparer<T> : IEqualityComparer<ICollection<T>>
{
public bool Equals(ICollection<T> x, ICollection<T> y)
{
if (x.Count != y.Count)
return false;
return Enumerable.Range(1, x.Count)
.Any(i => x.SequenceEqual(y.Skip(i).Concat(y.Take(i))));
}
public int GetHashCode(ICollection<T> obj)
{
return unchecked(obj.Aggregate(0, (x, y) => x + y.GetHashCode()));
}
}
Original answer (from OP)
On the plus side, this is shorter and concise. Also since it doesn't rely on recursion, it need not have an ICyclic<T> constraint, any IComparable<T> should work. On the minus side, it is slow as molasses in January.
public static IEnumerable<ICollection<T>> Cycles<T>(this ISet<T> input) where T : IComparable<T>
{
if (input.Count < 3)
return Enumerable.Empty<ICollection<T>>();
Func<T, T, bool> comparer = (t1, t2) => t1.CompareTo(t2) > 0;
return Enumerable.Range(3, input.Count - 3 + 1)
.Select(x => input.Permutations(x))
.SelectMany(x => x)
.Select(x => x.ToList())
.Where(l => l.Zip(l.Skip(1), (t1, t2) => new { t1, t2 }).All(x => comparer(x.t1, x.t2))
&& comparer(l.Last(), l.First()))
.Distinct(new CircleEqualityComparer<T>());
}
public static IEnumerable<IEnumerable<T>> Permutations<T>(this IEnumerable<T> list, int length)
{
if (length == 1)
return list.Select(t => new[] { t });
return Permutations(list, length - 1)
.SelectMany(t => list.Where(e => !t.Contains(e)), (t1, t2) => t1.Concat(new[] { t2 }));
}
public class CircleEqualityComparer<T> : IEqualityComparer<ICollection<T>>
{
public bool Equals(ICollection<T> x, ICollection<T> y)
{
if (x.Count != y.Count)
return false;
return Enumerable.Range(1, x.Count)
.Any(i => x.SequenceEqual(y.Skip(i).Concat(y.Take(i))));
}
public int GetHashCode(ICollection<T> obj)
{
return unchecked(obj.Aggregate(0, (x, y) => x + y.GetHashCode()));
}
}
Few things to note:
I have used a ISet<T>s and HashSet<T>s instead of more traditional List<T>s, but it is just to make the intent more clear, that no duplicate items are allowed. Lists should work just fine.
.NET doesn't really have insertion order preserving set (ie which allows no duplicates), hence had to use List<T> at many places. A set might have marginally improved performance, but more importantly using set and list interchangeably causes confusion.
The first approach gives performance jump of the order of 100 times over the second one.
The second approach can be sped up by utilizing the Prepare method. The logic holds there too, ie, lesser members in collection means lesser permutations to generate. But still very very painfully slow.
I have made the methods generic but the solution can be made more general purpose. For eg, in my case the cycle is detected based on a certain comparison logic. This can be passed as a parameter, ie, the items in the collection need not be just comparable, it could be any vertex determining logic. But that's left to readers' exercise.
In my code (both the examples) only cycles of minimum size 3 are considered, ie, cycles like A > B > C > A. It doesn't account for cycles like A > B, B > A situations. In case you need it, change all the instances of 3 in the code to whatever you like. Even better pass it to the function.

Related

How to remove duplicates from a list of nested objects?

I know there are many answers out there suggesting overriding equals and hashcode, but in my case, that is not possible because the objects used are imported from DLLs.
First, I have a list of objects called DeploymentData.
These objects, along other properties, contain the following two: Location(double x, double y, double z) and Duct(int id).
The goal is to remove those that have the same Location parameters.
First, I grouped them by Duct, as a Location can not be the same if it's on another duct.
var groupingByDuct = deploymentDataList.GroupBy(x => x.Duct.Id).ToList();
Then the actual algorithm:
List<DeploymentData> uniqueDeploymentData = new List<DeploymentData>();
foreach (var group in groupingByDuct) {
uniqueDeploymentData
.AddRange(group
.Select(x => x)
.GroupBy(d => new { d.Location.X, d.Location.Y, d.Location.Z })
.Select(x => x.First()).ToList());
}
This does the work, but in order to properly check that they are indeed duplicates, the entire location should be compared. For this, I've made the following method:
private bool CompareXYZ(XYZ point1, XYZ point2, double tolerance = 10)
{
if (System.Math.Abs(point1.X - point2.X) < tolerance &&
System.Math.Abs(point1.Y - point2.Y) < tolerance &&
System.Math.Abs(point1.Z - point2.Z) < tolerance) {
return true;
}
return false;
}
BUT I have no idea how to apply that to the code written above. To sum up:
How can I write the algorithm above without all those method calls?
How can I adjust the algorithm above to use the CompareXYZ method for a better precision?
Efficiency?
An easy way to filter duplicates is to use a Hashset with a custom equality comparer. This is a class that implements IEqualityComparer, e.g.:
public class DeploymentDataEqualityComparer : IEqualityComparer<DeploymentData>
{
private readonly double _tolerance;
public DeploymentDataEqualityComparer(double tolerance)
{
_tolerance = tolerance;
}
public bool Equals(DeploymentData a, DeploymentData b)
{
if (a.Duct.id != b.Duct.id)
return false; // Different Duct, therefore not equal
if (System.Math.Abs(a.Location.X - b.Location.X) < _tolerance &&
System.Math.Abs(a.Location.Y - b.Location.Y) < _tolerance &&
System.Math.Abs(a.Location.Z - b.Location.Z) < _tolerance) {
return true;
}
return false;
}
public GetHashCode(DeploymentData dd)
{
// If the classes of the library do not implement GetHashCode, you can create a custom implementation
return dd.Duct.GetHashCode() | dd.Location.GetHashCode();
}
}
In order to filter duplicates, you can then add them to a HashSet:
var hashSet = new HashSet<DeploymentData>(new DeploymentDataEqualityComparer(10));
foreach (var deploymentData in deploymentDataList)
hashSet.Add(deploymentData);
This way, you do not need to group by duct and use the enhanced performance of the HashSet.

C# synchronize 2 lists of different types by time

I have 2 lists of different types. I think for now it doesn't matter what types that are.
Both types have an information about occurance which is in ticks (but can also be a DateTime).
What I want to do is, to synchronize these 2 lists by time so for example i can iterate through all elements in the order how they occured in time.
Example: // in this example List has elements called A_NUM or B_NUM depending on a type of list and number after '_' will represent order at which this elements/events occured
ListA = {A_2, A_3, A_5}
ListB = {B_1, B_4, B_6}
And the result after synchronization will be something like this:
ResultList = {B_1, A_2, A_3, B_4, A_5, B_6}
Is it somehow possible to make such mixed list? Or I have to create some auxiliary List or Dictionary which will tell me synchronized order of this 2 lists?
EDIT:
One list is a list of eye fixations. Fixation have a position, duration, ... and also occurance attributes.
Second list is a list of some changes of text, for example on line 12 column 3 there was a char 'x' added at some time t.
And I want to iterate through these 2 lists simultaneously. I mean at time t1 fixation occured at position x,y. At time t2 there was a text change at position u,v, so I want to iterate through these events in the order as they occured in time.
Note: YES both lists are sorted by time. It is a sequence of fixations and sequence of text changes.
Your question strongly suggests a merge sort as the basic implementation detail. You have two inputs, both sorted, and just want them merged together in sequence.
The main difficulty implied by your question is that you are trying to merge sequences of two completely unrelated types. Ordinarily, you'd merge sequences of the same type, and so could easily manipulate them together. Barring that, they'd at least share a base class or interface type, so that you could treat them as a single generalized type. But it seems, from your question, that this is not the case.
Given that, I think the most straight-forward approach is still to use a merge sort, but to provide a mechanism for the sort to access the relevant property (ticks, DateTime, whatever). The sort would return the merged sequences, in correct order, as the object type (i.e. the only base type common to both inputs) and the caller would then have to cast back to the individual types for whatever purpose.
Here's an example of what I mean:
private static IEnumerable<TBase> Merge<TBase, T1, T2, TValue>(
IEnumerable<T1> sequence1, IEnumerable<T2> sequence2,
Func<T1, TValue> valueSelector1, Func<T2, TValue> valueSelector2)
where T1 : TBase
where T2 : TBase
where TValue : IComparable<TValue>
{
IEnumerator<T1> enumerator1 = sequence1.GetEnumerator();
IEnumerator<T2> enumerator2 = sequence2.GetEnumerator();
bool notDone1 = enumerator1.MoveNext(),
notDone2 = enumerator2.MoveNext();
while (notDone1 && notDone2)
{
TValue value1 = valueSelector1(enumerator1.Current),
value2 = valueSelector2(enumerator2.Current);
if (value1.CompareTo(value2) <= 0)
{
yield return enumerator1.Current;
notDone1 = enumerator1.MoveNext();
}
else
{
yield return enumerator2.Current;
notDone2 = enumerator2.MoveNext();
}
}
while (notDone1)
{
yield return enumerator1.Current;
notDone1 = enumerator1.MoveNext();
}
while (notDone2)
{
yield return enumerator2.Current;
notDone2 = enumerator2.MoveNext();
}
}
Used like this:
class A
{
public int Value { get; }
public A(int value)
{
Value = value;
}
}
class B
{
public int Value { get; }
public B(int value)
{
Value = value;
}
}
static void Main(string[] args)
{
const int minCount = 5, maxCount = 15, maxValue = 50;
Random random = new Random();
int listACount = random.Next(minCount, maxCount),
listBCount = random.Next(minCount, maxCount);
A[] listA = RandomOrderedSequence(random, maxValue, listACount).Select(i => new A(i)).ToArray();
B[] listB = RandomOrderedSequence(random, maxValue, listBCount).Select(i => new B(i)).ToArray();
Console.WriteLine("listA: ");
Console.WriteLine(string.Join(", ", listA.Select(a => a.Value)));
Console.WriteLine("listB: ");
Console.WriteLine(string.Join(", ", listB.Select(b => b.Value)));
foreach (object o in Merge<object, A, B, int>(listA, listB, a => a.Value, b => b.Value))
{
A a = o as A;
if (a != null)
{
// Do something with object of type A
Console.WriteLine($"a.Value: {a.Value}");
}
else
{
// Must be a B. Do something with object of type B
B b = (B)o;
Console.WriteLine($"b.Value: {b.Value}");
}
}
}
static IEnumerable<int> RandomOrderedSequence(Random random, int max, int count)
{
return RandomSequence(random, max, count).OrderBy(i => i);
}
static IEnumerable<int> RandomSequence(Random random, int max, int count)
{
while (count-- > 0)
{
yield return random.Next(max);
}
}
In your case, you would of course replace types A and B with the types you're actually using, provide appropriate selectors, and then do whatever you like with each instance returned as the merged, in-order sequence.
Note that even if the types do turn out to share some common basis for which they can be compared and merged, I would still recommend a merge sort over simply concatenating and merging the result. The merge sort is a much more efficient way to merge already-ordered data into a single sequence of ordered data.
You need the lists to implement a common interface so they can be compared. For example:
public interface ISynchronizable
{
long GetTicks();
}
So you need to have the two object implement this, like so:
public class Fixation : ISynchronizable
{
public long GetTicks()
{
// get the ticks
}
// some other code
}
public class TextChange: ISynchronizable
{
public long GetTicks()
{
// get the ticks
}
// some other code
}
Then the result list would be created like this:
public List<ISynchronizable> list = new List<ISynchronizable>();
list.AddRange(fixationList);
list.AddRange(textChangeList);
resultList = list.OrderBy(e => e.GetTicks()).ToList();

Filtering lambda expression that compares two elements of the same collection

I was wondering if it is possible to write an expression for a Linq extension (or a custom extension) to filter a collection using a lambda expression that compares two elements of the collection.
In other words, if I have a List<DateTime> and some value, var v = DateTime.Today, then I am wondering if it is possible to write create a method that will return the first element of the collection that is less than or equal to the value, current <= v, with the next element of the collection being greater than or equal to the value, next >= v.
Please note that the above is just an example, and may or may not be the final implementation.
The following would be a valid solution, were the .First() method to accept Func<DateTime, DateTime, bool> with the two DateTime parameters being consecutive elements of the sequence:
dateCollection.First((current, next) => current <= v && next >= v);
Please also note that with the example given, a valid workaround could be to use .OrderBy and then find the first index that is greater than d and subtract 1. However, this type of comparison is not the only one that I am looking for. I may have a situation in which I am checking a List<string> for the first situation where the current element starts with the first letter of my value, v, and the next element ends with the last letter of my value, v.
I am looking for something that would be just a few of code. My goal is to find the simplest solution to this possible, so brevity carries a lot of weight.
What I am looking for is something of the form:
public static T First (...)
{
...
}
I believe that this will also require two or more lambda expressions as parameters. One thing that may also provide a good solution is to be able to select into all possible, consecutive pairs of elements of the sequence, and call the .First() method on that.
For example:
//value
var v = 5;
//if my collection is the following
List<int> stuff = { a, b, c, d };
//select into consecutive pairs, giving:
var pairs = ... // { { a, b }, { b, c }, { c, d } };
//then run comparison
pairs.First(p => p[0] <= v && p[1] >= v).Select(p => p[0]);
Thanks and happy coding! :)
What we can create is a Pairwise method that can map a sequence of values into a sequence of pairs representing each value and the value that comes before it.
public static IEnumerable<Tuple<T, T>> Pairwise<T>(this IEnumerable<T> source)
{
using (var iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
yield break;
T prev = iterator.Current;
while (iterator.MoveNext())
{
yield return Tuple.Create(prev, iterator.Current);
prev = iterator.Current;
}
}
}
Now we can write out:
var item = data.Pairwise()
.First(pair => pair.Item1 <= v && pair.Item2 >= v)
.Item1;
If this is something you're going to use a fair bit, it may be worth creating a new custom type to replace Tuple, so that you can have Current and Next properties, instead of Item1 and Item2.
List<int> list = new List<int>();
list.Add(3);
list.Add(2);
list.Add(8);
list.Add(1);
list.Add(4);
var element = list
.Where((elem, idx) => idx < list.Count-1 && elem<=list[idx+1])
.FirstOrDefault();
Console.WriteLine(element);
rsult: 2
where 'elem' is the current element, and 'idx' is an index of the current element
Not really sure what you want to return here is my main question, but to take your example and put it into a LINQ statement, it would be like this:
DateTime? Firstmatch = dateCollection.DefaultIfEmpty(null).FirstOrDefault(a => a <= d && ((dateCollection.IndexOf(a) + 1) < (dateCollection.Count) && dateCollection[dateCollection.IndexOf(a) + 1] >= d));
Strictly following the description, you could combine linq and list indexes to find the first index that matches your criterium, and returning its element:
DateTime d= DateTime.Today;
var res = dateCollection[Enumerable.Range(0, dateCollection.Count - 1).First(i => dateCollection[i] <= d && dateCollection[i + 1] >= d)];
Servy's answer can be handled without an extension method:
var first = items
.Select((current, index) => index > 0 ? new { Prev = items.ElementAt(index-1), Current = current } : null)
.First(pair => pair != null && pair.Prev <= v && pair.Current >= v)
.Prev;

Quickest way to compare two generic lists for differences

What is the quickest (and least resource intensive) to compare two massive (>50.000 items) and as a result have two lists like the ones below:
items that show up in the first list but not in the second
items that show up in the second list but not in the first
Currently I'm working with the List or IReadOnlyCollection and solve this issue in a linq query:
var list1 = list.Where(i => !list2.Contains(i)).ToList();
var list2 = list2.Where(i => !list.Contains(i)).ToList();
But this doesn't perform as good as i would like.
Any idea of making this quicker and less resource intensive as i need to process a lot of lists?
Use Except:
var firstNotSecond = list1.Except(list2).ToList();
var secondNotFirst = list2.Except(list1).ToList();
I suspect there are approaches which would actually be marginally faster than this, but even this will be vastly faster than your O(N * M) approach.
If you want to combine these, you could create a method with the above and then a return statement:
return !firstNotSecond.Any() && !secondNotFirst.Any();
One point to note is that there is a difference in results between the original code in the question and the solution here: any duplicate elements which are only in one list will only be reported once with my code, whereas they'd be reported as many times as they occur in the original code.
For example, with lists of [1, 2, 2, 2, 3] and [1], the "elements in list1 but not list2" result in the original code would be [2, 2, 2, 3]. With my code it would just be [2, 3]. In many cases that won't be an issue, but it's worth being aware of.
Enumerable.SequenceEqual Method
Determines whether two sequences are equal according to an equality comparer.
MS.Docs
Enumerable.SequenceEqual(list1, list2);
This works for all primitive data types. If you need to use it on custom objects you need to implement IEqualityComparer
Defines methods to support the comparison of objects for equality.
IEqualityComparer Interface
Defines methods to support the comparison of objects for equality.
MS.Docs for IEqualityComparer
More efficient would be using Enumerable.Except:
var inListButNotInList2 = list.Except(list2);
var inList2ButNotInList = list2.Except(list);
This method is implemented by using deferred execution. That means you could write for example:
var first10 = inListButNotInList2.Take(10);
It is also efficient since it internally uses a Set<T> to compare the objects. It works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.
If you want the results to be case insensitive, the following will work:
List<string> list1 = new List<string> { "a.dll", "b1.dll" };
List<string> list2 = new List<string> { "A.dll", "b2.dll" };
var firstNotSecond = list1.Except(list2, StringComparer.OrdinalIgnoreCase).ToList();
var secondNotFirst = list2.Except(list1, StringComparer.OrdinalIgnoreCase).ToList();
firstNotSecond would contain b1.dll
secondNotFirst would contain b2.dll
using System.Collections.Generic;
using System.Linq;
namespace YourProject.Extensions
{
public static class ListExtensions
{
public static bool SetwiseEquivalentTo<T>(this List<T> list, List<T> other)
where T: IEquatable<T>
{
if (list.Except(other).Any())
return false;
if (other.Except(list).Any())
return false;
return true;
}
}
}
Sometimes you only need to know if two lists are different, and not what those differences are. In that case, consider adding this extension method to your project. Note that your listed objects should implement IEquatable!
Usage:
public sealed class Car : IEquatable<Car>
{
public Price Price { get; }
public List<Component> Components { get; }
...
public override bool Equals(object obj)
=> obj is Car other && Equals(other);
public bool Equals(Car other)
=> Price == other.Price
&& Components.SetwiseEquivalentTo(other.Components);
public override int GetHashCode()
=> Components.Aggregate(
Price.GetHashCode(),
(code, next) => code ^ next.GetHashCode()); // Bitwise XOR
}
Whatever the Component class is, the methods shown here for Car should be implemented almost identically.
It's very important to note how we've written GetHashCode. In order to properly implement IEquatable, Equals and GetHashCode must operate on the instance's properties in a logically compatible way.
Two lists with the same contents are still different objects, and will produce different hash codes. Since we want these two lists to be treated as equal, we must let GetHashCode produce the same value for each of them. We can accomplish this by delegating the hashcode to every element in the list, and using the standard bitwise XOR to combine them all. XOR is order-agnostic, so it doesn't matter if the lists are sorted differently. It only matters that they contain nothing but equivalent members.
Note: the strange name is to imply the fact that the method does not consider the order of the elements in the list. If you do care about the order of the elements in the list, this method is not for you!
try this way:
var difList = list1.Where(a => !list2.Any(a1 => a1.id == a.id))
.Union(list2.Where(a => !list1.Any(a1 => a1.id == a.id)));
Not for this Problem, but here's some code to compare lists for equal and not! identical objects:
public class EquatableList<T> : List<T>, IEquatable<EquatableList<T>> where T : IEquatable<T>
/// <summary>
/// True, if this contains element with equal property-values
/// </summary>
/// <param name="element">element of Type T</param>
/// <returns>True, if this contains element</returns>
public new Boolean Contains(T element)
{
return this.Any(t => t.Equals(element));
}
/// <summary>
/// True, if list is equal to this
/// </summary>
/// <param name="list">list</param>
/// <returns>True, if instance equals list</returns>
public Boolean Equals(EquatableList<T> list)
{
if (list == null) return false;
return this.All(list.Contains) && list.All(this.Contains);
}
If only combined result needed, this will work too:
var set1 = new HashSet<T>(list1);
var set2 = new HashSet<T>(list2);
var areEqual = set1.SetEquals(set2);
where T is type of lists element.
While Jon Skeet's answer is an excellent advice for everyday's practice with small to moderate number of elements (up to a few millions) it is nevertheless not the fastest approach and not very resource efficient. An obvious drawback is the fact that getting the full difference requires two passes over the data (even three if the elements that are equal are of interest as well). Clearly, this can be avoided by a customized reimplementation of the Except method, but it remains that the creation of a hash set requires a lot of memory and the computation of hashes requires time.
For very large data sets (in the billions of elements) it usually pays off to consider the particular circumstances. Here are a few ideas that might provide some inspiration:
If the elements can be compared (which is almost always the case in practice), then sorting the lists and applying the following zip approach is worth consideration:
/// <returns>The elements of the specified (ascendingly) sorted enumerations that are
/// contained only in one of them, together with an indicator,
/// whether the element is contained in the reference enumeration (-1)
/// or in the difference enumeration (+1).</returns>
public static IEnumerable<Tuple<T, int>> FindDifferences<T>(IEnumerable<T> sortedReferenceObjects,
IEnumerable<T> sortedDifferenceObjects, IComparer<T> comparer)
{
var refs = sortedReferenceObjects.GetEnumerator();
var diffs = sortedDifferenceObjects.GetEnumerator();
bool hasNext = refs.MoveNext() && diffs.MoveNext();
while (hasNext)
{
int comparison = comparer.Compare(refs.Current, diffs.Current);
if (comparison == 0)
{
// insert code that emits the current element if equal elements should be kept
hasNext = refs.MoveNext() && diffs.MoveNext();
}
else if (comparison < 0)
{
yield return Tuple.Create(refs.Current, -1);
hasNext = refs.MoveNext();
}
else
{
yield return Tuple.Create(diffs.Current, 1);
hasNext = diffs.MoveNext();
}
}
}
This can e.g. be used in the following way:
const int N = <Large number>;
const int omit1 = 231567;
const int omit2 = 589932;
IEnumerable<int> numberSequence1 = Enumerable.Range(0, N).Select(i => i < omit1 ? i : i + 1);
IEnumerable<int> numberSequence2 = Enumerable.Range(0, N).Select(i => i < omit2 ? i : i + 1);
var numberDiffs = FindDifferences(numberSequence1, numberSequence2, Comparer<int>.Default);
Benchmarking on my computer gave the following result for N = 1M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
115.19 ms
0.656 ms
0.582 ms
1.00
2800.0000
2800.0000
2800.0000
67110744 B
DiffZip
23.48 ms
0.018 ms
0.015 ms
0.20
-
-
-
720 B
And for N = 100M:
Method
Mean
Error
StdDev
Ratio
Gen 0
Gen 1
Gen 2
Allocated
DiffLinq
12.146 s
0.0427 s
0.0379 s
1.00
13000.0000
13000.0000
13000.0000
8589937032 B
DiffZip
2.324 s
0.0019 s
0.0018 s
0.19
-
-
-
720 B
Note that this example of course benefits from the fact that the lists are already sorted and integers can be very efficiently compared. But this is exactly the point: If you do have favourable circumstances, make sure that you exploit them.
A few further comments: The speed of the comparison function is clearly relevant for the overall performance, so it may be beneficial to optimize it. The flexibility to do so is a benefit of the zipping approach. Furthermore, parallelization seems more feasible to me, although by no means easy and maybe not worth the effort and the overhead. Nevertheless, a simple way to speed up the process by roughly a factor of 2, is to split the lists respectively in two halfs (if it can be efficiently done) and compare the parts in parallel, one processing from front to back and the other in reverse order.
I have used this code to compare two list which has million of records.
This method will not take much time
//Method to compare two list of string
private List<string> Contains(List<string> list1, List<string> list2)
{
List<string> result = new List<string>();
result.AddRange(list1.Except(list2, StringComparer.OrdinalIgnoreCase));
result.AddRange(list2.Except(list1, StringComparer.OrdinalIgnoreCase));
return result;
}
I compared 3 different methods for comparing different data sets. Tests below create a string collection of all the numbers from 0 to length - 1, then another collection with the same range, but with even numbers. I then pick out the odd numbers from the first collection.
Using Linq Except
public void TestExcept()
{
WriteLine($"Except {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
Write("Count of items: ");
WriteLine(array.Except(newArray).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Except 2021-08-14 11:43:03 AM
Populate set processing time: 00:00:03.7230479
2021-08-14 11:43:09 AM
Count of items: 10000000
Count processing time: 00:00:02.9720879
Using HashSet.Add
public void TestHashSet()
{
WriteLine($"HashSet {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>();
for (int i = 0; i < length; i++)
{
hashSet.Add(i.ToString());
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newHashSet = new HashSet<string>();
for (int i = 0; i < length; i+=2)
{
newHashSet.Add(i.ToString());
}
dateTime = DateTime.Now;
Write("Count of items: ");
// HashSet Add returns true if item is added successfully (not previously existing)
WriteLine(hashSet.Where(s => newHashSet.Add(s)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
HashSet 2021-08-14 11:42:43 AM
Populate set processing time: 00:00:05.6000625
Count of items: 10000000
Count processing time: 00:00:01.7703057
Special HashSet test:
public void TestLoadingHashSet()
{
int length = 20000000;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
var dateTime = DateTime.Now;
var hashSet = new HashSet<string>(array);
Write("Time to load hashset: ");
WriteLine(DateTime.Now - dateTime);
}
> TestLoadingHashSet()
Time to load hashset: 00:00:01.1918160
Using .Contains
public void TestContains()
{
WriteLine($"Contains {DateTime.Now}");
int length = 20000000;
var dateTime = DateTime.Now;
var array = new string[length];
for (int i = 0; i < length; i++)
{
array[i] = i.ToString();
}
Write("Populate set processing time: ");
WriteLine(DateTime.Now - dateTime);
var newArray = new string[length/2];
int j = 0;
for (int i = 0; i < length; i+=2)
{
newArray[j++] = i.ToString();
}
dateTime = DateTime.Now;
WriteLine(dateTime);
Write("Count of items: ");
WriteLine(array.Where(a => !newArray.Contains(a)).Count());
Write("Count processing time: ");
WriteLine(DateTime.Now - dateTime);
}
Output
Contains 2021-08-14 11:19:44 AM
Populate set processing time: 00:00:03.1046998
2021-08-14 11:19:49 AM
Count of items: Hosting process exited with exit code 1.
(Didnt complete. Killed it after 14 minutes)
Conclusion:
Linq Except ran approximately 1 second slower on my device than using HashSets (n=20,000,000).
Using Where and Contains ran for a very long time
Closing remarks on HashSets:
Unique data
Make sure to override GetHashCode (correctly) for class types
May need up to 2x the memory if you make a copy of the data set, depending on implementation
HashSet is optimized for cloning other HashSets using the IEnumerable constructor, but it is slower to convert other collections to HashSets (see special test above)
First approach:
if (list1 != null && list2 != null && list1.Select(x => list2.SingleOrDefault(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare) != null).All(x => true))
return true;
Second approach if you are ok with duplicate values:
if (list1 != null && list2 != null && list1.Select(x => list2.Any(y => y.propertyToCompare == x.propertyToCompare && y.anotherPropertyToCompare == x.anotherPropertyToCompare)).All(x => true))
return true;
Both Jon Skeet's and miguelmpn's answers are good. It depends on whether the order of the list elements is important or not:
// take order into account
bool areEqual1 = Enumerable.SequenceEqual(list1, list2);
// ignore order
bool areEqual2 = !list1.Except(list2).Any() && !list2.Except(list1).Any();
One line:
var list1 = new List<int> { 1, 2, 3 };
var list2 = new List<int> { 1, 2, 3, 4 };
if (list1.Except(list2).Count() + list2.Except(list1).Count() == 0)
Console.WriteLine("same sets");
I did the generic function for comparing two lists.
public static class ListTools
{
public enum RecordUpdateStatus
{
Added = 1,
Updated = 2,
Deleted = 3
}
public class UpdateStatu<T>
{
public T CurrentValue { get; set; }
public RecordUpdateStatus UpdateStatus { get; set; }
}
public static List<UpdateStatu<T>> CompareList<T>(List<T> currentList, List<T> inList, string uniqPropertyName)
{
var res = new List<UpdateStatu<T>>();
res.AddRange(inList.Where(a => !currentList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Added,
}));
res.AddRange(currentList.Where(a => !inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Deleted,
}));
res.AddRange(currentList.Where(a => inList.Any(x => x.GetType().GetProperty(uniqPropertyName).GetValue(x)?.ToString().ToLower() == a.GetType().GetProperty(uniqPropertyName).GetValue(a)?.ToString().ToLower()))
.Select(a => new UpdateStatu<T>
{
CurrentValue = a,
UpdateStatus = RecordUpdateStatus.Updated,
}));
return res;
}
}
I think this is a simple and easy way to compare two lists element by element
x=[1,2,3,5,4,8,7,11,12,45,96,25]
y=[2,4,5,6,8,7,88,9,6,55,44,23]
tmp = []
for i in range(len(x)) and range(len(y)):
if x[i]>y[i]:
tmp.append(1)
else:
tmp.append(0)
print(tmp)
Maybe it's funny, but this works for me:
string.Join("",List1) != string.Join("", List2)
This is the best solution you'll found
var list3 = list1.Where(l => list2.ToList().Contains(l));

How to get matches between collections of different types?

I think this would take O(A x B) time to execute.
(where A is the size of collectionA and B is the size of collectionB)
Am I correct?
IEnumerable<A> GetMatches(IEnumerable<A> collectionA, IEnumerable<B> collectionB)
{
foreach (A a in collectionA)
foreach (B b in collectionB)
if (a.Value == b.Value)
yield return a;
}
Is there a faster way to execute this query? (maybe using LINQ?)
Enumerable.Intersect will not, unfortunately, work as you're comparing against two separate types (A and B).
This will require a bit of handling separately to get an Intersect call that will work.
You could do this in stages:
IEnumerable<A> GetMatches(IEnumerable<A> collectionA, IEnumerable<B> collectionB)
where A : ISomeConstraintWithValueProperty
where B : ISomeOtherConstraintWithSameValueProperty
{
// Get distinct values in A
var values = new HashSet<TypeOfValue>(collectionB.Select(b => b.Value));
return collectionA.Where(a => values.Contains(a.Value));
}
Note that this will return duplicates if collectionB contains duplicates (but not collectionA), so it will have slightly different results than your looping code.
If you want unique matches (only one returned), you could change the last line to:
return collectionA.Where(a => values.Contains(a.Value)).Distinct();
You may try the following intersection algorithm which has complexity O(m+n) if your data is sorted, or O(nlogn) otherwise, without consuming additional memory:
private static IEnumerable<A> Intersect(A[] alist, B[] blist)
{
Array.Sort(alist);
Array.Sort(blist);
for (int i = 0, j = 0; i < alist.Length && j < blist.Length;)
{
if (alist[i].Value == blist[j].Value)
{
yield return alist[i];
i++;
j++;
}
else
{
if (alist[i].Value < blist[j].Value)
{
i++;
}
else
{
j++;
}
}
}
}

Categories

Resources