Merging duplicate elements within an IEnumerable - c#

I currently have an IEnumerable<MyObject> where MyObject has the properties String Name and long Value.
If i was to have within the Enumerable, 10 instances of MyObject, each with a different name and value, with the exception of one having the same name as the other.
Does .NET (or LINQ) have a built in method which will allow me to find the duplicate, and if possible, merge the Value property so that there ends up being only 9 elements within the enumerable, each with a distinct Name and the one that had a duplicate has the Value that is equal to the sum of its self and the duplicate.
So far i have found that the only way to iterate over the entire IEnumerable and look for the duplicates and generate a new IEnumerable of unique items, but this seems untidy and slow.

You can group items by name and project results to 'merged' objects:
objects.GroupBy(o => o.Name)
.Select(g => new MyObject { Name = g.Key, Value = g.Sum(o => o.Value) });
UPDATE: Another option, if new MyObject instantiation is undesired (e.g. you have many properties in this class, or you should preserver references) then you can use aggregation with first item in group as accumulator:
objects.GroupBy(o => o.Name)
.Select(g => g.Skip(1).Aggregate(
g.First(), (a, o) => { a.Value += o.Value; return a; }));

list.GroupBy(e => e.Name).Select(group => new MyObject
{
Name = group.Key,
Value = group.Sum(e => e.Value)
}
)
Update:
Another variant:
list.GroupBy(
e => e.Name,
e => e,
(name, group) => group.Aggregate((result, e) =>
{
result.Value += e.Value;
return result;
}
)
)

I dont know a single method solution but what about:
set.GroupBy(g=>g.Name).Select(g=> new MyObject{Name=g.Key, Value=g.Sum(i=>i.Value)});

Implement interface IEquatable and use Ditinct method. As follow:
internal class Program
{
private static void Main(string[] args)
{
var items = new List<MyClass>
{
new MyClass
{
Name = "Name1",
Value = 50
},
new MyClass
{
Name = "Name2",
Value = 20
},
new MyClass
{
Name = "Name3",
Value = 50
}
};
var distinct = items.Distinct().ToList();
}
}
internal class MyClass : **IEquatable<MyClass>**
{
public String Name { get; set; }
public int Value { get; set; }
**public bool Equals(MyClass other)
{
if (ReferenceEquals(null, other))
return false;
if (ReferenceEquals(this, other))
return true;
return this.Value == other.Value;
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj))
return false;
if (ReferenceEquals(this, obj))
return true;
if (obj.GetType() != this.GetType())
return false;
return this.Equals((MyClass)obj);
}
public override int GetHashCode()
{
return this.Value;
}
public static bool operator ==(MyClass left, MyClass right)
{
return Equals(left, right);
}
public static bool operator !=(MyClass left, MyClass right)
{
return !Equals(left, right);
}**
}

Related

Inline Comparer

I have a class Person with a Name property.
I have a collection of persons.
I have a method to add a new person but I need to check of the collection already contains the person.
I would like to use coll.Contains(newPerson,[here is the comparer]) where the comparer will make the comparison on the name property.
Is it possible to make the comparison inline (anonymously) without creating a new class implementing IEqualityComparer?
In the case you don't want duplicate Person objects, and want to operate on that collection as a set, you can use a HashSet<Person> instead which when calling its Add method will do the check if such a person already exists. For that to work, you can implement IEquatable<Person> in your class. It would look roughly like this:
public class Person : IEquatable<Person>
{
public Person(string name)
{
Name = name;
}
public string Name { get; private set; }
public bool Equals(Person other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
return string.Equals(Name, other.Name, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != this.GetType()) return false;
return Equals((Person) obj);
}
public override int GetHashCode()
{
return (Name != null ? Name.GetHashCode() : 0);
}
public static bool operator ==(Person left, Person right)
{
return Equals(left, right);
}
public static bool operator !=(Person left, Person right)
{
return !Equals(left, right);
}
}
And now you can use it in your HashSet<Person> like this:
void Main()
{
var firstPerson = new Person { Name = "Yuval" };
var secondPerson = new Person { Name = "yuval" };
var personSet = new HashSet<Person> { firstPerson };
Console.WriteLine(personSet.Add(secondPerson)); // Will print false.
}
Note this won't give you the flexibility of multiple comparers, but this way you won't have to create a new class implementing IEqualityComparer<T>.
You can use linq instead.
bool contains = coll.Any(p => p.Name == newPerson.Name);
You can add any condition here as you want. for example as WaiHaLee noted you can make compare ignore case.
bool contains = coll.Any(p => p.Name.Equals(newPerson.Name, StringComparison.OrdinalIgnoreCase));

UnitTesting List<T> of custom objects with List<S> of custom objects for equality

I'm writing some UnitTests for a parser and I'm stuck at comparing two List<T> where T is a class of my own, that contains another List<S>.
My UnitTest compares two lists and fails. The code in the UnitTest looks like this:
CollectionAssert.AreEqual(list1, list2, "failed");
I've written a test scenario that should clarify my question:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ComparerTest
{
class Program
{
static void Main(string[] args)
{
List<SimplifiedClass> persons = new List<SimplifiedClass>()
{
new SimplifiedClass()
{
FooBar = "Foo1",
Persons = new List<Person>()
{
new Person(){ ValueA = "Hello", ValueB="Hello"},
new Person(){ ValueA = "Hello2", ValueB="Hello2"},
}
}
};
List<SimplifiedClass> otherPersons = new List<SimplifiedClass>()
{
new SimplifiedClass()
{
FooBar = "Foo1",
Persons = new List<Person>()
{
new Person(){ ValueA = "Hello2", ValueB="Hello2"},
new Person(){ ValueA = "Hello", ValueB="Hello"},
}
}
};
// The goal is to ignore the order of both lists and their sub-lists.. just check if both lists contain the exact items (in the same amount). Basically ignore the order
// This is how I try to compare in my UnitTest:
//CollectionAssert.AreEqual(persons, otherPersons, "failed");
}
}
public class SimplifiedClass
{
public String FooBar { get; set; }
public List<Person> Persons { get; set; }
public override bool Equals(object obj)
{
if (obj == null) { return false;}
PersonComparer personComparer = new PersonComparer();
SimplifiedClass obj2 = (SimplifiedClass)obj;
return this.FooBar == obj2.FooBar && Enumerable.SequenceEqual(this.Persons, obj2.Persons, personComparer); // I think here is my problem
}
public override int GetHashCode()
{
return this.FooBar.GetHashCode() * 117 + this.Persons.GetHashCode();
}
}
public class Person
{
public String ValueA { get; set; }
public String ValueB { get; set; }
public override bool Equals(object obj)
{
if (obj == null)
{
return false;
}
Person obj2 = (Person)obj;
return this.ValueA == obj2.ValueA && this.ValueB == obj2.ValueB;
}
public override int GetHashCode()
{
if (!String.IsNullOrEmpty(this.ValueA))
{
//return this.ValueA.GetHashCode() ^ this.ValueB.GetHashCode();
return this.ValueA.GetHashCode() * 117 + this.ValueB.GetHashCode();
}
else
{
return this.ValueB.GetHashCode();
}
}
}
public class PersonComparer : IEqualityComparer<Person>
{
public bool Equals(Person x, Person y)
{
if (x != null)
{
return x.Equals(y);
}
else
{
return y == null;
}
}
public int GetHashCode(Person obj)
{
return obj.GetHashCode();
}
}
}
The question is strongly related to C# Compare Lists with custom object but ignore order, but I can't find the difference, other than I wrap a list into another object and use the UnitTest one level above.
I've tried to use an IEqualityComparer:
public class PersonComparer : IEqualityComparer<Person>
{
public bool Equals(Person x, Person y)
{
if (x != null)
{
return x.Equals(y);
}
else
{
return y == null;
}
}
public int GetHashCode(Person obj)
{
return obj.GetHashCode();
}
}
Afterwards I've tried to implement the ''IComparable'' interface thats allows the objects to be ordered. (Basically like this: https://stackoverflow.com/a/4188041/225808)
However, I don't think my object can be brought into a natural order. Therefore I consider this a hack, if I come up with random ways to sort my class.
public class Person : IComparable<Person>
public int CompareTo(Person other)
{
if (this.GetHashCode() > other.GetHashCode()) return -1;
if (this.GetHashCode() == other.GetHashCode()) return 0;
return 1;
}
I hope I've made no mistakes while simplifying my problem. I think the main problems are:
How can I allow my custom objects to be comparable and define the equality in SimplifiedClass, that relies on the comparision of subclasses (e.g. Person in a list, like List<Person>). I assume Enumerable.SequenceEqual should be replaced with something else, but I don't know with what.
Is CollectionAssert.AreEqual the correct method in my UnitTest?
Equals on a List<T> will only check reference equality between the lists themselves, it does not attempt to look at the items in the list. And as you said you don't want to use SequenceEqual because you don't care about the ordering. In that case you should use CollectionAssert.AreEquivalent, it acts just like Enumerable.SequenceEqual however it does not care about the order of the two collections.
For a more general method that can be used in code it will be a little more complicated, here is a re-implemented version of what Microsoft is doing in their assert method.
public static class Helpers
{
public static bool IsEquivalent(this ICollection source, ICollection target)
{
//These 4 checks are just "shortcuts" so we may be able to return early with a result
// without having to do all the work of comparing every member.
if (source == null != (target == null))
return false; //If one is null and one is not, return false immediately.
if (object.ReferenceEquals((object)source, (object)target) || source == null)
return true; //If both point to the same reference or both are null (We validated that both are true or both are false last if statement) return true;
if (source.Count != target.Count)
return false; //If the counts are different return false;
if (source.Count == 0)
return true; //If the count is 0 there is nothing to compare, return true. (We validated both counts are the same last if statement).
int nullCount1;
int nullCount2;
//Count up the duplicates we see of each element.
Dictionary<object, int> elementCounts1 = GetElementCounts(source, out nullCount1);
Dictionary<object, int> elementCounts2 = GetElementCounts(target, out nullCount2);
//It checks the total number of null items in the collection.
if (nullCount2 != nullCount1)
{
//The count of nulls was different, return false.
return false;
}
else
{
//Go through each key and check that the duplicate count is the same for
// both dictionaries.
foreach (object key in elementCounts1.Keys)
{
int sourceCount;
int targetCount;
elementCounts1.TryGetValue(key, out sourceCount);
elementCounts2.TryGetValue(key, out targetCount);
if (sourceCount != targetCount)
{
//Count of duplicates for a element where different, return false.
return false;
}
}
//All elements matched, return true.
return true;
}
}
//Builds the dictionary out of the collection, this may be re-writeable to a ".GroupBy(" but I did not take the time to do it.
private static Dictionary<object, int> GetElementCounts(ICollection collection, out int nullCount)
{
Dictionary<object, int> dictionary = new Dictionary<object, int>();
nullCount = 0;
foreach (object key in (IEnumerable)collection)
{
if (key == null)
{
++nullCount;
}
else
{
int num;
dictionary.TryGetValue(key, out num);
++num;
dictionary[key] = num;
}
}
return dictionary;
}
}
What it does is it makes a dictionary out of the two collections, counting the duplicates and storing it as the value. It then compares the two dictionaries to make sure that the duplicate count matches for both sides. This lets you know that {1, 2, 2, 3} and {1, 2, 3, 3} are not equal where Enumerable.Execpt would tell you that they where.

c# differenciate two lists with object

I have these two lists result and resultNew:
data.AddMapping<Employee>(x => x.Name, "Name");
data.AddMapping<Employee>(x => x.Code, "Code");
data.AddMapping<Employee>(x => x.WorkingStatus, "Working Status");
var result = (from x in data.Worksheet<Employee>("Tradesmen")
select x).ToList();
dataNew.AddMapping<Employee>(x => x.Name, "Name");
dataNew.AddMapping<Employee>(x => x.Code, "Code");
dataNew.AddMapping<Employee>(x => x.WorkingStatus, "Working Status");
var resultNew = (from x in dataNew.Worksheet<Employee>("On Leave")
select x).ToList();
where Employee is a simple c# code that contains code, name and workingStatus fields
I want to take the data which its code is the resultNew and not in the result
I tried this:
var newEmployees = resultNew.Except(Code = result.Select(s => s.Code)).ToList();
but I got syntax error:
System.Collections.Generic.List' does not contain a definition for 'Except' and the best extension method overload 'System.Linq.Enumerable.Except(System.Collections.Generic.IEnumerable, System.Collections.Generic.IEnumerable)' has some invalid arguments
You can create a HashSet for Code of new employees and then use it like:
HashSet<string> resultCodes = new HashSet<string>(result.Select(r => r.Code));
List<Employee> newEmployees = resultNew.Where(r => !resultCodes.Contains(r.Code))
.ToList();
You can also override Equals and GetHashCode for your class Employee base on property Code and then you can use Except like:
class Employee
{
protected bool Equals(Employee other)
{
return string.Equals(Code, other.Code);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != this.GetType()) return false;
return Equals((Employee) obj);
}
public override int GetHashCode()
{
return (Code != null ? Code.GetHashCode() : 0);
}
public string Name { get; set; }
public string Code { get; set; }
public string WorkingStatus { get; set; }
}
and then:
var newEmployees = resultnew.Except(result).ToList();
Remember the above implementation of Equals and GetHashCode only considers Code property. See this question How do you implement GetHashCode for structure with two string, when both strings are interchangeable

How to get a Distinct result using LINQ and C# using method syntax [duplicate]

This question already has answers here:
How can I maintain type when using LINQ .Select in C#?
(4 answers)
Closed 9 years ago.
The following code still does not return a DISTINCT result set. The equivalent SQL I am trying to accomplish is SELECT DISTINCT LEFT(Fac_Name, 6) AS ID, LEFT(Fac_Name, 3) AS Fac_Name
public List<Facility> GetFacilities() {
var facilities = new List<Facility>();
facilities = _facilityRepository.GetAll().ToList();
var facReturnList =
facilities.Where(x => x.Fac_Name = "Something")
.OrderBy(x => x.Fac_Name).ToList();
var facReturnList2 =
facReturnList.Select(x =>
new Facility { ID = x.Fac_Name.Substring(0, 6),
Fac_Name = x.Fac_Name.Substring(0, 3) })
.Distinct().ToList();
return facReturnList2;
}
The problem you have is that you're creating distinct reference values (which will return different hashcodes), even if the properties inside each reference are equal, the actual references themselves are distinct.
// fac1 and fac2 are the same reference, fac3 is a different reference.
var fac1 = new Facility { ID = "0", Fac_Name = "Hello" };
var fac2 = fac1;
var fac3 = new Facility { ID = "0", Fac_Name = "Hello" };
var facs = new List<Facility>() { fac1, fac2, fac3 };
foreach (var fac in facs.Distinct())
Console.WriteLine("Id: {0} | Name: {1}", fac.ID, fac.Fac_Name);
// OUTPUT
// Id: 0 | Name: Hello (NOTE: This is the value of fac1/fac2)
// Id: 0 | Name: Hello (This is the value of fac3)
To solve your dilemma, you should either:
Override the Object.GetHashCode() and the Object.Equals(Object) methods. Note that Distinct() ultimately uses the GetHashCode() to determine if something is distinct, but Equals(Object) and GetHashCode() should be overridden together.
Guidelines for Overloading Equals() and Operator ==
public class Facility
{
public string ID { get; set; }
public string Fac_Name { get; set; }
// This is just a rough example.
public override bool Equals(Object obj)
{
var fac = obj as Facility;
if (fac == null) return false;
if (Object.ReferenceEquals(this, fac)) return true;
return (this.ID == fac.ID) && (this.Fac_Name == fac.Fac_Name);
}
public override int GetHashCode()
{
var hash = 13;
if (!String.IsNullOrEmpty(this.ID))
hash ^= ID.GetHashCode();
if (!String.IsNullOrEmpty(this.Fac_Name))
hash ^= Fac_Name.GetHashCode();
return hash;
}
}
Provide a custom IEqualityComparer<T>.
public class FacilityEqualityComparer : IEqualityComparer<Facility>
{
public bool Equals(Facility x, Facility y)
{
return (x.ID == y.ID) && (x.Fac_Name == y.Fac_Name);
}
public int GetHashCode(Facility fac)
{
var hash = 13;
if (!String.IsNullOrEmpty(this.ID))
hash ^= ID.GetHashCode();
if (!String.IsNullOrEmpty(this.Fac_Name))
hash ^= Fac_Name.GetHashCode();
return hash;
}
}
var facReturnList2 =
facReturnList.Select(x =>
new Facility { ID = x.Fac_Name.Substring(0, 6),
Fac_Name = x.Fac_Name.Substring(0, 3) })
.Distinct(new FacilityEqualityComparer()).ToList();
Also, some other things to note:
You're naming does not follow guidelines. Don't use underscores in property names, and ID should be Id.
Whichever way you decide to go with, you should look into using String.Equals(...) and specify a StringComparison value. I just used == equality comparison on strings to keep the post short and readable.
So the problem is that the Enumerable.Distinct method uses the default equality comparer - which is comparing hash codes - so it will be a distinct list regardless of the properties values. Build an equality comparer for that type:
public class FacilityEqualityComparer : IEqualityComparer<Facility>
{
public bool Equals(Facility fac1, Facility fac2)
{
return fac1.ID.Equals(fac2.ID) && fac1.Fac_Name.Equals(fac2.Fac_Name);
}
public int GetHashCode(Facility fac)
{
string hCode = fac.ID + fac.Fac_Name;
return hCode.GetHashCode();
}
}
and then when you use it, call it like this:
var facReturnList2 =
facReturnList.Select(x =>
new Facility { ID = x.Fac_Name.Substring(0, 6),
Fac_Name = x.Fac_Name.Substring(0, 3) })
.Distinct(new FacilityEqualityComparer()).ToList();
return facReturnList2;
Distinct uses the default equality comparer to check for equality. This means it's looking for reference equality, which obviously won't be there in your case.
So you'll either need to use a custom IEqualityComparer (see the overload for Distinct(), or you can replicate the functionality of Distinct() with a GroupBy() and a First():
facReturnList.Select(x =>
new Facility { ID = x.Fac_Name.Substring(0, 6),
Fac_Name = x.Fac_Name.Substring(0, 3)
})
.GroupBy(x => new{x.ID, x.Fac_Name})
.Select(y => y.First())
.ToList();
You could also Override the Equals method in your Facility class:
public override bool Equals(System.Object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != this.GetType()) return false;
Facility objAsFacility = obj as Facility;
return Equals(objAsFacility);
}
protected bool Equals(Facility other)
{
if (other.Fac_Name == this.Fac_Name)
return true;
else return false;
}
public override int GetHashCode()
{
return this.Fac_Name.GetHashCode();
//Or you might even want to this:
//return (this.ID + this.Fac_Name).GetHashCode();
}
I'd probably go with the overriding equality operator method.

Filtering duplicates out of an IEnumerable

I have this code:
class MyObj {
int Id;
string Name;
string Location;
}
IEnumerable<MyObj> list;
I want to convert list to a dictionary like this:
list.ToDictionary(x => x.Name);
but it tells me I have duplicate keys. How can I keep only the first item for each key?
I suppose the easiest way would be to group by key and take the first element of each group:
list.GroupBy(x => x.name).Select(g => g.First()).ToDictionary(x => x.name);
Or you could use Distinct if your objects implement IEquatable to compare between themselves by key:
// I'll just randomly call your object Person for this example.
class Person : IEquatable<Person>
{
public string Name { get; set; }
public bool Equals(Person other)
{
if (other == null)
return false;
return Name == other.Name;
}
public override bool Equals(object obj)
{
return base.Equals(obj as Person);
}
public override int GetHashCode()
{
return Name.GetHashCode();
}
}
...
list.Distinct().ToDictionary(x => x.Name);
Or if you don't want to do that (maybe because you normally want to compare for equality in a different way, so Equals is already in use) you could make a custom implementation of IEqualityComparer just for this case:
class PersonComparer : IEqualityComparer<Person>
{
public bool Equals(Person x, Person y)
{
if (x == null)
return y == null;
if (y == null)
return false;
return x.Name == y.Name;
}
public int GetHashCode(Person obj)
{
return obj.Name.GetHashCode();
}
}
...
list.Distinct(new PersonComparer()).ToDictionary(x => x.Name);
list.Distinct().ToDictionary(x => x.Name);
You could also create your own Distinct extension overload method that accepted a Func<> for choosing the distinct key:
public static class EnumerationExtensions
{
public static IEnumerable<TSource> Distinct<TSource,TKey>(
this IEnumerable<TSource> source, Func<TSource,TKey> keySelector)
{
KeyComparer comparer = new KeyComparer(keySelector);
return source.Distinct(comparer);
}
private class KeyComparer<TSource,TKey> : IEqualityComparer<TSource>
{
private Func<TSource,TKey> keySelector;
public DelegatedComparer(Func<TSource,TKey> keySelector)
{
this.keySelector = keySelector;
}
bool IEqualityComparer.Equals(TSource a, TSource b)
{
if (a == null && b == null) return true;
if (a == null || b == null) return false;
return keySelector(a) == keySelector(b);
}
int IEqualityComparer.GetHashCode(TSource obj)
{
return keySelector(obj).GetHashCode();
}
}
}
Apologies for any bad code formatting, I wanted to reduce the size of the code on the page. Anyway, you can then use ToDictionary:
var dictionary = list.Distinct(x => x.Name).ToDictionary(x => x.Name);
Could make your own perhaps? For example:
public static class Extensions
{
public static IDictionary<TKey, TValue> ToDictionary2<TKey, TValue>(
this IEnumerable<TValue> subjects, Func<TValue, TKey> keySelector)
{
var dictionary = new Dictionary<TKey, TValue>();
foreach(var subject in subjects)
{
var key = keySelector(subject);
if(!dictionary.ContainsKey(key))
dictionary.Add(key, subject);
}
return dictionary;
}
}
var dictionary = list.ToDictionary2(x => x.Name);
Haven't tested it, but should work. (and it should probably have a better name than ToDictionary2 :p)
Alternatively, you can implement a DistinctBy method, for example like this:
public static IEnumerable<TSubject> DistinctBy<TSubject, TValue>(this IEnumerable<TSubject> subjects, Func<TSubject, TValue> valueSelector)
{
var set = new HashSet<TValue>();
foreach(var subject in subjects)
if(set.Add(valueSelector(subject)))
yield return subject;
}
var dictionary = list.DistinctBy(x => x.Name).ToDictionary(x => x.Name);
The problem here is that the ToDictionary extension method does not support multiple values with the same key. One solution is to write a version which does and use that instead.
public static Dictionary<TKey,TValue> ToDictionaryAllowDuplicateKeys<TKey,TValue>(
this IEnumerable<TValue> values,
Func<TValue,TKey> keyFunc) {
var map = new Dictionary<TKey,TValue>();
foreach ( var cur in values ) {
var key = keyFunc(cur);
map[key] = cur;
}
return map;
}
Now converting to a dictionary is straight forward
var map = list.ToDictionaryAllowDuplicateKeys(x => x.Name);
The following will work if you have different instances of MyObj with the same value for the Name property. It will take the first instance found for each duplicate (sorry for the obj - obj2 notation, it is just sample code):
list.SelectMany(obj => new MyObj[] {list.Where(obj2 => obj2.Name == obj.Name).First()}).Distinct();
EDIT: Joren's solution is better as it does not create unnecessary arrays in the process.

Categories

Resources