I have a List<CustomObject> and want to remove duplicates from it.
If two Custom Objects have same value for property: City, then I will call them duplicate.
I have implemented IEquatable as follows, but not able to remove duplicates from the list.
What is missing?
public class CustomAddress : IAddress, IEqualityComparer<IAddress>
{
//Other class members go here
//IEqualityComparer members
public bool Equals(IAddress x, IAddress y)
{
// Check whether the compared objects reference the same data.
if (ReferenceEquals(x, y)) return true;
// Check whether any of the compared objects is null.
if (ReferenceEquals(x, null) || ReferenceEquals(y, null))
return false;
// Check whether the Objects' properties are equal.
return x.City.Equals(y.City);
}
public int GetHashCode(IAddress obj)
{
// Check whether the object is null.
if (ReferenceEquals(obj, null)) return 0;
int hashAreaName = City == null ? 0 : City.GetHashCode();
return hashAreaName;
}
}
I am using .NET 3.5
With your overrides of Equals and GetHashCode in place, if you have an existing list that you need to filter, simply invoke Distinct() (available through the namespace System.Linq) on the list.
var noDupes = list.Distinct();
This will give you a duplicate-free sequence. If you need that to be a concrete list, simply add a ToList() to the end of the invocation.
var noDupes = list.Distinct().ToList();
Another answer mentions implementing an IEqualityComparer<CustomObject>. This is useful when overriding Equals and GetHashCode directly is either impossible (you don't control the source) or does not make sense (your idea of equality in this particular case is not universal for the class). In that case, define the comparer as demonstrated and provide an instance of the comparer to an overload of Distinct.
Finally, if you're building a list from the ground-up and want to avoid duplicates being inserted, you can use a HashSet<T> as mentioned here. The HashSet also accepts a custom comparer in the constructor, so you can optionally include that.
var mySet = new HashSet<CustomObject>();
bool isAdded = mySet.Add(myElement);
// isAdded will be false if myElement already exists in set, and
// myElement would not be added a second time.
// or you could use
if (!mySet.Contains(myElement))
mySet.Add(myElement);
One more option that is not using .NET library methods but can be useful in a pinch is Jon Skeet's DistinctBy, which you can see a rough implementation here. The idea is that you submit a Func<MyObject, Key> lambda expression directly and omit the overrides of Equals and GetHashCode (or the custom comparer) entirely.
var noDupes = list.DistinctBy(obj => obj.City); // NOT part of BCL
Just by implementing .Equals the way you did (wich you implemented correctly) you will not prevent duplicates from beeing added to a List<T>. You will actually have to manually remove them.
Instead of List<CustomObject> use HashSet<CustomObject>. It will never contain duplicates.
That's because List<CustomObject> tests if your class ( CustomObject) implements IEquatable<CustomObject> and not IEquatable<IAddress> as you did
I assume that for duplicate check you are using the Contains method, before adding a new member
To match duplicates on only a specific property you need a comparer.
class MyComparer : IEqualityComparer<CustomObject>
{
public bool Equals(CustomObject x, CustomObject y)
{
return x.City.Equals(y.City);
}
public int GetHashCode(CustomObject x)
{
return x.City.GetHashCode()
}
}
Usage:
var yourDistictObjects = youObjects.Distinct(new MyComparer());
Edit: Found this thread that does what you need and I think I referred to it in the past:
Remove duplicates in the list using linq
One answer that I thought was kind of interesting (but not how had done it) was:
var distinctItems = items.GroupBy(x => x.Id).Select(y => y.First());
It's a one liner that does what you need but might not be as efficient as the other methods.
Related
I have casted
var info = property.Info;
object data = info.GetValue(obj);
...
var enumerable = (IEnumerable)data;
if (enumerable.Any()) ///Does not compile
{
}
if (enumerable.GetEnumerator().Current != null) // Run time error
{
}
and I would like to see if this enumerable has any elements, via using Linq Query Any(). But unfortunately, even with using Linq, I can't.
How would I do this without specifying the generic type.
While you can't do this directly, you could do it via Cast:
if (enumerable.Cast<object>().Any())
That should always work, as any IEnumerable can be wrapped as an IEnumerable<object>. It will end up boxing the first element if it's actually an IEnumerable<int> or similar, but it should work fine. Unlike most LINQ methods, Cast and OfType target IEnumerable rather than IEnumerable<T>.
You could write your own subset of extension methods like the LINQ ones but operating on the non-generic IEnumerable type if you wanted to, of course. Implementing LINQ to Objects isn't terribly hard - you could use my Edulinq project as a starting point, for example.
There are cases where you could implement Any(IEnumerable) slightly more efficiently than using Cast - for example, taking a shortcut if the target implements the non-generic ICollection interface. At that point, you wouldn't need to create an iterator or take the first element. In most cases that won't make much performance difference, but it's the kind of thing you could do if you were optimizing.
One method is to use foreach, as noted in IEnumerable "Remarks". It also provides details on the additional methods off of the result of GetEnumerator.
bool hasAny = false;
foreach (object i in (IEnumerable)(new int[1] /* IEnumerable of any type */)) {
hasAny = true;
break;
}
(Which is itself easily transferable to an Extension method.)
Your attempt to use GetEnumerator().Current tried to get the current value of an enumerator that had not yet been moved to the first position yet. It would also have given the wrong result if the first item existed or was null. What you could have done (and what the Any() in Enumerable does) is see if it was possible to move to that first item or not; i.e. is there a first item to move to:
internal static class UntypedLinq
{
public static bool Any(this IEnumerable source)
{
if (source == null) throw new ArgumentNullException(nameof(source));
IEnumerator ator = source.GetEnumerator();
// Unfortunately unlike IEnumerator<T>, IEnumerator does not implement
// IDisposable. (A design flaw fixed when IEnumerator<T> was added).
// We need to test whether disposal is required or not.
if (ator is IDisposable disp)
{
using(disp)
{
return ator.MoveNext();
}
}
return ator.MoveNext();
}
// Not completely necessary. Causes any typed enumerables to be handled by the existing Any
// in Linq via a short method that will be inlined.
public static bool Any<T>(this IEnumerable<T> source) => Enumerable.Any(source);
}
when working with the List class, i have noticed that the boolean i was looking for was:
if(lstInts.Exists(x)){...}
X is a Predicate of T the same as lstInts. I was confused as to why you just cant pass int the int in this case, and why X's type isnt of type T.
Example i was testing:
List<int> listInt = new List<int>();
int akey = Convert.toInt32(myMatch.Value);
Predicate<int> pre = new Predicate<int>(akey); //akey is not the correct constructor param.
if(listInt.Exists(pre)){
listInt.add(akey);
}
Is there a reason for having the additional Step of Predicate, or.... if i am going about the logic incorrectly?
I also noticed that the predicate constructure does not take an item of type T. Sort of confused as to how this is suppose to work.
You could also use Contains() method
List<int> listInt = new List<int>();
int akey = Convert.toInt32(myMatch.Value);
if(listInt.Contains(akey)){
listInt.add(akey);
}
Or alternately use Any()
if(listInt.Any(I => I == akey)) {
// Do your logic
}
Predicate<T> is a delegate (returning bool) that allows you to find an item matching some condition (that's why item being checked is passed into it ad an argument).
This would be a good use for the HashSet<T> collection type, which does not allow duplicates (just silently ignores them).
Well, for your scenario, you should use the Contains method on the List class.
So what's the purpose of exists you might ask? Well, the Contains method uses the Equals method on the object to determine if the item you are checking is contained in the list or not. This only works if the class has overridden the Equals method for equality checking. If it hasn't, well then two separate instances of something that you consider to be equal will not be considered equal.
In addition to that, perhaps you want to use different logic that the Equals method provides. Now, the only way to determine if something is in the list is to either iterate it on your own, or write your own EqualityComparer to checks the equality of an instance.
So, what the list class does is expose some methods like Exists so that you can provide your own logic in an easy way, while doing the boilerplate iteration for you.
Example
Consider you have a list of Dog types. Now, the dog class has overridden the Equals method, so there is no way to check if a dog is equal to another, but they have some information about the dog like it's name and it's owner. So consider the following
List<Dog> dogs = new List<Dog> {
new Dog { Name = "Fido", Owner = "Julie" },
new Dog { Name = "Bruno", Owner = "Julie" },
new Dog { Name = "Fido", Owner = "George" }
};
Dog fido = new Dog { Name = "Fido", Owner = "Julie" };
List.Contains(fido)
Returns false (since Equals method has not been overridden)
List.Exists(x => fido.Name == x.Name && fido.Owner == x.Owner)
Returns true since you are checking equality on the properties which, being strings, have equality overridden.
If you were to go look at the source code for the list class, you would likely see something like this.
public bool Exists(Predicate<Dog> predicate) {
foreach (Dog item in Items) {
if (predicate(item))
return true;
}
return false;
}
Now, if you fill in the predicate I had above, the method would look like this
public bool Exists(Dog other) {
foreach (Dog item in Items) {
if (item.Name == other.Name && item.Owner == other.Owner)
return true;
}
return false;
}
If I want to perform actions such as .Where(...) or .Max(...), I need to make sure the list is not null and has a count greater than zero. Besides doing something such as the following everytime I want to use the list:
if(mylist != null && mylist.Count > 0)
{...}
is there something more inline or lambda like technique that I can use? Or another more compressed technique?
public static class LinqExtensions
{
public static bool IsNullOrEmpty<T>(this IEnumerable<T> items)
{
return items == null || !items.Any();
}
}
You can then do something like
if (!myList.IsNullOrEmpty())
....
My general preference is to have empty list instances, instead of null list variables. However, not everyone can cajole their co-workers into this arrangment. You can protect yourself from null list variables using this extension method.
public static IEnumerable<T> EmptyIfNull<T>(this IEnumerable<T> source)
{
return source ?? Enumerable.Empty<T>();
}
Called by:
Customers result = myList.EmptyIfNull().Where(c => c.Name == "Bob");
Most linq methods work on empty collections. Two methods that don't are Min and Max. Generally, I call these methods against an IGrouping. Most IGrouping implementations have at least one element (for example, IGroupings generated by GroupBy or ToLookup). For other cases, you can use Enumerable.DefaultIfEmpty.
int result = myList.EmptyIfNull().Select(c => c.FavoriteNumber).DefaultIfEmpty().Max();
Don't let the list be null
Ensure the object is always in a valid state. By ensuring the list is never null, you never have to check that the list is null.
public class MyClass
{
private readonly IEnumerable<int> ints;
public MyClass(IEnumerable<int> ints)
{
this.ints = ints;
}
public IEnumerable<int> IntsGreaterThan5()
{
return this.ints.Where(x => x > 5);
}
}
Even if this list were empty, you'd still get a valid IEnumerable<int> back.
Max and Min overloads with Nullable types
That still doesn't solve the "Max" and "Min" problems though. There's an overload of Max and Min that take selectors. Those selector overloads can return nullable ints, so your max method becomes this:
this.ints.Max(x => new int?(x));
Therefore, you run Max and check to see if you've gotten a null value or an integer back. voila!
Other Options
Custom Extension Methods
You could also write your own extension methods.
public static MinMaxHelper()
{
public static int? MaxOrDefault(IEnumerable<int> ints)
{
if(!ints.Any())
{
return null;
}
return ints.Max();
}
public static int MaxOrDefault(IEnumerable<int> ints, int defaultValue)
{
if(!ints.Any())
{
return defaultValue;
}
return ints.Max();
}
}
Overriding Linq Extension Methods
And finally, remember that the build in Linq extension methods can be overriden with your own extension methods with matching signatures. Therefore, you could write an extension method to replace .Where(...) and .Max(...) to return null (or a default value) instead of throwing an ArgumentNullException if the Enumerable is null.
Use empty collections instead of null collections. Where will work just fine against an empty collection, so you don't need to ensure that Count > 0 before calling it. You can also call Max on an empty collection if you do a bit of gymnastics first.
For IEnumerable<T> use Enumerable.Empty<T>()
For T[] use new T[0]
For List<T> use new List<T>()
You could try myList.Any() instead of .Count, but you'd still need to check for null.
If there is a risk of your list being null you will alway have to check that before calling any of its methods but you could use the Any() method rather than count. This will return true as soon as it counts one item regardless if there is one or more item in the list. This saves iterating over the entire list which is what Count will do:
if(mylist != null && mylist.Any())
{...}
You can use ?? operator which converts null to the value you supply on the right side:
public ProcessList(IEnumerable<int> ints)
{
this.ints = ints ?? new List<int>();
}
By the way: It is not a problem to process an empty list using LINQ.
You don't need to check Count to call Where. Max needs a non-empty list for value types but that can be overcome with an inline cast, eg
int? max = new List<int>().Max(i => (int?)i); // max = null
I want to remove duplicates from list, without changing order of unique elements in the list.
Jon Skeet & others have suggested to use the following:
list = list.Distinct().ToList();
Reference:
How to remove duplicates from a List<T>?
Remove duplicates from a List<T> in C#
Is it guaranteed that the order of unique elements would be same as before? If yes, please give a reference that confirms this as I couldn't find anything on it in documentation.
It's not guaranteed, but it's the most obvious implementation. It would be hard to implement in a streaming manner (i.e. such that it returned results as soon as it could, having read as little as it could) without returning them in order.
You might want to read my blog post on the Edulinq implementation of Distinct().
Note that even if this were guaranteed for LINQ to Objects (which personally I think it should be) that wouldn't mean anything for other LINQ providers such as LINQ to SQL.
The level of guarantees provided within LINQ to Objects is a little inconsistent sometimes, IMO. Some optimizations are documented, others not. Heck, some of the documentation is flat out wrong.
In the .NET Framework 3.5, disassembling the CIL of the Linq-to-Objects implementation of Distinct() shows that the order of elements is preserved - however this is not documented behavior.
I did a little investigation with Reflector. After disassembling System.Core.dll, Version=3.5.0.0 you can see that Distinct() is an extension method, which looks like this:
public static class Emunmerable
{
public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
throw new ArgumentNullException("source");
return DistinctIterator<TSource>(source, null);
}
}
So, interesting here is DistinctIterator, which implements IEnumerable and IEnumerator. Here is simplified (goto and lables removed) implementation of this IEnumerator:
private sealed class DistinctIterator<TSource> : IEnumerable<TSource>, IEnumerable, IEnumerator<TSource>, IEnumerator, IDisposable
{
private bool _enumeratingStarted;
private IEnumerator<TSource> _sourceListEnumerator;
public IEnumerable<TSource> _source;
private HashSet<TSource> _hashSet;
private TSource _current;
private bool MoveNext()
{
if (!_enumeratingStarted)
{
_sourceListEnumerator = _source.GetEnumerator();
_hashSet = new HashSet<TSource>();
_enumeratingStarted = true;
}
while(_sourceListEnumerator.MoveNext())
{
TSource element = _sourceListEnumerator.Current;
if (!_hashSet.Add(element))
continue;
_current = element;
return true;
}
return false;
}
void IEnumerator.Reset()
{
throw new NotSupportedException();
}
TSource IEnumerator<TSource>.Current
{
get { return _current; }
}
object IEnumerator.Current
{
get { return _current; }
}
}
As you can see - enumerating goes in order provided by source enumerable (list, on which we are calling Distinct). Hashset is used only for determining whether we already returned such element or not. If not, we are returning it, else - continue enumerating on source.
So, it is guaranteed, that Distinct() will return elements exactly in same order, which are provided by collection to which Distinct was applied.
According to the documentation the sequence is unordered.
Yes, Enumerable.Distinct preserves order. Assuming the method to be lazy "yields distinct values are soon as they are seen", it follows automatically. Think about it.
The .NET Reference source confirms. It returns a subsequence, the first element in each equivalence class.
foreach (TSource element in source)
if (set.Add(element)) yield return element;
The .NET Core implementation is similar.
Frustratingly, the documentation for Enumerable.Distinct is confused on this point:
The result sequence is unordered.
I can only imagine they mean "the result sequence is not sorted." You could implement Distinct by presorting then comparing each element to the previous, but this would not be lazy as defined above.
A bit late to the party, but no one really posted the best complete code to accomplish this IMO, so let me offer this (which is essentially identical to what .NET Framework does with Distinct())*:
public static IEnumerable<T> DistinctOrdered<T>(this IEnumerable<T> items)
{
HashSet<T> returnedItems = new HashSet<T>();
foreach (var item in items)
{
if (returnedItems.Add(item))
yield return item;
}
}
This guarantees the original order without reliance on undocumented or assumed behavior. I also believe this is more efficient than using multiple LINQ methods though I'm open to being corrected here.
(*) The .NET Framework source uses an internal Set class, which appears to be substantively identical to HashSet.
By default when use Distinct linq operator uses Equals method but you can use your own IEqualityComparer<T> object to specify when two objects are equals with a custom logic implementing GetHashCode and Equals method.
Remember that:
GetHashCode should not used heavy cpu comparision ( eg. use only some obvious basic checks ) and its used as first to state if two object are surely different ( if different hash code are returned ) or potentially the same ( same hash code ). In this latest case when two object have the same hashcode the framework will step to check using the Equals method as a final decision about equality of given objects.
After you have MyType and a MyTypeEqualityComparer classes follow code not ensure the sequence maintain its order:
var cmp = new MyTypeEqualityComparer();
var lst = new List<MyType>();
// add some to lst
var q = lst.Distinct(cmp);
In follow sci library I implemented an extension method to ensure Vector3D set maintain the order when use a specific extension method DistinctKeepOrder:
relevant code follows:
/// <summary>
/// support class for DistinctKeepOrder extension
/// </summary>
public class Vector3DWithOrder
{
public int Order { get; private set; }
public Vector3D Vector { get; private set; }
public Vector3DWithOrder(Vector3D v, int order)
{
Vector = v;
Order = order;
}
}
public class Vector3DWithOrderEqualityComparer : IEqualityComparer<Vector3DWithOrder>
{
Vector3DEqualityComparer cmp;
public Vector3DWithOrderEqualityComparer(Vector3DEqualityComparer _cmp)
{
cmp = _cmp;
}
public bool Equals(Vector3DWithOrder x, Vector3DWithOrder y)
{
return cmp.Equals(x.Vector, y.Vector);
}
public int GetHashCode(Vector3DWithOrder obj)
{
return cmp.GetHashCode(obj.Vector);
}
}
In short Vector3DWithOrder encapsulate the type and an order integer, while Vector3DWithOrderEqualityComparer encapsulates original type comparer.
and this is the method helper to ensure order maintained
/// <summary>
/// retrieve distinct of given vector set ensuring to maintain given order
/// </summary>
public static IEnumerable<Vector3D> DistinctKeepOrder(this IEnumerable<Vector3D> vectors, Vector3DEqualityComparer cmp)
{
var ocmp = new Vector3DWithOrderEqualityComparer(cmp);
return vectors
.Select((w, i) => new Vector3DWithOrder(w, i))
.Distinct(ocmp)
.OrderBy(w => w.Order)
.Select(w => w.Vector);
}
Note : further research could allow to find a more general ( uses of interfaces ) and optimized way ( without encapsulate the object ).
This highly depends on your linq-provider. On Linq2Objects you can stay on the internal source-code for Distinct, which makes one assume the original order is preserved.
However for other providers that resolve to some kind of SQL for example, that isn´t neccessarily the case, as an ORDER BY-statement usually comes after any aggregation (such as Distinct). So if your code is this:
myArray.OrderBy(x => anothercol).GroupBy(x => y.mycol);
this is translated to something similar to the following in SQL:
SELECT * FROM mytable GROUP BY mycol ORDER BY anothercol;
This obviously first groups your data and sorts it afterwards. Now you´re stuck on the DBMS own logic of how to execute that. On some DBMS this isn´t even allowed. Imagine the following data:
mycol anothercol
1 2
1 1
1 3
2 1
2 3
when executing myArr.OrderBy(x => x.anothercol).GroupBy(x => x.mycol) we assume the following result:
mycol anothercol
1 1
2 1
But the DBMS may aggregate the anothercol-column so, that allways the value of the first row is used, resulting in the following data:
mycol anothercol
1 2
2 1
which after ordering will result in this:
mycol anothercol
2 1
1 2
This is similar to the following:
SELECT mycol, First(anothercol) from mytable group by mycol order by anothercol;
which is the completely reverse order than what you expected.
You see the execution-plan may vary depending on what the underlying provider is. This is why there´s no guarantee about that in the docs.
I have two lists A and B, at the beginning of my program, they are both filled with information from a database (List A = List B). My program runs, List A is used and modified, List B is left alone. After a while I reload List B with new information from the database, and then do a check with that against List A.
foreach (CPlayer player in ListA)
if (ListB.Contains(player))
-----
Firstly, the object player is created from a class, its main identifier is player.Name.
If the Name is the same, but the other variables are different, would the .Contains still return true?
Class CPlayer(
public CPlayer (string name)
_Name = name
At the ---- I need to use the item from ListB that causes the .Contains to return true, how do I do that?
The default behaviour of List.Contains is that it uses the default equality comparer. If your items are reference types this means that it will use an identity comparison unless your class provides another implementation via Equals.
If you are using .NET 3.5 then you can change your second line to this which will do what you want:
if (ListB.Any(x => x.Name == player.Name))
For .NET 2.0 you could implement Equals and GetHashCode for your class, but this might give undesirable behaviour in other situations where you don't want two player objects to compare equal if they have the same name but differ in other fields.
An alternative way is to adapt Jon Skeet's answer for .NET 2.0. Create a Dictionary<string, object> and fill it with the names of all players in listB. Then to test if a player with a certain name is in listB you can use dict.ContainsKey(name).
An alternative to Mark's suggestion is to build a set of names and use that:
HashSet<string> namesB = new HashSet<string>(ListB.Select(x => x.Name));
foreach (CPlayer player in ListA)
{
if (namesB.Contains(player.Name))
{
...
}
}
Assuming you are using the System.Collections.Generic.List class, if the CPlayer class does not implement IEquatable<T> it will use the Equals and GetHashCode functions of the CPlayer class to check if the List has a member that equals the argument of Contains. Assuming that implementation is OK for you, you could something like
CPlayer listBItem = ListB.First(p => p == player);
to get the instance from ListB
It sounds like this is what you need to accomplish:
For each player in list A, find each player in list B with the same name and bring both players into the same scope.
Here is an approach which joins the two lists in a query:
var playerPairs =
from playerA in ListA
join playerB in ListB on playerA.Name equals playerB.Name
select new { playerA, playerB };
foreach(var playerPair in playerPairs)
{
Console.Write(playerPair.playerA.Name);
Console.Write(" -> ");
Console.WriteLine(playerPair.playerB.Name);
}
If you want the .Contains method to match only on CPlayer.Name, then in the CPlayer class implement these methods:
public override bool Equals(object obj)
{
if (!(obj is CPlayer)
return false;
return Name == (obj as CPlayer).Name;
}
public override int GetHashCode()
{
return Name.GetHashCode();
}
If you want the Name comparison to be Case Insensitive, replace use this Equals method instead:
public override bool Equals(object obj)
{
if (!(obj is CPlayer)
return false;
return Name.Equals((obj as CPlayer).Name, StringComparison.OrdinalIgnoreCase);
}
If you do this, your .Contains call will work just as you want it.
Secondly, if you want to select this item in the list, do this:
var playerB = ListB[ListB.IndexOf(player)];
It uses the same .Equals and .GetHashCode methods.
UPD:
This is probably a subjective statement, but you could also squeeze some performance out of it, if your .Equals method compared the Int hashes before doing the string comparison..
Looking at the .NET sources (Reflector FTW) I can see that seemingly only the HastTable class uses GetHashCode to improve it's performance, instead of using .Equals to compare objects every single time. In the case of a small class like this, the equality comparer is simple, a single string comparison.. If you were comparing all properties though, then comparing two integers would be much faster (esp if they were cached :) )
The List.Contains and List.IndexOf don't use the hash code, and use the .Equals method, hence I proposed checking the hash code inside. It probably won't be anything noticeable, but when you're itching to get every single ms of execution (not always a good thing, bug hey! :P ) this might help someone. just saying... :)