C#: A good and efficient implementation of IEnumerable<T>.HasDuplicates

C#: A good and efficient implementation of IEnumerable<T>.HasDuplicates - c#

Does anyone have a good and efficient extension method for finding if a sequence of items has any duplicates?
Guess I could put return subjects.Distinct().Count() == subjects.Count() into an extension method, but kind of feels that there should be a better way. That method would have to count elements twice and sort out all the distict elements. A better implementation should return true on the first duplicate it finds. Any good suggestions?
I imagine the outline could be something like this:
public static bool HasDuplicates<T>(this IEnumerable<T> subjects)
{
return subjects.HasDuplicates(EqualityComparer<T>.Default);
}
public static bool HasDuplicates<T>(this IEnumerable<T> subjects, IEqualityComparer<T> comparer)
{
...
}
But not quite sure how a smart implementation of it would be...

public static bool HasDuplicates<T>(this IEnumerable<T> subjects)
{
return HasDuplicates(subjects, EqualityComparer<T>.Default);
}
public static bool HasDuplicates<T>(this IEnumerable<T> subjects, IEqualityComparer<T> comparer)
{
HashSet<T> set = new HashSet<T>(comparer);
foreach (T item in subjects)
{
if (!set.Add(item))
return true;
}
return false;
}

This is in production code. Works great:
public static bool HasDuplicates<T>(this IEnumerable<T> sequence, IEqualityComparer<T> comparer = null) {
var set = new HashSet<T>(comparer);
return !sequence.All(item => set.Add(item));
}

I think the simplest extension method is the following.
public static bool HasDuplicates<T>(this IEnumerable<T> enumerable) {
var hs = new HashSet<T>();
foreach ( var cur in enumerable ) {
if ( !hs.Add(cur) ) {
return false;
}
}
}

Related

C#: generic function that return element from collection based on condition without using delegate

So i have this function that return element from collection based on condition:
public static IEnumerable<T> Search<T>(IEnumerable<T> source, Func<T, bool> filter)
{
return source.Where(filter);
}
Is it possible to implement it without using delegate\func ?
Supposed i have Person class with several properties:
string name;
int age;
And List of Person objects.
And i want to create function that get condition for example all the Person object that there age is bigger then some number.

I guess interviewer wants something (useless) like
public interface IFilterable
{
bool Match(string match);
}
public static IEnumerable<T> Search<T>(this IEnumerable<T> source, string match) where T: IFilterable
{
return source.Where(x=>x.Match(match));
}
or without any linq-ness
public interface IFilterable
{
bool Match(string match);
}
public static IEnumerable<T> Search<T>(this IEnumerable<T> source, string match) where T: IFilterable
{
foreach(var e in source)
{
if(e.Match(match)) yield return e;
}
}

Your example without the filter can be done in many many ways but here another with the yeild
public static IEnumerable<Person> GetPersonByMinimumAge(IEnumerable<Person> persons, int age)
{
foreach(var person in persons)
{
if(person.age >= age)
{
yield return person;
}
}
}

How to build a sequence using a fluent interface?

I'm trying to using a fluent interface to build a collection, similar to this (simplified) example:
var a = StartWith(1).Add(2).Add(3).Add(4).ToArray();
/* a = int[] {1,2,3,4}; */
The best solution I can come up with add Add() as:
IEnumerable<T> Add<T>(this IEnumerable<T> coll, T item)
{
foreach(var t in coll) yield return t;
yield return item;
}
Which seems to add a lot of overhead that going to be repeated in each call.
IS there a better way?
UPDATE:
in my rush, I over-simplified the example, and left out an important requirement. The last item in the existing coll influences the next item. So, a slightly less simplified example:
var a = StartWith(1).Times10Plus(2).Times10Plus(3).Times10Plus(4).ToArray();
/* a = int[] {1,12,123,1234}; */
public static IEnumerable<T> StartWith<T>(T x)
{
yield return x;
}
static public IEnumerable<int> Times10Plus(this IEnumerable<int> coll, int item)
{
int last = 0;
foreach (var t in coll)
{
last = t;
yield return t;
}
yield return last * 10 + item;
}

A bit late to this party, but here are a couple ideas.
First, consider solving the more general problem:
public static IEnumerable<A> AggregateSequence<S, A>(
this IEnumerable<S> items,
A initial,
Func<A, R, A> f)
{
A accumulator = initial;
yield return accumulator;
foreach(S item in items)
{
accumulator = f(accumulator, item);
yield return accumulator;
}
}
And now your program is just new[]{2, 3, 4}.AggregateSequence(1,
(a, s) => a * 10 + s).ToArray()
However that lacks the "fluency" you want and it assumes that the same operation is applied to every element in the sequence.
You are right to note that deeply nested iterator blocks are problematic; they have quadratic performance in time and linear consumption of stack, both of which are bad.
Here's an entertaining way to implement your solution efficiently.
The problem is that you need both cheap access to the "right" end of the sequence, in order to do an operation on the most recently added element, but you also need cheap access to the left end of the sequence to enumerate it. Normal queues and stacks only have cheap access to one end.
Therefore: start by implementing an efficient immutable double-ended queue. This is a fascinating datatype; I have an implementation here using finger trees:
https://blogs.msdn.microsoft.com/ericlippert/2008/01/22/immutability-in-c-part-10-a-double-ended-queue/
https://blogs.msdn.microsoft.com/ericlippert/2008/02/12/immutability-in-c-part-eleven-a-working-double-ended-queue/
Once you have that, your operations are one-liners:
static IDeque<T> StartWith<T>(T t) => Deque<T>.Empty.EnqueueRight(t);
static IDeque<T> Op<T>(this IDeque<T> d, Func<T, T> f) => d.EnqueueRight(f(d.PeekRight()));
static IDeque<int> Times10Plus(this IDeque<int> d, int j) => d.Op(i => i * 10 + j);
Modify IDeque<T> and Deque<T> to implement IEnumerable<T> in the obvious way and you then get ToArray for free. Or do it as an extension method:
static IEnumerable<T> EnumerateFromLeft(this IDeque<T> d)
{
var c = d;
while (!c.IsEmpty)
{
yield return c.PeekLeft();
c = c.DequeueLeft();
}
}

You could do the following:
public static class MySequenceExtensions
{
public static IReadOnlyList<int> Times10Plus(
this IReadOnlyList<int> sequence,
int value) => Add(sequence,
value,
v => sequence[sequence.Count - 1] * 10 + v);
public static IReadOnlyList<T> Starts<T>(this T first)
=> new MySequence<T>(first);
public static IReadOnlyList<T> Add<T>(
this IReadOnlyList<T> sequence,
T item,
Func<T, T> func)
{
var mySequence = sequence as MySequence<T> ??
new MySequence<T>(sequence);
return mySequence.AddItem(item, func);
}
private class MySequence<T>: IReadOnlyList<T>
{
private readonly List<T> innerList;
public MySequence(T item)
{
innerList = new List<T>();
innerList.Add(item);
}
public MySequence(IEnumerable<T> items)
{
innerList = new List<T>(items);
}
public T this[int index] => innerList[index];
public int Count => innerList.Count;
public MySequence<T> AddItem(T item, Func<T, T> func)
{
Debug.Assert(innerList.Count > 0);
innerList.Add(func(item));
return this;
}
public IEnumerator<T> GetEnumerator() => innerList.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
}
Note that I'm using IReadOnlyList to make it possible to index into the list in a performant way and be able to get the last element if needed. If you need a lazy enumeration then I think you are stuck with your original idea.
And sure enough, the following:
var a = 1.Starts().Times10Plus(2).Times10Plus(3).Times10Plus(4).ToArray();
Produces the expected result ({1, 12, 123, 1234}) with, what I think is, reasonable performance.

You can do like this:
public interface ISequence
{
ISequenceOp StartWith(int i);
}
public interface ISequenceOp
{
ISequenceOp Times10Plus(int i);
int[] ToArray();
}
public class Sequence : ISequence
{
public ISequenceOp StartWith(int i)
{
return new SequenceOp(i);
}
}
public class SequenceOp : ISequenceOp
{
public List<int> Sequence { get; set; }
public SequenceOp(int startValue)
{
Sequence = new List<int> { startValue };
}
public ISequenceOp Times10Plus(int i)
{
Sequence.Add(Sequence.Last() * 10 + i);
return this;
}
public int[] ToArray()
{
return Sequence.ToArray();
}
}
An then just:
var x = new Sequence();
var a = x.StartWith(1).Times10Plus(2).Times10Plus(3).Times10Plus(4).ToArray();

Lazily partition sequence with LINQ

I have the following extension method to find an element within a sequence, and then return two IEnumerable<T>s: one containing all the elements before that element, and one containing the element and everything that follows. I would prefer if the method were lazy, but I haven't figured out a way to do that. Can anyone come up with a solution?
public static PartitionTuple<T> Partition<T>(this IEnumerable<T> sequence, Func<T, bool> partition)
{
var a = sequence.ToArray();
return new PartitionTuple<T>
{
Before = a.TakeWhile(v => !partition(v)),
After = a.SkipWhile(v => !partition(v))
};
}
Doing sequence.ToArray() immediately defeats the laziness requirement. However, without that line, an expensive-to-iterate sequence may be iterated over twice. And, depending on what the calling code does, many more times.

You can use the Lazy object to ensure that the source sequence isn't converted to an array until one of the two partitions is iterated:
public static PartitionTuple<T> Partition<T>(
this IEnumerable<T> sequence, Func<T, bool> partition)
{
var lazy = new Lazy<IEnumerable<T>>(() => sequence.ToArray());
return new PartitionTuple<T>
{
Before = lazy.MapLazySequence(s => s.TakeWhile(v => !partition(v))),
After = lazy.MapLazySequence(s => s.SkipWhile(v => !partition(v)))
};
}
We'll use this method to defer evaluating the lazy until the sequence itself is iterated:
public static IEnumerable<TResult> MapLazySequence<TSource, TResult>(
this Lazy<IEnumerable<TSource>> lazy,
Func<IEnumerable<TSource>, IEnumerable<TResult>> filter)
{
foreach (var item in filter(lazy.Value))
yield return item;
}

This is an interesting problem and to get it right, you have to know what "right" is. For the semantics of the operation, I think that this definition makes sense:
The source sequence is only enumerated once even though the resulting sequences are enumerated several times.
The source sequence isn't enumerated until one of the results is enumerated.
Each of the results should be possible to enumerate independently.
If the source sequence changes, it is undefined what will happen.
I'm not entirely sure I got the handling of the matching object right, but I hope you get the idea. I'm deferring a lot of the work to the PartitionTuple<T> class to be able to be lazy.
public class PartitionTuple<T>
{
IEnumerable<T> source;
IList<T> before, after;
Func<T, bool> partition;
public PartitionTuple(IEnumerable<T> source, Func<T, bool> partition)
{
this.source = source;
this.partition = partition;
}
private void EnsureMaterialized()
{
if(before == null)
{
before = new List<T>();
after = new List<T>();
using(var enumerator = source.GetEnumerator())
{
while(enumerator.MoveNext() && !partition(enumerator.Current))
{
before.Add(enumerator.Current);
}
while(!partition(enumerator.Current) && enumerator.MoveNext());
while(enumerator.MoveNext())
{
after.Add(enumerator.Current);
}
}
}
}
public IEnumerable<T> Before
{
get
{
EnsureMaterialized();
return before;
}
}
public IEnumerable<T> After
{
get
{
EnsureMaterialized();
return after;
}
}
}
public static class Extensions
{
public static PartitionTuple<T> Partition<T>(this IEnumerable<T> sequence, Func<T, bool> partition)
{
return new PartitionTuple<T>(sequence, partition);
}
}

Here's a generic solution that will memoize any IEnumerable<T> to ensure it's only iterated once, without forcing the whole thing to iterate:
public class MemoizedEnumerable<T> : IEnumerable<T>, IDisposable
{
private readonly IEnumerator<T> _childEnumerator;
private readonly List<T> _itemCache = new List<T>();
public MemoizedEnumerable(IEnumerable<T> enumerableToMemoize)
{
_childEnumerator = enumerableToMemoize.GetEnumerator();
}
public IEnumerator<T> GetEnumerator()
{
return _itemCache.Concat(EnumerateOnce()).GetEnumerator();
}
public void Dispose()
{
_childEnumerator.Dispose();
}
private IEnumerable<T> EnumerateOnce()
{
while (_childEnumerator.MoveNext())
{
_itemCache.Add(_childEnumerator.Current);
yield return _childEnumerator.Current;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
public static class EnumerableExtensions
{
public static IEnumerable<T> Memoize<T>(this IEnumerable<T> enumerable)
{
return new MemoizedEnumerable<T>(enumerable);
}
}
To use it for your partitioning problem, do this:
var memoized = sequence.Memoize();
return new PartitionTuple<T>
{
Before = memoized.TakeWhile(v => !partition(v)),
After = memoized.SkipWhile(v => !partition(v))
};
This will only iterate sequence a maximum of one time.

Generally, you just return some object of your custom class, which implements IEnumerable<T> but also provides the results on enumeration demand only.
You can also implement IQueryable<T> (inherits IEnumerable) instead of IEnumerable<T>, but it's rather needed for building reach functionality with queries like the one, which linq for sql provides: database query being executed only on the final enumeration request.

How to return distinct elements from my C# collection?

I have a MongoDB database where I store all pictures and when I retrieve them I have stored some doubles, which ain't so good, but anyway I want to show only distinct elements.
#foreach (Foto f in fotos.Distinct(new IEqualityComparer<Foto> { )
But the Foto class has one property called smallurl and I want to show only distinct elements by this property. So how to write a custom IEqualityComparer.

var listOfUrls = fotos.Select(f => f.smallurl).Distinct();
EDIT to specifically answer your question
Practically copied from the MSDN documentation that you can find with a search for c# IEqualityComparer http://msdn.microsoft.com/en-us/library/ms132151.aspx
class FotoEqualityComparer : IEqualityComparer<Foto>
{
public bool Equals(Foto f1, Foto f2)
{
return f1.smallurl == f2.smallurl;
}
public int GetHashCode(Foto f)
{
return f.smallurl.GetHashCode();
}
}
#foreach (Foto f in fotos.Distinct(new FotoEqualityComparer() )

It's actually pretty easy. Simply provide a distinct-ness selector for your method like so:
public static IEnumerable<TSource> DistinctBy<TSource, TResult>(this IEnumerable<TSource> enumerable, Func<TSource, TResult> keySelector)
{
Dictionary<TResult, TSource> seenItems = new Dictionary<TResult, TSource>();
foreach (var item in enumerable)
{
var key = keySelector(item);
if (!seenItems.ContainsKey(key))
{
seenItems.Add(key, item);
yield return item;
}
}
}
Alternatively, you can create another one to make a generic implementation fo the IEquality comparer:
public static IEnumerable<TSource> DistinctBy<TSource>(this IEnumerable<TSource> enumerable, Func<TSource, TSource, bool> equalitySelector, Func<TSource, int> hashCodeSelector)
{
return enumerable.Distinct(new GenericEqualitySelector<TSource>(equalitySelector, hashCodeSelector));
}
class GenericEqualitySelector<TSource> : IEqualityComparer<TSource>
{
public Func<TSource, TSource, bool> _equalityComparer = null;
public Func<TSource, int> _hashSelector = null;
public GenericEqualitySelector(Func<TSource, TSource, bool> selector, Func<TSource, int> hashSelector)
{
_equalityComparer = selector;
_hashSelector = hashSelector;
}
public bool Equals(TSource x, TSource y)
{
return _equalityComparer(x, y);
}
public int GetHashCode(TSource obj)
{
return _hashSelector(obj);
}
}

Create your own:
public class FotoEqualityComparer : IEqualityComparer<Foto>
{
public bool Equals(Foto x, Foto y)
{
return x.smallurl.Equals(y.smallurl);
}
public int GetHashCode(Foto foto)
{
return foto.smallurl.GetHashCode();
}
}
And use it like so:
fotos.Distinct(new FotoEqualityComparer());
EDIT:
There's no inline lambda overload of .Distinct() because when two objects compare equal they must have the same GetHashCode return value (or else the hash table used internally by Distinct will not function correctly).
But if you want it in one line, then you could also do grouping to achieve the same result:
fotos.GroupBy(f => f.smallurl).Select(g => g.First());

Modified from MSDN
public class MyEqualityComparer : IEqualityComparer<Foto>
{
public bool Equals(Foto x, Foto y)
{
//Check whether the compared objects reference the same data.
if (Object.ReferenceEquals(x, y)) return true;
//Check whether any of the compared objects is null.
if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
return false;
//Check whether the foto's properties are equal.
return x.smallurl == y.smallurl ;
}
// If Equals() returns true for a pair of objects
// then GetHashCode() must return the same value for these objects.
public int GetHashCode(Foto foto)
{
//Check whether the object is null
if (Object.ReferenceEquals(foto, null)) return 0;
//Get hash code for the foto.smallurl field if it is not null.
return foto.smallurl == null ? 0 : foto.smallurl.GetHashCode();
}
}

Much simpler code using GroupBy instead:
#foreach (Foto f in fotos.GroupBy(f => f.smallurl).Select(g => g.First()))

You should create your own EqulityComparer:
class FotoEqualityComparer : IEqualityComparer<Foto>
{
public bool Equals(Foto b1, Foto b2)
{
if (b1.smallurl == b2.smallurl)
return true;
else
return false;
}
public int GetHashCode(Foto bx)
{
int hCode = bx.smallurl ;
return hCode.GetHashCode();
}
}

Does Select() on a List lose track of the size of the collection?

In the following code, is the Select() method smart enough to keep the size of the list somewhere internally for the ToArray() method to be cheap?
List<Thing> bigList = someBigList;
var bigArray = bigList.Select(t => t.SomeField).ToArray();

That's easy to check, without looking at the implementation. Just create a class that implements IList<T>, and put a trace in the Count property:
class MyList<T> : IList<T>
{
private readonly IList<T> _list = new List<T>();
public IEnumerator<T> GetEnumerator()
{
return _list.GetEnumerator();
}
public void Add(T item)
{
_list.Add(item);
}
public void Clear()
{
_list.Clear();
}
public bool Contains(T item)
{
return _list.Contains(item);
}
public void CopyTo(T[] array, int arrayIndex)
{
_list.CopyTo(array, arrayIndex);
}
public bool Remove(T item)
{
return _list.Remove(item);
}
public int Count
{
get
{
Console.WriteLine ("Count accessed");
return _list.Count;
}
}
public bool IsReadOnly
{
get { return _list.IsReadOnly; }
}
public int IndexOf(T item)
{
return _list.IndexOf(item);
}
public void Insert(int index, T item)
{
_list.Insert(index, item);
}
public void RemoveAt(int index)
{
_list.RemoveAt(index);
}
public T this[int index]
{
get { return _list[index]; }
set { _list[index] = value; }
}
#region Implementation of IEnumerable
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
#endregion
}
If the Count property is accessed, this code should print "Count accessed":
var list = new MyList<int> { 1, 2, 3 };
var array = list.Select(x => x).ToArray();
But it doesn't print anything, so no, it doesn't keep track of the count. Of course there could be an optimization specific to List<T>, but it seems unlikely...

No, right now it does not (at least the .NET implementation). From the MS reference sources, Enumerable.ToArray is implemented as
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source) {
if (source == null) throw Error.ArgumentNull("source");
return new Buffer<TSource>(source).ToArray();
}
Buffer<TSource> creates a copy of the source sequence (in array form) on construction by iterating and resizing as necessary; it has a special "fast path" if source is an ICollection<TSource>, but the result of Enumerable.Select unsurprisingly does not implement that interface.
Be that as it may, apart from pure curiosity I don't think that this result means anything. For one, the implementation may change at any point in the future (even though a quick cost-benefit analysis won't find this likely). And in any case, you will suffer at most O(logN) reallocations. For small N the reallocations are not going to be noticeable. For large N, the amount of time spent on iterating over the collection is going to be O(N) and will therefore easily dominate.

When you apply Select operator to enumerable sequence, it creates one of following iterators:
WhereSelectArrayIterator
WhereSelectListIterator
WhereSelectEnumerableIterator
In case of List<T>, WhereSelectListIterator iterator is created. It uses list's iterator to iterate over the list and apply predicate and selector. This is a MoveNext method implementation:
while (this.enumerator.MoveNext())
{
TSource current = this.enumerator.Current;
if ((this.predicate == null) || this.predicate(current))
{
base.current = this.selector(current);
return true;
}
}
As you can see, it does not preserve information about number of items, which matched predicate, thus it does not know count of items in filtered sequence.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C#: A good and efficient implementation of IEnumerable<T>.HasDuplicates - c#

This is in production code. Works great: public static bool HasDuplicates<T>(this IEnumerable<T> sequence, IEqualityComparer<T> comparer = null) { var set = new HashSet<T>(comparer); return !sequence.All(item => set.Add(item)); }

I think the simplest extension method is the following. public static bool HasDuplicates<T>(this IEnumerable<T> enumerable) { var hs = new HashSet<T>(); foreach ( var cur in enumerable ) { if ( !hs.Add(cur) ) { return false; } } }

Related

C#: generic function that return element from collection based on condition without using delegate

How to build a sequence using a fluent interface?

Lazily partition sequence with LINQ

How to return distinct elements from my C# collection?

Does Select() on a List lose track of the size of the collection?

Categories

Resources