Intersection of two sets in most optimized way

Intersection of two sets in most optimized way - c#

Given two sets of values, I have to find whether there is any common element among them or not i.e. whether their intersection is null or not.
Which of the standard C# collection will suit best (in terms of performance) for this purpose? I know that linq has a Intersect extension method to find out the intersection of two list/arrays but my focus is on performance in terms of Big-O notation.
And what if I have to find out the intersection of two sets as well?

Well, if you use LINQ's Intersect method it will build up a HashSet of the second sequence, and then check each element of the first sequence against it. So it's O(M+N)... and you can use foo.Intersect(bar).Any() to get an early-out.
Of course, if you store one (either) set in a HashSet<T> to start with, you can just iterate over the other one checking for containment on each step. You'd still need to build the set to start with though.
Fundamentally you've got an O(M+N) problem whatever you do - you're not going to get cheaper than that (there's always the possibility that you'll have to look at every element) and if your hash codes are reasonable, you should be able to achieve that complexity easily. Of course, some solutions may give better constant factors than others... but that's performance rather than complexity ;)
EDIT: As noted in the comments, there's also ISet<T>.Overlaps - if you've already got either set with a static type of ISet<T> or a concrete implementation, calling Overlaps makes it clearer what you're doing. If both of your sets are statically typed as ISet<T>, use larger.Overlaps(smaller) (where larger and smaller are in terms of the size of the set) as I'd expect an implementation of Overlaps to iterate over the argument and check each element against contents of the set you call it on.

As mentioned , Applying Any() will give you some performance.
I tested it on pretty big dataset and it gave 25% improvements.
Also applying larger.Intersect(smaller) rather than the opposite is very important, in my case, it gave 35% improvements.
Also ordering the list before applying intersect gave another 7-8%.
Another thing to keep in mind that depending on the use case you can completely avoid applying intersect.
For example, for an integer list, if the maximum and minimum are not within the same bounders you don't need to apply intersect since they will never do.
The same goes for a string list with the same idea applied to first letter.
Again depending on your case, try as much as possible to find a rule where intersection is impossible to avoid calling it.

Related

C# DateTime OrderBy when list contains some identical DateTime values

I have a list of reference types that are sorted by a DateTime property. Some of them have identical DateTime's. For example, multiple baseball games starting at the exact same time.
I want to confirm that OrderBy will sort the same way every time; that is to say because I provided the input in the order of A->B->C; that the output will also be A->B->C (all items have identical DateTime).
I wrote a unit test to confirm that the order is preserved. The test passed. But without really knowing what is going on I still don't feel confident.
Can someone please confirm the OrderBy behavior for me? I tried searching via google and couldn't find anything definitive.

The formal name for the concept you're asking about is called a Stable Sort.
Knowing this, you can check the documentation for Enumerable.OrderBy and see that it does, indeed, use a stable sorting algorithm. From near the end of the Remarks section:
This method performs a stable sort; that is, if the keys of two elements are equal, the order of the elements is preserved.
Additionally, there was some confustion with Linq-to-SQL in the comments on the question. If your data is already in a List<T> object you are not using Linq-to-SQL anymore. However, it's worth noting that Linq-to-SQL uses IQueryable.OrderBy rather than Enumerable.ORderBy, and IQueryable.OrderBy does not guarantee a stable sort. You may get a stable sort, but it depends on what the database engine does.

In short the OrderBy won't change the order if the OrderBy value is identical. So if the collection you called it on is ordered then it will keep that order. If it's a list then you're all good but other collection types do not guarantee order, such as dictionary or hashset so I imagine the order of them could change as you can't guarantee their order in the underlying collection.
Edit: as someone has mentioned in the comments linq to objects OrderBy is a stable (or deterministic sort) so the order will be the same every time and items considered equal will not have their order changed

First make sure you read the excelent answer of Joel.
That said, if you can not rely on Linq2Object's Stable Sort (EF for example), you also have the option of ThenBy:
OrderBy(c => c.MyDate).ThenBy(n => n.MyId)
This way you can apply a second order, if the first one has multiple same values.

Get original value from HashSet

UPDATE:
Starting with .Net 4.7.2, HashSet.TryGetValue - docs is available.
HashSet.TryGetValue - SO post
I have a problem with HashSet because it does not provide any method similar to TryGetValue known from Dictionary. And I need such method -- passing element to find in the set, and set returning element from its collection (when found).
Sidenote -- "why do you need element from the set, you already have that element?". No, I don't, equality and identity are two different things.
HashSet is not sealed but all its fields are private, so deriving from it is pointless. I cannot use Dictionary instead because I need SetEquals method. I was thinking about grabbing a source for HashSet and adding desired method, but the license is not truly open source (I can look, but I cannot distribute/modify). I could use reflection but the arrays in HashSet are not readonly meaning I cannot bind to those fields once per instance lifetime.
And I don't want to use full blown library for just single class.
So far I am stuck with LINQ SingleOrDefault. So the question is how fix this -- have HashSet with TryGetValue?

Probably you should switch from a HashSet to a SortedSet
There is a simple TryGetValue() for a SortedSet:
public bool TryGetValue(ref T element)
{
var foundSet = sortedSet.GetViewBetween(element, element);
if(foundSet.Count == 1)
{
element = foundSet.First();
return true;
}
return false;
}
when called, the element needs just all properties set which are used in the Comparer. It returns the element found in the Set.

I agree this is something which is basically missing. While it's only useful in rare cases, I think they're significant rare cases - most notable, key canonicalization.
I can only think of one suggestion at the moment, and it's truly foul.
You can specify your own IEqualityComparer<T> when creating a HashSet<T> - so create one which remembers the arguments to the last positive (i.e. true-returning) Equals comparison it has performed. You can then call Contains, and see what the equality comparer was asked to compare.
Caveats:
This holds on to references unnecessarily, so could end up preventing objects being garbage collected
You'd potentially want to do this on a per-thread basis (if you've got a set that isn't modified after initialization, but is then read by multiple threads, for example)
It assumes that HashSet<T> doesn't use any optimization such as "if the references are equal, don't bother consulting the equality comparer"
It's fundamentally a horrible abuse
I've been trying to think of other alternatives in terms of finding intersections, but I haven't got anywhere yet...
As noted in comments, it would be worth encapsulating this as far as possible - I suspect you only need a very limited set of operations, so I'd wrap a HashSet<T> in your own class and only expose the operations you really need - that way you get to clear the "cache" after each operation, removing my first objection above.
It still feels like a horrible abuse to me, but...
As others have suggested, an alternative would be to use a Dictionary<TKey, TValue> and implement SetEquals yourself. That would be simple enough to do - and again, you'd want to encapsulate this in your own type. Either way, you should probably design the type itself first, and then implement it using either a HashSet<> or a Dictionary<,> as an implementation detail.

Sounds like you trying to use the wrong tool. True, you can save some memory using a HashSet but it seems to me that you are trying to acheeve a different goal: Get the actual element that is just equal to a representation.
So in reality they are two different elements. Just the memento (a unique representation) is equal.
Therefore you'd be better of using a Dictionary where you add your elements as Key and Value. So you're able to get it back (the identical) but you miss your SetEquals....
I suppose SetEquals in it's implementation does nothing much different than sequencially compare two HashSets in it's bucket order and fails on first non-equality.
So you should be equally good off using a simple SequenceEqual() (LINQ) comparing the two Keys collections.
So this extension method could do
public static SetEqual<T,G>(this IDictionary<T,G> d, IDictionary<T,G> e)
{
return d.Keys.SequenceEqual(e.Keys);
}
This should work, because a Dictionary basically is a HashSet with an associated value. And more appropriate to your problem. (OK, to be correct, the code should go for Dictionary<> instead of IDictionary<> because Key order matters)
If you need an IEnumerable<> on the second parameter try sorting to get a defined order (not so efficient).

Finally added in .NET 4.7.2:
HashSet.TryGetValue(T, T) Method
An SO post with more details

hopefully not blind but I haven't seen this answer anywhere. If you want dictionary's TryGetValue, you can just steal it.
theHashset.ToDictionary(item => item.ID).TryGetValue(key, out value)
All you need is a quick lambda for determining unique keys.

List<T> vs HashSet<T> - dynamic collection choice is efficient or not?

var usedIds = list.Count > 20 ? new HashSet<int>() as ICollection<int> : new List<int>();
Assuming that List is more performant with 20 or less items and HashSet is more performant with greater items amount (from this post), is it efficient approach to use different collection types dynamicaly based on the predictable items count?
All of the actions for each of the collection types will be the same.
PS: Also i have found HybridCollection Class which seems to do the same thing automaticaly, but i've never used it so i have no info on its performance either.
EDIT: My collection is mostly used as the buffer with many inserts and gets.

In theory, it could be, depending on how many and what type of operations you are performing on the collections. In practice, it would be a pretty rare case where such micro-optimization would justify the added complexity.
Also consider what type of data you are working with. If you are using int as the collection item as the first line of your question suggests, then the threshold is going to be quite a bit less than 20 where List is no longer faster than HashSet for many operations.
In any case, if you are going to do that, I would create a new collection class to handle it, something along the lines of the HybridDictionary, and expose it to your user code with some generic interface like IDictionary.
And make sure you profile it to be sure that your use case actually benefits from it.
There may even be a better option than either of those collections, depending on what exactly it is you are doing. i.e. if you are doing a lot of "before or after" inserts and traversals, then LinkedList might work better for you.

Hashtables like Hashset<T> and Dictionary<K,T> are faster at searching and inserting items in any order.
Arrays T[] are best used if you always have a fixed size and a lot of indexing operations. Adding items to a array is slower than adding into a list due to the covariance of arrays in c#.
List<T> are best used for dynamic sized collections whith indexing operations.
I don't think it is a good idea to write something like the hybrid collection better use a collection dependent on your requirements. If you have a buffer with a lof of index based operations i would not suggest a Hashtable, as somebody already quoted a Hashtable by design uses more memory

HashSet is for faster access, but List is for insert. If you don't plan adding new items, use HashSet, otherwise List.

If you collection is very small then the performance is virtually always going to be a non-issue. If you know that n is always less than 20, O(n) is, by definition, O(1). Everything is fast for small n.
Use the data structure that most appropriate represents how you are conceptually treating the data, the type of operations that you need to perform, and the type of operations that should be most efficient.

is it efficient approach to use different collection types dynamicaly based on the predictable items count?
It can be depending on what you mean by "efficiency" (MS offers HybridDictionary class for that, though unfortunately it is non generic). But irrespective of that its mostly a bad choice. I will explain both.
From an efficiency standpoint:
Addition will be always faster in a List<T>, since a HashSet<T> will have to precompute hash code and store it. Even though removal and lookup will be faster with a HashSet<T> as size grows up, addition to the end is where List<T> wins. You will have to decide which is more important to you.
HashSet<T> will come up with a memory overhead compared to List<T>. See this for some illustration.
But however, from a usability standpoint it need not make sense. A HashSet<T> is a set, unlike a bag which List<T> is. They are very different, and their uses are very different. For:
HashSet<T> cannot have duplicates.
HashSet<T> will not care about any order.
So when you return a hybrid ICollection<T>, your requirement goes like this: "It doesn't matter whether duplicates can be added or not. Sometimes let it be added, sometimes not. Of course iteration order is not important anyway" - very rarely useful.
Good q, and +1.

HashSet is better, because it will probably use less space, and you will have faster access to elements.

What is the fastest/safest method to iterate over a HashSet?

I'm still quite new to C#, but noticed the advantages through forum postings of using a HashSet instead of a List in specific cases.
My current case isn't that I'm storing a tremendous amount of data in a single List exectly, but rather than I'm having to check for members of it often.
The catch is that I do indeed need to iterate over it as well, but the order they are stored or retrieved doesn't actually matter.
I've read that for each loops are actually slower than for next, so how else could I go about this in the fastest method possible?
The number of .Contains() checks I'm doing is definitely hurting my performance with lists, so at least comparing to the performance of a HashSet would be handy.
Edit: I'm currently using lists, iterating through them in numerous locations, and different code is being executed in each location. Most often, the current lists contain point coordinates that I then use to refer to a 2 dimensional array for that I then do some operation or another based on the criteria of the list.
If there's not a direct answer to my question, that's fine, but I assumed there might be other methods of iterating over a HashSet than just foreach cycle. I'm currently in the dark as to what other methods there might even be, what advantages they provide, etc. Assuming there are other methods, I also made the assumption that there would be a typical preferred method of choice that is only ignored when it doesn't suite the needs (my needs are pretty basic).
As far as prematurely optimizing, I already know using the lists as I am is a bottleneck. How to go about helping this issue is where I'm getting stuck. Not even stuck exactly, but I didn't want to re-invent the wheel by testing repeatedly only to find out I'm already doing it the best way I could (this is a large project with over 3 months invested, lists are everywhere, but there are definitely ones that I do not want duplicates, have a lot of data, need not be stored in any specific order, etc).

A foreach loop has a small amount of addition overhead on an indexed collections (like an array).
This is mostly because the foreach does a little more bounds checking than a for loop.
HashSet does not have an indexer so you have to use the enumerator.
In this case foreach is efficient as it only calls MoveNext() as it moves through the collection.
Also Parallel.ForEach can dramatically improve your performance, depending on the work you are doing in the loop and the size of your HashSet.
As mentioned before profiling is your best bet.

You shouldn't be iterating over a hashset in the first place to determine if an item is in it. You should use the HashSet (not the LINQ) contains method. The HashSet is designed such that it won't need to look through every item to see if any given value is inside of the set. That is what makes it so powerful for searching over a List.

Not strictly answering the question in the header, but more concerning your specific problem:
I would make your own Collection object that uses both a HashSet and a List internally. Iterating is fast as you can use the List, checking for Contains is fast as you can use the HashSet. Just make it an IEnumerable and you can use this Collection in foreach as well.
The downside is more memory, but there are only twice as many references to object, not twice as many objects. Worst case scenario it's only twice as much memory, but you seem much more concerned with performance.
Adding, checking, and iterating are fast this way, only removal is still O(N) because of the List.
EDIT: If removal needs to be O(1) as well, use a doubly linked list instead of a regular list, and make the hashSet a Dictionary<KeyType, Cell> instead. You can check the dictionary for Contains, but also to find the cell with the data in it fast, so removal from the data structure is fast.

I had the same issue, where the HashSet suits very well the addition of unique elements, but is very slow when getting elements in a for loop. I solved it by converting the HashSet to array and then running the for over it.

Why do we need Single() in LINQ?

Why is the main purpose of the extension method Single()?
I know it will throw an exception if more than an element that matches the predicate in the sequence, but I still don't understand in which context it could be useful.
Edit:
I do understand what Single is doing, so you don't need to explain in your question what this method does.

It's useful for declaratively stating
I want the single element in the list and if more than one item matches then something is very wrong
There are many times when programs need to reduce a set of elements to the one that is interesting based an a particular predicate. If more than one matches it indicates an error in the program. Without the Single method a program would need to traverse parts of the potentially expensive list more once.
Compare
Item i = someCollection.Single(thePredicate);
To
Contract.Requires(someCollection.Where(thePredicate).Count() == 1);
Item i = someCollection.First(thePredicate);
The latter requires two statements and iterates a potentially expensive list twice. Not good.
Note: Yes First is potentially faster because it only has to iterate the enumeration up until the first element that matches. The rest of the elements are of no consequence. On the other hand Single must consider the entire enumeration. If multiple matches are of no consequence to your program and indicate no programming errors then yes use First.

Using Single allows you to document your expectations on the number of results, and to fail early, fail hard if they are wrong. Unless you enjoy long debugging sessions for their own sake, I'd say it's enormously useful for increasing the robustness of your code.

Every LINQ operator returns a sequence, so an IEnumerable<T>. To get an actual element, you need one of the First, Last or Single methods - you use the latter if you know for sure the sequence only contains one element. An example would be a 1:1 ID:Name mapping in a database.

A Single will return a single instance of the class/object and not a collection. Very handy when you get a single record by Id. I never expect more than one row.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.