HashSet limit - how to proceed?

HashSet limit - how to proceed? - c#

My program creates custom objects, I want to get a distinct list of. So I want to use a set and add object by object. The set would prevent duplicates. And at last I have a set of unique objects.
I would usually use a HashSet, because I don't need a sorted set. Only, there are so many different potential objects. More than 2^32. The GetHashCode function returns an int, so this cannot work as a unique key for my objects.
I assume that I cannot use the HashSet hence and must use the slower SortedSet and have my object implement IComparable / CompareTo. Is this correct? Or is there a way to have a HashSet with long hash codes?

GetHashCode does return an int, but if the comparison for the hash codes determines they are the same, it folllows by calling the Equals method (which you should override).
So, no, you don't have to switch. You can keep using the same old lovable HashSet (as long as you don't run out of memory).

Related

Are hash codes of System.Type objects of types from the same assembly guaranteed to be unique?

Clarifying edit: The keys in the dictionary are actual instances of System.Type. More specifically every value is stored with its type as the key.
In a specific part of my program the usage of Dictionary<System.Type, SomeThing> takes a large chunk of CPU time, as per Visual Studio 2017 performance profiler.
A change in the type of the dictionary to Dictionary<int, SomeThing> and instead of passing the type object directly I pass the type.GetHashCode() seems to be about 20%-25% faster.
The above optimization will result in a nasty bug if two types have the same hash code, but it seems plausible to me that types can have unique hash codes, at least when it comes to types from the same assembly - which all the types used in this dictionary are.
Possibly relevant information - As per this answer the number of possible types in an assembly is far smaller than the number of values represented by System.Int32.

No. The documentation on object.GetHashCode() make no guarantees, and states:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
...
Do not use the hash code as the key to retrieve an object from a keyed collection.
Because equal hash codes is necessary, but not sufficient, for two objects to be equal.
If you're wondering if Type.GetHashCode() follows a more restrictive definition, its documentation makes no mention of such a change, so it still does not guarantee uniqueness. The reference source does not show any attempt to make this guarantee, either.

A hash-code is never garantueed to be unique for different values, so you should not use it like you are doing.
The same value should however generate the same hashcode.
This is also stated in MSDN:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash codes.
and somewhat further:
Do not use the hash code as the key to retrieve an object from a keyed collection.
Therefore, I would also not rely for GetHashCode for different types to be unique, but at least, you can verify it:
Dictionary<int, string> s = new Dictionary<int, string>();
var types = typeof(int).Assembly.GetTypes();
Console.WriteLine($"Inspecting {types.Length} types...");
foreach (var t in typeof(-put a type from that assembly here-).Assembly.GetTypes())
{
if (s.ContainsKey(t.GetHashCode()))
{
Console.WriteLine($"{t.Name} has the same hashcode as {s[t.GetHashCode()]}");
}
else
{
s.Add(t.GetHashCode(), t.Name);
}
}
Console.WriteLine("done!");
But even if the above test would conclude that there are no collisions, I wouldn't do it, since the implementation of GetHashCode can change over time, which means that collisions in the future might be possible.

A hashcode isn´t ment do be unique. Instead it is used in hash-based collections such as Dictionary in order to limit the number of possible ambiguities. A hashc-ode is nothing but an index, so instead of searching the entire collection for a match only a few items that share a common value - the hash-code - have to be considered.
In fact you could even have a hash-implementation that allways returns the same number for every item. However that´ll leads to O(n) to look for a key in your dictionary, as every key has to be compared.
Anyway you shouldn´t strive for micro-optimizations that get you some nan-seconds in exchange for maintainability and understandability. You should instead use some data-structure that gets the job done and is easy to understand.

Initialization of SortedDictionary without comparing

In my application, I have a SortedDictionary. Most of the time, Im inserting single values in it - in such cases, I understand, that it needs to use Compare method to determine, where should the new value be added.
I was just wondering, whether there is some way I can initialize this SortedDictionary from lets say a KeyValuePair<>[] array, without causing Compare method to run.
The thing is, that sometimes I do have a KeyValuePair<>[] array, that contains already sorted Keys and so it could be transformed in SortedDictionary without any additional sorting. I understand that compiler doesnt know that my collection is sorted, but since Im sure of it, is there some way to intentionaly evade the comparison? If this request is total nonsense, could you please explain why?
The only reason I want to this is because of performance - when working with big collections, the Compare method takes some time to finish.

[...] I understand that compiler doesnt know that my collection is
sorted, [...]
Sorting isn't a compile-time but run-time detail.
I don't think this would be a good idea. Here's a good summary of reasons to don't do it:
A dictionary actually is a hash table. Thus, keys aren't sorted per se.
A sorted dictionary requires the comparer to provide the keys in an arbitrary order. If you don't use a comparer, how a simple hash table would be able to expose its keys in some order?
At the end of the day, when you need a collection where its order is the insertion order, you should use a List<T>, and in your case, you should consider a List<KeyValuePair<TKey, TValue>>. Anyway, this won't work in your case. You want to provide an already sorted sequence as source of a sorted dictionary, and let the comparer work once the dictionary is filled when adding new pairs after construction-time.
I would say that if you need a sorted dictionary which relies on a sequence of pairs given during construction-time and that mustn't be re-sorted (since they're already sorted), then you'll need to think about rolling your own IDictionary<TKey, TValue> implementation to provide such feature...

Get original value from HashSet

UPDATE:
Starting with .Net 4.7.2, HashSet.TryGetValue - docs is available.
HashSet.TryGetValue - SO post
I have a problem with HashSet because it does not provide any method similar to TryGetValue known from Dictionary. And I need such method -- passing element to find in the set, and set returning element from its collection (when found).
Sidenote -- "why do you need element from the set, you already have that element?". No, I don't, equality and identity are two different things.
HashSet is not sealed but all its fields are private, so deriving from it is pointless. I cannot use Dictionary instead because I need SetEquals method. I was thinking about grabbing a source for HashSet and adding desired method, but the license is not truly open source (I can look, but I cannot distribute/modify). I could use reflection but the arrays in HashSet are not readonly meaning I cannot bind to those fields once per instance lifetime.
And I don't want to use full blown library for just single class.
So far I am stuck with LINQ SingleOrDefault. So the question is how fix this -- have HashSet with TryGetValue?

Probably you should switch from a HashSet to a SortedSet
There is a simple TryGetValue() for a SortedSet:
public bool TryGetValue(ref T element)
{
var foundSet = sortedSet.GetViewBetween(element, element);
if(foundSet.Count == 1)
{
element = foundSet.First();
return true;
}
return false;
}
when called, the element needs just all properties set which are used in the Comparer. It returns the element found in the Set.

I agree this is something which is basically missing. While it's only useful in rare cases, I think they're significant rare cases - most notable, key canonicalization.
I can only think of one suggestion at the moment, and it's truly foul.
You can specify your own IEqualityComparer<T> when creating a HashSet<T> - so create one which remembers the arguments to the last positive (i.e. true-returning) Equals comparison it has performed. You can then call Contains, and see what the equality comparer was asked to compare.
Caveats:
This holds on to references unnecessarily, so could end up preventing objects being garbage collected
You'd potentially want to do this on a per-thread basis (if you've got a set that isn't modified after initialization, but is then read by multiple threads, for example)
It assumes that HashSet<T> doesn't use any optimization such as "if the references are equal, don't bother consulting the equality comparer"
It's fundamentally a horrible abuse
I've been trying to think of other alternatives in terms of finding intersections, but I haven't got anywhere yet...
As noted in comments, it would be worth encapsulating this as far as possible - I suspect you only need a very limited set of operations, so I'd wrap a HashSet<T> in your own class and only expose the operations you really need - that way you get to clear the "cache" after each operation, removing my first objection above.
It still feels like a horrible abuse to me, but...
As others have suggested, an alternative would be to use a Dictionary<TKey, TValue> and implement SetEquals yourself. That would be simple enough to do - and again, you'd want to encapsulate this in your own type. Either way, you should probably design the type itself first, and then implement it using either a HashSet<> or a Dictionary<,> as an implementation detail.

Sounds like you trying to use the wrong tool. True, you can save some memory using a HashSet but it seems to me that you are trying to acheeve a different goal: Get the actual element that is just equal to a representation.
So in reality they are two different elements. Just the memento (a unique representation) is equal.
Therefore you'd be better of using a Dictionary where you add your elements as Key and Value. So you're able to get it back (the identical) but you miss your SetEquals....
I suppose SetEquals in it's implementation does nothing much different than sequencially compare two HashSets in it's bucket order and fails on first non-equality.
So you should be equally good off using a simple SequenceEqual() (LINQ) comparing the two Keys collections.
So this extension method could do
public static SetEqual<T,G>(this IDictionary<T,G> d, IDictionary<T,G> e)
{
return d.Keys.SequenceEqual(e.Keys);
}
This should work, because a Dictionary basically is a HashSet with an associated value. And more appropriate to your problem. (OK, to be correct, the code should go for Dictionary<> instead of IDictionary<> because Key order matters)
If you need an IEnumerable<> on the second parameter try sorting to get a defined order (not so efficient).

Finally added in .NET 4.7.2:
HashSet.TryGetValue(T, T) Method
An SO post with more details

hopefully not blind but I haven't seen this answer anywhere. If you want dictionary's TryGetValue, you can just steal it.
theHashset.ToDictionary(item => item.ID).TryGetValue(key, out value)
All you need is a quick lambda for determining unique keys.

Using Dictionary<Foo, Foo> Instead of List<Foo> to Speed up Calls to Contains()

I have a question about generic collections in C#. If I need to store a collection of items, and I'm frequently going to need to check whether an item is in the collection, would it be faster to use Dictionary instead of List?
I've heard that checking if an item is in the collection is linear relative to the size for lists and constant relative to the size for dictionaries. Is using Dictionary and then setting Key and Value to the same object for each key-value pair something that other programmers frequently do in this situation?
Thanks for taking the time to read this.

Yes, yes it is. That said, you probably want to use HashSet because you don't need both a key and a value, you just need a set of items.
It's also worth noting that Dictionary was added in C# 2.0, and HashSet was added in 3.5, so for all that time inbetween it was actually fairly common to use a Dictionary when you wanted a Set just because that was all you had (without rolling your own). When I was forced to do this I just stuck null in the value, rather than the item as the key and value, but the idea is the same.

Just use HashSet<Foo> if what you're concerned with is fast containment tests.
A Dictionary<TKey, TValue> is for looking a value up based on a key.
A List<T> is for random access and dynamic growth properties.
A HashSet<T> is for modeling a set and providing fast containment tests.
You're not looking up a value based on a key. You're not worried about random access, but rather fast containment checks. The right concept here is a HashSet<T>.

Assuming that there is only ever one copy of the item in the list, then the appropriate data structure is ISet<T>, specifically HashSet<T>.
That said, I've seen timing that indicate that a Dictionary<TKey, TValue> ContainsKey call is a wee bit faster than even HashSet<T>. Either way, both of them are going to be loads faster than a plain List<T> lookup.
Keep in mind that both of these methods (HashSet and Dictionary) rely on reasonably well-implemented Equals and GetHashcode implementations for T. List<T> only relies on Equals

A Dictionary, or HashSet will use more memory, but provide (almost) O(1) seek time.

You might want to look at HashSet, which is a collection of unique objects (as long as the object implements IEquality comparer).

You mention using List<T>, which implies that ordering may be important. If this is the case then you may also want to look into the SortedSet<T> type as well.

Define: What is a HashSet?

HashSet
The C# HashSet data structure was introduced in the .NET Framework 3.5. A full list of the implemented members can be found at the HashSet MSDN page.
Where is it used?
Why would you want to use it?

A HashSet holds a set of objects, but in a way that allows you to easily and quickly determine whether an object is already in the set or not. It does so by internally managing an array and storing the object using an index which is calculated from the hashcode of the object. Take a look here
HashSet is an unordered collection containing unique elements. It has the standard collection operations Add, Remove, Contains, but since it uses a hash-based implementation, these operations are O(1). (As opposed to List for example, which is O(n) for Contains and Remove.) HashSet also provides standard set operations such as union, intersection, and symmetric difference. Take a look here
There are different implementations of Sets. Some make insertion and lookup operations super fast by hashing elements. However, that means that the order in which the elements were added is lost. Other implementations preserve the added order at the cost of slower running times.
The HashSet class in C# goes for the first approach, thus not preserving the order of elements. It is much faster than a regular List. Some basic benchmarks showed that HashSet is decently faster when dealing with primary types (int, double, bool, etc.). It is a lot faster when working with class objects. So the point is that HashSet is fast.
The only catch of HashSet is that there is no access by indices. To access elements you can either use an enumerator or use the built-in function to convert the HashSet into a List and iterate through that. Take a look here

A HashSet has an internal structure (hash), where items can be searched and identified quickly. The downside is that iterating through a HashSet (or getting an item by index) is rather slow.
So why would someone want be able to know if an entry already exists in a set?
One situation where a HashSet is useful is in getting distinct values from a list where duplicates may exist. Once an item is added to the HashSet it is quick to determine if the item exists (Contains operator).
Other advantages of the HashSet are the Set operations: IntersectWith, IsSubsetOf, IsSupersetOf, Overlaps, SymmetricExceptWith, UnionWith.
If you are familiar with the object constraint language then you will identify these set operations. You will also see that it is one step closer to an implementation of executable UML.

Simply said and without revealing the kitchen secrets:
a set in general, is a collection that contains no duplicate elements, and whose elements are in no particular order. So, A HashSet<T> is similar to a generic List<T>, but is optimized for fast lookups (via hashtables, as the name implies) at the cost of losing order.

From application perspective, if one needs only to avoid duplicates then HashSet is what you are looking for since it's Lookup, Insert and Remove complexities are O(1) - constant. What this means it does not matter how many elements HashSet has it will take same amount of time to check if there's such element or not, plus since you are inserting elements at O(1) too it makes it perfect for this sort of thing.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HashSet limit - how to proceed? - c#

GetHashCode does return an int, but if the comparison for the hash codes determines they are the same, it folllows by calling the Equals method (which you should override). So, no, you don't have to switch. You can keep using the same old lovable HashSet (as long as you don't run out of memory).

Related

Are hash codes of System.Type objects of types from the same assembly guaranteed to be unique?

Initialization of SortedDictionary without comparing

Get original value from HashSet

Using Dictionary<Foo, Foo> Instead of List<Foo> to Speed up Calls to Contains()

Define: What is a HashSet?

Categories

Resources