Define: What is a HashSet? - c#

HashSet
The C# HashSet data structure was introduced in the .NET Framework 3.5. A full list of the implemented members can be found at the HashSet MSDN page.
Where is it used?
Why would you want to use it?

A HashSet holds a set of objects, but in a way that allows you to easily and quickly determine whether an object is already in the set or not. It does so by internally managing an array and storing the object using an index which is calculated from the hashcode of the object. Take a look here
HashSet is an unordered collection containing unique elements. It has the standard collection operations Add, Remove, Contains, but since it uses a hash-based implementation, these operations are O(1). (As opposed to List for example, which is O(n) for Contains and Remove.) HashSet also provides standard set operations such as union, intersection, and symmetric difference. Take a look here
There are different implementations of Sets. Some make insertion and lookup operations super fast by hashing elements. However, that means that the order in which the elements were added is lost. Other implementations preserve the added order at the cost of slower running times.
The HashSet class in C# goes for the first approach, thus not preserving the order of elements. It is much faster than a regular List. Some basic benchmarks showed that HashSet is decently faster when dealing with primary types (int, double, bool, etc.). It is a lot faster when working with class objects. So the point is that HashSet is fast.
The only catch of HashSet is that there is no access by indices. To access elements you can either use an enumerator or use the built-in function to convert the HashSet into a List and iterate through that. Take a look here

A HashSet has an internal structure (hash), where items can be searched and identified quickly. The downside is that iterating through a HashSet (or getting an item by index) is rather slow.
So why would someone want be able to know if an entry already exists in a set?
One situation where a HashSet is useful is in getting distinct values from a list where duplicates may exist. Once an item is added to the HashSet it is quick to determine if the item exists (Contains operator).
Other advantages of the HashSet are the Set operations: IntersectWith, IsSubsetOf, IsSupersetOf, Overlaps, SymmetricExceptWith, UnionWith.
If you are familiar with the object constraint language then you will identify these set operations. You will also see that it is one step closer to an implementation of executable UML.

Simply said and without revealing the kitchen secrets:
a set in general, is a collection that contains no duplicate elements, and whose elements are in no particular order. So, A HashSet<T> is similar to a generic List<T>, but is optimized for fast lookups (via hashtables, as the name implies) at the cost of losing order.

From application perspective, if one needs only to avoid duplicates then HashSet is what you are looking for since it's Lookup, Insert and Remove complexities are O(1) - constant. What this means it does not matter how many elements HashSet has it will take same amount of time to check if there's such element or not, plus since you are inserting elements at O(1) too it makes it perfect for this sort of thing.

Related

List<T> vs HashSet<T> - dynamic collection choice is efficient or not?

var usedIds = list.Count > 20 ? new HashSet<int>() as ICollection<int> : new List<int>();
Assuming that List is more performant with 20 or less items and HashSet is more performant with greater items amount (from this post), is it efficient approach to use different collection types dynamicaly based on the predictable items count?
All of the actions for each of the collection types will be the same.
PS: Also i have found HybridCollection Class which seems to do the same thing automaticaly, but i've never used it so i have no info on its performance either.
EDIT: My collection is mostly used as the buffer with many inserts and gets.
In theory, it could be, depending on how many and what type of operations you are performing on the collections. In practice, it would be a pretty rare case where such micro-optimization would justify the added complexity.
Also consider what type of data you are working with. If you are using int as the collection item as the first line of your question suggests, then the threshold is going to be quite a bit less than 20 where List is no longer faster than HashSet for many operations.
In any case, if you are going to do that, I would create a new collection class to handle it, something along the lines of the HybridDictionary, and expose it to your user code with some generic interface like IDictionary.
And make sure you profile it to be sure that your use case actually benefits from it.
There may even be a better option than either of those collections, depending on what exactly it is you are doing. i.e. if you are doing a lot of "before or after" inserts and traversals, then LinkedList might work better for you.
Hashtables like Hashset<T> and Dictionary<K,T> are faster at searching and inserting items in any order.
Arrays T[] are best used if you always have a fixed size and a lot of indexing operations. Adding items to a array is slower than adding into a list due to the covariance of arrays in c#.
List<T> are best used for dynamic sized collections whith indexing operations.
I don't think it is a good idea to write something like the hybrid collection better use a collection dependent on your requirements. If you have a buffer with a lof of index based operations i would not suggest a Hashtable, as somebody already quoted a Hashtable by design uses more memory
HashSet is for faster access, but List is for insert. If you don't plan adding new items, use HashSet, otherwise List.
If you collection is very small then the performance is virtually always going to be a non-issue. If you know that n is always less than 20, O(n) is, by definition, O(1). Everything is fast for small n.
Use the data structure that most appropriate represents how you are conceptually treating the data, the type of operations that you need to perform, and the type of operations that should be most efficient.
is it efficient approach to use different collection types dynamicaly based on the predictable items count?
It can be depending on what you mean by "efficiency" (MS offers HybridDictionary class for that, though unfortunately it is non generic). But irrespective of that its mostly a bad choice. I will explain both.
From an efficiency standpoint:
Addition will be always faster in a List<T>, since a HashSet<T> will have to precompute hash code and store it. Even though removal and lookup will be faster with a HashSet<T> as size grows up, addition to the end is where List<T> wins. You will have to decide which is more important to you.
HashSet<T> will come up with a memory overhead compared to List<T>. See this for some illustration.
But however, from a usability standpoint it need not make sense. A HashSet<T> is a set, unlike a bag which List<T> is. They are very different, and their uses are very different. For:
HashSet<T> cannot have duplicates.
HashSet<T> will not care about any order.
So when you return a hybrid ICollection<T>, your requirement goes like this: "It doesn't matter whether duplicates can be added or not. Sometimes let it be added, sometimes not. Of course iteration order is not important anyway" - very rarely useful.
Good q, and +1.
HashSet is better, because it will probably use less space, and you will have faster access to elements.

Why use a List over a HashSet?

May be an obvious question, but I've seen plenty of reasons why to use a HashSet over a list/array.
I've heard it has O(1) for removing and searching for data.
I've never heard why to use a list over a HashSet.
So why vice-versa?
A list allows duplicates, a HashSet does not
List is ordered by it's index, a HashSet has no implicit order
Performance is often overrated, choose the right tool for the job
They have different semantics. A list is ordered (by insert order), allows duplicates, and offers random-access by index; a hash-set is unordered, does not allow duplicates (removes them, by design), and does not offer random-access. Both are perfectly valid, simply: for different scenarios.
Well for one, you can insert duplicates into a List/Array.
From HashSet.Add Method
Return Value Type:
System.Boolean
true if the element is added to the HashSet object;
false if the element is already present.
I'm very late to the party, but I want to add something to the chosen answer by turning the question on its head:
When to use HashSet over List?
When your objects in the Set will certainly be unique or you want to never add duplicates.
You often want to look up if certain objects are in the Set.
If you often remove objects. Adding takes about the same time as adding objects into a List, but removing them is much, much faster in a HashSet.
If you don't care about about order or direct access via an index.
Keep in mind, you can "foreach" through a HashSet as well.
Rango is right that performance is sometimes overvalued; but if performance is critical (and order unimportant), HashSets can by magnitudes be faster than Lists.

What is the fastest/safest method to iterate over a HashSet?

I'm still quite new to C#, but noticed the advantages through forum postings of using a HashSet instead of a List in specific cases.
My current case isn't that I'm storing a tremendous amount of data in a single List exectly, but rather than I'm having to check for members of it often.
The catch is that I do indeed need to iterate over it as well, but the order they are stored or retrieved doesn't actually matter.
I've read that for each loops are actually slower than for next, so how else could I go about this in the fastest method possible?
The number of .Contains() checks I'm doing is definitely hurting my performance with lists, so at least comparing to the performance of a HashSet would be handy.
Edit: I'm currently using lists, iterating through them in numerous locations, and different code is being executed in each location. Most often, the current lists contain point coordinates that I then use to refer to a 2 dimensional array for that I then do some operation or another based on the criteria of the list.
If there's not a direct answer to my question, that's fine, but I assumed there might be other methods of iterating over a HashSet than just foreach cycle. I'm currently in the dark as to what other methods there might even be, what advantages they provide, etc. Assuming there are other methods, I also made the assumption that there would be a typical preferred method of choice that is only ignored when it doesn't suite the needs (my needs are pretty basic).
As far as prematurely optimizing, I already know using the lists as I am is a bottleneck. How to go about helping this issue is where I'm getting stuck. Not even stuck exactly, but I didn't want to re-invent the wheel by testing repeatedly only to find out I'm already doing it the best way I could (this is a large project with over 3 months invested, lists are everywhere, but there are definitely ones that I do not want duplicates, have a lot of data, need not be stored in any specific order, etc).
A foreach loop has a small amount of addition overhead on an indexed collections (like an array).
This is mostly because the foreach does a little more bounds checking than a for loop.
HashSet does not have an indexer so you have to use the enumerator.
In this case foreach is efficient as it only calls MoveNext() as it moves through the collection.
Also Parallel.ForEach can dramatically improve your performance, depending on the work you are doing in the loop and the size of your HashSet.
As mentioned before profiling is your best bet.
You shouldn't be iterating over a hashset in the first place to determine if an item is in it. You should use the HashSet (not the LINQ) contains method. The HashSet is designed such that it won't need to look through every item to see if any given value is inside of the set. That is what makes it so powerful for searching over a List.
Not strictly answering the question in the header, but more concerning your specific problem:
I would make your own Collection object that uses both a HashSet and a List internally. Iterating is fast as you can use the List, checking for Contains is fast as you can use the HashSet. Just make it an IEnumerable and you can use this Collection in foreach as well.
The downside is more memory, but there are only twice as many references to object, not twice as many objects. Worst case scenario it's only twice as much memory, but you seem much more concerned with performance.
Adding, checking, and iterating are fast this way, only removal is still O(N) because of the List.
EDIT: If removal needs to be O(1) as well, use a doubly linked list instead of a regular list, and make the hashSet a Dictionary<KeyType, Cell> instead. You can check the dictionary for Contains, but also to find the cell with the data in it fast, so removal from the data structure is fast.
I had the same issue, where the HashSet suits very well the addition of unique elements, but is very slow when getting elements in a for loop. I solved it by converting the HashSet to array and then running the for over it.

Conversion of an IEnumerable to a dictionary for performance?

I have recently seen a new trend in my firm where we change the IEnumerable to a dictionary by a simple LINQ transformation as follows:
enumerable.ToDictionary(x=>x);
We mostly end up doing this when the operation on the collection is a Contains/Access and obviously a dictionary has a better performance in such cases.
But I realise that converting the Enumerable to a dictionary has its own cost and I am wondering at what point does it start to break-even (if it does) i.e the performance of IEnumerable Contains/Access is equal to ToDictionary + access/contains.
Ok I might add there is no databse access the enumerable might be created from a database query and thats it and the enumerable may be edited after that too..
Also it would be interesting to know how does the datatype of the key affect the performance?
The lookup might be 2-5 times generally but sometimes may be one too. But i have seen things like
For an enumerable:
var element=Enumerable.SingleorDefault(x=>x.Id);
//do something if element is null or return
for a dictionary:
if(dictionary.ContainsKey(x))
//do something if false else return
This has been bugging me for quite some time now.
Performance of Dictionary Compared to IEnumerable
A Dictionary, when used correctly, is always faster to read from (except in cases where the data set is very small, e.g. 10 items). There can be overhead when creating it.
Given m as the amount of lookups performed against the same object (these are approximate):
Performance of an IEnumerable (created from a clean list): O(mn)
This is because you need to look at all the items each time (essentially m * O(n)).
Performance of a Dictionary: O(n) + O(1m), or O(m + n)
This is because you need to insert items first (O(n)).
In general it can be seen that the Dictionary wins when m > 1, and the IEnumerable wins when m = 1 or m = 0.
In general you should:
Use a Dictionary when doing the lookup more than once against the same dataset.
Use an IEnumerable when doing the lookup one.
Use an IEnumerable when the data-set could be too large to fit into memory.
Keep in mind a SQL table can be used like a Dictionary, so you could use that to offset the memory pressure.
Further Considerations
Dictionarys use GetHashCode() to organise their internal state. The performance of a Dictionary is strongly-related to the hash code in two ways.
Poorly performing GetHashCode() - results in overhead every time an item is added, looked up, or deleted.
Low quality hash codes - results in the dictionary not having O(1) lookup performance.
Most built-in .Net types (especially the value types) have very good hashing algorithms. However, with list-like types (e.g. string) GetHashCode() has O(n) performance - because it needs to iterate over the whole string. Thus you dictionary's performance can really be seen as (where M is the big-oh for an efficient GetHashCode()): O(1) + M.
It depends....
How long is the IEnumerable?
Does accessing the IEnumerable cause database access?
How often is it accessed?
The best thing to do would be to experiment and profile.
If you searching elements in your collection by some key very often - definatelly the Dictionary will be faster because or it's hash-based collection and searching is faster in times, otherwise if you don't search a lot thru the collection - the convertion is not necessary, because time for conversion may be bigger than you one or two searches in the collection,
IMHO: you need to measure this on your environment with representative data. In such cases I just write a quick console app that measures the time of the code execution. To have a better measure you need to execute the same code multiple times I guess.
ADD:
It also depents on the application you develop. Usually you gain more in optimizing other places (avoiding networkroundrips, caching etc.) in that time and effort.
I'll add that you haven't told us what happens every time you "rewind" your IEnumerable<>. Is it directly backed by a data collection? (for example a List<>) or is it calculated "on the fly"? If it's the first, and for small collections, enumerating them to find the wanted element is faster (a Dictionary for 3/4 elements is useless. If you want I can build some benchmark to find the breaking point). If it's the second then you have to consider if "caching" the IEnumerable<> in a collection is a good idea. If it's, then you can choose between a List<> or a Dictionary<>, and we return to point 1. Is the IEnumerable small or big? And there is a third problem: if the collection isn't backed, but it's too big for memory, then clearly you can't put it in a Dictionary<>. Then perhaps it's time to make the SQL work for you :-)
I'll add that "failures" have their cost: in a List<> if you try to find an element that doesn't exist, the cost is O(n), while in a Dictionary<> the cost is still O(1).

What is the difference between LinkedList and ArrayList, and when to use which one?

What is the difference between LinkedList and ArrayList? How do I know when to use which one?
The difference is the internal data structure used to store the objects.
An ArrayList will use a system array (like Object[]) and resize it when needed. On the other hand, a LinkedList will use an object that contains the data and a pointer to the next and previous objects in the list.
Different operations will have different algorithmic complexity due to this difference in the internal representation.
Don't use either. Use System.Collections.Generic.List<T>.
That really is my recommendation. Probably independently of what your application is, but here's a little more color just in case you're doing something that needs a finely tuned choice here.
ArrayList and LinkedList are different implementations of the storage mechanism for a List. ArrayList uses an array that it must resize if your collection outgrows it current storage size. LinkedList on the other hand uses the linked list data structure from CS 201. LinkedList is better for some head- or tail-insert heavy workloads, but ArrayList is better for random access workloads.
ArrayList has a good replacement which is List<T>.
In general, List<T> is a wrapper for array - it allows indexing and accessing items in O(1), but, every time you exceed the capacity an O(n) must be paid.
LinkedList<T> won't let you access items using index but you can count that insert will always cost O(1). In addition, you can insert items in to the beginning of the list and between existing items in O(1).
I think that in most cases List<T> is the default choice. Many of the common scenarios don't require special order and have no strict complexity constraints, therefore List<T> is preferred due to its usage simplicity.
The main difference between ArrayList and List<T>, LinkedList<T>, and other similar Generics is that ArrayList holds Objects, while the others hold a type that you specify (ie. List<Point> holds only Points).
Because of this, you need to cast any object you take out of an ArrayList to its actual type. This can take a lot of screen space if you have long class names.
In general it's much better to use List<T> and other typed Generics unless you really need to have a list with multiple different types of objects in it.
The difference lies in the semantics of how the List interface* is implemented:
http://en.wikipedia.org/wiki/Arraylist and http://en.wikipedia.org/wiki/LinkedList
*Meaning the basic list operations
As #sblom has stated, use the generic counterparts of LinkedList and ArrayList. There's really no reason not to, and plenty of reasons to do so.
The List<T> implementation is effectively wrapping an Array. Should the user attempt to insert elements beyond the bounds of the backing array, it will be copied to a larger array (at considerable expense, buit transparently to users of the List<T>)
A LinkedList<T> has a completely different implementation in which data is held in LinkedListNode<T> instances, which carry reference to two other LinkedListNode<T> instances (or only one in the case of the head or tail of the list). No external reference to mid-list items is created. This means that iterating the list is fast, but random-access is slow, because one must iterate the nodes from one end or the other. The best reason to use a LinkedList is to allow for fast inserts, that involve simply changing the references held by the nodes, rather than rewriting the entire list to insert an item (as is the case with List<T>)
They have different performance on "inserts" (adding new elements) and lookups. For inserts ArrayLists keeps an array internally (initially 16 items long) and when you reach the max capacity it doubles the size of the array. An LinkedList starts empty and add an item (node) when needed.
I think also that with ArrayList you are able to index the items, while with the LinkedList you have to "visit" the item from the head (or the LinkedList does this automatically for you).

Categories

Resources