Dictionary performance improvement - c#

I'm trying to improve on some code that was written a while back. the function is quite important to the core functionality of the system so I am cautious about a drastic overhaul.
I am using a dictionary to hold objects
Dictionary<Node, int> dConnections
The object Node is in itself a complex object containing many attributes and some lists.
This Dictionary could get quite large holding around 100 or more entries.
Currently the dictionary is being checked if it contains a node like
dConnections.ContainsKey(Node)
So I am presuming that (to check if this node is in the dictionary) the dictionary will have to check if the whole node and its attributes match a node in the dictionary (it will keep on iterating through the dictionary until it finds a match) and this will have a major impact on performance?
Would I be better off not using an object in the dictionary and rather use object id.

The .NET dictionary is an hashtable in the Inside. It means that if Node doesn't overrides the GetHashCode and Equals methods, when you call ContainsKey, it will match against:
Disclaimer: It's a summary. Things are a little more complicated. Please don't call me name because I oversimplified.
a partition of the hashcode of the ref address of the Node object. The number of partitions dépends upon the number of buckets of the hashtable (depending on the total number of keys in the dictionnary)
the exact ref address if more than one Node is in the same bucket.
This algorithm is very efficient. When you say that you have 100 or more entries in the dictionary, it's not "a lot". It's a few.
It means also that the content of the Node object has nothing to do with the way a ContainsKey will match. It will match against the exact same reference, and only against this reference.
If you implement GetHashCode and Equals yourself, be aware that these method return values shouldn't change when the instance property change (be immutable). Otherwise you could well get keys in the wrong bucket, and therefore completely unreachable (without enumerating the whole dictionary).

it will keep on iterating through the dictionary until it finds a match
No, dictionaries don't find matches by iterating all nodes; the hash-code is obtained first, and is used to limit the candidates to one, maybe a few (depending on how good your hashing method is, and the bucket-size)
So I am presuming that (to check if this node is in the dictionary) the dictionary will have to check if the whole node and its attributes match a node in the dictionary
No, for each candidate, it first checks the hash-code, which is intended to be a short-cut to detect non-equality vs possible-equality very quickly
So the key here is: your Node's hashing method, aka GetHashCode. If this is complex, then another trick is to cache it the first time you need it, i.e.
int cachedHashCode;
public override int GetHashCode() {
if(cachedHashCode == 0) {
cachedHashCode = /* some complex code here */
if(cachedHashCode == 0) {
cachedHashCode = -45; // why not... just something non-zero
}
}
return cachedHashCode;
}
Note that it does still use Equals too, as the final "are they the same", so you obviously want Equals to be as fast as possible too - but Equals will be called relatively rarely.

Related

Are hash codes of System.Type objects of types from the same assembly guaranteed to be unique?

Clarifying edit: The keys in the dictionary are actual instances of System.Type. More specifically every value is stored with its type as the key.
In a specific part of my program the usage of Dictionary<System.Type, SomeThing> takes a large chunk of CPU time, as per Visual Studio 2017 performance profiler.
A change in the type of the dictionary to Dictionary<int, SomeThing> and instead of passing the type object directly I pass the type.GetHashCode() seems to be about 20%-25% faster.
The above optimization will result in a nasty bug if two types have the same hash code, but it seems plausible to me that types can have unique hash codes, at least when it comes to types from the same assembly - which all the types used in this dictionary are.
Possibly relevant information - As per this answer the number of possible types in an assembly is far smaller than the number of values represented by System.Int32.
No. The documentation on object.GetHashCode() make no guarantees, and states:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
...
Do not use the hash code as the key to retrieve an object from a keyed collection.
Because equal hash codes is necessary, but not sufficient, for two objects to be equal.
If you're wondering if Type.GetHashCode() follows a more restrictive definition, its documentation makes no mention of such a change, so it still does not guarantee uniqueness. The reference source does not show any attempt to make this guarantee, either.
A hash-code is never garantueed to be unique for different values, so you should not use it like you are doing.
The same value should however generate the same hashcode.
This is also stated in MSDN:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash codes.
and somewhat further:
Do not use the hash code as the key to retrieve an object from a keyed collection.
Therefore, I would also not rely for GetHashCode for different types to be unique, but at least, you can verify it:
Dictionary<int, string> s = new Dictionary<int, string>();
var types = typeof(int).Assembly.GetTypes();
Console.WriteLine($"Inspecting {types.Length} types...");
foreach (var t in typeof(-put a type from that assembly here-).Assembly.GetTypes())
{
if (s.ContainsKey(t.GetHashCode()))
{
Console.WriteLine($"{t.Name} has the same hashcode as {s[t.GetHashCode()]}");
}
else
{
s.Add(t.GetHashCode(), t.Name);
}
}
Console.WriteLine("done!");
But even if the above test would conclude that there are no collisions, I wouldn't do it, since the implementation of GetHashCode can change over time, which means that collisions in the future might be possible.
A hashcode isn´t ment do be unique. Instead it is used in hash-based collections such as Dictionary in order to limit the number of possible ambiguities. A hashc-ode is nothing but an index, so instead of searching the entire collection for a match only a few items that share a common value - the hash-code - have to be considered.
In fact you could even have a hash-implementation that allways returns the same number for every item. However that´ll leads to O(n) to look for a key in your dictionary, as every key has to be compared.
Anyway you shouldn´t strive for micro-optimizations that get you some nan-seconds in exchange for maintainability and understandability. You should instead use some data-structure that gets the job done and is easy to understand.

Complexity of searching in a list and in a dictionary

Let's say I have a class:
class C
{
public int uniqueField;
public int otherField;
}
This is very simplified version of the actual problem. I want to store multiple instances of this class, where "uniqueField" should be unique for each instance.
What is better in this case?
a) Dictionary with uniqueField as the key
Dictionary<int, C> d;
or b) List?
List<C> l;
In the first case (a) the same data would be stored twice (as the key and as the field of a class instance). But the question is: Is it faster to find an element in dictionary than in list? Or is the equally fast?
a)
d[searchedUniqueField]
b)
l.Find(x=>x.uniqueField==searchedUniqueField);
Assuming you've got quite a lot of instances, it's likely to be much faster to find the item in the dictionary. Basically a Dictionary<,> is a hash table, with O(1) lookup other than due to collisions.
Now if the collection is really small, then the extra overhead of finding the hash code, computing the right bucket and then looking through that bucket for matching hash codes, then performing a key equality check can take longer than just checking each element in a list.
If you might have a lot of instances but might not, I'd usually pick the dictionary approach. For one thing it expresses what you're actually trying to achieve: a simple way of accessing an element by a key. The overhead for small collections is unlikely to be very significant unless you have far more small collections than large ones.
Use Dictionary when the number of lookups greatly exceeds the number of insertions. It is fine to use List when you will always have fewer than four items.
Reference - http://www.dotnetperls.com/dictionary-time
If you want to ensure that your client will not create a duplication of the key, you may want your class to be responsible to create the unique key. Therefore once the unique key generation is the responsibility of the class , dictionary or list is the client decision.

C# foreach loop - is order *stability* guaranteed?

Suppose I have a given collection. Without ever changing the collection in any way, I loop through its contents twice with a foreach. Barring cosmic rays and what not, is it absolutely guaranteed that the order will be consistent in both loops?
Alternatively, given a HashSet<string> with a number of elements, what can cause the output from the the commented lines in the following to be unequal:
{
var mySet = new HashSet<string>();
// Some code which populates the HashSet<string>
// Output1
printContents(mySet);
// Output2
printContents(mySet);
}
public void printContents(HashSet<string> set) {
foreach(var element in set) {
Console.WriteLine(element);
}
}
It would be helpful if I could get a general answer explaining what causes an implementation to not meet the criteria described above. Specifically, though, I am interested in Dictionary, List and arrays.
Array enumeration guarantees order.
List and List<T> are expected to provide stable order (since they are expected to implement sequentially-indexed elements).
Dictionary, HashSet are explicitly do not guarantee order. Its is very unlikely that 2 calls to iterate items one after each other will return items in different order, but there is no guarantees or expectations. One should not expect any particular order.
Sorted versions of Dictionary/HashSet return items in sort order.
Other IEnumerable objects are free to do whatever they want. Normally one implements iterators in such a way that it matches user's expectations. I.e. enumeration of something that have implicit order should be stable, if explicit order provided - expected to be stable. Query to database that does not specify order should be expected to return items in semi-random order.
Check this question for links: Does the foreach loop in C# guarantee an order of evaluation?
Everything that implements IEnumerable<T> does so in its own way. There is no general guarantee that any given collection must ensure stability.
If you are referring specifically to Collection<T> (http://msdn.microsoft.com/en-us/library/ms132397.aspx) I don't see any specific guarantee in its MSDN reference that ordering is consistent.
Will it probably be consistent? Yes. Is there a written guarantee? Not that I can find.
For many of the C# collections there are sorted versions of the collection. For instance, a HashSet is to a SortedSet as a Dictionary is to a SortedDictionary. If you're working with something where the order isn't important like the Dictionary then you can't assume the loop order will behave the same way every time.
As per your example with HashSet<T>, we now have source code to check: HashSet:Enumerator
As it is, the Slot[] set.m_slots array is iterated.
The array object is only changed in the methods TrimExcess, Initialize (both of which are only called in the constructor), OnDeserialization, and SetCapacity (only called by AddIfNotPresent and AddOrGetLocation).
The values of m_slots are only changed in methods that change elements of the HashSet(Clear, Remove, AddIfNotPresent, IntersectWith, SymmetricExceptWith).
So yes, if nothing touches the set, it enumerates in the same order.
Dictionary:Enumerator works in quite the same way, iterating an Entry[] entries that only changes when such non-readonly methods are called.

What happens when hash collision happens in Dictionary key?

I've been coding in c++ and java entirety of my life but on C#, I feel like it's a totally different animal.
In case of hash collision in Dictionary container in c#, what does it do? or does it even detect the collision?
In case of collisions in similar containers in SDL, some would make a key value section link data to key value section like linked list, or some would attempt to find different hash method.
[Update 10:56 A.M. 6/4/2010]
I am trying to make a counter per user. And set user # is not defined, it can both increase or decrease. And I'm expecting the size of data to be over 1000.
So, I want :
fast Access preferably not O(n), It important that I have close to O(1) due to requirement, I need to make sure I can force log off people before they are able to execute something silly.
Dynamic growth and shrink.
unique data.
Hashmap was my solution, and it seems Dictionary is what is similar to hashmap in c#...
Hash collisions are correctly handled by Dictionary<> - in that so long as an object implements GetHashCode() and Equals() correctly, the appropriate instance will be returned from the dictionary.
First, you shouldn't make any assumptions about how Dictionary<> works internally - that's an implementation detail that is likely to change over time. Having said that....
What you should be concerned with is whether the types you are using for keys implement GetHashCode() and Equals() correctly. The basic rules are that GetHashCode() must return the same value for the lifetime of the object, and that Equals() must return true when two instances represent the same object. Unless you override it, Equals() uses reference equality - which means it only returns true if two objects are actually the same instance. You may override how Equals() works, but then you must ensure that two objects that are 'equal' also produce the same hash code.
From a performance standpoint, you may also want to provide an implementation of GetHashCode() that generates a good spread of values to reduce the frequency of hashcode collision. The primarily downside of hashcode collisions, is that it reduces the dictionary into a list in terms of performance. Whenever two different object instances yield the same hash code, they are stored in the same internal bucket of the dictionary. The result of this, is that a linear scan must be performed, calling Equals() on each instance until a match is found.
According to this article at MSDN, in case of a hash collision the Dictionary class converts the bucket into a linked list. The older HashTable class, on the other hand, uses rehashing.
I offer an alternative code oriented answer that demonstrates a Dictionary will exhibit exception-free and functionally correct behavior when two items with different keys are added but the keys produce the same hashcode.
On .Net 4.6 the strings "699391" and "1241308" produce the same hashcode. What happens in the following code?
myDictionary.Add( "699391", "abc" );
myDictionary.Add( "1241308", "def" );
The following code demonstrates that a .Net Dictionary accepts different keys that cause a hash collision. No exception is thrown and dictionary key lookup returns the expected object.
var hashes = new Dictionary<int, string>();
var collisions = new List<string>();
for (int i = 0; ; ++i)
{
string st = i.ToString();
int hash = st.GetHashCode();
if (hashes.TryGetValue( hash, out string collision ))
{
// On .Net 4.6 we find "699391" and "1241308".
collisions.Add( collision );
collisions.Add( st );
break;
}
else
hashes.Add( hash, st );
}
Debug.Assert( collisions[0] != collisions[1], "Check we have produced two different strings" );
Debug.Assert( collisions[0].GetHashCode() == collisions[1].GetHashCode(), "Prove we have different strings producing the same hashcode" );
var newDictionary = new Dictionary<string, string>();
newDictionary.Add( collisions[0], "abc" );
newDictionary.Add( collisions[1], "def" );
Console.Write( "If we get here without an exception being thrown, it demonstrates a dictionary accepts multiple items with different keys that produce the same hash value." );
Debug.Assert( newDictionary[collisions[0]] == "abc" );
Debug.Assert( newDictionary[collisions[1]] == "def" );
Check this link for a good explanation: An Extensive Examination of Data Structures Using C# 2.0
Basically, .NET generic dictionary chains items with the same hash value.

When should I use the HashSet<T> type?

I am exploring the HashSet<T> type, but I don't understand where it stands in collections.
Can one use it to replace a List<T>? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.
Is it only for enumeration?
The important thing about HashSet<T> is right there in the name: it's a set. The only things you can do with a single set is to establish what its members are, and to check whether an item is a member.
Asking if you can retrieve a single element (e.g. set[45]) is misunderstanding the concept of the set. There's no such thing as the 45th element of a set. Items in a set have no ordering. The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.
It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set. That order is not really a property of the set. You should not rely on it. If ordering of the items in a collection is important to you, that collection isn't a set.
Sets are really limited and with unique members. On the other hand, they're really fast.
Here's a real example of where I use a HashSet<string>:
Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a # or \ command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string> of all the valid commands, so whenever I hit a #xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:
Dictionary<string, ?>: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.
List<string>: If I keep the list sorted, I can use BinarySearch, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...
string[]: Again, Array.BinarySearch gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet, Dictionary, or List. Even with BinarySearch, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.
A HashSet<T> implements the ICollection<T> interface:
public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
// Methods
void Add(T item);
void Clear();
bool Contains(T item);
void CopyTo(T[] array, int arrayIndex);
bool Remove(T item);
// Properties
int Count { get; }
bool IsReadOnly { get; }
}
A List<T> implements IList<T>, which extends the ICollection<T>
public interface IList<T> : ICollection<T>
{
// Methods
int IndexOf(T item);
void Insert(int index, T item);
void RemoveAt(int index);
// Properties
T this[int index] { get; set; }
}
A HashSet has set semantics, implemented via a hashtable internally:
A set is a collection that contains no
duplicate elements, and whose elements
are in no particular order.
What does the HashSet gain, if it loses index/position/list behavior?
Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).
A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.
Performance would be a bad reason to choose HashSet over List. Instead, what better captures your intent? If order is important, then Set (or HashSet) is out. If duplicates are permitted, likewise. But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.
HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unordered. So no, a set can not be used to replace a list (unless you should've use a set in the first place).
If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.
HashSet would be used to remove duplicate elements in an IEnumerable collection. For example,
List<string> duplicatedEnumrableStrings = new List<string> {"abc", "ghjr", "abc", "abc", "yre", "obm", "ghir", "qwrt", "abc", "vyeu"};
HashSet<string> uniqueStrings = new HashSet(duplicatedEnumrableStrings);
after those codes are run, uniqueStrings holds {"abc", "ghjr", "yre", "obm", "qwrt", "vyeu"};
Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)). So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement. If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).
And no, you can't index a set, which would not make sense anyway, because sets aren't ordered. If you add some items, the set won't remember which one was first, and which second etc.
HashSet<T> is a data strucutre in the .NET framework that is a capable of representing a mathematical set as an object. In this case, it uses hash codes (the GetHashCode result of each item) to compare equality of set elements.
A set differs from a list in that it only allows one occurrence of the same element contained within it. HashSet<T> will just return false if you try to add a second identical element. Indeed, lookup of elements is very quick (O(1) time), since the internal data structure is simply a hashtable.
If you're wondering which to use, note that using a List<T> where HashSet<T> is appropiate is not the biggest mistake, though it may potentially allow problems where you have undesirable duplicate items in your collection. What is more, lookup (item retrieval) is vastly more efficient - ideally O(1) (for perfect bucketing) instead of O(n) time - which is quite important in many scenarios.
List<T> is used to store ordered sets of information. If you know the relative order of the elements of the list, you can access them in constant time. However, to determine where an element lies in the list or to check if it exists in the list, the lookup time is linear. On the other hand, HashedSet<T> makes no guarantees of the order of the stored data and consequently provides constant access time for its elements.
As the name implies, HashedSet<T> is a data structure that implements set semantics. The data structure is optimized to implement set operations (i.e. Union, Difference, Intersect), which can not be done as efficiently with the traditional List implementation.
So, to choose which data type to use really depends on what your are attempting to do with your application. If you don't care about how your elements are ordered in a collection, and only want to enumarate or check for existence, use HashSet<T>. Otherwise, consider using List<T> or another suitable data structure.
In the basic intended scenario HashSet<T> should be used when you want more specific set operations on two collections than LINQ provides. LINQ methods like Distinct, Union, Intersect and Except are enough in most situations, but sometimes you may need more fine-grained operations, and HashSet<T> provides:
UnionWith
IntersectWith
ExceptWith
SymmetricExceptWith
Overlaps
IsSubsetOf
IsProperSubsetOf
IsSupersetOf
IsProperSubsetOf
SetEquals
Another difference between LINQ and HashSet<T> "overlapping" methods is that LINQ always returns a new IEnumerable<T>, and HashSet<T> methods modify the source collection.
In short - anytime you are tempted to use a Dictionary (or a Dictionary where S is a property of T) then you should consider a HashSet (or HashSet + implementing IEquatable on T which equates on S)

Categories

Resources