Dictionary KeyCollection Strangeness - c#

I have a mysterious situation involving a dictionary where I enumerate keys from the dictionary, but the dictionary doesn't contain some of the keys it says it contains.
Dictionary<uint, float> dict = GetDictionary(); // Gets values, 6268 pairs
foreach(uint key in dict.Keys)
{
if (!dict.ContainsKey(key))
Console.WriteLine("Wat? "+key);
}
The above will print two of the 6268 keys. Nothing special about those two keys, both positive values smaller than Int32.MaxValue (369099203 and 520093968).
A check on the counts reveals this:
Console.WriteLine(dict.Count); // 6268
Console.WriteLine(dict.Keys.Count()); // 6268
Console.WriteLine(dict.Keys.Count(dict.Keys.Contains)); // 6266
This is single threaded .NET4 code running under .NET4.5 CLR. The dictionary is a vanilla Dictionary<uint, float> i.e. there is no custom equality comparer. I assume there is a hash problem occuring because of the uint/int difference, but shouldn't the ContainsKey(key) be guaranteed to be true for all keys returned in the Key collection of the dictionary? Especially when you ONLY look at the KeyCollection object as in the lower code snippet, there the total count and the count of contained objects is off, which feels like an odd ICollection behavior.
Edit:
As expected there appears to be a reasonable explanation: the collection was modified earlier by two concurrent threads during its initialization. When something "sometimes breaks" , it is a threadnig issue, and sure enough. Accessing a dict from several threads can apparently upset the internal state enough for it to be kust semi-functioning for the remainder of its lifetime, but without causing any exceptions.
I'm going to switch to a concurrent dictionary, and probably delete this question. Thanks.

I don't have enough rep to comment - but I did try to reproduce your issue to no avail. I will suggest that you post how GetDictionary() is working and also I would suggest NOT iterating through a dictionary like that, do below instead and see if that seems to fix it:
foreach (KeyValuePair<uint, float> pair in dict)
Console.WriteLine("[" + pair.Key + "]=" + pair.Value);

Is there a chance of that GetDictionary() adds custom key equality comparer when constructing the dictionary? If so, the problem may be related to the comparer implementation.

Important note: When I refer to GetHashCode throughout this post, I am referring to the result of IEqualityComparer<T>.GetHashCode. By default, the dictionary will use EqualityComparer<T>.Default, which will return the result of calling GetHashCode on the key itself. However, you can provide a specific implementation of IEqualityComparer<T> at the time the dictionary is created to use a different behavior.
This can happen if the result of GetHashCode for a key changes between the time the value is added to the dictionary and the point where you enumerated the keys. When you enumerated the keys, it returns the keys for all populated "buckets" in the array. However, when you look up a specific key it recalculates the expected bucket from the result of GetHashCode for the key. If the hash code changed, then the actual location of the key/value pair in the dictionary's buckets and the expected location may no longer be the same, in which case Contains would return false.
You should make sure that the result of GetHashCode for the keys in the dictionary can't change after a value is added to the dictionary for the key.

I encountered similar weird behaviours with System.Uri too.
It turned to be an architecture mismatch between the key that was stored in the dictionary and the key that I was using to lookup. In particular, the Uri stored in the dictionary was 32bit, while I was looking for a 64bit one. Obviously, since GetHashcode() is not granted to be equal between different architectures, the dictionary was unable to match the keys.

Related

Are hash codes of System.Type objects of types from the same assembly guaranteed to be unique?

Clarifying edit: The keys in the dictionary are actual instances of System.Type. More specifically every value is stored with its type as the key.
In a specific part of my program the usage of Dictionary<System.Type, SomeThing> takes a large chunk of CPU time, as per Visual Studio 2017 performance profiler.
A change in the type of the dictionary to Dictionary<int, SomeThing> and instead of passing the type object directly I pass the type.GetHashCode() seems to be about 20%-25% faster.
The above optimization will result in a nasty bug if two types have the same hash code, but it seems plausible to me that types can have unique hash codes, at least when it comes to types from the same assembly - which all the types used in this dictionary are.
Possibly relevant information - As per this answer the number of possible types in an assembly is far smaller than the number of values represented by System.Int32.
No. The documentation on object.GetHashCode() make no guarantees, and states:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:
...
Do not use the hash code as the key to retrieve an object from a keyed collection.
Because equal hash codes is necessary, but not sufficient, for two objects to be equal.
If you're wondering if Type.GetHashCode() follows a more restrictive definition, its documentation makes no mention of such a change, so it still does not guarantee uniqueness. The reference source does not show any attempt to make this guarantee, either.
A hash-code is never garantueed to be unique for different values, so you should not use it like you are doing.
The same value should however generate the same hashcode.
This is also stated in MSDN:
Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash codes.
and somewhat further:
Do not use the hash code as the key to retrieve an object from a keyed collection.
Therefore, I would also not rely for GetHashCode for different types to be unique, but at least, you can verify it:
Dictionary<int, string> s = new Dictionary<int, string>();
var types = typeof(int).Assembly.GetTypes();
Console.WriteLine($"Inspecting {types.Length} types...");
foreach (var t in typeof(-put a type from that assembly here-).Assembly.GetTypes())
{
if (s.ContainsKey(t.GetHashCode()))
{
Console.WriteLine($"{t.Name} has the same hashcode as {s[t.GetHashCode()]}");
}
else
{
s.Add(t.GetHashCode(), t.Name);
}
}
Console.WriteLine("done!");
But even if the above test would conclude that there are no collisions, I wouldn't do it, since the implementation of GetHashCode can change over time, which means that collisions in the future might be possible.
A hashcode isn´t ment do be unique. Instead it is used in hash-based collections such as Dictionary in order to limit the number of possible ambiguities. A hashc-ode is nothing but an index, so instead of searching the entire collection for a match only a few items that share a common value - the hash-code - have to be considered.
In fact you could even have a hash-implementation that allways returns the same number for every item. However that´ll leads to O(n) to look for a key in your dictionary, as every key has to be compared.
Anyway you shouldn´t strive for micro-optimizations that get you some nan-seconds in exchange for maintainability and understandability. You should instead use some data-structure that gets the job done and is easy to understand.

ILookUp vs. Dictionary [duplicate]

I'm trying to wrap my head around which data structures are the most efficient and when / where to use which ones.
Now, it could be that I simply just don't understand the structures well enough, but how is an ILookup(of key, ...) different from a Dictionary(of key, list(of ...))?
Also where would I want to use an ILookup and where would it be more efficient in terms of program speed / memory / data accessing, etc?
Two significant differences:
Lookup is immutable. Yay :) (At least, I believe the concrete Lookup class is immutable, and the ILookup interface doesn't provide any mutating members. There could be other mutable implementations, of course.)
When you lookup a key which isn't present in a lookup, you get an empty sequence back instead of a KeyNotFoundException. (Hence there's no TryGetValue, AFAICR.)
They're likely to be equivalent in efficiency - the lookup may well use a Dictionary<TKey, GroupingImplementation<TValue>> behind the scenes, for example. Choose between them based on your requirements. Personally I find that the lookup is usually a better fit than a Dictionary<TKey, List<TValue>>, mostly due to the first two points above.
Note that as an implementation detail, the concrete implementation of IGrouping<,> which is used for the values implements IList<TValue>, which means that it's efficient to use with Count(), ElementAt() etc.
Interesting that nobody has stated the actual biggest difference (Taken directly from MSDN):
A Lookup resembles a Dictionary. The
difference is that a Dictionary maps keys to single
values, whereas a Lookup maps keys to collections of
values.
Both a Dictionary<Key, List<Value>> and a Lookup<Key, Value> logically can hold data organized in a similar way and both are of the same order of efficiency. The main difference is a Lookup is immutable: it has no Add() methods and no public constructor (and as Jon mentioned you can query a non-existent key without an exception and have the key as part of the grouping).
As to which do you use, it really depends on how you want to use them. If you are maintaining a map of key to multiple values that is constantly being modified, then a Dictionary<Key, List<Value>> is probably better since it is mutable.
If, however, you have a sequence of data and just want a read-only view of the data organized by key, then a lookup is very easy to construct and will give you a read-only snapshot.
Another difference not mentioned yet is that Lookup() supports null keys:
Lookup class implements the ILookup interface. Lookup is very similar to a dictionary except multiple values are allowed to map to the same key, and null keys are supported.
The primary difference between an ILookup<K,V> and a Dictionary<K, List<V>> is that a dictionary is mutable; you can add or remove keys, and also add or remove items from the list that is looked up. An ILookup is immutable and cannot be modified once created.
The underlying implementation of both mechanisms will be either the same or similar, so their searching speed and memory footprint will be approximately the same.
When exception is not a option, go for Lookup
If you are trying to get a structure as efficient as a Dictionary but you dont know for sure there is no duplicate key in input, Lookup is safer.
As mentioned in another answer, it also supports null keys, and returns always a valid result when queried with arbitrary data, so it appears as more resilient to unknown input (less prone than Dictionary to raise exceptions).
And it is especially true if you compare it to the System.Linq.Enumerable.ToDictionary function :
// won't throw
new[] { 1, 1 }.ToLookup(x => x);
// System.ArgumentException: An item with the same key has already been added.
new[] { 1, 1 }.ToDictionary(x => x);
The alternative would be to write your own duplicate key management code inside of a foreach loop.
Performance considerations, Dictionary: a clear winner
If you don't need a list and you are going to manage a huge number of items, Dictionary (or even your own custom tailored structure) would be more efficient:
Stopwatch stopwatch = new Stopwatch();
var list = new List<string>();
for (int i = 0; i < 5000000; ++i)
{
list.Add(i.ToString());
}
stopwatch.Start();
var lookup = list.ToLookup(x => x);
stopwatch.Stop();
Console.WriteLine("Creation: " + stopwatch.Elapsed);
// ... Same but for ToDictionary
var lookup = list.ToDictionary(x => x);
// ...
As Lookup has to maintain a list of items for each key, it is slower than Dictionary (around 3x slower for huge number of items)
Lookup speed:
Creation: 00:00:01.5760444
Dictionary speed:
Creation: 00:00:00.4418833

How can I know if I should access value by index or key in an OrderedDictionary in C#?

All this time I was using Dictionary to store key/value pairs until I came across this new class called OrderedDictionary which has got an additional feature of accessing data through index.
So, I wanted to know when could/would I be running into any situation that would ask me to access value through index when I have the key already. I have a small snippet below.
OrderedDictionary od = new OrderedDictionary();
od.Add("Key1", "Val1");
od.Add("Key2", "Val2");
od.Add("Key3", "Val3");
od.Add("Key4", "Val4");
Probably, the code above may not seem appropriate but, I would really appreciate if someone can give a better one to answer by question.
Many Thanks!
I wanted to know when could/would I be running into any situation that would ask me to access value through index when I have the key already
I follow the YAGNI principle - You Aren't Gonna Need It. If you already know the key, then what value is there in accessing by index? The point of a dictionary is to do FAST lookups by key (by not scanning the entire collection). With an OrderedDictionary, lookups are still fast, but inserts and updates are marginally slower because the structure must keep the keys and indices in sync. Plus, the current framework implementation is not generic, so you'll have to do more casting, but there are plenty of 3rd party generic implementations out there. The fact that MS did not create a generic implementation may tell you something about the value of that type overall.
So the situation you "could" run into is needing to access the values in key order. In that case you'll need to decide if you do that often enough to warrant the overhead of an OrderedDictionary or if you can just use Linq queries to order the items outside of the structure.
Theory of Hash based collections
Before choosing between Dictionary and OrderedDictionary, let's look how some collections are built.
Arrays
Arrays provide time constant access when you now the index of your value.
So keys must be integers. If you don't know the index, you must traverse the full collection to check your value is the one you're looking for.
Dictionaries
Purpose of dictionary is to provide a (relatively) time constant access to any value in it when key is not an integer. However, since there is not always a perfect hash function to get an integer from a value, there will collisions of hash codes and then when several values have the same hash codes, they are added in a array. And search for theses collided values will be slower (since it must traverse the array).
OrderedDictionary
OrderedDictionary is kind of mix between the two previous collections. Index search will/should be the fastest (however you need to profile to be sure of that point). The problem with index search is that apart from special cases, you don't know the index in which your value was stored, so you must rely on the key. Which makes me wonder, why would you need an OrderedDictionary ?
As one comment implies, I would be very interested to know what's your use case for such a collection. Most of the times, you either know the index or don't know it, because it relies on the value nature. So you should either use an array or a Dictionary, not a mix of both.
Two use cases very different:
KeyValuePair<string, string>[] values = new KeyValuePair<string, string>[4];
values[0] = new KeyValuePair<string, string>("Key1", "Value1");
// And so on...
// Or
Dictionary<string, Person> persons = new Dictionary<string, Person>();
var asker = new Person { FirstName = "pradeep", LastName=" pradyumna" };
persons.Add(asker.Key, asker);
// Later in the code, you cannot know the index of the person without having the person instance.

What happens when hash collision happens in Dictionary key?

I've been coding in c++ and java entirety of my life but on C#, I feel like it's a totally different animal.
In case of hash collision in Dictionary container in c#, what does it do? or does it even detect the collision?
In case of collisions in similar containers in SDL, some would make a key value section link data to key value section like linked list, or some would attempt to find different hash method.
[Update 10:56 A.M. 6/4/2010]
I am trying to make a counter per user. And set user # is not defined, it can both increase or decrease. And I'm expecting the size of data to be over 1000.
So, I want :
fast Access preferably not O(n), It important that I have close to O(1) due to requirement, I need to make sure I can force log off people before they are able to execute something silly.
Dynamic growth and shrink.
unique data.
Hashmap was my solution, and it seems Dictionary is what is similar to hashmap in c#...
Hash collisions are correctly handled by Dictionary<> - in that so long as an object implements GetHashCode() and Equals() correctly, the appropriate instance will be returned from the dictionary.
First, you shouldn't make any assumptions about how Dictionary<> works internally - that's an implementation detail that is likely to change over time. Having said that....
What you should be concerned with is whether the types you are using for keys implement GetHashCode() and Equals() correctly. The basic rules are that GetHashCode() must return the same value for the lifetime of the object, and that Equals() must return true when two instances represent the same object. Unless you override it, Equals() uses reference equality - which means it only returns true if two objects are actually the same instance. You may override how Equals() works, but then you must ensure that two objects that are 'equal' also produce the same hash code.
From a performance standpoint, you may also want to provide an implementation of GetHashCode() that generates a good spread of values to reduce the frequency of hashcode collision. The primarily downside of hashcode collisions, is that it reduces the dictionary into a list in terms of performance. Whenever two different object instances yield the same hash code, they are stored in the same internal bucket of the dictionary. The result of this, is that a linear scan must be performed, calling Equals() on each instance until a match is found.
According to this article at MSDN, in case of a hash collision the Dictionary class converts the bucket into a linked list. The older HashTable class, on the other hand, uses rehashing.
I offer an alternative code oriented answer that demonstrates a Dictionary will exhibit exception-free and functionally correct behavior when two items with different keys are added but the keys produce the same hashcode.
On .Net 4.6 the strings "699391" and "1241308" produce the same hashcode. What happens in the following code?
myDictionary.Add( "699391", "abc" );
myDictionary.Add( "1241308", "def" );
The following code demonstrates that a .Net Dictionary accepts different keys that cause a hash collision. No exception is thrown and dictionary key lookup returns the expected object.
var hashes = new Dictionary<int, string>();
var collisions = new List<string>();
for (int i = 0; ; ++i)
{
string st = i.ToString();
int hash = st.GetHashCode();
if (hashes.TryGetValue( hash, out string collision ))
{
// On .Net 4.6 we find "699391" and "1241308".
collisions.Add( collision );
collisions.Add( st );
break;
}
else
hashes.Add( hash, st );
}
Debug.Assert( collisions[0] != collisions[1], "Check we have produced two different strings" );
Debug.Assert( collisions[0].GetHashCode() == collisions[1].GetHashCode(), "Prove we have different strings producing the same hashcode" );
var newDictionary = new Dictionary<string, string>();
newDictionary.Add( collisions[0], "abc" );
newDictionary.Add( collisions[1], "def" );
Console.Write( "If we get here without an exception being thrown, it demonstrates a dictionary accepts multiple items with different keys that produce the same hash value." );
Debug.Assert( newDictionary[collisions[0]] == "abc" );
Debug.Assert( newDictionary[collisions[1]] == "def" );
Check this link for a good explanation: An Extensive Examination of Data Structures Using C# 2.0
Basically, .NET generic dictionary chains items with the same hash value.

Does the Enumerator of a Dictionary<TKey, TValue> return key value pairs in the order they were added?

I understand that a dictionary is not an ordered collection and one should not depend on the order of insertion and retrieval in a dictionary.
However, this is what I noticed:
Added 20 key value pairs to a Dictionary
Retrieved them by doing a foreach(KeyValuePair...)
The order of retrieval was same as the order in which they were added.
Tested for around 16 key value pairs.
Is this by design?
It's by coincidence, although predictably so. You absolutely shouldn't rely on it. Usually it will happen for simple situations, but if you start deleting elements and replacing them with anything either with the same hash code or just getting in the same bucket, that element will take the position of the original, despite having been added later than others.
It's relatively fiddly to reproduce this, but I managed to do it a while ago for another question:
using System;
using System.Collections.Generic;
class Test
{
static void Main(string[] args)
{
var dict = new Dictionary<int, int>();
dict.Add(0, 0);
dict.Add(1, 1);
dict.Add(2, 2);
dict.Remove(0);
dict.Add(10, 10);
foreach (var entry in dict)
{
Console.WriteLine(entry.Key);
}
}
}
The results show 10, 1, 2 rather than 1, 2, 10.
Note that even though it looks like the current behaviour will always yield elements in insertion order if you don't perform any deletions, there's no guarantee that future implementations will do the same... so even in the restricted case where you know you won't delete anything, please don't rely on this.
From MSDN:
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair<(Of <(TKey, TValue>)>) structure representing a value and its key. The order in which the items are returned is undefined.
[Emphasis added]
If you want to iterate through a Dictionary in a fixed order you could try OrderedDictionary
It is by design that the Dictionary<TKey,TValue> is not an ordered structure as it is intended to be used primarily more for key-based access.
If you have the need to retrieve items in a specific order, you should take a look at the Sorted Dictionary<TKey, TValue>, which takes a Comparer<T> that will be used to sort the keys in the Sorted Dictionary<TKey, TValue>.
Is this by design? It probably wasn't in the original .Net Framework 2.0, but now there is an implicit contract that they will be ordered in the same order as added, because to change this would break so much code that relies on the behaviour of the original generic dictionary. Compare with the Go language, where their map deliberately returns a random ordering to prevent users of maps from relying on any ordering [1].
Any improvements or changes the framework writers make to Dictionary<T,V> would have to keep that implicit contract.
[1] "Since the release of Go 1.0, the runtime has randomized map iteration order. ", https://blog.golang.org/go-maps-in-action .
I don't think so, the dictionary does not grantee the internal ordering of items inside it.
If you need to keep the order as well, use additional data structure (array or list) along with the dictionary.
I believe enumerating a Dictionary<K,V> will return the keys in the same order they were inserted if all the keys hash to the same value. This is because the Dictionary<K,V> implementation uses the hash code of the key object to insert key/value pairs into buckets, and the values are (usually) stored in the buckets in the order they are inserted. If you are consistently seeing this behavior with your user-defined objects, then perhaps you have not (correctly) overridden the GetHashCode() method?

Categories

Resources