How is the c#/.net 3.5 dictionary implemented? - c#

I'm using an application which uses a number of large dictionaries ( up to 10^6 elements), the size of which is unknown in advance, (though I can guess in some cases). I'm wondering how the dictionary is implemented, i.e. how bad the effect is if I don't give an initial estimate of the dictionary size. Does it internally use a (self-growing) array in the way List does? in which case letting the dictionaries grow might leave a lot of large un-referenced arrays on the LOH.

Using Reflector, I found the following: The Dictionary keeps the data in a struct array. It keeps a count on how many empty places are left in that array. When you add an item and no empty place is left, it increases the size of the internal array (see below) and copies the data from the old array to the new array.
So I would suggest you should use the constructor in which you set the initial size if you know there will be many entries.
EDIT: The logic is actually quite interesting: There is an internal class called HashHelpers to find primes. To speed this up, it also has stored some primes in a static array from 3 up to 7199369 (some are missing; for the reason, see below). When you supply a capacity, it finds the next prime (same value or larger) from the array, and uses that as initial capacity. If you give it a larger number than in its array, it starts checking manually.
So if nothing is passed as capacity to the Dictionary, the starting capacity is three.
Once the capacity is exceeded, it multiplies the current capacity by two and then finds the next larger prime using the helper class. That is why in the array not every prime is needed, since primes "too close together" aren't really needed.
So if we pass no initial value, we would get (I checked the internal array):
3
7
17
37
71
163
353
761
1597
3371
7013
14591
30293
62851
130363
270371
560689
1162687
2411033
4999559
Once we pass this size, the next step falls outside the internal array, and it will manually search for larger primes. This will be quite slow. You could initialize with 7199369 (the largest value in the array), or consider if having more than about 5 million entries in a Dictionary might mean that you should reconsider your design.

MSDN says: "Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table." and further on "the capacity is automatically increased as required by reallocating the internal array."
But you get less reallocations if you give an initial estimate. If you have all items from the beginning the LINQ method ToDictionary might be handy.

Hashtables normally have something called a load factor, that will increase the backing bucket store if this threshold is reached. IIRC the default is something like 0.72. If you had perfect hashing, this can be increased to 1.0.
Also when the hashtable needs more buckets, the entire collection has to be rehashed.

The best way for me would be to use the .NET Reflector.
http://www.red-gate.com/products/reflector/
Use the disassembled code to see the implementation.

JSON as dictionary
{
"Details":
{
"ApiKey": 50125
}
}
Model should contain property as type Dictionary.
public Dictionary<string, string> Details{ get; set; }
Implement foreach() block with datatype as "KeyValue"
foreach (KeyValuePair<string, string> dict in Details)
{
switch (dict.Key)
{
case nameof(settings.ApiKey):
int.TryParse(kv.Value, out int ApiKey);
settings.ApiKey=ApiKey;
break;
default:
break;
}
}

Related

"I i can traverse a list using indexing as i do in an array.Why can't i add an item in a list as i do in an array using indexing."

Why do I need to use the Add() to add elements to a List. Why can't I use indexing and do it. When I traverse the elements through the List I do it using the help of indexes.
int head = -1;
List<char> arr = new List<char>();
public void push(char s)
{
++head;
arr[head] = s;//throws runtime error.
arr.Add(s);
}
It doesn't throw any error during compile time. But throws an error at runtime stating IndexOutOfRangeException.
++head;
arr[head] = s;
This attempts to set element 1 of the list to s, but there is no element 1 yet because you've not added anything, or set the length of the list.
When you create an array, you define a length, so each item has a memory address that can be assigned to.
Lists are useful when you don't know how many items you're going to have, or what their index is going to be.
Arrays are fixed sizes. Once you allocate them, you can not add or remove "slots" from it. So if you need it to be bigger, you need to:
Detect that you need a bigger array.
Allocate a new, bigger array
copy all existing values to teh new, bigger array
start using the bigger array from now on everywhere
All that Lists do is automate that precise process. It will automatically detect that it needs to increase during Add() and then do step 2-4 automagically. It is even responsible to pick the initial size and by how much to grow it (to avoid having to grow to often.
They could in theory jsut react to List[11000] by growing the size to 11000. But chances are very big, that this value is a huge mistake. And preventing the Progarmmer from doing huge mistakes is what half the classes and compiler rules (like strong typisation) are there for. So they force you to use Add() so such a mistake can not happen.
Actually calling myArray[2] does not add the element, but just assigns the object to the specified index within the array. If the array´s size is less you´d get an IndexOutOfBoundsException, as in a list<T> also. So also in case of an array using the indexer assumes you actually have that many elements:
var array = new int[3];
array[5] = 4; // bang
This is because arrays have a fixed size which you can´t change. If you assign an object to an index greater the arrays size you get the exat same exception as for a List<T> also, there´s no difference here.
The only real difference here is that when using new array[3] you have an array of size 3 with indices up to 2 and you can call array[2]. However this would just return the default-value - in case of int this is zero. When using new List<int>(3) in contrast you don´t have actually three elements. In fact the list has no items at all and calling list[2] throws the exception. The parameter to a list is just the capacity, which is a parameter for the runtime to indicate when the underlying array of a list should be resized - an ability your array does not even have.
A list is an array wrapper, where the internal array size is managed by its methods. The constructor that takes a capacity simply creates an array of that size internally, but the count property (which reflects the count elements that has been added) will be zero. So in essence, zero slots in the array has been assigned a value.
The size of an array is managed by you the programmer. That is why you have to call static methods like System.Array.Resize (notice that the array argument is ref), if you want to change an array yourself. That method allocates a new chunk of memory for the new size.
So to sum up, the list essentially manages an array for you, and as such, the tradeoff is that you can only access as many array-like slots as has been added to it.

Performance of Dictionary<string, object> vs List<string> + BinarySearch

I have a custom performance timer implementation. In short it is a static data collection storing execution duration of some code paths. In order to identify particular measurements I need a collection of named objects with quick access to the data item by name i.e. a string of moderate length like 20-50 chars.
Straightforward way to do that could be a Dictionary<string, MyPerformanceCounter> with access by key which is the counter id.
What about a List<MyPerformanceCounter> which could be accessed and maintained sorted via List<T>.BinarySearch and List.Insert. Does it have a chance to have more linear performance when I would need to have several hundreds of counters?
Needless to say I need the access to the proper MyPerformanceCounter to be as quick as possible as it is called at rates of dozens of thousands per second and should affect code execution as less as possible.
New counters are appended relatively seldom like once per second.
There are several potentially non-O(1) parts to a dictionary.
The first is generating a hash code. If your strings are long, it will have to generate a hash of the string every time you use it as a key in your dictionary. The dictionary stores the hashes of the existing keys, so you don't have to worry about that, just hashing what you're passing in. If the strings are all short, hashing should be fast. Long strings are probably going to take longer to hash than doing a string comparison. Hashing affects both reads and writes.
The next non-constant part of a dictionary is when you have hash collisions. It keeps a linked list of values with the same hash bucket internally, and has to go through and compare your key to each item in that bucket if you get hash collisions. Since you're using strings and they spent a lot of effort coming up with a good string hashing function, this shouldn't be too major an issue. Hash collisions slow down both reads and writes.
The last non-constant part is only during writes, if it runs out of internal storage, it has to recalculate the whole hash table internally. This is still a lot faster than doing array inserts (like a List<> would do). If you only have a few hundred items, this is definitely not going to affect you.
A list, on the other hand, is going to take an average of N/2 copies for each insert, and log2(N) for each lookup. Unless the strings all have similar prefixes, the individual comparisons will be much faster than the dictionary, but there will be a lot more of them.
So unless your strings are quite long to make hashing inefficient, chances are a dictionary is going to give you better performance.
If you know something about the nature of your strings, you can write a more specific data structure optimized for your scenario. For example, if I knew all the strings started with an ASCII capital letter, and each is between 5 and 10 characters in length, I might create an array of 26 arrays, one for each letter, and then each of those arrays contains 6 lists, one for each length of string. Something like this:
List<string>[][] lists = new List<string>[26][6];
foreach (string s in keys)
{
var list = lists[s[0] - 'A'][s.Length - 5];
if (list == null)
{
lists[s[0] - 'A'][s.Length] = list = new List<string>();
}
int ix = list.BinarySearch(s);
if (ix < 0)
{
list.Insert(~ix, s);
}
}
This is the kind of thing you do if you have very specific information about what kind of data you're dealing with. If you can't make assumptions, using a Dictionary is most likely going to be your best bet.
You might also want to consider using OrderedDictionary if you want to go binary search route, I believe it uses a binary search tree internally. https://msdn.microsoft.com/en-us/library/system.collections.specialized.ordereddictionary%28v=vs.110%29.aspx
I believe you should use the Dictionary<string, MyPerformanceCounter>.
For small sets of data the list will have a better performance. However, as more elements are needed, the Dictionary becomes clearly superior.
The time required for a Dictionary is: O(1) constant time
complexity.
The List has an O(N) linear time complexity.
You could try Hashtable or SortedDictionary, but I think that you should still use Dictionary.
I provide a link with benchmarks and guidelines here: http://www.dotnetperls.com/dictionary-time
I hope this helps you.

Using Long Integer types with collection methods

I've encountered a problem, which is best illustrated with this code segment:
public static void Foo(long RemoveLocation)
{
// Code body here...
// MyList is a List type collection object.
MyList.RemoveAt(RemoveLocation);
}
Problem: RemoveLocation is a long. The RemoveAt method takes only int types. How do I get around this problem?
Solutions I'd prefer to avoid (because it's crunch time on the project):
Splitting MyList into two or more lists; that would require rewriting a lot of code.
Using int instead of long.
If there was a way you could group similar items together, could you bring the total down below the limit? E.g. if your data contains lots of repeated X,Y coords, you might be able to reduce the number of elements and still keep one list, by creating a frequency count field. e.g. (x,y,count)
In theory, the maximum number of elements in a list is int.MaxValue, which is about 2 billion.
However, it is very inefficient to use the list type to store an extremely large number of elements. It simply has not been designed for that and you're doing way better with a tree-like data structure.
For instance, if you look at Mono's implementation of the list types, you'll see that they're using a single array to hold the elements and I assume .NET's version does the same. Since the maximum size of an element in .NET is 2 GB, the actual maximum number of elements is 2 billion divided by the element size. So, for instance a list of strings on a 64-bit machine could hold at most about 268 million elements.
When using the mutable (non-readonly) list types, this array needs to be re-allocated to a larger size (usually using twice the old size) when adding items, requiring the entire contents to be copied. This is very inefficient.
In addition to this, having too large objects could also have negative impacts on the garbage collector.
Update
If you really need a very large list, you could simply write your own data type, for instance using an array or large arrays as internal storage.
There are also some useful comments about this here:
http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx

How large per list item does List<uint> get in .NET 4.0?

I know that it takes 4 bytes to store a uint in memory, but how much space in memory does it take to store List<uint> for, say, x number of uints?
How does this compare to the space required by uint[]?
There is no per-item overhead of a List<T> because it uses a T[] to store its data. However, a List<T> containing N items may have 2N elements in its T[]. Also, the List data structure itself has probably 32 or more bytes of overhead.
You probably will notice not so much difference between T[] and list<T> but you can use
System.GC.GetTotalMemory(true);
before and after an object allocation to obtain an approximate memory usage.
List<> uses an array internally, so a List<uint> should take O(4bytes * n) space, just like a uint[]. There may be some more constant overhead in comparison to a array, but you should normally not care about this.
Depending on the specific implementation (this may be different when using Mono as a runtime instead of the MS .NET runtime), the internal array will be bigger than the number of actual items in the list. E.g.: a list of 5 elements has an internal array that can store 10, a list of 10000 elements may have an internal array of size 11000. So you cant generally say that the internal array will always be twice as big, or 5% bigger than the number of list element, it may also depend on the size.
Edit: I've just seen, Hans Passant has described the growing behaviour of List<T> here.
So, if you have a collection of items that you want to append to, and you cant know the size of this collection at the time the list is created, use a List<T>. It is specifically designed for this case. It provides fast random access O(1) to the elements, and has very little memory overhead (internal array). It is on the other hand very slow on removing or inserting in the middle of the list. If you need those operations often, use a LinkedList<T>, which has then more memory overhead (per item!), however. If you know the size of you collection from the beginning, and you know that is wont change (or just very few times) use arrays.

What is dictionary compaction support?

"Here is the implementation of the dictionary without any compaction support."
This quote is taken from here: http://blogs.msdn.com/jaredpar/archive/2009/03/03/building-a-weakreference-hashtable.aspx
I know jaredpar is a member on here and posts on the C# section. What exactly is "dictionary compaction support"? I am assuming it is some way to optimise or make it smaller? But how (if this is what it is)?
Thanks
For that particular post I was refering to shrinking the dictionary in order to be a more appropriate size for the number of non-collected elements.
Under the hood most hashtables are backed by a large array which usually points to another structure such as a linked list. The array starts out at an initialize size. When the number of elements added to the hashtable exceeds a certain threshold (say 70% of the number of elements in the array), the hashtable will expand. This usually involves creating a new array at twice the size and re-adding the values into the new array.
One of the problems / features of a weak reference hashtable is that over time the elements are collected. Over time this can lead to a bit of wasted space. Imagine that you added enough elements to go through this array doubling process. Over time some of these were collected and now the remaining elements could fit into the previous array size.
This is not necessarily a bad thing but it is wasted space. Compaction is the process where you essentially shrink the underlying data structure for the hashtable to be a more appropriate size for the data.

Categories

Resources