In my application, I have a SortedDictionary. Most of the time, Im inserting single values in it - in such cases, I understand, that it needs to use Compare method to determine, where should the new value be added.
I was just wondering, whether there is some way I can initialize this SortedDictionary from lets say a KeyValuePair<>[] array, without causing Compare method to run.
The thing is, that sometimes I do have a KeyValuePair<>[] array, that contains already sorted Keys and so it could be transformed in SortedDictionary without any additional sorting. I understand that compiler doesnt know that my collection is sorted, but since Im sure of it, is there some way to intentionaly evade the comparison? If this request is total nonsense, could you please explain why?
The only reason I want to this is because of performance - when working with big collections, the Compare method takes some time to finish.
[...] I understand that compiler doesnt know that my collection is
sorted, [...]
Sorting isn't a compile-time but run-time detail.
I don't think this would be a good idea. Here's a good summary of reasons to don't do it:
A dictionary actually is a hash table. Thus, keys aren't sorted per se.
A sorted dictionary requires the comparer to provide the keys in an arbitrary order. If you don't use a comparer, how a simple hash table would be able to expose its keys in some order?
At the end of the day, when you need a collection where its order is the insertion order, you should use a List<T>, and in your case, you should consider a List<KeyValuePair<TKey, TValue>>. Anyway, this won't work in your case. You want to provide an already sorted sequence as source of a sorted dictionary, and let the comparer work once the dictionary is filled when adding new pairs after construction-time.
I would say that if you need a sorted dictionary which relies on a sequence of pairs given during construction-time and that mustn't be re-sorted (since they're already sorted), then you'll need to think about rolling your own IDictionary<TKey, TValue> implementation to provide such feature...
Related
All this time I was using Dictionary to store key/value pairs until I came across this new class called OrderedDictionary which has got an additional feature of accessing data through index.
So, I wanted to know when could/would I be running into any situation that would ask me to access value through index when I have the key already. I have a small snippet below.
OrderedDictionary od = new OrderedDictionary();
od.Add("Key1", "Val1");
od.Add("Key2", "Val2");
od.Add("Key3", "Val3");
od.Add("Key4", "Val4");
Probably, the code above may not seem appropriate but, I would really appreciate if someone can give a better one to answer by question.
Many Thanks!
I wanted to know when could/would I be running into any situation that would ask me to access value through index when I have the key already
I follow the YAGNI principle - You Aren't Gonna Need It. If you already know the key, then what value is there in accessing by index? The point of a dictionary is to do FAST lookups by key (by not scanning the entire collection). With an OrderedDictionary, lookups are still fast, but inserts and updates are marginally slower because the structure must keep the keys and indices in sync. Plus, the current framework implementation is not generic, so you'll have to do more casting, but there are plenty of 3rd party generic implementations out there. The fact that MS did not create a generic implementation may tell you something about the value of that type overall.
So the situation you "could" run into is needing to access the values in key order. In that case you'll need to decide if you do that often enough to warrant the overhead of an OrderedDictionary or if you can just use Linq queries to order the items outside of the structure.
Theory of Hash based collections
Before choosing between Dictionary and OrderedDictionary, let's look how some collections are built.
Arrays
Arrays provide time constant access when you now the index of your value.
So keys must be integers. If you don't know the index, you must traverse the full collection to check your value is the one you're looking for.
Dictionaries
Purpose of dictionary is to provide a (relatively) time constant access to any value in it when key is not an integer. However, since there is not always a perfect hash function to get an integer from a value, there will collisions of hash codes and then when several values have the same hash codes, they are added in a array. And search for theses collided values will be slower (since it must traverse the array).
OrderedDictionary
OrderedDictionary is kind of mix between the two previous collections. Index search will/should be the fastest (however you need to profile to be sure of that point). The problem with index search is that apart from special cases, you don't know the index in which your value was stored, so you must rely on the key. Which makes me wonder, why would you need an OrderedDictionary ?
As one comment implies, I would be very interested to know what's your use case for such a collection. Most of the times, you either know the index or don't know it, because it relies on the value nature. So you should either use an array or a Dictionary, not a mix of both.
Two use cases very different:
KeyValuePair<string, string>[] values = new KeyValuePair<string, string>[4];
values[0] = new KeyValuePair<string, string>("Key1", "Value1");
// And so on...
// Or
Dictionary<string, Person> persons = new Dictionary<string, Person>();
var asker = new Person { FirstName = "pradeep", LastName=" pradyumna" };
persons.Add(asker.Key, asker);
// Later in the code, you cannot know the index of the person without having the person instance.
I'm asking for something that's a bit weird, but here is my requirement (which is all a bit computation intensive, which I couldn't find anywhere so far)..
I need a collection of <TKey, TValue> of about 30 items. But the collection is used in massively nested foreach loops that would iterate possibly almost up to a billion times, seriously. The operations on collection are trivial, something that would look like:
Dictionary<Position, Value> _cells = new
_cells.Clear();
_cells.Add(Position.p1, v1);
_cells.Add(Position.p2, v2);
//etc
In short, nothing more than addition of about 30 items and clearing of the collection. Also the values will be read from somewhere else at some point. I need this reading/retrieval by the key. So I need something along the lines of a Dictionary. Now since I'm trying to squeeze out every ounce from the CPU, I'm looking for some micro-optimizations as well. For one, I do not require the collection to check if a duplicate already exists while adding (this typically makes dictionary slower when compared to a List<T> for addition). I know I wont be passing duplicates as keys.
Since Add method would do some checks, I tried this instead:
_cells[Position.p1] = v1;
_cells[Position.p2] = v2;
//etc
But this is still about 200 ms seconds slower for about 10k iterations than a typical List<T> implementation like this:
List<KeyValuePair<Position, Value>> _cells = new
_cells.Add(new KeyValuePair<Position, Value>(Position.p1, v1));
_cells.Add(new KeyValuePair<Position, Value>(Position.p2, v2));
//etc
Now that could scale to a noticeable time after full iteration. Note that in the above case I have read item from list by index (which was ok for testing purposes). The problem with a regular List<T> for us are many, the main reason being not being able to access an item by key.
My question in short are:
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding? Any 3rd party open source collection would do.
Or else please point me to a good starter as to how to implement my custom collection class from IDictionary<TKey, TValue> interface
Update:
I went by MiMo's suggestion and List was still faster. Perhaps it has got to do with overhead of creating the dictionary.
My suggestion would be to start with the source code of Dictionary<TKey, TValue> and change it to optimize for you specific situation.
You don't have to support removal of individual key/value pairs, this might help simplifying the code. There apppear to be also some check on the validity of keys etc. that you could get rid of.
But this is still a few ms seconds slower for about ten iterations than a typical List implementation like this
A few milliseconds slower for ten iterations of adding just 30 values? I don't believe that. Adding just a few values should take microscopic amounts of time, unless your hashing/equality routines are very slow. (That can be a real problem. I've seen code improved massively by tweaking the key choice to be something that's hashed quickly.)
If it's really taking milliseconds longer, I'd urge you to check your diagnostics.
But it's not surprising that it's slower in general: it's doing more work. For a list, it just needs to check whether or not it needs to grow the buffer, then write to an array element, and increment the size. That's it. No hashing, no computation of the right bucket.
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding?
No. The very work you're trying to avoid is what makes it quick to access by key later.
When do you need to perform a lookup by key, however? Do you often use collections without ever looking up a key? How big is the collection by the time you perform a key lookup?
Perhaps you should build a list of key/value pairs, and only convert it into a dictionary when you've finished writing and are ready to start looking up.
I have a question about generic collections in C#. If I need to store a collection of items, and I'm frequently going to need to check whether an item is in the collection, would it be faster to use Dictionary instead of List?
I've heard that checking if an item is in the collection is linear relative to the size for lists and constant relative to the size for dictionaries. Is using Dictionary and then setting Key and Value to the same object for each key-value pair something that other programmers frequently do in this situation?
Thanks for taking the time to read this.
Yes, yes it is. That said, you probably want to use HashSet because you don't need both a key and a value, you just need a set of items.
It's also worth noting that Dictionary was added in C# 2.0, and HashSet was added in 3.5, so for all that time inbetween it was actually fairly common to use a Dictionary when you wanted a Set just because that was all you had (without rolling your own). When I was forced to do this I just stuck null in the value, rather than the item as the key and value, but the idea is the same.
Just use HashSet<Foo> if what you're concerned with is fast containment tests.
A Dictionary<TKey, TValue> is for looking a value up based on a key.
A List<T> is for random access and dynamic growth properties.
A HashSet<T> is for modeling a set and providing fast containment tests.
You're not looking up a value based on a key. You're not worried about random access, but rather fast containment checks. The right concept here is a HashSet<T>.
Assuming that there is only ever one copy of the item in the list, then the appropriate data structure is ISet<T>, specifically HashSet<T>.
That said, I've seen timing that indicate that a Dictionary<TKey, TValue> ContainsKey call is a wee bit faster than even HashSet<T>. Either way, both of them are going to be loads faster than a plain List<T> lookup.
Keep in mind that both of these methods (HashSet and Dictionary) rely on reasonably well-implemented Equals and GetHashcode implementations for T. List<T> only relies on Equals
A Dictionary, or HashSet will use more memory, but provide (almost) O(1) seek time.
You might want to look at HashSet, which is a collection of unique objects (as long as the object implements IEquality comparer).
You mention using List<T>, which implies that ordering may be important. If this is the case then you may also want to look into the SortedSet<T> type as well.
HashSet
The C# HashSet data structure was introduced in the .NET Framework 3.5. A full list of the implemented members can be found at the HashSet MSDN page.
Where is it used?
Why would you want to use it?
A HashSet holds a set of objects, but in a way that allows you to easily and quickly determine whether an object is already in the set or not. It does so by internally managing an array and storing the object using an index which is calculated from the hashcode of the object. Take a look here
HashSet is an unordered collection containing unique elements. It has the standard collection operations Add, Remove, Contains, but since it uses a hash-based implementation, these operations are O(1). (As opposed to List for example, which is O(n) for Contains and Remove.) HashSet also provides standard set operations such as union, intersection, and symmetric difference. Take a look here
There are different implementations of Sets. Some make insertion and lookup operations super fast by hashing elements. However, that means that the order in which the elements were added is lost. Other implementations preserve the added order at the cost of slower running times.
The HashSet class in C# goes for the first approach, thus not preserving the order of elements. It is much faster than a regular List. Some basic benchmarks showed that HashSet is decently faster when dealing with primary types (int, double, bool, etc.). It is a lot faster when working with class objects. So the point is that HashSet is fast.
The only catch of HashSet is that there is no access by indices. To access elements you can either use an enumerator or use the built-in function to convert the HashSet into a List and iterate through that. Take a look here
A HashSet has an internal structure (hash), where items can be searched and identified quickly. The downside is that iterating through a HashSet (or getting an item by index) is rather slow.
So why would someone want be able to know if an entry already exists in a set?
One situation where a HashSet is useful is in getting distinct values from a list where duplicates may exist. Once an item is added to the HashSet it is quick to determine if the item exists (Contains operator).
Other advantages of the HashSet are the Set operations: IntersectWith, IsSubsetOf, IsSupersetOf, Overlaps, SymmetricExceptWith, UnionWith.
If you are familiar with the object constraint language then you will identify these set operations. You will also see that it is one step closer to an implementation of executable UML.
Simply said and without revealing the kitchen secrets:
a set in general, is a collection that contains no duplicate elements, and whose elements are in no particular order. So, A HashSet<T> is similar to a generic List<T>, but is optimized for fast lookups (via hashtables, as the name implies) at the cost of losing order.
From application perspective, if one needs only to avoid duplicates then HashSet is what you are looking for since it's Lookup, Insert and Remove complexities are O(1) - constant. What this means it does not matter how many elements HashSet has it will take same amount of time to check if there's such element or not, plus since you are inserting elements at O(1) too it makes it perfect for this sort of thing.
I often use Dictionary in C#2.0 with the first key as string that was containing a unique identifier.
I am learning C#3.0+ and it seems that I can now simply use a List and simply do LINQ on that object to get the specific object (with the .where()).
So, if I understand well, the Dictionary class has lost its purpose?
no, a dictionary is still more efficient for getting things back out given a key.
a list you still have to iterate through the list to find what you want. A dictionary does a lookup.
If you just have a List, then doing an LINQ select will scan through every item in the list comparing it against the one you are looking for.
The Dictionary however computes a hash code of the string you are looking for (returned by the GetHashCode method). This value is then used to look up the string more efficiently. For more info on how this works see Wikipedia.
If you have more than a few strings, the initial (List) method will start to get painfully slow.
IMHO the Dictionary approach will be MUCH faster than LINQ, so if you have an array with a lot of items, you should rather use Dictionary.
Dictionary is implemented as a hashtable. Thus it should give constant time access for lookups. List is implemented as a dynamic array, giving you linear time access.
Based on the underlying data structures, the Dictionary should still give you better performance.
MSDN docs on Dictionary
http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
and List
http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx