I am designing a C# class that contains a string hierarchy, where each string has 0 or 1 parents.
My inclination is to implement this with a Dictionary<string,string> where the key is the child and value is the parent. The dictionary may have a large amount of values, but I can't say the exact size. This seems like it should perform faster than creating composite wrapper with references to the parent, but I could be wrong.
Is there an alternative approach I can take that will ensure better performance speed?
Retrieving values from a Dictionary<K,V> is extremely fast (close to O(1), i.e., almost constant time lookup regardless of the size of the collection) because the underlying implementation uses a hash table. Of course, if the key type uses a terrible hashing algorithm than the performance can degrade, but you can rest assured that this is likely not the case for the framework's string type.
However, as I asked in my comment, you need to answer a few questions:
Define what performance metric is most important, i.e., Time (CPU) or space (memory).
What are your requirements? How will this be used? What's your worst case scenario? Is this going to hold a ton of data with relatively infrequent lookups, do many lookups need to be performed in a short amount of time, or do both hold true?
The Dictionary<K,V> class also uses an array internally which will grow as you add items. Is this okay for you? Again, you need to be more specific in terms of your requirements before anyone can give you a complete answer.
Using a Dictionary will be slower than using direct references, because the Dictionary will have to compute a hash etc. If you really only need the parent and not the child's operation (which I doubt), then you could store the Strings in an array together with the index of the parent String.
Related
Let's say I have a class:
class C
{
public int uniqueField;
public int otherField;
}
This is very simplified version of the actual problem. I want to store multiple instances of this class, where "uniqueField" should be unique for each instance.
What is better in this case?
a) Dictionary with uniqueField as the key
Dictionary<int, C> d;
or b) List?
List<C> l;
In the first case (a) the same data would be stored twice (as the key and as the field of a class instance). But the question is: Is it faster to find an element in dictionary than in list? Or is the equally fast?
a)
d[searchedUniqueField]
b)
l.Find(x=>x.uniqueField==searchedUniqueField);
Assuming you've got quite a lot of instances, it's likely to be much faster to find the item in the dictionary. Basically a Dictionary<,> is a hash table, with O(1) lookup other than due to collisions.
Now if the collection is really small, then the extra overhead of finding the hash code, computing the right bucket and then looking through that bucket for matching hash codes, then performing a key equality check can take longer than just checking each element in a list.
If you might have a lot of instances but might not, I'd usually pick the dictionary approach. For one thing it expresses what you're actually trying to achieve: a simple way of accessing an element by a key. The overhead for small collections is unlikely to be very significant unless you have far more small collections than large ones.
Use Dictionary when the number of lookups greatly exceeds the number of insertions. It is fine to use List when you will always have fewer than four items.
Reference - http://www.dotnetperls.com/dictionary-time
If you want to ensure that your client will not create a duplication of the key, you may want your class to be responsible to create the unique key. Therefore once the unique key generation is the responsibility of the class , dictionary or list is the client decision.
I had two questions. I was wondering if there is an easy class in the C# library that stores pairs of values instead of just one, so that I can store a class and an integer in the same node of the list. I think the easiest way is to just make a container class, but as this is extra work each time. I wanted to know whether I should be doing so or not. I know that in later versions of .NET ( i am using 3.5) that there are tuples that I can store, but that's not available to me.
I guess the bigger question is what are the memory disadvantages of using a dictionary to store the integer class map even though I don't need to access in O(1) and could afford to just search the list? What is the minimum size of the hash table? should i just make the wrapper class I need?
If you need to store an unordered list of {integer, value}, then I would suggest making the wrapper class. If you need a data structure in which you can look up integer to get value (or, look up value to get integer), then I would suggest a dictionary.
The decision of List<Tuple<T1, T2>> (or List<KeyValuePair<T1, T2>>) vs Dictionary<T1, T2> is largely going to come down to what you want to do with it.
If you're going to be storing information and then iterating over it, without needing to do frequent lookups based on a particular key value, then a List is probably what you want. Depending on how you're going to use it, a LinkedList might be even better - slightly higher memory overheads, faster content manipulation (add/remove) operations.
On the other hand, if you're going to be primarily using the first value as a key to do frequent lookups, then a Dictionary is designed specifically for this purpose. Key value searching and comparison is significantly improved, so if you do much with the keys and your list is big a Dictionary will give you a big speed boost.
Data size is important to the decision. If you're talking about a couple hundred items or less, a List is probably fine. Above that point the lookup times will probably impact more significantly on execution time, so Dictionary might be more worth it.
There are no hard and fast rules. Every use case is different, so you'll have to balance your requirements against the overheads.
You can use a list of KeyValuePair:http://msdn.microsoft.com/en-us/library/5tbh8a42.aspx
You can use a Tuple<T,T1>, a list of KeyValuePair<T, T1> - or, an anonymous type, e.g.
var list = something.Select(x => new { Key = x.Something, Value = x.Value });
You can use either KeyValuePair or Tuple
For Tuple, you can read the following useful post:
What requirement was the tuple designed to solve?
Say that, in my method, I pass in a couple IEnumerables (probably because I'm going to get a bunch of objects from a db or something).
Then for each object in objects1, I want to pull out a diffobject from objects2 that has the same object.iD.
I don't want multiple enumerations (according to resharper) so I could make objects2 into a dictionary keyed with object.iD. Then I only enumerate once for each. (secondary question)Is that a good pattern?
(primary question) What's too big? At what point would this be a horrible pattern? How many objects is too many objects for the dictionary?
Internally, it would be prevented from ever having more than two billion items. Since the way things are positioned within a dictionary is fairly complicated, if I were looking at dealing with a billion items (if a 16-bit value, for example, then 2GB), I'd be looking to store them in a database and retrieve them using data-access code.
I have to ask though, where are Objects1 and Objects2 coming from? It sounds as though you could do this at the DB level and it would be MUCH, MUCH more efficient than doing it in C#!
You might also want to consider using KeyValuePair[]
Dictionaries store instances of KeyValuePair
If all you ever want to do is look up values in the dictionary given their Key, then yes, Dictionary is the way to go - they're pretty quick at doing that. However, if you want to sort items or search for them using the Value or a property of it, it's better to use something else.
As far as the size goes, they get a little slower as they get bigger, it's worth doing some benchmarks to see how it affects your needs, but you could always split values across multiple dictionaries based on their type or range. http://www.dotnetperls.com/dictionary-size
It's worth noting though that when you say "Then I only enumerate once for each", that's slightly incorrect. objects1 will be enumerated fully, but the dictionary of objects2 won't be enumerated. As long as you use the Key to retrieve values, it will hash the key and use the result to calculate a location to store the value, so a dictionary can get pretty quickly to the value you ask for. Ideally use an int for the Key because it can use that as the hash directly. You can enumerate them, but it's must better to look objects up using objects2Dictionary[key].
I'm looking for a data structure that can possibly outperform Dictionary<string, object>. I have a map that has N items - the map is constructed once and then read many, many times. The map doesn't change during the lifetime of the program (no new items are added, no items are deleted and items are not reordered). Because the map doesn't change, it doesn't need to be thread-safe, even though the application using it is heavily multi-threaded. I expect that ~50% of lookups will happen for items not in the map.
Dictionary<TKey, TItem> is quite fast and I may end up using it but I wonder if there's another data structure that's faster for this scenario. While the rest of the program is obviously more expensive than this map, it is used in performance-critical parts and I'd like to speed it up as much as possible.
What you're looking for is a Perfect Hash Function. You can create one based on your list of strings, and then use it for the Dictionary.
The non-generic HashTable has a constructor that accepts IHashCodeProvider that lets you specify your own hash function. I couldn't find an equivalent for Dictionary, so you might have to resort to using a Hashtable instead.
You can use it internally in your PerfectStringHash class, which will do all the type casting for you.
Note that you may need to be able to specify the number of buckets in the hash. I think HashTable only lets you specify the load factor. You may find out you need to roll your own hash entirely. It's a good class for everyone to use, I guess, a generic perfect hash.
EDIT: Apparantly someone already implemented some Perfect Hash algorithms in C#.
The read performance of the generic dictionary is "close to O(1)" according to the remarks on MSDN for most TKey (and you should get pretty good performance with just string keys). And you get this out of the box, free, from the framework, without implementing your own collection.
http://msdn.microsoft.com/en-us/library/xfhwa508(v=vs.90).aspx
If you need to stick with string keys - Dictionary is at least very good (if not best choice).
One more thing to note when you start measuring - consider if computation of hash itself has measurable impact. Searching for long strings should take longer to compute hash. See if items you want to search for can be represented as other objects with constant get hash time.
I have recently seen a new trend in my firm where we change the IEnumerable to a dictionary by a simple LINQ transformation as follows:
enumerable.ToDictionary(x=>x);
We mostly end up doing this when the operation on the collection is a Contains/Access and obviously a dictionary has a better performance in such cases.
But I realise that converting the Enumerable to a dictionary has its own cost and I am wondering at what point does it start to break-even (if it does) i.e the performance of IEnumerable Contains/Access is equal to ToDictionary + access/contains.
Ok I might add there is no databse access the enumerable might be created from a database query and thats it and the enumerable may be edited after that too..
Also it would be interesting to know how does the datatype of the key affect the performance?
The lookup might be 2-5 times generally but sometimes may be one too. But i have seen things like
For an enumerable:
var element=Enumerable.SingleorDefault(x=>x.Id);
//do something if element is null or return
for a dictionary:
if(dictionary.ContainsKey(x))
//do something if false else return
This has been bugging me for quite some time now.
Performance of Dictionary Compared to IEnumerable
A Dictionary, when used correctly, is always faster to read from (except in cases where the data set is very small, e.g. 10 items). There can be overhead when creating it.
Given m as the amount of lookups performed against the same object (these are approximate):
Performance of an IEnumerable (created from a clean list): O(mn)
This is because you need to look at all the items each time (essentially m * O(n)).
Performance of a Dictionary: O(n) + O(1m), or O(m + n)
This is because you need to insert items first (O(n)).
In general it can be seen that the Dictionary wins when m > 1, and the IEnumerable wins when m = 1 or m = 0.
In general you should:
Use a Dictionary when doing the lookup more than once against the same dataset.
Use an IEnumerable when doing the lookup one.
Use an IEnumerable when the data-set could be too large to fit into memory.
Keep in mind a SQL table can be used like a Dictionary, so you could use that to offset the memory pressure.
Further Considerations
Dictionarys use GetHashCode() to organise their internal state. The performance of a Dictionary is strongly-related to the hash code in two ways.
Poorly performing GetHashCode() - results in overhead every time an item is added, looked up, or deleted.
Low quality hash codes - results in the dictionary not having O(1) lookup performance.
Most built-in .Net types (especially the value types) have very good hashing algorithms. However, with list-like types (e.g. string) GetHashCode() has O(n) performance - because it needs to iterate over the whole string. Thus you dictionary's performance can really be seen as (where M is the big-oh for an efficient GetHashCode()): O(1) + M.
It depends....
How long is the IEnumerable?
Does accessing the IEnumerable cause database access?
How often is it accessed?
The best thing to do would be to experiment and profile.
If you searching elements in your collection by some key very often - definatelly the Dictionary will be faster because or it's hash-based collection and searching is faster in times, otherwise if you don't search a lot thru the collection - the convertion is not necessary, because time for conversion may be bigger than you one or two searches in the collection,
IMHO: you need to measure this on your environment with representative data. In such cases I just write a quick console app that measures the time of the code execution. To have a better measure you need to execute the same code multiple times I guess.
ADD:
It also depents on the application you develop. Usually you gain more in optimizing other places (avoiding networkroundrips, caching etc.) in that time and effort.
I'll add that you haven't told us what happens every time you "rewind" your IEnumerable<>. Is it directly backed by a data collection? (for example a List<>) or is it calculated "on the fly"? If it's the first, and for small collections, enumerating them to find the wanted element is faster (a Dictionary for 3/4 elements is useless. If you want I can build some benchmark to find the breaking point). If it's the second then you have to consider if "caching" the IEnumerable<> in a collection is a good idea. If it's, then you can choose between a List<> or a Dictionary<>, and we return to point 1. Is the IEnumerable small or big? And there is a third problem: if the collection isn't backed, but it's too big for memory, then clearly you can't put it in a Dictionary<>. Then perhaps it's time to make the SQL work for you :-)
I'll add that "failures" have their cost: in a List<> if you try to find an element that doesn't exist, the cost is O(n), while in a Dictionary<> the cost is still O(1).