Write-once read-many string-to-object map - c#

I'm looking for a data structure that can possibly outperform Dictionary<string, object>. I have a map that has N items - the map is constructed once and then read many, many times. The map doesn't change during the lifetime of the program (no new items are added, no items are deleted and items are not reordered). Because the map doesn't change, it doesn't need to be thread-safe, even though the application using it is heavily multi-threaded. I expect that ~50% of lookups will happen for items not in the map.
Dictionary<TKey, TItem> is quite fast and I may end up using it but I wonder if there's another data structure that's faster for this scenario. While the rest of the program is obviously more expensive than this map, it is used in performance-critical parts and I'd like to speed it up as much as possible.

What you're looking for is a Perfect Hash Function. You can create one based on your list of strings, and then use it for the Dictionary.
The non-generic HashTable has a constructor that accepts IHashCodeProvider that lets you specify your own hash function. I couldn't find an equivalent for Dictionary, so you might have to resort to using a Hashtable instead.
You can use it internally in your PerfectStringHash class, which will do all the type casting for you.
Note that you may need to be able to specify the number of buckets in the hash. I think HashTable only lets you specify the load factor. You may find out you need to roll your own hash entirely. It's a good class for everyone to use, I guess, a generic perfect hash.
EDIT: Apparantly someone already implemented some Perfect Hash algorithms in C#.

The read performance of the generic dictionary is "close to O(1)" according to the remarks on MSDN for most TKey (and you should get pretty good performance with just string keys). And you get this out of the box, free, from the framework, without implementing your own collection.
http://msdn.microsoft.com/en-us/library/xfhwa508(v=vs.90).aspx

If you need to stick with string keys - Dictionary is at least very good (if not best choice).
One more thing to note when you start measuring - consider if computation of hash itself has measurable impact. Searching for long strings should take longer to compute hash. See if items you want to search for can be represented as other objects with constant get hash time.

Related

C# dictionary vs list usage

I had two questions. I was wondering if there is an easy class in the C# library that stores pairs of values instead of just one, so that I can store a class and an integer in the same node of the list. I think the easiest way is to just make a container class, but as this is extra work each time. I wanted to know whether I should be doing so or not. I know that in later versions of .NET ( i am using 3.5) that there are tuples that I can store, but that's not available to me.
I guess the bigger question is what are the memory disadvantages of using a dictionary to store the integer class map even though I don't need to access in O(1) and could afford to just search the list? What is the minimum size of the hash table? should i just make the wrapper class I need?
If you need to store an unordered list of {integer, value}, then I would suggest making the wrapper class. If you need a data structure in which you can look up integer to get value (or, look up value to get integer), then I would suggest a dictionary.
The decision of List<Tuple<T1, T2>> (or List<KeyValuePair<T1, T2>>) vs Dictionary<T1, T2> is largely going to come down to what you want to do with it.
If you're going to be storing information and then iterating over it, without needing to do frequent lookups based on a particular key value, then a List is probably what you want. Depending on how you're going to use it, a LinkedList might be even better - slightly higher memory overheads, faster content manipulation (add/remove) operations.
On the other hand, if you're going to be primarily using the first value as a key to do frequent lookups, then a Dictionary is designed specifically for this purpose. Key value searching and comparison is significantly improved, so if you do much with the keys and your list is big a Dictionary will give you a big speed boost.
Data size is important to the decision. If you're talking about a couple hundred items or less, a List is probably fine. Above that point the lookup times will probably impact more significantly on execution time, so Dictionary might be more worth it.
There are no hard and fast rules. Every use case is different, so you'll have to balance your requirements against the overheads.
You can use a list of KeyValuePair:http://msdn.microsoft.com/en-us/library/5tbh8a42.aspx
You can use a Tuple<T,T1>, a list of KeyValuePair<T, T1> - or, an anonymous type, e.g.
var list = something.Select(x => new { Key = x.Something, Value = x.Value });
You can use either KeyValuePair or Tuple
For Tuple, you can read the following useful post:
What requirement was the tuple designed to solve?

Collection that lets access item by key but doesn't require duplicate checking on addition?

I'm asking for something that's a bit weird, but here is my requirement (which is all a bit computation intensive, which I couldn't find anywhere so far)..
I need a collection of <TKey, TValue> of about 30 items. But the collection is used in massively nested foreach loops that would iterate possibly almost up to a billion times, seriously. The operations on collection are trivial, something that would look like:
Dictionary<Position, Value> _cells = new
_cells.Clear();
_cells.Add(Position.p1, v1);
_cells.Add(Position.p2, v2);
//etc
In short, nothing more than addition of about 30 items and clearing of the collection. Also the values will be read from somewhere else at some point. I need this reading/retrieval by the key. So I need something along the lines of a Dictionary. Now since I'm trying to squeeze out every ounce from the CPU, I'm looking for some micro-optimizations as well. For one, I do not require the collection to check if a duplicate already exists while adding (this typically makes dictionary slower when compared to a List<T> for addition). I know I wont be passing duplicates as keys.
Since Add method would do some checks, I tried this instead:
_cells[Position.p1] = v1;
_cells[Position.p2] = v2;
//etc
But this is still about 200 ms seconds slower for about 10k iterations than a typical List<T> implementation like this:
List<KeyValuePair<Position, Value>> _cells = new
_cells.Add(new KeyValuePair<Position, Value>(Position.p1, v1));
_cells.Add(new KeyValuePair<Position, Value>(Position.p2, v2));
//etc
Now that could scale to a noticeable time after full iteration. Note that in the above case I have read item from list by index (which was ok for testing purposes). The problem with a regular List<T> for us are many, the main reason being not being able to access an item by key.
My question in short are:
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding? Any 3rd party open source collection would do.
Or else please point me to a good starter as to how to implement my custom collection class from IDictionary<TKey, TValue> interface
Update:
I went by MiMo's suggestion and List was still faster. Perhaps it has got to do with overhead of creating the dictionary.
My suggestion would be to start with the source code of Dictionary<TKey, TValue> and change it to optimize for you specific situation.
You don't have to support removal of individual key/value pairs, this might help simplifying the code. There apppear to be also some check on the validity of keys etc. that you could get rid of.
But this is still a few ms seconds slower for about ten iterations than a typical List implementation like this
A few milliseconds slower for ten iterations of adding just 30 values? I don't believe that. Adding just a few values should take microscopic amounts of time, unless your hashing/equality routines are very slow. (That can be a real problem. I've seen code improved massively by tweaking the key choice to be something that's hashed quickly.)
If it's really taking milliseconds longer, I'd urge you to check your diagnostics.
But it's not surprising that it's slower in general: it's doing more work. For a list, it just needs to check whether or not it needs to grow the buffer, then write to an array element, and increment the size. That's it. No hashing, no computation of the right bucket.
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding?
No. The very work you're trying to avoid is what makes it quick to access by key later.
When do you need to perform a lookup by key, however? Do you often use collections without ever looking up a key? How big is the collection by the time you perform a key lookup?
Perhaps you should build a list of key/value pairs, and only convert it into a dictionary when you've finished writing and are ready to start looking up.

What would be too big for a dictionary, when using IEnumerable.ToDictionary()?

Say that, in my method, I pass in a couple IEnumerables (probably because I'm going to get a bunch of objects from a db or something).
Then for each object in objects1, I want to pull out a diffobject from objects2 that has the same object.iD.
I don't want multiple enumerations (according to resharper) so I could make objects2 into a dictionary keyed with object.iD. Then I only enumerate once for each. (secondary question)Is that a good pattern?
(primary question) What's too big? At what point would this be a horrible pattern? How many objects is too many objects for the dictionary?
Internally, it would be prevented from ever having more than two billion items. Since the way things are positioned within a dictionary is fairly complicated, if I were looking at dealing with a billion items (if a 16-bit value, for example, then 2GB), I'd be looking to store them in a database and retrieve them using data-access code.
I have to ask though, where are Objects1 and Objects2 coming from? It sounds as though you could do this at the DB level and it would be MUCH, MUCH more efficient than doing it in C#!
You might also want to consider using KeyValuePair[]
Dictionaries store instances of KeyValuePair
If all you ever want to do is look up values in the dictionary given their Key, then yes, Dictionary is the way to go - they're pretty quick at doing that. However, if you want to sort items or search for them using the Value or a property of it, it's better to use something else.
As far as the size goes, they get a little slower as they get bigger, it's worth doing some benchmarks to see how it affects your needs, but you could always split values across multiple dictionaries based on their type or range. http://www.dotnetperls.com/dictionary-size
It's worth noting though that when you say "Then I only enumerate once for each", that's slightly incorrect. objects1 will be enumerated fully, but the dictionary of objects2 won't be enumerated. As long as you use the Key to retrieve values, it will hash the key and use the result to calculate a location to store the value, so a dictionary can get pretty quickly to the value you ask for. Ideally use an int for the Key because it can use that as the hash directly. You can enumerate them, but it's must better to look objects up using objects2Dictionary[key].

Best performance on a String Dictionary in C#

I am designing a C# class that contains a string hierarchy, where each string has 0 or 1 parents.
My inclination is to implement this with a Dictionary<string,string> where the key is the child and value is the parent. The dictionary may have a large amount of values, but I can't say the exact size. This seems like it should perform faster than creating composite wrapper with references to the parent, but I could be wrong.
Is there an alternative approach I can take that will ensure better performance speed?
Retrieving values from a Dictionary<K,V> is extremely fast (close to O(1), i.e., almost constant time lookup regardless of the size of the collection) because the underlying implementation uses a hash table. Of course, if the key type uses a terrible hashing algorithm than the performance can degrade, but you can rest assured that this is likely not the case for the framework's string type.
However, as I asked in my comment, you need to answer a few questions:
Define what performance metric is most important, i.e., Time (CPU) or space (memory).
What are your requirements? How will this be used? What's your worst case scenario? Is this going to hold a ton of data with relatively infrequent lookups, do many lookups need to be performed in a short amount of time, or do both hold true?
The Dictionary<K,V> class also uses an array internally which will grow as you add items. Is this okay for you? Again, you need to be more specific in terms of your requirements before anyone can give you a complete answer.
Using a Dictionary will be slower than using direct references, because the Dictionary will have to compute a hash etc. If you really only need the parent and not the child's operation (which I doubt), then you could store the Strings in an array together with the index of the parent String.

Adding keys to a list or collection - is there any value in hashing the key before adding it

I have stumbled across some code that is adding strings to a List but hashing the value before adding it.
It's using an MD5 hash (MD5CryptoServiceProvider) on the string value, then adding this to the list.
Is there any value in doing this in terms of speed to find the key in the list or is this just unnecessary?
I am not going to assume to know what the authors of the code you were viewing were doing with their list. But I will say that if you have a large list and performance of searching is critical, then there's a class for that. HashSet<T> will suit your needs nicely.
First of all, a list (List<T>) does not have “keys”. However, a Dictionary<TKey, TValue> does.
Secondly, to answer your performance question: no, there is actually a performance penalty in computing that hash. However, before you jump to the conclusion that it is unnecessary, examine the surrounding code and think about whether the author may have actually needed the MD5 hashsum and not the string itself?
Thirdly, if you need to look something up efficiently, you can use a HashSet<T> if you just need to check its existence, or Dictionary<TKey, TValue> if you need to associate the keys that you look up with a value.
If you place strings in a dictionary or hashset, C# will already generate a hash value from any string you put in. This generally uses a hash algorithm that is much faster than MD5.
I don't think it's necessary to do this for a List if the aim is to improve performance. Strings are strings and are looked up the same way whether they are hashed or not.

Categories

Resources