In terms of speed in search, is it better to search the keys of a dictionary or the values of a list?
In other words, which of these would be most preferable?
Dictionary<tring,string> dic = new Dictionary<string,string>();
if(dic.ContainsKey("needle")){ ... }
Or
List<string> list = new List<string>();
if(list.Contains("needle")){ ... }
If by "better" you mean "faster" then use a dictionary. Dictionary keys are organized by hash codes so lookups are significantly faster that list searches with more than just a few items in the ocllection.
With a good hashing algorithm, Dictionary searches can be close to O(1), meaning the search time is independent of the size of the dictionary. Lists, on the other hand, are O(n), meaning that the time is (on average) proportional to the size of the list.
If you just have key items (not mapping keys to values) you might also try a HashSet. It has the benefit of O(1) lookups without the overhead of the Value side of a dictionary.
(Granted the overhead is probably minimal, but why have it if you don't need it?)
For lookups a dictionary is usually best because the time it takes remains constant. With a list it increases the larger the list gets.
See also: http://www.dotnetperls.com/dictionary-time
I suggest using Dictionary when the number of lookups greatly exceeds the number of insertions. It is fine to use List when you will always have fewer than four items.
For lookups, Dictionary is usually a better choice. The time required is flat, an O(1) constant time complexity. The List has an O(N) linear time complexity. Three elements can be looped over faster than looked up in a Dictionary.
Related
I have a custom performance timer implementation. In short it is a static data collection storing execution duration of some code paths. In order to identify particular measurements I need a collection of named objects with quick access to the data item by name i.e. a string of moderate length like 20-50 chars.
Straightforward way to do that could be a Dictionary<string, MyPerformanceCounter> with access by key which is the counter id.
What about a List<MyPerformanceCounter> which could be accessed and maintained sorted via List<T>.BinarySearch and List.Insert. Does it have a chance to have more linear performance when I would need to have several hundreds of counters?
Needless to say I need the access to the proper MyPerformanceCounter to be as quick as possible as it is called at rates of dozens of thousands per second and should affect code execution as less as possible.
New counters are appended relatively seldom like once per second.
There are several potentially non-O(1) parts to a dictionary.
The first is generating a hash code. If your strings are long, it will have to generate a hash of the string every time you use it as a key in your dictionary. The dictionary stores the hashes of the existing keys, so you don't have to worry about that, just hashing what you're passing in. If the strings are all short, hashing should be fast. Long strings are probably going to take longer to hash than doing a string comparison. Hashing affects both reads and writes.
The next non-constant part of a dictionary is when you have hash collisions. It keeps a linked list of values with the same hash bucket internally, and has to go through and compare your key to each item in that bucket if you get hash collisions. Since you're using strings and they spent a lot of effort coming up with a good string hashing function, this shouldn't be too major an issue. Hash collisions slow down both reads and writes.
The last non-constant part is only during writes, if it runs out of internal storage, it has to recalculate the whole hash table internally. This is still a lot faster than doing array inserts (like a List<> would do). If you only have a few hundred items, this is definitely not going to affect you.
A list, on the other hand, is going to take an average of N/2 copies for each insert, and log2(N) for each lookup. Unless the strings all have similar prefixes, the individual comparisons will be much faster than the dictionary, but there will be a lot more of them.
So unless your strings are quite long to make hashing inefficient, chances are a dictionary is going to give you better performance.
If you know something about the nature of your strings, you can write a more specific data structure optimized for your scenario. For example, if I knew all the strings started with an ASCII capital letter, and each is between 5 and 10 characters in length, I might create an array of 26 arrays, one for each letter, and then each of those arrays contains 6 lists, one for each length of string. Something like this:
List<string>[][] lists = new List<string>[26][6];
foreach (string s in keys)
{
var list = lists[s[0] - 'A'][s.Length - 5];
if (list == null)
{
lists[s[0] - 'A'][s.Length] = list = new List<string>();
}
int ix = list.BinarySearch(s);
if (ix < 0)
{
list.Insert(~ix, s);
}
}
This is the kind of thing you do if you have very specific information about what kind of data you're dealing with. If you can't make assumptions, using a Dictionary is most likely going to be your best bet.
You might also want to consider using OrderedDictionary if you want to go binary search route, I believe it uses a binary search tree internally. https://msdn.microsoft.com/en-us/library/system.collections.specialized.ordereddictionary%28v=vs.110%29.aspx
I believe you should use the Dictionary<string, MyPerformanceCounter>.
For small sets of data the list will have a better performance. However, as more elements are needed, the Dictionary becomes clearly superior.
The time required for a Dictionary is: O(1) constant time
complexity.
The List has an O(N) linear time complexity.
You could try Hashtable or SortedDictionary, but I think that you should still use Dictionary.
I provide a link with benchmarks and guidelines here: http://www.dotnetperls.com/dictionary-time
I hope this helps you.
I have a list of about 500 strings "joe" "john" "jack" ... "jan"
I only need to find the ordinal.
In my example, the list will never be changed.
One could just put them in a list and IndexOf
ll.Add("joe")
ll.Add("john")
...
ll.Add("jan")
ll.IndexOf("jib") is 315
or you can put them in a dictionary, using the ordinal integers as the values,
dd.Add("joe", 1)
dd.Add("john", 2)
dd.Add("jack", 3)
...
dd.Add("jan", 571)
dd["jib"] is 315
FTR the strings are 3 to 8 characters long. FTR this is in a Unity, hence Mono, milieu.
Purely for performance, is one approach generally preferable?
1b) Indeed, I found a number of analysis of this nature: http://www.dotnetperls.com/dictionary-time (google for a number of similar analyses). Does this apply to the situation I describe or am I off here?
It's a shame there isn't a "HashSetLikeThingWithOrdinality" type of facility - if I'm missing an obvious please let us know. Indeed, this seems like a fairly common, basic, collections use case - "get the ordinal of some strings" - perhaps I am completely missing something obvious.
Here's a small overview on the difference between using a Dictionary<string,int> and a (sorted)List<string> for this:
Observations:
1) In my micro benchmarks, once the dictionary is created, the dictionary is much faster. (Explanations as to why will follow shortly)
2) In my opinion, mapping in some way (eg. a Dictionary or HashTable) will be significantly less awkward.
Performance:
For the List<string>, to do a binary search, the system will start in the 'middle', then walk each direction (stepping into the 'middle' in the now halved search space, in a typical divide and conquer pattern) depending on if the value is greater or smaller than the value at the index it's looking at. This is O(log n) growth. This assumes that data is already sorted in some manner (also applies to stuff like SortedDictionary, which uses data structures that allow for binary searching)
Alternately, you'd do IndexOf, which is O(n) complexity because you have to walk every element.
For the Dictionary<string,int>, it uses a hash lookup (generates a hash of the object by calling .GetHashCode() on the TKey (string in this case), then uses that to look up in a hash table (then does a compare to ensure it is an exact match), and gets the value out. This is roughly O(1) growth (ie. the complexity doesn't grow meaningfully with the number of elements) [Not including worst case scenarios involving hash collisions here]
Because of this, Dictionary<string,int> takes a (relatively) constant amount of time to do lookups, while List<string> grows according to the number of elements (albeit at a logarithmic (slow) rate).
Testing:
I did a few micro benchmarks, where I took the top 500 female names and did lookups against them. The lookups looked something like this:
var searchItems = new[] { "Maci", "Daria", "Michelle", "Amber", "Henrietta"};
foreach (var item in searchItems)
{
sortedList.BinarySearch(item); //You'd store the output here. Just looking at performance
}
And compared it to a dictionary lookup:
foreach (var item in searchItems)
{
var output = dictionary.ContainsKey(item) ? dictionary[item] : -1; //Presumably, output would be declared outside of this, just getting rid of a compiler error
}
So, here's the thing: even for a small number of elements, with short strings as lookup keys, a sorted List<string> isn't any faster (on my machine, in my admittedly simplistic tests) than a Dictionary<string,int>. Once again, this is a microbenchmark, but, for 500 elements, the 5 lookups are roughly 3x faster with the dictionary.
Keep in mind, however, that the list was 6.3 microseconds, and the dictionary was 1.8 microseconds.
Syntax:
Using a list as a lookup to find indexes is slightly awkward. A mapping type (like Dictionary) implies intent much better than your lookup list does, which should make for more maintainable code in the end.
That said, with my syntax and performance considerations, I'd say go with the Dictionary. However, if you don't like Dictionaries for whatever reason, the performance considerations are on such a small scale that it's a pointless thing to worry about anyways.
Edit: Bonus points, you will probably want to use a case-insensitive comparer for either method. You can pass a comparer as an argument for Dictionary and BinarySearch() should support a comparer as well.
I suspect that there might be a twist somewhere, as such a simple question has no answer for 2 hours. I'll risk being down-voted, but here is my answers:
1) Dictionary (hash table-based) is clearly a better choice for a fast lookup. List, on the other hand, is the worst choice.
1.b) Yes, it applies here. Search in the List has linear complexity, while Dictionary provides constant time lookup.
2) You are trying to map a string to an ordinal; any kind of map will be natural here (while any kind of list is awkward).
Dictionary is the natural approach for a lookup.
A list would be an optimisation for less memory use at the cost of decreased speed. An array would do better still (same time, but slightly less memory again).
If you already had a list or array for some other reason then the memory saving would be greater still, because no more memory was used that would be used anyway, and so a better optimisation for space at the same cost to speed. (If the order of the keys was the same as a sort then it could be O(log n) but otherwise it's O(n)).
Creating the dictionary itself takes time, so while it's the fastest approach if the number of times it is looked up is small then it might cost as much as it saves and so not be worth it.
I have a Dictionary of objects with strings as the keys. This Dictionary is first populated with anywhere from 50 to tens of thousands of entries. Later on my program looks for values within this dictionary, and after having found an item in the dictionary I no longer have any need to persist the object that I just found in the dictionary. My question then is, would I be able to get better total execution time if I remove entries from the dictionary once I no longer have use for them, perhaps cutting down memory usage or just making subsequent lookups slightly faster, or would the extra time spent removing items be more impactful?
I understand the answer to this may depend upon certain details such as how many total lookups are done against the dictionary, the size of the key, and the size of the object, I will try to provide these below, but is there a general answer to this? Is it unnecessary to try and improve performance in this way, or are there cases where this would be a good idea?
Key is variable length string, either 6 characters or ~20 characters.
Total lookups is completely up in the air, I may have to only check 50x or so or I may have to look 10K times completely independent of the size of the dictionary, i.e. dictionary may have 50 items and I may do 10K lookups, or I may have 10K items and only do 50 lookups.
One additional note is that if I do remove items from the dictionary and I am ever left with an empty dictionary I can then signal to a waiting thread to no longer wait for me while I process the remaining items (involves parsing through a long text file while looking up items in the dictionary to determine what to do with the parsed data).
Dictionary lookups are essentially O(1). Removing items from the dictionary will have a tiny (if any) impact on lookup speed.
In the end, it's very likely that removing items will be slower than just leaving them in.
The only reason I'd suggest removing items would be if you need to reduce your memory footprint.
I found some interesting items over at DotNetPerls that seem to relate to your question.
The order you add keys to a Dictionary is important. It affects the
performance of accessing those keys. Because the Dictionary uses a
chaining algorithm, the keys that were added last are often faster to
locate.
http://www.dotnetperls.com/dictionary-order
Dictionary size influences lookup performance. Smaller Dictionaries
are faster than larger Dictionaries. This is true when they are tested
for keys that always exist in both. Reducing Dictionary size could
help improve performance.
http://www.dotnetperls.com/dictionary-size
I thought this last tidbit was really interesting. It didn't occur to me to consider my key length.
Generally, shorter [key] strings perform better than longer ones.
http://www.dotnetperls.com/dictionary-string-key
Good question!
This may be a silly question but I am reading about that Hashtables and Dictionaries are faster than a list because they index the items with keys.
I know a List or Array is for elements without values, and a Dictionary is for elements with values. So I would think that it maybe be smart to have a Dictionary with the value that you need as a key and the value equal in all of them?
Update:
Based on the comments what I think I need is a HashSet. This question talks about their performance.
"Faster" depends on what you need them for.
A .NET List is just a slab of continuous memory (this in not a linked list), which makes it extremely efficient to access sequentially (especially when you consider the effects of caching and prefetching of modern CPUs) or "randomly" trough a known integer index. Searching or inserting elements (especially in the middle) - not so much.
Dictionary is an associative data structure - a key can be anything hashable (not just integer index), but elements are not sorted in a "meaningful" way and the access through the known key is not as fast as List's integer index.
So, pick the right tool for the job.
There are some weaknesses to Dictionary/Hashtable vs a List/array as well:
You have to compute the hash value of the object with each lookup.
For small collections, iterating through the array can be faster than computing that hash, especially because a hash is not guaranteed to be unique1.
They are not as good at iterating over the list of items.
They are not very good at storing duplicate entries (sometimes you legitimately want a value to show in an array more than once)
Sometimes a type does not have a good key to associate with it
Use what fits the situation. Sometimes that will be a list or an array. Sometimes it will be a Dictionary. You should almost never use a HashTable any more (prefer Dictionary<KeyType, Object> if you really don't what type you're storing).
1It usually is unique, but because there is a small potential for collisions the collection must check the bucket after computing the hash value.
Your statement "list or array is for elements without values, and dictionary is for elements with values", is not strictly true.
More accurately, a List is a collection of elements, and a Hashtable or Dictionary is a collection of elements along with a unique key to be used to access each one.
Use a list for collections of a very few elements, or when you will only need to access the entire collection, not a single element of the collection.
Use a Hashtable or Dictionary when the collection is large and/or when you will need to find/access individual members of the collection.
I have a list of strings and need to find which strings match a given input value.
what is the most efficient way (memory vs execution speed) for me to store this list of strings and be able to search through it? The start-up and loading of the list of strings isnt important, but the response time for searching is.
should i be using a List or HashSet or just a basic string[] or something else?
It depends very much on the nature of the strings and the size of the collection. Depending on characteristics of the collection, and the expected search strings, there are ways to organize things very cleverly so that searching is very fast. You haven't given us that information.
But here's what I'd do. I'd set a reasonable performance requirement. Then I'd try a n-gram index (why? because you said in a comment you need to account for partial matches; a HashSet<string> won't help you here) and I'd profile reasonable inputs that I expect against this solution and see if it meets my performance requirements or not. If it does, I'd accept the solution and move on. If it doesn't, I'd think very carefully about whether or not my performance requirements are reasonable. If they are, I'd start thinking about whether or not there is something special about my inputs and collection that might enable me to use some more clever solutions.
It seems the best way is to build a suffix tree of your input in O(input_len) time then do queries of your patterns in O(pattern_length) time. So if your text is really big compared to your patterns, this will work well.
See Ukkonen's algorithm for building a suffix tree.
If you want inexact matching...see the work of Gonzalo Navarro.
Use a Dictionary<string>() or an HashSet<string> is probably good for you.
Look here for Dictionary
and here for HashSet
Dictionary and Hashtable are going to be the fastest at "searching" because it is O(1) speed. There are some downfalls to Dictionaries and Hashtables in that they are not sorted.
Using a Binary search tree you will be able to get O(Log N) searching.
Using an unsorted list you will be O(N) speed for searching.
Using a sorted list you will get O(Log N) searching but keep in mind the list has to be sorted so that adds time to the overall speed.
As for memory use just make sure that you initialize the size of the collection.
So dictionary or hash table are the fastest for retrieval.
Speed classifications from best to worst are
O(1)
O(log n)
O(n)
O(n log n)
O(n^2)
O(2^n)
n being the number of elements.