Performance Dictionary<string,int> versus List<string> - c#

I have a list of about 500 strings "joe" "john" "jack" ... "jan"
I only need to find the ordinal.
In my example, the list will never be changed.
One could just put them in a list and IndexOf
ll.Add("joe")
ll.Add("john")
...
ll.Add("jan")
ll.IndexOf("jib") is 315
or you can put them in a dictionary, using the ordinal integers as the values,
dd.Add("joe", 1)
dd.Add("john", 2)
dd.Add("jack", 3)
...
dd.Add("jan", 571)
dd["jib"] is 315
FTR the strings are 3 to 8 characters long. FTR this is in a Unity, hence Mono, milieu.
Purely for performance, is one approach generally preferable?
1b) Indeed, I found a number of analysis of this nature: http://www.dotnetperls.com/dictionary-time (google for a number of similar analyses). Does this apply to the situation I describe or am I off here?
It's a shame there isn't a "HashSetLikeThingWithOrdinality" type of facility - if I'm missing an obvious please let us know. Indeed, this seems like a fairly common, basic, collections use case - "get the ordinal of some strings" - perhaps I am completely missing something obvious.

Here's a small overview on the difference between using a Dictionary<string,int> and a (sorted)List<string> for this:
Observations:
1) In my micro benchmarks, once the dictionary is created, the dictionary is much faster. (Explanations as to why will follow shortly)
2) In my opinion, mapping in some way (eg. a Dictionary or HashTable) will be significantly less awkward.
Performance:
For the List<string>, to do a binary search, the system will start in the 'middle', then walk each direction (stepping into the 'middle' in the now halved search space, in a typical divide and conquer pattern) depending on if the value is greater or smaller than the value at the index it's looking at. This is O(log n) growth. This assumes that data is already sorted in some manner (also applies to stuff like SortedDictionary, which uses data structures that allow for binary searching)
Alternately, you'd do IndexOf, which is O(n) complexity because you have to walk every element.
For the Dictionary<string,int>, it uses a hash lookup (generates a hash of the object by calling .GetHashCode() on the TKey (string in this case), then uses that to look up in a hash table (then does a compare to ensure it is an exact match), and gets the value out. This is roughly O(1) growth (ie. the complexity doesn't grow meaningfully with the number of elements) [Not including worst case scenarios involving hash collisions here]
Because of this, Dictionary<string,int> takes a (relatively) constant amount of time to do lookups, while List<string> grows according to the number of elements (albeit at a logarithmic (slow) rate).
Testing:
I did a few micro benchmarks, where I took the top 500 female names and did lookups against them. The lookups looked something like this:
var searchItems = new[] { "Maci", "Daria", "Michelle", "Amber", "Henrietta"};
foreach (var item in searchItems)
{
sortedList.BinarySearch(item); //You'd store the output here. Just looking at performance
}
And compared it to a dictionary lookup:
foreach (var item in searchItems)
{
var output = dictionary.ContainsKey(item) ? dictionary[item] : -1; //Presumably, output would be declared outside of this, just getting rid of a compiler error
}
So, here's the thing: even for a small number of elements, with short strings as lookup keys, a sorted List<string> isn't any faster (on my machine, in my admittedly simplistic tests) than a Dictionary<string,int>. Once again, this is a microbenchmark, but, for 500 elements, the 5 lookups are roughly 3x faster with the dictionary.
Keep in mind, however, that the list was 6.3 microseconds, and the dictionary was 1.8 microseconds.
Syntax:
Using a list as a lookup to find indexes is slightly awkward. A mapping type (like Dictionary) implies intent much better than your lookup list does, which should make for more maintainable code in the end.
That said, with my syntax and performance considerations, I'd say go with the Dictionary. However, if you don't like Dictionaries for whatever reason, the performance considerations are on such a small scale that it's a pointless thing to worry about anyways.
Edit: Bonus points, you will probably want to use a case-insensitive comparer for either method. You can pass a comparer as an argument for Dictionary and BinarySearch() should support a comparer as well.

I suspect that there might be a twist somewhere, as such a simple question has no answer for 2 hours. I'll risk being down-voted, but here is my answers:
1) Dictionary (hash table-based) is clearly a better choice for a fast lookup. List, on the other hand, is the worst choice.
1.b) Yes, it applies here. Search in the List has linear complexity, while Dictionary provides constant time lookup.
2) You are trying to map a string to an ordinal; any kind of map will be natural here (while any kind of list is awkward).

Dictionary is the natural approach for a lookup.
A list would be an optimisation for less memory use at the cost of decreased speed. An array would do better still (same time, but slightly less memory again).
If you already had a list or array for some other reason then the memory saving would be greater still, because no more memory was used that would be used anyway, and so a better optimisation for space at the same cost to speed. (If the order of the keys was the same as a sort then it could be O(log n) but otherwise it's O(n)).
Creating the dictionary itself takes time, so while it's the fastest approach if the number of times it is looked up is small then it might cost as much as it saves and so not be worth it.

Related

Is Dictionary.ContainsKey() any better than FirstOrDefault()?

I know, nothing one million of anything's gonna be performant. But I'm needing that piece o' knowledge right now.
I have a Dictionary and a string[]. The boolean in the dictionary is just to fill the space. Let's imagine that as an Inventory System just to make things easier.
In this inventory, I wanna check if I already had gotten one item. So what I'd do is:
if (dic.ContainsKey(item_id)) // That could be a TryGetValue() as well.
{
// Do some logic.
}
But would it be better to just have an array?
if (array.FirstOrDefault(a => a = item_id))
{
// Do magic.
}
I mean, which would perform better in that specific case?
I know, that's a silly question, but when you can have over one million (or over nine thousand, for the DBZ fans out there xD) checks, things can get pretty heavy, especially for mobile, VR and others with similar performance.
Plus, I just want my users to have the best experience with my Inventory (a.k.a. no lag), so I often take stuff like that in consideration.
There are two tradeoffs here space and time.
A Dictionary is a relatively heavy weight structure compared to an array.
The lookup time in a Dictionary (or a HashSet) if basically independant of the number of entries O(1), while with the array it increases linearly O(N).
So there is a certain number of items where the Dictionary (or HashSet) begins to be considerably faster. And 1 million is certainly above this threshold.

Performance of Dictionary<string, object> vs List<string> + BinarySearch

I have a custom performance timer implementation. In short it is a static data collection storing execution duration of some code paths. In order to identify particular measurements I need a collection of named objects with quick access to the data item by name i.e. a string of moderate length like 20-50 chars.
Straightforward way to do that could be a Dictionary<string, MyPerformanceCounter> with access by key which is the counter id.
What about a List<MyPerformanceCounter> which could be accessed and maintained sorted via List<T>.BinarySearch and List.Insert. Does it have a chance to have more linear performance when I would need to have several hundreds of counters?
Needless to say I need the access to the proper MyPerformanceCounter to be as quick as possible as it is called at rates of dozens of thousands per second and should affect code execution as less as possible.
New counters are appended relatively seldom like once per second.
There are several potentially non-O(1) parts to a dictionary.
The first is generating a hash code. If your strings are long, it will have to generate a hash of the string every time you use it as a key in your dictionary. The dictionary stores the hashes of the existing keys, so you don't have to worry about that, just hashing what you're passing in. If the strings are all short, hashing should be fast. Long strings are probably going to take longer to hash than doing a string comparison. Hashing affects both reads and writes.
The next non-constant part of a dictionary is when you have hash collisions. It keeps a linked list of values with the same hash bucket internally, and has to go through and compare your key to each item in that bucket if you get hash collisions. Since you're using strings and they spent a lot of effort coming up with a good string hashing function, this shouldn't be too major an issue. Hash collisions slow down both reads and writes.
The last non-constant part is only during writes, if it runs out of internal storage, it has to recalculate the whole hash table internally. This is still a lot faster than doing array inserts (like a List<> would do). If you only have a few hundred items, this is definitely not going to affect you.
A list, on the other hand, is going to take an average of N/2 copies for each insert, and log2(N) for each lookup. Unless the strings all have similar prefixes, the individual comparisons will be much faster than the dictionary, but there will be a lot more of them.
So unless your strings are quite long to make hashing inefficient, chances are a dictionary is going to give you better performance.
If you know something about the nature of your strings, you can write a more specific data structure optimized for your scenario. For example, if I knew all the strings started with an ASCII capital letter, and each is between 5 and 10 characters in length, I might create an array of 26 arrays, one for each letter, and then each of those arrays contains 6 lists, one for each length of string. Something like this:
List<string>[][] lists = new List<string>[26][6];
foreach (string s in keys)
{
var list = lists[s[0] - 'A'][s.Length - 5];
if (list == null)
{
lists[s[0] - 'A'][s.Length] = list = new List<string>();
}
int ix = list.BinarySearch(s);
if (ix < 0)
{
list.Insert(~ix, s);
}
}
This is the kind of thing you do if you have very specific information about what kind of data you're dealing with. If you can't make assumptions, using a Dictionary is most likely going to be your best bet.
You might also want to consider using OrderedDictionary if you want to go binary search route, I believe it uses a binary search tree internally. https://msdn.microsoft.com/en-us/library/system.collections.specialized.ordereddictionary%28v=vs.110%29.aspx
I believe you should use the Dictionary<string, MyPerformanceCounter>.
For small sets of data the list will have a better performance. However, as more elements are needed, the Dictionary becomes clearly superior.
The time required for a Dictionary is: O(1) constant time
complexity.
The List has an O(N) linear time complexity.
You could try Hashtable or SortedDictionary, but I think that you should still use Dictionary.
I provide a link with benchmarks and guidelines here: http://www.dotnetperls.com/dictionary-time
I hope this helps you.

efficient way to search for string in list of string?

I have a list of strings and need to find which strings match a given input value.
what is the most efficient way (memory vs execution speed) for me to store this list of strings and be able to search through it? The start-up and loading of the list of strings isnt important, but the response time for searching is.
should i be using a List or HashSet or just a basic string[] or something else?
It depends very much on the nature of the strings and the size of the collection. Depending on characteristics of the collection, and the expected search strings, there are ways to organize things very cleverly so that searching is very fast. You haven't given us that information.
But here's what I'd do. I'd set a reasonable performance requirement. Then I'd try a n-gram index (why? because you said in a comment you need to account for partial matches; a HashSet<string> won't help you here) and I'd profile reasonable inputs that I expect against this solution and see if it meets my performance requirements or not. If it does, I'd accept the solution and move on. If it doesn't, I'd think very carefully about whether or not my performance requirements are reasonable. If they are, I'd start thinking about whether or not there is something special about my inputs and collection that might enable me to use some more clever solutions.
It seems the best way is to build a suffix tree of your input in O(input_len) time then do queries of your patterns in O(pattern_length) time. So if your text is really big compared to your patterns, this will work well.
See Ukkonen's algorithm for building a suffix tree.
If you want inexact matching...see the work of Gonzalo Navarro.
Use a Dictionary<string>() or an HashSet<string> is probably good for you.
Look here for Dictionary
and here for HashSet
Dictionary and Hashtable are going to be the fastest at "searching" because it is O(1) speed. There are some downfalls to Dictionaries and Hashtables in that they are not sorted.
Using a Binary search tree you will be able to get O(Log N) searching.
Using an unsorted list you will be O(N) speed for searching.
Using a sorted list you will get O(Log N) searching but keep in mind the list has to be sorted so that adds time to the overall speed.
As for memory use just make sure that you initialize the size of the collection.
So dictionary or hash table are the fastest for retrieval.
Speed classifications from best to worst are
O(1)
O(log n)
O(n)
O(n log n)
O(n^2)
O(2^n)
n being the number of elements.

Fast data structure for small sets

I'm in need for a data structure that can handle small sets (10-20 strings, at most 50, of varying length) very fast. False positives is ok, but false negatives are not.
The last requirement makes bloom filters seem like a good fit, but I'm not sure about their speed, any other recommendations?
Edit: The set only needs to support insert + membership test.
How about an array of strings that you use a for-loop over checking membership with String.Equals?
For sets this small, fancy data structures may incur too much overhead, and big-oh does not apply. Have you tried doing the simplest possible thing and measuring that?
(If false positives are ok, you might also keep e.g. an array of 1024 bools, where you compute a poor 'hash' of strings by looking at just the first two characters' lowest 5 bits to give you a 10-bit index into the boolean array. Seems like this would be just a few instructions long.)
Depending on what operations you wish to perform against the set, the fastest will likely be a HashSet<string>. See HashSet for more.
ADDITION
Asking Mr. Google, here's an article written by a gentlemen that wrote a Bloom Filter function in C#. However, he's still using (multiple) hashcodes to populate the filter. I would expect that on small data sets it will be slower than a HashSet.
If the set of strings to check for membership is much larger than the set of valid strings then a Trie might give you better performance than a HashSet. The speed of a lookup in a hashset is dependent on the run time of the hashing algorithm which is usually O(k) where k is the length of the string. This is true whether the string is in the hashset or not.
With a Trie, lookup is still O(k), but if the string is not in the Trie, it will terminate the lookup as soon as a single character doesn't match. So best-case, a lookup for an invalid string is O(1).
Why not use a Radix Tree? It's a specialized set data structure based on the trie that is used to store a set of strings.
Check out the System.Collections.Specialized Namespace on MSDN.
Especially the HybridDictionary and the StringDictionary.
I know they're not sets, but you can use null values for each key. (Java does the same with out-of-the box Sets and still is "fast".
If HashSet is too slow for you, you can use classic LZ compressor's technique: fixed size array of hash codes where each entry points to linked list of strings.
In case you know domain of your data just construct ideal hash function and use it.
If it's not your case you can use string.GetHashCode() of something like Murmur hash
and use hash(str) % array.Length as array's index.
I suppose array size of 256-512 entries in good enough for your data structure with 50 strings.
The main benefit of bloom filters over hash tables is that their size depends on the number of objects in the database and the permitted probability for false positives, but not on the size of the objects themselves. Since your database is so small I doubt its size is your main concern.
HashSets are theoretically the best data structure for your requirement, but since the database is so small, an O(log (n)) structure like a SortedDictionary is often preferable, or maybe even just linear search (as mentioned). I recall stories where switching from hash-based collections to tree-based ones drastically increased performance for small sets.
The best way is to switch between them and compare the performance of each.

What is the fastest way to count the unique elements in a list of billion elements?

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)
I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.
I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.
This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.
If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).
Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.
With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.
divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count
I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.
+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.
Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.
A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.
I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.
If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.

Categories

Resources