Best Collection for Fast String Lookup - c#

I need a list of strings and a way to quickly determine if a string is contained within that list.
To enhance lookup speed, I considered SortedList and Dictionary; however, both work with KeyValuePairs when all I need is a single string.
I know I could use a KeyValuePair and simply ignore the Value portion. But I do prefer to be efficient and am just wondering if there is a collection better suited to my requirements.

If you're on .NET 3.5 or higher, use HashSet<String>.
Failing that, a Dictionary<string, byte> (or whatever type you want for the TValue type parameter) would be faster than a SortedList if you have a lot of entries - the latter will use a binary search, so it'll be O(log n) lookup, instead of O(1).

If you just want to know if a string is in the set use HashSet<string>

This sounds like a job for
var keys = new HashSet<string>();
Per MSDN: The Contains function has O(1) complexity.
But you should be aware that it does not give an error for duplicates when adding.

HashSet<string> is like a Dictionary, but with only keys.

If you feel like rolling your own data structure, use a Trie.
http://en.wikipedia.org/wiki/Trie
worst-case is if the string is present: O(length of string)

I know this answer is a bit late to this party, but I was running into an issue where our systems were running slow. After profiling we found out there was a LOT of string lookups happening with the way we had our data structures structured.
So we did some research, came across these benchmarks, did our own tests, and have switched over to using SortedList now.
if (sortedlist.ContainsKey(thekey))
{
//found it.
}
Even though a Dictionary proved to be faster, it was less code we had to refactor, and the performance increase was good enough for us.
Anyway, wanted to share the website in case other people are running into similar issues. They do comparisons between data structures where the string you're looking for is a "key" (like HashTable, Dictionary, etc) or in a "value" (List, Array, or in a Dictionary, etc) which is where ours are stored.

I know the question is old as hell, but I just had to solve the same problem, only for a very small set of strings(between 2 and 4).
In my case, I actually used manual lookup over an array of strings which turned up to be much faster than HashSet<string>(I benchmarked it).
for (int i = 0; i < this.propertiesToIgnore.Length; i++)
{
if (this.propertiesToIgnore[i].Equals(propertyName))
{
return true;
}
}
Note, that it is better than hash set for only for tiny arrays!
EDIT: works only with a manual for loop, do not use LINQ, details in comments

Related

Performance of Dictionary<string, object> vs List<string> + BinarySearch

I have a custom performance timer implementation. In short it is a static data collection storing execution duration of some code paths. In order to identify particular measurements I need a collection of named objects with quick access to the data item by name i.e. a string of moderate length like 20-50 chars.
Straightforward way to do that could be a Dictionary<string, MyPerformanceCounter> with access by key which is the counter id.
What about a List<MyPerformanceCounter> which could be accessed and maintained sorted via List<T>.BinarySearch and List.Insert. Does it have a chance to have more linear performance when I would need to have several hundreds of counters?
Needless to say I need the access to the proper MyPerformanceCounter to be as quick as possible as it is called at rates of dozens of thousands per second and should affect code execution as less as possible.
New counters are appended relatively seldom like once per second.
There are several potentially non-O(1) parts to a dictionary.
The first is generating a hash code. If your strings are long, it will have to generate a hash of the string every time you use it as a key in your dictionary. The dictionary stores the hashes of the existing keys, so you don't have to worry about that, just hashing what you're passing in. If the strings are all short, hashing should be fast. Long strings are probably going to take longer to hash than doing a string comparison. Hashing affects both reads and writes.
The next non-constant part of a dictionary is when you have hash collisions. It keeps a linked list of values with the same hash bucket internally, and has to go through and compare your key to each item in that bucket if you get hash collisions. Since you're using strings and they spent a lot of effort coming up with a good string hashing function, this shouldn't be too major an issue. Hash collisions slow down both reads and writes.
The last non-constant part is only during writes, if it runs out of internal storage, it has to recalculate the whole hash table internally. This is still a lot faster than doing array inserts (like a List<> would do). If you only have a few hundred items, this is definitely not going to affect you.
A list, on the other hand, is going to take an average of N/2 copies for each insert, and log2(N) for each lookup. Unless the strings all have similar prefixes, the individual comparisons will be much faster than the dictionary, but there will be a lot more of them.
So unless your strings are quite long to make hashing inefficient, chances are a dictionary is going to give you better performance.
If you know something about the nature of your strings, you can write a more specific data structure optimized for your scenario. For example, if I knew all the strings started with an ASCII capital letter, and each is between 5 and 10 characters in length, I might create an array of 26 arrays, one for each letter, and then each of those arrays contains 6 lists, one for each length of string. Something like this:
List<string>[][] lists = new List<string>[26][6];
foreach (string s in keys)
{
var list = lists[s[0] - 'A'][s.Length - 5];
if (list == null)
{
lists[s[0] - 'A'][s.Length] = list = new List<string>();
}
int ix = list.BinarySearch(s);
if (ix < 0)
{
list.Insert(~ix, s);
}
}
This is the kind of thing you do if you have very specific information about what kind of data you're dealing with. If you can't make assumptions, using a Dictionary is most likely going to be your best bet.
You might also want to consider using OrderedDictionary if you want to go binary search route, I believe it uses a binary search tree internally. https://msdn.microsoft.com/en-us/library/system.collections.specialized.ordereddictionary%28v=vs.110%29.aspx
I believe you should use the Dictionary<string, MyPerformanceCounter>.
For small sets of data the list will have a better performance. However, as more elements are needed, the Dictionary becomes clearly superior.
The time required for a Dictionary is: O(1) constant time
complexity.
The List has an O(N) linear time complexity.
You could try Hashtable or SortedDictionary, but I think that you should still use Dictionary.
I provide a link with benchmarks and guidelines here: http://www.dotnetperls.com/dictionary-time
I hope this helps you.

Implementing simple string interning

Problem
I have a huge collection of strings that are duplicated among some objects. What is need is string interning. These objects are serialized and deserialized with protobuf-net. I know it should handle .NET string intering, but my tests have shown that taking all those strings myself and creating a Dictionary<string, int> (mapping between a value and its unique identifier), replacing original string values by ints, gives better results.
The problem, though, is in the mapping. It is only one-way searchable (I mean O(1)-searchable). But I would like to search by key or by value in O(1). Not just by key.
Approach
The set of strings is fixed. This sounds like an array. Search by value is O(1), blinding fast. Not even amortized as in the dictionary - just constant, by the index.
The problem with an array is searching by keys. This sounds like hashes. But hey, n hashes aren't said to be evenly distributed among exactly n cells of the n-element array. Using modulo, this will likely lead to collisions. That's bad.
I could create, let's say, an n * 1.1-length array, and try random hashing functions until I get no collisions but... that... just... feels... wrong.
Question
How can I solve the problem and achieve O(1) lookup time both by keys (strings) and values (integers)?
Two dictionaries is not an option ;)
Two dictionaries is the answer. I know you said it isn't an option, but without justification it's hard to see how two dictionaries doesn't answer your scenario perfectly, with easy to understand, fast, memory-efficient code.
From here, it seems like you're looking to perform two basic operations;
myStore.getString(int); // O(1)
myStore.getIndexOf(string); // O(1)
you're happy for one to be implemented as a dictionary, but not the other. What is it that's giving you pause?
Can you use an array to store the strings and a hash table to relate the strings back to their indices in the array?
Your n*1.1 length array idea might be improved on by some reading on perfect hashing and dynamic perfect hashing. Wikipedia has a nice article about the latter here. Unfortunately, all of these solutions seem to involve hash tables which contain hash tables. This may break your requirement that only one hash table be used, but perhaps the way in which the hash tables are used is different here.

Performance Dictionary<string,int> versus List<string>

I have a list of about 500 strings "joe" "john" "jack" ... "jan"
I only need to find the ordinal.
In my example, the list will never be changed.
One could just put them in a list and IndexOf
ll.Add("joe")
ll.Add("john")
...
ll.Add("jan")
ll.IndexOf("jib") is 315
or you can put them in a dictionary, using the ordinal integers as the values,
dd.Add("joe", 1)
dd.Add("john", 2)
dd.Add("jack", 3)
...
dd.Add("jan", 571)
dd["jib"] is 315
FTR the strings are 3 to 8 characters long. FTR this is in a Unity, hence Mono, milieu.
Purely for performance, is one approach generally preferable?
1b) Indeed, I found a number of analysis of this nature: http://www.dotnetperls.com/dictionary-time (google for a number of similar analyses). Does this apply to the situation I describe or am I off here?
It's a shame there isn't a "HashSetLikeThingWithOrdinality" type of facility - if I'm missing an obvious please let us know. Indeed, this seems like a fairly common, basic, collections use case - "get the ordinal of some strings" - perhaps I am completely missing something obvious.
Here's a small overview on the difference between using a Dictionary<string,int> and a (sorted)List<string> for this:
Observations:
1) In my micro benchmarks, once the dictionary is created, the dictionary is much faster. (Explanations as to why will follow shortly)
2) In my opinion, mapping in some way (eg. a Dictionary or HashTable) will be significantly less awkward.
Performance:
For the List<string>, to do a binary search, the system will start in the 'middle', then walk each direction (stepping into the 'middle' in the now halved search space, in a typical divide and conquer pattern) depending on if the value is greater or smaller than the value at the index it's looking at. This is O(log n) growth. This assumes that data is already sorted in some manner (also applies to stuff like SortedDictionary, which uses data structures that allow for binary searching)
Alternately, you'd do IndexOf, which is O(n) complexity because you have to walk every element.
For the Dictionary<string,int>, it uses a hash lookup (generates a hash of the object by calling .GetHashCode() on the TKey (string in this case), then uses that to look up in a hash table (then does a compare to ensure it is an exact match), and gets the value out. This is roughly O(1) growth (ie. the complexity doesn't grow meaningfully with the number of elements) [Not including worst case scenarios involving hash collisions here]
Because of this, Dictionary<string,int> takes a (relatively) constant amount of time to do lookups, while List<string> grows according to the number of elements (albeit at a logarithmic (slow) rate).
Testing:
I did a few micro benchmarks, where I took the top 500 female names and did lookups against them. The lookups looked something like this:
var searchItems = new[] { "Maci", "Daria", "Michelle", "Amber", "Henrietta"};
foreach (var item in searchItems)
{
sortedList.BinarySearch(item); //You'd store the output here. Just looking at performance
}
And compared it to a dictionary lookup:
foreach (var item in searchItems)
{
var output = dictionary.ContainsKey(item) ? dictionary[item] : -1; //Presumably, output would be declared outside of this, just getting rid of a compiler error
}
So, here's the thing: even for a small number of elements, with short strings as lookup keys, a sorted List<string> isn't any faster (on my machine, in my admittedly simplistic tests) than a Dictionary<string,int>. Once again, this is a microbenchmark, but, for 500 elements, the 5 lookups are roughly 3x faster with the dictionary.
Keep in mind, however, that the list was 6.3 microseconds, and the dictionary was 1.8 microseconds.
Syntax:
Using a list as a lookup to find indexes is slightly awkward. A mapping type (like Dictionary) implies intent much better than your lookup list does, which should make for more maintainable code in the end.
That said, with my syntax and performance considerations, I'd say go with the Dictionary. However, if you don't like Dictionaries for whatever reason, the performance considerations are on such a small scale that it's a pointless thing to worry about anyways.
Edit: Bonus points, you will probably want to use a case-insensitive comparer for either method. You can pass a comparer as an argument for Dictionary and BinarySearch() should support a comparer as well.
I suspect that there might be a twist somewhere, as such a simple question has no answer for 2 hours. I'll risk being down-voted, but here is my answers:
1) Dictionary (hash table-based) is clearly a better choice for a fast lookup. List, on the other hand, is the worst choice.
1.b) Yes, it applies here. Search in the List has linear complexity, while Dictionary provides constant time lookup.
2) You are trying to map a string to an ordinal; any kind of map will be natural here (while any kind of list is awkward).
Dictionary is the natural approach for a lookup.
A list would be an optimisation for less memory use at the cost of decreased speed. An array would do better still (same time, but slightly less memory again).
If you already had a list or array for some other reason then the memory saving would be greater still, because no more memory was used that would be used anyway, and so a better optimisation for space at the same cost to speed. (If the order of the keys was the same as a sort then it could be O(log n) but otherwise it's O(n)).
Creating the dictionary itself takes time, so while it's the fastest approach if the number of times it is looked up is small then it might cost as much as it saves and so not be worth it.

efficient way to search for string in list of string?

I have a list of strings and need to find which strings match a given input value.
what is the most efficient way (memory vs execution speed) for me to store this list of strings and be able to search through it? The start-up and loading of the list of strings isnt important, but the response time for searching is.
should i be using a List or HashSet or just a basic string[] or something else?
It depends very much on the nature of the strings and the size of the collection. Depending on characteristics of the collection, and the expected search strings, there are ways to organize things very cleverly so that searching is very fast. You haven't given us that information.
But here's what I'd do. I'd set a reasonable performance requirement. Then I'd try a n-gram index (why? because you said in a comment you need to account for partial matches; a HashSet<string> won't help you here) and I'd profile reasonable inputs that I expect against this solution and see if it meets my performance requirements or not. If it does, I'd accept the solution and move on. If it doesn't, I'd think very carefully about whether or not my performance requirements are reasonable. If they are, I'd start thinking about whether or not there is something special about my inputs and collection that might enable me to use some more clever solutions.
It seems the best way is to build a suffix tree of your input in O(input_len) time then do queries of your patterns in O(pattern_length) time. So if your text is really big compared to your patterns, this will work well.
See Ukkonen's algorithm for building a suffix tree.
If you want inexact matching...see the work of Gonzalo Navarro.
Use a Dictionary<string>() or an HashSet<string> is probably good for you.
Look here for Dictionary
and here for HashSet
Dictionary and Hashtable are going to be the fastest at "searching" because it is O(1) speed. There are some downfalls to Dictionaries and Hashtables in that they are not sorted.
Using a Binary search tree you will be able to get O(Log N) searching.
Using an unsorted list you will be O(N) speed for searching.
Using a sorted list you will get O(Log N) searching but keep in mind the list has to be sorted so that adds time to the overall speed.
As for memory use just make sure that you initialize the size of the collection.
So dictionary or hash table are the fastest for retrieval.
Speed classifications from best to worst are
O(1)
O(log n)
O(n)
O(n log n)
O(n^2)
O(2^n)
n being the number of elements.

What is the most performant way to check for existence with a collection of integers?

I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?
Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.
Sounds like a job for a Hashset...
If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.
If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.
What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.

Categories

Resources