I have a custom performance timer implementation. In short it is a static data collection storing execution duration of some code paths. In order to identify particular measurements I need a collection of named objects with quick access to the data item by name i.e. a string of moderate length like 20-50 chars.
Straightforward way to do that could be a Dictionary<string, MyPerformanceCounter> with access by key which is the counter id.
What about a List<MyPerformanceCounter> which could be accessed and maintained sorted via List<T>.BinarySearch and List.Insert. Does it have a chance to have more linear performance when I would need to have several hundreds of counters?
Needless to say I need the access to the proper MyPerformanceCounter to be as quick as possible as it is called at rates of dozens of thousands per second and should affect code execution as less as possible.
New counters are appended relatively seldom like once per second.
There are several potentially non-O(1) parts to a dictionary.
The first is generating a hash code. If your strings are long, it will have to generate a hash of the string every time you use it as a key in your dictionary. The dictionary stores the hashes of the existing keys, so you don't have to worry about that, just hashing what you're passing in. If the strings are all short, hashing should be fast. Long strings are probably going to take longer to hash than doing a string comparison. Hashing affects both reads and writes.
The next non-constant part of a dictionary is when you have hash collisions. It keeps a linked list of values with the same hash bucket internally, and has to go through and compare your key to each item in that bucket if you get hash collisions. Since you're using strings and they spent a lot of effort coming up with a good string hashing function, this shouldn't be too major an issue. Hash collisions slow down both reads and writes.
The last non-constant part is only during writes, if it runs out of internal storage, it has to recalculate the whole hash table internally. This is still a lot faster than doing array inserts (like a List<> would do). If you only have a few hundred items, this is definitely not going to affect you.
A list, on the other hand, is going to take an average of N/2 copies for each insert, and log2(N) for each lookup. Unless the strings all have similar prefixes, the individual comparisons will be much faster than the dictionary, but there will be a lot more of them.
So unless your strings are quite long to make hashing inefficient, chances are a dictionary is going to give you better performance.
If you know something about the nature of your strings, you can write a more specific data structure optimized for your scenario. For example, if I knew all the strings started with an ASCII capital letter, and each is between 5 and 10 characters in length, I might create an array of 26 arrays, one for each letter, and then each of those arrays contains 6 lists, one for each length of string. Something like this:
List<string>[][] lists = new List<string>[26][6];
foreach (string s in keys)
{
var list = lists[s[0] - 'A'][s.Length - 5];
if (list == null)
{
lists[s[0] - 'A'][s.Length] = list = new List<string>();
}
int ix = list.BinarySearch(s);
if (ix < 0)
{
list.Insert(~ix, s);
}
}
This is the kind of thing you do if you have very specific information about what kind of data you're dealing with. If you can't make assumptions, using a Dictionary is most likely going to be your best bet.
You might also want to consider using OrderedDictionary if you want to go binary search route, I believe it uses a binary search tree internally. https://msdn.microsoft.com/en-us/library/system.collections.specialized.ordereddictionary%28v=vs.110%29.aspx
I believe you should use the Dictionary<string, MyPerformanceCounter>.
For small sets of data the list will have a better performance. However, as more elements are needed, the Dictionary becomes clearly superior.
The time required for a Dictionary is: O(1) constant time
complexity.
The List has an O(N) linear time complexity.
You could try Hashtable or SortedDictionary, but I think that you should still use Dictionary.
I provide a link with benchmarks and guidelines here: http://www.dotnetperls.com/dictionary-time
I hope this helps you.
Related
Problem
I have a huge collection of strings that are duplicated among some objects. What is need is string interning. These objects are serialized and deserialized with protobuf-net. I know it should handle .NET string intering, but my tests have shown that taking all those strings myself and creating a Dictionary<string, int> (mapping between a value and its unique identifier), replacing original string values by ints, gives better results.
The problem, though, is in the mapping. It is only one-way searchable (I mean O(1)-searchable). But I would like to search by key or by value in O(1). Not just by key.
Approach
The set of strings is fixed. This sounds like an array. Search by value is O(1), blinding fast. Not even amortized as in the dictionary - just constant, by the index.
The problem with an array is searching by keys. This sounds like hashes. But hey, n hashes aren't said to be evenly distributed among exactly n cells of the n-element array. Using modulo, this will likely lead to collisions. That's bad.
I could create, let's say, an n * 1.1-length array, and try random hashing functions until I get no collisions but... that... just... feels... wrong.
Question
How can I solve the problem and achieve O(1) lookup time both by keys (strings) and values (integers)?
Two dictionaries is not an option ;)
Two dictionaries is the answer. I know you said it isn't an option, but without justification it's hard to see how two dictionaries doesn't answer your scenario perfectly, with easy to understand, fast, memory-efficient code.
From here, it seems like you're looking to perform two basic operations;
myStore.getString(int); // O(1)
myStore.getIndexOf(string); // O(1)
you're happy for one to be implemented as a dictionary, but not the other. What is it that's giving you pause?
Can you use an array to store the strings and a hash table to relate the strings back to their indices in the array?
Your n*1.1 length array idea might be improved on by some reading on perfect hashing and dynamic perfect hashing. Wikipedia has a nice article about the latter here. Unfortunately, all of these solutions seem to involve hash tables which contain hash tables. This may break your requirement that only one hash table be used, but perhaps the way in which the hash tables are used is different here.
I have a list of about 500 strings "joe" "john" "jack" ... "jan"
I only need to find the ordinal.
In my example, the list will never be changed.
One could just put them in a list and IndexOf
ll.Add("joe")
ll.Add("john")
...
ll.Add("jan")
ll.IndexOf("jib") is 315
or you can put them in a dictionary, using the ordinal integers as the values,
dd.Add("joe", 1)
dd.Add("john", 2)
dd.Add("jack", 3)
...
dd.Add("jan", 571)
dd["jib"] is 315
FTR the strings are 3 to 8 characters long. FTR this is in a Unity, hence Mono, milieu.
Purely for performance, is one approach generally preferable?
1b) Indeed, I found a number of analysis of this nature: http://www.dotnetperls.com/dictionary-time (google for a number of similar analyses). Does this apply to the situation I describe or am I off here?
It's a shame there isn't a "HashSetLikeThingWithOrdinality" type of facility - if I'm missing an obvious please let us know. Indeed, this seems like a fairly common, basic, collections use case - "get the ordinal of some strings" - perhaps I am completely missing something obvious.
Here's a small overview on the difference between using a Dictionary<string,int> and a (sorted)List<string> for this:
Observations:
1) In my micro benchmarks, once the dictionary is created, the dictionary is much faster. (Explanations as to why will follow shortly)
2) In my opinion, mapping in some way (eg. a Dictionary or HashTable) will be significantly less awkward.
Performance:
For the List<string>, to do a binary search, the system will start in the 'middle', then walk each direction (stepping into the 'middle' in the now halved search space, in a typical divide and conquer pattern) depending on if the value is greater or smaller than the value at the index it's looking at. This is O(log n) growth. This assumes that data is already sorted in some manner (also applies to stuff like SortedDictionary, which uses data structures that allow for binary searching)
Alternately, you'd do IndexOf, which is O(n) complexity because you have to walk every element.
For the Dictionary<string,int>, it uses a hash lookup (generates a hash of the object by calling .GetHashCode() on the TKey (string in this case), then uses that to look up in a hash table (then does a compare to ensure it is an exact match), and gets the value out. This is roughly O(1) growth (ie. the complexity doesn't grow meaningfully with the number of elements) [Not including worst case scenarios involving hash collisions here]
Because of this, Dictionary<string,int> takes a (relatively) constant amount of time to do lookups, while List<string> grows according to the number of elements (albeit at a logarithmic (slow) rate).
Testing:
I did a few micro benchmarks, where I took the top 500 female names and did lookups against them. The lookups looked something like this:
var searchItems = new[] { "Maci", "Daria", "Michelle", "Amber", "Henrietta"};
foreach (var item in searchItems)
{
sortedList.BinarySearch(item); //You'd store the output here. Just looking at performance
}
And compared it to a dictionary lookup:
foreach (var item in searchItems)
{
var output = dictionary.ContainsKey(item) ? dictionary[item] : -1; //Presumably, output would be declared outside of this, just getting rid of a compiler error
}
So, here's the thing: even for a small number of elements, with short strings as lookup keys, a sorted List<string> isn't any faster (on my machine, in my admittedly simplistic tests) than a Dictionary<string,int>. Once again, this is a microbenchmark, but, for 500 elements, the 5 lookups are roughly 3x faster with the dictionary.
Keep in mind, however, that the list was 6.3 microseconds, and the dictionary was 1.8 microseconds.
Syntax:
Using a list as a lookup to find indexes is slightly awkward. A mapping type (like Dictionary) implies intent much better than your lookup list does, which should make for more maintainable code in the end.
That said, with my syntax and performance considerations, I'd say go with the Dictionary. However, if you don't like Dictionaries for whatever reason, the performance considerations are on such a small scale that it's a pointless thing to worry about anyways.
Edit: Bonus points, you will probably want to use a case-insensitive comparer for either method. You can pass a comparer as an argument for Dictionary and BinarySearch() should support a comparer as well.
I suspect that there might be a twist somewhere, as such a simple question has no answer for 2 hours. I'll risk being down-voted, but here is my answers:
1) Dictionary (hash table-based) is clearly a better choice for a fast lookup. List, on the other hand, is the worst choice.
1.b) Yes, it applies here. Search in the List has linear complexity, while Dictionary provides constant time lookup.
2) You are trying to map a string to an ordinal; any kind of map will be natural here (while any kind of list is awkward).
Dictionary is the natural approach for a lookup.
A list would be an optimisation for less memory use at the cost of decreased speed. An array would do better still (same time, but slightly less memory again).
If you already had a list or array for some other reason then the memory saving would be greater still, because no more memory was used that would be used anyway, and so a better optimisation for space at the same cost to speed. (If the order of the keys was the same as a sort then it could be O(log n) but otherwise it's O(n)).
Creating the dictionary itself takes time, so while it's the fastest approach if the number of times it is looked up is small then it might cost as much as it saves and so not be worth it.
I have a Dictionary of objects with strings as the keys. This Dictionary is first populated with anywhere from 50 to tens of thousands of entries. Later on my program looks for values within this dictionary, and after having found an item in the dictionary I no longer have any need to persist the object that I just found in the dictionary. My question then is, would I be able to get better total execution time if I remove entries from the dictionary once I no longer have use for them, perhaps cutting down memory usage or just making subsequent lookups slightly faster, or would the extra time spent removing items be more impactful?
I understand the answer to this may depend upon certain details such as how many total lookups are done against the dictionary, the size of the key, and the size of the object, I will try to provide these below, but is there a general answer to this? Is it unnecessary to try and improve performance in this way, or are there cases where this would be a good idea?
Key is variable length string, either 6 characters or ~20 characters.
Total lookups is completely up in the air, I may have to only check 50x or so or I may have to look 10K times completely independent of the size of the dictionary, i.e. dictionary may have 50 items and I may do 10K lookups, or I may have 10K items and only do 50 lookups.
One additional note is that if I do remove items from the dictionary and I am ever left with an empty dictionary I can then signal to a waiting thread to no longer wait for me while I process the remaining items (involves parsing through a long text file while looking up items in the dictionary to determine what to do with the parsed data).
Dictionary lookups are essentially O(1). Removing items from the dictionary will have a tiny (if any) impact on lookup speed.
In the end, it's very likely that removing items will be slower than just leaving them in.
The only reason I'd suggest removing items would be if you need to reduce your memory footprint.
I found some interesting items over at DotNetPerls that seem to relate to your question.
The order you add keys to a Dictionary is important. It affects the
performance of accessing those keys. Because the Dictionary uses a
chaining algorithm, the keys that were added last are often faster to
locate.
http://www.dotnetperls.com/dictionary-order
Dictionary size influences lookup performance. Smaller Dictionaries
are faster than larger Dictionaries. This is true when they are tested
for keys that always exist in both. Reducing Dictionary size could
help improve performance.
http://www.dotnetperls.com/dictionary-size
I thought this last tidbit was really interesting. It didn't occur to me to consider my key length.
Generally, shorter [key] strings perform better than longer ones.
http://www.dotnetperls.com/dictionary-string-key
Good question!
I'm in need for a data structure that can handle small sets (10-20 strings, at most 50, of varying length) very fast. False positives is ok, but false negatives are not.
The last requirement makes bloom filters seem like a good fit, but I'm not sure about their speed, any other recommendations?
Edit: The set only needs to support insert + membership test.
How about an array of strings that you use a for-loop over checking membership with String.Equals?
For sets this small, fancy data structures may incur too much overhead, and big-oh does not apply. Have you tried doing the simplest possible thing and measuring that?
(If false positives are ok, you might also keep e.g. an array of 1024 bools, where you compute a poor 'hash' of strings by looking at just the first two characters' lowest 5 bits to give you a 10-bit index into the boolean array. Seems like this would be just a few instructions long.)
Depending on what operations you wish to perform against the set, the fastest will likely be a HashSet<string>. See HashSet for more.
ADDITION
Asking Mr. Google, here's an article written by a gentlemen that wrote a Bloom Filter function in C#. However, he's still using (multiple) hashcodes to populate the filter. I would expect that on small data sets it will be slower than a HashSet.
If the set of strings to check for membership is much larger than the set of valid strings then a Trie might give you better performance than a HashSet. The speed of a lookup in a hashset is dependent on the run time of the hashing algorithm which is usually O(k) where k is the length of the string. This is true whether the string is in the hashset or not.
With a Trie, lookup is still O(k), but if the string is not in the Trie, it will terminate the lookup as soon as a single character doesn't match. So best-case, a lookup for an invalid string is O(1).
Why not use a Radix Tree? It's a specialized set data structure based on the trie that is used to store a set of strings.
Check out the System.Collections.Specialized Namespace on MSDN.
Especially the HybridDictionary and the StringDictionary.
I know they're not sets, but you can use null values for each key. (Java does the same with out-of-the box Sets and still is "fast".
If HashSet is too slow for you, you can use classic LZ compressor's technique: fixed size array of hash codes where each entry points to linked list of strings.
In case you know domain of your data just construct ideal hash function and use it.
If it's not your case you can use string.GetHashCode() of something like Murmur hash
and use hash(str) % array.Length as array's index.
I suppose array size of 256-512 entries in good enough for your data structure with 50 strings.
The main benefit of bloom filters over hash tables is that their size depends on the number of objects in the database and the permitted probability for false positives, but not on the size of the objects themselves. Since your database is so small I doubt its size is your main concern.
HashSets are theoretically the best data structure for your requirement, but since the database is so small, an O(log (n)) structure like a SortedDictionary is often preferable, or maybe even just linear search (as mentioned). I recall stories where switching from hash-based collections to tree-based ones drastically increased performance for small sets.
The best way is to switch between them and compare the performance of each.
My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)
I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.
I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.
This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.
If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).
Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.
With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.
divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count
I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.
+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.
Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.
A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.
I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.
If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.