cache entry replacement algorithm - c#

I have a software project that creates a series of fingerprint (hash) values from objects of varying size. The larger the object size, of course, the more expensive the computation of the hash. The hashes are used for comparative purposes.
I now wish to cache hash values in order to improve performance of subsequent comparisons. For any given entry in the cache, I have the following metrics available:
hit count
last modification date/time
size of object hashed
So on to my question. Given the need to constrain the size of the cache (limit it to a specific number of entries), what is a well-balanced approach to replacing cache items?
Clearly, larger objects are more expensive to hash so they need to be kept around for as long as possible. However, I want to avoid a situation where populating the cache with a large quantity of large objects will prevent future (smaller) items from being cached.
So, based upon the metrics available to me (see above), I'm looking for a good general-purpose "formula" for expiring (removing) cache entries when the cache becomes full.
All thoughts, comments are appreciated.

You need to think about the nature of the objects. Think about the probability of the objects to be called again soon. And remove the least likely object.
This is very specific to the software you're using and the nature of the objects.
If they are used continuously in the program they will probably abide to the Locality of reference principle. So you should use LRU (Least recently used) algorithm.
If objects with higher hit count are more likely to be called again, then use that (and remove the lowest).
Take a look at Cache Algorithms
In principle, you need to calculate:
min(p*cost)
p = probability to be called again.
cost = The cost of caching that object again.

Assuming the ability to record when an entry was last accessed, I'd go with a "Cost" for each entry, where you at any time remove the least expensive entry.
Cost = Size * N - TimeSinceLastUse * M
Presuming you completely remove entries from the cache (and not keep old hitcount data around) I'd avoid using hitcount as a metric, you'd end up with an entry that has a high hitcount because it's been there for a long time, and it'll be there even longer because it has a high hitcount.

I typically use a strict least recently used (LRU) scheme for discarding things from the cache, unless it's hugely more expensive to reconstruct some items. LRU has the benefit of being trivially simple to implement, and it works really well for a wide range of applications. It also keeps the most frequently used items in the cache.
In essence, I create a linked list that's also indexed by a dictionary. When a client wants an item, I look it up in the dictionary. If it's found, I unlink the node from the linked list and move it to the head of the list. If the item isn't in the cache, I construct it (load it from disk, or whatever), put it at the head of the list, insert it into the dictionary, and then remove the item that's at the tail of the list.

Might want to try a multilevel style of cache. Dedicate a certain percentage of the cache to Expensive to create items and a portion to easy to create but more heavily accessed items. You can then use different strategies for maintaining the expensive cache than you would the less expensive one.

The algorithm could consider the cost of reproducing a missing element. That way you would keep the most valuable items in the cache.

Related

Which collection class to use for a rolling period timeseries of data?

I want to imlpement a c# class (.NET 4.6.1) which contains time series data as follows :
the timeseries is a collection keyed on DateTime, each with an associated value (eg an array of doubles)
the points will be added strictly in time order
there will be a rolling time period - e.g. 1 hour - when adding a new point, any points at the start older than this period will be removed
the key issues for performance will be quickly adding new points, and finding the data point for a particular time (a binary search or better is essential). There will be 100k's of points in the collection sometimes.
it has to be thread safe
it has to be relatively memory efficient - e.g. it can't just keep all the points from the beginning of time - as time moves on the memory footprint has to be fairly stable.
So what would be a good approach for that in terms of underlying collection classes? Using Lists will be slow - as the rolling period means we will need to remove from the start of the List a lot. A Queue or LinkedList would solve that - but I don't think they provide a fast way to access the nth item in the collection (ElementAt just iterates so is very slow).
The only solution I can think of will involve storing the data twice - once in a collection thats easy to remove from the start of - and again in one thats easy to search in (with some awful background process to prune the stale points from the search collection somehow).
Is there a better way to approach this?
Thanks.
When I first saw the question I immediately thought of a queue, but most built-in queues do not efficiently allow indexed access, as you've found.
Best suggestion I can come up with is to use a ConcurrentDictionary. Thread-safe, near-constant access time by key, you can key directly to DateTimes, etc. it's everything you need functionally, except the behavior to manage size/timeline. To do that, you use a little Linq to scan the Keys property for keys more than one hour older than the newest one being added, then TryRemove() each one (you can derive from ConcurrentDictionary and override TryAdd() to do this automatically when adding anything new).
Only other potential issue is it's not terribly memory-efficient; the HashSet-based implementation of .NET Dictionaries requires a two-dimensional array for storage based on the hash of the key, and that array will be sparsely populated even with 100k items in the collection.

Replacement .net Dictionary

Given (Simplified description)
One of our services has a lot of instances in memory. About 85% are unique.
We need a very fast key based access to these items as they are queried very often in a single stack / call. This single context is extremely performance optimized.
So we started to put them them into a dictionary. The performance was ok.
Access to the items as fast as possible is the most important thing in this case. It is ensured that there are no write operations when reads occur.
Problem
In the meanwhile we hit the limits of the number of items a dictionary can store.
Die Arraydimensionen haben den unterstützten Bereich überschritten.
bei System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
bei System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
Which translates to The array dimensions have exceeded the supported range.
Solutions like Memcached are in this specific case just too slow. It is a isolated very specific use case encapsulated in a single service
So we are looking for a replacement of the dictionary for this specific scenario.
Currently I can't find one supporting this. Am I missing something? Can someone point me to one?
As an alternative, if none exists we are thinking about implementing one by ourselves.
We thought about two possibilities. Build it up from scratch or wrapping multiple dictionaries.
Wrapping multiple dictionaries
When an item is searched we could have a look at the keys HasCode and use its starting number like an index for a list of wrappers dictionaries. Although this seems to be easy it smells to me and it would mean that the hashcode is calculated twice (one time by us one time by the inner dictionary) (this scenario is really really performance cruical).
I know that exchanging a basetype like the dictionary is the absolute last possibility and I want to avoid it. But currently it looks like there is no way to make the objects more unique or to get the performance of a dictionary from a database or to save performance somewhere else.
I'm also aware of "be aware of optimizations" but the a lower performance would very badly hit the business requirements behind it.
Before I finished reading your questions, the simple multiple dictionaries came to my mind. But you know this solution already. I am assuming you are really hitting the maximum number of items in a dictionary, not any other limit.
I would say go for it. I do not think you should be worried about counting a hash twice. If they keys are somehow long and getting the hash is really a time consuming operations (which I doubt, but can't be sure as you did not mention what are the keys), you do not need to use whole keys for your hash function. Just pick up whatever part you are OK to process in your own hashing and distribute the item based on that.
The only thing you need to make sure here is to have an evenly spread of items among your multiple dictionaries. How hard is to achieve this really depends on what your keys are. If they were completely random numbers, you could just use the first byte and it would be fine (unless you would need more than 256 dictionaries). If they are not random numbers, you have to think about the distribution in their domain and code your first hash function in a way it achieves that goal of even distribution.
I've looked at the implementation of the .Net Dictionary and it seems like you should be able to store 2^32 values in your dictionary. (Next to the list of buckets, which are themselves linked lists there is a single array that stores all items, probably for quick iteration, that might be the limiting factor).
If you haven't added 2^32 values it might be that there is a limit on the items in a bucket (its a linked list so its probably limitted to the maximum stackframe size). In that case you should double check that your hashing function spreads the items evenly over the dictionary. See this answer for more info What is the best algorithm for an overridden System.Object.GetHashCode?

Any performance benefits to removing items from C# Dictionary after lookup if they only need to be read once

I have a Dictionary of objects with strings as the keys. This Dictionary is first populated with anywhere from 50 to tens of thousands of entries. Later on my program looks for values within this dictionary, and after having found an item in the dictionary I no longer have any need to persist the object that I just found in the dictionary. My question then is, would I be able to get better total execution time if I remove entries from the dictionary once I no longer have use for them, perhaps cutting down memory usage or just making subsequent lookups slightly faster, or would the extra time spent removing items be more impactful?
I understand the answer to this may depend upon certain details such as how many total lookups are done against the dictionary, the size of the key, and the size of the object, I will try to provide these below, but is there a general answer to this? Is it unnecessary to try and improve performance in this way, or are there cases where this would be a good idea?
Key is variable length string, either 6 characters or ~20 characters.
Total lookups is completely up in the air, I may have to only check 50x or so or I may have to look 10K times completely independent of the size of the dictionary, i.e. dictionary may have 50 items and I may do 10K lookups, or I may have 10K items and only do 50 lookups.
One additional note is that if I do remove items from the dictionary and I am ever left with an empty dictionary I can then signal to a waiting thread to no longer wait for me while I process the remaining items (involves parsing through a long text file while looking up items in the dictionary to determine what to do with the parsed data).
Dictionary lookups are essentially O(1). Removing items from the dictionary will have a tiny (if any) impact on lookup speed.
In the end, it's very likely that removing items will be slower than just leaving them in.
The only reason I'd suggest removing items would be if you need to reduce your memory footprint.
I found some interesting items over at DotNetPerls that seem to relate to your question.
The order you add keys to a Dictionary is important. It affects the
performance of accessing those keys. Because the Dictionary uses a
chaining algorithm, the keys that were added last are often faster to
locate.
http://www.dotnetperls.com/dictionary-order
Dictionary size influences lookup performance. Smaller Dictionaries
are faster than larger Dictionaries. This is true when they are tested
for keys that always exist in both. Reducing Dictionary size could
help improve performance.
http://www.dotnetperls.com/dictionary-size
I thought this last tidbit was really interesting. It didn't occur to me to consider my key length.
Generally, shorter [key] strings perform better than longer ones.
http://www.dotnetperls.com/dictionary-string-key
Good question!

merge in-place without external storage

I want to merge two arrays with sorted values into one. Since both source arrays are stored as succeeding parts of a large array, I wonder, if you know a way to merge them into the large storage. Meaning inplace merge.
All methods I found, need some external storage. They often require sqrt(n) temp arrays. Is there an efficient way without it?
I m using C#. Other languages welcome also. Thanks in advance!
AFAIK, merging two (even sorted) arrays does not work inplace without considerably increasing the necessary number of comparisons and moves of elements. See: merge sort. However, blocked variants exist, which are able to sort a list of length n by utilizing a temporary arrays of lenght sqrt(n) - as you wrote - by still keeping the number of operations considerably low.. Its not bad - but its also not "nothing" and obviously the best you can get.
For practical situations and if you can afford it, you better use a temporary array to merge your lists.
If the values are stored as succeeding parts of a larger array, you just want to sort the array, then remove consecutive values which are equal.
void SortAndDedupe(Array<T> a)
{
// Do an efficient in-place sort
a.Sort();
// Now deduplicate
int lwm = 0; // low water mark
int hwm = 1; // High water mark
while(hwm < a.length)
{
// If the lwm and hwm elements are the same, it is a duplicate entry.
if(a[lwm] == a[hwm])
{
hwm++;
}else{
// Not a duplicate entry - move the lwm up
// and copy down the hwm element over the gap.
lwm++;
if(lwm < hwm){
a[lwm] = a[hwm];
}
hwm++;
}
}
// New length is lwm
// number of elements removed is (hwm-lwm-1)
}
Before you conclude that this will be too slow, implement it and profile it. That should take about ten minutes.
Edit: This can of course be improved by using a different sort rather than the built-in sort, e.g. Quicksort, Heapsort or Smoothsort, depending on which gives better performance in practice. Note that hardware architecture issues mean that the practical performance comparisons may very well be very different from the results of big O analysis.
Really you need to profile it with different sort algorithms on your actual hardware/OS platform.
Note: I am not attempting in this answer to give an academic answer, I am trying to give a practical one, on the assumption you are trying to solve a real problem.
Dont care about external storage. sqrt(n) or even larger should not harm your performance. You will just have to make sure, the storage is pooled. Especially for large data. Especially for merging them in loops. Otherwise, the GC will get stressed and eat up a considerable part of your CPU time / memory bandwidth.

What is the fastest way to count the unique elements in a list of billion elements?

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)
I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.
I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.
This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.
If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).
Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.
With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.
divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count
I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.
+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.
Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.
A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.
I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.
If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.

Categories

Resources