I have a bunch of pairs of dates and monetary values in a SortedDictionary<DateTime, decimal>, corresponding to loan balances calculated into the future at contract-defined compounding dates. Is there an efficient way to find a date key that is nearest to a given value? (Specifically, the nearest key less than or equal to the target). The point is to store only the data at the points when the value changed, but efficiently answer the question "what was the balance on x date?" for any date in range.
A similar question was asked ( What .NET dictionary supports a "find nearest key" operation? ) and the answer was "no" at the time, at least from the people who responded, but that was almost 3 years ago.
The question How to find point between two keys in sorted dictionary presents the obvious solution of naively iterating through all keys. I am wondering if any built-in framework function exists to take advantage of the fact that the keys are already indexed and sorted in memory -- or alternatively a built-in Framework collection class that would lend itself better to this kind of query.
Since SortedDictionary is sorted on the key, you can create a sorted list of keys with
var keys = new List<DateTime>(dictionary.Keys);
and then efficiently perform binary search on it:
var index = keys.BinarySearch(key);
As the documentation says, if index is positive or zero then the key exists; if it is negative, then ~index is the index where key would be found at if it existed. Therefore the index of the "immediately smaller" existing key is ~index - 1. Make sure you handle correctly the edge case where key is smaller than any of the existing keys and ~index - 1 == -1.
Of course the above approach really only makes sense if keys is built up once and then queried repeatedly; since it involves iterating over the whole sequence of keys and doing a binary search on top of that there's no point in trying this if you are only going to search once. In that case even naive iteration would be better.
Update
As digEmAll correctly points out, you could also switch to SortedList<DateTime, decimal> so that the Keys collection implements IList<T> (which SortedDictionary.Keys does not). That interface provides enough functionality to perform a binary search on it manually, so you could take e.g. this code and make it an extension method on IList<T>.
You should also keep in mind that SortedList performs worse than SortedDictionary during construction if the items are not inserted in already-sorted order, although in this particular case it is highly likely that dates are inserted in chronological (sorted) order which would be perfect.
So, this doesn't directly answer your question, because you specifically asked for something built-in to the .NET framework, but facing a similar problem, I found the following solution to work best, and I wanted to post it here for other searchers.
I used the TreeDictionary<K, V> from the C5 Collections (GitHub/NuGet), which is an implementation of a red-black tree.
It has Predecessor/TryPredecessor and WeakPredessor/TryWeakPredecessor methods (as well as similar methods for successors) to easily find the nearest items to a key.
More useful in your case, I think, is the RangeFrom/RangeTo/RangeFromTo methods that allow you to retrieve a range of key-value-pairs between keys.
Note that all of these methods can also be applied to the TreeDictionary<K, V>.Keys collection, which allow you to work with only the keys as well.
It really is a very neat implementation, and something like it deserves to be in the BCL.
It is not possible to efficiently find the nearest key with SortedList, SortedDictionary or any other "built-in" .NET type, if you need to interleave queries with inserts (unless your data arrives pre-sorted, or the collection is always small).
As I mentioned on the other question you referenced, I created three data structures related to B+ trees that provide find-nearest-key functionality for any sortable data type: BList<T>, BDictionary<K,V> and BMultiMap<K,V>. Each of these data structures provide FindLowerBound() and FindUpperBound() methods that work like C++'s lower_bound and upper_bound.
These are available in the Loyc.Collections NuGet package, and BDictionary typically uses about 44% less memory than SortedDictionary.
public static DateTime RoundDown(DateTime dateTime)
{
long remainingTicks = dateTime.Ticks % PeriodLength.Ticks;
return dateTime - new TimeSpan(remainingTicks);
}
Related
Given (Simplified description)
One of our services has a lot of instances in memory. About 85% are unique.
We need a very fast key based access to these items as they are queried very often in a single stack / call. This single context is extremely performance optimized.
So we started to put them them into a dictionary. The performance was ok.
Access to the items as fast as possible is the most important thing in this case. It is ensured that there are no write operations when reads occur.
Problem
In the meanwhile we hit the limits of the number of items a dictionary can store.
Die Arraydimensionen haben den unterstützten Bereich überschritten.
bei System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
bei System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
Which translates to The array dimensions have exceeded the supported range.
Solutions like Memcached are in this specific case just too slow. It is a isolated very specific use case encapsulated in a single service
So we are looking for a replacement of the dictionary for this specific scenario.
Currently I can't find one supporting this. Am I missing something? Can someone point me to one?
As an alternative, if none exists we are thinking about implementing one by ourselves.
We thought about two possibilities. Build it up from scratch or wrapping multiple dictionaries.
Wrapping multiple dictionaries
When an item is searched we could have a look at the keys HasCode and use its starting number like an index for a list of wrappers dictionaries. Although this seems to be easy it smells to me and it would mean that the hashcode is calculated twice (one time by us one time by the inner dictionary) (this scenario is really really performance cruical).
I know that exchanging a basetype like the dictionary is the absolute last possibility and I want to avoid it. But currently it looks like there is no way to make the objects more unique or to get the performance of a dictionary from a database or to save performance somewhere else.
I'm also aware of "be aware of optimizations" but the a lower performance would very badly hit the business requirements behind it.
Before I finished reading your questions, the simple multiple dictionaries came to my mind. But you know this solution already. I am assuming you are really hitting the maximum number of items in a dictionary, not any other limit.
I would say go for it. I do not think you should be worried about counting a hash twice. If they keys are somehow long and getting the hash is really a time consuming operations (which I doubt, but can't be sure as you did not mention what are the keys), you do not need to use whole keys for your hash function. Just pick up whatever part you are OK to process in your own hashing and distribute the item based on that.
The only thing you need to make sure here is to have an evenly spread of items among your multiple dictionaries. How hard is to achieve this really depends on what your keys are. If they were completely random numbers, you could just use the first byte and it would be fine (unless you would need more than 256 dictionaries). If they are not random numbers, you have to think about the distribution in their domain and code your first hash function in a way it achieves that goal of even distribution.
I've looked at the implementation of the .Net Dictionary and it seems like you should be able to store 2^32 values in your dictionary. (Next to the list of buckets, which are themselves linked lists there is a single array that stores all items, probably for quick iteration, that might be the limiting factor).
If you haven't added 2^32 values it might be that there is a limit on the items in a bucket (its a linked list so its probably limitted to the maximum stackframe size). In that case you should double check that your hashing function spreads the items evenly over the dictionary. See this answer for more info What is the best algorithm for an overridden System.Object.GetHashCode?
I am performing something similar to an N-dimensional convolution, but will be combining values that are close to one another as I proceed, to save memory and time.
I look for a key in the array.
If I find the key, I add to the value stored at that key.
If I do not find the key, I find the next highest and next lowest key.
If the closer of the two neighbors is close enough, then I accumulate with that key-value pair.
Otherwise I add a new key-value pair.
The key is a double. It is always positive and never infinite. (I handle zeroes specially.) I expect the values to range from pennies to as high as 100 billion. The rounding coarseness will change as the algorithm proceeds to maintain a maximum array size between 10,000 and 1,000,000. (Only testing will reveal the sweet spot in the trade-off between speed, memory and accuracy.) Because of the range of values versus array size, direct addressing is not practical; I need sparse storage.
The naive approach is to use a List and perform a BinarySearch to find the key or insertion point, then proceed from there. This is fast for finding the nearest key, can be iterated in key order, but inserts are horrible. (I do not need to perform deletes! Each iteration in the outer loop creates a new list from scratch.)
What data structure is recommended? Wikipedia mentions a few, like Trie, Judy array, etc.
(I implemented something Trie-like with similar characteristics years ago, but that was in java, took me a week to implement, and was tricky. I am crunched for time.)
UPDATE:
The suggestion of SortedSet causes me to modify my requirements. While finding the next lowest and next highest key was my way of accomplishing my task, SortedSet.GetViewBetween goes about things differently. Since I just want to see if there is a value close enough to be aggregated with, and I have a certain rounding granularity G, I can just ask for all elements of interest using
var possibilities = mySet.GetViewBetween(x - G, x + G)
If that set is empty, I need to add. If it is not, it is probably a small set and I iterate through it.
I need to perform performance testing to see if it is fast enough. But even if it does not, another collection that has the same contract is an acceptable alternative to FindNextHighestKey and FindNextLowestKey.
UPDATE 2:
I have decided to use plain Dictionary, and force the keys into buckets using a custom rounding function. Iterating the items in sorted order is not crucial, and by using this rounding function, I can find "close enough" values with which to aggregate. I will not change the granularity during an iteration; I will adjust it every time I finish convolving with a new dimension. Each iteration I create a new array to hold the results of that pass.
If your key is unique you may look at Dictionary<TKey,TValue> or SortedDictionary<TKey,TValue>
I found this question, which let me to SortedSet<T>.
If you can handle O(log(n)) for insert, delete, and lookup, this might be where you should keep your keys.
Based on your new requirement... Why not just map the doubles by the granularity to sparse keys before use and go with a Dictionary<double, T> ? This won't work if you want the granularity to change during runtime, but neither would the other approach really.
Whenever I want to insert into a SortedList, I check to see if the item exists, then I insert. Is this performing the same search twice? Once to see if the item is there and again to find where to insert the item? Is there a way to optimize this to speed it up or is this just the way to do it, no changes necessary?
if( sortedList.ContainsKey( foo ) == false ){
sortedList.Add( foo, 0 );
}
You can add the items to a HashSet and the List, searching in the hash set is the fastest way to see if you have to add the value to the list.
if( hashSet.Contains( foo ) == false ){
sortedList.Add( foo, 0 );
hashSet.Add(foo);
}
You can use the indexer. The indexer does this in an optimal way internally by first looking for the index corresponding to the key using a binary search and then using this index to replace an existing item. Otherwise a new item is added by taking in account the index already calculated.
list["foo"] = value;
No exception is thrown whether the key already exists or not.
UPDATE:
If the new value is the same as the old value, replacing the old value will have the same effect than doing nothing.
Keep in mind that a binary search is done. This means that it takes about 10 steps to find an item among 1000 items! log2(1000) ~= 10. Therefore doing an extra search will not have a significant impact on speed. Searching among 1,000,000 items will only double this value (~ 20 steps).
But setting the value through the indexer will do only one search in any case. I looked at the code using Reflector and can confirm this.
I'm sorry if this doesn't answer your question, but I have to say sometimes the default collection structures in .NET are unjustifiably limited in features. This could have been handled if Add method returned a boolean indicating success/failure very much like HashSet<T>.Add does. So everything goes in one step. In fact the whole of ICollection<T>.Add should have been a boolean so that implementation-wise it's forced, very much like Collection<T> in Java does.
You could either use a SortedDictionary<K, V> structure as pointed out by Servy or a combination of HashSet<K> and SortedList<K, V> as in peer's answer for better performance, but neither of them are really sticking to do it only once philosophy. I tried a couple of open source projects to see if there is a better implementation in this respect, but couldn't find.
Your options:
In vast majority of the cases it's ok to do two lookups, doesn't hurt much. Stick to one. There is no solution built in.
Write your own SortedList<K, V> class. It's not difficult at all.
If you'r desperate, you can use reflection. The Insert method is a private member in SortedList class. An example that does.. Kindly dont do it. It's a very very poor choice. Mentioned here for completeness.
ContainsKey does a binary search, which is O(log n), so unless you list is massive, I wouldn't worry about it too much. And, presumably, on insertion it does another binary search to find the location to insert at.
One option to avoid this (doing the search twice) is to use a the BinarySearch method of List. This will return a negative value if the item isn't found and that negative value is the bitwise compliment of the place where the item should be inserted. So you can look for an item, and if it's not already in the list, you know exactly where it should be inserted to keep the list sorted.
SortedList<Key,Value> is a slow data structure that you probably shouldn't use at all. You may have already considered using SortedDictionary<Key,Value> but found it inconvenient because the items don't have indexes (you can't write sortedDictionary[0]) and because you can write a find nearest key operation for SortedList but not SortedDictionary.
But if you're willing to switch to a third-party library, you can get better performance by changing to a different data structure.
The Loyc Core libraries include a data type that works the same way as SortedList<Key,Value> but is dramatically faster when the list is large. It's called BDictionary<Key,Value>.
Now, answering your original question: yes, the way you wrote the code, it performs two searches and one insert (the insert is the slowest part). If you switch to BDictionary, there is a method bdictionary.AddIfNotPresent(key, value) which combines those two operations into a single operation. It returns true if the specified item was added, or false if it was already present.
I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?
Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.
Sounds like a job for a Hashset...
If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.
If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.
What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.
I am looking for a structure that holds a sorted set of double values. I want to query this set to find the closest value to a specified reference value.
I have looked at the SortedList<double, double>, and it does quite well for me. However, since I do not need explicit key/value pairs. this seems to be overkill to me, and i wonder if i could do faster.
Conditions:
The structure is initialised only once, and does never change (no insert/deletes)
The amount of values is in the range of 100k.
The structure is queried often with new references, which must execute fast.
For simplicity and speed, the set's value just below of the reference may be returned, not actually the nearest value
I want to use LINQ for the query, if possible, for simplicity of code.
I want to use no 3rd party code if possible. .NET 3.5 is available.
Speed is more importand than memory footprint
I currently use the following code, where SortedValues is the aforementioned SortedList
IEnumerable<double> nearest = from item in SortedValues.Keys
where item <= suggestion
select item;
return nearest.ElementAt(nearest.Count() - 1);
Can I do faster?
Also I am not 100% percent sure, if this code is really safe. IEnumerable, the return type of my query is not by definition sorted anymore. However, a Unit test with a large test data base has shown that it is in practice, so this works for me. Have you hints regarding this aspect?
P.S. I know that there are many similar questions, but none actually answers my specific needs. Especially there is this one C# Data Structure Like Dictionary But Without A Value, but the questioner does just want to check the existence not find anything.
The way you are doing it is incredibly slow as it must search from the beginning of the list each time giving O(n) performance.
A better way is to put the elements into a List and then sort the list. You say you don't need to change the contents once initialized, so sorting once is enough.
Then you can use List<T>.BinarySearch to find elements or to find the insertion point of an element if it doesn't already exist in the list.
From the docs:
Return Value
The zero-based index of
item in the sorted List<T>,
if item is found; otherwise, a
negative number that is the bitwise
complement of the index of the next
element that is larger than item or,
if there is no larger element, the
bitwise complement of Count.
Once you have the insertion point, you need to check the elements on either side to see which is closest.
Might not be useful to you right now, but .Net 4 has a SortedSet class in the BCL.
I think it can be more elegant as follows:
In case your items are not sorted:
double nearest = values.OrderBy(x => x.Key).Last(x => x.Key <= requestedValue);
In case your items are sorted, you may omit the OrderBy call...