Finding nearest value in a SortedDictionary

Finding nearest value in a SortedDictionary - c#

I have a SortedDictionary
SortedDictionary<int, CPUOptimizationObject> myDict;
Now I want to find the first value above X.
I can do something like this
foreach (var iKey in MyDict.Keys)
{
if (iKey >= thresholdKey)
{
foundKey = iKey;
break;
}
}
but this isn't good performance wise.
Any better suggestion?
(is there a method for that in the collections something like Binary search for SortedDictionary ?)

While, in theory, finding the smallest item that is larger than a given value is an operation that can be performed efficiently on a Binary Search Tree (which is what a SortedDictionary is implemented as) SortedDictionary does not expose the means for you to perform such a search on that data type.
You would need to use a different implementation of a Binary Search Tree in order to efficiently perform such a search, while still using the same type of data structure. There are no suitable .NET types; you would need to use a 3rd party implementation (of which there are quite a few out there).

You could try if this is faster. But I guess it will be faster only if you are performing the search multiple times.
var keys = new List<int>(myDict.Keys);
int index = keys.BinarySearch(thresholdKey);

create a temp list n using a .toList() on your Sorteddictionary
Now since that results a List n
you could do
n.Find(item =>item >20)
to retrieve the first key in return that matches,

I don't know if this has better performance than the foreach, but this should work:
var foo = myDict.FirstOrDefault(i => i.Key > thresholdKey);

Related

Reading a 200MiB binary file in a SortedDictionary

I have a 200MiB binary file containing pairs of Int64 and Int32. The file is already sorted by the values on the Int64.
I'm reading it like this:
private SortedDictionary<Int64, int> edgeID = new SortedDictionary<Int64, int>();
//Carico il file del mapping arco ID
using (BinaryReader b = new BinaryReader(File.Open(#"file.bin", FileMode.Open)))
{
Int64 keyID;
Int32 grafoID;
int numArchi = b.ReadInt32();
for (int i = 0; i < numArchi; i++)
{
keyID = (b.ReadInt64());
grafoID = b.ReadInt32();
edgeID[keyID] = grafoID;
}
}
But it's super slow. I noticed if I was using a normal Dictionary I could speed up a bit if I pass the dimension of the Dictionary in the costructor but apparently SortedDictionary doesn't support that. Also I think the bigger problem is the program doesn't know I'm passing data already ordered so it checks the key at every insert.

Your second assumption is partially correct: SortedDictionary, in general, has no way of telling that data is ordered and needs to check they are inserted in the right place.
Furthermore, SortedDictionary is a red-black tree internally in the current Mono and Microsoft implementations of the BCL.
Insertion of already ordered data is actually the worst case scenario for a red black tree (see insert section here)
Also, the implementation might change in the future, so optimizing for it/agaist it is a risky move!
In comparison, Dictionary is an hash-table internally (that's why you can give it an initial size; hash-tables are implemented using arrays underneath (of different shapes, depending on the implementation)) and so both insertion and look-up times are faster (amortized, O(1)).
You have to consider: what do you need it for? If you want to just search by key, and the data is already ordered, probably the best way of doing it is loading the data in a simple contiguous array (a .NET array or a list) and use binary search on it. You will need the overload that takes a comparer, as your array will probably hold a Tuple - or you may just use two arrays, one of "indexes" and one of "values".. your choice.
Much faster than SortedDictionary and Dictionary on loading (and should be faster than SortedDictionary in searching too).
It might be that Dictionary is still faster on searching a single item; hash-tables have 0(1) access, while binary search on an ordered array is 0(log N), but locality plays an important role on this "intermediate" data dimensions, so the real result might be different from the theoretical one. You will need to profile and measure to decide which is best in your case!
(By intermediate dimensions I mean: something for which the big-O notation does not play a dominant role, yet)

Using a Dictionary with a max number of elements like this:
using (BinaryReader b = new BinaryReader(File.Open(#"filed", FileMode.Open)))
{
Int64 tomtomID;
Int32 grafoID;
int numArchi = b.ReadInt32();
edgeID = new Dictionary<long, int>(numArchi);
for (int i = 0; i < numArchi; i++)
{
tomtomID = (b.ReadInt64());
grafoID = b.ReadInt32();
edgeID[keyID] = grafoID;
}
}
This solved the issue.

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Best Collection for Fast String Lookup

I need a list of strings and a way to quickly determine if a string is contained within that list.
To enhance lookup speed, I considered SortedList and Dictionary; however, both work with KeyValuePairs when all I need is a single string.
I know I could use a KeyValuePair and simply ignore the Value portion. But I do prefer to be efficient and am just wondering if there is a collection better suited to my requirements.

If you're on .NET 3.5 or higher, use HashSet<String>.
Failing that, a Dictionary<string, byte> (or whatever type you want for the TValue type parameter) would be faster than a SortedList if you have a lot of entries - the latter will use a binary search, so it'll be O(log n) lookup, instead of O(1).

If you just want to know if a string is in the set use HashSet<string>

This sounds like a job for
var keys = new HashSet<string>();
Per MSDN: The Contains function has O(1) complexity.
But you should be aware that it does not give an error for duplicates when adding.

HashSet<string> is like a Dictionary, but with only keys.

If you feel like rolling your own data structure, use a Trie.
http://en.wikipedia.org/wiki/Trie
worst-case is if the string is present: O(length of string)

I know this answer is a bit late to this party, but I was running into an issue where our systems were running slow. After profiling we found out there was a LOT of string lookups happening with the way we had our data structures structured.
So we did some research, came across these benchmarks, did our own tests, and have switched over to using SortedList now.
if (sortedlist.ContainsKey(thekey))
{
//found it.
}
Even though a Dictionary proved to be faster, it was less code we had to refactor, and the performance increase was good enough for us.
Anyway, wanted to share the website in case other people are running into similar issues. They do comparisons between data structures where the string you're looking for is a "key" (like HashTable, Dictionary, etc) or in a "value" (List, Array, or in a Dictionary, etc) which is where ours are stored.

I know the question is old as hell, but I just had to solve the same problem, only for a very small set of strings(between 2 and 4).
In my case, I actually used manual lookup over an array of strings which turned up to be much faster than HashSet<string>(I benchmarked it).
for (int i = 0; i < this.propertiesToIgnore.Length; i++)
{
if (this.propertiesToIgnore[i].Equals(propertyName))
{
return true;
}
}
Note, that it is better than hash set for only for tiny arrays!
EDIT: works only with a manual for loop, do not use LINQ, details in comments

How do I find next largest key in a collection?

Suppose I have a dictionary in C#. Assuming the keys are comparable, how do I find the smallest key greater than a given k (of the same type as the keys of the dictionary)? However I would like to do this efficiently with a collection like a SortedDictionary.
Clearly, if it were not a question of doing it efficiently one could start with any dictionary, extract its keys and then use the First method with a suitable predicate. But this would execute in linear time (in the number of keys) when if one has a sorted set of keys one should be be able to find the key in log time.
Thanks.

The SortedList<TKey, TValue> class implements IDictionary<TKey, TValue> and has an IndexOfKey method; I think that's what you want:
// I'm just going to pretend your keys are ints
var collection = new SortedList<int, string>();
// populate collection with whatever
int k = GetK(); // or whatever
int kIndex = collection.IndexOfKey(k);
int? smallestKeyGreaterThanK = null;
if (collection.Count > kIndex + 1)
smallestKeyGreaterThanK = collection.Keys[kIndex + 1];
According to the MSDN documentation:
This method performs a binary search; therefore, this method is an O(log n) operation.
EDIT: If you can't be sure that the dictionary contains the key you're looking for (you just want the next-largest), there is still a way to leverage an existing binary search method from .NET for your purposes. You said you are looking for an "efficient" solution; the following fits that criterion if you mean in terms of your time (and in terms of lines of code). If you mean in terms of memory usage or performance, on the other hand, it might not be ideal. Anyway:
List<int> keysList = new List<int>(collection.Keys);
int kIndex = keysList.BinarySearch(k);
Now, BinarySearch will give you what you're looking for, but if the key's not there, it's a little wacky. The return value, from the MSDN documentation, is as follows:
The zero-based index of item in the
sorted List<T>, if item is
found; otherwise, a negative number
that is the bitwise complement of the
index of the next element that is
larger than item or, if there is no
larger element, the bitwise complement
of Count.
What this means is that you'll need to add another line:
kIndex = kIndex >= 0 ? kIndex : ~kIndex;

For any dictionary, you will have to sort the keys yourself, and then do a binary search on the keys to find the one that matches your value.
This will give you a time of (n * log(n)) + log(n) for that whole operation.
If the keys are already sorted then you can reduce it to log(n) but with most dictionaries, this isn't the case.
That being said, it becomes a simple matter of comparing the functions of f(n) vs f((n * log(n)) + log(n)) and seeing how many keys you will typically want to perform this operation on and if it is better to do a linear or binary search.
That being said, f(n) will always be lower than f((n * log(n))), so it is better to just search the keys linearly.

are you sure, the use of the SortedDictionary would execute in linear time? Since this is a class by microsoft, I would expect them to have it optimized.
I suggest you actually write some test methods to be sure.
br, Marcel

Since SortedDictionary implements IEnumerable, why not loop through the collection and stop when you hit the first value greater than k? Unless you have a large collection and your target is nearer to the end, this should give you reasonable performance. Just how big is your dictionary?

Algorithm for matching lists of integers

For each day we have approximately 50,000 instances of a data structure (this could eventually grow to be much larger) that encapsulate the following:
DateTime AsOfDate;
int key;
List<int> values; // list of distinct integers
This is probably not relevant but the list values is a list of distinct integers with the property that for a given value of AsOfDate, the union of values over all values of key produces a list of distinct integers. That is, no integer appears in two different values lists on the same day.
The lists usually contain very few elements (between one and five), but are sometimes as long as fifty elements.
Given adjacent days, we are trying to find instances of these objects for which the values of key on the two days are different, but the list values contain the same integers.
We are using the following algorithm. Convert the list values to a string via
string signature = String.Join("|", values.OrderBy(n => n).ToArray());
then hash signature to an integer, order the resulting lists of hash codes (one list for each day), walk through the two lists looking for matches and then check to see if the associated keys differ. (Also check the associated lists to make sure that we didn't have a hash collision.)
Is there a better method?

You could probably just hash the list itself, instead of going through String.
Apart from that, I think your algorithm is nearly optimal. Assuming no hash collisions, it is O(n log n + m log m) where n and m are the numbers of entries for each of the two days you're comparing. (The sorting is the bottleneck.)
You can do this in O(n + m) if you use a bucket array (essentially: a hashtable) that you plug the hashes in. You can compare the two bucket arrays in O(max(n, m)) assuming a length dependent on the number of entries (to get a reasonable load factor).
It should be possible to have the library do this for you (it looks like you're using .NET) by using HashSet.IntersectWith() and writing a suitable compare function.
You cannot do better than O(n + m), because every entry needs to be visited at least once.
Edit: misread, fixed.

On top of the other answers you could make the process faster by creating a low-cost hash simply constructed of a XOR amongst all the elements of each List.
You wouldn't have to order your list and all you would get is an int which is easier and faster to store than strings.
Then you only need to use the resulting XORed number as a key to a Hashtable and check for the existence of the key before inserting it.
If there is already an existing key, only then do you sort the corresponding Lists and compare them.
You still need to compare them if you find a match because there may be some collisions using a simple XOR.
I think thought that the result would be much faster and have a much lower memory footprint than re-ordering arrays and converting them to strings.
If you were to have your own implementation of the List<>, then you could build the generation of the XOR key within it so it would be recalculated at each operation on the List.
This would make the process of checking duplicate lists even faster.
Code
Below is a first-attempt at implementing this.
Dictionary<int, List<List<int>>> checkHash = new Dictionary<int, List<List<int>>>();
public bool CheckDuplicate(List<int> theList) {
bool isIdentical = false;
int xorkey = 0;
foreach (int v in theList) xorkey ^= v;
List<List<int>> existingLists;
checkHash.TryGetValue(xorkey, out existingLists);
if (existingLists != null) {
// Already in the dictionary. Check each stored list
foreach (List<int> li in existingLists) {
isIdentical = (theList.Count == li.Count);
if (isIdentical) {
// Check all elements
foreach (int v in theList) {
if (!li.Contains(v)) {
isIdentical = false;
break;
}
}
}
if (isIdentical) break;
}
}
if (existingLists == null || !isIdentical) {
// never seen this before, add it
List<List<int>> newList = new List<List<int>>();
newList.Add(theList);
checkHash.Add(xorkey, newList);
}
return isIdentical;
}
Not the most elegant or easiest to read at first sight, it's rather 'hackey' and I'm not even sure it performs better than the more elegant version from Guffa.
What it does though is take care of collision in the XOR key by storing Lists of List<int> in the Dictionary.
If a duplicate key is found, we loop through each previously stored List until we found a mismatch.
The good point about the code is that it should be probably as fast as you could get in most cases and still faster than compiling strings when there is a collision.

Implement an IEqualityComparer for List, then you can use the list as a key in a dictionary.
If the lists are sorted, it could be as simple as this:
IntListEqualityComparer : IEqualityComparer<List<int>> {
public int GetHashCode(List<int> list) {
int code = 0;
foreach (int value in list) code ^=value;
return code;
}
public bool Equals(List<int> list1, List<int> list2) {
if (list1.Count != list2.Coount) return false;
for (int i = 0; i < list1.Count; i++) {
if (list1[i] != list2[i]) return false;
}
return true;
}
}
Now you can create a dictionary that uses the IEqualityComparer:
Dictionary<List<int>, YourClass> day1 = new Dictionary<List<int>, YourClass>(new IntListEqualityComparer());
Add all the items from the first day in the dictionary, then loop through the items from the second day and check if the key exists in the dictionary. As the IEqualityComprarer both handles the hash code and the comparison, you will not get any false matches.
You may want to test some different methods of calculating the hash code. The one in the example works, but may not give the best efficiency for your specific data. The only requirement on the hash code for the dictionary to work is that the same list always gets the same hash code, so you can do pretty much what ever you want to calculate it. The goal is to get as many different hash codes as possible for the keys in your dictionary, so that there are as few items as possible in each bucket (with the same hash code).

Does the ordering matter? i.e. [1,2] on day 1 and [2,1] on day 2, are they equal?
If they are, then hashing might not work all that well. You could use a sorted array/vector instead to help with the comparison.
Also, what kind of keys is it? Does it have a definite range (e.g. 0-63)? You might be able to concatenate them into large integer (may require precision beyond 64-bits), and hash, instead of converting to string, because that might take a while.

It might be worthwhile to place this in a SQL database. If you don't want to have a full blown DBMS you could use sqlite.
This would make uniqueness checks and unions and these types of operations very simple queries and would very efficient. It would also allow you to easily store information if it is ever needed again.

Would you consider summing up the list of values to obtain an integer which can be used as a precheck of whether different list contains the same set of values?
Though there will be much more collisions (same sum doesn't necessarily mean same set of values) but I think it can first reduce the set of comparisons required by a large part.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.