Data structure for indexed searches of subsets

Data structure for indexed searches of subsets - c#

I'm working on a c# jquery implementation and am trying to figure out an efficient algorithm for locating elements in a subset of the entire DOM (e.g. a subselector). At present I am creating an index of common selectors: class, id, and tag when the DOM is built.
The basic data structure is as one would expect, a tree of Elements which contain IEnumerable<Element> Children and a Parent. This is simple when searching the whole domain using a Dictonary<string,HashSet<Element>> to store the index.
I have not been able to get my head around the most effective way to search subsets of elements using an index. I use the term "subset" to refer to the starting set from which a subsequent selector in a chain will be run against. The following are methods I've thought of:
Retrieve matches from entire DOM for a subquery, and eliminate those that are not part of the subset. This requires traversing up the parents of each match until the root is found (and it is eliminated) or a member of the subset is found (and it is a child, hence included)
Maintain the index separately for each element.
Maintain a set of parents for each element (to make #1 fast by eliminating traversal)
Rebuild the entire index for each subquery.
Just search manually except for primary selectors.
The cost of each possible technique depends greatly on the exact operation being done. #1 is probably pretty good most of the time, since most of the time when you do a sub-select, you're targeting specific elements. The number of iterations required would be the number of results * the average depth of each element.
The 2nd method would be by far the fastest for selecting, but at the expense of storage requirements that increase exponentially with depth, and difficult index maintenance. I've pretty much eliminated this.
The 3rd method has a fairly bad memory footprint (though much better than #2) - it may be reasonable, but in addition to the storage requirements, adding and removing elements becomes substantially more expensive and complicated.
The 4rd method requires traversing the entire selection anyway so it seems pointless since most subqueries are only going to be run once. It would only be beneficial if a subequery was expected to be repeated. (Alternatively, I could just do this while traversing a subset anyway - except some selectors don't require searching the whole subdomain, e.g. ID and position selectors).
The 5th method will be fine for limited subsets, but much worse than the 1st method for subsets that are much of the DOM.
Any thoughts or other ideas about how best to accomplish this? I could do some hybrid of #1 and #4 by guessing which is more efficient given the size of the subset being searched vs. the size of the DOM but this is pretty fuzzy and I'd rather find some universal solution. Right now I am just using #4 (only full-DOM queries use the index) which is fine, but really bad if you decided to do something like $('body').Find('#id')
Disclaimer: This is early optimization. I don't have a bottleneck that needs solving, but as an academic problem I can't stop thinking about it...
Solution
Here's the implementation for the data structure as proposed by the answer. Is working perfectly as a near drop-in replacement for a dictionary.
interface IRangeSortedDictionary<TValue>: IDictionary<string, TValue>
{
IEnumerable<string> GetRangeKeys(string subKey);
IEnumerable<TValue> GetRange(string subKey);
}
public class RangeSortedDictionary<TValue> : IRangeSortedDictionary<TValue>
{
protected SortedSet<string> Keys = new SortedSet<string>();
protected Dictionary<string,TValue> Index =
new Dictionary<string,TValue>();
public IEnumerable<string> GetRangeKeys(string subkey)
{
if (string.IsNullOrEmpty(subkey)) {
yield break;
}
// create the next possible string match
string lastKey = subkey.Substring(0,subkey.Length - 1) +
Convert.ToChar(Convert.ToInt32(subkey[subkey.Length - 1]) + 1);
foreach (var key in Keys.GetViewBetween(subkey, lastKey))
{
// GetViewBetween is inclusive, exclude the last key just in case
// there's one with the next value
if (key != lastKey)
{
yield return key;
}
}
}
public IEnumerable<TValue> GetRange(string subKey)
{
foreach (var key in GetRangeKeys(subKey))
{
yield return Index[key];
}
}
// implement dictionary interface against internal collections
}
Code is here: http://ideone.com/UIp9R

If you suspect name collisions will be uncommon, it may be fast enough to just walk up the tree.
If collisions are common though, it might be faster to use a data structure that excels at ordered prefix searches, such as a tree. Your various subsets make up the prefix. Your index keys would then include both selectors and total paths.
For the DOM:
<path>
<to>
<element id="someid" class="someclass" someattribute="1"/>
</to>
</path>
You would have the following index keys:
<element>/path/to/element
#someid>/path/to/element
.someclass>/path/to/element
#someattribute>/path/to/element
Now if you search these keys based on prefix, you can limit the query to any subset you want:
<element> ; finds all <element>, regardless of path
.someclass> ; finds all .someclass, regardless of path
.someclass>/path ; finds all .someclass that exist in the subset /path
.someclass>/path/to ; finds all .someclass that exist in the subset /path/to
#id>/body ; finds all #id that exist in the subset /body
A tree can find the lower bound (the first element >= to your search value) in O(log n), and because it is ordered from there you simply iterate until you come to a key that no longer matches the prefix. It will be very fast!
.NET doesn't have a suitable tree structure (it has SortedDictionary but that unfortunately doesn't expose the required LowerBound method), so you'll need to either write your own or use an existing third party one. The excellent C5 Generic Collection Library features trees with suitable Range methods.

Related

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Fastest way to select all strings from list starting from

I'm looking for the fastest way to find all strings in a collection starting from a set of characters. I can use sorted collection for this, however I can't find convenient way to do this in .net. Basically I need to find low and high indexes in a collection that meet the criteria.
BinarySearch on List<T> does not guarantee the returned index is that of the 1st element, so one would need to iterate up and down to find all matching strings which is not fast if one has a large list.
There are also Linq methods (with parallel), but I'm not sure which data structure will provide the best results.
List example, ~10M of records:
aaaaaaaaaaaaaaabb
aaaaaaaaaaaaaaba
aaaaaaaaaaaaabc
...
zzzzzzzzzzzzzxx
zzzzzzzzzzzzzyzzz
zzzzzzzzzzzzzzzzzza
Search for strings starting from: skk...
Result: record indexes from x to y.
UPDATE: strings can have different lengths and are unique.

In terms of time complexity - you should use a trie, and not a sorted set or binary search.
Trie will get you a O(|S|) time complexity [while sorted set and binary search gets you O(|S|logn)] to find the node [let it be v] that represents that prefix.
All the strings [paths] in the trie that fit the prefix will "pass" via v. By adding numberOfLeaves field to each node, you can find out exactly how much leaves [=strings] this node has.
In a single pass - you can also find the index of this v [For each node u in the path from the root to v - sum numberOfLeaves for each sibling which is left to u].
This requires much more work then using already existing structures, but if the data is huge - it can make your algorithm much faster, so you should concider it if performance is an issue and you expect a huge set of strings.

You can do it with a hand-written binary search - one which just doesn't stop when it's found a match; it continues until it's found a single index.
In fact, you don't even have to write the binary search bit yourself - you could create a custom comparer which never returns 0, i.e. if you're looking for "abc" then it treats "abb" as being below the target value, but "abc" as being above the target value. This way the BinarySearch will always return a negative number, which you can then just bit-flip to find the theoretical insertion point for "the string which comes between abb and abc".
You can do the same in reverse (treat "abc" as lower than the target value) to find the highest bound.
If you know the format of these strings and it won't have edge cases like Unicode NULL characters, and everything's the same length, you can even do it without writing your own comparer:
// This could be done more efficiently :)
string stringJustBelow = target.Substring(0, target.Length - 1) +
target[target.Length - 1] + "X";
string stringJustAbove = target + "X"; // Or any character
int lowerBoundInclusive = ~list.BinarySearch(stringJustBelow);
int upperBoundExclusive = ~list.BinarySearch(stringJustAbove);
So if you strings are all length 3 and you were searching for "abc" you'd actually look for where "abbX" and "abcX" would be inserted.

Put them in SortedSet and use GetViewBetween.
This answer illustrates searching for both prefix and suffix, I'm sure you'll have no trouble adapting it to prefix-only search, if that is indeed what you want.
If you just want to search for a range (not prefix), directly using GetViewBetween should suffice.

String parsing and matching algorithm

I am solving the following problem:
Suppose I have a list of software packages and their names might looks like this (the only known thing is that these names are formed like SOMETHING + VERSION, meaning that the version always comes after the name):
Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1
Now, I need to parse this list and select only latest versions of each package. For this example the expected result would be:
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Current approach that I use can be described the following way:
Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
Apply the regular expression (or a set of regular expressions),
which tries to deduce the version from the string and perform
the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
// Example for this step:
// 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
// (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
Search through the corresponding group (in this example - the 'E' group)
and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
from it's significant part. This comparison is performed in ignore-case and
ignore-special-symbols mode.
// The matches for this step:
// Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
// Efficient.Exclusive.Zip.Archiver, {123.01}
// Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
// The last one will get picked, because year is ignored.
Get the possible version from each match, ***pick the latest, yield that match.***
Remove every possible match (including the initial element) from the list.
}
This algorithm (as I assume) should work for something like O(N * V + N lg N * M), where M stands for the average string matching time and V stands for the version regexp working time.
However, I suspect there is a better solution (there always is!), maybe specific data structure or better matching approach.
If you can suggest something or make some notes on the current approach, please do not hesitate to do this.

How about this? (Pseudo-Code)
Dictionary<string,string> latestPackages=new Dictionary<string,string>(packageNameComparer);
foreach element
{
(package,version)=applyRegex(element);
if(!latestPackages.ContainsKey(package) || isNewer)
{
latestPackages[package]=version;
}
}
//print out latestPackages
Dictionary operations are O(1), so you have O(n) total runtime. No pre-grouping necessary and instead of storing all matches, you only store the one which is currently the newest.
Dictionary has a constructor which accepts a IEqualityComparer-object. There you can implement your own semantic of equality between package names. Keep in mind however that you need to implement a GetHashCode method in this IEqualityComparer which should return the same values for objects that you consider equal. To reproduce the example above you could return a hash code for the first character in the string, which would reproduce the grouping you had inside your dictionary. However you will get more performance with a smarter hash code, which doesn't have so many collisions. Maybe using more characters if that still yields good results.

I think you could probably use a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) here to good effect. I think you could simply cycle down each node till you hit one that has only 1 "child". On this node, you'll have common prefixes "up" the tree and version strings below. From there, parse the version strings by removing everything that isn't a digit or a period, splitting the string by the period and converting each element of the array to an integer. This should give you an int array for each version string. Identify the highest version, record it and travel to the next node with only 1 child.
EDIT: Populating a large DAWG is a pretty expensive operation but lookup is really fast.

Algorithm to remove all objects of a tree from a list

I have a problem where I need to remove all objects of a tree from a list.
I have a List<String> Tags which contains the tags in my entire system that match a certain criterion (generally starts with some search string). I also have a root Device object. The Device class is described as follows:
public class Device
{
public int ID;
public String Tag;
public EntityCollection<Device> ChildDevices;
}
The attempt that I have made is to use a breadth first search and remove the tags from the list as each node is visited, then return whatever is leftover:
private List<String> RemoveInvalidTags(Device root, List<String> tags)
{
var queue = new Queue<Device>();
queue.Enqueue(root);
while (queue.Count > 0)
{
var device = queue.Dequeue();
//load all the child devices of this device from DB
var childDevices = device.ChildDevices.ToList();
foreach (var hierarchyItem in childDevices)
queue.Enqueue(hierarchyItem.ChildDevice);
tags.Remove(device.Tag);
}
return tags;
}
At the moment I am visiting 2000+ device nodes and removing from a list of about 1400 tags (reduced due to the search string). This takes about 4 secs which is far too long.
I have tried changing the list of tags to a hashset but it brings negligible speed improvements.
Any ideas of an algorithm/change that I could use to make this faster?

I'm going to guess that your tree is fairly "fat". That is, that each of your nodes has MANY children, but you don't have a lot of layers. If that is the case, give Depth First Search a try. You should reach bottom quickly and then be able to start removing nodes. You still have to visit all nodes, but you won't have to store as much intermediate data as you would in BFS.

You should definitely be using some sort of hash table (sorry, not familiar with the specifics of c#) for accessing tags.
I am curious about the process of loading the child devices from the DB. Since you are iterating across the entire tree, you might be able to load more appropriately-sized chunks into memory. The breadth-first search might load most of the tree into memory before starting to remove nodes from the queue (if the tree is very wide).

It would be a good idea to instrument or profile your code to find out where most of the time is going. An earlier comment and answer about "load query to the database" (i.e. childDevices = device.ChildDevices.ToList();) taking time may be correct, but it seems possible it might instead be
tags.Remove(device.Tag); that is wasting time. A .Remove() is done for every enqueued item. Remove takes O(n) time: "This method performs a linear search; therefore, this method is an O(n) operation, where n is Count." [MSDN]
That is, suppose you enqueue m device items, many of which have .Tag's not in your tags list with n entries. .Remove touches every element of tags when it looks for a .Tag not in the list; and on average it looks at n/2 entries to find a .Tag that is in the list, so total work is O(m*n). By contrast, work in the method below is O(m + n), which typically will be hundreds of times smaller.
To sidestep the problem:
Preprocess tags list by making a hash table H corresponding to it
For each device.Tag, test if its hash value is in H
If the value is in H, add device.Tag to a dictionary D
After handling all device.Tag's, for each element T of tags list, if T is in D output T, else suppress T

You can use Stopwatch to find out about the bottleneck, If you ask me
var childDevices = device.ChildDevices.ToList();
foreach (var hierarchyItem in childDevices)
queue.Enqueue(hierarchyItem.ChildDevice);
that s your bottleneck.
Look at this Tree implementation in C#, i hope you already know Tree Traversals.
why dont you try this?
foreach (var hierarchyItem in device.ChildDevices)
queue.Enqueue(hierarchyItem.ChildDevice);
you dont need to convert device.ChildDevices to list, because it is already enumerable. when you convert that to list, it will be eager, which enumerable, it will be lazy.
Try that.

In C#, is there a kind of a SortedList<double> that allows fast querying (with LINQ) for the nearest value?

I am looking for a structure that holds a sorted set of double values. I want to query this set to find the closest value to a specified reference value.
I have looked at the SortedList<double, double>, and it does quite well for me. However, since I do not need explicit key/value pairs. this seems to be overkill to me, and i wonder if i could do faster.
Conditions:
The structure is initialised only once, and does never change (no insert/deletes)
The amount of values is in the range of 100k.
The structure is queried often with new references, which must execute fast.
For simplicity and speed, the set's value just below of the reference may be returned, not actually the nearest value
I want to use LINQ for the query, if possible, for simplicity of code.
I want to use no 3rd party code if possible. .NET 3.5 is available.
Speed is more importand than memory footprint
I currently use the following code, where SortedValues is the aforementioned SortedList
IEnumerable<double> nearest = from item in SortedValues.Keys
where item <= suggestion
select item;
return nearest.ElementAt(nearest.Count() - 1);
Can I do faster?
Also I am not 100% percent sure, if this code is really safe. IEnumerable, the return type of my query is not by definition sorted anymore. However, a Unit test with a large test data base has shown that it is in practice, so this works for me. Have you hints regarding this aspect?
P.S. I know that there are many similar questions, but none actually answers my specific needs. Especially there is this one C# Data Structure Like Dictionary But Without A Value, but the questioner does just want to check the existence not find anything.

The way you are doing it is incredibly slow as it must search from the beginning of the list each time giving O(n) performance.
A better way is to put the elements into a List and then sort the list. You say you don't need to change the contents once initialized, so sorting once is enough.
Then you can use List<T>.BinarySearch to find elements or to find the insertion point of an element if it doesn't already exist in the list.
From the docs:
Return Value
The zero-based index of
item in the sorted List<T>,
if item is found; otherwise, a
negative number that is the bitwise
complement of the index of the next
element that is larger than item or,
if there is no larger element, the
bitwise complement of Count.
Once you have the insertion point, you need to check the elements on either side to see which is closest.

Might not be useful to you right now, but .Net 4 has a SortedSet class in the BCL.

I think it can be more elegant as follows:
In case your items are not sorted:
double nearest = values.OrderBy(x => x.Key).Last(x => x.Key <= requestedValue);
In case your items are sorted, you may omit the OrderBy call...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.