I'm looking for an implementation of a Red-Black Tree in C#, with the following features:
Search, Insert and Delete in O(log n).
Members type should be generic.
Support in Comparer(T), for sorting T by different fields in it.
Searching in the tree should be with the specific field, so it won't accept T, but it'll accept the field type sorting it.
Searching shouldn't be only exact value. Should support searching the lower/higher one.
Thank you.
You mostly just described SortedDictionary<T, U>, except for the next-lowest/next-highest value binary search, which you could implement on your own without much difficulty.
Are there specific reasons that SortedDictionary is insufficient for you?
Rip the TreeSet from C5 collection libs.
This is exactly the OrderedDictionary in PowerCollections. It's pretty much identical to SortedDictionary (red black tree with generics) with the addition of the ability to set a start key/end key and scan all values in that range.
SortedDicionary only allows exposes a GetEnumerator() function that starts at the beginning of the collection and only allows a MoveNext() call, so even if you use LINQ there is nothing magic happening: it starts at the beginning and runs your expression on every single node, in order, until it finds those matching your LINQ expression.
OrderedDictionary has a function that gets an enumerator at or before a particular key and that does the lookup in O(log n).
A word of caution though: the enumerator in the PowerCollections OrderedDictionary is implemented using "yield" and the memory usage and enumeration performance is at least O(n^2)... you can change the implementation yourself to make it implement a traditional enumerator and both of these problems go away. I'll submit that patch to Codeplex if I can ever find the time.
Related
UPDATE:
Starting with .Net 4.7.2, HashSet.TryGetValue - docs is available.
HashSet.TryGetValue - SO post
I have a problem with HashSet because it does not provide any method similar to TryGetValue known from Dictionary. And I need such method -- passing element to find in the set, and set returning element from its collection (when found).
Sidenote -- "why do you need element from the set, you already have that element?". No, I don't, equality and identity are two different things.
HashSet is not sealed but all its fields are private, so deriving from it is pointless. I cannot use Dictionary instead because I need SetEquals method. I was thinking about grabbing a source for HashSet and adding desired method, but the license is not truly open source (I can look, but I cannot distribute/modify). I could use reflection but the arrays in HashSet are not readonly meaning I cannot bind to those fields once per instance lifetime.
And I don't want to use full blown library for just single class.
So far I am stuck with LINQ SingleOrDefault. So the question is how fix this -- have HashSet with TryGetValue?
Probably you should switch from a HashSet to a SortedSet
There is a simple TryGetValue() for a SortedSet:
public bool TryGetValue(ref T element)
{
var foundSet = sortedSet.GetViewBetween(element, element);
if(foundSet.Count == 1)
{
element = foundSet.First();
return true;
}
return false;
}
when called, the element needs just all properties set which are used in the Comparer. It returns the element found in the Set.
I agree this is something which is basically missing. While it's only useful in rare cases, I think they're significant rare cases - most notable, key canonicalization.
I can only think of one suggestion at the moment, and it's truly foul.
You can specify your own IEqualityComparer<T> when creating a HashSet<T> - so create one which remembers the arguments to the last positive (i.e. true-returning) Equals comparison it has performed. You can then call Contains, and see what the equality comparer was asked to compare.
Caveats:
This holds on to references unnecessarily, so could end up preventing objects being garbage collected
You'd potentially want to do this on a per-thread basis (if you've got a set that isn't modified after initialization, but is then read by multiple threads, for example)
It assumes that HashSet<T> doesn't use any optimization such as "if the references are equal, don't bother consulting the equality comparer"
It's fundamentally a horrible abuse
I've been trying to think of other alternatives in terms of finding intersections, but I haven't got anywhere yet...
As noted in comments, it would be worth encapsulating this as far as possible - I suspect you only need a very limited set of operations, so I'd wrap a HashSet<T> in your own class and only expose the operations you really need - that way you get to clear the "cache" after each operation, removing my first objection above.
It still feels like a horrible abuse to me, but...
As others have suggested, an alternative would be to use a Dictionary<TKey, TValue> and implement SetEquals yourself. That would be simple enough to do - and again, you'd want to encapsulate this in your own type. Either way, you should probably design the type itself first, and then implement it using either a HashSet<> or a Dictionary<,> as an implementation detail.
Sounds like you trying to use the wrong tool. True, you can save some memory using a HashSet but it seems to me that you are trying to acheeve a different goal: Get the actual element that is just equal to a representation.
So in reality they are two different elements. Just the memento (a unique representation) is equal.
Therefore you'd be better of using a Dictionary where you add your elements as Key and Value. So you're able to get it back (the identical) but you miss your SetEquals....
I suppose SetEquals in it's implementation does nothing much different than sequencially compare two HashSets in it's bucket order and fails on first non-equality.
So you should be equally good off using a simple SequenceEqual() (LINQ) comparing the two Keys collections.
So this extension method could do
public static SetEqual<T,G>(this IDictionary<T,G> d, IDictionary<T,G> e)
{
return d.Keys.SequenceEqual(e.Keys);
}
This should work, because a Dictionary basically is a HashSet with an associated value. And more appropriate to your problem. (OK, to be correct, the code should go for Dictionary<> instead of IDictionary<> because Key order matters)
If you need an IEnumerable<> on the second parameter try sorting to get a defined order (not so efficient).
Finally added in .NET 4.7.2:
HashSet.TryGetValue(T, T) Method
An SO post with more details
hopefully not blind but I haven't seen this answer anywhere. If you want dictionary's TryGetValue, you can just steal it.
theHashset.ToDictionary(item => item.ID).TryGetValue(key, out value)
All you need is a quick lambda for determining unique keys.
Does there exist an implementation for .NET of a list collection such that both insert and lookup are worst-case O(log(n)) operations? The default System.Collections.Generic.List 'Insert' method is an O(n) operation.
By a list collection, I mean an array-like expandable data structure. By 'lookup' I mean access by index.
I suspect this can be done with balanced trees, but would be non-trivial to implement.
I do not know of a .NET implementation, but a data structure that might work for you is an Indexible Skiplist. It has similar O(lg n) performance like a balanced binary tree but is conceptually more like a linked list.
http://en.wikipedia.org/wiki/Skip_list
I don't think it would be too hard to write one in C#.
The C5 TreeSet should give you a red/black implementation with those characteristics, including index access.
Don't know if there exist in the .net framework, but you can implement aa tree insert and search is both O(log n).
There is no possible solution for this if you need to access fields using their index. You can use a SortedList, but then you get O(n), or you can use SortedDictionary, but then you lose the array-like access (by index).
HashSet
The C# HashSet data structure was introduced in the .NET Framework 3.5. A full list of the implemented members can be found at the HashSet MSDN page.
Where is it used?
Why would you want to use it?
A HashSet holds a set of objects, but in a way that allows you to easily and quickly determine whether an object is already in the set or not. It does so by internally managing an array and storing the object using an index which is calculated from the hashcode of the object. Take a look here
HashSet is an unordered collection containing unique elements. It has the standard collection operations Add, Remove, Contains, but since it uses a hash-based implementation, these operations are O(1). (As opposed to List for example, which is O(n) for Contains and Remove.) HashSet also provides standard set operations such as union, intersection, and symmetric difference. Take a look here
There are different implementations of Sets. Some make insertion and lookup operations super fast by hashing elements. However, that means that the order in which the elements were added is lost. Other implementations preserve the added order at the cost of slower running times.
The HashSet class in C# goes for the first approach, thus not preserving the order of elements. It is much faster than a regular List. Some basic benchmarks showed that HashSet is decently faster when dealing with primary types (int, double, bool, etc.). It is a lot faster when working with class objects. So the point is that HashSet is fast.
The only catch of HashSet is that there is no access by indices. To access elements you can either use an enumerator or use the built-in function to convert the HashSet into a List and iterate through that. Take a look here
A HashSet has an internal structure (hash), where items can be searched and identified quickly. The downside is that iterating through a HashSet (or getting an item by index) is rather slow.
So why would someone want be able to know if an entry already exists in a set?
One situation where a HashSet is useful is in getting distinct values from a list where duplicates may exist. Once an item is added to the HashSet it is quick to determine if the item exists (Contains operator).
Other advantages of the HashSet are the Set operations: IntersectWith, IsSubsetOf, IsSupersetOf, Overlaps, SymmetricExceptWith, UnionWith.
If you are familiar with the object constraint language then you will identify these set operations. You will also see that it is one step closer to an implementation of executable UML.
Simply said and without revealing the kitchen secrets:
a set in general, is a collection that contains no duplicate elements, and whose elements are in no particular order. So, A HashSet<T> is similar to a generic List<T>, but is optimized for fast lookups (via hashtables, as the name implies) at the cost of losing order.
From application perspective, if one needs only to avoid duplicates then HashSet is what you are looking for since it's Lookup, Insert and Remove complexities are O(1) - constant. What this means it does not matter how many elements HashSet has it will take same amount of time to check if there's such element or not, plus since you are inserting elements at O(1) too it makes it perfect for this sort of thing.
I have;
var maxVal = l.TakeWhile(x=>x < val).Where(x=>Matches(x)).Max();
How much space does this need ? Does linq build up a list of the above Where() condition, or is Max() just iterating through the IEnumerable keeping track of what is the current Max() ?
And where can I find more info about this, besides asking on SO f
I have verified with Reflector that each of Enumerable.TakeWhile, Enumerable.Where and Enumerable.Max run in constant space. Consequently, this entire query should run in constant space.
Not surprising, considering TakeWhile and Where are speced to use deferred execution + streaming.
Max does not use deferred execution, but only needs to store 'max so far' and the enumerator on the source enumerable.
According to the Reflector Max() method iterates through the enumerable.
And where can I find more info about this, besides asking on SO f
You can use Reflector to look at the implementation of any .NET assembly.
The only thing offered by Enumerable that I've found doesn't run in constant space is ToList(), for obvious reasons.
With some enumerations, this is inefficient, in that you already have a space complexity above constant (typically O(n) as you are storing the items) and that the collection in question offers a mechanism with lower time complexity. If you are creating such a collection yourself it makes sense to offer your own versions of the extensions offered by Enumerable. For example, if you have a collection that is inherently sorted you should be able to offer Min() and Max() in better than O(n) complexity (whether it is O(1), O(ln) or something else would depend on what way that sorting was kept). Since instance methods override extension methods (when called on an expression of the object type rather than the instance type) then with no difference to the coder using your object, you will offer better efficiency.
Reflector is your friend.
In particular, you can take a look at Linq to Objects extension methods in the Enumerable class in System.Linq.
The above are using iterations, so they use the whatever space the enumerators takes up - usually O(1). Max() is O(1) space.
However, keep in mind that nothing stops a developer from writing an enumerator that takes up more than constant space. E.g. traversing a tree may require O(log n) space. This is the case e.g. for SortedDictionary<K,V> and SortedSet<K,V>.
So it partially depends on what l is in your code.
More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?
Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.
Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.
Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.