C# DateTime OrderBy when list contains some identical DateTime values - c#

I have a list of reference types that are sorted by a DateTime property. Some of them have identical DateTime's. For example, multiple baseball games starting at the exact same time.
I want to confirm that OrderBy will sort the same way every time; that is to say because I provided the input in the order of A->B->C; that the output will also be A->B->C (all items have identical DateTime).
I wrote a unit test to confirm that the order is preserved. The test passed. But without really knowing what is going on I still don't feel confident.
Can someone please confirm the OrderBy behavior for me? I tried searching via google and couldn't find anything definitive.

The formal name for the concept you're asking about is called a Stable Sort.
Knowing this, you can check the documentation for Enumerable.OrderBy and see that it does, indeed, use a stable sorting algorithm. From near the end of the Remarks section:
This method performs a stable sort; that is, if the keys of two elements are equal, the order of the elements is preserved.
Additionally, there was some confustion with Linq-to-SQL in the comments on the question. If your data is already in a List<T> object you are not using Linq-to-SQL anymore. However, it's worth noting that Linq-to-SQL uses IQueryable.OrderBy rather than Enumerable.ORderBy, and IQueryable.OrderBy does not guarantee a stable sort. You may get a stable sort, but it depends on what the database engine does.

In short the OrderBy won't change the order if the OrderBy value is identical. So if the collection you called it on is ordered then it will keep that order. If it's a list then you're all good but other collection types do not guarantee order, such as dictionary or hashset so I imagine the order of them could change as you can't guarantee their order in the underlying collection.
Edit: as someone has mentioned in the comments linq to objects OrderBy is a stable (or deterministic sort) so the order will be the same every time and items considered equal will not have their order changed

First make sure you read the excelent answer of Joel.
That said, if you can not rely on Linq2Object's Stable Sort (EF for example), you also have the option of ThenBy:
OrderBy(c => c.MyDate).ThenBy(n => n.MyId)
This way you can apply a second order, if the first one has multiple same values.

Related

What are we guaranteed regarding side-effects in LINQ predicates?

I just saw this bit of code that has a count++ side-effect in the .GroupBy predicate. (originally here).
object[,] data; // This contains all the data.
int count = 0;
List<string[]> dataList = data.Cast<string>()
.GroupBy(x => count++ / data.GetLength(1))
.Select(g => g.ToArray())
.ToList();
This terrifies me because I have no idea how many times the implementation will invoke the key selector function. And I also don't know if the function is guaranteed to be applied to each item in order. I realize that, in practice, the implementation may very well just call the function once per item in order, but I never assumed that as being guaranteed, so I'm paranoid about depending on that behaviour -- especially given what may happen on other platforms, other future implementations, or after translation or deferred execution by other LINQ providers.
As it pertains to a side-effect in the predicate, are we offered some kind of written guarantee, in terms of a LINQ specification or something, as to how many times the key selector function will be invoked, and in what order?
Please, before you mark this question as a duplicate, I am looking for a citation of documentation or specification that says one way or the other whether this is undefined behaviour or not.
For what it's worth, I would have written this kind of query the long way, by first performing a select query with a predicate that takes an index, then creating an anonymous object that includes the index and the original data, then grouping by that index, and finally selecting the original data out of the anonymous object. That seems more like a correct way of doing functional programming. And it also seems more like something that could be translated to a server-side query. The side-effect in the predicate just seems wrong to me - and against the principles of both LINQ and functional programming, so I would assume there would be no guarantee specified and that this may very well be undefined behaviour. Is it?
I realize this question may be difficult to answer if the documentation and LINQ specification actually says nothing regarding side effects in predicates. I want to know specifically whether:
Specs say it's permissible and how. (I doubt it)
Specs say it's undefined behaviour (I suspect this is true and am looking for a citation)
Specs say nothing. (Sloppy spec, if you ask me, but it would be nice to know if others have searched for language regarding side-effects and also come up empty. Just because I can't find it doesn't mean it doesn't exist.)
According to official C# Language Specification, on page 203, we can read (emphasis mine):
12.17.3.1 The C# language does not specify the execution semantics of query expressions. Rather, query expressions are
translated into invocations of methods that adhere to the
query-expression pattern (§12.17.4). Specifically, query expressions
are translated into invocations of methods named Where, Select,
SelectMany, Join, GroupJoin, OrderBy, OrderByDescending, ThenBy,
ThenByDescending, GroupBy, and Cast. These methods are expected to
have particular signatures and return types, as described in §12.17.4.
These methods may be instance methods of the object being queried or
extension methods that are external to the object. These methods
implement the actual execution of the query.
From looking at the source code of GroupBy in corefx on GitHub, it does seems like the key selector function is indeed called once per element, and it is called in the order that the previous IEnumerable provides them. I would in no way consider this a guarantee though.
In my view, any IEnumerables which cannot be enumerated multiple times safely are a big red flag that you may want to reconsider your design choices. An interesting issue that could arise from this is that for example if you view the contents of this IEnumerable in the Visual Studio debugger, it will probably break your code, since it would cause the count variable to go up.
The reason this code hasn't exploded up until now is probably because the IEnumerable is never stored anywhere, since .ToList is called right away. Therefore there is no risk of multiple enumerations (again, with the caveat about viewing it in the debugger and so on).

Get original value from HashSet

UPDATE:
Starting with .Net 4.7.2, HashSet.TryGetValue - docs is available.
HashSet.TryGetValue - SO post
I have a problem with HashSet because it does not provide any method similar to TryGetValue known from Dictionary. And I need such method -- passing element to find in the set, and set returning element from its collection (when found).
Sidenote -- "why do you need element from the set, you already have that element?". No, I don't, equality and identity are two different things.
HashSet is not sealed but all its fields are private, so deriving from it is pointless. I cannot use Dictionary instead because I need SetEquals method. I was thinking about grabbing a source for HashSet and adding desired method, but the license is not truly open source (I can look, but I cannot distribute/modify). I could use reflection but the arrays in HashSet are not readonly meaning I cannot bind to those fields once per instance lifetime.
And I don't want to use full blown library for just single class.
So far I am stuck with LINQ SingleOrDefault. So the question is how fix this -- have HashSet with TryGetValue?
Probably you should switch from a HashSet to a SortedSet
There is a simple TryGetValue() for a SortedSet:
public bool TryGetValue(ref T element)
{
var foundSet = sortedSet.GetViewBetween(element, element);
if(foundSet.Count == 1)
{
element = foundSet.First();
return true;
}
return false;
}
when called, the element needs just all properties set which are used in the Comparer. It returns the element found in the Set.
I agree this is something which is basically missing. While it's only useful in rare cases, I think they're significant rare cases - most notable, key canonicalization.
I can only think of one suggestion at the moment, and it's truly foul.
You can specify your own IEqualityComparer<T> when creating a HashSet<T> - so create one which remembers the arguments to the last positive (i.e. true-returning) Equals comparison it has performed. You can then call Contains, and see what the equality comparer was asked to compare.
Caveats:
This holds on to references unnecessarily, so could end up preventing objects being garbage collected
You'd potentially want to do this on a per-thread basis (if you've got a set that isn't modified after initialization, but is then read by multiple threads, for example)
It assumes that HashSet<T> doesn't use any optimization such as "if the references are equal, don't bother consulting the equality comparer"
It's fundamentally a horrible abuse
I've been trying to think of other alternatives in terms of finding intersections, but I haven't got anywhere yet...
As noted in comments, it would be worth encapsulating this as far as possible - I suspect you only need a very limited set of operations, so I'd wrap a HashSet<T> in your own class and only expose the operations you really need - that way you get to clear the "cache" after each operation, removing my first objection above.
It still feels like a horrible abuse to me, but...
As others have suggested, an alternative would be to use a Dictionary<TKey, TValue> and implement SetEquals yourself. That would be simple enough to do - and again, you'd want to encapsulate this in your own type. Either way, you should probably design the type itself first, and then implement it using either a HashSet<> or a Dictionary<,> as an implementation detail.
Sounds like you trying to use the wrong tool. True, you can save some memory using a HashSet but it seems to me that you are trying to acheeve a different goal: Get the actual element that is just equal to a representation.
So in reality they are two different elements. Just the memento (a unique representation) is equal.
Therefore you'd be better of using a Dictionary where you add your elements as Key and Value. So you're able to get it back (the identical) but you miss your SetEquals....
I suppose SetEquals in it's implementation does nothing much different than sequencially compare two HashSets in it's bucket order and fails on first non-equality.
So you should be equally good off using a simple SequenceEqual() (LINQ) comparing the two Keys collections.
So this extension method could do
public static SetEqual<T,G>(this IDictionary<T,G> d, IDictionary<T,G> e)
{
return d.Keys.SequenceEqual(e.Keys);
}
This should work, because a Dictionary basically is a HashSet with an associated value. And more appropriate to your problem. (OK, to be correct, the code should go for Dictionary<> instead of IDictionary<> because Key order matters)
If you need an IEnumerable<> on the second parameter try sorting to get a defined order (not so efficient).
Finally added in .NET 4.7.2:
HashSet.TryGetValue(T, T) Method
An SO post with more details
hopefully not blind but I haven't seen this answer anywhere. If you want dictionary's TryGetValue, you can just steal it.
theHashset.ToDictionary(item => item.ID).TryGetValue(key, out value)
All you need is a quick lambda for determining unique keys.

Why use a List over a HashSet?

May be an obvious question, but I've seen plenty of reasons why to use a HashSet over a list/array.
I've heard it has O(1) for removing and searching for data.
I've never heard why to use a list over a HashSet.
So why vice-versa?
A list allows duplicates, a HashSet does not
List is ordered by it's index, a HashSet has no implicit order
Performance is often overrated, choose the right tool for the job
They have different semantics. A list is ordered (by insert order), allows duplicates, and offers random-access by index; a hash-set is unordered, does not allow duplicates (removes them, by design), and does not offer random-access. Both are perfectly valid, simply: for different scenarios.
Well for one, you can insert duplicates into a List/Array.
From HashSet.Add Method
Return Value Type:
System.Boolean
true if the element is added to the HashSet object;
false if the element is already present.
I'm very late to the party, but I want to add something to the chosen answer by turning the question on its head:
When to use HashSet over List?
When your objects in the Set will certainly be unique or you want to never add duplicates.
You often want to look up if certain objects are in the Set.
If you often remove objects. Adding takes about the same time as adding objects into a List, but removing them is much, much faster in a HashSet.
If you don't care about about order or direct access via an index.
Keep in mind, you can "foreach" through a HashSet as well.
Rango is right that performance is sometimes overvalued; but if performance is critical (and order unimportant), HashSets can by magnitudes be faster than Lists.

Intersection of two sets in most optimized way

Given two sets of values, I have to find whether there is any common element among them or not i.e. whether their intersection is null or not.
Which of the standard C# collection will suit best (in terms of performance) for this purpose? I know that linq has a Intersect extension method to find out the intersection of two list/arrays but my focus is on performance in terms of Big-O notation.
And what if I have to find out the intersection of two sets as well?
Well, if you use LINQ's Intersect method it will build up a HashSet of the second sequence, and then check each element of the first sequence against it. So it's O(M+N)... and you can use foo.Intersect(bar).Any() to get an early-out.
Of course, if you store one (either) set in a HashSet<T> to start with, you can just iterate over the other one checking for containment on each step. You'd still need to build the set to start with though.
Fundamentally you've got an O(M+N) problem whatever you do - you're not going to get cheaper than that (there's always the possibility that you'll have to look at every element) and if your hash codes are reasonable, you should be able to achieve that complexity easily. Of course, some solutions may give better constant factors than others... but that's performance rather than complexity ;)
EDIT: As noted in the comments, there's also ISet<T>.Overlaps - if you've already got either set with a static type of ISet<T> or a concrete implementation, calling Overlaps makes it clearer what you're doing. If both of your sets are statically typed as ISet<T>, use larger.Overlaps(smaller) (where larger and smaller are in terms of the size of the set) as I'd expect an implementation of Overlaps to iterate over the argument and check each element against contents of the set you call it on.
As mentioned , Applying Any() will give you some performance.
I tested it on pretty big dataset and it gave 25% improvements.
Also applying larger.Intersect(smaller) rather than the opposite is very important, in my case, it gave 35% improvements.
Also ordering the list before applying intersect gave another 7-8%.
Another thing to keep in mind that depending on the use case you can completely avoid applying intersect.
For example, for an integer list, if the maximum and minimum are not within the same bounders you don't need to apply intersect since they will never do.
The same goes for a string list with the same idea applied to first letter.
Again depending on your case, try as much as possible to find a rule where intersection is impossible to avoid calling it.

In-memory LINQ performance

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?
Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.
Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.
Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.

Categories

Resources