Linq on DateTime Collections Best Performance - c#

My app has a Collection of Job objects. There is a unique property (Key) called Jobnumber and a DateTime property called Bookingtime which is not necessarily unique. There are various other properties also.
I want to do a lot of linq queries based on the Bookingtime property and occasional inserts and removal of objects from the collection.
If I have 1000 to 2000 objects in a collection should I use a SortedList<TKey, TValue> or just a List<T> and order it manually with linq?
Does the options change for 10,000 objects in the collection?
The objects are from database and are already sorted by Bookingtime but I need to work on certain datetime subsets.
DateTime t1,t2; //equals some values.
var subSet = jobs.where(a=>a.Bookingtime >= t1 &&
a.Bookingtime < = t2).ToList();

As one can see in the documentation, the SortedList.Add (too bad there is no AddAll method as in Java that could optimize bulk insertion) operation is performed in O(n) whereas OrderBy runs in O(n log n). The implication is that only on small (or sorted lists), the SortedList can outperform Linq.
Furthermore notice that Linq uses lazy evaluation. It will only sort items if you actually need the resulting list (or use a ToList,... method). If you thus later never do anything with the result, the environment won't even sort the data.
This article even implements a truly lazy OrderBy, such that if you only need the first i items, it will not sort the entire list.
EDIT: based on the your updated question, you better incorporate the .where statement as a WHERE in the SQL query. This can reduce network, memory and CPU usage since a database in many cases has means to optimize queries enormously.

Related

Fastest way to check whether a single element in common exists between two enumerables

I have a method I'm writing where I want to be able to filter orders based on whether they have one or more ordered products in them that exist in the selection of products made by the user. Currently I'm doing this with:
SelectedProductIDs.Intersect(orderProductIDs).Any()
executed on each order (~20,000 orders total in the database and expected to grow quickly), where both SelectedProducts and orderProductIDs are string[]. I've also attempted to use pre-generated HashSets for both SelectedProductIDs and orderProductIDs, but this made no appreciable difference in the speed of comparison.
However, both of these are unpleasantly slow - ~300ms per selection change - particularly given that the dates made available to the sliders within the UI are predicated entirely on the results of this query, so user interaction has to halt in some fashion. Is there a (very) significantly faster way to do this?
Edit: May not have been clear enough - order objects are materialized from SQL data at launch-time and these queries are performed later, in a secondary window of the overall application. SQL is irrelevant to the specifics of this question; this is a LINQ-to-Objects question.
The LINQ intersect is going to reconstruct a new HashSet based on the input value no matter what you do, even if the input is already a HashSet. Its implementation mutates the hash set internally (which is how it avoids yielding duplicate values) so it is important to make a copy of the input sequence, even if it's already a HashSet.
You can create your own Intersect method that accepts a hashset, instead of populating a new one. To avoid mutating it though, you'll have to settle for a bag-based Intersect, rather than a set based Intersect (i.e., duplicates in the sequence will all be yielded). Clearly that's not a problem in your case:
public static IEnumerable<T> IntersectAll<T>(
this HashSet<T> set, IEnumerable<T> sequence)
{
foreach (var item in sequence)
if (set.Contains(item))
yield return item;
}
Now you can write:
SelectedProductIDs.InsersectAll(orderProductIDs).Any();
And the hashset won't need to be re-constructed each time.
It sounds like you are reading all the values from the database into memory and then querying. If you instead use LINQ to EF, it will translate the LINQ query into a SQL query that gets run on the database, which could be significantly faster.

LINQ to SQL table class performance

I can't find the performance characteristics for System.Data.Linq.Table<T entity> methods! I refer to methods like insertonSubmit and deleteonSubmit. Are these methods are O(1) or O(n)?
The InsertOnSubmit and DeleteOnSubmit take a single object, so their performance should be O(1): all they do is appending to the insertion queue, which is either an O(1) or an amortized O(1) for all unordered containers.
InsertAllOnSubmit and DeleteAllOnSubmit, on the other hand, are O(N), where N is the length of the IEnumerable passed into the method.
I assume you mean O(n) in terms of the size of the underlying table, and you're talking about once they get submitted, not just calling the function (which is O(1) as mentioned). I haven't seen any of the implementation of LINQ, but just from the experience with it and my understanding of SQL, the insert method should be O(n) in terms of the existing table, and O(n) in terms of how many submissions there are.
Since the whole thing is submitted at once, I'm assuming it's a transaction, or union of the insert statements, meaning only the first insert suffers the O(n), and the rest of them are just O(1).
I don't think there's a way to make a delete statement happen quicker than O(n), so I'm assuming that's how long it takes.
Of course, since LINQ just translates to SQL and leaves the actual implementation to the database server, a lot of this is up to the database server.

Why are ToLookup and GroupBy different?

.ToLookup<TSource, TKey> returns an ILookup<TKey, TSource>. ILookup<TKey, TSource> also implements interface IEnumerable<IGrouping<TKey, TSource>>.
.GroupBy<TSource, TKey> returns an IEnumerable<IGrouping<Tkey, TSource>>.
ILookup has the handy indexer property, so it can be used in a dictionary-like (or lookup-like) manner, whereas GroupBy can't. GroupBy without the indexer is a pain to work with; pretty much the only way you can then reference the return object is by looping through it (or using another LINQ-extension method). In other words, any case that GroupBy works, ToLookup will work as well.
All this leaves me with the question why would I ever bother with GroupBy? Why should it exist?
why would I ever bother with GroupBy? Why should it exist?
What happens when you call ToLookup on an object representing a remote database table with a billion rows in it?
The billion rows are sent over the wire, and you build the lookup table locally.
What happens when you call GroupBy on such an object?
A query object is built; end of story.
When that query object is enumerated then the analysis of the table is done on the database server and the grouped results are sent back on demand a few at a time.
Logically they are the same thing but the performance implications of each are completely different. Calling ToLookup means I want a cache of the entire thing right now organized by group. Calling GroupBy means "I am building an object to represent the question 'what would these things look like if I organized them by group?'"
In simple LINQ-world words:
ToLookup() - immediate execution
GroupBy() - deferred execution
The two are similar, but are used in different scenarios. .ToLookup() returns a ready to use object that already has all the groups (but not the group's content) eagerly loaded. On the other hand, .GroupBy() returns a lazy loaded sequence of groups.
Different LINQ providers may have different behaviors for the eager and lazy loading of the groups. With LINQ-to-Object it probably makes little difference, but with LINQ-to-SQL (or LINQ-to-EF, etc.), the grouping operation is performed on the database server rather than the client, and so you may want to do an additional filtering on the group key (which generates a HAVING clause) and then only get some of the groups instead of all of them. .ToLookup() wouldn't allow for such semantics since all items are eagerly grouped.

Conversion of an IEnumerable to a dictionary for performance?

I have recently seen a new trend in my firm where we change the IEnumerable to a dictionary by a simple LINQ transformation as follows:
enumerable.ToDictionary(x=>x);
We mostly end up doing this when the operation on the collection is a Contains/Access and obviously a dictionary has a better performance in such cases.
But I realise that converting the Enumerable to a dictionary has its own cost and I am wondering at what point does it start to break-even (if it does) i.e the performance of IEnumerable Contains/Access is equal to ToDictionary + access/contains.
Ok I might add there is no databse access the enumerable might be created from a database query and thats it and the enumerable may be edited after that too..
Also it would be interesting to know how does the datatype of the key affect the performance?
The lookup might be 2-5 times generally but sometimes may be one too. But i have seen things like
For an enumerable:
var element=Enumerable.SingleorDefault(x=>x.Id);
//do something if element is null or return
for a dictionary:
if(dictionary.ContainsKey(x))
//do something if false else return
This has been bugging me for quite some time now.
Performance of Dictionary Compared to IEnumerable
A Dictionary, when used correctly, is always faster to read from (except in cases where the data set is very small, e.g. 10 items). There can be overhead when creating it.
Given m as the amount of lookups performed against the same object (these are approximate):
Performance of an IEnumerable (created from a clean list): O(mn)
This is because you need to look at all the items each time (essentially m * O(n)).
Performance of a Dictionary: O(n) + O(1m), or O(m + n)
This is because you need to insert items first (O(n)).
In general it can be seen that the Dictionary wins when m > 1, and the IEnumerable wins when m = 1 or m = 0.
In general you should:
Use a Dictionary when doing the lookup more than once against the same dataset.
Use an IEnumerable when doing the lookup one.
Use an IEnumerable when the data-set could be too large to fit into memory.
Keep in mind a SQL table can be used like a Dictionary, so you could use that to offset the memory pressure.
Further Considerations
Dictionarys use GetHashCode() to organise their internal state. The performance of a Dictionary is strongly-related to the hash code in two ways.
Poorly performing GetHashCode() - results in overhead every time an item is added, looked up, or deleted.
Low quality hash codes - results in the dictionary not having O(1) lookup performance.
Most built-in .Net types (especially the value types) have very good hashing algorithms. However, with list-like types (e.g. string) GetHashCode() has O(n) performance - because it needs to iterate over the whole string. Thus you dictionary's performance can really be seen as (where M is the big-oh for an efficient GetHashCode()): O(1) + M.
It depends....
How long is the IEnumerable?
Does accessing the IEnumerable cause database access?
How often is it accessed?
The best thing to do would be to experiment and profile.
If you searching elements in your collection by some key very often - definatelly the Dictionary will be faster because or it's hash-based collection and searching is faster in times, otherwise if you don't search a lot thru the collection - the convertion is not necessary, because time for conversion may be bigger than you one or two searches in the collection,
IMHO: you need to measure this on your environment with representative data. In such cases I just write a quick console app that measures the time of the code execution. To have a better measure you need to execute the same code multiple times I guess.
ADD:
It also depents on the application you develop. Usually you gain more in optimizing other places (avoiding networkroundrips, caching etc.) in that time and effort.
I'll add that you haven't told us what happens every time you "rewind" your IEnumerable<>. Is it directly backed by a data collection? (for example a List<>) or is it calculated "on the fly"? If it's the first, and for small collections, enumerating them to find the wanted element is faster (a Dictionary for 3/4 elements is useless. If you want I can build some benchmark to find the breaking point). If it's the second then you have to consider if "caching" the IEnumerable<> in a collection is a good idea. If it's, then you can choose between a List<> or a Dictionary<>, and we return to point 1. Is the IEnumerable small or big? And there is a third problem: if the collection isn't backed, but it's too big for memory, then clearly you can't put it in a Dictionary<>. Then perhaps it's time to make the SQL work for you :-)
I'll add that "failures" have their cost: in a List<> if you try to find an element that doesn't exist, the cost is O(n), while in a Dictionary<> the cost is still O(1).

In-memory LINQ performance

More than about LINQ to [insert your favorite provider here], this question is about searching or filtering in-memory collections.
I know LINQ (or searching/filtering extension methods) works in objects implementing IEnumerable or IEnumerable<T>. The question is: because of the nature of enumeration, is every query complexity at least O(n)?
For example:
var result = list.FirstOrDefault(o => o.something > n);
In this case, every algorithm will take at least O(n) unless list is ordered with respect to 'something', in which case the search should take O(log(n)): it should be a binary search. However, If I understand correctly, this query will be resolved through enumeration, so it should take O(n), even in list was previously ordered.
Is there something I can do to solve a query in O(log(n))?
If I want performance, should I use Array.Sort and Array.BinarySearch?
Even with parallelisation, it's still O(n). The constant factor would be different (depending on your number of cores) but as n varied the total time would still vary linearly.
Of course, you could write your own implementations of the various LINQ operators over your own data types, but they'd only be appropriate in very specific situations - you'd have to know for sure that the predicate only operated on the optimised aspects of the data. For instance, if you've got a list of people that's ordered by age, it's not going to help you with a query which tries to find someone with a particular name :)
To examine the predicate, you'd have to use expression trees instead of delegates, and life would become a lot harder.
I suspect I'd normally add new methods which make it obvious that you're using the indexed/ordered/whatever nature of the data type, and which will always work appropriately. You couldn't easily invoke those extra methods from query expressions, of course, but you can still use LINQ with dot notation.
Yes, the generic case is always O(n), as Sklivvz said.
However, many LINQ methods special case for when the object implementing IEnumerable actually implements e.g. ICollection. (I've seen this for IEnumerable.Contains at least.)
In practice this means that LINQ IEnumerable.Contains calls the fast HashSet.Contains for example if the IEnumerable actually is a HashSet.
IEnumerable<int> mySet = new HashSet<int>();
// calls the fast HashSet.Contains because HashSet implements ICollection.
if (mySet.Contains(10)) { /* code */ }
You can use reflector to check exactly how the LINQ methods are defined, that is how I figured this out.
Oh, and also LINQ contains methods IEnumerable.ToDictionary (maps key to single value) and IEnumerable.ToLookup (maps key to multiple values). This dictionary/lookup table can be created once and used many times, which can speed up some LINQ-intensive code by orders of magnitude.
Yes, it has to be, because the only way of accessing any member of an IEnumerable is by using its methods, which means O(n).
It seems like a classic case in which the language designers decided to trade performance for generality.

Categories

Resources