I am trying to get to grips with LINQ. The thing that bothers me most is that even as I understand the syntax better, I don't want to unwittingly sacrifice performance for expressiveness.
Are they any good centralized repositories of information or books for 'Effective LINQ' ? Failing that, what is your own personal favourite high-performance LINQ technique ?
I am primarily concerned with LINQ to Objects, but all suggestions on LINQ to SQL and LINQ to XML also welcome of course. Thanks.
Linq, as a built-in technology, has performance advantages and disadvantages. The code behind the extension methods has had considerable performance attention paid to it by the .NET team, and its ability to provide lazy evaluation means that the cost of performing most manipulations on a set of objects is spread across the larger algorithm requiring the manipulated set. However, there are some things you need to know that can make or break your code's performance.
First and foremost, Linq doesn't magically save your program the time or memory needed to perform an operation; it just may delay those operations until absolutely needed. OrderBy() performs a QuickSort, which will take nlogn time just the same as if you'd written your own QuickSorter or used List.Sort() at the right time. So, always be mindful of what you're asking Linq to do to a series when writing queries; if a manipulation is not necessary, look to restructure the query or method chain to avoid it.
By the same token, certain operations (sorting, grouping, aggregates) require knowledge of the entire set they are acting upon. The very last element in a series could be the first one the operation must return from its iterator. On top of that, because Linq operations should not alter their source enumerable, but many of the algorithms they use will (i.e. in-place sorts), these operations end up not only evaluating, but copying the entire enumerable into a concrete, finite structure, performing the operation, and yielding through it. So, when you use OrderBy() in a statement, and you ask for an element from the end result, EVERYTHING that the IEnumerable given to it can produce is evaluated, stored in memory as an array, sorted, then returned one element at a time. The moral is, any operation that needs a finite set instead of an enumerable should be placed as late in the query as possible, allowing for other operations like Where() and Select() to reduce the cardinality and memory footprint of the source set.
Lastly, Linq methods drastically increase the call stack size and memory footprint of your system. Each operation that must know of the entire set keeps the entire source set in memory until the last element has been iterated, and the evaluation of each element will involve a call stack at least twice as deep as the number of methods in your chain or clauses in your inline statement (a call to each iterator's MoveNext() or yielding GetEnumerator, plus at least one call to each lambda along the way). This is simply going to result in a larger, slower algorithm than an intelligently-engineered inline algorithm that performs the same manipulations. Linq's main advantage is code simplicity. Creating, then sorting, a dictionary of lists of groups values is not very easy-to-understand code (trust me). Micro-optimizations can obfuscate it further. If performance is your primary concern, then don't use Linq; it will add approximately 10% time overhead and several times the memory overhead of manipulating a list in-place yourself. However, maintainability is usually the primary concern of developers, and Linq DEFINITELY helps there.
On the performance kick: If performance of your algorithm is the sacred, uncompromisable first priority, you'd be programming in an unmanaged language like C++; .NET is going to be much slower just by virtue of it being a managed runtime environment, with JIT native compilation, managed memory and extra system threads. I would adopt a philosophy of it being "good enough"; Linq may introduce slowdowns by its nature, but if you can't tell the difference, and your client can't tell the difference, then for all practical purposes there is no difference. "Premature optimization is the root of all evil"; Make it work, THEN look for opportunities to make it more performant, until you and your client agree it's good enough. It could always be "better", but unless you want to be hand-packing machine code, you'll find a point short of that at which you can declare victory and move on.
Simply understanding what LINQ is doing internally should yield enough information to know whether you are taking a performance hit.
Here is a simple example where LINQ helps performance. Consider this typical old-school approach:
List<Foo> foos = GetSomeFoos();
List<Foo> filteredFoos = new List<Foo>();
foreach(Foo foo in foos)
{
if(foo.SomeProperty == "somevalue")
{
filteredFoos.Add(foo);
}
}
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();
So the above code will iterate twice and allocate a second container to hold the filtered values. What a waste! Compare with:
var foos = GetSomeFoos();
var filteredFoos = foos.Where(foo => foo.SomeProperty == "somevalue");
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();
This only iterates once (when the repeater is bound); it only ever uses the original container; filteredFoos is just an intermediate enumerator. And if, for some reason, you decide not to bind the repeater later on, nothing is wasted. You don't even iterate or evaluate once.
When you get into very complex sequence manipulations, you can potentially gain a lot by leveraging LINQ's inherent use of chaining and lazy evaluation. Again, as with anything, it's just a matter of understanding what it is actually doing.
There are various factors which will affect performance.
Often, developing a solution using LINQ will offer pretty reasonable performance because the system can build an expression tree to represent the query without actually running the query while it builds this. Only when you iterate over the results does it use this expression tree to generate and run a query.
In terms of absolute efficiency, running against predefined stored procedures you may see some performance hit, but generally the approach to take is to develop a solution using a system that offers reasonable performance (such as LINQ), and not worry about a few percent loss of performance. If a query is then running slowly, then perhaps you look at optimisation.
The reality is that the majority of queries will not have the slightest problem with being done via LINQ. The other fact is that if your query is running slowly, it's probably more likely to be issues with indexing, structure, etc, than with the query itself, so even when looking to optimise things you'll often not touch the LINQ, just the database structure it's working against.
For handling XML, if you've got a document being loaded and parsed into memory (like anything based on the DOM model, or an XmlDocument or whatever), then you'll get more memory usage than systems that do someting like raising events to indicate finding a start or end tag, but not building a complete in-memory version of the document (like SAX or XmlReader). The downside is that the event-based processing is generally rather more complex. Again, with most documents there won't be a problem - most systems have several GB of RAM, so taking up a few MB representing a single XML document is not a problem (and you often process a large set of XML documents at least somewhat sequentially). It's only if you have a huge XML file that would take up 100's of MB that you worry about the particular choice.
Bear in mind that LINQ allows you to iterate over in-memory lists and so on as well, so in some situations (like where you're going to use a set of results again and again in a function), you may use .ToList or .ToArray to return the results. Sometimes this can be useful, although generally you want to try to use the database's querying rather in-memory.
As for personal favourites - NHibernate LINQ - it's an object-relational mapping tool that allows you to define classes, define mapping details, and then get it to generate the database from your classes rather than the other way round, and the LINQ support is pretty good (certainly better than the likes of SubSonic).
In linq to SQL you don't need to care that much about performance. you can chain all your statements in the way you think it is the most readable. Linq just translates all your statements into 1 SQL statement in the end, which only gets called/executed in the end (like when you call a .ToList()
a var can contain this statement without executing it if you want to apply various extra statements in different conditions. The executing in the end only happens when you want to translate your statements into a result like an object or a list of objects.
There's a codeplex project called i4o which I used a while back which can help improve the performance of Linq to Objects in cases where you're doing equality comparisons, e.g.
from p in People
where p.Age == 21
select p;
http://i4o.codeplex.com/
I haven't tested it with .Net 4 so can't safely say it will still work but worth checking out.
To get it to work its magic you mostly just have to decorate your class with some attributes to specify which property should be indexed. When I used it before it only works with equality comparisons though.
Related
Is there a way to use a custom memory allocator for LINQ?
For example when I call:
someCollection.Where(x).SelectMany(y).ToList();
Methods like ToList() or OrderBy() will always create a new array, so lots of GC will happen.
With a custom allocator, I could always use the same List, which will be cleared and refilled every time. Iam aware that reusing buffers could lead to problems with reentrancy.
The background is, my application is a game and GC means stuttering.
Please don't tell me "Use C++ instead" or "Do not use LINQ", I know that :)
(Although you asked not to be suggested against it, I thin this answer could help the community)
LINQ is a facility built on top the CLR, therefore it uses the CLR allocator, and it cannot be changed.
You can tune it a little bit, for example configuring whether or not the GC cycle should be offloaded to a background thread, but you can't go any further.
The aim of LINQ is to simply writing code for certain class of problems sacrificing the freedom to choose the implementation of every building block (that's why we usually choose LINQ).
However, depending on the scenario, LINQ could not be your best friend as its design choices may play against yours.
If, after profiling your code, you identify that you have a serious performance problems you should try at first to identify whether or not you can isolate the bottleneck in some of LINQ methods and see whether you can roll your own implementation, via extension methods.
Of course this option is viable when yuo are the main caller, unless you manage to roll something that is IEnumerable complaint. You need to be very lucky, because your implementation should abide to LINQ rules. Particularly, as you are not in control of how the objects are manipulated, you cannot perform the optimizations you would in your own code.
Closures and deferred execution work against you.
Otherwise, what has been suggested by the comments, is the only viable option: avoid using LINQ for that specific task.
The reason for stepping away from LINQ is that it is not the right tool to solve your problem with performance constraint you require.
Additionally, as stated in the comments, the (ab)use of lambda expressions significantly increase the memory pressure as backing objects are created to implement the closures.
We had performance issues similar to yours, where we had to rewrite certain slow paths. In other (rare) cases, preallocating the lists and loading the results via AddRange helped.
At the startup of an application that uses RavenDB, I need to load the full collection of documents of a specific type and loop through them. The number of documents should always be small (< 1000).
I can do this via:
session.Query<MyType>();
but I want to ensure that the results I am getting are immediately consistent.
It seems that this case falls between a Load() and a Query(), since (I think) the query is intended to be eventually consistent, and the load is intended to be immediately consistent?
But in this case, there should be no index involved (no filtering or sorting), so will using Query() be immediately consistent?
Query() is always immediately inconsistent. session.SaveChanges() stores only to the document store. The indexes are always updated asynchronously later, although for the most part very very fast!
This is generally a poor modeling design and a code smell with a document database. Since you are mentioning that is at application startup and a relatively small amount, it sounds like reference information that changes infrequently. Could you enclose all of this in a single document that contains a List<MyType> instead?
Failing that, you can try the LoadStartingWith() command, which is still fully ACID, and give it the prefix that fits all the necessary documents, although up to 1000 is still a lot and I don't know if that method will return ALL results or cut it off at some point.
If you must use Query(), you will have to use one of the variations of .WaitForNonStaleResults() (the other variants take into consideration what you are waiting for rather than requiring all indexing to be complete) to get consistent results, but that should be fairly safe if the documents change infrequently. Although I really do hate using this method in nearly all its forms, preferring to use any of the above more preferable methods.
Which will perform better while loop or cursor?
After lots of research, I came to know they both are equally bad for performance and may sometime out perform each other based on situation, and should be used only when it is not possible to use set based operation.
Now question is Which is better performance wise, loop in C# or cursor(or while loop) in sql?
and I searched in web, but found no definitive result...
anybody have any idea?
Based on my experience I would say: it depends on which operations you perform on every item...
In a scenario it happened to me to use cursor loop in SQL for performing bit-wise operation on some data read from the DB and it was very poor in performance (SQL is not intended to operate such kind of stuff)... in that case I obtained a better result looping in C# on a cursor opened on the DB...
On the other side, if you have to perform some other complex data-mining task for every item of the loop, then it is much more convenient to do that in SQL, so that your data do not have to go back and forth from DB to C# and viceversa.
Have you a specific application scenario you can talk about, so that we can give you an idea about that?
When you say "a for loop in C#", do you mean that you intend to first load all the results from the query, and then subsequently loop over them? The downside of that approach is of course that you need the memory to hold all those results. But you don't have to do that. There are many mechanisms in C# that allow you to loop over the results as they come in, which avoids the need to hold all results in memory. It depends of course on your database access technology.
If you use some technology based on IQueryable<T>, just use a foreach loop over the result, and avoid calling materialization functions on it, such as .ToList().
As for cursors, always avoid them if possible. For performance, always prefer set based operations over cursors. In fact, the same is true for a foreach loop in C#. If your processing of each result involves querying the database again, use a SQL join that returns you the needed data in a single query, instead of a new query for each result row retrieved.
I find myself often with a situation where I need to perform an operation on a set of properties. The operation can be anything from checking if a particular property matches anything in the set to a single iteration of actions. Sometimes the set is dynamically generated when the function is called, some built with a simple LINQ statement, other times it is a hard-coded set that will always remain the same. But one constant always exists: the set only exists for one single operation and has no use before or after it.
My problem is, I have so many points in my application where this is necessary, but I appear to be very, very inconsistent in how I store these sets. Some of them are arrays, some are lists, and just now I've found a couple linked lists. Now, none of the operations I'm specifically concerned about have to care about indices, container size, order, or any other functionality that is bestowed by any of the individual container types. I picked resource efficiency because it's a better idea than flipping coins. I figured, since array size is configured and it's a very elementary container, that might be my best choice, but I figure it is a better idea to ask around. Alternatively, if there's a better choice not out of resource-efficiency but strictly as being a better choice for this kind of situation, that would be nice as well.
With your acknowledgement that this is more about coding consistency rather than performance or efficiency, I think the general practice is to use a List<T>. Its actual backing store is an array, so you aren't really losing much (if anything noticable) to container overhead. Without more qualifications, I'm not sure that I can offer anything more than that.
Of course, if you truly don't care about the things that you list in your question, just type your variables as IEnumerable<T> and you're only dealing with the actual container when you're populating it; where you consume it will be entirely consistent.
There are two basic principles to be aware of regarding resource efficiency.
Runtime complexity
Memory overhead
You said that indices and order do not matter and that a frequent operation is matching. A Dictionary<T> (which is a hashtable) is an ideal candidate for this type of work. Lookups on the keys are very fast which would be beneficial in your matching operation. The disadvantage is that it will consume a little more memory than what would be strictly required. The usual load factor is around .8 so we are not talking about a huge increase or anything.
For your other operations you may find that an array or List<T> is a better option especially if you do not need to have the fast lookups. As long as you are not needing high performance on specialty operations (lookups, sorting, etc.) then it is hard to beat the general resource characteristics of array based containers.
List is probably fine in general. It's easy to understand (in the literate programming sense) and reasonably efficient. The keyed collections (e.g. Dict, SortedList) will throw an exception if you add an entry with a duplicate key, though this may not be a problem for what you're working on now.
Only if you find that you're running into a CPU-time or memory-size problem should you look at improving the "efficiency", and then only after determining that this is the bottleneck.
No matter which approach you use, there will still be creation and deletion of the underlying objects (collection or iterator) that will eventually be garbage collected, if the application runs long enough.
i am using collection object. In some cases no of objects become large. in that case what i have to do either loop through the object collection or do a new hit? which gives more optimal to performance ?
As always with performance, it depends. The best answer is to try out both options and measure which one works best in a number of different scenarios.
+1 to Mark's answer, as it's impossible to say without knowing the full picture.
I would just add the 3rd option, which is to not loop through the object collection but maybe use a LINQ approach to find the objects you're interested in in the collection. I'm not sure what, if any, performance difference this would give but thought it worth a mention.
It also might depend on what you are doing. There are lots of options in databases to make life easier. Things like materialized views for instance can help reduce the number of rows your selecting against (which would reduce the results you get and make your query faster) or something like a stored procedure can also be utilized for performance if you're doing some sort of processing.
Then again somethings will just always take awhile. Last November I had to migrate 120 million records from a "detail" table into a "summary" table and even using bulk collects and things it just took a long time. Sometimes performance will always be "crappy."
If you have a DBA talk to them early on.