I understand that IEnumerable might have a risk to return different result on multiple run.
But, is that still a problem if we sure the underlying record set will never change in between and the sequence of the loop doesn't matter at all ?
It's such a shame to call ToList / ToArray everywhere without any consideration that it's just a "possible" risk. R# or VS can simply mark it as error if it should never happened.
Is that really no exception at all?
We should never iterate IEnumerable multiple times?
This is what actually happened.
In a single threaded environment.
void Main()
{
var result = GetFile(new [] {path1, path2}) // hardcoded path
}
IList<SomeFile> GetFiles(IEnumerable<string> filePaths)
{
var paths = filePaths.ToArray(); // <-- why we have to do this ?
foreach(var path In paths)
// Throw exception if the path not exist.
foreach (var path In paths)
// Process and return a list of file
}
I understand it makes not much difference as the collection is so small but we are at the beginning of implementing a project that required to deal with big collection of static data. This kinda practice might be a big problem if apply to all areas without considering whether if it is necessary or not.
The concern of getting different result on second iteration is a distant second to a much more realistic performance concern. Unless your IEnumerable<T> is actually a collection in memory, you are running the risk of having to reproduce it each time you enumerate. This could be very costly:
If IEnumerable<T> comes from another LINQ expression, you spend CPU cycles to recompute the same thing,
If IEnumerable<T> comes from the database, you may end up re-reading the data from the server,
If IEnumerable<T> comes from a file, you will re-read the file.
None of the above have an effect on correctness, but it may dramatically decrease the speed, especially for large data sets. Since memory is relatively cheap these days, and garbage collection system is pretty reliable, temporarily saving collections in a list or an array is an inexpensive way to avoid the problem.
If it is in-memory collection and not abstract enumerable object - just use appropriate interface (IReadOnlyCollection, IReadOnlyList, etc.) instead of IEnumerable. If you use IEnumerable you should assume that it can be any IEnumerable implementation.
Related
Using the System.DirectoryServices.Protocols library:
I have a class LdapItemOperator that takes a SearchResultEntry object from an LDAP query (not Active Directory related) and stores the attributes for the object in a field: readonly SearchResultAttributeCollection LdapAttributes.
The problem I am experiencing is that when I have a large operation the garbage collector seems to never delete these objects after they ought to have been disposed because of the LdapAttributes field in my objects, at least I think that's the problem. What ways can I try to dispose of the objects when they are no longer required? I can't seem to find a way to incorporate a using statement in there, although I only have little experience with it.
As an example, let's say I have the following logic:
List<LdapItemOperator> itemList = GetList(ldapFilter);
List<bool> resultList = new List<bool>();
foreach (IdmLdapItemOperator item in itemList) {
bool result = doStuff(item);
resultList.Add(result);
}
//Even though we are out of the loop now, the objects are still stored in memory, how come? Same goes for the previous objects in the loop, they seem to remain in memory
Logic.WriteResultToLog(result);
After a good while of running the logic on large filesets, this process starts taking up enormous amounts of memory, of course...
I think you might be a little confused about how GC works. You can never know exactly when GC will run. And objects you are still holding a reference to will not be collected (unless it's a weak reference...).
Also "disposing" is yet another different concept, that hasn't much to do with GC.
Basically, all objects will be in memory already after the call to GetList. And memory consumption will not change much after that, the foreach loop shouldn't affect it at all.
Without knowing your implementation, maybe try returning an enumerable instead of a single list, or make multiple batched calls.
Assuming the two following possible blocks of code inside of a view, with a model passed to it using something like return View(db.Products.Find(id));
List<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date).ToList();
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id).ToList();
IEnumerable<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date);
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id);
What are the performance implications between the two? By this point, is all of the product related data in memory, including its navigation properties? Does using ToList() in this instance matter at all? If not, is there a better approach to using Linq queries on a List without having to call ToList() every time, or is this the typical way one would do it?
Read http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Deferred execution is one of the many marvels intrinsic to linq. The short version is that your data is never touched (it remains idle in the source be that in-memory, or in-database, or wherever). When you construct a linq query all you are doing is creating an IEnumerable class that is 'capable' of enumerating your data. The work doesn't begin until you actually start enumerating and then each piece of data comes all the way from the source, through the pipeline, and is serviced by your code one item at a time. If you break your loop early, you have saved some work - later items are never processed. That's the simple version.
Some linq operations cannot work this way. Orderby is the best example. Orderby has to know every piece of data because it possible that the last piece retrieved from the source very well could be the first piece that you are supposed to get. So when an operation such as orderby is in the pipe, it will actually cache your dataset internally. So all data has been pulled from the source, and has gone through the pipeline, up to the orderby, and then the orderby becomes your new temporary data source for any operations that come afterwards in the expression. Even so, orderby tries as much as possible to follow the deferred execution paradigm by waiting until the last possible moment to build its cache. Including orderby in your query still doesn't do any work, immediately, but the work begins once you start enumerating.
To answer your question directly, your call to ToList is doing exactly that. OrderByDescending is caching the data from your datasource => ToList additionally persists it into a variable that you can actually touch (reviews) => where starts pulling records one at a time from reviews, and if it matches then your final ToList call is storing the results into yet another list in memory.
Beyond the memory implications, ToList is additionally thwarting deferred execution because it also STOPS the processing of your view at the time of the call, to entirely process that entire linq expression, in order to build its in-memory representation of the results.
Now none of this is a real big deal if the number of records we're talking about are in the dozens. You'll never notice the difference at runtime because it happens so quick. But when dealing with large scale datasets, deferring as much as possible for as long as possible in hopes that something will happen allowing you to cancel a full enumeration... in addition to the memory savings... gold.
In your version without ToList: OrderByDescending will still cache a copy of your dataset as processed through the pipeline up to that point, internally, sorted of course. That's ok, you gotta do what you gotta do. But that doesn't happen until you actually try to retrieve your first record later in your view. Once that cache is complete, you get your first record, and for every next record you are then pulling from that cache, checking against the where clause, you get it or you don't based upon that where and have saved a couple of in memory copies and a lot of work.
Magically, I bet even your lead-in of db.Products.Find(id) doesn't even start spinning until your view starts enumerating (if not using ToList). If db.Products is a Linq2SQL datasource, then every other element you've specified will reduce into SQL verbiage, and your entire expression will be deferred.
Hope this helps! Read further on Deferred Execution. And if you want to know 'how' that works, look into c# iterators (yield return). There's a page somewhere on MSDN that I'm having trouble finding that contains the common linq operations, and whether they defer execution or not. I'll update if I can track that down.
/*edit*/ to clarify - all of the above is in the context of raw linq, or Linq2Objects. Until we find that page, common sense will tell you how it works. If you close your eyes and imagine implementing orderby, or any other linq operation, if you can't think of a way to implement it with 'yield return', or without caching, then execution is not likely deferred and a cache copy is likely and/or a full enumeration... orderby, distinct, count, sum, etc... Linq2SQL is a whole other can of worms. Even in that context, ToList will still stop and process the whole expression and store the results because a list, is a list, and is in memory. But Linq2SQL is uniquely capable of deferring many of those aforementioned clauses, and then some, by incorporating them into the generated SQL that is sent to the SQL server. So even orderby can be deferred in this way because the clause will be pushed down into your original datasource and then ignored in the pipe.
Good luck ;)
Not enough context to know for sure.
But ToList() guarantees that the data has been copied into memory, and your first example does that twice.
The second example could involve queryable data or some other on-demand scenario. Even if the original data was all already in memory and even if you only added a call to ToList() at the end, that would be one less copy in-memory than the first example.
And it's entirely possible that in the second example, by the time you get to the end of that little snippet, no actual processing of the original data has been done at all. In that case, the database might not even be queried until some code actually enumerates the final reviews value.
As for whether there's a "better" way to do it, not possible to say. You haven't defined "better". Personally, I tend to prefer the second example...why materialize data before you need it? But in some cases, you do want to force materialization. It just depends on the scenario.
I have recently seen a new trend in my firm where we change the IEnumerable to a dictionary by a simple LINQ transformation as follows:
enumerable.ToDictionary(x=>x);
We mostly end up doing this when the operation on the collection is a Contains/Access and obviously a dictionary has a better performance in such cases.
But I realise that converting the Enumerable to a dictionary has its own cost and I am wondering at what point does it start to break-even (if it does) i.e the performance of IEnumerable Contains/Access is equal to ToDictionary + access/contains.
Ok I might add there is no databse access the enumerable might be created from a database query and thats it and the enumerable may be edited after that too..
Also it would be interesting to know how does the datatype of the key affect the performance?
The lookup might be 2-5 times generally but sometimes may be one too. But i have seen things like
For an enumerable:
var element=Enumerable.SingleorDefault(x=>x.Id);
//do something if element is null or return
for a dictionary:
if(dictionary.ContainsKey(x))
//do something if false else return
This has been bugging me for quite some time now.
Performance of Dictionary Compared to IEnumerable
A Dictionary, when used correctly, is always faster to read from (except in cases where the data set is very small, e.g. 10 items). There can be overhead when creating it.
Given m as the amount of lookups performed against the same object (these are approximate):
Performance of an IEnumerable (created from a clean list): O(mn)
This is because you need to look at all the items each time (essentially m * O(n)).
Performance of a Dictionary: O(n) + O(1m), or O(m + n)
This is because you need to insert items first (O(n)).
In general it can be seen that the Dictionary wins when m > 1, and the IEnumerable wins when m = 1 or m = 0.
In general you should:
Use a Dictionary when doing the lookup more than once against the same dataset.
Use an IEnumerable when doing the lookup one.
Use an IEnumerable when the data-set could be too large to fit into memory.
Keep in mind a SQL table can be used like a Dictionary, so you could use that to offset the memory pressure.
Further Considerations
Dictionarys use GetHashCode() to organise their internal state. The performance of a Dictionary is strongly-related to the hash code in two ways.
Poorly performing GetHashCode() - results in overhead every time an item is added, looked up, or deleted.
Low quality hash codes - results in the dictionary not having O(1) lookup performance.
Most built-in .Net types (especially the value types) have very good hashing algorithms. However, with list-like types (e.g. string) GetHashCode() has O(n) performance - because it needs to iterate over the whole string. Thus you dictionary's performance can really be seen as (where M is the big-oh for an efficient GetHashCode()): O(1) + M.
It depends....
How long is the IEnumerable?
Does accessing the IEnumerable cause database access?
How often is it accessed?
The best thing to do would be to experiment and profile.
If you searching elements in your collection by some key very often - definatelly the Dictionary will be faster because or it's hash-based collection and searching is faster in times, otherwise if you don't search a lot thru the collection - the convertion is not necessary, because time for conversion may be bigger than you one or two searches in the collection,
IMHO: you need to measure this on your environment with representative data. In such cases I just write a quick console app that measures the time of the code execution. To have a better measure you need to execute the same code multiple times I guess.
ADD:
It also depents on the application you develop. Usually you gain more in optimizing other places (avoiding networkroundrips, caching etc.) in that time and effort.
I'll add that you haven't told us what happens every time you "rewind" your IEnumerable<>. Is it directly backed by a data collection? (for example a List<>) or is it calculated "on the fly"? If it's the first, and for small collections, enumerating them to find the wanted element is faster (a Dictionary for 3/4 elements is useless. If you want I can build some benchmark to find the breaking point). If it's the second then you have to consider if "caching" the IEnumerable<> in a collection is a good idea. If it's, then you can choose between a List<> or a Dictionary<>, and we return to point 1. Is the IEnumerable small or big? And there is a third problem: if the collection isn't backed, but it's too big for memory, then clearly you can't put it in a Dictionary<>. Then perhaps it's time to make the SQL work for you :-)
I'll add that "failures" have their cost: in a List<> if you try to find an element that doesn't exist, the cost is O(n), while in a Dictionary<> the cost is still O(1).
I'm developing an Xbox 360 game with XNA. I'd really like to use C#'s yield return construct in a couple of places, but it seems to create a lot of garbage. Have a look at this code:
class ComponentPool<T> where T : DrawableGameComponent
{
List<T> preallocatedComponents;
public IEnumerable<T> Components
{
get
{
foreach (T component in this.preallocatedComponents)
{
// Enabled often changes during iteration over Components
// for example, it's not uncommon for bullet components to get
// disabled during collision testing
// sorry I didn't make that clear originally
if (component.Enabled)
{
yield return component;
}
}
}
}
...
I use these component pools everywhere - for bullets, enemies, explosions; anything numerous and transient. I often need to loop over their contents, and I'm only ever interested in components that are active (i.e., Enabled == true), hence the behavior of the Components property.
Currently, I'm seeing as much as ~800K per second of additional garbage when using this technique. Is this avoidable? Is there another way to use yield return?
Edit: I found this question about the broader issue of how to iterate over a resource pool without creating garbage. A lot of commenters were dismissive, apparently not understanding the limitations of the Compact Framework, but this commenter was more sympathetic and suggested creating an iterator pool. That's the solution I'm going to use.
The implementation of iterators by the compiler does indeed use class objects and the use (with foreach, for example) of an iterator implemented with yield return will indeed cause memory to be allocated. In the scheme of things this is rarely a problem because either considerable work is done while iterating or considerably more memory is allocated doing other things while iterating.
In order for the memory allocated by an iterator to become a problem, your application must be data structure intensive and your algorithms must operate on objects without allocating any memory. Think of the Game of Life of something similar. Suddenly it is the iteration itself that overwhelms. And when the iteration allocates memory a tremendous amount of memory can be allocated.
If your application fits this profile (and only if) then the first rule you should follow is:
avoid iterators in inner loops when a simpler iteration concept is available
For example, if you have an array or list like data structure, you are already exposing an indexer property and a count property so clients can simply use a for loop instead of using foreach with your iterator. This is "easy money" to reduce GC and it doesn't make your code ugly or bloated, just a little less elegant.
The second principle you should follow is:
measure memory allocations to see when and where you should use with the first rule
Just for grins, try capturing the filter in a Linq query and holding onto the query instance. This might reduce memory reallocations each time the query is enumerated.
If nothing else, the statement preallocatedComponents.Where(r => r.Enabled) is a heck of a lot less code to look at to do the same thing as your yield return.
class ComponentPool<T> where T : DrawableGameComponent
{
List<T> preallocatedComponents;
IEnumerable<T> enabledComponentsFilter;
public ComponentPool()
{
enabledComponentsFilter = this.preallocatedComponents.Where(r => r.Enabled);
}
public IEnumerable<T> Components
{
get { return enabledComponentsFilter; }
}
...
I need to enumerate though generic IList<> of objects. The contents of the list may change, as in being added or removed by other threads, and this will kill my enumeration with a "Collection was modified; enumeration operation may not execute."
What is a good way of doing threadsafe foreach on a IList<>? prefferably without cloning the entire list. It is not possible to clone the actual objects referenced by the list.
Cloning the list is the easiest and best way, because it ensures your list won't change out from under you. If the list is simply too large to clone, consider putting a lock around it that must be taken before reading/writing to it.
There is no such operation. The best you can do is
lock(collection){
foreach (object o in collection){
...
}
}
Your problem is that an enumeration does not allow the IList to change. This means you have to avoid this while going through the list.
A few possibilities come to mind:
Clone the list. Now each enumerator has its own copy to work on.
Serialize the access to the list. Use a lock to make sure no other thread can modify it while it is being enumerated.
Alternatively, you could write your own implementation of IList and IEnumerator that allows the kind of parallel access you need. However, I'm afraid this won't be simple.
ICollection MyCollection;
// Instantiate and populate the collection
lock(MyCollection.SyncRoot) {
// Some operation on the collection, which is now thread safe.
}
From MSDN
You'll find that's a very interesting topic.
The best approach relies on the ReadWriteResourceLock which use to have big performance issues due to the so called Convoy Problem.
The best article I've found treating the subject is this one by Jeffrey Richter which exposes its own method for a high performance solution.
So the requirements are: you need to enumerate through an IList<> without making a copy while simultaniously adding and removing elements.
Could you clarify a few things? Are insertions and deletions happening only at the beginning or end of the list?
If modifications can occur at any point in the list, how should the enumeration behave when elements are removed or added near or on the location of the enumeration's current element?
This is certainly doable by creating a custom IEnumerable object with perhaps an integer index, but only if you can control all access to your IList<> object (for locking and maintaining the state of your enumeration). But multithreaded programming is a tricky business under the best of circumstances, and this is a complex probablem.
Forech depends on the fact that the collection will not change. If you want to iterate over a collection that can change, use the normal for construct and be prepared to nondeterministic behavior. Locking might be a better idea, depending on what you're doing.
Default behavior for a simple indexed data structure like a linked list, b-tree, or hash table is to enumerate in order from the first to the last. It would not cause a problem to insert an element in the data structure after the iterator had already past that point or to insert one that the iterator would enumerate once it had arrived, and such an event could be detected by the application and handled if the application required it. To detect a change in the collection and throw an error during enumeration I could only imagine was someone's (bad) idea of doing what they thought the programmer would want. Indeed, Microsoft has fixed their collections to work correctly. They have called their shiny new unbroken collections ConcurrentCollections (System.Collections.Concurrent) in .NET 4.0.
I recently spend some time multip-threading a large application and had a lot of issues with the foreach operating on list of objects shared across threads.
In many cases you can use the good old for-loop and immediately assign the object to a copy to use inside the loop. Just keep in mind that all threads writing to the objects of your list should write to different data of the objects. Otherwise, use a lock or a copy as the other contributors suggest.
Example:
foreach(var p in Points)
{
// work with p...
}
Can be replaced by:
for(int i = 0; i < Points.Count; i ++)
{
Point p = Points[i];
// work with p...
}
Wrap the list in a locking object for reading and writing. You can even iterate with multiple readers at once if you have a suitable lock, that allows multiple concurrent readers but also a single writer (when there are no readers).
This is something that I've recently had to deal with and to me it really depends on what you're doing with the list.
If you need to use the list at a point in time (given the number of elements currently in it) AND another thread can only ADD to the end of the list, then maybe you just switch out to a FOR loop with a counter. At the point you grab the counter, you're only seeing X numbers of elements in the list. You can walk through the list (while others are adding to the end of it) . . . should not cause a problem.
Now, if the list needs to have items taken OUT of it by other threads, or CLEARED by other threads, then you'll need to implement one of the locking mechanisms mentioned above. Also, you may want to look at some of the newer "concurrent" collection classes (though I don't believe they implement IList - so you may need refactor for a dictionary).