List<T>.AddRange implementation suboptimal

List<T>.AddRange implementation suboptimal - c#

Profiling my C# application indicated that significant time is spent in List<T>.AddRange. Using Reflector to look at the code in this method indicated that it calls List<T>.InsertRange which is implemented as such:
public void InsertRange(int index, IEnumerable<T> collection)
{
if (collection == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.collection);
}
if (index > this._size)
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.index, ExceptionResource.ArgumentOutOfRange_Index);
}
ICollection<T> is2 = collection as ICollection<T>;
if (is2 != null)
{
int count = is2.Count;
if (count > 0)
{
this.EnsureCapacity(this._size + count);
if (index < this._size)
{
Array.Copy(this._items, index, this._items, index + count, this._size - index);
}
if (this == is2)
{
Array.Copy(this._items, 0, this._items, index, index);
Array.Copy(this._items, (int) (index + count), this._items, (int) (index * 2), (int) (this._size - index));
}
else
{
T[] array = new T[count]; // (*)
is2.CopyTo(array, 0); // (*)
array.CopyTo(this._items, index); // (*)
}
this._size += count;
}
}
else
{
using (IEnumerator<T> enumerator = collection.GetEnumerator())
{
while (enumerator.MoveNext())
{
this.Insert(index++, enumerator.Current);
}
}
}
this._version++;
}
private T[] _items;
One can argue that the simplicity of the interface (only having one overload of InsertRange) justifies the performance overhead of runtime type cheching and casting.
But what could be the reason behind the 3 lines I have indicated with (*) ?
I think it could be rewritten to the faster alternative:
is2.CopyTo(this._items, index);
Do you see any reason for not using this simpler and apparently faster alternative?
Edit:
Thanks for the answers. So consensus opinion is that this is a protective measure against the input collection implementing the CopyTo in a defective/malicious manner. To me it seems like a overkill to constantly pay the price of 1) runtime type checking 2) dynamic allocation of the temporary array 3) double the copy operation, when all this could have been saved by defining 2 or a few more overloads of InsertRange, one getting IEnumerable as now, the second getting a List<T>, third getting T[]. The later two could have been implemented to run around twice as fast as in the current case.
Edit 2:
I did implement a class FastList, identical to List, except that it also provides an overload of AddRange which takes a T[] argument. This overload does not need the dynamic type verification, and double-copying of elements. I did profile this FastList.AddRange against List.AddRange by adding 4-byte arrays 1000 times to a list which was initially emtpy. My implementation beats the speed of standard List.AddRange with a factor of 9 (nine!). List.AddRange takes about 5% of runtime in one of the important usage scenarios of our application, replacing List with a class providing a faster AddRange could improve application runtime by 4%.

They are preventing the implementation of ICollection<T> from accessing indices of the destination list outside the bounds of insertion. The implementation above results in an IndexOutOfBoundsException if a faulty (or "manipulative") implementation of CopyTo is called.
Keep in mind that T[].CopyTo is quite literally internally implemented as memcpy, so the performance overhead of adding that line is minute. When you have such a low cost of adding safety to a tremendous number of calls, you might as well do so.
Edit: The part I find strange is the fact that the call to ICollection<T>.CopyTo (copying to the temporary array) does not occur immediately following the call to EnsureCapacity. If it were moved to that location, then following any synchronous exception the list would remain unchanged. As-is, that condition only holds if the insertion happens at the end of the list. The reasoning here is:
All necessary allocation happens before altering the list elements.
The calls to Array.Copy cannot fail because
The memory is already allocated
The bounds are already checked
The element types of the source and destination arrays match
There is no "copy constructor" used like in C++ - it's just a memcpy
The only items that can throw an exception are the external call to ICollection.CopyTo and the allocations required for resizing the list and allocating the temporary array. If all three of these occur before moving elements for the insertion, the transaction to change the list cannot throw a synchronous exception.
Final note: This address strictly exceptional behavior - the above rationale does not add thread-safety.
Edit 2 (response to the OP's edit): Have you profiled this? You are making some bold claims that Microsoft should have chosen a more complicated API, so you should make sure you're correct in the assertions that the current method is slow. I've never had a problem with the performance of InsertRange, and I'm quite sure that any performance problems someone does face with it will be better resolved with an algorithm redesign than by reimplementing the dynamic list. Just so you don't take me as being harsh in a negative way, keep the following in mind:
I don't want can't stand people on my dev team that like to reinvent the square wheel.
I definitely want people on my team that care about potential performance issues, and ask questions about the side effects their code may have. This point wins out when present - but as long as people are asking questions I will drive them to turn their questions into solid answers. If you can show me that an application gains a significant advantage through what initially appears to be a bad idea, then that's just the way things go sometimes.

It's a good question, I'm struggling to come up with a reason. There's no hint in the Reference Source. One possibility is that they try to avoid a problem when the class that implements the ICollection<>.CopyTo() method objects against copying to a start index other than 0. Or as a security measure, preventing the collection from messing with the array elements it should not have access to.
Another one is that this is a counter-measure when the collection is used in thread-unsafe manner. If an item got added to the collection by another thread it will be the collection class' CopyTo() method that fails, not the Microsoft code. The right person will get the service call.
These are not great explanations.

There is a problem with your solution if you think about it for a minute, if you change the code in that way you are essentially giving the collection that should be added access to an internal datastructure.
This is not a good idea, for example if the author of the List datastructure figures out a better underlying structure to store the data than an array there is no way to change the implementation of List since all collection are expecting an array into the CopyTo function.
In essence you would be cementing the implementation of the List class, even though object oriented programming tells us that the internal implementation of a datastructure should be something that can be changed without breaking other code.

Related

What's the quickest way to check the size of an IEnumerable is greater than some given value?

I know that you can use enumerable.Any() instead of enumerable.Count() to check if the collection has anything in it efficiently.
Is there an equivalent to check the size is at least any larger size?
For example, how would I efficiently do enumerable.Count() > 3.

The most efficient approach will unfortunately depend on the implementation. It's a leaky abstraction at that point.
If you're using a List<T> or similar, using Count() will be fastest. But for any lazily-evaluated sequence, that will evaluate the whole sequence.
For a lazily-evaluated sequence, using enumerable.Skip(3).Any() will be more efficient, because it can stop once it's found the fourth element. That's all you need to know about; you don't care about the actual size.
Using the Skip()/Any() approach will be slightly less efficient than using Count() for some collections - but could be much more efficient for large lazy sequences. (It will also work even for infinite sequences, which Count() wouldn't.)
The difference in efficiency for lists will depend on how many items you're skipping, of course - if you need to see whether there are "at least a million" items then using Count() would be much more efficient for a list.
Sorry not to have an easy answer for you. If you really need this to be optimal in every case, you could perform the same kinds of optimization that the Count() method does. Something like this:
// FIXME: This name is horrible! Note that you'd call it with 4 in your case,
// as it's inclusive of minCount.
// Note this assumes C# 8 and its lovely switch expression support.
// It could be written with if/else etc of course.
public static bool HasAtMinElements<T>(this IEnumerable<T> source, int minCount) =>
source switch
{
null => throw new ArgumentNullException(nameof(source)),
ICollection<TSource> coll => coll.Count >= minCount,
ICollection coll => coll.Count >= minCount,
_ => source.Skip(minCount - 1).Any();
}
That's annoying though :( Note that it doesn't optimize IIListProvider<T> like the real Count() method does, either - because that's internal.

Enumerable.Count Method is the Microsoft's recommended way to return the number of elements in a sequence, which is what you are already doing and it is the best option as far as I see.

Running Enumerable multiple times if the data source remain unchanged

I understand that IEnumerable might have a risk to return different result on multiple run.
But, is that still a problem if we sure the underlying record set will never change in between and the sequence of the loop doesn't matter at all ?
It's such a shame to call ToList / ToArray everywhere without any consideration that it's just a "possible" risk. R# or VS can simply mark it as error if it should never happened.
Is that really no exception at all?
We should never iterate IEnumerable multiple times?
This is what actually happened.
In a single threaded environment.
void Main()
{
var result = GetFile(new [] {path1, path2}) // hardcoded path
}
IList<SomeFile> GetFiles(IEnumerable<string> filePaths)
{
var paths = filePaths.ToArray(); // <-- why we have to do this ?
foreach(var path In paths)
// Throw exception if the path not exist.
foreach (var path In paths)
// Process and return a list of file
}
I understand it makes not much difference as the collection is so small but we are at the beginning of implementing a project that required to deal with big collection of static data. This kinda practice might be a big problem if apply to all areas without considering whether if it is necessary or not.

The concern of getting different result on second iteration is a distant second to a much more realistic performance concern. Unless your IEnumerable<T> is actually a collection in memory, you are running the risk of having to reproduce it each time you enumerate. This could be very costly:
If IEnumerable<T> comes from another LINQ expression, you spend CPU cycles to recompute the same thing,
If IEnumerable<T> comes from the database, you may end up re-reading the data from the server,
If IEnumerable<T> comes from a file, you will re-read the file.
None of the above have an effect on correctness, but it may dramatically decrease the speed, especially for large data sets. Since memory is relatively cheap these days, and garbage collection system is pretty reliable, temporarily saving collections in a list or an array is an inexpensive way to avoid the problem.

If it is in-memory collection and not abstract enumerable object - just use appropriate interface (IReadOnlyCollection, IReadOnlyList, etc.) instead of IEnumerable. If you use IEnumerable you should assume that it can be any IEnumerable implementation.

Push Item to the end of an array

No, I can't use generic Collections. What I am trying to do is pretty simple actually. In php I would do something like this
$foo = [];
$foo[] = 1;
What I have in C# is this
var foo = new int [10];
// yeah that's pretty much it
Now I can do something like foo[foo.length - 1] = 1 but that obviously wont work. Another option is foo[foo.Count(x => x.HasValue)] = 1 along with a nullable int during declaration. But there has to be a simpler way around this trivial task.
This is homework and I don't want to explain to my teacher (and possibly the entire class) what foo[foo.Count(x => x.HasValue)] = 1 is and why it works etc.

The simplest way is to create a new class that holds the index of the inserted item:
public class PushPopIntArray
{
private int[] _vals = new int[10];
private int _nextIndex = 0;
public void Push(int val)
{
if (_nextIndex >= _vals.Length)
throw new InvalidOperationException("No more values left to push");
_vals[_nextIndex] = val;
_nextIndex++;
}
public int Pop()
{
if (_nextIndex <= 0)
throw new InvalidOperationException("No more values left to pop");
_nextIndex--;
return _vals[_nextIndex];
}
}
You could add overloads to get the entire array, or to index directly into it if you wanted. You could also add overloads or constructors to create different sized arrays, etc.

In C#, arrays cannot be resized dynamically. You can use Array.Resize (but this will probably be bad for performance) or substitute for ArrayList type instead.

But there has to be a simpler way around this trivial task.
Nope. Not all languages do everything as easy as each other, this is why Collections were invented. C# <> python <> php <> java. Pick whichever suits you better, but equivalent effort isn't always the case when moving from one language to another.

foo[foo.Length] won't work because foo.Length index is outside the array.
Last item is at index foo.Length - 1
After that an array is a fixed size structure if you expect it to work the same as in php you're just plainly wrong

Originally I wrote this as a comment, but I think it contains enough important points to warrant writing it as an answer.
You seem to be under the impression that C# is an awkward language because you stubbornly insist on using an array while having the requirement that you should "push items onto the end", as evidenced by this comment:
Isn't pushing items into the array kind of the entire purpose of the data structure?
To answer that: no, the purpose of the array data structure is to have a contiguous block of pre-allocated memory to mimic the original array structure in C(++) that you can easily index and perform pointer arithmetic on.
If you want a data structure that supports certain operations, such as pushing elements onto the end, consider a System.Collections.Generic.List<T>, or, if you insist on avoiding generics, a System.Collections.List. There are specializations that specify the underlying storage structure (such as ArrayList) but in general the whole point of the C# library is that you don't want to concern yourself with such details: the List<T> class has certain guarantees on its operations (e.g. insertion is O(n), retrieval is O(1) -- just like an array) and whether there is an array or some linked list that actually holds the data is irrelevant and is in fact dynamically decided based on the size and use case of the list at runtime.
Don't try to compare PHP and C# by comparing PHP arrays with C# arrays - they have different programming paradigms and the way to solve a problem in one does not necessarily carry over to the other.
To answer the question as written, I see two options then:
Use arrays the awkward way. Either create an array of Nullable<int>s and accept some boxing / unboxing and unpleasant LINQ statements for insertion; or keep an additional counter (preferably wrapped up in a class together with the array) to keep track of the last assigned element.
Use a proper data structure with appropriate guarantees on the operations that matter, such as List<T> which is effectively the (much better, optimised) built-in version of the second option above.
I understand that the latter option is not feasible for you because of the constraints imposed by your teacher, but then do not be surprised that things are harder than the canonical way in another language, if you are not allowed to use the canonical way in this language.
Afterthought:
A hybrid alternative that just came to mind, is using a List for storage and then just calling .ToArray on it. In your insert method, just Add to the list and return the new array.

C#'s `yield return` is creating a lot of garbage for me. Can it be helped?

I'm developing an Xbox 360 game with XNA. I'd really like to use C#'s yield return construct in a couple of places, but it seems to create a lot of garbage. Have a look at this code:
class ComponentPool<T> where T : DrawableGameComponent
{
List<T> preallocatedComponents;
public IEnumerable<T> Components
{
get
{
foreach (T component in this.preallocatedComponents)
{
// Enabled often changes during iteration over Components
// for example, it's not uncommon for bullet components to get
// disabled during collision testing
// sorry I didn't make that clear originally
if (component.Enabled)
{
yield return component;
}
}
}
}
...
I use these component pools everywhere - for bullets, enemies, explosions; anything numerous and transient. I often need to loop over their contents, and I'm only ever interested in components that are active (i.e., Enabled == true), hence the behavior of the Components property.
Currently, I'm seeing as much as ~800K per second of additional garbage when using this technique. Is this avoidable? Is there another way to use yield return?
Edit: I found this question about the broader issue of how to iterate over a resource pool without creating garbage. A lot of commenters were dismissive, apparently not understanding the limitations of the Compact Framework, but this commenter was more sympathetic and suggested creating an iterator pool. That's the solution I'm going to use.

The implementation of iterators by the compiler does indeed use class objects and the use (with foreach, for example) of an iterator implemented with yield return will indeed cause memory to be allocated. In the scheme of things this is rarely a problem because either considerable work is done while iterating or considerably more memory is allocated doing other things while iterating.
In order for the memory allocated by an iterator to become a problem, your application must be data structure intensive and your algorithms must operate on objects without allocating any memory. Think of the Game of Life of something similar. Suddenly it is the iteration itself that overwhelms. And when the iteration allocates memory a tremendous amount of memory can be allocated.
If your application fits this profile (and only if) then the first rule you should follow is:
avoid iterators in inner loops when a simpler iteration concept is available
For example, if you have an array or list like data structure, you are already exposing an indexer property and a count property so clients can simply use a for loop instead of using foreach with your iterator. This is "easy money" to reduce GC and it doesn't make your code ugly or bloated, just a little less elegant.
The second principle you should follow is:
measure memory allocations to see when and where you should use with the first rule

Just for grins, try capturing the filter in a Linq query and holding onto the query instance. This might reduce memory reallocations each time the query is enumerated.
If nothing else, the statement preallocatedComponents.Where(r => r.Enabled) is a heck of a lot less code to look at to do the same thing as your yield return.
class ComponentPool<T> where T : DrawableGameComponent
{
List<T> preallocatedComponents;
IEnumerable<T> enabledComponentsFilter;
public ComponentPool()
{
enabledComponentsFilter = this.preallocatedComponents.Where(r => r.Enabled);
}
public IEnumerable<T> Components
{
get { return enabledComponentsFilter; }
}
...

Avoiding array duplication

According to [MSDN: Array usage guidelines](http://msdn.microsoft.com/en-us/library/k2604h5s(VS.71).aspx):
Array Valued Properties
You should use collections to avoid code inefficiencies. In the following code example, each call to the myObj property creates a copy of the array. As a result, 2n+1 copies of the array will be created in the following loop.
[Visual Basic]
Dim i As Integer
For i = 0 To obj.myObj.Count - 1
DoSomething(obj.myObj(i))
Next i
[C#]
for (int i = 0; i < obj.myObj.Count; i++)
DoSomething(obj.myObj[i]);
Other than the change from myObj[] to ICollection myObj, what else would you recommend? Just realized that my current app is leaking memory :(
Thanks;
EDIT: Would forcing C# to pass references w/ ref (safety aside) improve performance and/or memory usage?

No, it isn't leaking memory - it is just making the garbage collector work harder than it might. Actually, the MSDN article is slightly misleading: if the property created a new collection every time it was called, it would be just as bad (memory wise) as with an array. Perhaps worse, due to the usual over-sizing of most collection implementations.
If you know a method/property does work, you can always minimise the number of calls:
var arr = obj.myObj; // var since I don't know the type!
for (int i = 0; i < arr.Length; i++) {
DoSomething(arr[i]);
}
or even easier, use foreach:
foreach(var value in obj.myObj) {
DoSomething(value);
}
Both approaches only call the property once. The second is clearer IMO.
Other thoughts; name it a method! i.e. obj.SomeMethod() - this sets expectation that it does work, and avoids the undesirable obj.Foo != obj.Foo (which would be the case for arrays).
Finally, Eric Lippert has a good article on this subject.

Just as a hint for those who haven't use the ReadOnlyCollection mentioned in some of the answers:
[C#]
class XY
{
private X[] array;
public ReadOnlyCollection<X> myObj
{
get
{
return Array.AsReadOnly(array);
}
}
}
Hope this might help.

Whenever I have properties that are costly (like recreating a collection on call) I either document the property, stating that each call incurs a cost, or I cache the value as a private field. Property getters that are costly, should be written as methods.
Generally, I try to expose collections as IEnumerable rather than arrays, forcing the consumer to use foreach (or an enumerator).

It will not make copies of the array unless you make it do so. However, simply passing the reference to an array privately owned by an object has some nasty side-effects. Whoever receives the reference is basically free to do whatever he likes with the array, including altering the contents in ways that cannot be controlled by its owner.
One way of preventing unauthorized meddling with the array is to return a copy of the contents. Another (slightly better) is to return a read-only collection.
Still, before doing any of these things you should ask yourself if you are about to give away too much information. In some cases (actually, quite often) it is even better to keep the array private and instead let provide methods that operate on the object owning it.

myobj will not create new item unless you explicitly create one. so to make better memory usage I recommend to use private collection (List or any) and expose indexer which will return the specified value from the private collection

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.