While looking at the Implementation of List.AddRange i found something odd i do not understand.
Sourcecode, see line 727 (AddRange calls InsertRange)
T[] itemsToInsert = new T[count];
c.CopyTo(itemsToInsert, 0);
itemsToInsert.CopyTo(_items, index);
Why doest it Copy the collection into a "temp-array" (itemsToInsert) first and then copies the temp array into the actual _items-array?
Is there any reason behind this, or is this just some leftover from copying ArrayList's source, because the same thing happens there.
My guess is that this is to hide the existence of the internal backing array. There is no way to obtain a reference to that array which is intentional. The List class does not even promise that there is such an array. (Of course, for performance and for compatibility reasons it will always be implemented with an array.)
Someone could pass in a crafted ICollection<T> that remembers the array that it is passed. Now callers can mess with the internal array of List and start depending on List internals.
Contrast this with MemoryStream which has a documented way to access the internal buffer (and shoot yourself with it): GetBuffer().
Related
Looking over the source of List<T>, it seems that there's no good way to access the private _items array of items.
What I need is basically a dynamic list of structs, which I can then modify in place. From my understanding, because C# 6 doesn't yet support ref return types, you can't have a List<T> return a reference to an element, which requires copying of the whole item, for example:
struct A {
public int X;
}
void Foo() {
var list = new List<A> { new A { X = 3; } };
list[0].X++; // this fails to compile, because the indexer returns a copy
// a proper way to do this would be
var copy = list[0];
copy.X++;
list[0] = copy;
var array = new A[] { new A { X = 3; } };
array[0].X++; // this works just fine
}
Looking at this, it's both clunky from syntax point of view, and possibly much slower than modifying the data in place (Unless the JIT can do some magic optimizations for this specific case? But I doubt they could be relied on in the general case, unless it's a special standardized optimization?)
Now if List<T>._items was protected, one could at least subclass List<T> and create a data structure with specific modify operations available. Is there another data structure in .NET that allows this, or do I have to implement my own dynamic array?
EDIT: I do not want any form of boxing or introducing any form of reference semantics. This code is intended for very high performance, and the reason I'm using an array of structs is to have them tighly packed on memory (and not everywhere around heap, resulting in cache misses).
I want to modify the structs in place because it's part of a performance critical algorithm that stores some of it's data in those structs.
Is there another data structure in .NET that allows this, or do I have to implement my own dynamic array?
Neither.
There isn't, and can't be, a data structure in .NET that avoids the structure copy, because deep integration with the C# language is needed to get around the "indexed getter makes a copy" issue. So you're right to think in terms of directly accessing the array.
But you don't have to build your own dynamic array from scratch. Many List<T>-like operations such as Resize and bulk movement of items are provided for you as static methods on type System.Array. They come in generic flavors, so no boxing is involved.
The unfortunate thing is that the high-performance Buffer.BlockCopy, which should work on any blittable type, actually contains a hard-coded check for primitive types and refuses to work on any structure.
So just go with T[] (plus int Count -- array length isn't good enough because trying to keep capacity equal to count is very inefficient) and use System.Array static methods when you would otherwise use methods of List<T>. If you wrap this as a PublicList<T> class, you can get reusability and both the convenience of methods for Add, Insert, Sort as well as direct element access by indexing directly on the array. Just exercise some restraint and never store the handle to the internal array, because it will become out-of-date the next time the list needs to grow its capacity. Immediate direct access is perfectly fine though.
Ok, maybe I'm just lazy but this might be a cool question to have on the interwebs.
I know that Buffer.BlockCopy(...) is faster than Array.Copy(...) when working with byte[]. I was about to write a CloneBuffer helper that would create an array the same size as a source array then copy the source array into it using Buffer.BlockCopy(...) when I instead wrote:
public void Send(byte[] data) {
// Copy caller-provided buffer
var buf = data.ToArray();
// Start async send here and return immediately
}
Does anyone know if the ToArray() method special-cased for byte[] or if this is going to be slower than BlockCopy?
You can look into the Microsoft .NET assemblies using a reflector program, such as ILSpy.
This tells me that the implementation of System.Linq.Enumerable::ToArray() is:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
// ...
return new Buffer<TSource>(source).ToArray();
}
And the constructor of the internal struct Buffer<T> does:
If the source enumerable implements ICollection<T>, then:
allocate an array of Count elements, and
use CopyTo() to copy the collection into the array.
Otherwise:
allocate an array of 4 elements, and
start enumerating the IEnumerable, storing each value in the array.
Is the array too small?
Create a new array that has twice the size of the old one,
and copy the old array's content into the new one,
then use the new array instead, and continue.
And Buffer<T>.ToArray() simply returns the inner array if its size matches the number of elements in it; otherwise copies the inner array to a new array with the exact size.
Note that this Buffer<T> class is internal and not related to the Buffer class you mentioned.
All copying is done using Array.Copy().
So, to conclude: all copying is done using Array.Copy() and there is no optimization for byte arrays. But I don't know whether it is slower than Buffer.BlockCopy(). The only way to know is to measure.
Yes, it is going to be slower.
When you look at the documentation for the System.Array methods, there is no definition for System.Array.ToArray(). In fact, looking at the inheritance/interface tree, it's all the way we have to go back all the way to [IEnumerable.ToArray()][2] before we find this method. Since this was implemented with only the features of IEnumerable to work with, it can't know the size of the resulting array when it begins executing. Instead, it uses a doubling algorithm to build up the array as it runs. So you might end up creating and throwing away several arrays over the course of making the copy, and copying those initial items several time in the course of destroying/recreating each intermediate buffer.
If you want a simpler, naive implementation, at least look at Array.CopyTo(). And remember: I said, "If".
I have been playing around with the BlockingCollection class, and I was wondering why the ToArray() Method is an O(n) operation. Coming from a Java background, the ArrayList's ToArray() method runs in O(1), because it just returns the internal array it uses (elementData). So why in the world do they iterate through all of the items, and create a new Array in the IEnumerable.ToArray method, when they could just override it and return the internal array the collection uses?
Coming from a Java background, the ArrayList's ToArray() method runs in O(1), because it just returns the internal array it uses (elementData).
No, it really doesn't. It creates a copy of the array. From the docs for ArrayList.toArray:
Returns an array containing all of the elements in this list in proper sequence (from first to last element).
The returned array will be "safe" in that no references to it are maintained by this list. (In other words, this method must allocate a new array). The caller is thus free to modify the returned array.
So basically, the premise of your question is flawed in the Java sense.
Now, beyond that, Enumerable.ToArray (the extension method on IEnumerable<T>) in general would be O(N), as there's no guarantee that the sequence is even backed by an array. When it's backed by an IList<T>, it uses IList<T>.CopyTo to make things more efficient, but this is an implementation-specific detail and still doesn't transform it into an O(1) operation.
ArrayList.toArray is not O(1), and it does not just return its internal array. Did you read the API specification?
The returned array will be "safe" in that no references to it are maintained by this list. (In other words, this method must allocate a new array). The caller is thus free to modify the returned array.
First, there's no array to return. BlockingCollection<T> uses an object of type IProducerConsumerCollection<T> for its internal storage, and there's no guarantee that the concrete type being used will be backed by an array. For example the default constructor uses a ConcurrentQueue<T>, which stores its data in a linked list of arrays. Even in the odd case where there is an array which represents the full contents of the collection hiding somewhere in there it won't be exposed through the IProducerConsumerCollection<T> interface.
Second, even assuming there were an array to be returned in the first place (which there isn't), it wouldn't be a safe thing to do. If the calling code made any modifications to the array it would corrupt the internal state of the collection.
I usually find myself doing something like:
string[] things = arrayReturningMethod();
int index = things.ToList<string>.FindIndex((s) => s.Equals("FOO"));
//do something with index
return things.Distinct(); //which returns an IEnumerable<string>
and I find all this mixup of types/interface a bit confusing and it tickles my potential performance problem antennae (which I ignore until proven right, of course).
Is this idiomatic and proper C# or is there a better alternative to avoid casting back and forth to access the proper methods to work with the data?
EDIT:
The question is actually twofold:
When is it proper to use either the IEnumerable interface or an array or a list (or any other IEnumerable implementing type) directly (when accepting parameters)?
Should you freely move between IEnumerables (implementation unknown) and lists and IEnumerables and arrays and arrays and Lists or is that non idiomatic (there are better ways to do it)/ non performant (not typically relevant, but might be in some cases) / just plain ugly (unmaintable, unreadable)?
In regards to performance...
Converting from List to T[] involves copying all the data from the original list to a newly allocated array.
Converting from T[] to List also involves copying all the data from the original list to a newly allocated List.
Converting from either List or T[] to IEnumerable involves casting, which is a few CPU cycles.
Converting from IEnumerable to List involves upcasting, which is also a few CPU cycles.
Converting from IEnumerable to T[] also involves upcasting.
You can't cast an IEnumerable to T[] or List unless it was a T[] or List respectively to begin with. You can use the ToArray or ToList functions, but those will also result in a copy being made.
Accessing all the values in order from start to end in a T[] will, in a straightforward loop, be optimized to use straightforward pointer arithmetic -- which makes it the fastest of them all.
Accessing all the values in order from start to end in a List involves a check on each iteration to make sure that you aren't accessing a value outside the array's bounds, and then the actual accessing of the array value.
Accessing all the values in an IEnumerable involves creating an enumerator object, calling the Next() function which increases the index pointer, and then calling the Current property which gives you the actual value and sticks it in the variable that you specified in your foreach statement. Generally, this isn't as bad as it sounds.
Accessing an arbitrary value in an IEnumerable involves starting at the beginning and calling Next() as many times as you need to get to that value. Generally, this is as bad as it sounds.
In regards to idioms...
In general, IEnumerable is useful for public properties, function parameters, and often for return values -- and only if you know that you're going to be using the values sequentially.
For instance, if you had a function PrintValues, if it was written as PrintValues(List<T> values), it would only be able to deal with List values, so the user would first have to convert, if for instance they were using a T[]. Likewise with if the function was PrintValues(T[] values). But if it was PrintValues(IEnumerable<T> values), it would be able to deal with Lists, T[]s, stacks, hashtables, dictionaries, strings, sets, etc -- any collection that implements IEnumerable, which is practically every collection.
In regards to internal use...
Use a List only if you're not sure how many items will need to be in it.
Use a T[] if you know how many items will need to be in it, but need to access the values in an arbitrary order.
Stick with the IEnumerable if that's what you've been given and you just need to use it sequentially. Many functions will return IEnumerables. If you do need to access values from an IEnumerable in an arbitrary order, use ToArray().
Also, note that casting is different from using ToArray() or ToList() -- the latter involves copying the values, which is indeed a performance and memory hit if you have a lot of elements. The former simply is to say that "A dog is an animal, so like any animal, it can eat" (downcast) or "This animal happens to be a dog, so it can bark" (upcast). Likewise, All Lists and T[]s are IEnumerables, but only some IEnumerables are Lists or T[]s.
A good rule of thumb is to always use IEnumerable (when declaring your variables/method parameters/method return types/properties/etc.) unless you have a good reason not to. By far the most type-compatible with other (especially extension) methods.
Well, you've got two apples and an orange that you are comparing.
The two apples are the array and the List.
An array in C# is a C-style array that has garbage collection built in. The upside of using them it that they have very little overhead, assuming you don't need to move things around. The bad thing is that they are not as efficient when you are adding things, removing things, and otherwise changing the array around, as memory gets shuffled around.
A List is a C# style dynamic array (similar to the vector<> class in C++). There is more overhead, but they are more efficient when you need to be moving things around a lot, as they will not try to keep the memory usage contiguous.
The best comparison I could give is saying that arrays are to Lists as strings are to StringBuilders.
The orange is 'IEnumerable'. This is not a datatype, but rather it is an interface. When a class implements the IEnumerable interface, it allows that object to be used in a foreach() loop.
When you return the list (as you did in your example), you were not converting the list to an IEnumerable. A list already is an IEnumerable object.
EDIT: When to convert between the two:
It depends on the application. There is very little that can be done with an array that cannot be done with a List, so I would generally recommend the List. Probably the best thing to do is to make a design decision that you are going to use one or the other, that way you don't have to switch between the two. If you rely on an external library, abstract it away to maintain consistent usage.
Hope this clears a little bit of the fog.
Looks to me like the problem is that you haven't bothered learning how to search an array. Hint: Array.IndexOf or Array.BinarySearch depending on whether the array is sorted.
You're right that converting to a list is a bad idea: it wastes space and time and makes the code less readable. Also, blindly upcasting to IEnumerable slows matters down and also completely prevents use of certain algorithms (such as binary search).
I try to avoid rapidly jumping between data types if it can be avoided.
It must be the case that each situation similar to that you described is sufficiently different so as to prevent a dogmatic rule about transforming your types; however, it is generally good practice to select a data structure that provides as best as possible the interface you need without having to copying elements needlessly to new data structures.
When to use what?
I would suggest returning the most specific type, and taking in the most flexible type.
Like this:
public int[] DoSomething(IEnumerable<int> inputs)
{
//...
}
public List<int> DoSomethingElse(IList<int> inputs)
{
//...
}
That way you can call methods on List< T > for whatever you get back from the method in addition to treating it as an IEnumerable. On the inputs, use as flexible as possible, so you don't dictate the users of your method what kind of collection to create.
You're right to ignore the 'performance problem' antennae until you actually have a performance problem. Most performance problems come from doing too much I/O or too much locking or doing one of them wrong, and none of these apply to this question.
My general approach is:
Use T[] for 'static' or 'snapshot'-style information. Use for things where calling .Add() wouldn't make sense anyway, and you don't need the extra methods List<T> gives you.
Accept IEnumerable<T> if you don't really care what you're given and don't need a constant-time .Length/.Count.
Only return IEnumerable<T> when you're doing simple manipulations of an input IEnumerable<T> or when you specifically want to make use of the yield syntax to do your work lazily.
In all other cases, use List<T>. It's just too flexible.
Corollary to #4: don't be afraid of ToList(). ToList() is your friend. It forces the IEnumerable<T> to evaluate right then (useful for when you're stacking several where clauses). Don't go nuts with it, but feel free to call it once you've built up your full where clause before you do the foreach over it (or the like).
Of course, this is just a rough guideline. Just please try to follow the same pattern in the same codebase -- code styles that jump around make it harder for maintenance coders to get into your frame of mind.
If my understanding of deep and shallow copying is correct my question is an impossible one.
If you have an array (a[10]) and perform a shallow copy (b[20]) wouldn't this be impossible as the data in b wouldn't be contiguous?
If i've got this completely wrong could someone advise a fast way to immitate (in c#) c++'s ability to do a realloc in order to resize an array.
NOTE
Im looking at the .Clone() and .Copy() members of the System.Array object.
You can't resize an existing array, however, you can use:
Array.Resize(ref arr, newSize);
This allocates a new array, copies the data from the old array into the new array, and updates the arr variable (which is passed by-ref in this case). Is that what you mean?
However, any other references still pointing at the old array will not be updated. A better option might be to work with List<T> - then you don't need to resize it manually, and you don't have the issue of out-of-date references. You just Add/Remove etc. Generally, you don't tend to use arrays directly very often. They have their uses, but they aren't the default case.
Re your comments;
boxing: List<T> doesn't box. That is one of the points about generics; under the hood, List<T> is a wrapper around T[], so a List<int> has an int[] - no boxing. The older ArrayList is a wrapper around object[], so that does box; of course, boxing isn't as bad as you might assume anyway.
workings of Array.Resize; if I recall, it finds the size of T, then uses Buffer.BlockCopy to blit the contents the actual details are hidden by an internal call - but essentially after allocating a new array it is a blit (memcpy) of the data between the two arrays, so it should be pretty quick; note that for reference-types this only copies the reference, not the object on the heap. However, if you are resizing regularly, List<T> would usually be a lot simpler (and quicker unless you basically re-implement what List<T> does re spare capacity to minimise the number of resizes).