I have been playing around with the BlockingCollection class, and I was wondering why the ToArray() Method is an O(n) operation. Coming from a Java background, the ArrayList's ToArray() method runs in O(1), because it just returns the internal array it uses (elementData). So why in the world do they iterate through all of the items, and create a new Array in the IEnumerable.ToArray method, when they could just override it and return the internal array the collection uses?
Coming from a Java background, the ArrayList's ToArray() method runs in O(1), because it just returns the internal array it uses (elementData).
No, it really doesn't. It creates a copy of the array. From the docs for ArrayList.toArray:
Returns an array containing all of the elements in this list in proper sequence (from first to last element).
The returned array will be "safe" in that no references to it are maintained by this list. (In other words, this method must allocate a new array). The caller is thus free to modify the returned array.
So basically, the premise of your question is flawed in the Java sense.
Now, beyond that, Enumerable.ToArray (the extension method on IEnumerable<T>) in general would be O(N), as there's no guarantee that the sequence is even backed by an array. When it's backed by an IList<T>, it uses IList<T>.CopyTo to make things more efficient, but this is an implementation-specific detail and still doesn't transform it into an O(1) operation.
ArrayList.toArray is not O(1), and it does not just return its internal array. Did you read the API specification?
The returned array will be "safe" in that no references to it are maintained by this list. (In other words, this method must allocate a new array). The caller is thus free to modify the returned array.
First, there's no array to return. BlockingCollection<T> uses an object of type IProducerConsumerCollection<T> for its internal storage, and there's no guarantee that the concrete type being used will be backed by an array. For example the default constructor uses a ConcurrentQueue<T>, which stores its data in a linked list of arrays. Even in the odd case where there is an array which represents the full contents of the collection hiding somewhere in there it won't be exposed through the IProducerConsumerCollection<T> interface.
Second, even assuming there were an array to be returned in the first place (which there isn't), it wouldn't be a safe thing to do. If the calling code made any modifications to the array it would corrupt the internal state of the collection.
Related
I'm learning C# and basically know the difference between arrays and Lists that the last is a generic and can dynamically grow but I'm wondering:
are List elements sequentially located in heap like array or is each element located "randomly" in a different locations?
and if that is true, does that affect the speed of access & data retrieval from memory?
and if that is true, is this what makes arrays a little faster than Lists?
Let's see the second and the third questions first:
and if that true does that affect the speed of access & data retrieval from memory ?
and if that true is this what makes array little faster than list ?
There is only a single type of "native" collection in .NET (with .NET I mean the CLR, so the runtime): the array (technically, if you consider a string a type of collection, then there are two native types of collections :-) ) (technically part 2: not all the arrays you think that are arrays are "native" arrays... Only the monodimensional 0 based arrays are "native" arrays. Arrays of type T[,] aren't, and arrays where the first element doesn't have an index of 0 aren't) . Every other collection (other than the LinkedList<>) is built atop it. If you look at the List<T> with IlSpy you'll see that at the base of it there is a T[] with an added int for the Count (the T[].Length is the Capacity). Clearly an array is a little faster than a List<T> because to use it, you have one less indirection (you access the array directly, instead of accessing the array that accesses the list).
Let's see the first question:
does List elements sequentially located in heap like array or each element is located randomly in different locations?
Being based on an array internally, clearly the List<> memorizes its elements like an array, so in a contiguous block of memory (but be aware that with a List<SomeObject> where SomeObject is a reference type, the list is a list of references, not of objects, so the references are put in a contiguous block of memory (we will ignore that with the advanced memory management of computers, the word "contiguous block of memory" isn't exact", it would be better to say "a contiguous block of addresses") )
(yes, even Dictionary<> and HashSet<> are built atop arrays. Conversely a tree-like collection could be built without using an array, because it's more similar to a LinkedList)
Some additional details: there are four groups of instructions in the CIL language (the intermediate language used in compiled .NET programs) that are used with "native" arrays:
Newarr
Ldelem and family Ldelem_*
Stelem and family Stelem_*
ReadOnly (don't ask me its use, I don't know, and the documentation isn't clear)
if you look at OpCodes.Newarr you'll see this comment in the XML documentation:
// Summary:
// Pushes an object reference to a new zero-based, one-dimensional array whose
// elements are of a specific type onto the evaluation stack.
Yes, elements in a List are stored contiguously, just like an array. A List actually uses arrays internally, but that is an implementation detail that you shouldn't really need to be concerned with.
Of course, in order to get the correct impression from that statement, you also have to understand a bit about memory management in .NET. Namely, the difference between value types and reference types, and how objects of those types are stored. Value types will be stored in contiguous memory. With reference types, the references will be stored in contiguous memory, but not the instances themselves.
The advantage of using a List is that the logic inside of the class handles allocating and managing the items for you. You can add elements anywhere, remove elements from anywhere, and grow the entire size of the collection without having to do any extra work. This is, of course, also what makes a List slightly slower than an array. If any reallocation has to happen in order to comply with your request, there'll be a performance hit as a new, larger-sized array is allocated and the elements are copied to it. But it won't be any slower than if you wrote the code to do it manually with a raw array.
If your length requirement is fixed (i.e., you never need to grow/expand the total capacity of the array), you can go ahead and use a raw array. It might even be marginally faster than a List because it avoids the extra overhead and indirection (although that is subject to being optimized out by the JIT compiler).
If you need to be able to dynamically resize the collection, or you need any of the other features provided by the List class, just use a List. The performance difference will be virtually imperceptible.
While looking at the Implementation of List.AddRange i found something odd i do not understand.
Sourcecode, see line 727 (AddRange calls InsertRange)
T[] itemsToInsert = new T[count];
c.CopyTo(itemsToInsert, 0);
itemsToInsert.CopyTo(_items, index);
Why doest it Copy the collection into a "temp-array" (itemsToInsert) first and then copies the temp array into the actual _items-array?
Is there any reason behind this, or is this just some leftover from copying ArrayList's source, because the same thing happens there.
My guess is that this is to hide the existence of the internal backing array. There is no way to obtain a reference to that array which is intentional. The List class does not even promise that there is such an array. (Of course, for performance and for compatibility reasons it will always be implemented with an array.)
Someone could pass in a crafted ICollection<T> that remembers the array that it is passed. Now callers can mess with the internal array of List and start depending on List internals.
Contrast this with MemoryStream which has a documented way to access the internal buffer (and shoot yourself with it): GetBuffer().
Why is it that i cannot use the normal array functions in C# like:
string[] k = {"Hello" , "There"};
k.RemoveAt(index); //Not possible
Code completion comes with suggestions like All<>, Any<>, Cast<> or Average<>, but no function to remove strings from the array. This happens with all kind of arrays. Is this because my build target is set to .NET 4.5.1?
You cannot "Add" or "Remove" items from an array, nor should you, as arrays are defined to be a fixed size. The functions you mention (All, Any) are there because Array<T> implements IEnumerable<T> and so you get access to the LINQ extensions.
While it does implement IList<T>, the methods will throw a NotSupportedException. In your case, to "remove" the string, just do:
k[index] = String.Empty; //Or null, whichever you prefer
The length of an array is fixed when it's created and doesn't change, it represents a block of memory. Arrays do actually implement IList/IList<T>, but only partially - any method that tries to change the array is only available after casting and will throw an exception. Arrays are used internally in most collections.
If you need to add and remove arbitrarily and have fast acces by index you should use a List<T> which uses a resizing array internally.
Ok, maybe I'm just lazy but this might be a cool question to have on the interwebs.
I know that Buffer.BlockCopy(...) is faster than Array.Copy(...) when working with byte[]. I was about to write a CloneBuffer helper that would create an array the same size as a source array then copy the source array into it using Buffer.BlockCopy(...) when I instead wrote:
public void Send(byte[] data) {
// Copy caller-provided buffer
var buf = data.ToArray();
// Start async send here and return immediately
}
Does anyone know if the ToArray() method special-cased for byte[] or if this is going to be slower than BlockCopy?
You can look into the Microsoft .NET assemblies using a reflector program, such as ILSpy.
This tells me that the implementation of System.Linq.Enumerable::ToArray() is:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
// ...
return new Buffer<TSource>(source).ToArray();
}
And the constructor of the internal struct Buffer<T> does:
If the source enumerable implements ICollection<T>, then:
allocate an array of Count elements, and
use CopyTo() to copy the collection into the array.
Otherwise:
allocate an array of 4 elements, and
start enumerating the IEnumerable, storing each value in the array.
Is the array too small?
Create a new array that has twice the size of the old one,
and copy the old array's content into the new one,
then use the new array instead, and continue.
And Buffer<T>.ToArray() simply returns the inner array if its size matches the number of elements in it; otherwise copies the inner array to a new array with the exact size.
Note that this Buffer<T> class is internal and not related to the Buffer class you mentioned.
All copying is done using Array.Copy().
So, to conclude: all copying is done using Array.Copy() and there is no optimization for byte arrays. But I don't know whether it is slower than Buffer.BlockCopy(). The only way to know is to measure.
Yes, it is going to be slower.
When you look at the documentation for the System.Array methods, there is no definition for System.Array.ToArray(). In fact, looking at the inheritance/interface tree, it's all the way we have to go back all the way to [IEnumerable.ToArray()][2] before we find this method. Since this was implemented with only the features of IEnumerable to work with, it can't know the size of the resulting array when it begins executing. Instead, it uses a doubling algorithm to build up the array as it runs. So you might end up creating and throwing away several arrays over the course of making the copy, and copying those initial items several time in the course of destroying/recreating each intermediate buffer.
If you want a simpler, naive implementation, at least look at Array.CopyTo(). And remember: I said, "If".
I usually find myself doing something like:
string[] things = arrayReturningMethod();
int index = things.ToList<string>.FindIndex((s) => s.Equals("FOO"));
//do something with index
return things.Distinct(); //which returns an IEnumerable<string>
and I find all this mixup of types/interface a bit confusing and it tickles my potential performance problem antennae (which I ignore until proven right, of course).
Is this idiomatic and proper C# or is there a better alternative to avoid casting back and forth to access the proper methods to work with the data?
EDIT:
The question is actually twofold:
When is it proper to use either the IEnumerable interface or an array or a list (or any other IEnumerable implementing type) directly (when accepting parameters)?
Should you freely move between IEnumerables (implementation unknown) and lists and IEnumerables and arrays and arrays and Lists or is that non idiomatic (there are better ways to do it)/ non performant (not typically relevant, but might be in some cases) / just plain ugly (unmaintable, unreadable)?
In regards to performance...
Converting from List to T[] involves copying all the data from the original list to a newly allocated array.
Converting from T[] to List also involves copying all the data from the original list to a newly allocated List.
Converting from either List or T[] to IEnumerable involves casting, which is a few CPU cycles.
Converting from IEnumerable to List involves upcasting, which is also a few CPU cycles.
Converting from IEnumerable to T[] also involves upcasting.
You can't cast an IEnumerable to T[] or List unless it was a T[] or List respectively to begin with. You can use the ToArray or ToList functions, but those will also result in a copy being made.
Accessing all the values in order from start to end in a T[] will, in a straightforward loop, be optimized to use straightforward pointer arithmetic -- which makes it the fastest of them all.
Accessing all the values in order from start to end in a List involves a check on each iteration to make sure that you aren't accessing a value outside the array's bounds, and then the actual accessing of the array value.
Accessing all the values in an IEnumerable involves creating an enumerator object, calling the Next() function which increases the index pointer, and then calling the Current property which gives you the actual value and sticks it in the variable that you specified in your foreach statement. Generally, this isn't as bad as it sounds.
Accessing an arbitrary value in an IEnumerable involves starting at the beginning and calling Next() as many times as you need to get to that value. Generally, this is as bad as it sounds.
In regards to idioms...
In general, IEnumerable is useful for public properties, function parameters, and often for return values -- and only if you know that you're going to be using the values sequentially.
For instance, if you had a function PrintValues, if it was written as PrintValues(List<T> values), it would only be able to deal with List values, so the user would first have to convert, if for instance they were using a T[]. Likewise with if the function was PrintValues(T[] values). But if it was PrintValues(IEnumerable<T> values), it would be able to deal with Lists, T[]s, stacks, hashtables, dictionaries, strings, sets, etc -- any collection that implements IEnumerable, which is practically every collection.
In regards to internal use...
Use a List only if you're not sure how many items will need to be in it.
Use a T[] if you know how many items will need to be in it, but need to access the values in an arbitrary order.
Stick with the IEnumerable if that's what you've been given and you just need to use it sequentially. Many functions will return IEnumerables. If you do need to access values from an IEnumerable in an arbitrary order, use ToArray().
Also, note that casting is different from using ToArray() or ToList() -- the latter involves copying the values, which is indeed a performance and memory hit if you have a lot of elements. The former simply is to say that "A dog is an animal, so like any animal, it can eat" (downcast) or "This animal happens to be a dog, so it can bark" (upcast). Likewise, All Lists and T[]s are IEnumerables, but only some IEnumerables are Lists or T[]s.
A good rule of thumb is to always use IEnumerable (when declaring your variables/method parameters/method return types/properties/etc.) unless you have a good reason not to. By far the most type-compatible with other (especially extension) methods.
Well, you've got two apples and an orange that you are comparing.
The two apples are the array and the List.
An array in C# is a C-style array that has garbage collection built in. The upside of using them it that they have very little overhead, assuming you don't need to move things around. The bad thing is that they are not as efficient when you are adding things, removing things, and otherwise changing the array around, as memory gets shuffled around.
A List is a C# style dynamic array (similar to the vector<> class in C++). There is more overhead, but they are more efficient when you need to be moving things around a lot, as they will not try to keep the memory usage contiguous.
The best comparison I could give is saying that arrays are to Lists as strings are to StringBuilders.
The orange is 'IEnumerable'. This is not a datatype, but rather it is an interface. When a class implements the IEnumerable interface, it allows that object to be used in a foreach() loop.
When you return the list (as you did in your example), you were not converting the list to an IEnumerable. A list already is an IEnumerable object.
EDIT: When to convert between the two:
It depends on the application. There is very little that can be done with an array that cannot be done with a List, so I would generally recommend the List. Probably the best thing to do is to make a design decision that you are going to use one or the other, that way you don't have to switch between the two. If you rely on an external library, abstract it away to maintain consistent usage.
Hope this clears a little bit of the fog.
Looks to me like the problem is that you haven't bothered learning how to search an array. Hint: Array.IndexOf or Array.BinarySearch depending on whether the array is sorted.
You're right that converting to a list is a bad idea: it wastes space and time and makes the code less readable. Also, blindly upcasting to IEnumerable slows matters down and also completely prevents use of certain algorithms (such as binary search).
I try to avoid rapidly jumping between data types if it can be avoided.
It must be the case that each situation similar to that you described is sufficiently different so as to prevent a dogmatic rule about transforming your types; however, it is generally good practice to select a data structure that provides as best as possible the interface you need without having to copying elements needlessly to new data structures.
When to use what?
I would suggest returning the most specific type, and taking in the most flexible type.
Like this:
public int[] DoSomething(IEnumerable<int> inputs)
{
//...
}
public List<int> DoSomethingElse(IList<int> inputs)
{
//...
}
That way you can call methods on List< T > for whatever you get back from the method in addition to treating it as an IEnumerable. On the inputs, use as flexible as possible, so you don't dictate the users of your method what kind of collection to create.
You're right to ignore the 'performance problem' antennae until you actually have a performance problem. Most performance problems come from doing too much I/O or too much locking or doing one of them wrong, and none of these apply to this question.
My general approach is:
Use T[] for 'static' or 'snapshot'-style information. Use for things where calling .Add() wouldn't make sense anyway, and you don't need the extra methods List<T> gives you.
Accept IEnumerable<T> if you don't really care what you're given and don't need a constant-time .Length/.Count.
Only return IEnumerable<T> when you're doing simple manipulations of an input IEnumerable<T> or when you specifically want to make use of the yield syntax to do your work lazily.
In all other cases, use List<T>. It's just too flexible.
Corollary to #4: don't be afraid of ToList(). ToList() is your friend. It forces the IEnumerable<T> to evaluate right then (useful for when you're stacking several where clauses). Don't go nuts with it, but feel free to call it once you've built up your full where clause before you do the foreach over it (or the like).
Of course, this is just a rough guideline. Just please try to follow the same pattern in the same codebase -- code styles that jump around make it harder for maintenance coders to get into your frame of mind.