A very basic auto-expanding list/array - c#

I have a method which returns an array of fixed type objects (let's say MyObject).
The method creates a new empty Stack<MyObject>. Then, it does some work and pushes some number of MyObjects to the end of the Stack. Finally, it returns the Stack.ToArray().
It does not change already added items or their properties, nor remove them. The number of elements to add will cost performance. There is no need to sort/order the elements.
Is Stack a best thing to use? Or must I switch to Collection or List to ensure better performance and/or lower memory cost?

Stack<T> will not be any faster than List<T>.
For optimal performance, you should use a List<T> and set the Capacity to a number larger than or equal to the number of items you plan to add.

If the ordering doesn't matter and your method doesn't need to add/remove/edit items that have already been processed then why not return IEnumerable<MyObject> and just yield each item as you go?
Then your calling code can either use the IEnumerable<MyObject> sequence directly, or call ToArray, ToList etc as required.
For example...
// use the sequence directly
foreach (MyObject item in GetObjects())
{
Console.WriteLine(item.ToString());
}
// ...
// convert to an array
MyObject[] myArray = GetObjects().ToArray();
// ...
// convert to a list
List<MyObject> myList = GetObjects().ToList();
// ...
public IEnumerable<MyObject> GetObjects()
{
foreach (MyObject foo in GetObjectsFromSomewhereElse())
{
MyObject bar = DoSomeProcessing(foo);
yield return bar;
}
}

Stack<T> is not any faster than List<T> in this case, so I would probably use List, unless something about what you are doing is "stack-like". List<T> is the more standard data structure to use when what you want is basically a growable array, whereas stacks are usually used when you need LIFO behavior for the collection.

For this purpose, there is not any other collections in the framework that will perform considerably better than a Stack<T>.
However, both Stack<T> and List<T> auto-grows their internal array of items when the initial capacity is exceeded. This involves creating a new larger array and copying all items. This costs some performance.
If you know the number of items beforehand, initialize your collection to that capacity to avoid auto-growth. If you don't know exactly, choose a capacity that is unlikely to be insufficient.
Most of the built in collections take the initial capacity as a constructor argument:
var stack = new Stack<T>(200); // Initial capacity of 200 items.

Use a LinkedList maybe?
Though LinkedLists are only useful with sequential data.

You don't need Stack<> if all you're going to do is append. You can use List<>.Add (http://msdn.microsoft.com/en-us/library/d9hw1as6.aspx) and then ToArray.
(You'll also want to set initial capacity, as others have pointed out.)

If you need the semantics of a stack (last-in first-out), then the answer is, without any doubt, yes, a stack is your best solution. If you know from the start how many elements it will end up with, you can avoid the cost of automatic resizing by calling the constructor that receives a capacity.
If you're worried about the memory cost of copying the stack into an array, and you only need sequential access to the result, then, you can return the Stack<T> as an IEnumerable<T> instead of an array and iterate it with foreach.
All that said, unless this code proves it is problematic in terms of performance (i.e., by looking at data from a profiler), I wouldn't bother much and go with the semantics call.

Related

Efficiency of ToList() [duplicate]

This question already has answers here:
Is there a performance impact when calling ToList()?
(10 answers)
Closed 5 years ago.
A lot of the developers I work with feel more comfortable working with a List as opposed to IEnumerable (for example). I am wondering whether there is any performance impact for ToList() overuse. For example, or, will use ToList() after ordering to get a list back out again i.e.
private void ListThinger(List<T> input)
{
input = input.OrderBy(s => s.Thing).ToList();
foreach(var thing in input)
{
// do things
}
}
My question is:
How efficient is the ToList() method? Will it create a new list and how much memory does that take, assuming the contents are POCOs? Does this change if its a value type rather than a POCO?
Will the size of the list determine efficiency or does size of list not determine cost of ToList()?
If a list is cast to an IEnumerable and then ToList() is called on it, will it just return the original object?
P.s. I understand that a single use of ToList won't break any backs, but we are building a highly concurrent system that is currently CPU bound so I am looking for little wins that, when scaled, will add up to a big improvement
How efficient is the ToList() method? Will it create a new list and how much memory does that take, assuming the contents are POCOs? Does this change if its a value type rather than a POCO?
The ToList() method materializes the given collection by creating a new list and populating it with the items of the given collection. Linq.ToList() implementation:
public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source) {
if (source == null) throw Error.ArgumentNull("source");
return new List<TSource>(source);
}
By doing so you are not gaining the power of deffered execution if needed
Will the size of the list determine efficiency or does size of list not determine cost of ToList()?
As it calls Lists copy constructor and it creates a new list then it'll work on each of the items. So it'll run in O(n) - meaning that list's size matters. MSDNs documentation about the operation of the copy constructor:
Initializes a new instance of the List class that contains elements copied from the specified collection and has sufficient capacity to accommodate the number of elements copied.
As #Jason mentioned in the comment bellow the Copy Constructor is smart and is efficient but doing it when not needed is still an O(n) operation that doesn't have to happen
If a list is cast to an IEnumerable and then ToList() is called on it, will it just return the original object?
No. It will create a new list as seen above
As for your example code:
input = input.OrderBy(s => s.Thing).ToList();
foreach(var thing in input)
{
// do things
}
As you are getting a materialized list (rather than an IQueriable/IEnumerable that might perform in deffered execution) adding the ToList after the adding gives you no benefit.
You can look here, might also help: When to use LINQ's .ToList() or .ToArray()
Yes is creates a new list. It is hard to accurately measure the memory usage but it is likely to be class size + (system word size * element count). I recommend a memory profiler.
Algorithmic efficiency of operations will of course be impacted by the element count
Yes, you get a brand new list every time. References inside are not duplicated but primitives are.
Try it yourself:
var list = new List<int>();
bool areListsTheSame = list == ((IEnumerable<int>)list).ToList();

Fast way to clear list or appropriate .NET collection for the following

I have a need for a fast collection type which will only be accessed sequentially. It needs fast adds and fast clears. I would think there are some collection types that can be cleared with very little processing time.
I have read that List<> is O(n) operation. Is there a collection type with a super fast clear ability? Maybe a stack? Would a List<int> or List<Double> (non reference type) allow faster operation?
O(n) is when you are searching. If you are stepping through the list using the Enumerable (and maybe the indexer I don't know about that part) it will be O(1) lookups. Clearing is O(1) too, as long as you are not doing Remove(T obj) calling Clear() will be very fast. Clearing is still O(n).
If you don't need resizeablity just declaring an array will have a O(1) indexer and to "Clear" it you just dereference it and make a new one.
If you truly need a fast clear and don't want to allocate a new piece of memory each time, you could also write your own implementation of IList<T> with a Clear that is O(1) by simply resetting the internal counter to 0. The List<T> class keeps a _size counter that determines where the list is at currently, to reset you could just set this to 0 in your own implementation. The internal array wouldn't really be cleared, but any items added to the list would override the old values and enumeration wouldn't continue past _size so you wouldn't ever touch the old values.
Essentially, you want List<T> where Count is settable to 0 rather than just gettable.
But do note that if T is a reference type, this kind of list would hold a reference to the old values and prevent them from being garbage collected. So perhaps only best used if you wanna store value types.
If you main goal O(1) for clear - just have any suitable collection (List, LinkedList, Array) as member of your cusomt class implementing IList or ICollection and forward indexing/iteration to that member. To implement clear simple create new instance of that inner member.
class FastClearList<T> : IList<T>
{
List<T> inner = new List<T>();
public void Clear()
{
inner = new List<T>(); // recreating list here will give O(1)
}
public IEnumerator<T> GetEnumerator()
{
return inner.GetEnumerator();
}
// forward everything else ...
}
List does need to clear the references to objects that have been cleared. A best approach would be to create a new List<> instead of clearing it, or create a wrapper/factory that implements such an approach.

C# foreach loop - is order *stability* guaranteed?

Suppose I have a given collection. Without ever changing the collection in any way, I loop through its contents twice with a foreach. Barring cosmic rays and what not, is it absolutely guaranteed that the order will be consistent in both loops?
Alternatively, given a HashSet<string> with a number of elements, what can cause the output from the the commented lines in the following to be unequal:
{
var mySet = new HashSet<string>();
// Some code which populates the HashSet<string>
// Output1
printContents(mySet);
// Output2
printContents(mySet);
}
public void printContents(HashSet<string> set) {
foreach(var element in set) {
Console.WriteLine(element);
}
}
It would be helpful if I could get a general answer explaining what causes an implementation to not meet the criteria described above. Specifically, though, I am interested in Dictionary, List and arrays.
Array enumeration guarantees order.
List and List<T> are expected to provide stable order (since they are expected to implement sequentially-indexed elements).
Dictionary, HashSet are explicitly do not guarantee order. Its is very unlikely that 2 calls to iterate items one after each other will return items in different order, but there is no guarantees or expectations. One should not expect any particular order.
Sorted versions of Dictionary/HashSet return items in sort order.
Other IEnumerable objects are free to do whatever they want. Normally one implements iterators in such a way that it matches user's expectations. I.e. enumeration of something that have implicit order should be stable, if explicit order provided - expected to be stable. Query to database that does not specify order should be expected to return items in semi-random order.
Check this question for links: Does the foreach loop in C# guarantee an order of evaluation?
Everything that implements IEnumerable<T> does so in its own way. There is no general guarantee that any given collection must ensure stability.
If you are referring specifically to Collection<T> (http://msdn.microsoft.com/en-us/library/ms132397.aspx) I don't see any specific guarantee in its MSDN reference that ordering is consistent.
Will it probably be consistent? Yes. Is there a written guarantee? Not that I can find.
For many of the C# collections there are sorted versions of the collection. For instance, a HashSet is to a SortedSet as a Dictionary is to a SortedDictionary. If you're working with something where the order isn't important like the Dictionary then you can't assume the loop order will behave the same way every time.
As per your example with HashSet<T>, we now have source code to check: HashSet:Enumerator
As it is, the Slot[] set.m_slots array is iterated.
The array object is only changed in the methods TrimExcess, Initialize (both of which are only called in the constructor), OnDeserialization, and SetCapacity (only called by AddIfNotPresent and AddOrGetLocation).
The values of m_slots are only changed in methods that change elements of the HashSet(Clear, Remove, AddIfNotPresent, IntersectWith, SymmetricExceptWith).
So yes, if nothing touches the set, it enumerates in the same order.
Dictionary:Enumerator works in quite the same way, iterating an Entry[] entries that only changes when such non-readonly methods are called.

Efficiency: Creating an array of doubles incrementally?

Consider the following code:
List<double> l = new List<double>();
//add unknown number of values to the list
l.Add(0.1); //assume we don't have these values ahead of time.
l.Add(0.11);
l.Add(0.1);
l.ToArray(); //ultimately we want an array of doubles
Anything wrong with this approach? Is there a more appropriate way to build an array, without knowing the size, or elements ahead of time?
There's nothing wrong with your approach. You are using the correct data type for the purpose.
After some observations you can get a better idea of the total elements in that list. Then you can create a new list with an initial capacity in the constructor:
List<double> l = new List<double>(capacity);
Other than this, it's the proper technique and data structure.
UPDATE:
If you:
Need only the Add and ToArray functions of the List<T> structure,
And you can't really predict the total capacity
And you end up with more than 1K elements
And better performance is really really (really!) your goal
Then you might want to write your own interface:
public interface IArrayBuilder<T>
{
void Add(T item);
T[] ToArray();
}
And then write your own implementation, which might be better than List<T>. Why is that? because List<T> holds a single array internally, and it increases its size when needed. The procedure of increasing the inner array costs, in terms of performance, since it allocates new memory (and perhaps copies the elements from the old array to the new one, I don't remember). However, if all of the conditions described above are true, all you need is to build an array, you don't really need all of the data to be stored in a single array internally.
I know it's a long shot, but I think it's better sharing such thoughts...
As others have already pointed out: This is the correct approach. I'll just add that if you can somehow avoid the array and use List<T> directly or perhaps IEnumerable<T>, you'll avoid copying the array as ToArray actually copies the internal array of the list instance.
Eric Lippert has a great post about arrays, that you may find relevant.
A dynamic data structure like a List is the correct way to implement this. The only real advantage arrays have over a List is the O(1) access performance (compared to O(n) in List). The flexibility more than makes up for this performance loss imho

What is the difference between LinkedList and ArrayList, and when to use which one?

What is the difference between LinkedList and ArrayList? How do I know when to use which one?
The difference is the internal data structure used to store the objects.
An ArrayList will use a system array (like Object[]) and resize it when needed. On the other hand, a LinkedList will use an object that contains the data and a pointer to the next and previous objects in the list.
Different operations will have different algorithmic complexity due to this difference in the internal representation.
Don't use either. Use System.Collections.Generic.List<T>.
That really is my recommendation. Probably independently of what your application is, but here's a little more color just in case you're doing something that needs a finely tuned choice here.
ArrayList and LinkedList are different implementations of the storage mechanism for a List. ArrayList uses an array that it must resize if your collection outgrows it current storage size. LinkedList on the other hand uses the linked list data structure from CS 201. LinkedList is better for some head- or tail-insert heavy workloads, but ArrayList is better for random access workloads.
ArrayList has a good replacement which is List<T>.
In general, List<T> is a wrapper for array - it allows indexing and accessing items in O(1), but, every time you exceed the capacity an O(n) must be paid.
LinkedList<T> won't let you access items using index but you can count that insert will always cost O(1). In addition, you can insert items in to the beginning of the list and between existing items in O(1).
I think that in most cases List<T> is the default choice. Many of the common scenarios don't require special order and have no strict complexity constraints, therefore List<T> is preferred due to its usage simplicity.
The main difference between ArrayList and List<T>, LinkedList<T>, and other similar Generics is that ArrayList holds Objects, while the others hold a type that you specify (ie. List<Point> holds only Points).
Because of this, you need to cast any object you take out of an ArrayList to its actual type. This can take a lot of screen space if you have long class names.
In general it's much better to use List<T> and other typed Generics unless you really need to have a list with multiple different types of objects in it.
The difference lies in the semantics of how the List interface* is implemented:
http://en.wikipedia.org/wiki/Arraylist and http://en.wikipedia.org/wiki/LinkedList
*Meaning the basic list operations
As #sblom has stated, use the generic counterparts of LinkedList and ArrayList. There's really no reason not to, and plenty of reasons to do so.
The List<T> implementation is effectively wrapping an Array. Should the user attempt to insert elements beyond the bounds of the backing array, it will be copied to a larger array (at considerable expense, buit transparently to users of the List<T>)
A LinkedList<T> has a completely different implementation in which data is held in LinkedListNode<T> instances, which carry reference to two other LinkedListNode<T> instances (or only one in the case of the head or tail of the list). No external reference to mid-list items is created. This means that iterating the list is fast, but random-access is slow, because one must iterate the nodes from one end or the other. The best reason to use a LinkedList is to allow for fast inserts, that involve simply changing the references held by the nodes, rather than rewriting the entire list to insert an item (as is the case with List<T>)
They have different performance on "inserts" (adding new elements) and lookups. For inserts ArrayLists keeps an array internally (initially 16 items long) and when you reach the max capacity it doubles the size of the array. An LinkedList starts empty and add an item (node) when needed.
I think also that with ArrayList you are able to index the items, while with the LinkedList you have to "visit" the item from the head (or the LinkedList does this automatically for you).

Categories

Resources