I'm looking for a kind of array data-type that can easily have items added, without a performance hit.
System.Array - Redim Preserve copies entire RAM from old to new, as slow as amount of existing elements
System.Collections.ArrayList - good enough?
System.Collections.IList - good enough?
Just to summarize a few data structures:
System.Collections.ArrayList: untyped data structures are obsolete. Use List(of t) instead.
System.Collections.Generic.List(of t): this represents a resizable array. This data structure uses an internal array behind the scenes. Adding items to List is O(1) as long as the underlying array hasn't been filled, otherwise its O(n+1) to resize the internal array and copy the elements over.
List<int> nums = new List<int>(3); // creates a resizable array
// which can hold 3 elements
nums.Add(1);
// adds item in O(1). nums.Capacity = 3, nums.Count = 1
nums.Add(2);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(3);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(4);
// adds item in O(n). Lists doubles the size of our internal array, so
// nums.Capacity = 6, nums.count = 4
Adding items is only efficient when adding to the back of the list. Inserting in the middle forces the array to shift all items forward, which is an O(n) operation. Deleting items is also O(n), since the array needs to shift items backward.
System.Collections.Generic.LinkedList(of t): if you don't need random or indexed access to items in your list, for example you only plan to add items and iterate from first to last, then a LinkedList is your friend. Inserts and removals are O(1), lookup is O(n).
You should use the Generic List<> (System.Collections.Generic.List) for this. It operates in constant amortized time.
It also shares the following features with Arrays.
Fast random access (you can access any element in the list in O(1))
It's quick to loop over
Slow to insert and remove objects in the start or middle (since it has to do a copy of the entire listbelieve)
If you need quick insertions and deletions in the beginning or end, use either linked-list or queues
Would the LinkedList< T> structure work for you? It's not (in some cases) as intuitive as a straight array, but is very quick.
AddLast to append to the end
AddBefore/AddAfter to insert into list
AddFirst to append to the beginning
It's not so quick for random access however, as you have to iterate over the structure to access your items... however, it has .ToList() and .ToArray() methods to grab a copy of the structure in list/array form so for read access, you could do that in a pinch. The performance increase of the inserts may outweigh the performance decrease of the need for random access or it may not. It will depend entirely on your situation.
There's also this reference which will help you decide which is the right way to go:
When to use a linked list over an array/array list?
What is "good enough" for you? What exactly do you want to do with that data structure?
No array structure (i.e. O(n) access) allows insertion in the middle without an O(n) runtime; insertion at the end is O(n) worst case an O(1) amortized for self-resizing arrays like ArrayList.
Maybe hashtables (amortized O(1) access and insertion anywhere, but O(n) worst case for insertion) or trees (O(log(n)) for access and insertion anywhere, guaranteed) are better suited.
If speed is your problem, I don't see how the selected answer is any better than using a raw Array, although it resizes itself, it uses the exact same mechanism you would use to resize an array (and should take just a touch longer) UNLESS you are always adding to the end, in which case it should do things a bit smarter because it allocates a chunk at a time instead of just one element.
If you often add near the beginning/middle of your collection and don't index into the middle/end very often, you probably want a Linked List. That will have the fastest insert time and will have great iteration time, it just sucks at indexing (such as looking at the 3rd element from the end, or the 72nd element).
Related
This question already has answers here:
C# List remove from end, really O(n)?
(5 answers)
Closed 1 year ago.
The question may sound pretty primitive, but it has spawned as the result of a (lively) discussion about the (best case) complexity of removing an element from a List (at the end).
Here are some points I have considered before posting this question:
Complexity of removing an element at the end of a List<> is O(1)
Complexity of removing any element (be it even the last one) from a static array is O(n), as it requires a new array.
Now keeping in view the above two arguments and the fact that we have an array in the background of List<> implementation as well.
So why does the List<> operation lead to O(1) and the same for the array into O(n) when both are involving the recreation of an array? Or am I missing something here? Thanks!
UPDATE: Just to clarify, this question's major focus is not on why the last element removal's complexity is O(1), but rather on why it's O(n) for array vs. O(1) for List<>. The title has been updated to reflect the changes. Thanks to the answer (which addresses it) and other comments..`
First off, the basis for your question is incorrect: removing an item from a List<T> by index is O(n), not O(1). Let me explain...
When you remove an item from a List<T> by index (i.e. using List<T>.RemoveAt), you need to copy all of the items which appear after it one space forward in the list, in order to close the gap you just left. You can see that quite clearly in the source:
public void RemoveAt(int index)
{
if ((uint)index >= (uint)_size)
{
ThrowHelper.ThrowArgumentOutOfRange_IndexException();
}
_size--;
if (index < _size)
{
Array.Copy(_items, index + 1, _items, index, _size - index);
}
if (RuntimeHelpers.IsReferenceOrContainsReferences<T>())
{
_items[_size] = default!;
}
_version++;
}
If you remove the last element, that's no cost; if you remove the first element, you'll have to copy every remaining element.
In general, then, the cost of removing an arbitrary element from a list scales with the number of elements in the list, making it linear. Indeed, to quote the docs:
This method is an O(n) operation, where n is (Count - index).
With that out of the way, there is a small difference in cost between removnig an element from a list, and removing an element from an array.
List<T> is backed by an array: the underlying array can be larger than the value returned by List<T>.Count. As more items are added to the list the backing array gradually fills up, before a new backing array is allocated and all of the old elements copied over.
This means that you're allowed to have a backing array which is larger than the number of elements in the List<T>. If we remove an element from the List<T>, we don't need to allocate an entirely new backing array.
An array, however, always has to be exactly sized. If you have an array containing 3 elements, the array has to have a Length of exactly 3. If you want to reduce that to a Length of 2, you need to allocate a whole new array with that length.
Any change in size to the array therefore means you need to allocate a new array of the correct size, and copy over everything you care about.
It is common in implementing list-like data structures, to allocate the underlying memory as arrays but sized to contain many list elements, keeping track of the position of the last-used array entry. When more space is needed during insertion of new entries, allocation is generally increased by about 50% (control of starting size and growth policy is an implementation detail).
Therefore removing an entry from the tail of a list generally involves no more than decrementing the variable used to track the position of the last element - an O(1) operation since no (de)allocation or moving of other list elements is required.
However note that Big O Notation is used to classify algorithms according to their limiting (worst-case) behaviour/requirements. The underlying memory allocated to lists must still be managed and although growing and shrinking arrays and moving entries around should be performed less frequently than for pure arrays, they are O(N) operations. While your question makes a statement that is probably generally true, it assumes specific implementation details which need not be (particularly that deletion never gives rise to shrinking of the underlying array).
I have a list that I divided into a fixed number of sections (with some that might be empty).
Every section contains unordered elements, however the sections themselves are ordered backwards.
I am referencing each beginning of a section through a fixed dimension array, whose elements are the indexes at which the each section can be found in the list.
I regularly extract the whole section at the tail of the list and, when I do so, I set its index inside the array at 0 (so the section will start to regrow from the head of the list) and then I circularly increment the lastSection variable that I use to keep track of which section is at the tail of the list.
With the same frequency I also need to insert back into the list some new elements that will be spread across one or more sections.
I chose a single sectioned list (instead of a list of lists or something like that) because, even if the sections may vary a lot (from empty to a length of some thousands), the total number of elements has little variations during the application runtime AND because I also frequently need to get all the elements in the list, and didn't want to concatenate multiple lists in order to get the result.
Graphical representation of the data structure
Existential question:
Up to here did I do some mistakes in the choice of the data structure, since these described are all the operations I am doing with it?
Going forward:
The problem I am trying to address, since this is the core of the application I am building (and I want to squeeze out every slice of performance I can since it should run on smartphones), is: how can I do those multiple inserts as fast as possible?
Trivial solution:
For each new group of elements belonging to a certain section, just do an insertRange (sectionBeginning, groupOfElements).
Performance footprint:
every insertRange will force the list to shift all the content after the root of a section to the right, and with multiple insertRange this means that some data will be shifted even M times, where M is the number of insertRange done with index != list.Count.
Little smarter solution:
Knowing before every multiple-inserts step which and how many new elements per section I need to add, I can add empty elements to the back of the list, perform M shifts of determined size, then copy the new elements to the corresponding "holes" left inside the list.
I could extend the list class and implement a new insertRange (int [] index, IEnumerable [] collection) where each index points to the beginning of a section, however I am worried about some possible internal optimizations that the list class might have and that could transform my for loop shifts in worse performance, like an Array.Copy to which I do not think to have access. Is there a way to do a performant list shift in order to implement this and gain some advantages over multiple standard insertRanges?
Note: index and collections should be ordered by section.
Graphical representation of the multiple-at once insertRange approach
Another similar thread about insertRange:
Replace multiple InsertRange() into efficient way
Another similar thread about shifts in lists:
Does code exist, for shifting List elements to left or right by specified amount, in C#?
So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.
The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.
Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.
I know that it takes 4 bytes to store a uint in memory, but how much space in memory does it take to store List<uint> for, say, x number of uints?
How does this compare to the space required by uint[]?
There is no per-item overhead of a List<T> because it uses a T[] to store its data. However, a List<T> containing N items may have 2N elements in its T[]. Also, the List data structure itself has probably 32 or more bytes of overhead.
You probably will notice not so much difference between T[] and list<T> but you can use
System.GC.GetTotalMemory(true);
before and after an object allocation to obtain an approximate memory usage.
List<> uses an array internally, so a List<uint> should take O(4bytes * n) space, just like a uint[]. There may be some more constant overhead in comparison to a array, but you should normally not care about this.
Depending on the specific implementation (this may be different when using Mono as a runtime instead of the MS .NET runtime), the internal array will be bigger than the number of actual items in the list. E.g.: a list of 5 elements has an internal array that can store 10, a list of 10000 elements may have an internal array of size 11000. So you cant generally say that the internal array will always be twice as big, or 5% bigger than the number of list element, it may also depend on the size.
Edit: I've just seen, Hans Passant has described the growing behaviour of List<T> here.
So, if you have a collection of items that you want to append to, and you cant know the size of this collection at the time the list is created, use a List<T>. It is specifically designed for this case. It provides fast random access O(1) to the elements, and has very little memory overhead (internal array). It is on the other hand very slow on removing or inserting in the middle of the list. If you need those operations often, use a LinkedList<T>, which has then more memory overhead (per item!), however. If you know the size of you collection from the beginning, and you know that is wont change (or just very few times) use arrays.
I was browsing this question and some similar ones:
Getting a sub-array from an existing array
Many places I read answers like this:
Getting a sub-array from an existing array
What I am wondering is why Skip and Take are not constant time operations for arrays?
In turn, if they were constant time operations, won't the Skip and Take method (without calling ToArray() at the end) have the same running time without the overhead of doing an Array.Copy, but also more space efficient?
You have to differentiate between the work that the Skip and Take methods do, and the work of consuming the data that the methods return.
The Skip and Take methods themselves are O(1) operations, as the work they do does not scale with the input size. They just set up an enumerator that is capable of returning items from the array.
It's when you use the enumerator that the work is done. That is an O(n) operation, where n is the number of items that the enumerator produces. As the enumerators read from the array, they don't contain a copy of the data, and you have to keep the data in the array intact as long as you are using the enumerator.
(If you use Skip on a collection that is not accessible by index like an array, gettting the first item is an O(n) operation, where n is the number of items skipped.)