I know exactly how many items I want to keep in a list, they are ordered, I only need it to finish at an specific index I know, but I don't want to alter the capacity or use TrimExcess in order to make it smaller, otherwise after adding an item again it will double the size of the list again.
How can I set the Count instead of using Remove or RemoveAt or RemoveRange?.
My priority is optimization of speed for this operation.
Important: I know I can use an array, but I am not allowed. Also, I'm adding items and removing them all the time. I just want the capacity to stay around a similar amount which I don't know exactly but it will stabilize.
If you remove elements, the Capacity won't change. So if you don't use TrimExcess(), the Capacity will only ever increase (to the maximum you ever used for this list). So there's no performance penalty in removing elements again. You can set the initial capacity in the constructor, which is a good idea if you know how many elements you'll be using (or have an estimate for it), because that will remove the overload of the doubling while initially building up the list.
Note: Insert/Remove in a list is still O(n), because the elements eventually need to be compied around (unless you operate only at the tail end of the list).
Use an array and (in C# 8.0 onwards) use Indices and Ranges with slicing.
https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-8#indices-and-ranges
Related
I am trying to think of a fast and efficient way to handle a ton of items, all of the same struct type, in which the array can grow over time and quickly and selectively remove items when the conditions are right.
The application will have a large amount of data streaming in at a relatively fast rate, and I need to quickly analyze it, update some UI info, and drop the older datapoints to make room for new ones. There are certain data points of interest that I need to hang onto for a longer amount of time than others.
The data payload contains 2 integer numbers that represent physical spectrum data: frequency, power, etc. The "age out" thing was just some meta-data I was going to use to determine when it was a good time to drop old data.
I thought that using a LinkedList would be a good choice as it can easily remove items from the middle of the collection, but I need to be able to perform the following pseudo-code:
for(int i = 0; i < myCollection.Length; i++)
{
myCollection[i].AgeOutVal--;
if(myCollection[i].AgeOutVal == 0)
{
myCollection.Remove(i);
i--;
}
}
But I'm getting compiler errors indicating that I cannot use a collection like this. What would be a good/fast way to do this?
I would recommend that first, you do some serious performance analysis of your program. Processing a million items per second only leaves you a few thousand cycles per item, which is certainly doable. But with that kind of performance goal your performance is going to be heavily influenced by things like data locality and the resulting cache misses.
Second, I would recommend that you separate the concern of "does this thing need to be removed from the queue" from whatever concern the object itself represents.
Third, you do not say how big the "age" field can get, just that it is counting down. It seems inefficient to mutate the entire collection every time through the loop just to find the ones to remove. Some ideas:
Suppose the "age" counts down from ten to zero. Instead of creating one collection and each item in the collection has an age, create ten collections, one for things that will time out in one, one for things that will time out in two, and so on. Each tick you throw away the "time out in one" collection, then the "time out in two" collection becomes the "time out in one" collection, and so on. Every time through the loop you just move around a tiny number of collection references, rather than mutating a huge number of items.
Why is "age" counting down at all? Time is increasing. Mark each item according to when it was created, and never change that. Use a queue, so you can insert new items on one end and delete them from the other end. The queue will therefore be sorted by age. Each tick, dequeue items that are too old until you get to an item that is not too old. As mentioned elsewhere, a circular buffer implementation of a queue is likely to be efficient.
I was browsing this question and some similar ones:
Getting a sub-array from an existing array
Many places I read answers like this:
Getting a sub-array from an existing array
What I am wondering is why Skip and Take are not constant time operations for arrays?
In turn, if they were constant time operations, won't the Skip and Take method (without calling ToArray() at the end) have the same running time without the overhead of doing an Array.Copy, but also more space efficient?
You have to differentiate between the work that the Skip and Take methods do, and the work of consuming the data that the methods return.
The Skip and Take methods themselves are O(1) operations, as the work they do does not scale with the input size. They just set up an enumerator that is capable of returning items from the array.
It's when you use the enumerator that the work is done. That is an O(n) operation, where n is the number of items that the enumerator produces. As the enumerators read from the array, they don't contain a copy of the data, and you have to keep the data in the array intact as long as you are using the enumerator.
(If you use Skip on a collection that is not accessible by index like an array, gettting the first item is an O(n) operation, where n is the number of items skipped.)
I have a program that needs to store data values and periodically get the last 'x' data values.
It initially thought a stack is the way to go but I need to be able to see more than just the top value - something like a PeekRange method where I can peek the last 'x' number of values.
At the moment I'm just using a list and get the last, say, 20 values like this:
var last20 = myList.Skip(myList.Count - 20).ToList();
The list grows all the time the program runs, but I only ever want the last 20 values. Could someone give some advice on a better data structure?
I'd probably be using a ring buffer. It's not hard to implement one on your own, AFAIK there's no implementation provided by the Framework..
Well since you mentioned the stack, I guess you only need modifications at the end of the list?
In that case the list is actually a nice solution (cache efficient and with fast insertion/removal at the end). However your way of extracting the last few items is somewhat inefficient, because IEnumerable<T> won't expose the random access provided by the List. So the Skip()-Implementation has to scan the whole List until it reaches the end (or do a runtime type check first to detect that the container implements IList<T>). It is more efficient, to either access the items directly by index, or (if you need a second array) to use List<T>.CopyTo().
If you need fast removal/insertion at the beginning, you may want to consider a ring buffer or (doubly) linked list (see LinkedList<T>). The linked list will be less cache-efficient, but it is easy and efficient to navigate and alter from both directions. The ring buffer is a bit harder to implement, but will be more cache- and space-efficient. So its probably better if only small value types or reference types are stored. Especially when the buffers size is fixed.
You could just removeat(0) after each add (if the list is longer than 20), so the list will never be longer than 20 items.
You said stack, but you also said you only ever want the last 20 items. I don't think these two requirements really go together.
I would say that Johannes is right about a ring buffer. It is VERY easy to implement this yourself in .NET; just use a Queue<T> and once you reach your capacity (20) start dequeuing (popping) on every enqueue (push).
If you want your PeekRange to enumerate from the most recent to least recent, you can defineGetEnumerator to do somehing likereturn _queue.Reverse().GetEnumerator();
Woops, .Take() wont do it.
Here's an implementation of .TakeLast()
http://www.codeproject.com/Articles/119666/LINQ-Introducing-The-Take-Last-Operators.aspx
I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?
Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.
Sounds like a job for a Hashset...
If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.
If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.
What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.
I'm looking for a kind of array data-type that can easily have items added, without a performance hit.
System.Array - Redim Preserve copies entire RAM from old to new, as slow as amount of existing elements
System.Collections.ArrayList - good enough?
System.Collections.IList - good enough?
Just to summarize a few data structures:
System.Collections.ArrayList: untyped data structures are obsolete. Use List(of t) instead.
System.Collections.Generic.List(of t): this represents a resizable array. This data structure uses an internal array behind the scenes. Adding items to List is O(1) as long as the underlying array hasn't been filled, otherwise its O(n+1) to resize the internal array and copy the elements over.
List<int> nums = new List<int>(3); // creates a resizable array
// which can hold 3 elements
nums.Add(1);
// adds item in O(1). nums.Capacity = 3, nums.Count = 1
nums.Add(2);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(3);
// adds item in O(1). nums.Capacity = 3, nums.Count = 3
nums.Add(4);
// adds item in O(n). Lists doubles the size of our internal array, so
// nums.Capacity = 6, nums.count = 4
Adding items is only efficient when adding to the back of the list. Inserting in the middle forces the array to shift all items forward, which is an O(n) operation. Deleting items is also O(n), since the array needs to shift items backward.
System.Collections.Generic.LinkedList(of t): if you don't need random or indexed access to items in your list, for example you only plan to add items and iterate from first to last, then a LinkedList is your friend. Inserts and removals are O(1), lookup is O(n).
You should use the Generic List<> (System.Collections.Generic.List) for this. It operates in constant amortized time.
It also shares the following features with Arrays.
Fast random access (you can access any element in the list in O(1))
It's quick to loop over
Slow to insert and remove objects in the start or middle (since it has to do a copy of the entire listbelieve)
If you need quick insertions and deletions in the beginning or end, use either linked-list or queues
Would the LinkedList< T> structure work for you? It's not (in some cases) as intuitive as a straight array, but is very quick.
AddLast to append to the end
AddBefore/AddAfter to insert into list
AddFirst to append to the beginning
It's not so quick for random access however, as you have to iterate over the structure to access your items... however, it has .ToList() and .ToArray() methods to grab a copy of the structure in list/array form so for read access, you could do that in a pinch. The performance increase of the inserts may outweigh the performance decrease of the need for random access or it may not. It will depend entirely on your situation.
There's also this reference which will help you decide which is the right way to go:
When to use a linked list over an array/array list?
What is "good enough" for you? What exactly do you want to do with that data structure?
No array structure (i.e. O(n) access) allows insertion in the middle without an O(n) runtime; insertion at the end is O(n) worst case an O(1) amortized for self-resizing arrays like ArrayList.
Maybe hashtables (amortized O(1) access and insertion anywhere, but O(n) worst case for insertion) or trees (O(log(n)) for access and insertion anywhere, guaranteed) are better suited.
If speed is your problem, I don't see how the selected answer is any better than using a raw Array, although it resizes itself, it uses the exact same mechanism you would use to resize an array (and should take just a touch longer) UNLESS you are always adding to the end, in which case it should do things a bit smarter because it allocates a chunk at a time instead of just one element.
If you often add near the beginning/middle of your collection and don't index into the middle/end very often, you probably want a Linked List. That will have the fastest insert time and will have great iteration time, it just sucks at indexing (such as looking at the 3rd element from the end, or the 72nd element).