Good way to manage very large collection

Good way to manage very large collection - c#

I am trying to think of a fast and efficient way to handle a ton of items, all of the same struct type, in which the array can grow over time and quickly and selectively remove items when the conditions are right.
The application will have a large amount of data streaming in at a relatively fast rate, and I need to quickly analyze it, update some UI info, and drop the older datapoints to make room for new ones. There are certain data points of interest that I need to hang onto for a longer amount of time than others.
The data payload contains 2 integer numbers that represent physical spectrum data: frequency, power, etc. The "age out" thing was just some meta-data I was going to use to determine when it was a good time to drop old data.
I thought that using a LinkedList would be a good choice as it can easily remove items from the middle of the collection, but I need to be able to perform the following pseudo-code:
for(int i = 0; i < myCollection.Length; i++)
{
myCollection[i].AgeOutVal--;
if(myCollection[i].AgeOutVal == 0)
{
myCollection.Remove(i);
i--;
}
}
But I'm getting compiler errors indicating that I cannot use a collection like this. What would be a good/fast way to do this?

I would recommend that first, you do some serious performance analysis of your program. Processing a million items per second only leaves you a few thousand cycles per item, which is certainly doable. But with that kind of performance goal your performance is going to be heavily influenced by things like data locality and the resulting cache misses.
Second, I would recommend that you separate the concern of "does this thing need to be removed from the queue" from whatever concern the object itself represents.
Third, you do not say how big the "age" field can get, just that it is counting down. It seems inefficient to mutate the entire collection every time through the loop just to find the ones to remove. Some ideas:
Suppose the "age" counts down from ten to zero. Instead of creating one collection and each item in the collection has an age, create ten collections, one for things that will time out in one, one for things that will time out in two, and so on. Each tick you throw away the "time out in one" collection, then the "time out in two" collection becomes the "time out in one" collection, and so on. Every time through the loop you just move around a tiny number of collection references, rather than mutating a huge number of items.
Why is "age" counting down at all? Time is increasing. Mark each item according to when it was created, and never change that. Use a queue, so you can insert new items on one end and delete them from the other end. The queue will therefore be sorted by age. Each tick, dequeue items that are too old until you get to an item that is not too old. As mentioned elsewhere, a circular buffer implementation of a queue is likely to be efficient.

Related

Is Dictionary.ContainsKey() any better than FirstOrDefault()?

I know, nothing one million of anything's gonna be performant. But I'm needing that piece o' knowledge right now.
I have a Dictionary and a string[]. The boolean in the dictionary is just to fill the space. Let's imagine that as an Inventory System just to make things easier.
In this inventory, I wanna check if I already had gotten one item. So what I'd do is:
if (dic.ContainsKey(item_id)) // That could be a TryGetValue() as well.
{
// Do some logic.
}
But would it be better to just have an array?
if (array.FirstOrDefault(a => a = item_id))
{
// Do magic.
}
I mean, which would perform better in that specific case?
I know, that's a silly question, but when you can have over one million (or over nine thousand, for the DBZ fans out there xD) checks, things can get pretty heavy, especially for mobile, VR and others with similar performance.
Plus, I just want my users to have the best experience with my Inventory (a.k.a. no lag), so I often take stuff like that in consideration.

There are two tradeoffs here space and time.
A Dictionary is a relatively heavy weight structure compared to an array.
The lookup time in a Dictionary (or a HashSet) if basically independant of the number of entries O(1), while with the array it increases linearly O(N).
So there is a certain number of items where the Dictionary (or HashSet) begins to be considerably faster. And 1 million is certainly above this threshold.

Which collection class to use for a rolling period timeseries of data?

I want to imlpement a c# class (.NET 4.6.1) which contains time series data as follows :
the timeseries is a collection keyed on DateTime, each with an associated value (eg an array of doubles)
the points will be added strictly in time order
there will be a rolling time period - e.g. 1 hour - when adding a new point, any points at the start older than this period will be removed
the key issues for performance will be quickly adding new points, and finding the data point for a particular time (a binary search or better is essential). There will be 100k's of points in the collection sometimes.
it has to be thread safe
it has to be relatively memory efficient - e.g. it can't just keep all the points from the beginning of time - as time moves on the memory footprint has to be fairly stable.
So what would be a good approach for that in terms of underlying collection classes? Using Lists will be slow - as the rolling period means we will need to remove from the start of the List a lot. A Queue or LinkedList would solve that - but I don't think they provide a fast way to access the nth item in the collection (ElementAt just iterates so is very slow).
The only solution I can think of will involve storing the data twice - once in a collection thats easy to remove from the start of - and again in one thats easy to search in (with some awful background process to prune the stale points from the search collection somehow).
Is there a better way to approach this?
Thanks.

When I first saw the question I immediately thought of a queue, but most built-in queues do not efficiently allow indexed access, as you've found.
Best suggestion I can come up with is to use a ConcurrentDictionary. Thread-safe, near-constant access time by key, you can key directly to DateTimes, etc. it's everything you need functionally, except the behavior to manage size/timeline. To do that, you use a little Linq to scan the Keys property for keys more than one hour older than the newest one being added, then TryRemove() each one (you can derive from ConcurrentDictionary and override TryAdd() to do this automatically when adding anything new).
Only other potential issue is it's not terribly memory-efficient; the HashSet-based implementation of .NET Dictionaries requires a two-dimensional array for storage based on the hash of the key, and that array will be sparsely populated even with 100k items in the collection.

cache entry replacement algorithm

I have a software project that creates a series of fingerprint (hash) values from objects of varying size. The larger the object size, of course, the more expensive the computation of the hash. The hashes are used for comparative purposes.
I now wish to cache hash values in order to improve performance of subsequent comparisons. For any given entry in the cache, I have the following metrics available:
hit count
last modification date/time
size of object hashed
So on to my question. Given the need to constrain the size of the cache (limit it to a specific number of entries), what is a well-balanced approach to replacing cache items?
Clearly, larger objects are more expensive to hash so they need to be kept around for as long as possible. However, I want to avoid a situation where populating the cache with a large quantity of large objects will prevent future (smaller) items from being cached.
So, based upon the metrics available to me (see above), I'm looking for a good general-purpose "formula" for expiring (removing) cache entries when the cache becomes full.
All thoughts, comments are appreciated.

You need to think about the nature of the objects. Think about the probability of the objects to be called again soon. And remove the least likely object.
This is very specific to the software you're using and the nature of the objects.
If they are used continuously in the program they will probably abide to the Locality of reference principle. So you should use LRU (Least recently used) algorithm.
If objects with higher hit count are more likely to be called again, then use that (and remove the lowest).
Take a look at Cache Algorithms
In principle, you need to calculate:
min(p*cost)
p = probability to be called again.
cost = The cost of caching that object again.

Assuming the ability to record when an entry was last accessed, I'd go with a "Cost" for each entry, where you at any time remove the least expensive entry.
Cost = Size * N - TimeSinceLastUse * M
Presuming you completely remove entries from the cache (and not keep old hitcount data around) I'd avoid using hitcount as a metric, you'd end up with an entry that has a high hitcount because it's been there for a long time, and it'll be there even longer because it has a high hitcount.

I typically use a strict least recently used (LRU) scheme for discarding things from the cache, unless it's hugely more expensive to reconstruct some items. LRU has the benefit of being trivially simple to implement, and it works really well for a wide range of applications. It also keeps the most frequently used items in the cache.
In essence, I create a linked list that's also indexed by a dictionary. When a client wants an item, I look it up in the dictionary. If it's found, I unlink the node from the linked list and move it to the head of the list. If the item isn't in the cache, I construct it (load it from disk, or whatever), put it at the head of the list, insert it into the dictionary, and then remove the item that's at the tail of the list.

Might want to try a multilevel style of cache. Dedicate a certain percentage of the cache to Expensive to create items and a portion to easy to create but more heavily accessed items. You can then use different strategies for maintaining the expensive cache than you would the less expensive one.

The algorithm could consider the cost of reproducing a missing element. That way you would keep the most valuable items in the cache.

PeekRange on a stack in C#?

I have a program that needs to store data values and periodically get the last 'x' data values.
It initially thought a stack is the way to go but I need to be able to see more than just the top value - something like a PeekRange method where I can peek the last 'x' number of values.
At the moment I'm just using a list and get the last, say, 20 values like this:
var last20 = myList.Skip(myList.Count - 20).ToList();
The list grows all the time the program runs, but I only ever want the last 20 values. Could someone give some advice on a better data structure?

I'd probably be using a ring buffer. It's not hard to implement one on your own, AFAIK there's no implementation provided by the Framework..

Well since you mentioned the stack, I guess you only need modifications at the end of the list?
In that case the list is actually a nice solution (cache efficient and with fast insertion/removal at the end). However your way of extracting the last few items is somewhat inefficient, because IEnumerable<T> won't expose the random access provided by the List. So the Skip()-Implementation has to scan the whole List until it reaches the end (or do a runtime type check first to detect that the container implements IList<T>). It is more efficient, to either access the items directly by index, or (if you need a second array) to use List<T>.CopyTo().
If you need fast removal/insertion at the beginning, you may want to consider a ring buffer or (doubly) linked list (see LinkedList<T>). The linked list will be less cache-efficient, but it is easy and efficient to navigate and alter from both directions. The ring buffer is a bit harder to implement, but will be more cache- and space-efficient. So its probably better if only small value types or reference types are stored. Especially when the buffers size is fixed.

You could just removeat(0) after each add (if the list is longer than 20), so the list will never be longer than 20 items.

You said stack, but you also said you only ever want the last 20 items. I don't think these two requirements really go together.
I would say that Johannes is right about a ring buffer. It is VERY easy to implement this yourself in .NET; just use a Queue<T> and once you reach your capacity (20) start dequeuing (popping) on every enqueue (push).
If you want your PeekRange to enumerate from the most recent to least recent, you can defineGetEnumerator to do somehing likereturn _queue.Reverse().GetEnumerator();

Woops, .Take() wont do it.
Here's an implementation of .TakeLast()
http://www.codeproject.com/Articles/119666/LINQ-Introducing-The-Take-Last-Operators.aspx

Double ended priority queue

I have a set of data and I want to find the biggest and smallest items (multiple times), what's the best way to do this?
For anyone interested in the application, I'm developing a level of detail system and I need to find the items with the biggest and smallest screen space error, obviously every time I subdivide/merge an item I have to insert it into the queue but every time the camera moves the entire dataset changes - so it might be best to just use a sorted list and defer adding new items until the next time I sort (since it happens so often)

You can use a Min-Max Heap as described in the paper Min-Max Heaps and Generalized Priority Queues:
A simple implementation of
double ended priority queues is
presented. The proposed structure,
called a min-max heap, can be built in
linear time; in contrast to
conventional heaps, it allows both
FindMin and FindMax to be performed in
constant time; Insert, DeleteMin, and
DeleteMax operations can be performed
in logarithmic time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Good way to manage very large collection - c#

Related

Is Dictionary.ContainsKey() any better than FirstOrDefault()?

Which collection class to use for a rolling period timeseries of data?

cache entry replacement algorithm

PeekRange on a stack in C#?

Double ended priority queue

Categories

Resources