Double ended priority queue - c#

I have a set of data and I want to find the biggest and smallest items (multiple times), what's the best way to do this?
For anyone interested in the application, I'm developing a level of detail system and I need to find the items with the biggest and smallest screen space error, obviously every time I subdivide/merge an item I have to insert it into the queue but every time the camera moves the entire dataset changes - so it might be best to just use a sorted list and defer adding new items until the next time I sort (since it happens so often)

You can use a Min-Max Heap as described in the paper Min-Max Heaps and Generalized Priority Queues:
A simple implementation of
double ended priority queues is
presented. The proposed structure,
called a min-max heap, can be built in
linear time; in contrast to
conventional heaps, it allows both
FindMin and FindMax to be performed in
constant time; Insert, DeleteMin, and
DeleteMax operations can be performed
in logarithmic time.

Related

Which collection class to use for a rolling period timeseries of data?

I want to imlpement a c# class (.NET 4.6.1) which contains time series data as follows :
the timeseries is a collection keyed on DateTime, each with an associated value (eg an array of doubles)
the points will be added strictly in time order
there will be a rolling time period - e.g. 1 hour - when adding a new point, any points at the start older than this period will be removed
the key issues for performance will be quickly adding new points, and finding the data point for a particular time (a binary search or better is essential). There will be 100k's of points in the collection sometimes.
it has to be thread safe
it has to be relatively memory efficient - e.g. it can't just keep all the points from the beginning of time - as time moves on the memory footprint has to be fairly stable.
So what would be a good approach for that in terms of underlying collection classes? Using Lists will be slow - as the rolling period means we will need to remove from the start of the List a lot. A Queue or LinkedList would solve that - but I don't think they provide a fast way to access the nth item in the collection (ElementAt just iterates so is very slow).
The only solution I can think of will involve storing the data twice - once in a collection thats easy to remove from the start of - and again in one thats easy to search in (with some awful background process to prune the stale points from the search collection somehow).
Is there a better way to approach this?
Thanks.
When I first saw the question I immediately thought of a queue, but most built-in queues do not efficiently allow indexed access, as you've found.
Best suggestion I can come up with is to use a ConcurrentDictionary. Thread-safe, near-constant access time by key, you can key directly to DateTimes, etc. it's everything you need functionally, except the behavior to manage size/timeline. To do that, you use a little Linq to scan the Keys property for keys more than one hour older than the newest one being added, then TryRemove() each one (you can derive from ConcurrentDictionary and override TryAdd() to do this automatically when adding anything new).
Only other potential issue is it's not terribly memory-efficient; the HashSet-based implementation of .NET Dictionaries requires a two-dimensional array for storage based on the hash of the key, and that array will be sparsely populated even with 100k items in the collection.

Good way to manage very large collection

I am trying to think of a fast and efficient way to handle a ton of items, all of the same struct type, in which the array can grow over time and quickly and selectively remove items when the conditions are right.
The application will have a large amount of data streaming in at a relatively fast rate, and I need to quickly analyze it, update some UI info, and drop the older datapoints to make room for new ones. There are certain data points of interest that I need to hang onto for a longer amount of time than others.
The data payload contains 2 integer numbers that represent physical spectrum data: frequency, power, etc. The "age out" thing was just some meta-data I was going to use to determine when it was a good time to drop old data.
I thought that using a LinkedList would be a good choice as it can easily remove items from the middle of the collection, but I need to be able to perform the following pseudo-code:
for(int i = 0; i < myCollection.Length; i++)
{
myCollection[i].AgeOutVal--;
if(myCollection[i].AgeOutVal == 0)
{
myCollection.Remove(i);
i--;
}
}
But I'm getting compiler errors indicating that I cannot use a collection like this. What would be a good/fast way to do this?
I would recommend that first, you do some serious performance analysis of your program. Processing a million items per second only leaves you a few thousand cycles per item, which is certainly doable. But with that kind of performance goal your performance is going to be heavily influenced by things like data locality and the resulting cache misses.
Second, I would recommend that you separate the concern of "does this thing need to be removed from the queue" from whatever concern the object itself represents.
Third, you do not say how big the "age" field can get, just that it is counting down. It seems inefficient to mutate the entire collection every time through the loop just to find the ones to remove. Some ideas:
Suppose the "age" counts down from ten to zero. Instead of creating one collection and each item in the collection has an age, create ten collections, one for things that will time out in one, one for things that will time out in two, and so on. Each tick you throw away the "time out in one" collection, then the "time out in two" collection becomes the "time out in one" collection, and so on. Every time through the loop you just move around a tiny number of collection references, rather than mutating a huge number of items.
Why is "age" counting down at all? Time is increasing. Mark each item according to when it was created, and never change that. Use a queue, so you can insert new items on one end and delete them from the other end. The queue will therefore be sorted by age. Each tick, dequeue items that are too old until you get to an item that is not too old. As mentioned elsewhere, a circular buffer implementation of a queue is likely to be efficient.

How can I manage my memory issue

I am using ObservableCollection to store information on my CPU Usage and transferring this information to a line chart. the information is updated every second. It is working fine but I realize that this is going to jam up my memory overtime cos it just keeps adding information to the list.
What is the norm in this situation? Do you reset the list after every minute? I feel that would mess up how the chart looks every time it resets. Please advice how I could manage this memory issue swiftly. Thanks.
ObservableCollection<KeyValuePair<double, double>> chart1 = new ObservableCollection<KeyValuePair<double, double>>();
chart1.Add(new KeyValuePair<double, double>(DateTime.now, getCurrentCpuUsage()));
What exactly you should do depends on your requirements.
If you only need to retain a certain amount of data (e.g. 10 minutes or whatever), a bounded queue may be more appropriate than an ObservableCollection. That way, events that are "too old" automatically fall out of the data structure, allowing you to cap memory usage.
If you still want to be able to access older data in the future, you could write the data coming out of the end of the queue to a file or database instead of just dropping it.
For one implementation of a bounded queue see
Limit size of Queue<T> in .NET?
Since you may need an observable bounded queue, here are some notes on how to implement one (fairly straightforward)
Observable Stack and Queue
Bounded Queue Explained
A regular queue is like the line at a checkout stand. People line up (or as the British would say queue up) at the end of the line, and the cashier takes the person at the front of the line. First in, first out. FIFO.
A bounded queue sets a maximum length for the line. For a real-life line, new people would be prevented from joining the line if it is too long. Some bounded queues in software work that way too. The other option to keep the queue length from exceeding a limit is to remove from the front of the queue when the line is too long. In real life that's probably not very fair, but for software algorithms that's sometimes just what you need.
Store as much information as you need and drop (or save to file) the rest.
If you need to store data for very long time spans (more than i.e. 1M of data points) you may try compress data. But figure out your goals and measure performance time/memory usage first.

Drawing signal with a lot of samples

I need to display a set of signals. Each signal is defined by millions of samples. Just processing the collection (for converting samples to points according to bitmap size) of samples takes a significant amount of time (especially during scrolling).
So I implemented some kind of downsampling. I just skip some points: take every 2nd, every 3rd, every 50th point depending on signal characteristics. It increases speed very much but significantly distorts signal form.
Are there any smarter approaches?
We've had a similar issue in a recent application. Our visualization (a simple line graph) became too cluttered when zoomed out to see the full extent of the data (about 7 days of samples with a sample taken every 6 seconds more or less), so down-sampling was actually the way to go. If we didn't do that, zooming out wouldn't have much meaning, as all you would see was just a big blob of lines smeared out over the screen.
It all depends on how you are going to implement the down-sampling. There's two (simple) approaches: down-sample at the moment you get your sample or down-sample at display time.
What really gives a huge performance boost in both of these cases is the proper selection of your data-sources.
Let's say you have 7 million samples, and your viewing window is just interested in the last million points. If your implementation depends on an IEnumerable, this means that the IEnumerable will have to MoveNext 6 million times before actually starting. However, if you're using something which is optimized for random reads (a List comes to mind), you can implement your own enumerator for that, more or less like this:
public IEnumerator<T> GetEnumerator(int start, int count, int skip)
{
// assume we have a field in the class which contains the data as a List<T>, named _data
for(int i = start;i<count && i < _data.Count;i+=skip)
{
yield return _data[i];
}
}
Obviously this is a very naive implementation, but you can do whatever you want within the for-loop (use an algorithm based on the surrounding samples to average?). However, this approach will make usually smooth out any extreme spikes in your signal, so be wary of that.
Another approach would be to create some generalized versions of your dataset for different ranges, which update itself whenever you receive a new signal. You usually don't need to update the complete dataset; just updating the end of your set is probably good enough. This allows you do do a bit more advanced processing of your data, but it will cost more memory. You will have to cache the distinct 'layers' of detail in your application.
However, reading your (short) explanation, I think a display-time optimization might be good enough. You will always get a distortion in your signal if you generalize. You always lose data. It's up to the algorithm you choose on how this distortion will occur, and how noticeable it will be.
You need a better sampling algorithm, also you can employ parallel processing features of c#. Refer to Task Parallel Library

cache entry replacement algorithm

I have a software project that creates a series of fingerprint (hash) values from objects of varying size. The larger the object size, of course, the more expensive the computation of the hash. The hashes are used for comparative purposes.
I now wish to cache hash values in order to improve performance of subsequent comparisons. For any given entry in the cache, I have the following metrics available:
hit count
last modification date/time
size of object hashed
So on to my question. Given the need to constrain the size of the cache (limit it to a specific number of entries), what is a well-balanced approach to replacing cache items?
Clearly, larger objects are more expensive to hash so they need to be kept around for as long as possible. However, I want to avoid a situation where populating the cache with a large quantity of large objects will prevent future (smaller) items from being cached.
So, based upon the metrics available to me (see above), I'm looking for a good general-purpose "formula" for expiring (removing) cache entries when the cache becomes full.
All thoughts, comments are appreciated.
You need to think about the nature of the objects. Think about the probability of the objects to be called again soon. And remove the least likely object.
This is very specific to the software you're using and the nature of the objects.
If they are used continuously in the program they will probably abide to the Locality of reference principle. So you should use LRU (Least recently used) algorithm.
If objects with higher hit count are more likely to be called again, then use that (and remove the lowest).
Take a look at Cache Algorithms
In principle, you need to calculate:
min(p*cost)
p = probability to be called again.
cost = The cost of caching that object again.
Assuming the ability to record when an entry was last accessed, I'd go with a "Cost" for each entry, where you at any time remove the least expensive entry.
Cost = Size * N - TimeSinceLastUse * M
Presuming you completely remove entries from the cache (and not keep old hitcount data around) I'd avoid using hitcount as a metric, you'd end up with an entry that has a high hitcount because it's been there for a long time, and it'll be there even longer because it has a high hitcount.
I typically use a strict least recently used (LRU) scheme for discarding things from the cache, unless it's hugely more expensive to reconstruct some items. LRU has the benefit of being trivially simple to implement, and it works really well for a wide range of applications. It also keeps the most frequently used items in the cache.
In essence, I create a linked list that's also indexed by a dictionary. When a client wants an item, I look it up in the dictionary. If it's found, I unlink the node from the linked list and move it to the head of the list. If the item isn't in the cache, I construct it (load it from disk, or whatever), put it at the head of the list, insert it into the dictionary, and then remove the item that's at the tail of the list.
Might want to try a multilevel style of cache. Dedicate a certain percentage of the cache to Expensive to create items and a portion to easy to create but more heavily accessed items. You can then use different strategies for maintaining the expensive cache than you would the less expensive one.
The algorithm could consider the cost of reproducing a missing element. That way you would keep the most valuable items in the cache.

Categories

Resources