How can I manage my memory issue - c#

I am using ObservableCollection to store information on my CPU Usage and transferring this information to a line chart. the information is updated every second. It is working fine but I realize that this is going to jam up my memory overtime cos it just keeps adding information to the list.
What is the norm in this situation? Do you reset the list after every minute? I feel that would mess up how the chart looks every time it resets. Please advice how I could manage this memory issue swiftly. Thanks.
ObservableCollection<KeyValuePair<double, double>> chart1 = new ObservableCollection<KeyValuePair<double, double>>();
chart1.Add(new KeyValuePair<double, double>(DateTime.now, getCurrentCpuUsage()));

What exactly you should do depends on your requirements.
If you only need to retain a certain amount of data (e.g. 10 minutes or whatever), a bounded queue may be more appropriate than an ObservableCollection. That way, events that are "too old" automatically fall out of the data structure, allowing you to cap memory usage.
If you still want to be able to access older data in the future, you could write the data coming out of the end of the queue to a file or database instead of just dropping it.
For one implementation of a bounded queue see
Limit size of Queue<T> in .NET?
Since you may need an observable bounded queue, here are some notes on how to implement one (fairly straightforward)
Observable Stack and Queue
Bounded Queue Explained
A regular queue is like the line at a checkout stand. People line up (or as the British would say queue up) at the end of the line, and the cashier takes the person at the front of the line. First in, first out. FIFO.
A bounded queue sets a maximum length for the line. For a real-life line, new people would be prevented from joining the line if it is too long. Some bounded queues in software work that way too. The other option to keep the queue length from exceeding a limit is to remove from the front of the queue when the line is too long. In real life that's probably not very fair, but for software algorithms that's sometimes just what you need.

Store as much information as you need and drop (or save to file) the rest.
If you need to store data for very long time spans (more than i.e. 1M of data points) you may try compress data. But figure out your goals and measure performance time/memory usage first.

Related

Which collection class to use for a rolling period timeseries of data?

I want to imlpement a c# class (.NET 4.6.1) which contains time series data as follows :
the timeseries is a collection keyed on DateTime, each with an associated value (eg an array of doubles)
the points will be added strictly in time order
there will be a rolling time period - e.g. 1 hour - when adding a new point, any points at the start older than this period will be removed
the key issues for performance will be quickly adding new points, and finding the data point for a particular time (a binary search or better is essential). There will be 100k's of points in the collection sometimes.
it has to be thread safe
it has to be relatively memory efficient - e.g. it can't just keep all the points from the beginning of time - as time moves on the memory footprint has to be fairly stable.
So what would be a good approach for that in terms of underlying collection classes? Using Lists will be slow - as the rolling period means we will need to remove from the start of the List a lot. A Queue or LinkedList would solve that - but I don't think they provide a fast way to access the nth item in the collection (ElementAt just iterates so is very slow).
The only solution I can think of will involve storing the data twice - once in a collection thats easy to remove from the start of - and again in one thats easy to search in (with some awful background process to prune the stale points from the search collection somehow).
Is there a better way to approach this?
Thanks.
When I first saw the question I immediately thought of a queue, but most built-in queues do not efficiently allow indexed access, as you've found.
Best suggestion I can come up with is to use a ConcurrentDictionary. Thread-safe, near-constant access time by key, you can key directly to DateTimes, etc. it's everything you need functionally, except the behavior to manage size/timeline. To do that, you use a little Linq to scan the Keys property for keys more than one hour older than the newest one being added, then TryRemove() each one (you can derive from ConcurrentDictionary and override TryAdd() to do this automatically when adding anything new).
Only other potential issue is it's not terribly memory-efficient; the HashSet-based implementation of .NET Dictionaries requires a two-dimensional array for storage based on the hash of the key, and that array will be sparsely populated even with 100k items in the collection.

Good way to manage very large collection

I am trying to think of a fast and efficient way to handle a ton of items, all of the same struct type, in which the array can grow over time and quickly and selectively remove items when the conditions are right.
The application will have a large amount of data streaming in at a relatively fast rate, and I need to quickly analyze it, update some UI info, and drop the older datapoints to make room for new ones. There are certain data points of interest that I need to hang onto for a longer amount of time than others.
The data payload contains 2 integer numbers that represent physical spectrum data: frequency, power, etc. The "age out" thing was just some meta-data I was going to use to determine when it was a good time to drop old data.
I thought that using a LinkedList would be a good choice as it can easily remove items from the middle of the collection, but I need to be able to perform the following pseudo-code:
for(int i = 0; i < myCollection.Length; i++)
{
myCollection[i].AgeOutVal--;
if(myCollection[i].AgeOutVal == 0)
{
myCollection.Remove(i);
i--;
}
}
But I'm getting compiler errors indicating that I cannot use a collection like this. What would be a good/fast way to do this?
I would recommend that first, you do some serious performance analysis of your program. Processing a million items per second only leaves you a few thousand cycles per item, which is certainly doable. But with that kind of performance goal your performance is going to be heavily influenced by things like data locality and the resulting cache misses.
Second, I would recommend that you separate the concern of "does this thing need to be removed from the queue" from whatever concern the object itself represents.
Third, you do not say how big the "age" field can get, just that it is counting down. It seems inefficient to mutate the entire collection every time through the loop just to find the ones to remove. Some ideas:
Suppose the "age" counts down from ten to zero. Instead of creating one collection and each item in the collection has an age, create ten collections, one for things that will time out in one, one for things that will time out in two, and so on. Each tick you throw away the "time out in one" collection, then the "time out in two" collection becomes the "time out in one" collection, and so on. Every time through the loop you just move around a tiny number of collection references, rather than mutating a huge number of items.
Why is "age" counting down at all? Time is increasing. Mark each item according to when it was created, and never change that. Use a queue, so you can insert new items on one end and delete them from the other end. The queue will therefore be sorted by age. Each tick, dequeue items that are too old until you get to an item that is not too old. As mentioned elsewhere, a circular buffer implementation of a queue is likely to be efficient.

Norms, rules or guidelines for calculating and showing "ETA/ETC" for a process

ETC = "Estimated Time of Completion"
I'm counting the time it takes to run through a loop and showing the user some numbers that tells him/her how much time, approximately, the full process will take. I feel like this is a common thing that everyone does on occasion and I would like to know if you have any guidelines that you follow.
Here's an example I'm using at the moment:
int itemsLeft; //This holds the number of items to run through.
double timeLeft;
TimeSpan TsTimeLeft;
list<double> avrage;
double milliseconds; //This holds the time each loop takes to complete, reset every loop.
//The background worker calls this event once for each item. The total number
//of items are in the hundreds for this particular application and every loop takes
//roughly one second.
private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
//An item has been completed!
itemsLeft--;
avrage.Add(milliseconds);
//Get an avgrage time per item and multiply it with items left.
timeLeft = avrage.Sum() / avrage.Count * itemsLeft;
TsTimeLeft = TimeSpan.FromSeconds(timeLeft);
this.Text = String.Format("ETC: {0}:{1:D2}:{2:D2} ({3:N2}s/file)",
TsTimeLeft.Hours,
TsTimeLeft.Minutes,
TsTimeLeft.Seconds,
avrage.Sum() / avrage.Count);
//Only using the last 20-30 logs in the calculation to prevent an unnecessarily long List<>.
if (avrage.Count > 30)
avrage.RemoveRange(0, 10);
milliseconds = 0;
}
//this.profiler.Interval = 10;
private void profiler_Tick(object sender, EventArgs e)
{
milliseconds += 0.01;
}
As I am a programmer at the very start of my career I'm curious to see what you would do in this situation. My main concern is the fact that I calculate and update the UI for every loop, is this bad practice?
Are there any do's/don't's when it comes to estimations like this? Are there any preferred ways of doing it, e.g. update every second, update every ten logs, calculate and update UI separately? Also when would an ETA/ETC be a good/bad idea.
The real problem with estimation of time taken by a process is the quantification of the workload. Once you can quantify that, you can made a better estimate
Examples of good estimates
File system I/O or network transfer. Whether or not file systems have bad performance, you can get to know in advance, you can quantify the total number of bytes to be processed and you can measure the speed. Once you have these, and once you can monitor how many bytes have you transferred, you get a good estimate. Random factors may affect your estimate (i.e. an application starts meanwhile), but you still get a significative value
Encryption on large streams. For the reasons above. Even if you are computing a MD5 hash, you always know how many blocks have been processed, how many are to be processed and the total.
Item synchronization. This is a little trickier. If you can assume that the per-unit workload is constant or you can make a good estimate of the time required to process an item when variance is low or insignificant, then you can make another good estimate of the process. Pick email synchronization: if you don't know the byte size of the messages (otherwise you fall in case 1) but common practice tells that the majority of emails have quite the same size, then you can use the mean of the time taken to download/upload all processed emails to estimate the time taken to process a single email. This won't work in 100% of the cases and is subject to error, but you still see progress bar progressing on a large account
In general the rule is that you can make a good estimate of ETC/ETA (ETA is actually the date and time the operation is expected to complete) if you have a homogeneous process about of which you know the numbers. Homogeneity grants that the time to process a work item is comparable to others, i.e. the time taken to process a previous item can be used to estimate future. Numbers are used to make correct calculations.
Examples of bad estimates
Operations on a number of files of unknown size. This time you know only how many files you want to process (e.g. to download) but you don't know their size in advance. Once the size of the files has a high variance you see troubles. Having downloaded half of the file, when these were the smallest and sum up to 10% of total bytes, can be said being halfway? No! You just see the progress bar growing fast to 50% and then much slowly
Heterogenous processes. E.g. Windows installations. As pointed out by #HansPassant, Windows installations provide a worse-than-bad estimate. Installing a Windows software involves several processes including: file copy (this can be estimated), registry modifications (usually never estimated), execution of transactional code. The real problem is the last. Transactional processes involving execution of custom installer code are discusses below
Execution of generic code. This can never be estimated. A code fragment involves conditional statements. The execution of these involve changing paths depending on a condition external to the code. This means, for example, that a program behaves differently whether you have a printer installed or not, whether you have a local or a domain account, etc.
Conclusions
Estimating the duration of a software process isn't both an impossible and an exact/*deterministic* task.
It's not impossible because, even in the case of code fragments, you can either find a model for your code (pick a LU factorization as an example, this may be estimated). Or you might redesign your code splitting it into an estimation phase - where you first determine the branch conditions - and an execution phase, where all pre-determined branches are taken. I said might because this task is in practice impossible: most code determines branches as effects of previous conditions, meaning that estimating a branch actually involves running the code. Chicken and egg circle
It's not a deterministic process. Computer systems, especially if multitasking are affected by a number of random factors that may impact on your estimated process. You will never get a correct estimate before running your process. At most, you can detect external factors and re-estimate your process. The fork between your estimate and the real duration of process is mathematically converging to zero when you get closer to process end (lim [x->N] |est(N) - real(N)| == 0, where N is the process duration)
If your user interface is so obscure that you have to explain that ETC doesn't mean Etcetera then you are doing it wrong. Every user understands what a progress bar does, don't help.
Nothing is quite as annoying as an inaccurate progress bar. Particularly ones that promise a quick finish but then don't deliver. I'd give the progress bar displayed by any installer on Windows as a good example of one that is fundamentally broken. Just not a shining example of an implementation that you should pursue.
Such a progress bar is broken because it is utterly impossible to guess up front how long it is going to take to install a program. File systems have very unpredictable perf. This is a very common problem with estimating execution time. Better UI models are the spinning dots you'd see in a video player and many programs in Windows 8. Or the marquee style supported by the common ProgressBar control. Just feedback that says "I'm not dead, working on it". Even the hour-glass cursor is better than a bad estimate. If you have something to report beyond a technicality that no user is really interested in then don't hesitate to display that. Like the number of files you've processed or the number of kilobytes you've downloaded. The actual value of the number isn't that useful, seeing the rate at which it increases is the interesting tidbit.

Drawing signal with a lot of samples

I need to display a set of signals. Each signal is defined by millions of samples. Just processing the collection (for converting samples to points according to bitmap size) of samples takes a significant amount of time (especially during scrolling).
So I implemented some kind of downsampling. I just skip some points: take every 2nd, every 3rd, every 50th point depending on signal characteristics. It increases speed very much but significantly distorts signal form.
Are there any smarter approaches?
We've had a similar issue in a recent application. Our visualization (a simple line graph) became too cluttered when zoomed out to see the full extent of the data (about 7 days of samples with a sample taken every 6 seconds more or less), so down-sampling was actually the way to go. If we didn't do that, zooming out wouldn't have much meaning, as all you would see was just a big blob of lines smeared out over the screen.
It all depends on how you are going to implement the down-sampling. There's two (simple) approaches: down-sample at the moment you get your sample or down-sample at display time.
What really gives a huge performance boost in both of these cases is the proper selection of your data-sources.
Let's say you have 7 million samples, and your viewing window is just interested in the last million points. If your implementation depends on an IEnumerable, this means that the IEnumerable will have to MoveNext 6 million times before actually starting. However, if you're using something which is optimized for random reads (a List comes to mind), you can implement your own enumerator for that, more or less like this:
public IEnumerator<T> GetEnumerator(int start, int count, int skip)
{
// assume we have a field in the class which contains the data as a List<T>, named _data
for(int i = start;i<count && i < _data.Count;i+=skip)
{
yield return _data[i];
}
}
Obviously this is a very naive implementation, but you can do whatever you want within the for-loop (use an algorithm based on the surrounding samples to average?). However, this approach will make usually smooth out any extreme spikes in your signal, so be wary of that.
Another approach would be to create some generalized versions of your dataset for different ranges, which update itself whenever you receive a new signal. You usually don't need to update the complete dataset; just updating the end of your set is probably good enough. This allows you do do a bit more advanced processing of your data, but it will cost more memory. You will have to cache the distinct 'layers' of detail in your application.
However, reading your (short) explanation, I think a display-time optimization might be good enough. You will always get a distortion in your signal if you generalize. You always lose data. It's up to the algorithm you choose on how this distortion will occur, and how noticeable it will be.
You need a better sampling algorithm, also you can employ parallel processing features of c#. Refer to Task Parallel Library

Double ended priority queue

I have a set of data and I want to find the biggest and smallest items (multiple times), what's the best way to do this?
For anyone interested in the application, I'm developing a level of detail system and I need to find the items with the biggest and smallest screen space error, obviously every time I subdivide/merge an item I have to insert it into the queue but every time the camera moves the entire dataset changes - so it might be best to just use a sorted list and defer adding new items until the next time I sort (since it happens so often)
You can use a Min-Max Heap as described in the paper Min-Max Heaps and Generalized Priority Queues:
A simple implementation of
double ended priority queues is
presented. The proposed structure,
called a min-max heap, can be built in
linear time; in contrast to
conventional heaps, it allows both
FindMin and FindMax to be performed in
constant time; Insert, DeleteMin, and
DeleteMax operations can be performed
in logarithmic time.

Categories

Resources