I have a program that needs to store data values and periodically get the last 'x' data values.
It initially thought a stack is the way to go but I need to be able to see more than just the top value - something like a PeekRange method where I can peek the last 'x' number of values.
At the moment I'm just using a list and get the last, say, 20 values like this:
var last20 = myList.Skip(myList.Count - 20).ToList();
The list grows all the time the program runs, but I only ever want the last 20 values. Could someone give some advice on a better data structure?
I'd probably be using a ring buffer. It's not hard to implement one on your own, AFAIK there's no implementation provided by the Framework..
Well since you mentioned the stack, I guess you only need modifications at the end of the list?
In that case the list is actually a nice solution (cache efficient and with fast insertion/removal at the end). However your way of extracting the last few items is somewhat inefficient, because IEnumerable<T> won't expose the random access provided by the List. So the Skip()-Implementation has to scan the whole List until it reaches the end (or do a runtime type check first to detect that the container implements IList<T>). It is more efficient, to either access the items directly by index, or (if you need a second array) to use List<T>.CopyTo().
If you need fast removal/insertion at the beginning, you may want to consider a ring buffer or (doubly) linked list (see LinkedList<T>). The linked list will be less cache-efficient, but it is easy and efficient to navigate and alter from both directions. The ring buffer is a bit harder to implement, but will be more cache- and space-efficient. So its probably better if only small value types or reference types are stored. Especially when the buffers size is fixed.
You could just removeat(0) after each add (if the list is longer than 20), so the list will never be longer than 20 items.
You said stack, but you also said you only ever want the last 20 items. I don't think these two requirements really go together.
I would say that Johannes is right about a ring buffer. It is VERY easy to implement this yourself in .NET; just use a Queue<T> and once you reach your capacity (20) start dequeuing (popping) on every enqueue (push).
If you want your PeekRange to enumerate from the most recent to least recent, you can defineGetEnumerator to do somehing likereturn _queue.Reverse().GetEnumerator();
Woops, .Take() wont do it.
Here's an implementation of .TakeLast()
http://www.codeproject.com/Articles/119666/LINQ-Introducing-The-Take-Last-Operators.aspx
Related
I know exactly how many items I want to keep in a list, they are ordered, I only need it to finish at an specific index I know, but I don't want to alter the capacity or use TrimExcess in order to make it smaller, otherwise after adding an item again it will double the size of the list again.
How can I set the Count instead of using Remove or RemoveAt or RemoveRange?.
My priority is optimization of speed for this operation.
Important: I know I can use an array, but I am not allowed. Also, I'm adding items and removing them all the time. I just want the capacity to stay around a similar amount which I don't know exactly but it will stabilize.
If you remove elements, the Capacity won't change. So if you don't use TrimExcess(), the Capacity will only ever increase (to the maximum you ever used for this list). So there's no performance penalty in removing elements again. You can set the initial capacity in the constructor, which is a good idea if you know how many elements you'll be using (or have an estimate for it), because that will remove the overload of the doubling while initially building up the list.
Note: Insert/Remove in a list is still O(n), because the elements eventually need to be compied around (unless you operate only at the tail end of the list).
Use an array and (in C# 8.0 onwards) use Indices and Ranges with slicing.
https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-8#indices-and-ranges
I want to imlpement a c# class (.NET 4.6.1) which contains time series data as follows :
the timeseries is a collection keyed on DateTime, each with an associated value (eg an array of doubles)
the points will be added strictly in time order
there will be a rolling time period - e.g. 1 hour - when adding a new point, any points at the start older than this period will be removed
the key issues for performance will be quickly adding new points, and finding the data point for a particular time (a binary search or better is essential). There will be 100k's of points in the collection sometimes.
it has to be thread safe
it has to be relatively memory efficient - e.g. it can't just keep all the points from the beginning of time - as time moves on the memory footprint has to be fairly stable.
So what would be a good approach for that in terms of underlying collection classes? Using Lists will be slow - as the rolling period means we will need to remove from the start of the List a lot. A Queue or LinkedList would solve that - but I don't think they provide a fast way to access the nth item in the collection (ElementAt just iterates so is very slow).
The only solution I can think of will involve storing the data twice - once in a collection thats easy to remove from the start of - and again in one thats easy to search in (with some awful background process to prune the stale points from the search collection somehow).
Is there a better way to approach this?
Thanks.
When I first saw the question I immediately thought of a queue, but most built-in queues do not efficiently allow indexed access, as you've found.
Best suggestion I can come up with is to use a ConcurrentDictionary. Thread-safe, near-constant access time by key, you can key directly to DateTimes, etc. it's everything you need functionally, except the behavior to manage size/timeline. To do that, you use a little Linq to scan the Keys property for keys more than one hour older than the newest one being added, then TryRemove() each one (you can derive from ConcurrentDictionary and override TryAdd() to do this automatically when adding anything new).
Only other potential issue is it's not terribly memory-efficient; the HashSet-based implementation of .NET Dictionaries requires a two-dimensional array for storage based on the hash of the key, and that array will be sparsely populated even with 100k items in the collection.
Given (Simplified description)
One of our services has a lot of instances in memory. About 85% are unique.
We need a very fast key based access to these items as they are queried very often in a single stack / call. This single context is extremely performance optimized.
So we started to put them them into a dictionary. The performance was ok.
Access to the items as fast as possible is the most important thing in this case. It is ensured that there are no write operations when reads occur.
Problem
In the meanwhile we hit the limits of the number of items a dictionary can store.
Die Arraydimensionen haben den unterstützten Bereich überschritten.
bei System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
bei System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
Which translates to The array dimensions have exceeded the supported range.
Solutions like Memcached are in this specific case just too slow. It is a isolated very specific use case encapsulated in a single service
So we are looking for a replacement of the dictionary for this specific scenario.
Currently I can't find one supporting this. Am I missing something? Can someone point me to one?
As an alternative, if none exists we are thinking about implementing one by ourselves.
We thought about two possibilities. Build it up from scratch or wrapping multiple dictionaries.
Wrapping multiple dictionaries
When an item is searched we could have a look at the keys HasCode and use its starting number like an index for a list of wrappers dictionaries. Although this seems to be easy it smells to me and it would mean that the hashcode is calculated twice (one time by us one time by the inner dictionary) (this scenario is really really performance cruical).
I know that exchanging a basetype like the dictionary is the absolute last possibility and I want to avoid it. But currently it looks like there is no way to make the objects more unique or to get the performance of a dictionary from a database or to save performance somewhere else.
I'm also aware of "be aware of optimizations" but the a lower performance would very badly hit the business requirements behind it.
Before I finished reading your questions, the simple multiple dictionaries came to my mind. But you know this solution already. I am assuming you are really hitting the maximum number of items in a dictionary, not any other limit.
I would say go for it. I do not think you should be worried about counting a hash twice. If they keys are somehow long and getting the hash is really a time consuming operations (which I doubt, but can't be sure as you did not mention what are the keys), you do not need to use whole keys for your hash function. Just pick up whatever part you are OK to process in your own hashing and distribute the item based on that.
The only thing you need to make sure here is to have an evenly spread of items among your multiple dictionaries. How hard is to achieve this really depends on what your keys are. If they were completely random numbers, you could just use the first byte and it would be fine (unless you would need more than 256 dictionaries). If they are not random numbers, you have to think about the distribution in their domain and code your first hash function in a way it achieves that goal of even distribution.
I've looked at the implementation of the .Net Dictionary and it seems like you should be able to store 2^32 values in your dictionary. (Next to the list of buckets, which are themselves linked lists there is a single array that stores all items, probably for quick iteration, that might be the limiting factor).
If you haven't added 2^32 values it might be that there is a limit on the items in a bucket (its a linked list so its probably limitted to the maximum stackframe size). In that case you should double check that your hashing function spreads the items evenly over the dictionary. See this answer for more info What is the best algorithm for an overridden System.Object.GetHashCode?
Whenever I want to insert into a SortedList, I check to see if the item exists, then I insert. Is this performing the same search twice? Once to see if the item is there and again to find where to insert the item? Is there a way to optimize this to speed it up or is this just the way to do it, no changes necessary?
if( sortedList.ContainsKey( foo ) == false ){
sortedList.Add( foo, 0 );
}
You can add the items to a HashSet and the List, searching in the hash set is the fastest way to see if you have to add the value to the list.
if( hashSet.Contains( foo ) == false ){
sortedList.Add( foo, 0 );
hashSet.Add(foo);
}
You can use the indexer. The indexer does this in an optimal way internally by first looking for the index corresponding to the key using a binary search and then using this index to replace an existing item. Otherwise a new item is added by taking in account the index already calculated.
list["foo"] = value;
No exception is thrown whether the key already exists or not.
UPDATE:
If the new value is the same as the old value, replacing the old value will have the same effect than doing nothing.
Keep in mind that a binary search is done. This means that it takes about 10 steps to find an item among 1000 items! log2(1000) ~= 10. Therefore doing an extra search will not have a significant impact on speed. Searching among 1,000,000 items will only double this value (~ 20 steps).
But setting the value through the indexer will do only one search in any case. I looked at the code using Reflector and can confirm this.
I'm sorry if this doesn't answer your question, but I have to say sometimes the default collection structures in .NET are unjustifiably limited in features. This could have been handled if Add method returned a boolean indicating success/failure very much like HashSet<T>.Add does. So everything goes in one step. In fact the whole of ICollection<T>.Add should have been a boolean so that implementation-wise it's forced, very much like Collection<T> in Java does.
You could either use a SortedDictionary<K, V> structure as pointed out by Servy or a combination of HashSet<K> and SortedList<K, V> as in peer's answer for better performance, but neither of them are really sticking to do it only once philosophy. I tried a couple of open source projects to see if there is a better implementation in this respect, but couldn't find.
Your options:
In vast majority of the cases it's ok to do two lookups, doesn't hurt much. Stick to one. There is no solution built in.
Write your own SortedList<K, V> class. It's not difficult at all.
If you'r desperate, you can use reflection. The Insert method is a private member in SortedList class. An example that does.. Kindly dont do it. It's a very very poor choice. Mentioned here for completeness.
ContainsKey does a binary search, which is O(log n), so unless you list is massive, I wouldn't worry about it too much. And, presumably, on insertion it does another binary search to find the location to insert at.
One option to avoid this (doing the search twice) is to use a the BinarySearch method of List. This will return a negative value if the item isn't found and that negative value is the bitwise compliment of the place where the item should be inserted. So you can look for an item, and if it's not already in the list, you know exactly where it should be inserted to keep the list sorted.
SortedList<Key,Value> is a slow data structure that you probably shouldn't use at all. You may have already considered using SortedDictionary<Key,Value> but found it inconvenient because the items don't have indexes (you can't write sortedDictionary[0]) and because you can write a find nearest key operation for SortedList but not SortedDictionary.
But if you're willing to switch to a third-party library, you can get better performance by changing to a different data structure.
The Loyc Core libraries include a data type that works the same way as SortedList<Key,Value> but is dramatically faster when the list is large. It's called BDictionary<Key,Value>.
Now, answering your original question: yes, the way you wrote the code, it performs two searches and one insert (the insert is the slowest part). If you switch to BDictionary, there is a method bdictionary.AddIfNotPresent(key, value) which combines those two operations into a single operation. It returns true if the specified item was added, or false if it was already present.
Lets assume I have plenty of strings that need processing, I like to place the last processed string in memory to avoid repeated processing against it. I only need to record the last 100 strings, which means if I use
List<string> oldString after oldString.Add(), I have to use oldString.TakeFromEnd(100) As you know, TakeFromEnd() not exist, which means if I go this path, I have to write lots of things to keep a 100 length List which would lead to bad performance I can imagine.
I'd like to ask, is there any pre-made in system Class that just holds a fixed amount of data, and throws away oldest data when a new one is added. Thanks
[EDIT]
Queue<string> is very good indeed, use .Any() to check if already exist, use .Enqueue() to add (not Equeue as answered below, it shot a N), use .Count to check length, and .Dequeue() to remove the first added one.
Work with a Queue
And the idea is to:
public void addToQueue(Object obj){
if (myQueue.Count > 100)
myQueue.Dequeue();
myQueue.Equeue(obj);
}
This is roughly a sketch of the code that you need to use, but you`ll get the idea.
You will then have a queue than contains only the latest 100 records