I would need to manage some lists with timer: each element of these lists is associated with a timer, so when the timer expires, the corresponding element must be removed from the list.
In this way the length of the list does not grow too much because, as time goes on, the elements are progressively removed. The speed with which the list length increases also depends on the rate of addition of new elements.
However I need to add the following constraint: the amount of RAM used by the list does not exceed a certain limit, ie the user must specify the maximum number of items that can be stored in RAM.
Therefore, if the rate of addition of the elements is low, all items can be stored in RAM. If, however, the rate of addition of elements is high, old items are likely to be lost before the expiration of their timers.
Intuitively, I thought about taking a cue from swapping technique used by operating systems.
class SwappingList
{
private List<string> _list;
private SwapManager _swapManager;
public SwappingList(int capacity, SwapManager swapManager)
{
_list = new List<string>(capacity);
_swapManager = swapManager;
// TODO
}
}
One of the lists that I manage is made up of strings of constant length, and it must work as a hash table, so I should use HashMap, but how can I define the maximum capacity of a HashMap object?
Basically I would like to implement a caching mechanism, but I wish that the RAM used by the cache is limited to a number of items or bytes, which means that old items that are not expired yet, must be moved to a file.
According to the comments above you want a caching mechanism.
.NET 4 has this build-in (see http://msdn.microsoft.com/en-us/library/system.runtime.caching.aspx) - it comes with configurable caching policy which you can use to configure expiration among other things... it provides even some events which you can assign delegates to that are called prior removing a cache entry to customize this process even further...
You cannot specify the maximum capacity of a HashMap. You need to implement a wrapper around it, which, after each insertion, checks to see if the maximum count has been reached.
It is not clear to me whether that's all you are asking. If you have more questions, please be sure to state them clearly and use a question mark with each one of them.
Related
Im building an multithreading program that handels big data and wounder what i can do to tweak it.
Right now i have 50 000millions entrys in a normal List and as i use multithreading i use lockstatement.
public string getUsername()
{
string user = null;
lock (UsersToCheckExistList)
{
user = UsersToCheckExistList.First();
UsersToCheckExistList.Remove(user);
}
return user;
}
When i run smaller lists 500k lines it works much faster. But when i load a bigger list 5-50mill it starts to slow down. One way to solve this issue is creating many small lists dynamic and store them in an Dictonary and this is the way i think i will go with. But as i want to learn more about optimizing i wounder if there is a better solution for this task?
All i want is the get a value from the collection and remove it same time from the collection.
You're using the wrong tools for the job - explicit locking is quite expensive, not to mention that the cost of removing the head of a List is O(Count). If you want a collection that is accessed concurrently it's best to use types in System.Collections.Concurrent, as they are heavily optimised for concurrent accesses. From your use case it seems you want a queue of users, so using a ConcurrentQueue:
ConcurrentQueue<string> UsersQueue;
public string getUsername()
{
string user = null;
UsersQueue.TryDequeue(out user);
return user;
}
The problem is that removing the first item from a list is O(n), so as you list grows it takes longer to remove the first item. You would probably be better off using a Queue instead. Since you need threadsafety, you could use ConcurrentQueue, which handles efficient locking for you.
You can put them all in a ConcurrentBag (https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentbag-1?view=netframework-4.8) then each thread can just use the TryTake method to grab one entry and remove it at the same time, you then don't need to worry about doing your own locking
If you have enough RAM for your data, you should definitely use ConcurrentQueue for FIFO access to you data.
But if you have not enough RAM you can try to use some database. Modern databases can cache data very effectively, you will have almost instant access to you data and save OS memory from swapping.
I have a need to maintain a cache of database objects that are uniquely keyed (by an integer). A query delivers an instance of IEnumerable<MyEntity> (MyEntity uses an int primary key) with the results, and I'd like to initialize an instance of Dictionary<int, MyEntity> as fast as possible, because this query can return a few hundred thousand rows.
What is the most performant way to initialize an instance of Dictionary<int, MyEntity> from an IEnumerable<MyEntity>?
In short, I want to know if there is a more performant way to do this:
IEnumerable<MyEntity> entities = DoSomeQuery();
var cache = new Dictionary<int, MyEntity>();
foreach (var entity in entities)
cache.Add(entity.Id, entity);
//or...
cache = entities.ToDictionary(e => e.Id);
Of course, the query has the biggest potential performance consequences, but it's important that I shave milliseconds wherever I can for my use case.
EDIT:
Worth noting here that .ToDictionary<TKey, TElement> literally runs a foreach loop like the first example, so one could assume the perf would be the exact same if not slightly worse. Maybe that's my answer right there.
You're about as fast as you can get.
If you can quickly determine the number of elements you are going to add then passing that as the capacity to the Dictionary constructor will give a bit of a boost by preventing internal resize operations (the .NET Core version of ToDictionary() does that, the other versions do not).
If the keys are relatively tightly packed then you can benefit from sizing to the range rather than the count. E.g. if you had Ids of {5, 6, 7, 9, 10, 11} then it would be beneficial to size to 7 (the number of values you would have if the missing 8 was there) rather than 6. (Actually, it would make no difference here, as the effect only kicks in with larger sets than this). The effect is rather small though, so not worth doing if you're going to be wasting a lot of memory (e.g. it's defintely not worth storing {8, 307} in a 300-capacity dictionary! The benefit comes from increasing how often a key will be hashed to something that isn't going to clash with another element during the period when the internal size (and hence the internal hash-reduction) is smaller than it will be when you are finished adding them all.
If they are tightly packed but you can't predict the size then there's a benefit in storing them in order, because as the internal storage grows, there'll more often be a case where the dictionary wants to store something with an unused reduced hash code. The benefit though will be smaller than the cost of sorting in-memory (and that will require finding the number of elements anyway, either explicitly or within an OrderBy operation) so it's only helpful if there's a way of getting that ordering done for you cheaply. (E.g. some webservices require that some sort of ordering criteria be given, so you may as well give the id as the criteria. Mostly this won't be applicable).
These points, especially the last two, are tiny effects though, likely to not add up to anything measurable. Even the first is going to be smaller than the cost of obtaining the count if it isn't already in a source that has a cheap Count or Length operation.
The foreach itself can perhaps be improved by replacing with indexing (when applicable) but also sometimes that's worse. It also tends to do better on some concretely-typed source (i.e. foreach on T[] array beats foreach on List<T> beats foreach on IEnumerable<T>) but that means exposing implementation details between layers and is rarely worth it, especially since many collection types don't have any benefit through this.
After reading the excellent accepted answer in this question:
How is the c#/.net 3.5 dictionary implemented?
I decided to set my initial capacity to a large guess and then trim it after I read in all values. How can I do this? That is, how can I trim a Dictionary so the gc will collect the unused space later?
My goal with this is optimization. I often have large datasets and the time penalty for small datasets is acceptable. I want to avoid the overhead of reallocating and copying the data that is incured with small initial capacities on large datasets.
According to Reflector, the Dictionary class never shrinks. void Resize() is hard-coded to always double the size.
You can probably create a new dictionary and use the respective constructor to copy over the items. This will be quite inefficient.
Or, implement your own dictionary with the existing one as a blue-print. This is less work than you might think at first.
Be sure to benchmark both approaches.
In .NET 5 there is the method TrimExcess doing exactly what you're asking:
Sets the capacity of this dictionary to what it would be if it had
been originally initialized with all its entries.
You might consider putting your data in a list first. Then you know the list's size, and can create a dictionary with that capacity (now exactly right for the data you want) and populate it.
Allowing the list to dynamically resize (as you add the elements) should be cheaper than allowing a dictionary to resize. (But, as others have noted, test the performance yourself!) Resizing a dictionary involves a rehashing operation, which means every element's GetHashCode will get called again, as well as the reference being copied into the new data structure. Resizing a list just means copying the references, so should be cheaper.
I had some problems with a WCF web service (some dumps, memory leaks, etc.) and I run a profillng tool (ANTS Memory Profiles).
Just to find out that even with the processing over (I run a specific test and then stopped), Generation 2 is 25% of the memory for the web service. I tracked down this memory to find that I had a dictionary object full of (null, null) items, with -1 hash code.
The workflow of the web service implies that during specific processing items are added and then removed from the dictionary (just simple Add and Remove). Not a big deal. But it seems that after all items are removed, the dictionary is full of (null, null) KeyValuePairs. Thousands of them in fact, such that they occupy a big part of memory and eventually an overflow occurs, with the corresponding forced application pool recycle and DW20.exe getting all the CPU cycles it can get.
The dictionary is in fact Dictionary<SomeKeyType, IEnumerable<KeyValuePair<SomeOtherKeyType, SomeCustomType>>> (System.OutOfMemoryException because of Large Dictionary) so I already checked if there is some kind of reference holding things.
The dictionary is contained in a static object (to make it accesible to different processing threads through processing) so from this question and many more (Do static members ever get garbage collected?) I understand why that dictionary is in Generation 2. But this is also the cause of those (null, null)? Even if I remove items from dictionary something will be always occupied in the memory?
It's not a speed issue like in this question Deallocate memory from large data structures in C# . It seems that memory is never reclaimed.
Is there something I can do to actually remove items from dictionary, not just keep filling it with (null, null) pairs?
Is there anything else I need to check out?
Dictionaries store items in a hash table. An array is used internally for this. Because of the way hash tables work, this array must always be larger than the actual number of items stored (at least about 30% larger). Microsoft uses a load factor of 72%, i.e. at least 28% of the array will be empty (see An Extensive Examination of Data Structures Using C# 2.0 and especially The System.Collections.Hashtable Class
and The System.Collections.Generic.Dictionary Class) Therefore the null/null entries could just represent this free space.
If the array is too small, it will grow automatically; however, when items are removed, the array does not shrink, but the space that will be freed up should be reused when new items are inserted.
If you are in control of this dictionary, you could try to re-create it in order to shrink it:
theDict = new Dictionary<TKey, IEnumerable<KeyValuePair<TKey2, TVal>>>(theDict);
But the problem might arise from the actual (non empty) entries. Your dictionary is static and will therefore never be reclaimed automatically by the garbage collector, unless you assign it another dictionary or null (theDict = new ... or theDict = null). This is only true for the dictionary itself which is static, not for its entries. As long as references to removed entries exist somewhere else, they will persist. The GC will reclaim any object (earlier or later) which cannot be accessed any more through some reference. It makes no difference, whether this object was declared static or not. The objects themselves are not static, only their references.
As #RobertTausig kindly pointed out, since .NET Core 2.1 there is the new Dictionary.TrimExcess(), which is what you actually wanted, but didn't exist back then.
Looks like you need to recycle space in that dict periodically. You can do that by creating a new one: new Dictionary<a,b>(oldDict). Be sure to do this in a thread-safe manner.
When to do this? Either on the tick of a timer (60sec?) or when a specific number of writes has occurred (100k?) (you'd need to keep a modification counter).
A solution could be to call Clear() method on the static dictionary.
In this way, the reference to the dictionary will remain available, but the objects contained will be released.
I'm asking for something that's a bit weird, but here is my requirement (which is all a bit computation intensive, which I couldn't find anywhere so far)..
I need a collection of <TKey, TValue> of about 30 items. But the collection is used in massively nested foreach loops that would iterate possibly almost up to a billion times, seriously. The operations on collection are trivial, something that would look like:
Dictionary<Position, Value> _cells = new
_cells.Clear();
_cells.Add(Position.p1, v1);
_cells.Add(Position.p2, v2);
//etc
In short, nothing more than addition of about 30 items and clearing of the collection. Also the values will be read from somewhere else at some point. I need this reading/retrieval by the key. So I need something along the lines of a Dictionary. Now since I'm trying to squeeze out every ounce from the CPU, I'm looking for some micro-optimizations as well. For one, I do not require the collection to check if a duplicate already exists while adding (this typically makes dictionary slower when compared to a List<T> for addition). I know I wont be passing duplicates as keys.
Since Add method would do some checks, I tried this instead:
_cells[Position.p1] = v1;
_cells[Position.p2] = v2;
//etc
But this is still about 200 ms seconds slower for about 10k iterations than a typical List<T> implementation like this:
List<KeyValuePair<Position, Value>> _cells = new
_cells.Add(new KeyValuePair<Position, Value>(Position.p1, v1));
_cells.Add(new KeyValuePair<Position, Value>(Position.p2, v2));
//etc
Now that could scale to a noticeable time after full iteration. Note that in the above case I have read item from list by index (which was ok for testing purposes). The problem with a regular List<T> for us are many, the main reason being not being able to access an item by key.
My question in short are:
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding? Any 3rd party open source collection would do.
Or else please point me to a good starter as to how to implement my custom collection class from IDictionary<TKey, TValue> interface
Update:
I went by MiMo's suggestion and List was still faster. Perhaps it has got to do with overhead of creating the dictionary.
My suggestion would be to start with the source code of Dictionary<TKey, TValue> and change it to optimize for you specific situation.
You don't have to support removal of individual key/value pairs, this might help simplifying the code. There apppear to be also some check on the validity of keys etc. that you could get rid of.
But this is still a few ms seconds slower for about ten iterations than a typical List implementation like this
A few milliseconds slower for ten iterations of adding just 30 values? I don't believe that. Adding just a few values should take microscopic amounts of time, unless your hashing/equality routines are very slow. (That can be a real problem. I've seen code improved massively by tweaking the key choice to be something that's hashed quickly.)
If it's really taking milliseconds longer, I'd urge you to check your diagnostics.
But it's not surprising that it's slower in general: it's doing more work. For a list, it just needs to check whether or not it needs to grow the buffer, then write to an array element, and increment the size. That's it. No hashing, no computation of the right bucket.
Is there a custom collection class that would let access item by key, yet bypass the duplicate checking while adding?
No. The very work you're trying to avoid is what makes it quick to access by key later.
When do you need to perform a lookup by key, however? Do you often use collections without ever looking up a key? How big is the collection by the time you perform a key lookup?
Perhaps you should build a list of key/value pairs, and only convert it into a dictionary when you've finished writing and are ready to start looking up.