What is dictionary compaction support?

What is dictionary compaction support? - c#

"Here is the implementation of the dictionary without any compaction support."
This quote is taken from here: http://blogs.msdn.com/jaredpar/archive/2009/03/03/building-a-weakreference-hashtable.aspx
I know jaredpar is a member on here and posts on the C# section. What exactly is "dictionary compaction support"? I am assuming it is some way to optimise or make it smaller? But how (if this is what it is)?
Thanks

For that particular post I was refering to shrinking the dictionary in order to be a more appropriate size for the number of non-collected elements.
Under the hood most hashtables are backed by a large array which usually points to another structure such as a linked list. The array starts out at an initialize size. When the number of elements added to the hashtable exceeds a certain threshold (say 70% of the number of elements in the array), the hashtable will expand. This usually involves creating a new array at twice the size and re-adding the values into the new array.
One of the problems / features of a weak reference hashtable is that over time the elements are collected. Over time this can lead to a bit of wasted space. Imagine that you added enough elements to go through this array doubling process. Over time some of these were collected and now the remaining elements could fit into the previous array size.
This is not necessarily a bad thing but it is wasted space. Compaction is the process where you essentially shrink the underlying data structure for the hashtable to be a more appropriate size for the data.

Related

Efficient queue clearing in C#

I'm currently dealing with a queue that has a couple thousand entries in it. To save on RAM usage I'm currently using the TrimExcess() method built in to the queue datatype by Microsoft. As mentioned in the documentation
, this method is inefficient when it comes to large lists and results in a significant time loss whenever it is called.
Is there a more efficient way to remove items from a queue that actually deletes them in RAM aswell?
Edit: to clarify there are still elements in the queue, I just want to remove the elements that have been dequeued from RAM

The answer to your question is "Don't worry about that, and it's very likely that you should not do that." This answer is an elaboration of the comments from #madreflection and myself.
The Queue<T> class, like other collection classes uses an array to hold the items it collects. Like other collection classes, if the array runs out of slots, the Queue class creates a new, larger array and copies the data from old array.
I haven't looked at the Queue<T> source, but using the debugger, I can see that this array is the _array member of the class. It starts with an array of 0 length. When you enqueue one item, it gets replaced by an array of length 4. After that, the array doubles in size whenever it needs more space.
You say your queue "has a couple thousand entries in it". I'm going to use 2000 in this analysis as a rough guess.
As you enqueue more and more net entries into the queue, that array doubling will happen several times:
At First
After 5
After 9
After 17
After 33
4 Entries
Double to 8
Double to 16
Double to 32
Double to 64
It will keep doing this until it's doubled (and copied the array contents) 10 times - bringing it to 2048. At that point, you will have allocated 10 arrays, nine of which are garbage, and done about 3000 queued element copies.
Now think about it. I'm guessing you are enqueue reference type objects. A reference type object is represented by an object reference (in effect a pointer). If you have 2000 instances in a queue that will represent 8kb on a 32-bit machine (plus some one-time overhead for the members of the queue class). On a 64-bit machine, it's 16kb. That's nothing for a modern computer. The .NET garbage collector has two strategies for managing memory, a normal one and one for large objects. The boundary is 85kb; your queue will never be a large object
If you are enqueuing large value types, then more memory is needed (since the value type objects will be copied into the array elements that make up the queue entries). You'd need to be using very large value type objects before your queue becomes a large object.
The other thing that will happen is that as your queue grows in size, it will settle into the Garbage Collector's Gen2 memory area. Gen2 collections are expensive, but once an object becomes stable in Gen2, it doesn't bother the garbage collector at all.
But, think about what happens if you reduce your queue size way down to, say 100 entries and call TrimExcess. At that point, yet another new array will be created (this time, much smaller) and the entries in your queue will be copied to that new queue (that's what the notes in the Remarks section of the TrimExcess documentation is referring to when it talks about The cost of reallocating and copying a large Queue<T>). If your queue starts growing again, you will start doubling/copying that array over and over again - spinning off more garbage and spinning your wheels doing the copying.
A better strategy is to look at your estimated queue size, inflate it a bit, and pre-allocate the space for all of those entries at construction time. If you expect to have 2000 entries, allocate space for 2500 in the constructor:
var myQueue = new Queue<SomeType>(2500);
Now, you do one allocation, there should be no reallocation or array copying, and your memory will quickly migrate to Gen2, but will never be touched by the GC.

Any performance benefits to removing items from C# Dictionary after lookup if they only need to be read once

I have a Dictionary of objects with strings as the keys. This Dictionary is first populated with anywhere from 50 to tens of thousands of entries. Later on my program looks for values within this dictionary, and after having found an item in the dictionary I no longer have any need to persist the object that I just found in the dictionary. My question then is, would I be able to get better total execution time if I remove entries from the dictionary once I no longer have use for them, perhaps cutting down memory usage or just making subsequent lookups slightly faster, or would the extra time spent removing items be more impactful?
I understand the answer to this may depend upon certain details such as how many total lookups are done against the dictionary, the size of the key, and the size of the object, I will try to provide these below, but is there a general answer to this? Is it unnecessary to try and improve performance in this way, or are there cases where this would be a good idea?
Key is variable length string, either 6 characters or ~20 characters.
Total lookups is completely up in the air, I may have to only check 50x or so or I may have to look 10K times completely independent of the size of the dictionary, i.e. dictionary may have 50 items and I may do 10K lookups, or I may have 10K items and only do 50 lookups.
One additional note is that if I do remove items from the dictionary and I am ever left with an empty dictionary I can then signal to a waiting thread to no longer wait for me while I process the remaining items (involves parsing through a long text file while looking up items in the dictionary to determine what to do with the parsed data).

Dictionary lookups are essentially O(1). Removing items from the dictionary will have a tiny (if any) impact on lookup speed.
In the end, it's very likely that removing items will be slower than just leaving them in.
The only reason I'd suggest removing items would be if you need to reduce your memory footprint.

I found some interesting items over at DotNetPerls that seem to relate to your question.
The order you add keys to a Dictionary is important. It affects the
performance of accessing those keys. Because the Dictionary uses a
chaining algorithm, the keys that were added last are often faster to
locate.
http://www.dotnetperls.com/dictionary-order
Dictionary size influences lookup performance. Smaller Dictionaries
are faster than larger Dictionaries. This is true when they are tested
for keys that always exist in both. Reducing Dictionary size could
help improve performance.
http://www.dotnetperls.com/dictionary-size
I thought this last tidbit was really interesting. It didn't occur to me to consider my key length.
Generally, shorter [key] strings perform better than longer ones.
http://www.dotnetperls.com/dictionary-string-key
Good question!

Using Long Integer types with collection methods

I've encountered a problem, which is best illustrated with this code segment:
public static void Foo(long RemoveLocation)
{
// Code body here...
// MyList is a List type collection object.
MyList.RemoveAt(RemoveLocation);
}
Problem: RemoveLocation is a long. The RemoveAt method takes only int types. How do I get around this problem?
Solutions I'd prefer to avoid (because it's crunch time on the project):
Splitting MyList into two or more lists; that would require rewriting a lot of code.
Using int instead of long.

If there was a way you could group similar items together, could you bring the total down below the limit? E.g. if your data contains lots of repeated X,Y coords, you might be able to reduce the number of elements and still keep one list, by creating a frequency count field. e.g. (x,y,count)

In theory, the maximum number of elements in a list is int.MaxValue, which is about 2 billion.
However, it is very inefficient to use the list type to store an extremely large number of elements. It simply has not been designed for that and you're doing way better with a tree-like data structure.
For instance, if you look at Mono's implementation of the list types, you'll see that they're using a single array to hold the elements and I assume .NET's version does the same. Since the maximum size of an element in .NET is 2 GB, the actual maximum number of elements is 2 billion divided by the element size. So, for instance a list of strings on a 64-bit machine could hold at most about 268 million elements.
When using the mutable (non-readonly) list types, this array needs to be re-allocated to a larger size (usually using twice the old size) when adding items, requiring the entire contents to be copied. This is very inefficient.
In addition to this, having too large objects could also have negative impacts on the garbage collector.
Update
If you really need a very large list, you could simply write your own data type, for instance using an array or large arrays as internal storage.
There are also some useful comments about this here:
http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx

How can I reduce the garbage generation in this situation

My game has gotten to the point where its generating too much garbage and is resulting in long GC times. I've been going around and reducing a lot of the garbage generated but there's one spot that's allocating a large amount of memory too frequently and I'm stuck on how I can resolve this.
My game is a minecraft-type world that generates new regions as you walk. I have a large, variable size array that is allocated on the creation of a new region that is used to store the vertex data for the terrain. After the array is filled with data, it's passed to a slimdx DataStream so it can be used for rendering.
The problem is the fact that this is a variable-size array and that it needs to be passed to slimdx, which calls GCHandle.Alloc on it. Since its a variable size, it may have to be resized in order to reuse it. I also can't just allocate a max sized array for each region because it would require impossibly large amounts of memory. I can't use a List because of the GCHandle business with slimdx.
So far, resizing the array only when it needs to be made bigger seems to be the only plausible option to me, but it may not work out that well and will likely be a pain to implement. I'd need to keep track of the actual size of the array separately and use unsafe code to get a pointer to the array and pass that to slimdx. It may also end up eventually using such a large amount of memory that I have to occasionally go and reduce the size of all the arrays down to the minimum needed.
I'm hesitant to jump at this solution and would like to know if anyone sees any better solutions to this.

I'd suggest a tighter integration with the slimdx library. It's open source so you could dig in and find the critical path that you need for the rendering. Then you could integrate tighter by using a DMA-style memory sharing approach.

Since SlimDX is open source and it is too slow the time has come to change the open Source to suit your performance needs. What I do see here is that you want to keep a much larger array but hand to SlimDX only the actual used region to prevent additional memory allocatons for this potentially huge array.
There is a type in the .NET Framework named ArraySegment which was made exactly for this purpose.
// Taken from MSDN
// Create and initialize a new string array.
String[] myArr = { "The", "quick", "brown", "fox", "jumps", "over", "the",
"lazy", "dog" };
// Define an array segment that contains the middle five values of the array.
ArraySegment<String> myArrSegMid = new ArraySegment<String>( myArr, 2, 5 );
public static void PrintIndexAndValues( ArraySegment<String> arrSeg )
{
for ( int i = arrSeg.Offset; i < (arrSeg.Offset + arrSeg.Count); i++ )
{
Console.WriteLine( " [{0}] : {1}", i, arrSeg.Array[i] );
}
Console.WriteLine();
}
That said I have found the usage of ArraySegment somewhat strange because I always have to use the offset and the index which just behaves not a regular array. Instead you can distill your own struct which allows zero based indexing which is much easier to use but comes at the cost that every index based access does cost you and add of the base offset. But if the usage pattern is mainly foreaches then it does not really matter.
I had situations where ArraySegment was also too costly because you do allocate a struct every time and pass it to all methods by value on the stack. You need to watch closely where its usage is ok and if it is not allocated at a too high rate.

I sympathize your problem with older library, slimdx, which may not be .NET compliant. I have dealt with such a situation.
Suggestions:
Use a more performance efficient generic list or array like ArrayList. It keeps track of the size of the array so you don't have to. Allocate the List, chunks at a time, e.g. 100 elements at a time.
Use C++ .NET and take advantage of unsafe arrays or the .NET classes like ArrayList.
Updated: Use the idea of virtual memory. Save some data to an XML file or SQL database, relieving huge amount of memory.
I realize it's a gamble either way.

How large per list item does List<uint> get in .NET 4.0?

I know that it takes 4 bytes to store a uint in memory, but how much space in memory does it take to store List<uint> for, say, x number of uints?
How does this compare to the space required by uint[]?

There is no per-item overhead of a List<T> because it uses a T[] to store its data. However, a List<T> containing N items may have 2N elements in its T[]. Also, the List data structure itself has probably 32 or more bytes of overhead.

You probably will notice not so much difference between T[] and list<T> but you can use
System.GC.GetTotalMemory(true);
before and after an object allocation to obtain an approximate memory usage.

List<> uses an array internally, so a List<uint> should take O(4bytes * n) space, just like a uint[]. There may be some more constant overhead in comparison to a array, but you should normally not care about this.
Depending on the specific implementation (this may be different when using Mono as a runtime instead of the MS .NET runtime), the internal array will be bigger than the number of actual items in the list. E.g.: a list of 5 elements has an internal array that can store 10, a list of 10000 elements may have an internal array of size 11000. So you cant generally say that the internal array will always be twice as big, or 5% bigger than the number of list element, it may also depend on the size.
Edit: I've just seen, Hans Passant has described the growing behaviour of List<T> here.
So, if you have a collection of items that you want to append to, and you cant know the size of this collection at the time the list is created, use a List<T>. It is specifically designed for this case. It provides fast random access O(1) to the elements, and has very little memory overhead (internal array). It is on the other hand very slow on removing or inserting in the middle of the list. If you need those operations often, use a LinkedList<T>, which has then more memory overhead (per item!), however. If you know the size of you collection from the beginning, and you know that is wont change (or just very few times) use arrays.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.