List collection in C# where internal representation only grows?

List collection in C# where internal representation only grows? - c#

One of the reasons List's are generally good for adding and removing items is that the internal data representation is allocated larger than needed to reduce the number of reallocations.
Is there a way to make an instance of this class (or another similar class) to grow as needed by a decent chunk size, but to prevent reducing the size of the internal array?

I'm not aware that the internal array ever reduces in size automatically (you can use TrimExcess to manually reduce it). List<T> always increases capacity doubling the size of the internal array whenever it runs out of space. You could write a wrapper class that increases the Capacity however you want if you don't like the built-in policy.

Related

C# How do you monitor accesses to array elements?

Arrays are all of type Array, instead of their underlying type, meaning making arrays out of your own custom primitives with event handling and stuff is useless. Someone on Discord said that it'll probably take either reflection or unsafe constructs. Adding code to the get accessor is useless, because Array.Sort() only calls that accessor twice. What should I do instead to call code like events whenever an array element is accessed (like read or written)?
Here's what I'm trying to make, a benchmarker for sorting algorithms that charts the number of comparisons and total array accesses on a whole range of array sizes
Sort benchmarker

If you just want to count the number of comparisons you should probably provide an IComparer<T> implementation instead. Most sorting implementations take such an interface.
If you want to measure the number of accesses you need to use another interface, like IList<T>. But this will not be usable for most built in sort methods, since accessing elements thru an interface will reduce performance.
But measuring "array access" is probably not a meaningful metric. In many cases this will just be a memory access, and the time of this varies greatly depending on locality. A register access is "free", while a uncached memory read is many hundreds of cycles. So using profilers or writing an actual benchmark will probably be a much better tool to measure overall performance.

Why most of the data structures in generic collections use array despite of Large Object Heap fragmentation?

I could see that CoreCLR and CoreFx implicitly use array for most of the generic collections. what is the main driving factor to go with arrays and how it handles any side effects of LOH fragmentation.

What other then arrays should collections be?
More importnatly, what other then arrays could collections be?
In use collection boils down to "arrays - and stuff we wrap around arrays, for ease of use.":
The pure thing (arrays), wich do offer some conveniences like bounds checks in C#/.NET
Self growing arrays (Lists)
Two synchronized arrays that allow the mapping of any any input to any element (Dictionaries key/value pair)
Three synchornized array: Key, Value and a Hashvalue to quickly identify not-matching keys (HastTable).
Below the hood - regardless of how hard .NET makes it to use pointers - it all boils down to some code doing C/C++ style pointer arythmethic to get the next element.
Edit 1: As I learned in another place, .NET Dictionaries are actually implemented as HashLists. The HashList class is just the pre-generics version. Object has a GetHashCode function with sensible default behavior wich can be used, but also fully overwritten.
Fragmentation wise the "best" would be a array of references. It can be as small as the reference width (a Pointer or slightly bigger) and the GC can move around the instances to defragment memory. Of course then you get the slight overhead of accessing references rather the just counting/mathing up a pointer, so as usualy it is a memory vs speed tradeoff. However this might go into Speed Rant Territory of detail.
Edit 2: As Markus Appel pointed out in the comments, there is something even better for fragmentation avoidance: Linked lists. Even that single array of references - if you just make it big enough - will take quite some memory in one indivisible chunk. So it might run into object size limits or array indexer limits. A linked list will do neither. But as a result the performance is around a disk that was never defragmented.
Generics is just a convience to have typesafety in collections/other places. It avoids you having to use the dreaded Object as type, wich ruins all compile-time typesafety. Afaik they add nothing else to this situation. List<string> works the same as a StringList would.

Array access is faster as it is a linear storage. If Arrays can solve a problem well enough they are a better storage for traversal rather than always identifying where the next object is stored. For Large data structures this performance benefit will also be amplified.

Using arrays can cause fragmentation if used carelessly. In the general case though, the performance gains outweigh the cost.
When the buffer runs out, the collection allocates a new one with double the size. If the code inserts a lot of items without specifying a capacity, this results in log2(N) reallocations. If the code does specify a capacity though, even a very rough approximation, there may be no fragmentation issues at all.
Removal is another expensive case as the collection will have to move the items after the deleted item(s) to the left.
In general though, array storage offers far better performance than other storage structures though, both for reading, inserting and allocating memory. Deletions are rare in most cases.
For example, inserting N items in a linked list requires allocating N objects to hold that value and storing N pointers. That cost will be paid for every insertion, while the GC will have a lot more objects to track and collect. Inserting 100K items in a linked list would allocate 100K node objects that would need tracking.
With an array there won't be any allocations unless the buffer runs out. In the majority of cases insertion means simply writing to a buffer location and updating a count. When the buffer runs out there will be a single reallocation and an (expensive) copy operation. For 100K items, that's 17 allocations. In most cases, that's an acceptable cost.
To reduce or even get rid of allocations, the code can specify a capacity that's used as the initial buffer size. Specifying even a very rough estimate can reduce allocations a lot. Specifying 1024 as the initial capacity for 100K items would reduce reallocations to 7.

Where is the List<MyClass> object buffer maintained? Is it on RAM or HDD?

My question might sound a little vague. But what I want to know is where the List<> buffer is maintained.
I have a list List<MyClass> to which I am adding items from an infinite loop. But the RAM consumption of the Windows Service(inside which I am creating the List) never goes beyond 17 MB. In fact it hovers between 15-16MB even if I continue adding items to the List.
I was trying to do some Load Testing of My Service and came across this thing.
Can anyone tell me whether it dumps the data to some temporary location on the machine, and picks it from there as I don't see an increase in RAM consumption.
The method which I am calling infinitely is AddMessageToList().
class MainClass
{
List<MessageDetails> messageList = new List<MessageDetails>();
private void AddMessageToList()
{
SendMessage(ApplicationName,Address, Message);
MessageDetails obj= new MessageDetails();
obj.ApplicationName= ApplicationName;
obj.Address= Address;
obj.Message= Message;
lock(messageList)
{
messageList.Add(obj);
}
}
}
class MessageDetails
{
public string Message
{
get;
set;
}
public string ApplicationName
{
get;
set;
}
public string Address
{
get;
set;
}
}

The answer to your question is: "In Memory".
That can mean RAM, and it can also mean the hard drive (Virtual Memory). The OS memory manager decides when to page memory to Virtual Memory, which mostly has to do with how often the memory is accessed (though I don't pretend to know Microsoft's specific algorithm).
You also asked why your memory usages isn't going up. First off, a MegaByte is a HUGE amount of memory. Unless your class is quite large, you will need a LOT of them to make a MB appear. Eventually your memory usage should go up though.

In general C# objects are created from the Heap, which resides in memory. If you want to store things on disk there are ways to go about it, but a standard List<T> will live in memory.
When you create an object it will occupy a certain number of bytes in memory plus the size of the pointers used to reference it. Adding it to a list only adds a pointer to the object you've already created, so if you're adding lots of copies of the same instance into the list, it won't grow as fast as you expect.
If you really want to test the impact of large data structures on your memory, you're going to have to ramp up the numbers. A few thousand average objects aren't going to occupy much memory, but a few million might.
You might also be interested in the GC.GetTotalMemory() method and its friends.

Note that pretty much all memory on Windows (and .NET) is Virtual Memory - its "real, physical" location is arbitrary, Windows memory management handles that. However, regardless of whether it's currently using physical RAM or a page file on the HDD, it will show up as committed private memory.
So it's up to how you're actually creating the items and adding them to the List<T>. How many objects are there? Are you adding the same object over and over again, or creating a new one every time? Are you using the same List instance, or are you creating others? Do you keep references to the created objects (and List instances), or are you throwing them away? Do you actually do anything with the object / list? If not, the optimizer might have removed the code alltogether (it's very conservative, though, so I wouldn't count on that in adding items to a list - that's a very complex scenario with possible side effects).
In the lowest memory ideal case, you could be using about four bytes per list item, that's not much - you'd need 262 144 items to consume a single MiB of memory!
Show us your code, the whole loop and it's surroundings. Then we can tell you what you're actually doing.
EDIT: This is in a WCF service? You should have said so before. Where do you store the MainClass class? If it's inside the WCF service class, it might not last longer than a single request. And even if you fix that, and store it in something a bit more persistent, like a static class, you get into the complexities of when everything is collected, how the service is being restarted etc. If you need the data to be safely held for longer than a single request, storing it in process memory isn't good enough. If you don't care that the data can get thrown away once in a while, you can make the List instance static (it's not going to be shared nor persisted otherwise). Otherwise, use a database.

Speculating from your sparse description, there are two sorts of things you might not realize:
The 15-16 MB usage you see might have nothing to do with the size of your list: it could be the memory requirements for the rest of the program, and your list only consumes a negligible amount of memory in comparison. Even if you don't explicitly create objects, your program still has to load libraries and stuff, which takes memory.
I don't know C# so I don't know if this applies to List, but one of the standard container class implementations to dynamically allocate an array to hold the objects... and if the array is ever filled, then you allocate a new array twice the size and copy everything over to the new array and continue along (the actual ratio may be something other than $2$). This can have the effect that your memory usage remains constant for a long time, until you finally fill up the array and then it suddenly jumps up in size, only to remain constant again for a long time.

Is there a way to trim a Dictionary's capacity once it is known to be fixed size?

After reading the excellent accepted answer in this question:
How is the c#/.net 3.5 dictionary implemented?
I decided to set my initial capacity to a large guess and then trim it after I read in all values. How can I do this? That is, how can I trim a Dictionary so the gc will collect the unused space later?
My goal with this is optimization. I often have large datasets and the time penalty for small datasets is acceptable. I want to avoid the overhead of reallocating and copying the data that is incured with small initial capacities on large datasets.

According to Reflector, the Dictionary class never shrinks. void Resize() is hard-coded to always double the size.
You can probably create a new dictionary and use the respective constructor to copy over the items. This will be quite inefficient.
Or, implement your own dictionary with the existing one as a blue-print. This is less work than you might think at first.
Be sure to benchmark both approaches.

In .NET 5 there is the method TrimExcess doing exactly what you're asking:
Sets the capacity of this dictionary to what it would be if it had
been originally initialized with all its entries.

You might consider putting your data in a list first. Then you know the list's size, and can create a dictionary with that capacity (now exactly right for the data you want) and populate it.
Allowing the list to dynamically resize (as you add the elements) should be cheaper than allowing a dictionary to resize. (But, as others have noted, test the performance yourself!) Resizing a dictionary involves a rehashing operation, which means every element's GetHashCode will get called again, as well as the reference being copied into the new data structure. Resizing a list just means copying the references, so should be cheaper.

Limit memory consumption of class

I have a performance counters reporter class that holds in multiple members different lists. Is there a neat way to tell the class the memory limit of its consumption (to be bullet proof case the lists will be pushed with enormous amount of data), or should I go to each member and change it to blocked list? (and this is less dynamic in a way)

What you're asking doesn't make sense. How could a class limit its memory consumption?
Consider: you have a public property that is a list of data. You set the value of that property to be a 2GB set of data but the class is limited to 100MB. How does the class decide what data to throw away? What happens to the data that's thrown away? How does the rest of your program deal with the fact that half its data has disappeared?
None of these questions are sensibly answered, because each program will have a different answer. For that reason, you'd have to implement such logic yourself.
However, more importantly, you should consider this: if I create a List<int> that contains 2GB of data, and assign this list to a property of your "reporter class," the memory consumption of your reporter class doesn't change. This is because your reporter class has a property that is a List<int>, and what that means is that the property stores the memory address of a List<int> that is held somewhere else in the heap. This memory address - a pointer to what we consider the "value" of the property - is fixed according to the architecture of your machine/application and will never change. It's the same size when your pointer is null as it is when the pointer points to a 2GB list. And so in that sense, the memory consumption of your class itself won't be as big as you think.
You can redefine the question to say "when calculating consumption, include all objects pointed to by my properties" but this has its own problems. What happens if you assign the List<int> to a property on two different objects, each with its own memory limit?
Also, if your reporting class has two properties that could hold large data, and I assign large values to each, how do you decide what to throw away? If I have a 100MB limit for the class, and assign 200MB of data to one property and 1GB of data to the other, which data do I truncate? What would happen if I then cleared one of the properties - I now have "spare" memory consumption but data is irretrievably lost.
In short: this is a very complex requirement to request. You'd have to create your own logic to implement this, and it's unlikely you'll find anything "standard" to handle it, because no two implementations would be the same.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.