Memory usage difference between Generic and Non-generic collections in .NET

Memory usage difference between Generic and Non-generic collections in .NET - c#

I read about collections in .NET nowadays. As known, there is some advantages using generic collections over non-generic: they are type-safety and there is no casting, no boxing/unboxing. That's why generic collections have a performance advantage.
If we consider that non-generic collections store every member as object, then we can think that generics have also memory advantage. However, I didn't found any information about memory usage difference.
Can anyone clarify about the point?

If we consider that non-generic collections store every member as object, then we can think that generics have also memory advantage. However, I didn't found any information about memory usage difference. Can anyone clarify about the point?
Sure. Let's consider an ArrayList that contains ints vs a List<int>. Let's suppose there are 1000 ints in each list.
In both, the collection type is a thin wrapper around an array -- hence the name ArrayList. In the case of ArrayList, there's an underlying object[] that contains at least 1000 boxed ints. In the case of List<int>, there's an underlying int[] that contains at least 1000 ints.
Why did I say "at least"? Because both use a double-when-full strategy. If you set the capacity of a collection when you create it then it allocates enough space for that many things. If you don't, then the collection has to guess, and if it guesses wrong and you need more capacity, then it doubles its capacity. So, best case, our collection arrays are exactly the right size. Worst case, they are possibly twice as big as they need to be; there could be room for 2000 objects or 2000 ints in the arrays.
But let's suppose for simplicity that we're lucky and there are about 1000 in each.
To start with, what's the memory burden of just the array? An object[1000] takes up 4000 bytes on a 32 bit system and 8000 bytes on a 64 bit system, just for the references, which are pointer sized. An int[1000] takes up 4000 bytes regardless. (There are also a few extra bytes taken up by array bookkeeping, but these costs are small compared to the marginal costs.)
So already we see that the non-generic solution possibly consumes twice as much memory just for the array. What about the contents of the array?
Well, the thing about value types is they are stored right there in their own variable. There is no additional space beyond those 4000 bytes used to store the 1000 integers; they get packed right into the array. So the additional cost is zero for the generic case.
For the object[] case, each member of the array is a reference, and that reference refers to an object; in this case, a boxed integer. What's the size of a boxed integer?
An unboxed value type doesn't need to store any information about its type, because its type is determined by the type of the storage its in, and that's known to the runtime. A boxed value type needs to somewhere store the type of the thing in the box, and that takes space. It turns out that the bookkeeping overhead for an object in 32 bit .NET is 8 bytes, and 16 on 64 bit systems. That's just the overhead; we of course need 4 bytes for the int. But wait, it gets worse: on 64 bit systems, the box must be aligned to an 8 byte boundary, so we need another 4 bytes of padding on 64 bit systems.
Add it all up: Our int[] takes about 4KB on both 64 and 32 bit systems. Our object[] containing 1000 ints takes about 16KB on 32 bit systems, and 32K on 64 bit systems. So the memory efficiency of an int[] vs an object[] is either 4 or 8 times worse for the non-generic case.
But wait, it gets even worse. That's just size. What about access time?
To access an integer from an array of integers, the runtime must:
verify that the array is valid
verify that the index is valid
fetch the value from the variable at the given index
To access an integer from an array of boxed integers, the runtime must:
verify that the array is valid
verify that the index is valid
fetch the reference from the variable at the given index
verify that the reference is not null
verify that the reference is a boxed integer
extract the integer from the box
That's a lot more steps, so it takes a lot longer.
BUT WAIT IT GETS WORSE.
Modern processors use caches on the chip itself to avoid going back to main memory. An array of 1000 plain integers is highly likely to end up in the cache so that accesses to the first, second, third, etc, members of the array in quick succession are all pulled from the same cache line; this is insanely fast. But boxed integers can be all over the heap, which increases cache misses, which greatly slows down access even further.
Hopefully that sufficiently clarifies your understanding of the boxing penalty.
What about non-boxed types? Is there a significant difference between an array list of strings, and a List<string>?
Here the penalty is much, much smaller, since an object[] and a string[] have similar performance characteristics and memory layouts. The only additional penalty in this case is (1) not catching your bugs until runtime, (2) making the code harder to read and edit, and (3) the slight penalty of a run-time type check.

then we can think that generics have also memory advantage
This assumption is false, it only applies on value-types. So considder this:
new ArrayList { 1, 2, 3 };
This will implicetly cast every integer into object (known as boxing) in order to store it into your ArrayList. This will cause your memory-overhead here, because an object surely is bigger than a simple int.
For reference-types there´s no difference however as there´s no need for boxing.
Using the one or the other shouldn´t be driven bei neither any performance- nor memory-issues. However you should ask yourself what you want to do with the results. In particular if you know the type(s) stored in your collection at compile-time, there´s no reason to not put this information into the compile-process by using the right generic type-argument.
Anyway you should allways use generic collections instead of non-generic ones because of the mentioned type-safety.
EDIT: Your actual question if using a non-generic collection or a generic version is quite pointless: allways use the generic one. But not because of its memory-usage. See this:
ArrayList a = new ArrayList { 1, 2, 3};
vs.
List<object> a = new List<object> { 1, 2, 3 };
Both lists will consume same amount of memory, although the second one is generic. That´s because they both box your integers into object. So the answer to the question has nothing to do with memory.
On te other saying for reference-types there´s no memory-differencee at all:
ArrayList a = new ArrayList { myInstance, anotherInstance }
vs.
List<MyClass> a = new List<MyClass> { myInstance, anotherInstance }
will produce the same memory-outcome. However the second one is far easier to maintain as you can work with the instances directly without casting them.

Lets assume we have this statement :
int valueType = 1;
so now we have a value on the stack as follows :
stack
i = 1
Now consider we do this now :
object boxingObject = valueType;
Now we have two values stored in the memory, the reference for valueType in the stack and the value 1 in the heap:
stack
boxingObject
heap
1
So in case of boxing a value type there will be extra usage for memory as Microsoft docs states :
Boxing a value type allocates an object instance on the heap and copies the value into the new object.
See this link for full information.

Related

Where the List<int> and int[] are allocated? [duplicate]

I'm learning C# and basically know the difference between arrays and Lists that the last is a generic and can dynamically grow but I'm wondering:
are List elements sequentially located in heap like array or is each element located "randomly" in a different locations?
and if that is true, does that affect the speed of access & data retrieval from memory?
and if that is true, is this what makes arrays a little faster than Lists?

Let's see the second and the third questions first:
and if that true does that affect the speed of access & data retrieval from memory ?
and if that true is this what makes array little faster than list ?
There is only a single type of "native" collection in .NET (with .NET I mean the CLR, so the runtime): the array (technically, if you consider a string a type of collection, then there are two native types of collections :-) ) (technically part 2: not all the arrays you think that are arrays are "native" arrays... Only the monodimensional 0 based arrays are "native" arrays. Arrays of type T[,] aren't, and arrays where the first element doesn't have an index of 0 aren't) . Every other collection (other than the LinkedList<>) is built atop it. If you look at the List<T> with IlSpy you'll see that at the base of it there is a T[] with an added int for the Count (the T[].Length is the Capacity). Clearly an array is a little faster than a List<T> because to use it, you have one less indirection (you access the array directly, instead of accessing the array that accesses the list).
Let's see the first question:
does List elements sequentially located in heap like array or each element is located randomly in different locations?
Being based on an array internally, clearly the List<> memorizes its elements like an array, so in a contiguous block of memory (but be aware that with a List<SomeObject> where SomeObject is a reference type, the list is a list of references, not of objects, so the references are put in a contiguous block of memory (we will ignore that with the advanced memory management of computers, the word "contiguous block of memory" isn't exact", it would be better to say "a contiguous block of addresses") )
(yes, even Dictionary<> and HashSet<> are built atop arrays. Conversely a tree-like collection could be built without using an array, because it's more similar to a LinkedList)
Some additional details: there are four groups of instructions in the CIL language (the intermediate language used in compiled .NET programs) that are used with "native" arrays:
Newarr
Ldelem and family Ldelem_*
Stelem and family Stelem_*
ReadOnly (don't ask me its use, I don't know, and the documentation isn't clear)
if you look at OpCodes.Newarr you'll see this comment in the XML documentation:
// Summary:
// Pushes an object reference to a new zero-based, one-dimensional array whose
// elements are of a specific type onto the evaluation stack.

Yes, elements in a List are stored contiguously, just like an array. A List actually uses arrays internally, but that is an implementation detail that you shouldn't really need to be concerned with.
Of course, in order to get the correct impression from that statement, you also have to understand a bit about memory management in .NET. Namely, the difference between value types and reference types, and how objects of those types are stored. Value types will be stored in contiguous memory. With reference types, the references will be stored in contiguous memory, but not the instances themselves.
The advantage of using a List is that the logic inside of the class handles allocating and managing the items for you. You can add elements anywhere, remove elements from anywhere, and grow the entire size of the collection without having to do any extra work. This is, of course, also what makes a List slightly slower than an array. If any reallocation has to happen in order to comply with your request, there'll be a performance hit as a new, larger-sized array is allocated and the elements are copied to it. But it won't be any slower than if you wrote the code to do it manually with a raw array.
If your length requirement is fixed (i.e., you never need to grow/expand the total capacity of the array), you can go ahead and use a raw array. It might even be marginally faster than a List because it avoids the extra overhead and indirection (although that is subject to being optimized out by the JIT compiler).
If you need to be able to dynamically resize the collection, or you need any of the other features provided by the List class, just use a List. The performance difference will be virtually imperceptible.

Does an array or object's pointer/reference handle affect its size?

I know that if I have an array int A[512] that the reference A can point to the first element. In pointer arithmetic, the memory is referenced as A + index.
But if I'm not mistaken, the pointer/reference also takes up a machine word of space. Assuming an int takes up a machine word, does that mean that the 512 integers of the above array take up 513 words of space?
Is the same true/false for objects and their data members in C++ or C#?
Update: Wow you guys are fast. To clarify, I'm interested in how C++ and C# differ in how they handle this, and how I can size objects to fit in a cache line (if possible).
Update: I have been made aware of the distinction between pointers and arrays. I understand that arrays are not pointers, and that the pointer arithmetic I referenced above is only valid after the array has been converted to a pointer. I don't think this distinction is relevant to the overall question however. I'm interested in how both arrays and other objects are stored in memory in both C++ and C#.

Note that when you're talking about fitting data into a cache line, the variable containing the reference and the actual data it refers to are not going to be located in near proximity. The reference is going to wind up in a register (eventually), but it's probably originally stored as part of another object somewhere else in memory, or as a local variable on the stack. The array contents themselves can still fit in cache lines when being operated on, regardless of whatever other overhead is associated with the 'object'. If you're curious about how this works in C#, Visual Studio has a Disassembler view that shows the actual x86 or x64 assembly generated for your code.
Array references have special baked-in support at the IL (intermediate language) level, so you'll find that the way memory is loaded/used is essentially the same as using an array in C++. Under the hood, indexing into an array is exactly the same operation. Where you'll start to notice differences is if you index through arrays using 'foreach' or start having to 'unbox' references when the array is an array of object types.
Note that one difference as far as memory locality between C++ and C# can show up when you instantiate objects locally in a method. C++ allows you to instantiate arrays on the stack, which creates a special case where the array memory is actually stored in close proximity to the 'reference' and other local variables. In C#, a (managed) array's contents will always wind up being allocated on the heap.
On the other hand, when referring to heap-allocated objects, C# can sometimes have better locality of memory than C++, especially for short-lived objects. This is due to the way that the GC stores objects by their 'generation' (how long they've been alive) and the heap compaction it does. Short-lived objects are allocated quickly on a growing heap; when collected, the heap is also compacted, preventing the 'fragmentation' that can cause subsequent allocations in a non-compacted heap to be scattered in memory.
You can get similar memory locality benefits in C++ using an 'object pooling' technique (or by avoiding frequent small short-lived objects), but that takes a bit of extra work and design. The cost for this, of course, is that GC has to run, with thread hijacking, promoting generations, compacting and reassigning references causing a measurable overhead at somewhat unpredictable times. In practice, the overhead is rarely a problem, especially with Gen0 collection, which is highly optimized for a usage pattern of frequently allocated short-lived objects.

You appear to have a misunderstanding about arrays and pointers in C++.
The array
int A[512];
This declaration gets you an array of 512 ints. Nothing else. No pointer, no nothing. Just an array of ints. The size of the array will be 512 * sizeof(int).
The name
The name A refers to that array. It's not of pointer type. It's of array type. It is a name and it refers to the array. Names are simply compile-time constructs for telling the compiler what object you're talking about. Names don't exist at run-time.
The conversion
There is a conversion called array-to-pointer conversion that may occur in some circumstances. The conversion takes an expression which is of array type (such as the simple expression A) and converts it to a pointer to its first element. That is, in some situations, the expression A (which denotes the array) may be converted to an int* (which points at the first element in the array).
The pointer
The pointer that is created by array-to-pointer conversion exists for the duration of the expression it is part of. It is just a temporary object that appears in those particular circumstances.
The circumstances
An array-to-pointer conversion is a standard conversion and circumstances in which it may occur include:
When casting from an array to a pointer. For example, (int*)A.
When initialising an object of pointer type, e.g. int* = A;.
Whenever glvalue referring to an array appears as the operand of an expression that expects a prvalue.
This is what happens when you subscript an array, such as with A[20]. The subscript operator expects a prvalue of pointer type, so A undergoes array-to-pointer conversion.

No, the objects in CLR does not map to the "simple" memory mapping of C++ (I immagine) you refer too. Remember that you can operate over objects in CLR using reflection, that means that every object has to have additional information (manifest) inside it. This already adds more memory that just plain content of the object, add to this also a pointer for locking management in multithreaded environment and you go far away in terms of expected memory allocation for CLR object.
Also remember that pointer size defers between 32 and 64 bit machines.

I think you're confusing an array and pointer in C++.
An array of int is just that, it's an array of locations in memory, each taking up sizeof(int) in which you can store N-1 ints.
A pointer is a type which can point to a memory location, and takes up CPU register size in memory, so on a 32 bit machine, sizeof(int*) would be 32 bits.
If you want to have a pointer into your array, you do this: int * ptr = &A[0]; This points to the first element in the array. Now you have the pointer taking up memory (CPU word size) and you have your array of ints.
When you pass an array to a function in C or C++, it decays to a pointer to the first element in the array. That doesn't say that a pointer is an array, it says there is a decay from an array to a pointer.
In C# your array is a reference type, and you do not have pointers, so you don't worry about it. It just takes up the size of your array.

An array, int A[512] takes up 512 * sizeof(int) (+ any padding the compiler decides to add - in this particularly instance, very likely no padding).
The fact that the array A can be converted to a pointer to int A and used with A + index uses the fact that in the implementation A[index] is almost always exactly the same instructions as A + index. The conversion to pointer happens in both cases, because to get to A[index], we have to take the first address of the array A, and add index times sizeof(int) - whether you write that as A[index] or A + index doesn't make any difference. In both cases, A is referring to the first address in the array, and index the number of elements into it.
There is no extra space used here.
The above applies to C and C++.
In C# and other languages that use "managed memory", there is extra overhead to track each variable. This does not impact the size of the variable A itself, but it does of course have to be stored somewhere, and thus every variable, whether it's a single integer or a very large array, will have some overhead, stored somewhere, including the size of the variable and some sort of "reference count" (how many places the variable is used, and if it can be removed).

Concerning native C++:
But if I'm not mistaken, the pointer/reference also takes up a machine word of space
A reference does not necessarily take space in memory. Per Paragraph 8.3.2/4 of the C++11 Standard:
It is unspecified whether or not a reference requires storage (3.7).
In this case, you can use A like a pointer, and indeed it does decay to a pointer when necessary (e.g. when passing it as an argument to functions), but the type of A is int[512], not int*: therefore, A is not a pointer. For instance, you cannot do this:
int A[512];
int B;
A = &B;
There doesn't need to be any memory location used to store A (i.e. used to store the memory address where the array begins), so most likely your compiler will not allocate any extra bytes of memory for holding the address of A.

We have multiple different examples here, given that we even have several languages to discuss.
Let's start with the simple example, a simple array in C++:
int array[512];
What happens in terms of memory allocation here? 512 words of memory are allocated on the stack for the array. No heap memory is allocated. There is no overhead of any kind; no pointers to the array, no nothing, just the 512 words of memory.
Here is an alternate method of creating an array in C++:
int * array = new int[512];
Here we're creating an array on the heap. It will allocate 512 words of memory with no additional memory allocated on the heap. Then, once that is done, an address to the start of that array will be placed in a variable on the stack, taking up an additional word of memory. If you look at the total memory footprint for the entire application, yes it will be 513, but it's worth noting that one is on the stack and the rest is on the heap (stack memory is much cheaper to allocate, and doesn't cause fragmentation, but if you overuse it or mis-use it you can run out more easily.
Now onto C#. In C# we don't have the two different syntaxes, all you have is:
int[] array = new int[512];
This will create a new array object on the heap. It will contain 512 words of memory for the data in the array, as well as a bit of extra memory for the overhead of the array object. It will need 4 bytes to hold onto the count of the array, a synchronization object, and a few other bits of overhead that we don't really need to think about. That overhead is small, and not dependent on the size of the array.
There will also be a pointer (or "reference", as would be more appropriate to use in C#) to that array that is placed on the stack, which will take up a word of memory. Like C++, the stack memory can be allocated/deallocated very quickly, and without fragmenting memory, so when considering the memory footprint of your program it often makes sense to separate it.

How to prevent initialization of arrays of value types in C#?

All arrays of any types in .NET are initialized to 0 by default (or null for reference types).
Is there any way to skip this initialization ? Just to save processor time. Say, i'm sure it will be initialized later again with different values:
Random rnd = new Random();
Int32[] nums = new Int32[666];
Array.ForEach(nums, n => rnd.Next());
Why should CLR init the nums array to zeros ? When its 666*4 bytes length, its ok. But when its 10^6 bytes ? so it clears 1M bytes without a need ?

It has nothing to do with arrays. Arrays simply initialize each element using default(T). If you have a value type then it must be constructed. You cannot have a value type any other way, otherwise you would have a situation which is in direct violation of the semantic goals of value types.
I question whether or not this is actually a problem. Have you profiled your code and determined that initialization of these arrays is a bottleneck? I highly doubt it. It certainly is not a bottleneck even in your example. Focus on solving real problems.
If you truly need such low level management then why in the world are you using C# to begin with?

If C# is anything like Java in this regard the clearing is a necessary feature. If a reference array were not cleared then garbage collection would interpret the entries as valid pointers and go off the deep end.

No way to avoid this, but you can avoid the default Int32 initialization by making your own value type (which can initialize from Random(), for instance).

It looks like you are looking for dynamic allocation. This is not something you can do with arrays, but that would be doable with other dynamic allocation collections (such as lists).
This is, however, doesn't mean that it's more performant. You are just deferring allocation until it's needed.

What is the downside of using a structure vs object in a list in C#?

As I understand, using structure value types will always give better performance than using reference types in an array or list. Is there any downside involved in using struct instead of class type in a generic list?
PS : I am aware that MSDN recommends that struct should be maximum 16 bytes, but I have been using 100+ byte structure without problems so far. Also, when I get the maximum stack memory error exceeded for using a struct, I also run out of heap space if I use a class instead.

There is a lot of misinformation out there about struct vs. reference types in .Net. Anything which makes blanket statements like "structs will always perform better in ..." is almost certainly wrong. It's almost impossible to make blanket statements about performance.
Here are several items related to value types in a generic collection which will / can affect performance.
Using a value types in a generic instantiation can cause extra copies of methods to be JIT'd at runtime. For reference types only one instance will be generated
Using value types will affect the size of the allocated array to be count * size of the specific value type vs. reference types which have all have the same size
Adding / accessing values in the collection will incur copy overhead. The performance of this changes based on the size of the item. For references again it's the same no matter the type and for value types it will vary based on the size

As others have pointed out, there are many downsides to using large structures in a list. Some ramifications of what others have said:
Say you're sorting a list whose members are 100+ byte structures. Every time items have to be swapped, the following occurs:
var temp = list[i];
list[i] = list[j];
list[j] = temp;
The amount of data copied is 3*sizeof(your_struct). If you're sorting a list that's made up of reference types, the amount of data copied is 3*sizeof(IntPtr): 12 bytes in the 32-bit runtime, or 24 bytes in the 64-bit runtime. I can tell you from experience that copying large structures is far more expensive than the indirection inherent in using reference types.
Using structures also reduces the maximum number of items you can have in a list. In .NET, the maximum size of any single data structure is 2 gigabytes (minus a little bit). A list of structures has a maximum capacity of 2^31/sizeof(your_struct). So if your structure is 100 bytes in size, you can have at most about 21.5 million of them in a list. But if you use reference types, your maximum is about 536 million in the 32-bit runtime (although you'll run out of memory before you reach that limit), or 268 million in the 64-bit runtime. And, yes, some of us really do work with that many things in memory.

using structure value types will always give better performance than using reference types in an array or list
There is nothing true in that statement.

Take a look at this question and answer.

With structs, you cannot have code reuse in the form of class inheritance. A struct can only implement interfaces but cannot inherit from a class or another struct whereas a class can inherit from another class and of course implement interfaces.

When storing data in a List<T> or other collection (as opposed to keeping a list of controls or other active objects) and one wishes to allow the data to change, one should generally follow one of four patterns:
Store immutable objects in the list, and allow the list itself to change
Store mutable objects in the list, but only allow objects created by the owner of the list to be stored therein. Allow outsiders to access the mutable objects themselves.
Only store mutable objects to which no outside references exist, and don't expose to the outside world any references to objects within the list; if information from the list is requested, copy it from the objects in the list.
Store value types in the list.
Approach #1 is the simplest, if the objects one wants to store are immutable. Of course, the requirement that objects be immutable can be somewhat limiting.
Approach #2 can be convenient in some cases, and it permits convenient updating of data in the list (e.g. MyList[index].SomeProperty += 5;) but the exact semantics of how returned properties are, or remain, attached to items in the list may sometimes be unclear. Further, there's no clear way to load all the properties of an item in the list from an 'example' object.
Approach #3 has simple-to-understand semantics (changing an object after giving it to the list will have no effect, objects retrieved from the list will not be affected by subsequent changes to the list, and changes to objects retrieved from a list will not affect the list themselves unless the objects are explicitly written back), but requires defensive copying on every list access, which can be rather bothersome.
Approach #4 offers essentially the same semantics as approach #3, but copying a struct is cheaper than making a defensive copy of a class object. Note that if the struct is mutable, the semantics of:
var temp = MyList[index];
temp.SomeField += 5;
MyList[index] temp;
are clearer than anything that can be achieved with so-called "immutable" (i.e. mutation-only-by-assignment) structs. To know what the above does, all one needs to know about the struct is that SomeField is a public field of some particular type. By contrast, even something like:
var temp = MyList[index];
temp = temp.WithSomeField(temp.SomeField + 5);
MyList[index] temp;
which is about the best one could hope for with such a struct, would be much harder to read than the easily-mutable-struct version. Further, to be sure of what the above actually does, one would have to examine the definition of the struct's WithSomeField method and any constructors or methods employed thereby, as well as all of the struct's fields, to determine whether it had any side-effects other than modifying SomeField.

In C#, why is String a reference type that behaves like a value type?

A String is a reference type even though it has most of the characteristics of a value type such as being immutable and having == overloaded to compare the text rather than making sure they reference the same object.
Why isn't string just a value type then?

Strings aren't value types since they can be huge, and need to be stored on the heap. Value types are (in all implementations of the CLR as of yet) stored on the stack. Stack allocating strings would break all sorts of things: the stack is only 1MB for 32-bit and 4MB for 64-bit, you'd have to box each string, incurring a copy penalty, you couldn't intern strings, and memory usage would balloon, etc...
(Edit: Added clarification about value type storage being an implementation detail, which leads to this situation where we have a type with value sematics not inheriting from System.ValueType. Thanks Ben.)

It is not a value type because performance (space and time!) would be terrible if it were a value type and its value had to be copied every time it were passed to and returned from methods, etc.
It has value semantics to keep the world sane. Can you imagine how difficult it would be to code if
string s = "hello";
string t = "hello";
bool b = (s == t);
set b to be false? Imagine how difficult coding just about any application would be.

A string is a reference type with value semantics. This design is a tradeoff which allows certain performance optimizations.
The distinction between reference types and value types are basically a performance tradeoff in the design of the language. Reference types have some overhead on construction and destruction and garbage collection, because they are created on the heap. Value types on the other hand have overhead on assignments and method calls (if the data size is larger than a pointer), because the whole object is copied in memory rather than just a pointer. Because strings can be (and typically are) much larger than the size of a pointer, they are designed as reference types. Furthermore the size of a value type must be known at compile time, which is not always the case for strings.
But strings have value semantics which means they are immutable and compared by value (i.e. character by character for a string), not by comparing references. This allows certain optimizations:
Interning means that if multiple strings are known to be equal, the compiler can just use a single string, thereby saving memory. This optimization only works if strings are immutable, otherwise changing one string would have unpredictable results on other strings.
String literals (which are known at compile time) can be interned and stored in a special static area of memory by the compiler. This saves time at runtime since they don't need to be allocated and garbage collected.
Immutable strings does increase the cost for certain operations. For example you can't replace a single character in-place, you have to allocate a new string for any change. But this is a small cost compared to the benefit of the optimizations.
Value semantics effectively hides the distinction between reference type and value types for the user. If a type has value semantics, it doesn't matter for the user if the type is a value type or reference type - it can be considered an implementation detail.

This is a late answer to an old question, but all other answers are missing the point, which is that .NET did not have generics until .NET 2.0 in 2005.
String is a reference type instead of a value type because it was of crucial importance for Microsoft to ensure that strings could be stored in the most efficient way in non-generic collections, such as System.Collections.ArrayList.
Storing a value-type in a non-generic collection requires a special conversion to the type object which is called boxing. When the CLR boxes a value type, it wraps the value inside a System.Object and stores it on the managed heap.
Reading the value from the collection requires the inverse operation which is called unboxing.
Both boxing and unboxing have non-negligible cost: boxing requires an additional allocation, unboxing requires type checking.
Some answers claim incorrectly that string could never have been implemented as a value type because its size is variable. Actually it is easy to implement string as a fixed-length data structure containing two fields: an integer for the length of the string, and a pointer to a char array. You can also use a Small String Optimization strategy on top of that.
If generics had existed from day one I guess having string as a value type would probably have been a better solution, with simpler semantics, better memory usage and better cache locality. A List<string> containing only small strings could have been a single contiguous block of memory.

Not only strings are immutable reference types.
Multi-cast delegates too.
That is why it is safe to write
protected void OnMyEventHandler()
{
delegate handler = this.MyEventHandler;
if (null != handler)
{
handler(this, new EventArgs());
}
}
I suppose that strings are immutable because this is the most safe method to work with them and allocate memory.
Why they are not Value types? Previous authors are right about stack size etc. I would also add that making strings a reference types allow to save on assembly size when you use the same constant string in the program. If you define
string s1 = "my string";
//some code here
string s2 = "my string";
Chances are that both instances of "my string" constant will be allocated in your assembly only once.
If you would like to manage strings like usual reference type, put the string inside a new StringBuilder(string s). Or use MemoryStreams.
If you are to create a library, where you expect a huge strings to be passed in your functions, either define a parameter as a StringBuilder or as a Stream.

In a very simple words any value which has a definite size can be treated as a value type.

Also, the way strings are implemented (different for each platform) and when you start stitching them together. Like using a StringBuilder. It allocats a buffer for you to copy into, once you reach the end, it allocates even more memory for you, in the hopes that if you do a large concatenation performance won't be hindered.
Maybe Jon Skeet can help up out here?

It is mainly a performance issue.
Having strings behave LIKE value type helps when writing code, but having it BE a value type would make a huge performance hit.
For an in-depth look, take a peek at a nice article on strings in the .net framework.

How can you tell string is a reference type? I'm not sure that it matters how it is implemented. Strings in C# are immutable precisely so that you don't have to worry about this issue.

Actually strings have very few resemblances to value types. For starters, not all value types are immutable, you can change the value of an Int32 all you want and it it would still be the same address on the stack.
Strings are immutable for a very good reason, it has nothing to do with it being a reference type, but has a lot to do with memory management. It's just more efficient to create a new object when string size changes than to shift things around on the managed heap. I think you're mixing together value/reference types and immutable objects concepts.
As far as "==": Like you said "==" is an operator overload, and again it was implemented for a very good reason to make framework more useful when working with strings.

The fact that many mention the stack and memory with respect to value types and primitive types is because they must fit into a register in the microprocessor. You cannot push or pop something to/from the stack if it takes more bits than a register has....the instructions are, for example "pop eax" -- because eax is 32 bits wide on a 32-bit system.
Floating-point primitive types are handled by the FPU, which is 80 bits wide.
This was all decided long before there was an OOP language to obfuscate the definition of primitive type and I assume that value type is a term that has been created specifically for OOP languages.

Isn't just as simple as Strings are made up of characters arrays. I look at strings as character arrays[]. Therefore they are on the heap because the reference memory location is stored on the stack and points to the beginning of the array's memory location on the heap. The string size is not known before it is allocated ...perfect for the heap.
That is why a string is really immutable because when you change it even if it is of the same size the compiler doesn't know that and has to allocate a new array and assign characters to the positions in the array. It makes sense if you think of strings as a way that languages protect you from having to allocate memory on the fly (read C like programming)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.