Why .NET String is immutable? [duplicate] - c#

This question already has answers here:
Why can't strings be mutable in Java and .NET?
(17 answers)
Closed 9 years ago.
As we all know, String is immutable. What are the reasons for String being immutable and the introduction of StringBuilder class as mutable?

Instances of immutable types are inherently thread-safe, since no thread can modify it, the risk of a thread modifying it in a way that interferes with another is removed (the reference itself is a different matter).
Similarly, the fact that aliasing can't produce changes (if x and y both refer to the same object a change to x entails a change to y) allows for considerable compiler optimisations.
Memory-saving optimisations are also possible. Interning and atomising being the most obvious examples, though we can do other versions of the same principle. I once produced a memory saving of about half a GB by comparing immutable objects and replacing references to duplicates so that they all pointed to the same instance (time-consuming, but a minute's extra start-up to save a massive amount of memory was a performance win in the case in question). With mutable objects that can't be done.
No side-effects can come from passing an immutable type as a method to a parameter unless it is out or ref (since that changes the reference, not the object). A programmer therefore knows that if string x = "abc" at the start of a method, and that doesn't change in the body of the method, then x == "abc" at the end of the method.
Conceptually, the semantics are more like value types; in particular equality is based on state rather than identity. This means that "abc" == "ab" + "c". While this doesn't require immutability, the fact that a reference to such a string will always equal "abc" throughout its lifetime (which does require immutability) makes uses as keys where maintaining equality to previous values is vital, much easier to ensure correctness of (strings are indeed commonly used as keys).
Conceptually, it can make more sense to be immutable. If we add a month onto Christmas, we haven't changed Christmas, we have produced a new date in late January. It makes sense therefore that Christmas.AddMonths(1) produces a new DateTime rather than changing a mutable one. (Another example, if I as a mutable object change my name, what has changed is which name I am using, "Jon" remains immutable and other Jons will be unaffected.
Copying is fast and simple, to create a clone just return this. Since the copy can't be changed anyway, pretending something is its own copy is safe.
[Edit, I'd forgotten this one]. Internal state can be safely shared between objects. For example, if you were implementing list which was backed by an array, a start index and a count, then the most expensive part of creating a sub-range would be copying the objects. However, if it was immutable then the sub-range object could reference the same array, with only the start index and count having to change, with a very considerable change to construction time.
In all, for objects which don't have undergoing change as part of their purpose, there can be many advantages in being immutable. The main disadvantage is in requiring extra constructions, though even here it's often overstated (remember, you have to do several appends before StringBuilder becomes more efficient than the equivalent series of concatenations, with their inherent construction).
It would be a disadvantage if mutability was part of the purpose of an object (who'd want to be modeled by an Employee object whose salary could never ever change) though sometimes even then it can be useful (in a many web and other stateless applications, code doing read operations is separate from that doing updates, and using different objects may be natural - I wouldn't make an object immutable and then force that pattern, but if I already had that pattern I might make my "read" objects immutable for the performance and correctness-guarantee gain).
Copy-on-write is a middle ground. Here the "real" class holds a reference to a "state" class. State classes are shared on copy operations, but if you change the state, a new copy of the state class is created. This is more often used with C++ than C#, which is why it's std:string enjoys some, but not all, of the advantages of immutable types, while remaining mutable.

Making strings immutable has many advantages. It provides automatic thread safety, and makes strings behave like an intrinsic type in a simple, effective manner. It also allows for extra efficiencies at runtime (such as allowing effective string interning to reduce resource usage), and has huge security advantages, since it's impossible for an third party API call to change your strings.
StringBuilder was added in order to address the one major disadvantage of immutable strings - runtime construction of immutable types causes a lot of GC pressure and is inherently slow. By making an explicit, mutable class to handle this, this issue is addressed without adding unneeded complication to the string class.

Strings are not really immutable. They are just publicly immutable.
It means you cannot modify them from their public interface. But in the inside the are actually mutable.
If you don't believe me look at the String.Concat definition using reflector.
The last lines are...
int length = str0.Length;
string dest = FastAllocateString(length + str1.Length);
FillStringChecked(dest, 0, str0);
FillStringChecked(dest, length, str1);
return dest;
As you can see the FastAllocateString returns an empty but allocated string and then it is modified by FillStringChecked
Actually the FastAllocateString is an extern method and the FillStringChecked is unsafe so it uses pointers to copy the bytes.
Maybe there are better examples but this is the one I have found so far.

string management is an expensive process. keeping strings immutable allows repeated strings to be reused, rather than re-created.

Why are string types immutable in C#
String is a reference type, so it is never copied, but passed by reference.
Compare this to the C++ std::string
object (which is not immutable), which
is passed by value. This means that if
you want to use a String as a key in a
Hashtable, you're fine in C++, because
C++ will copy the string to store the
key in the hashtable (actually
std::hash_map, but still) for later
comparison. So even if you later
modify the std::string instance,
you're fine. But in .Net, when you use
a String in a Hashtable, it will store
a reference to that instance. Now
assume for a moment that strings
aren't immutable, and see what
happens:
1. Somebody inserts a value x with key "hello" into a Hashtable.
2. The Hashtable computes the hash value for the String, and places a
reference to the string and the value
x in the appropriate bucket.
3. The user modifies the String instance to be "bye".
4. Now somebody wants the value in the hashtable associated with "hello". It
ends up looking in the correct bucket,
but when comparing the strings it says
"bye"!="hello", so no value is
returned.
5. Maybe somebody wants the value "bye"? "bye" probably has a different
hash, so the hashtable would look in a
different bucket. No "bye" keys in
that bucket, so our entry still isn't
found.
Making strings immutable means that
step 3 is impossible. If somebody
modifies the string he's creating a
new string object, leaving the old one
alone. Which means the key in the
hashtable is still "hello", and thus
still correct.
So, probably among other things,
immutable strings are a way to enable
strings that are passed by reference
to be used as keys in a hashtable or
similar dictionary object.

Just to throw this in, an often forgotten view is of security, picture this scenario if strings were mutable:
string dir = "C:\SomePlainFolder";
//Kick off another thread
GetDirectoryContents(dir);
void GetDirectoryContents(string directory)
{
if(HasAccess(directory) {
//Here the other thread changed the string to "C:\AllYourPasswords\"
return Contents(directory);
}
return null;
}
You see how it could be very, very bad if you were allowed to mutate strings once they were passed.

You never have to defensively copy immutable data. Despite the fact that you need to copy it to mutate it, often the ability to freely alias and never have to worry about unintended consequences of this aliasing can lead to better performance because of the lack of defensive copying.

Strings are passed as reference types in .NET.
Reference types place a pointer on the stack, to the actual instance that resides on the managed heap. This is different to Value types, who hold their entire instance on the stack.
When a value type is passed as a parameter, the runtime creates a copy of the value on the stack and passes that value into a method. This is why integers must be passed with a 'ref' keyword to return an updated value.
When a reference type is passed, the runtime creates a copy of the pointer on the stack. That copied pointer still points to the original instance of the reference type.
The string type has an overloaded = operator which creates a copy of itself, instead of a copy of the pointer - making it behave more like a value type. However, if only the pointer was copied, a second string operation could accidently overwrite the value of a private member of another class causing some pretty nasty results.
As other posts have mentioned, the StringBuilder class allows for the creation of strings without the GC overhead.

Strings and other concrete objects are typically expressed as immutable objects to improve readability and runtime efficiency. Security is another, a process can't change your string and inject code into the string

Imagine you pass a mutable string to a function but don't expect it to be changed. Then what if the function changes that string? In C++, for instance, you could simply do call-by-value (difference between std::string and std::string& parameter), but in C# it's all about references so if you passed mutable strings around every function could change it and trigger unexpected side effects.
This is just one of various reasons. Performance is another one (interned strings, for example).

There are five common ways by which a class data store data that cannot be modified outside the storing class' control:
As value-type primitives
By holding a freely-shareable reference to class object whose properties of interest are all immutable
By holding a reference to a mutable class object that will never be exposed to anything that might mutate any properties of interest
As a struct, whether "mutable" or "immutable", all of whose fields are of types #1-#4 (not #5).
By holding the only extant copy of a reference to an object whose properties can only be mutated via that reference.
Because strings are of variable length, they cannot be value-type primitives, nor can their character data be stored in a struct. Among the remaining choices, the only one which wouldn't require that strings' character data be stored in some kind of immutable object would be #5. While it would be possible to design a framework around option #5, that choice would require that any code which wanted a copy of a string that couldn't be changed outside its control would have to make a private copy for itself. While it hardly be impossible to do that, the amount of extra code required to do that, and the amount of extra run-time processing necessary to make defensive copies of everything, would far outweigh the slight benefits that could come from having string be mutable, especially given that there is a mutable string type (System.Text.StringBuilder) which accomplishes 99% of what could be accomplished with a mutable string.

Immutable Strings also prevent concurrency-related issues.

Imagine being an OS working with a string that some other thread was
modifying behind your back. How could you validate anything without
making a copy?

Related

Should I use Class or Struct in the following case (data structure with many fields)? [duplicate]

I'm about to create 100,000 objects in code. They are small ones, only with 2 or 3 properties. I'll put them in a generic list and when they are, I'll loop them and check value a and maybe update value b.
Is it faster/better to create these objects as class or as struct?
EDIT
a. The properties are value types (except the string i think?)
b. They might (we're not sure yet) have a validate method
EDIT 2
I was wondering: are objects on the heap and the stack processed equally by the garbage collector, or does that work different?
Is it faster to create these objects as class or as struct?
You are the only person who can determine the answer to that question. Try it both ways, measure a meaningful, user-focused, relevant performance metric, and then you'll know whether the change has a meaningful effect on real users in relevant scenarios.
Structs consume less heap memory (because they are smaller and more easily compacted, not because they are "on the stack"). But they take longer to copy than a reference copy. I don't know what your performance metrics are for memory usage or speed; there's a tradeoff here and you're the person who knows what it is.
Is it better to create these objects as class or as struct?
Maybe class, maybe struct. As a rule of thumb:
If the object is :
1. Small
2. Logically an immutable value
3. There's a lot of them
Then I'd consider making it a struct. Otherwise I'd stick with a reference type.
If you need to mutate some field of a struct it is usually better to build a constructor that returns an entire new struct with the field set correctly. That's perhaps slightly slower (measure it!) but logically much easier to reason about.
Are objects on the heap and the stack processed equally by the garbage collector?
No, they are not the same because objects on the stack are the roots of the collection. The garbage collector does not need to ever ask "is this thing on the stack alive?" because the answer to that question is always "Yes, it's on the stack". (Now, you can't rely on that to keep an object alive because the stack is an implementation detail. The jitter is allowed to introduce optimizations that, say, enregister what would normally be a stack value, and then it's never on the stack so the GC doesn't know that it is still alive. An enregistered object can have its descendents collected aggressively, as soon as the register holding onto it is not going to be read again.)
But the garbage collector does have to treat objects on the stack as alive, the same way that it treats any object known to be alive as alive. The object on the stack can refer to heap-allocated objects that need to be kept alive, so the GC has to treat stack objects like living heap-allocated objects for the purposes of determining the live set. But obviously they are not treated as "live objects" for the purposes of compacting the heap, because they're not on the heap in the first place.
Is that clear?
Sometimes with struct you don't need to call the new() constructor, and directly assign the fields making it much faster that usual.
Example:
Value[] list = new Value[N];
for (int i = 0; i < N; i++)
{
list[i].id = i;
list[i].isValid = true;
}
is about 2 to 3 times faster than
Value[] list = new Value[N];
for (int i = 0; i < N; i++)
{
list[i] = new Value(i, true);
}
where Value is a struct with two fields (id and isValid).
struct Value
{
int id;
bool isValid;
public Value(int i, bool isValid)
{
this.i = i;
this.isValid = isValid;
}
}
On the other hand is the items needs to be moved or selected value types all that copying is going to slow you down. To get the exact answer I suspect you have to profile your code and test it out.
Arrays of structs are represented on the heap in a contiguous block of memory, whereas an array of objects is represented as a contiguous block of references with the actual objects themselves elsewhere on the heap, thus requiring memory for both the objects and for their array references.
In this case, as you are placing them in a List<> (and a List<> is backed onto an array) it would be more efficient, memory-wise to use structs.
(Beware though, that large arrays will find their way on the Large Object Heap where, if their lifetime is long, may have an adverse affect on your process's memory management. Remember, also, that memory is not the only consideration.)
Structs may seem similar to classes, but there are important differences that you should be aware of. First of all, classes are reference types and structs are value types. By using structs, you can create objects that behave like the built-in types and enjoy their benefits as well.
When you call the New operator on a class, it will be allocated on the heap. However, when you instantiate a struct, it gets created on the stack. This will yield performance gains. Also, you will not be dealing with references to an instance of a struct as you would with classes. You will be working directly with the struct instance. Because of this, when passing a struct to a method, it's passed by value instead of as a reference.
More here:
http://msdn.microsoft.com/en-us/library/aa288471(VS.71).aspx
If they have value semantics, then you should probably use a struct. If they have reference semantics, then you should probably use a class. There are exceptions, which mostly lean towards creating a class even when there are value semantics, but start from there.
As for your second edit, the GC only deals with the heap, but there is a lot more heap space than stack space, so putting things on the stack isn't always a win. Besides which, a list of struct-types and a list of class-types will be on the heap either way, so this is irrelevant in this case.
Edit:
I'm beginning to consider the term evil to be harmful. After all, making a class mutable is a bad idea if it's not actively needed, and I would not rule out ever using a mutable struct. It is a poor idea so often as to almost always be a bad idea though, but mostly it just doesn't coincide with value semantics so it just doesn't make sense to use a struct in the given case.
There can be reasonable exceptions with private nested structs, where all uses of that struct are hence restricted to a very limited scope. This doesn't apply here though.
Really, I think "it mutates so it's a bad stuct" is not much better than going on about the heap and the stack (which at least does have some performance impact, even if a frequently misrepresented one). "It mutates, so it quite likely doesn't make sense to consider it as having value semantics, so it's a bad struct" is only slightly different, but importantly so I think.
The best solution is to measure, measure again, then measure some more. There may be details of what you're doing that may make a simplified, easy answer like "use structs" or "use classes" difficult.
A struct is, at its heart, nothing more nor less than an aggregation of fields. In .NET it's possible for a structure to "pretend" to be an object, and for each structure type .NET implicitly defines a heap object type with the same fields and methods which--being a heap object--will behave like an object. A variable which holds a reference to such a heap object ("boxed" structure) will exhibit reference semantics, but one which holds a struct directly is simply an aggregation of variables.
I think much of the struct-versus-class confusion stems from the fact that structures have two very different usage cases, which should have very different design guidelines, but the MS guidelines don't distinguish between them. Sometimes there is a need for something which behaves like an object; in that case, the MS guidelines are pretty reasonable, though the "16 byte limit" should probably be more like 24-32. Sometimes, however, what's needed is an aggregation of variables. A struct used for that purpose should simply consist of a bunch of public fields, and possibly an Equals override, ToString override, and IEquatable(itsType).Equals implementation. Structures which are used as aggregations of fields are not objects, and shouldn't pretend to be. From the structure's point of view, the meaning of field should be nothing more or less than "the last thing written to this field". Any additional meaning should be determined by the client code.
For example, if a variable-aggregating struct has members Minimum and Maximum, the struct itself should make no promise that Minimum <= Maximum. Code which receives such a structure as a parameter should behave as though it were passed separate Minimum and Maximum values. A requirement that Minimum be no greater than Maximum should be regarded like a requirement that a Minimum parameter be no greater than a separately-passed Maximum one.
A useful pattern to consider sometimes is to have an ExposedHolder<T> class defined something like:
class ExposedHolder<T>
{
public T Value;
ExposedHolder() { }
ExposedHolder(T val) { Value = T; }
}
If one has a List<ExposedHolder<someStruct>>, where someStruct is a variable-aggregating struct, one may do things like myList[3].Value.someField += 7;, but giving myList[3].Value to other code will give it the contents of Value rather than giving it a means of altering it. By contrast, if one used a List<someStruct>, it would be necessary to use var temp=myList[3]; temp.someField += 7; myList[3] = temp;. If one used a mutable class type, exposing the contents of myList[3] to outside code would require copying all the fields to some other object. If one used an immutable class type, or an "object-style" struct, it would be necessary to construct a new instance which was like myList[3] except for someField which was different, and then store that new instance into the list.
One additional note: If you are storing a large number of similar things, it may be good to store them in possibly-nested arrays of structures, preferably trying to keep the size of each array between 1K and 64K or so. Arrays of structures are special, in that indexing one will yield a direct reference to a structure within, so one can say "a[12].x = 5;". Although one can define array-like objects, C# does not allow for them to share such syntax with arrays.
Use classes.
On a general note. Why not update value b as you create them?
From a c++ perspective I agree that it will be slower modifying a structs properties compared to a class. But I do think that they will be faster to read from due to the struct being allocated on the stack instead of the heap. Reading data from the heap requires more checks than from the stack.
Well, if you go with struct afterall, then get rid of string and use fixed size char or byte buffer.
That's re: performance.

what does it mean by "Immutable strings are threadsafe"

I have recently started to read up on mutable and immutable objects in C# and the constant thing i find wherever i read is hat being immutable makes things threadsafe and useful when used as keys in hashtables but what i dont understand is as far as the concept goes while we cannot change the content we can change the reference that is :
string s = "Hi";
s = "Bye";
While here the reference of s is changed to "Bye" but the main thing is that the content of s (or rather what it was pointing to) has changed and from the point of view of programming that is the same, so how does this make a particular function threadsafe or usable in hashtable if the string is changed ??
Simple. If you were to pass s to code that runs on a different thread, this code will receive the string pointed to by s at the time the parameter is passed. Like all strings in .net, it will not change over time, so your threaded code does not need to take into account that you may reassign s to a different value.
If you assign "Bye" to s, the original string lives on (until its garbage collected), and your variable s points to a new string.
In dictionaries, it is slightly different. If you change your mutable key in a way such that its hashcode changes, the dictionary will fail to find the key: the hashcode is used to search in an index, and the dictionary will not find the correct record if the hashcode changes over time. So this does not so really require immutability, but immutability will ensure consistent computation of hashcodes.
What immutability does for you is it gives the ability to think of the object as if it were a value type (such as int), which is often easier to reason about.
In your example, s is reassigned to reference a different string object ("Bye"), but the object that s previously referenced ("Hi") hasn't changed. Anything else that has a reference to the string "Hi" (another thread, a Dictionary, etc.) will be unaffected. As you mention, string is immutable - it's content cannot be changed once created. If you append one string to another, for example, you get a new string object. The two original string objects remain the same. This is what makes string thread-safe, and suitable for use in a hashtable.
The reference s isn't thread-safe - to ensure thread safety when using a reference, you'd need to put a lock around the reference assignment to ensure that one thread didn't attempt to read from the reference while another thread was writing to it.

Immutable class vs struct

The following are the only ways classes are different from structs in C# (please correct me if I'm wrong):
Class variables are references, while struct variables are values, therefore the entire value of struct is copied in assignments and parameter passes
Class variables are pointers stored on stack that point to the memory on heap, while struct variables are on stored heap as values
Suppose I have an immutable struct, that is struct with fields that cannot be modified once initialized. Each time I pass this struct as a parameter or use in assignments, the value would be copied and stored on stack.
Then suppose I make this immutable struct to be an immutable class. The single instance of this class would be created once, and only the reference to the class would be copied in assignments and parameter passes.
If the object was mutable, the behavior in these two cases would be different: when one would change the object, in the first case the copy of the struct would be modified, while in the second case the original object would be changed. However, in both cases the object is immutable, therefore there is no difference whether this is actually a class or a struct for the user of this object.
Since copying reference is cheaper than copying struct, why would one use an immutable struct?
Also, since mutable structs are evil, it looks like there is no reason to use structs at all.
Where am I wrong?
Since copying reference is cheaper than copying struct, why would one use an immutable struct?
This isn't always true. Copying a reference is going to be 8 bytes on a 64bit OS, which is potentially larger than many structs.
Also note that creation of the class is likely more expensive. Creating a struct is often done completely on the stack (though there are many exceptions), which is very fast. Creating a class requires creating the object handle (for the garbage collector), creating the reference on the stack, and tracking the object's lifetime. This can add GC pressure, which also has a real cost.
That being said, creating a large immutable struct is likely not a good idea, which is part of why the Guidelines for choosing between Classes and Structures recommend always using a class if your struct will be more than 16 bytes, if it will be boxed, and other issues that make the difference smaller.
That being said, I often base my decision more on the intended usage and meaning of the type in question. Value types should be used to refer to a single value (again, refer to guidelines), and often have a semantic meaning and expected usage different than classes. This is often just as important as the performance characteristics when making the choice between class or struct.
Reed's answer is quite good but just to add a few extra points:
please correct me if I'm wrong
You are basically on the right track here. You've made the common error of confusing variables with values. Variables are storage locations; values are stored in variables. And you are flirting with the commonly-stated myth that "value types go on the stack"; rather, variables go on either short-term or long-term storage, because variables are storage locations. Whether a variable goes on short or long term storage depends on its known lifetime, not its type.
But all of that is not particularly relevant to your question, which boils down to asking for a refutation of this syllogism:
Mutable structs are evil.
Reference copying is cheaper than struct copying, so immutable structs are always worse.
Therefore, there is never any use for structs.
We can refute the syllogism in several ways.
First, yes, mutable structs are evil. However, they are sometimes very useful because in some limited scenarios, you can get a performance advantage. I do not recommend this approach unless other reasonable avenues have been exhausted and there is a real performance problem.
Second, reference copying is not necessarily cheaper than struct copying. References are typically implemented as 4 or 8 byte managed pointers (though that is an implementation detail; they could be implemented as opaque handles). Copying a reference-sized struct is neither cheaper nor more expensive than copying a reference-sized reference.
Third, even if reference copying is cheaper than struct copying, references must be dereferenced in order to get at their fields. Dereferencing is not zero cost! Not only does it take machine cycles to dereference a reference, doing so might mess up the processor cache, and that can make future dereferences far more expensive!
Fourth, even if reference copying is cheaper than struct copying, who cares? If that is not the bottleneck that is producing an unacceptable performance cost then which one is faster is completely irrelevant.
Fifth, references are far, far more expensive in memory space than structs are.
Sixth, references add expense because the network of references must be periodically traced by the garbage collector; "blittable" structs may be ignored by the garbage collector entirely. Garbage collection is a large expense.
Seventh, immutable value types cannot be null, unlike reference types. You know that every value is a good value. And as Reed pointed out, in order to get a good value of a reference type you have to run both an allocator and a constructor. That's not cheap.
Eighth, value types represent values, and programs are often about the manipulation of values. It makes sense to "bake in" the metaphors of both "value" and "reference" in a language, regardless of which is "cheaper".
From MSDN;
Classes are reference types and structures are value types. Reference
types are allocated on the heap, and memory management is handled by
the garbage collector. Value types are allocated on the stack or
inline and are deallocated when they go out of scope. In general,
value types are cheaper to allocate and deallocate. However, if they
are used in scenarios that require a significant amount of boxing and
unboxing, they perform poorly as compared to reference types.
Do not define a structure unless the type has all of the following characteristics:
It logically represents a single value, similar to primitive types (integer, double, and so on).
It has an instance size smaller than 16 bytes.
It is immutable.
It will not have to be boxed frequently.
So, you should always use a class instead of struct, if your struct will be more than 16 bytes. Also read from http://www.dotnetperls.com/struct
There are two usage cases for structures. Opaque structures are useful for things which could be implemented using immutable classes, but are sufficiently small that even in the best of circumstances there wouldn't be much--if any--benefit to using a class, especially if the frequency with which they are created and discarded is a significant fraction of the frequency with which they will be simply copied. For example, Decimal is a 16-byte struct, so holding a million Decimal values would take 16 megabytes. If it were a class, each reference to a Decimal instance would take 4 or 8 bytes, but each distinct instance would probably take another 20-32 bytes. If one had many large arrays whose elements were copied from a small number of distinct Decimal instances, the class could win out, but in most scenarios one would be more likely to have an array with a million references to a million distinct instances of Decimal, which would mean the struct would win out.
Using structures in this way is generally only good if the guidelines quoted from MSDN apply (though the immutability guideline is mainly a consequence of the fact that there isn't yet any way via which struct methods can indicate that they modify the underlying struct). If any of the last three guidelines don't apply, one is likely better off using an immutable class than a struct. If the first guideline does not apply, however, that means one shouldn't use an opaque struct, but not that one should use a class instead.
In some situations, the purpose of a data type is simply to fasten a group of variables together with duct tape so that their values can be passed around as a unit, but they still remain semantically as distinct variables. For example, a lot of methods may need to pass around groups of three floating-point numbers representing 3d coordinates. If one wants to draw a triangle, it's a lot more convenient to pass three Point3d parameters than nine floating-point numbers. In many cases, the purpose of such types is not to impart any domain-specific behavior, but rather to simply provide a means of passing things around conveniently. In such cases, structures can offer major performance advantages over classes, if one uses them properly. A struct which is supposed to represent three varaibles of type double fastened together with duct tape should simply have three public fields of type double. Such a struct will allow two common operations to be performed efficiently:
Given an instance, take a snapshot of its state so the instance can be modified without disturbing the snapshot
Given an instance which is no longer needed, somehow come up with an instance which is slightly different
Immutable class types allow the first to be performed at fixed cost regardless of the amount of data held by the class, but they are inefficient at the second. The greater the amount of data the variable is supposed to represent, the greater the advantage of immutable class types versus structs when performing the first operation, and the greater the advantage of exposed-field structs when performing the second.
Mutable class types can be efficient in scenarios where the second operation dominates, and the first is needed seldom if ever, but it can be difficult for an object to expose the present values in a mutable class object without exposing the object itself to outside modification.
Note that depending upon usage patterns, large exposed-field structures may be much more efficient than either opaque structures or class types. Structure larger than 17 bytes are often less efficient than smaller ones, but they can still be vastly more efficient than classes. Further, the cost of passing a structure as a ref parameter does not depend upon its size. Large structs are inefficient if one accesses them via properties rather than fields, passes them by value needlessly, etc. but if one is careful to avoid redundant "copy" operations, there are usage patterns where there is no break-even point for classes versus structs--structs will simply perform better.
Some people may recoil in horror at the idea of a type having exposed fields, but I would suggest that a struct such as I describe shouldn't be thought of so much as an entity unto itself, but rather an extension of the things that read or write it. For example:
public struct SlopeAndIntercept
{
public double Slope,Intercept;
}
public SlopeAndIntercept FindLeastSquaresFit() ...
Code which is going to perform a least-squares-fit of a bunch of points will have to do a significant amount of work to find either the slope or Y intercept of the resulting line; finding both would not cost much more. Code which calls the FindLeastSquaresFit method is likely going to want to have the slope in one variable and the intercept in another. If such code does:
var resultLine = FindLeastSquaresFit();
the result will be to effectively create two variables resultLine.Slope and resultLine.Intercept which the method can manipulate as it sees fit. The fields of resultLine don't really belong to SlopeIntercept, nor to FindLeastSquaresFit; they belong to the code that declares resultLine. The situation is little different from if the method were used as:
double Slope, Intercept;
FindLeastSquaresFit(out Slope, out Intercept);
In that context, it would be clear that immediately following the function call, the two variables have the meaning assigned by the method, but that their meaning at any other time will depend upon what else the method does with them. Likewise for the fields of the aforementioned structure.
There are some situations where it may be better to return data using an immutable class rather than a transparent structure. Among other things, using a class will make it easier for future versions of a function that returns a Foo to return something which includes additional information. On the other hand, there are many situations where code is going to expect to deal with a specific set of discrete things, and changing that set of things would fundamentally change what clients have to do with it. For example, if one has a bunch of code that deals with (x,y) points, adding a "z" coordinate is going to require that code to be rewritten, and there's nothing the "point" type can do to mitigate that.

Why isn't there an equivalent of StringBuilder for other immutable types?

StringBuilder exists purely for the reason that strings in .NET are immutable, that is that traditional string concatenation can use lots of resources (due to lots of String objects being created).
So, since an Int32 is also immutable why don't classes exist for multiple addition for example?
There is. There's UriBuilder for building Uri objects.
What would an Int32Builder do? What meaningful operation on a single integer is going to be more convenient and/or more performant through use of such a class?
For an XXXBuilder class to make sense, the following have to hold:
The class or struct is immutable.
Changing the value by replacing it with one based on the previous (e.g. someString += "abc" or someDate = someDate.AddDays(1)) has to be relatively expensive (true in the former example more than the latter) and/or relatively convoluted to code.
The requirement for such a XXXBuilder class is common enough that it makes sense to provide it rather than just letting those who do need it code their own.
None of the above applies to int. They do apply to string and Uri. I don't think reference vs value type is particularly relevant except that cases where point 2 fits are also going to be cases where a class is almost certainly a better design choice than a value type.
Indeed, the combination of point 1 and point 2 is relatively uncommon in .NET. Some would argue less common than it should be (those who favour heavy use of immutable types). And if we can avoid point 2, then we would, wouldn't we? Nobody will think "I'll code this to be expensive and clumsy and provide a builder class". Rather they may on occasion think "The downside to my well thought-out immutability is that while it gives me many advantages it makes some operations expensive and clumsy, so I'll provide a builder class as well".
A concatenated string gets longer, which requires heap memory allocations and memory copies.
These get more expensive the longer the string gets, ergo we've a helper class (i.e. StringBuilder) to minimise the amount of copying that goes when when strings are concatenated.
Ints aren't concatinated, as you multiply ints you don't need more memory to hold the result of two multiplied ints, you just need another int (or the same int if it's *=).
You'd only need a helper class if you need to concatenate ints into some form of list . . . oh wait, List<int>!
Int32 is a value type.
String is a reference type. StringBuilder exists because String is an immutable reference type. String is also a collection of Char - so many allocations happen when you concatenate strings - StringBuilder makes these allocations beforehand, making creation of concatenated strings much more efficient. This is not an issue with value types.
Because Int32 is a value type, usually allocated on the stack (or within the body of a heap object). The compiler will automatically reuse the memory location when adding many value types in a loop for example.
The answer is basically "because of an implementation detail which means it is unnecessary".
The fact that string concatenation is slow, leading to a requirement for StringBuilder, is itself an implementation detail.
Value types can have their lifetime tracked because they have value type semantics. Whether this occurs is an implementation detail. In practice it does, and that is the reason why there is no need for an IntBuilder class.

In C#, why is String a reference type that behaves like a value type?

A String is a reference type even though it has most of the characteristics of a value type such as being immutable and having == overloaded to compare the text rather than making sure they reference the same object.
Why isn't string just a value type then?
Strings aren't value types since they can be huge, and need to be stored on the heap. Value types are (in all implementations of the CLR as of yet) stored on the stack. Stack allocating strings would break all sorts of things: the stack is only 1MB for 32-bit and 4MB for 64-bit, you'd have to box each string, incurring a copy penalty, you couldn't intern strings, and memory usage would balloon, etc...
(Edit: Added clarification about value type storage being an implementation detail, which leads to this situation where we have a type with value sematics not inheriting from System.ValueType. Thanks Ben.)
It is not a value type because performance (space and time!) would be terrible if it were a value type and its value had to be copied every time it were passed to and returned from methods, etc.
It has value semantics to keep the world sane. Can you imagine how difficult it would be to code if
string s = "hello";
string t = "hello";
bool b = (s == t);
set b to be false? Imagine how difficult coding just about any application would be.
A string is a reference type with value semantics. This design is a tradeoff which allows certain performance optimizations.
The distinction between reference types and value types are basically a performance tradeoff in the design of the language. Reference types have some overhead on construction and destruction and garbage collection, because they are created on the heap. Value types on the other hand have overhead on assignments and method calls (if the data size is larger than a pointer), because the whole object is copied in memory rather than just a pointer. Because strings can be (and typically are) much larger than the size of a pointer, they are designed as reference types. Furthermore the size of a value type must be known at compile time, which is not always the case for strings.
But strings have value semantics which means they are immutable and compared by value (i.e. character by character for a string), not by comparing references. This allows certain optimizations:
Interning means that if multiple strings are known to be equal, the compiler can just use a single string, thereby saving memory. This optimization only works if strings are immutable, otherwise changing one string would have unpredictable results on other strings.
String literals (which are known at compile time) can be interned and stored in a special static area of memory by the compiler. This saves time at runtime since they don't need to be allocated and garbage collected.
Immutable strings does increase the cost for certain operations. For example you can't replace a single character in-place, you have to allocate a new string for any change. But this is a small cost compared to the benefit of the optimizations.
Value semantics effectively hides the distinction between reference type and value types for the user. If a type has value semantics, it doesn't matter for the user if the type is a value type or reference type - it can be considered an implementation detail.
This is a late answer to an old question, but all other answers are missing the point, which is that .NET did not have generics until .NET 2.0 in 2005.
String is a reference type instead of a value type because it was of crucial importance for Microsoft to ensure that strings could be stored in the most efficient way in non-generic collections, such as System.Collections.ArrayList.
Storing a value-type in a non-generic collection requires a special conversion to the type object which is called boxing. When the CLR boxes a value type, it wraps the value inside a System.Object and stores it on the managed heap.
Reading the value from the collection requires the inverse operation which is called unboxing.
Both boxing and unboxing have non-negligible cost: boxing requires an additional allocation, unboxing requires type checking.
Some answers claim incorrectly that string could never have been implemented as a value type because its size is variable. Actually it is easy to implement string as a fixed-length data structure containing two fields: an integer for the length of the string, and a pointer to a char array. You can also use a Small String Optimization strategy on top of that.
If generics had existed from day one I guess having string as a value type would probably have been a better solution, with simpler semantics, better memory usage and better cache locality. A List<string> containing only small strings could have been a single contiguous block of memory.
Not only strings are immutable reference types.
Multi-cast delegates too.
That is why it is safe to write
protected void OnMyEventHandler()
{
delegate handler = this.MyEventHandler;
if (null != handler)
{
handler(this, new EventArgs());
}
}
I suppose that strings are immutable because this is the most safe method to work with them and allocate memory.
Why they are not Value types? Previous authors are right about stack size etc. I would also add that making strings a reference types allow to save on assembly size when you use the same constant string in the program. If you define
string s1 = "my string";
//some code here
string s2 = "my string";
Chances are that both instances of "my string" constant will be allocated in your assembly only once.
If you would like to manage strings like usual reference type, put the string inside a new StringBuilder(string s). Or use MemoryStreams.
If you are to create a library, where you expect a huge strings to be passed in your functions, either define a parameter as a StringBuilder or as a Stream.
In a very simple words any value which has a definite size can be treated as a value type.
Also, the way strings are implemented (different for each platform) and when you start stitching them together. Like using a StringBuilder. It allocats a buffer for you to copy into, once you reach the end, it allocates even more memory for you, in the hopes that if you do a large concatenation performance won't be hindered.
Maybe Jon Skeet can help up out here?
It is mainly a performance issue.
Having strings behave LIKE value type helps when writing code, but having it BE a value type would make a huge performance hit.
For an in-depth look, take a peek at a nice article on strings in the .net framework.
How can you tell string is a reference type? I'm not sure that it matters how it is implemented. Strings in C# are immutable precisely so that you don't have to worry about this issue.
Actually strings have very few resemblances to value types. For starters, not all value types are immutable, you can change the value of an Int32 all you want and it it would still be the same address on the stack.
Strings are immutable for a very good reason, it has nothing to do with it being a reference type, but has a lot to do with memory management. It's just more efficient to create a new object when string size changes than to shift things around on the managed heap. I think you're mixing together value/reference types and immutable objects concepts.
As far as "==": Like you said "==" is an operator overload, and again it was implemented for a very good reason to make framework more useful when working with strings.
The fact that many mention the stack and memory with respect to value types and primitive types is because they must fit into a register in the microprocessor. You cannot push or pop something to/from the stack if it takes more bits than a register has....the instructions are, for example "pop eax" -- because eax is 32 bits wide on a 32-bit system.
Floating-point primitive types are handled by the FPU, which is 80 bits wide.
This was all decided long before there was an OOP language to obfuscate the definition of primitive type and I assume that value type is a term that has been created specifically for OOP languages.
Isn't just as simple as Strings are made up of characters arrays. I look at strings as character arrays[]. Therefore they are on the heap because the reference memory location is stored on the stack and points to the beginning of the array's memory location on the heap. The string size is not known before it is allocated ...perfect for the heap.
That is why a string is really immutable because when you change it even if it is of the same size the compiler doesn't know that and has to allocate a new array and assign characters to the positions in the array. It makes sense if you think of strings as a way that languages protect you from having to allocate memory on the fly (read C like programming)

Categories

Resources