Should .NET strings really be considered immutable? - c#

Consider the following code:
unsafe
{
string foo = string.Copy("This can't change");
fixed (char* ptr = foo)
{
char* pFoo = ptr;
pFoo[8] = pFoo[9] = ' ';
}
Console.WriteLine(foo); // "This can change"
}
This creates a pointer to the first character of foo, reassigns it to become mutable, and changes the chars 8 and 9 positions up to ' '.
Notice I never actually reassigned foo; instead, I changed its value by modifying its state, or mutating the string. Therefore, .NET strings are mutable.
This works so well, in fact, that the following code:
unsafe
{
string bar = "Watch this";
fixed (char* p = bar)
{
char* pBar = p;
pBar[0] = 'C';
}
string baz = "Watch this";
Console.WriteLine(baz); // Unrelated, right?
}
will print "Catch this" due to string literal interning.
This has plenty of applicable uses, for example this:
string GetForInputData(byte[] inputData)
{
// allocate a mutable buffer...
char[] buffer = new char[inputData.Length];
// fill the buffer with input data
// ...and a string to return
return new string(buffer);
}
gets replaced by:
string GetForInputData(byte[] inputData)
{
// allocate a string to return
string result = new string('\0', inputData.Length);
fixed (char* ptr = result)
{
// fill the result with input data
}
return result; // return it
}
This could save potentially huge memory allocation / performance costs if you work in a speed-critical field (e.g. encodings).
I guess you could say that this doesn't count because it "uses a hack" to make pointers mutable, but then again it was the C# language designers who supported assigning a string to a pointer in the first place. (In fact, this is done all the time internally in String and StringBuilder, so technically you could make your own StringBuilder with this.)
So, should .NET strings really be considered immutable?

§ 18.6 of the C# language specification (The fixed statement) specifically addresses the case of modifying a string through a fixed pointer, and indicates that doing so can result in undefined behavior:
Modifying objects of managed type through fixed pointers can results in undefined behavior. For example, because strings are immutable, it is the programmer’s responsibility to ensure that the characters referenced by a pointer to a fixed string are not modified.

I just had to play with this and experiment to confirm whether the addresses of string literal are pointing into the same memory location.
The results are:
string foo = "Fix value?"; //New address: 0x02b215f8
string foo2 = "Fix value?"; //Points to same address: 0x02b215f8
string fooCopy = string.Copy(foo); //New address: 0x021b2888
fixed (char* p = foo)
{
p[9] = '!';
}
Console.WriteLine(foo);
Console.WriteLine(foo2);
Console.WriteLine(fooCopy);
//Reference is equal, which means refering to same memory address
Console.WriteLine(string.ReferenceEquals(foo, foo2)); //true
//Reference is not equal, which creates another string in new memory address
Console.WriteLine(string.ReferenceEquals(foo, fooCopy)); //false
We see that foo initializes a string literal which points to 0x02b215f8 memory address in my PC. Assigning the same string literal to foo2 references the same memory address. And creating a copy of that same string literal makes a new one. Further testing via string.ReferenceEquals() reveals that they are indeed equal for foo and foo2 while different reference for foo and fooCopy.
It is interesting to see how string literals can be manipulated in memory and affects other variables that are just referencing it. One of the things that we should be careful of as this behavior exists.

Related

How to view the address of string, to check its (reference) type in C#?

using System;
using System.Runtime.InteropServices;
using System.Security.Claims;
using System.Text;
Method();
unsafe void Method()
{
string a = "hello";
string b = a;
//or: string b = "hello";
Console.WriteLine(object.ReferenceEquals(a, b)); // True.
string aa = "hello";
string bb = "h";
bb += "ello";
Console.WriteLine(object.ReferenceEquals(aa, bb)); // False.
int aaa = 100;
int bbb = aaa;
Console.WriteLine(object.ReferenceEquals(aaa, bbb)); // False.
string* pointer1;
string* pointer2;
string word1 = "Hello";
string word2 = "Hello";
pointer1 = &word1;
pointer2 = &word2;
ulong addr1 = (ulong)pointer1;
ulong addr2 = (ulong)pointer2;
Console.WriteLine($"Address of variable named word1: {addr1}");
Console.WriteLine($"Address of variable named word2: {addr2}");
}
Why different locations?
It works correctly with object/string.ReferenceEquals. But I can't see the ADDRESSes of strings. Beginner in the world of IT. Be kind, people.
We'll start from here:
string word1 = "Hello";
string word2 = "Hello";
It seems you expect word1 and word2 to refer to the same string object in memory. But that's not how it works for normal objects (strings can be a little different... we'll get there). For normal reference types, you should expect two different objects. The two objects have equivalent values, but they are still different objects.
This is important. Imagine the next line changed the string for word1. You would not want the word2 variable to also change.
Now, strings are a little bit "special" in this area. Depending on which version of .Net you're running, the compiler may opt to intern equivalent strings. This means it will use the same object in memory for strings with equivalent values.
This is possible because strings are immutable. That is, calling, say, word1.Replace("e", "3") does not change the value of the string in word1 to instead be "h3llo", and therefore word2 is also not modified by changes made from word1. Instead, the Replace() call returns a new string. Additionally, all the string methods and properties work this way, such that there is no way to change an existing string in-place.
If you want word1 to receive that new value, you must also assign it to the variable: word1 = word1.Replace("e", "3");. Since this is a new assignment and only assigns to word1, the word2 variable will still show "hello". So everything works as expected, and you were able to save some memory use while the two values were equal. Again: strings have special treatment here, and this is a little different from how most reference objects work by default.
But there's another important thing to understand about memory managemnet in .Net. The Garbage Collector can sometimes move objects to new locations. This means any address you see at one moment may not be the address it uses the next moment. This can especially happen during the compaction phase of garbage collection.
Now, it is possible to pin objects via the fixed keyword, but this is not usually a good idea; it's something to avoid unless you really need it: say to pass the object to an outside unmanaged library. There are a number of reasons for this, but one is it prevents the garbage collector from collecting the resource at all until the fixed block closes.

c# modify interned string through Span/Memory and MemoryMarshal

I started digging into new C#/.net core features called Span and Memory and so far they look very good.
However, when I encountered MemoryMarshal.AsMemory method I found out the following interesting use case:
const string source1 = "immutable string";
const string source2 = "immutable string";
var memory = MemoryMarshal.AsMemory(source1.AsMemory());
ref char first = ref memory.Span[0];
first = 'X';
Console.WriteLine(source1);
Console.WriteLine(source2);
Output in both cases is Xmmutable string (tested on Windows 10 x64, .net471 and .netcore2.1). And as far as I can see any string that is interned can now be modified in one place and then all references to that string will use updated value.
Is there any way to prevent such behavior? And is it possible to "unintern" string?
This is just the way it works
MemoryMarshal.AsMemory(ReadOnlyMemory) Method
Creates a Memory instance from a ReadOnlyMemory.
Returns
- Memory<T> A memory block that represetns the same memory as the ReadOnlyMemory .
Remarks
This method must be used with extreme caution. ReadOnlyMemory is used to represent immutable data and other memory that is not meant to
be written to. Memory instances created by this method should not
be written to. The purpose of this method is to allow variables typed
as Memory but only used for reading to store a ReadOnlyMemory.
More things you shouldn't do
private const string source1 = "immutable string1";
private const string source2 = "immutable string2";
public unsafe static void Main()
{
fixed(char* c = source1)
{
*c = 'f';
}
Console.WriteLine(source1);
Console.WriteLine(source2);
Console.ReadKey();
}
Output
fmmutable string1
immutable string2

String inside structure behavior

Let's say we have structure
struct MyStruct
{
public string a;
}
When we assign it to the new variable what will be happened with the string? So for example, we expect that string should be shared when structs are copied in the stack. We're using this code to test it, but it returns different pointers:
var a = new MyStruct();
a.a = "test";
var b = a;
IntPtr pA = Marshal.StringToCoTaskMemAnsi(a.a);
IntPtr pB = Marshal.StringToCoTaskMemAnsi(b.a);
Console.WriteLine("Pointer of a : {0}", (int)pA);
Console.WriteLine("Pointer of b : {0}", (int)pB);
The question is when structs are copied in the stack and have string inside did it share the string or the string is recreated?
[UPDATE]
We also tried this code, it returns different pointers as well:
char charA2 = a.a[0];
char charB2 = b.a[0];
unsafe
{
var pointerA2 = &charA2;
var pointerB2 = &charB2;
Console.WriteLine("POinter of a : {0}", (int)pointerA2);
Console.WriteLine("Pointer of b : {0}", (int)pointerB2);
}
The code you use to test it 'Copies the contents of a managed String to a block of memory allocated from the unmanaged COM task allocator.' according to MSDN. I would be surprised if any two subsequent calls to StringToCoTaskMemAnsi would return the same pointer. You can look at the memory address of the two string references or assign an object id using the debugger. Or easier: object.ReferenceEquals(a.a, b.a);
In your update, you are pointing to the stack location of the character variables, also not a good way of finding out. In any case, you are just copying the reference when you assign a string to another string, so they should always be the same.
Strings are immutable in storage and a reference type. Furthermore in your example, the string "test" is interned. So no matter how many copies of the struct you makes you ultimately have multiple pointers to the same underling storage (unless you go through gyrations to copy it to a new block of memory which your contrived examples are doing)
Rest assured that there is only one copy, pointed at multiple times.

Strings in Java and C#

I recently moved over to C# from Java and wanted to know how do we explicitly define a string thats stored on heap.
For example:
In Java, there are two ways we can define Strings:
String s = "Hello" //Goes on string pool and is interned
String s1 = new String("Hello") //creates a new string on heap
AFAIK, C# has only one way of defining String:
String s = "Hello" // Goes on heap and is interned
Is there a way I can force this string to be created on heap, like we do in Java using new operator? There is no business need for me to do this, its just for my understanding.
In C#, strings are ALWAYS created on the heap. Constant strings are also (by default) always interned.
You can force a non-constant string to be interned using string.Intern(), as the following code demonstrates:
string a1 = "TE";
string a2 = "ST";
string a = a1 + a2;
if (string.IsInterned(a) != null)
Console.WriteLine("a was interned");
else
Console.WriteLine("a was not interned");
string.Intern(a);
if (string.IsInterned(a) != null)
Console.WriteLine("a was interned");
else
Console.WriteLine("a was not interned");
In C#, the datatypes can be either
value types - which gets created in the stack (e.g. int, struct )
reference type - which gets created in the heap (e.g string, class)
Since strings are reference types and it always gets created in a heap.
In the .net platform, strings are created on the heap always. If you want to edit a string stay:
string foo = "abc";
string foo = "abc"+ "efg";
it will create a new string, it WON'T EDIT the previous one. The previous one will be deleted from the heap. But, to conclude, it will always be created on the heap.
Like Java:
char[] letters = { 'A', 'B', 'C' };
string alphabet = new string(letters);
and various ways are explained in this link.
On .Net your literal string will be created on the heap and a reference added to the intern pool before the program starts.
Allocating a new string on the heap occurs at runtime if you do something dynamic like concatenating two variables:
String s = string1 + string2;
See: http://msdn.microsoft.com/library/system.string.intern.aspx

how String object is allocate memory without having new keyword or constructor?

In C# if we want to create a variable of type string we can use:
string str="samplestring"; // this will allocate the space to hold the string
In C#, string is a class type, so if we want to create an object, normally we have to use the new keyword. So how is allocation happening without new or constructors?
When you write
string str="samplestring";
compiler will generate two instructions:
Firstly, ldstr gets a string literal from the metadata; allocates the requisite amount of memory; creates a new String object and pushes the reference to it onto the stack.
Then stloc (or one of it's short forms, e.g. stloc.0) stores that reference in the local variable str.
Note, that ldstr will allocate memory only once for each sequence of characters.
So in example below both variables will point at the same object in memory:
// CLR will allocate memory and create a new String object
// from the string literal stored in the metadata
string a = "abc";
// CLR won't create a new String object. Instead, it will look up for an existing
// reference pointing to the String object created from "abc" literal
string b = "abc";
This process is known as string interning.
Also, as you know, in .NET strings are immutable. So the contents of a String object cannot be changed after the object is created. That is, every time you're concatenating string, CLR will create a new String object.
For example, the following lines of code:
string a = "abc";
string b = a + "xyz";
Will be compiled into the following IL (not exactly, of course):
ldstr will allocate memory and create a new String object from "abc" literal
stloc will store the reference to that object in the local variable a
ldloc will push that reference onto the stack
ldstr will allocate memory and create a new String object from "xyz" literal
call will invoke the System.String::Concat on these String objects on the stack
A call to System.String::Concat will be decomposed into dozens of IL instructions and internal calls. Which, in short, will check lengths of both strings and allocate the requisite amount of memory to store the concatenation result and then copy those strings into the newly allocated memory.
stloc will store the reference to the newly created string in the local variable b
This is simply the C# compiler giving you a shortcut by allowing string literals.
If you'd rather, you can instantiate a string by any number of different constructors. For example:
char[] chars = { 'w', 'o', 'r', 'd' };
string myStr = new String(chars);
According to the MS documentation you do not need to use the new command to use the default string constructor.
However this does work.
char[] letters = { 'A', 'B', 'C' };
string alphabet = new string(letters);
c# Strings (from MSDN programming guide)
Strings are in fact reference types. The variable hold a reference to the value in the memory. Therefore you are just assigning the reference and not the value to the object. I would recommend you to have a look at this video from Pluralsight (you can get a free 14 days trial)
Pluralsight C# Fundamentals - Strings
Disclaimer: I am in no way related to Pluralsight. I am a subscriber and I love the videos over there
While everything is an object in .net there are still primitive types (int, bool, etc) that do not require instantiation.
as you can see here, a string is a 4byte address ref pointing to a vector/array structure that can extend to up to 2GB. remember strings are unmutable types so when you change a string you are not editing the existing variable, but instead allocating new memory for the literal value and then changing your string pointer to point to your new memory structure.
hope that helps
When you creates a string using literals, internally, depending on your assembly is marked with NoStringInterning flag or not, it's looks like:
String str = new String("samplestring");
// or with NoStringInterning
String str = String.Intern("samplestring");
In java if you write something like that:
String s1 = "abc";
String s2 = "abc";
memory will be allocated for "abc" in so called string pool and both s1 and s2 will refer to that memory. And s1 == s2 will return true ("==" compares references). But if you write:
String s1 = new String("abc");
String s1 = new String("abc");
s1 == s2 will return false. I guess in c# it'll be the same.

Categories

Resources