string concatenation and reference equality

string concatenation and reference equality - c#

In C# strings are immutable and managed. In theory that would mean the concatenation of any strings A and B would cause the allocation of a new buffer however this is all pretty obfuscated. When you concatenate with the identity (the empty string) the reference maintains intact. Is this a compile time optimization or is the overloaded assignment operator making the decision to not realloc at runtime? Furthermore, how does the runtime/compiler handle s2's value/allocation when I modify the value of s1? My program would indicate that the memory at the original address of s1 remains intact (and s2 continues pointing there) while a relloc occurs for the new value and then s1 is pointed there, is this an accurate description of what happens under the covers?
Example program;
static void Main(string[] args)
{
string s1 = "Some random text I chose";
string s2 = s1;
string s3 = s2;
Console.WriteLine(Object.ReferenceEquals(s1, s2)); // true
s1 = s1 + "";
Console.WriteLine(Object.ReferenceEquals(s1, s2)); // true
Console.WriteLine(s2);
s1 = s1 + " something else";
Console.WriteLine(Object.ReferenceEquals(s1, s2)); // false cause s1 got realloc'd
Console.WriteLine(Object.ReferenceEquals(s2, s3));
Console.WriteLine(s2);
Console.ReadKey();
}

When you concatenate with the identity (the empty string) the reference maintains intact. Is this a compile time optimization or is the overloaded assignment operator making the decision to not realloc at runtime?
It is both a compile time optimization and also an optimization performed in the implementation of the overloaded concatenation operator. If you concat two compile time literals, or concat a string known to be null or empty at compile time, the concatenation is done at compile time, and then potentially interned, and will therefore be reference equal to any other compile time literal string that has the same value.
Additionally, String.Concat is implemented such that if you concat a string with either null or an empty string, it just returns the other string (unless the other string was null, in which case it returns an empty string). The test you already have demonstrates this, as you're concatting a non-compile time literal string with an empty string and it's staying reference equal.
Of course if you don't believe your own test, you can look at the source to see that if one of the arguments is null then it simply returns the other.
if (IsNullOrEmpty(str0)) {
if (IsNullOrEmpty(str1)) {
return String.Empty;
}
return str1;
}
if (IsNullOrEmpty(str1)) {
return str0;
}

When you concatenate with the identity (the empty string) the reference maintains intact. Is this a compile time optimization or is the overloaded assignment operator making the decision to not realloc at runtime?
This is a run-time optimization. Here is how it is implemented in Mono:
public static String Concat(String str0, String str1) {
Contract.Ensures(Contract.Result() != null);
Contract.Ensures(Contract.Result().Length ==
(str0 == null ? 0 : str0.Length) +
(str1 == null ? 0 : str1.Length));
Contract.EndContractBlock();
// ========= OPTIMIZATION BEGINS ===============
if (IsNullOrEmpty(str0)) {
if (IsNullOrEmpty(str1)) {
return String.Empty;
}
return str1;
}
if (IsNullOrEmpty(str1)) {
return str0;
}
// ========== OPTIMIZATION ENDS =============
int str0Length = str0.Length;
String result = FastAllocateString(str0Length + str1.Length);
FillStringChecked(result, 0, str0);
FillStringChecked(result, str0Length, str1);
return result;
}
The compiler may produce additional optimizations of its own - for example, concatenating two string literals produces a new literal value at compile time, without calling string.Concat. This is not different from C#'s handling of other expressions that include compile-time constants of other data types, though.
Furthermore, how does the runtime/compiler handle s2's value/allocation when I modify the value of s1?
s1 and s2 are independent references to the same string object, which is immutable. Reassigning another object to one of them does not change the other reference.

It is a decision by the String.Concat function not to concat the string. It checks whether s1 is null and assigns "" to s1 if yes.
s1 = s1 + "";
gets optimized by the comiler.
s1 = s1 ?? "";
If you want to learn more check out this link

String concatenation is specified to return a string whose sequence of characters is the concatenation of the sequences encapsulated by the string representations of the things being concatenated. In cases where no existing string contains the proper sequence of characters, the concatenation code will need to create a new one; further, even in cases where an existing string might contain the proper sequence of characters, it will usually be faster for the computer to create a new string than try to find the existing one. I believe, however, that concatenation is allowed to return an existing string in any case where it can quickly find one that contains the proper characters, and in the case of concatenating a zero-length string to a non-zero-length string, finding a string which contains the proper characters is easy.
Because of behavioral details like the above, in most cases the only legitimate application of ReferenceEquals with strings is in situations where a true result is interpreted to say "the strings definitely contain the same characters" and a "false" result to say "the strings might not contain the same characters". It should not be interpreted as saying anything about where the strings came, how they were created, or anything like that.

When you concatenate with the identity (the empty string) the
reference maintains intact. Is this a compile time optimization or is
the overloaded assignment operator making the decision to not realloc
at runtime?
Neither. It's the Concat method that does that decision. The code is actually compiled into:
s1 = String.Concat(s1, "");
The Concat method contains this code, that makes it return the first parameter if the second is empty:
if (IsNullOrEmpty(str1)) {
return str0;
}
Ref: Microsoft reference source: String.Concat(string, string)
My program would indicate that the memory at the original address of
s1 remains intact (and s2 continues pointing there) while a relloc
occurs for the new value and then s1 is pointed there
That is correct.

Related

How to view the address of string, to check its (reference) type in C#?

using System;
using System.Runtime.InteropServices;
using System.Security.Claims;
using System.Text;
Method();
unsafe void Method()
{
string a = "hello";
string b = a;
//or: string b = "hello";
Console.WriteLine(object.ReferenceEquals(a, b)); // True.
string aa = "hello";
string bb = "h";
bb += "ello";
Console.WriteLine(object.ReferenceEquals(aa, bb)); // False.
int aaa = 100;
int bbb = aaa;
Console.WriteLine(object.ReferenceEquals(aaa, bbb)); // False.
string* pointer1;
string* pointer2;
string word1 = "Hello";
string word2 = "Hello";
pointer1 = &word1;
pointer2 = &word2;
ulong addr1 = (ulong)pointer1;
ulong addr2 = (ulong)pointer2;
Console.WriteLine($"Address of variable named word1: {addr1}");
Console.WriteLine($"Address of variable named word2: {addr2}");
}
Why different locations?
It works correctly with object/string.ReferenceEquals. But I can't see the ADDRESSes of strings. Beginner in the world of IT. Be kind, people.

We'll start from here:
string word1 = "Hello";
string word2 = "Hello";
It seems you expect word1 and word2 to refer to the same string object in memory. But that's not how it works for normal objects (strings can be a little different... we'll get there). For normal reference types, you should expect two different objects. The two objects have equivalent values, but they are still different objects.
This is important. Imagine the next line changed the string for word1. You would not want the word2 variable to also change.
Now, strings are a little bit "special" in this area. Depending on which version of .Net you're running, the compiler may opt to intern equivalent strings. This means it will use the same object in memory for strings with equivalent values.
This is possible because strings are immutable. That is, calling, say, word1.Replace("e", "3") does not change the value of the string in word1 to instead be "h3llo", and therefore word2 is also not modified by changes made from word1. Instead, the Replace() call returns a new string. Additionally, all the string methods and properties work this way, such that there is no way to change an existing string in-place.
If you want word1 to receive that new value, you must also assign it to the variable: word1 = word1.Replace("e", "3");. Since this is a new assignment and only assigns to word1, the word2 variable will still show "hello". So everything works as expected, and you were able to save some memory use while the two values were equal. Again: strings have special treatment here, and this is a little different from how most reference objects work by default.
But there's another important thing to understand about memory managemnet in .Net. The Garbage Collector can sometimes move objects to new locations. This means any address you see at one moment may not be the address it uses the next moment. This can especially happen during the compaction phase of garbage collection.
Now, it is possible to pin objects via the fixed keyword, but this is not usually a good idea; it's something to avoid unless you really need it: say to pass the object to an outside unmanaged library. There are a number of reasons for this, but one is it prevents the garbage collector from collecting the resource at all until the fixed block closes.

Why does String.IsInterned return a string

I see that String.Intern will actually add a string to the intern-pool and String.IsInterned will return the reference to that corresponding interned string. This makes me wonder:
Why does IsInterned return the referenced interned string and not a bool indicating whether a given string has been interned so far? I feel it's a funny use for an Is notation.
In what case would the code below return true?
bool InternCheck(string s)
{
string internedString = String.IsInterned(s);
return internedString != null && !String.Equals(internedString, s);
}

Why does IsInterned return the referenced interned string and not a bool indicating whether a given string has been interned so far? I feel it's a funny use for an Is notation.
For definitive "why?" you need to ask Microsoft. However, compare IsInterned() with similar (though functionally different of course) HashSet<T>.Add(). I.e. it's convenient to have a method that checks whether something is true, and if it is, provides the value you wanted as part of returning the information you want.
Why this method doesn't follow the TryXXX() pattern, again…you'd have to ask Microsoft, but we can easily guess. Obviously the method could have returned a bool and providing the string reference as an out parameter. But note that here, we know the value type is a nullable reference, and so can be null as an adequate way to indicate non-existence, which is different from the various types that implement TryXXX() methods.
In what case would the code below return true?
I don't see how that code would ever return true. If the string is not interned, it will return false, and if it is interned, then the interned string is necessarily always equal to the string that was passed in, and so !string.Equals(...) would also be false.
Is there some reason you think otherwise?

Let's imagine if the String.IsInterned method where to return a bool. Then all you'd know from calling bool whoopie = String.IsInterned(s); is that the value of your string is the same as a string that is interned. There is no indication that you have the same reference to the interned string.
Now the point of interning is to hold memory pressure down. You know you're creating a lot of similar strings and you want to ensure that you're not clogging up memory.
There's a cost to interning and that cost better be less than the cost of using up RAM.
So, back to String.IsInterned hypothetically returning a bool.
Since you don't know if you have the interned reference, which you'd want otherwise there's no point in interning, you'd end up writing this code a lot:
if (String.IsInterned(s))
{
s = String.GetInterned(s);
}
Or:
s = String.IsInterned(s) ? String.GetInterned(s) : s;
String.GetInterned is also a hypothetical method.
With the actual implementation of IsInterned this code becomes slightly simpler:
s = String.IsInterned(s) ?? s;
Let's see if we can improve this design.
If I try to implement a TryGetInterned style of operator I might implement it like this:
public static bool TryGetInterned(this string input, out string output)
{
string intermediate = String.IsInterned(input);
output = intermediate ?? input;
return intermediate != null;
}
This code works perfectly fine, but it leads to this kind of code repetition:
string s = "Hello World";
if (s.TryGetInterned(out string s2))
Console.WriteLine(s2); // `s` is interned
else
Console.WriteLine(s2); // `s` is NOT interned
This seems pretty pointless.
Compare this to the current IsInterned method:
string s = "Hello World";
s = String.IsInterned(s) ?? s;
Console.WriteLine(s);
Much simpler.
The only implementation that I could consider an improvement, in some circumstances, is this:
public static string GetIsInternedOrSelf(this string input)
=> String.IsInterned(input) ?? input;
Now I have this:
string s = "Hello World";
s = s.GetIsInternedOrSelf();
Console.WriteLine(s);
It's an improvement, but we've lost the ability to know if the string was interned.
The bottom-line is that I think String.IsInterned is probably as well designed as it could be.

C# string passed as an function argument

string myString;
void WriteString( string myString ) // This myString is copied.
{
// Writing to myString.
myString[0] = 'b'; // chaning this is just changing copy
}
void ReadString( string myString ) // Is this myString copied, eventhough I'm not writing at all?
{
if( myString[0] == 'a' ) // calling just get property in string
DebugConsole.Write("I just read myString and first character was 'a'");
}
Hello. I wonder if, in the case above, compiler would distinguish two functions and try to optimize ReadString function by passing myString as reference or inlining the function. If that is not the case, what should be done if myString is too huge to just ignore copying?
Thank you.

Regardless of the compiler's optimizations (which, no, would not make all that much of a difference anyway here), the string type in C# is always passed by reference.
Furthermore, the string reference is immutable. That means that your WriteString function wouldn't compile in the first place.
StringBuilder builder = new StringBuilder(myString);
builder[0] = 'b';
myString = builder.ToString();
Note, of course, that this solution will not change any references to the string made outside the function. In order to do that, pass it as a ref parameter.

Strings in Java and C#

I recently moved over to C# from Java and wanted to know how do we explicitly define a string thats stored on heap.
For example:
In Java, there are two ways we can define Strings:
String s = "Hello" //Goes on string pool and is interned
String s1 = new String("Hello") //creates a new string on heap
AFAIK, C# has only one way of defining String:
String s = "Hello" // Goes on heap and is interned
Is there a way I can force this string to be created on heap, like we do in Java using new operator? There is no business need for me to do this, its just for my understanding.

In C#, strings are ALWAYS created on the heap. Constant strings are also (by default) always interned.
You can force a non-constant string to be interned using string.Intern(), as the following code demonstrates:
string a1 = "TE";
string a2 = "ST";
string a = a1 + a2;
if (string.IsInterned(a) != null)
Console.WriteLine("a was interned");
else
Console.WriteLine("a was not interned");
string.Intern(a);
if (string.IsInterned(a) != null)
Console.WriteLine("a was interned");
else
Console.WriteLine("a was not interned");

In C#, the datatypes can be either
value types - which gets created in the stack (e.g. int, struct )
reference type - which gets created in the heap (e.g string, class)
Since strings are reference types and it always gets created in a heap.

In the .net platform, strings are created on the heap always. If you want to edit a string stay:
string foo = "abc";
string foo = "abc"+ "efg";
it will create a new string, it WON'T EDIT the previous one. The previous one will be deleted from the heap. But, to conclude, it will always be created on the heap.

Like Java:
char[] letters = { 'A', 'B', 'C' };
string alphabet = new string(letters);
and various ways are explained in this link.

On .Net your literal string will be created on the heap and a reference added to the intern pool before the program starts.
Allocating a new string on the heap occurs at runtime if you do something dynamic like concatenating two variables:
String s = string1 + string2;
See: http://msdn.microsoft.com/library/system.string.intern.aspx

"bool" as object vs "string" as object testing equality

I am relatively new to C#, and I noticed something interesting today that I guess I have never noticed or perhaps I am missing something. Here is an NUnit test to give an example:
object boolean1 = false;
object booloan2 = false;
Assert.That(boolean1 == booloan2);
This unit test fails, but this one passes:
object string1 = "string";
object string2 = "string";
Assert.That(string1 == string2);
I'm not that surprised in and of itself that the first one fails seeing as boolean1, and boolean2 are different references. But it is troubling to me that the first one fails, and the second one passes. I read (on MSDN somewhere) that some magic was done to the String class to facilitate this. I think my question really is why wasn't this behavior replicated in bool? As a note... if the boolean1 and 2 are declared as bool then there is no problem.
What is the reason for these differences or why it was implemented that way? Is there a situation where you would want to reference a bool object for anything except its value?

It's because the strings are in fact referring the same instance. Strings are interned, so that unique strings are reused. This means that in your code, the two string variables will refer to the same, interned string instance.
You can read some more about it here: Strings in .NET and C# (by Jon Skeet)
Update
Just for completeness; as Anthony points out string literals are interned, which can be showed with the following code:
object firstString = "string1";
object secondString = "string1";
Console.WriteLine(firstString == secondString); // prints True
int n = 1;
object firstString = "string" + n.ToString();
object secondString = "string" + n.ToString();
Console.WriteLine(firstString == secondString); // prints False

Operator Overloading.
The Boolean class does not have an overloaded == operator. The String class does.

As Fredrik said, you are doing a reference compare with the boolean comparison. The reason the string scenario works is because the == operator has been overloaded for strings to do a value compare. See the System.String page on MSDN.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.