Why doesn't interning work on copies of a string?

Why doesn't interning work on copies of a string? - c#

Given:
object literal1 = "abc";
object literal2 = "abc";
object copiedVariable = string.Copy((string)literal1);
if (literal1 == literal2)
Console.WriteLine("objects are equal because of interning");//Are equal
if(literal1 == copiedVariable)
Console.WriteLine("copy is equal");
else
Console.WriteLine("copy not eq");//NOT equal
These results imply that copiedVariable is not subject to string interning. Why?
Is there a circumstance where its useful to have equivalent strings that are not interned or is this behavior due to some language detail?

If you think about it, the interning of strings is a process that it triggered at compile time on literals. Which implies that:
it is implicit when you assign/bind a literal to a variable
it is implicit when you copy a reference (i.e. string a = some_other_string_variable;)
On the other hand, if you create an instance of a string manually - at run-time by using a StringBuilder, or by Copy-ing, than you have to specifically request to intern it by invoking the Intern method of the String class.
Even in the remarks section of the documentation it is stated that:
The common language runtime conserves string storage by maintaining a
table, called the intern pool, that contains a single reference to
each unique literal string declared or created programmatically in
your program. Consequently, an instance of a literal string with a
particular value only exists once in the system. For example, if you
assign the same literal string to several variables, the runtime
retrieves the same reference to the literal string from the intern
pool and assigns it to each variable.
And the documentation for the Copy method of the String class states that it:
Creates a new instance of String with the same value as a specified
String.
which implies that it's not going to just return a reference to the same string (from the intern pool). Again, if it did there wouldn't be much use for it then, would there?!

Some languages requires the result be a copy for certain methods/procedures.
For example in substring type methods. The semantics would then be the same, even if if you call foo.substring(0, foo.length) (and how you would probably implement stringcopy).
Note: IIRC*, this is NOT the case with .NET's implementation of string.Substring though. It is not really clear from MSDN either. (see below)
It returns:
A string that is equivalent to the substring of length length that
begins at startIndex in this instance, or Empty if startIndex is equal
to the length of this instance and length is zero.
It notes:
This method does not modify the value of the current instance.
Instead, it returns a new string with length characters starting from
the startIndex position in the current string.
UPDATE
I remember correctly, it does indeed do a check with string InternalSubString(int startIndex, int length, bool fAlwaysCopy) if fAlwaysCopy is not false. Substring passes false to this method.
UPDATE 2
It looks like string.Copy could have used InternalSubString and passing true to the aforementioned parameter, but looking at the disassembly, it seems to use a slightly more optimized version and possibly save a method call.
Sorry for the redundant information.
* The reason I remember was when implementing the substring procedure for IronScheme, which the R6RS specification requires to make a copy :)

Related

will Substring creates another instance C#?

I am new to C# string I am confused about the
Object.referenceEquals
I was reading some article which says ReferenceEquals check if it same instance or not in the program i am checking if object.ReferenceEquals(s1, s4) even though they point to same data why it is coming as false ?
string s1 = "akhil";
string s2 = "akhil";
Console.WriteLine(object.ReferenceEquals(s1, s2)); //true
s2 = "akhil jain";
Console.WriteLine(object.ReferenceEquals(s1, s2)); //false
//Console.WriteLine(s1 == s2);
//Console.WriteLine(s1.Equals(s2));
string s3 = "akhil";
//1".Substring(0, 5);
Console.WriteLine(s3+" " +s1);
Console.WriteLine(object.ReferenceEquals(s1,s3)); //true
string s4 = "akhil1".Substring(0, 5);
Console.WriteLine(object.ReferenceEquals(s1, s4)); //confusion false why as s4 data is same as s1

The references are the same because a string literal gets interned, Substring returns a new string and a new reference, it doesn't try to second guess your parameters and check the intern pool
String.Intern(String) Method
The common language runtime conserves string storage by maintaining a
table, called the intern pool, that contains a single reference to
each unique literal string declared or created programmatically in
your program. Consequently, an instance of a literal string with a
particular value only exists once in the system.
For example, if you assign the same literal string to several variables, the runtime retrieves the same reference to the literal
string from the intern pool and assigns it to each variable.
Though, useless fact 3454345.2, Since .Net 2, you have been able to turn it off for various reasons you may have
CompilationRelaxations Enum
NoStringInterning Marks an assembly as not requiring string-literal interning. In an application domain, the common
language runtime creates one string object for each unique string
literal, rather than making multiple copies. This behavior, called
string interning, internally requires building auxiliary tables that
consume memory resources.

When instantiating two object, the reference is not equal. The Object.ReferenceEquals method therefore returns false. However, strings are a very special case. If you declare a string in code, the CLR maintains it in a table. This is called the intern pool. This causes two strings that were instantiated with the same value to reference the same object in memory. This will cause Object.ReferenceEquals to return true.
When a string was formed by some operation in your code, it is not automatically interned to the pool. And therefore, it has a different reference, although the content of the string might be the same. This is also explained in the remarks of the documentation of Object.ReferenceEquals here.
Note that the String.Equals() method would return true. In C# you can also use the '==' operator on strings. See your adjusted code below.
string s1 = "akhil";
string s2 = "akhil";
Console.WriteLine(s1.Equals(s2)); //true
s2 = "akhil jain";
Console.WriteLine(s1.Equals(s2)); //false
string s3 = "akhil";
Console.WriteLine(s3 + " " + s1);
Console.WriteLine(s1.Equals(s3)); //true
string s4 = "akhil1".Substring(0, 5);
Console.WriteLine(s1.Equals(s4)); //this now returns true as well
Console.WriteLine(s1 == s4); //so does this

The value of object.ReferenceEquals is false since it checks if both the references point to the same object. ReferenceEquals does not check for data equality, but if both objects occupy the same memory address.
As TheGeneral already mentioned, string literals are interned and stored in a table called intern pool. This is to store string objects efficiently.
When a string literal is assigned to multiple variables, they are pointing to the same address in the intern pool. Hence, you get true for object.ReferenceEquals. But when you compare this with a substring, a new object has been created in the memory. This result in a false when reference is compared since they are two different objects occupying different memory locations.
All the dynamically created strings, or read from an external source are not interned automatically.
If you try the following, you will get true for object.ReferenceEquals:
Console.WriteLine(object.ReferenceEquals(s1, string.Intern(s4)));
You can check with Primitive data types that the ReferenceEquals returns false even when one variable is assigned to another.
int a = 10;
int b = a;
Console.WriteLine(ReferenceEquals(a, b)); //false
This is because each primitive type is stored separately.

Does string.Replace(string, string) create additional strings?

We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.

Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.

Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".

I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()

There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);

(string)combination Purpose?

I'm following an exercise which tasks me to...
"Declare two variables of type string with values "Hello" and "World".
Declare a variable of type object. Assign the value obtained of
concatenation of the two string variables (add space if necessary) to
this variable. Print the variable of type object".
Now here was my original solution:
string hi = "Hello";
string wo = "World";
object hiwo = hi + " " + wo;
Console.WriteLine(hiwo);
Console.ReadLine();
I found a good website that gives sample solutions of the exercises I am going through, which I have started to go through comparing to my answers, In this one I noticed I was nearly spot on, apart from an extra line. I've modified my original code to illustrate the comparison more easily.
My modified code:
string firstWord = "Hello";
string secondWord = "World";
object combination = firstWord + " " + secondWord;
Console.WriteLine(combination);
Given Solution:
string firstWord = "Hello";
string secondWord = "World";
object combination = firstWord + " " + secondWord;
string a = (string)combination;
Console.WriteLine(a);
I believe understanding this extra line is the purpose of the exercise. So my question is why is the extra line exists and what the benefits are to having it? The section of the book is understanding types and variables.

The extra line is a type cast:
A cast is a way of explicitly informing the compiler that you intend to make the conversion and that you are aware that data loss might occur.
Usually, a cast doesn't really return a different object. It just checks if the object is, at runtime, of the type you're casting to. That is, the expression firstWord + secondWord returns an object of type string. Assigning it to a variable of type object doesn't change the fact it's really a string. Similarly, doing (string) combination doesn't return a different object – it just tells the compiler that the expression is of type string. (If combination wasn't really a string, the check would fail and throw an exception.)
In this case there is no benefit to having it there I can see. Console.WriteLine(object) converts the object to a string internally, and an object that is already a string will just "convert" to itself.

In your solution when you call
Console.WriteLine(Combination)
.ToString() method is called internally. Therefore you don't feel the difference.
From MSDN
If value is null, only the line terminator is written. Otherwise, the ToString method of value is called to produce its string representation, and the resulting string is written to the standard output stream.
Whereas in the given solution object is first converted to string and then written.
To understand the difference let's take another example
TextBox tb = new TextBox();
Console.WriteLine(tb);
output would be System.Windows.Forms.TextBox, Text: that is the type of object

In your version what is happening in the line Console.WriteLine is a call to the virtual ToString method, which because of being virtual is in fact executed in its version implemented in the string class (which just returns the string).
The given solution explicitly casts the object into string. The difference is thus in increased readability - less things are happening behind the scene - it is made explicit that you're operating on a string instance.

The extra line is basic casting the object to a string type in order for it to be printed out.
Another way would be...
string firstWord = "hello";
string "secondWord = "world";
object combination = string.Format("{0} {1}", firstWord, secondWord);
Console.WriteLine(combination.ToString());

C# Changing a string after it has been created

Okay I know this question is painfully simple, and I'll admit that I am pretty new to C# as well. But the title doesn't describe the entire situation here so hear me out.
I need to alter a URL string which is being created in a C# code behind, removing the substring ".aspx" from the end of the string. So basically I know that my URL, coming into this class, will be something like "Blah.aspx" and I want to get rid of the ".aspx" part of that string. I assume this is quite easy to do by just finding that substring, and removing it if it exists (or some similar strategy, would appreciate if someone has an elegant solution for it if they've thought done it before). Here is the problem:
"Because strings are immutable, it is not possible (without using unsafe code) to modify the value of a string object after it has been created." This is from the MSDN official website. So I'm wondering now, if strings are truly immutable, then I simply can't (shouldn't) alter the string after it has been made. So how can I make sure that what I'm planning to do is safe?

You don't change the string, you change the variable. Instead of that variable referring to a string such as "foo.aspx", alter it to point to a new string that has the value "foo".
As an analogy, adding one to the number two doesn't change the number two. Two is still just the same as it always way, you have changed a variable from referring to one number to refer to another.
As for your specific case, EndsWith and Remove make it easy enough:
if (url.EndsWith(".aspx"))
url = url.Remove(url.Length - ".aspx".Length);
Note here that Remove is taking one string, an integer, and giving us a brand new string, which we need to assign back to our variable. It doesn't change the string itself.
Also note that there is a URI class that you can use for parsing URLs, and it will be able to handle all of the complex situations that can arise, including hashes, query parameters, etc. You should use that to parse out the aspects of a URL that you are interested in.

String immutability is not a problem for normal usage -- it just means that member functions like "Replace", instead of modifying the existing string object, return a new one. In practical terms that usually just means you have to remember to copy the change back to the original, like:
string x = "Blah.aspx";
x.Replace(".aspx", ""); // still "Blah.aspx"
x = x.Replace(".aspx", ""); // now "Blah"
The weirdness around strings comes from the fact that System.String inherits System.Object, yet, because of its immutability, behaves like a value type rather than an object. For example, if you pass a string into a function, there's no way to modify it, unless you pass it by reference:
void Test(string y)
{
y = "bar";
}
void Test(ref string z)
{
z = "baz";
}
string x = "foo";
Test(x); // x is still "foo"
Test(ref x); // x is now "baz"

A String in C# is immutable, as you say. Meaning that this would create multiple String objects in memory:
String s = "String of numbers 0";
s += "1";
s += "2";
So, while the variable s would return to you the value String of numbers 012, internally it required the creation of three strings in memory to accomplish.
In your particular case, the solution is quite simple:
String myPath = "C:\\folder1\\folder2\\myFile.aspx";
myPath = Path.Combine(Path.GetDirectoryName(myPath), Path.GetFileNameWithoutExtension(myPath));
Again, this appears as if myPath has changed, but it really has not. An internal copy and assign took place and you get to keep using the same variable.
Also, if you must preserve the original variable, you could simply make a new variable:
String myPath = "C:\\folder1\\folder2\\myFile.aspx";
String thePath = Path.Combine(Path.GetDirectoryName(myPath), Path.GetFileNameWithoutExtension(myPath));
Either way, you end up with a variable you can use.
Note that the use of the Path methods ensures you get proper path operations, and not blind String replacements that could have unintended side-effects.

String.Replace() will not modify the string. It will create a new one. So the following code:
String myUrl = #"http://mypath.aspx";
String withoutExtension = myUrl.Replace(".aspx", "");
will create a brand-new string which is assigned to withoutExtension.

Intern string literals misunderstanding?

I dont understand :
MSDN says
http://msdn.microsoft.com/en-us/library/system.string.intern.aspx
Consequently, an instance of a literal string with a particular value
only exists once in the system.
For example, if you assign the same literal string to several
variables, the runtime retrieves the same reference to the literal
string from the intern pool and assigns it to each variable.
Does this behavior is the Default (without intern ) ? or by using Intern method ?
If its default , so why will I want to use intern? (the instance will be once already...) ?
If its NOT default : if I write 1000 times this row :
Console.WriteLine("lalala");
1 ) will I get 1000 occurrences of "lalala" in memory ? ( without using intern ...)
2) will "lalala" will eventually Gc'ed ?
3) Does "lalala" is already interned ? and if it does , why will i need to "get" it from the pool , and not just write "lalala" again ?
Im a bit confuse.

String literals get interned automatically (so, if your code contains "lalala" 1000 times, only one instance will exist).
Such strings will not get GC'd and any time they are referenced the reference will be the interned one.
string.Intern is there for strings that are not literals - say from user input or read from a file or database and that you know will be repeated very often and as such are worth interning for the lifetime of the process.

Interning is something that happens behind the scenes, so you as a programmer never have to worry about it. You generally do not have to put anything to the pool, or get anything from the pool. Like garbage collection: you never have to invoke it, or worry that it may happen, or worry that it may not happen. (Well, in 99.999% of the cases. And the remaining 0.001 percent is when you are doing very weird stuff.)
The compiler takes care of interning all string literals that are contained within your source file, so "lalala" will be interned without you having to do anything, or having any control over the matter. And whenever you refer to "lalala" in your program, the compiler makes sure to fetch it from the intern pool, again without you having to do anything, nor having any control over the matter.
The intern pool contains a more-or-less fixed number of strings, generally of a very small size, (only a fraction of the total size of your .exe,) so it does not matter that they never get garbage-collected.
EDIT
The purpose of interning strings is to greatly improve the execution time of certain string operations like Equals(). The Equals() method of String first checks whether the strings are equal by reference, which is extremely fast; if the references are equal, then it returns true immediately; if the references are not equal, and the strings are both interned, then it returns false immediately, because they cannot possibly be equal, since all strings in the intern pool are different from each other. If none of the above holds true, then it proceeds with a character by character string comparison. (Actually, it is even more complicated than that, because it also checks the hashcodes of the strings, but let's keep things simple in this discussion.)
So, suppose that you are reading tokens from a file in string s, and you have a switch statement of the following form:
switch( s )
{
case "cat": ....
case "dog": ....
case "tod": ....
}
The string literals "cat", "dog", "tod" have all been interned, but you are comparing each and every one of them against s, which has not been interned, so you are not reaping the benefits of the intern pool. If you intern s right before the switch statement, then the comparisons that will be done by the switch statement will be a lot faster.
Of course, if there is any possibility that your file might contain garbage, then you do NOT want to do this, because loading lots of random strings into the intern pool is sure to kill the performance of your program, and eventually run out of memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.