Intern string literals misunderstanding?

Intern string literals misunderstanding? - c#

I dont understand :
MSDN says
http://msdn.microsoft.com/en-us/library/system.string.intern.aspx
Consequently, an instance of a literal string with a particular value
only exists once in the system.
For example, if you assign the same literal string to several
variables, the runtime retrieves the same reference to the literal
string from the intern pool and assigns it to each variable.
Does this behavior is the Default (without intern ) ? or by using Intern method ?
If its default , so why will I want to use intern? (the instance will be once already...) ?
If its NOT default : if I write 1000 times this row :
Console.WriteLine("lalala");
1 ) will I get 1000 occurrences of "lalala" in memory ? ( without using intern ...)
2) will "lalala" will eventually Gc'ed ?
3) Does "lalala" is already interned ? and if it does , why will i need to "get" it from the pool , and not just write "lalala" again ?
Im a bit confuse.

String literals get interned automatically (so, if your code contains "lalala" 1000 times, only one instance will exist).
Such strings will not get GC'd and any time they are referenced the reference will be the interned one.
string.Intern is there for strings that are not literals - say from user input or read from a file or database and that you know will be repeated very often and as such are worth interning for the lifetime of the process.

Interning is something that happens behind the scenes, so you as a programmer never have to worry about it. You generally do not have to put anything to the pool, or get anything from the pool. Like garbage collection: you never have to invoke it, or worry that it may happen, or worry that it may not happen. (Well, in 99.999% of the cases. And the remaining 0.001 percent is when you are doing very weird stuff.)
The compiler takes care of interning all string literals that are contained within your source file, so "lalala" will be interned without you having to do anything, or having any control over the matter. And whenever you refer to "lalala" in your program, the compiler makes sure to fetch it from the intern pool, again without you having to do anything, nor having any control over the matter.
The intern pool contains a more-or-less fixed number of strings, generally of a very small size, (only a fraction of the total size of your .exe,) so it does not matter that they never get garbage-collected.
EDIT
The purpose of interning strings is to greatly improve the execution time of certain string operations like Equals(). The Equals() method of String first checks whether the strings are equal by reference, which is extremely fast; if the references are equal, then it returns true immediately; if the references are not equal, and the strings are both interned, then it returns false immediately, because they cannot possibly be equal, since all strings in the intern pool are different from each other. If none of the above holds true, then it proceeds with a character by character string comparison. (Actually, it is even more complicated than that, because it also checks the hashcodes of the strings, but let's keep things simple in this discussion.)
So, suppose that you are reading tokens from a file in string s, and you have a switch statement of the following form:
switch( s )
{
case "cat": ....
case "dog": ....
case "tod": ....
}
The string literals "cat", "dog", "tod" have all been interned, but you are comparing each and every one of them against s, which has not been interned, so you are not reaping the benefits of the intern pool. If you intern s right before the switch statement, then the comparisons that will be done by the switch statement will be a lot faster.
Of course, if there is any possibility that your file might contain garbage, then you do NOT want to do this, because loading lots of random strings into the intern pool is sure to kill the performance of your program, and eventually run out of memory.

Related

Does string.Replace(string, string) create additional strings?

We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.

Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.

Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".

I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()

There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);

How to work properly with strings in C#?

I know there is a rule about strings in C# that says:
When we create a textual string of type string, we can never change its value! When putting different value for a string variable thje first string will stay in memory and variable (which is kind of reference type) just gets the address of the new string.
So doing something like this:
string a = "aaa";
a = a.Trim(); // Creates a new string
is not recommended.
But what if I need to do some actions on the string according to user preferences, like so:
string a = "aaa";
if (doTrim)
a = a.Trim();
if (doSubstring)
a = a.Substring(...);
etc...
How can I do it without creating new strings on every action ?
I thougt about sending the string to a function by ref, like so:
void DoTrim(ref string value)
{
value = value.Trim(); // also creates new string
}
But this also creates a new string...
Can someone please tell me if there is a way of doing it without wasteing memory on each action ?

You are correct in that the operations you're performing are creating new strings, and not mutating a single string.
You are incorrect in that this is generally problematic or something to be avoided.
If your strings are hundreds of thousands of characters, then sure, copying all of those just to remove a few leading spaces, or to add a few characters to the end of it (repeatedly, in a loop, in particular) can actually be a problem.
If your strings aren't large, and you're not performing many (an in thousands of) operations on the string, then you almost certainly don't have a problem.
Now there are a handful of contexts, generally rather rare, that do run into problems with string manipulation. Probably the most common of the problematic contexts is appending a bunch of strings together, as doing so means copying all of the previously appended data for each new addition. If you're in that situation consider using something like a StringBuilder or a single call to string.Concat (the overload accepting a sequence of strings to concat) to perform this operation.
Other contexts are, for example, programs dealing with processing DNA strands. They'll often be taking strings of millions of characters and creating hundreds of thousands of many thousand character long substrings of that string. Using standard C# string operations would therefore result in a lot of unnecessary copying. People writing such programs end up creating objects that can represent a substring of another string without copying the data and instead referring to the existing string's underlying data source with an offset.

Sticking my neck out here a bit so I'll preface with saying in most cases Servy's answer is the correct answer. However, if you really do need lower level access and less string allocations, you could consider creating a character buffer (simple array for instance) that is big enough to fit your processed string and allow you direct manipulation of the characters. There are some significant downfalls to this, though. Including that you'll probably have to write your own Substring() and Trim() modifiers, and your buffer will likely be bigger than your input strings in many cases to accommodate unexpected string sizes. Once you are done manipulating your buffer, you could then package the character array up as a String. Since all of your manipulations are done on a single buffer, you should save a lot of allocations.
I would seriously consider if the above is worth the hassle, but if you really need the performance, this is the best solution I can think of.

How can I do it without creating new strings on every action?
You should only worry about that if you're handling big strings or if you're doing many string operations in a short period of time.
Even then, the performance loss due to creating more references is minimal.
The Garbage Collector has to collect all the unused string variables, but hey - that only really matters if you're doing MANY string operations.
So rather focus on readability in your code, rather than trying to optimize its performance in the first place.
If you really have to keep the same reference of string, you can simply use a StringBuilder.

Why do you feel uncomfortable creating new strings? There is a reason for the string API to be designed this way. For example, immutable objects are thread-safe (and they allow for a more functional programming style).
If you replace your simple string code by stringbuilders, your code might be more error-prone in multithreading scenarios (which is quite normal in a web application for example).
StringBuilders are used for concatenating strings, inserting characters, removing characters, etc. But they will need to reallocate and copy their internal characters arrays every now and then, too.
When you speak about memory consumption you have started to micro-optimize your code. Don't.
BTW: Have a look at the LINQ API. What does each operation do? Rats - it creates a new enumerator! A query like foos.Where(bar).Select(baz).FirstOrDefault() could certainly be memory-optimized by just creating a single enumerator object and modifying the criteria it applies when enumerating. </irony>

It will depend on what your exact use case is, but you might want to explore using the StringBuilder class which you can use to build and modify strings.

Why doesn't interning work on copies of a string?

Given:
object literal1 = "abc";
object literal2 = "abc";
object copiedVariable = string.Copy((string)literal1);
if (literal1 == literal2)
Console.WriteLine("objects are equal because of interning");//Are equal
if(literal1 == copiedVariable)
Console.WriteLine("copy is equal");
else
Console.WriteLine("copy not eq");//NOT equal
These results imply that copiedVariable is not subject to string interning. Why?
Is there a circumstance where its useful to have equivalent strings that are not interned or is this behavior due to some language detail?

If you think about it, the interning of strings is a process that it triggered at compile time on literals. Which implies that:
it is implicit when you assign/bind a literal to a variable
it is implicit when you copy a reference (i.e. string a = some_other_string_variable;)
On the other hand, if you create an instance of a string manually - at run-time by using a StringBuilder, or by Copy-ing, than you have to specifically request to intern it by invoking the Intern method of the String class.
Even in the remarks section of the documentation it is stated that:
The common language runtime conserves string storage by maintaining a
table, called the intern pool, that contains a single reference to
each unique literal string declared or created programmatically in
your program. Consequently, an instance of a literal string with a
particular value only exists once in the system. For example, if you
assign the same literal string to several variables, the runtime
retrieves the same reference to the literal string from the intern
pool and assigns it to each variable.
And the documentation for the Copy method of the String class states that it:
Creates a new instance of String with the same value as a specified
String.
which implies that it's not going to just return a reference to the same string (from the intern pool). Again, if it did there wouldn't be much use for it then, would there?!

Some languages requires the result be a copy for certain methods/procedures.
For example in substring type methods. The semantics would then be the same, even if if you call foo.substring(0, foo.length) (and how you would probably implement stringcopy).
Note: IIRC*, this is NOT the case with .NET's implementation of string.Substring though. It is not really clear from MSDN either. (see below)
It returns:
A string that is equivalent to the substring of length length that
begins at startIndex in this instance, or Empty if startIndex is equal
to the length of this instance and length is zero.
It notes:
This method does not modify the value of the current instance.
Instead, it returns a new string with length characters starting from
the startIndex position in the current string.
UPDATE
I remember correctly, it does indeed do a check with string InternalSubString(int startIndex, int length, bool fAlwaysCopy) if fAlwaysCopy is not false. Substring passes false to this method.
UPDATE 2
It looks like string.Copy could have used InternalSubString and passing true to the aforementioned parameter, but looking at the disassembly, it seems to use a slightly more optimized version and possibly save a method call.
Sorry for the redundant information.
* The reason I remember was when implementing the substring procedure for IronScheme, which the R6RS specification requires to make a copy :)

Memory-wise, is it better to store a long non-dynamic string as a single string object or to have the program build it out of it's repetitive parts?

This is a bit of an odd question and more of a though experiment that anything I need, but I'm still curious about the answer: If I have a string that I know ahead of time will never change but is (mostly) made up of repetitive parts, would it be better to have said string as just a single string object, get called when needed, and be done with it - or should I break the string up into smaller strings that represent the repeated parts and concatenate them when needed?
Let me use an example: Let's say we have a naive programmer who wants to create a regular expression for validating IP Addresses (in other words, I know this regular expression won't work as intended, but it helps show what I mean by repetitive parts and saves me a bit of typing for the second part of the example). So he writes this function:
private bool isValidIP(string ip)
{
Regex checkIP = new Regex("\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
return checkIP.IsMatch(ip);
}
Now our young programmer notices that he has "\d", "\d?", and "\." just repeated a few times. This gives him an idea that he could both save some storage space and help remind himself what this means for later. So he remakes the function:
private bool isValidIP(string ip)
{
string escape = "\\";
string digi = "d";
string digit = escape + digi;
string possibleDigit = digit + '?';
string IpByte = digit + possibleDigit + possibleDigit;
string period = escape + '.';
Regex checkIP = new Regex(IpByte + period + IpByte + period + IpByte + period + IpByte);
return checkIP.IsMatch(ip);
}
The first method is simple. It just stores 38 chars in the program's instructions, which are just read into memory each time the function is called.
The second method stores (I suspect) two 1 length strings and two chars into the program's instructions as well as all of the calls to concatenate those four into different orders. This creates at least 8 strings in memory when the program is called (the six named strings, a temporary string for the first four parts of the regex, and then the final string created from the previous string + the three strings of the regex). This second method also happens to help explain what the regex is looking for - though not what the final regex would look like. It could also help with refactoring, say if our hypothetical programmer realizes that his current regex will allow for more than just 0-255 in the IP Address, and the constitute parts can be changed without having to find every single item that would need to be fixed.
Again, which method would be better? Would it just be as simple as a trade-off between program size vs. memory usage? Of course, with something as simple as this, the trade-off is negligible at best, but what about a much larger, more complex string?
Oh, yes, and a much better regex for IP Addresses would be:
^(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)(\\.(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)){3}$
Wouldn't work as well as an example, would it?

The first is by far the better option. Here's why
It's clearer.
It's cheaper. Any time you declare a new object it's an "expensive" process. You have to make space for it on the heap (well for strings at least). Yes, you could in theory be saving a byte or so, but your spending a lot more time (probably, I haven't tested it) going through and allocating space for each string, additional memory instructions etc. Not to mention the fact that remember, you also have to factor in the use of the GC. You keep allocating strings and eventually you are going to have to contend with it taking up process ticks also. You really want to hit on optimization, I can easily tell this code isn't as efficient as it could be. There are no constants for one thing, which means that you are possibly creating more objects than you need instead of letting the compiler optimize for strings that don't need to change. This leads me to think, that as a person reviewing this code, I need to take a much closer look at what is going to see what is going on and figure out if something is wrong.
It's clearer (yes, I said this again). You want to do an academic pursuit to see how efficient you can make it. That's cool. I get that. I do it myself. It's fun. I NEVER let that slip into production code. I don't care about losing a tick, I care about having a bug in production, and I care about if other programmers can understand what my code does. Reading someone else's code is hard enough, I don't want to add the extra task of them having to try and figure out which micro-optimization I put in and what happens if they "nudge" the wrong piece of code.
You hit on another point. What if the original regex is wrong. Google will tell you this problem has been solved. You can Google another regex that's right and has been tested. You can't Google "What's wrong with my code." You can post it on SO sure, but that means that someone else has to get involved and look through it.
Here's how to make the first example win the horse race easily:
Regex checkIP = new Regex(
"\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
private bool isValidIP(string ip)
{
return checkIP.IsMatch(ip);
}
Declare once, reuse over and over. If you are taking the time to recreate the regex dynamically to save a few, don't get to do that. Technically you could do that and still only create the object once, but that is a lot more work than say, moving it to a class level variable.

You're effectively attempting to game the compiler here and implement your own string compression. For the kinds of string literals you're describing, it seems like your savings will be mere tens of bytes shaved off of the compiled binary, which due to memory alignment may not even be realized. In exchange for these few bytes of saved space, this approach adds code complexity and runtime overhead, not to mention difficulty in debugging.
Storage is cheap. Why make your life (and the lives of your coworkers) harder? Keep your code simple, clear, and evident - you'll thank yourself later.

The second is worse off in memory consumption, as every time you concatenate two strings you've got three in memory.
Although the compiler started handling some instances of string constants by creating a StringBuilder for you, I'd still vote for the first one being less memory intensive, because if the system does create the StringBuilder for you, you are going to have the overhead for that, and if it doesn't see the first paragraph...
I am now curious how compiling the RegEx would effect the memory usage.

Savings here are illusionary and splitting this string up is a big overshot. Saving insignificant amount of memory and complicating so simple code is just pointless. You will not see any savings but next person to maintain that code will spend 10x more time understanding it.
Strings are immutable so if your string never/rarely changes keep it in one piece. Intense string concatenations give garbage collector additional strain.
Unless your strings and sub-strings are big and you could save at least kilobytes, do not spend your time and effort on such optimizations.

Why are strings notoriously expensive

What is it about the way strings are implemented that makes them so expensive to manipulate?
Is it impossible to make a "cheap" string implementation?
or am I completely wrong in my understanding?
Thanks

Which language?
Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.
This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.
If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.
If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.

The problem with strings is that they are not primitive types. They are arrays.
Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).
Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc.
There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.

Since it creates new copy of the object every time in java its advisable to use StringBuffer
Syntax
StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();

Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them.
i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.
Pete

It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.

Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:
Concatenation.
Slicing.
Getting a character at an index.
Changing a character at an index.
Locating the index of a character.
Traversing the string.
Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.

If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.
Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...

Changes and copying of strings tends to involve memory management.
Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.

You want to read this Joel Spolsky article:
http://www.joelonsoftware.com/articles/fog0000000319.html
Me, I'm disappointed .NET doesn't have a native type called F***edString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.