I have a String which I would like to modify in some way. For example: reverse it or upcase it.
I have discovered that the fastest way to do this is by using a unsafe block and pointers.
For example:
unsafe
{
fixed (char* str = text)
{
*str = 'X';
}
}
Are there any reasons why I should never ever do this?
The .Net framework requires strings to be immutable. Due to this requirement it is able to optimise all sorts of operations.
String interning is one great example of this requirement is leveraged heavily. To speed up some string comparisons (and reduce memory consumption) the .Net framework maintains a Dictionary of pointers, all pre-defined strings will live in this dictionary or any strings where you call the String.intern method on. When the IL instruction ldstr is called it will check the interned dictionary and avoid memory allocation if we already have the string allocated, note: String.Concat will not check for interned strings.
This property of the .net framework means that if you start mucking around directly with strings you can corrupt your intern table and in turn corrupt other references to the same string.
For example:
// these strings get interned
string hello = "hello";
string hello2 = "hello";
string helloworld, helloworld2;
helloworld = hello;
helloworld += " world";
helloworld2 = hello;
helloworld2 += " world";
unsafe
{
// very bad, this changes an interned string which affects
// all app domains.
fixed (char* str = hello2)
{
*str = 'X';
}
fixed (char* str = helloworld2)
{
*str = 'X';
}
}
Console.WriteLine("hello = {0} , hello2 = {1}", hello, hello2);
// output: hello = Xello , hello2 = Xello
Console.WriteLine("helloworld = {0} , helloworld2 = {1}", helloworld, helloworld2);
// output : helloworld = hello world , helloworld2 = Xello world
Are there any reasons why I should never ever do this?
Yes, very simple: Because .NET relies on the fact that strings are immutable. Some operations (e.g. s.SubString(0, s.Length)) actually return a reference to the original string. If this now gets modified, all other references will as well.
Better use a StringBuilder to modify a string since this is the default way.
Put it this way: how would you feel if another programmer decided to replace 0 with 1 everywhere in your code, at execution time? It would play hell with all your assumptions. The same is true with strings. Everyone expects them to be immutable, and codes with that assumption in mind. If you violate that, you are likely to introduce bugs - and they'll be really hard to trace.
Oh dear lord yes.
1) Because that class is not designed to be tampered with.
2) Because strings are designed and expected throughout the framework to be immutable. That means that code that everyone else writes (including MSFT) is expecting a string's underlying value never to change.
3) Because this is premature optimization and that is E V I L.
Agreed about StringBuilder, or just convert your string to an array of chars/bytes and work there. Also, you gave the example of "upcasing" -- the String class has a ToUpper method, and if that's not at least as fast as your unsafe "upcasing", I'll eat my hat.
Related
I want to know the process and internals of string interning specific to .Net framework. Would also like to know the benefits of using interning and the scenarios/situations where we should use string interning to improve the performance. Though I have studied interning from the Jeffery Richter's CLR book but I am still confused and would like to know it in more detail.
[Editing] to ask a specific question with a sample code as below:
private void MethodA()
{
string s = "String"; // line 1 - interned literal as explained in the answer
//s.intern(); // line 2 - what would happen in line 3 if we uncomment this line, will it make any difference?
}
private bool MethodB(string compareThis)
{
if (compareThis == "String") // line 3 - will this line use interning (with and without uncommenting line 2 above)?
{
return true;
}
return false;
}
In general, interning is something that just happens, automatically, when you use literal string values. Interning provides the benefit of only having one copy of the literal in memory, no matter how often it's used.
That being said, it's rare that there is a reason to intern your own strings that are generated at runtime, or ever even think about string interning for normal development.
There are potentially some benefits if you're going to be doing a lot of work with comparisons of potentially identical runtime generated strings (as interning can speed up comparisons via ReferenceEquals). However, this is a highly specialized usage, and would require a fair amount of profiling and testing, and wouldn't be an optimization I'd consider unless there was a measured problem in place.
This is an "old" question, but I have a different angle on it.
If you're going to have a lot of long-lived strings from a small pool, interning can improve memory efficiency.
In my case, I was interning another type of object in a static dictionary because they were reused frequently, and this served as a fast cache before persisting them to disk.
Most of the fields in these objects are strings, and the pool of values is fairly small (much smaller than the number of instances, anyway).
If these were transient objects, it wouldn't matter because the string fields would be garbage collected often. But because references to them were being held, their memory usage started to accumulate (even when no new unique values were being added).
So interning the objects reduced the memory usage substantially, and so did interning their string values while they were being interned.
Interning is an internal implementation detail. Unlike boxing, I do not think there is any benefit in knowing more than what you have read in Richter's book.
Micro-optimisation benefits of interning strings manually are minimal hence is generally not recommended.
This probably describes it:
class Program
{
const string SomeString = "Some String"; // gets interned
static void Main(string[] args)
{
var s1 = SomeString; // use interned string
var s2 = SomeString; // use interned string
var s = "String";
var s3 = "Some " + s; // no interning
Console.WriteLine(s1 == s2); // uses interning comparison
Console.WriteLine(s1 == s3); // do NOT use interning comparison
}
}
Interned strings have the following characteristics:
Two interned strings that are identical will have the same address in memory.
Memory occupied by interned strings is not freed until your application terminates.
Interning a string involves calculating a hash and looking it up in a dictionary which consumes CPU cycles.
If multiple threads intern strings at the same time they will block each other because accesses to the dictionary of interned strings are serialized.
The consequences of these characteristics are:
You can test two interned strings for equality by just comparing the address pointer which is a lot faster than comparing each character in the string. This is especially true if the strings are very long and start with the same characters. You can compare interned strings with the Object.ReferenceEquals method, but it is safer to use the string == operator because it checks to see if the strings are interned first.
If you use the same string many times in your application, your application will only store one copy of the string in memory reducing the memory required to run your application.
If you intern many different strings this will allocate memory for those strings that will never be freed, and your application will consume ever increasing amounts of memory.
If you have a very large number of interned strings, string interning can become slow, and threads will block each other when accessing the interned string dictionary.
You should use string interning only if:
The set of strings you are interning is fairly small.
You compare these strings many times for each time that you intern them.
You really care about minute performance optimizations.
You don't have many threads aggressively interning strings.
Internalization of strings affects memory consumption.
For example if you read strings and keep them it in a list for caching; and the exact same string occurs 10 times, the string is actually stored only once in memory if string.Intern is used. If not, the string is stored 10 times.
In the example below, the string.Intern variant consumes about 44 MB and the without-version (uncommented) consumes 1195 MB.
static void Main(string[] args)
{
var list = new List<string>();
for (int i = 0; i < 5 * 1000 * 1000; i++)
{
var s = ReadFromDb();
list.Add(string.Intern(s));
//list.Add(s);
}
Console.WriteLine(Process.GetCurrentProcess().PrivateMemorySize64 / 1024 / 1024 + " MB");
}
private static string ReadFromDb()
{
return "abcdefghijklmnopqrstuvyxz0123456789abcdefghijklmnopqrstuvyxz0123456789abcdefghijklmnopqrstuvyxz0123456789" + 1;
}
Internalization also improves performance for equals-compare. The example below the intern version takes about 1 time units while the non-intern takes 7 time units.
static void Main(string[] args)
{
var a = string.Intern(ReadFromDb());
var b = string.Intern(ReadFromDb());
//var a = ReadFromDb();
//var b = ReadFromDb();
int equals = 0;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 250 * 1000 * 1000; i++)
{
if (a == b) equals++;
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed + ", equals: " + equals);
}
I was experimenting with C++ to observe the effects of variables' scope-bounded-declarations and usage in loops on the running time of the program, as follows:
for(int i=0; i<10000000 ; ++i){
string s = "HELLO THERE!";
}
and
string s;
for(int i=0; i<10000000 ; ++i){
s = "HELLO THERE!";
}
The first program run in ~1 second while the second one run in ~250 milliseconds, as expected. Trying built in types wouldn't cause a significant difference, so I stick with strings in both languages.
I was discussing this with a friend of mine and he said this wouldn't happen in C#. We tried and observed ourselves that this did not happen in C# as it turned out, scope-bounded declarations of strings won't affect running time of the program.
Why is this difference? Is that a bad optimization in C++ strings (I strongly doubt that tho) or something else?
Strings in C# are immutable, so the assignment can just copy over a reference. In C++, however, strings are mutable, so the entire contents of the string need to be copied over.
If you want to verify this hypothesis, try with a (significantly) longer string constant. Runtime in C++ should go up, but runtime in C# should remain the same.
Strings in C# are immutable.
C# uses references and the memory it's not copied!
in C# "HELLO THERE!" will be automatically assigned to a piece of memory and won't be copied each time
for example:
string a = "HELLO";
string b = a;
they are pointing to the same piece of memory, but in C++ no! the string will be the same but not in the same place, if you want to obtain the same reult you should use pointers (or smart pointers)
string *a = new string("hello");
string *b = a;
When developing in Java a couple of years ago I learned that it is better to append a char if I had a single character instead of a string with one character because the VM would not have to do any lookup on the string value in its internal string pool.
string stringappend = "Hello " + name + ".";
string charappend = "Hello " + name + '.'; // better?
When I started programming in C# I never thought of the chance that it would be the same with its "VM". I came across C# String Theory—String intern pool that states that C# also has an internal string pool (I guess it would be weird if it didn't) so my question is,
are there actually any benefits in appending a char instead of a string when concatenating to a string regarding C# or is it just jibberish?
Edit: Please disregard StringBuilder and string.Format, I am more interested in why I would replace "." with '.' in code. I am well aware of those classes and functions.
If given a choice, I would pass a string rather than a char when calling System.String.Concat or the (equivalent) + operator.
The only overloads that I see for System.String.Concat all take either strings or objects. Since a char isn't a string, the object version would be chosen. This would cause the char to be boxed. After Concat verifies that the object reference isn't null, it would then call object.ToString on the char. It would then generate the dreaded single-character string that was being avoided in the first place, before creating the new concatinated string.
So I don't see how passing a char is going to gain anything.
Maybe someone wants to look at the Concat operation in Reflector to see if there is special handling for char?
UPDATE
As I thought, this test confirms that char is slightly slower.
using System;
using System.Diagnostics;
namespace ConsoleApplication19
{
class Program
{
static void Main(string[] args)
{
TimeSpan throwAwayString = StringTest(100);
TimeSpan throwAwayChar = CharTest(100);
TimeSpan realStringTime = StringTest(10000000);
TimeSpan realCharTime = CharTest(10000000);
Console.WriteLine("string time: {0}", realStringTime);
Console.WriteLine("char time: {0}", realCharTime);
Console.ReadLine();
}
private static TimeSpan StringTest(int attemptCount)
{
Stopwatch sw = new Stopwatch();
string concatResult = string.Empty;
sw.Start();
for (int counter = 0; counter < attemptCount; counter++)
concatResult = counter.ToString() + ".";
sw.Stop();
return sw.Elapsed;
}
private static TimeSpan CharTest(int attemptCount)
{
Stopwatch sw = new Stopwatch();
string concatResult = string.Empty;
sw.Start();
for (int counter = 0; counter < attemptCount; counter++)
concatResult = counter.ToString() + '.';
sw.Stop();
return sw.Elapsed;
}
}
}
Results:
string time: 00:00:02.1878399
char time: 00:00:02.6671247
When developing in Java a couple of years ago I learned that it is better to append a char if I had a single character instead of a string with one character because the VM would not have to do any lookup on the string value in its internal string pool.
Appending a char to a String is likely to be slightly faster than appending a 1 character String because:
the append(char) operation doesn't have to load the string length,
it doesn't have to load the reference to the string characters array,
it doesn't have to load and add the string's start offset,
it doesn't have to do a bounds check on the array index, and
it doesn't have to increment and test a loop variable.
Take a look at the Java source code for String and related classes. You might be surprised what goes on under the hood.
The intern pool has nothing to do with it. The interning of string literals happens just once during class loading. Interning of non-literal strings occurs only if the application explicitly calls String.intern().
This may be interesting:
http://www.codeproject.com/KB/cs/StringBuilder_vs_String.aspx
Stringbuilder are not necessarily faster than Strings, it, as said before, depends. It depends on machine configuration, available memory vs processor power, framework version and machine config. Your profiler is your best buddy in this case :)
Back 2 Topic:
You should just TRY which is faster. Do that concatenation a bazillion times and let your profiler watch. You will see possible differences.
All string concatenation in .NET (with the standard operators i.e. +) requires the runtime to reserve enough memory for a complete new string to hold the results of the concatenation. This is due to the string type being immutable.
If you are performing string concatenation many times over (i.e. within a loop etc.) you will suffer performance issues (and eventually memory issues if the string is sufficiently large) as the .NET runtime needs to continually allocate and deallocate memory space to hold each new string.
It's probably for this reason that you're thinking (correctly) that excessive string concatenation can be problematic. It has very little (if anything) to do with concatenating a char rather than a string type.
The alternative to this is to use the StringBuilder class within the System.Text namespace. This class represents a mutable string-like object that can be used to concatenate strings without much of the resulting performance issues. This is because the StringBuilder class will reserve a specific amount of memory for a string, and will allow concatenations to be appended to the end of the reserved memory amount without requiring a complete new copy of the entire string.
EDIT:
With regard to the specifics of string lookups versus char lookups, I whipped up this little test:
class Program
{
static void Main(string[] args)
{
string stringtotal = "";
string chartotal = "";
Stopwatch stringconcat = new Stopwatch();
Stopwatch charconcat = new Stopwatch();
stringconcat.Start();
for (int i = 0; i < 100000; i++)
{
stringtotal += ".";
}
stringconcat.Stop();
charconcat.Start();
for (int i = 0; i < 100000; i++)
{
chartotal += '.';
}
charconcat.Stop();
Console.WriteLine("String: " + stringconcat.Elapsed.ToString());
Console.WriteLine("Char : " + charconcat.Elapsed.ToString());
Console.ReadLine();
}
}
It merely times (using the high-performance StopWatch class) how long it takes to concatenate 100000 dots/periods (.) of type string vs. 100000 dots/periods of type char.
I ran this test a few times over to prevent the results being skewed from one specific run, however, each time the results were similar to as follows:
String: 00:00:06.4606331
Char : 00:00:06.4528073
Therefore, in the context of multiple concatenations, I'd say that there's very little difference (in all likelihood, no difference when taking standard test run tolerances into account) between the two!
I agree with what everyone is saying about using StringBuilder if you are doing lots of string concatenation because String is an immutable type, but don't forget there's an overhead with creating the StringBuilder class too so you'll have to make a choice when to use which.
In one of Bill Wagner's Effect C# books (or might be in all 3 of them..), he touched on this too. Broadly speaking, if all you need is to add a few string fragments together, string.Format is better but if you need to build up a large string value in a potentially large loop, use the StringBuilder.
Every time when you concatenate strings using + operator, runtime creates a new string, and for avoiding that, recommended practice is usage of StringBuilder class, which has Append method. You can also use AppendLine and AppendFormat.
If you do not want to use StringBuilder, then you can use string.Format:
string str = string.Format("Hello {0}.", name);
Since strings are immutable types both would require creating a new instance of a string before the value is returned back to you.
I would consider string.Concat(...) for a small number of concatenations or use the StringBuilder class for many string concatenations.
I can't speak to C#, but in Java, the main advantage is not the compile-time gain but the run-time gain.
Yes, if you use a String, than at compile time Java will have to look the String up in its internal pool and possibly create a new String object. But this just happens once, at compile-time, when you create the .class files. The user will never see this.
What the user will see is that at run-time, if you give a character the program just has to retrieve the character. Done. If you give a String, it must first retrieve the String object handle. Then it must set up a loop to go through all the characters, retrieve the one character, observe that there are no more characters, and stop. I haven't looked at the generated byte-code but it's clearly seveal times as much work.
My question is this: Is string concatenation in C# safe? If string concatenation leads to unexpected errors, and replacing that string concatenation by using StringBuilder causes those errors to disappear, what might that indicate?
Background: I am developing a small command line C# application. It takes command line arguments, performs a slightly complicated SQL query, and outputs about 1300 rows of data into a formatted XML file.
My initial program would always run fine in debug mode. However, in release mode it would get to about the 750th SQL result, and then die with an error. The error was that a certain column of data could not be read, even through the Read() method of the SqlDataReader object had just returned true.
This problem was fixed by using StringBuilder for all operations in the code, where previously there had been "string1 + string2". I'm not talking about string concatenation inside the SQL query loop, where StringBuilder was already in use. I'm talking about simple concatenations between two or three short string variables earlier in the code.
I had the impression that C# was smart enough to handle the memory management for adding a few strings together. Am I wrong? Or does this indicate some other sort of code problem?
To answer your question:
String contatenation in C# (and .NET in general) is "safe", but doing it in a tight loop as you describe is likely to cause severe memory pressure and put strain on the garbage collector.
I would hazard a guess that the errors you speak of were related to resource exhaustion of some sort, but it would be helpful if you could provide more detail — for example, did you receive an exception? Did the application terminate abnormally?
Background:
.NET strings are immutable, so when you do a concatenation like this:
var stringList = new List<string> {"aaa", "bbb", "ccc", "ddd", //... };
string result = String.Empty;
foreach (var s in stringList)
{
result = result + s;
}
This is roughly equivalent to the following:
string result = "";
result = "aaa"
string temp1 = result + "bbb";
result = temp1;
string temp2 = temp1 + "ccc";
result = temp2;
string temp3 = temp2 + "ddd";
result = temp3;
// ...
result = tempN + x;
The purpose of this example is to emphasise that each time around the loop results in the allocation of a new temporary string.
Since the strings are immutable, the runtime has no alternative options but to allocate a new string each time you add another string to the end of your result.
Although the result string is constantly updated to point to the latest and greatest intermediate result, you are producing a lot of these un-named temporary string that become eligible for garbage collection almost immediately.
At the end of this concatenation you will have the following strings stored in memory (assuming, for simplicity, that the garbage collector has not yet run).
string a = "aaa";
string b = "bbb";
string c = "ccc";
// ...
string temp1 = "aaabbb";
string temp2 = "aaabbbccc";
string temp3 = "aaabbbcccddd";
string temp4 = "aaabbbcccdddeee";
string temp5 = "aaabbbcccdddeeefff";
string temp6 = "aaabbbcccdddeeefffggg";
// ...
Although all of these implicit temporary variables are eligible for garbage collection almost immediately, they still have to be allocated. When performing concatenation in a tight loop this is going to put a lot of strain on the garbage collector and, if nothing else, will make your code run very slowly. I have seen the performance impact of this first hand, and it becomes truly dramatic as your concatenated string becomes larger.
The recommended approach is to always use a StringBuilder if you are doing more than a few string concatenations. StringBuilder uses a mutable buffer to reduce the number of allocations that are necessary in building up your string.
String concatenation is safe though more memory intensive than using a StringBuilder if contatenating large numbers of strings in a loop. And in extreme cases you could be running out of memory.
It's almost certainly a bug in your code.
Maybe you're contatenating a very large number of strings. Or maybe it's something else completely different.
I'd go back to debugging without any preconceptions of the root cause - if you're still having problems try to reduce it to the minimum needed to repro the problem and post code.
Apart from what you're doing is probably best done with XML APIs instead of strings or StringBuilder I doubt that the error you see is due to string concatenation. Maybe switching to StringBuilder just masked the error or went over it gracefully, but I doubt using strings really was the cause.
How long would it take the concatenation version vs the string builder version? It's possible that your connection to the DB is being closed. If you are doing a lot of concatenation, i would go w/ StringBuilder as it is a bit more efficient.
One cause may be that strings are immutable in .Net so when you do an operation on one such as concatenation you are actually creating a new string.
Another possible cause is that string length is an int so the maximum possible length is Int32.MaxValue or 2,147,483,647.
In either case a StringBuilder is better than "string1 + string2" for this type of operation. Although, using the built-in XML capabilities would be even better.
string.Concat(string[]) is by far the fastest way to concatenate strings. It litterly kills StringBuilder in performance when used in loops, especially if you create the StringBuilder in each iteration.
There are loads of references if you Google "c# string format vs stringbuilder" or something like that.
http://www.codeproject.com/KB/cs/StringBuilder_vs_String.aspx gives you an ideer about the times. Here string.Join wins the concatenation test but I belive this is because the string.Concat(string, string) is used instead of the overloaded version that takes an array.
If you take a look at the MSIL code that is generated by the different methods you'll see what going on beneath the hood.
Here is my shot in the dark...
Strings in .NET (not stringbuilders) go into the String Intern Pool. This is basically an area managed by the CLR to share strings to improve performance. There has to be some limit here, although I have no idea what that limit is. I imagine all the concatenation you are doing is hitting the ceiling of the string intern pool. So SQL says yes I have a value for you, but it can't put it anywhere so you get an exception.
A quick and easy test would be to nGen your assembly and see if you still get the error. After nGen'ing, you application no longer will use the pool.
If that fails, I'd contact Microsoft to try and get some hard details. I think my idea sounds plausible, but I have no idea why it works in debug mode. Perhaps in debug mode strings aren't interned. I am also no expert.
When compounding strings together I always use StringBuilder. It's designed for it and is more efficient that simply using "string1 + string2".
Does String.ToLower() return the same reference (e.g. without allocating any new memory) if all the characters are already lower-case?
Memory allocation is cheap, but running a quick check on zillions of short strings is even cheaper. Most of the time the input I'm working with is already lower-case, but I want to make it that way if it isn't.
I'm working with C# / .NET in particular, but my curiosity extends to other languages so feel free to answer for your favorite one!
NOTE: Strings are immutable but that does not mean a function always has to return a new one, rather it means nothing can change their character content.
I expect so, yes. A quick test agrees (but this is not evidence):
string a = "abc", b = a.ToLower();
bool areSame = ReferenceEquals(a, b); // false
In general, try to work with comparers that do what you want. For example, if you want a case-insensitive dictionary, use one:
var lookup = new Dictionary<string, int>(
StringComparer.InvariantCultureIgnoreCase);
Likewise:
bool ciEqual = string.Equals("abc", "ABC",
StringComparison.InvariantCultureIgnoreCase);
String is an immutable. String.ToLower() will always return new instance thereby generating another instance on every ToLower() call.
Java implementation of String.toLowerCase() from Sun actually doesn't always allocate new String. It checks if all chars are lowercase, and if so, it returns original string.
[edit]
Interning doesn't help -- see the comments to this answer.
If you use the following code it will not allocate new memory and it will overwrite the original string (this may or may not be what you want). It expects an ascii string. Expect weird things to occur if you call this on strings returned from functions you do not control.
public static unsafe void UnsafeToLower(string asciiString)
{
fixed (char* pstr = asciiString)
{
for(char* p = pstr; *p != 0; ++p)
*p = (*p > 0x40) && (*p < 0x5b) ? (char)(*p | 0x60) : (*p);
}
}
It takes about 25% as long as ToLowerInvariant and avoids memory allocation.
I would only use something like this if you are doing say 100,000 or more strings regularly inside a tight loop.