My question is this: Is string concatenation in C# safe? If string concatenation leads to unexpected errors, and replacing that string concatenation by using StringBuilder causes those errors to disappear, what might that indicate?
Background: I am developing a small command line C# application. It takes command line arguments, performs a slightly complicated SQL query, and outputs about 1300 rows of data into a formatted XML file.
My initial program would always run fine in debug mode. However, in release mode it would get to about the 750th SQL result, and then die with an error. The error was that a certain column of data could not be read, even through the Read() method of the SqlDataReader object had just returned true.
This problem was fixed by using StringBuilder for all operations in the code, where previously there had been "string1 + string2". I'm not talking about string concatenation inside the SQL query loop, where StringBuilder was already in use. I'm talking about simple concatenations between two or three short string variables earlier in the code.
I had the impression that C# was smart enough to handle the memory management for adding a few strings together. Am I wrong? Or does this indicate some other sort of code problem?
To answer your question:
String contatenation in C# (and .NET in general) is "safe", but doing it in a tight loop as you describe is likely to cause severe memory pressure and put strain on the garbage collector.
I would hazard a guess that the errors you speak of were related to resource exhaustion of some sort, but it would be helpful if you could provide more detail — for example, did you receive an exception? Did the application terminate abnormally?
Background:
.NET strings are immutable, so when you do a concatenation like this:
var stringList = new List<string> {"aaa", "bbb", "ccc", "ddd", //... };
string result = String.Empty;
foreach (var s in stringList)
{
result = result + s;
}
This is roughly equivalent to the following:
string result = "";
result = "aaa"
string temp1 = result + "bbb";
result = temp1;
string temp2 = temp1 + "ccc";
result = temp2;
string temp3 = temp2 + "ddd";
result = temp3;
// ...
result = tempN + x;
The purpose of this example is to emphasise that each time around the loop results in the allocation of a new temporary string.
Since the strings are immutable, the runtime has no alternative options but to allocate a new string each time you add another string to the end of your result.
Although the result string is constantly updated to point to the latest and greatest intermediate result, you are producing a lot of these un-named temporary string that become eligible for garbage collection almost immediately.
At the end of this concatenation you will have the following strings stored in memory (assuming, for simplicity, that the garbage collector has not yet run).
string a = "aaa";
string b = "bbb";
string c = "ccc";
// ...
string temp1 = "aaabbb";
string temp2 = "aaabbbccc";
string temp3 = "aaabbbcccddd";
string temp4 = "aaabbbcccdddeee";
string temp5 = "aaabbbcccdddeeefff";
string temp6 = "aaabbbcccdddeeefffggg";
// ...
Although all of these implicit temporary variables are eligible for garbage collection almost immediately, they still have to be allocated. When performing concatenation in a tight loop this is going to put a lot of strain on the garbage collector and, if nothing else, will make your code run very slowly. I have seen the performance impact of this first hand, and it becomes truly dramatic as your concatenated string becomes larger.
The recommended approach is to always use a StringBuilder if you are doing more than a few string concatenations. StringBuilder uses a mutable buffer to reduce the number of allocations that are necessary in building up your string.
String concatenation is safe though more memory intensive than using a StringBuilder if contatenating large numbers of strings in a loop. And in extreme cases you could be running out of memory.
It's almost certainly a bug in your code.
Maybe you're contatenating a very large number of strings. Or maybe it's something else completely different.
I'd go back to debugging without any preconceptions of the root cause - if you're still having problems try to reduce it to the minimum needed to repro the problem and post code.
Apart from what you're doing is probably best done with XML APIs instead of strings or StringBuilder I doubt that the error you see is due to string concatenation. Maybe switching to StringBuilder just masked the error or went over it gracefully, but I doubt using strings really was the cause.
How long would it take the concatenation version vs the string builder version? It's possible that your connection to the DB is being closed. If you are doing a lot of concatenation, i would go w/ StringBuilder as it is a bit more efficient.
One cause may be that strings are immutable in .Net so when you do an operation on one such as concatenation you are actually creating a new string.
Another possible cause is that string length is an int so the maximum possible length is Int32.MaxValue or 2,147,483,647.
In either case a StringBuilder is better than "string1 + string2" for this type of operation. Although, using the built-in XML capabilities would be even better.
string.Concat(string[]) is by far the fastest way to concatenate strings. It litterly kills StringBuilder in performance when used in loops, especially if you create the StringBuilder in each iteration.
There are loads of references if you Google "c# string format vs stringbuilder" or something like that.
http://www.codeproject.com/KB/cs/StringBuilder_vs_String.aspx gives you an ideer about the times. Here string.Join wins the concatenation test but I belive this is because the string.Concat(string, string) is used instead of the overloaded version that takes an array.
If you take a look at the MSIL code that is generated by the different methods you'll see what going on beneath the hood.
Here is my shot in the dark...
Strings in .NET (not stringbuilders) go into the String Intern Pool. This is basically an area managed by the CLR to share strings to improve performance. There has to be some limit here, although I have no idea what that limit is. I imagine all the concatenation you are doing is hitting the ceiling of the string intern pool. So SQL says yes I have a value for you, but it can't put it anywhere so you get an exception.
A quick and easy test would be to nGen your assembly and see if you still get the error. After nGen'ing, you application no longer will use the pool.
If that fails, I'd contact Microsoft to try and get some hard details. I think my idea sounds plausible, but I have no idea why it works in debug mode. Perhaps in debug mode strings aren't interned. I am also no expert.
When compounding strings together I always use StringBuilder. It's designed for it and is more efficient that simply using "string1 + string2".
Related
How does StringBuilder work?
What does it do internally? Does it use unsafe code?
And why is it so fast (compared to the + operator)?
When you use the + operator to build up a string:
string s = "01";
s += "02";
s += "03";
s += "04";
then on the first concatenation we make a new string of length four and copy "01" and "02" into it -- four characters are copied. On the second concatenation we make a new string of length six and copy "0102" and "03" into it -- six characters are copied. On the third concat, we make a string of length eight and copy "010203" and "04" into it -- eight characters are copied. So far a total of 4 + 6 + 8 = 18 characters have been copied for this eight-character string. Keep going.
...
s += "99";
On the 98th concat we make a string of length 198 and copy "010203...98" and "99" into it. That gives us a total of 4 + 6 + 8 + ... + 198 = a lot, in order to make this 198 character string.
A string builder doesn't do all that copying. Rather, it maintains a mutable array that is hoped to be larger than the final string, and stuffs new things into the array as necessary.
What happens when the guess is wrong and the array gets full? There are two strategies. In the previous version of the framework, the string builder reallocated and copied the array when it got full, and doubled its size. In the new implementation, the string builder maintains a linked list of relatively small arrays, and appends a new array onto the end of the list when the old one gets full.
Also, as you have conjectured, the string builder can do tricks with "unsafe" code to improve its performance. For example, the code which writes the new data into the array can already have checked that the array write is going to be within bounds. By turning off the safety system it can avoid the per-write check that the jitter might otherwise insert to verify that every write to the array is safe. The string builder does a number of these sorts of tricks to do things like ensuring that buffers are reused rather than reallocated, ensuring that unnecessary safety checks are avoided, and so on. I recommend against these sorts of shenanigans unless you are really good at writing unsafe code correctly, and really do need to eke out every last bit of performance.
StringBuilder's implementation has changed between versions, I believe. Fundamentally though, it maintains a mutable structure of some form. I believe it used to use a string which was still being mutated (using internal methods) and would just make sure it would never be mutated after it was returned.
The reason StringBuilder is faster than using string concatenation in a loop is precisely because of the mutability - it doesn't require a new string to be constructed after each mutation, which would mean copying all the data within the string etc.
For just a single concatenation, it's actually slightly more efficient to use + than to use StringBuilder. It's only when you're performing multiple operations and you don't really need the intermediate results that StringBuilder shines.
See my article on StringBuilder for more information.
The Microsoft CLR does do some operations with internal call (not quite the same as unsafe code). The biggest performance benefit over a bunch of + concatenated strings is that it writes to a char[] and doesn't create as many intermediate strings. When you call ToString (), it builds a completed, immutable string from your contents.
The StringBuilder uses a string buffer that can be altered, compared to a regular String that can't be. When you call the ToString method of the StringBuilder it will just freeze the string buffer and convert it into a regular string, so it doesn't have to copy all the data one extra time.
As the StringBuilder can alter the string buffer, it doesn't have to create a new string value for each and every change to the string data. When you use the + operator, the compiler turns that into a String.Concat call that creates a new string object. This seemingly innocent piece of code:
str += ",";
compiles into this:
str = String.Concat(str, ",");
We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.
Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.
Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".
I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()
There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);
I know there is a rule about strings in C# that says:
When we create a textual string of type string, we can never change its value! When putting different value for a string variable thje first string will stay in memory and variable (which is kind of reference type) just gets the address of the new string.
So doing something like this:
string a = "aaa";
a = a.Trim(); // Creates a new string
is not recommended.
But what if I need to do some actions on the string according to user preferences, like so:
string a = "aaa";
if (doTrim)
a = a.Trim();
if (doSubstring)
a = a.Substring(...);
etc...
How can I do it without creating new strings on every action ?
I thougt about sending the string to a function by ref, like so:
void DoTrim(ref string value)
{
value = value.Trim(); // also creates new string
}
But this also creates a new string...
Can someone please tell me if there is a way of doing it without wasteing memory on each action ?
You are correct in that the operations you're performing are creating new strings, and not mutating a single string.
You are incorrect in that this is generally problematic or something to be avoided.
If your strings are hundreds of thousands of characters, then sure, copying all of those just to remove a few leading spaces, or to add a few characters to the end of it (repeatedly, in a loop, in particular) can actually be a problem.
If your strings aren't large, and you're not performing many (an in thousands of) operations on the string, then you almost certainly don't have a problem.
Now there are a handful of contexts, generally rather rare, that do run into problems with string manipulation. Probably the most common of the problematic contexts is appending a bunch of strings together, as doing so means copying all of the previously appended data for each new addition. If you're in that situation consider using something like a StringBuilder or a single call to string.Concat (the overload accepting a sequence of strings to concat) to perform this operation.
Other contexts are, for example, programs dealing with processing DNA strands. They'll often be taking strings of millions of characters and creating hundreds of thousands of many thousand character long substrings of that string. Using standard C# string operations would therefore result in a lot of unnecessary copying. People writing such programs end up creating objects that can represent a substring of another string without copying the data and instead referring to the existing string's underlying data source with an offset.
Sticking my neck out here a bit so I'll preface with saying in most cases Servy's answer is the correct answer. However, if you really do need lower level access and less string allocations, you could consider creating a character buffer (simple array for instance) that is big enough to fit your processed string and allow you direct manipulation of the characters. There are some significant downfalls to this, though. Including that you'll probably have to write your own Substring() and Trim() modifiers, and your buffer will likely be bigger than your input strings in many cases to accommodate unexpected string sizes. Once you are done manipulating your buffer, you could then package the character array up as a String. Since all of your manipulations are done on a single buffer, you should save a lot of allocations.
I would seriously consider if the above is worth the hassle, but if you really need the performance, this is the best solution I can think of.
How can I do it without creating new strings on every action?
You should only worry about that if you're handling big strings or if you're doing many string operations in a short period of time.
Even then, the performance loss due to creating more references is minimal.
The Garbage Collector has to collect all the unused string variables, but hey - that only really matters if you're doing MANY string operations.
So rather focus on readability in your code, rather than trying to optimize its performance in the first place.
If you really have to keep the same reference of string, you can simply use a StringBuilder.
Why do you feel uncomfortable creating new strings? There is a reason for the string API to be designed this way. For example, immutable objects are thread-safe (and they allow for a more functional programming style).
If you replace your simple string code by stringbuilders, your code might be more error-prone in multithreading scenarios (which is quite normal in a web application for example).
StringBuilders are used for concatenating strings, inserting characters, removing characters, etc. But they will need to reallocate and copy their internal characters arrays every now and then, too.
When you speak about memory consumption you have started to micro-optimize your code. Don't.
BTW: Have a look at the LINQ API. What does each operation do? Rats - it creates a new enumerator! A query like foos.Where(bar).Select(baz).FirstOrDefault() could certainly be memory-optimized by just creating a single enumerator object and modifying the criteria it applies when enumerating. </irony>
It will depend on what your exact use case is, but you might want to explore using the StringBuilder class which you can use to build and modify strings.
At the moment I maintain a quirky codebase, and came across the following same pattern more than 100 times:
string NotMySqlQuery = ""; //why initialize the string with "", only to overwrite it on the next line?
NotMySqlQuery = "The query to be executed";
Since I came across this so often, I now doubt my own good judgement.
Is this a trick to optimize the compiler or does it bring any other advantages?
It reminds me a bit of the old times when I did write some code in C++, but it still doesn't look like proper dealing with strings to me.
Why would someone write code like that?
There is no performance advantage of that syntax. It is even slightly worse than not initializing it at all, since the strings are immutable in c# and this way 2 separate strings are allocated.
For your simple case, it is better to save the 2 lines into one, there is no point to assign it an empty string, and immediately assign another value to it.
string NotMySqlQuery = "The query to be executed";
This is clearer.
When I need to join two strings I use String.Format (or StringBuilder if it happens in several places in the code).
I see that some good programmers doesn't give attention to strings joining complexity and just use the '+' operator.
I know that using the '+' operator make the application to use more memory, but what about complexity?
This is an excellent article about the different string join methods by our own Jeff Atwood on Coding Horror:
(source: codinghorror.com)
The Sad Tragedy of Micro-Optimization Theater
Here is the gist of the post.
[several string join methods shown]
Take your itchy little trigger finger
off that compile key and think about
this for a minute. Which one of these
methods will be faster?
Got an answer? Great!
And.. drumroll please.. the correct
answer:
It. Just. Doesn't. Matter!
This answer assumes you are talking about the runtime complexity.
Using + creates a new string object, which means the contents of both the old string objects must be copied into the new one. With a large amount of concatenation, such as in a tight loop, this can turn into an O(n^2) operation.
As an informal proof, say you had the following code:
string foo = "a";
for(int i = 0; i < 1000; i++)
{
foo += "a";
}
The first iteration of the loop, first the contents of foo ("a") are copied into a new string object, then the contents of the literal "a". That's two copies. The second iteration has three copies; two from the new foo, and one from the literal "a". The 1000th iteration will have 1001 copy operations. The total number of copies is 2 + 3 + ... + 1001. In general, if in a loop you are only concatenating one character each iteration (and you start at one character long), if the number of iterations is n, there will be 2 + 3 + ... + n + 1 copies. That's the same as 1 + 2 + 3 + ... + n = n(n+1)/2 = (n^2 + n)/2, which is O(n^2).
Depends on the situation. The + can sometimes reduce the complexity of the code. Consider the following code:
output = "<p>" + intro + "</p>";
That is a good, clear line. No String.Format required.
If you use + only once, you have no disadvantage from it and it increases readability (as Colin Pickard already stated).
As far as I know + means: take left operand and right operand and copy them into a new buffer (as strings are immutable).
So using + two times (as in Colin Pickards example you already create 2 temporary strings. First when adding "<p>" to intro and then when adding "</p>" to the newly created string.
You have to consider for yourself when to use which method. Even for a small example like seen above performance drop can be serious if intro is a large enough string.
Unless your application is very string-intensive (profile, profile, profile!), this doesn't really matter. Good programmers put readability above performance for mundane operations.
I think in terms of complexity you trade reiteration of newly created strings for parsing format string.
For axample "A" + "B" + "C" + "D" means that you would have to copy "A", "AB", and at last "ABC" in order to form "ABCD". Copying is reiteration, right? So if for example you have a 1000 character string that you will sum with thousand one character strings you will copy (1000+N) character strings 1000 times. It leads to O(n^2) complexity in worst cases.
Strin.Fomat, even considering parsing, and StringBuffer should be around O(n).
Because strings are immutable in languages like Java and C#, everytime two strings are concatenated a new string has to be created, in which the contents of the two old strings are copied.
Assume strings which are on average c characters long.
Now the first concatenation only has to copy 2*c characters, but the last one has to copy the concatenation of the first n-1 strings, which is (n-1)*c characters long, and the last one itself, which is c characters long, for a total of n*c characters. For n concatenations this makes n^2*c/2 character copies, which means an algorithmic complexity of O(n^2).
In most cases in practice however this quadratic complexity will not be noticeable (as Jeff Atwood shows in the blog entry linked to by Robert C. Cartaino) and I'd advise to just write the code as readable as possible.
There are cases however when it does matter, and using O(n^2) in such cases may be deadly.
In practice I've seen this for example for generating big Word XML files in memory, including base64 encoded pictures. This generation used to take over 10 minutes due to using O(n^2) string concatenation. After I replaced concatenation using + with StringBuilder the running time for the same document reduced below 10 seconds.
Similarly I've seen a piece of software that generated an epically big piece of SQL code as a string using + for concatenation. I haven't even waited till this finished (had been waiting for over an hour already), but just rewrote it using StringBuilder. This faster version finished within a minute.
In short, just do whatever is most readable / easiest to write and only think about this when you'll be creating a freaking huge string :-)
StringBuilder should be used if you are building a large string in several steps. It also is a good thing if you know about how large it will be eventually, then you can initialize it with the size you need, and prevent costing re-allocations. For small operations it will not be considerable performance loss using the + operator, and it will result in clearer code (and faster to write...)
Plenty of input already, but I've always felt that the best way to approach the issue of performance is to understand the performance differences of all viable solutions and for those that meet performance requirements, pick the one that is the most reliable and the most supportable.
There are many who use Big O notation to understand complexity, but I've found that in most cases (including understanding which string concatenation methods work best), a simple time trial will suffice. Just compare strA+strB to strA.Append(strB) in a loop of 100,000 iterations to see which works faster.
The compiler optimizes string literal concatenation into one string literal. For example:
string s = "a" + "b" + "c";
is optimized to the following at compile time:
string s = "abc";
See this question and this MSDN article for more information.
Compiler will optimize: "a" + "b" + "c" to be replaced with String.Concat method (not String.Format one as fixed me comments)
I benchmarked this forever ago, and it hasn't really made a differences since .NET 1.0 or 1.1.
Back then if you had some process that was going to hit a line of code that was concatinating strings a few million times you could get a huge speed increase by using String.Concat, String.Format, or StringBuilder.
Now it doesn't matter at all. At least it hasn't mattered since .Net 2.0 came out anyhow. Put it out of your mind and code in whatever manner makes it easiest for you to read.