Strings joining and complexity?

Strings joining and complexity? - c#

When I need to join two strings I use String.Format (or StringBuilder if it happens in several places in the code).
I see that some good programmers doesn't give attention to strings joining complexity and just use the '+' operator.
I know that using the '+' operator make the application to use more memory, but what about complexity?

This is an excellent article about the different string join methods by our own Jeff Atwood on Coding Horror:
(source: codinghorror.com)
The Sad Tragedy of Micro-Optimization Theater
Here is the gist of the post.
[several string join methods shown]
Take your itchy little trigger finger
off that compile key and think about
this for a minute. Which one of these
methods will be faster?
Got an answer? Great!
And.. drumroll please.. the correct
answer:
It. Just. Doesn't. Matter!

This answer assumes you are talking about the runtime complexity.
Using + creates a new string object, which means the contents of both the old string objects must be copied into the new one. With a large amount of concatenation, such as in a tight loop, this can turn into an O(n^2) operation.
As an informal proof, say you had the following code:
string foo = "a";
for(int i = 0; i < 1000; i++)
{
foo += "a";
}
The first iteration of the loop, first the contents of foo ("a") are copied into a new string object, then the contents of the literal "a". That's two copies. The second iteration has three copies; two from the new foo, and one from the literal "a". The 1000th iteration will have 1001 copy operations. The total number of copies is 2 + 3 + ... + 1001. In general, if in a loop you are only concatenating one character each iteration (and you start at one character long), if the number of iterations is n, there will be 2 + 3 + ... + n + 1 copies. That's the same as 1 + 2 + 3 + ... + n = n(n+1)/2 = (n^2 + n)/2, which is O(n^2).

Depends on the situation. The + can sometimes reduce the complexity of the code. Consider the following code:
output = "<p>" + intro + "</p>";
That is a good, clear line. No String.Format required.

If you use + only once, you have no disadvantage from it and it increases readability (as Colin Pickard already stated).
As far as I know + means: take left operand and right operand and copy them into a new buffer (as strings are immutable).
So using + two times (as in Colin Pickards example you already create 2 temporary strings. First when adding "<p>" to intro and then when adding "</p>" to the newly created string.
You have to consider for yourself when to use which method. Even for a small example like seen above performance drop can be serious if intro is a large enough string.

Unless your application is very string-intensive (profile, profile, profile!), this doesn't really matter. Good programmers put readability above performance for mundane operations.

I think in terms of complexity you trade reiteration of newly created strings for parsing format string.
For axample "A" + "B" + "C" + "D" means that you would have to copy "A", "AB", and at last "ABC" in order to form "ABCD". Copying is reiteration, right? So if for example you have a 1000 character string that you will sum with thousand one character strings you will copy (1000+N) character strings 1000 times. It leads to O(n^2) complexity in worst cases.
Strin.Fomat, even considering parsing, and StringBuffer should be around O(n).

Because strings are immutable in languages like Java and C#, everytime two strings are concatenated a new string has to be created, in which the contents of the two old strings are copied.
Assume strings which are on average c characters long.
Now the first concatenation only has to copy 2*c characters, but the last one has to copy the concatenation of the first n-1 strings, which is (n-1)*c characters long, and the last one itself, which is c characters long, for a total of n*c characters. For n concatenations this makes n^2*c/2 character copies, which means an algorithmic complexity of O(n^2).
In most cases in practice however this quadratic complexity will not be noticeable (as Jeff Atwood shows in the blog entry linked to by Robert C. Cartaino) and I'd advise to just write the code as readable as possible.
There are cases however when it does matter, and using O(n^2) in such cases may be deadly.
In practice I've seen this for example for generating big Word XML files in memory, including base64 encoded pictures. This generation used to take over 10 minutes due to using O(n^2) string concatenation. After I replaced concatenation using + with StringBuilder the running time for the same document reduced below 10 seconds.
Similarly I've seen a piece of software that generated an epically big piece of SQL code as a string using + for concatenation. I haven't even waited till this finished (had been waiting for over an hour already), but just rewrote it using StringBuilder. This faster version finished within a minute.
In short, just do whatever is most readable / easiest to write and only think about this when you'll be creating a freaking huge string :-)

StringBuilder should be used if you are building a large string in several steps. It also is a good thing if you know about how large it will be eventually, then you can initialize it with the size you need, and prevent costing re-allocations. For small operations it will not be considerable performance loss using the + operator, and it will result in clearer code (and faster to write...)

Plenty of input already, but I've always felt that the best way to approach the issue of performance is to understand the performance differences of all viable solutions and for those that meet performance requirements, pick the one that is the most reliable and the most supportable.
There are many who use Big O notation to understand complexity, but I've found that in most cases (including understanding which string concatenation methods work best), a simple time trial will suffice. Just compare strA+strB to strA.Append(strB) in a loop of 100,000 iterations to see which works faster.

The compiler optimizes string literal concatenation into one string literal. For example:
string s = "a" + "b" + "c";
is optimized to the following at compile time:
string s = "abc";
See this question and this MSDN article for more information.

Compiler will optimize: "a" + "b" + "c" to be replaced with String.Concat method (not String.Format one as fixed me comments)

I benchmarked this forever ago, and it hasn't really made a differences since .NET 1.0 or 1.1.
Back then if you had some process that was going to hit a line of code that was concatinating strings a few million times you could get a huge speed increase by using String.Concat, String.Format, or StringBuilder.
Now it doesn't matter at all. At least it hasn't mattered since .Net 2.0 came out anyhow. Put it out of your mind and code in whatever manner makes it easiest for you to read.

Related

StringBuilder - Should only Append and AppendLine be used without combining strings in arguments?

I have started to use StringBuilder as I hear it's much more optimized when it comes to outputting strings.
My question is regarding the use of + as strings are immutable and when you add them together it allocates a new string.
If I use this operator in the arguments for the StringBuilder.Append function, I assume it will essentially have the same overhead.
For example:
string animal1 = "dog";
string animal2 = "cat";
stringBuilder.Append("Today I saw a " + animal1 + " and " + animal2);
My guess is that this could concatenate these texts together allocating memory anyway.
I assume the more efficient (albeit verbose) way to do this would be:
stringBuilder.Append("Today I saw a");
stringBuilder.Append(animal1);
stringBuilder.Append(" and ");
stringBuilder.Append(animal2);
Is this correct?

You have a bug in the second example, as you miss a space between "Today I saw a" and animal1. Unless you are doing this in an excessive way (within a loop with many iterations) you'll probably find no measurable difference, so your best bet is probably to aim for readability.
$"Today I saw a {animal1} and a {animal2}"
Yes, I added an a before the second animal too :) I'm not handling cases for "an" though.
You also have the option of using AppendFormat if you want to be less verbose with all those appends...
stringBuilder.AppendFormat("Today I saw a {0} and a {1}", animal1, animal2);

Concatenating strings like that allocates the memory for a new string, just to have it appended to your StringBuilder, which is just wasteful. As you noted, you should just explicitly Append them instead.

Time complexity for C# StringBuilder initialized with string [duplicate]

How does StringBuilder work?
What does it do internally? Does it use unsafe code?
And why is it so fast (compared to the + operator)?

When you use the + operator to build up a string:
string s = "01";
s += "02";
s += "03";
s += "04";
then on the first concatenation we make a new string of length four and copy "01" and "02" into it -- four characters are copied. On the second concatenation we make a new string of length six and copy "0102" and "03" into it -- six characters are copied. On the third concat, we make a string of length eight and copy "010203" and "04" into it -- eight characters are copied. So far a total of 4 + 6 + 8 = 18 characters have been copied for this eight-character string. Keep going.
...
s += "99";
On the 98th concat we make a string of length 198 and copy "010203...98" and "99" into it. That gives us a total of 4 + 6 + 8 + ... + 198 = a lot, in order to make this 198 character string.
A string builder doesn't do all that copying. Rather, it maintains a mutable array that is hoped to be larger than the final string, and stuffs new things into the array as necessary.
What happens when the guess is wrong and the array gets full? There are two strategies. In the previous version of the framework, the string builder reallocated and copied the array when it got full, and doubled its size. In the new implementation, the string builder maintains a linked list of relatively small arrays, and appends a new array onto the end of the list when the old one gets full.
Also, as you have conjectured, the string builder can do tricks with "unsafe" code to improve its performance. For example, the code which writes the new data into the array can already have checked that the array write is going to be within bounds. By turning off the safety system it can avoid the per-write check that the jitter might otherwise insert to verify that every write to the array is safe. The string builder does a number of these sorts of tricks to do things like ensuring that buffers are reused rather than reallocated, ensuring that unnecessary safety checks are avoided, and so on. I recommend against these sorts of shenanigans unless you are really good at writing unsafe code correctly, and really do need to eke out every last bit of performance.

StringBuilder's implementation has changed between versions, I believe. Fundamentally though, it maintains a mutable structure of some form. I believe it used to use a string which was still being mutated (using internal methods) and would just make sure it would never be mutated after it was returned.
The reason StringBuilder is faster than using string concatenation in a loop is precisely because of the mutability - it doesn't require a new string to be constructed after each mutation, which would mean copying all the data within the string etc.
For just a single concatenation, it's actually slightly more efficient to use + than to use StringBuilder. It's only when you're performing multiple operations and you don't really need the intermediate results that StringBuilder shines.
See my article on StringBuilder for more information.

The Microsoft CLR does do some operations with internal call (not quite the same as unsafe code). The biggest performance benefit over a bunch of + concatenated strings is that it writes to a char[] and doesn't create as many intermediate strings. When you call ToString (), it builds a completed, immutable string from your contents.

The StringBuilder uses a string buffer that can be altered, compared to a regular String that can't be. When you call the ToString method of the StringBuilder it will just freeze the string buffer and convert it into a regular string, so it doesn't have to copy all the data one extra time.
As the StringBuilder can alter the string buffer, it doesn't have to create a new string value for each and every change to the string data. When you use the + operator, the compiler turns that into a String.Concat call that creates a new string object. This seemingly innocent piece of code:
str += ",";
compiles into this:
str = String.Concat(str, ",");

Does string.Replace(string, string) create additional strings?

We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.

Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.

Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".

I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()

There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);

Difference between using append method of StringBulder class and concatenation "+" operator [duplicate]

This question already has answers here:
String Concatenation Vs String Builder Append
(5 answers)
Closed 9 years ago.
what is the difference in using the Append method of StringBuilder class and Concatenation using "+" operator?
In what way the Append method works efficient or faster than "+" operator in concatenating two strings?

First of all, String and StringBuilder are different classes.
String class represents immutable types but StringBuilder class represent mutable types.
When you use + to concatanate your strings, it uses String.Concat method. And every time, it returns a new string object.
StringBuilder.Append method appends a copy of the specified string. It doesn't return a new string, it changes the original one.
For efficient part, you should read Jeff's article called The Sad Tragedy of Micro-Optimization Theater
It. Just. Doesn't. Matter!
We already know none of these operations
will be performed in a loop, so we can rule out brutally poor
performance characteristics of naive string concatenation. All that's
left is micro-optimization, and the minute you begin worrying about
tiny little optimizations, you've already gone down the wrong path.
Oh, you don't believe me? Sadly, I didn't believe it myself, which is
why I got drawn into this in the first place. Here are my results --
for 100,000 iterations, on a dual core 3.5 GHz Core 2 Duo.
1: Simple Concatenation 606 ms
2: String.Format 665 ms
3: string.Concat 587 ms
4: String.Replace 979 ms
5: StringBuilder 588 ms

String are immutable so when you append, you actually create a new object in the background.
When you use StringBuilder, it provides an efficient method for concatenating strings.
To be honest, you are not really going to notice a big improvement if you use it once or twice. But the efficiency comes in when you use the StringBuilder in loops.

When you concatenate two strings you actually create a new string with the result. A StringBuilder has the ability to resize itself as you add to it, which can be faster.
As with all things, it depends. If you are simply concatenating two small strings like this:
string s = "a" + "b";
Then at best there will be no difference in performance, but likely this will be quicker than using a StringBuilder and is also easier to read.
StringBuilder is more suitable for cases where you are concatenating an arbitrary number of strings, which you don't know at compile time.

Memory-wise, is it better to store a long non-dynamic string as a single string object or to have the program build it out of it's repetitive parts?

This is a bit of an odd question and more of a though experiment that anything I need, but I'm still curious about the answer: If I have a string that I know ahead of time will never change but is (mostly) made up of repetitive parts, would it be better to have said string as just a single string object, get called when needed, and be done with it - or should I break the string up into smaller strings that represent the repeated parts and concatenate them when needed?
Let me use an example: Let's say we have a naive programmer who wants to create a regular expression for validating IP Addresses (in other words, I know this regular expression won't work as intended, but it helps show what I mean by repetitive parts and saves me a bit of typing for the second part of the example). So he writes this function:
private bool isValidIP(string ip)
{
Regex checkIP = new Regex("\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
return checkIP.IsMatch(ip);
}
Now our young programmer notices that he has "\d", "\d?", and "\." just repeated a few times. This gives him an idea that he could both save some storage space and help remind himself what this means for later. So he remakes the function:
private bool isValidIP(string ip)
{
string escape = "\\";
string digi = "d";
string digit = escape + digi;
string possibleDigit = digit + '?';
string IpByte = digit + possibleDigit + possibleDigit;
string period = escape + '.';
Regex checkIP = new Regex(IpByte + period + IpByte + period + IpByte + period + IpByte);
return checkIP.IsMatch(ip);
}
The first method is simple. It just stores 38 chars in the program's instructions, which are just read into memory each time the function is called.
The second method stores (I suspect) two 1 length strings and two chars into the program's instructions as well as all of the calls to concatenate those four into different orders. This creates at least 8 strings in memory when the program is called (the six named strings, a temporary string for the first four parts of the regex, and then the final string created from the previous string + the three strings of the regex). This second method also happens to help explain what the regex is looking for - though not what the final regex would look like. It could also help with refactoring, say if our hypothetical programmer realizes that his current regex will allow for more than just 0-255 in the IP Address, and the constitute parts can be changed without having to find every single item that would need to be fixed.
Again, which method would be better? Would it just be as simple as a trade-off between program size vs. memory usage? Of course, with something as simple as this, the trade-off is negligible at best, but what about a much larger, more complex string?
Oh, yes, and a much better regex for IP Addresses would be:
^(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)(\\.(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)){3}$
Wouldn't work as well as an example, would it?

The first is by far the better option. Here's why
It's clearer.
It's cheaper. Any time you declare a new object it's an "expensive" process. You have to make space for it on the heap (well for strings at least). Yes, you could in theory be saving a byte or so, but your spending a lot more time (probably, I haven't tested it) going through and allocating space for each string, additional memory instructions etc. Not to mention the fact that remember, you also have to factor in the use of the GC. You keep allocating strings and eventually you are going to have to contend with it taking up process ticks also. You really want to hit on optimization, I can easily tell this code isn't as efficient as it could be. There are no constants for one thing, which means that you are possibly creating more objects than you need instead of letting the compiler optimize for strings that don't need to change. This leads me to think, that as a person reviewing this code, I need to take a much closer look at what is going to see what is going on and figure out if something is wrong.
It's clearer (yes, I said this again). You want to do an academic pursuit to see how efficient you can make it. That's cool. I get that. I do it myself. It's fun. I NEVER let that slip into production code. I don't care about losing a tick, I care about having a bug in production, and I care about if other programmers can understand what my code does. Reading someone else's code is hard enough, I don't want to add the extra task of them having to try and figure out which micro-optimization I put in and what happens if they "nudge" the wrong piece of code.
You hit on another point. What if the original regex is wrong. Google will tell you this problem has been solved. You can Google another regex that's right and has been tested. You can't Google "What's wrong with my code." You can post it on SO sure, but that means that someone else has to get involved and look through it.
Here's how to make the first example win the horse race easily:
Regex checkIP = new Regex(
"\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
private bool isValidIP(string ip)
{
return checkIP.IsMatch(ip);
}
Declare once, reuse over and over. If you are taking the time to recreate the regex dynamically to save a few, don't get to do that. Technically you could do that and still only create the object once, but that is a lot more work than say, moving it to a class level variable.

You're effectively attempting to game the compiler here and implement your own string compression. For the kinds of string literals you're describing, it seems like your savings will be mere tens of bytes shaved off of the compiled binary, which due to memory alignment may not even be realized. In exchange for these few bytes of saved space, this approach adds code complexity and runtime overhead, not to mention difficulty in debugging.
Storage is cheap. Why make your life (and the lives of your coworkers) harder? Keep your code simple, clear, and evident - you'll thank yourself later.

The second is worse off in memory consumption, as every time you concatenate two strings you've got three in memory.
Although the compiler started handling some instances of string constants by creating a StringBuilder for you, I'd still vote for the first one being less memory intensive, because if the system does create the StringBuilder for you, you are going to have the overhead for that, and if it doesn't see the first paragraph...
I am now curious how compiling the RegEx would effect the memory usage.

Savings here are illusionary and splitting this string up is a big overshot. Saving insignificant amount of memory and complicating so simple code is just pointless. You will not see any savings but next person to maintain that code will spend 10x more time understanding it.
Strings are immutable so if your string never/rarely changes keep it in one piece. Intense string concatenations give garbage collector additional strain.
Unless your strings and sub-strings are big and you could save at least kilobytes, do not spend your time and effort on such optimizations.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.