String concatenation in C# with interned strings

String concatenation in C# with interned strings - c#

I know this question has been done but I have a slightly different twist to it. Several have pointed out that this is premature optimization, which is entirely true if I were asking for practicality's sake and practicality's sake only. My problem is rooted in a practical problem but I'm still curious nonetheless.
I'm creating a bunch of SQL statements to create a script (as in it will be saved to disk) to recreate a database schema (easily many many hundreds of tables, views, etc.). This means my string concatenation is append-only. StringBuilder, according to MSDN, works by keeping an internal buffer (surely a char[]) and copying string characters into it and reallocating the array as necessary.
However, my code has a lot of repeat strings ("CREATE TABLE [", "GO\n", etc.) which means I can take advantage of them being interned but not if I use StringBuilder since they would be copied each time. The only variables are essentially table names and such that already exist as strings in other objects that are already in memory.
So as far as I can tell that after my data is read in and my objects created that hold the schema information then all my string information can be reused by interning, yes?
Assuming that, then wouldn't a List or LinkedList of strings be faster because they retain pointers to interned strings? Then it's only one call to String.Concat() for a single memory allocation of the whole string that is exactly the correct length.
A List would have to reallocate string[] of interned pointers and a linked list would have to create nodes and modify pointers, so they aren't "free" to do but if I'm concatenating many thousands of interned strings then they would seem like they would be more efficient.
Now I suppose I could come up with some heuristic on character counts for each SQL statement & count each type and get a rough idea and pre-set my StringBuilder capacity to avoid reallocating its char[] but I would have to overshoot by a fair margin to reduce the probability of reallocating.
So for this case, which would be fastest to get a single concatenated string:
StringBuilder
List<string> of interned strings
LinkedList<string> of interned strings
StringBuilder with a capacity heuristic
Something else?
As a separate question (I may not always go to disk) to the above: would a single StreamWriter to an output file be faster yet? Alternatively, use a List or LinkedList then write them to a file from the list instead of first concatenating in memory.
EDIT:
As requested, the reference (.NET 3.5) to MSDN. It says: "New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer." That to me means a char[] that is realloced to make it larger (which requires copying old data to the resized array) then appending.

For your separate question, Win32 has a WriteFileGather function, which could efficiently write a list of (interned) strings to disk - but it would make a notable difference only when being called asynchronously, as the disk write will overshadow all but extremely large concatenations.
For your main question: unless you are reaching megabytes of script, or tens of thousands of scripts, don't worry.
You can expect StringBuilder to double the allocation size on each reallocation. That would mean growing a buffer from 256 bytes to 1MB is just 12 reallocations - quite good, given that your initial estimate was 3 orders of magnitude off the target.
Purely as an exercise, some estimates: building a buffer of 1MB will sweep roughly 3 MB memory (1MB source, 1MB target, 1MB due to
copying during realloation).
A linked list implementation will sweep about 2MB, (and that's ignoring the 8 byte / object overhead per string reference). So you are saving 1 MB memory reads/writes, compared to a typical memory bandwidth of 10Gbit/s and 1MB L2 cache.)
Yes, a list implementation is potentially faster, and the difference would matter if your buffers are an order of magnitude larger.
For the much more common case of small strings, the algorithmic gain is negligible, and easily offset by other factors: the StringBuilder code is likely in the code cache already, and a viable target for microoptimizations. Also, using a string internally means no copy at all if the final string fits the initial buffer.
Using a linked list will also bring down the reallocation problem from O(number of characters) to O(number of segments) - your list of string references faces the same problem as a string of characters!
So, IMO the implementation of StringBuilder is the right choice, optimized for the common case, and degrades mostly for unexpectedly large target buffers. I'd expect a list implementation to degrade for very many small segments first, which is actually the extreme kind of scenario StringBuilder is trying to optimize for.
Still, it would be interesting to see a comparison of the two ideas, and when the list starts to be faster.

If I were implementing something like this, I would never build a StringBuilder (or any other in memory buffer of your script).
I would just stream it out to your file instead, and make all strings inline.
Here's an example pseudo code (not syntactically correct or anything):
FileStream f = new FileStream("yourscript.sql");
foreach (Table t in myTables)
{
f.write("CREATE TABLE [");
f.write(t.ToString());
f.write("]");
....
}
Then, you'll never need an in memory representation of your script, with all the copying of strings.
Opinions?

In my experience, I properly allocated StringBuilder outperforms most everything else for large amounts of string data. It's worth wasting some memory, even, by overshooting your estimate by 20% or 30% in order to prevent reallocation. I don't currently have hard numbers to back it up using my own data, but take a look at this page for more.
However, as Jeff is fond of pointing out, don't prematurely optimize!
EDIT: As #Colin Burnett pointed out, the tests that Jeff conducted don't agree with Brian's tests, but the point of linking Jeff's post was about premature optimization in general. Several commenters on Jeff's page noted issues with his tests.

Actually StringBuilder uses an instance of String internally. String is in fact mutable within the System assembly, which is why StringBuilder can be build on top of it. You can make StringBuilder a wee bit more effective by assigning a reasonable length when you create the instance. That way you will eliminate/reduce the number of resize operations.
String interning works for strings that can be identified at compile time. Thus if you generate a lot of strings during the execution they will not be interned unless you do so yourself by calling the interning method on string.
Interning will only benefit you if your strings are identical. Almost identical strings doesn't benefit from interning, so "SOMESTRINGA" and "SOMESTRINGB" will be two different strings even if they are interned.

If all (or most) of the strings being concatenated are interned, then your scheme MIGHT give you a performance boost, as it could potentally use less memory, and could save a few large string copies.
However, whether or not it actually improves perf depends on the volume of data you are processing, because the improvement is in constant factors, not in the order of magnitude of the algorithm.
The only way to really tell is to run your app using both ways and measure the results. However, unless you are under significant memory pressure, and need a way to save bytes, I wouldn't bother and would just use string builder.

A StringBuilder doesn't use a char[] to store the data, it uses an internal mutable string. That means that there is no extra step to create the final string as it is when you concatenate a list of strings, the StringBuilder just returns the internal string buffer as a regular string.
The reallocations that the StringBuilder does to increase the capacity means that the data is by average copied an extra 1.33 times. If you can provide a good estimate on the size when you create the StringBuilder you can reduce that even furter.
However, to get a bit of perspective, you should look at what it is that you are trying to optimise. What will take most of the time in your program is to actually write the data to disk, so even if you can optimise your string handling to be twice as fast as using a StringBuilder (which is very unlikely), the overall difference will still only be a few percent.

Have you considered C++ for this? Is there a library class that already builds T/SQL expressions, preferably written in C++.
Slowest thing about strings is malloc. It takes 4KB per string on 32-bit platforms. Consider optimizing number of string objects created.
If you must use C#, I'd recommend something like this:
string varString1 = tableName;
string varString2 = tableName;
StringBuilder sb1 = new StringBuilder("const expression");
sb1.Append(varString1);
StringBuilder sb2 = new StringBuilder("const expression");
sb2.Append(varString2);
string resultingString = sb1.ToString() + sb2.ToString();
I would even go as far as letting the computer evaluate the best path for object instantiation with dependency injection frameworks, if perf is THAT important.

Related

String Builder and string size

Why StringBuilder Size is greater than string(~250MB).
Please read the question. I want to know the reason of size constraint in the string, but not in stringbuilder. I have fixed the problem of reading file.
Yes, I know there are operation, we can perform on string builder like append, replace, remove, etc. But what is the use of it when we can't get ToString() from it and we can't write it directly in the file. We had to get ToString() to actually use it, but because its size is out of string range it throws exception.
So in particular is there any use of string builder having size greated than string as i read a file of around 1 gb into string builder but cant get it into string. I read all the pros and cons of StringBuilder over String but I cant anything explaning this
Update:
I want to load XMLDocument from file if reading in chunk then data cannot be loaded because root level node needs its closing tag which will be in other chunk block
Update:
I know it is not a correct aproach now i am different process but still i want to know the reason of size constraing in string but not in stringbuilder
Update:
I have Fixed my proble and want to know the reason why there is no memory constraint on stringbuilder.

Why StringBuilder Size is greater than string(~250MB).
The reason depends on the version of .net.
There are two implementations Eric Lippert mentions here: https://stackoverflow.com/a/6524401/360211
Internally a string builder maintains a char[]. When you append it may have to resize this array. In order to stop it needing to be resized every time you append it resizes to a larger size to anticipate future appends (it actually doubles in size). So the StringBuilder often ends up larger than it's content, as much as double the size.
A newer implementation maintains a linked list of char[]. If you do many small appends, the overhead of the linked list may account for the extra 250MB.
In normal use, an extra 100% size on a string temporarily doesn't make one bit of difference given the performance benefits, but when you are dealing with a GB, it becomes significant and that is not its intended usage.
Why you get OutOfMemoryException
The linked list implementation can fit more in memory than a string because it does not need one continuous block of 1GB. When you ToString it would force it to try to find another GB, which is also continuous and that is the problem.
Why is there no constraint preventing this?
Well there is. The constraint is if there is not enough memory to create a string during ToString, throw an OutOfMemoryException.
You may want this to happen during Append operations, but that would be impossible to determine. StringBuilder could look at the free memory, but that might change before you call ToString. So the author of StringBuilder could have set an arbitrary limit, but that can't suit all systems equally, as some will have more memory than others.
You also might want to do operations that reduce the size of the StringBuilder before calling ToString, or not call ToString at all! So just because StringBuilder is too large to ToString at any point is not a reason to throw an exception.

You can use StringBuilder.ToString(int, int) to get smaller-sized chunks of your huge content out of of the StringBuilder.
In addition, you might want to consider whether you are really using the right tool for the job. StringBuilder's purpose is to build and modify strings, not to load huge files to memory.

You can try the following to handle large XML files.
CodeProject

Does string.Replace(string, string) create additional strings?

We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.

Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.

Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".

I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()

There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);

How to work properly with strings in C#?

I know there is a rule about strings in C# that says:
When we create a textual string of type string, we can never change its value! When putting different value for a string variable thje first string will stay in memory and variable (which is kind of reference type) just gets the address of the new string.
So doing something like this:
string a = "aaa";
a = a.Trim(); // Creates a new string
is not recommended.
But what if I need to do some actions on the string according to user preferences, like so:
string a = "aaa";
if (doTrim)
a = a.Trim();
if (doSubstring)
a = a.Substring(...);
etc...
How can I do it without creating new strings on every action ?
I thougt about sending the string to a function by ref, like so:
void DoTrim(ref string value)
{
value = value.Trim(); // also creates new string
}
But this also creates a new string...
Can someone please tell me if there is a way of doing it without wasteing memory on each action ?

You are correct in that the operations you're performing are creating new strings, and not mutating a single string.
You are incorrect in that this is generally problematic or something to be avoided.
If your strings are hundreds of thousands of characters, then sure, copying all of those just to remove a few leading spaces, or to add a few characters to the end of it (repeatedly, in a loop, in particular) can actually be a problem.
If your strings aren't large, and you're not performing many (an in thousands of) operations on the string, then you almost certainly don't have a problem.
Now there are a handful of contexts, generally rather rare, that do run into problems with string manipulation. Probably the most common of the problematic contexts is appending a bunch of strings together, as doing so means copying all of the previously appended data for each new addition. If you're in that situation consider using something like a StringBuilder or a single call to string.Concat (the overload accepting a sequence of strings to concat) to perform this operation.
Other contexts are, for example, programs dealing with processing DNA strands. They'll often be taking strings of millions of characters and creating hundreds of thousands of many thousand character long substrings of that string. Using standard C# string operations would therefore result in a lot of unnecessary copying. People writing such programs end up creating objects that can represent a substring of another string without copying the data and instead referring to the existing string's underlying data source with an offset.

Sticking my neck out here a bit so I'll preface with saying in most cases Servy's answer is the correct answer. However, if you really do need lower level access and less string allocations, you could consider creating a character buffer (simple array for instance) that is big enough to fit your processed string and allow you direct manipulation of the characters. There are some significant downfalls to this, though. Including that you'll probably have to write your own Substring() and Trim() modifiers, and your buffer will likely be bigger than your input strings in many cases to accommodate unexpected string sizes. Once you are done manipulating your buffer, you could then package the character array up as a String. Since all of your manipulations are done on a single buffer, you should save a lot of allocations.
I would seriously consider if the above is worth the hassle, but if you really need the performance, this is the best solution I can think of.

How can I do it without creating new strings on every action?
You should only worry about that if you're handling big strings or if you're doing many string operations in a short period of time.
Even then, the performance loss due to creating more references is minimal.
The Garbage Collector has to collect all the unused string variables, but hey - that only really matters if you're doing MANY string operations.
So rather focus on readability in your code, rather than trying to optimize its performance in the first place.
If you really have to keep the same reference of string, you can simply use a StringBuilder.

Why do you feel uncomfortable creating new strings? There is a reason for the string API to be designed this way. For example, immutable objects are thread-safe (and they allow for a more functional programming style).
If you replace your simple string code by stringbuilders, your code might be more error-prone in multithreading scenarios (which is quite normal in a web application for example).
StringBuilders are used for concatenating strings, inserting characters, removing characters, etc. But they will need to reallocate and copy their internal characters arrays every now and then, too.
When you speak about memory consumption you have started to micro-optimize your code. Don't.
BTW: Have a look at the LINQ API. What does each operation do? Rats - it creates a new enumerator! A query like foos.Where(bar).Select(baz).FirstOrDefault() could certainly be memory-optimized by just creating a single enumerator object and modifying the criteria it applies when enumerating. </irony>

It will depend on what your exact use case is, but you might want to explore using the StringBuilder class which you can use to build and modify strings.

non contiguous String object C#.net

By what i understand String and StringBuilder objects both allocate contiguous memory underneath.
My program runs for days buffering several output in a String object. This sometimes cause outofmemoryexception which i think is because of non availability of contiguous memory. my string size can go upto 100MBs and i m concatenating new string frequently this causes new string object being allocated. i can reduce new string object creation by using Stringbuilder but that would not solve my problem entirely
Is there an alternative to a contiguous string object?

A rope data structure may be an option but I don't know of any ready-to-use implementations for .NET.
Have you tried using a LinkedList of strings instead? Or perhaps you can modify your architecture to read and write a file on disk instead of keeping everything in memory.

DO NOT USE STRINGS.
Strings will copy and allocate a new string for every operation. That is, if you have an 50mb string and add one character, until garbage collection happens, you will have two (aprox) 50mb strings around.
Then, you add another char, you'll have 3.... and so on.
On the other hand, proper use of StringBuilder, that is, using "Append" should not have any problem with 100 mbs.
Another optimization is creating the StringBuilder with your estimated size,
StringBuilder SB;
SB= new StringBuilder(capacity); // being capacity the suggested starting size
Use stringBuider to hold your big string, and then use append.
HTH

By going so large your strings are moved to the Large Object Heap (LOH) and you run a greater risk of fragmentation.
A few options:
Use a StringBuilder. You will be re-allocating less frequently. And try to pre-allocate, like new StringBuilder(100*1000*1000);
re-design your solution. There must be alternatives to keeping such large strings around. A List<string> for instance, that is only converted to 1 single string when (really) necessary.

I don't believe there's any solution for this using either String or StringBuilder. Both will require contiguous memory. Is it possible to change your architecture such that you can save the ongoing data to a List, a file, a database, or some other structure designed for such purposes?

First you should examine why you are doing that and see if there are other things you can do that give you the same value.
Then you have lots of options (depending on what you need) ranging from using logging to writing a simple class that collects strings into a List.

You can try saving the string to a database such as TextFile, SQL Server Express, MySQL, MS Access, ..etc. This way if your server gets shutdown for any reason (Power outage, someone bumped the UPS, thunderstorm, etc) you would not lose your data. It is a little slower then RAM but I think the trade off is worth it.
If this is not an option -- Most definitly use the stringbuilder for adding strings.

Why are strings notoriously expensive

What is it about the way strings are implemented that makes them so expensive to manipulate?
Is it impossible to make a "cheap" string implementation?
or am I completely wrong in my understanding?
Thanks

Which language?
Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.
This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.
If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.
If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.

The problem with strings is that they are not primitive types. They are arrays.
Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).
Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc.
There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.

Since it creates new copy of the object every time in java its advisable to use StringBuffer
Syntax
StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();

Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them.
i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.
Pete

It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.

Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:
Concatenation.
Slicing.
Getting a character at an index.
Changing a character at an index.
Locating the index of a character.
Traversing the string.
Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.

If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.
Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...

Changes and copying of strings tends to involve memory management.
Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.

You want to read this Joel Spolsky article:
http://www.joelonsoftware.com/articles/fog0000000319.html
Me, I'm disappointed .NET doesn't have a native type called F***edString.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.