non contiguous String object C#.net - c#

By what i understand String and StringBuilder objects both allocate contiguous memory underneath.
My program runs for days buffering several output in a String object. This sometimes cause outofmemoryexception which i think is because of non availability of contiguous memory. my string size can go upto 100MBs and i m concatenating new string frequently this causes new string object being allocated. i can reduce new string object creation by using Stringbuilder but that would not solve my problem entirely
Is there an alternative to a contiguous string object?

A rope data structure may be an option but I don't know of any ready-to-use implementations for .NET.
Have you tried using a LinkedList of strings instead? Or perhaps you can modify your architecture to read and write a file on disk instead of keeping everything in memory.

DO NOT USE STRINGS.
Strings will copy and allocate a new string for every operation. That is, if you have an 50mb string and add one character, until garbage collection happens, you will have two (aprox) 50mb strings around.
Then, you add another char, you'll have 3.... and so on.
On the other hand, proper use of StringBuilder, that is, using "Append" should not have any problem with 100 mbs.
Another optimization is creating the StringBuilder with your estimated size,
StringBuilder SB;
SB= new StringBuilder(capacity); // being capacity the suggested starting size
Use stringBuider to hold your big string, and then use append.
HTH

By going so large your strings are moved to the Large Object Heap (LOH) and you run a greater risk of fragmentation.
A few options:
Use a StringBuilder. You will be re-allocating less frequently. And try to pre-allocate, like new StringBuilder(100*1000*1000);
re-design your solution. There must be alternatives to keeping such large strings around. A List<string> for instance, that is only converted to 1 single string when (really) necessary.

I don't believe there's any solution for this using either String or StringBuilder. Both will require contiguous memory. Is it possible to change your architecture such that you can save the ongoing data to a List, a file, a database, or some other structure designed for such purposes?

First you should examine why you are doing that and see if there are other things you can do that give you the same value.
Then you have lots of options (depending on what you need) ranging from using logging to writing a simple class that collects strings into a List.

You can try saving the string to a database such as TextFile, SQL Server Express, MySQL, MS Access, ..etc. This way if your server gets shutdown for any reason (Power outage, someone bumped the UPS, thunderstorm, etc) you would not lose your data. It is a little slower then RAM but I think the trade off is worth it.
If this is not an option -- Most definitly use the stringbuilder for adding strings.

Related

String Builder and string size

Why StringBuilder Size is greater than string(~250MB).
Please read the question. I want to know the reason of size constraint in the string, but not in stringbuilder. I have fixed the problem of reading file.
Yes, I know there are operation, we can perform on string builder like append, replace, remove, etc. But what is the use of it when we can't get ToString() from it and we can't write it directly in the file. We had to get ToString() to actually use it, but because its size is out of string range it throws exception.
So in particular is there any use of string builder having size greated than string as i read a file of around 1 gb into string builder but cant get it into string. I read all the pros and cons of StringBuilder over String but I cant anything explaning this
Update:
I want to load XMLDocument from file if reading in chunk then data cannot be loaded because root level node needs its closing tag which will be in other chunk block
Update:
I know it is not a correct aproach now i am different process but still i want to know the reason of size constraing in string but not in stringbuilder
Update:
I have Fixed my proble and want to know the reason why there is no memory constraint on stringbuilder.
Why StringBuilder Size is greater than string(~250MB).
The reason depends on the version of .net.
There are two implementations Eric Lippert mentions here: https://stackoverflow.com/a/6524401/360211
Internally a string builder maintains a char[]. When you append it may have to resize this array. In order to stop it needing to be resized every time you append it resizes to a larger size to anticipate future appends (it actually doubles in size). So the StringBuilder often ends up larger than it's content, as much as double the size.
A newer implementation maintains a linked list of char[]. If you do many small appends, the overhead of the linked list may account for the extra 250MB.
In normal use, an extra 100% size on a string temporarily doesn't make one bit of difference given the performance benefits, but when you are dealing with a GB, it becomes significant and that is not its intended usage.
Why you get OutOfMemoryException
The linked list implementation can fit more in memory than a string because it does not need one continuous block of 1GB. When you ToString it would force it to try to find another GB, which is also continuous and that is the problem.
Why is there no constraint preventing this?
Well there is. The constraint is if there is not enough memory to create a string during ToString, throw an OutOfMemoryException.
You may want this to happen during Append operations, but that would be impossible to determine. StringBuilder could look at the free memory, but that might change before you call ToString. So the author of StringBuilder could have set an arbitrary limit, but that can't suit all systems equally, as some will have more memory than others.
You also might want to do operations that reduce the size of the StringBuilder before calling ToString, or not call ToString at all! So just because StringBuilder is too large to ToString at any point is not a reason to throw an exception.
You can use StringBuilder.ToString(int, int) to get smaller-sized chunks of your huge content out of of the StringBuilder.
In addition, you might want to consider whether you are really using the right tool for the job. StringBuilder's purpose is to build and modify strings, not to load huge files to memory.
You can try the following to handle large XML files.
CodeProject

How to work properly with strings in C#?

I know there is a rule about strings in C# that says:
When we create a textual string of type string, we can never change its value! When putting different value for a string variable thje first string will stay in memory and variable (which is kind of reference type) just gets the address of the new string.
So doing something like this:
string a = "aaa";
a = a.Trim(); // Creates a new string
is not recommended.
But what if I need to do some actions on the string according to user preferences, like so:
string a = "aaa";
if (doTrim)
a = a.Trim();
if (doSubstring)
a = a.Substring(...);
etc...
How can I do it without creating new strings on every action ?
I thougt about sending the string to a function by ref, like so:
void DoTrim(ref string value)
{
value = value.Trim(); // also creates new string
}
But this also creates a new string...
Can someone please tell me if there is a way of doing it without wasteing memory on each action ?
You are correct in that the operations you're performing are creating new strings, and not mutating a single string.
You are incorrect in that this is generally problematic or something to be avoided.
If your strings are hundreds of thousands of characters, then sure, copying all of those just to remove a few leading spaces, or to add a few characters to the end of it (repeatedly, in a loop, in particular) can actually be a problem.
If your strings aren't large, and you're not performing many (an in thousands of) operations on the string, then you almost certainly don't have a problem.
Now there are a handful of contexts, generally rather rare, that do run into problems with string manipulation. Probably the most common of the problematic contexts is appending a bunch of strings together, as doing so means copying all of the previously appended data for each new addition. If you're in that situation consider using something like a StringBuilder or a single call to string.Concat (the overload accepting a sequence of strings to concat) to perform this operation.
Other contexts are, for example, programs dealing with processing DNA strands. They'll often be taking strings of millions of characters and creating hundreds of thousands of many thousand character long substrings of that string. Using standard C# string operations would therefore result in a lot of unnecessary copying. People writing such programs end up creating objects that can represent a substring of another string without copying the data and instead referring to the existing string's underlying data source with an offset.
Sticking my neck out here a bit so I'll preface with saying in most cases Servy's answer is the correct answer. However, if you really do need lower level access and less string allocations, you could consider creating a character buffer (simple array for instance) that is big enough to fit your processed string and allow you direct manipulation of the characters. There are some significant downfalls to this, though. Including that you'll probably have to write your own Substring() and Trim() modifiers, and your buffer will likely be bigger than your input strings in many cases to accommodate unexpected string sizes. Once you are done manipulating your buffer, you could then package the character array up as a String. Since all of your manipulations are done on a single buffer, you should save a lot of allocations.
I would seriously consider if the above is worth the hassle, but if you really need the performance, this is the best solution I can think of.
How can I do it without creating new strings on every action?
You should only worry about that if you're handling big strings or if you're doing many string operations in a short period of time.
Even then, the performance loss due to creating more references is minimal.
The Garbage Collector has to collect all the unused string variables, but hey - that only really matters if you're doing MANY string operations.
So rather focus on readability in your code, rather than trying to optimize its performance in the first place.
If you really have to keep the same reference of string, you can simply use a StringBuilder.
Why do you feel uncomfortable creating new strings? There is a reason for the string API to be designed this way. For example, immutable objects are thread-safe (and they allow for a more functional programming style).
If you replace your simple string code by stringbuilders, your code might be more error-prone in multithreading scenarios (which is quite normal in a web application for example).
StringBuilders are used for concatenating strings, inserting characters, removing characters, etc. But they will need to reallocate and copy their internal characters arrays every now and then, too.
When you speak about memory consumption you have started to micro-optimize your code. Don't.
BTW: Have a look at the LINQ API. What does each operation do? Rats - it creates a new enumerator! A query like foos.Where(bar).Select(baz).FirstOrDefault() could certainly be memory-optimized by just creating a single enumerator object and modifying the criteria it applies when enumerating. </irony>
It will depend on what your exact use case is, but you might want to explore using the StringBuilder class which you can use to build and modify strings.

Faster way to get an SQL-compatible string from a Guid

I noticed this particular line of code when I was profiling my application (which creates a boatload of database insertions by processing some raw data):
myStringBuilder.AppendLine(
string.Join(
BULK_SEPARATOR, new string[]{myGuid.ToString() ...
Bearing in mind that the resultant string is going to end up in a file called via the TSQL command BULK INSERT, is there a way I can do this step faster? I know that getting a byte array is faster, but I can't just plug that into the file.
The fastest and simplest way would be not to use BULK INSERT with a raw file at all. Instead, use the SqlBulkCopy class. That should speed this up significantly by sending the data directly over the pipe instead of using an intermediate file.
(You'll also be able to use the Guid directly without any string conversions, although I can't be 100% sure what SqlBulkCopy does internally with it.)
You aren't indicating where you are getting the guid from. Also, I don't believe that getting the bytes are going to be any faster, as you are going to do what the ToString method on the Guid class already does, iterate through the bytes and convert to a string value.
Rather, I think a few general areas that this code can possibly be improved upon in terms of performance are (and assuming you are doing this in a loop):
Are you reusing the myStringBuilder instance upon a new iteration of your loop? You should be setting the Length (not the Capacity) property to 0, and then rebuild your string using that. This will prevent having to warm up a new StringBuilder instance, and the memory allocation for a larger string will already have been made.
Use calls to Append on myStringBuilder instead of calling String.Join. String.Join is going to preallocate a bunch of memory and then return a string instance which you will just allocate again (if on the first iteration) or copy into already-allocated space. There is no reason to do this twice. Instead, iterate through the array you are creating (or expand the loop, it seems you have a fixed-size array) and call Append, passing in the guid and then then BULK_SEPARATOR. It's easier to remove the single character from the end, btw, just decrement the Length property of the StringBuilder by one if you actually appended Guids.
If time is critical - could you pre-create a sufficiently long list of GUID's converted to string ahead of time, and then use it in your code? Either in C#, or possibly in SQL Server, depending on your requirements?

String concatenation in C# with interned strings

I know this question has been done but I have a slightly different twist to it. Several have pointed out that this is premature optimization, which is entirely true if I were asking for practicality's sake and practicality's sake only. My problem is rooted in a practical problem but I'm still curious nonetheless.
I'm creating a bunch of SQL statements to create a script (as in it will be saved to disk) to recreate a database schema (easily many many hundreds of tables, views, etc.). This means my string concatenation is append-only. StringBuilder, according to MSDN, works by keeping an internal buffer (surely a char[]) and copying string characters into it and reallocating the array as necessary.
However, my code has a lot of repeat strings ("CREATE TABLE [", "GO\n", etc.) which means I can take advantage of them being interned but not if I use StringBuilder since they would be copied each time. The only variables are essentially table names and such that already exist as strings in other objects that are already in memory.
So as far as I can tell that after my data is read in and my objects created that hold the schema information then all my string information can be reused by interning, yes?
Assuming that, then wouldn't a List or LinkedList of strings be faster because they retain pointers to interned strings? Then it's only one call to String.Concat() for a single memory allocation of the whole string that is exactly the correct length.
A List would have to reallocate string[] of interned pointers and a linked list would have to create nodes and modify pointers, so they aren't "free" to do but if I'm concatenating many thousands of interned strings then they would seem like they would be more efficient.
Now I suppose I could come up with some heuristic on character counts for each SQL statement & count each type and get a rough idea and pre-set my StringBuilder capacity to avoid reallocating its char[] but I would have to overshoot by a fair margin to reduce the probability of reallocating.
So for this case, which would be fastest to get a single concatenated string:
StringBuilder
List<string> of interned strings
LinkedList<string> of interned strings
StringBuilder with a capacity heuristic
Something else?
As a separate question (I may not always go to disk) to the above: would a single StreamWriter to an output file be faster yet? Alternatively, use a List or LinkedList then write them to a file from the list instead of first concatenating in memory.
EDIT:
As requested, the reference (.NET 3.5) to MSDN. It says: "New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer." That to me means a char[] that is realloced to make it larger (which requires copying old data to the resized array) then appending.
For your separate question, Win32 has a WriteFileGather function, which could efficiently write a list of (interned) strings to disk - but it would make a notable difference only when being called asynchronously, as the disk write will overshadow all but extremely large concatenations.
For your main question: unless you are reaching megabytes of script, or tens of thousands of scripts, don't worry.
You can expect StringBuilder to double the allocation size on each reallocation. That would mean growing a buffer from 256 bytes to 1MB is just 12 reallocations - quite good, given that your initial estimate was 3 orders of magnitude off the target.
Purely as an exercise, some estimates: building a buffer of 1MB will sweep roughly 3 MB memory (1MB source, 1MB target, 1MB due to
copying during realloation).
A linked list implementation will sweep about 2MB, (and that's ignoring the 8 byte / object overhead per string reference). So you are saving 1 MB memory reads/writes, compared to a typical memory bandwidth of 10Gbit/s and 1MB L2 cache.)
Yes, a list implementation is potentially faster, and the difference would matter if your buffers are an order of magnitude larger.
For the much more common case of small strings, the algorithmic gain is negligible, and easily offset by other factors: the StringBuilder code is likely in the code cache already, and a viable target for microoptimizations. Also, using a string internally means no copy at all if the final string fits the initial buffer.
Using a linked list will also bring down the reallocation problem from O(number of characters) to O(number of segments) - your list of string references faces the same problem as a string of characters!
So, IMO the implementation of StringBuilder is the right choice, optimized for the common case, and degrades mostly for unexpectedly large target buffers. I'd expect a list implementation to degrade for very many small segments first, which is actually the extreme kind of scenario StringBuilder is trying to optimize for.
Still, it would be interesting to see a comparison of the two ideas, and when the list starts to be faster.
If I were implementing something like this, I would never build a StringBuilder (or any other in memory buffer of your script).
I would just stream it out to your file instead, and make all strings inline.
Here's an example pseudo code (not syntactically correct or anything):
FileStream f = new FileStream("yourscript.sql");
foreach (Table t in myTables)
{
f.write("CREATE TABLE [");
f.write(t.ToString());
f.write("]");
....
}
Then, you'll never need an in memory representation of your script, with all the copying of strings.
Opinions?
In my experience, I properly allocated StringBuilder outperforms most everything else for large amounts of string data. It's worth wasting some memory, even, by overshooting your estimate by 20% or 30% in order to prevent reallocation. I don't currently have hard numbers to back it up using my own data, but take a look at this page for more.
However, as Jeff is fond of pointing out, don't prematurely optimize!
EDIT: As #Colin Burnett pointed out, the tests that Jeff conducted don't agree with Brian's tests, but the point of linking Jeff's post was about premature optimization in general. Several commenters on Jeff's page noted issues with his tests.
Actually StringBuilder uses an instance of String internally. String is in fact mutable within the System assembly, which is why StringBuilder can be build on top of it. You can make StringBuilder a wee bit more effective by assigning a reasonable length when you create the instance. That way you will eliminate/reduce the number of resize operations.
String interning works for strings that can be identified at compile time. Thus if you generate a lot of strings during the execution they will not be interned unless you do so yourself by calling the interning method on string.
Interning will only benefit you if your strings are identical. Almost identical strings doesn't benefit from interning, so "SOMESTRINGA" and "SOMESTRINGB" will be two different strings even if they are interned.
If all (or most) of the strings being concatenated are interned, then your scheme MIGHT give you a performance boost, as it could potentally use less memory, and could save a few large string copies.
However, whether or not it actually improves perf depends on the volume of data you are processing, because the improvement is in constant factors, not in the order of magnitude of the algorithm.
The only way to really tell is to run your app using both ways and measure the results. However, unless you are under significant memory pressure, and need a way to save bytes, I wouldn't bother and would just use string builder.
A StringBuilder doesn't use a char[] to store the data, it uses an internal mutable string. That means that there is no extra step to create the final string as it is when you concatenate a list of strings, the StringBuilder just returns the internal string buffer as a regular string.
The reallocations that the StringBuilder does to increase the capacity means that the data is by average copied an extra 1.33 times. If you can provide a good estimate on the size when you create the StringBuilder you can reduce that even furter.
However, to get a bit of perspective, you should look at what it is that you are trying to optimise. What will take most of the time in your program is to actually write the data to disk, so even if you can optimise your string handling to be twice as fast as using a StringBuilder (which is very unlikely), the overall difference will still only be a few percent.
Have you considered C++ for this? Is there a library class that already builds T/SQL expressions, preferably written in C++.
Slowest thing about strings is malloc. It takes 4KB per string on 32-bit platforms. Consider optimizing number of string objects created.
If you must use C#, I'd recommend something like this:
string varString1 = tableName;
string varString2 = tableName;
StringBuilder sb1 = new StringBuilder("const expression");
sb1.Append(varString1);
StringBuilder sb2 = new StringBuilder("const expression");
sb2.Append(varString2);
string resultingString = sb1.ToString() + sb2.ToString();
I would even go as far as letting the computer evaluate the best path for object instantiation with dependency injection frameworks, if perf is THAT important.

Why are strings notoriously expensive

What is it about the way strings are implemented that makes them so expensive to manipulate?
Is it impossible to make a "cheap" string implementation?
or am I completely wrong in my understanding?
Thanks
Which language?
Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.
This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.
If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.
If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.
The problem with strings is that they are not primitive types. They are arrays.
Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).
Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc.
There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.
Since it creates new copy of the object every time in java its advisable to use StringBuffer
Syntax
StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();
Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them.
i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.
Pete
It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.
Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:
Concatenation.
Slicing.
Getting a character at an index.
Changing a character at an index.
Locating the index of a character.
Traversing the string.
Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.
If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.
Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...
Changes and copying of strings tends to involve memory management.
Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.
You want to read this Joel Spolsky article:
http://www.joelonsoftware.com/articles/fog0000000319.html
Me, I'm disappointed .NET doesn't have a native type called F***edString.

Categories

Resources