Does String.ToLower() return the same reference (e.g. without allocating any new memory) if all the characters are already lower-case?
Memory allocation is cheap, but running a quick check on zillions of short strings is even cheaper. Most of the time the input I'm working with is already lower-case, but I want to make it that way if it isn't.
I'm working with C# / .NET in particular, but my curiosity extends to other languages so feel free to answer for your favorite one!
NOTE: Strings are immutable but that does not mean a function always has to return a new one, rather it means nothing can change their character content.
I expect so, yes. A quick test agrees (but this is not evidence):
string a = "abc", b = a.ToLower();
bool areSame = ReferenceEquals(a, b); // false
In general, try to work with comparers that do what you want. For example, if you want a case-insensitive dictionary, use one:
var lookup = new Dictionary<string, int>(
StringComparer.InvariantCultureIgnoreCase);
Likewise:
bool ciEqual = string.Equals("abc", "ABC",
StringComparison.InvariantCultureIgnoreCase);
String is an immutable. String.ToLower() will always return new instance thereby generating another instance on every ToLower() call.
Java implementation of String.toLowerCase() from Sun actually doesn't always allocate new String. It checks if all chars are lowercase, and if so, it returns original string.
[edit]
Interning doesn't help -- see the comments to this answer.
If you use the following code it will not allocate new memory and it will overwrite the original string (this may or may not be what you want). It expects an ascii string. Expect weird things to occur if you call this on strings returned from functions you do not control.
public static unsafe void UnsafeToLower(string asciiString)
{
fixed (char* pstr = asciiString)
{
for(char* p = pstr; *p != 0; ++p)
*p = (*p > 0x40) && (*p < 0x5b) ? (char)(*p | 0x60) : (*p);
}
}
It takes about 25% as long as ToLowerInvariant and avoids memory allocation.
I would only use something like this if you are doing say 100,000 or more strings regularly inside a tight loop.
Related
In C, if we have an array, we can pass it by reference to a function. We can also use simple addition of (n-1) to pass the reference starting from n-th element of the array like this:
char *strArr[5];
char *str1 = "I want that!\n";
char *str2 = "I want this!\n";
char *str3 = "I want those!\n";
char *str4 = "I want these!\n";
char *str5 = "I want them!\n";
strArr[0] = str1;
strArr[1] = str2;
strArr[2] = str3;
strArr[3] = str4;
strArr[4] = str5;
printPartially(strArr + 1, 4); //we can pass like this to start printing from 2nd element
....
void printPartially(char** strArrPart, char size){
int i;
for (i = 0; i < size; ++i)
printf(strArrPart[i]);
}
Resulting in these:
I want this!
I want those!
I want these!
I want them!
Process returned 0 (0x0) execution time : 0.006 s
Press any key to continue.
In C#, we can also pass reference to an object by ref (or, out). The object includes array, which is the whole array (or at least, this is how I suppose it works). But how are we to pass by reference to the n-th element of the array such that internal to the function, there is only string[] whose elements are one less than the original string[] without the need to create new array?
Must we use unsafe? I am looking for a solution (if possible) without unsafe
Edit:
I understand that we could pass Array in C# without ref keyword. Perhaps my question sounds quite misleading by mentioning ref when we talk about Array. The point why I put ref there, I should rather put it this way: is the ref keyword can be used, say, to pass the reference to n-th element of the array as much as C does other than passing reference to any object (without mentioning the n-th element or something alike)? My apology for any misunderstanding occurs by my question's phrasing.
The "safe" approach would be to pass an ArraySegment struct instead.
You can of course pass a pointer to a character using unsafe c#, but then you need to worry about buffer overruns.
Incidentally, an Array in C# is (usually) allocated on the heap, so passing it normally (without ref) doesn't mean copying the array- it's still a reference that is passed (just a new one).
Edit:
You won't be able to do it as you do in C in safe code.
A C# array (i.e. string[]) is derived from abstract type Array.
It is not only a simple memory block as it is in C.
So you can't send one of it's element's reference and start iterate from there.
But there are some solutions which will give you the same taste of course (without unsafe):
Like:
As #Chris mentioned you can use ArraySegment<T>.
As Array is also an IEnumerable<T> you can use .Skip and send the returned value. (but this will give you an IEnumerable<T> instead of an Array). But it will allow you iterate.
etc...
If the method should only read from the array, you can use linq:
string[] strings = {"str1", "str2", "str3", ...."str10"};
print(strings.Skip(1).Take(4).ToArray());
Your confusion is a very common one. The essential point is realizing that "reference types" and "passing by reference" (ref keyboard) are totally independent. In this specific case, since string[] is a reference type (as are all arrays), it means the object is not copied when you pass it around, hence you are always referring to the same object.
Modified Version of C# Code:
string[] strArr = new string[5];
strArr[0] = "I want that!\n";
strArr[1] = "I want this!\n";
strArr[2] = "I want those!\n";
strArr[3] = "I want these!\n";
strArr[4] = "I want them!\n";
printPartially(strArr.Skip(1).Take(4).ToArray());
void printPartially(string[] strArr)
{
foreach (string str in strArr)
{
Console.WriteLine(str);
}
}
Question is old, but maybe answer will be useful for someone.
As of C# 7.2 there are much more types to use in that case, ex. Span or Memory.
They allow exactly for the thing you mentioned in your question (and much more).
Here's great article about them
Currently, if you want to use them, remeber to add <LangVersion>7.2</LangVersion> in .csproj file of your project to use C# 7.2 features
After doing some profiling, we've discovered that the current way in which our app concatenates strings causes an enormous amount of memory churn and CPU time.
We're building a List<string> of strings to concatenate that is on the order of 500 thousand elements long, referencing several hundred megabytes worth of strings. We're trying to optimize this one small part of our app since it seems to account for a disproportionate amount of CPU and memory usage.
We do a lot of text processing :)
Theoretically, we should be able to perform the concatenation in a single allocation and N copies - we can know how many total characters are available in our string, so it should just be as simple as summing up the lengths of the component strings and allocating enough underlying memory to hold the result.
Assuming we're starting with a pre-filled List<string>, is it possible to concatenate all strings in that list using a single allocation?
Currently, we're using the StringBuilder class, but this stores its own intermediate buffer of all of the characters - so we have an ever growing chunk array, with each chunk storing a copy of the characters we're giving it. Far from ideal. The allocations for the array of chunks aren't horrible, but the worst part is that it allocates intermediate character arrays, which means N allocations and copies.
The best we can do right now is to call List<string>.ToArray() - which performs one copy of a 500k element array - and pass the resulting string[] to string.Concat(params string[]). string.Concat() then performs two allocations, one to copy the input array into an internal array, and the one to allocate the destination string's memory.
From referencesource.microsoft.com:
public static String Concat(params String[] values) {
if (values == null)
throw new ArgumentNullException("values");
Contract.Ensures(Contract.Result<String>() != null);
// Spec#: Consider a postcondition saying the length of this string == the sum of each string in array
Contract.EndContractBlock();
int totalLength=0;
// -----------> Allocation #1 <---------
String[] internalValues = new String[values.Length];
for (int i=0; i<values.Length; i++) {
string value = values[i];
internalValues[i] = ((value==null)?(String.Empty):(value));
totalLength += internalValues[i].Length;
// check for overflow
if (totalLength < 0) {
throw new OutOfMemoryException();
}
}
return ConcatArray(internalValues, totalLength);
}
private static String ConcatArray(String[] values, int totalLength) {
// -----------------> Allocation #2 <---------------------
String result = FastAllocateString(totalLength);
int currPos=0;
for (int i=0; i<values.Length; i++) {
Contract.Assert((currPos <= totalLength - values[i].Length),
"[String.ConcatArray](currPos <= totalLength - values[i].Length)");
FillStringChecked(result, currPos, values[i]);
currPos+=values[i].Length;
}
return result;
}
Thus, in the best case, we have three allocations, two for arrays referencing the component strings, and one for the destination concatenated string.
Can we improve on this? Is it possible to concatenate a List<string> using a single allocation and a single loop of character copies?
Edit 1
I'd like to summarize the various approaches discussed so far, and why they are still sub-optimal. I'd also like to set the parameters of the situation in concrete a little more, since I've received a lot of questions that try to side step the central question.
...
First, the structure of the code that I am working within. There are three layers:
Layer one is a set of methods that produce my content. These methods return small-ish string objects, which I will call my 'component' strings'. These string objects will eventually be concatenated into a single string. I do not have the ability to modify these methods; I have to face the reality that they return string objects and move forward.
Layer two is my code that calls these content producers and assembles the output, and is the subject of this question. I must call the content producer methods, collect the strings they return, and eventually concatenate the returned strings into a single string (reality is a little more complex; the returned strings are partitioned depending on how they're routed for output, and so I have several sets of large collections of strings).
Layer three is a set of methods that accept a single large string for further processing. Changing the interface of that code is beyond my control.
Talking about some numbers: a typical batch run will collect ~500000 strings from the content producers, representing about 200-500 MB of memory. I need the most efficient way to concatenate these 500k strings into a single string.
...
Now I'd like to examine the approaches discussed so far. For the sake of numbers, assume we're running 64-bit, assume that we are collecting 500000 string objects, and assume that the aggregate size of the string objects totals 200 megabytes worth of character data. Also, assume that the original string object's memory is not counted toward any approach's total in the below analysis. I make this assumption because it is necessarily common to any and all approaches, because it is an assumption that we cannot change the interface of the content producers - they return 500k relatively small fully formed strings objects that I must then accept and somehow concatenate. As stated above, I cannot change this interface.
Approach #1
Content producers ----> StringBuilder ----> string
Conceptually, this would be invoking the content producers, and directly writing the strings they return to a StringBuilder, and then later calling StringBuilder.ToString() to obtain the concatenated string.
By analyzing StringBuilder's implementation, we can see that the cost of this boils down to 400 MB of allocations and copies:
During the stage where we collect the output from the content producers, we're writing 200 MB of data to the StringBuilder. We would be performing one 200 MB allocation to pre-allocate the StringBuilder, and then 200 MB worth of copies as we copy and discard the strings returned from the content producers
After we've collected all output from the content producers and have a fully formed StringBuilder, we then need to call StringBuilder.ToString(). This performs exactly one allocation (string.FastAllocateString()), and then copies the string data from its internal buffers to the string object's internal memory.
Total cost: approximately 400 MB of allocations and copies
Approach #2
Content producers ---> pre-allocated char[] ---> string
This strategy is fairly simple. Assuming we know roughly how much character data we're going to be collecting from the producers, we can pre-allocate a char[] that is 200 MB large. Then, as we call the content producers, we copy the strings they return into our char[]. This accounts for 200 MB of allocations and copies. The final step to turn this into a string object is to pass it to the new string(char[]) constructor. However, since strings are immutable and arrays are not, the constructor will make a copy of that entire array, causing it to allocate and copy another 200 MB of character data.
Total cost: approximately 400 MB of allocations and copies
Approach #3:
Content producers ---> List<string> ----> string[] ----> string.Concat(string[])
Pre-allocate a List<string> to be about 500k elements - approximately 4 MB of allocations for List's underlying array (500k * 8 bytes per pointer == 4 MB of memory).
Call all of the content producers to collect their strings. Approximately 4 MB of copies, as we copy the pointer to the returned string into List's underlying array.
Call List<string>.ToArray() to obtain a string[]. Approximately 4 MB of allocations and copies (again, we're really just copying pointers).
Call string.Concat(string[]):
Concat will make a copy of the array provided to it before it does any real work. Approximately 4 MB of allocations and copies, again.
Concat will then allocate a single 'destination' string object using the internal string.FastAllocateString() special method. Approximately 200 MB of allocations.
Concat will then copy strings from its internal copy of the provided array directly into the destination. Approximately 200 MB of copies.
Total cost: approximately 212 MB of allocations and copies
None of these approaches are ideal, however approach #3 is very close. We're assuming that the absolute minimum of memory that needs to be allocated and copied is 200 MB (for the destination string), and here we get pretty close - 212 MB.
If there were a string.Concat overload that 1) Accepted an IList<string> and 2) did not make a copy of that IList before using it, then the problem would be solved. No such method is provided by .Net, hence the subject of this question.
Edit 2
Progress on a solution.
I've done some testing with some hacked IL, and found that directly invoking string.FastAllocateString(n) (which is not usually invokable...) is about as fast as invoking new string('\0', n), and both seem to allocate exactly as much memory as is expected.
From there, it seems its possible to acquire a pointer to the freshly allocated string using the unsafe and fixed statements.
And so, a rough solution begins to appear:
private static string Concat( List<string> list )
{
int concatLength = 0;
for( int i = 0; i < list.Count; i++ )
{
concatLength += list[i].Length;
}
string newString = new string( '\0', concatLength );
unsafe
{
fixed( char* ptr = newString )
{
...
}
}
return newString;
}
The next biggest hurdle is implementing or finding an efficient block copy method, ala Buffer.BlockCopy, except one that will accept char* types.
If you can determine the length of the concatenation before trying to perform the operation, a char array can beat string builder in some use cases. Manipulating the characters within the array prevents the multiple allocations.
See: http://blogs.msdn.com/b/cisg/archive/2008/09/09/performance-analysis-reveals-char-array-is-better-than-stringbuilder.aspx
UPDATE
Please check out this internal implementation of the String.Join from .NET - it uses unsafe code with pointers to avoid multiple allocations. Unless I'm missing something, it would seem you can re-write this using your List to accomplish what you want:
[System.Security.SecuritySafeCritical] // auto-generated
public unsafe static String Join(String separator, String[] value, int startIndex, int count) {
//Range check the array
if (value == null)
throw new ArgumentNullException("value");
if (startIndex < 0)
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_StartIndex"));
if (count < 0)
throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_NegativeCount"));
if (startIndex > value.Length - count)
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_IndexCountBuffer"));
Contract.EndContractBlock();
//Treat null as empty string.
if (separator == null) {
separator = String.Empty;
}
//If count is 0, that skews a whole bunch of the calculations below, so just special case that.
if (count == 0) {
return String.Empty;
}
int jointLength = 0;
//Figure out the total length of the strings in value
int endIndex = startIndex + count - 1;
for (int stringToJoinIndex = startIndex; stringToJoinIndex <= endIndex; stringToJoinIndex++) {
if (value[stringToJoinIndex] != null) {
jointLength += value[stringToJoinIndex].Length;
}
}
//Add enough room for the separator.
jointLength += (count - 1) * separator.Length;
// Note that we may not catch all overflows with this check (since we could have wrapped around the 4gb range any number of times
// and landed back in the positive range.) The input array might be modifed from other threads,
// so we have to do an overflow check before each append below anyway. Those overflows will get caught down there.
if ((jointLength < 0) || ((jointLength + 1) < 0) ) {
throw new OutOfMemoryException();
}
//If this is an empty string, just return.
if (jointLength == 0) {
return String.Empty;
}
string jointString = FastAllocateString( jointLength );
fixed (char * pointerToJointString = &jointString.m_firstChar) {
UnSafeCharBuffer charBuffer = new UnSafeCharBuffer( pointerToJointString, jointLength);
// Append the first string first and then append each following string prefixed by the separator.
charBuffer.AppendString( value[startIndex] );
for (int stringToJoinIndex = startIndex + 1; stringToJoinIndex <= endIndex; stringToJoinIndex++) {
charBuffer.AppendString( separator );
charBuffer.AppendString( value[stringToJoinIndex] );
}
Contract.Assert(*(pointerToJointString + charBuffer.Length) == '\0', "String must be null-terminated!");
}
return jointString;
}
Source: http://www.dotnetframework.org/default.aspx/4#0/4#0/DEVDIV_TFS/Dev10/Releases/RTMRel/ndp/clr/src/BCL/System/String#cs/1305376/String#cs
UPDATE 2
Good point on the fast allocate. According to an old SO post, you can wrap FastAllocate using reflection (assuming of course you'd cache the fastAllocate method reference so you just called Invoke each time. Perhaps the tradeoff of the call is better than what you're doing now.
var fastAllocate = typeof (string).GetMethods(BindingFlags.NonPublic | BindingFlags.Static)
.First(x => x.Name == "FastAllocateString");
var newString = (string)fastAllocate.Invoke(null, new object[] {20});
Console.WriteLine(newString.Length); // 20
Perhaps another approach is to use unsafe code to copy your allocation into a char* array, then pass this to the string constructor. The string constructor with char* is an extern passed to the underlying C++ implementation. I haven't found a reliable source for that code to confirm, but perhaps this can be faster for you. The non-prod ready code (no checks for potential overflow, add fixed to lock strings from garbage collection, etc) would start with:
public unsafe string MyConcat(List<string> values)
{
int index = 0;
int totalLength = values.Sum(m => m.Length);
char* concat = stackalloc char[totalLength + 1]; // Add additional char for null term
foreach (var value in values)
{
foreach (var c in value)
{
concat[index] = c;
index++;
}
}
concat[index] = '\0';
return new string(concat);
}
Now I'm all out of ideas for this :) Perhaps somebody can figure out a method here with marshalling to avoid unsafe code. Since introducing unsafe code requires adding the unsafe flag to compilation, consider adding this piece as a separate dll to minimize your app's security risk if you go down that route.
Unless the average length of the strings is very small, the most efficient approach, given a List<String>, will be to use ToArray() to copy it to a new String[], and pass that to a concatenation or joining method. Doing that may cause a wasted allocation for an array of references if the concatenation or joining method wants to make a copy of its array before it starts, but that would only allocate one reference per string, there will only be one allocation to hold character data, and it will be correctly sized to hold the entire string.
If you're building the data structure yourself, you might gain a little bit of efficiency by initializing a String[] to the estimated required size, populating it yourself, and expanding it as needed. That would save one allocation of a String[] worth of data.
Another approach would be to allocate a String[8192][] and then allocate a String[8192] for each array of strings as you go along. Once you're all done, you'll know exactly what size String[] you need to pass to the Concat method so you can create an array of that exact size. This approach would require a greater quantity of allocations, but only the final String[] and the String itself would need to go on the Large Object Heap.
It's a shame the constraints you're putting on yourself. It's very blockily structured, and it's hard to get any flow going. For example, if you didn't expect a IList but only expected IEnumerable you might be able to make it easier for the producer of your content. Not only that, you could make your processing benefit from being able to consume the strings only as you need them - and only as they're produced.
This gets you on down the road to some nice asynchrony.
One the other end, they're making you send to whole thing at once. That's tough.
But having said that, and since you're going to run it over and over, etc... I'm wondering if you couldn't create your string buffer or byte buffer or StringBuilder or whatever - and reuse it between executions - allocate the max monster (or progressively bump-reallocate it as needed) one time - and don't let the gc have it. The string constructor will copy it over and over again - but that's a single allocation per cycle. If you're running this so much you're making the machine hot, then it might be worth the hit. I've made precisely that tradeoff in the near past (but I didn't have 5gb to choke on). It felt dirty at first - but ooohh - the throughput spoke loudly!
Also, it may be possible, that while your native API expects a string, but you can lie to it - let it think you're giving it a string. You can very probably pass the buffer with a null char at the end - or with the length - depending on the API's particulars. I think one or two commenters spoke to this. In such a case, you may probably need your buffer pinned for the duration of the calls to the native consumer of your big ol' string.
If this is the case, you're down to a one-time allocation of a buffer, repeated copies into it, and that's it. It could go way under your proposed best case.
I have implemented a method to concatenate a List into a single string that performs exactly one allocation.
The following code compiles under .Net 4.6 - Block.MemoryCopy wasn't added to .Net until 4.6.
The "unsafe" implementation:
public static unsafe class FastConcat
{
public static string Concat( IList<string> list )
{
string destinationString;
int destLengthChars = 0;
for( int i = 0; i < list.Count; i++ )
{
destLengthChars += list[i].Length;
}
destinationString = new string( '\0', destLengthChars );
unsafe
{
fixed( char* origDestPtr = destinationString )
{
char* destPtr = origDestPtr; // a pointer we can modify.
string source;
for( int i = 0; i < list.Count; i++ )
{
source = list[i];
fixed( char* sourcePtr = source )
{
Buffer.MemoryCopy(
sourcePtr,
destPtr,
long.MaxValue,
source.Length * sizeof( char )
);
}
destPtr += source.Length;
}
}
}
return destinationString;
}
}
The competing implementation is the following "safe" implementation:
public static string Concat( IList<string> list )
{
return string.Concat( list.ToArray() )
}
Memory consumption
The "unsafe" implementation performs exactly one allocation and zero temporary allocations. The List<string> is directly concatenated into a single, freshly allocated string object.
The "safe" implementation requires two copies of the list - one, when I call ToArray() to pass it to string.Concat, and another when string.Concat performs its own internal copy of the array.
When concatenating a 500k element list, the "safe" string.Concat method allocates exactly 8 MB of extra memory in a 64-bit process, which I've confirmed by running the test driver in a memory monitor. This is what we would expect with the array copies performed by the safe implementation.
CPU performance
For small worksets, the unsafe implementation seems to win by about 25%.
The test driver was tested by compiling for 64-bit, installing the program into the native image cache via NGEN, and running from outside the debugger on an unloaded workstation.
From my test driver with a small workset (500k strings each 2-10 chars long):
Unsafe Time: 17.266 ms
Unsafe Time: 18.419 ms
Unsafe Time: 16.876 ms
Safe Time: 21.265 ms
Safe Time: 21.890 ms
Safe Time: 24.492 ms
Unsafe average: 17.520 ms. Safe average: 22.549 ms. Safe takes about 25% longer than unsafe. This is likely due to the extra work the safe implementation has to do, allocating temporary arrays.
...
From my test driver with a large workset (500k strings, each 500-800 chars long):
Unsafe Time: 498.122 ms
Unsafe Time: 513.725 ms
Unsafe Time: 515.016 ms
Safe Time: 487.456 ms
Safe Time: 499.508 ms
Safe Time: 512.390 ms
As you can see, the performance difference with large strings is roughly zero, likely because the time is dominated by the raw copy.
Conclusion
If you don't care about the array copies, the safe implementation is dead simple to implement, and is roughly as fast as the unsafe implementation. If you want to be absolutely perfect with memory usage, use the unsafe implementation.
I've attached the code I used for the test harness:
class PerfTestHarness
{
private List<string> corpus;
public PerfTestHarness( List<string> corpus )
{
this.corpus = corpus;
// Warm up the JIT
// Note that `result` is discarded. We reference it via 'result[0]' as an
// unused paramater to my prints to be absolutely sure it doesn't get
// optimized out. Cheap hack, but it works.
string result;
result = FastConcat.Concat( this.corpus );
Console.WriteLine( "Fast warmup done", result[0] );
result = string.Concat( this.corpus.ToArray() );
Console.WriteLine( "Safe warmup done", result[0] );
GC.Collect();
GC.WaitForPendingFinalizers();
}
public void PerfTestSafe()
{
Stopwatch watch = new Stopwatch();
string result;
GC.Collect();
GC.WaitForPendingFinalizers();
watch.Start();
result = string.Concat( this.corpus.ToArray() );
watch.Stop();
Console.WriteLine( "Safe Time: {0:0.000} ms", watch.Elapsed.TotalMilliseconds, result[0] );
Console.WriteLine( "Memory usage: {0:0.000} MB", Environment.WorkingSet / 1000000.0 );
Console.WriteLine();
}
public void PerfTestUnsafe()
{
Stopwatch watch = new Stopwatch();
string result;
GC.Collect();
GC.WaitForPendingFinalizers();
watch.Start();
result = FastConcat.Concat( this.corpus );
watch.Stop();
Console.WriteLine( "Unsafe Time: {0:0.000} ms", watch.Elapsed.TotalMilliseconds, result[0] );
Console.WriteLine( "Memory usage: {0:0.000} MB", Environment.WorkingSet / 1000000.0 );
Console.WriteLine();
}
}
StringBuilder was designed to concatenate strings efficiently. It has no other purpose.Use the constructor which sets the initial capacity:
int totalLength = CalcTotalLength();
// sufficient capacity
StringBuilder sb = new StringBuilder(totalLength);
But then you say that even StringBuilder allocates intermediate memory, and you want to do better...
These are unusual requirements, so you need to write a function which suits your situation (creating a char[] of appropriate size, then filling it in). I'm sure you are more than capable.
The first two of my answers have now been already incorporated in the question. Here is my highly situation dependent, but useful -
Third Answer
If in all these MBs of string you are getting a lot of strings that are same, then a smarter way would be use two dictionaries, one would be Dictionary<int, int> to store position and "Id" of the string at that position while another would be a Dictionary<int, int> to store the "Id" and the index of actual string in the original string[].
Coincidentally for me, what I am trying to do is already implemented in C#. Goes kinda like this...
If indeed there are a lot of same strings, is it a rare case where String Interning is useful? You are guaranteed to save considerable amount of your 200 MB target if a lot of matching strings are coming from the content producers.
What is String.Intern?
When you use strings in C#, the CLR does something clever called
string interning. It's a way of storing one copy of any string. If you
end up having a hundred—or, worse, a million—strings with the same
value, it's a waste to take up all of that memory storing the same
string over and over again. String interning is a way around that.
The CLR maintains a table called the intern pool that contains a
single, unique reference to every literal string that's either
declared or created programmatically while your program's running. And
the .NET Framework gives you two useful methods for interacting with
the intern pool: String.Intern() and String.IsInterned().
The way String.Intern() works is pretty straightforward. You pass it a
single string as an argument. If that string is already in the intern
pool, it returns a reference to that string. If it's not already in
the intern pool, it adds it and returns the same reference you passed
into it.
The way to use String Interning is explained in the link. For the sake of completeness of this answer I can add the code here but only if you feel that these solutions are useful.
I've found a memory leak in my parser. I don't know how to fix that problem.
Let's see that basic routing.
private void parsePage() {
String[] tmp = null;
foreach (String row in rows) {
tmp = row.Split(new []{" "}, StringSplitOptions.None);
PrivateRow t = new PrivateRow();
t.Field1 = tmp[1];
t.Field2 = tmp[2];
t.Field3 = tmp[3];
t.Field4 = String.Join(" ", tmp);
myBigCollection.Add(t);
}
}
private void parseFromFile() {
String[] tmp = null;
foreach (String row in rows) {
PrivateRow t = new PrivateRow();
t.Field1 = "mystring1";
t.Field2 = "mystring2222";
t.Field3 = "mystring3333";
t.Field4 = "mystring1 xxx yy zzz";
myBigCollection.Add(t);
}
}
Launching parsePage(), on a collection (rows is a List of 100000 elements) make my app grown from 20MB to 70MB.
Launching parseFromFile(), that read SAME collection from file, but avoiding split/join, take about 1MB.
Using a MemoryProfiler, I see that "t" fields and PrivateRow, kkep reference to String.Split() array and Split.Join.
I suppose that's because I assign a reference, not a copy, that can be garbage collected.
Ok, use 70mb isn't a big deal, but when I launch on production, with a lot o site, it can raise 2.5-3GB...
Cheers
This isn't a memory leak per se. It's actually behaving properly. The reason your second function uses so much less memory, is simply because you only have four strings in use. Each of these four strings is allocated only once, and subsequent uses of the strings for new t.Fieldx instances actually refer to the same string values. Strings are immutable, so if you refer to the same string value more than once, it can be handled by the same string instances. See the paragraph labelled "Interning" at this article on String in .NET for some more detail on this.
In your first function, you have what are probably mostly different strings for each field, and each time through the loop. That simply is much more varied data. The fact that those strings are held on to is what you want to have happen for as long as your PrivateRow objects exist.
You don't have a memory leak at all, it's just garbage collector takes time to process it.
I suppose that's because I assign a reference, not a copy, that can
be garbage collected.
That is not correct assumption. string during assignment is copied, even if it is a reference type. It is special, kind of, unique type inside BCL.
Now what about possible solution, in case you have intensive memory pressure. If you have massive amount of string to process from file, you may look on 2 options.
1) Process them in sequence, by reading a srteam (not load all at once). Loading as less data in memory as possible/required/makes sence.
2) Use MemoryMappedFile to, again, load only chunks of data and process them in sequence.
2nd can be combined with 1st.
Like others have said, there is no evidence of a memory leak here, just delayed garbage collection. All memory should be cleaned up eventually.
That being said, there are a couple things you can do to help keep memory usage lower or recover it more quickly:
1)You should be able to replace
t.Field4 = String.Join(" ", tmp);
with
t.Field4 = row;
You created tmp by splitting row, then you're joining it back together. Avoid creating a new string by just using row.
2) Call GC.Collect(); at the end of the method to request immediate garbage collection. This won't reduce the memory used within the method, but it should free up memory more quickly.
If your application is memory-usage critical and there is a lot of repeating data you should replace string values with Enums.
When the application runs, the IDE tells me Input string is not in the correct format.
(Convert.ToInt32(_subStr.Reverse().ToString().Substring(4, _subStr.Length - 4))*1.6).ToString()
I don't know how the Reverse() can be exactly used here.
There is no Reverse method in the String class, so the method you're using is actually the Enumerable.Reverse extension method. This compiles because String implements IEnumerable<char>, but the result is not a string, it's another implementation of IEnumerable<char>. When you call ToString() on that, you get this: System.Linq.Enumerable+<ReverseIterator>d__a01[System.Char]`.
If you want to convert this IEnumerable<char> to a string, you can dot it like this:
string reversed = new string(_subStr.Reverse().ToArray());
(Convert.ToInt32(reversed.Substring(4, _subStr.Length - 4))*1.6).ToString()
Note, however, that it is not a correct way of reversing a string, it will fail in some cases due to Unicode surrogate pairs and combining characters. See here and there for explanations.
As the Reverse method is an extension of IEnumerable<T>, you get an IEnumerable<T> as result, and as that doesn't override the ToString method, you will get the original implementation from the Object class, that simply returns the type name of the object. Turn the IEnumerable<T> into an array, then you can create a string from it.
You should first get the part of the string that is digits, then reverse it. That way it will work, regardless of what characters you have in the rest of the string. Although using the Reverse extension doesn't work properly to reverse any string (as Thomas Levesque pointed out), it will work for a string with only digits:
(
Int32.Parse(
new String(_subStr.SubString(0, _subStr.Length - 4).Reverse().ToArray())
) * 1.6
).ToString();
the simplest approach would be:
string backwards = "sdrawkcab";
string reverse = "";
for(int i = backwards.Length-1; i >= 0; i--)
reverse += backwards[i];
The result is: reverse == "backwards".
then you can do:
reverse = (Convert.ToInt32(reverse) * 1.6).toString();
Your Piping approach string.Substring().Split().Last()... will very quickly lead to a memory bottleneck, if you loop through lines of text.
Also important to notice: strings are Immutable, therefor each iteration of the for loop we create a new string instance in the memory because of the += operator, this will give us less optimal memory efficieny compared to other, more efficient algorithms, this is an O(n2) algorithm.
for a more efficient implementation you can check:
https://stackoverflow.com/a/1009707/14473033
I have a String which I would like to modify in some way. For example: reverse it or upcase it.
I have discovered that the fastest way to do this is by using a unsafe block and pointers.
For example:
unsafe
{
fixed (char* str = text)
{
*str = 'X';
}
}
Are there any reasons why I should never ever do this?
The .Net framework requires strings to be immutable. Due to this requirement it is able to optimise all sorts of operations.
String interning is one great example of this requirement is leveraged heavily. To speed up some string comparisons (and reduce memory consumption) the .Net framework maintains a Dictionary of pointers, all pre-defined strings will live in this dictionary or any strings where you call the String.intern method on. When the IL instruction ldstr is called it will check the interned dictionary and avoid memory allocation if we already have the string allocated, note: String.Concat will not check for interned strings.
This property of the .net framework means that if you start mucking around directly with strings you can corrupt your intern table and in turn corrupt other references to the same string.
For example:
// these strings get interned
string hello = "hello";
string hello2 = "hello";
string helloworld, helloworld2;
helloworld = hello;
helloworld += " world";
helloworld2 = hello;
helloworld2 += " world";
unsafe
{
// very bad, this changes an interned string which affects
// all app domains.
fixed (char* str = hello2)
{
*str = 'X';
}
fixed (char* str = helloworld2)
{
*str = 'X';
}
}
Console.WriteLine("hello = {0} , hello2 = {1}", hello, hello2);
// output: hello = Xello , hello2 = Xello
Console.WriteLine("helloworld = {0} , helloworld2 = {1}", helloworld, helloworld2);
// output : helloworld = hello world , helloworld2 = Xello world
Are there any reasons why I should never ever do this?
Yes, very simple: Because .NET relies on the fact that strings are immutable. Some operations (e.g. s.SubString(0, s.Length)) actually return a reference to the original string. If this now gets modified, all other references will as well.
Better use a StringBuilder to modify a string since this is the default way.
Put it this way: how would you feel if another programmer decided to replace 0 with 1 everywhere in your code, at execution time? It would play hell with all your assumptions. The same is true with strings. Everyone expects them to be immutable, and codes with that assumption in mind. If you violate that, you are likely to introduce bugs - and they'll be really hard to trace.
Oh dear lord yes.
1) Because that class is not designed to be tampered with.
2) Because strings are designed and expected throughout the framework to be immutable. That means that code that everyone else writes (including MSFT) is expecting a string's underlying value never to change.
3) Because this is premature optimization and that is E V I L.
Agreed about StringBuilder, or just convert your string to an array of chars/bytes and work there. Also, you gave the example of "upcasing" -- the String class has a ToUpper method, and if that's not at least as fast as your unsafe "upcasing", I'll eat my hat.