I have a filename that has the following format:
timestamp-username-1
This file is constantly etting written to, but before it gets too large I want to create a new file.
timestamp-username-2
How can I acheive this with using least amount of memory (ie, no or little variables)
here is my version:
private void Split() {
char[] strArr = FlowArgs.Filename.ToCharArray();
int num;
//get the last number
if(Int32.TryParse(strArr[strArr.Length - 1].ToString(), out num)) {
num += 1;
}
//replace the old number with the new number
char.TryParse(num.ToString(), out strArr[strArr.Length - 1]);
FlowArgs.Filename = strArr.ToString();
}
Edit:
I have added a "version" property (int) in the FlowArgs class. However my new problem is that how can I append this at the end of thefilename
I think you should just store the counter in an int. I understand that you want to save memory space but to be honest, an extra int is really in the "acceptable" category. I mean, the Int32 parser is probably wasting way much more memory. Don't forget that on x86, the memory space is spitted to 4096 byte pages so there is much more memory wasted than these 4 bytes.
EDIT: You probably want to have a method like GetNextFileName() in your class that generates you the next filename (being able to refactor your code into small bits is important, much more important than saving memory space):
private int nextFileNumber = 0;
private string GetNextFileName(string userName)
{
return String.Format("{0}-{1}-{2}", DateTime.Now, userName,
nextFileNumber++);
}
"least amount of memory" is NOT equals to "no or little variables"
Local variables only takes little memory itself.
But object creation in heap takes a lot more, and they require GC to do cleanup.
In your example, your ToCharArray() and ToString() have created 4 object (indirect created object not included).
your string variable is already character array:
int num=0;
//get the last number
if (Int32.TryParse(FolwArgs.Filename[FolwArgs.Filename.Length-1].ToString(), out num))
num++;
//replace the old number with the new number
char.TryParse(num.ToString(), out FolwArgs.Filename[FolwArgs.Filename.Length-1]]);
Instead of using a running counter, consider using datetime of creation as the changing part of your filename. This way, you don't have to store and retrieve the previous value.
Using the ToBinary() method, you can get a numeric representation of the time.
Of course, any time format that is acceptable in a filename can be used - see custom date and time format strings.
Related
How does StringBuilder work?
What does it do internally? Does it use unsafe code?
And why is it so fast (compared to the + operator)?
When you use the + operator to build up a string:
string s = "01";
s += "02";
s += "03";
s += "04";
then on the first concatenation we make a new string of length four and copy "01" and "02" into it -- four characters are copied. On the second concatenation we make a new string of length six and copy "0102" and "03" into it -- six characters are copied. On the third concat, we make a string of length eight and copy "010203" and "04" into it -- eight characters are copied. So far a total of 4 + 6 + 8 = 18 characters have been copied for this eight-character string. Keep going.
...
s += "99";
On the 98th concat we make a string of length 198 and copy "010203...98" and "99" into it. That gives us a total of 4 + 6 + 8 + ... + 198 = a lot, in order to make this 198 character string.
A string builder doesn't do all that copying. Rather, it maintains a mutable array that is hoped to be larger than the final string, and stuffs new things into the array as necessary.
What happens when the guess is wrong and the array gets full? There are two strategies. In the previous version of the framework, the string builder reallocated and copied the array when it got full, and doubled its size. In the new implementation, the string builder maintains a linked list of relatively small arrays, and appends a new array onto the end of the list when the old one gets full.
Also, as you have conjectured, the string builder can do tricks with "unsafe" code to improve its performance. For example, the code which writes the new data into the array can already have checked that the array write is going to be within bounds. By turning off the safety system it can avoid the per-write check that the jitter might otherwise insert to verify that every write to the array is safe. The string builder does a number of these sorts of tricks to do things like ensuring that buffers are reused rather than reallocated, ensuring that unnecessary safety checks are avoided, and so on. I recommend against these sorts of shenanigans unless you are really good at writing unsafe code correctly, and really do need to eke out every last bit of performance.
StringBuilder's implementation has changed between versions, I believe. Fundamentally though, it maintains a mutable structure of some form. I believe it used to use a string which was still being mutated (using internal methods) and would just make sure it would never be mutated after it was returned.
The reason StringBuilder is faster than using string concatenation in a loop is precisely because of the mutability - it doesn't require a new string to be constructed after each mutation, which would mean copying all the data within the string etc.
For just a single concatenation, it's actually slightly more efficient to use + than to use StringBuilder. It's only when you're performing multiple operations and you don't really need the intermediate results that StringBuilder shines.
See my article on StringBuilder for more information.
The Microsoft CLR does do some operations with internal call (not quite the same as unsafe code). The biggest performance benefit over a bunch of + concatenated strings is that it writes to a char[] and doesn't create as many intermediate strings. When you call ToString (), it builds a completed, immutable string from your contents.
The StringBuilder uses a string buffer that can be altered, compared to a regular String that can't be. When you call the ToString method of the StringBuilder it will just freeze the string buffer and convert it into a regular string, so it doesn't have to copy all the data one extra time.
As the StringBuilder can alter the string buffer, it doesn't have to create a new string value for each and every change to the string data. When you use the + operator, the compiler turns that into a String.Concat call that creates a new string object. This seemingly innocent piece of code:
str += ",";
compiles into this:
str = String.Concat(str, ",");
After doing some profiling, we've discovered that the current way in which our app concatenates strings causes an enormous amount of memory churn and CPU time.
We're building a List<string> of strings to concatenate that is on the order of 500 thousand elements long, referencing several hundred megabytes worth of strings. We're trying to optimize this one small part of our app since it seems to account for a disproportionate amount of CPU and memory usage.
We do a lot of text processing :)
Theoretically, we should be able to perform the concatenation in a single allocation and N copies - we can know how many total characters are available in our string, so it should just be as simple as summing up the lengths of the component strings and allocating enough underlying memory to hold the result.
Assuming we're starting with a pre-filled List<string>, is it possible to concatenate all strings in that list using a single allocation?
Currently, we're using the StringBuilder class, but this stores its own intermediate buffer of all of the characters - so we have an ever growing chunk array, with each chunk storing a copy of the characters we're giving it. Far from ideal. The allocations for the array of chunks aren't horrible, but the worst part is that it allocates intermediate character arrays, which means N allocations and copies.
The best we can do right now is to call List<string>.ToArray() - which performs one copy of a 500k element array - and pass the resulting string[] to string.Concat(params string[]). string.Concat() then performs two allocations, one to copy the input array into an internal array, and the one to allocate the destination string's memory.
From referencesource.microsoft.com:
public static String Concat(params String[] values) {
if (values == null)
throw new ArgumentNullException("values");
Contract.Ensures(Contract.Result<String>() != null);
// Spec#: Consider a postcondition saying the length of this string == the sum of each string in array
Contract.EndContractBlock();
int totalLength=0;
// -----------> Allocation #1 <---------
String[] internalValues = new String[values.Length];
for (int i=0; i<values.Length; i++) {
string value = values[i];
internalValues[i] = ((value==null)?(String.Empty):(value));
totalLength += internalValues[i].Length;
// check for overflow
if (totalLength < 0) {
throw new OutOfMemoryException();
}
}
return ConcatArray(internalValues, totalLength);
}
private static String ConcatArray(String[] values, int totalLength) {
// -----------------> Allocation #2 <---------------------
String result = FastAllocateString(totalLength);
int currPos=0;
for (int i=0; i<values.Length; i++) {
Contract.Assert((currPos <= totalLength - values[i].Length),
"[String.ConcatArray](currPos <= totalLength - values[i].Length)");
FillStringChecked(result, currPos, values[i]);
currPos+=values[i].Length;
}
return result;
}
Thus, in the best case, we have three allocations, two for arrays referencing the component strings, and one for the destination concatenated string.
Can we improve on this? Is it possible to concatenate a List<string> using a single allocation and a single loop of character copies?
Edit 1
I'd like to summarize the various approaches discussed so far, and why they are still sub-optimal. I'd also like to set the parameters of the situation in concrete a little more, since I've received a lot of questions that try to side step the central question.
...
First, the structure of the code that I am working within. There are three layers:
Layer one is a set of methods that produce my content. These methods return small-ish string objects, which I will call my 'component' strings'. These string objects will eventually be concatenated into a single string. I do not have the ability to modify these methods; I have to face the reality that they return string objects and move forward.
Layer two is my code that calls these content producers and assembles the output, and is the subject of this question. I must call the content producer methods, collect the strings they return, and eventually concatenate the returned strings into a single string (reality is a little more complex; the returned strings are partitioned depending on how they're routed for output, and so I have several sets of large collections of strings).
Layer three is a set of methods that accept a single large string for further processing. Changing the interface of that code is beyond my control.
Talking about some numbers: a typical batch run will collect ~500000 strings from the content producers, representing about 200-500 MB of memory. I need the most efficient way to concatenate these 500k strings into a single string.
...
Now I'd like to examine the approaches discussed so far. For the sake of numbers, assume we're running 64-bit, assume that we are collecting 500000 string objects, and assume that the aggregate size of the string objects totals 200 megabytes worth of character data. Also, assume that the original string object's memory is not counted toward any approach's total in the below analysis. I make this assumption because it is necessarily common to any and all approaches, because it is an assumption that we cannot change the interface of the content producers - they return 500k relatively small fully formed strings objects that I must then accept and somehow concatenate. As stated above, I cannot change this interface.
Approach #1
Content producers ----> StringBuilder ----> string
Conceptually, this would be invoking the content producers, and directly writing the strings they return to a StringBuilder, and then later calling StringBuilder.ToString() to obtain the concatenated string.
By analyzing StringBuilder's implementation, we can see that the cost of this boils down to 400 MB of allocations and copies:
During the stage where we collect the output from the content producers, we're writing 200 MB of data to the StringBuilder. We would be performing one 200 MB allocation to pre-allocate the StringBuilder, and then 200 MB worth of copies as we copy and discard the strings returned from the content producers
After we've collected all output from the content producers and have a fully formed StringBuilder, we then need to call StringBuilder.ToString(). This performs exactly one allocation (string.FastAllocateString()), and then copies the string data from its internal buffers to the string object's internal memory.
Total cost: approximately 400 MB of allocations and copies
Approach #2
Content producers ---> pre-allocated char[] ---> string
This strategy is fairly simple. Assuming we know roughly how much character data we're going to be collecting from the producers, we can pre-allocate a char[] that is 200 MB large. Then, as we call the content producers, we copy the strings they return into our char[]. This accounts for 200 MB of allocations and copies. The final step to turn this into a string object is to pass it to the new string(char[]) constructor. However, since strings are immutable and arrays are not, the constructor will make a copy of that entire array, causing it to allocate and copy another 200 MB of character data.
Total cost: approximately 400 MB of allocations and copies
Approach #3:
Content producers ---> List<string> ----> string[] ----> string.Concat(string[])
Pre-allocate a List<string> to be about 500k elements - approximately 4 MB of allocations for List's underlying array (500k * 8 bytes per pointer == 4 MB of memory).
Call all of the content producers to collect their strings. Approximately 4 MB of copies, as we copy the pointer to the returned string into List's underlying array.
Call List<string>.ToArray() to obtain a string[]. Approximately 4 MB of allocations and copies (again, we're really just copying pointers).
Call string.Concat(string[]):
Concat will make a copy of the array provided to it before it does any real work. Approximately 4 MB of allocations and copies, again.
Concat will then allocate a single 'destination' string object using the internal string.FastAllocateString() special method. Approximately 200 MB of allocations.
Concat will then copy strings from its internal copy of the provided array directly into the destination. Approximately 200 MB of copies.
Total cost: approximately 212 MB of allocations and copies
None of these approaches are ideal, however approach #3 is very close. We're assuming that the absolute minimum of memory that needs to be allocated and copied is 200 MB (for the destination string), and here we get pretty close - 212 MB.
If there were a string.Concat overload that 1) Accepted an IList<string> and 2) did not make a copy of that IList before using it, then the problem would be solved. No such method is provided by .Net, hence the subject of this question.
Edit 2
Progress on a solution.
I've done some testing with some hacked IL, and found that directly invoking string.FastAllocateString(n) (which is not usually invokable...) is about as fast as invoking new string('\0', n), and both seem to allocate exactly as much memory as is expected.
From there, it seems its possible to acquire a pointer to the freshly allocated string using the unsafe and fixed statements.
And so, a rough solution begins to appear:
private static string Concat( List<string> list )
{
int concatLength = 0;
for( int i = 0; i < list.Count; i++ )
{
concatLength += list[i].Length;
}
string newString = new string( '\0', concatLength );
unsafe
{
fixed( char* ptr = newString )
{
...
}
}
return newString;
}
The next biggest hurdle is implementing or finding an efficient block copy method, ala Buffer.BlockCopy, except one that will accept char* types.
If you can determine the length of the concatenation before trying to perform the operation, a char array can beat string builder in some use cases. Manipulating the characters within the array prevents the multiple allocations.
See: http://blogs.msdn.com/b/cisg/archive/2008/09/09/performance-analysis-reveals-char-array-is-better-than-stringbuilder.aspx
UPDATE
Please check out this internal implementation of the String.Join from .NET - it uses unsafe code with pointers to avoid multiple allocations. Unless I'm missing something, it would seem you can re-write this using your List to accomplish what you want:
[System.Security.SecuritySafeCritical] // auto-generated
public unsafe static String Join(String separator, String[] value, int startIndex, int count) {
//Range check the array
if (value == null)
throw new ArgumentNullException("value");
if (startIndex < 0)
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_StartIndex"));
if (count < 0)
throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_NegativeCount"));
if (startIndex > value.Length - count)
throw new ArgumentOutOfRangeException("startIndex", Environment.GetResourceString("ArgumentOutOfRange_IndexCountBuffer"));
Contract.EndContractBlock();
//Treat null as empty string.
if (separator == null) {
separator = String.Empty;
}
//If count is 0, that skews a whole bunch of the calculations below, so just special case that.
if (count == 0) {
return String.Empty;
}
int jointLength = 0;
//Figure out the total length of the strings in value
int endIndex = startIndex + count - 1;
for (int stringToJoinIndex = startIndex; stringToJoinIndex <= endIndex; stringToJoinIndex++) {
if (value[stringToJoinIndex] != null) {
jointLength += value[stringToJoinIndex].Length;
}
}
//Add enough room for the separator.
jointLength += (count - 1) * separator.Length;
// Note that we may not catch all overflows with this check (since we could have wrapped around the 4gb range any number of times
// and landed back in the positive range.) The input array might be modifed from other threads,
// so we have to do an overflow check before each append below anyway. Those overflows will get caught down there.
if ((jointLength < 0) || ((jointLength + 1) < 0) ) {
throw new OutOfMemoryException();
}
//If this is an empty string, just return.
if (jointLength == 0) {
return String.Empty;
}
string jointString = FastAllocateString( jointLength );
fixed (char * pointerToJointString = &jointString.m_firstChar) {
UnSafeCharBuffer charBuffer = new UnSafeCharBuffer( pointerToJointString, jointLength);
// Append the first string first and then append each following string prefixed by the separator.
charBuffer.AppendString( value[startIndex] );
for (int stringToJoinIndex = startIndex + 1; stringToJoinIndex <= endIndex; stringToJoinIndex++) {
charBuffer.AppendString( separator );
charBuffer.AppendString( value[stringToJoinIndex] );
}
Contract.Assert(*(pointerToJointString + charBuffer.Length) == '\0', "String must be null-terminated!");
}
return jointString;
}
Source: http://www.dotnetframework.org/default.aspx/4#0/4#0/DEVDIV_TFS/Dev10/Releases/RTMRel/ndp/clr/src/BCL/System/String#cs/1305376/String#cs
UPDATE 2
Good point on the fast allocate. According to an old SO post, you can wrap FastAllocate using reflection (assuming of course you'd cache the fastAllocate method reference so you just called Invoke each time. Perhaps the tradeoff of the call is better than what you're doing now.
var fastAllocate = typeof (string).GetMethods(BindingFlags.NonPublic | BindingFlags.Static)
.First(x => x.Name == "FastAllocateString");
var newString = (string)fastAllocate.Invoke(null, new object[] {20});
Console.WriteLine(newString.Length); // 20
Perhaps another approach is to use unsafe code to copy your allocation into a char* array, then pass this to the string constructor. The string constructor with char* is an extern passed to the underlying C++ implementation. I haven't found a reliable source for that code to confirm, but perhaps this can be faster for you. The non-prod ready code (no checks for potential overflow, add fixed to lock strings from garbage collection, etc) would start with:
public unsafe string MyConcat(List<string> values)
{
int index = 0;
int totalLength = values.Sum(m => m.Length);
char* concat = stackalloc char[totalLength + 1]; // Add additional char for null term
foreach (var value in values)
{
foreach (var c in value)
{
concat[index] = c;
index++;
}
}
concat[index] = '\0';
return new string(concat);
}
Now I'm all out of ideas for this :) Perhaps somebody can figure out a method here with marshalling to avoid unsafe code. Since introducing unsafe code requires adding the unsafe flag to compilation, consider adding this piece as a separate dll to minimize your app's security risk if you go down that route.
Unless the average length of the strings is very small, the most efficient approach, given a List<String>, will be to use ToArray() to copy it to a new String[], and pass that to a concatenation or joining method. Doing that may cause a wasted allocation for an array of references if the concatenation or joining method wants to make a copy of its array before it starts, but that would only allocate one reference per string, there will only be one allocation to hold character data, and it will be correctly sized to hold the entire string.
If you're building the data structure yourself, you might gain a little bit of efficiency by initializing a String[] to the estimated required size, populating it yourself, and expanding it as needed. That would save one allocation of a String[] worth of data.
Another approach would be to allocate a String[8192][] and then allocate a String[8192] for each array of strings as you go along. Once you're all done, you'll know exactly what size String[] you need to pass to the Concat method so you can create an array of that exact size. This approach would require a greater quantity of allocations, but only the final String[] and the String itself would need to go on the Large Object Heap.
It's a shame the constraints you're putting on yourself. It's very blockily structured, and it's hard to get any flow going. For example, if you didn't expect a IList but only expected IEnumerable you might be able to make it easier for the producer of your content. Not only that, you could make your processing benefit from being able to consume the strings only as you need them - and only as they're produced.
This gets you on down the road to some nice asynchrony.
One the other end, they're making you send to whole thing at once. That's tough.
But having said that, and since you're going to run it over and over, etc... I'm wondering if you couldn't create your string buffer or byte buffer or StringBuilder or whatever - and reuse it between executions - allocate the max monster (or progressively bump-reallocate it as needed) one time - and don't let the gc have it. The string constructor will copy it over and over again - but that's a single allocation per cycle. If you're running this so much you're making the machine hot, then it might be worth the hit. I've made precisely that tradeoff in the near past (but I didn't have 5gb to choke on). It felt dirty at first - but ooohh - the throughput spoke loudly!
Also, it may be possible, that while your native API expects a string, but you can lie to it - let it think you're giving it a string. You can very probably pass the buffer with a null char at the end - or with the length - depending on the API's particulars. I think one or two commenters spoke to this. In such a case, you may probably need your buffer pinned for the duration of the calls to the native consumer of your big ol' string.
If this is the case, you're down to a one-time allocation of a buffer, repeated copies into it, and that's it. It could go way under your proposed best case.
I have implemented a method to concatenate a List into a single string that performs exactly one allocation.
The following code compiles under .Net 4.6 - Block.MemoryCopy wasn't added to .Net until 4.6.
The "unsafe" implementation:
public static unsafe class FastConcat
{
public static string Concat( IList<string> list )
{
string destinationString;
int destLengthChars = 0;
for( int i = 0; i < list.Count; i++ )
{
destLengthChars += list[i].Length;
}
destinationString = new string( '\0', destLengthChars );
unsafe
{
fixed( char* origDestPtr = destinationString )
{
char* destPtr = origDestPtr; // a pointer we can modify.
string source;
for( int i = 0; i < list.Count; i++ )
{
source = list[i];
fixed( char* sourcePtr = source )
{
Buffer.MemoryCopy(
sourcePtr,
destPtr,
long.MaxValue,
source.Length * sizeof( char )
);
}
destPtr += source.Length;
}
}
}
return destinationString;
}
}
The competing implementation is the following "safe" implementation:
public static string Concat( IList<string> list )
{
return string.Concat( list.ToArray() )
}
Memory consumption
The "unsafe" implementation performs exactly one allocation and zero temporary allocations. The List<string> is directly concatenated into a single, freshly allocated string object.
The "safe" implementation requires two copies of the list - one, when I call ToArray() to pass it to string.Concat, and another when string.Concat performs its own internal copy of the array.
When concatenating a 500k element list, the "safe" string.Concat method allocates exactly 8 MB of extra memory in a 64-bit process, which I've confirmed by running the test driver in a memory monitor. This is what we would expect with the array copies performed by the safe implementation.
CPU performance
For small worksets, the unsafe implementation seems to win by about 25%.
The test driver was tested by compiling for 64-bit, installing the program into the native image cache via NGEN, and running from outside the debugger on an unloaded workstation.
From my test driver with a small workset (500k strings each 2-10 chars long):
Unsafe Time: 17.266 ms
Unsafe Time: 18.419 ms
Unsafe Time: 16.876 ms
Safe Time: 21.265 ms
Safe Time: 21.890 ms
Safe Time: 24.492 ms
Unsafe average: 17.520 ms. Safe average: 22.549 ms. Safe takes about 25% longer than unsafe. This is likely due to the extra work the safe implementation has to do, allocating temporary arrays.
...
From my test driver with a large workset (500k strings, each 500-800 chars long):
Unsafe Time: 498.122 ms
Unsafe Time: 513.725 ms
Unsafe Time: 515.016 ms
Safe Time: 487.456 ms
Safe Time: 499.508 ms
Safe Time: 512.390 ms
As you can see, the performance difference with large strings is roughly zero, likely because the time is dominated by the raw copy.
Conclusion
If you don't care about the array copies, the safe implementation is dead simple to implement, and is roughly as fast as the unsafe implementation. If you want to be absolutely perfect with memory usage, use the unsafe implementation.
I've attached the code I used for the test harness:
class PerfTestHarness
{
private List<string> corpus;
public PerfTestHarness( List<string> corpus )
{
this.corpus = corpus;
// Warm up the JIT
// Note that `result` is discarded. We reference it via 'result[0]' as an
// unused paramater to my prints to be absolutely sure it doesn't get
// optimized out. Cheap hack, but it works.
string result;
result = FastConcat.Concat( this.corpus );
Console.WriteLine( "Fast warmup done", result[0] );
result = string.Concat( this.corpus.ToArray() );
Console.WriteLine( "Safe warmup done", result[0] );
GC.Collect();
GC.WaitForPendingFinalizers();
}
public void PerfTestSafe()
{
Stopwatch watch = new Stopwatch();
string result;
GC.Collect();
GC.WaitForPendingFinalizers();
watch.Start();
result = string.Concat( this.corpus.ToArray() );
watch.Stop();
Console.WriteLine( "Safe Time: {0:0.000} ms", watch.Elapsed.TotalMilliseconds, result[0] );
Console.WriteLine( "Memory usage: {0:0.000} MB", Environment.WorkingSet / 1000000.0 );
Console.WriteLine();
}
public void PerfTestUnsafe()
{
Stopwatch watch = new Stopwatch();
string result;
GC.Collect();
GC.WaitForPendingFinalizers();
watch.Start();
result = FastConcat.Concat( this.corpus );
watch.Stop();
Console.WriteLine( "Unsafe Time: {0:0.000} ms", watch.Elapsed.TotalMilliseconds, result[0] );
Console.WriteLine( "Memory usage: {0:0.000} MB", Environment.WorkingSet / 1000000.0 );
Console.WriteLine();
}
}
StringBuilder was designed to concatenate strings efficiently. It has no other purpose.Use the constructor which sets the initial capacity:
int totalLength = CalcTotalLength();
// sufficient capacity
StringBuilder sb = new StringBuilder(totalLength);
But then you say that even StringBuilder allocates intermediate memory, and you want to do better...
These are unusual requirements, so you need to write a function which suits your situation (creating a char[] of appropriate size, then filling it in). I'm sure you are more than capable.
The first two of my answers have now been already incorporated in the question. Here is my highly situation dependent, but useful -
Third Answer
If in all these MBs of string you are getting a lot of strings that are same, then a smarter way would be use two dictionaries, one would be Dictionary<int, int> to store position and "Id" of the string at that position while another would be a Dictionary<int, int> to store the "Id" and the index of actual string in the original string[].
Coincidentally for me, what I am trying to do is already implemented in C#. Goes kinda like this...
If indeed there are a lot of same strings, is it a rare case where String Interning is useful? You are guaranteed to save considerable amount of your 200 MB target if a lot of matching strings are coming from the content producers.
What is String.Intern?
When you use strings in C#, the CLR does something clever called
string interning. It's a way of storing one copy of any string. If you
end up having a hundred—or, worse, a million—strings with the same
value, it's a waste to take up all of that memory storing the same
string over and over again. String interning is a way around that.
The CLR maintains a table called the intern pool that contains a
single, unique reference to every literal string that's either
declared or created programmatically while your program's running. And
the .NET Framework gives you two useful methods for interacting with
the intern pool: String.Intern() and String.IsInterned().
The way String.Intern() works is pretty straightforward. You pass it a
single string as an argument. If that string is already in the intern
pool, it returns a reference to that string. If it's not already in
the intern pool, it adds it and returns the same reference you passed
into it.
The way to use String Interning is explained in the link. For the sake of completeness of this answer I can add the code here but only if you feel that these solutions are useful.
I am in need of a way to detect whether a string changes within my code, however, I am at the same time cautious about my performance:
In reference to this question and answer.
Specifically, this code snippet:
// holds a copy of the previous value for comparison purposes
private string oldString = string.Empty;
private void button6_Click(object sender, EventArgs e)
{
// Get the new string value
string newString = //some varying value I get from other parts of my program
// Compare the old string to the new one
if (oldString != newString)
{
// The string values are different, so update the ListBox
listBox1.Items.Clear();
listBox1.Items.Add(x + /*other things*/);
}
// Save the new value back into the temporary variable
oldString = newString;
}
I am currently working on a Grasshopper 3D component. Each component is itself a class library, and the main method is a method called SolveInstance(). The condition in which it runs I'm not actually too sure, but what I do know is that in the minimum, it runs a number of times a section, so your graphical UI pretty much is real-time, or so imperceptible to the human eye.
For my particular example, this is what my particular case would look like (it's untested psuedo-code).
// instance vars
private string _oldOutputString = string.Empty;
private string _newOutputString = string.Empty;
// Begin SolveInstance() method
// This constructor call saves a string to _newOutputString based on two lists
_valueList = new ValueList(firstList, secondList);
// Compare the old string to the new one
if (_oldOutputString != _newOutputString)
{
// Save the new value back into the temporary variable
_oldOutputString = _newOutputString;
// Call eventargs method
Menu_MyCustomItemClicked(_sender, _e);
}
DA.SetData(0, _oldOutputStr);
My question is: Would doing this, where that particular piece of code gets called many times a second, take a hit in performance?
That string comparison should take on the order of a microsecond or less.
You're only doing it once per button click.
How fast can you click a button - ten times per second?
That means, worst case, that comparison can cost you on the order of ten microseconds per second, or 0.001 percent of time.
Don't worry about anything taking less than 1 percent of time, or even 10%, because if you could fix it, it would save you no more than that.
Would doing this, where that particular piece of code gets called many times a second, take a hit in performance?
Yes.
That's not the important thing though, the important thing is will it cause a significant hit.
Which is a matter of how long it takes, vs. what's significant to you.
The only way to know is to measure.
However, it's worth considering what makes some string comparisons slower than others.
The fastest string comparison is where two strings are in fact the exact same object. In particular;
string a = "ABC";
string b = a;
bool c = a == b; // Very fast.
The next fastest is when one, but not both are null (both being null is an example of the above case, anyway).
The next fastest is when they have different lengths, for those types of comparisons where they can't be equivalent if they have the different lengths. This doesn't apply to case-sensitive comparisons because if you capitalise "weißbier" to "WEISSBEIR" the lengths are different, but does to exact-match comparisons.
The next fastest is when they differ early on.
The slowest is when two strings are different objects, but are in fact equal.
The average cost of string equality tests are proportional to the length of the strings.
We can reduce the speed of the slowest by interning strings (whether in the default intern pool, or a custom cache of strings) and if we know all strings have gone through the same process we can make all equality comparisons as fast as the fastest case (because we've made sure that either two strings will be the same string, or they'll not be equivalent). However doing so in itself takes time, so it's not always worth it.
In all, if you are only changing the string on a real change, then it in practice will be much faster than if you are building up the string repeatedly, potentially to end up with the same string as you had to begin with.
I have a very big char array that I need to convert to string in order to use Regex on it.
But it's so big that I get OutOfMemoryException when I pass that to string constructor.
I know that string is immutable and therefore it shouldn't be possible to specify its underlying character collection but I need a way to use regular expressions on that without copying the whole thing.
How do I get that array?
I get it from a file using StreamReader. I know the starting position and the length of the content to read, Read and ReadBlock methods need me to supply a char[] buffer.
So here are the things I want to know:
Is there a way to specify a string's underlaying collection? (Does it even keep its chars in an array?)
...or using Regex directly on a char array?
...or getting the part of the file directly as string?
If you have a character or pattern that you could search for that is guaranteed NOT to be in the pattern you're trying to find, you could scan the array for that character and create smaller strings to process individually. Process would be something like:
char token = '|';
int start = 0;
int length = 0;
for(int i = 0; i < charArray.Length; i++;)
{
if(charArray[i] == token)
{
string split = new string(charArray,start,length);
// check the string using the regex
// reset the length
length = 0;
}
else
{
length++;
}
}
That way you're copying smaller segments of the string that would be GCed after each attempt versus the entire string.
I would think your best bet would be to read multiple char[] chunks into individual strings that overlap with a certain dimension. This way you'd be able to perform your Regex on the individual chunks, and the overlap would provide you the ability to ensure that a "break" in the chunks doesn't break the search pattern. In a psuedo-code manner:
int chunkSize = 100000;
int overLap = 2000;
for(int i = 0; i < myCharArray.length; i += chunkSize - overlap)
{
// Grab your array chunk into a partial string
// By having your iteration slightly smaller than
// your chunk size you guarantee not to miss any
// character groupings. You just need to make sure
// your overlap is sufficient to cover the expression
string chunk = new String(myCharArray.Skip(i).Take(chunkSize).ToArray());
// run your regex
}
One rather ugly option would be to use an unmanaged RegEx library (like the POSIX regular expression library) and unsafe code. You can obtain a byte * pointer to the char array and pass it directly to the unmanaged library, then marshal the responses back.
fixed (byte * pArray = largeCharArray)
{
// call unmanaged code with pArray
}
If you are using .NET 4.0 or higher, what you should be using is a MemoryMappedFile. This class was designed exclusively so you could manipulate very large files. From the MSDN documentation:
A memory-mapped file maps the contents of a file to an application’s
logical address
space. Memory-mapped files enable programmers to work with extremely large files because
memory can be managed concurrently, and they allow complete, random access to a file
without the need for seeking. Memory-mapped files can also be shared across multiple
processes.
Once you got your memory mapped file, check out this Stack Overflow answer on how to apply RegEx to the memory mapped file.
Hope this helps!
When developing in Java a couple of years ago I learned that it is better to append a char if I had a single character instead of a string with one character because the VM would not have to do any lookup on the string value in its internal string pool.
string stringappend = "Hello " + name + ".";
string charappend = "Hello " + name + '.'; // better?
When I started programming in C# I never thought of the chance that it would be the same with its "VM". I came across C# String Theory—String intern pool that states that C# also has an internal string pool (I guess it would be weird if it didn't) so my question is,
are there actually any benefits in appending a char instead of a string when concatenating to a string regarding C# or is it just jibberish?
Edit: Please disregard StringBuilder and string.Format, I am more interested in why I would replace "." with '.' in code. I am well aware of those classes and functions.
If given a choice, I would pass a string rather than a char when calling System.String.Concat or the (equivalent) + operator.
The only overloads that I see for System.String.Concat all take either strings or objects. Since a char isn't a string, the object version would be chosen. This would cause the char to be boxed. After Concat verifies that the object reference isn't null, it would then call object.ToString on the char. It would then generate the dreaded single-character string that was being avoided in the first place, before creating the new concatinated string.
So I don't see how passing a char is going to gain anything.
Maybe someone wants to look at the Concat operation in Reflector to see if there is special handling for char?
UPDATE
As I thought, this test confirms that char is slightly slower.
using System;
using System.Diagnostics;
namespace ConsoleApplication19
{
class Program
{
static void Main(string[] args)
{
TimeSpan throwAwayString = StringTest(100);
TimeSpan throwAwayChar = CharTest(100);
TimeSpan realStringTime = StringTest(10000000);
TimeSpan realCharTime = CharTest(10000000);
Console.WriteLine("string time: {0}", realStringTime);
Console.WriteLine("char time: {0}", realCharTime);
Console.ReadLine();
}
private static TimeSpan StringTest(int attemptCount)
{
Stopwatch sw = new Stopwatch();
string concatResult = string.Empty;
sw.Start();
for (int counter = 0; counter < attemptCount; counter++)
concatResult = counter.ToString() + ".";
sw.Stop();
return sw.Elapsed;
}
private static TimeSpan CharTest(int attemptCount)
{
Stopwatch sw = new Stopwatch();
string concatResult = string.Empty;
sw.Start();
for (int counter = 0; counter < attemptCount; counter++)
concatResult = counter.ToString() + '.';
sw.Stop();
return sw.Elapsed;
}
}
}
Results:
string time: 00:00:02.1878399
char time: 00:00:02.6671247
When developing in Java a couple of years ago I learned that it is better to append a char if I had a single character instead of a string with one character because the VM would not have to do any lookup on the string value in its internal string pool.
Appending a char to a String is likely to be slightly faster than appending a 1 character String because:
the append(char) operation doesn't have to load the string length,
it doesn't have to load the reference to the string characters array,
it doesn't have to load and add the string's start offset,
it doesn't have to do a bounds check on the array index, and
it doesn't have to increment and test a loop variable.
Take a look at the Java source code for String and related classes. You might be surprised what goes on under the hood.
The intern pool has nothing to do with it. The interning of string literals happens just once during class loading. Interning of non-literal strings occurs only if the application explicitly calls String.intern().
This may be interesting:
http://www.codeproject.com/KB/cs/StringBuilder_vs_String.aspx
Stringbuilder are not necessarily faster than Strings, it, as said before, depends. It depends on machine configuration, available memory vs processor power, framework version and machine config. Your profiler is your best buddy in this case :)
Back 2 Topic:
You should just TRY which is faster. Do that concatenation a bazillion times and let your profiler watch. You will see possible differences.
All string concatenation in .NET (with the standard operators i.e. +) requires the runtime to reserve enough memory for a complete new string to hold the results of the concatenation. This is due to the string type being immutable.
If you are performing string concatenation many times over (i.e. within a loop etc.) you will suffer performance issues (and eventually memory issues if the string is sufficiently large) as the .NET runtime needs to continually allocate and deallocate memory space to hold each new string.
It's probably for this reason that you're thinking (correctly) that excessive string concatenation can be problematic. It has very little (if anything) to do with concatenating a char rather than a string type.
The alternative to this is to use the StringBuilder class within the System.Text namespace. This class represents a mutable string-like object that can be used to concatenate strings without much of the resulting performance issues. This is because the StringBuilder class will reserve a specific amount of memory for a string, and will allow concatenations to be appended to the end of the reserved memory amount without requiring a complete new copy of the entire string.
EDIT:
With regard to the specifics of string lookups versus char lookups, I whipped up this little test:
class Program
{
static void Main(string[] args)
{
string stringtotal = "";
string chartotal = "";
Stopwatch stringconcat = new Stopwatch();
Stopwatch charconcat = new Stopwatch();
stringconcat.Start();
for (int i = 0; i < 100000; i++)
{
stringtotal += ".";
}
stringconcat.Stop();
charconcat.Start();
for (int i = 0; i < 100000; i++)
{
chartotal += '.';
}
charconcat.Stop();
Console.WriteLine("String: " + stringconcat.Elapsed.ToString());
Console.WriteLine("Char : " + charconcat.Elapsed.ToString());
Console.ReadLine();
}
}
It merely times (using the high-performance StopWatch class) how long it takes to concatenate 100000 dots/periods (.) of type string vs. 100000 dots/periods of type char.
I ran this test a few times over to prevent the results being skewed from one specific run, however, each time the results were similar to as follows:
String: 00:00:06.4606331
Char : 00:00:06.4528073
Therefore, in the context of multiple concatenations, I'd say that there's very little difference (in all likelihood, no difference when taking standard test run tolerances into account) between the two!
I agree with what everyone is saying about using StringBuilder if you are doing lots of string concatenation because String is an immutable type, but don't forget there's an overhead with creating the StringBuilder class too so you'll have to make a choice when to use which.
In one of Bill Wagner's Effect C# books (or might be in all 3 of them..), he touched on this too. Broadly speaking, if all you need is to add a few string fragments together, string.Format is better but if you need to build up a large string value in a potentially large loop, use the StringBuilder.
Every time when you concatenate strings using + operator, runtime creates a new string, and for avoiding that, recommended practice is usage of StringBuilder class, which has Append method. You can also use AppendLine and AppendFormat.
If you do not want to use StringBuilder, then you can use string.Format:
string str = string.Format("Hello {0}.", name);
Since strings are immutable types both would require creating a new instance of a string before the value is returned back to you.
I would consider string.Concat(...) for a small number of concatenations or use the StringBuilder class for many string concatenations.
I can't speak to C#, but in Java, the main advantage is not the compile-time gain but the run-time gain.
Yes, if you use a String, than at compile time Java will have to look the String up in its internal pool and possibly create a new String object. But this just happens once, at compile-time, when you create the .class files. The user will never see this.
What the user will see is that at run-time, if you give a character the program just has to retrieve the character. Done. If you give a String, it must first retrieve the String object handle. Then it must set up a loop to go through all the characters, retrieve the one character, observe that there are no more characters, and stop. I haven't looked at the generated byte-code but it's clearly seveal times as much work.