Related
We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.
Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.
Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".
I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()
There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);
I know there is a rule about strings in C# that says:
When we create a textual string of type string, we can never change its value! When putting different value for a string variable thje first string will stay in memory and variable (which is kind of reference type) just gets the address of the new string.
So doing something like this:
string a = "aaa";
a = a.Trim(); // Creates a new string
is not recommended.
But what if I need to do some actions on the string according to user preferences, like so:
string a = "aaa";
if (doTrim)
a = a.Trim();
if (doSubstring)
a = a.Substring(...);
etc...
How can I do it without creating new strings on every action ?
I thougt about sending the string to a function by ref, like so:
void DoTrim(ref string value)
{
value = value.Trim(); // also creates new string
}
But this also creates a new string...
Can someone please tell me if there is a way of doing it without wasteing memory on each action ?
You are correct in that the operations you're performing are creating new strings, and not mutating a single string.
You are incorrect in that this is generally problematic or something to be avoided.
If your strings are hundreds of thousands of characters, then sure, copying all of those just to remove a few leading spaces, or to add a few characters to the end of it (repeatedly, in a loop, in particular) can actually be a problem.
If your strings aren't large, and you're not performing many (an in thousands of) operations on the string, then you almost certainly don't have a problem.
Now there are a handful of contexts, generally rather rare, that do run into problems with string manipulation. Probably the most common of the problematic contexts is appending a bunch of strings together, as doing so means copying all of the previously appended data for each new addition. If you're in that situation consider using something like a StringBuilder or a single call to string.Concat (the overload accepting a sequence of strings to concat) to perform this operation.
Other contexts are, for example, programs dealing with processing DNA strands. They'll often be taking strings of millions of characters and creating hundreds of thousands of many thousand character long substrings of that string. Using standard C# string operations would therefore result in a lot of unnecessary copying. People writing such programs end up creating objects that can represent a substring of another string without copying the data and instead referring to the existing string's underlying data source with an offset.
Sticking my neck out here a bit so I'll preface with saying in most cases Servy's answer is the correct answer. However, if you really do need lower level access and less string allocations, you could consider creating a character buffer (simple array for instance) that is big enough to fit your processed string and allow you direct manipulation of the characters. There are some significant downfalls to this, though. Including that you'll probably have to write your own Substring() and Trim() modifiers, and your buffer will likely be bigger than your input strings in many cases to accommodate unexpected string sizes. Once you are done manipulating your buffer, you could then package the character array up as a String. Since all of your manipulations are done on a single buffer, you should save a lot of allocations.
I would seriously consider if the above is worth the hassle, but if you really need the performance, this is the best solution I can think of.
How can I do it without creating new strings on every action?
You should only worry about that if you're handling big strings or if you're doing many string operations in a short period of time.
Even then, the performance loss due to creating more references is minimal.
The Garbage Collector has to collect all the unused string variables, but hey - that only really matters if you're doing MANY string operations.
So rather focus on readability in your code, rather than trying to optimize its performance in the first place.
If you really have to keep the same reference of string, you can simply use a StringBuilder.
Why do you feel uncomfortable creating new strings? There is a reason for the string API to be designed this way. For example, immutable objects are thread-safe (and they allow for a more functional programming style).
If you replace your simple string code by stringbuilders, your code might be more error-prone in multithreading scenarios (which is quite normal in a web application for example).
StringBuilders are used for concatenating strings, inserting characters, removing characters, etc. But they will need to reallocate and copy their internal characters arrays every now and then, too.
When you speak about memory consumption you have started to micro-optimize your code. Don't.
BTW: Have a look at the LINQ API. What does each operation do? Rats - it creates a new enumerator! A query like foos.Where(bar).Select(baz).FirstOrDefault() could certainly be memory-optimized by just creating a single enumerator object and modifying the criteria it applies when enumerating. </irony>
It will depend on what your exact use case is, but you might want to explore using the StringBuilder class which you can use to build and modify strings.
This is a bit of an odd question and more of a though experiment that anything I need, but I'm still curious about the answer: If I have a string that I know ahead of time will never change but is (mostly) made up of repetitive parts, would it be better to have said string as just a single string object, get called when needed, and be done with it - or should I break the string up into smaller strings that represent the repeated parts and concatenate them when needed?
Let me use an example: Let's say we have a naive programmer who wants to create a regular expression for validating IP Addresses (in other words, I know this regular expression won't work as intended, but it helps show what I mean by repetitive parts and saves me a bit of typing for the second part of the example). So he writes this function:
private bool isValidIP(string ip)
{
Regex checkIP = new Regex("\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
return checkIP.IsMatch(ip);
}
Now our young programmer notices that he has "\d", "\d?", and "\." just repeated a few times. This gives him an idea that he could both save some storage space and help remind himself what this means for later. So he remakes the function:
private bool isValidIP(string ip)
{
string escape = "\\";
string digi = "d";
string digit = escape + digi;
string possibleDigit = digit + '?';
string IpByte = digit + possibleDigit + possibleDigit;
string period = escape + '.';
Regex checkIP = new Regex(IpByte + period + IpByte + period + IpByte + period + IpByte);
return checkIP.IsMatch(ip);
}
The first method is simple. It just stores 38 chars in the program's instructions, which are just read into memory each time the function is called.
The second method stores (I suspect) two 1 length strings and two chars into the program's instructions as well as all of the calls to concatenate those four into different orders. This creates at least 8 strings in memory when the program is called (the six named strings, a temporary string for the first four parts of the regex, and then the final string created from the previous string + the three strings of the regex). This second method also happens to help explain what the regex is looking for - though not what the final regex would look like. It could also help with refactoring, say if our hypothetical programmer realizes that his current regex will allow for more than just 0-255 in the IP Address, and the constitute parts can be changed without having to find every single item that would need to be fixed.
Again, which method would be better? Would it just be as simple as a trade-off between program size vs. memory usage? Of course, with something as simple as this, the trade-off is negligible at best, but what about a much larger, more complex string?
Oh, yes, and a much better regex for IP Addresses would be:
^(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)(\\.(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)){3}$
Wouldn't work as well as an example, would it?
The first is by far the better option. Here's why
It's clearer.
It's cheaper. Any time you declare a new object it's an "expensive" process. You have to make space for it on the heap (well for strings at least). Yes, you could in theory be saving a byte or so, but your spending a lot more time (probably, I haven't tested it) going through and allocating space for each string, additional memory instructions etc. Not to mention the fact that remember, you also have to factor in the use of the GC. You keep allocating strings and eventually you are going to have to contend with it taking up process ticks also. You really want to hit on optimization, I can easily tell this code isn't as efficient as it could be. There are no constants for one thing, which means that you are possibly creating more objects than you need instead of letting the compiler optimize for strings that don't need to change. This leads me to think, that as a person reviewing this code, I need to take a much closer look at what is going to see what is going on and figure out if something is wrong.
It's clearer (yes, I said this again). You want to do an academic pursuit to see how efficient you can make it. That's cool. I get that. I do it myself. It's fun. I NEVER let that slip into production code. I don't care about losing a tick, I care about having a bug in production, and I care about if other programmers can understand what my code does. Reading someone else's code is hard enough, I don't want to add the extra task of them having to try and figure out which micro-optimization I put in and what happens if they "nudge" the wrong piece of code.
You hit on another point. What if the original regex is wrong. Google will tell you this problem has been solved. You can Google another regex that's right and has been tested. You can't Google "What's wrong with my code." You can post it on SO sure, but that means that someone else has to get involved and look through it.
Here's how to make the first example win the horse race easily:
Regex checkIP = new Regex(
"\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
private bool isValidIP(string ip)
{
return checkIP.IsMatch(ip);
}
Declare once, reuse over and over. If you are taking the time to recreate the regex dynamically to save a few, don't get to do that. Technically you could do that and still only create the object once, but that is a lot more work than say, moving it to a class level variable.
You're effectively attempting to game the compiler here and implement your own string compression. For the kinds of string literals you're describing, it seems like your savings will be mere tens of bytes shaved off of the compiled binary, which due to memory alignment may not even be realized. In exchange for these few bytes of saved space, this approach adds code complexity and runtime overhead, not to mention difficulty in debugging.
Storage is cheap. Why make your life (and the lives of your coworkers) harder? Keep your code simple, clear, and evident - you'll thank yourself later.
The second is worse off in memory consumption, as every time you concatenate two strings you've got three in memory.
Although the compiler started handling some instances of string constants by creating a StringBuilder for you, I'd still vote for the first one being less memory intensive, because if the system does create the StringBuilder for you, you are going to have the overhead for that, and if it doesn't see the first paragraph...
I am now curious how compiling the RegEx would effect the memory usage.
Savings here are illusionary and splitting this string up is a big overshot. Saving insignificant amount of memory and complicating so simple code is just pointless. You will not see any savings but next person to maintain that code will spend 10x more time understanding it.
Strings are immutable so if your string never/rarely changes keep it in one piece. Intense string concatenations give garbage collector additional strain.
Unless your strings and sub-strings are big and you could save at least kilobytes, do not spend your time and effort on such optimizations.
I have to read a huge xml file which consists of over 3 million records and over 10 million nested elements.
Naturally I am using xmltextreader and have got my parsing time down to about 40 seconds from earlier 90 seconds using multiple optimization tricks and tips.
But I want to further save processing time as much as I can hence below question.
Quite a few elements are of type xs:boolean and the data provider always represents values as "true" or "false" - never "1" or "0".
For such cases my earliest code was:
if (xmlTextReader.Value == "true")
{
bool subtitled = true;
}
which i further optimized to:
if (string.Equals(xmlTextReader.Value, "true", StringComparison.OrdinalIgnoreCase))
{
bool subtitled = true;
}
I wanted to know if below would be fastest (because its either "true" or "false")?
if (xtr.value.length == 4)
{
bool subtitled = true;
}
Yes, it is faster, because you only compare exactly one value, namely the length of the string.
By comparing two strings with each other, you compare each and every character, as long as both characters are the same. So if you're finding a match for the string "true", you're going to do 4 comparisons before the predicate evaluates to true.
The only problem you have with this solution is, that if someday the value is going to change from true to let's say 1, you're going to run into a problem here.
Comparing length will be faster, but less readable. I wouldn't use it unless I profile the performance of the code and conclude that I need this optimization.
What about comparing the first character to "t"?
Should (maybe :) be faster than comparing the whole string..
Measuring the length would almost invariably be faster. That said, unless this is an experiment in micro-optimization, I'd just focus on making the code to be readable and convey the proper semantics.
You might also try something like that uses the following approach:
Boolean.TryParse(xmlTextReader.Value, out subtitled)
I know that has nothing to do with your question, but I figured I'd throw it out there anyway.
Cant you just write a unit test? Run each scenario for example 1000 times and compare the datetimes.
If you know it's either "true" or "false", the last snippet must be fastest.
Anyway, you can also write:
bool subtitled = (xtr.Value.length == 4);
That should be even faster.
Old question I know but the accepted answer is wrong, or at least, incorrect in it's explanation.
Comparing the lengths maybe be the slightest bit faster but only because string.Equals is likely doing some other comparisons before it too checks the lengths and decides that they are not equal strings.
So in practice this is an optimization of last resort.
Here you can find the source for .NET core string comparison.
String comparing and parsing is very slow in .Net, I'd recommend avoid intensive using string parsing/comparing in .Net.
If you're forced to do it -- use highly optimized unmanaged or unsafe code and use parallelism.
IMHO.
What is it about the way strings are implemented that makes them so expensive to manipulate?
Is it impossible to make a "cheap" string implementation?
or am I completely wrong in my understanding?
Thanks
Which language?
Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.
This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.
If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.
If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.
The problem with strings is that they are not primitive types. They are arrays.
Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).
Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc.
There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.
Since it creates new copy of the object every time in java its advisable to use StringBuffer
Syntax
StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();
Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them.
i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.
Pete
It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.
Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:
Concatenation.
Slicing.
Getting a character at an index.
Changing a character at an index.
Locating the index of a character.
Traversing the string.
Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.
If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.
Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...
Changes and copying of strings tends to involve memory management.
Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.
You want to read this Joel Spolsky article:
http://www.joelonsoftware.com/articles/fog0000000319.html
Me, I'm disappointed .NET doesn't have a native type called F***edString.