I read today, in c# strings are immutable, like once created they cant be changed, so how come below code works
string str="a";
str +="b";
str +="c";
str +="d";
str +="e";
console.write(str) //output: abcde
How come the value of variable changed??
String objects are immutable, but variables can be reassigned.
You created separate objects
a
ab
abc
abcd
abcde
Each of these immutable strings was assigned, in turn, to the variable str.
You can not change the contents (characters inside) a string.
Changing the variable is a different thing altogether.
It is easy to show that the post did not mutate the String object as believed:
string a = "a";
string b = a;
Console.WriteLine(object.ReferenceEquals(a, b)); // true
Console.WriteLine(a); // a
Console.WriteLine(b); // a
b += "b";
Console.WriteLine(object.ReferenceEquals(a, b)); // false
Console.WriteLine(a); // a
Console.WriteLine(b); // ab
The is because the x += y operator is equivalent to x = x + y, but with less typing.
Happy coding.
Use reflector to look at the ILM code and you will see exactly what is going on. Although your code logically appends new contents onto the end of the string, behind the scenes the compiler is creating ILM code that is creating a new string for each assignment.
The picture gets a little muddier if you concatenate literal strings in a single statement like this:
str = "a" + "b" + "c" ...
In this case the compiler is usually smart enough to not create all the extra strings (and thus work for the Garbage collector and will translate it for you to ILM code equivalent to:
str = "abc"
That said, doing it on separate lines like that might not trigger that optimization.
Immutable means the memory location used to store the string variable never gets changed.
string str ="a"; //stored at some memory location
str+= "b"; // now str='ab' and stored at some other memory location
str+= "c"; // now str='abc' and stored at some other memory location
and so on...
Whenever you change the value of string type, you never actually store the new value at the original location, rather keep storing it at new memory locations.
string a="Hello";
string b=a;
a="changed";
console.writeline(b);
Output
Hello // variable b still referring to the original location.
Please check John Skeet's page
http://csharpindepth.com/Articles/General/Strings.aspx
Concepts:
The variable and the instance are separate concepts. A variable is something that holds a value. In the case of a string, the variable holds a pointer to that string that is stored somewhere else: the instance.
A variable can always be assigned and reassigned if you want... it is variable after all! =)
The instance, that I told is somewhere else, can not be changed in the case of a string.
By concatenating strings like you did, you are in fact creating a lot of different storages, one for each string concatenation.
The correct way to do it:
To concatenate string, you can use StringBuilder class:
StringBuilder b = new StringBuilder();
b.Append("abcd");
b.Append(" more text");
string result = b.ToString();
You could also use a list of string, and then join it:
List<string> l = new List<string>();
l.Add("abcd");
l.Add(" more text");
string result = string.Join("", l);
#pst - I agree that readability is important, and in most cases on a PC it won't matter, but what about if you're on a mobile platform where system resources are constrained?
It's important to understand that StringBuilder is the best way to concat strings. It is much faster and more efficient.
You highlighted the important difference, though, as to whether it is significantly faster, and in what scenarios. It is illustrative that the difference has to be measured in ticks at low volumes, because it can't be measured in milliseconds.
It's important to know that for everyday scenarios on a desktop platform, the difference is imperceptible. But it's also important to know that for mobile platforms, edge cases where you're building large strings or doing thousands of concats, or for performance optimization, StringBuilder does win. With a very large number of concats, it is worth noting that StringBuilder takes slightly more memory.
This is by no means a perfect comparison, but for the fool that concats 1,000,000 strings, StringBuilder beats plain string concatenation by ~10 mins (on a Core 2 Duo E8500 # 3.16GHz in Win 7 x64):
String concat (10): 9 ticks, 0 ms, 8192 bytes
String Builder (10): 2 ticks, 0 ms, 8192 bytes
String concat (100): 30 ticks, 0 ms, 16384 bytes
String Builder (100): 6 ticks, 0 ms, 8192 bytes
String concat (1000): 1658 ticks, 0 ms, 1021964 bytes
String Builder (1000): 29 ticks, 0 ms, 8192 bytes
String concat (10000): 105451 ticks, 34 ms, 2730396 bytes
String Builder (10000): 299 ticks, 0 ms, 40624 bytes
String concat (100000): 15908144 ticks, 5157 ms, 200020 bytes
String Builder (100000): 2776 ticks, 0 ms, 216888 bytes
String concat (1000000): 1847164850 ticks, 598804 ms, 1999804 bytes
String Builder (1000000): 27339 ticks, 8 ms, 2011576 bytes
Code:
class Program
{
static void Main(string[] args)
{
TestStringCat(10);
TestStringBuilder(10);
TestStringCat(100);
TestStringBuilder(100);
TestStringCat(1000);
TestStringBuilder(1000);
TestStringCat(10000);
TestStringBuilder(10000);
TestStringCat(100000);
TestStringBuilder(100000);
TestStringCat(1000000);
TestStringBuilder(1000000);
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
static void TestStringCat(int iterations)
{
GC.Collect();
String s = String.Empty;
long memory = GC.GetTotalMemory(true);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s += "a";
}
sw.Stop();
memory = GC.GetTotalMemory(false) - memory;
Console.WriteLine("String concat \t({0}):\t\t{1} ticks,\t{2} ms,\t\t{3} bytes", iterations, sw.ElapsedTicks, sw.ElapsedMilliseconds, memory);
}
static void TestStringBuilder(int iterations)
{
GC.Collect();
StringBuilder sb = new StringBuilder();
long memory = GC.GetTotalMemory(true);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
sb.Append("a");
}
sw.Stop();
memory = GC.GetTotalMemory(false) - memory;
Console.WriteLine("String Builder \t({0}):\t\t{1} ticks,\t{2} ms,\t\t{3} bytes", iterations, sw.ElapsedTicks, sw.ElapsedMilliseconds, memory);
}
}
it didn't, it's actually overwriting the variable with the new value creating a new string object each time.
Immutability means that the class cannot be modified. For every += there you are creating a completely new string object.
Related
This question already has answers here:
Most efficient way to concatenate strings?
(18 answers)
Closed 4 years ago.
I wrote a program that runs a simple for loop in both C++ and C#, yet the same thing takes dramatically longer in C#, why is that? Did I fail to account for something in my test?
C# (13.95s)
static double timeStamp() {
return (double)(DateTime.UtcNow.Subtract(new DateTime(1970, 1, 1))).TotalSeconds;
}
static void Main(string[] args) {
double timeStart = timeStamp();
string f = "";
for(int i=0; i<100000; i++) {
f += "Sample";
}
double timeEnd = timeStamp();
double timeDelta = timeEnd - timeStart;
Console.WriteLine(timeDelta.ToString());
Console.Read();
}
C++ (0.20s)
long int timeStampMS() {
milliseconds ms = duration_cast<milliseconds> (system_clock::now().time_since_epoch());
return ms.count();
}
int main() {
long int timeBegin = timeStampMS();
string test = "";
for (int i = 0; i < 100000; i++) {
test += "Sample";
}
long int timeEnd = timeStampMS();
long double delta = timeEnd - timeBegin;
cout << to_string(delta) << endl;
cin.get();
}
On my PC, changing the code to use StringBuilder and converting to a String at the end, the execution time went from 26.15 seconds to 0.0012 seconds, or over 20,000 times faster.
var fb = new StringBuilder();
for (int i = 0; i < 100000; ++i) {
fb.Append("Sample");
}
var f = fb.ToString();
As explained in the .Net documentation, the StringBuilder class is a mutable string object that is useful for when you are making many changes to a string, as opposed to the String class, which is an immutable object that requires a new object creation every time you e.g. concatenate two Strings together. Because the implementation of StringBuilder is a linked list of character arrays, and new blocks are added up to 8000 characters at a time, StringBuilder.Append is much faster.
C++ loop may be fast because it doesn't actually need to do anything. A good optimizer will be able to prove that removing the entire loop makes no observable difference in the behaviour of the program (execution time doesn't count as observable). I don't know if C# runtime is allowed to do similar optimization. In any case, to guarantee sensible measurements, you must always use the result in a way that is observable.
Assuming the optimizer didn't remove the loop, appending a constant length string into std::string has amortized constant complexity. Strings in C# are immutable, so the operation creates a new copy of the string every time, and so it has linear complexity. The longer the string becomes, the more significant this difference in asymptotic complexity becomes. You can achieve same asymptotic complexity by using the mutable StringBuilder in C#.
Since Strings are immutable, each concatenation creates a new string.
The used strings are left for dead, awaiting garbage collection.
StringBuider is instantiated once and new chunks of data can be added when needed, expanding its capacity to MakeRoom (.NET source).
Test it using a StringBuilder:
string stringToAppend = "Sample";
int iteratorMaxValue = 100000;
StringBuilder sb = new StringBuilder(stringToAppend.Length * iteratorMaxValue);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int i = 0; i < iteratorMaxValue; i++) {
sb.Append(stringToAppend);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds);
4 Milliseconds on my machine.
This simple piece of c# code that is meant to find script blocks in HTML takes 0.5 seconds to run on a 74K char string with only 9 script blocks in it. This is undebuged release binary on 2.8Ghz i7 CPU. I made several runs though this code to make sure that performance is not impeded by JIT. It is not.
This is VS2010 .NET 4.0 Client Profile. x64
Why is this so slow?
int[] _exclStart = new int[100];
int[] _exclStop = new int[100];
int _excl = 0;
for (int f = input.IndexOf("<script", 0); f != -1; )
{
_exclStart[_excl] = f;
f = input.IndexOf("</script", f + 8);
if (f == -1)
{
_exclStop[_excl] = input.Length;
break;
}
_exclStop[_excl] = f;
f = input.IndexOf("<script", f + 8);
++_excl;
}
I used the source on this page as an example, I then duplicated the content 8 times, resulting in a page some 334,312 bytes long. Using StringComparision.Ordinal yields massive performance difference.
string newInput = string.Format("{0}{0}{0}{0}{0}{0}{0}{0}", input.Trim().ToLower());
//string newInput = input.Trim().ToLower();
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
int[] _exclStart = new int[100];
int[] _exclStop = new int[100];
int _excl = 0;
for (int f = newInput.IndexOf("<script", 0, StringComparison.Ordinal); f != -1; )
{
_exclStart[_excl] = f;
f = newInput.IndexOf("</script", f + 8, StringComparison.Ordinal);
if (f == -1)
{
_exclStop[_excl] = newInput.Length;
break;
}
_exclStop[_excl] = f;
f = newInput.IndexOf("<script", f + 8, StringComparison.Ordinal);
++_excl;
}
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
running 5 times yields almost the same result for each (the loop timings did not significantly change so for this simple code there is almost no time spent for JIT to compile it)
Output using your original code (in Milliseconds):
10.2786
11.4671
11.1066
10.6537
10.0723
Output using the above code instead (in Milliseconds):
0.3055
0.2953
0.2972
0.3112
0.3347
Notice that my test results are around 0.010 seconds (original code) and 0.0003 seconds (for ordinal code). Meaning you have something else wrong other than this code directly.
If as you say, using StringComparison.Ordinal does nothing for your performance then that means that either your using incorrect timers to time your performance, or you have a large overhead in reading your input value such as reading it from a stream again which you otherwise don't realise.
Tested under Windows 7 x64 running on a 3GHz i5 using .NET 4 Client Profile.
Suggestions:
use StringComparison.Ordinal
Make sure you're using System.Diagnostics.Stopwatch to time performance
Declare a local variable for the input instead of using values external to the function (eg: string newInput = input.Trim().ToLower();)
Again I stress, I am getting 50 times faster speed for a test data that is apparently over 4 times larger in size using the exact same code that you provide. Meaning my test is running some 200 times faster than yours which is not something anyone would expect given we're both running same environment and just i5 (me) versus i7 (you).
The IndexOf overload you're using is culture-sensitive, which will affect performance. Instead, use:
input.IndexOf("<script", 0, StringComparison.Ordinal);
I would recommend using RegEx for this, it offers significant performance improvement because the expressions are compiled only once. Whereas IndexOf is essentially a loop which runs on per character basis which probably means, you have 3 "loops" within your main for loop, ofcourse, IndexOf won't be as slow as a regular loop, but still when the input size grows the time increases. Regex has inbuilt functions that would return the number and positions of occurrences of each pattern you define.
Edit: this might shed some more light on the performance of IndexOf IndexOf Perf
I just test IndexOf performance with .NET 4.0 on Windows 7
public void Test()
{
var input = "Hello world, I'm ekk. This is test string";
TestStringIndexOfPerformance(input, StringComparison.CurrentCulture);
TestStringIndexOfPerformance(input, StringComparison.InvariantCulture);
TestStringIndexOfPerformance(input, StringComparison.Ordinal);
Console.ReadLine();
}
private static void TestStringIndexOfPerformance(string input, StringComparison stringComparison)
{
var count = 0;
var startTime = DateTime.UtcNow;
TimeSpan result;
for (var index = 0; index != 1000000; index++)
{
count = input.IndexOf("<script", 0, stringComparison);
}
result = DateTime.UtcNow.Subtract(startTime);
Console.WriteLine("{0}: {1}", stringComparison, count);
Console.WriteLine("Total time: {0}", result.TotalMilliseconds);
Console.WriteLine("--------------------------------");
}
And the result is:
CurrentCulture:
Total time: 225.4008
InvariantCulture:
Total time: 187.2003
Ordinal:
Total time: 124.8003
As you can see performance of Ordinal is a little better.
I don't discuss the code here, that probably coul be written with Regex and so on... but in order to me is slow because the IndexOf() *inside* the for always rescan the string from the beginning ( it always start from index 0 ) try to scan from the last occurrency found instead.
How can I optimize the following code so that it executes faster?
static void Main(string[] args)
{
String a = "Hello ";
String b = " World! ";
for (int i=0; i<20000; i++)
{
a = a + b;
}
Console.WriteLine(a);
}
From the StringBuilder documentation:
Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer.
The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs. A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.
static void Main(string[] args) {
String a = "Hello ";
String b = " World! ";
StringBuilder result = new StringBuilder(a.Length + b.Length * 20000);
result.Append(a);
for (int i=0; i<20000; i++) {
result.Append(b);
}
Console.WriteLine(result.ToString());
}
Since its output is predetermined, it would run faster if you just hardcoded the literal value that is built by the loop.
Perform output in the loop (5x faster, same result):
static void Main(string[] args)
{
Console.Write("Hello ");
for (int i=0; i<20000; i++)
Console.Write(" World! ");
Console.Write(Environment.NewLine);
}
Or allocate the memory on forehand and fill it up (4x faster, same result):
static void Main(string[] args)
{
String a = "Hello ";
String b = " World! ";
int it = 20000;
char[] result = new char[a.Length + it*b.Length];
a.ToCharArray().CopyTo(result, 0);
for (int i = 0; i < it; i++)
b.ToCharArray().CopyTo(result, a.Length + i * b.Length);
Console.WriteLine(result);
}
static void Main(string[] args)
{
const String a = "Hello " +
/* insert string literal here that contains " World! " 20000 times. */ ;
Console.WriteLine(a);
}
I can't believe that they teach nonsense like this in schools. There isn't a real-world example of why you would ever do this, let alone optimize it. All this teaches is how to micro-optimize a program that does nothing useful, and that's counterproductive to the student's health as a programmer/developer.
MemoryStream is slightly faster than using StringBuilder:
static void Main(string[] args)
{
String a = "Hello ";
String b = " World! ";
System.IO.MemoryStream ms = new System.IO.MemoryStream(20000 * b.Length + a.Length);
System.IO.StreamWriter sw = new System.IO.StreamWriter(ms);
sw.Write(a);
for (int i = 0; i < 20000; i++)
{
sw.Write(b);
}
ms.Seek(0,System.IO.SeekOrigin.Begin);
System.IO.StreamReader sr = new System.IO.StreamReader(ms);
Console.WriteLine(sr.ReadToEnd());
}
It's likely to be IO dominated ( writing the output to the console or a file will be the slowest part ), so probably won't benefit from a high degree of optimisation. Simply removing obvious pessimisations should suffice.
As a general rule, don't create temporary objects. Each iteration of your loop creates a temporary string, coping the entire previous string in a and the value of the string in b, so has to do up to 20000 times the length of b operations each time through the loop. Even so, that's only 3 billion bytes to copy, and so should complete in less than a second on a modern machine ( assuming the runtime uses the right operations for the target hardware ). Dumping 160,008 characters to the console may well take longer.
One technique is to use a builder or buffer to create fewer temporary objects, instead creating a long string in memory using a StringBuilder then copying that to a string, then outputting that string.
However, you can go one stage further and achieve the same functionality by writing the output directly, rather than creating any temporary strings or buffers, using Console.Write in the loop instead. That will remove two of copying operations ( the string b is copied to the buffer then the buffer is copied to a string object then the string's data to the output buffer; the final copy operation is the one internal to Console.Write so is not avoidable in C# ), but require more operating system calls, so may or may not be faster.
Another common optimisation is to unroll the loop. So instead of having a loop which has one line which writes one " World! " and is looped 20,000 times, you can have (say) five lines which write one " World! " each and loop them 4,000 times. That's normally only worth doing in itself if the cost of incrementing and testing the loop variable is high compared to what you're doing in the loop, but it can lead to other optimisations.
Having partially unrolled the loop, you can combine the code in the loop and write five or ten " World! "s with one call to Console.Write, which should save some time in that you're only making one fifth the number of system calls.
Writing to console, in cmd window, it appears limited by speed of console window:
( times in seconds for 100 runs )
724.5312500 - concat
53.2187500 - direct
30.3906250 - direct writing b x10
30.3750000 - direct writing b x100
30.3750000 - builder
30.3750000 - builder writing b x100
writing to file, the times for the different techniques differ:
205.0000000 - concat
9.7031250 - direct
1.0781250 - direct writing b x10
0.5000000 - builder
0.4843750 - direct writing b x100
0.4531250 - builder writing b x100
From this it's possible to draw two conclusions:
Most of the improvements don't matter if you're writing to the console in a cmd.exe window. You do have to profile the system as a whole, and (unless you're trying to reduce the energy use of the CPU ) there's no point optimising one component beyond the capbilities of the rest of the system.
Although apparently doing more work - copying the data more and calling the same number of functions, the StringBuilder approach is faster. This implies that there's quite a high overhead in each call to Console.Write, compared with the equivalent in non-managed languages.
writing to file, using gcc C99 on Win XP:
0.375 - direct ( fputs ( b, stdout ) 20000 times )
0.171 - direct unrolled ( fputs ( b x 100, stdout ) 200 times )
0.171 - copy to b to a buffer 20000 times then puts once
The lower cost of the system call in C allows it to get towards being IO bound, rather than limited by the .net runtime boundaries. So when optimizing .net, managed/unmanaged boundaries become important.
I wonder, would this be any faster?
static void Main(string[] args) {
String a = "Hello ";
String b = " World! ";
int worldCount = 20000;
StringBuilder worldList = new StringBuilder(b.Length * worldCount);
worldList.append(b);
StringBuilder result = new StringBuilder(a.Length + b.Length * worldCount);
result.Append(a);
while (worldCount > 0) {
if ((worldCount & 0x1) > 0) { // Fewer appends, more ToStrings.
result.Append(worldList); // would the ToString here kill performance?
}
worldCount >>= 1;
if (worldCount > 0) {
worldList.Append(worldList);
}
}
Console.WriteLine(result.ToString());
}
Depends on what's going in the String object I guess. If internally all they have is a null-terminated string, then you could optimize by storing the length of the string somewhere. Also, if you're just outputting to stdout, it would make more sense to move your output call inside the loop (less memory overhead), and it should also be faster.
Here are some timing results. Each test was conducted starting with 20000 iterations. Every test includes output in the timings, unless stated otherwise. Each number for a group means number of iterations was 10 times greater than the previous. If there are less than 4 numbers, the test was taking too long so i killed it. "Parallize it" means i split the number of concatenations evenly over 4 threads and appended the results when all had finished (could probable have saved a little time here and put them into a queue and appended them as they finished, but didn't think of that until now). All times are in milliseconds.
656
6658
66999
370717
output hello loop output world. no concatenation.
658
6641
65807
554546
build with stringbuilder then output
664
6571
65676
314140
build with stringbuilder with large initial size no output
2761
367042
OP, strings only (killed test while concatenating; nothing printed to screen)
167
43227
parallize it OP no output
27
40
323
1758
parallize it stringbuilder no output
I understand the benefits of StringBuilder.
But if I want to concatenate 2 strings, then I assume that it is better (faster) to do it without StringBuilder. Is this correct?
At what point (number of strings) does it become better to use StringBuilder?
I warmly suggest you to read The Sad Tragedy of Micro-Optimization Theater, by Jeff Atwood.
It treats Simple Concatenation vs. StringBuilder vs. other methods.
Now, if you want to see some numbers and graphs, follow the link ;)
But if I want to concatinate 2
strings, then I assume that it is
better (faster) to do it without
StringBuilder. Is this correct?
That is indeed correct, you can find why exactly explained very well on :
Article about strings and StringBuilder
Summed up : if you can concatinate strings in one go like
var result = a + " " + b + " " + c + ..
you are better off without StringBuilder for only on copy is made (the length of the resulting string is calculated beforehand.);
For structure like
var result = a;
result += " ";
result += b;
result += " ";
result += c;
..
new objects are created each time, so there you should consider StringBuilder.
At the end the article sums up these rules of thumb :
Rules Of Thumb
So, when should you use StringBuilder,
and when should you use the string
concatenation operators?
Definitely use StringBuilder when
you're concatenating in a non-trivial
loop - especially if you don't know
for sure (at compile time) how many
iterations you'll make through the
loop. For example, reading a file a
character at a time, building up a
string as you go using the += operator
is potentially performance suicide.
Definitely use the concatenation
operator when you can (readably)
specify everything which needs to be
concatenated in one statement. (If you
have an array of things to
concatenate, consider calling
String.Concat explicitly - or
String.Join if you need a delimiter.)
Don't be afraid to break literals up
into several concatenated bits - the
result will be the same. You can aid
readability by breaking a long literal
into several lines, for instance, with
no harm to performance.
If you need the intermediate results
of the concatenation for something
other than feeding the next iteration
of concatenation, StringBuilder isn't
going to help you. For instance, if
you build up a full name from a first
name and a last name, and then add a
third piece of information (the
nickname, maybe) to the end, you'll
only benefit from using StringBuilder
if you don't need the (first name +
last name) string for other purpose
(as we do in the example which creates
a Person object).
If you just have a few concatenations
to do, and you really want to do them
in separate statements, it doesn't
really matter which way you go. Which
way is more efficient will depend on
the number of concatenations the sizes
of string involved, and what order
they're concatenated in. If you really
believe that piece of code to be a
performance bottleneck, profile or
benchmark it both ways.
System.String is an immutable object - it means that whenever you modify its content it will allocate a new string and this takes time (and memory?).
Using StringBuilder you modify the actual content of the object without allocating a new one.
So use StringBuilder when you need to do many modifications on the string.
Not really...you should use StringBuilder if you concatenate large strings or you have many concatenations, like in a loop.
If you concatenate strings in a loop, you should consider using StringBuilder instead of regular String
In case it's single concatenation, you may not see the difference in execution time at all
Here is a simple test app to prove the point:
static void Main(string[] args)
{
//warm-up rounds:
Test(500);
Test(500);
//test rounds:
Test(500);
Test(1000);
Test(10000);
Test(50000);
Test(100000);
Console.ReadLine();
}
private static void Test(int iterations)
{
int testLength = iterations;
Console.WriteLine($"----{iterations}----");
//TEST 1 - String
var startTime = DateTime.Now;
var resultString = "test string";
for (var i = 0; i < testLength; i++)
{
resultString += i.ToString();
}
Console.WriteLine($"STR: {(DateTime.Now - startTime).TotalMilliseconds}");
//TEST 2 - StringBuilder
startTime = DateTime.Now;
var stringBuilder = new StringBuilder("test string");
for (var i = 0; i < testLength; i++)
{
stringBuilder.Append(i.ToString());
}
string resultString2 = stringBuilder.ToString();
Console.WriteLine($"StringBuilder: {(DateTime.Now - startTime).TotalMilliseconds}");
Console.WriteLine("---------------");
Console.WriteLine("");
}
Results (in milliseconds):
----500----
STR: 0.1254
StringBuilder: 0
---------------
----1000----
STR: 2.0232
StringBuilder: 0
---------------
----10000----
STR: 28.9963
StringBuilder: 0.9986
---------------
----50000----
STR: 1019.2592
StringBuilder: 4.0079
---------------
----100000----
STR: 11442.9467
StringBuilder: 10.0363
---------------
There's no definitive answer, only rules-of-thumb. My own personal rules go something like this:
If concatenating in a loop, always use a StringBuilder.
If the strings are large, always use a StringBuilder.
If the concatenation code is tidy and readable on the screen then it's probably ok.
If it isn't, use a StringBuilder.
To paraphrase
Then shalt thou count to three, no more, no less. Three shall be the number thou shalt count, and the number of the counting shall be three. Four shalt thou not count, neither count thou two, excepting that thou then proceed to three. Once the number three, being the third number, be reached, then lobbest thou thy Holy Hand Grenade of Antioch
I generally use string builder for any block of code which would result in the concatenation of three or more strings.
Since it's difficult to find an explanation for this that's not either influenced by opinions or followed by a battle of prides I thought to write a bit of code on LINQpad to test this myself.
I found that using small sized strings rather than using i.ToString() changes response times (visible in small loops).
The test uses different sequences of iterations to keep time measurements in sensibly comparable ranges.
I'll copy the code at the end so you can try it yourself (results.Charts...Dump() won't work outside LINQPad).
Output (X-Axis: Number of iterations tested, Y-Axis: Time taken in ticks):
Iterations sequence: 2, 3, 4, 5, 6, 7, 8, 9, 10
Iterations sequence: 10, 20, 30, 40, 50, 60, 70, 80
Iterations sequence: 100, 200, 300, 400, 500
Code (Written using LINQPad 5):
void Main()
{
Test(2, 3, 4, 5, 6, 7, 8, 9, 10);
Test(10, 20, 30, 40, 50, 60, 70, 80);
Test(100, 200, 300, 400, 500);
}
void Test(params int[] iterationsCounts)
{
$"Iterations sequence: {string.Join(", ", iterationsCounts)}".Dump();
int testStringLength = 10;
RandomStringGenerator.Setup(testStringLength);
var sw = new System.Diagnostics.Stopwatch();
var results = new Dictionary<int, TimeSpan[]>();
// This call before starting to measure time removes initial overhead from first measurement
RandomStringGenerator.GetRandomString();
foreach (var iterationsCount in iterationsCounts)
{
TimeSpan elapsedForString, elapsedForSb;
// string
sw.Restart();
var str = string.Empty;
for (int i = 0; i < iterationsCount; i++)
{
str += RandomStringGenerator.GetRandomString();
}
sw.Stop();
elapsedForString = sw.Elapsed;
// string builder
sw.Restart();
var sb = new StringBuilder(string.Empty);
for (int i = 0; i < iterationsCount; i++)
{
sb.Append(RandomStringGenerator.GetRandomString());
}
sw.Stop();
elapsedForSb = sw.Elapsed;
results.Add(iterationsCount, new TimeSpan[] { elapsedForString, elapsedForSb });
}
// Results
results.Chart(r => r.Key)
.AddYSeries(r => r.Value[0].Ticks, LINQPad.Util.SeriesType.Line, "String")
.AddYSeries(r => r.Value[1].Ticks, LINQPad.Util.SeriesType.Line, "String Builder")
.DumpInline();
}
static class RandomStringGenerator
{
static Random r;
static string[] strings;
public static void Setup(int testStringLength)
{
r = new Random(DateTime.Now.Millisecond);
strings = new string[10];
for (int i = 0; i < strings.Length; i++)
{
strings[i] = Guid.NewGuid().ToString().Substring(0, testStringLength);
}
}
public static string GetRandomString()
{
var indx = r.Next(0, strings.Length);
return strings[indx];
}
}
But if I want to concatenate 2 strings, then I assume that it's better and faster to do so without StringBuilder. Is this correct?
Yes. But more importantly, it is vastly more readable to use a vanilla String in such situations. Using it in a loop, on the other hand, makes sense and can also be as readable as concatenation.
I’d be wary of rules of thumb that cite specific numbers of concatenation as a threshold. Using it in loops (and loops only) is probably just as useful, easier to remember and makes more sense.
As long as you can physically type the number of concatenations (a + b + c ...) it shouldn't make a big difference. N squared (at N = 10) is a 100X slowdown, which shouldn't be too bad.
The big problem is when you are concatenating hundreds of strings. At N=100, you get a 10000X times slowdown. Which is pretty bad.
A single concatenation is not worth using a StringBuilder. I've typically used 5 concatenations as a rule of thumb.
I don't think there's a fine line between when to use or when not to. Unless of course someone performed some extensive testings to come out with the golden conditions.
For me, I will not use StringBuilder if just concatenating 2 huge strings. If there's loop with an undeterministic count, I'm likely to, even if the loop might be small counts.
I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}