I understand the benefits of StringBuilder.
But if I want to concatenate 2 strings, then I assume that it is better (faster) to do it without StringBuilder. Is this correct?
At what point (number of strings) does it become better to use StringBuilder?
I warmly suggest you to read The Sad Tragedy of Micro-Optimization Theater, by Jeff Atwood.
It treats Simple Concatenation vs. StringBuilder vs. other methods.
Now, if you want to see some numbers and graphs, follow the link ;)
But if I want to concatinate 2
strings, then I assume that it is
better (faster) to do it without
StringBuilder. Is this correct?
That is indeed correct, you can find why exactly explained very well on :
Article about strings and StringBuilder
Summed up : if you can concatinate strings in one go like
var result = a + " " + b + " " + c + ..
you are better off without StringBuilder for only on copy is made (the length of the resulting string is calculated beforehand.);
For structure like
var result = a;
result += " ";
result += b;
result += " ";
result += c;
..
new objects are created each time, so there you should consider StringBuilder.
At the end the article sums up these rules of thumb :
Rules Of Thumb
So, when should you use StringBuilder,
and when should you use the string
concatenation operators?
Definitely use StringBuilder when
you're concatenating in a non-trivial
loop - especially if you don't know
for sure (at compile time) how many
iterations you'll make through the
loop. For example, reading a file a
character at a time, building up a
string as you go using the += operator
is potentially performance suicide.
Definitely use the concatenation
operator when you can (readably)
specify everything which needs to be
concatenated in one statement. (If you
have an array of things to
concatenate, consider calling
String.Concat explicitly - or
String.Join if you need a delimiter.)
Don't be afraid to break literals up
into several concatenated bits - the
result will be the same. You can aid
readability by breaking a long literal
into several lines, for instance, with
no harm to performance.
If you need the intermediate results
of the concatenation for something
other than feeding the next iteration
of concatenation, StringBuilder isn't
going to help you. For instance, if
you build up a full name from a first
name and a last name, and then add a
third piece of information (the
nickname, maybe) to the end, you'll
only benefit from using StringBuilder
if you don't need the (first name +
last name) string for other purpose
(as we do in the example which creates
a Person object).
If you just have a few concatenations
to do, and you really want to do them
in separate statements, it doesn't
really matter which way you go. Which
way is more efficient will depend on
the number of concatenations the sizes
of string involved, and what order
they're concatenated in. If you really
believe that piece of code to be a
performance bottleneck, profile or
benchmark it both ways.
System.String is an immutable object - it means that whenever you modify its content it will allocate a new string and this takes time (and memory?).
Using StringBuilder you modify the actual content of the object without allocating a new one.
So use StringBuilder when you need to do many modifications on the string.
Not really...you should use StringBuilder if you concatenate large strings or you have many concatenations, like in a loop.
If you concatenate strings in a loop, you should consider using StringBuilder instead of regular String
In case it's single concatenation, you may not see the difference in execution time at all
Here is a simple test app to prove the point:
static void Main(string[] args)
{
//warm-up rounds:
Test(500);
Test(500);
//test rounds:
Test(500);
Test(1000);
Test(10000);
Test(50000);
Test(100000);
Console.ReadLine();
}
private static void Test(int iterations)
{
int testLength = iterations;
Console.WriteLine($"----{iterations}----");
//TEST 1 - String
var startTime = DateTime.Now;
var resultString = "test string";
for (var i = 0; i < testLength; i++)
{
resultString += i.ToString();
}
Console.WriteLine($"STR: {(DateTime.Now - startTime).TotalMilliseconds}");
//TEST 2 - StringBuilder
startTime = DateTime.Now;
var stringBuilder = new StringBuilder("test string");
for (var i = 0; i < testLength; i++)
{
stringBuilder.Append(i.ToString());
}
string resultString2 = stringBuilder.ToString();
Console.WriteLine($"StringBuilder: {(DateTime.Now - startTime).TotalMilliseconds}");
Console.WriteLine("---------------");
Console.WriteLine("");
}
Results (in milliseconds):
----500----
STR: 0.1254
StringBuilder: 0
---------------
----1000----
STR: 2.0232
StringBuilder: 0
---------------
----10000----
STR: 28.9963
StringBuilder: 0.9986
---------------
----50000----
STR: 1019.2592
StringBuilder: 4.0079
---------------
----100000----
STR: 11442.9467
StringBuilder: 10.0363
---------------
There's no definitive answer, only rules-of-thumb. My own personal rules go something like this:
If concatenating in a loop, always use a StringBuilder.
If the strings are large, always use a StringBuilder.
If the concatenation code is tidy and readable on the screen then it's probably ok.
If it isn't, use a StringBuilder.
To paraphrase
Then shalt thou count to three, no more, no less. Three shall be the number thou shalt count, and the number of the counting shall be three. Four shalt thou not count, neither count thou two, excepting that thou then proceed to three. Once the number three, being the third number, be reached, then lobbest thou thy Holy Hand Grenade of Antioch
I generally use string builder for any block of code which would result in the concatenation of three or more strings.
Since it's difficult to find an explanation for this that's not either influenced by opinions or followed by a battle of prides I thought to write a bit of code on LINQpad to test this myself.
I found that using small sized strings rather than using i.ToString() changes response times (visible in small loops).
The test uses different sequences of iterations to keep time measurements in sensibly comparable ranges.
I'll copy the code at the end so you can try it yourself (results.Charts...Dump() won't work outside LINQPad).
Output (X-Axis: Number of iterations tested, Y-Axis: Time taken in ticks):
Iterations sequence: 2, 3, 4, 5, 6, 7, 8, 9, 10
Iterations sequence: 10, 20, 30, 40, 50, 60, 70, 80
Iterations sequence: 100, 200, 300, 400, 500
Code (Written using LINQPad 5):
void Main()
{
Test(2, 3, 4, 5, 6, 7, 8, 9, 10);
Test(10, 20, 30, 40, 50, 60, 70, 80);
Test(100, 200, 300, 400, 500);
}
void Test(params int[] iterationsCounts)
{
$"Iterations sequence: {string.Join(", ", iterationsCounts)}".Dump();
int testStringLength = 10;
RandomStringGenerator.Setup(testStringLength);
var sw = new System.Diagnostics.Stopwatch();
var results = new Dictionary<int, TimeSpan[]>();
// This call before starting to measure time removes initial overhead from first measurement
RandomStringGenerator.GetRandomString();
foreach (var iterationsCount in iterationsCounts)
{
TimeSpan elapsedForString, elapsedForSb;
// string
sw.Restart();
var str = string.Empty;
for (int i = 0; i < iterationsCount; i++)
{
str += RandomStringGenerator.GetRandomString();
}
sw.Stop();
elapsedForString = sw.Elapsed;
// string builder
sw.Restart();
var sb = new StringBuilder(string.Empty);
for (int i = 0; i < iterationsCount; i++)
{
sb.Append(RandomStringGenerator.GetRandomString());
}
sw.Stop();
elapsedForSb = sw.Elapsed;
results.Add(iterationsCount, new TimeSpan[] { elapsedForString, elapsedForSb });
}
// Results
results.Chart(r => r.Key)
.AddYSeries(r => r.Value[0].Ticks, LINQPad.Util.SeriesType.Line, "String")
.AddYSeries(r => r.Value[1].Ticks, LINQPad.Util.SeriesType.Line, "String Builder")
.DumpInline();
}
static class RandomStringGenerator
{
static Random r;
static string[] strings;
public static void Setup(int testStringLength)
{
r = new Random(DateTime.Now.Millisecond);
strings = new string[10];
for (int i = 0; i < strings.Length; i++)
{
strings[i] = Guid.NewGuid().ToString().Substring(0, testStringLength);
}
}
public static string GetRandomString()
{
var indx = r.Next(0, strings.Length);
return strings[indx];
}
}
But if I want to concatenate 2 strings, then I assume that it's better and faster to do so without StringBuilder. Is this correct?
Yes. But more importantly, it is vastly more readable to use a vanilla String in such situations. Using it in a loop, on the other hand, makes sense and can also be as readable as concatenation.
I’d be wary of rules of thumb that cite specific numbers of concatenation as a threshold. Using it in loops (and loops only) is probably just as useful, easier to remember and makes more sense.
As long as you can physically type the number of concatenations (a + b + c ...) it shouldn't make a big difference. N squared (at N = 10) is a 100X slowdown, which shouldn't be too bad.
The big problem is when you are concatenating hundreds of strings. At N=100, you get a 10000X times slowdown. Which is pretty bad.
A single concatenation is not worth using a StringBuilder. I've typically used 5 concatenations as a rule of thumb.
I don't think there's a fine line between when to use or when not to. Unless of course someone performed some extensive testings to come out with the golden conditions.
For me, I will not use StringBuilder if just concatenating 2 huge strings. If there's loop with an undeterministic count, I'm likely to, even if the loop might be small counts.
Related
I have a string like this:
-82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0
I need to have output like this:
-82.9494547 36.2913021, -83.0784938 36.2347521, -82.9537782,36.079235
I have tried this following to code to achieve the desired output:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++)
{
coordinatesVal[i] = coordinatesVal[i].Trim();
coordinatesVal[i] = coordinatesVal[i].Replace(',', ' ');
numbers.Append(coordinatesVal[i]);
if (i != coordinatesVal.Length - 1)
{
coordinatesVal.Append(", ");
}
}
But this process does not seem to me the professional solution. Can anyone please suggest more efficient way of doing this?
Your code is okay. You could dismiss temporary results and chain method calls
var numbers = new StringBuilder();
string[] coordinatesVal = coordinateTxt
.Trim()
.Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++) {
numbers
.Append(coordinatesVal[i].Trim().Replace(',', ' '))
.Append(", ");
}
numbers.Length -= 2;
Note that the last statement assumes that there is at least one coordinate pair available. If the coordinates can be empty, you would have to enclose the loop and this last statement in if (coordinatesVal.Length > 0 ) { ... }. This is still more efficient than having an if inside the loop.
You ask about efficiency, but you don't specify whether you mean code efficiency (execution speed) or programmer efficiency (how much time you have to spend on it).
One key part of professional programming is to judge which one of these is more important in any given situation.
The other answers do a good job of covering programmer efficiency, so I'm taking a stab at code efficiency. I'm doing this at home for fun, but for professional work I would need a good reason before putting in the effort to even spend time comparing the speeds of the methods given in the other answers, let alone try to improve on them.
Having said that, waiting around for the program to finish doing the conversion of millions of coordinate pairs would give me such a reason.
One of the speed pitfalls of C# string handling is the way String.Replace() and String.Trim() return a whole new copy of the string. This involves allocating memory, copying the characters, and eventually cleaning up the garbage generated. Do that a few million times, and it starts to add up. With that in mind, I attempted to avoid as many allocations and copies as possible.
enum CurrentField
{
FirstNum,
SecondNum,
UnwantedZero
};
static string ConvertStateMachine(string input)
{
// Pre-allocate enough space in the string builder.
var numbers = new StringBuilder(input.Length);
var state = CurrentField.FirstNum;
int i = 0;
while (i < input.Length)
{
char c = input[i++];
switch (state)
{
// Copying the first number to the output, next will be another number
case CurrentField.FirstNum:
if (c == ',')
{
// Separate the two numbers by space instead of comma, then move on
numbers.Append(' ');
state = CurrentField.SecondNum;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
// Copying the second number to the output, next will be the ,0\n that we don't need
case CurrentField.SecondNum:
if (c == ',')
{
numbers.Append(", ");
state = CurrentField.UnwantedZero;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
case CurrentField.UnwantedZero:
// Output nothing, just track when the line is finished and we start all over again.
if (c == '\n')
{
state = CurrentField.FirstNum;
}
break;
}
}
return numbers.ToString();
}
This uses a state machine to treat incoming characters differently depending on whether they are part of the first number, second number, or the rest of the line, and output characters accordingly. Each character is only copied once into the output, then I believe once more when the output is converted to a string at the end. This second conversion could probably be avoided by using a char[] for the output.
The bottleneck in this code seems to be the number of calls to StringBuilder.Append(). If more speed were required, I would first attempt to keep track of how many characters were to be copied directly into the output, then use .Append(string value, int startIndex, int count) to send an entire number across in one call.
I put a few example solutions into a test harness, and ran them on a string containing 300,000 coordinate-pair lines, averaged over 50 runs. The results on my PC were:
String Split, Replace each line (see Olivier's answer, though I pre-allocated the space in the StringBuilder):
6542 ms / 13493147 ticks, 130.84ms / 269862.9 ticks per conversion
Replace & Trim entire string (see Heriberto's second version):
3352 ms / 6914604 ticks, 67.04 ms / 138292.1 ticks per conversion
- Note: Original test was done with 900000 coord pairs, but this entire-string version suffered an out of memory exception so I had to rein it in a bit.
Split and Join (see Łukasz's answer):
8780 ms / 18110672 ticks, 175.6 ms / 362213.4 ticks per conversion
Character state machine (see above):
1685 ms / 3475506 ticks, 33.7 ms / 69510.12 ticks per conversion
So, the question of which version is most efficient comes down to: what are your requirements?
Your solution is fine. Maybe you could write it a bit more elegant like this:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" },
StringSplitOptions.RemoveEmptyEntries);
string result = string.Empty;
foreach (string line in coordinatesVal)
{
string[] numbers = line.Trim().Split(',');
result += numbers[0] + " " + numbers[1] + ", ";
}
result = result.Remove(result.Count()-2, 2);
Note the StringSplitOptions.RemoveEmptyEntries parameter of Split method so you don't have to deal with empty lines into foreach block.
Or you can do extremely short one-liner. Harder to debug, but in simple cases does the work.
string result =
string.Join(", ",
coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.RemoveEmptyEntries).
Select(i => i.Replace(",", " ")));
heres another way without defining your own loops and replace methods, or using LINQ.
string coordinateTxt = #" -82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string[] coordinatesVal = coordinateTxt.Replace(",", "*").Trim().Split(new string[] { "*0", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(",", coordinatesVal).Replace("*", " ");
Console.WriteLine(result);
or even
string coordinateTxt = #" -82.9494540,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string result = coordinateTxt.Replace(Environment.NewLine, "").Replace($",", " ").Replace(" 0", ", ").Trim(new char[]{ ',',' ' });
Console.WriteLine(result);
I need to convert a 8-bit number such as 00001110 to char. The problem is easy so I wrote the code and everything is working fine, but now I need to optimize for speed as much as possible.
In test class :
class Program
{
static void Main(string[] args)
{
Random r = new Random();
int[] testTab = new int[8];
Normal n = new Normal();
long time;
Stopwatch watch = new Stopwatch();
watch.Start();
for (int i = 0; i < 9000; i++)
{
for (int j = 0; j < 8; j++)
{
testTab[j] = r.Next(2);
}
n.SetTable(testTab);
n.Decode();
}
watch.Stop();
time = watch.ElapsedTicks;
Console.WriteLine(time);
time = watch.ElapsedMilliseconds;
Console.WriteLine(time);
Console.ReadKey();
}
}
and class with algorithm :
class Normal
{
private int[] _tab = new int[8];
public void SetTable(int[] tab)
{
_tab = tab;
}
public void Decode()
{
char a = ((char)( _tab[0]*1 + _tab[1]*2 + _tab[2]*4 + _tab[3]*8 + _tab[4]*16 + _tab[5]*32 +
_tab[6]*64 + _tab[7]*124));
}
}
In the output for 9000 times I get time 2ms it is not a long time ( for 9000 ) time, but I have good proc in my PC.
The final code will be running in smartphone so there is no powerful CPU. In my algorithm I use random data, in final version I will load data by Camera (so it will be longer ) and try to repeat this operation 10 times in one second so that is why I need best time in even smallest operations.
Is there a faster way to convert byte to char than this?
char a = ((char)( _tab[0]*1 + _tab[1]*2 + _tab[2]*4 + _tab[3]*8 + _tab[4]*16 + _tab[5]*32 + _tab[6]*64 + _tab[7]*128));
tl;dr Your conversion code is already efficient, and is not your bottleneck.
Your benchmarking is flawed. You are not just timing the conversion of binary stored in int[] to integer value. You are also timing the generation of your random data. I expect that the majority of the time is spent generating the random data.
Re-write your benchmarking program to operate on data prepared before you start timing. Make sure that the duration of the test is at least 5 or 10 seconds so that you can generate meaningful answers. If you only run for two milliseconds then the granularity of your timer affects the quality of your results.
Bear in mind that in your real application you will be taking a picture on a camera of a QR code and decoding that. The cost of that is many orders of magnitude greater than the cost of converting the 8 bit int arrays.
Your code to do that conversion is already efficient. Do not seek to optimize it further. Not only is there no need to optimize it, there is little hope for significant gains. For the sake of clarity and conciseness you may well opt to use one of the .net library methods that perform such a conversion, but performance of this part of your program is not an issue.
As an aside, it looks like you need to be converting the 8 bit value to byte, adding these values to a byte array, and then feeding to Encoding.GetString to obtain your text. A cast to UTF-16 char as per your code is not correct.
It worth a try this:
var yourString = "00100000";
char yourChar = (char) Convert.ToByte(yourString, 2); // you got ' ' (space)
It may or may not faster, but definitely simpler, more stable and more maintainable.
I ran some tests with different implementations.
First was #Melnikovl answer.
Second was mine, where I replaced + with | and * with << operator.
Third was author's original solution.
I tested with modified code and measured only conversion code.
First and second solution showed a little better performance. But BitConverter a little more often was better, so I think you should choose it (also because if simplicity of code)
var byte[] bytes = { 1, 1, 1, 1 };
int i = BitConverter.ToInt32(bytes, 0);
char a = (char)i;
Don't forgot to check if byte array litte or big endian
I read today, in c# strings are immutable, like once created they cant be changed, so how come below code works
string str="a";
str +="b";
str +="c";
str +="d";
str +="e";
console.write(str) //output: abcde
How come the value of variable changed??
String objects are immutable, but variables can be reassigned.
You created separate objects
a
ab
abc
abcd
abcde
Each of these immutable strings was assigned, in turn, to the variable str.
You can not change the contents (characters inside) a string.
Changing the variable is a different thing altogether.
It is easy to show that the post did not mutate the String object as believed:
string a = "a";
string b = a;
Console.WriteLine(object.ReferenceEquals(a, b)); // true
Console.WriteLine(a); // a
Console.WriteLine(b); // a
b += "b";
Console.WriteLine(object.ReferenceEquals(a, b)); // false
Console.WriteLine(a); // a
Console.WriteLine(b); // ab
The is because the x += y operator is equivalent to x = x + y, but with less typing.
Happy coding.
Use reflector to look at the ILM code and you will see exactly what is going on. Although your code logically appends new contents onto the end of the string, behind the scenes the compiler is creating ILM code that is creating a new string for each assignment.
The picture gets a little muddier if you concatenate literal strings in a single statement like this:
str = "a" + "b" + "c" ...
In this case the compiler is usually smart enough to not create all the extra strings (and thus work for the Garbage collector and will translate it for you to ILM code equivalent to:
str = "abc"
That said, doing it on separate lines like that might not trigger that optimization.
Immutable means the memory location used to store the string variable never gets changed.
string str ="a"; //stored at some memory location
str+= "b"; // now str='ab' and stored at some other memory location
str+= "c"; // now str='abc' and stored at some other memory location
and so on...
Whenever you change the value of string type, you never actually store the new value at the original location, rather keep storing it at new memory locations.
string a="Hello";
string b=a;
a="changed";
console.writeline(b);
Output
Hello // variable b still referring to the original location.
Please check John Skeet's page
http://csharpindepth.com/Articles/General/Strings.aspx
Concepts:
The variable and the instance are separate concepts. A variable is something that holds a value. In the case of a string, the variable holds a pointer to that string that is stored somewhere else: the instance.
A variable can always be assigned and reassigned if you want... it is variable after all! =)
The instance, that I told is somewhere else, can not be changed in the case of a string.
By concatenating strings like you did, you are in fact creating a lot of different storages, one for each string concatenation.
The correct way to do it:
To concatenate string, you can use StringBuilder class:
StringBuilder b = new StringBuilder();
b.Append("abcd");
b.Append(" more text");
string result = b.ToString();
You could also use a list of string, and then join it:
List<string> l = new List<string>();
l.Add("abcd");
l.Add(" more text");
string result = string.Join("", l);
#pst - I agree that readability is important, and in most cases on a PC it won't matter, but what about if you're on a mobile platform where system resources are constrained?
It's important to understand that StringBuilder is the best way to concat strings. It is much faster and more efficient.
You highlighted the important difference, though, as to whether it is significantly faster, and in what scenarios. It is illustrative that the difference has to be measured in ticks at low volumes, because it can't be measured in milliseconds.
It's important to know that for everyday scenarios on a desktop platform, the difference is imperceptible. But it's also important to know that for mobile platforms, edge cases where you're building large strings or doing thousands of concats, or for performance optimization, StringBuilder does win. With a very large number of concats, it is worth noting that StringBuilder takes slightly more memory.
This is by no means a perfect comparison, but for the fool that concats 1,000,000 strings, StringBuilder beats plain string concatenation by ~10 mins (on a Core 2 Duo E8500 # 3.16GHz in Win 7 x64):
String concat (10): 9 ticks, 0 ms, 8192 bytes
String Builder (10): 2 ticks, 0 ms, 8192 bytes
String concat (100): 30 ticks, 0 ms, 16384 bytes
String Builder (100): 6 ticks, 0 ms, 8192 bytes
String concat (1000): 1658 ticks, 0 ms, 1021964 bytes
String Builder (1000): 29 ticks, 0 ms, 8192 bytes
String concat (10000): 105451 ticks, 34 ms, 2730396 bytes
String Builder (10000): 299 ticks, 0 ms, 40624 bytes
String concat (100000): 15908144 ticks, 5157 ms, 200020 bytes
String Builder (100000): 2776 ticks, 0 ms, 216888 bytes
String concat (1000000): 1847164850 ticks, 598804 ms, 1999804 bytes
String Builder (1000000): 27339 ticks, 8 ms, 2011576 bytes
Code:
class Program
{
static void Main(string[] args)
{
TestStringCat(10);
TestStringBuilder(10);
TestStringCat(100);
TestStringBuilder(100);
TestStringCat(1000);
TestStringBuilder(1000);
TestStringCat(10000);
TestStringBuilder(10000);
TestStringCat(100000);
TestStringBuilder(100000);
TestStringCat(1000000);
TestStringBuilder(1000000);
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
static void TestStringCat(int iterations)
{
GC.Collect();
String s = String.Empty;
long memory = GC.GetTotalMemory(true);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
s += "a";
}
sw.Stop();
memory = GC.GetTotalMemory(false) - memory;
Console.WriteLine("String concat \t({0}):\t\t{1} ticks,\t{2} ms,\t\t{3} bytes", iterations, sw.ElapsedTicks, sw.ElapsedMilliseconds, memory);
}
static void TestStringBuilder(int iterations)
{
GC.Collect();
StringBuilder sb = new StringBuilder();
long memory = GC.GetTotalMemory(true);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
sb.Append("a");
}
sw.Stop();
memory = GC.GetTotalMemory(false) - memory;
Console.WriteLine("String Builder \t({0}):\t\t{1} ticks,\t{2} ms,\t\t{3} bytes", iterations, sw.ElapsedTicks, sw.ElapsedMilliseconds, memory);
}
}
it didn't, it's actually overwriting the variable with the new value creating a new string object each time.
Immutability means that the class cannot be modified. For every += there you are creating a completely new string object.
How can I optimize the following code so that it executes faster?
static void Main(string[] args)
{
String a = "Hello ";
String b = " World! ";
for (int i=0; i<20000; i++)
{
a = a + b;
}
Console.WriteLine(a);
}
From the StringBuilder documentation:
Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer.
The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs. A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.
static void Main(string[] args) {
String a = "Hello ";
String b = " World! ";
StringBuilder result = new StringBuilder(a.Length + b.Length * 20000);
result.Append(a);
for (int i=0; i<20000; i++) {
result.Append(b);
}
Console.WriteLine(result.ToString());
}
Since its output is predetermined, it would run faster if you just hardcoded the literal value that is built by the loop.
Perform output in the loop (5x faster, same result):
static void Main(string[] args)
{
Console.Write("Hello ");
for (int i=0; i<20000; i++)
Console.Write(" World! ");
Console.Write(Environment.NewLine);
}
Or allocate the memory on forehand and fill it up (4x faster, same result):
static void Main(string[] args)
{
String a = "Hello ";
String b = " World! ";
int it = 20000;
char[] result = new char[a.Length + it*b.Length];
a.ToCharArray().CopyTo(result, 0);
for (int i = 0; i < it; i++)
b.ToCharArray().CopyTo(result, a.Length + i * b.Length);
Console.WriteLine(result);
}
static void Main(string[] args)
{
const String a = "Hello " +
/* insert string literal here that contains " World! " 20000 times. */ ;
Console.WriteLine(a);
}
I can't believe that they teach nonsense like this in schools. There isn't a real-world example of why you would ever do this, let alone optimize it. All this teaches is how to micro-optimize a program that does nothing useful, and that's counterproductive to the student's health as a programmer/developer.
MemoryStream is slightly faster than using StringBuilder:
static void Main(string[] args)
{
String a = "Hello ";
String b = " World! ";
System.IO.MemoryStream ms = new System.IO.MemoryStream(20000 * b.Length + a.Length);
System.IO.StreamWriter sw = new System.IO.StreamWriter(ms);
sw.Write(a);
for (int i = 0; i < 20000; i++)
{
sw.Write(b);
}
ms.Seek(0,System.IO.SeekOrigin.Begin);
System.IO.StreamReader sr = new System.IO.StreamReader(ms);
Console.WriteLine(sr.ReadToEnd());
}
It's likely to be IO dominated ( writing the output to the console or a file will be the slowest part ), so probably won't benefit from a high degree of optimisation. Simply removing obvious pessimisations should suffice.
As a general rule, don't create temporary objects. Each iteration of your loop creates a temporary string, coping the entire previous string in a and the value of the string in b, so has to do up to 20000 times the length of b operations each time through the loop. Even so, that's only 3 billion bytes to copy, and so should complete in less than a second on a modern machine ( assuming the runtime uses the right operations for the target hardware ). Dumping 160,008 characters to the console may well take longer.
One technique is to use a builder or buffer to create fewer temporary objects, instead creating a long string in memory using a StringBuilder then copying that to a string, then outputting that string.
However, you can go one stage further and achieve the same functionality by writing the output directly, rather than creating any temporary strings or buffers, using Console.Write in the loop instead. That will remove two of copying operations ( the string b is copied to the buffer then the buffer is copied to a string object then the string's data to the output buffer; the final copy operation is the one internal to Console.Write so is not avoidable in C# ), but require more operating system calls, so may or may not be faster.
Another common optimisation is to unroll the loop. So instead of having a loop which has one line which writes one " World! " and is looped 20,000 times, you can have (say) five lines which write one " World! " each and loop them 4,000 times. That's normally only worth doing in itself if the cost of incrementing and testing the loop variable is high compared to what you're doing in the loop, but it can lead to other optimisations.
Having partially unrolled the loop, you can combine the code in the loop and write five or ten " World! "s with one call to Console.Write, which should save some time in that you're only making one fifth the number of system calls.
Writing to console, in cmd window, it appears limited by speed of console window:
( times in seconds for 100 runs )
724.5312500 - concat
53.2187500 - direct
30.3906250 - direct writing b x10
30.3750000 - direct writing b x100
30.3750000 - builder
30.3750000 - builder writing b x100
writing to file, the times for the different techniques differ:
205.0000000 - concat
9.7031250 - direct
1.0781250 - direct writing b x10
0.5000000 - builder
0.4843750 - direct writing b x100
0.4531250 - builder writing b x100
From this it's possible to draw two conclusions:
Most of the improvements don't matter if you're writing to the console in a cmd.exe window. You do have to profile the system as a whole, and (unless you're trying to reduce the energy use of the CPU ) there's no point optimising one component beyond the capbilities of the rest of the system.
Although apparently doing more work - copying the data more and calling the same number of functions, the StringBuilder approach is faster. This implies that there's quite a high overhead in each call to Console.Write, compared with the equivalent in non-managed languages.
writing to file, using gcc C99 on Win XP:
0.375 - direct ( fputs ( b, stdout ) 20000 times )
0.171 - direct unrolled ( fputs ( b x 100, stdout ) 200 times )
0.171 - copy to b to a buffer 20000 times then puts once
The lower cost of the system call in C allows it to get towards being IO bound, rather than limited by the .net runtime boundaries. So when optimizing .net, managed/unmanaged boundaries become important.
I wonder, would this be any faster?
static void Main(string[] args) {
String a = "Hello ";
String b = " World! ";
int worldCount = 20000;
StringBuilder worldList = new StringBuilder(b.Length * worldCount);
worldList.append(b);
StringBuilder result = new StringBuilder(a.Length + b.Length * worldCount);
result.Append(a);
while (worldCount > 0) {
if ((worldCount & 0x1) > 0) { // Fewer appends, more ToStrings.
result.Append(worldList); // would the ToString here kill performance?
}
worldCount >>= 1;
if (worldCount > 0) {
worldList.Append(worldList);
}
}
Console.WriteLine(result.ToString());
}
Depends on what's going in the String object I guess. If internally all they have is a null-terminated string, then you could optimize by storing the length of the string somewhere. Also, if you're just outputting to stdout, it would make more sense to move your output call inside the loop (less memory overhead), and it should also be faster.
Here are some timing results. Each test was conducted starting with 20000 iterations. Every test includes output in the timings, unless stated otherwise. Each number for a group means number of iterations was 10 times greater than the previous. If there are less than 4 numbers, the test was taking too long so i killed it. "Parallize it" means i split the number of concatenations evenly over 4 threads and appended the results when all had finished (could probable have saved a little time here and put them into a queue and appended them as they finished, but didn't think of that until now). All times are in milliseconds.
656
6658
66999
370717
output hello loop output world. no concatenation.
658
6641
65807
554546
build with stringbuilder then output
664
6571
65676
314140
build with stringbuilder with large initial size no output
2761
367042
OP, strings only (killed test while concatenating; nothing printed to screen)
167
43227
parallize it OP no output
27
40
323
1758
parallize it stringbuilder no output
I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}